Warm tip: This article is reproduced from serverfault.com, please click

Filter in rows of all columns with specific conditions

发布于 2020-11-27 23:38:21

I'm still learning R, I have this dataset, it has 5 columns, first column is tracking_id, the next four columns have values of four groups.

First, I want to filter rows that have values equal or larger than 1, then I want to filter rows based on comparison of the last three columns ("CD44hi_CD69low_rep","CD44hi_CD69hi_CD103low_rep","CD44hi_CD69hi_CD103hi_rep") that are 8 folds higher or 4 folds lower compared to column ("CD44low_rep").

The output should have 5 columns, with values equal or larger than 1 that are 8 fold higher or 4 fold less of the last three column compared to second column.

I should get something like this:

To filter rows equal or larger than 1, I tried this:

df1 %>% select_if(is.numeric) %>%  filter_all(all_vars(. >= 1))

Then to filter 8 folds high or 4 fold less, I tried (thanks to @akrun):

nm1 <- c("CD44hi_CD69low_rep",  "CD44hi_CD69hi_CD103low_rep", 
         "CD44hi_CD69hi_CD103hi_rep")
i1 <- (rowSums(df1[nm1]  >= (df1$CD44low_rep * 8)) == 3) &
     (rowSums(df1[nm1]  <= (df1$CD44low_rep * 4)) == 3)

However, I'm getting no input.

I'm following these steps:

Analysis and graphic display of RNA-Seq data. A total of 9,085 genes for which
the maximum fragments per kilobase of exon per million mapped reads value in all
samples was ≥1.0 were subjected to further analyses. A principal component analysis
was performed using R (https://www.r-project.org/). Clustering was performed using
APCluster (an R Package for Affinity Propagation Clustering). The transcriptional
signatures of CD44hiCD69lo, CD44hiCD69hiCD103lo and CD44hiCD69hiCD103hi CD4+
T cells were defined with genes for which the expression was eightfold higher or
fourfold lower than that in CD44loCD69lo CD4+ T cells.
For the visualization of the co-regulation network, the 500 genes in the CD44hi
CD4+ T cell groups that showed the greatest variation compared with the naive
(CD44loCD69lo) CD4+ T cell group were subjected to further analyses. The first-
neighbor genes were determined using the following two criteria: (1) a correlation
of >0.8; and (2) a ratio of norm of 0.8–1.25. The network graph of 483 genes was
visualized using Cytoscape (http://www.cytoscape.org/).

The IDs that I'm interested in are:

values <- c('S100a10', 'Esm1', 'Itgb1', 'Anxa2', 'Hist1h1b', 
                                                'Il2rb', 'Lgals1', 'Mki67', 'Rora', 'S100a4', 
                                                'S100a6', 'Adam8', 'Areg', 'Bcl2l1', 'Calca', 
                                                'Capg', 'Ccr2', 'Cd44', 'Csda', 'Ehd1', 
                                                'Id2', 'Il10', 'Il1rl1', 'Il2ra', 'Lmna', 
                                                'Maf', 'Penk', 'Podnl1', 'Tiam1', 'Vim',
                                                'Ern1', 'Furin', 'Ifng', 'Igfbp7', 'Il13', 
                                                'Il4', 'Il5', 'Nrp1', 'Ptprs', 'Rbpj', 
                                                'Spry1', 'Tnfsf11', 'Vdr', 'Xcl1', 'Bmpr2', 
                                                'Csf1', 'Dst', 'Foxp3', 'Itgav', 'Itgb8', 
                                                'Lamc1', 'Myo1e', 'Pmaip1', 'Prdm1', 'Ptpn5', 
                                                'Ramp1', 'Sdc4')

After applying @RonakShah (thank you!), I get only 21 instead of 57:

library(dplyr)
df09 <- read.csv('https://raw.githubusercontent.com/learnseq/learning/main/dfpilot.csv')

filtertrial <- df09 %>% 
  #Keep rows where all the values are greater than 1
  filter(across(where(is.numeric), ~. >= 1)) %>%
  #Rows where any value is higher than 8 times CD44low_rep
  #Or 4 times less than CD44low_rep
  filter(Reduce(`|`, across(CD44hi_CD69low_rep:CD44hi_CD69hi_CD103hi_rep, 
         ~. >= CD44low_rep*8 | . <= CD44low_rep/4)))

values <- c('S100a10', 'Esm1', 'Itgb1', 'Anxa2', 'Hist1h1b', 
                                                'Il2rb', 'Lgals1', 'Mki67', 'Rora', 'S100a4', 
                                                'S100a6', 'Adam8', 'Areg', 'Bcl2l1', 'Calca', 
                                                'Capg', 'Ccr2', 'Cd44', 'Csda', 'Ehd1', 
                                                'Id2', 'Il10', 'Il1rl1', 'Il2ra', 'Lmna', 
                                                'Maf', 'Penk', 'Podnl1', 'Tiam1', 'Vim',
                                                'Ern1', 'Furin', 'Ifng', 'Igfbp7', 'Il13', 
                                                'Il4', 'Il5', 'Nrp1', 'Ptprs', 'Rbpj', 
                                                'Spry1', 'Tnfsf11', 'Vdr', 'Xcl1', 'Bmpr2', 
                                                'Csf1', 'Dst', 'Foxp3', 'Itgav', 'Itgb8', 
                                                'Lamc1', 'Myo1e', 'Pmaip1', 'Prdm1', 'Ptpn5', 
                                                'Ramp1', 'Sdc4')

#Make sure the sorting won't change by using match function and reverse it to get the right order as 
#shown in the original plot.
dfgll <- filtertrial %>% slice(match(rev(values), tracking_id))


dfgll

How to achieve this?

Questioner

user432797

Viewed

Original

Ronak Shah 2020-11-28 15:39:58

You can try the following :

library(dplyr)
df <- read.csv('https://raw.githubusercontent.com/learnseq/learning/main/dfpilot.csv')

df %>% 
  #Keep rows where all the values are greater than 1
  filter(across(where(is.numeric), ~. > 1)) %>%
  #Rows where any value is higher than 8 times CD44low_rep
  #Or 4 times less than CD44low_rep
  filter(Reduce(`|`, across(CD44hi_CD69low_rep:CD44hi_CD69hi_CD103hi_rep, 
         ~. > CD44low_rep*8 | . < CD44low_rep/4))) -> result

head(result)  

#    tracking_id CD44low_rep CD44hi_CD69low_rep CD44hi_CD69hi_CD103low_rep
#1         42624        1.91               6.68                      17.50
#2 A930005H10Rik        9.41               4.63                       1.48
#3         Actn1       42.01              21.77                       1.71
#4       Adora2a        1.31               7.05                      15.51
#5         Ahnak       12.09             152.43                     362.87
#6        Als2cl       11.17               1.98                       1.01

#  CD44hi_CD69hi_CD103hi_rep
#1                     22.51
#2                      1.55
#3                      1.22
#4                     13.31
#5                    299.07
#6                      1.26

user432797 2020-11-28 16:47:36

Thank you @RonakShah, I deleted the other question, but the output seems not right for some reason, I get only 21 rows output at the end instead of 57 after I compared 57 selected IDs, I even changed > to >=, also < to <=, but still, I got only 21 rows, any input?

Ronak Shah 2020-11-29 00:28:19

I get 197 rows for the data you have shared running the same code as above. How do you get 21 rows?

user432797 2020-11-29 01:22:24

I do get 197 but when I filter the ids that I'm interested in I get only 21 instead of 57! I updated the question with the code that I used after your code. @RonalShah

Ronak Shah 2020-11-29 01:34:57

Yes, so only 21 id's satisfy the criteria that you have mentioned. Please review your criteria again. Your first criteria is keep only those rows where all values is greater than 1. One of the id which is absent in final output is Esm1. If you check it's values df09 %>% filter(tracking_id == 'Esm1') you'll see that CD44low_rep is 0.1740859 for it. Hence, the code removes it from the list since it is less than 1. I am sure that is the similar case with rest of the id's which is not present in the final output.

user432797 2020-11-29 22:03:05

Thank you @RonakShah, I agree, I don't know why they put in the criteria to remove values less than 1, probably they meant something in data preprocessing in bash, not R.

Filter in rows of all columns with specific conditions

热门帖子

热门github