lundi 17 juin 2019

How to select data based on 2 conditions in R?

I am looking to select a subset of my data based on 2 conditions:

Firstly, here is my data:

Gene           AreaID   Label
DNAJC12      rs1111111  unlikely
HERC4        rs1111111  unlikely
RP11-57G10.8 rs2222222  possible
RPL12P8      rs1111111  unlikely
SIRT1        rs3333333  certain
RP11-57G10.8 rs3333333  possible
RPL12P8      rs3333333  unlikely
SIRT1        rs3333333  unlikely

I am looking to subset this to select the genes with an 'unlikely' label and if they have the same area ID. However, the ID must also not be present for any other genes with any other label.

So for example my output would only select this:

Gene          AreaID      Label
DNAJC12     rs1111111   unlikely
HERC4       rs1111111   unlikely
RPL12P8     rs1111111   unlikely

and not include the rs333333 area ID which has unlikely with duplicate IDs but also has genes of different labels.

I have tried based on reading similar questions on here, but this does not seems to work:

loci <- read.csv('dataset.csv')
sub_list <- lapply(1:length(loci), function(i) loci %>% filter(loci$AreaID==duplicated(loci) & loci$Label =='unlikely'))
do.call(rbind, sub_list)

I have also tried:

prediction_snps = loci$AreaID[loci$label == 'unlikely']
result = loci[prediction_snps, ]

I am not sure how else to approach this as I am new to R, currently

Aucun commentaire:

Enregistrer un commentaire