mercredi 15 janvier 2020

Add new data frame column conditioning on second data frame R

I have two data frames which I am trying to integrate. The first data frame looks like:

df1=data.frame(Gene=c("gene1","gene2","gene3","gene4","gene5"),
              CHR=c(1,4,5,6,7),
              START=c(1000,5000,10000,15000,20000), 
              STOP=c(2000,6000,11000,16000,21000), 
              stringsAsFactors=FALSE)

> df1
Gene CHR START  STOP
gene1   1  1000  2000
gene2   4  5000  6000
gene3   5 10000 11000
gene4   6 15000 16000
gene5   7 20000 21000

The second data frame looks like:

df2=data.frame(Disorder=c("A","A","A","B","C"),
              Locus=c(1,2,3,1,1),
              Chr=c(1,1,6,4,1),
              Locus.Start=c(157,1500,14600,30000,2300), 
              Locus.Stop=c(800,2400,15900,35000,7000), 
              stringsAsFactors=FALSE)

> df2
Disorder Locus Chr Locus.Start Locus.Stop
     A     1   1         157        800
     A     2   1        1500       2400
     A     3   6       14600      15900
     B     1   4       30000      35000
     C     1   1         900       3000

What I am trying to do is make a column in df1 which says where chromosome matches between df1 and df2 (CHR == Chr) and the position of the gene (START or STOP) in df1 spans the locus of df2 (between the values of Locus.Start and Locus.Stop), print disorder and locus.

So, where df1$CHR == df2Chr AND

((df1$START >= df2$Locus.Start AND df1$START <= df2$Locus.Stop) OR

(df1$STOP >= df2$Locus.Start AND df1$STOP <= df2$Locus.Stop))

print Disorder Locus, otherwise print NA.

This would result in a table looking like:

> df1
 Gene CHR START  STOP  Map
gene1   1  1000  2000  A loc2, C loc1 
gene2   4  5000  6000  NA
gene3   5 10000 11000  NA
gene4   6 15000 16000  A loc3
gene5   7 20000 21000  NA

So far, I have just been trying to get anything close to that (so accepting A loc2 C loc1 as the last column for example) and have tried things like:

 df1$Map<-ifelse(df1$CHR == df2$Chr & 
         ((df1$START >= df2$Locus.Start & df1$START <= df2$Locus.Stop)|
         (df1$STOP >= df2$Locus.Start & df1$STOP <= df2$Locus.Stop)),
          print(df2$Disorder " loc"df2$Locus),NA)

Is there a way of referencing between two data frames like this, to use information from df2 to make a new column in df1?

Many thanks for any help received.

Aucun commentaire:

Enregistrer un commentaire