I have two data frames which I am trying to integrate. The first data frame looks like:
df1=data.frame(Gene=c("gene1","gene2","gene3","gene4","gene5"),
CHR=c(1,4,5,6,7),
START=c(1000,5000,10000,15000,20000),
STOP=c(2000,6000,11000,16000,21000),
stringsAsFactors=FALSE)
> df1
Gene CHR START STOP
gene1 1 1000 2000
gene2 4 5000 6000
gene3 5 10000 11000
gene4 6 15000 16000
gene5 7 20000 21000
The second data frame looks like:
df2=data.frame(Disorder=c("A","A","A","B","C"),
Locus=c(1,2,3,1,1),
Chr=c(1,1,6,4,1),
Locus.Start=c(157,1500,14600,30000,2300),
Locus.Stop=c(800,2400,15900,35000,7000),
stringsAsFactors=FALSE)
> df2
Disorder Locus Chr Locus.Start Locus.Stop
A 1 1 157 800
A 2 1 1500 2400
A 3 6 14600 15900
B 1 4 30000 35000
C 1 1 900 3000
What I am trying to do is make a column in df1 which says where chromosome matches between df1 and df2 (CHR == Chr) and the position of the gene (START or STOP) in df1 spans the locus of df2 (between the values of Locus.Start and Locus.Stop), print disorder and locus.
So, where df1$CHR == df2Chr AND
((df1$START >= df2$Locus.Start AND df1$START <= df2$Locus.Stop) OR
(df1$STOP >= df2$Locus.Start AND df1$STOP <= df2$Locus.Stop))
print Disorder Locus, otherwise print NA.
This would result in a table looking like:
> df1
Gene CHR START STOP Map
gene1 1 1000 2000 A loc2, C loc1
gene2 4 5000 6000 NA
gene3 5 10000 11000 NA
gene4 6 15000 16000 A loc3
gene5 7 20000 21000 NA
So far, I have just been trying to get anything close to that (so accepting A loc2 C loc1 as the last column for example) and have tried things like:
df1$Map<-ifelse(df1$CHR == df2$Chr &
((df1$START >= df2$Locus.Start & df1$START <= df2$Locus.Stop)|
(df1$STOP >= df2$Locus.Start & df1$STOP <= df2$Locus.Stop)),
print(df2$Disorder " loc"df2$Locus),NA)
Is there a way of referencing between two data frames like this, to use information from df2 to make a new column in df1?
Many thanks for any help received.
Aucun commentaire:
Enregistrer un commentaire