vendredi 1 décembre 2017

R matching 2 data frames and creating a new variable based on 2 dataframe

I have a requirement where I need to look at values in one data frame and check that with other data frame and create a new categorical column. This seems to be something simple as vlookup but I am not able to get how to do it in R.

My data is big and has thousands of rows and many columns. Below I have tried to create a sample data of similar format.

#####Generating sample data

library("plyr")
library("data.table")
library("rpart")

set.seed(1200)

id <- 1:1000
ibd <- sample(1:15,1000,replace = T)
bills <- sample(1:20,1000,replace = T)
nos <- sample(1:80,1000,replace = T)
stru <- sample(c("A","B","C","D"),1000,replace = T)
v1 <- sample(1:80,1000,replace = T)
v2 <- sample(1:80,1000,replace = T)
v3 <- sample(1:80,1000,replace = T)
v4 <- sample(1:80,1000,replace = T)
v5 <- sample(1:80,1000,replace = T)
v6 <- sample(1:80,1000,replace = T)
v7 <- sample(1:80,1000,replace = T)
v8 <- sample(1:80,1000,replace = T)
v9 <- sample(1:80,1000,replace = T)
v10 <- sample(1:80,1000,replace = T)
a1 <- sample(1:80,1000,replace = T)
b1 <- sample(1:80,1000,replace = T)
type <- sample(1:15,1000,replace = T)
value <- sample(100:1000,1000,replace = T)

df1 <- data.frame(id,ibd,bills,nos,stru,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,a1,b1,type,value)

num_var <- c("bills","nos","v1","v2","v3")

v0 <- num_var

ibda <- sort(rep(1:15,4),decreasing = F)
billsa <- sample(5:15,60,replace = T)
nosa <- sample(15:60,60,replace = T)
v1a <- sample(10:70,60,replace = T)
v2a <- sample(20:70,60,replace = T)
v3a <- sample(20:70,60,replace = T)

df2 <- data.frame(ibda,billsa,nosa,v1a,v2a,v3a)

bills_ibd1 <- sort(df2[ibda == 1,"billsa"])

So if you see bills_ibd1 contains 05,10,13,15. I want to check these values in df1 for ibd==1 and create a categorical variable "bills_cat" in df1 which will have codes as below

if (ibd == 1 & bills_ibd1 <= 05) bills_cat = 1
if (ibd == 1 & bills_ibd1 > 05 & bills_ibd1 <= 10)  bills_cat = 2
if (ibd == 1 & bills_ibd1 > 10 & bills_ibd1 <= 13)  bills_cat = 3
if (ibd == 1 & bills_ibd1 > 13 & bills_ibd1 <= 15)  bills_cat = 4
if (ibd == 1 & bills_ibd1 > 15 )  bills_cat = 5

Note - bills_ibd1 is getting generated from df2 and I would have such variable for each ibd and column variable.

But this way I will have to write many if statements and I observed variable bills_cat getting replaced.

Is there a simple and better way of achieving this? I need to check for variable in df1 at ibd level based on the values from df2. Please suggest

Aucun commentaire:

Enregistrer un commentaire