mardi 9 avril 2019

How to use if-statement in apply function?

Since I have to read over 3 go of data, I would like to improve mycode by changing two for-loop and if-statement to the applyfunction.

Here under is a reproducible example of my code. The overall purpose (in this example) is to count the number of positive and negative values in "c" column for each value of x and y. In real case I have over 150 files to read.

# Example of initial data set
df1 <- data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15))
# Another dataframe to keep track of "c" counts
dfOcc <- data.frame(a=rep(c(1:5),times=3),"positive"=c(0),"negative"=c(0))

So far I did this code, which works but is really slow:

for (i in 1:nrow(df)) {
  x = df[i,"a"]
  y = df[i,"b"]
  if (df[i,"c"]>=0) {
    dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] +1
  }else{
    dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] +1
  }
}

I am unsure whether the code is slow due to the size of the files (260k rows each) or due to the for-loop?

So far I managed to improve it in this way:

dfOcc[which(dfOcc$a==df$a & dfOcc$b==df$b),"positive"] <- apply(df,1,function(x){ifelse(x["c"]>0,1,0)})

This works fine in this example but not in my real case:

  • It only keeps count of the positive c and running this code twice might be counter productive
  • My original datasets are 260k rows while my "tracer" is 10k rows (the initial dataset repeats the a and b values with other c values

Any tip on how to improve those two points would be greatly appreciated!

Aucun commentaire:

Enregistrer un commentaire