Since I have to read over 3 go of data, I would like to improve mycode by changing two for-loop and if-statement to the applyfunction.
Here under is a reproducible example of my code. The overall purpose (in this example) is to count the number of positive and negative values in "c" column for each value of x and y. In real case I have over 150 files to read.
# Example of initial data set
df1 <- data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15))
# Another dataframe to keep track of "c" counts
dfOcc <- data.frame(a=rep(c(1:5),times=3),"positive"=c(0),"negative"=c(0))
So far I did this code, which works but is really slow:
for (i in 1:nrow(df)) {
x = df[i,"a"]
y = df[i,"b"]
if (df[i,"c"]>=0) {
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] +1
}else{
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] +1
}
}
I am unsure whether the code is slow due to the size of the files (260k rows each) or due to the for-loop?
So far I managed to improve it in this way:
dfOcc[which(dfOcc$a==df$a & dfOcc$b==df$b),"positive"] <- apply(df,1,function(x){ifelse(x["c"]>0,1,0)})
This works fine in this example but not in my real case:
- It only keeps count of the positive
cand running this code twice might be counter productive - My original datasets are 260k rows while my "tracer" is 10k rows (the initial dataset repeats the
aandbvalues with othercvalues
Any tip on how to improve those two points would be greatly appreciated!
Aucun commentaire:
Enregistrer un commentaire