lundi 5 février 2018

Optimization of iteration in R

Preface: I do have two csv-tables each containing 3 million rows and about 20 columns and I want to extract 5 columns for all rows which meet certain requirements. It would be better if I worked with SQL or some other data base tool, but hey, I started out in R! and I do have to finish it now.

Currently my request is running on a R!-server with about 16 GB RAM - tomorrow the run of the first table will hit one week runtime and about 80% are done.

This leads me to following question: Does it make any difference how I formulate my if-clause? Currently I do the following (omitting loading csv, preparing dataframe etc):

i = 1
while(i < length_csv){
   if((csv$row11[i] != condition1) && (csv$row11[i] != condition2) 
   && (csv$row11[i] != condition3) && (csv$row11[i] != condition4) 
   && (csv$row11[i] != condition5) && (csv$row11[i] != condition6) 
   && (csv$row11[i] != condition7) && (csv$row3[i] == condition8)){
      dataframe = rbind(dataframe,c(csv$row1[i],csv$row2[i],csv$row11[i],csv$row12[i],csv$row13[i]))
      }
   i = i + 1
}

Would it be more efficient if the request was nested like

i = i+1
while(i < length_csv){
    if(csv$row3[i] == condition8){
        if(csv$row11[i] != condition1){
            if(csv$row11[i] != condition2){
                ... etc 
                }
    }
}

Or are there other ways to formulate the request I might have overlooked?

Aucun commentaire:

Enregistrer un commentaire