dimanche 2 octobre 2016

speeding up nested ifelse statement

Example at this point in my code:

    time_elapsed                     network_name             daypart       day
 1:         4705                          Laff TV 2016-09-09 03:11:35    Friday
 2:         1800                              CNN 2016-09-10 08:00:00  Saturday
 3:           23                             INSP 2016-09-02 18:00:00    Friday
 4:          148                              NBC 2016-09-02 16:01:26    Friday
 5:          957                  History Channel 2016-09-07 14:44:03 Wednesday
 6:         1138         Nickelodeon/Nick-at-Nite 2016-09-09 16:00:00    Friday
 7:          120                       Starz Edge 2016-09-07 15:28:59 Wednesday
 8:          268            Starz Encore Westerns 2016-09-07 17:13:05 Wednesday
 9:            6                              CBS 2016-09-10 04:00:00  Saturday
10:           69                      Independent 2016-09-07 12:48:11 Wednesday
11:         4151                              NBC 2016-09-09 04:32:37    Friday
12:          570 PBS: Public Broadcasting Service 2016-09-07 16:17:58 Wednesday
13:         1421                            NBCSN 2016-09-03 15:22:23  Saturday
14:          466          Estrella TV (Broadcast) 2016-09-04 19:00:00    Sunday

(generally over 200 million rows)

I wrote the following nested ifelse statement a couple months ago when I was running my entire script over just a couple of million rows but now that I'm running it on a bigger scale I'd really like to find a way to make it a little faster.

targets_random$daypart <- ifelse((wday(targets_random$daypart) == 1 | 
                wday(targets_random$daypart) == 7), "W: Weekend",
                        ifelse(hour(targets_random$daypart) <= 2, "LP: Late Prime",
                        ifelse((hour(targets_random$daypart) >= 3 & 
                hour(targets_random$daypart) <= 5), "O: Overnight",
                        ifelse((hour(targets_random$daypart) >= 6 & 
                hour(targets_random$daypart) <= 9), "EM: Early Morning",
                        ifelse((hour(targets_random$daypart) >= 10 & 
                hour(targets_random$daypart) <= 16), "D: Day",
                        ifelse((hour(targets_random$daypart) >= 17 & 
                hour(targets_random$daypart) <= 20), "F: Fringe",
                        ifelse(hour(targets_random$daypart) >= 21, "P: Prime", NA)))))))

I tried using a data.table solution but was only very slightly faster and transformed my data.table to a list. For the life of me I couldn't see why. That added enough time to undo it wasn't worth the savings.

Any suggestions would be much appreciated. What I have works and if I have to stick with it, it'll be fine. It currently takes about 3.5 hours to run through the full code. The biggest portion is a SQL query and the file creation of the results but it would be nice if I could shave off as much time as possible!

(As a sidenote - it used to be almost 8 hours before I replaced tons of parts with data.table syntax. I'm an official fan now!)

Aucun commentaire:

Enregistrer un commentaire