lundi 11 novembre 2019

How to simplify if-statement with multiple data frames/conditions in a list, in R?

I would like help to improve my code/knowledge of R. My code works, but I think it can be more efficient and any help is appreciated.

  • I have a list (DFx) of nested lists, of data frames, so:

    • DFx1 = list of length 1
      • DFx[1] = dataframe

if I call on a dataframe within this list (for example):

list(`31457` = structure(list(by5min = structure(c(1L, 2L,3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("2018-08-06 23:20:00", "2018-08-06 23:25:00", "2018-08-06 23:30:00", 
                                                                                                                     "2018-08-06 23:35:00", "2018-08-06 23:40:00", "2018-08-06 23:45:00", 
                                                                                                                     "2018-08-06 23:50:00", "2018-08-06 23:55:00", "2018-08-07 00:00:00", 
                                                                                                                     "2018-08-07 00:05:00", "2018-08-07 00:10:00"), class = "factor"), 
                              HR = c(90.1966666666667, 94.99, 95.54, 91.2633333333333, 
                                     93.37, 92.3466666666667, 89.0933333333333, 90.92, 91.0533333333333, 
                                     96.7666666666667, 93.3533333333333), 
                              WeekDay = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Mon", "Tue"), class = c("ordered","factor")), Hour = c(23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 0L, 0L, 0L), 
                              YearDay = c(218, 218, 218, 218, 218, 218, 218, 218, 219, 219, 219),
                              Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "31457", class = "factor")), row.names = 245:255, class = "data.frame"))

I get: (keep in mind, this is a tiny subset, typically the data spans for 1000+ rows)

$`31457`
                 by5min       HR WeekDay Hour YearDay  Name
245 2018-08-06 23:20:00 90.19667     Mon   23     218 31457
246 2018-08-06 23:25:00 94.99000     Mon   23     218 31457
247 2018-08-06 23:30:00 95.54000     Mon   23     218 31457
248 2018-08-06 23:35:00 91.26333     Mon   23     218 31457
249 2018-08-06 23:40:00 93.37000     Mon   23     218 31457
250 2018-08-06 23:45:00 92.34667     Mon   23     218 31457
251 2018-08-06 23:50:00 89.09333     Mon   23     218 31457
252 2018-08-06 23:55:00 90.92000     Mon   23     218 31457
253 2018-08-07 00:00:00 91.05333     Tue    0     219 31457
254 2018-08-07 00:05:00 96.76667     Tue    0     219 31457
255 2018-08-07 00:10:00 93.35333     Tue    0     219 31457

I then convert this list into a data frame. I use left_join where I add information from a separate data frame about specific dates. This allows me to split my list into two groups (Pre vs post MRI).

for (y in 1:length(DFx)) {
DF_Joined_tmp = DFx[y] %>% 
  data.frame()

#skip iteration if DF is empty
if (is.na(DF_Joined_tmp)) {next}
if (ncol(DF_Joined_tmp) <= 1) {next}

#clean
colnames(DF_Joined_tmp)[5:6] = c("YearDay", "Name") 
DF_Joined_tmp$Name = as.character(DF_Joined_tmp$Name)

#Pre
DF_Prex = left_join(DF_Joined_tmp, DF2, by = c("YearDay" = "Day_MRI", "Name" = "Whoop_ID")) # join by MRI date/ID
DF_Prex$MRI_DAY = lubridate::yday(DF_Prex$MRI_DATE)
DF_Prex = filter(DF_Prex, DF_Prex$YearDay <= mean(DF_Prex$MRI_DAY, na.rm = T)) #if less than/equal to MRI date, set as PRE
DF_Prex = DF_Prex[,-c(7:11)] #clean DF

#Post
DF_Postx = left_join(DF_Joined_tmp, DF2, by = c("YearDay" = "Day_MRI", "Name" = "Whoop_ID"))
DF_Postx$MRI_DAY = lubridate::yday(DF_Postx$MRI_DATE)
DF_Postx = filter(DF_Postx, DF_Postx$YearDay > mean(DF_Postx$MRI_DAY, na.rm = T)) # if greater than mri date, set as post
DF_Postx = DF_Postx[,-c(7:11)] #clean DF

After the Prex/Postx DFs are created I run them through an if else loop, where I think my code can be most improved:

# loop through processed data
  if (nrow(DF_Prex) == 0 & nrow(DF_Postx) == 0) {  # if there is no pre/post dates
    DF_Post[[y]] = DF_Joined_tmp[,-c(7:11)]

  } else {

       if(length(unique(DF_Prex)) == 1) # if there is only one day of data, then:
           DF_Pre[[y]] = DF_Prex

       else {
        DF_Prex = split(DF_Prex, DF_Prex$YearDay)
        DF_Pre[[y]] = DF_Prex
      }

      if(length(unique(DF_Postx)) == 1) # if there is only one day of data, then:
          DF_Post[[z]] = DF_Postx 

      else {
        DF_Postx = split(DF_Postx, DF_Postx$YearDay)
        DF_Post[[y]] = DF_Postx

        }    

  }
}

remove(DF_Joined_tmp, DF_Prex, DF_Postx) # tidy workspace

I looked into using case_when, but I'm not sure how to apply it when I have 2 data frames.

  • I need to split the data into lists by day of the year while nested under the same participant ID.
  • So each list would hold nested lists of separate days.
  • The output would be:

PRE group List of DF's Output List of several dataframes PRE MRI day POST group List of DF's Output List of several dataframes POST MRI day

I am asking for help to simplify because after this I use similar loops to further separate the data and add information from other dataframes.

Aucun commentaire:

Enregistrer un commentaire