mercredi 24 avril 2019

Fast alternative to ifelse for replacing strings in data.table

I have a large list of data.tables (~10m rows in total) that contain a number of string variants of NA, such as "N/A" or "Unknown". I would like to replace these observations with a missing value, for all of the columns in all of my data.tables.

A simplified example of the data is set out below:

library(data.table)
dt1 <- data.table(v1 = 1:4, v2 = c("yes", "no", "unknown", NA))
dt2 <- data.table(v1 = c("1", "2", "not applicable", "4"), v2 = c("yes", "yes", "n/a", "no"))
master_list <- list(dt1 = dt1, dt2 = dt2)

The following solution works, but it is taking a prohibitively long time (~30 minutes with the full data) so I am trying to find a more efficient solution:

unknowns <- c("n/a", "not applicable", "unknown")
na_edit <- function(x){ifelse(x %in% unknowns, NA, x)}

master_list <- lapply(master_list, function(dt) {
 dt[, lapply(.SD, na_edit)]
})

> master_list$dt1
   v1   v2
1:  1  yes
2:  2   no
3:  3 <NA>
4:  4 <NA>

I have tried something resembling the following, removing the need for the ifelse, but I have not been able to make this work across multiple columns.

lapply(master_list, function(dt) {
  dt[v2 %in% unknowns, v2 := NA]
})

I feel an answer may lie in the responses in this thread. Would anyone be able to help me apply similar, or other methods, to the problem above? Many thanks in advance.

Aucun commentaire:

Enregistrer un commentaire