mercredi 25 juillet 2018

faster multi-column partial string recognition loop in R

I want to create a fast function that returns true or false if a character string contained within one column the same as one of my columns. The true or false is to be registered within the individually named columns. Below is an example of the data structure:

df = data.frame(Authors, A1, A2 [... all the way A63])
# Example of "Authors" column row values: ("A1, A12, A50")
# All other columns equal: NA
# Note: "Authors" has millions of rows.

I have a nested loop that recognizes an author's name "A1" from a column that often contains multiple such "Authors" / "df[,1]" (Example: "A1, A12, A50"), and returns "True" into a column named after the specific author ("A12") if the author's name is contained within this string (alternatively, "False"). Here is a slow nested loop that achieves the intended result:

for (i in 2:length(df)){
    for (j in 1:nrow(df)) {
df[j,i]= ifelse(grepl(df[j,1],colnames(df[i]))==colnames(df[i]), TRUE, FALSE)}}
# Intended result df[2,2] = "True" if df[2,1] = ("A1, A2, A50"), otherwise "False".

The above works, but it is excruciatingly slow. I have millions of rows. Any pointers as to how I might speed this up?

Aucun commentaire:

Enregistrer un commentaire