samedi 23 janvier 2016

Regex double column matching R

This is a follow on from a question I asked yesterday: Partial string match two columns R

The answer provided to this was great; however, I found that many species were not directly referred too i.e. a tortoise was never described directly in dats$product.authorise, but 'exotic' was an acceptable match.

dats<-data.frame(ID=c(1:4),species=c("dog","cat","rabbit","tortoise"),
            species.descriptor=c("all animal dog","all animal cat","rabbit exotic","tortoise exotic"),
            product=c(1,2,3,4),product.authorise=c("all animal dog cat rabbit","cat horse pig",
            "dog cat","exotic"))
dats
  ID  species species.descriptor product         product.authorise
   1      dog     all animal dog       1 all animal dog cat rabbit
   2      cat     all animal cat       2             cat horse pig
   3   rabbit      rabbit exotic       3                   dog cat
   4 tortoise    tortoise exotic       4                    exotic

I have come up with a solution that works based on binding $species.descriptor and $product.authorise together and then designating the row as 'TRUE' if a particular reg exp appears two or more times in the field like so:

library(stringr)
dats$bound<-paste(dats$product.authorise, dats$species.descriptor)

species_descriptor<-c("all animal","dog","cat","rabbit","exotic","horse","pig","tortoise")
species_descriptor<-setNames(nm=species_descriptor)
result<-ifelse(sapply(species_descriptor, str_count, string=dats$bound)>=2,"TRUE","FALSE")
result<-as.data.frame(result)

result$AuthorisedCount<-apply(result[,1:ncol(result)],MARGIN=1,function(x){sum(x=="TRUE",na.rm=T)})
result$SpeciesAuthorised<-ifelse(result$AuthorisedCount>=1,"TRUE","FALSE")

dats<-cbind(dats, result$SpeciesAuthorised)
names(dats)[7]<-"SpeciesAuthorised" 
dats$bound<-NULL

dats
  ID  species species.descriptor product         product.authorise SpeciesAuthorised
   1      dog     all animal dog       1 all animal dog cat rabbit              TRUE
   2      cat     all animal cat       2             cat horse pig              TRUE
   3   rabbit      rabbit exotic       3                   dog cat             FALSE
   4 tortoise    tortoise exotic       4                    exotic              TRUE

This is fine and on the much larger dataset works quickly; however, I am aware that there is probably a much more elegant way of doing things. I was wondering if anyone has any suggestions?

Aucun commentaire:

Enregistrer un commentaire