vendredi 9 février 2018

Replace value if the level factor is over 52

I'm running a classification algorithm in R with many categorical variables and my problem is that many of them contain more than 52 factors. (53 is the limit for most of classification algo).

Thus, what I want to do is to replace values when the "level" (based on the frequency) is more than 52. Meaning: I want to keep the most frequent 52 factor level, and replace the others by "Others".

Here is my code:

motor2$BESTUURDER.PERS_POSTCODE <- as.factor(motor2$BESTUURDER.PERS_POSTCODE)
var.smry <- motor2%>%
  select(BESTUURDER.PERS_POSTCODE)%>%
  group_by(BESTUURDER.PERS_POSTCODE)%>%
  dplyr::summarise(n())

names(var.smry)[2] <- "Count"
var.smry <- var.smry%>%
  arrange(desc(Count))
var.smry$Count <- as.factor(var.smry$Count)

var.smry<- setDT(var.smry, keep.rownames = TRUE)[]

var2 <- var.smry%>%
  select(rn, BESTUURDER.PERS_POSTCODE)

motor2 <- (merge(var2, motor2, by = 'BESTUURDER.PERS_POSTCODE'))

motor2$BESTUURDER.PERS_POSTCODE <- as.character(motor2$BESTUURDER.PERS_POSTCODE)

motor2$BESTUURDER.PERS_POSTCODE <- ifelse(motor2$rn >= 52, ifelse(!is.na(motor2$BESTUURDER.PERS_POSTCODE),"Other",motor2$BESTUURDER.PERS_POSTCODE), motor2$BESTUURDER.PERS_POSTCODE)

motor2 <- motor2%>%
  select(-rn)

motor2$BESTUURDER.PERS_POSTCODE <- as.character(motor2$BESTUURDER.PERS_POSTCODE)
motor2$BESTUURDER.PERS_POSTCODE[is.na(motor2$BESTUURDER.PERS_POSTCODE)] <- "missing"
motor2$BESTUURDER.PERS_POSTCODE <- as.factor(motor2$BESTUURDER.PERS_POSTCODE)

I really don't understand why it is not working...

Any help would be very much appreciated.

Thanks a lot,

Allan

Aucun commentaire:

Enregistrer un commentaire