I'm running a classification algorithm in R with many categorical variables and my problem is that many of them contain more than 52 factors. (53 is the limit for most of classification algo).
Thus, what I want to do is to replace values when the "level" (based on the frequency) is more than 52. Meaning: I want to keep the most frequent 52 factor level, and replace the others by "Others".
Here is my code:
motor2$BESTUURDER.PERS_POSTCODE <- as.factor(motor2$BESTUURDER.PERS_POSTCODE)
var.smry <- motor2%>%
select(BESTUURDER.PERS_POSTCODE)%>%
group_by(BESTUURDER.PERS_POSTCODE)%>%
dplyr::summarise(n())
names(var.smry)[2] <- "Count"
var.smry <- var.smry%>%
arrange(desc(Count))
var.smry$Count <- as.factor(var.smry$Count)
var.smry<- setDT(var.smry, keep.rownames = TRUE)[]
var2 <- var.smry%>%
select(rn, BESTUURDER.PERS_POSTCODE)
motor2 <- (merge(var2, motor2, by = 'BESTUURDER.PERS_POSTCODE'))
motor2$BESTUURDER.PERS_POSTCODE <- as.character(motor2$BESTUURDER.PERS_POSTCODE)
motor2$BESTUURDER.PERS_POSTCODE <- ifelse(motor2$rn >= 52, ifelse(!is.na(motor2$BESTUURDER.PERS_POSTCODE),"Other",motor2$BESTUURDER.PERS_POSTCODE), motor2$BESTUURDER.PERS_POSTCODE)
motor2 <- motor2%>%
select(-rn)
motor2$BESTUURDER.PERS_POSTCODE <- as.character(motor2$BESTUURDER.PERS_POSTCODE)
motor2$BESTUURDER.PERS_POSTCODE[is.na(motor2$BESTUURDER.PERS_POSTCODE)] <- "missing"
motor2$BESTUURDER.PERS_POSTCODE <- as.factor(motor2$BESTUURDER.PERS_POSTCODE)
I really don't understand why it is not working...
Any help would be very much appreciated.
Thanks a lot,
Allan
Aucun commentaire:
Enregistrer un commentaire