jeudi 28 novembre 2019

I wish to merge categories from a factor variable to reduce the number of levels

I have this dataset bank-full with a variable job summary(bank.full$job) admin. blue-collar entrepreneur housemaid management 5171 9732 1487 1240 9458 retired self-employed services student technician 2264 1579 4154 938 7597 unemployed unknown 1303 288 This is the percent cross tab of the variable with the target variable y no yes admin. 0.88 0.12 blue-collar 0.93 0.07 entrepreneur 0.92 0.08 housemaid 0.92 0.08 management 0.87 0.13 retired 0.83 0.17 self-employed 0.89 0.11 services 0.91 0.09 student 0.72 0.28 technician 0.90 0.10 unemployed 0.84 0.16 unknown 0.89 0.11 Now I wish to merge job categories whose cross tab values are similar I used this two approaches

 bank.full$newjob<-ifelse(c(bank.full$job=='admin.',
+                            bank.full$job=='self-employed',
+                            bank.full$job=='unknown'),'CAT1',
+                   ifelse(c(bank.full$job=='blue-collar',
+                            bank.full$job=='entrepreneur'),'CAT2',
+                   ifelse(c(bank.full$job=='housemaid',
+                            bank.full$job=='services'),'CAT3',
+                   ifelse(c(bank.full$job=='management',
+                            bank.full$job=='unemployed',
+                            bank.full$job=='technician'),'CAT4',
+                   ifelse(bank.full$job=='student','student','retired')))))
Error in `$<-.data.frame`(`*tmp*`, newjob, value = c("CAT4", "retired",  : 
  replacement has 135633 rows, data has 45211

Second Approach

bank.full$newjob<-ifelse(bank.full$job=='admin.','CAT1',
+                   ifelse(bank.full$job=='self-employed','CAT1',
+                   ifelse(bank.full$job=='unknown'),'CAT1',
+                   ifelse(bank.full$job=='blue-collar','CAT2',
+                   ifelse(bank.full$job=='entrepreneur','CAT2',
+                   ifelse(bank.full$job=='housemaid','CAT3',
+                   ifelse(bank.full$job=='services','CAT3',
+                   ifelse(bank.full$job=='management','CAT4',
+                   ifelse(bank.full$job=='unemployed','CAT4',
+                   ifelse(bank.full$job=='technician','CAT4',"")))))))))
Error in ifelse(bank.full$job == "self-employed", "CAT1", ifelse(bank.full$job ==  : 
  unused arguments ("CAT1", ifelse(bank.full$job == "blue-collar", "CAT2", ifelse(bank.full$job == 
"entrepreneur", "CAT2", ifelse(bank.full$job == "housemaid", "CAT3", ifelse(bank.full$job == "services", "CAT3", ifelse(bank.full$job == "management", "CAT4", ifelse(bank.full$job == "unemployed", "CAT4",
 ifelse(bank.full$job == "technician", "CAT4", ""))))))))

I was able to get an output till this level but when i inserted all the if conditions it's giving me a an error

bank.full$newjob<-ifelse(bank.full$job=='admin.','CAT1',
+                          ifelse(bank.full$job=='self-employed','CAT1',
+                                 ifelse(bank.full$job=='unknown','CAT1',
+ ifelse(c(bank.full$job=='blue-collar',bank.full$job=='entrepreneur'),'CAT2',""))))
> bank.full$newjob<-as.factor(bank.full$newjob)
> summary(bank.full$newjob)
> summary(bank.full$newjob)
       CAT1  CAT2 
28441  7038  9732 

Aucun commentaire:

Enregistrer un commentaire