samedi 25 février 2017

Conditional value assignment to multiple columns based on some other column value

I have been stuck with this problem and I am sure for some of you this will not be too hard to solve. I've had no success in finding the answer in this forum.

The company I work at has a rotational program in which employees spend some time in multiple divisions and, at the end of some period, they get evaluated for a more senior division (promotion). Most will finish their program in 3 years, then a few more in the 4th and 5th. A small percentage (around 15%) does not complete the program. The dataset is fairly large and goes back more than 30 years. Some data is entered manually and is prone to data entry mistakes. The columns cont1, cont2,...,cont7 flags whether the individual is still in the rotational program. The columns prom3, prom4 and prom5 has a 'Y' if the employee has successfully completed the program in 3, 4 and 5 years respectively. So a 'Y' in prom3 means that there will also be an 'Y' in prom4 and prom5, and consequently a NA in cont3,...,cont7 because the person is no longer in the rotational program. If an individual is not promoted in year 3 but, instead, do it in year 4, then prom4 is 'Y' and cont4,...,cont7 is NA. By now you see the problem. The issue is that I have more years. I understand I can use ifelse() but the code gets quite messy and long. I would like to find a solution to do this in a more elegant way, dynamically.

I need to find a way to dynamically program if prom3 has a 'Y' then cont2 is 'Y' and cont3,...,cont7 is NA. If id has 'Y' in prom4 then cont4,...,cont7 is NA and cont2 and cont3 is 'Y' and so on. Something like:

contYears <- seq(2,7, by=1)

promYears <- seq(3,5, by=1)

if (paste0("prom",promYears)=='Y'){
is.na(paste0("cont",contYears)) while contYears >= promYears)}
else paste0("cont",contYears)=='Y'

Sorry for the not so elegant try above!

Thank you for your help!!! Below a toy df:

set.seed(123)
df <- tibble::data_frame(id = seq(1,100, by=1),
                     cont2 = sample(c('Y', NA), 100, replace=T, prob = c(0.9, 0.1)),
                     cont3 = sample(c('Y', NA), 100, replace=T, prob = c(0.8, 0.2)),
                     cont4 = sample(c('Y', NA), 100, replace=T, prob = c(0.5, 0.5)),
                     cont5 = sample(c('Y', NA), 100, replace=T, prob = c(0.25,0.75)),
                     cont6 = sample(c('Y', NA), 100, replace=T, prob = c(0.15,0.85)),
                     cont7 = sample(c('Y', NA), 100, replace=T, prob = c(0.10,0.9)),
                     prom3 = sample(c('Y', NA), 100, replace=T, prob = c(0.5,0.5)),
                     prom4 = sample(c('Y', NA), 100, replace=T, prob = c(0.75,0.25)),
                     prom5 = sample(c('Y', NA), 100, replace=T, prob = c(0.85,0.15)))

head(df) 

id cont2 cont3 cont4 cont5 cont6 cont7 prom3 prom4 prom5
1     Y     Y     Y     Y     Y  <NA>  <NA>  <NA>     Y
2     Y     Y     Y     Y  <NA>  <NA>     Y     Y     Y
3     Y     Y     Y  <NA>     Y  <NA>  <NA>     Y     Y
4     Y     Y     Y  <NA>  <NA>  <NA>  <NA>  <NA>     Y
5     Y     Y     Y  <NA>  <NA>  <NA>  <NA>     Y     Y
6  <NA>     Y  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>

Aucun commentaire:

Enregistrer un commentaire