I have been stuck with this problem and I am sure for some of you this will not be too hard to solve. I've had no success in finding the answer in this forum.
The company I work at has a rotational program in which employees spend some time in multiple divisions and, at the end of some period, they get evaluated for a more senior division (promotion). Most will finish their program in 3 years, then a few more in the 4th and 5th. A small percentage (around 15%) does not complete the program. The dataset is fairly large and goes back more than 30 years. Some data is entered manually and is prone to data entry mistakes. The columns cont1
, cont2
,...,cont7
flags whether the individual is still in the rotational program. The columns prom3
, prom4
and prom5
has a 'Y'
if the employee has successfully completed the program in 3, 4 and 5 years respectively. So a 'Y'
in prom3
means that there will also be an 'Y'
in prom4
and prom5
, and consequently a NA
in cont3
,...,cont7
because the person is no longer in the rotational program. If an individual is not promoted in year 3 but, instead, do it in year 4, then prom4
is 'Y'
and cont4
,...,cont7
is NA
. By now you see the problem. The issue is that I have more years. I understand I can use ifelse()
but the code gets quite messy and long. I would like to find a solution to do this in a more elegant way, dynamically.
I need to find a way to dynamically program if prom3
has a 'Y'
then cont2
is 'Y'
and cont3
,...,cont7
is NA
. If id
has 'Y'
in prom4
then cont4
,...,cont7
is NA
and cont2
and cont3
is 'Y'
and so on. Something like:
contYears <- seq(2,7, by=1)
promYears <- seq(3,5, by=1)
if (paste0("prom",promYears)=='Y'){
is.na(paste0("cont",contYears)) while contYears >= promYears)}
else paste0("cont",contYears)=='Y'
Sorry for the not so elegant try above!
Thank you for your help!!! Below a toy df:
set.seed(123)
df <- tibble::data_frame(id = seq(1,100, by=1),
cont2 = sample(c('Y', NA), 100, replace=T, prob = c(0.9, 0.1)),
cont3 = sample(c('Y', NA), 100, replace=T, prob = c(0.8, 0.2)),
cont4 = sample(c('Y', NA), 100, replace=T, prob = c(0.5, 0.5)),
cont5 = sample(c('Y', NA), 100, replace=T, prob = c(0.25,0.75)),
cont6 = sample(c('Y', NA), 100, replace=T, prob = c(0.15,0.85)),
cont7 = sample(c('Y', NA), 100, replace=T, prob = c(0.10,0.9)),
prom3 = sample(c('Y', NA), 100, replace=T, prob = c(0.5,0.5)),
prom4 = sample(c('Y', NA), 100, replace=T, prob = c(0.75,0.25)),
prom5 = sample(c('Y', NA), 100, replace=T, prob = c(0.85,0.15)))
head(df)
id cont2 cont3 cont4 cont5 cont6 cont7 prom3 prom4 prom5
1 Y Y Y Y Y <NA> <NA> <NA> Y
2 Y Y Y Y <NA> <NA> Y Y Y
3 Y Y Y <NA> Y <NA> <NA> Y Y
4 Y Y Y <NA> <NA> <NA> <NA> <NA> Y
5 Y Y Y <NA> <NA> <NA> <NA> Y Y
6 <NA> Y <NA> <NA> <NA> <NA> <NA> <NA> <NA>