I'm working with a messy big data patient file (> 40 million rows). Each patient (id
) has several rows. Each row (roughly) represents one consultation with a symptom/disease code (icpc
). I add a new column with categories for patients with specific conditions (based on columns icpc
and icpc2
).
My raw data.frame (df
) looks something like this (this is fabricated data, id
is much longer in my dataset and I left out irrelevant columns which I like to drop):
id icpc icpc2 reg.date
123 D95 F15 19JUN2015
123 F85 15AUG2016
332 A01 16MAR2010
332 A04 20JAN2018
332 K20 20FEB2017
100 B10 01JUN2017
100 A04 11JAN2008
113 T08 18MAR2018
113 P28 19JAN2017
113 D95 A01 16JAN2013
113 A04 01MAY2009
551 B12 A01 03APR2011
551 D95 09MAY2015
Say I want to categorize patients with D95 and/or A01 as YES in the new column named 'condit' (based on two columns. I've been trying the following:
df.cat <- df %>%
group_by(id) %>%
mutate(condit =
if_else(icpc == c("D95", "A01") | icpc2 == c("D95", "A01") ~ 'yes', 'no'))
But this gives an error:
Error: `condition` must be a logical vector, not a `formula` object
Run `rlang::last_error()` to see where the error occurred.
I have also tried using case_when
instead of if_else
because I also need make a seperate file with multiple options for condit
.
This is the data.frame I want:
id icpc icpc2 reg.date condit
123 D95 F15 19JUN2015 yes
123 F85 15AUG2016 yes
332 A01 16MAR2010 yes
332 A04 20JAN2018 yes
332 K20 20FEB2017 yes
100 B10 01JUN2017 no
100 A04 11JAN2008 no
113 T08 18MAR2018 yes
113 P28 19JAN2017 yes
113 D95 A01 16JAN2013 yes
113 A04 01MAY2009 yes
551 B12 A01 03APR2011 yes
551 D90 09MAY2015 yes
Any help will be greatly appreciated. Thank you!
Aucun commentaire:
Enregistrer un commentaire