mardi 31 mars 2020

Create new column based on multiple conditions in multiple columns

I'm working with a messy big data patient file (> 40 million rows). Each patient (id) has several rows. Each row (roughly) represents one consultation with a symptom/disease code (icpc). I add a new column with categories for patients with specific conditions (based on columns icpc and icpc2).

My raw data.frame (df) looks something like this (this is fabricated data, id is much longer in my dataset and I left out irrelevant columns which I like to drop):

id icpc icpc2 reg.date 
123 D95 F15   19JUN2015 
123 F85       15AUG2016 
332 A01       16MAR2010 
332 A04       20JAN2018
332 K20       20FEB2017
100 B10       01JUN2017 
100 A04       11JAN2008
113 T08       18MAR2018
113 P28       19JAN2017 
113 D95 A01   16JAN2013
113 A04       01MAY2009
551 B12 A01   03APR2011
551 D95       09MAY2015

Say I want to categorize patients with D95 and/or A01 as YES in the new column named 'condit' (based on two columns. I've been trying the following:

df.cat <- df %>%
  group_by(id) %>%
  mutate(condit = 
           if_else(icpc == c("D95", "A01") | icpc2 == c("D95", "A01") ~ 'yes', 'no'))

But this gives an error:

Error: `condition` must be a logical vector, not a `formula` object
Run `rlang::last_error()` to see where the error occurred.

I have also tried using case_when instead of if_else because I also need make a seperate file with multiple options for condit.

This is the data.frame I want:

id icpc icpc2 reg.date  condit
123 D95 F15   19JUN2015 yes
123 F85       15AUG2016 yes
332 A01       16MAR2010 yes
332 A04       20JAN2018 yes
332 K20       20FEB2017 yes
100 B10       01JUN2017 no
100 A04       11JAN2008 no
113 T08       18MAR2018 yes
113 P28       19JAN2017 yes
113 D95 A01   16JAN2013 yes
113 A04       01MAY2009 yes
551 B12 A01   03APR2011 yes
551 D90       09MAY2015 yes

Any help will be greatly appreciated. Thank you!

Aucun commentaire:

Enregistrer un commentaire