mardi 17 mars 2020

Ignore NA values of a column within a statement

Until now I've been working with a medium size dataset for an Ocupation Survey(around 200 mb total), here's the data if you want to review it: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing

I have the following code:

hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)

library(dplyr)


base<-tibble(ID_VIV=datos$ID_VIV, ID_PERSONA=datos$ID_PERSONA, EDAD=datos$EDAD, CONACT=datos$CONACT) 
base$maxage <- ave(base$EDAD, base$ID_VIV, FUN=max)
base$Condición_I<-case_when(base$CONACT==32 & base$EDAD>=60 ~ 1,
                            base$CONACT>=10 & base$EDAD>=60 & base$CONACT<=16 ~ 2,
                            base$CONACT==20 & base$EDAD>=60 | base$CONACT==31 & base$EDAD>=60 | (base$CONACT>=33 & base$CONACT<=35 & base$EDAD>=60) ~ 3)
base <- subset(base, maxage >= 60) 
base<- base %>%  group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else Condición_I)
base$ID_VIV<-as.character(base$ID_VIV)           
base$ID_PERSONA<-as.character(base$ID_PERSONA)
base

And ended up with:

# A tibble: 38,307 x 7
# Groups:   ID_VIV [10,499]
   ID_VIV      ID_PERSONA     EDAD CONACT maxage Condición_I Condición_V
   <chr>       <chr>         <int>  <int>  <int>       <dbl>       <dbl>
 1 10010000007 1001000000701    69     32     69           1           1
 2 10010000008 1001000000803    83     33     83           3           4
 3 10010000008 1001000000802    47     33     83          NA           4
 4 10010000008 1001000000801    47     10     83          NA           4
 5 10010000012 1001000001204     4     NA     60          NA           4
 6 10010000012 1001000001203     2     NA     60          NA           4
 7 10010000012 1001000001201    60     10     60           2           4
 8 10010000012 1001000001202    21     10     60          NA           4
 9 10010000014 1001000001401    67     32     67           1           4
10 10010000014 1001000001402    64     33     67           3           4

The Condición_I column value is a code for the labour conditions of each individual(row), some of this individuals share house (that's why they share ID_VIV), I only care about the individuals that are 60yo or more, all the NA are individuals who live with a 60+yo but I do not care about their situation (but I need to keep them), I need the column Condición_V to display another value following this conditions:

Condición_I == 1 ~ 1
Condición_I == 2 ~ 2
Condición_I == 3 ~ 3
Any combination of Condición_I ~ 4

This means that if all the 60 and+_yo individuals in a house have Condición_I == 1 then Condición_V will be 1 that's true up to code 3, when there are x.e. one person C_I == 1 and another one C_I == 3 in the same house, then Condición_V will be 4
And I'm hoping to get this kind of result:

A tibble: 38,307 x 7

# Groups:   ID_VIV [10,499]
   ID_VIV      ID_PERSONA     EDAD CONACT maxage Condición_I Condición_V
   <chr>       <chr>         <int>  <int>  <int>       <dbl>       <dbl>
 1 10010000007 1001000000701    69     32     69           1           1
 2 10010000008 1001000000803    83     33     83           3           3
 3 10010000008 1001000000802    47     33     83          NA           3
 4 10010000008 1001000000801    47     10     83          NA           3
 5 10010000012 1001000001204     4     NA     60          NA           2
 6 10010000012 1001000001203     2     NA     60          NA           2
 7 10010000012 1001000001201    60     10     60           2           2
 8 10010000012 1001000001202    21     10     60          NA           2
 9 10010000014 1001000001401    67     32     67           1           4
10 10010000014 1001000001402    64     33     67           3           4

I know my error is in:

`#base<- base %>%  group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else` Condición_I)

Is there a way to use that line of code ignoring the NA values or is it my best option to do it otherway, I do not have to do it the way I'm trying and any other way or help will be much appreciated!

if-statement

mardi 17 mars 2020

Ignore NA values of a column within a statement

A tibble: 38,307 x 7

Aucun commentaire:

Enregistrer un commentaire