lundi 27 mai 2019

Looping over the subsets of a dataframe based on two conditions

I have the following problem: I need to run each subset of a dataframe -based on the value of a variable- creating a new entries for another variable depending on 2 conditions.

The dataframe (dt3) is as follows: I have 4 variables (birth_year, last name –Name-, role in the household -role- and household -hh-). The whole set is divided or subsetted by the hh variable, which gathers all the individuals under the same household. For instances, in my example bellow, the first 4 rows belong to the household “1”. Also, under the variable role, only the head of the household is stated. The rest of roles are empty and must be derived, and this is what I’m trying to do. My first step is to assign the roles of "children". I was thinking of doing it by running a loop over the whole data set and over each subset (each hh value). Whenever each line has a person who has the same last name as the head of the household and whose birth year is at least 15 years later than the head’s, then this person is inferred as “children”.

The original dataframe is:

birth_year       Name           role        hh

1877        Snijders    Head ofhousehold    1
1885        Marteen     NA                  1
1897        Snijders    NA                  1
1892        Zelstra     NA                  1
1878        Kuipers     Head of household   2
1870        Marteen     NA                  2
1897        Wals        NA                  2
1900        Venstra     NA                  2
1900        Lippe       Head of household   3
1905        Flachs      NA                  3
1920        Lippe       NA                  3
1922        Lippe       NA                  3

So, I need to run the whole set and each hh subset and perform the following two conditions: a. If the person’s name == the name of the head, and b. If the birth year of the person has a difference of 15 years or more with the head´s

Then this person is “children”.

So far I´ve been trying several things. As I’m placing the head role in the first row of each household then I was doing this:

a) Nested loop, where I try to run the data set and then each hh. For each hh I run the conditions (by comparing each row’s name and birth year with those of the first line of the hh –the head-)

for (n in 1:unique(dt3$hh)){
  for (i in 1:length(which(dt3$hh==n)) ){ 
     mutate(dt3, role = ifelse( dt3$Name[[1,2]] == dt3$Name[[n,1]]    
     & dt3$birth_year[[n,i]] > dt3$birth_year[[n,1]], "children","NoA"))
      }
  }

Also b), I have tried to do the same, but with lists. I first Split dt3 by means of the hh variable

dt3 <- split(dt3, f = dt3$hh)

And then

for (n in 1:dt3){
  mutate(dt3, role = ifelse( dt3$name [[n,i]] == dt3$name[[n,1]] &  
        dt3$birth_year[[n,i]] > dt3$birth_year[[n,1]],"children","NoA"))
  }

What I was expecting is an outcomelike this:

birth_year       Name           role        hh

1877        Snijders    Head ofhousehold    1
1885        Marteen     NA                  1
1897        Snijders    children            1
1892        Zelstra     NA                  1
1878        Kuipers     Head of household   2
1870        Marteen     NA                  2
1897        Wals        NA                  2
1900        Venstra     NA                  2
1900        Lippe       Head of household   3
1905        Flachs      NA                  3
1920        Lippe       children            3
1922        Lippe       children            3

Any tips will be welkom.

Thank you in advance

Aucun commentaire:

Enregistrer un commentaire