lundi 17 février 2020

Conditional loop matching two datasets in R

I would like to figure out who lived in a location before the current owner.

Previous owners (>20,000) are stored in a dataset called lifetime_census. Here is a sample of the data:

  id    squirrel_id reflo  locx  locy census_date gr    year  census_year lifespan sex  
  <chr> <chr>       <chr> <dbl> <dbl> <date>      <chr> <chr>       <dbl>    <dbl> <chr>
16161 5587        -310     -3    10 2001-05-15  KL    2001         2001     1810 F    
17186 5587        -310     -3    10 2001-08-15  KL    2001         2001     1810 F    
17723 5587        -310     -3    10 2002-05-15  KL    2002         2002     1810 F    
18473 5587        -310     -3    10 2002-08-15  KL    2002         2002     1810 F    
19345 5879        -310     -3    10 2003-05-15  KL    2003         2003     2109 M    
20535 5879        -310     -3    10 2003-08-15  KL    2003         2003     2109 M
21812 6055        -310     -3    10 2004-05-15  KL    2004         2004      802 F    
22663 6130        -310     -3    10 2004-08-15  KL    2004         2004      374 M   
16848 5101        Q1     17.3   0.8 2001-06-21  KL    2001         2001       55 F    
17033 5101        Q1     17.3   0.8 2001-08-15  KL    2001         2001       55 F    
17836 6501        Q1     17.3   0.8 2002-05-15  KL    2002         2002     1598 F    
18626 6501        Q1     17.3   0.8 2002-08-15  KL    2002         2002     1598 F    
19439 6501        Q1     17.3   0.8 2003-05-15  KL    2003         2003     1598 F    
20675 6501        Q1     17.5   0.8 2003-08-15  KL    2003         2003     1598 F    
21815 6057        Q1     17.5   0.8 2004-05-15  KL    2004         2004      479 F    
21815 6077        M2     13     1.8 2004-05-15  SU    2004         2004      979 M    

I then have an owners dataset (here is another sample):

  squirrel_id spr_census reflo.x spring_locx spring_locy spring_grid aut_census reflo.y autumn_locx autumn_locy autumn_grid
  <chr>            <dbl>  <chr>         <dbl>       <dbl> <chr>            <dbl> <chr>         <dbl>       <dbl> <chr>      
1 6391              2005  Q1             17.5         0.8 KL                2004 M4             13.3           4 SU  
1 6130              2005 -310             -3          10 KL                2004 -310           -3.2        10.3 KL    

Using a loop I'd like to create a new column for each squirrel_id in the owners dataset, showing who the most recent previous owner was (i.e., the previous owner's squirrel_id, their sex, lifespan, and reflo - which are all in the lifetime_census dataset).

I would like to do this for the spr_census and aut_census locations (reflo.x and reflo.y, respectively) for each squirrel_id in the owners dataset.

The complicating factors include:

  • if there is not an owner with the exact same reflo (i.e., lifetime_census$reflo==owners$reflo.x and lifetime_census$reflo==owners$reflo.y), then the next closest owner will do (for spring: (30*owners$spring_locx[i]-30*lifetime_census$locx)^2+(30*owners$spring_locy[i]-30*lifetime_census$locy)^2<=(distance)^2 and for autumn: (30*owners$autumn_locx[i]-30*lifetime_census$locx)^2+(30*owners$autumn_locy[i]-30*lifetime_census$locy)^2<=(distance)^2)
  • the previous owner has to be within two years of the spr_census and aut_census years
  • if even after these conditions are relaxed there is still no owner, then it should be marked with an NA

For example, for squirrel_id 6391, the previous spr_census at reflo Q1 would have been squirrel_id 6057, but for the aut_census at reflo M4 there is no previous owner, but there is an owner nearby (at M2).

The goal is to have these columns added to the owners dataset:

squirrel_id spring_owner_id  spring_owner_reflo  spring_owner_sex  spring_owner_lifespan  autumn_owner_id  autumn_owner_reflo  autumn_owner_sex  autumn_owner_lifespan
6391        6057             Q1                  F                 479                     6077              M2                 M                979  

What I have currently tried is this (just for the spr_census):

n <- length(owners$squirrel_id)
distance <- 30

for(i in 1:n) {
    print(i)
  last_owner <- subset(lifetime_census,
    lifetime_census$gr== owners$spring_grid[i] & #owners must be on same grid
    lifetime_census$census_year <= owners$spr_census[i] & #owners can be in current or past year
    lifetime_census$squirrel_id != owners$squirrel_id[i] & #previous owner cant be current owner
    (30*owners$spring_locx[i]-30* lifetime_census$locx)^2+(30* owners$spring_locy[i]-30* lifetime_census$locy)^2<=(distance)^2)  #pick neighbour within set distance 
    #lifetime_census$reflo==owners$reflo.x #not included

    #Put it all together
    owners[i,"spring_owner"] <- last_owner

else {
owners[i, "spring_owner"] <- NA
}
}

I am unable to figure out how to make this code find the previous spr_census and aut_census owners, nor restrict this to the last two years, and then use the distance filter after no reflo match was found.

Any ideas?

Aucun commentaire:

Enregistrer un commentaire