dimanche 24 janvier 2021

Separating list-col values into sigular values relative to a condition

I'm trying to separate the list-col values when converting a dataframe from long to wide format. I know that this has been answered here, however I want these columns with values adjacent to one another, and by adding an id this won't be the case, and neither does distinct() work. Although, I believe this can be tackled using this logic.

Given the dataframe looks like this:

# A tibble: 10 x 4
   year  locality_id landcover  pland
   <chr> <chr>           <int>  <dbl>
 1 2010  L452817             8 0.0968
 2 2010  L452817             9 0.0323
 3 2010  L452817            12 0.613 
 4 2010  L452817            13 0.194 
 5 2010  L452817            14 0.0645
 6 2019  L452817             8 0.0645
 7 2019  L452817             9 0.0645
 8 2019  L452817            12 0.516 
 9 2019  L452817            13 0.194 
10 2019  L452817            14 0.161 

And the wide format looks like this:

#with lists in a single row
Rows: 7
Columns: 4
$ locality_id <chr> "L452817", "L452817", "L452817", "L452817", "L452817",...
$ pland       <dbl> 0.09677419, 0.03225806, 0.61290323, 0.19354839, 0.0645...
$ `2010`      <list> [8, 9, 12, 13, 14, 16, 16]
$ `2019`      <list> [17, 17, 17, 13, <8, 9>, 12, 14]

I'm trying to wrap my head around the logic to get around this for my situation, I've thought of various ways to tackle it and many failed. My current solution (until I find a problem with it), is combing a previous solution to a question of mine.

Given the first table, luckily the landcover value are the same for each year relative to locality_id however, their pland values are different, so when converting it to pivot_wider I'll get lists. My two solutions:

  1. Minus the pland values for 2019 from 2010, and store those differences:

enter image description here

However, my dataset is very large and I'll find that some values are covered in 2010 and not in 2019 or vice versa. When such a case happens, fill those NA values by fulfiling the condition provided in one of my previous questions, where if 2010 is NA fill it with 16 otherwise, if 2019 is NA fill it with 17. However, assign the pland value a negative sign if its from 2010 and make it positive if its ``in 2019. To suggest that if the habitat was in 2010 and not 2019, then there's clearly been a decrease in that habitat type.

Although, I may find that I have completely different pland and landcover values for a specific locality_id, also of different length (however, I was only able to find a similar looking table to meet this condition given the amount of time I have available):

# A tibble: 7 x 4
  year  locality_id landcover  pland
  <chr> <chr>           <int>  <dbl>
1 2010  L596267             0 0.194 
2 2010  L596267             9 0.0323
3 2010  L596267            11 0.0645
4 2010  L596267            13 0.710 
5 2019  L596267             0 0.194 
6 2019  L596267            11 0.0968
7 2019  L596267            13 0.710

2010 has length 4 whilst 2019 has length 3. Luckily, in this scenario, some values match so we just proceed by minus the pland values for those that match. Otherwise, replace NA with 17.

Here's a criteria for Pland, for each year relative to locality_id , then pland should sum to 1.

However, I'm trying to calculate the difference, so these values should be either minus, 0, or positive.

Finding the differences has also been mentioned here for a previous question of mine, however, implementing it when transitioning from long to wide format is the tricky part.

reproducible code:

structure(list(year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 
2010L, 2010L, 2010L, 2010L, 2010L, 2019L, 2019L, 2019L, 2019L, 
2019L), locality_id = c("L452817", "L452817", "L452817", "L452817", 
"L452817", "L910180", "L910180", "L910180", "L910180", "L910180", 
"L910180", "L452817", "L452817", "L452817", "L452817", "L452817"
), landcover = c(8L, 9L, 12L, 13L, 14L, 0L, 8L, 9L, 10L, 11L, 
13L, 8L, 9L, 12L, 13L, 14L), pland = c(0.0967741935483871, 0.032258064516129, 
0.612903225806452, 0.193548387096774, 0.0645161290322581, 0.4375, 
0.34375, 0.03125, 0.03125, 0.09375, 0.0625, 0.0645161290322581, 
0.0645161290322581, 0.516129032258065, 0.193548387096774, 0.161290322580645
)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))

Aucun commentaire:

Enregistrer un commentaire