dimanche 28 octobre 2018

How to create a custom rules based tie-breaker?

Sample df:

set.seed(1)
df <- tibble(name = fruit[1:10],
             A = rpois(10, 10),
             B = rpois(10, 2),
             C = rpois(10, 6),
             D = rpois(10, 2))

           name  A B C D A_rank AB_rank ABC_rank ABCD_rank
1         apple  8 1 8 2      9       9        8         7
2       apricot 10 1 7 3      7       7        4         5
3       avocado  7 0 8 0     10      10        9        10
4        banana 11 1 3 2      4       5        9         9
5   bell pepper 14 4 7 3      1       1        1         1
6      bilberry 12 1 5 3      3       3        4         5
7    blackberry 11 2 8 2      4       3        3         3
8  blackcurrant  9 2 7 4      8       7        4         4
9  blood orange 14 2 8 2      1       2        2         2
10    blueberry 11 1 6 1      4       5        4         7

This builds on a question I asked previously, where I wanted to perform row-wise calculations to compute ranks on gradually cumulating sums of each column, where a higher sum = lower rank.

df <- cbind(df, apply(-apply(df[, -1], 1, cumsum), 1, min_rank) %>% 
          as_tibble() %>% 
          rename(A_rank = A, AB_rank = B, ABC_rank = C, ABCD_rank = D))

However, what I would like now is to incorporate a custom rules-based tie-breaker function which base R or dplyr doesn't provide. The rules for my tie-breaker function at each rank calculation are:

  • The fruit with the highest number of points in the most events wins
  • If a tie still remains, then the fruit with the largest number of points in any single column will be given the higher place.
    • If the tie still exists, compare the second highest number of points, and so on.
  • Else, use min_rank.

So, in my df, looking at the first rank computation, just for A:

df %>% select(name, A, A_rank) %>% arrange(A_rank)
           name  A A_rank
1   bell pepper 14      1
2  blood orange 14      1
3      bilberry 12      3
4        banana 11      4
5    blackberry 11      4
6     blueberry 11      4
7       apricot 10      7
8  blackcurrant  9      8
9         apple  8      9
10      avocado  7     10

Here, as we just started with the first rank, the fruits with tied scores use min_rank, which is fine as there is no more information.

After summing row-wise columns A and B:

df %>% select(name, A, B, AB_rank) %>% arrange(AB_rank)
           name  A B AB_rank
1   bell pepper 14 4       1
2  blood orange 14 2       2
3      bilberry 12 1       3
4    blackberry 11 2       3
5        banana 11 1       5
6     blueberry 11 1       5
7       apricot 10 1       7
8  blackcurrant  9 2       7
9         apple  8 1       9
10      avocado  7 0      10

Here, for fruits bilberry and blackberry, they each have one column where they have a higher number than the other fruit, so a tie still remains and I want to move on to the second rule, where bilberry will rank 3 as they have the higher number 12 in the A col, while blackberry goes to rank 4.

For banana and blueberry, because a tie would still remain after applying my two rules, use min_rank, which is fine here.

Expected output

           name  A B AB_rank
1   bell pepper 14 4       1
2  blood orange 14 2       2
3      bilberry 12 1       3
4    blackberry 11 2       4
5        banana 11 1       5
6     blueberry 11 1       5
7       apricot 10 1       7
8  blackcurrant  9 2       8
9         apple  8 1       9
10      avocado  7 0      10

Now, using the sums of A, B, C:

df %>% select(name, A, B, C, ABC_rank) %>% arrange(ABC_rank)
           name  A B C ABC_rank
1   bell pepper 14 4 7        1
2  blood orange 14 2 8        2
3    blackberry 11 2 8        3
4       apricot 10 1 7        4
5      bilberry 12 1 5        4
6  blackcurrant  9 2 7        4
7     blueberry 11 1 6        4
8         apple  8 1 8        8
9       avocado  7 0 8        9
10       banana 11 1 3        9

Fruits apricot, bilberry, blackcurrant, and blueberry have the same sum. Applying the first rule, blueberry becomes rank 7, as they have no number which is the highest in any of the three columns A, B, C. Then, bilberry will have a rank of 4, as the fruit has the highest figure 12 in A, then apricot with rank 5 as it has a figure of 10, then blackcurrant is rank 6.

Looking at avocado and banana, banana would be rank 9, as they have two values which are larger than avacado in cols A and B, while avocado would become rank 10.

Expected output

           name  A B C ABC_rank
1   bell pepper 14 4 7        1
2  blood orange 14 2 8        2
3    blackberry 11 2 8        3
4      bilberry 12 1 5        4
5       apricot 10 1 7        5
6  blackcurrant  9 2 7        6
7     blueberry 11 1 6        7
8         apple  8 1 8        8
9        banana 11 1 3        9
10      avocado  7 0 8        10

This is quite complex, and I'm not sure what the best solution for tackling this is. Possibly an if else statement?

Aucun commentaire:

Enregistrer un commentaire