vendredi 30 octobre 2020

How to conditionally select top value per group without comparing each value?

The data I have looks like:

  Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5
    1   CLNS1A   0.2811747       0               2
    1   RSF1     0.5469924       3               6
    2   CFDP1    0.4186066       1               2
    2   CHST6    0.4295135       1               3
    3   ACE      0.634           1               1
    3   NOS2     0.6345          1               1

I am grouping the genes by the Group column then selecting the best gene per group based on conditions:

  1. Select the gene with the highest score if the score difference between the top scored gene and all others in the group is >0.05

  2. If the score difference between the top gene and any other genes in a group is <0.05 then select the gene with a higher direct_count only selecting between those genes with a <0.05 distance to the top scored gene per group

  3. If the direct_count is the same select the gene with the highest secondary_count

  4. If all counts are the same select all genes <0.05 distance to each other.

Output from example looking like:

 Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5       #highest direct_count
    2   CHST6    0.4295135       1               3       #highest secondary_count after matching direct_count
    3   ACE      0.634           1               1       #ACE and NOS2 have matching counts
    3   NOS2     0.6345          1               1

Currently I try to code this with:

df<- setDT(df)
new_df <- df[, 
   {d = dist(Score, method = 'manhattan')
   if (any(d > 0.05)) 
     ind = which.max(d)
   else if (sum(max(direct_count) == direct_count) == 1L) 
     ind = which.max(direct_count)
   else if (sum(max(secondary_count) == secondary_count) == 1L) 
     ind = which.max(secondary_count)
   else 
     ind = which((outer(direct_count, direct_count, '==') & outer(secondary_count, secondary_count, '=='))[1, ])
   
   .SD[ind]
   }
   , by = Group]

However, I am struggling to adjust my first else if statement to account for my 2nd condition with only selecting between genes with a <0.05 distance to the top scored gene - currently it's comparing with all genes per group so even if a gene in that group has a 0.1 score but largest count columns its getting selected over a top scored gene at 0.7 for example if other genes in the group are 0.68 filling that <0.05 distance requirement.

Essentially I want my conditions 2 to 4 to only be considering the genes that are <0.05 distance to the top scored gene per group.

Input data:

structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507, 
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"))

Edit:

The reason for my question is a problem with one specific group not doing as I expect:

structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1", 
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502, 
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62, 
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000183dc6b1ef0>)

From this group Gene1 is being selected when it should actually be CHST6 and I can't find why.

Data looks like:

  Group Gene         Score      direct_count     secondary_count
1   2    CFDP1        0.5517401        1                  62
2   2    CHST6        0.5989186        1                   6
3   2    RNU6-758P    0.5644914        0                   1
4   2    Gene1        0.5672916        0                   1
5   2    TMEM170A     0.6167083        0                   2

CHST6 has the highest direct_count out of all genes <0.05 of the to the top scored gene in this group, yet Gene1 is being selected.

Aucun commentaire:

Enregistrer un commentaire