The data I have looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I am grouping the genes by the Group
column then selecting the best gene per group based on conditions:
-
Select the gene with the highest score if the score difference between the top scored gene and all others in the group is >0.05
-
If the score difference between the top gene and any other genes in a group is <0.05 then select the gene with a higher direct_count
only selecting between those genes with a <0.05 distance to the top scored gene per group
-
If the direct_count
is the same select the gene with the highest secondary_count
-
If all counts are the same select all genes <0.05 distance to each other.
Output from example looking like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5 #highest direct_count
2 CHST6 0.4295135 1 3 #highest secondary_count after matching direct_count
3 ACE 0.634 1 1 #ACE and NOS2 have matching counts
3 NOS2 0.6345 1 1
Currently I try to code this with:
df<- setDT(df)
new_df <- df[,
{d = dist(Score, method = 'manhattan')
if (any(d > 0.05))
ind = which.max(d)
else if (sum(max(direct_count) == direct_count) == 1L)
ind = which.max(direct_count)
else if (sum(max(secondary_count) == secondary_count) == 1L)
ind = which.max(secondary_count)
else
ind = which((outer(direct_count, direct_count, '==') & outer(secondary_count, secondary_count, '=='))[1, ])
.SD[ind]
}
, by = Group]
However, I am struggling to adjust my first else if
statement to account for my 2nd condition with only selecting between genes with a <0.05 distance to the top scored gene - currently it's comparing with all genes per group so even if a gene in that group has a 0.1 score but largest count
columns its getting selected over a top scored gene at 0.7 for example if other genes in the group are 0.68 filling that <0.05 distance requirement.
Essentially I want my conditions 2 to 4 to only be considering the genes that are <0.05 distance to the top scored gene per group.
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
Edit:
The reason for my question is a problem with one specific group not doing as I expect:
structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1",
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502,
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62,
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000183dc6b1ef0>)
From this group Gene1 is being selected when it should actually be CHST6 and I can't find why.
Data looks like:
Group Gene Score direct_count secondary_count
1 2 CFDP1 0.5517401 1 62
2 2 CHST6 0.5989186 1 6
3 2 RNU6-758P 0.5644914 0 1
4 2 Gene1 0.5672916 0 1
5 2 TMEM170A 0.6167083 0 2
CHST6
has the highest direct_count
out of all genes <0.05 of the to the top scored gene in this group, yet Gene1
is being selected.