Hi I am trying to create a new column in a dataframe which will contain information depending on the conditions in multiple other columns in the same dataframe. My research involves quantification of severity of occlusion of coronary artery (heart artery).
The example dataframe 'x' is
structure(list(Study_number = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 25, 26, 26, 26, 26, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 33, 34, 34, 34, 34, 35, 36, 36, 36, 36, 37, 37, 37, 37, 38, 38, 38, 38, 39, 39, 39, 39, 40, 40, 40, 40, 41, 41, 41, 41, 42, 42, 42, 42, 43, 43, 43, 43, 44, 44, 44, 44, 45, 45, 45, 45, 46, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 49, 50, 50, 50, 50, 51, 51, 51, 51, 52, 52, 52, 53, 53, 53, 53, 54, 54, 54, 54, 55, 55, 55, 56, 56, 56, 56, 57, 57, 57, 57, 58, 58, 58, 58, 59, 59, 59, 59, 60, 60, 60, 60, 61, 61, 61, 61, 62, 62, 63, 63, 63, 63, 64, 64, 64, 64, 65, 65, 65, 65, 66, 66), Vessel = c(1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 2, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3), Segment = c(3, 9, 7, 8, 2, 9, 7, 8, 9, 7, 8, 3, 9, 6, 11, 3, 9, 6, 8, 2, 9, 9, 15, 2, 9, 7, 8, 2, 9, 6, 8, 2, 9, 2, 9, 7, 8, 3, 9, 9, 11, 1, 9, 7, 8, 2, 9, 6, 8, 2, 9, 7, 11, 1, 9, 6, 12, 2, 9, 7, 11, 2, 9, 6, 15, 2, 9, 6, 8, 2, 9, 7, 8, 3, 9, 7, 11, 2, 9, 6, 11, 2, 9, 7, 8, 1, 9, 6, 11, 2, 9, 8, 11, 2, 9, 7, 8, 2, 9, 7, 11, 9, 7, 11, 2, 9, 6, 11, 3, 9, 7, 11, 2, 9, 6, 11, 2, 9, 7, 8, 1, 9, 6, 11, 4, 9, 7, 3, 9, 7, 8, 9, 2, 9, 7, 8, 2, 9, 7, 11, 1, 9, 7, 14, 2, 9, 7, 11, 2, 9, 6, 12, 2, 9, 6, 11, 2, 9, 7, 8, 2, 9, 9, 8, 2, 9, 7, 12, 2, 9, 7, 11, 1, 9, 7, 8, 2, 9, 7, 15, 2, 9, 6, 11, 2, 9, 6, 8, 3, 9, 10, 14, 2, 9, 6, 11, 1, 6, 11, 1, 9, 6, 8, 1, 9, 7, 11, 2, 8, 12, 2, 9, 7, 8, 1, 9, 7, 11, 0, 9, 6, 12, 1, 9, 7, 8, 0, 9, 6, 11, 0, 9, 7, 8, 9, 7, 3, 9, 7, 8, 2, 9, 7, 11, 21, 9, 6, 11, 9, 7), Severity = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Study_number", "Vessel", "Segment", "Severity"), row.names = c(NA, -250L), class = c("tbl_df", "tbl", "data.frame"))
The actual data frame looks like this:
Study_number Vessel Segment Severity <dbl> <dbl> <dbl> <dbl> 1 1 1 3 0 2 1 2 9 0 3 1 3 7 0 4 1 4 8 0 5 2 1 2 0 6 2 2 9 0 7 2 3 7 0 8 2 4 8 0 9 3 2 9 0 10 3 3 7 1
Study_number = participant ID
Vessel = Vessel ID (1 to 4)
Segment = Segment ID of that particular vessel
Severity = Severity of disease in that vessel (0 = no, 1 = yes)
There are usually 4 vessels (1-4) per participant even though some participant may have only 3 vessels.
What I want to achieve is a new column called 'Overall_severe_disease' which should satisfy the following criteria.
-
When vessel 2 has severe disease (ie, Vessel == 2 and Severity == 1 for the same row);OR
-
When vessel 3 has segment 6 or segment 7 with severe disease (ie, Vessel == 3 and Segment == 6 or 7 and Severity == 1 for corresponding rows) AND at least one other vessel has severe disease (ie, sum of Severity column == 2); OR
-
When 3 or more vessels have severe disease (ie, sum of Severity column >= 3 per participant).
This is how I attempted to tackle the issue.
- First create a Vessel-Severity column by pasting them together.
x$Vessel_Severity -> paste(x$Vessel, x$Severity, sep = '-')
The new dataframe will look like this:
Study_number Vessel Segment Severity Vessel_Severity <dbl> <dbl> <dbl> <dbl> <chr> 1 1 1 3 0 1-0 2 1 2 9 0 2-0 3 1 3 7 0 3-0 4 1 4 8 0 4-0 5 2 1 2 0 1-0 6 2 2 9 0 2-0
- Then I use the following ddply function in
plyrpackage to apply nested ifelse conditions to each participant.
library(plyr) x <- ddply(x, 'Study_number', transform, Overall_severe_disease = ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1, ifelse(Vessel_Severity == '2-1', 1, ifelse(sum(Severity) >= 3, 1, 0))))
- After that, I used the following function to assign 'Yes' or 'No' to 'Overall_severe_disease' column (if any row has at least one '1' then it's assigned as 'Yes' at participant level)
x <- ddply(x, 'Study_number', transform, Overall_severe_disease = ifelse(sum(Overall_severe_disease) >= 1, 'Yes', 'No'))
-This method works and it gives me 16 unique participant with 'Overall_severe_disease'
length(unique(x$Study_number[x$Overall_severe_disease=='Yes']))
- But if I change the order of ifelse and place the last condition at the beginning of my nested ifelse statements ('ifelse(sum(Severity) >= 3'), then ddply will not apply the rest of the statements beyond this and I will get a completely under-estimated result (8 unique participants rather than 16)
x <- ddply(x, 'Study_number', transform, Overall_severe_disease = ifelse(sum(Severity) >= 3, 1, ifelse(Vessel_Severity == '2-1', 1, ifelse(Vessel_Severity == '3-1' & Segment %in% c(6,7) & sum(Severity) == 2 , 1 , 0))))
x <- ddply(x, 'Study_number', transform, Overall_severe_disease = ifelse(sum(Overall_severe_disease) >= 1, 'Yes', 'No'))
length(unique(x$Study_number[x$Overall_severe_disease=='Yes']))
I am confused by this behaviour. I would be grateful for some advice and clarification.
Aucun commentaire:
Enregistrer un commentaire