vendredi 9 juillet 2021

R - Why does my for-loop with if-statement not work?

Sorry if this has an obvious solution but I'm not very familiar with conditional statements, I've been stuck with this problem for a while and I haven't been able to find the mistake.

I have a data frame that looks like this, with several more columns of clinical data:

> samples[,c(2,360,361)]
     patient_id  sample_id timepoint
d1.18    1056023 1056023.d1        d1
d1.4     3278638 3278638.d1        d1
d1.37     858412  858412.d1        d1
d4.4     3278638 3278638.d4        d4
d4.31     467506  467506.d4        d4
d4.29    1064441 1064441.d4        d4
d1.29    1064441 1064441.d1        d1
d4.37     858412  858412.d4        d4
d4.22     967710  967710.d4        d4
d1.52     294224  294224.d1        d1
d4.51     907354  907354.d4        d4

For some patients I have two samples in two different timepoints: d1 and d4. For others I only have d1 or d4. I would like to select only one sample for each patient, choosing the d1 if two samples are available. My final data frame should look like this:

> samples[,c(2,360,361)]
      patient_id  sample_id timepoint
d1.18    1056023 1056023.d1        d1
d1.4     3278638 3278638.d1        d1
d1.37     858412  858412.d1        d1
d4.31     467506  467506.d4        d4
d1.29    1064441 1064441.d1        d1
d4.22     967710  967710.d4        d4
d1.52     294224  294224.d1        d1
d4.51     907354  907354.d4        d4

This has been my approach:

for(i in unique(samples$patient_id)){
  if((sum(samples$patient_id == i)) == 2){
    samples <- samples[-(samples$patient_id == i & samples$timepoint == d4),]
  }
}

Although my final data frame has the same number of rows as the length in unique(samples$patient_id) some patients have completely disappeared and others still have both samples.

Instead of removing rows from the original data frame I have also tried storing the lines I want in a empty list, or generating the names of the samples using the patient and the timepoint columns, like this:

patients <- unique(samples$patient_id)
dat <- list()

for(i in patients){
  
  if((sum(samples$patient_id == i)) == 2){
    dat[[i]] <- paste(i, "d1", sep = ".")
  }else if ((sum(samples$patient_id == i)) == 1){
    dat[[i]] <- paste(i, "d4", sep = ".")
  } else{
    NULL
  }
}

But this results in a list with 1314182 elements.

I would be very grateful for any assistance!

Aucun commentaire:

Enregistrer un commentaire