mercredi 24 mars 2021

How to replace values of several columns based on/ another column in R within each row?

I am working on a data set (30000 x 500 ) where I need to replace some values in columns based on data from another column. The problem is that in each row, the reference values change. Here is an sub-example of the data set:

#Create a data frame
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"), 
                   A_allele = c("C","G","C","G","C","C","A","T","G","C"),
                   B_allele = c("G","A","T","A","A","G","T","A","C","A"),
                   alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
                   line_1 = sample(c("A","B"),10, replace = TRUE),
                   line_2 = sample(c("A","B"),10, replace = TRUE),
                   line_3 = sample(c("A","B"),10, replace = TRUE),
                   line_4 = sample(c("A","B"),10, replace = TRUE),
                   line_5 = sample(c("A","B"),10, replace = TRUE),
                   line_6 = sample(c("A","B"),10, replace = TRUE),
                   line_7 = sample(c("A","B"),10, replace = TRUE),
                   line_8 = sample(c("A","B"),10, replace = TRUE),
                   line_9 = sample(c("A","B"),10, replace = TRUE),
                   line_10 = sample(c("A","B"),10, replace = TRUE)
                   )

df
head(df)
     SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1   SNP1        C        G     C/G      B      A      B      A      B      B      B      B      B       A
2   SNP2        G        A     G/A      A      B      A      A      A      B      B      A      B       A
3   SNP3        C        T     C/T      B      B      A      B      B      B      A      A      A       A
4   SNP4        G        A     G/A      A      B      B      A      B      A      B      B      B       A
5   SNP5        C        A     C/A      B      A      B      B      B      A      B      A      B       B
6   SNP6        C        G     C/G      B      A      B      A      B      A      B      B      B       B
7   SNP7        A        T     A/T      B      A      A      B      A      A      B      A      B       A
8   SNP8        T        A     T/A      A      B      A      B      A      A      B      B      A       B
9   SNP9        G        C     G/C      B      A      B      B      B      B      A      B      A       B
10 SNP10        C        A     C/A      B      B      B      B      B      A      A      A      A       A

For each row, A_allele and B_allele columns serve as reference values to change A or B values in the 10 lines. When there is an "A" value => use the value from column A_allele and when there is a "B" value => use the value from column_B.

In the example, this should be as following:

  • Row 1: Change lines with A to C / Change lines with B to G
  • Row 2: Change lines with A to G / Change lines with B to A
  • Row 3: Change lines with A to C / Change lines with B to T
  • Row 10: The same idea.

The output should look something like this:

SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1   SNP1    C   G   C/G G   C   G   C   G   G   G   G   G   C
2   SNP2    G   A   G/A G   A   G   G   G   A   A   G   A   G
3   SNP3    C   T   C/T T   T   C   T   T   T   C   C   C   C
4   SNP4    G   A   G/A G   A   A   G   A   G   A   A   A   G
5   SNP5    C   A   C/A A   C   A   A   A   C   A   C   A   A
6   SNP6    C   G   C/G G   C   G   C   G   C   G   G   G   G
7   SNP7    A   T   A/T T   A   A   T   A   A   T   A   T   A
8   SNP8    T   A   T/A T   A   T   A   T   T   A   A   T   A
9   SNP9    G   C   G/C C   G   C   C   C   C   G   C   G   C
10  SNP10   C   A   C/A A   A   A   A   A   C   C   C   C   C

As there are ~30000 rows, I would like an efficient code to run if it possible.

Any suggestions?

Aucun commentaire:

Enregistrer un commentaire