I am working on a data set (30000 x 500 ) where I need to replace some values in columns based on data from another column. The problem is that in each row, the reference values change. Here is an sub-example of the data set:
#Create a data frame
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"),
A_allele = c("C","G","C","G","C","C","A","T","G","C"),
B_allele = c("G","A","T","A","A","G","T","A","C","A"),
alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
line_1 = sample(c("A","B"),10, replace = TRUE),
line_2 = sample(c("A","B"),10, replace = TRUE),
line_3 = sample(c("A","B"),10, replace = TRUE),
line_4 = sample(c("A","B"),10, replace = TRUE),
line_5 = sample(c("A","B"),10, replace = TRUE),
line_6 = sample(c("A","B"),10, replace = TRUE),
line_7 = sample(c("A","B"),10, replace = TRUE),
line_8 = sample(c("A","B"),10, replace = TRUE),
line_9 = sample(c("A","B"),10, replace = TRUE),
line_10 = sample(c("A","B"),10, replace = TRUE)
)
df
head(df)
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G B A B A B B B B B A
2 SNP2 G A G/A A B A A A B B A B A
3 SNP3 C T C/T B B A B B B A A A A
4 SNP4 G A G/A A B B A B A B B B A
5 SNP5 C A C/A B A B B B A B A B B
6 SNP6 C G C/G B A B A B A B B B B
7 SNP7 A T A/T B A A B A A B A B A
8 SNP8 T A T/A A B A B A A B B A B
9 SNP9 G C G/C B A B B B B A B A B
10 SNP10 C A C/A B B B B B A A A A A
For each row, A_allele and B_allele columns serve as reference values to change A or B values in the 10 lines. When there is an "A" value => use the value from column A_allele and when there is a "B" value => use the value from column_B.
In the example, this should be as following:
- Row 1: Change lines with A to C / Change lines with B to G
- Row 2: Change lines with A to G / Change lines with B to A
- Row 3: Change lines with A to C / Change lines with B to T
- Row 10: The same idea.
The output should look something like this:
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G G C G C G G G G G C
2 SNP2 G A G/A G A G G G A A G A G
3 SNP3 C T C/T T T C T T T C C C C
4 SNP4 G A G/A G A A G A G A A A G
5 SNP5 C A C/A A C A A A C A C A A
6 SNP6 C G C/G G C G C G C G G G G
7 SNP7 A T A/T T A A T A A T A T A
8 SNP8 T A T/A T A T A T T A A T A
9 SNP9 G C G/C C G C C C C G C G C
10 SNP10 C A C/A A A A A A C C C C C
As there are ~30000 rows, I would like an efficient code to run if it possible.
Any suggestions?
Aucun commentaire:
Enregistrer un commentaire