vendredi 10 avril 2020

How to merge two files based on data in multiple columns?

I have two separate files, each containing a different number of columns which I want to merge based on data in multiple columns.

file1

VMNF01000015.1  1769465 1769675 .   .   -   Focub_II5_mimp_1
VMNF01000014.1  3225875 3226081 .   .   +   Focub_II5_mimp_1
VMNF01000014.1  3226046 3226081 .   .   -   Focub_II5_mimp_1
VMNF01000014.1  3585246 3585281 .   .   -   Focub_II5_mimp_1
VMNF01000014.1  3692468 3692503 .   .   -   Focub_II5_mimp_1
VMNF01000014.1  3715380 3715415 .   .   +   Focub_II5_mimp_1
VMNF01000014.1  2872478 2872511 .   .   -   Focub_II5_mimp_1

file2

VMNF01000014.1  3225875-3226081(+)  gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3226046-3226081(-)  tacacacctgcgaatactttttgcatcccactgta
VMNF01000015.1  1769465-1769675(-)  gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3692468-3692503(-)  tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1  3715380-3715415(+)  gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3585246-3585281(-)  tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1  2872478-2872511(-)  gtacttcagcctggattcaaacttattgcatcccactgta

First, I think I need to create another 2 columns in file2, separating numbers by "-" and creating a new column for "(*)", but I cannot work out how to separate the numbers without replacing "(-)" too. So far I have been using this command:

awk '{gsub("-","\t",$2);print;}'

Once this has been done, I would like to add the last column in file2 to file1. I have been able to do this using the following command:

awk 'NR==FNR {a[$1]=$3; next} {print $1,$2,$3,$4,$5,$6,$7,a[$1];}' file2 file1 > file3. 

However, the data does not match. It is matched based on the entry in column 1. The data in column 1 is the same in many instances, so the data in column 8 of file3 only matches one of the entries, and doesn't match the data in column 2 or 3 in file1 e.g.

file3:

VMNF01000015.1  1769465 1769675 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3225875 3226081 .   .   +   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3226046 3226081 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3585246 3585281 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3692468 3692503 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  3715380 3715415 .   .   +   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1  2872478 2872511 .   .   -   Focub_II5_mimp_1    gtacttcagcctggattcaaacttattgcatcccactgta

Even if I was able to separate the data in column 2 of file2, I would still have the same problem as the data in column 2 is the same in some instances. What I need is code that says something along the lines of: sperate the data in column 2 (see below);

VMNF01000014.1  3225875    3226081    (+)   gtacttcagcctggattcaaacttattgcatcccactgta

then:

if $1,$2,$3 in file1 match $1,$2,$3 in file2, print $1,$2,$3,$4,$5,$6,$7 from file1 and add $5 from file2.

How can I do this? I know that awk can use if statements, but I don't know how to use them in awk.

Any advice?

Aucun commentaire:

Enregistrer un commentaire