Suppose I have a file that has the following unique headerline.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MA605 MA611 MA622 MA625 MA629 Ncm8
I want to create a table after reading this file. The problem is that the number of columns starting at 10th column represents the sample. So, there may be any number of sample when reading different files of this kind.
I created a new output file that had the following header.
contig pos ref alt_My freq_My MA605 MA611 MA622 MA625 MA629 Ncm8
with the following script:
vcf1_My = open('MY.phased_variants.vcf', 'r')
for lines in vcf1_My.read().split('\n'):
if '#CHROM' in lines:
header = lines.split('\t')
sample_genotype = header[9::] # so starting after the 9th field all the remaining columns are samples. This way I can capture any number of samples that may vary from file to file.
with open("My_allele_table-Markov02.txt", "w") as output:
output.write("contig\tpos\tref\talt_My\tfreq_My\t" + '\t'.join(sample_genotype))
output.close()
break
But, I want my header to be:
contig pos ref alt_My freq_My MA605_GT MA605_PG MA611_GT MA611_PG ...so..on for other remaining samples too - i.e for MA622_GT MA625_GT MA629_GT Ncm8_GT
I tried to use:
GT_tag = '_'.join('GT' for x in sample_genotype)
How, can I improve this last code to do what I want? Any other solutions are also appreciated.
Thanks,
Aucun commentaire:
Enregistrer un commentaire