I am not sure if I put the question title appropriately. But, I have tried to explain the problem below. Please suggest appropriate title if you can think for this problem.
Say I have two types of list data:
list_headers = ['gene_id', 'gene_name', 'trans_id']
attri_values =
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"']
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"']
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']
I am trying to make a table based on matches of the list in the header and attribute in the attri_values.
output = open('gtf_table', 'w')
output.write('\t'.join(list_headers) + '\n') # this will first write the header
# then
for values in attri_values:
for list in list_headers:
if values.startswith(list):
attr_id = ''.join([x for x in attri_values if list in x])
attr_id = attr_id.replace('"', '').split(' ')[1]
output.write('\t' + '\t'.join([attr_id]))
elif not values.startswith(list):
attr_id = 'NA'
output.write('\t' + '\t'.join([attr_id]))
output.write('\n')
Problem: is that when the matching strings from list of list_headers is found in values of attri_values all works well, but when there is no match there are lots of repeat 'NA'. I tried to move the condition of NA in different way but no success.
Final expected outcome:
gene_id gene_name trans_id
scaffold_200001.1 NA NA
scaffold_200001.1 NA scaffold_200001.1
scaffold_200002.1 NA scaffold_200002.1
Aucun commentaire:
Enregistrer un commentaire