lundi 24 avril 2017

How to find the match between two lists and write the output based on matches?

I am not sure if I put the question title appropriately. But, I have tried to explain the problem below. Please suggest appropriate title if you can think for this problem.

Say I have two types of list data:

list_headers = ['gene_id', 'gene_name', 'trans_id']

attri_values = 

['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"']
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"']
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']

I am trying to make a table based on matches of the list in the header and attribute in the attri_values.

output = open('gtf_table', 'w')
output.write('\t'.join(list_headers) + '\n') # this will first write the header

# then
for values in attri_values:
    for list in list_headers:
        if values.startswith(list):
            attr_id = ''.join([x for x in attri_values if list in x])
            attr_id = attr_id.replace('"', '').split(' ')[1]
            output.write('\t' + '\t'.join([attr_id]))

        elif not values.startswith(list):
            attr_id = 'NA'
            output.write('\t' + '\t'.join([attr_id]))

        output.write('\n')

Problem: is that when the matching strings from list of list_headers is found in values of attri_values all works well, but when there is no match there are lots of repeat 'NA'. I tried to move the condition of NA in different way but no success.

Final expected outcome:

gene_id    gene_name    trans_id
scaffold_200001.1    NA    NA
scaffold_200001.1    NA    scaffold_200001.1
scaffold_200002.1    NA    scaffold_200002.1

Aucun commentaire:

Enregistrer un commentaire