mercredi 8 novembre 2017

quantifying reads to a reference with Python

Input file (test.sam):

SN398:470:C8RD3ACXX:7:1111:19077:53994  16  chrI    65374   255 51M *   0   0   TGAGAAATTCTTGAACATTCGTCTGTATTGATAAATAAAACTAGTATACAG IJJJJJJJJJJJJJIJJJIJJJJJJHJJJJJJJJJJJJHHHHHFFFFDB@B AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51 YT:Z:UU NH:i:1

genes.bed file is the reference:

chrI    130798  131983  YAL012W 0   +   130798  131983  0   1   1185,   0,
chrI    334 649 YAL069W 0   +   334 649 0   1   315,    0,
chrI    537 792 YAL068W-A   0   +   537 792 0   1   255,    0,
chrI    1806    2169    YAL068C 0   -   1806    2169    0   1   363,    0,
chrI    2479    2707    YAL067W-A   0   +   2479    2707    0   1   228,    0,
chrI    7234    9016    YAL067C 0   -   7234    9016    0   1   1782,   0,
chrI    10090   10399   YAL066W 0   +   10090   10399   0   1   309,    0,
chrI    11564   11951   YAL065C 0   -   11564   11951   0   1   387,    0,
chrI    12045   12426   YAL064W-B   0   +   12045   12426   0   1   381,    0,

script is the following - it looks if "chr" matches between two files, and if fourth column of test.sam (called genomic_location) is within the second and third column of genes.bed file, then it prints the fourth column of genes.bed and counts it as "1".

#!/usr/bin/env python
import sys

samfile=open('test.sam')  #sorted sam file
bedfile=open('genes.bed') #reference genome
sys.stdout=open('merged.txt', 'w')

lookup = {}
for line in bedfile:
   fields = line.strip().split()
   chrm   = fields[0]
   st     = int(fields[1])
   end    = int(fields[2])
   name   = fields[3]
   if chrm not in lookup:
       lookup[chrm] = {}
   for i in range(st,end):
       if i not in lookup[chrm]:
           lookup[chrm][i] = [name]
       else:
           lookup[chrm][i].append(name)

gene_counts = {}
for line in samfile:
   reads = line.split()
   qname = reads[0]
   flag  = reads[1] # be 0 or 16
   rname=reads[2]
   genomic_location = int(reads[3])
   mapq  = int(reads[4])
   if rname in lookup:
       if genomic_location in lookup[rname]:
           for gene in lookup[rname][genomic_location]:
               if gene not in gene_counts:
                   gene_counts[gene]  = 0
           else:
               gene_counts[gene] += 1

print gene_counts

I need to change it in such a way that when flag (second column in input file test.sam) is 16, then subtract 51 from the fourth column in inputfile (test.sam) and then process it to see if that newly made integer is within st and end of genes.bed file.

What do you think is the best way to do this? I need to implement this within script and not make a new input files (test.sam) that first changes the fourth column if second is 16.

I would like to do this Python. Thank you for your help and please let me know if something is unclear.

Aucun commentaire:

Enregistrer un commentaire