Input file (test.sam):
SN398:470:C8RD3ACXX:7:1111:19077:53994 16 chrI 65374 255 51M * 0 0 TGAGAAATTCTTGAACATTCGTCTGTATTGATAAATAAAACTAGTATACAG IJJJJJJJJJJJJJIJJJIJJJJJJHJJJJJJJJJJJJHHHHHFFFFDB@B AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1
genes.bed file is the reference:
chrI 130798 131983 YAL012W 0 + 130798 131983 0 1 1185, 0,
chrI 334 649 YAL069W 0 + 334 649 0 1 315, 0,
chrI 537 792 YAL068W-A 0 + 537 792 0 1 255, 0,
chrI 1806 2169 YAL068C 0 - 1806 2169 0 1 363, 0,
chrI 2479 2707 YAL067W-A 0 + 2479 2707 0 1 228, 0,
chrI 7234 9016 YAL067C 0 - 7234 9016 0 1 1782, 0,
chrI 10090 10399 YAL066W 0 + 10090 10399 0 1 309, 0,
chrI 11564 11951 YAL065C 0 - 11564 11951 0 1 387, 0,
chrI 12045 12426 YAL064W-B 0 + 12045 12426 0 1 381, 0,
script is the following - it looks if "chr" matches between two files, and if fourth column of test.sam (called genomic_location) is within the second and third column of genes.bed file, then it prints the fourth column of genes.bed and counts it as "1".
#!/usr/bin/env python
import sys
samfile=open('test.sam') #sorted sam file
bedfile=open('genes.bed') #reference genome
sys.stdout=open('merged.txt', 'w')
lookup = {}
for line in bedfile:
fields = line.strip().split()
chrm = fields[0]
st = int(fields[1])
end = int(fields[2])
name = fields[3]
if chrm not in lookup:
lookup[chrm] = {}
for i in range(st,end):
if i not in lookup[chrm]:
lookup[chrm][i] = [name]
else:
lookup[chrm][i].append(name)
gene_counts = {}
for line in samfile:
reads = line.split()
qname = reads[0]
flag = reads[1] # be 0 or 16
rname=reads[2]
genomic_location = int(reads[3])
mapq = int(reads[4])
if rname in lookup:
if genomic_location in lookup[rname]:
for gene in lookup[rname][genomic_location]:
if gene not in gene_counts:
gene_counts[gene] = 0
else:
gene_counts[gene] += 1
print gene_counts
I need to change it in such a way that when flag (second column in input file test.sam) is 16, then subtract 51 from the fourth column in inputfile (test.sam) and then process it to see if that newly made integer is within st and end of genes.bed file.
What do you think is the best way to do this? I need to implement this within script and not make a new input files (test.sam) that first changes the fourth column if second is 16.
I would like to do this Python. Thank you for your help and please let me know if something is unclear.
Aucun commentaire:
Enregistrer un commentaire