vendredi 23 décembre 2016

How to do a context dependent merging of data from two different columns?

In a file with following data structure:

contig  pos    GT    PGT       PID     PG      PB     updated_Block
2      5426    0/1   0|1       5398   1|0   1311       1  
2      5427    0/1   0|1       5398   0/1   .          1
2      5434    0/1   0|1       5398   1|0   1311       1
2      5454    0/1   0|1       5398   0/1   .          1
2      5457    0/0   .          .     0/0   .          1
2      5467    0/1   0|1       5467   0|1   1311       1
2      5480    0/1   0|1       5467   0|1   1311       1
2      5483    0/0   0|1       5482   0/0   1667       2
2      5518    1/1   1|1       5467   1/1   .          1
2      5519    0/0   .         .      0/0   .          .
2      5547    1/1   1|1       5467   1/1   .          1
2      5550    ./.   .         .      ./.   .          .
2      5559    1/1   1|1       5467   1/1   .          1
2      5561    0/0   .         .      0/0   .          .
2      5576    0/1   0|1       5576   1|0   1311       1
2      5599    0/1   0|1       5576   1|0   1311       1
2      5602    0/0   .         .      0/0   .          .
2      5657    0/1   .         .      1|0   1311       1
2      5723    0/1   .         .      1|0   1311       1
2      6414    0/1   .         .      0|1   1667       2
2      6446    0/1  0|1      6446     0|1   1667       2
2      6448    0/1  0|1      6446     0|1   1667       2
2      6465    0/1  0|1      6446     0|1   1667       2
2      6636    0/1  .          .      1|0   1667       2
2      6740    0/1  .        6740     0|1   1667       2

PID represents the block and PGT represents one of the data in that block, generated by one particular program. Another program generates the same type of information for the same data set - PB is the block and PG is one of the data in that block.

So, from the above data output: We can bridge the data belonging to the block PB (1311) with the data from the block PID (5398, 5467 and 5576). These two programs emit the values and block info based on different probability test. So, I just need to find the overlapping blocks and merge them to create a larger block set. And I want to update this info in the updated_block with unique block value like 1, 2 ....

Details: 1311 from PB overlaps with 5398, 5467, 5576 from PID - so they make a large block. - This overlap may break at some line, then we will start making another larger block.

I am confused on how to approach this problem. I wanted to build a list of dictionary first, but I will still have some problem with reading each line until the break is met and start reading the next line after that break.

Any suggestions.

Aucun commentaire:

Enregistrer un commentaire