In a file with following data structure:
contig pos GT PGT PID PG PB updated_Block
2 5426 0/1 0|1 5398 1|0 1311 1311
2 5427 0/1 0|1 5398 0/1 . 1311
2 5434 0/1 0|1 5398 1|0 1311 1311
2 5454 0/1 0|1 5398 0/1 . 1311
2 5457 0/0 . . 0/0 . 1311
2 5467 0/1 0|1 5467 0|1 1311 1311
2 5480 0/1 0|1 5467 0|1 1311 1311
2 5483 0/0 0|1 5482 0/0 1667 1667
2 5518 1/1 1|1 5467 1/1 . 1311
2 5519 0/0 . . 0/0 . .
2 5547 1/1 1|1 5467 1/1 . 1311
2 5550 ./. . . ./. . .
2 5559 1/1 1|1 5467 1/1 . 1311
2 5561 0/0 . . 0/0 . .
2 5576 0/1 0|1 5576 1|0 1311 1311
2 5599 0/1 0|1 5576 1|0 1311 1311
2 5602 0/0 . . 0/0 . .
2 5657 0/1 . . 1|0 1311 1311
2 5723 0/1 . . 1|0 1311 1311
2 6414 0/1 . . 0|1 1667 1667
2 6446 0/1 0|1 6446 0|1 1667 1667
2 6448 0/1 0|1 6446 0|1 1667 1667
2 6465 0/1 0|1 6446 0|1 1667 1667
2 6636 0/1 . . 1|0 1667 1667
2 6740 0/1 . 6740 0|1 1667 1667
PID represents the block and PGT represents one of the data in that block, generated by one particular program. Another program generates the same type of information for the same data set - PB is the block and PG is one of the data in that block.
So, from the above data output: We can bridge the data belonging to the block PB (1311) with the data from the block PID (5398, 5467 and 5576). These two programs emit the values and block info based on different probability test. So, I just need to find the overlapping blocks and merge them to create a larger block set. And I want to update this info in the updated_block with unique block value like 1, 2 ....
Details: 1311 from PB overlaps with 5398, 5467, 5576 from PID - so they make a large block. - This overlap may break at some line, then we will start making another larger block.
I am confused on how to approach this problem. I wanted to build a list of dictionary first, but I will still have some problem with reading each line until the break is met and start reading the next line after that break.
Any suggestions.
Aucun commentaire:
Enregistrer un commentaire