I want to update a values in the column by reading two files simultaneously.
main_file has following data:
contig pos GT PGT_phase PID PG_phase PI
2 1657 ./. . . ./. .
2 1738 0/1 . . 0|1 935
2 1764 0/1 . . 1|0 935
2 1782 0/1 . . 0|1 935
2 1850 0/0 . . 0/0 .
2 1860 0/1 . . 1|0 935
2 1863 0/1 . . 0|1 935
2 2969 0/1 . . 1|0 3352
2 2971 0/0 . . 0/0 .
2 5207 0/1 0|1 5185 1|0 1311
2 5238 0/1 . . 0|1 1311
2 5241 0/0 . . 0/0 .
2 5258 0/1 . . 1|0 1311
2 5260 0/0 . . 0/0 .
2 5319 0/0 . . 0/0 .
2 5398 0/1 0|1 5398 1|0 1311
2 5403 0/1 0|1 5398 1|0 1311
2 5426 0/1 0|1 5398 1|0 1311
2 5427 0/1 0|1 5398 0/1 .
2 5434 0/1 0|1 5398 1|0 1311
2 5454 0/1 0|1 5398 0/1 .
2 5457 0/0 . . 0/0 .
2 5467 0/1 0|1 5467 0|1 1311
2 5480 0/1 0|1 5467 0|1 1311
2 5483 0/0 0|1 5482 0/0 .
2 6414 0/1 . . 0|1 1667
2 6446 0/1 0|1 6446 0|1 1667
2 6448 0/1 0|1 6446 0|1 1667
2 6465 0/1 0|1 6446 0|1 1667
2 6636 0/1 . . 1|0 1667
2 6740 0/1 . 6740 0|1 1667
2 6748 0/1 . 6740 0|1 .
The another match_file has following type of info:
**PI PID**
1309 3617741,3617753,3617788,3618156,3618187,3618289
131 11793586
1310
1311 5185,5398,5467,5576
1312 340692,340728
1313 18503498
1667 6740,12237,12298
What I am trying to do:
- I want to create a new column(new_PI) with updated PI values.
How the updating works:
- So, if there a PI value in the line of main_file, its simple:
new_PI value = main_PI
and thencontinue
- If in main_file both
main_PI
andmain_PID
is.
,new_PI = .
andcontinue
- But, if the PI value is '.' but PID value is some integer, now we look in the match_file for the PI value that contains that value in the list of PID. If a matching PID is found
new_PI = PI_match_file
and thencontinue
I have written the below code:
main_file = open("2ms01e_chr2_table.txt", 'r+')
match_file = open('updated_df_table.txt', 'r+')
main_header = main_file.readline()
match_header = match_file.readline()
main_data = main_file.read().rstrip('\n').split('\n')
match_data = match_file.read().rstrip('\n').split('\n')
file_update = open('PI_updates.txt', 'w')
file_update.write('contig pos GT PGT_phase PID PG_phase PI new_PI\n')
file_update.close()
for line in main_data:
main_column = line.split('\t')
PID_main = main_column[4]
PI_main = main_column[6]
if PID_main == '.' and PI_main == '.':
new_PI = '.'
continue
if PI_main != '.':
new_PI = PI_main
continue
if PI_main == '.' and PID_main != '.':
for line in match_data:
match_column = line.split('\t')
PI_match = match_column[0]
PID_match = match_column[1].split(',')
if PID_main in PID_match:
new_PI = PI_match
continue
file_update = open('PI_updates.txt', 'a')
file_update.write(line + '\t' + str(new_PI)+ '\n')
file_update.close()
I am not getting any error but looks like I am not writing appropriate code to read the two files.
My output should be something like this:
contig pos GT PGT PID PG PI new_PI
2 5426 0/1 0|1 5398 1|0 1311 1311
2 5427 0/1 0|1 5398 0/1 . 1311
2 5434 0/1 0|1 5398 1|0 1311 1311
2 5454 0/1 0|1 5398 0/1 . 1311
2 5457 0/0 . . 0/0 . .
2 5467 0/1 0|1 5467 0|1 1311 1311
2 5480 0/1 0|1 5467 0|1 1311 1311
2 5483 0/0 0|1 5482 0/0 1667 1667
2 5518 1/1 1|1 5467 1/1 . 1311
2 5519 0/0 . . 0/0 . .
2 5547 1/1 1|1 5467 1/1 . 1311
2 5550 ./. . . ./. . .
2 5559 1/1 1|1 5467 1/1 . 1311
2 5561 0/0 . . 0/0 . .
2 5576 0/1 0|1 5576 1|0 1311 1311
2 5599 0/1 0|1 5576 1|0 1311 1311
2 5602 0/0 . . 0/0 . .
2 5657 0/1 . . 1|0 1311 1311
2 5723 0/1 . . 1|0 1311 1311
2 6414 0/1 . . 0|1 1667 1667
2 6446 0/1 0|1 6446 0|1 1667 1667
2 6448 0/1 0|1 6446 0|1 1667 1667
2 6465 0/1 0|1 6446 0|1 1667 1667
2 6636 0/1 . . 1|0 1667 1667
2 6740 0/1 . 6740 0|1 1667 1667
2 6748 0/1 . 6740 0|1 . 1667
Thanks in advance !
Aucun commentaire:
Enregistrer un commentaire