vendredi 21 décembre 2018

Awk: From CSV to PDB (Protein Data Bank)

I have a CSV file with this format:

ATOM,3662,H,VAL,A,257,6.111,31.650,13.338,1.00,0.00,H ATOM,3663,HA,VAL,A,257,3.180,31.995,13.768,1.00,0.00,H ATOM,3664,HB,VAL,A,257,4.726,32.321,11.170,1.00,0.00,H ATOM,3665,HG11,VAL,A,257,2.387,31.587,10.892,1.00,0.00,H

And I would like to format it according to PDB standards (fixed position):

ATOM   3662  H   VAL A 257       6.111  31.650  13.338  1.00  0.00           H

ATOM   3663  HA  VAL A 257       3.180  31.995  13.768  1.00  0.00           H

ATOM   3664  HB  VAL A 257       4.726  32.321  11.170  1.00  0.00           H

ATOM   3665 HG11 VAL A 257       2.387  31.587  10.892  1.00  0.00           H

One can consider that everything is right-justified except for the first and the third column. The first is not a problem. The third however, it is left-justified when it length is 1-3 but shifted one position to the left when it is 4.

I have this AWK one-liner that almost does the trick:

awk -F, 'BEGIN {OFS=FS} {if(length($3) == 4 ) {pad=" "} else {pad=" "}} {printf "%-6s%5s%s%-4s%4s%2s%4s%11s%8s%8s%6s%6s%12s\n", $1, $2, $pad, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12}' < 1iy8_min.csv

Except for two things:

1) The exception of the third column. I was thinking about adding a condition which changes the padding before the third column according to the field length, but I cannot get it to work (the idea is illustrated in the above one-liner).

2) The other problem is that if there are no spaces between the fields, the padding does not work at all.

ATOM   3799  HH   TYR A 267     -5.713  16.149  26.838  1.00  0.00           H

HETATM 3801  O7N  NADA12688.285     19.839  10.489    1.00 20.51     O   

In the above example, the second line should be:

HETATM 3801  O7N  NAD A1268      8.285  19.839  10.489  1.00 20.51           O

But because there is no space between fields 5 and 6, everything gets shuffled. It think that A1268 is perceived as being one field. Maybe because the default awk delimiter seems to be a blank space. Is it possible to make it position-dependent?

Aucun commentaire:

Enregistrer un commentaire