CHARMMing's PDB Reader
CHARMMing's PDB reader exists to take a raw PDB file from the Protein Data Bank and convert it into a CHARMM-formatted PDB file. CHARMM has its own syntactic expectations for PDB files, so many PDB files will have problems when loaded into CHARMM (some of these problems are quite minor while others are potentially serious). This section of the tutorial is an explanation of how CHARMMing's PDB reader modifies a file from the PDB. It is not meant to be an exhaustive exploration of the PDB file format.
The PDB reader program itself is a standalone Python script (pdbinfo/parser_v3.py in the CHARMMing tarball, required libraries which are included in the tarball are lib/Atom.py and lib/Etc.py -- these are general purpose structure management libraries that we developed along with the parser).
The basic algorithm used by CHARMMing is:
- Remove all lines that do not start with ATOM or HETATM
- Handle multi-model atoms. Some PDBs separate different models cleanly into different sections, but others have alternate configurations listed for individual atoms or residues within a single model. These are identifieded via the following procedure:
- The line is checked to see if it belongs to an alternate model (this is done by looking at the PDB B-Factor, which is usually < 1 when multiple models are present and by checking to see if an 'A' has been prepended to the residue type). If the line is not part of a multi-model description, the algorithm skips to the last step.
- The line is stored in a two state (empty or full) buffer. If the buffer is empty, store the current line in it and move onto the next line (the buffer is flushed, printing the line, if a corresponding multi-model line is not found).
- If the buffer is full, compare the current line to the one in the buffer to see that they are multiple configurations of the same atom. If not, flush the buffer and store the current line into it (it is starting a multi-model section for a new atom).
- If the current line matches with the line in the buffer, look at the B-factor weight. Whichever line has a lower B-factor (less likely to be the correct structural data) is tagged for removal (it will not be written out when the multimodel buffer is flushed), and go on to the next line
- If the buffer is full and the line we are working on does not refer to the atom in the multimodel buffer (if said buffer is full), flush the buffer (i.e. collect the lines in it excepting those tagged for removal).
- Sort the lines that have been read in by segment ID, residue number, and atom number, in that order (this groups all segments together, orders residues within segments, and orders atoms within residues)
- Group the segments together (i.e. make a separate data structure for each independent segment) and decide whether each segment is a protein, nucleic acid, composed of "good hetero-atoms", or composed of "bad hetero-atoms" (accomplished by looking at the residues present within the segment)
- Rename the terminal oxygen atom type in each protein segment from OXT to OT2 (most PDB files use OXT for the terminal Oxygen, but CHARMM expects OT2).
- Decide if each nucleic acid segment is DNA or RNA (by testing for the presence of Uracil or Thymine residues)
- Change the residue names and atom types to fit in with CHARMM's naming conventions
- HOH residues are renamed TIP3
- Adenine, Thymine, Cytosine, Uracil, and Guanine residues are given the correct names
- The CD1 atom in residue ILE (isoleucine) is given the atom type CD
- ZN, NA, CS, CL, CA, and K residues are renamed ZN2, SOD, CES, CLA, CAL, and POT respectively
- Re-index atom number and residue IDs, to make atom and residue numbering continuous starting from 1. The mapping from original to new atom and residue numbers is preserved. Soon CHARMMing will allow the user to download PDBs with either the original or new atom numbering.
- Write all segments out in separate PDB files.
- The PDB parser removes all lines other than ATOM and HETATM ones. A separate routine is called beforehand to parse out title, author, journal, and disulfide bond information.
- PDBs with multiple MODEL segments are passed to a separate routine to be split into their constituent models.
- We can tell if two atoms are the same by looking at their segment IDs, residue IDs, and atom numbers.
- Nothing is written out until the very end of the processing (last step of the algorithm above). PDB lines are kept in internal data buffers until then. A Python pickle file is written out of the data structure that has a snapshot of the internal data structures (especially those needed to restore the original atom and residue numbering).