The Protein Data Bank is the world-wide repository for the results of macromolecular structure determination. The macromolecular CIF (mmCIF) format  was developed to capture all the detailed searchable relationships among data items and to overcome the field-width limitations of the old 80-column PDB format . While mmCIF has been of great value to the internal operation of the PDB and has simplified the creation of powerful search engines, users of the PDB have been reluctant to make the transition from the easily-read fixed-field PDB format to mmCIF, and most existing macromolecular software remains unable to read mmCIF directly.
We propose a new fixed-field wide PDB format (WPDB) that will carry all the information provided by mmCIF using record formats very similar to those in the existing PDB format. By increasing the widths of many fields (especially atom number, atom name, chain identifier and coordinates), we should overcome the major deficiencies of the old PDB format, including the 99,999 atom limitation and the idiosyncratic atom naming forced by the 4-column field width, especially for hydrogens. In addition, we propose new record types to carry the new information from the mmCIF files (e.g. the more detailed information on secondary structure and biological function). We are creating simple external translation filter programs to go back and forth to mmCIF format and, where the data permits, to the existing PDB format. Software developers should find it completely straightforward to revise their input and output routines to process wide PDB (WPDB) and anyone will be able to use the filters.
This project is an extension of the DOE funded project BIOMOL project (ER63601-1021466-0009501) to enchance access to the contents of the PDB through improvements in software external to the database. Just as the base proposal works to enhance access by extending RasMol   to increase local functionality in management of biological units, the WPDB project works to enhance access by providing software tools which allow easy adaptation of applications to a format that will support the greater detail of and size of structures not easily handled in the old PDB format. Indeed, we will demonstrate the ease of this adaptation by adding both WPDB input and WPDB output capabilities to RasMol.
The managers of the RCSB PDB have implicity recognized the need for a format going beyond both the old PDB format and mmCIF. They have applied the mmCIF dictionary to XML format to produce a new composite format, PDBML . Unfortunately, this format exacerbates the difficulties for application developers of working with mmCIF format by combining the order independence of mmCIF with the inefficiencies of needing to manage the syntactic verbosity of XML, and they propose a new atom record format that partially returns to the efficiency and fixed ordering of the old PDB format. It is time to take this idea to its logical conclusion and provide a fixed-field, easily parsed format for all PDB entries, including the newer ones.
The original design paradigm for the WPDB format was:
The last design requirement would seem to conflict with the requirement for a fixed field format, since mmCIF format allows for arbitrarily long chain identifiers. We originally planned to resolve this conflict by adding record types for translation between long and short chain identifiers, so that atom records could use short identifiers that had been linked by an earlier translation table to longer names. At present we are exploring a simpler solution: use of field-by-field continuation when needed.
The PDB format atom naming convention will be retained, using fixed column positions to distinguish carbon-alpha from calcium, but the field will be extended to at least 6 characters to allow for proper hydrogen naming without the need to wrap the field around. Let us look at a old PDB format CRYST1, ORIGX, SCALE and ATOM records from 3CRO 
.........1.........2.........3.........4.........5.........6.........7 1234567890123456789012345678901234567890123456789012345678901234567890 CRYST1 49.200 47.600 61.700 90.00 109.50 90.00 P 21 2 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.020325 0.000000 0.007198 0.00000 SCALE2 0.000000 0.021008 0.000000 0.00000 SCALE3 0.000000 0.000000 0.017194 0.00000 ATOM 5 O5* A A 1 -16.851 -5.543 74.981 1.00 55.62 ATOM 6 C5* A A 1 -18.254 -5.683 75.238 1.00 51.97 ATOM 7 C4* A A 1 -18.600 -7.125 75.571 1.00 37.32 ATOM 8 O4* A A 1 -19.740 -7.166 76.456 1.00 26.97 ATOM 9 C3* A A 1 -18.978 -8.004 74.382 1.00 34.63Larger cells argue for wider fields for the cell edges in the CRYST1, for at least the translations in the ORIGX records, for all the fields in the SCALE records and for the coordinates in the ATOM records. A larger field is needed for the atom serial number to prevent the need for repeated numbers already seen in some NMR entries, etc. A wider format, as below, would only require minor changes in format edit descriptors (e.g. 'a10' or '3x') in applications.
.........1.........2.........3.........4.........5.........6.........7.........8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 software external to the database. CRYST1 49.200 47.600 61.700 90.00 109.50 90.00 P 21 2 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.020325 0.000000 0.007198 0.00000 SCALE2 0.000000 0.021008 0.000000 0.00000 SCALE3 0.000000 0.000000 0.017194 0.00000 ATOM 5 O5* A A 1 -16.851 -5.543 74.981 1.00 55.62 ATOM 6 C5* A A 1 -18.254 -5.683 75.238 1.00 51.97 ATOM 7 C4* A A 1 -18.600 -7.125 75.571 1.00 37.32 ATOM 8 O4* A A 1 -19.740 -7.166 76.456 1.00 26.97 ATOM 9 C3* A A 1 -18.978 -8.004 74.382 1.00 34.63
Since the original proposal, the format has evolved. It will now handle up to 999,999,999 atoms and 10 character chain names. The current strawman draft of the WPDB format is available as a 1.85MB PDF here. The code used to produce the draft WPDB entries is available here. Comments and corrections appreciated.
 H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Research, 28:235 -- 242, 2000. See http://www.rcsb.org
 F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer Jr., M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi. The protein data bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112:535 -- 542, 1977.
 H. J. Bernstein. Recent changes to RasMol, recombining the variants. Trends in Biological Sciences (TIBS), 25(9):453 -- 455, September 2000.
 Protein data bank atomic coordinate and bibliographic entry format description. Technical report, Brookhaven National Laboratory, 1992. http://arcib.dowling.edu/~bernsteh/PDB_format_1992.pdf.
 P. E. Bourne, H. M. Berman, B. McMahon, K. D.. Watenpaugh, J. Westbrook, and P. M. D. Fitzgerald. The macromolecular crystallographic information file (mmCIF). Methods in Enzymology, 277:571 --590, 1997.
 A. Mondragon, C. Wolberger, and S. C. Harrison. Structure of phage 434 cro protein at 2.35 Å resolution. J. Mol. Biol., 205(1):179 --188, 1989. PDB entry 3CRO, 1990.
 Roger Sayle and E. James Milner-White. RasMol: Biomolecular graphics for all. Trends in Biochemical Sciences (TIBS), 20(9):374, September 1995.
 J. Westbrook, N. Ito, H. Nakamura, K. Henrick, and H. M. Berman. PDBML: The representation of archival macromolecular structure data in XML. Bioinformatics, 21(7):988 -- 992, 2005.