correspondence - San Diego Supercomputer Center

3 downloads 3883 Views 150KB Size Report
Aug 8, 2007 - created different instantiations of the same data. The resultant websites created by the the Research Collaboratory for Structural. Bioinformatics ...
© 2007 Nature Publishing Group http://www.nature.com/naturebiotechnology

CORRESPONDENCE

Realism about PDB To the editor: On behalf of the worldwide Protein Data Bank (wwPDB), we are responding to the commentary in your April issue entitled “Overhauling the PDB” (Nat. Biotechnol. 25, 437–442, 2007). We consider that the commentary fails to recognize the realities of dealing with complex biological macromolecular structural data, and with the equally complex and diverse communities who have a long history of working with these data. The wwPDB1 was formed to represent these international users and to ensure that the archive is kept to a uniform standard that enables the effective use and exchange of data content. The PDB archive consists of the flat files distributed via FTP (file transfer protocol). The term ‘PDB’ does not refer to the websites, browsers or database query engines developed by wwPDB partners. The wwPDB has not reengineered the PDB, nor is the term ‘Brookhaven Protein Data Bank’ appropriate, as Brookhaven’s management of the PDB archive ended in 1998. In creating the wwPDB, we deliberately separated the PDB archive from the views of the data. The members of wwPDB have each created different instantiations of the same data. The resultant websites created by the the Research Collaboratory for Structural Bioinformatics (RCSB), the Macromolecular Structure Database (MSD) at the European Bioinformatics Institute (Hinxton Hall, UK) and PDBj (Osaka University, Japan) each provide different services and views on the PDB data. Each database uses schemas that are in some manner derived from the macromolecular Crystallographic Information File (mmCIF), and are designed to satisfy particular search and retrieval requirements. None is put forward as a standard and each takes a different approach to data cleaning, normalization

for integrity and denormalization for fast transaction services. The members of wwPDB are well aware of the complexities and inconsistencies in the PDB data. The PDB archive contains experimental data that have, until recently, not always been collected consistently. It includes data from new experimental techniques as they emerge, from many different contributing laboratories and from high-throughput structural genomics centers. The wwPDB has worked to remediate these data, and now all 42,000 entries have been corrected for errors in sequence, chemical components, nomenclature, citations and more. The wwPDB website (http://www.wwpdb. org/) provides information about the progress of this remediation project with the complete corpus available for community review2. Beyond remediation of the PDB archive, we have updated the rigor of our annotation practices to avoid introducing new errors in the future. All current databases derived from the PDB archive are loaded with preremediated data and have problems with data uniformity. This will be rectified as the remediated PDB passes from the test phase into full production over the next few months. The mmCIF was created under the auspices of the International Union of Crystallography (IUCr) as a dictionary of terms that describes the crystallographic experiment and the results3. The dictionary was extended to include terms for nuclear magnetic resonance and electron microscopy as defined by experts in these fields. This PDB Exchange Dictionary4 was created in full collaboration with structural biologists and computer scientists and represents a very detailed vocabulary for structural biology. The mmCIF was evolved from the small molecule CIF format5, which necessitated the creation of a new dictionary definition language (DDL). The semantics of the DDL

NATURE BIOTECHNOLOGY VOLUME 25 NUMBER 8 AUGUST 2007

provide the necessary details, such as category containers with natural key identifiers, semantic definitions and examples, data types with regular expressions, controlled vocabularies, boundary values, subclasses, aliases and versioning. In addition, directed relationships of the parent-child or foreign key types are described between common identifiers. These semantics permit the precise definition, validation and exchange of data. The use of this information makes it possible to perform largely automated transformations of the underlying data into different concrete formats—PDB, mmCIF or PDBML (an extensible markup language (XML) format developed by the wwPDB members6 as a direct translation of mmCIF)—to program application programming interfaces (APIs)7 and to remodel this information in various database schemas6. The mmCIF dictionary has been referred to as an ontology of terms for experimental structure determination8. The term ‘ontology’ currently implies a more specific set of modeled relationships, some of which are not present in our data dictionary. We also take issue with the assertion by Ross King and colleagues that data organization in our data dictionary, or any domain dictionary for that matter, should dictate the logical and physical organization of our database systems. Although the simple regular tabular organization of mmCIF simplifies the relational mapping of PDB data, each of the wwPDB groups has created a logical design and physical implementation that suits requirements of its particular web resources. The commentary authors refer to standard normalization techniques (that is, the third normal form (3NF)) to describe the design of the ‘logical model’ in a particular normal form. Universally, the ‘physical model’ that is then implemented does not necessarily have any particular form. Physical database implementations address issues of performance and maintenance using the logical model as the underlying basis. The authors have mixed the meanings of a logical model design with the implementations within the wwPDB websites.

845

© 2007 Nature Publishing Group http://www.nature.com/naturebiotechnology

CORRESPONDENCE Finally, we certainly agree that the PDB should have a standard data representation, although a well-designed ontology plays less of a role than that championed by King and his coauthors. Even so, the legacy requirements of our community and the dynamics of changing a global resource require that it be developed over time and in collaboration with our diverse user base. Anything less is a gross underestimation of the current usage and impact of PDB data. COMPETING INTERESTS STATEMENT The authors declare no competing financial interests.

Helen M Berman1, Kim Henrick2, Haruki Nakamura3, John Markley4, Philip E Bourne5 & John Westbrook6 1RCSB Protein Data Bank, Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA. 2Macromolecular Structural Database, European Bioinformatics Institute, EMBL, Outstation, Hinxton, Cambridge, UK. 3PDBj, Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 5650871, Japan. 4BioMagResBank, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA. 5RCSB Protein Data Bank, San Diego Supercomputer Center and the Skaggs School of Pharmacy and Pharmaceutical Sciences at the University of California, San Diego, La Jolla, California 92093, USA. 6RCSB Protein Data Bank, Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA. Correspondence should be addressed to H.M.B. e-mail: [email protected]

1. Berman, H.M., Henrick, K. & Nakamura, H. Nat. Struct. Biol. 10, 980 (2003). 2. Berman, H.M., Henrick, K. Nakamura, H. & Markley, J.L. Nucleic Acids Res. 35, D301–D333 (2007). 3. Fitzgerald, P.M.D. et al. in International Tables for Crystallography (eds. Hall, S.R. & McMahon, B.) 295– 443 (Springer, Dordrecht, The Netherlands, 2005). 4. Westbrook, J., Henrick, K., Ulrich, E.L. & Berman, H.M. in International Tables for Crystallography (eds. Hall, S.R. & McMahon, B.) 195–198 (Springer, Dordrecht, The Netherlands, 2005). 5. Hall, S.R., Allen, A.H. & Brown, I.D. Acta Crystallograph. A47, 655–685 (1991). 6. Westbrook, J., Ito, N., Nakamura, H., Henrick, K. & Berman, H.M. Bioinformatics 21, 988–992 (2005). 7. http://www.omg.org/cgi-bin/doc?lifesci/00-02-02 8. Westbrook, J. & Bourne, P.E. Bioinformatics 16, 159– 168 (2000).

Amanda C Schierz, Larisa N Soldatova & Ross D King respond: The first point we would like to make is that we are very pleased to learn of the project to clean up the data in the PDB and to see on the website (http://www.wwpdb.org/) ‘Announcement: Release of Remediated PDB Data’ (16 April, 2007). This is a welcome development for the structural biology community.

846

Even so, we are disappointed with the reply from the wwPDB group. What we had hoped to read was a plan for structural biology to regain its lead in scientific data standards. Instead what the letter consists of is a series of red herrings, excuses for past problems, a complacent description of the current situation and a vague promise of jam tomorrow. Our main claims that mmCIF is a poor ontology and that the RCSB is a poor relational database are not seriously disputed. Considering the red herrings: the wwPDB authors object to our use of “PDB” and of the term “Brookhaven Protein Data Bank” for post-1998 data. Yet, the title ‘Overhauling the PDB’ was Nature Biotechnology’s editorial suggestion, not ours, indicating that PDB is a generally accepted term for their organization. And although we should have deleted the word “Brookhaven” (which was erroneously introduced by editors at the proof stage), one must ask, ‘What’s in a name?’ Would your data smell any sweeter with the correct name? The wwPDB group also claims that we argue “that data organization in our data dictionary or any domain dictionary for that matter, should dictate the logical and physical organization of our database systems.” We don’t claim this. We are well aware of the differences between a logical and physical database model, which is why we were surprised that the RCSB PDB logical and physical model are exactly the same! The question of how an ontology can contribute to database design is an active area of research with high potential. The worry is that given the poor example of the RCSB PDB, database developers may draw

the wrong conclusion about the usefulness of ontologies. Considering the excuses: the wwPDB group argues that it has a lot of complex data to deal with. We, of course, accept this. But the problem is not new and PDB/wwPDB have had over 35 years to get things right. We have examined the wwPDB remediated chemical component dictionary and note that the obsolete component codes are now clearly labeled as such. Even so, we believe that some, perhaps many, of the mmCIF files on the remediated wwPDB FTP (file transfer protocol) site still contain incorrect data. For example, the nuclear magnetic resonance– obtained protein structures 1AXJ and 2FN2 are both supposedly remediated, yet both contain the mmCIF CELL category (which is defined as ‘Data items in the CELL category record details about the crystallographic cell parameters’). The wwPDB also seem to put the blame for poor features in mmCIF on the International Union of Crystallography (IUCr). This seems to be an abdication of responsibility. Their claim that mmCIF was an ontology when Westbrook and Bourne1 was written, but the meaning of the term ‘ontology’ has changed, is also weak. To conclude, we hope that before the end of the decade, the wwPDB will present the structural biology community with guaranteed clean and self-consistent structural data, a state of the art ontology to represent these data and link it with other types of data and (at least) one state-of-the-art relational database to store and access the data. 1. Westbrook, J. & Bourne, P.E. Bioinformatics 16, 159– 168 (2000).

The Metabolomics Standards Initiative To the editor: The standards papers that Nature Biotechnology hosted online as part of a community consultation (http://www. nature.com/nbt/consult/index.html), in particular those by the Human Proteome Organization Proteomics Standardization Initiative (HUPO-PSI)1,2 and the Functional Genomics Experiment (FuGE)3 working groups, represent an important first step toward permitting the sharing of high-quality, structured data. We particularly applaud the open consultation solicited by Nature Biotechnology and

advocate the early-community-involvement approach taken by HUPO-PSI, FuGE and the other working groups in the development of such guidelines and standards. These are the most effective ways to ensure that the output generated is pragmatic and the standards are both useful and widely accepted by the community. As representatives of the nascent Metabolomics Standards Initiative (MSI)4, we are following closely the work of the FuGE and the PSI working groups, leveraging on their work where commonality exists, such as the mass

VOLUME 25 NUMBER 8 AUGUST 2007 NATURE BIOTECHNOLOGY