Genome Databases Pathway ... - CiteSeerX

Curator’s Guide for Pathway/Genome Databases Pathway Tools Version 14.0 Ron Caspi, Carol Fulcher, Ingrid Keseler, Markus Krummenacker, Suzanne Paley, Alex Shearer, and Peter D. Karp SRI International 333 Ravenswood Ave. Menlo Park, CA 94025 [email protected] June 11, 2012

Previous Contributors: Martha Arnaud, John Ingraham, Cynthia Krieger

1

Contents 1 Introduction 1.1

4

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Literature Search 2.1

4 5

Recommended Databases for Literature Search . . . . . . . . . . . . . . . . . . . . . .

5

2.1.1

PubMed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1.2

MEDLINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1.3

BIOSIS via LANL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1.4

SciSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

3 PGDB Curation

5

3.1

Naming Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3.2

Overview of PGDB Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3.2.1

Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3.2.2

Chemical Compounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3.2.3

Compound Classes for Broad Substrate Specificity and Polymerization . . . . .

9

3.2.4

Reactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.2.5

Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.2.6

Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Summaries and History Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.3.1

Writing Style Guidelines for Summaries . . . . . . . . . . . . . . . . . . . . . .

22

3.3.2

Formatting Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.3.3

Say It in Your Own Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.3.4

Citation Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.4

Saving Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.5

Evidence Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.3

4 Pathway Curation

24

4.1

Pathway Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2

Defining Pathway Start and End Points . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.3

Pathway Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4.4

Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2

4.5

Database Searching Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.6

Pathway Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

5 EcoCyc-Specific Information

28

5.1

E. coli Gene Frame Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

5.2

Interrupted Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

6 MetaCyc-Specific Information

29

6.1

Database Searching Strategies for MetaCyc . . . . . . . . . . . . . . . . . . . . . . . .

29

6.2

Species Information

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Taxonomic Range Information . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

6.3

E. coli Pathways in MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

6.4

Pathways from Other Pathway Databases . . . . . . . . . . . . . . . . . . . . . . . . .

31

6.5

Proteins as Substrates in MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

6.6

Curation with Classification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

6.6.1

31

6.2.1

Gene Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 Organism Summary

32

8 Update Propagation among DBs

33

8.1

Invoking DB Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

8.2

Overview of the Updating Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

9 Database Release Process

35

9.1

The Consistency Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

9.2

Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

9.2.1

Database Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

9.2.2

Updates to the Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Updates to the General DB Information . . . . . . . . . . . . . . . . . . . . . . . . . .

37

9.3

10 Programming Hints

37

3

1

Introduction

This guide contains information for curators of Pathway/Genome Databases (PGDBs) such as EcoCyc [4] and MetaCyc [3]. PGDBs are created, updated, and queried using the Pathway Tools software system [2]. This curator’s guide addresses issues regarding PGDB conventions, literature search and review, PGDB entry and editing, and a variety of programs that may be run periodically on PGDBs. Since the roles of curators may vary, only parts of this guide may be relevant to any particular reader. For example, some curators may not conduct literature research and others may never run any of the automated programs. Furthermore, sections of this guide are specific to the curation of a particular PGDB, such as the organism-specific PGDB EcoCyc, and thereby may not be applicable to other PGDBs. For example, there is an organism-specific section for EcoCyc and it addresses such issues as the conventions for naming E. coli genes. Organism-specific sections are labeled as such. Another important source of information for curators is the Pathway Tools User’s Guide document. This guide is a work in [email protected].

1.1

progress.

Please

mail

suggestions

for

improvements

to

Definitions

The following terms are used in this manual. Pathway/Genome Database: A PGDB describes the genome of an organism (its chromosome(s), genes, and genome sequence), the product of each gene, the biochemical reaction(s) catalyzed by each gene product, the substrates of each reaction, and the organization of reactions into pathways. A PGDB can also describe the genetic network of an organism: its promoters, operons, transcription factors, and transcription-factor binding sites. A PGDB is a type of MOD (Model Organism Database). EcoCyc Database: A PGDB for the organism E. coli. The majority of the information in EcoCyc is derived from the biomedical literature. MetaCyc Database: A PGDB containing metabolic data for many different organisms. The goal of MetaCyc is to contain broad coverage of experimentally elucidated metabolic pathways from many different organisms, rather than to attempt to model the complete pathway complement of any particular organism. MetaCyc contains a broad base of well-established pathways that are used by the PathoLogic program to predict the pathway complement of a particular organism, which is modeled within a separate PGDB for that organism. The majority of the information in MetaCyc is derived from the biomedical literature. BioCyc Database Collection: The collection of PGDBs at URL BioCyc.org is called the BioCyc database collection. EcoCyc, MetaCyc, and BsubCyc are all component databases within the BioCyc Collection.

4

2

Literature Search

2.1

Recommended Databases for Literature Search

2.1.1

PubMed

This database is a quick and easy starting point for researching metabolic pathways. Independent of what database you use for your literature search, you’ll want to get the PubMed or MEDLINE citation number for the articles you’ll reference to add a web link from your PGDB to PubMed. For those articles that don’t have PubMed or MEDLINE citations you’ll have to create a frame for the reference (See Pathway Tools User’s Guide). For a description of PubMed searching strategies see the following URL: http://www.ncbi.nlm.nih.gov:80/entrez/query/static/help/pmhelp.html#PubMedSearching. The word “coli” is useful for restricting searches for EcoCyc references that return a huge number of irrelevant results. Use of the “Limits” option to limit searching to titles and abstracts also helps restrict searches in cases where the gene name happens also to be a name of a person or institution or part of an institution’s address. 2.1.2

MEDLINE

MEDLINE, not through PubMed, has more database searching functions such as combining different searches, subtracting others, etc. 2.1.3

BIOSIS via LANL

This database is accessible free of charge to Stanford University so you can access it from any terminal connected to Stanford. If you have access to Lane library there are computers there that can be used for database searching. 2.1.4

SciSearch

This database allows you to search for articles that cite an article of interest, which enables you to search for related articles that have been published after the article of interest.

3

PGDB Curation

This section discusses curation of specific PGDB datatypes. A pathway is connected to its component reactions, which in turn are connected to the enzymes that catalyze them. Enzymes are connected to the genes that encode them, which are connected to the replicons (chromosomes or plasmids) on which they reside. Enzymes can be active as monomers or as complexes; complexes are connected to objects that represent their polypeptide subunits. Reactions are also connected to objects representing their substrates (such as Metabolite-1 in this example). Genes are also connected to objects representing the transcription units (operons) containing them. Transcription units are also connected to their promoters.

5

3.1

Naming Issues

A variety of naming issues arise when developing PGDBs. We strive for consistency in naming. One reason consistency is important is so that users of a database can rely on a naming scheme being used in a systematic way. For example, if a user wishes to find all degradation pathways, they can do a substring search for the term “degradation,” and we should ensure that they will not miss some pathways because they use the term “catabolism” instead. Although we advocate using a single consistency criterion for the common names of different object types, such as chemicals, enzymes, and pathways, it is also reasonable and preferable to consistently apply several naming schemes when creating synonyms, if you believe that different communities of users are likely to consistently try to use different sorts of names when querying different entities. The addition of synonyms for different object types increases the flexibility and robustness of the database. Synonyms 1) Enable users to search the database for information using alternative names and readily find information and 2) Prevent addition of redundant information in the database under alternative names. Furthermore, if a reaction does not have a specific Enzyme Commission (EC) number, the software program PathoLogic [2] relies on enzyme names to correlate an annotated gene with an enzyme, emphasizing the importance of enzyme names. Try not to put commas inside the common names for objects, because the Pathway Tools software uses commas as separators between the names of objects when it displays a list of objects. One alternative is to use hyphens instead of commas. More detailed naming conventions are presented in sections that follow on each datatype.

3.2

Overview of PGDB Content

The following sections discuss curation of different PGDB datatypes, such as the naming conventions for each datatype. Frequently, it is beneficial to know the type of information collected in PGDBs prior to curating a given pathway or other object type, so these sections summarize the information to be gathered for each datatype. 3.2.1

Pathways

Please note that Section 4 discusses curation of pathways in more detail. Summary of information collected for pathways: • Common name • Synonym(s) of pathway name • Superclass(es) of pathway • Evidence code for the existence of the pathway, along with a reference • Names of species in which the pathway has been experimentally demonstrated (only relevant to MetaCyc since other PGDBs are specific to a given organism). • Summary 6

– General description of the pathway and its significance. – Please try to use uniform styling for the General Background and About This Pathway headings within summaries. Consistently use these two phrases, in bold, with a blank line before the next sentence. No underlining, no colon. – Statement regarding the initial and end substrates. In cases where the initial or end compounds are rather specific (as in the degradation of a xenobiotic compound or the biosynthesis of a secondary metabolite), provide background information about the compound. If the pathway is defined as the degradation of substrate A to substrate E, but E is further degraded, comment on how E is further degraded and to what end products in the species noted. – Statement regarding whether the pathway is shared among different types of species. – Relationship to similar pathways in the same or different species. – Relationship to linked pathways (preceding and subsequent pathways), and sub- and superpathways (see definitions), if applicable. Use internal links when referring to other pathways. – Highlight interesting or novel reactions/enzymes in the pathway. – If the pathway contains proposed intermediates, or hypothetical reactions, discuss the relevant circumstantial or preliminary evidence. • Links to other pathways within the same PGDB. • Links to other DBs, such as other PGDBs or The University of Minnesota Biocatalysis/Biodegradation DB (UM-BBD). • Label hypothetical reactions as such. • Citations Pathway Naming. When adding a new pathway name, try to use the format and style of other, similar pathway names. For consistency, please use the terms “degradation” and “biosynthesis” for pathway common names when appropriate. Do not use the terms “catabolism”, anabolism, nor “utilization”. If “degradation” or “biosynthesis” is not appropriate, “metabolism” may be used. Pathway variants should be enumerated with roman numerals, not cardinal numbers. For example, TCA cycle variation I, TCA variation II, etc. Begin the name of any superpathway with “superpathway of...” Explicitly note pathways that are anaerobic by inclusion of (anaerobic) at the end of the pathway name. The one exception is the anaerobic energy pathways (“anaerobic respiration,” etc.), in which the word “anaerobic” is found at the beginning of the pathway name. Spell out the full names of amino acids or chemical compounds within all pathway names. 3.2.2

Chemical Compounds

Summary of information collected for chemical compounds:

7

• Chemical name and synonym(s) • Superclass(es) of chemical compound • Chemical structure • Chemical formula and molecular weight are calculated from the chemical structure. • Gibbs free energy of formation This can be added via the frame editor. Add it if you come across this information, but do not spend time searching for it. • Links to other databases, such as other PGDBs, Chemical Abstract Service (CAS), etc. • Summary • Citation(s) Chemical names, synonyms and structures may be found in a variety of sources, such as the following: • Examples for On-line databases – Kyoto Encyclopedia of Genes and Genomes (http://www.genome.ad.jp/kegg/ligand.html). Go to Search Compound. – The University of Minnesota Biocatalysis/Biodegradation (http://umbbd.ahc.umn.edu/index.html). Use compound search.

Database

(KEGG): (UMBBD):

– Wikipedia – Online literature databases – Sigma-Aldrich product catalog: (http://www.sigmaaldrich.com/). – Klotho Biochemical Compounds (http://www.biocheminfo.org/klotho/).

Declarative

Database:

– IUBMB Enzyme Nomenclature Database: (http://www.chem.qmul.ac.uk/iubmb/enzyme/). This DB can be helpful if the chemical of interest is a substrate or product of an EC reaction, and there is a link from the reaction to a reaction diagram, displaying the chemical structure. – Enzymes and Metabolic Pathways (EMP) Database: (http://emp.mcs.anl.gov/). – Agency for Toxic Substances (http://www.atsdr.cdc.gov/toxfaq.html).

and

Disease

Registry

(ATSDR):

• Books and chemical catalogs – Merck Index – General chemistry and biochemistry textbooks if the chemical is common or is a central metabolite – Sigma catalog – Aldrich catalog

8

Chemical structures and synonyms may also be found by searching the scientific literature for research articles. If you find a structure in a research paper, make sure to cite the article. If a compound cannot be found by name-based searching, it can sometimes be found by searching for an enzyme that is known to catalyze a reaction in which it is a substrate. It is desirable but not essential to find each compound in at least two sources, both to ensure correctness of its structure, and to aid in finding additional synonyms. The recommended structure editor to use with Pathway Tools is Marvin (see Pathway Tools User Guide). It is often faster to enter a structure by modifying the structure of an existing similar compound rather than entering the structure from scratch. If a structure is not found for a compound, enter a value in the Comment-Internal slot using the frame editor to record this fact of this form: “No structure found in search on DATE by PERSON”. Chemical Naming. Common names for chemical compounds should not be capitalized except where uppercase characters are strictly required, e.g., use “L-tryptophan,” not “L-Tryptophan.” In addition, for consistency, when naming organic acids, please use the name of the conjugate base as the common name and the conjugate acid as a synonym. For example, the common name for acetic acid should be acetate, while acetic acid should be used as a synonym . Many chemicals have a common name as well as a trivial name, and an IUPAC (International Union of Pure and Applied Chemistry) name. For example, dimethyl ketone (common name) has the trivial name of acetone, and the IUPAC name of 2-propanone. It is also known as beta-ketopropane, and 2-oxopropane. Different communities of users should be able to query the chemical using any one of these names and find the chemical entry. To make this possible, each chemical entry in the PGDB should include, in addition to the chemical common name, all trivial name(s) and IUPAC name(s) as synonym. Furthermore, some functional groups have multiple prefixes and/or suffixes. For example, the functional group of ketones, >C=O, has two different prefixes, oxo- (IUPAC name) and keto(common name). The bifunctional compound 4-oxohexanoate is equivalent to 4-oxohexanoic acid, 4ketohexanoate and 4-ketohexanoic acid, and thus the chemical entry for this compound should include all of these names as either the common name or synonym. Furthermore, if you think that many people are likely to use names such as both “glucose-6-P” and “glucose-6-phosphate,” you may wish to consistently use one style for the common name, and one style as a synonym. Note that the Pathway Tools matching software for chemical names eliminates ambiguity due to capitalization, hyphenation, and spacing. Therefore the names “glucose-6-P” and “glucose 6-p” will match one another during a user query, so there is no need to enter both of these names as synonyms. 3.2.3

Compound Classes for Broad Substrate Specificity and Polymerization

Some enzymes (and their reactions) have broad and/or ill-characterized substrate specificities, and the rationale and philosophy for how to faithfully represent these complexities will be explained here. Most reactions in the small molecule metabolism convert distinct and well-defined compounds, the representation of which is achieved by compound instance frames that reside under the Compounds class hierarchy. However, a substantial number of reaction equations, including many that were imported from the EC enzyme classification system, refer to classes of chemical compounds, and the question arises of how to represent such reactions, such that much of the chemical logic of metabolic pathways can be captured and made computationally accessible. 9

One of the strengths of the PGDB schema is that the concept of an enzyme with broad substrate specificity can be faithfully captured through the use of generic reactions. A generic reaction is a reaction containing at least one substrate that is a compound class. Such a compound class can have multiple specific instances, i.e. example compounds for which the reaction equation is known or presumed to hold true. For example, the reaction R1 : a + B = c + D would define a generic reaction if compounds B and D are compound classes. The reaction R2 : a + B1 = c + D1 is said to be a specialization of R1 if the substrate B1 is a subclass of B and if D1 is a subclass of D. The reaction R3 : a + b1 = c + d1 is said to be an instance of the generic reaction R1 if b1 is an instance of class B and if d1 is an instance of class D. Note that a PGDB states that enzyme E catalyzes generic reaction R, the PGDB is making the statement that E catalyzes every instance of R that can be constructed by combining all instances of the substrates of R. Organization of the Compound Class Hierarchy. Compounds are organized in a class hierarchy for several reasons, including specification of groups of compounds that are interchangable as enzyme substrates, and to allow user navigation and retrieval of sets of compounds that share functional groups or that share metabolic purposes. Underneath the class Compounds, compound classes describing functional groups include All-Amines, All-Carboxy-Acids, All-Amino-Acids, All-Carbohydrates. Classes describing metabolic purposes include Coenzymes, Hormones, Vitamins, Secondary-Metabolites. These classification hierarchies are not yet as fully developed as they could and should be, and may be subject to additional refinement and rearrangement. The special class Unclassified-Compounds serves as the catch-all bin for compounds that have not yet been carefully placed in an appropriate category. Newly created compound instances (when the curator uses the Compound Editor [2]) are created in this class by default, unless another class is explicitly specified. The common name of a class starts with “a” or “an,” e.g. an alcohol. Broad Substrate Specificity. Compound classes allow a PGDB to represent broad substrate specificities in reaction equations. By using a compound class as a substrate in a reaction equation, you are effectively stating that all instances of that class are interchangable within that reaction. As an example in EcoCyc, the reaction LYSOPHOSPHOLIPASE-RXN (EC 3.1.1.5) contains two classes: a 2-acylglycerophosphocholine + H2O = a fatty acid + L-1-glycero-3-phosphocholine The class “a 2-acylglycerophosphocholine” (frame ID 2-Acylglycero-Phosphocholines) on the left side stands for a group of Lysolecithin compounds that have a variable-length fatty acid tail. On the right side, the fatty acids class stands for the corresponding hydrolysis products. In contrast, H2O and L-1-glycero-3-phosphocholine are specific compounds, represented as instances.

10

What usually determines the exact extent of the specificity is the enzyme that catalyzes the reaction, and this even may differ depending on the isozyme or organism. Because the classes used for representing substrate specificity do not necessarily fully overlap with the classification based on the chemical viewpoint, often the curator has to make an informed decision about how much can be safely assumed about the likely extent of specificity in a given reaction equation. As a complicated example, in EcoCyc, the class Nucleosides is a substrate in the transport reaction TRANS-RXN-108. However, the narrower subclass Ribonucleosides is a substrate in the 3NUCLEOTID-RXN phosphatase reaction, which appears to apply only to RNA-derived substrates and not to the other major subclass Deoxy-Ribonucleosides. The overlap of these classes is quite good between their viewpoints of functional groups versus roles as broad reaction substrates. However, one subtle difference is that the class Ribonucleosides includes the compound INOSINE, because this class definition clearly makes sense from the functional group viewpoint. However, in the 3-NUCLEOTIDRXN, it is unlikely that INOSINE will ever be a substrate that occurs biologically in the RNA material that is processed. Such classification decisions often have to be made, and more precise guidelines will likely evolve for some time to come. In summary, to represent broad substrate specificity in reaction equations, classes can be used that will generally contain specific compound instances, and could in principle acquire additional future instances as more new compounds are added to the database and as the enzymes for the reactions are investigated in greater biochemical detail. Maintaining detailed information about compound classes and their instances allows inferences to be made about compound conversions mediated by enzymes with broad specificity, which would not become apparent in the metabolic network otherwise. Sometimes an enzyme will have broad substrate specificity comprising most of the members of a compound class but with a few notable exceptions. Rather than cluttering up the compound taxonomy with classes that attempt to exclude just those few exceptions, it is possible to specify a reaction as involving a compound class and then enumerate the exceptions. To exclude a specific reaction from those catalyzed by an enzyme attached to the generic reaction, use the Edit-¿Detach Enzyme(s) command from the right-click menu for the specific reaction. Ill-characterized Substrates. Some EC reactions describe very broad conversions of functional groups, such as the reduction of aldehydes to alcohols (EC 1.1.1.2): NADP + an alcohol = NADPH + an aldehyde This reaction is so broad that it would be dangerous to make assumptions about which compound instances are truly affected. Often, the description of such reaction equations is so vague, that it is very difficult to search and obtain information on the scope of the reaction. In other cases, the biochemical experiments simply have not yet been performed that could delineate the extent of the specificity more precisely. For example in HumanCyc, a substantial number of function assignments to enzymes were made by sequence similarity to a family of enzymes that is known to convert a general class of compounds, but the true substrate in humans has not been determined yet for every particular putative enzyme. In these cases, we advocate creating generic reactions some of whose substrates are compound classes that contain no compound instances. This is in contrast to the classes that represent broad substrate specificity, which generally will contain compound instances. The classes representing ill-characterized substrates serve as place holders, until more information becomes available that allows a better reclassification.

11

In the special class Pseudo-Compounds, ultra-ill-defined compound anomalies can be recorded, until they can be resolved more satisfactorily. Polymerization. A number of EC reactions describe polymerization reactions. Various enzymes, such as polymerases, nucleases, and fatty acid synthase, can all be viewed as causing polymerizations or de-polymerizations. Polymerization under a high degree of enzymatic control is what sets cells apart from just a simple bag of small molecules and their direct interconversions. Because of the key role that polymerization plays in biology, and to close the loop computationally for the full metabolic flow of materials, a representation had to be found that captures the essence of polymerization reactions. This can be done in the PGDB schema by looking at a growing polymer as if it were a compound class, and as if its hypothetical instances (which do not necessarily have to be explicitly enumerated) represent the intermediate products as the polymer is being grown. Internally, PGDB reaction equations that add a monomer to a polymer will generally list the compound class representing the polymer on both sides of the reaction equation. But when the name of that polymer is displayed to the user by Pathway Tools in a reaction or pathway drawing, it generates different names for different lengths of the polymer, depending on the side of the equation the class appears on. For example, the reaction FOLYLPOLYGLUTAMATESYNTH-RXN (EC 6.3.2.17) refers to the same, identical compound class THF-GLU-N on both sides of the equation, but the display of this reaction shows the appropriate (n) and (n+1) versions of the polymer name: tetrahydropteroyl-[γ-Glu](n) + L-glutamate + ATP = tetrahydropteroyl-[γ-Glu](n+1) + ADP + phosphate These additional names are stored in 3 special slots in the compound class frame, namely N-NAME, N+1-NAME, and N-1-NAME. These slots can be edited to contain a single string each, which can be selected as an alternate COMMON-NAME in reaction and pathway displays. In a reaction frame that refers to a polymer, the selection of the appropriate name is achieved by attaching the NAME-SLOT annotation to the polymer compound class frame that is one of the values of the LEFT and RIGHT slots in the reaction frame. The value for the NAME-SLOT annotation is the symbol that stands for one of the three slots mentioned above. Using the “Show Frame” menu command to print the frame contents in the example above shows these annotations: --- Instance FOLYLPOLYGLUTAMATESYNTH-RXN --Types: EC-6.3.2 COMMENT: "This reaction is involved in the conversion of folates to polyglutamate derivatives." COMMON-NAME: "FOLYLPOLYGLUTAMATE SYNTHASE" CREATION-DATE: 3000588944 CREATOR: |mriley| EC-NUMBER: "6.3.2.17"

12

ENZYMATIC-REACTION: FOLYLPOLYGLUTAMATESYNTH-ENZRXN IN-PATHWAY: FOLSYN-PWY LEFT: GLT THF-GLU-N ---NAME-SLOT: N-NAME ATP NAMES: "FOLYLPOLYGLUTAMATE SYNTHASE", "FOLYLPOLY-GAMMA-GLUTAMATE SYNTHETASE" OFFICIAL-EC?: T RIGHT: THF-GLU-N ---NAME-SLOT: N+1-NAME |Pi| ADP SCHEMA?: T SUBSTRATES: ADP, |Pi|, GLT, THF-GLU-N, ATP SYNONYMS: "FOLYLPOLY-GAMMA-GLUTAMATE SYNTHETASE" TEMPLATE-FILE: "/home/bolinas2/ecocyc/templates/new/folsyn/folC.template" ____________________________________________ In a pathway that contains polymerization reactions, the selection of the appropiate names is achieved by the POLYMERIZATION-LINKS slot of the pathway frame. Each value of this slot is a list of frame ID symbols of the form (cpd-class product-rxn reactant-rxn). When both reactions are non-nil, an identity link is created between the polymer compound class cpd-class, serving as a product of product-rxn, and the same compound class serving as a reactant of reactant-rxn. The identity link is displayed to the user by a dashed line in the pathway drawing. The PRODUCT-NAME-SLOT and REACTANT-NAME-SLOT annotations on this list specify which slot should be used for displaying the compound label in product-rxn and reactant-rxn above respectively — if one or both are omitted, COMMON-NAME is assumed. Either reaction above may be nil; in this case, no identity link is created — this form is used solely in conjunction with one of the name-slot annotations to specify a name-slot other than COMMON-NAME for a polymer compound class in a reaction of the pathway. As an example, this is what the POLYMERIZATION-LINKS slot looks like in the pathway FOLSYNPWY, as printed out by “Show Frame”: POLYMERIZATION-LINKS: (THF-GLU-N FOLYLPOLYGLUTAMATESYNTH-RXN FOLYLPOLYGLUTAMATESYNTH-RXN) ---PRODUCT-NAME-SLOT: N+1-NAME ---REACTANT-NAME-SLOT: N-NAME The Pathway Editor [2] supports the selection of these alternate name labels. Right-clicking on a 13

compound in the pathway editor will bring up a menu with commands, if applicable, that allow creating and deleting polymerization links, and selecting from the name labels that the polymer compound class makes available in the name slots described above. The Reaction Editor [2] still lacks support for name label selection as of March 2008. For reactions, use the Frame Editor for adding the appropriate annotations. Macromolecules as Substrates. Numerous reaction equations refer to proteins or nucleic acids that are enzymatically modified. One prominent example is protein phosphorylation in a regulatory context. In a sense, proteins and nucleic acids are polymers, but they are generally treated differently in the PGDB schema from the polymerization compound class concept discussed above. Historically, the PGDB schema has made the high-level distinction between small molecule compounds and macromolecules. Proteins and nucleic acids are in the latter category. Usually, when a macromolecule is listed as a subtrate in a reaction, a corresponding modified form of the macromolecule must be specified on the other side of the equation. The unmodified macromolecule generally will be a gene product, such as a protein monomer or tRNA. For the modified variant, a new frame generally needs to be created, which represents the modified macromolecule, and which points to the unmodified version in its UNMODIFIED-FORM slot. The modified and unmodified variants should be made instances in a class of their own. Also see the description in [2]. Redox-active proteins are also treated in this manner, with the corresponding oxidized and reduced variants paired up. Many reaction equations accept a broad range of macromolecules, and in these cases, macromolecule classes can be used as the substrates, in analogy to how compound classes are utilized to represent broad substrate specificity. A typical example includes tRNA-charging reactions, such as LEUCINE– TRNA-LIGASE-RXN (EC 6.1.1.4): tRNAleu + L-leucine + ATP = L-leucyl-tRNAleu + pyrophosphate + AMP On the left side, the class LEU-tRNAs contains 8 distinct gene products in EcoCyc. There are also numerous reaction equations coming from the EC system that are very generic modifications of macromolecule structure, such as DNA methylation at one type of base, wherever it may occur in a long DNA polymer. For now, the best way to represent such modifications at particular macromolecule residues, which are likely to apply to many different gene products that are left otherwise unspecified, is by using macromolecule classes, which in analogy to ill-specified compound classes will contain no instances. At a future time, these placeholder classes might be replaced by a better mechanism for representing residue- and site-specific modifications on macromolecules. Naming Conventions of Compound Classes. Like all class names, the frame IDs for compound classes should be capitalized plurals, with words separated by dashes. Example: All-Amines. The “All-” prefix should only be used for high-level classes from a chemist’s viewpoint. The COMMONNAME should make good sense when written in a reaction equation, which generally will be a lower cased singular name, usually beginning with “an ” and essentially the same name as the frame ID, but with spaces used instead of dashes to separate words. Example: the class Aliphatic-Amines has the common name “an aliphatic amine”.

14

Chemical Structures of Compound Classes. Compound classes can also have molecular structures, just like instances. The main difference is that a class needs to show in its structure that variations exist, which if explicitly enumerated, would correspond to the compound instances populating this class. In a class structure, this variation can be represented by an “R” group. The Marvin Structure Editor allows “R” to be entered as a special atom type. There are class structures that need to represent 2 or more different “R” groups that have to be distinguished between. Marvin allows the labeling of different groups as R1, R2 etc. However, while it is currently possible to create such a structure, editing such a structure currently converts all of the groups to R1. Better software support for this feature should be added in the future. 3.2.4

Reactions

Summary of information collected for reactions: • EC number. There is a slot to label whether this is an official EC number or not. The default is that it is the official number. The EC number for the reaction may not be official unless the reaction is in the exact form as defined by the Enzyme Commission. • (EC) Reaction name. The reaction name is imported automatically from the ENZYME database for EC reactions. For non-EC reactions it can be entered via the Synonym Editor. Typically reaction names are of the same form as enzyme names, and are particularly important to provide if there is no associated enzyme. However, for transport reactions, Pathway Tools reaction pages will automatically generate a name for the reaction such as “Transport of pyruvate,” therefore a common name should only be provided for transport reactions if the name that would be generated is incorrect. • Spontaneous reaction? • Change in Gibbs Free Energy for the reaction in the direction of the reaction as written. ∆G0 can be added via the frame editor. Add it if you come across this information, but do not spend time searching for it. • Summary: The summary is optional. Most of the reactions do not have summary comments. However, if useful, consider adding the following information: – A brief description of the reaction, highlighting any interesting or unusual aspects of it. – If the reaction is novel, describe why. – If the reaction is hypothetical, explain the supporting evidence. – If the reaction is spontaneous, describe under what conditions. In vivo? In vitro? • Citation(s): This is perhaps the most important annotation for a non-EC reaction. Reaction Directionality. By way of background, it is important to know that the Enzyme Commission dictates a preferred direction for every enzymatic reaction that they classify. All PGDBs should store reactions in this preferred direction, therefore, you should not change the direction in which a reaction is stored in the DB for reactions that have EC numbers. Reactions that do not have EC numbers should be stored in the same direction as comparable EC-classified reactions, if known.

15

The direction in which the Enzyme Commission writes the reaction often differs from the direction in which the reaction occurs physiologically. If the reaction is irreversible, or a strong preference for its directionality exists in the cell, a Reaction Direction can be specified using the Reaction Editor and using the Enzyme Editor (see documentation for the Reaction-Direction slot). Curators should enter reaction direction information using the Reaction Editor only for reactions that are very likely to always proceed in the specified direction. Therefore, in the majority of cases, reaction direction information should be entered using the Enzyme Editor. Note that the Navigator software chooses which direction to display a reaction in reaction windows and enzyme windows based on several criteria. Please see the Navigator User’s Guide for more information. Physiologically Relevant Reactions. By default the value of the Physiologically-Relevant? slot in reactions is True, however, the user can override that setting using the reaction editor for reactions thought not to be physiologically relevant. Examples of such reactions is transport of a substrate that is not naturally occurring. One reason to specify a reaction as not physiologically relevant is that its substrates will not be identified as dead-end metabolites by the dead-end metabolite finder. Hypothetical Reactions. Reactions within pathways may be indicated as hypothetical if there is insufficient experimental evidence. The last few reactions in the pathway “toluene degradation to benzoyl-CoA (anaerobic)” are marked as hypothetical because the intermediates, 2-carboxymethyl3-hydroxyphenylpropionyl-CoA and benzoylsuccinyl-CoA, have yet to be observed in the lab. These intermediates and their respective reactions were proposed based on analogous biochemical reactions. There is also genetic evidence for the hypothesized enzymes. If a reaction is marked as hypothetical it should be further explained in the summary of the pathway and possibly in the reaction or enzyme windows. Reaction Balancing. Ideally, all biochemical reactions within a PGDB should be exactly balanced with respect to mass and atoms, meaning that the sums of the masses and atoms for the reactants of the reaction should equal that for the reaction products. In practice, some reactions are unbalanced. Here we discuss how to find and correct unbalanced reactions. It is possible to check the balancing of reactions in the PGDB by right-clicking on the reaction and selecting Show → Is reaction mass-balanced? In addition, closing the Reaction Editor forces a balance check. The balance checker ignores the balancing if hydrogens because the chemical structures within BioCyc PGDBs are currently stored with inconsistent ionization states that cannot be expected to balance across every reaction. Reactions are typically unbalanced for several possible reasons. (1) The literature itself contains an unbalanced reaction, such as because a full reaction equation has not been determined reliably, and the author does not know the full equation. These reactions should all be given a summary to indicate that the reaction is expected to be unbalanced. (2) The reaction involves polymerization, and therefore involves complexity that the reaction balancer cannot handle properly. (3) The reaction equation is in error, such as due to lack of a water molecule, and should be corrected. (4) One or more chemical structures for substrates of the reaction are not known or simply have not been entered to the DB. (5) One or more chemical structures for substrates of the reaction are in error and should be corrected. When errors in reactions or compound structures are corrected, those corrections should be logged

16

in a history note for the reaction or compound, created using the command Right-Click → Notes → Add to History. If an error is found in a reaction that has an EC number, please alert the Enzyme Commission (IUBMB). The information printed for each reaction will aid the curator in identifying the reason the reaction is unbalanced. For example, if a large fraction of all reactions containing a given compound are unbalanced, there is a good chance that this is because the structure of that compound is in error. 3.2.5

Proteins

Guidelines for Gene-Product Summaries (See also: General guidelines for summaries in Section 3.3.) Summaries are stored in PGDB object for the protein (or RNA) product of a gene, rather than in the gene itself. The first sentence of the summary should summarize the function of the gene product. Gene-product summaries should describe important information known about the gene product such as its function and role within the cell, any phenotypes resulting from its mutation or absence, protein domains, its participation as a component of a larger structure, its similarity to other proteins (including any functional complementation studies), membership in protein families, and information about its structure. If the enzyme is novel, explain why. If the enzyme has different isoforms, describe the substrate specificity of the isoforms and the cell-type/tissue/developmental specificity of the isoforms. Except in rare cases, in which the gene product is particularly complex or physiologically significant, an effort should be made to keep the length of “Summaries” (including references)to less than 500 words. In longer summaries, the first paragraph should in a few sentences summarize what is known about the gene product. Experimental support for these conclusions and information regarding protein structure and regulation of expression should come later. In other words, “Summaries” should be organized more like a news story than a scientific paper: conclusions should precede rather than follow detailed information and evidence; the user should be able to gain the essential facts without necessarily having to read the entire summary. In those cases in which background information is considered to be an important aid to some users, such information can be added but it should not be included in the first paragraph. It is helpful to explain gene names within summaries. Use consistent styling including double quotes and bold characters to indicate what longer phrase a gene name abbreviates. Where the explanation is put will vary between proteins. Sometimes it will work best at the start of the summary, for example: “N-acetylgalactosamine repressor,” AgaR, controls the expression of the aga gene cluster... In cases where the gene name is not an abbreviation of the protein name, the explanation will often work best at the end of the summary, in a paragraph of its own. For example: Apt: “adenine phosphoribosyltransferase” [Kocharian76]

17

If relevant reviews are available, they should be grouped together at the end of the summary, such as: “Reviews: [Smith95,Jones98].” Literature citations are an important and valuable part of summaries, but some effort should be made to restrict their numbers to the most significant ones — try not to exceed 10–20 references within most summaries (in some cases of extremely well-studied genes, more references may be appropriate). Sentences, particularly those in the first paragraph, should not be interrupted by long listings of tens of references. If a literature search fails to turn up any information about a particular object, a statement with the following exact wording should be included in the summary (of course, change the date as appropriate): “No information about this protein was found by a literature search conducted on 18 June 2003.” Summary of information collected for proteins • General Protein Information – Species name (in multi-organism PGDBs) – Common name and synonym(s) of protein: use only if the protein is multifunctional, or if the common name of a single-function enzyme differs from the enzymatic reaction name. – Cellular location (eg. membrane, cytoplasm, chloroplast, etc.) – Native molecular weight in kilodaltons – Neidhardt Spot Number (reflects the proteins electrophoretic behavior in 2-dimensional electrophoresis). (Can be added via the frame editor, but not typical. Add it if you come across this information, but do not search for it). – Summary – Citation(s). In EcoCyc, we are assembling comprehensive reference lists for each protein. – Last-Curated. A PGDB can track a “last-curated” date for a gene product. Checking the “last-curated” box in the protein editor will cause this field to be updated. The last-curated date is the date on which a systematic literature search was last performed by curators for this gene product. This date can be used by both curators and database users to determine how up to date the entry is. Do not check this box if only partial curation is performed by the gene product, as this would interfere with the purpose of this field, which is the record the last date on which full curation was performed. • Enzyme Activity For a description of some of these terms see Appendix A of Pathway Tools User’s Guide. – Enzyme activity name and synonym(s). This name should be based solely on the catalyzed reaction. Additional information, such as a roman numeral used to identify a particular isozyme, should be entered under the enzyme common name, not here (see Enzyme Naming below). – Summary ∗ General description of the enzymatic activity, highlighting any interesting aspects. Kinetic data that does not have a dedicated slot, such as Ki(s) for inhibitors or maximal reaction velocity, should be noted, if available. 18

– Inhibitors (physiologically relevant or not?) ∗ Competitive inhibitors: inhibit enzyme activity by binding reversibly to the enzyme and thereby preventing the substrate from binding. ∗ Noncompetitive inhibitors: inhibit enzyme activity by binding reversibly to either the free enzyme or the enzyme-substrate complex. The substrate is not prevented from binding, but the enzyme with the inhibitor bound is not catalytically active. ∗ Uncompetitive inhibitors: inhibit enzyme activity by binding reversibly to the enzymesubstrate complex. ∗ Allosteric inhibitors: inhibit enzyme activity by binding reversibly to the enzyme and inducing a conformational change that decreases the affinity of the enzyme to its substrates. Allosteric inhibitors can be competitive or noncompetitive, therefore those inhibition categories may be used in conjunction with this one. ∗ Irreversible inhibitors: irreversibly inhibit the enzyme activity by binding to the enzyme and dissociating so slowly as to be considered irreversible. ∗ Other inhibitors: inhibit the enzyme activity by a mechanism that has been characterized but does not fall cleanly into one of the above categories. ∗ Inhibitors of unknown mechanism: inhibit enzyme activity, but the mechanism of action is unknown either because it has not yet been elucidated, or because it has not been curated. – Activators (physiologically relevant or not?) – Prosthetic Groups/Cofactors (eg. metals, FAD, NAD(H), etc.) (Note: Prosthetic groups are defined as being covalently or tightly bound to an enzyme, whereas cofactors are not.) Use the protein features mechanism to define which subunit of a multimeric enzyme a cofactor interacts with. – Alternative substrates: Can be used to specify other known substrates for the enzyme. In general it is preferable to define reaction frames for alternative substrates. Curate alternative substrates using the “Alternative substrates” field within the enzyme activity only when the full reaction equation is not known or cannot easily be inferred. – Citation(s). • Enzyme Subunit Composition – Subunit composition. Specifies the number of copies of each monomer subunit of a multimeric protein. In cases where sub-complexes of a large multimer have been observed, those sub-complexes can be created as PGDB objects that are sub-complexes of a larger super-complex. – Subunit name(s) and synonym(s) – Subunit molecular weight (experimental and computed from sequence) – PI (isoelectric point). Can be added via the the frame editor. Add it if you come across the information but do not search for it. – UNIPROT primary accession number of subunit, if available (http://uniprot.org/). Links protein subunit to UNIPROT entries. – Summary

19

∗ Brief description of the enzyme subunit, such as its function, if known or proposed. For example, a subunit may be known to house the catalytic active site, it may have an FAD binding motif and may be proposed to be involved in electron transport, or it may be proposed to be the membrane anchor subunit of a membrane-bound enzyme. – Citation(s) Protein Naming. The word “monomer” may be used to refer to a polypeptide that has intrinsic activity or to a polypeptide that acts as a member of a homo-multimeric complex. The words “subunit” or “component” may be used to refer to a member of a hetero-multimeric complex. When naming a protein complex, use the name that is commonly recognized by the scientific community. Any other names for the complex should be added as synonyms. Avoid extra wordiness (“complex”, for example) unless it is necessary to avoid confusion. For example, the long name of the pyruvate dehydrogenase multienzyme complex distinguishes it from pyruvate dehydrogenase, which is one of its components. Any protein product of a named locus without an associated gene should be named using the guidelines for protein complex names: “acetate kinase B” should be named “acetate kinase B” with “AckB” as a synonym, rather than “AckB,” because the ackB mutation has not been localized to a particular gene. Enzyme Naming. Be sure not to assign non-specific enzyme names to enzymes. For example, do not use names such as “oxidoreductase” or “hydrogenase” because these names refer to nonspecific classes of enzyme activity, not to specific enzyme activities. Use of these names will be very problematic for the PathoLogic program, because genome annotations often use these nonspecific enzyme names for genes whose specific functions cannot be inferred. Thus, if a MetaCyc enzyme were called “oxidoreductase,” PathoLogic would assign a a gene annotated with the name “oxidoreductase” to the corresponding reaction in MetaCyc, which is not correct. The following procedures should be followed regarding entry of enzyme names ending in roman numerals and arabic numbers, such as “pyruvate kinase II” and “tagatose-1,6-bisphosphate aldolase 2.” Enzyme names ending in roman numerals fall into two categories: those in which the roman numerals are an integral part of the enzyme activity, and those in which the roman numerals differentiate isozymes that have the same enzyme activity. PGDBs allow two sets of names and synonyms to be defined for enzymes: names for the enzyme activity, and names for the protein. Consider that E. coli has two proteins with pyruvate kinase activity, designated pyruvate kinase I and pyruvate kinase II. The name of the enzyme activity for both of these enzymes is pyruvate kinase, however, the names of the proteins are pyruvate kinase I and pyruvate kinase II. To encode this situation in a PGDB, enter the protein name, pyruvate kinase I, in the top section of the protein editor, by adding a common name to the protein. Enter the enzyme activity name, pyruvate kinase, in the second section of the protein editor, under the first horizontal line, in the box labeled “Enzyme activity name.” Note that every enzyme should have at least one enzyme activity name, but protein names should be assigned less frequently, such as when different isozymes exist. In addition, protein names should be defined when names exist for an enzyme that are specific to that 20

protein. Arabic numbers at the ends of enzyme names are also used to differentiate isozymes, and should be treated in the same manner. In contrast, consider the enzymes exodeoxyribonuclease I and exodeoxyribonuclease III. These enzymes catalyze different reactions, and the roman numerals designate exactly which enzyme activity an enzyme has. Therefore, these names should be entered in the “Enzyme activity name” field of a PGDB, because the names are specific to the enzyme activity, not to the protein. Using enzymatic reaction synonyms in special cases of multifunctional enzymes. Note that this section only applies to multi-organism PGDBs, such as MetaCyc, that are used as a reference for PathoLogic. Some enzymes catalyze multiple reactions in the cell, yet are known by a single common name. For example, consider the E. coli enzyme octaprenyl synthase. This enzyme catalyzes the sequential addition of isoprenoid units to farnesyl diphosphate, generating the compounds geranylgeranyl diphosphate, geranylfarnesyl diphosphate, hexaprenyl diphosphate, heptaprenyl diphosphate, and finally octaprenyl diphosphate. The enzymatic reaction names for the different reactions should be specified properly as geranylgeranyl diphosphate synthase, geranylfarnesyl diphosphate synthase etc. However, the enzyme is commonly referred to as octaprenyl diphosphate synthase, and a similar enzyme in the annotated genome of a different organism is likely to be annotated under such name. To ensure that PathoLogic would be able to match the enzyme with all the appropriate reactions catalyzed by it, the name “octaprenyl diphosphate synthase” should be entered as a synonym for all the enzymatic reactions catalyzed by the enzyme. 3.2.6

Genes

Summary of information collected for genes: • Common name and synonym(s) • Superclass(es) of gene • Gene product type (eg. enzyme, regulator, leader, etc.). Evidence for product type (experimental or predicted based on sequence analysis) • Transcription direction (unspecified, forward or reverse) • Left and right end position of gene on chromosome or plasmid • Link to other DBs • Summary • Citation(s) Gene Naming. Regarding gene naming, for bacterial databases, automated programs will periodically ensure that the capitalized gene name is a synonym for the name of the gene product (e.g., “TrpA” will become a synonym for the product of “trpA”). For most organisms, the frame name for each gene frame will be the unique identifier assigned to the gene by the genome project (e.g., “HP0001” for an H. pylori gene). Therefore, there is no need to add this same identifier as a synonym for the gene name. 21

3.3

Summaries and History Notes

Formerly called comments, text stored in the Comment slot of a PGDB will be seen by the public under the heading “Summary:”. Should you want to record information that will not be seen by the public but that will be visible in the Navigator to curators, put that information in the Comment-Internal slot using Frame Editor. Curators are encouraged to record historical information about PGDB objects to provide explanation and justification of edits to a PGDB. Such commentary should be stored in a history note for the object, created using the command Right-Click → Notes → Add to History. Example commentary could describe the reasons for changing a gene function, a chemical structure, or a pathway definition. History note contents are also public, and will be displayed with the date they were created and the username of the creating curator. Please try to minimize redundancy between information within summaries, and information in other parts of the PGDB, whenever possible. For example, in the summary for a protein, do not write “The enzyme is inhibited by calcium” if calcium is listed in the PGDB as an inhibitor for the enzyme. One reason for this rule is to avoid annoying the reader with redundant information. Another reason is to minimize the possibility that the PGDB will become inconsistent. For example, imagine for the above calcium example, that a later publication reported that the enzyme is also inhibited by magnesium, and that a curator updated the PGDB inhibitor slot to reflect this fact, but overlooked updating of the summary. The summary would then be wrong (incomplete), and might confuse or irritate the reader, and decrease the reader’s opinion of the database’s quality. The situation would be even worse if the inhibitor was changed from calcium to magnesium because the earlier information was in error. In practice, it is probably not possible to eliminate redundancy, because some of the important biology that curators will want to write in summaries will be redundant with other PGDB data. When a curator does choose to put redundant information in a summary, the information should be important, and the summary should try to provide additional information beyond what is present in structured database fields. Furthermore, the information should be general; avoid giving laundry lists. For example, write “This transcription factor regulates stress response genes,” which is less likely to be rendered incorrect by later information than “This transcription factor regulates genes xyz and abc.” Also try to avoid redundancy with other information that is displayed on the same page. 3.3.1

Writing Style Guidelines for Summaries

Summaries should be written in full sentences rather than in sentence fragments. Use multiple paragraphs within summaries where the extra whitespace adds clarity and separates ideas. Embed citations within summaries, in a way that makes the link between the citation and the text is supports clear. Other than general commentary, most of the text of a summary should consist of an assertion followed by a citation, then another assertion followed by a citation. Do not lump all the citations at the end of the summary, because it is not clear which assertion(s) a given citation pertains to. Abbreviations should be defined at their first use within a summary, in parentheses. Use American English.

22

3.3.2

Formatting Summaries

Pathway Tools supports a subset of HTML tags for encoding special characters and formatting (such as italics) within summaries and names. Following is a list of tags that are accepted. For example, to encode the name α-D-glucose, enter the characters: “α-D-glucose”. • Greek letters – alpha symbol: α – beta symbol: β – delta symbol: δ – gamma symbol: γ – omega: ω – mu (micro): μ • Text – italicized text: italicized text – bold text: bold text – underlined text: underlined text – supertext: supertext – subtext: subtext • Special characters – Angstrom (˚ A): Å – Degrees (◦ ): ° A few simple, HTML tags, such as
for paragraph, and as
for hard line break, are detected by Pathway Tools, but are removed from the displayed text and not observed. Thus, to force a new paragraph, hit the return key twice instead of using
. 3.3.3

Say It in Your Own Words

Avoid word-for-word duplication from papers, and enclose any small duplication (one to three sentences) within quotes and cite the source. 3.3.4

Citation Guidelines

Citations should be used within summaries to cite the source of the information just conveyed. Citations can also be entered independently of summaries, in which case they are stored in the Citations slot of the object (such as a protein), and they provide general references for the object.

23

3.4

Saving Changes

If changes have been made to a database, an asterisk will appear to the left of the database name at the top of the navigator window, for example, *MetaCyc. Changes are not saved automatically. It is important to remember to use the “Save DB” command to save changes.

3.5

Evidence Codes

PGDBs include an evidence ontology that is designed to encode information about why we believe certain assertions in a PGDB, the sources of those assertions, and the degree of confidence scientists hold in those assertions. A detailed description of the evidence ontology can be found in [1]. The evidence ontology can be browsed within a PGDB by running the GKB Taxonomy viewer on the PGDB class hierarchy rooted at class Evidence. An HTML version of the evidence ontology is available at URL http://bioinformatics.ai.sri.com/evidence-ontology/. An assertion could be the existence of a biological object described in a PGDB. For example, we would like to be able to encode the evidence supporting the existence of a gene, an operon, or a pathway that is described within a PGDB. Curators can assign evidence codes by clicking on the “Evidence Code” button within most Pathway Tools Editors. For example, within the protein editor, there is an Evidence Code button below the Enzyme Activity box. This button allows the curator to assign an evidence code; a citation to the source of the evidence should also be added when available (the citation is optional since some evidence codes such as Ev-AS-NAS are used when no citation is available). Whenever an evidence code has been assigned, a new “Evidence Code” button will be drawn to allow assignment of an additional evidence code. It is proper to assign multiple evidence codes if multiple types of evidence support a given conclusion. We offer several guidelines as to how to apply the evidence ontology to different classes of PGDB objects. Pathways. Assign code EV-Exp or one of its sub-codes to a pathway if some experimental evidence supports the existence of the pathway. Code Ev-Comp or its sub-codes should be used for pathways whose presence is inferred computationally, such as by the PathoLogic program. Therefore, we expect that all pathways in the MetaCyc PGDB will be assigned an EV-Exp code because MetaCyc contains experimentally elucidated pathways. Proteins. Evidence codes should be assigned to a protein to define the evidence supporting the function of the protein. For example, was the function of the protein elucidated using sequence analysis, or using experimental methods; if the latter, what class of method was used? Enzymes. Several evidence codes specific to enzymes capture evidence from experimental methods that are specific to elucidating enzyme activities, such as EV-Exp-IMP-Reaction-Enhanced. These evidence codes are actually stored within instances of the Enzymatic-Reactions class.

4

Pathway Curation

This section discusses curation of pathways in more depth. 24

4.1

Pathway Definitions

A metabolic pathway is a set of one or more enzymatic transformations (such as biosynthesis, degradation, conversion, or utilization), as it occurs in a particular organism. Identical pathways that exist in other organisms are not repeated in the MetaCyc database; instead a single pathway is labeled by the multiple organisms in which it occurs. Metabolism of a substrate exogenously supplied to cells, such as a vitamin, or drug, can also constitute a pathway. Pathways can be classified into base pathways and superpathways. A base pathway is considered a lowest-level pathway in the sense that it is not subdivided into smaller component pathways. Base pathways can be linear, circular, or branched. A superpathway is an aggregation of two or more pathways that are related in some way. A pathway component of a superpathway is referred to as a subpathway. A subpathway is part of a superpathway, and a superpathway is composed of subpathways. The subpathways of a superpathway can be base pathways, or can themselves be superpathways. Some superpathways will contain additional reactions and enzymes not found within the base pathways, such as reactions that connect two base pathways together. PGDBs always contain links between associated base pathways and superpathways, and those links are displayed by the Pathway/Genome Navigator toward the bottom of a pathway display page. There are two main types of superpathways: those whose subpathways are related by a common substrate, and those whose subpathways are related by being analogous base pathways from different organisms. For the first type of superpathway, its subpathways could be derived all from the same organism, or they could be derived from multiple organisms. Multispecies superpathways that are connected via a common substrate are potentially useful in metabolic engineering. The steps used in creating superpathways from subpathways are described in the Pathway Tools User’s Guide, Volume II, section 2.3.5.3. More specifically, superpathways can be created based on the following types of relationships among their subpathways: (1) subpathways that are physically connected through a common substrate (that is, one subpathway produces the substrate, and another subpathway consumes it); (2) subpathways that are unconnected, but that metabolize the same substrates (e.g. MetaCyc Superpathway of aspartate and asparagine biosynthesis: interconversion of aspartate and asparagine); and (3) analogous subpathways consisting of an analogous series of reactions catalyzed by the same enzymes, or by analogous enzymes (e.g. MetaCyc Superpathway of isoleucine and valine biosynthesis). Additionally, pathway variants exist in which the same substrate is synthesized or degraded using different enzymes and/or cofactors, in the same or different organisms. Pathway variants share identical pathway names followed by a roman numeral. Many examples of pathway variants can be found in MetaCyc by browsing the pathway ontology. These pathway variants can potentially be combined into superpathways.

4.2

Defining Pathway Start and End Points

Several considerations guide the questions of how to define the start and end points of a pathway, and of whether a given published pathway should be encoded in a PGDB as a single base pathway, or as a set of base pathways within a common superpathway. The following rules should be used to guide the creation and editing of base pathways and superpathways, when possible.

25

1. The substrate biosynthesized or degraded by a pathway should be a stable substrate, as opposed to a transient intermediate. However, a pathway could show the biosynthesis of a stable intermediate that is a precursor for the biosynthesis of other substrates. 2. Biosynthetic pathways should begin with an intermediate of central metabolism. These intermediates are the 13 precursor metabolites: glucose-6-phosphate, fructose-6-phosphate, ribose-5phosphate, erythrose-4-phosphate, triose phosphate, 3-phosphoglycerate, phosphoenolpyruvate, pyruvate, acetyl CoA, alpha-oxoglutarate, succinyl CoA, oxaloacetate, and sedoheptulose-7phosphate. A pathway link (see Section 4.3) should be created to indicate the pathway that produces the precursor metabolite at the start of the pathway. 3. Degradative pathways that produce an intermediate of central metabolism should stop at that point. Some degradative pathways may not produce intermediates of central metabolism, but instead produce compounds that are excreted from the cell. If appropriate, a pathway link (see Section 4.3) should be created to indicate the pathway that processes the resulting metabolite at the end of the pathway. 4. Another class of pathways is applicable in cases where compounds are metabolized in a dissimilatory manner for the production of energy. Examples for these pathways include sulfate reduction and ammonia oxidation. In such cases, the metabolites are unlikely to consume or produce intermediates of central metabolism. Such pathways should start with the natural form of the compound being used as an electron donor or acceptor, and end with the compound generated at the end of the electron transport process, which would generally be secreted by the organism. 5. Very large or complex pathways should usually be defined as superpathways that combine several smaller base pathways, where those base pathways are divided at breakpoints. Dividing a large or complex pathway in this fashion is particularly useful to optimize the accuracy of PathoLogic predictions, especially in cases where it is the base pathways, rather than the entire pathways, that tend to be present as units across different organisms. If the pathway was defined as one large base pathway, rather than as a set of base pathways connected through a superpathway, PathoLogic would be unable to predict the presence of the smaller base pathways independently in different organisms. Breakpoints for large base pathways can be chosen based on various criteria such as: branch point substrates; substrates involved in regulation; a major metabolite that is further metabolized; the cellular compartment in which the reactions occur (organelle or cytosol); a transport segment or a utilization segment; or the type of reaction (oxidative or non-oxidative). 6. If a published pathway contains several pathways that are already defined in a PGDB as base pathways, it should be represented as a superpathway. 7. If the pathway contains too many reactions to conveniently represent in one base pathway, it should be broken into two or more base pathways, which should be linked together by pathway links (see below).

4.3

Pathway Links

Pathway links are a mechanism for indicating substrate connections among pathways. Pathway links are displayed as arrows connecting an input or output substrate in a pathway to the name of a second pathway in which that substrate is metabolized. Pathway links can illustrate the source pathway for an 26

input substrate, or the destination pathway for an output substrate (assuming that it is not completely metabolized). Clicking on the second pathway name takes the user to that pathways display page. Links can be created to another base pathway, a superpathway, or to a class of substrates that derive from the pathway (e.g. MetaCyc glycolysis I). The steps used in creating pathway links are described in the Pathway Tools User’s Guide, Volume II, section 2.3.5.3. Note that although it is helpful to explain the origin or fate of substrates in the pathway summaries field, this unstructured text is not computationally useful, and thus cannot replace the use of pathway links.

4.4

Limitations

Metabolic pathways involving macromolecules and cellular structures may be difficult to represent in PGDBs. This is a factor in pathway selection. Pathways that involve reactions that synthesize, degrade, or modify small molecule components of macromolecules and cellular structures can be represented. However, some processes may be beyond the scope of PGDBs, which focus on small molecule metabolism.

4.5

Database Searching Strategies

For those curating MetaCyc, please see Section 6.1 for additional information regarding MetaCyc specifically. To search available databases for information regarding a particular pathway, it is recommended to begin by using general keywords related to the pathway name (eg, creatinine degradation to formate methionine biosynthesis, etc). You may need to search using several alternative names, for example: toluene degradation, toluene oxidation, toluene catabolism etc. Adding additional search terms such as anaerobic will help avoid getting irrelevant hits. As in all searches, if you get too many hits you should narrow your search by adding keywords, such as bacteria to limit the search to only bacterial pathways, or a species name to limit the search to only pathways in a particular species. Some databases allow the use of wildcards, which are truncated names followed by a special character such as * to designate different variations for the ending of the word. For example, bacter* would include bacteria, bacterial, bacterium, etc. Different databases may use different wild cards, so it is always useful to consult the databases search description/overview. For a description of PubMed searching strategies see the following URL: http://www.ncbi.nlm.nih.gov:80/entrez/query/static/help/pmhelp.html#PubMedSearching. If you know a little bit about the pathway, such as the names of intermediates, or enzymes involved in the pathway, you should also search for articles using these keywords. Furthermore, if you know the names of the researchers who studied the pathway, you can search for articles using a combination of their names and the substrate or enzyme they are working on. Once you get some articles of interest, you would usually find other related articles by 1) looking at their references, and 2) using SciSearch to find articles that cite them. Often an articles full text is available online in html format. These html formatted articles often have web links to articles that they reference as well as to articles that cite them, which is a very convenient way of finding additional articles. Once you have identified and gathered the relevant articles for a pathway, try to find their PubMed ID (PMID) numbers and label them as these numbers are the easiest way to enter references information into PGDBs.

27

4.6

Pathway Entry

Once you have enough papers to put a pathway together, draw out the pathway, making note of the chemical reactions, chemical names and structures of the metabolites, and enzyme names if known. Try to identify any EC numbers that may have been assigned to the reactions in the pathway. It is very important to find out whether the chemicals and/or reactions already exist in the database. This may not be straightforward, as chemicals may have many different names. MetaCyc already includes all of the reactions which have been assigned EC numbers, so you may want to search the IUBMB web site carefully to find out whether existing EC reactions fit any of the reactions in the new pathway. Often authors are not aware of such EC numbers and do not include them in their publications. Make sure you do not create duplicate chemicals and/or reaction in the database, as this will lead to certain problems in the future. After you have identified all existing reactions, you may need to create new reactions and chemicals. Write down the frame IDs of both existing and new reactions and add them to the drawing of the pathway which you have prepared. This will greatly facilitate the creation of the new pathway. Once you finished these steps, you are ready to define the new pathway. Do not forget to assign appropriate class and evidence codes. An ideal reference for the pathway evidence code is a recent review article which cites all the relevant experimental literature. In such case use the code EV-EXP-TAS. Make sure you assign the appropriate organisms to the pathway, and mark any hypothetical reaction as such. See section xxx for guideline for writing summaries for new pathways. Once the pathway is defined in the PGDB, you need to enter enzymes and genes for the various reactions. Review the papers in greater detail while taking notes of the relevant information (see Section 3.2 below). As mentioned earlier, it is best if you are already aware of the type of information you’ll input into the PGDBs. This way, you can skim/read the papers for the relevant information, and take notes in a similar fashion to how the PGDB is organized, which expedites inputting the information into the database. In addition, you’ll need to cite the information you input into PGDBs. Hopefully, your papers have PMID reference numbers; in which case, the full reference information will be imported automatically

5 5.1

EcoCyc-Specific Information E. coli Gene Frame Names

All frame IDs for E. coli genes are derived using numeric sequences. Some genes frame names are the same as the identifier used in Rudd’s EcoGene database; they use the prefix “EG”. Genes with prefix “G” were either (a) created by the MBL group when they encountered new genes in the literature, or (b) created by SRI — virtually all of these genes were created in the course of including data from the Blattner GenBank entry for the full E. coli sequence into EcoCyc. Examples: EG10115, G1001. Identifiers used by other databases are often listed in the synonyms slot for the gene, e.g., the Blattner “b” identifiers and later assigned “EG” identifiers. Dr. K. Rudd developed a naming scheme for E. coli ORFs, in which ORF genes are named beginning with “y.” The rest of the name encodes the position of the gene on the E. coli chromosome. These names are retained as synonyms for genes whose functions are later determined.

28

5.2

Interrupted Genes

Interrupted genes in E. coli are defined as follow. For each piece of the coding region, a separate gene frame is created. The same EG number is stored as a synonym for each interrupted gene, but the two interrupted genes have different b-numbers. In addition, a name of the form “ilvG” is stored as a synonym, but a name of the form “ilvG 1” is stored as the common-name. The interrupted? slot is set to T, and the start-base and end-base delimit the segment of the coding region defined by that gene frame.

6

MetaCyc-Specific Information

MetaCyc describes the union of pathways across a range of different organisms.

6.1

Database Searching Strategies for MetaCyc

The database searching strategies discussed below are unique to curating MetaCyc since MetaCyc is currently the only PGDB covering metabolic pathways from multiple species. Section 4.5 above should be reviewed in addition to this section prior to embarking on database searching. While researching a pathway you may find that the pathway was studied in many different species, just a couple, or maybe only one. If the pathway has been studied in more than one species, choose the model species in which the pathway has been studied the most and thereby is best defined. This way you can build the most complete picture of a pathway. Pathways in MetaCyc may be associated with one or more species. In order for a species to be associated with a pathway, each of the reactions composing the pathway must have been either experimentally demonstrated or hypothesized to occur in the given species. Enzymes in MetaCyc, on the other hand, are associated with only one species because it is assumed that the same enzyme in different species will have slightly different properties. For an enzyme to be associated with a pathway (i.e. displayed in the pathway navigator window over the reaction it catalyzes) the enzyme and the pathway both must be associated with the same species. This further emphasizes the reason for studying all of the reactions and enzymes for a pathway in one particular species. It is a greater priority to add as many different pathways to MetaCyc than it is to find all of the species in which a given pathway is found. However, if during your research you discover that many species have the same pathway, list all of the species for that given pathway. To reduce redundancy in MetaCyc, the same pathway in multiple species is only entered once. If you find that other species use a slightly different pathway, a new pathway can be created for these species. For example, you may note that the glutamate fermentation-the hydroxyglutarate pathway includes a list of several species in which the pathway is found, but it only lists one example for each enzyme. The priority was to find enzyme information for one of the species listed in the pathway for each of the pathway reactions, and not to include enzyme information for each of the species.

6.2

Species Information

Unlike the EcoCyc DB, which describes pathways for E. coli only, MetaCyc describes pathways for many different organisms. When entering a pathway into MetaCyc, you should record the one or more species in which the pathway is known to occur in the Species slot of the pathway, which can be done using the Species field of the Pathway Info Editor. The list of species recorded for a pathway should be 29

considered a representative list, not necessarily an exhaustive list of all species in which the pathway occurs. When creating the enzymes that catalyze steps in the pathway, specify the species from which the enzyme was obtained. Each enzyme in MetaCyc should define a protein from a single species. You may create enzymes that catalyze a reaction even if those enzymes are from other species besides those in which the pathway occurs, but those enzymes will not be displayed within the pathway diagram. 6.2.1

Taxonomic Range Information

Should the curator be reasonably certain that a pathway should be expected to occur in a limited set of taxa (e.g., plants only, or animals only) a high-level taxonomic classification for its expected taxonomic range should be entered in the Taxonomic-Range slot of the pathway frame by using the Expected Taxonomic Range field in the Pathway Info Editor. Here you can limit the taxonomic range of organisms in which the pathway occurs by selecting the appropriate taxonomic domain(s) or subdomain(s). You can determine the most specific higher level domain for each genus and species by using the NCBI Taxonomy Browser. The higher level domain will usually be a phylum, or class. If the species is not present in the NCBI taxonomy, select only the highest level, a superkingdom (Archaea, Bacteria, or Eukaryota). Only enter this information if you are reasonably confident that the distribution of the pathway will be limited. The Taxonomic-Range slot is primarily intended for use by the PathoLogic pathway prediction program during generation of new PGDBs. By assigning an expected taxonomic range to the organisms in a MetaCyc pathway, the domains, or subdomains that are represented in the expected taxonomic range can be compared with that assigned to the organism in the new PGDB. PathoLogic could then assign a lower probability score to the pathway if the organism for the PGDB is not in the expected taxonomic range of the MetaCyc pathway. This form of reasoning can help to exclude inappropriate pathways from the new PGDB. Expected Taxonomic Range also gives MetaCyc users a taxonomic perspective of the pathway that they might not otherwise have if they are not familiar with some of the organisms in the species slot.

6.3

E. coli Pathways in MetaCyc

MetaCyc contains many E. coli enzymes and pathways. Pathways and enzymes of E. coli K–12 are automatically imported from EcoCyc into MetaCyc periodically. To ensure that updates are not lost as a result of that automatic import process, please observe these rules when updating E. coli K–12 enzymes and pathways in MetaCyc: • Please consider EcoCyc to be the authoritative source for K–12 enzymes. Do not update K–12 enzymes in MetaCyc. Update them in EcoCyc. • For K–12 pathways, the only slots you should update in MetaCyc are: – Species, Summary, Common-Name, Synonyms, Citations, History You should not change the pathway topology (set of reactions within a pathway or their relative arrangements). The most likely changes to a K–12 pathway in MetaCyc is extending that pathway to another species.

30

6.4

Pathways from Other Pathway Databases

When entering pathways into MetaCyc from other pathway DBs, please: • Enter a citation in the Citations slot of the pathway, for the appropriate DB • Enter a web link from the MetaCyc pathway to the other pathway DB • Describe in the summary field of the pathway any changes that you made to the pathway, and why

6.5

Proteins as Substrates in MetaCyc

Some reactions in MetaCyc pathways involve protein substrates such as acyl carrier protein (ACP) or thioredoxin. These protein substrates should be entered as protein class frames within MetaCyc. The protein class frames are generic, species-independent descriptions of these proteins. The reason for this approach is that a number of different MetaCyc reactions may reference a particular protein as a substrate, but those reactions may be from different pathways in different species, leading to confusion as to exactly which version of the protein is intended. In addition, when MetaCyc pathways are predicted in organism-specific PGDB’s, these reactions are now used in the context of this specific organism, and must not refer to protein instances from other organisms (e.g. a human pathway utilizing an E. coli protein as a substrate). If organism-specific instances of such proteins exist in MetaCyc (e.g. specific E. coli thioredoxins), these instances should reside within the appropriate protein class. This will enable the user to see a list of all such instances by clicking on a protein substrate in a pathway diagram.

6.6

Curation with Classification Systems

PGDBs contain several classification systems to which individual objects can be assigned. For example, the classification system for pathways divides all metabolic pathways into over a 100 different categories and sub-categories, such as biosynthesis, biosynthesis of amino acids, and biosynthesis of carbohydrates. Individual pathways can be assigned to those classes to facilitate retrieval and drilldown by users into categories and sub-categories. Classification systems exist for pathways, chemical compounds, reactions, and genes. In some cases it is appropriate to assign an object to more than one class, such as in the case of a multifunctional protein. When updating an object, please consider whether its assignment within a classification system should be revised. 6.6.1

Gene Classes

The gene class hierarchy is the MultiFun system developed by Monica Riley and her colleagues [5]. The gene class hierarchy has several classes specific for classification of uncharacterized genes. The general gene class “ORF” has three child classes as of the most recent updates to the hierarchy (in July 2003): 31

1. conserved ORF 2. conserved hypothetical ORF 3. nonconserved ORF As we describe them to the users: “Conserved ORFs have homologs, usually in other organisms, where one or more of those homologs have known functions, but the sequence similarity of a conserved ORF to its homologs is not strong enough to permit inference of function of the conserved ORF. In contrast conserved hypothetical ORFs have homologs, but none of those homologs have known functions.” The remaining category, nonconserved ORF, is used to refer to genes that do not have assigned gene class functions and which do not have significant similarity to other genes. As guidelines for naming of the products of these uncharacterized genes, we suggest that a product of a “conserved ORF” will be called “conserved protein,” a product of a “conserved hypothetical ORF” will be called “conserved hypothetical protein,” and a product of an “nonconserved ORF” will be called “hypothetical protein.” In cases where some additional information has been recorded, but this information is insufficient for prediction of protein function, additional information may be appended to the standard name; for example, “conserved protein with an unusual xyz domain.” The explanations of selected gene class definitions are provided by Dr. Gretta Serres: • The “cell structure” category contains both genes encoding components of the structure and genes encoding products involved in the biosynthesis of the structure. • The terms “trigger” (3.4) and “modulator” (3.5) refer to specific compounds/proteins that either trigger a response or modulate a response. I, for example, in the case of lacI, lactose acts as the trigger and CRP acts as a modulator of the regulation. • Any transcriptional activator or repressor should be assigned a class of 3.1.2.2 or 3.1.2.3, or both for dual regulators. • The “adaptation to stress” class is intended to cover the classical inducers of stress such as osmotic pressure, temperature, and starvation. • The “protection” class is intended to cover those inducers related to cell killing or stress induced by compounds or chemicals. • The “defense/survival” category is intended to cover virulence factors. • The “ extrachromosomal origin” category is intended to include chromosomal genes affecting functions of genes of extrachromosomal origin as well as genes that are of extrachromosomal origin.

7

Organism Summary

The organism summary display differs from most of the other displayed information in that it is not computed directly from the DB at runtime. Rather, it is created by a part-manual, part-automated procedure, and the resulting statistics information is cached to allow fast re-display. The information for every organism lives in a separate DB. Each DB has a class Organisms, which tends to have exactly one instance in a given DB, namely the frame that describes the organism. The frame name for that frame is the PGDB unique ID for the organism, such as ECOLI for E. coli. In its Genome slot, that frame will point to the genetic elements, which often is just one chromosome.

32

Both the organism and chromosome frames have a slot called Cached-Statistics, in which the values are stored that will be displayed in the organism summary. For example, the numbers of genes coding for proteins and coding for RNAs are stored. Most of these values are filled in automatically by running a stats-collecting function over the DB.

8

Update Propagation among DBs

Many DB updates are non-species-specific, and therefore are intended to apply to all DBs. It is convenient to make these changes only once in one DB, and rely on an automated mechanism for propagating such updates to all the other DBs. In general, it is recommended that all such updates be performed in MetaCyc. Since this is not always convenient, the update propagation mechanism will take care of propagating certain kinds of updates from clone DBs (including EcoCyc) “up” to MetaCyc. However, all schema changes must be performed in MetaCyc — otherwise they will be undone during update propagation. For instance-level updates, update propagation uses the transaction logs collected in Oracle to determine when one updated value is more recent than another, to determine in which direction the update should be performed. Update propagation is invoked manually and, for the time being, relatively infrequently. It is a timeintensive operation, which can take on the order of a half hour per DB. The update propagation mechanism does not save any of the updated DBs — it is up to the user to examine the output to make sure that it looks reasonable, and to save the affected DBs.

8.1

Invoking DB Updating

To invoke the update propagation mechanism, call (propagate-kb-updates). This is the normal mode of update propagation and will cause updates to be propagated among all known DBs. Alternatively, to propagate only among a subset of DBs, the above function can be called with a list of orgkb-descriptors (an orgkb-descriptor can be found by calling find-org on the org-id). For example, to propagate updates only between EcoCyc and MetaCyc ( MetaCyc is automatically always included in update propagation), you would type (propagate-kb-updates (list (find-org ’ecoli))). Update propagation should be done with care, and the output examined fairly carefully before any DBs are saved (update propagation does not automatically save the changed DBs). There may still be bugs in the mechanism, and we want to take pains to avoid losing any data or undoing important changes.

8.2

Overview of the Updating Procedure

Update propagation involves the following sequence of events: • Any schema classes (classes can be explicitly specified to be either schema or data classes) in MetaCyc that are not present in the clone DBs are copied to the clones. • All of MetaCyc’s slotunits are copied over to all other clones. This will overwrite the corresponding slotunits in the clone DBs. Slotunits present in a clone DB but not in MetaCyc will not be deleted, but a warning will be printed.

33

• All instances of the Databases class will be copied from MetaCyc to all clone DBs, overwriting the existing frames if they exist. • The class hierarchy of each clone DB is “molded” to match that of MetaCyc, i.e. each class will have the same set of parents in all DBs. This operation doesn’t copy any classes, it just potentially alters the parents of existing classes. • For each clone DB, we loop through all instances of designated classes, looking at certain slots to see which updates need to be propagated “upwards” to MetaCyc. The set of considered classes and slots is listed below. The mechanism of this is as follows: – If a frame in a clone DB is not present in MetaCyc, and either it has never been deleted from MetaCyc or the creation date in the clone DB is more recent than the deletion date in MetaCyc, then the frame is copied to MetaCyc. – If the frame’s parents are different in the clone DB and MetaCyc, and the last change to the parents is more recent in the clone DB than in MetaCyc, then the parents in MetaCyc are changed to match those in the clone DB. Any parents not present in MetaCyc that are necessary for this operation (presumably, these are all data classes) will first be copied to MetaCyc. – If, for each considered slot, the slot values and/or annotations are different between the two DBs and the modification date is more recent for the clone than for MetaCyc, then all slot values and annotations (except for citation and comment annotations, which are often species-specific) for that slot in that frame are copied from the clone DB to MetaCyc. While looping through these frames, we collect a list of frames and slots that either were updated in MetaCyc or that differ between the two DBs (and were not updated because the MetaCyc changes were more recent). This list is used in the next phase of the update. • After updates have been propagated “upwards” from all clones to MetaCyc, we iterate through all the clone DBs again, propagating changes “downwards” from MetaCyc to the clones: – We “mold” all instances of the designated classes in the clone DB so that each instance’s parents match those in MetaCyc (because any more recent changes in parentage in any of the clone DBs have already been copied to MetaCyc, we cannot lose any changes here). If an instance exists in a clone DB, but its new parents do not (these would have to be data classes, otherwise they would already have been copied), we copy those parents over from MetaCyc. – We iterate through the list that we have collected of frames and slots that either have been updated in MetaCyc or potentially need updating in one or more clones, comparing values between the clone and MetaCyc. If there is any difference, the values and annotations (except for citation and comment annotations, which are often species-specific) are copied over from MetaCyc to the clone. If the frame does not exist in MetaCyc, then it must have been deleted recently (otherwise it would already have been copied there), so it is deleted from the clone DB. – For all slot values that are copied to a clone DB and that represent frames in MetaCyc, we check to see if the corresponding frame exists in the clone DB. If not, we copy that frame to the clone DB also. The exception is if the frame is a modified protein, and the unmodified form is not present in the clone DB either, in which case we do not import either frame.

34

While update propagation is being performed, messages about any changes are printed to the terminal window. The user should examine these messages to make sure that the changes seem reasonable before saving the DBs. It is also good practice to save this output to a file in the $GPROOT/ecocyc/metacyc/released/kb/ directory, so that it can be examined in case changes have been lost or updates fail to be propagated (more so that we can determine if there is a bug in the propagation code rather than to recover any data, which can be retrieved from the change logs stored in Oracle). One known failing of the update propagation mechanism is that if a different change has been made to some slot of a frame in two different clone DBs (but not in MetaCyc), then the value that prevails is that from the last clone in the orgkb list (i.e. when updates are propagated upward from a clone, they may overwrite updates previously propagated upward from earlier clone DBs), and then of course that value is propagated downwards to all other clones. It is largely to avoid this problem that we recommend that the majority of changes be performed in MetaCyc rather than in any of the clones. The classes and slots for which update propagation is done are listed below. This set of classes and slots is currently hard-coded into the update propagation code. • Compounds: common-name, synonyms, overview-node-shape, smiles, pka3, pka2, pka1, structure-atoms, structure-bonds, superatoms, systematic-name, n-1-name, n+1-name, molecular-weight, comment, citations, dblinks, chemical-formula, display-coords-2d, charge, casregistry-numbers, atom-chirality, atom-charges, gibbs-0, and aromatic-rings. • EC-Reactions and Unclassified-Reactions (i.e. this excludes Transport-Reactions, which are not propagated): common-name, synonyms, requirements, right, left, substrates, dblinks, deltago, ec-number, ec-number-old, and official-ec?. • Super-Pathways: common-name, synonyms, primaries, polymerization-links, layout-advice, netreaction-equation, class-instance-links and disable-display. • All Pathways except for Super-Pathways (which are covered above) and 2Comp-Pathways (which are not propagated): reaction-list, predecessors, common-name, synonyms, primaries, polymerization-links, layout-advice, net-reaction-equation, class-instance-links and disabledisplay.

9

Database Release Process

The database release process is performed by BioCyc project staff at SRI. At each release, SRI curators will need to run the Correctify program and fix any errors identified, update the database release notes, and update general database information that is displayed to the users.

9.1

The Consistency Checker

The Consistency Checker is a utility that performs various checks and corrections of the data. This program is run by SRI prior to each release, and should be run by PGDB curators periodically. The Consistency Checker builds indexes, checks citation format, checks correspondence between polypeptides and genes, changes compound names to the corresponding frame ID in some slots, updates references to compound name strings to refer to the corresponding frames, checks that physiological

35

regulators are on the master list of regulators, checks various cross-references, checks that all reactions listed within pathways exist, checks pathway links and removes obsolete ones, checks various types of reaction information, checks for enzymatic reactions that are lacking a link to the reaction or the enzyme, checks compound structure information and removes redundant bonds, calculates suband superpathways, checks links to modified proteins, checks computed slot values, verifies replicon components and positions, and, finally, checks various constraints. To run the program, open the database (e.g., EcoCyc or MetaCyc) in the Pathway Tools Navigator and select Consistency Checker from the Tools menu. The interface is divided to two main panes - automatic tasks and manual tasks. The automatic tasks find and fix the problems automatically (although the changes to the DB are not saved until the user invokes the Save command), while The manual tasks require user intervention. All tasks in each of the panels can be selected simultaneously, or the user may choose to run each task individually. when running the manual tasks, problematic objects (such as badly formed references) are printed in the right panel of the interface, and can be easily opened in the main Navigator window by clicking on them within the Consistency Checker interface. Once opened, the curator can inspect the object, analyze the cause of the problem, and repair it within the main window. The process can be repeated as many times as required, until the errors have all been addressed. Remember to save the changes in the Navigator. The Consistency Checker generates log files each time it is invoked. The log files are saved in the PGDB reports directory, and the exact path is printed within the Consistency Checker interface. Changes performed automatically by the Consistency Checker may show up in the history record displayed by invoking the Show Changes command in the editor software. This can cause confusion while trying to understand the history of a particular object, and curators are advised to keep in mind the possibility that a series of otherwise mysterious changes to an object may be due to the Consistency Checker, rather than intentional manipulations of the object in question.

9.2

Release Notes

Updates to the EcoCyc and MetaCyc release notes are made by EcoCyc and MetaCyc curators at SRI. 9.2.1

Database Statistics

The (kb-stats) and (new-pathways) queries are used to gather database statistics for the release notes. If all updates to the database prior to the release have been completed, and the database has not yet been saved to a file, open EcoCyc in the Pathway Tools Navigator and then select Exit to get back to the Lisp prompt. At the Lisp prompt type (kb-stats :summary? t) and let it run. The output generated will contain the statistics that we record in the release notes, which are displayed on our web site. It is helpful to save the output to a file for future reference. To generate a list of new pathways entered since the last release, type (new-pathways :spreadsheet? t) and let it run. The output will provide the list of new pathways, in a format that makes it easier to process for generation of the pathway list in the release notes, which includes url’s for all pathways . Remember to list superpathways separately.

36

After the new pathway list has been gathered, only once a release, run (new-pathways :reset? t ) to re-set the pathway list. This command cannot be run from a Windows installation. It is advisable to save a copy of the output of each of these programs for reference. The statistics may be gathered after the database has been frozen and saved to a file, even after curation into the MySQL database has resumed. At a Lisp prompt, type (select-organism :org-id ’insert organism here :version “insert version number here” :dbms-type :file) with the correct organism and version number inserted, for example, (select-organism :org-id ’ecoli :version “7.5” :dbms-type :file) Then, at the next Lisp prompt, type (kb-stats :summary? t) or ((new-pathways :spreadsheet? t) and proceed as previously described. The statistics will be computed using the file, which includes the data that will be displayed in the new released version of that database. 9.2.2

Updates to the Notes

The text of the release notes is found in the file ~brg/website/pgdb-release-notes-beta/[KB-NAME]/release-notes.shtml for example, ~brg/website/pgdb-release-notes-beta/metacyc/release-notes.shtml The file should not be edited directly. Rather, it should be checked out of cvs, updated locally, and committed. When preparing a new version of the file, copy the format of previous sections; list the names of newly curated pathways, modified pathways, and other significant improvements. Use a horizontal rule (HTML command ¡hr¿) to separate notes from different releases. MetaCyc curators use an Excel spreadsheet to automate the formatting of the modified pathways section.

9.3

Updates to the General DB Information

There is a special frame that contains general information about each database, such as the list of authors, the copyright information, and citation information. This information should be checked and updated, if necessary, prior to each release. To update this information, edit the frame using the Frame Editor. At a Lisp prompt type (efr ’ecoli) or (efr ’meta) to open the EcoCyc ansd MetaCyc information frames, respectively.

10

Programming Hints • Frame fetching hangs: (gfp::reset-prefetching)

If frame fetching hangs,

try breaking and then calling

References [1] P.D. Karp, S. Paley, C.J. Krieger, and P. Zhang. An evidence ontology for use in pathway/genome databases. In R. Altman and T. Klein, editors, Proc Pacific Symposium on Biocomputing, pages 190–201, Singapore, 2004. World Scientific. 37

[2] P.D. Karp, S. Paley, and P. Romero. The Pathway Tools Software. Bioinformatics, 18:S225–S232, 2002. [3] P.D. Karp, M. Riley, S. Paley, and A. Pellegrini-Toole. The MetaCyc database. Nuc Acids Res, 30(1):59–61, 2002. [4] P.D. Karp, M. Riley, M. Saier, I.T. Paulsen, S. Paley, and A. Pellegrini-Toole. The EcoCyc database. Nuc Acids Res, 30(1):56–8, 2002. [5] M.H. Serres and M. Riley. MultiFun, a multifunctional classification scheme for Escherichia coli K–12 gene products. Genome Biol., 5(4):205–222, 2000.

38