Identity and divergence of protein domain architectures ... - CiteSeerX

3 downloads 6152 Views 2MB Size Report
Identity and divergence of protein domain architectures after the yeast whole-genome duplication eventw. Luigi Grassi,zab Diana Fusco,zyc Alessandro Sellerio ...
View Online

PAPER

www.rsc.org/molecularbiosystems | Molecular BioSystems

Identity and divergence of protein domain architectures after the yeast whole-genome duplication eventw Luigi Grassi,zab Diana Fusco ,zyc Alessandro Sellerio ,zzc Davide Cora`,8ab Bruno Bassetti,cd Michele Caselleabe and Marco Cosentino Lagomarsino*wwc

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

Received 26th February 2010, Accepted 29th June 2010 DOI: 10.1039/c003507f Gene duplication is a key mechanism in evolution for generating new functionality, and it is known to have produced a large proportion of genes. Duplication mechanisms include small-scale, or ‘‘local’’, events such as unequal crossing over and retroposition, together with global events, such as chromosomal or whole genome duplication (WGD). In particular, different studies confirmed that the yeast S. cerevisiae arose from a 100–150 million-year old whole-genome duplication. Detection and study of duplications are usually based on sequence alignment, synteny and phylogenetic techniques, but protein domains are also useful in assessing protein homology. We develop a simple and computationally efficient protein domain architecture comparison method based on the domain assignments available from public databases. We test the accuracy and the reliability of this method in detecting instances of gene duplication in the yeast S. cerevisiae. In particular, we analyze the evolution of WGD and non-WGD paralogs from the domain viewpoint, in comparison with a more standard functional analysis of the genes. A large number of domains is shared by genes that underwent local and global duplications, indicating the existence of a common set of ‘‘duplicable’’ domains. On the other hand, WGD and non-WGD paralogs tend to have different functions. We find evidence that this comes from functional migration within similar domain superfamilies, but also from the existence of small sets of WGD and non-WGD specific domain superfamilies with largely different functions. This observation gives a novel perspective on the finding that WGD paralogs tend to be functionally different from small-scale paralogs. WGD and non-WGD superfamilies carry distinct functions. Finally, the Gene Ontology similarity of paralogs tends to decrease with duplication age, while this tendency is weaker or not observable by the comparison of the domain architectures of paralogs. This suggests that the set of domains composing a protein tends to be maintained, while its function, cellular process or localization diversifies. Overall, the gathered evidence gives a different viewpoint on the biological specificity of the WGD and at the same time points out the validity of domain architecture comparison as a tool for detecting homology. a

Universita` degli Studi di Torino, Dip. Fisica Teorica-Via Giuria 1, 10125 Torino, Italy b Center for Complex Systems in Molecular Biology and Medicine, University of Torino-Via Accademia Albertina 13, 10100 Torino, Italy c Universita` degli Studi di Milano, Dip. Fisica-Via Celoria 16, 20133 Milano, Italy. E-mail: [email protected] d I.N.F.N., Milano-Via Celoria 16, 20133 Milano, Italy e I.N.F.N, Torino, Via Giuria 1, 10125 Torino, Italy w Electronic supplementary information (ESI) available: Supplementary figures and tables and the pseudocode description of the three algorithms for the homology assignment. See DOI: 10.1039/ c003507f z Equal contribution. y Current address: Duke University, Program in Computational Biology and Bioinformatics-North Building, 470 Research Drive, Durham NC 27708. z Current address: E´cole Polytechnique Fe´de´rale de Lausanne-FSB IPMC GSM, CH-1015 Lausanne, Switzerland. 8 Current address: Institute for Cancer Research and TreatmentStrada Provinciale 142 km 3,95 Candiolo, 10060 Torino, Italy. ww Current address: Genomic Physics Group FRE 3214 CNRS ‘‘Microorganism Genomics’’ and Universite’ Pierre et Marie Curie-Campus des Cordeliers, 15 rue de l’Ecole de Medecine, 75006 Paris.

This journal is

c

The Royal Society of Chemistry 2010

1. Introduction Genomes possess a high degree of redundancy in the information for which they encode.1–5 Considering protein-coding genes in eukaryotes, most of this redundancy is due to duplication events.6 Such duplications can involve a fraction of a genome (e.g. individual genes, genomic segments, whole chromosomes) or a whole genome.7,8 More specifically, S. cerevisiae has arisen from an ancient whole-genome duplication, which differentiated the genus Saccharomyces in a way observable by the existence of 2 : 1 synteny maps (e.g. comparing S. cerevisiae with K. waltii9 or A. gossypii).10 About 10% of the genes originally duplicated by this event have been retained and are observable today as paralogs. Proteins descending from a common ancestor (homologs) are usually identified by sequence alignment and synteny methods. Such methods typically have the main hindrance of not taking into account the protein’s fold, which tends to be conserved for longer evolutionary times than sequence identity. On the other hand, structure and function of proteins Mol. BioSyst., 2010, 6, 2305–2315

2305

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

View Online

can be represented on a coarser scale, considering protein domains, the modular substructures defined by folding,11 compact structure,12 function and evolution.13 Protein domains and their homology relationships are available in many databases. There are several definitions for domains. We will be mainly interested in domain superfamilies, which are built by domains closely-related in terms of structure, function and sequence. These are available in domain databases such as SCOP14 and CATH.15 Genome assignments for SCOP domains are available in the SUPERFAMILY database, where hidden Markov models (HMMs), made for members of each SCOP superfamily, are matched with all sequences present in sequenced genomes.16 At the sequence level, domain families are built by domains whose sequences have significant similarities. They are collected in databases such as ProDom17 and Pfam.18 Identity or similarity of proteins in terms of their domain architecture, defined as the sequential order of domains in their sequences, is a powerful tool to identify homology with little computational effort.19–21 The study of domain architectures offers an interesting perspective to follow evolutionary trajectories of biological processes.22,23 Here, we focus on two main questions about domain architectures and their evolution. The first question concerns the reliability of the homology detection by using the available domain assignments. The second question is whether protein architecture provides insight about the evolution of paralogs. In detail, we used the domain architecture information combined with functional annotation, for the characterization of paralogs derived from global and local duplications at different dates. We addressed the first question by implementing a simple algorithm to detect homology via SUPERFAMILY24 domain assignments. Then we compared the results with the ones obtained by sequence alignment/synteny methods. The assignment of protein homology through the comparison of their architecture has two main limitations: (i) the partial coverage of domains assignments; (ii) the correct assignment of homology in the case of multi-domain proteins that have a partial overlap of architecture (i.e. correctly distinguishing non related proteins from homologous proteins, differentiated by recombination events). For these reasons, the choice of homology criterion implies a trade-off between error tolerance and the rate of false positive homologs. We tested different criteria on WGD paralogs in S. cerevisiae by comparing the structural domain architecture, taking into consideration their K. waltii orthologs. The second question arises from the fact that gene duplications provide raw material to develop new functions. In particular, it is interesting to ask how the whole-genome duplication event reshaped the yeast genome in a distinct way from small-scale duplications and how this is reflected by the domain superfamilies found in WGD paralogs versus local paralogs. We used our method to evaluate differences between pre-WGD, WGD and post-WGD paralogs, and performed a parallel comparison by using a Gene Ontology enrichment analysis. Both analyses converge on the conclusion that wholegenome and small-scale paralogs tend to be functionally different, corroborating existing evidence.25,26 On the other hand, the study of domain superfamilies of paralogs gives a 2306

Mol. BioSyst., 2010, 6, 2305–2315

different perspective on this finding. We find that since domains of paralogs are essentially maintained, the functional dichotomy can be created by two main factors. The first one is a significant functional difference between domain superfamilies that are exclusively found in WGD paralogs and domain superfamilies found exclusively in small-scale paralogs. The second one is that paralogs tend to change the subcellular localization and the specificity for given biological pathways, conserving their domain architectures.

2. Results 2.1 Homology assignment by domain characterization In order to study homology from the domain viewpoint, we implemented three homology criteria based on the main domain rearrangements that occur in the evolution of protein architectures.27 Additions of new domain(s) from the same family as one of the adjacent domains lead to the formation of repeats, whereas insertions are additions of new domain(s) from a family/families other than the adjacent domains. Domains may also be deleted or exchanged (Fig. 1). Criterion A defines two proteins as homologs if they have identical architectures (i.e. they contain the same domains in the same order). Criterion B requires the conservation of the order of domains in the two proteins, allowing multiple repetitions of one or more domains. The classes of homology defined by this criterion include the ones defined by criterion A, plus homologs affected by protein rearrangements that led to domain repeats. Finally, two proteins are identified as homologs by criterion C if their architectures are equal, or if one of them is an approximate repetition of the other. This criterion, allowing domain deletion/insertion, is the least restrictive of

Fig. 1 Cartoon of the domain-based methods used to classify and compare proteins. We compared protein pairs based on their SUPERFAMILY domain architectures, to produce homology assignments using three different homology criteria (A, B, C, lower-left panel), and two different similarity scores for the two architectures (domain score and architecture score, lower-right panel).

This journal is

c

The Royal Society of Chemistry 2010

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

View Online

the three and is more fault-tolerant to differences in domain assignments generated by lack of knowledge. In the supplementary materialsw we describe the pseudocode for the algorithmic implementation of the three criteria. We compared homology classes generated by the three criteria with those defined by sequence alignment/synteny methods. This test was divided into two different steps. First, we evaluated the fraction of homologs identified for the WGD (by Kellis et al.)9 and by general sequence alignment/synteny methods (Ensembl Compara)28 that are also identified by criteria A, B, and C. The results of this analysis are shown in Table 1. These results confirm the efficiency of domain-based classifications in detecting evolutionary relatedness among proteins (as observed in ref. 19–21,29). More specifically, even the most stringent homology criterion A, is able to find the majority of triplets (72%), pairs (67%) and Ensembl Compara homology classes (64%). The other criteria perform better; in particular, criterion C retrieves more than 90% of the information in blocks of conserved synteny. The results indicate that this method is able to detect orthologs and paralogs similarly. In order to test the efficiency of the homology assessment in the presence of multi-domain proteins, we repeated the analysis separately for single- and multi-domain proteins. We found a small quantitative difference (less than 10%) between the two results. We also verified, through a hypergeometric test, the enrichment of the multi-domain proteins in the Ensembl Compara paralogs not detected by our method. The P-values retrieved were not significant for all the tests reported in Table 1. Second, we quantified the fraction of paralogs not recovered by Ensembl Compara for each paralogy class defined by the three homology criteria (Fig. 2). All three criteria define a significant fraction of classes that are not recovered by Ensembl Compara. Note that criteria A and B follow qualitatively similar trends and produce a small fraction of partially covered classes, while criterion C has a larger number of partially covered classes, essentially due to the fact that

Table 1 Comparison of classes obtained with domain-based homology criteria and homology classes built with WGD paralogs and their orthologs9 (upper panel) and paralogs relations provided by Ensembl Compara (lower panel, ref. 28). For both tables, the first row shows the number of genes in the reference homology classes. The second row contains the result of the intersection of these data with the architecture databases. The following three rows report the total and the relative fraction of the number of triplets and pairs found in the homology classes with criteria A, B, and C. Homology criterion

Triplets

(%) Total

Pairs

(%) Total

Kellis et al. Overlap Criterion A Criterion B Criterion C

457 289 207 239 270

— 100% 72% 83% 92%

2609 1099 734 836 1010

— 100% 67% 76% 91%

Homology criterion Ensembl Overlap Criterion A Criterion B Criterion C

Ensembl Compara classes 672 470 301 347 403

This journal is

c

The Royal Society of Chemistry 2010

(%) Total — 100% 64% 74% 86%

Fig. 2 Fraction R of domain-based homology classes not scored by Ensembl Compara. All homology classes are ranked by R on the x-axis. The different lines in the plot refer to homology criteria A (black solid line), B (red dashed line) C (green dash-dotted line), defined in the text (see Fig. 1). Criterion A has the highest number of classes entirely not covered by Ensembl Compara classes. Criterion C, while having the lowest number of classes entirely not covered, has also the lowest rate of entirely covered ones.

classes produced with this criterion are very large. Criterion B is the most efficient of the three criteria in recovering Ensembl Compara paralogy relations. Fig. 2 shows the limitations of both criterion A and C. The former, being more restrictive, builds small homology classes and consequently, the probability that a whole class is not recognized by Ensembl Compara is higher. The latter builds wide homology classes associating far-away homologs. The consequence is that the classes built with criterion C will almost certainly contain some Ensembl Compara homologs, as shown by the small number of classes that are not recovered. On the other hand, the same classes do not exclusively contain Ensembl Compara homologs and are rarely completely covered. 2.2 Domain architecture evolution in WGD and non-WGD paralogs Duplicate gene pairs may undergo an altered selective regime that leads to an asymmetrical evolution of the proteins acting at different levels. Indeed, genes at fixation may evolve in different ways, depending on the divergence process and the nature of the duplication.30 Among the possibilities, there is a process by which one copy maintains the original function, and thus is constrained by selection. This constraint on the first copy leaves the other copy free to evolve, as originally hypothesized by Ohno31 and supported by evidence in yeast.5,32,33 However, theoretical and experimental work34,35 argues that both paralogs can evolve independently at the same rate. We tested the consequences of these processes at the domain level. The length of an architecture is defined as the total number of domains that form it. In order to globally quantify the changes in protein architectures, we introduced two scoring methods able to evaluate relatedness between architectures. The first, called the ‘‘domain score’’, is the number of domains shared by both proteins divided by the mean number of domains of the two proteins (i.e. the ‘‘Dice coefficient’’36 of the two architectures). This coefficient is similar to the Jaccard coefficient employed by Jin et al.23 in comparing protein Mol. BioSyst., 2010, 6, 2305–2315

2307

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

View Online

domain composition. The Jaccard coefficient is less significant for proteins with a small number of shared domains. Note that the occurrence of convergent evolution of domain architectures is rare.37 The second scoring method we introduced is the ‘‘architecture score’’. This score is defined as the longest exactly matching sequence of domains between two architectures divided by the mean length of the two architectures (see Experimental and Fig. 1). For each WGD triplet, we compared the two paralogs in S. cerevisiae with the corresponding K. lactis ortholog, detecting the best- and worst-matches. This was done using the domain score and the architecture score between both paralogs and their K. lactis ortholog. In the analysis, the protein architectures were retrieved from two independent databases, Pfam-A,18 and SUPERFAMILY.24 The choice of these two databases aims at combining the reliability of a manually curated database (Pfam-A) with the higher coverage of an automatically generated database (SUPERFAMILY). Automatically generated databases can include potential annotation artifacts and domains which have drifted apart. Beaussart et al.38 describe a method to correct the missing relationships among the domains in automatically generated databases. In accordance with this method, we did not consider the paralogous pairs with uneven architecture, where domain loss/gain is due to a very divergent low-scoring domain, below the detection threshold. Table 2 shows the fraction (F2) of WGD triplets in which both S. cerevisiae paralogs have identical domain (or architecture) scores to their single-copy ortholog in K. lactis. Furthermore, we called F1 the fraction of triplets in which only one of the two S. cerevisiae paralogs gives a domain (or architecture) score equal to one, when compared to the corresponding K. lactis ortholog. Comparing proteins with the architecture score, derived by SUPERFAMILY, we detect 93% of F2 triplets and 4% of F1 triplets, while with the domain score we detect 94% of F2 triplets and 5% of F1 triplets. We obtained similar results using Pfam-A as a source for the domain architectures (Table 2). Most paralogs share the same architecture of their singlecopy ortholog. Focusing on pairs with very dissimilar architecture, one can consider the mechanisms that generated divergence. Using SUPERFAMILY assignments, we found

Table 2 Quantification of uneven architecture divergence between paralogs. The table shows relative frequencies of WGD paralogs in S. cerevisiae having identical architecture to their single copy ortholog in K. lactis according to the domain (upper panel) and architecture (lower panel) scores. The first two rows of each panel show the statistics restricted to the F2 and F1 triplets. Domain score comparison

F2 triplets F1 triplets

2308

a total of 12 uneven paralog pairs, in most of which the architecture divergence is due to domain loss (66% of the cases, see Fig. 3) and to domain gain in one of the two paralogs (20% of the cases). The set of paralogs with uneven architecture identified by using Pfam-A is larger and includes the pairs detected with SUPERFAMILY. Domain loss and domain gain are, also for this set, the most frequent mechanisms, with a percentage of 56% and 36%, respectively. We extended the analysis of paralog divergence to nonWGD paralogs, taking into account the duplication date reported by the analysis of Wapinski et al.26 We measured the average domain and architecture scores as a function of duplication age, dividing the paralogs into post-WGD, WGD, and pre-WGD. We then divided the paralogs further into Hemiascomycetes and Euascomycetes (referring to duplications that occurred after and before the branching of these two clades). The analysis is reported in Fig. 4. The plots show that domain score and architecture score follow similar trends, decreasing only weakly over time and reaching mean values never lower than 0.8. This indicates that paralogs tend to maintain their architecture even after millions of years of evolution. WGD paralogs have slightly higher similarity and lower spread than the small-scale paralogs age groups. The proportionality of domain score and architecture score is further confirmed by the significant Spearman’s rank correlation coefficient of 0.97 (Supplementary Table S4w). This result is, in part, related to the large amount of single-domain proteins in S.cerevisiae (more than 70% in both the Pfam-A and the SUPERFAMILY data sets). If two single-domain proteins have the same domain content, they also certainly share the same architecture. Similar results were retrieved using both Pfam-A and SUPERFAMILY as reference data for the domain assignments. 2.3 Functional divergence and duplication age

SUPERFAMILY

Pfam-A

94% 5%

90% 7%

Architecture score comparison

F2 triplets F1 triplets

Fig. 3 Paralogous pairs with uneven architecture. In the paralogous pairs with uneven architecture, very often one of two paralogs shares the same architecture with the single-copy ortholog in K. lactis, while the other one has a missing domain. This is the case of the post-WGD paralogs YKL104C-YMR084W, and the WGD paralogs YDL036CYOL066C.

SUPERFAMILY

Pfam-A

93% 4%

90% 7%

Mol. BioSyst., 2010, 6, 2305–2315

In order to gain more insight into the divergence of paralogs at the domain level, we evaluated how the same duplicate proteins tend to diverge in their function. Specifically, we calculated the Gene Ontology (GO) term similarity between paralogs for each of the GO branches (‘‘molecular function’’, ‘‘biological process’’ and ‘‘cellular component’’) by using the GOSim package.39 The results, shown in Fig. 5, indicate that for all three GO branches, recent paralogs tend to be more This journal is

c

The Royal Society of Chemistry 2010

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

View Online

Fig. 4 A: Domain score versus duplication date. B: Architecture score versus duplication date. The duplication dates are elaborated from SYNERGY duplication groups (ref. 26), considering post-WGD duplications, WGD duplications pre-WGD duplications exclusive of the Hemiascomycetes class and duplications already present in the Euascomycetes class. The y-axes of the plots report mean domain/ architecture scores (central dashes), standard error SE (box margins) and 2SE (whiskers).

similar than older ones. Indeed, average GO term similarity values decrease as the duplication time increases. On the other hand, the mean GO term similarity of paralogs in all duplication date groups never reaches values lower than one half, indicating that ancient pre-WGD paralogs tend to maintain some functional overlap. The curve of GO similarity versus duplication age reaches lower values for the ‘‘biological process’’ and ‘‘cellular component’’ branches. This indicates that paralogs more likely diversify by the biological process in which they participate and the cellular compartment to which they belong, rather than by their molecular function. Secondly, they diversify at domain and architecture scores typically close to one, indicating that on average the function of paralogs migrates within the same fold structure, presumably by sequence mutations or recombinations maintaining the same structural domains.40 This is confirmed by a very weak positive correlation between domain score and GO similarity score for each paralogous pair (see Supplementary Table S4w). The same trends are also visible from the histograms of GOSim and domain-based similarity scores of all paralogous pairs (Fig. 6). The number of paralogous pairs having high domain-based similarity is consistently higher than the number of those having high GO similarity, but this trend is weaker for the ‘‘molecular function’’ taxonomy. In order to This journal is

c

The Royal Society of Chemistry 2010

Fig. 5 Functional similarity of paralogs of different duplication age. The y-axes of the plots report mean similarity score (central dashes), standard error SE (box margins) and 2SE (whiskers) between the associated GO terms39 of paralogs, computed over sets of paralogous pairs belonging to the same age groups (x-axis). The three panels refer to each of the three GO branches: molecular function (A), cellular component (B), biological process (C). In all the plots, the GO term similarity values tend to decrease with duplication age.

gather more direct evidence of this general domain conservation under functional and sequence evolution, we also compared these figures with the normalized histogram of the (protein) sequence identity (%id) between pairs of paralogs from Smith-Waterman pairwise alignments, performed by using EMBOSS Water (Fig. 6).41 The latter distribution has the lowest peak at one and the highest value at low scores, confirming that proteins can conserve architectures and functions, in spite of the migration in amino acid sequence. In order to exclude computational biases that could influence the results, we repeated the analyses with different conditions. First, not all proteins S. cerevisiae are covered entirely by domains, but some have gaps. Excluding proteins with gaps from the analysis should confirm that the functional migration of paralogs is not attributable to unknown domains. Supplementary Fig. S3w shows that this is indeed the case. Second, Gene Ontology annotations inferred from computational evidence could generate false positives in GO similarity, especially in the case of recent paralogs with significant sequence similarity. To circumvent this possibility, we Mol. BioSyst., 2010, 6, 2305–2315

2309

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

View Online

Fig. 6 Structural and functional divergence of paralogs. The plot reports histograms over all paralogous pairs of domain score (&), architecture score ( v ), sequence identity (,) and GO similarity (for all three taxonomies: molecular function,J, biological process, n, cellular compartment, B). All curves are peaked around the value one, but the highest density values are reached by domain and architecture score curves, while the GO similarity curves reach lower values at one and develop a secondary peak below 0.5. This indicates that paralogs tend to maintain domain composition and architecture changing their functions.

repeated the analysis considering exclusively manually curated GO annotations. This filter significantly reduces the size of our data set, especially in the case of non-WGD paralogs. For this reason, we grouped non-WGD paralogs into pre- and post-WGD sets. This gave sufficient statistics to retrieve the same trends of Fig. 5 for the ‘‘biological process’’ and ‘‘molecular function’’ GO branches (Supplementary figure S2w). Data were insufficient in the case of the ‘‘cellular component’’ GO branch. 2.4 Functional properties of WGD and non-WGD paralogs from a domain content viewpoint We focused on the difference in function between domains of local and global paralogs. Whole-genome and small-scale duplications are different biological processes, and the analysis of WGD and non-WGD paralogs can give some insight into the biological constraints leading to long-term persistence of paralogous pairs in these two cases. In particular, recent studies25,26,42 found that WGD and non-WGD paralogs are enriched for different functional classes of genes. Thus, we set out to quantify with domain-based methods, how the effects of the WGD on the genome are qualitatively different from those brought about by local duplications. Functional assignment of SUPERFAMILY domains24,43 can be used to evaluate the evolutionary destiny of paralogs. We considered the functional annotations for domain superfamilies given in the SCOP database16,44 and defined WGD and non-WGD paralogs using the data from ref. 26. We then proceeded to evaluate the trends in domain duplications, regardless of the specific protein in which they appear. We assigned domain superfamilies (SUPERFAMILY entry codes) to a set O, if they were duplicated in at least one WGD paralog, and to a set P, if they appeared in at least one local duplication (see Experimental). First, we found that the 2310

Mol. BioSyst., 2010, 6, 2305–2315

intersection of these two sets, in the universe of all SUPERFAMILY domain superfamilies that can be annotated in the proteome of S. cerevisiae, is larger (P-value o1028 than expected from a hypergeometric null model (Fig. 7). Thus, there is a dominant common set of domain superfamilies that is prone to be duplicated, regardless of the local or global duplication mode. Finally, we asked the question of whether the different behavior of WGD versus non-WGD paralogs can be related to the dissemination of domain superfamilies in different genomes. We defined superfamilies as high-occurring if their occurrence in SUPERFAMILY annotated genomes is higher than 0.95 (152 domain superfamilies) and low-occurring if their occurrence is lower that 0.05 (246 domain superfamilies). We assessed the enrichment of WGD and non-WGD paralogs by a hypergeometric test. The results show that both WGD and non-WGD characteristic domains are enriched for the high-occurrence class (Z score for WGD equal to 13.75 and Z score for non-WGD equal to 5.6), suggesting that the presence of a domain superfamily in a genome is related to its duplicability, without relationship to the duplication mode. On the other hand, the observed distribution of the fraction of WGD versus non-WGD paralogous proteins where each domain superfamily is found, is very uneven (Supplementary figure S4w). This trend indicates the existence of two populations of domain superfamilies: those that are duplicated only outside the WGD, and those that have a bias towards being found in the WGD only. Consequently, we analyzed the sets O/P, the domains only found in WGD paralogs, and P/O, the domains only found in non-WGD paralogs, for functional enrichment. For the finer categories of the SCOP functional classification, we found a few cases where the enrichment was biased in two opposite ways in the two sets (i.e. categories having a positive Z-score for WGD domains, and a negative Z-score for non-WGD domains). The categories that show a bias for WGD-specific domains (belonging to O/P) correspond to functions that are growth-related (ribosomes, translation), involved in regulation of gene transcription and degradation (transcription factors, proteases), primary metabolism (coenzymes) or cell adhesion. On the other hand, a positive bias for locally duplicated domains (belonging to P/O) was found in functional categories related to transport, post-transcriptional regulatory processes and secondary metabolism. Surprisingly, we found that the category DNA repair and replication tends to be enriched among domain superfamilies duplicated only locally rather than only globally. Weaker signals for the same trend were found for RNA processing and modification, chromatin structure and dynamics, toxins and defense enzymes. To test for the effects of duplication age, we performed the same analysis by distinguishing the domain superfamilies of pre- and post-WGD small-scale paralogs. The enriched categories are generally the same, but this analysis gives a better perspective of the time when the superfamilies with specific functions were produced by small-scale duplications. For example, the non-WGD paralogs enrichment in DNA replication/repair and proteases is due only to pre-WGD paralogs. Analogously, polysaccharide metabolism enrichment is due to post-WGD domains, while phospholipd This journal is

c

The Royal Society of Chemistry 2010

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

View Online

Fig. 7 Domain superfamilies of small-scale and global paralogs. (A) Domain superfamilies (SCOP entry names) are extracted from the nonoverlapping sets of WGD and non-WGD paralogs. Two overlapping sets (P and O) are obtained, and their non-overlapping parts (WGD- and non-WGD-specific domain superfamilies) are analyzed for functional enrichment. (B) Venn diagram of the sets O and P. (C) Table summarizing the significantly enriched functional classes for the sets of WGD and non-WGD domains. The empirical intersection is 11 standard deviations larger than the mean value provided by a hypergeometric distribution. zP and zO refer, respectively, to the Z-score for the non-WGD paralogs analysis and the WGD paralogs analysis (the sets O\P and P\O in panel B).

metabolism enrichment is due to pre-WGD domains. Redox function shows some enrichment for post-WGD paralogs and under-representation over pre-WGD paralogs. This analysis shows enrichment in DNA binding for all the paralogs, rather than just for WGD domains. As a reference and control, we also performed a more standard functional characterization based on Gene Ontology analysis of proteins along the lines of previous studies25,42 (see Supplementary materialsw). We considered the separate sets of WGD and non-WGD paralogs. For each set we extracted the over-represented GO terms and we looked for the terms shared between WGD and non WGD-paralogs, or specifically connected to a group (over-represented in a group and not significantly present in the other). WGD and non-WGD paralogs are typically enriched in unrelated sets of GO terms. The categories we find for the two sets agree with the above analysis based on domain superfamilies. Performing the same analysis separately on pre- and post-WGD paralogs, we find that post-WGD paralogs have over-represented terms pointing to some specific metabolic pathways such as ‘‘thiamin and derivative metabolic process’’, ‘‘cobalamin metabolic process’’ and ‘‘beta-alanine metabolic process’’. The WGD paralogs are enriched for genes involved in cell metabolism as ‘‘hexose metabolic process’’, ‘‘cellular bud’’, ‘‘nucleotide binding’’ and ‘‘alcohol catabolic process’’.45 The exclusive over-represented terms of pre-WGD paralogs are, in large, part connected to transport functions such as ‘‘carboxylic acid transport’’ and ‘‘ion transmembrane transporter activity’’. Interestingly, genes connected to vacuole function are also pre-WGD paralogs (see Supplementary Table 1w).

3. Discussion and conclusions Detection of homology among distant paralogs and orthologs is difficult because of sequence divergence. However, homologous proteins tend to maintain the same architecture more stably than the identity of their amino acid sequence. To This journal is

c

The Royal Society of Chemistry 2010

score distant relationships among yeast and K. waltii proteins we used SCOP superfamilies domain assignments. This choice has three main reasons. First, these domains contain three-dimensional structural information and are not solely based on sequence similarity. Thus, they can be considered, at least to a certain extent, ‘‘independent’’ from sequence alignments. Second, compared to the higher classification into ‘‘folds’’, they are defined to guarantee monophyly (i.e. excluding convergent evolution). Evolutionary information on domains is an intrinsic feature of the classification scheme of the SCOP database, which is the basis for the hidden Markov models of the SUPERFAMILY database. Third, this choice was taken in previous studies,29,46 giving us a basis for comparison. The criteria and scores we used assume that two proteins share the same common ancestor if they have the same domain architecture, or a series of domains from the same protein families. This method allowed us to compare the more distant structural homology relationships with those obtained by sequence comparisons and synteny and also provided us with simple means to study the genome-wide evolution of protein function from the domain viewpoint. Naturally, the hidden Markov model assignment of domains depends on the scoring parameters.16 We limited our analysis to the criteria used by the SUPERFAMILY database.29,46–48 3.1 Domain architecture and homology Despite the sparse coverage of structural domains, it seems evident from our results that even elementary domain-based homology criteria can recover most of the information obtained through sequence alignment/synteny techniques. Indeed, the criteria we defined are able to capture a large fraction of Ensembl Compara homology classes. This indicates that the rate of false negatives is small. Examining the results, it is evident that this rate is mainly due to the partial coverage of domain assignments, and thus, is expected to decrease as our knowledge of protein domains is improved. Mol. BioSyst., 2010, 6, 2305–2315

2311

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

View Online

The comparison of proteins by domain architecture produces larger homology classes than conventional methods of homology assignment. These can include false positives, but also distant relationships visible in domain architectures only. We quantified them by measuring the fraction of domainbased homology classes not containing Ensembl Compara classes. Criteria A and B have a similar percentage of homologs not detected in Ensembl Compara while criterion C, follows a different trend. This last criterion is the only one that allows for insertion and/or loss of external domains, an event observed and expected in the evolution of proteins.49–51 The different qualitative trend of criterion C could suggest a lower reliability (higher false positive rate) compared to the other ones. It is important to stress that the architecture comparison methods implemented in this paper can show false positive matches. In other words, the less restrictive the criterion is, the higher the possibility of incorrectly identifying evolutionarily unrelated genes as homologs. While we have estimated the relative false positive rate of the different criteria by comparison with Ensembl homology relationships, the absolute rate remains difficult to establish. It is also possible to assign a significance score to single homologous pairs from the P-values, or Pi, of the domains assigned in the two architectures. Interpreting this as the probability that the single domain assignment is not valid, one can score the homology relationship by Pi(1  Pi). While this could be used to judge the relative reliability of single homologous pairs detected by our method only, we believe that their absolute reliability is much harder to identify. Overall, while some instances could represent false positives, it is also natural to expect that some others represent distant relationships that are not detected as paralogs by Ensembl Compara, but are recognized by domain-based criteria. However, quantifying them would require an independent method to establish very distant evolutionary relationships between proteins, a challenge beyond the scopes of this work. Despite a limited ability to estimate the false-positive, we have other biologically-based observables that argue for the reliability of domain-based criteria. Firstly, the mean domain scores of paralogs are very close to one (Fig. 4). This indicates that even ancient paralogs tend to have very similar domain composition. The similar tendency of the architecture score curve (Fig. 4) also suggests that multi-domain paralogs tend to maintain their architecture even after millions of years. These results are confirmed by a Spearman’s rank correlation coefficient between domain score and architecture score close to one. The architecture comparison of S. cerevisiae WGD paralogs with their K. lactis single copy ortholog (Table 2) gives a further confirmation that proteins tend to maintain their architecture over million years of evolution. These results constitute further proof of the power and usefulness of homology criteria based on protein architecture. The most important current challenge is that no tool is available to quantify the failure rate of domain-based methods in detecting homologs. In other words, it would be important to estimate precisely which fraction of homologs detected by domain-based methods and not by more conventional methods are really significant. For example, one cannot exclude that genes gained by horizontal transfer give rise to 2312

Mol. BioSyst., 2010, 6, 2305–2315

proteins with the same domain structure as some other proteins in the genome,52 or that the partial coverage of domain databases does not enable the resolving of distinct architectures resolution. 3.2 Domain structure and function of paralogous proteins We now turn our attention to the second question of our study, namely the functional properties of paralogs in light of the evolution of their domain architectures. Paralogs show a strong tendency to conserve their domain composition and their architectures. This has to be compared with the functional GO similarity analysis on the same sets, showing a trend for divergence with increasing duplication age. An explanation of this phenomenon may be the fact that proteins evolve with point mutations affecting one nucleotide at a time. Domain superfamilies can withstand these mutations without changing significantly, but some elementary biochemical properties that define protein function may vary. In other words, point mutation can change protein function without changing their domain composition. It is well known that proteins with identical folds can diverge greatly, not only in sequence but also in function.40 Along the same lines, Wapinski and collaborators26 observe that the functional fates of paralogs rarely diverge with respect to biochemical function but typically diverge with respect to regulatory control. The typical case when this is known to happen is that of transcription factors,53 where the migration of sequences within the same DNA-binding fold can lead to major changes in the affinity for a given set of sequences, and thus to large variation on the set of regulated targets. More simply, GO term divergence could come to a change of cellular compartment or biological process while performing similar biochemical functions. Also, note that the drop with duplication age of the GO similarity score, relative to the ‘‘molecular function’’ branch, is weaker than in either the ‘‘biological process’’ branch or the ‘‘cellular component’’ branch. We extracted from our set some paralogs that maintain exactly the same domain architecture after duplication, while changing the molecular function, the cellular compartment and/or the biological process in which they are involved (GO term similarityo0.15). Two examples of these paralogs are BDH1 and SOR1, ancient pre-WGD paralogs. BDH1 is a butanediol dehydrogenase involved in alcohol metabolic processes with a domain architecture composed by a GroES-like domain and a NAD(P)-binding Rossmann-fold domain. SOR1 is a sorbitol dehydrogenase involved in hexose metabolism with the same domain architecture. SOR1 is also a post-WGD paralog of XYL2, which encodes for a xylitol dehydrogenase and shares the same domain architecture. DIN7 and EXO1 are instead WGD paralogs, both encoding proteins with nuclease activity involved in DNA repair and replication. They share the same domain architecture composed of a PIN-like domain and a 5 0 to 3 0 exonuclease C-terminal subdomain. However, DIN7 is mitochondrial and EXO1 is nuclear. Similarly, the WGD paralogs SEC14 and YKL091C are both phosphatidylinositol/ phosphatidylcholine transfer proteins with the same domain architecture (CRAL/TRIO domain and CRAL/TRIO This journal is

c

The Royal Society of Chemistry 2010

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

View Online

N-terminal domain), but the first performs its function in the cytosol and in the Golgi apparatus, while the second is nuclear. Naturally, the coverage of domains in genomes is only partial, leaving the open question of whether the observed trends of functional annotations with duplication age are due to modifications in the space of domains that are not visible to our methods. While of course this may happen, it seems unlikely that this can affect the global observed trends, assuming that we are observing an unbiased random sample of the existing structural domains. In other words, if the domains that change their superfamily during evolution have a fixed probability to be in the set of known domains, this would generate on average a decreasing trend of the domain score with duplication age, which we do not observe. A confirmation of this is given by the fact that when we remove proteins with gaps in their architecture, all the observed trends (Fig. 6, Supplementary figure S3w) do not change. 3.3

Specificity of the whole-genome duplication

Partitioning the universe of all S. cerevisiae domain superfamilies into locally and globally duplicated superfamilies yields two sets of WGD and non-WGD domains. These sets are not mutually exclusive, as the same SUPERFAMILY domain superfamily can be present in both WGD and nonWGD paralogs. Notably, this intersection is much larger than expected from a hypergeometric null model. This can be interpreted as the fact that, within the universe of domains, the main distinction is between domain superfamilies found or not found in duplications, rather than between domain superfamilies in global versus local duplications. Whole-genome and local duplications are unified, rather than separated by this trend. However, the domain superfamilies of WGD paralogs laying outside a common set of duplicable domains remain significant, as they give rise to evident peaks in the frequency of observing a domain in the sets of WGD and non-WGD paralogs. Moreover, they are also significant functionally. Indeed, the separate sets of WGD-specific and localduplication specific domain superfamilies are enriched for different functional categories. Similar categories are found with a more standard functional analysis of the genes. Thus, the domain-superfamily-based functional analysis gives an additional perspective on the biological differences between WGD and non-WGD paralogs found in previous work.25,26,42 There exist functionally significant sets of superfamilies that are specifically ‘‘duplicable’’, for WGD or small-scale duplications. These superfamilies are likely peculiar as a consequence of their function and interaction properties, as they explain, at least in part, the functional differences of small-scale duplications and the WGD. Prior studies have concluded that these functional differences54 are the consequence of differential rates of retention. By definition, all pre-existing genes were duplicated in the WGD event. Thus, we know that all available domain superfamilies could have a priori been retained and that the one tenth that was indeed retained encodes for a set of superfamilies not found in small-scale paralogs. Assuming that small-scale duplications also occur uniformly on the genome This journal is

c

The Royal Society of Chemistry 2010

(i.e. all genes can be subject to them and later on retained or not), we could conclude that the proteins that could not be retained in small-scale duplications because of their functional or interaction properties, probably related to the WGDspecific domain superfamilies, were in fact retained after the WGD. On the other hand, should some regions of the genome be less prone to single-gene or segmental duplications, it is also possible that the WGD-specific trends are related to a release of this bias in favor of uniform sampling. In this second hypothesis, one would observe (once synteny is correctly mapped) complementarity of WGD and non-WGD paralogs in well-defined regions of the genome, which is, to our knowledge, not reported. Several investigators found that WGD paralogs and non-WGD paralogs are similarly biased with respect to codon bias and evolutionary rate, although they differ significantly in their function and in the number of interacting partners.25,26,42 While the debate on the exact biological role of the WGD is open, in agreement with these results we find that some fundamental functions such as ribosomes and translation are enriched in the WGD paralogs and in the WGD-specific domain superfamilies, while some functions that have been defined as ‘‘peripheral’’26 (e.g. secondary metabolism) are enriched for local duplications. The rationale for this result26 might be that functions related to core biological processes, or in general realized by genes with more entangled genetic interactions, are more difficult to replicate by duplicating one part at a time as it happens with local duplications.26 On the other hand, global moves such as the WGD could release these constraints and allow ‘‘recycling’’ and disentanglement of more elaborate cell machinery. Guan and coworkers25 have studied at great depth and clarity, the differences of local and global duplications and found that WGD paralogs tend to share more interaction partners and biological functions than non-WGD paralogs. This is confirmed by our analysis. The fact that some superfamilies are more prevalent in WGD duplications is likely related to their membership in protein complexes and, hence, is probably a requirement to maintain precise stoichiometry. Accordingly, the functional categories that show a bias for WGD-specific domains correspond to cell tasks that involve protein complexes and protein–protein interactions, such as ribosomes, translation, proteases and coenzymes. On the other hand, locally duplicated domains are found in functional categories related to transport, post-transcriptional regulatory processes and secondary metabolism, corresponding to the hypothesis that enzymes and transporters tend to act more independently, and are more easily maintained after local duplication. To confirm the latter result, we analyzed the distribution of the GO similarity normalized histograms for all the pairs of the two separate sets. Indeed, WGD paralogs are slightly more similar than non-WGD paralogs for all the three GO branches (Supplementary Fig. S1w). On the other hand, examining Fig. 5, one notes that pre-WGD paralogs are less similar at the functional level, so this signal could come, at least in part, from the functional difference of ancient non-WGD paralogs. This could be a consequence of duplication date rather than duplication mode. Mol. BioSyst., 2010, 6, 2305–2315

2313

View Online

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

Finally, we can speculate on the consequences of our observation that the functional dichotomy is also found at the domain superfamily level. If it is true that function migrates abundantly, the functional dichotomy of local and global paralogs may emerge from migration of function maintaining similar domain structures. However, this cannot be the only source of diversification, as the same functional differences did not emerge from the analysis from the analysis of WGD and non-WGD specific domain superfamilies. On the contrary, our results indicate that the dichotomy must be, at least in part, related to the ‘‘special’’ domains superfamilies that are only found in local or global duplications.

4. Experimental 4.1

Data sets

We used the SUPERFAMILY database version16,24 for the SCOP 1.73 superfamilies assignment, and the functional annotation of domains. The Pfam-A 24.0 was the reference source for domain families.18 We implemented a C code to reconstruct the protein domain architectures, as ordered lists of domains and ‘‘gaps’’. Gaps were defined as protein subsequences of 100 AA or more not scored for domains; 100 AA is the most probable domain size in the empirical data, and a choice of a lower cutoff does not affect the results presented here. As a reference for homology assignment we used different homology tools based on sequence alignment and synteny. For general homology, we referred to Ensembl Compara (release 50),28 where the alignments are curated under sequence similarity, synteny and phylogeny. For K. lactis-S. cerevisiae WGD paralogs, we referred to the Fungal Orthogroups Repository,26 this tool was also used for the datation of paralogs. 4.2

Homology criteria

Three different homology criteria were used to compare the domain architecture of proteins.29,50 Criterion A considers exactly matching architectures. The underlying biological hypothesis is that divergence after duplication does not change the domain architecture of the proteins, implying that divergence between homologs should happen at the sequence/peptide level. Criterion B relaxes the previous condition, and considers homologous domain architectures that are equal or contain multiple repetition of ordered sets of domains, ignoring possible gap mismatches. Criterion C further relaxes the above conditions, considering domain architectures as homologous if one contains repeated architecture domain sequences possibly interspaced by gaps or other domains. The code that implements the three criteria is available from the authors upon request. See also the supplementary materialsw for the pseudocode description of the three algorithms. 4.3

Domain architecture comparison scores

We defined two different methods to compare proteins in their structural properties. The first ‘‘domain score’’ quantifies the variation in the domains of the two architectures, and is defined as the number of domains common to the two architectures, divided by the mean number of domains found 2314

Mol. BioSyst., 2010, 6, 2305–2315

in both (Dice’s coefficient). The second ‘‘architecture score’’ takes into account the order of appearance of domains in the two architectures and is defined as the length of the longest matching string of domains between the two architectures, divided by their mean length. Both scores have a range from 0 (no similarity) to 1 (full similarity). The scores for pairs of WGD, and non-WGD paralogs of different age groups were averaged and histogrammed. 4.4 Domain-based functional analysis Duplicate proteins with non-empty domain architecture were divided into two separate sets of WGD and non-WGD paralogs by using the classification in ref. 55. Domain superfamilies (SCOP superfamily entries) extracted from the two sets were divided accordingly into three sets: the set O of domains found in WGD paralogs; the set P of domains found in non-WGD paralogs; the set O - P of domains found in at least one member of both protein sets (Fig. 7). To assess the functional enrichment for WGD and non-WGD paralogs, we implemented a null model based on the hypergeometric distribution, which provides the expected number of domains assigned with function F belonging either to WGD paralogs or to non-WGD paralogs, using as universe the set of all distinct domains found in S. cerevisiae. 4.5 Gene ontology analysis Duplicate proteins were divided into two separate sets of WGD and non-WGD paralogs by using the classification in ref. 55. We downloaded the Gene Ontology (GO) annotation DAGs from the GO website56 (download date 1/2008) and the gene product annotations from the Ensembl database, version 46. We considered a gene annotated to a GO term if it was directly annotated to it or to any of its descendants in the GO tree. We used the SYNERGY algorithm results26 for defining paralogy classes and groups of WGD versus smallscale paralogs. Orthologs and paralogs were considered as different groups. As a reference, 1000 triplets of sets were considered, each consisting of 500 randomly assorted genes from the S. cerevisiae genome with the only constraint that each gene was chosen only once in each triplet. For each group we implemented an exact Fisher’s test to assess whether a set of genes could be enriched in a certain GO term.57,58 Fisher’s test gives the probability P of obtaining an equal or greater number of genes annotated to the term in a set made of the same number of genes, but randomly selected. Subsequently, the terms shared by both groups and the exclusive terms (terms present in only one group) were extracted. Finally, we filtered the results retaining only GO terms with P-values r103. For each pair of paralogs, we calculated the Lin GO term similarity, by using the GOSim R-package (Version 1.1.5.1).39 For each duplication date group we calculated the mean and the standard deviation of the mean of the GO term similarity. We performed these analyses also restricting to annotations with experimental evidence codes.

Author contributions BB, MC and MCL conceived and designed the study; AS, DF, and LG analysed the data; DC and MCL contributed This journal is

c

The Royal Society of Chemistry 2010

View Online

to materials and analysis; AS, DF, LG and MCL wrote the paper.

Acknowledgements We would like to thank Herve´ Isambert for useful discussions, and Paolo Provero for critical reading of this manuscript. This work was partially supported by the Fund for Investments of Basic Research (FIRB) from the Italian Ministry of the University and Scientific Research, No. RBNE03B8KK-006.

Downloaded on 29 December 2010 Published on 01 November 2010 on http://pubs.rsc.org | doi:10.1039/C003507F

References 1 G. M. Rubin, M. D. Yandell and J. R. Wortman, et al., Science, 2000, 287, 2204–2215. 2 E. V. Koonin and M. Y. Galperin, Curr. Opin. Genet. Dev., 1997, 7, 757–63. 3 E. S. Lander, L. M. Linton and B. Birren, et al., Nature, 2001, 409, 860–921. 4 A. McLysaght, K. Hokamp and K. H. Wolfe, Nat. Genet., 2002, 31, 200–4. 5 R. B. Langkjaer, P. F. Cliften, M. Johnston and J. Piskur, Nature, 2003, 421, 848–852. 6 R. Friedman and A. L. Hughes, Genome Res., 2001, 11, 373–381. 7 G. Liti and E. J. Louis, Annu. Rev. Microbiol., 2005, 59, 135–153. 8 R. Koszul, B. Dujon and G. Fischer, Genetics, 2006, 172, 2211–2222. 9 M. Kellis, B. W. Birren and E. S. Lander, Nature, 2004, 428, 617–624. 10 F. S. Dietrich, S. Voegeli, S. Brachat, A. Lerch, K. Gates, S. Steiner, C. Mohr, R. Pohlmann, P. Luedi, S. Choi, R. A. Wing, A. Flavier, T. D. Gaffney and P. Philippsen, Science, 2004, 304, 304–307. 11 D. B. Wetlaufer, Proc. Natl. Acad. Sci. U. S. A., 1973, 70, 697–701. 12 J. S. Richardson, Adv. Protein Chem., 1981, 34, 167–339. 13 P. Bork and R. F. Doolittle, Proc. Natl. Acad. Sci. U. S. A., 1992, 89, 8990–4. 14 A. Andreeva, D. Howorth, S. E. Brenner, T. J. P. Hubbard, C. Chothia and A. G. Murzin, Nucleic Acids Res., 2004, 32, 226D–229. 15 C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells and J. M. Thornton, Structure, 1997, 5, 1093–1108. 16 J. Gough, K. Karplus, R. Hughey and C. Chothia, J. Mol. Biol., 2001, 313, 903–919. 17 F. Servant, C. Bru, S. Carre`re, E. Courcelle, J. Gouzy, D. Peyruc and D. Kahn, Briefings Bioinf., 2002, 3, 246–251. 18 R. D. Finn, J. Tate, J. Mistry, P. C. Coggill, S. J. Sammut, H.-R. Hotz, G. Ceric, K. Forslund, S. R. Eddy, E. L. L. Sonnhammer and A. Bateman, Nucleic Acids Res., 2008, 36, D281–288. 19 K. Lin, L. Zhu and D.-Y. Zhang, Bioinformatics, 2006, 22, 2081–2086. 20 N. Song, R. D. Sedgewick and D. Durand, J. Comput. Biol., 2007, 14, 496–516. 21 N. Song, J. M. Joseph, G. B. Davis and D. Durand, PLoS Comput. Biol., 2008, 4, e1000063. 22 J. Weiner 3rd, F. Beaussart and E. Bornberg-Bauer, FEBS J., 2006, 273, 2037–2047. 23 J. Jin, X. Xie, C. Chen, J. G. Park, C. Stark, D. A. James, M. Olhovsky, R. Linding, Y. Mao and T. Pawson, Sci. Signaling, 2009, 2, ra76––. 24 D. Wilson, M. Madera, C. Vogel, C. Chothia and J. Gough, Nucleic Acids Res., 2007, 35, D308–313. 25 Y. Guan, M. J. Dunham and O. G. Troyanskaya, Genetics, 2007, 175, 933–943.

This journal is

c

The Royal Society of Chemistry 2010

26 I. Wapinski, A. Pfeffer, N. Friedman and A. Regev, Nature, 2007, 449, 54–61. 27 A. K. Bjrklund, D. Ekman, S. Light, J. Frey-Sktt and A. Elofsson, J. Mol. Biol., 2005, 353, 911–923. 28 P. Flicek, B. L. Aken, K. Beal, B. Ballester, M. Caccamo, Y. Chen, L. Clarke, G. Coates, F. Cunningham, T. Cutts, T. Down, S. C. Dyer, T. Eyre, S. Fitzgerald, J. Fernandez-Banet, S. Graf, S. Haider, M. Hammond, R. Holland, K. L. Howe, K. Howe, N. Johnson, A. Jenkinson, A. Kahari, D. Keefe, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard, A. Prlic, S. Rice, D. Rios, M. Schuster, I. Sealy, G. Slater, D. Smedley, G. Spudich, S. Trevanion, A. J. Vilella, J. Vogel, S. White, M. Wood, E. Birney, T. Cox, V. Curwen, R. Durbin, X. M. FernandezSuarez, J. Herrero, T. J. P. Hubbard, A. Kasprzyk, G. Proctor, J. Smith, A. Ureta-Vidal and S. Searle, Nucleic Acids Res., 2008, 36, D707–714. 29 M. M. Babu and S. A. Teichmann, Nat. Genet., 2004, 36, 492–496. 30 B. P. Cusack and K. H. Wolfe, Mol. Biol. Evol., 2007, 24, 679–86. 31 S. Ohno, Evolution by Gene Duplication, Allen and Unwin, London, UK, 1970. 32 D. R. Scannell and K. H. Wolfe, Genome Res., 2008, 18, 137–47. 33 K. P. Byrne and K. H. Wolfe, Genetics, 2007, 175, 1341–50. 34 A. Force, M. Lynch, F. B. Pickett, A. Amores, Y.-l. Yan and J. Postlethwait, Genetics, 1999, 151, 1531–1545. 35 M. Lynch and A. Force, Genetics, 2000, 154, 459–473. 36 C. van Rijsbergen, Information Retrieval, Butterworths, London, 1979. 37 J. Gough, Bioinformatics, 2005, 21, 1464–1471. 38 F. Beaussart, J. Weiner 3rd and E. Bornberg-Bauer, Bioinformatics, 2007, 23, 1834–1836. 39 H. Frohlich, N. Speer, A. Poustka and T. BeiSZbarth, BMC Bioinformatics, 2007, 8, 166. 40 M. N. Carbone and F. H. Arnold, Curr. Opin. Struct. Biol., 2007, 17, 454–459. 41 P. Rice, I. Longden and A. Bleasby, Trends Genet., 2000, 16, 276–277. 42 J. C. Davis and D. A. Petrov, Trends Genet., 2005, 21, 548–551. 43 C. Vogel, C. Berzuini, M. Bashton, J. Gough and S. A. Teichmann, J. Mol. Biol., 2004, 336, 809–823. 44 A. G. Murzin, S. E. Brenner, T. Hubbard and C. Chothia, J. Mol. Biol., 1995, 247, 536–540. 45 G. C. Conant and K. H. Wolfe, Mol. Syst. Biol., 2007, 3, year. 46 M. Cosentino-Lagomarsino, P. Jona, B. Bassetti and H. Isambert, Proc. Natl. Acad. Sci. U. S. A., 2007, 104, 5516–20. 47 M. Madan Babu and S. A. Teichmann, Nucleic Acids Res., 2003, 31, 1234–1244. 48 A. L. Sellerio, B. Bassetti, H. Isambert and M. C. Lagomarsino, Mol. BioSyst., 2009, 5, 170–179. 49 G. Apic, J. Gough and S. A. Teichmann, J. Mol. Biol., 2001, 310, 311–25. 50 A. D. Moore, A. K. Bjorklund, D. Ekman, E. Bornberg-Bauer and A. Elofsson, Trends Biochem. Sci., 2008, 33, 444–451. 51 P. Durrens, M. Nikolski and D. Sherman, PLoS Comput. Biol., 2008, 4, e1000200. 52 M. N. Price, P. S. Dehal and A. P. Arkin, GenomeBiology, 2008, 9, R4. 53 S. Itzkovitz, T. Tlusty and U. Alon, BMC Genomics, 2006, 7, 239. 54 L. Hakes, J. Pinney, S. Lovell, S. Oliver and D. Robertson, GenomeBiology, 2007, 8, R209. 55 K. P. Byrne and K. H. Wolfe, Genome Res., 2005, 15, 1456–61. 56 The Gene Ontology, http://www.geneontology.org. 57 D. Cora, F. Di Cunto, P. Provero, L. Silengo and M. Caselle, BMC Bioinformatics, 2004, 5, 57. 58 D. Cora, C. Herrmann, C. Dieterich, F. Di Cunto, P. Provero and M. Caselle, BMC Bioinformatics, 2005, 6, 110.

Mol. BioSyst., 2010, 6, 2305–2315

2315