Consideration of RNA Secondary Structure ... - Semantic Scholar

16 downloads 60 Views 378KB Size Report
Macrostomum. Pseudoceros. Notoplana. 1.00. 0.76. 0.95. 0.69. 0.96. Phoronis. Plumatella. Barentsia. Pedicellin. 1.00. Symbion. Lecane. Asplanchna. Philodina.
Consideration of RNA Secondary Structure Significantly Improves Likelihood-Based Estimates of Phylogeny: Examples from the Bilateria Maximilian J. Telford,* Michael J. Wise,  and Vivek Gowri-Shankarà *Department of Biology, University College London, London, United Kingdom;  School of Biomedical and Chemical Sciences, University of Western Australia, Crawley Perch, Western Australia; and àDepartment of Computer Science, University of Manchester, Manchester, United Kingdom Sequences from ribosomal RNA (rRNA) genes have made a huge contribution to our current understanding of metazoan phylogeny and indeed the phylogeny of all of life. That said, some parts of this rRNA-based phylogeny remain unresolved. One approach to increase the resolution of these trees would be to use more appropriate models of sequence evolution in phylogenetic analysis. RNAs transcribed from rRNA genes have a complex secondary structure mediated by base pairing between sometimes distant regions of the rRNA molecule. The pairing between the stem nucleotides has important consequences for their evolution which differs from that of unpaired loop nucleotides. These differences in evolution should ideally be accounted for when using rRNA sequences for phylogeny estimation. We use a novel permutation approach to demonstrate the significant superiority of models of sequence evolution that allow stem and loop regions to evolve according to separate models and, in common with previous studies, we show that 16-state models that take base pairing of stems into account are significantly better than simpler, 4-state, single-nucleotide models. One of these 16-state models has been applied to the phylogeny of the Bilateria using small subunit rRNA (SSU) sequences. Our optimal tree largely echoes previous results based on SSU in particular supporting the tripartite Bilaterian tree of deuterostomes, lophotrochozoans, and ecdysozoans. There are also a number of differences, however, perhaps most important of which is the observation of a clade consisting of the gastrotrichs plus platyheminthes that is basal to all other lophotrochozoan taxa. Use of 16-state models also appears to reduce the Bayesian support given to certain biologically improbable groups found using standard 4-state models.

Introduction Thousands of small subunit ribosomal RNA (SSU rRNA) and hundreds of large subunit rRNA (LSU) genes have been sequenced from the diversity of animal taxa, and analyses of these have led to important revolutions in our understanding of animal relationships and evolution (Adoutte et al. 2000; Mallatt and Winchell 2002). The current SSU 1 LSU-based model of bilaterian phylogeny posits two principal clades: the deuterostomes (chordates, echinoderms, and hemichordates) and the protostomes. The protostomes consist of the two major groups Ecdysozoa (arthropods plus various introvertan worms including priapulids and nematodes) and Lophotrochozoa (mollusks, annelids, flatworms, and a number of other phyla) (Aguinaldo et al. 1997). This general scheme has been supported by additional molecular phylogenetic studies (e.g., Haase et al. 2001), although there is also evidence that contradicts this view (e.g., Wolf, Rogozin, and Koonin 2004). Although this tripartite phylogeny seems on balance to be robust, SSU and LSU phylogenies have not been able reliably to resolve relationships within the ecdysozoan and lophotrochozoan clades. One approach to improving the resolution of the animal tree using SSU and LSU would be to employ more accurate methods of phylogenetic analyses. An important aspect of rRNA genes that has largely been ignored by zoologists when reconstructing bilaterian phylogenies is that the functional RNAs transcribed from these genes have a complex secondary structure mediated by base pairing between sometimes distant regions of the Key words: ribosomal RNA, phylogeny, Bilateria, secondary structure maximum likelihood Bayesian analysis. E-mail: [email protected]. Mol. Biol. Evol. 22(4):1129–1136. 2005 doi:10.1093/molbev/msi099 Advance Access publication February 2, 2005

rRNA molecule. By accounting for this secondary structure in the models of evolution used in phylogeny reconstruction, we might hope to improve the reliability of the bilaterian tree. rRNAs consist of base paired stem and unpaired loop regions. The pairing within the stems involves the WatsonCrick A:U, G:C pairs and the noncanonical G:U pair; other pairings exist, but they are rare enough to be disregarded in the current context. The constraints arising from the need to conserve secondary structure have important effects on the evolution of the stem sequences that differ from those of the loop regions. In essence, because there is selective pressure for the maintenance of rRNA secondary structure potential substitutions affecting one half of a pair of stem nucleotides have a different probability of fixation compared to the equivalent nucleotide in a loop. More specifically, in stem regions, changes from a base paired state to an unpaired state tend to be strongly selected against. Moreover, not all changes between sets of pairing bases are equal. Changes between A:U and G:U and between G:U and G:C each involve a single change and are therefore common. Changing between A:U and G:C or between U:A and C:G can also relatively easily occur involving two changes with the stable transitional state G:U or U:G. Changing from any of A:U, G:U, and G:C to their counterparts U:A, U:G, or C:G on the other hand necessitates either two simultaneous changes which is very unlikely or passage via an unpaired transitional state which is presumably maladapative. Such changes are correspondingly much rarer. Stem models have been around for a number of years, and there have been a number of descriptions of the use of such models in phylogeny reconstruction (Scho¨niger and von Haeseler 1994; Muse 1995; Rzhetsky 1995; Tillier and Collins 1995; Otsuka, Terai, and Nakano 1999; Savill, Hoyle, and Higgs 2001; Jow et al. 2002; Hudelot et al.

Molecular Biology and Evolution vol. 22 no. 4  Society for Molecular Biology and Evolution 2005; all rights reserved.

1130 Telford et al.

2003; Smith, Lui, and Tillier 2004). Considering the widely accepted view that the use of more accurate models of sequence evolution should lead to more accurate inference of phylogeny based on those sequences, it seems surprising that the vast majority of published studies of rRNA sequences have not taken into account the differences expected between the evolution of stems and of loops. Our particular interest is in incorporating knowledge of secondary structure in a phylogenetic analysis of the Bilateria, something that has been attempted in the past but without great success. Early in the history of animal SSU phylogenetics, Patterson (1989) recognized the interdependence of nucleotides within stems and therefore employed a scheme which weighted stem nucleotides at half the weight of the loop positions. However, this does not take into account the different selective pressures on stem positions, only their nonindependence. More recently, Otsuka and Sugaya (2003) considered the mitochondrial LSU genes of various animals and, taking stem pairs into account, derived genetic distances between pairs of taxa. These distances were used for tree construction using the Neighbor Joining and unweighted pair group method with arithmetic mean (UPGMA) methods but gave trees that contradicted current understanding of animal phylogeny: Otsuka and Sugaya did not find a monophyletic grouping of lophotrochozans, the mollusks and annelids are grouped with arthropods rather than with brachiopods and platyhelminthes, and the hemichordates group within the chordates rather than with echinoderms. In any case it is clear that a likelihood-based tree estimation method would be preferable to Otsuka and Sugaya’s distance-based approach (Scho¨niger and von Haeseler 1994; Muse 1995; Rzhetsky 1995). The ideal approach must be to model the different evolution of stems and loops within a likelihood framework. Previous likelihood-based studies have already suggested the superiority of approaches that take into account base pair correlation in RNA stems over methods assuming complete evolutionary independence of stem nucleotides. Using a standard likelihood ratio test, Muse (1995) showed that the addition of an extra pairing-parameter to standard DNA models significantly increased the fit with real data sets. Similarly, Scho¨niger and von Haeseler (1999) rejected the standard Hasegawa, Kishino, and Yano (HKY) model in favor of a corresponding doublet model with a GoldmanCox test. Savill et al. (2001) compared a number of doublet models and show that models that allow for double substitutions are preferred. Here, a permutation test (Felsenstein 2003) is used to demonstrate the presence of altered evolutionary patterns in stems regions compared to loops and to provide further evidence that base paired models are more appropriate than standard, single-nucleotide models. We use the best of these 16-state models to estimate the phylogeny of 72 bilaterians using and Markov Chain Monte Carlo (MCMC) Bayesian search procedure. Materials and Methods Data Bilaterian SSU sequences were downloaded from the European Ribosomal RNA database (http://oberon.fvms.

ugent.be:8080/rRNA/) in the dedicated comparative sequence editor (DCSE) format. Our test data set comprised 17 sequences from the deuterostomes (1 xenoturbellid, 4 hemichordates, 5 echinoderms, 2 urochordates, 1 cephalochordate, and 4 vertebrates) and 6 out-group sequences (3 ecdysozoans and 3 lophotrochozoans). Sequences from additional taxa were downloaded from GenBank. Our Bilaterian data set contained 72 taxa from the deuterostomes, ecdysozoans, and lophotrochozoans. We used newly developed software Xstem and Ystem to manipulate the data sets and to extract the secondary structure information in DCSE in a format usable by the phylogeny software MrBayes (Huelsenbeck and Ronquist 2001) Tree-Puzzle (Schmidt et al. 2002) and Phase http://bioinf.man.ac.uk/resources/phase/ (see Appendix). Test Tree Our initial approach was to evaluate the different models of evolution by comparing their likelihood maximized on one specific test tree. For this model evaluation, a data set of 23 bilaterian SSU sequences was used. Unreliably aligned positions were excluded, leaving 1563 positions in the analysis. A total of 680 sites were paired stem nucleotides and the remaining 883 were loop-bulge nucleotides (alignment available on request). The tree used for all comparisons of models is shown in figure 1. Ideally, one would optimize the topology along with all other parameters when performing the permutation tests, but this is computationally impossible. In spite of this we believe our results to be robust for two reasons. First, our tree is very likely to be approximately correct; it makes excellent sense biologically, and this topology or a very close variant of it is found by all of the models used in an MCMC Bayesian search (results not shown). Second, we have repeated the same permutation tests using a different and probably slightly suboptimal topology (we moved Branchiostoma to be the sister group of the echinoderm-hemichordate-Xenoturbella clade) and got essentially identical results (not shown). We subsequently use the best model identified in this way under a Bayesian framework. Phylogenetic Models Five different models of sequence evolution were used: (1) General time reversible (GTR). This condition does not consider the differences between stems and loops. A GTR model with discrete gamma (four categories), proportion of invariant sites (pinv), and base composition was used (written as GTR 1 I 1 G4). (2) 2xGTR. Positions were divided into stem and loop positions, and each of these two sets was allowed to evolve as single nucleotides according to a separate model (each of the two models comprising the same parameters as the GTR model.) (3) RNA16A (for stems). All 16 character states are included in this model, and the model permits double substitutions which has been shown to be important by Savill, Hoyle, and Higgs (2001). The rate matrix has 16 frequency parameters for each of the 16 possible

Ribosomal RNA Secondary Structure and Phylogeny 1131

pairs of nucleotides. The rate parameters group together different types of changes: as models the single substitution rate, ad models the double substitution rate, b models the double transversion rate, c models substitutions between paired and mismatched states, and e models rates between mismatch states (2 parameters are fixed so there are 19 estimated parameters). Please see Savill, Hoyle, and Higgs (2001) for a detailed explanation of this model. (4) RNA16HKY85 (for stems). Stem-pair evolution was also modeled considering a 16-state model after Scho¨niger and von Haeseler (1999). This model has 16 frequency parameters for the 16 possible stem pairs and 2 rate parameters. The rate parameters differ significantly from those of RNA16A as in this case we do not consider classes of change between pairs of nucleotides rather considering changes between single nucleotides within the stem pairs (in the case of RNA16HKY85 grouping transitions and transversions). Simultaneous double transitions are disallowed (there are 16 estimated parameters). (5) RNA16GTR (for stems). This is a more parameter-rich version of the above model; like the GTR model for unpaired nucleotides, it has six rate parameters: one for each single nucleotide change. Simultaneous double transitions are disallowed (20 estimated parameters). Also included as parameters in models RNA16A, RNA16HKY85, and RNA16GTR were the gamma shape parameter a for the gamma distribution with four discrete rate categories and the proportion of invariant sites. Permutation Test of 2xGTR Model We have used a permutation test to compare models rather than the more usual parametric bootstrap. The parametric bootstrap tests the null hypothesis that two nested models (with differing numbers of parameters) are indistinguishable. Parametric bootstrapping uses multiple data sets simulated according to the simpler of the two models under comparison. In contrast, the permutation test approach (Felsenstein 2003), although clearly related to parametric bootstrapping, differs in that (1) the null hypothesis is that the partitioning of characters to different models has no effect and (2) the multiple data sets derive from a randomization of the initial data rather than from simulation. In short, we are asking whether partitioning characters according to our more complex model (e.g., considering stem nucleotides separate from loop nucleotides) is distinguishable, in terms of lnLikelihood, from a random partitioning of these characters. If it is and if the likelihood is higher, then we conclude that the more complex model is preferable to the more simple model. The first test is whether stem nucleotides and loop nucleotides come from a single population (i.e., they are evolving according to a single GTR model—the simpler model) or whether they come from different populations (and are therefore evolving according to different GTRs—the more complex model). We calculate the lnLikelihood of the more complex model which corresponds to the data set correctly divided into stem and loops (real data

set) using separate GTR models for stem and loop. This is compared to the lnLikelihoods calculated under the same dual model using 100 permutated samples in which the same nucleotides are randomly assigned to pseudo stem and loop classes of the same size as the real ones and therefore actually correspond to the simpler model. If the real lnLikelihood is outside the distribution of 95% of the permutated lnLikelihoods, then the stems and loops are inferred to be evolving according to different models (come from different populations) with 95% confidence. An increase in lnLikelihood seen with twin models over a single model implies that we are justified in assigning stems and loops to separate models in our phylogenetic analyses. To simplify plotting the lnLikelihoods we subtracted the lnLikelihoods of the single GTR from the lnLikelihoods of the double GTRs (2xGTR) for unpermutated and permutated data. The likelihoods were calculated using Phase on the fixed topology shown in figure 1. Branch lengths were also reestimated. Permutation Tests of Models RNA16A, RNA16HKY85, and RNA16GTR This permutation test approach was also taken when considering conditions RNA16A, RNA16HKY85, and RNA16GTR. Here, we are asking whether the stem nucleotides evolve according to models in which one side of a pair constrains the evolution of the other side of a pair (i.e., a 16state, stem-pair model—complex) as opposed to a model in which this is not true and in which there is no correlation (equivalent to a four-state model—simple). For each of the three 16-state models we calculate the lnLikelihood when the model is fitted to the correct stem pairs. This is compared to the lnLikelihoods, calculated under the same model, of 100 permutated samples in which the nucleotides are randomly assigned to the same number of stem pairs; the randomization ensures that there is no correlation between members of pairs. If the lnLikelihood of the correctly paired data set is outside the distribution of 95% of the lnLikelihoods of the permutated data, then we can say with 95% confidence that the stems do not evolve according to the uncorrelated model. An increase in lnLikelihood seen with 16-state models over a 4-state model implies that we would be justified in the use of 16-state models in our phylogenetic analyses. In order to permutate the data we randomly reordered columns of stem nucleotides while maintaining the set of parentheses indicating the secondary structure. To simplify plotting the lnLikelihoods, we subtracted the lnLikelihood of the GTR from the lnLikelihoods of the 16-state models (models RNA16A, RNA16HKY85, and RNA16GTR) for unpermutated and permutated stem data. Phylogeny of the Bilateria The Bayesian framework using the software MCMCPhase was used to estimate the phylogeny of a sample of 72 taxa from within the Bilateria using GTR for the loop regions and RNA16A for the stem regions. Unreliably aligned positions were excluded leaving 1510 positions in

1132 Telford et al.

Barentsia Ochetostoma Phascolosoma Pycnophyes Priapulus Gordius Saccoglossus Cephalodiscus Balanoglossus Ptychodera Cassidulus Polycheira Ophiocanops Astropecten Dorometra Xenoturbella Halocynthia Ciona Alligator Gallus Homo Lampetra Branchiostoma FIG. 1.—Test tree of deuterostomes plus protostomian out-groups used for all permutation test comparisons.

the analysis. A total of 634 sites were paired stem nucleotides and the remaining 876 were loop-bulge nucleotides (alignment available on request). Random starting trees were used, and the parameters of the substitution models were treated as random variables and estimated during the analysis. A total of 1,000,000 initial generations were run which was seen to be more than sufficient for the lnLikelihood estimates to plateau. A total of 1,000,000 additional generations were run with sampling of the tree including branch lengths every 1,000th generation. The analyses were repeated four times with indistinguishable results, supporting the notion that a plateau had been reached. MrBayes ‘‘sumt’’ command was used to generate a consensus tree with branch lengths and Bayesian clade support values from these samples. For comparison, the same analyses were run using a single GTR for both stem and loop positions. Results and Discussion Comparing GTR and 2xGTR: Parameter Estimates and Permutation Test We have used our full test data set (23 species, 680 stem nucleotides, 883 loop nucleotides) and optimized the maximum likelihood estimates of parameters under the single GTR and under the 2xGTRs over our test tree (fig. 1.). These comparisons reveal striking differences (fig. 2.). Most obvious is the increased transition/transversion ratio (A,.G and C,.U) in stem regions versus loops and the increased frequency of adenine in loops, whereas cytosine and guanine are overrepresented in stems. These observations clearly match expectations as secon-

dary structure–conserving single substitutions are all transitions and G:C pairs are the most thermodynamically stable, hence, the preponderance of these categories in stem regions (Vawter and Brown 1993). The reason for the overrepresentation of adenine in loops has been suggested to result from their use in various conserved structural motifs (Gutell et al. 2000). The lnLikelihoods of the test tree under single and dual models are ÿ9990.1 and ÿ9904.78, respectively (difference 5 85.32). The permutation test (fig. 3A) shows the significance of the use of separate models for stems and loops as opposed to using a single model. The observed improvement under the more complex dual model implies that the use of separate models for stems and loops is strongly supported. Permutation Tests of Models RNA16A, RNA16HKY85, and RNA16GTR We have used our stems-only test data set (23 species, 680 stem nucleotides/340 stem pairs) and optimized the maximum likelihood estimates of parameters under the single GTR and under the 16-state models RNA16A, RNA16HKY85, and RNA16GTR on our test tree (fig. 1.). The lnLikelihood of the test tree under the 4-state GTR model is ÿ3551.99, and under the 16-state correlated models RNA16A, RNA16HKY85, and RNA16GTR are ÿ2788.29, ÿ2823.82, and ÿ2820.78, respectively. As before, the use of permutation tests (fig. 3B–D) demonstrate that the lnLikelihoods for the data set with correctly paired nucleotides are far outside the distribution of lnLikelihoods of the permutated data sets. The lnLikelihood is much higher for the correctly paired versus permutated data sets and implies that the sequences are indeed evolving according to the model in which nucleotides changes in stem pairs are correlated. The use of the 16-state, stem-pairing models described is clearly supported by these analyses. The large improvement in lnLikelihood using model RNA16A over models RNA16HKY85 and RNA16GTR seems to confirm its superiority; this is particularly impressive considering that RNA16GTR is actually marginally more parameter rich than RNA16A. We used model RNA16A for our analysis of metazoan phylogeny. There is very little improvement in lnLikelihood using model RNA16GTR over model RNA16HKY85 (difference 5 3.045). As models RNA16HKY85 and RNA16GTR are nested, we are able to compare their lnLikelihoods using a chi-square test with 4 degrees of freedom for the four additional parameters in model RNA16GTR. The chi-square test shows that the improvement in likelihood of model RNA16GTR versus model RNA16HKY85 (using 2 3 difference 5 6.08 as the test statistic) is nonsignificant. Superiority of Models that Account for Stem Pairing We are unaware of a previous use of a permutation approach for studies of stem models. Using this approach we have been able to demonstrate for the first time the superiority of models of nucleotide evolution that divide rRNA sequences into sets of stem and loop nucleotides.

Ribosomal RNA Secondary Structure and Phylogeny 1133

FIG. 2.—Comparison of the maximum likelihood estimates of (A) exchangeability parameters (AÆæC, AÆæG, etc.) and (B) nucleotide frequencies (fA, fC, fG, fU) for loops and stems analyzed according to separate GTR models for stems and for loops and for combined loops 1 stems under the single GTR model. Exchangeability parameters cannot be estimated independently of each other, and GÆæU for all three sets of characters is set to equal 1 as a reference.

Furthermore, we have used the permutation approach to confirm results (e.g., Scho¨niger and von Haeseler 1999) showing that 16-state, stem-pair models for stem regions of rRNA genes are appropriate and far superior to 4-state, single-nucleotide models. Models RNA16HKY85 and RNA16GTR were included for completeness because they are included in the popular MrBayes software. Our results show that, although they are almost certainly inferior to Model RNA16A they improve hugely over four-state models and their use should be encouraged. Phylogeny of the Bilateria Our Bayesian tree (fig. 4) derived using a combination of 4-state GTR model for loops and 16-state RNA16A model for stems accords well with current ideas of bilaterian phylogeny (Adoutte et al. 2000). We see the three separate major clades of the Deuterostomia, Ecdysozoa, and Lophotrochozoa, although the tree was rooted between Deuterostomia and Protostomia so that the relative topology of these three branches has not been tested. Within the Deuterostomia we recover the expected division between chordates and a clade containing the Ambulacraria (echinoderms plus hemichordates) and the worm Xenoturbella as their sister group. The 99% Bayesian support for

FIG. 3.—The results of permutation tests of 2xGTR model and models RNA16A, RNA16HKY85, and RNA16GTR. (A) The difference in lnLikelihood between the single GTR and double GTR for 100 permutated data sets have been binned. (B, C, D) The difference in lnLikelihood between the single GTR for stems and the three different 16-state models for 100 permutated data sets have been binned. The difference for the unpermutated data sets are indicated by the arrow and in each case are far outside and greater than the distribution of permutated data, showing that the data do not conform to the simpler model and that the more complex model is superior.

1134 Telford et al.

0.69

1.00 0.95

Macrostomum

0.76

1.00

Notoplana

Phoronis

Plumatella Barentsia Pedicellin

1.00 1.00

1.00* 0.86

1.00 0.56 0.90

Lineus

1.00

Siboglinum Acanthopleura Lepidozona Terebratulus 1.00* Neocrania Nerita 1.00 Littorina 1.00 1.00 Helix

1.00

Mytilus

1.00*

Ostrea

0.99

Sabella

Squilla

1.00

1.00

0.86

Limulus Androctonus Proteroiulus Lithobius

1.00

0.67 0.55

1.00

1.00

0.99

Xenoturbella Dorometra Cassidulus

1.00 0.61

0.97 1.00

Ptychodera

1.00 1.00

1.00 1.00 1.00

Nematomorphs Priapulid/Kinorhynch Crustacean/Insects

Myriapods/Chelicerates Euperipatoides Onychophoran Thulinia Tardigrade Longidorus Nematodes Enoplus Xenoturbellid

Astropecten Ophiocanops Saccoglossus

0.74

0.54

Dilta Ephemera

Annelids/Sipunculid

Branchiostoma Lampetra Homo Alligator Gallus Ciona Halocynthia

Polycheira

Echinoderms Hemichordates

Deuterostomes

1.00

Pycnophyes

Tenebrio Aeschna

0.89 0.85

0.98 0.97

0.94

Gastropod molluscs Acanthodorus

Ecdysozoans

1.00

Priapulus 1.00

Entoprocts/Cycliophoran /Rotifers

Brachiopods

Branchellion Hirudo Phascolosoma Chordodes Gordius

1.00

1.00

Phoronid Bryozoan

Polyplacophoran molluscs

Eisenia

0.96 0.63*

Pseudoceros

Bivalve molluscs

0.73

0.72

Platyhelminths

Mya Aeolosoma

Aphrodita Xironogiton

1.00

Dugesia

Pogonophorans

1.00

0.95

Mesostoma

Nemerteans

Prostoma

Ridgeia

1.00

Symbion Lecane Asplanchna Philodina Brachionus

Gastrotrichs

Lophotrochozoans

Chaetonotus Lepidodermella Stenostomum

1.00 0.96

Cephalochordate Vertebrates Urochordates

FIG. 4.—Phylogeny of the Bilateria estimated from 72 SSU genes using the GTR model for loop nucleotides and the RNA16A model for stem pairs. Branches that received ,50% Bayesian support have been collapsed. Numbers to the left of the clades show Bayesian support values (except for the numbers followed by * which are to the right of their clade for reasons of space). Well-supported clades are indicated with labels to the right. The three major bilaterian superphyla are shown in boxes. See text for further details.

the exclusion of Xenoturbella from within the Ambulacraria using this superior model of evolution adds to our confidence in this result (Bourlat et al. 2003), although it must be remembered that Bayesian support generally gives higher values than an equivalent nonparametric bootstrap. Within the Ecdysozoa there is good support for the basal position of the priapulid and kinorynch and also nem-

atomorphs. The branching pattern between nematodes, onychophorans, and tardigrades relative to Euarthropoda is not well resolved (low Bayesian support values). Within the Euarthropoda we find two principal branches—the chelicerates plus myriapods on one side and crustaceans plus insects on the other. This result is in keeping with recent studies (Friedrich and Tautz 1995; Boore, Lavrov, and

Ribosomal RNA Secondary Structure and Phylogeny 1135

Brown 1998; Cook et al. 2001), and all these clades receive high support values. Within the Lophotrochozoa, although certain groups are well resolved (gastrotrichs, platyhelminths, bivalve mollusks, gastropod mollusks, polyplacophoran mollusks, pogonophorans, nemerteans, and rotifers), for the most part, the relationships between these and other taxa are not well resolved. We do find weak support for a monophyletic assemblage of annelids that includes the sipunculid (now widely considered a derived annelid) and strong support for the groupings of rotifer plus cycliophoran plus entoprocts and gastrotrich plus platyhelminths, but other branches remain poorly supported. The lack of resolution within the Lophotrochozoa is in common with other analyses that have used SSU (Mallatt and Winchell 2002). Comparison of the single GTR 1 RNA16A tree and the GTR tree (not shown) shows a number of differences. Our most significant novel finding using GTR 1 RNA16A is the observation of two highly supported groups within the Lophotrochozoa; the gastrotrichs plus platyhelminthes in one and all other lophotrochozoan phyla (including the rotifer-cycliophoran clade) in the other. Analyses with a single GTR model group the rotifer-cycliophoran clade with the gastrotrich-platyhelminth clade rather than with the other lophotrochozoans. This difference deserves to be examined further considering the demonstrated superiority of the evolutionary model we have used. The single GTR model also fails to group entoprocts with cycliophora 1 rotifers and supports a number of unlikely associations—entoprocts with nemerteans, phoronid with annelids, and pogonophorans with gastropod mollusks. In general, the GTR 1 RNA16A model seems to be more agnostic in its placement of these taxa, resulting in a less well-resolved but more accurate tree. Overall, our method, although certainly not perfect with this data set, seems to have found a more credible tree than the simpler alternative. Conclusions It is well established that phylogenetic methods perform better when the model of evolution is appropriate (Felsenstein 1978, Leitner, Kumar, and Albert, 1997; Sullivan and Swofford 1997; Posada and Crandall 2001). We have shown that the simple expedient of considering stem nucleotides independently of loop nucleotides significantly improves the estimated lnLikelihood. Moreover, the 16-state models we have used result in a further major improvement in the fit of the model to the evolution of the pairs of stem nucleotides. Although the consideration of more complex models of sequence evolution inevitably requires additional computer time, with the existence of computationally efficient methods such as the Bayesian MCMC approach used here, the analysis even of relatively large data sets is feasible. Although clearly desirable in terms of a measurably improved approximation to the true evolution of rRNA sequences, incorporation of secondary structure information is not a panacea. Our consideration of the phylogeny of the Bilateria using SSU sequences shows that much of our tree does not differ from previous analyses using standard four-state models. Additionally, areas of the tree,

particularly within the Lophotrochozoa, are still not resolved. rRNA genes are, nevertheless, the most widely sequenced genes across all living organisms—more than 20,000 SSU sequences in the European rRNA database—and such a significant improvement in using them for phylogenetic analyses may be hoped to further our understanding of the tree of life. We believe that our tree represents the best-justified SSU-based phylogeny of the Bilateria yet produced. Acknowledgments We would like to thank Nick Goldman for advice on permutation tests. The comments of the editor and two anonymous reviewers were also very much appreciated. M.J.T. was supported by the Wellcome trust 060503/Z/ 00/Z. M.J.W. was supported during the course of this project by a Fellowship at Pembroke College, Cambridge, generously provided by Bristol-Myers Squibb. Xstem (written in python) and Ystem (written in perl) are non–platform specific and are available on request from the authors. Appendix Software The biggest obstacle to using secondary structure information in phylogentics has been interpreting secondary structure information contained in rRNA sequence databases and translating this into a format that the phylogeny software packages can use. To overcome this we have developed the software applications Xstem and Ystem. Extracting Secondary Structure Information Using Xstem Xstem parses the DCSE alignment format file and identifies positions that make up the two halves of each stem pair in each sequence. The file can be output in formats readable by MrBayes, Phase, and in the 20-state character set proposed by Smith, Lui, and Tillier (2004). Adding Taxa and Altering Alignments Using Ystem Ystem allows conversion of the DCSE alignment into a Nexus file that can then be imported into programs such as MacClade where new taxa can be added and the alignment altered. Ystem allows reconversion of the altered nexus file to the DCSE format, reintroducing the structural information from the original DCSE file while maintaining the new alignment and additional taxa. Literature Cited Adoutte, A., G. Balavoine, N. Lartillot, O. Lespinet, B. Prud’homme, and R. de Rosa. 2000. The new animal phylogeny: reliability and implications. Proc. Natl. Acad. Sci. USA 97:4453–4456. Aguinaldo, A. A., J. M. Turbeville, L. S. Linford, M. C. Rivera, J. R. Garey, R. A. Raff, and J. A. Lake. 1997. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 387:489–493. Boore, J. L., D. V. Lavrov, and W. M. Brown. 1998. Gene translocation links insects and crustaceans. Nature 392: 667–668.

1136 Telford et al.

Bourlat, S. J., C. Nielsen, A. E. Lockyer, D. T. J. Littlewood, and M. J. Telford. 2003. Xenoturbella is a deuterostome that eats molluscs. Nature 424:925–928. Cook, C. E., M. L. Smith, M. J. Telford, A. Bastianello, and M. E. Akam. 2001. Hox genes and the phylogeny of the arthropods. Curr. Biol. 11:759–763. Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27: 401–410. ———. 2003. Inferring phylogenies. Sinauer Associates Inc, Sunderland, Mass. Friedrich, M., and D. Tautz. 1995. rDNA phylogeny of the major extant arthropod classes and the evolution of myriapods. Nature 376:165–167. Haase, A., M. Stern, K. Wa¨chtler, and G. Bicker, 2001. A tissuespecific marker of Ecdysozoa. Dev. Genes Evol. 211:428–433. Gutell, R. R., J. J. Cannone, Z. Shang, Y. Du, and M. J. Serra. 2000. A story: unpaired adenosine bases in ribosomal RNAs. J. Mol. Biol. 304:335–354. Hudelot, C., V. Gowri-Shankar, H. Jow, M. Rattray, and P. G. Higgs. 2003. RNA-based phylogenetic methods: application to mammalian mitochondrial RNA sequences. Mol. Phyl. Evol. 28:241–252. Huelsenbeck, J. P., and F. R. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755. Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol. 19:1591–1601. Leitner, T., S. Kumar, and J. Albert. 1997. Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history. J. Virol. 71:4761–4770. Mallatt, J., and C. J. Winchell. 2002. Testing the new animal phylogeny: first use of combined large-subunit and small-subunit rRNA gene sequences to classify the protostomes. Mol. Biol. Evol. 19:289–301. Muse, S. V. 1995. Evolutionary analysis of DNA sequences subject to constraints on secondary structure. Genetics 139:1429–1439. Otsuka, J., and N. Sugaya. 2003. Advanced formulation of base pair changes in the stem regions of ribosomal RNAs; its application to mitochondrial rRNAs for resolving the phylogeny of animals. J. Theor. Biol. 222:447–460. Otsuka, J., G. Terai, and T. Nakano. 1999. Phylogeny of organisms investigated by the base-pair changes in the stem regions

of small and large ribosomal subunit RNAs. J. Mol. Evol. 48:218–235. Posada, D., and K. A. Crandall. 2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol. 500:580–601. Patterson, C. 1989. Phylogenetic relations of major groups: conclusions and prospects. Pp. 471–488 in B. Fernholm and H. Jornvall, eds. The hierarchy of life. Elsevier, Amsterdam. Rzhetsky, A. 1995. Estimating substitution rates in ribosomal RNA genes. Genetics 141:771–783. Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. Genetics 157:399–411. Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502–504. Scho¨niger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3:240–247. ———. 1999. Toward assigning helical regions in alignments of ribosomal RNA and testing the appropriateness of evolutionary models. J. Mol. Evol. 49:691–698. Smith, A. D., T. W. H. Lui, and E. R. M. Tillier. 2004. Emprical models for substitution in ribosomal RNA. Mol. Biol. Evol. 21:419–427. Sullivan, J., and D. L. Swofford 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenies. J. Mamm. Evol. 4:77–86. Tillier, E. R. M., and R. A. Collins. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol. 12:7–15. Vawter, L., and W. M. Brown. 1993. Rates and patterns of base change in the small subunit ribosomal RNA gene. Genetics 134:597–608. Wolf, Y. I., I. B. Rogozin, and E.V. Koonin. 2004. Coelomata and not ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 14:29–36.

Herve Philippe, Associate Editor Accepted January 25, 2005