An Empirical Approach Using Three Large DNA Data Sets for ...

12 downloads 620 Views 232KB Size Report
Abstract.–. To explore the feasibility of parsimony analysis for large data sets, we conducted heuristic parsimony searches and bootstrap analyses on separate ...
S yst. Biol. 4 7(1):3 2 ± 4 2 , 199 8

Inferrin g C om plex Phylog enies U sin g Parsim ony: A n Em pirical A pproach U sin g T hree Larg e DNA D ata Sets for A ngiosperm s D O U G L A S E. S O L T IS , 1 P A M E L A S. S O L T IS , 1 M A R K E. M O R T , 1 M A R K W. C H A SE , 2 V IN C E N T S A V O L A IN E N , 2 S A R A B. H O O T , 3 A N D C Y N T H IA M . M O R T O N 4 1 D epartment of Botany, Washington State Un iversity, Pu llm an, Washington 99164-4238, U SA ; E -m ail: dsoltis@ m ail.w su.edu (D.E .S.), psoltis@ w su.edu (P.S.S.), m arkmort@ m ail.w su.edu (M .E .M .) 2 M olecu lar System atics Section, Jod rell Laborator y, Royal Botan ic G ardens, Kew, R ichmond, T W 9 3D S, Un ited K ingdom ; E -m ail: m c03kg@ lion.rbgkew.org.uk (M .W.C.) 3 D epartment of Biolog ical S ciences, Un iversity of W iscon sin, M ilwau kee, W iscon sin 53201, U SA ; E -m ail: hoot@ csd.uw m .edu 4 D epartment of Botany, Un iversity of Reading, Reading RG 6 2A S, United K ingdom ; E -m ail: c.morton@ reading.ac.uk

A bstract.Ð To ex plore the fea sibility of p arsimony an alysis for large data sets, we condu cted he uristic parsim ony searches and b oo tstrap analyses on separate and com bined DNA data sets for 190 an giosperm s and three outgrou ps. Sep arate data sets of 18S rDNA (1,855 bp ), rbcL (1,428 b p), and atp B (1,450 b p) sequences were com bined into a single m atrix 4,733 b p in length. A n alyses of the com bined data set show great im provem en ts in com puter ru n tim es com pared to those of the separate data sets and of the data sets com bined in p airs. Six searche s of the 18S rDNA 1 rbcL 1 atp B data set were condu cted ; in all cases T BR branch swapping was com pleted, gener ally w ithin a few days. In con trast, T BR branch swap ping was no t com pleted for any of the three sep arate data sets, or for the pairw ise com bined data sets. T hes e resu lts illustrate that it is possible to condu ct a thorough search of tree space w ith large data sets, given suf® cien t signal. In this case, and probably m ost others, su f® cien t signal for a large num b er of taxa can on ly b e ob tained by com bining data sets. The com bined data sets also h ave higher internal supp ort for clades than the separate data sets, and m ore clades receive bo otstrap sup port of $ 50% in the com bined an alysis than in analyses of the separate data sets. T hese data su ggest that one solution to the com putational and analytical dilem m as p osed by large data sets is the add ition of nucleo tides, as well as taxa. [La rge data sets, p arsimony, phylogeny.]

Phylo genetic relationsh ip s in m any larg e group s of organism s rem ain en igm atic desp ite in tensive study. Elucidating relationsh ips in these grou ps w ill ultim ately require the com p ilation and phy lo genetic ana lysis of sequences and /or m orpholo gical traits represen ting hund reds of taxa. The feasibility of phy logen etic analysis of such large data sets has been debated, however (e.g., Patterson et al., 1993; H illis et al., 1994; H illis, 1995). For exam ple, H illis et al. (1994) su gg este d that in som e in sta nces correct phylo geny recons truction for only four taxa would requ ire over 10,000 bp of DNA sequence. This deg ree of com plexity implie d much greater dif® culty w ith larg er data sets and stim ulated som e to propo se that phylogenetic problem s be broken into a serie s of sm aller problem s (e.g., M ish ler, 1994; K im , 1996; Soltis and Soltis, 1996; Rice et al., 1997), one extrem e

bein g a larg e nu m ber of four-taxon questions (e.g., G raur et al., 1996). Larg e data sets also pose problem s for p arsim ony analyses because of the larg e num ber of trees that must be exam ined in searching for the shor test tree (s). The num ber of p otential solutions incre ases loga rith m ically as taxa are added (Felsenstein, 1978). For example, for 20 taxa there are approxim ately 8.87 3 10 2 3 p ossible rooted tree s (Felsen stein, 1978); for 228 taxa (the num ber of species recen tly ana lyzed by Soltis et al., 1997b, in a phy logene tic analysis of angiosp erm s using nuclear 18S ribosom al DNA (rD NA ) sequences), there are ap proxim ately 1.2 3 10 5 0 2 solutions (H illis, 1996). D esp ite the dire predictions su gg ested for som e four-taxon analyse s (H illis et al., 1994) and the num ber of possible trees for large data sets (Felsen stein , 1978), severa l 32

1998

SO LTIS E T A L.Ð

L AR GE D NA D A TA SET S

phy logene tic analyses involving hu nd red s of sp ecies have been conducted for angiosp erm s, and the resu lts of these studies have im portant im plications for the analysis of la rg e data sets. A nalyse s of three large DNA data sets (each w ith over 200 sp ecies) have been conducted, usin g the plastid genes rbcL (Chase et al., 1993) and atpB (Savolainen et al., 1996) and the nuclear 18S rDNA (Soltis et al., 1997b). A ll three analyse s have yield ed highly sim ilar topolo gies for the angio sperm s (rev iewed by C hase and Cox, 1997; Soltis et al., 1997a; C hase and A lb ert, 1998). A m ore len gthy analysis of the 499-taxon rbcL data set of C hase et al. (1993) found shor ter trees (R ice et al., 1997), but the genera l p icture of angio sperm relationsh ips rem ains unchang ed. Signi® cantly, none of the searches in any of these analyses swap ped to com pletion, des pite hu ge inve stm en ts of com puter tim e: Soltis et al. (1997b) em ployed over 2 ye ars of com puter tim e on the 18S rDNA analysis and R ice et al. (1997) devoted a total of ``ap proxim ately 11.6 m onth s of C PU tim e’ ’ using three Sun work stations in the rea na lysis of the rbcL data set. The three gene trees (representin g the plastid and nuclea r genom es) are high ly sim ilar in the relationsh ip s they dep ict for all m ajor group s of angiosp erm s, su ggestin g that even these rou gh estim ates of phy logeny based on the ind ividu al data sets provide a consistent p icture of organism al relationsh ips. Hence, these analyses ind icate that the phylo gen etic analysis of large data sets m ay be m ore tractable than su gg este d by ea rlier sim ulation studies. H illis (1996) recently teste d the feasibility of analyz ing large data sets by sim ulating the 18S rD NA phylogeny for angiosp erm s based on 228 sequences (Soltis et al., 1997b). H is sim ulations suggested that the m odel phy logeny can be reconstructed usin g either parsim ony or neigh bor-jo ining m ethods w ith . 99% accuracy w ith only 5,000 bp of sequence data. Initial em p irica l work (Soltis et al., 1997a) su pp orted thes e conclu sions. Separate and com bined data sets for 232 sp ecies of ang iosp erm s for w hich both rbcL and 18S rD NA sequences were available (over 3,000 bp of sequence

33

data) were analyz ed usin g parsim ony m etho ds. A na lysis of the com bine d data set swap p ed to com pletion in a few days; after 1 m onth, neither of the separate data sets had swap ped to com pletion. In addition, com bining the data sets greatly increased the intern al su pp ort for m any clades (as m easu red by p arsim ony jackknife values; Farris et al., 1996). A nalyses of the rbcL 1 18S rD NA data set genera lly resu lt in tree s having greater overa ll resolution than those inferred from the separate data sets and a com bin ation of the w ell-s u p p o r te d c lad e s p re s en t in th e separate rbcL and 18S rDN A tree s. A nalysis of the com bined data set also recovered severa l ``uniquely su pp orted’ ’ clades that received jackkn ife su p port of $ 50% in the analyses of the com bine d, but not the sep arate, data sets. The se resu lts are com parable to tho se observed w ith com bined data sets in m ore focused studies involving far fewer taxa (e.g., O lm stead and Sweere, 1994; Soltis et al., 1996; Sullivan, 1996). To explore further the feasibilty of analysis of la rg e data sets usin g p arsim ony, we conducted searches on a com bine d 18S rDNA 1 rbcL 1 atpB data set for 193 species. We also exam ined the effects of com bining larg e data sets on in tern al su pp ort by conducting bootstrap analyses on the separate and com bine d data sets. We ® rst constru cted separate data sets of 18S rDNA (1,855 bp ), rbcL (1,428 bp), and atpB (1,450 bp ) sequences for 193 taxa for w hich all thre e sequences were available; these three data sets were then com bined into a sing le data m atrix 4,733 bp in leng th. R ep resen ted in this m atrix are 190 angio sp erm s from ap proxim ately 148 fam ilies that represen t well the diversity of angiosp erm s; also included are three outgrou ps, Ephedra sinica, G inkgo biloba, and Pseudotsuga menziesii (for atpB, Pinus was used in place of Pseudotsuga). Of the 190 angio sp erm s included , we used 18S rD NA , rbcL, and atpB sequences for the sa m e genu s (and species, if p ossible) for 137 taxa; in 52 instances, different genera were used as placeholders for a fam ily. In one instance, differen t fam ilies of a sister p air were

34

SYSTE M AT IC BIOLO GY

used : Cyperus (Cyperaceae) was used for 18S rDNA and rbcL, w here as Juncus ( Juncaceae) was used for atpB. Most of the 18S rD NA and rbcL sequences are from Soltis et al. (1997b) and C hase et al. (1993), resp ectively ; other sources of publishe d sequen ces include Hoot and C rane (1995), Ho ot et al. (1995), and Soltis and Soltis (1997). The se were sup plem en ted w ith additional unpublishe d sequences. We conducted parsim ony searche s on the com bined and the three separate data sets usin g PAU P* 4.0 (Swofford, 1997) and Power M acintosh com puters. A ll p arsim ony sea rches were conducted as follow s. First, 500 replicate he uristic searche s w ith R A N D OM taxon addition and N N I bra nch swapp in g were conducted, saving ® ve tree s p er replicate. Using the shor test trees obtained from these in itia l searche s as sta rtin g trees, we then conducted su bsequen t searche s using TBR branch swap p ing and saving all m ost p arsim onious tree s. For the thre e sep arate data sets, these su bsequen t TBR searches were allowed to proceed until the search ``stalle d’ ’ on a tree len gth for 4 days or m ore and the num ber of tree s in mem ory exceede d 3,500. At this p oint, we selected a new group of ® ve sta rtin g trees one step longer tha n those used initially, and TBR searche s were conducted as before. If the in itial N N I searches did not produce trees one step longer, we used the next longest tree s. This process was repeated three tim es for each of the three sep arate data sets. We alm ost certainly did not ® nd the shor test trees via this approach. The go al of this study was not to ascertain phylo genetic relationsh ip s p er se, but rather to comp are the p erform ance of sep arate versu s com bined analyses; thus, trees from the sep arate p arsim o ny a n a ly se s a r e no t p re s en te d . However, the shor test tree s obtained agre e closely w ith tho se presen ted elsew here (see rev iew s by Soltis et al., 1997a; C hase and A lbert, 1998). For the com bined 18S rD NA 1 rbcL 1 atpB data set, a sim ilar approach was used : A fter each TBR search, a new grou p of ® ve starting trees that were one step longer than those used in the previou s search was selected for further ana l-

VO L.

47

ysis. W here the initial N N I searche s did not produce trees one step longer than tho se just used, then the next longe st N N I tree s were selected as startin g tree s. This process was repeated six tim es for the com bined 18S rDNA 1 rbcL 1 atpB data set. Issues of congruence and w he ther or not multiple data sets sh ould be com bine d to form a sin gle data set have been the focus of considerable deb ate (see rev iew s by Bull et al., 1993; de Q ueiroz et al., 1995; Huelsen beck et al., 1996; Johnson and Soltis, 1998). Visu al com p ariso n of the rbcL trees of C hase et al. (1993) w ith the 18S rD NA (Soltis et al., 1997b) and atpB (Savolainen et al., 1996) trees certainly su gge sts that these three molecular data sets are la rg ely congruent. Usin g the 193-taxon data m atrices assem bled here, we assessed congru ence usin g the ``partition hom o geneity test’ ’ com m and im plem en ted in PAU P* 4.0. Tree len gth s obtained from partitioning the data by gene were com pared to those obtained from rand om p artitions follow in g Farris et al. (1995). We p erform ed 500 replicate searches usin g N N I. The partition hom ogeneity test found the three data sets to be sig ni® cantly differen t from rand om p artitions of a sin gle data set (P , 0.05): 18S rD NA 2 atpB, P 5 0.02; 18S rD NA 2 rbcL, P 5 0.02; atpB 2 rbcL, P 5 0.004. Thus, based on these data sets, the greatest am ount of incong ruen ce actually involves the two plastid gene s, data sets that m any would arg ue could be com bined a priori (see later discussion). Sign i® cant incong ruence am ong m atrices based on the test of Fa rris et al. (1995) does not m ean incom binability. We have found , for exam ple, that multiple DNA sequen ce data sets derived from the chloroplast genom e are statistically incong ruen t, although the topologies retrieved from them each are highly sim ila r ( Johnson and Soltis, 1998). Nonetheless, because the chloroplast geno m e is inh erite d as a unit and is no t su bject to recom bination, plastid data sets (bo th sequences and restriction sites) are genera lly consid ere d ``com binable.’ ’ Furtherm ore, the sen sitiv ity of this p artition homo geneity test has not been

1998

SO LTIS E T A L.Ð

L AR GE D NA D A TA SET S

explored rigorously (e.g., Sullivan, 1996), p articularly for large data sets. Perh ap s this test is to o conservative for larg e data sets, w ith a different placem en t of only one or a few sm all clades suf® cient to indicate sign i® cant incongruence w hen the overall p icture is one of genera l congruence. To eva luate the effects of com bin ing large data sets, we com pared levels of su pp ort for clades in the sep arate 18S rDNA , rbcL, and atpB data sets w ith tho se for the com bine d 18S rDNA 1 rbcL 1 atpB data set, as well as for each com bination of two data sets: 18S rD NA 1 rbcL; 18S rD NA 1 atpB ; rbcL 1 atpB. We wanted to determ ine w hether increased resolution and intern al su pp ort are achieved in analyses of com bin ed rather tha n sep arate data sets. To provide an estim ate of in tern al sup port, we used the ``fast’’ bootstrap option available on PAU P* 4.0 w ith 1,000 replicates. A ll of the clades of three or m ore species that receive d bootstrap su pp ort of $ 50% are provided in Table 1. The com plete bootstrap results for all seven data sets are available at h ttp :// w w w.w su.ed u:8080/ ; soltilab/. More clades havin g boo tstrap values of $ 50% were recovere d from the analysis of the rbcL (81) and atpB (88) data sets than for the 18S rDNA data set (48) (Table 1, Fig. 1). This resu lt is no t su rprising given the slower rate of evolution and lower sign al of 18S rDNA com p ared to rbcL and atpB (Hoot and C ra ne, 1995; Hoot et al., 1995; N ickren t and Soltis, 1995; C hase and C ox, 1997). Most of the clades having bootstrap values of $ 50% w ith 18S rDNA are a subset of tho se observed for rbcL and atpB (this is not to say that 18S rDNA data do not m ake an im portant overall contributionÐ see later discussion and Fig. 1). However, three clades receiving bootstrap su pp ort are unique to the 1 8 S rD N A d ata (i.e., A r is to lo ch iac e ae ; C hloranthus/Saururaceae/ Piperaceae; A ristolochia/Asarum /Sarum a/Lactoris). The nu m ber of clades receivin g a boo tstrap value of $ 50% in the ana lyses of the com bined data sets increased com p ared to the sep arate data sets (Table 1; Fig. 1): 18S rDNA 1 rbcL

35

± 92; 18S rDNA 1 atpB ± 97; rbcL 1 atpB ± 105; 18S rDNA 1 rbcL 1 atpB ± 112. In m any instances, clades sup ported by a bootstrap value of $ 50% in the sep arate analyses are m ore strong ly su pp orted in the com bined analyses. O f the 46 clades receiving bootstrap su pp ort of $ 50% in at least one of the three individu al data sets, 36 (78% ) received increased bootstrap su pp ort w hen all three data sets were com bined. The sam e percen tage of two-sp ecies clades (not show n) also received increased su pp ort w hen data sets were com bined. For approxim ately 13% of the clades show ing an increase in bootstrap sup p ort, this increase is sm all (only 1 or 2% ), and for another 9% of the clades, the increase is only 3± 5% , but in mo st cases (78% ), the increase is . 5% . Furtherm ore, over half of the clades that show an increase in boo tstrap sup p ort of only 1 or 2% re¯ ect increases from in itial boo tstrap values of 98± 99% w ith the sep arate data sets to 100% w ith the com bine d data sets. M any clades show a steady increase in bootstrap su pp ort w hen the separate and com bine d data sets are com pared (Table 1). For exam ple, the sister-grou p relationsh ip of Saururus and Houttuynia w ith Peperomia a n d P ip er r e ce ive s b o o t st ra p s u p p o r t . 50% in analyse s of all three sep arate data sets (rbcL 91% ; 18S rDNA 87% ; atpB 86% ); w hen data sets are com bine d in grou p s of two, these values increase to 96± 98% ; w hen all three data sets are com bined , the value is 99% . A sim ilar exam ple is provided by the bo otstrap values for the apioid clade (rbcL 84% ; 18S rD NA 62% ; atpB 88% ); w hen data sets are com bine d in grou p s of two, the bootstra p values range from 92± 98% ; w hen all three data sets are com bine d, the value increases to 100% (Table 1). In m any instances this increase in bootstrap su pp ort in the com bined data sets is su bstantial, elevatin g a clade from m oderate to high sup port. For example, the caryophyllid s s.l. receive boo tstrap su pp ort of 59% w ith rbcL and 81% w ith atpB; although this clade is presen t in all of the shor test 18S rDNA trees observe d, it does not receive bootstrap su pp ort $ 50% . W hen the data sets are com bine d, the

36

SYSTE M AT IC BIOLO GY

VO L.

47

T A B L E 1. Bo otstrap percentages for sub set of the clades su pported by b oo tstrap values of 50% in at least one of the seven sequence data sets (rbcL , 18S rDNA , atp B , rbcL 1 18S rDNA , rbcL 1 atp B , 18S rDNA 1 atp B , rbcL 1 18S rDNA 1 atpB . T h is table contains all of the larger clades detecte d, and of the three- and fou r-taxon clades but none of the num erou s two -spe cies clades. T he nam e s given for clades are infor m al and are given sim ply for conven ience.

Clade

B asal clade Austrobaileya/ Illiciales C hloranthus/Saururaceae/Piperaceae S aururus/Houttu yn ia/Peperom ia/P iper A ristolochia/A sarum / Sarum a/Lactor is A ristolo ch iaceae Nelu mb o/P latanus/ Placosp ermum R anu nculids G laucidium /Ranu ncu lus/ H ydrastis/Nand ina R anu nculids m inus D icentra A kebia/ Sinofranchetia/ D ecaisnea G laucidium /Ranu ncu lus/ H ydrastis R anu nculids-2 R anu nculids-3 Saxifrago ids Saxifragaceae s.s. C rassulaceae Saxifragaceae s.s. m inus S. mertensian a D udleya/S edum /K alenchoe Saxifrago ids-1 H aloragis/ Penthorum /Tetracarpaea Sap indoid s Sap indoid s-1 A ilanthus/Poncir us/C itrus Sap indoid s-2 G lucosinolates G lucosinolates-1 Brassica/ Capparis/F loerkea C elastroid s Myrtoids A lnus/C asuarina/Myrica/C arya Fagales C aryophy llids s.l. C aryophy llids s.s. C aryophy llids s.l./S im mond sia P lum bago/ Cocoloba/ Nepenthes M irabilis/ Phytolacca/ Tetragonia Ceanothus/ Morus/U rtica/ Hu mulus M alp ighioids C u nonio ids Santaloids Datisca/ Coriaria/ Abobra M alvoids Aucu ba/G arr ya/Eucom m ia A pioids E ricoids L am ioids L am ioids-1 H ydrophyllum / Phacelia/Bou rreria L am ioids-2 A steroid s C ampanula/Lobelia/Roussea C yperoids H igher Eud icots E udicots

rbcL

55 89 91

18S

atp B

84 87 87 68 71

77 86

rbcL 1

18S rbcL 1

52 97 a 74 98 a

54 a 58 92

98 50 62

84 98

98

88 96 99 80

79 98 52 93 83 68 97 75 52 54

55 99 100 77

99 78 90 92 81 95

59 99

57 89 72 53 79

99

87 96 78

84

62

67 77 97

50 56

94 84 85 88 51 96 76 51 53

73

79

57 a 93 a 97 a

atpB

80 a 74 96 a 65

rbcL 1 18S 1 atpB

99 a 99 a 68

76 a

51 73 96 76

atpB 18S 1

81 a 100 a 84 a 50 b 50 b 86 a 100 a 93 b 52 b 93 a 98 a 99 80 51 b 63 51 99 100 62

84 a 99 53 b 99

96 a 99 a 88 a 59 b 92 a 58 b 89 a 90 a 99 a 63 b 53 b 90 a 51 b

81 a 63 a 74 a 100 a 63 80 a 90 a 95 a 100 a 57 a 97 a 56 b 96 a 98 a 100 a 94 85 a 73 a 99 100 98 a

52 b

100 a 56 a 77 a 67 b 88 a 100 a 95 a 92 a 87 a 97 a 82 a 66 a

65 a 100 a 61 b 88 a 97 a 98 a 98 b 92 a 91 a 98 a 81 b 91 a 100 a 100 a

65 b 100 72 94 a 88 94 a 97 a 52 b 66 a 96 a 72 55 a 93 a 81 b 85 90 a 72 96 a 77 b 87 a 99 a 90

81 a 58 a

68 a 70 a

70 b 71 b

66 b

67 94 a 100 a

52 b 84 a 61 a 84 a 100 a 79 a 82 a 93 a 58 b 98 a 100 a 51 98 a 65 b 98 a 99 a 100 a 95 a 84 a 66 a 99 100 94 a 72 86 96 a 100 a 58 b 72 a 100 a 74 a 86 a 99 a 100 a 98 a 92 a 84 100 a 87 b 97 a 100 a 100 a 66 b 86 a 62 a 62 67 b 74 b

1998

SO LTIS E T A L.Ð

T A B L E 1.

Clade

Monoco ts A steridae s.l. A n giosperm s A steridae s.s.

37

L AR GE D NA D A TA SET S

rbcL

Con tinued.

18S

atp B

rbcL 1

18S rbcL 1

atpB 18S 1

53 b 91

100

97

100

100 a

52 b 100

atpB

rbcL 1 18S 1 atpB

55 b 58 b 100 53 b

a

Clade that receives a higher bootstrap value in the an alysis of any of the com bined data sets. Clade that does not receive b ootstrap suppor t of . 50% in any of the three separate data sets, but does in the com bined data sets. b

bootstrap values increase greatly, w ith the highe st value (96% ) attained w ith the com bin ation of all three data sets. Sim ila rly, the clade Saxifragaceae s.s. has bootstrap values of 84% and 79% w ith rbcL and atpB, resp ectively ; w ith the com bin ation of all three data sets, sup port increases to 98% . There are nu m erou s exam ples in w hich a clade has bootstrap su pp ort of $ 50% w ith only one of the sep arate data sets, but w hen data sets are com bine d the boo tstrap values increase su bstantially. This pattern is exem pli® ed by the ranunculid s, a clade observe d in all of the shortest tree s re-

F IG U R E 1. N u m b er of ang iosperm clades having b oo tstrap (BP) su pport $ 50% b ased on the phy logenetic an alysis of the separate and com bined DNA data sets for an giosperm s.

trieved in all analyse s. O f the three separate data sets, the ranu nculid s clade received bootstrap su pp ort $ 50% only w ith rbcL (51% ); however, this value increases to 84% w hen all three data sets are com bined. Sim ilarly, the m alpigh ioid clade has a bootstrap value of $ 50% only w ith rbcL (53% ), but this in creases to 86% w hen all three data sets are com bined. O ther exam ples of this sam e pheno m en on include clades labelled astero ids, m yrtoids, m alvoids, ranu nculid s-2, ranu nculids m inu s D ic entra, r a nu n cu lid s-3 , g lu co s in o late s , and glucosino lates-1 (Table 1). Particularly noteworthy are clades that are presen t in all of the shor test tree s but do not receive boo tstrap sup port of $ 50% in ana lyses of the sep arate data sets; as data sets are com bined , the level of bo otstrap su pp ort rise s. Ericoids are present w ith bootstrap su pp ort of $ 50% in the separate data sets, but as the data sets are com bine d in p airs, bootstrap su pp ort increases to $ 50% in all p airw ise com bina tions (rbcL 1 18S rDNA 58% ; rbcL 1 atpB 81% ; 18S rD NA 1 atpB 77% ); w ith the com bination of all three data sets, boo tstrap su pp ort reaches 87% . O ther exam ples of sm all to m oderate-sized clades in w hich this sam e pheno m enon is observed are lam ioids -2, saxifrago ids, and caryophy llid s s.l./Sim mondsia. Signi® cantly, a nu m ber of large clades do not exhibit boo tstrap su pp ort of $ 50% until data sets are com bine d: higher eudicots, eudicots, m onocots, A sterida e s.l., and A sterida e s.s. (Table 1). Two clades do no t exhibit boo tstrap su pp ort of $ 50% until all three data sets are com bine d: Nelum bo/Platanus/Placospermum (52% ) and A sterida e s.s. (53% ). Thus,

38

SYSTE M AT IC BIOLO GY

all of these clades represent w hat we have referre d to prev iously as uniquely sup p orted clades (Soltis et al., 1997a); only w ith the com bination of data sets is there su f® cien t sig nal to yield bootstrap sup port of $ 50% . Com bining data sets therefore results in both an increase in bootstrap su pp ort for clades (com p ared to the sep arate data sets) and the recovery of clades w ith bo otstra p su p port of $ 50% that did not receive this sa m e level of boo tstrap su p port in the ana lyses of the sep arate data sets. The se observations app arently re¯ ect the additive effect of the underly ing phy lo genetic sign al provided by the sep arate data sets. Thus, although an ind ividual data set m ay not provide enou gh sig na l for bo otstra p su p port of $ 50% for a given clade in the sep arate ana lysis, that data set m ay contribute to an increased level of bootstra p su p port w hen data sets are com bined. This seem s to be particula rly true for large clades for w hich bo otstrap values in analyses of the sep arate data sets are , 50% . A nalyses of all three sep arate data sets revea l (1) a eudicot clade, (2) higher eudicots, (3) A sterida e s.l., and (4) A sterid ae s.s., but only w ith the com bination of data sets does suf® cient sign al exist to provide a bootstrap value of $ 50% for any of these clades (Table 1). A ll three data sets contribute to this additive effect, including 18S rD NA , w hich has the lowest nu m ber of bootstrap -sup p orted clades (see also Fig. 1). A s an exam ple, the A sterid ae s.l. clade ® rst attain s bootstrap su pp ort $ 50% w ith the com bination of 18S rDNA 1 atpB, but not w ith 18S rDNA 1 rbcL or rbcL 1 atpB. The 18S rD NA data clearly m ake a critical contribution to the su pp ort for the caryophy llid s s.l./Sim mondsia and saxifra goid clades as well (Table 1). A lthou gh boo tstrap sup port typ ically increases as data sets are added (Table 1), this is not always the case. These instances are in structive in that they often p inpo int areas of con¯ ict am ong the separate data sets. O ne exam ple involves the m onophy ly of A ris tolochiaceae. A lthough 18S rDNA sequence data sup port this clade w ith a bootstrap value of 71% , bo th the shortest

VO L.

47

atpB and rbcL trees su gge st inste ad that one m em ber of A ristolo chiaceae (A ristolochia) is sister to Lactorid aceae. Sim ilarly, althou gh atpB data ind icate a m onophy letic betu loid clade (bootstrap value of 92% ) com prisin g ® ve taxa (A lnus, C asuarina, Myrica, C arya, and Fagus), the other data sets are not in com plete agreem ent, due largely to differen t placem en ts of Myrica in the rbcL trees. This resu lt is noteworthy bec au s e, a lt ho u g h c o r re c tly rep o r te d in C hase et al. (1993), the rbcL sequences of M y ric a a n d C elti s w er e ac cid e n t ally sw itched w hen su bm itted to G en Bank; hence, the present analysis reveals this ``sequen cing erro r.’ ’ By com bining data sets and com parin g topolo gies to those resu ltin g from the analyses of the separate data sets, areas in nee d of additional investigation (e.g., higher taxon den sity, resequen cing of a taxon, etc.) can be iden ti® ed. In addition to producing tree s w ith in creased intern al resolution and higher in tern al su pp ort, the com bined data set also sh owed great im provem ents in com puter ru n tim es (see also O lm stead and Sweere, 1994). With the 18S rD NA 1 rbcL 1 atpB data set, T BR branch swap pin g on the sta rtin g tree s was gen erally com plete d w ithin a few days (Table 2). Subsequent searche s (six were conducted) w ith the 18S rD NA 1 rbcL 1 atpB data set continued to swap to com pletion, even as long er startin g trees were em ployed (Table 2). Two of these six sea rches each retrieved the shortest tree s obtaine d (18,770 step s); these ``shortest’ ’ tree s were in two differen t island s (sensu M addison, 1991). The se resu lts ind icate that it is po ssible to conduct a rigo rou s sea rch of tree sp ace w ith larg e data sets, given suf® cient sign al and taxa. In this case, and m ost others, we su sp ect, su f® cien t sig na l for a larg e num ber of taxa can only be obtained by com binin g data sets. In contrast, swap pin g was not com plete d w ith any of the three sep arate data sets. A fter severa l days to a week , the sep arate data sets stalle d on a tree len gth, at w hich p oint thou sand s of trees of that len gth accumulated. Bra nch swap pin g was continued for 4 or m ore days until 3,500 or m ore tree s of this len gth were in m emory;

1998

SO LTIS E T A L.Ð

T A B L E 2.

D ata set

18S-1 18S-2 18S-3 rbcL-1 rbcL-2 rbcL-3 atpB -1 atpB -2 atpB -3 18S 1 18S 1 rbcL 1 18S 1 18S 1 18S 1 18S 1 18S 1 18S 1

rbcL atp B atp B rbcL 1 rbcL 1 rbcL 1 rbcL 1 rbcL 1 rbcL 1

atp B -1 atp B -2 atp B -3 atp B -4 atp B -5 atp B -6

39

L AR GE D NA D A TA SET S

Gener al results for parsimony searche s of separate and com bined data sets. Search swap to com ple tion?

Length starting trees

Length shortest trees

no no no no no no no no no no no no yes yes yes yes yes yes

3,174 3,176 3,177 7,548 7,549 7,549 7,745 7,747 7,749 10,977 11,081 15,425 18,780 18,781 18,784 18,786 18,787 18,787

3,135 3,135 3,134 7,526 7,527 7,525 7,734 7,730 7,727 10,898 11,069 15,417 18,770 18,772 18,770 18,771 18,774 18,770

Num ber of shortest trees

. 3,800 . 5,300 . 5,300 . 4,457 . 5,716 . 5,198 . 5,351 . 3,627 . 3,500 . 5,300 . 5,394 . 2,371 18 67 6 210 18 42

T im e of analysis h:m in:sec a

137:29:60.0 177:28:52.5 221:33:51.6 320:39:38.6 483:12:33.0 410:13:09.2 25:20:30.9 6:18:31.0 19:49:19.4 673:14:28.7 754:41:48.3 173:25:39.8 72:48:10.8 44:38:01.9 10:12:11.0 168:52:37.1 19:38:23.2 18:27:30.8

a T he run tim es for the rbcL searches are much long er than those for the separate 18S rDNA and atp B searches because the rbcL searches were conducted on a Power Macintosh 6100 rather than on a 7100.

then these searches were term inated . For each data set, the searche s typ ically sta lled on tree s of the sa m e, or very sim ilar, leng th, regard less of the len gth of the starting trees used (Table 2). The se results for the sep arate data sets are not su rprisin g. A s noted, our prev iou s analyses of 228 18S rDNA sequen ces (Soltis et al., 1997b) did not swap to com pletion despite over 2 ye ars of com puter tim e. Len gthy sea rches of sep arate rbcL and atpB data sets of com p arable size also have not swapp ed to com pletion. We experim ente d less extensively w ith com puter ru n tim es for the com bination of p airs of data sets. O nly one set of starting tree s was used for the searche s conducted w ith 18S rDNA 1 atpB, 18S rD NA 1 rbcL, and rbcL 1 atpB. Nonetheless, subsequen t swap p ing on startin g trees w ith TBR was not com pleted in the tim e fram e em ployed (Table 2). The 190 angiosp erm s em ployed in this analysis are a su bset both of the 228-taxon 18S rD NA 1 rbcL data set analyzed by Soltis et al. (1997a) and of a 270taxon 18S rDNA 1 rbcL data set analyzed by Soltis et al. (unpubl.). A lthou gh the 190taxon data set did no t swap to com pletion , analyses of bo th the 228- and 270-taxon 18S rD NA 1 rbcL data sets did. Thus, al-

thou gh perh aps counterintuitive, these resu lts illustrate ano ther advantage of com b in in g s e que n ce s in st u d ie s o f th is m agnitude: The tim e require d for data analysis actually decreases w ith the addition of m ore characters and taxa. These em pirical data therefore complem ent the sim ulations of both H illis (1996) and G raybeal (1998), w hich sim ila rly reve al the im p ortance of adding taxa in phylo genetic analyses. The ultim ate advantage of decreased run times is the op portu nity for m ore thorou gh searche s of tree space, desp ite the larg e nu m bers of taxa involved and the enorm ous nu m bers of po ssible tree s (Felsen stein, 1978). The explanation for w hy ana lyses of the com bine d data sets swap to com pletion, w here as the ind ividu al data sets do not, m ay be, in p art, the fact that analyse s of the com bined data sets qu ickly and consistently ® nd trees very close in leng th to the shortest trees ultim ately obtained (Chase and C ox, 1997; C hase et al., in prep.). O ur data lend sup port to this hyp othesis. For exam ple, com parison of the len gth of the starting tree s w ith the shor test trees obtained for the ind ividual data sets (Table 2) reve als an avera ge difference of 41, 23, and 16 step s for 18S rDNA , rbcL, and atpB, re-

40

SYSTE M AT IC BIOLO GY

sp ectively. W hen all three data sets are com bined, the starting trees obtaine d are very close in len gth to the shor test trees ultim ately obtained, w ith an average difference of only 13 steps (Table 2). This disp arity m ay be larg er given that the ind ividual searche s prob ably did no t ® nd the true shor test tree s (as they did no t swap to com pletion), w herea s the ana lyses of the com bined 18S rDNA 1 rbcL 1 atpB data set always swap ped to completion. In addition, the average value of 13 step s for the 18S rDNA 1 rbcL 1 atpB data set is in¯ ated because we experim en ted m ore extensively w ith the use of longer and longer sta rtin g trees w ith this com bined data set; the sm allest difference between len gths of sta rtin g and shortest trees was only 9 steps (w ith 18S rD NA 1 rbcL 1 atpB 2 2; Table 2). Most im portantly, w hen the difference between the len gth s of the starting and sh ortest trees is viewed as a proportion of the len gth of the shortest trees, the value for the com bine d ana lysis (13/18,770 5 0.00069) is much less than those for any of the separate analyse s (41/3,134 5 0.013 for 18S rDNA , 23/7,525 5 0.0031 for rbcL, and 16/7,727 5 0.0021 for atpB), dem onstratin g that the com bine d analys is much m ore easily ® nd s trees clo se to the shor test trees ultim ately found. These em pirical resu lts for large data sets su pp ort the conten tion of H illis (1996) that adding a la rg e num ber of taxa and in creasin g the num ber of base p airs in phylogenetic analyse s m ay not only increase the accuracy of the estim ated tree s, but also reduce the com putational dif® culty of the inference process. In his sim ulation studies (based on the estim ated phylo geny of 228 angiosp erm s in ferred from 18S rD NA sequences provided by Soltis et al., 1997b), H illis (1996) attributed the unexp ected ease of phy lo geny reconstruction for this large data set to the fact that hom oplasy was likely disp erse d across the m any branche s of the tree, allow ing the phylo genetic sig na l to be detected. The resu lts of the em pirical approach presented here for three com bined data sets represen ting 193 taxa and nearly 5,000 nucleotide s len d su p port to this conten tion. C om -

VO L.

47

bining rbcL, atpB, and 18S rDNA data sets for a sim ilar su ite of 193 taxa increases the in tern al su p port (as m easured by the ``fast’’ bootstrap option on PAU P* 4.0) for m any clades, often dra m atically. A na lyses of com bine d data sets genera lly revea l com plem en tarity; that is, the well-supp orted clades presen t in the separate rbcL, atpB, and 18S rDNA trees are all presen t. Parsim ony analyse s of large, com bined data sets also reveal clades w ith bootstra p su p port of . 50% that did not receive this sa m e level of boo tstrap su p port in the ana lyses of the sep arate data sets. A nalyses of com bined data m atrices also revea l greater overa ll resolution of relationsh ip s and a much higher num ber of clades w ith bootstrap su p port . 50% than do analyses of any of the three sep arate data sets. Pattern s of com plem entarity and increased su p port for clades observe d here w ith the com bined 18S rDNA 1 rbcL 1 atpB data set, as well as for each p air of data sets, are sim ilar to those observed in analyse s of com bined rbcL and 18S rDNA data sets for 232 taxa (Soltis et al., 1997a) and also in m ore focused studies (e.g., Olm stead and Sweere, 1994; Hoot and C rane, 1995; Hoot et al., 1995, in press; Johnson and Soltis, 1995; Soltis et al., 1996; Sullivan, 1996). Fu rtherm ore, com bine d data sets show an enorm ous im provem en t in com puter ru n tim es. Thu s, for these reasons, larg e data sets m ay be m ore tractable tha n prev iou sly consid ere d (e.g., H illis, 1995; Patterson et al., 1993). We conclude, in su pp ort of the sim ulation studies of H illis (1996) and G raybeal (1998), that one solution to the com putationa l/ analytical dilem m as p osed by larg e data sets is the addition of nucleo tides, as well as taxa. The exploratory analyses of a com bined 18S rDNA 1 rbcL 1 atpB data set ind icate that a well-resolved, strong ly su p ported phylo genetic tree for a data set of 400± 500 angiosp erm s not only is p ossible, but m ay be achieved w ithin the next 2 years. A C KN O W L ED GM EN TS T his work was supp orted in part by National Scie n ce Fou n d ation g ra n ts D E B -9 3 07 0 00 a n d D BI9512890. We than k D avid Swofford for access to test

1998

SO LTIS E T A L.Ð

L AR GE D NA D A TA SET S

version s of PAU P * 4.0; we also than k A nn a G raybe al, D ick O lm stead, and two anony m ou s reviewers for helpful com m ents on the m anuscript.

R E FE R E N C E S B U L L , J. J., J. P. H U E L S E N B E C K , C . W. C U N N IN G H A M , D. L . S W O F F O R D , A N D P. J. W A D D E L L . 1993. Pa rtitioning and com bining data in phy logenetic analysis. Syst. B iol. 42:384± 397. C H A S E , M . W., A N D V. A . A L B E R T . 1998. A persp ective on the contribution of plastid rbcL DNA sequences to an giosp erm phylogenetics. Pa ges 488± 507 in Molecular system atics of plants II (D. E. Soltis, P. S. Soltis, and J. J. D oy le, ed s.). C hapm an and Hall, N ew York. C H A S E , M . W., A N D A . V. C O X . 1997. G ene sequences, collaboration, and an alysis of large data sets. Au st. Syst. B ot. (in press). C H A S E , M . W., D. E. S O L T IS , R . G. O L M S T E A D , D. M O R G A N , D. H . L E S , B . D. M IS H L E R , M . R. D U V A L L , R . A . P R IC E , H . G. H IL L S , Y.-L . Q IU , K . A . K R O N , J. H . R E T T IG , E . C O N T I , J. D. P A L M E R , J. R . M A N H A R T , K . J. S Y T S M A , H . J. M IC H A E L S , W. J. K R E S S , K . G. K A R O L , W. D. C L A R K , M . H E D R EÂ N , B . S. G A U T , R . K . J A N S E N , K .-J. K IM , C . F. W IM P E E , J. F. S M IT H , G. R . F U R N IE R , S. H . S T R A U S S , Q .-Y. X IA N G , G. M . P L U N K E T T , P. S. S O L T IS , S. M . S W E N S E N , S. E. W IL L I A M S , P. A . G A D E K , C . J. Q U IN N , L. E . E G U IA R T E , E. G O L E N B E R G , G. H . L E A R N , J R ., S. W. G R A H A M , S. C . H . B A R R E T T , S. D A Y A N A N D A N , A N D V. A . A L B E R T . 1993. P hy logenetics of seed plants: A n an alysis of nucleotide sequences fr om the plastid gene rbcL. A n n. Mo. Bo t. G ard. 80: 628± 580. D E Q U E IR O Z , A ., M . J. D O N O G H U E , A N D J. K IM . 1995. Separate versus com bined analysis of phy logenetic evidence. A nnu. R ev. E col. Syst. 26:657± 681. F A R R IS , J. S., M . K AÈ L L E R S JOÈ , A . G. K L U G E , A N D C . B U L T . 1995. Testing signi® cance of incon gruence. C ladistics 10:315± 319. F A R R IS , J. S., V. A . A L B E R T , M . K AÈ L L E R S JOÈ , D. L IP S C O M B , A N D A . G. K L U G E . 1996. Pa rsimony jack kni® n g outp erform s neighbor-joining. C ladistics 12:99± 124. F E L S E N S T E IN , J. 1978. The num b er of evolution ary trees. Syst. Z ool. 27:27± 33. G R A U R , D., L . D U R E T , A N D M . G O U Y . 1996. Phy loge n etic position of the order L agom orpha (r abbits, h ares and allies). Nature 379:333± 335. G R A Y B E A L , A . 1998. Is it be tter to add taxa or ch aracters to a d if® cu lt phylogenetic problem ? Syst. B iol. 47:9± 17. H IL L IS , D. M . 1995. A pproaches for assessing phy logenetic accu racy. Syst. B iol. 44:3± 16. H IL L IS , D. M . 1996. In fer ring com plex phylogen ies. N ature 383:130. H IL L IS , D. M ., J. P. H U E L S E N B E C K , A N D D. L. S W O F F O R D . 1994. Hobg oblin of phylogene tics? Nature 369:363± 364. H O O T , S. B ., A N D P. R . C R A N E . 1995. In ter-fam ilial relationship s in the R anu nculidae b ased on m olecular system atics. P lant Syst. Evol. 9 (sup pl.):119± 131. H O O T , S. B ., A . C U L H A M , A N D P. R. C R A N E . 1995. Phylogene tic relation ships of the Lardizabalaceae and Sargen todoxaceae: C h loroplast and nuclear DNA

41

sequence evidence. Plant Syst. E vol. 9(sup pl.):195± 199. H O O T , S. B ., J. W. K A D E R E IT , F. R . B L A T T N E R , K . B . J O R K , A . E. S C H W A R Z B A C H , A N D P. R . C R A N E . In press. D ata con gruence and phylogeny of the Pa paverace ae s.l. b ased on fou r data sets: atp B and rbcL sequence s, trnK restriction sites, and m orpholog ical ch aracters. Syst. Bo t. H U E L S E N B E C K , J. P., J. J. B U L L , A N D C . W. C U N N IN G H A M . 1996. C om bining data in phy logenetic an alysis. Trend s E col. E vol. 11:152± 158. J O H N S O N , L. A ., A N D D. E . S O L T IS . 1995. P hy logenetic inference in Saxifragaceae sen su stricto and G ilia (Polem on iaceae) u sing m atK sequences . A nn . Mo. B o t. G ard. 82:149± 175. J O H N S O N , L . A ., A N D D. E. S O L T IS . 1998. A ssessing congruence: E m pirical ex am ples fr om m olecular data. Pa ges 297± 348 in Molecular system atics of plants II (D. E . Soltis, P. S. Soltis, and J. J. D oy le, ed s.). C hapm an and H all, N ew York. K IM , J. 1996. General incon sistency cond itions for m aximu m p arsimony : Effects of branch lengths and increasing num b ers of taxa. Syst. Biol. 45:363± 374. M A D D IS O N , D. R . 1991. T he d iscovery and im portance of mu ltiple islands of m ost-parsim on iou s trees. Syst. Zo ol. 40:315± 328. M IS H L E R , B . D. 1994. C ladistic an alysis of m olecular and m orphological data. A m . J. Phys . A n throp ol. 94: 143± 156. N IC K R E N T , D. L ., A N D D. E . S O L T IS . 1995. A com p arison of ang iosperm phylogen ies from nuclear 18S rDNA and rbcL sequences . A n n. Mo. B ot. G ard. 82: 208± 234. O L M S T E A D , R. G., A N D J. A . S W E E R E . 1994. Com bining data in phy logenetic system atics: A n em pirical approach using three m olecular data sets in the Solan aceae. Syst. B iol. 43:467± 481. P A TT ER SO N , C., D. M . W ILL IA M S , A N D C. J. H U M PH R IES . 1993. Congruence between m olecular and morphological phylogenies. Annu. Rev. Ecol. Syst. 24:153± 188. R IC E , K . A ., M . J. D O N O G H U E , A N D R . G. O L M S T E A D . 1997. A nalyzing large data sets: rbcL 500 revisited. Syst. B iol. 46:554± 563. S A V O L A IN E N , V., C . M . M O R T O N , S. B . H O O T , A N D M . W. C H A S E . 1996. A n ex am ination of phy logenetic p atterns of plastid atp B gene sequences am on g eud icots. A m . J. B ot. 83(sup pl.):541. S O L T IS , D. E ., C . H IB S C H -J E T T E R , P. S. S O L T IS , M . W. C H A S E , A N D J. S. F A R R IS . 1997a. Molecu lar phylogenetic relation ships am on g an giosp erm s: A n overv iew b ased on rbcL and 18S rDNA sequences . J. P lant R es. (in press). S O L T IS , D. E., R . K . K U Z O FF , E. C O N T I , R. G O R N A L L , A N D K . F E R G U S O N . 1996. m atK and rbcL gene sequence data ind icate that Saxifraga (Saxifragaceae) is p olyphyletic. A m . J. B ot. 83:371± 382. S O L T IS , D. E., A N D P. S. S O L T IS . 1997. P hy logenetic relationship s in Saxifragaceae s.l.: A com parison of top ologies b ased on 18S rDNA and rbcL sequences. A m . J. B ot. 84:504± 522. S O L T IS , D. E., P. S. S O L T IS , D. L . N IC K R E N T , L . A . J O H N S O N , W. J. H A H N , S. B . H O O T , J. A . S W E E R E , R. K . K U Z O F F , K . A . K R O N , M . W. C H A S E , S. M . S W E N S E N ,

42

SYSTE M AT IC BIOLO GY

E. A . Z IM M E R , S.-M . C H A W , L . J. G IL L E SP IE , W. J. K R E S S , A N D K . J. S Y T S M A . 1997b . A n giosperm phylogeny inferred from 18S ribosom al DNA sequences. A n n. Mo. Bo t. Gard. 84:1± 49. S O L T IS , P. S., A N D D. E . S O L T IS . 1996. Phy logenetic an alysis of large m olecular data sets. B ol. Soc. Bo t. M ex . 59:99± 113. S U L L IV A N , J. 1996. Com bining data w ith d iffer en t d is-

VO L.

47

tributions of am ong -s ite rate variation. Syst. Biol. 45: 375± 380. S W O FF O R D , D. L. 1997. PAU P *: P hy logenetic an alysis using parsimony, version 4.0. Sinauer, Su nderland, M assachusetts. Received 8 M ay 1997; accepted 13 Ju ne 1997 A ssociate E ditor: D. C an natella