A Simple Hierarchical Approach to Modeling ... - Semantic Scholar

1 downloads 39008 Views 470KB Size Report
Oct 27, 2004 - may show little variation, while other sites may be under diversifying selection ... Huelsenbeck, Larget, and Alfaro 2004) that the best-fitting model of ..... alignment was recently analyzed by Seo, Kishino, and. Thorne (2004).
A Simple Hierarchical Approach to Modeling Distributions of Substitution Rates Sergei L. Kosakovsky Pond and Simon D. W. Frost Antiviral Research Center, University of California San Diego, San Diego, California Genetic sequence data typically exhibit variability in substitution rates across sites. In practice, there is often too little variation to fit a different rate for each site in the alignment, but the distribution of rates across sites may not be well modeled using simple parametric families. Mixtures of different distributions can capture more complex patterns of rate variation, but are often parameter-rich and difficult to fit. We present a simple hierarchical model in which a baseline rate distribution, such as a gamma distribution, is discretized into several categories, the quantiles of which are estimated using a discretized beta distribution. Although this approach involves adding only two extra parameters to a standard distribution, a wide range of rate distributions can be captured. Using simulated data, we demonstrate that a ‘‘beta-’’ model can reproduce the moments of the rate distribution more accurately than the distribution used to simulate the data, even when the baseline rate distribution is misspecified. Using hepatitis C virus and mammalian mitochondrial sequences, we show that a beta- model can fit as well or better than a model with multiple discrete rate categories, and compares favorably with a model which fits a separate rate category to each site. We also demonstrate this discretization scheme in the context of codon models specifically aimed at identifying individual sites undergoing adaptive or purifying evolution.

Introduction Genetic sequences typically show extensive variation across sites in diversity; for example in coding regions, some sites may be under functional constraints, and hence may show little variation, while other sites may be under diversifying selection and, therefore are highly variable. It is important to characterize this variation, as it provides useful information on the role of selection on the sequences. Estimation of the rate at which a site evolves can be used to detect positive or negative selection at a site, by comparing the rates of nonsynonymous and synonymous substitution at that site (Nielsen and Yang 1998; Suzuki and Gojobori 1999; Yang et al. 2000; Nielsen and Yang 2003). The evolutionary rate of a site may also vary in different phylogenetic subtrees, for example, because of ‘‘covarion’’ type evolution (Fitch and Markowitz 1970; Fitch 1971; Fitch and Ayala 1994), which may also reflect selection. Failure to take rate heterogeneity into account can lead to errors in phylogenetic reconstruction and underestimation of divergence times (Yang 1996). Broadly speaking, there have been two kinds of approaches to fit variable rates to sequence data. In the first approach, a different rate is fitted to each site in the sequence, either by maximum likelihood (Olsen, Pracht, and Overbeek 1994; Kelly and Rice 1996; Nielsen 1997; Meyer and von Haeseler 2003) or by distance-based methods (Gu and Zhang 1997; Pesole and Saccone 2001; Horner and Pesole 2003). Although this approach makes no assumption regarding the distribution of rates across sites, it involves the estimation of many parameters, particularly for long sequences, and hence approximations are required to reduce the computational burden. In addition, for smaller data sets, such a ‘‘fixed effects’’ approach may lack statistical power, or may overfit the data, leading to biased rate estimates. Key words: substitution rates, hierarchical model, adaptive evolution, hepatitis C, model selection, parallel algorithms. E-mail: [email protected]. Mol. Biol. Evol. 22(2):223–234. 2005 doi:10.1093/molbev/msi009 Advance Access publication October 13, 2004

In the second type of approach, an underlying rate distribution is assumed and fitted to sequence data. This ‘‘random effects’’ approach provides estimates of the distribution parameters (which can be compared between different genes or populations) and can provide empirical Bayes estimates of the evolutionary rate at each site. As rate variation is modeled using relatively few extra parameters, this approach has more statistical power to estimate rates by pooling information across sites. The gamma distribution is often used to model rate variation (Yang 1993, 1994) for reasons of computational simplicity, although there are some theoretical justifications for its use (Felsenstein 2001). In practice, distributions belonging to simple parametric families are often too simple to capture the pattern of rate variation. One approach to this problem is to fit mixtures of different distributions. In the context of models of codon substitution, Yang et al. (2000) proposed the use of distributions ranging in complexity from several discrete categories to mixtures of beta and normal distributions. However, such mixture models are notoriously difficult to fit and are often computationally intensive. As it is not clear a priori which rate distribution is most appropriate; a large number of models may have to be explored, exacerbating the problem. There is clearly a need for models of rate variation that are simple yet flexible. Susko et al. (2003) suggested fitting distributions with a large (100) but a priori fixed number of rate classes to the data. Morozov et al. (2000) proposed Fourier and wavelet models of rate variation, which provided a flexible, nonparametric estimate of variation in substitution rates. However, this model was extremely computationally challenging, and its fit cannot be directly compared to the fit of a parametric distribution. We present a simple method of discretization of a rate distribution, where the quantiles of the rate distribution are themselves modeled using a discretized beta distribution. We show that this hierarchical model of rate variation gives better fits to data sets with highly variable rate distributions than standard models of rate variation, and it is computationally faster than fitting mixtures of distributions.

Molecular Biology and Evolution vol. 22 no. 2 Ó Society for Molecular Biology and Evolution 2005; all rights reserved.

224 Kosakovsky Pond and Frost

Materials and Methods Nucleotide Substitution Models Because it has been suggested (Muse 1999; Huelsenbeck, Larget, and Alfaro 2004) that the best-fitting model of sequence evolution may not be one of the ‘‘standard’’ models, we employed a likelihood-based model selection procedure to identify the most appropriate simplification of the general time-reversible (GTR) nucleotide substitution model (Lanave et al. 1984; Tavare´ 1986; Rodriguez et al. 1990), described in detail in the Supplemental Material online. The instantaneous substitution rate matrix for the GTR model is defined by: 0 1 pG pT RAT * pC RAC B pA RAC * pG RCG pT RCT C Ch dt; QREV ðdtÞ ¼ B @ pA pC RCG * pT RGT A pA RAT pC RCT pG RGT * ð1Þ where pA, pC, pG, pT denote the observed proportions of respective nucleotides in the data and constitute the vector of equilibrium frequencies for the Markov process whose transition probability matrix for time t . 0 is obtained by exponentiating the instantaneous rate matrix: TREV(t) ¼ exp[t 3 QREV]. h is a rate variation parameter, sampled from ~ supported a discrete valued probability distribution r(h; ), ~ on [0, R], R  1, parameterized by  and, without loss of generality, restricted to have mean one (this is due to a wellknown confounding of times and evolutionary rates). Codon Substitution Models We are also interested in investigating whether a specific site in a gene alignment is undergoing adaptive or purifying evolution, by modeling variation in both synonymous and nonsynonymous substitution rates. This approach is discussed in detail by Kosakovsky Pond (2003) and forms an extension of the codon models proposed previously (Muse and Gaut 1994; Goldman and Yang 1994; Nielsen and Yang 1998; Yang et al. 2000). The general rate matrix element for this model defines the instantaneous rate of substituting a non-stop codon x with a non-stop codon y: 8 > 0; x ! y requires  2 > > > > nucleotide substitutions, > > > > as Rij pny dt; x ! y 1-step synonymous > < substitution of nucleotide i Qx;y ðdtÞ ¼ > with nucleotide j, > > > > b R p dt; x ! y 1-step nonsynonymous ij n > s y > > > substitution of nucleotide i > : with nucleotide j, ð2Þ Due to reversibility, Rij ¼ Rji, and because of the confounding of rates and times, only five rate parameters are estimable—we choose to set RAG ¼ RGA ¼ 1, as we did in the nucleotide-reversible model. Site-specific synonymous rates as and nonsynonymous rates bs are sampled independently from various probability distributions. pny denotes

the frequency of the target nucleotide in codon y in the appropriate codon position (for instance, the target nucleotide in a ACG ! ACT substitution is T in position 3). The model also specifies the equilibrium frequency of a codon composed of the nucleotide triplet ijk. If we denote the frequency of nucleotide n 2 fA, C, G, Tg at codon position m ¼ 1, 2, 3 as pm n , then the equilibrium frequency of codon ijk is the product of the constituent nucleotide frequencies, scaled to account for the absence of stop codons (TAA, TAG, TGA for the universal genetic code) in the model: pijk ¼

1

P

p1i p2j p3k : frequencies of stop codons

Modeling Quantiles of Rate Distributions The main idea of our method lies in estimation of both the rates and their proportions from the data. Traditional discrete gamma approaches (Yang 1994) accomplish flexible modeling of rates (to an extent), but they do not adjust their proportions, whereas with more general methods (such as using general discrete distributions) it is possible to infer both rates and proportions, but such methods are computationally challenging and may suffer from convergence issues, particularly for large numbers of rate classes (Morozov et al. 2000). Although our method estimates both aspects of rate distributions, we restrict the number of parameters estimated from the data, without limiting the flexibility of the rate distribution. We begin by selecting a simple parametric baseline family, such as the unit mean gamma () distribution, with density rðh; lÞ ¼

ll hl1 lh e ; ðlÞ

l . 0;

ð3Þ

or the unit mean log-normal distribution with density ( ) 2 1 ðlog h þ r2 =2Þ rðh; rÞ ¼ pffiffiffiffiffiffi exp  ; r . 0: ð4Þ 2r2 hr 2p As an example of a rather inflexible rate distribution, we include a unit mean piecewise uniform distribution, supported on [0, R], with density rðh; RÞ ¼

R1 1 I½0;1 ðhÞ þ Ið1;R ðhÞ; R . 1; ð5Þ R RðR  1Þ

where I[a,b] is the indicator function of interval [a, b], i.e., I[a,b](h) ¼ 1, if h 2 [a, b], and I[a,b](h) ¼ 0, otherwise. We cannot simply use a uniform on [0, R] because the unit mean constraint would force R ¼ 2, and thus limit the largest possible value of h to be 2, a priori. Lastly, we consider another popular density:  coupled with an invariable rate class, with the density function given by rðh; l; PÞ ¼ Pd0 ðhÞ þ ð1  PÞ l . 0;

P 2 ½0; 1;

ll hl1 lh e ; ðlÞ ð6Þ

where d0(h) is the delta function at 0, and P is the proportion of invariable sites. Each continuous density is

Hierarchical Rate Variation Model 225

discretized in N  2 rate categories, using the process described below. It is customary to summarize main characteristics of a distribution by its variance r2, skewness k1, and kurtosis b2, which are defined in terms of central moments ln ¼ E[h 2 Eh]n as: l3 l r2 ¼ l2 ; k1 ¼ 3=2 ; b2 ¼ 42 : l2 l2 For comparative purposes, we also define an N-bin unit mean general discrete distribution (GDD) with 2N 2 2 parameters, by specifying N rates 0  h1 , h2    , hN, and probabilities pk ¼ Prfh ¼ hkg, subject to kpk ¼ 1 and k pkhk ¼ 1. For a fixed number of rate classes, GDD is the most flexible among all discrete valued distributions. However, GDD is a parameter-rich distribution, and it often fails to converge for N . 4 in practice for all but largest data sets and may also overfit the data. In order to capture most of the flexibility of GDD, without dealing with the associated complexity and convergence issues, we draw upon an idea from Bayesian estimation theory, namely Dirichlet process prior models (Ferguson 1973), in which a baseline prior distribution is discretized into a countable number of categories, with the probability of assignment to a given class for a given data point drawn from a Dirichlet distribution—the multivariate generalization of a beta distribution. Applying this idea in a frequentist setting, we partition the entire rate distribution into N equiprobable intervals and represent each interval with its mean or median, as first discussed in (Yang 1994). This approach captures, as N increases, the properties of the underlying continuous distribution. In practice, N is limited by computational resources and the length of sequence data, and it is often set (somewhat arbitrarily) to a small number, such as N ¼ 8. One can attempt to fit models by starting with N ¼ 2 and increasing N until adding another category does not improve the fit, measured in likelihood terms. Secondly, we use the beta distribution, with density 1 t p1 ð1  tÞq1 ; t 2 ð0; 1Þ; p; q . 0; rðt; p; qÞ ¼ Bð p; qÞ ð7Þ to decide what intervals the underlying rate distribution is partitioned into. Specifically, we discretize the beta distribution into N 2 1 equiprobable intervals and compute conditional means, t0 ¼ 0 , t1 , t2    tN21 , tN ¼ 1, over these intervals. We then split the underlying rate distribution of h into intervals [hi, hi11), 0  i , N, such that Prfh 2 [hi, hi11)g ¼ ti11 2 ti, and calculate rate class values as conditional means over an appropriate interval Z hi 1 ~ hf ðh; Þdh; i ¼ 1 . . . N: hi ¼ E½hI½hi1 ;hi Þ  ¼ ti  ti1 hiþ1 Although in theory any distribution supported on [0, 1] can be used to model the quantiles of a baseline rate distribution, the choice of a beta distribution allows the model to capture a wide range of distributional behaviors with the addition of only two extra parameters (fig. 1).

1. p ’ 1 and q ’ 1. The distribution is similar to what would be obtained by discretizing with equiprobable intervals. 2. p  1 and q  1. The distribution is bimodal, with a skew toward high rates, if q . p, or low rates, if p , q. 3. p  1 and q  1. The distribution is tri-modal, with a skew toward high rates, if q . p, or low rates, if p , q. 4. p . 1 and q , 1. The distribution approaches a mixture of point mass and a sharp peak for low rates, the width of which is controlled by p. 5. p , 1 and q . 1. The distribution approaches a mixture of point mass and a sharp peak for high rates, the width of which is controlled by q. We denote distributions discretized this way by prepending b– to the name of the baseline distribution. For comparative purposes, we also implemented a discretization approach proposed by Felsenstein (2001), where the quantiles of a gamma distribution are chosen using Gauss-Laguerre quadratures, in order to reproduce as many moments of the continuous  distribution as possible given the coarseness of discretization (denoted by  quadratures). This model would excel compared to traditionally discretized  if the rates were sampled from the gamma distribution and the patterns present in the alignment exhibited sufficient variation to enable reliable inference of high moments. To assess the goodness-of-fit, we compare the likelihood of the model to a model that does not allow rate variation and to one in which each site (strictly speaking, each site pattern) can evolve at a different rate. We will refer to this method as the ‘‘site specific rate’’ (SSR) method. As the latter model is computationally burdensome because of the large number of parameters for even small sequence alignments, we take an approximate approach, in which nuisance parameters such as the branch lengths and the substitution parameters are fixed, while the rate parameters hs for each site Ds are estimated independently. To avoid artificial infinite rates for variable sites, as was done by Nielsen (1997), we capped the maximum value of h at 100. For nucleotide data, one can justify this restriction by noticing that for ht . 5, 20 1 3 3 1 1 1 6B 1 3 1 7 1 C 6B C 7 exp6B Cht7 4@ 1 1 3 1 A 5 1 1 1 3 1 0:25 0:25 0:25 0:25 B 0:25 0:25 0:25 0:25 C B C ’B C; @ 0:25 0:25 0:25 0:25 A 0:25 0:25 0:25 0:25 0

hence the substitution matrix becomes saturated and large values of h are not identifiable. Estimated rates are then normalized to yield an average value of 1, in order to resolve confounding with the length of the tree. By adding site-by-site log-likelihood scores, we can obtain an approximate upper bound on the likelihood scores attainable by any model with parametric rate distributions. We can then compare the fit of the

226 Kosakovsky Pond and Frost

Discretized gamma distribution p=0.8 q=1.2

CDF

1

1

0.5

0.5

0

p=0.05 q=0.15

1

2

3 1

0.5

0.5

10 1

0.5

0.5

1

2

3

1

0

1

1

1

0.5

0.5

0

p=8 q=0.4

1

0

1

0

p=0.5 q=5

0

1

0

p=50 q=20

Interval defining beta distribution

CDF

1

0

1

1

1

0.5

0.5

0

10

Rate

0

1

Rate

FIG. 1.—Cumulative distribution functions obtained by discretizing the gamma distribution with l ¼ 0.5 into 8 rate classes for illustrative values of beta distribution parameters p and q. Continuous cumulative distributions are shown for reference.

Hierarchical Rate Variation Model 227

Table 1 Sequence Alignments Used for Analyses Data Set HCV cds HCV core gene HCV E2 gene HCV NS5B gene mtDNA COXI gene

N

bp

Patterns

Model

AICmax

AICmin

25 100 81 79 21

9465 573 576 1773 1530

2456/N.A 213/184 324/184 738/541 683/506

010232 010232 010010 012030 012343

124043 13173.1 20308.7 49208.2 30973

99237.9 10547.2 17371.8 39427 24172

NOTE.—N ¼ number of sequences in the alignments; bp ¼ length of alignment in nucleotide bases; Patterns ¼ number of unique alignment column patters (nucleotide/codon); Model ¼ code of the selected nucleotide evolutionary model; AICmax ¼ AIC score of the nucleotide model without site-by-site rate variation; AICmin ¼ AIC score of the nucleotide model where the rate at each unique site pattern is estimated independently.

models using Akaike Information Criterion (AIC) scores (Akaike 1974), and compare moments of the inferred discretized rate distributions with those obtained by the SSR method utilizing standard unbiased estimators of sample moments. Note that because the number of parameters in the SSR model is determined by the data, AIC is not, strictly speaking, applicable directly. However, we use the AIC measure for the SSR to estimate a reasonable upper bound which can be achieved by models with parametric distributions. The SSR approach has been parallelized to allow it to run quickly on a cluster of computers. Lastly, we can search for the maximum number of discrete rate categories, Nmax, supported by the data, by performing a one-dimensional search on N under the GDD, using AIC scores to locate the maximum. Sequence Data We apply our methods to sequence data from genotype 1 subtype b hepatitis C virus (HCV) and cytochrome oxyidase I (COXI) mammalian sequences. There sequences typically exhibit extensive rate variation across sites. Data on Core, E2, and NS5B genes, as well as complete genome alignments (table 1) were downloaded from the Los Alamos HCV database http://hcv.lanl.gov). The mitochondrial alignment was recently analyzed by Seo, Kishino, and Thorne (2004). We used the heuristic algorithm described in the Supplemental Material online to obtain viral phylogenies and appropriate evolutionary models. Alignments and trees can be downloaded from http://www.hyphy.org/ pubs/BetaGamma.zip. Results To test the ability of our model to flexibly describe variation in substitution rates across sites, we compared the fit of our model to sequences of HCV, mammalian mitochondrial sequences, and to simulated data. All of the analyses were performed with the HyPhy (v 0.99b) package (Kosakovsky Pond, Frost, and Muse 2004), and are included in the most recent package build. Nucleotide Analyses First, we investigated the distribution of nucleotide substitution rates in the complete genome sequences of HCV. As the sequences are fairly long, come from dif-

ferent genes of an organism that evolves extremely rapidly, and are subject to vastly different selective pressures, we expect to find very high variability of substitution rates, and thus provide a good test case for different modeling approaches and distributions. Fitting the site-specific rate model (tables 1 and 2) suggests that nucleotide substitution rates follow a highly variable distribution (median 0, mean 1, large variance), with a substantial weight in the tail of the distribution (high skewness and kurtosis). A linear search using a GDD produced support for five discrete rate classes with the cumulative density function (CDF) shown in figure 2. We next fitted a series of nucleotide models using varying probability distributions to approximate the variability of substitution rates across sites (see fig. 2), using eight rate categories for discretized distributions, and the results from the best fitting (in the AIC sense) GDD-based model. We used eight rate categories, in part because it is the number often arbitrarily chosen in existing methods, and in part because it is reasonably close to the five rate classes supported by the data under the GDD model. Several observations are immediately clear: (1) all rate variation models produce much lower AIC scores than the single rate model, yielding more that half of the maximum score improvement AICmax 2 AICmin, supporting the hypothesis of presence of rate variation; (2) the overall length of the tree varies up to 50%, depending on the choice of the distribution; (3) b– distributions fit dramatically better than their conventionally discretized counterparts; (4) conventionally discretized distributions greatly underestimate the variance, skewness, and kurtosis of the rate distribution inferred by the SSR approach; GDD captures the moments better than any other distributions, while b 2  is second-best; (5) the shape parameters of baseline continuous distributions are close for the conventional and b– partitionings (table 2); (6) the intentionally inflexible b– uniform distribution has the worst fit of all models, and fails to capture the skewness and kurtosis of the rate distribution, suggesting that it is important to choose prior rate distributions from a continuous family that can model asymmetry and long tails; (7) while the  quadratures model does an admirable job of reproducing distributional moments, b 2  is preferred by AIC, by a sizeable margin; (8) b 2  distribution is sufficiently flexible to outperform 1Inv and recover a large proportion of essentially invariable sites. Including an invariable class in the b 2  model results in a much

228 Kosakovsky Pond and Frost

Table 2 Distributional Properties of Nucleotide Substitution Rates in HCV and Vertebrate Mitochondrial DNA Alignments, Sorted by Increasing AIC Distribution

N

T

r

k1

b2

AIC

HCV cds SSR GDD b 2 1Inv b2 1Inv  quadratures b–log-normal  Log-normal b–uniform

5 5 8 8 8 8 8 8 8 8

1.09 1.58 1.32 1.37 1.31 1.29 1.93 1.28 1.12 0.98

9.6 16.1 0.97 4.2 0.91 3.98 17.67 2.85 3.09 2.37

42 15.4 3.59 4.11 2.04 3.99 9.4 1.9 2.1 2.5

293.04 266.5 22.1 28.27 6.2 26.88 93.82 5.1 5.7 7.54

99237.9 112967 113131 113147 113178 113180 113221 113252 113551 114430

HCV core gene SSR b2 GDD b 2 1Inv  quadratures b–log-normal 1Inv  Log-normal b–uniform

4 8 4 8 8 8 8 8 8 8

1.51 2.02 2.05 2.02 1.70 2.92 2.59 2.72 2.53 0.98

7.2 4.97 6.13 4.97 4.16 52.8 1.44 3.16 3.33 1.58

14.75 4.57 5.21 4.57 4.08 81.7 2.2 1.96 2.13 1.51

44.95 34.06 35.02 34.06 27.96 12135 6.63 5.25 5.77 3.44

10547.2 11589 11589.7 11591 11592.7 11601.4 11656.6 11662.7 11705.6 11808.3

l ¼ 0.19, p ¼ 0.85, q ¼ 0.15, P ¼ 0 l ¼ 0.24 r ¼ 2.09, p ¼ 1.61, q ¼ 0.22 l ¼ 0.21, P ¼ 0.37 l ¼ 0.21 r ¼ 1.73 R ¼ 10000, p ¼ 0.94, q ¼ 1.08

HCV E2 gene SSR GDD b 2 1Inv b2 1Inv   quadratures b–log-normal Log-normal b–uniform

5 5 8 8 8 8 8 8 8 8

3.33 3.86 3.83 3.82 3.84 3.56 3.2 3.41 2.9 0.98

2.43 2.21 1.62 2.69 1.14 2.13 2.44 2.00 2.41 1.58

3.3 2.07 2.49 2.67 1.69 1.72 3.12 1.4 2.01 1.51

16.25 7.64 9.77 10.73 4.7 4.6 17.62 3.08 5.44 3.44

17371.8 18133.4 18141.1 18141.6 18154.6 18159.2 18177 18217.3 18227.4 18306.5

l ¼ 0.48, p ¼ 0.48, q ¼ 0.59, P ¼ 0.19 l ¼ 0.33, p ¼ 0.54, q ¼ 0.63 l ¼ 0.64, P ¼ 0.25 l ¼ 0.36 l ¼ 0.41 r ¼ 1.54, p ¼ 3.49, q ¼ 3.2 r ¼ 1.46 R ¼ 10000, p ¼ 1.17, q ¼ 1.74

HCV NS5B gene SSR GDD b 2 2Inv b2 1Inv   quadratures b–log-normal Log-normal b–uniform

4 4 8 8 8 8 8 8 8 8

2.43 2.81 2.81 2.83 2.88 2.93 2.2 4.84 2.4 1.64

2.4 3.45 1.4 4.52 1.33 3.02 3.78 18.5 3.35 3.3

11.17 2.57 3.42 3.99 2.1 1.93 3.89 7.7 2.13 4.17

40.5 9.8 19.1 24.7 6.2 5.2 25.7 63.6 5.77 23.29

39427 42319.4 42320.5 42323 42366.1 42387.3 42419.1 42495 42577.8 42869.3

l l l l l r r R

mtDNA COXI gene SSR GDD b 2 1Inv 1Inv b2  b–log-normal  quadratures Log-normal b–uniform

4 4 8 8 8 8 8 8 8 8

2.34 4.11 4.06 4.24 3.35 4.97 9.54 2.87 2.67 2.61

7.16 2.32 0.49 0.64 1.55 3.07 21.9 3.42 3.27 1.26

13.63 2.02 1.33 1.75 0.61 1.94 6.54 3.70 2.12 1.13

40.6 7.1 3.46 5.2 1.85 5.2 44.11 23.5 5.75 2.42

24172 27272.3 27281.8 27284 27338.1 27389.3 27427.1 27533.2 27704.6 27907.2

l ¼ 1.62, p ¼ 2.35, q ¼ 1.93, p = 0.51 l ¼ 1.16, P ¼ 0.47 l ¼ 0.23, p ¼ 4.0, q ¼ 5.87 l ¼ 0.22 r ¼ 2.54, p ¼ 0.09, q ¼ 0.27 l ¼ 0.29 r ¼ 1.71 R ¼ 10000, p ¼ 1.23, q ¼ 2.12

Parameters

l l l l r l r R

¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼

0.61, p ¼ 2, q ¼ 0.12, P ¼ 0.48 0.2, p ¼ 0.24, q ¼ 0.2 0.69, P ¼ 0.44 0.25 2.11, p ¼ 0.11, q ¼ 0.26 0.25 1.66 8.11, p ¼ 0.68, q ¼ 1

l ¼ 0.19, p ¼ 0.85, q ¼ 0.15

¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼

0.48, 0.21, 0.45, 0.23 0.26 2.18, 1.73 13.7,

p ¼ 0.51, q ¼ 0.34, P ¼ 0.40 p ¼ 0.85, q ¼ 0.43 P ¼ 0.35 p ¼ 3.71, q ¼ 0.87 p ¼ 0.59, q ¼ 0.5

NOTE.—N ¼ number of discrete rate categories supported by the data (SSR) or the number of discrete rate classes used for the analysis (all others); T: (1) for SSR denotes the length of the with h ¼ 1; (2) for all other methods denotes the length of the tree under the model with an appropriate distribution of h; both measured in expected substitutions per site; r ¼ variance of the distribution; k1 ¼ skewness of the distribution; b2 ¼ proper kurtosis of the distribution; AIC ¼ Akaike Information Criterion score for the model

smaller AIC improvement than does adding an invariable class to the  model. Even though the b– distributions were split into eight classes, the discretization scheme collapsed several rate

classes with h  1 into one, thus suggesting that eight was an excessive number of rate categories. Moreover, such distributions were better able to capture the variability of h, by inferring classes with large rates (hi . 10), unlike

Hierarchical Rate Variation Model 229

β−Uniform

GDD CDF

CDF

1

1

0.5

0.5

0

10

0

β−Γ

Gamma 1

1

0.5

0.5

0

10

0

10

β−Log-normal

Log-normal 1

1

0.5

0.5

0

10

10

Rate

0

10

Rate

FIG. 2.—Inferred cumulative rate distributions for the complete HCV genome alignment. SSR derived cumulative rate distribution is shown for reference.

conventionally discretized distributions. For example, a b 2  model with three rate classes had a better AIC score (113206) than a  model with eight rate classes. We note that because the computational difficulty is proportional to the number of rate classes, fitting b 2  with a few rate categories may yield better fits substantially faster than traditionally discretized . Similar observations hold for specific HCV genes and the COXI data set. Note that for the Core gene, the b 2 

distribution has a better AIC score than the GDD, and for the other two HCV data sets, the b 2  distribution fits almost as well as GDD and infers very similar moments. For E2 and COXI, traditionally discretized  has a better AIC score than  quadratures, most likely because the rates in these alignments do not come from a distribution similar to a continuous . A notable difference for the COXI data set is that the addition of an invariable class of sites results in a much

230 Kosakovsky Pond and Frost

more pronounced improvement of fit. In particular, 1Inv outperforms b 2  (although b 2 1Inv fits better still). This example highlights that if the true distribution of rates cannot be well approximated by any member of the parametric family ( in this case), an adaptive discretization scheme alone cannot match the fit achieved by choosing a better baseline distribution. Having selected 1Inv as the baseline distribution, we observe that using the b– discretization scheme still leads to a better AIC score. Codon-Stratified Models For coding data, models that assume different patterns of rate variation among first and second positions in a codon and among third positions in a codon, are biologically reasonable. As an illustration, we separated the COXI alignment into two fixed partitions: first and second codon positions and third codon positions. Similar to Yang and Swanson (2002), we fitted a fixed sites model, with all model parameters, except rate distributions shared among partitions. The b 2  distribution (with four rate categories in each partition) had the AIC score of 26107.4 versus 26593.9 for the  distribution. The shape parameters of the gamma distribution fitted to the first and second positions were l12 ¼ 0.01 for b 2  and l12 ¼ 0.13 for , and that for the third position: l3 ¼ 2.17 and l3 ¼ 2.18, respectively. Both models established that positional rate distributions have very different variances—high in the first two positions and low in the third. b 2  further established that both rate distributions were bimodal: p ¼ 0.6, q ¼ 0.35 for the first two positions and p ¼ 0.08, q ¼ 0.23 for the third position. While both codon-stratified models have lower AIC scores than any of the models in table 1 (except SSR), one must be careful not to directly compare the fits of fixed and random sites models. As Yang and Swanson (2002) pointed out, likelihood scores of random sites models are inherently lower compared to equivalent fixed sites models. Simulation Results We also conducted the following simulation studies: 1. 100 data sets with 25 sequences and 10,000 nucleotides each simulated under the tree and model inferred from the complete HCV genome alignment using the gamma distribution with shape parameter l ¼ 0.25; on average there were 2,404 unique alignment columns per data set; 2. 100 data sets with 100 sequences and 1,000 nucleotides each simulated under the tree and model inferred from the HCV Core gene alignment using the gamma distribution with shape parameter l ¼ 0.21; on average there were 414 unique alignment columns per data set; 3. 100 data sets with 79 sequences and 1,000 nucleotides each simulated under the tree and model inferred from the HCV NS5B gene alignment using the 4 bin GDD distribution with h1 ¼ 0.06, p1 ¼ 0.663; h2 ¼ 1.14, p2 ¼ 0.181; h3 ¼ 3.88, p3 ¼ 0.125; h4 ¼ 8.69, p4 ¼ 0.03; on average there were 387 unique alignment columns per data set;

4. 100 data sets with 81 sequences and 1,000 nucleotides each simulated under the tree and model inferred from the HCV E2 gene alignment using the a single rate model (h ¼ 1) ; on average there were 944 unique alignment columns per data set. Simulation 1 was intended to capture the situation of modeling variation of continuously sampled rates in a relatively small number of long sequences (such as a complete viral genome); simulations 2 and 3 were performed to investigate modeling variation of continuously (2) or discretely (3) sampled rates in a relatively large number of short sequences; simulation 4 was conducted as a negative control to assess performance of various models in the absence of rate variation. We tabulated the actual distribution of rates in each replicate, and fitted each of the distributional models to the alignment. Tables 3 and 4 summarize actual and inferred distributional properties for every model and rank models by their AIC scores. We observe:  GDD, b 2 , and  quadrature distributions recover the average tree length reasonably well. Surprisingly, the traditionally discretized  distribution does relatively poorly, and overestimates the length;  all inferred variances are reasonably close to the true value, except for b–log-normal, which greatly overestimates the variance;  GDD, b 2 ,  quadratures, and b–log-normal are able to infer high skewness and kurtosis, whereas others fail to correctly reflect large higher moments;  a  distribution discretized using quadrature fitted the data better than a conventionally discretized  distribution, but worse than a b 2 ;  b–uniform consistently provides the worst fit and b 2  the best. In particular, b 2  is always preferred to traditionally discretized . Moreover, GDD is also always preferred to , though the true distribution is a gamma;  when compared with  quadratures, mean AIC scores for the b 2  model were 20 (simulation 1) and 30 (simulation 2) units better. Perhaps, this is so because even long and large data sets simulated over a rather long tree, still fail to provide enough information to reliably infer a continuous rate distribution;  for simulations without rate variation, all models except b–Uniform on average converged to the point mass at 1. The ability of the b 2  distribution to recover the statistical moments of the rate distribution and the shape parameter of the underlying continuous distribution (when applicable), together with the consistent improvement in fit over a conventionally discretized  implies that b 2  is preferable to a conventionally discretized  distribution. Although the  quadratures distribution fitted significantly better than the conventionally discretized  distribution, the b 2  model still provided a substantially better fit. The ability of this model to adopt a wide variety of shapes

Hierarchical Rate Variation Model 231

Table 3 Simulated and Inferred Distribution Properties for Simulations, Where h Were Drawn from a Pre-specified Distribution Distribution

^ T

^ r

^1 k

^2 b

h ;  with l ¼ 0.25 4.05 8.04 2.85 1.9 3.08 2.1 3.95 3.69 7.19 5.64 4.04 2.68 3.92 3.54 3.86 3.93

31 5.08 5.69 21.5 69 9.16 18.5 26.18

Simulation 2. 1,000 sites, 100 sequences. h ;  with l ¼ 0.21 Actual 2.72 4.7 9.13  3.7 3.1 1.95 Log-normal 3.1 3.35 2.13 b2 2.96 4.76 4.1 b–log-normal 4.74 22.8 12.7 b–uniform 1.92 5.26 3.44 GDD 2.91 4.27 3.38  quadratures 2.18 3.8 3.9

31.45 5.22 5.77 26.65 272.7 14.74 16.1 25.78

Simulation 3. 1,000 sites, 79 sequences. h ; 4 bin GDD Actual 2.08 1.94  2.6 3.04 Log-normal 2.18 3.33 b2 2.48 4.05 b–log-normal 4.1 16.6 b–uniform 1.34 4.88 GDD 2.54 3.4  quadratures 2.08 4.95

3.5 1.93 2.13 3.25 9.8 3.06 2.55 4.15

14.0 5.19 5.77 15.52 261.1 12.7 9.8 32.7

0 0.14 0.07 25.9 2552 236 2387 0.21

0 2.3 5.03 139 106 57800 3 3 106 3.07

Simulation 1. 10,000 sites, 25 sequences. Actual 1.28  1.35 Log-normal 1.22 b2 1.29 b–log-normal 1.21 b–uniform 1.08 GDD 1.29  quadratures 1.25

Simulation 4. 1,000 sites, 79 sequences. h ¼ 1 Actual 3.43 0  3.43 0.01 Log-normal 3.43 0.005 b2 3.43 0.008 b–log-normal 3.43 0.002 b–uniform 3.38 4.42 GDD 3.43 0.009  quadratures 3.43 0.012

Parameter Estimates l ¼ 0.25 ^ ¼ 0.25(0.006) l ^ ¼ 1.66(0.016) r ^ ¼ 0.23(0.01), ^p ¼ 0.99(0.96), ^q ¼ 0.44(0.2) l ^ ¼ 1.72(0.1), ^p ¼ 1.86(1.11), ^q ¼ 1.25(0.74) r ^ ¼ 0.25(0.006) l l ¼ 0.21 ^ ¼ 0.22(0.014) l ^ ¼ 1.74(0.12) r ^ ¼ 0.2(0.016), ^p ¼ 2.38(1.18), ^q ¼ 0.48(0.23) l ^ ¼ 2.04(0.15), ^p ¼ 2.5(0.72), ^q ¼ 0.65(0.22) r ^ ¼ 0.26(0.017) l l ¼ 0.23(0.015) ^ ¼ 0.22(0.014) l ^ ¼ 1.73(0.05) r ^ ¼ 0.22(0.015), ^p ¼ 2.34(3.14), ^q ¼ 1.0(0.883) l ^ ¼ 1.94(0.2), ^p ¼ 2.6(1.26), ^q ¼ 1.33(1.0) r ^ ¼ 0.25(0.03) l

^ ¼ 90.6(18.7) l ^ ¼ 0.05(0.06) r ^ ¼ 51.6(42.8), ^p ¼ 12.3(27.3), ^q ¼ 50(39.5) l ^ ¼ 0.6(1.9), ^p ¼ 30.8(37), ^q ¼ 28.8(37.3) r ^ ¼ 91.6(17.6) l

NOTE.—Each fitted distribution was discretized into eight rate classes, while GDD had five rate classes for simulations 1 and 4, and four rate classes for simulations 2 ^1 ¼ mean skewness; b2 ¼ mean kurtosis; mean parameter estimates (and ^ ¼ mean variance; k and 3. T^ ¼ mean length of the tree measured in expected substitutions per site; r standard deviations)

is likely responsible for b 2  outcompeting the other models. Codon Analyses We next fitted codon models (eq. 2) to HCV gene alignments to investigate the distributions of synonymous (a), and non-synonymous (b) substitution rates. We employed independent b 2  distributions, with unit mean for a, due to identifiability issues as discussed in Kosakovsky Pond (2003), and estimated the mean of the distribution for b. We chose to use three rate classes for a and 4 for b. Increasing the number of categories did not lead to a better AIC score. Likelihood analyses were parallelized (see Supplemental Material online for details) and run on a Linux cluster. For comparative purposes we also employed GDD, SSR, -based models, and a single rate model (table 5). All analyses used branch lengths derived from the single rate nucleotide model (with appropriate tranformations to distances on a codon scale) in order to reduce computational complexity. Lastly, to provide a reference point for existing models, we fitted the

M8 (b1x) model of Yang et al. (2000), modified to employ the rate matrices given by (eq. 2). This model did not account for synonymous rate variation, and it fitted the data poorly, even though we estimated branch lengths under the full codon model instead of using nucleotide approximations, increasing computational effort 10-fold. Note that b 2  distributions yield AIC scores better than or comparable to the GDD model, and they fit the moments of rate distributions much better than the conventionally discretized gamma distribution, which, once again, fails to represent high skewness and kurtosis of substitution rate distributions. We used the b 2  model to compute Bayes factors for the events of positive or negative selection at a site (Kosakovsky Pond 2003). The distribution of Bayes factors for the events of positive and negative selection across codon sites under the b 2  model are shown in figure 3. We illustrate the identification of selected codon sites with the E2 gene alignment. All sites with respective Bayes factors over 100 (a rather conservative threshold) were classified as undergoing adaptive evolution (Table 6).

232 Kosakovsky Pond and Frost

Table 4 AIC Means, Standard Deviations, and Ranks of Each Distribution for 100 Data Sets Simulated under a Given Distribution of h Distribution

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

sites, 25 sequences. h ;  with l ¼ 0.25 58400.2 639.3 99 58408.5 639.5 0 58419.6 640.9 1 58484.5 638.7 0 58552.3 641.9 0 58572.7 643.7 0 58871.1 664.2 0

1 98 1 0 0 0 0

0 2 98 0 0 0 0

0 0 0 99 1 0 0

0 0 0 1 83 16 0

0 0 0 0 16 84 0

0 0 0 0 0 0 100

Simulation 2. 1,000 sites, 100 sequences. h ;  with l ¼ 0.21 b2 12049.4 547.7 99 GDD 12063.4 548.2 1  quadratures 12081.8 549.9 0 b–log-normal 12086.3 551.1 0  12094.4 546.9 0 Log-normal 12162.2 549.4 0 b–uniform 12235.2 553 0

1 91 1 7 0 0 0

0 8 53 29 10 0 0

0 0 38 37 25 0 0

0 0 8 27 65 0 0

0 0 0 0 0 100 0

0 0 0 0 0 0 100

Simulation 3. 1,000 sites, 79 sequences. h ; 4 bin GDD GDD 9918.1 1086.2 68 b2 9920.1 1088.4 30  9930.7 1091.9 2  quadratures 9938.8 1078.3 0 b–log-normal 9962.2 1097.7 0 Log-normal 9966 1100.6 0 b–uniform 10053.2 1123.8 0

27 62 8 3 0 0 0

4 8 57 31 0 0 0

1 0 33 63 1 2 0

0 0 0 2 57 41 0

0 0 0 0 42 57 1

0 0 0 1 0 0 99

Simulation D. 1,000 sites, 100 sequences. Rates are constant Log-normal 17334.6 247.7 88  17334.9 247.6 9  quadratures 17335 247.6 0 b2 17336.8 248 0 b–log-normal 17337 247.9 1 GDD 17341.2 247.6 0 b–uniform 17371.6 274.1 2

10 75 3 2 1 0 9

0 14 83 1 1 0 1

1 0 13 13 10 0 63

0 2 0 61 33 0 4

1 0 1 21 52 21 4

0 0 0 2 2 79 17

Simulation 1. 10,000 b2 GDD  quadratures  b–log-normal Log-normal b–uniform

E[AIC]

SD[AIC]

Rank 1

NOTE.—For each simulation, the models are sorted by their mean AIC.

The  model fails to detect any positively selected sites, while the b 2  model finds 13 sites under selection, including 3 in the hypervariable (HVR) region. The absence of positively selected sites under the gamma model is likely due to its inability to capture the pattern of rate variation. Discussion Estimation of the variability in substitution rates across a sequence may be of primary interest, or it may be secondary to estimating a phylogeny and divergence times. Previous attempts to describe rate variation have assumed that each site evolves under a separate rate, or have modeled rate variation using distributions that are either too simple to capture the underlying distribution of rates, or mixtures of densities that may take a long time to fit, and may suffer from convergence problems. We have proposed a model of rate variation that is simple enough to be described by a relatively small number of parameters, but flexible enough to capture a wide range of patterns of rate variation. Our approach, despite its simplicity, can provide a good fit to data with relatively few parameters, reducing problems with convergence and computational complexity. For our test hepatitis C and mammalian mitochondrial data sets, the

b 2  model fitted better than conventionally discretized distributions, and for individual HCV genes, the b 2  model provided a fit similar to or better than a general discrete distribution. The good fits obtained by our model may reflect the fact that, in many cases, only a limited number of statistical moments (mean, variance, skewness, and kurtosis) can be realistically fitted, due to the restricted number of site patterns in many data sets and the inherent stochasticity of the substitution process. b 2  also compared favorably to complex site-specific rate models, recovering many of the important distributional properties at a much smaller computational cost. Simple parametric formulation of the b 2  approach makes direct comparison across data sets easy to carry out. Furthermore, as suggested by our simulation results, b– distributions capture the properties of the underlying rate distribution very well. Mayrose et al. (2004) even suggested that in certain settings random effects models are better able to recover ‘‘true’’ evolutionary rates, when compared with SSR-type models. In principle, several distributions can be organized hierarchically to model the quantiles of a rate distribution, although preliminary investigations suggest that this approach does not give the dramatic improvement in fit obtained by using, say, the b 2  model over the conventionally discretized gamma distribution. In the

Hierarchical Rate Variation Model 233

Table 5 Codon based analyses of substitution rates in HCV genes Synonymous Rates a Distribution r

k1

Core SSR b2 GDD  b1x Single rate

3.1 1.95 1.87 1.32 0

E2 SSR b2 GDD  b1x Single rate

0.97 0.49 0.5 0.4 0

2.04 1.39 1.43 0.38 0

NS5B SSR GDD b2  b1x Single rate

1.43 0.95 1.05 0.81 0

2.52 1.33 1.61 0.55 0

8.13 2.5 2.27 0.66 0

b2

Nonsynonymous rates b E[b]

44.4 9.7 8.38 1.5 0

r

k1 2.6 4.75 6.97 1.07 4.41

b2 37.7 31.8 60.45 2.26 21.9

Table 6 HCV E2 codons identified as positively selected using different methods and distributions synonymous and non-synonymous rates

AIC

0.18 0.17 0.156 0.22 0.17 0.16

0.23 0.14 0.15 0.09 0.17

11430.6 11523.7 11526.24 11771.02 12073.2 12565.7

48.4 5.15 5.25 1.5 0

0.19 0.18 0.18 0.24 0.2 0.18

0.14 0.1 0.11 0.1 0.11

1.62 3.10 3.7 1.04 3.23

35 14.1 20.2 2.23 17.7

17779.2 18006.8 18017.54 18081.70 18230.4 19105.83

30.6 3.94 4.87 1.5 0

0.12 0.11 0.109 0.18 0.13 0.11

0.09 0.06 0.056 0.07 0.07

1.14 3.54 3.45 1.1 3.3

20.5 15.93 15.63 2.29 14.1

41071.6 41632.8 41633.4 41873.00 43272.2 44964.48

NOTE.—Models are sorted in ascending order by AIC. r ¼ variance of the distribution; k1 ¼ skewness of the distribution; b2 ¼ proper kurtosis of the distribution; E[b] ¼ mean of the distribution for non-synonymous rates; AIC ¼ Akaike Information Criterion score for the model.

context of adaptive evolution, one may envision using rate distributions that do not allow bs . as or bs , as to address the question of presence or absence of selective pressure in the data set as a whole, as opposed to identifying individual sites under selection. It is not trivial, however, to choose a good parameterization in the presence of both synonymous and nonsynonymous rate variation. Moreover, most applications are ultimately concerned with locating the sites under selective pressure. Our approach also results in a degree of robustness in the exact choice of baseline distribution; indeed, in our simulations generated under a gamma model of rate variation, a discretized log-normal distribution sometimes fitted better than a gamma model. Hence, the researcher may not need to fit a large number of rate distributions to

Codon Distribution GDD b2  GDD b2 

9*

19* 25* 40

44

60

62

66

79

108

0.7 0.0 0.0 0.0 20.2 0.4 0.2 20.1 0.0 0.2 0.3 0.1 0.1 0.1 0.4 0.3 0.1 0.1 0.1 20.1 20.9 0.3 0.1 0.2 20.3 20.8 0.3 0.1 0.1 0 123 0.4 0.1 0.2

139 0.4 1 0.2

153 0.3 0.1 0.2

154 0.4 0.3 0.3

164 0.0 0.1 0.3

NOTE.—The values reflect the estimated dN 2 dS at a given site scaled by the length of the tree, and codons found to be under diversifying selection (Bayes Factor . 100) are highlighted in bold. * marks codons located in the hypervariable (HVR-1) region.

the data, further reducing the computational burden of estimating rate variation. It may seem counterintuitive that the b 2  model fits better to data simulated under a gamma rate model than a gamma rate model itself. The relatively poor fit of the conventionally discretized gamma distribution arises in part as a result of the poor approximation to the continuous distribution when equiprobable intervals are chosen; using a quadrature method to discretize the gamma distribution instead (Felsenstein 2001) results in a significantly better fit. However, as the b 2  model gave a better fit than the gamma distribution discretized using quadrature, it is also the ability of the b 2  model to adapt to the distribution of rates present in the data that is responsible for the improvement in fit over a gamma rate model. Given the finite length of alignments, and the stochasticity of the substitution process, only a small number of rate classes can be detected—a mere five rate classes could be detected in nucleotide alignments of the entire hepatitis C genome— and the distribution of such small samples may differ dramatically from the underlying rate distribution. The ability of the b 2  model to recover the moments of the rate distribution makes it a good general-purpose distribution with which to obtain more accurate estimates of branch lengths and phylogenies, and to provide information on variation in selective coefficients across a gene (Nielsen and Yang 2003).

Log[BF{NS/S>1|s}]

Literature Cited

30 20 10 0 10 20 30 Log[BF{NS/S