Population Subdivision and Molecular Sequence ... - Semantic Scholar

2 downloads 0 Views 129KB Size Report
ABSTRACT. Population subdivision complicates analysis of molecular variation. Even if neutrality is assumed, three evolution- ary forces need to be considered: ...
Copyright  2003 by the Genetics Society of America

Population Subdivision and Molecular Sequence Variation: Theory and Analysis of Drosophila ananassae Data Claus Vogl,*,†,1 Aparup Das,* Mark Beaumont,‡ Sujata Mohanty* and Wolfgang Stephan* *Department Biologie II, Ludwig-Maximilians Universita¨t, D-80333 Mu¨nchen, Germany, ‡School of Animal and Microbial Sciences, University of Reading, Whiteknights, Reading, RG6 6AJ, United Kingdom and † Veterina¨rmedizinische Universita¨t Wien, A-1210 Vienna, Austria Manuscript received February 5, 2003 Accepted for publication August 6, 2003 ABSTRACT Population subdivision complicates analysis of molecular variation. Even if neutrality is assumed, three evolutionary forces need to be considered: migration, mutation, and drift. Simplification can be achieved by assuming that the process of migration among and drift within subpopulations is occurring fast compared to mutation and drift in the entire population. This allows a two-step approach in the analysis: (i) analysis of population subdivision and (ii) analysis of molecular variation in the migrant pool. We model population subdivision using an infinite island model, where we allow the migration/drift parameter ⌰ to vary among populations. Thus, central and peripheral populations can be differentiated. For inference of ⌰, we use a coalescence approach, implemented via a Markov chain Monte Carlo (MCMC) integration method that allows estimation of allele frequencies in the migrant pool. The second step of this approach (analysis of molecular variation in the migrant pool) uses the estimated allele frequencies in the migrant pool for the study of molecular variation. We apply this method to a Drosophila ananassae sequence data set. We find little indication of isolation by distance, but large differences in the migration parameter among populations. The population as a whole seems to be expanding. A population from Bogor ( Java, Indonesia) shows the highest variation and seems closest to the species center.

P

OPULATION subdivision is centrally important for evolution and affects estimation of all evolutionary parameters from natural and domestic populations. Even if just a neutral model is considered (i.e., selection is ignored), analysis of molecular variation in subdivided populations requires modeling of at least three forces: mutation, migration, and drift. Straightforward incorporation of these three forces into a comprehensive model leads to formidable complexity and necessitates computer-intensive numerical integration schemes (Beerli and Felsenstein 1999, 2001). If the number of subpopulations (demes) is large, however, the process of migration among and drift within subpopulations (“scattering phase”) is relatively fast, while the process of coalescence in the entire population (“collecting phase”) is relatively slow (Wakeley 2001). Then, time is too short for mutations to happen in the scattering phase; instead, almost all mutations will occur in the collecting phase. Because of its importance, many approaches to measuring population subdivision have been put forward. Among them, the infinite island model and the complementary F-statistics approach (Wright 1931, 1969) are the earliest and most influential. Rousset (2001) and Excoffier (2001) introduce the different approaches,

1 Corresponding author: Veterina¨rmedizinische Universita¨t Wien, Veterina¨rplatz 1, A-1210 Vienna, Austria. E-mail: [email protected]

Genetics 165: 1385–1395 (November 2003)

clarify their historic development, and discuss their relative merits. We therefore limit our introduction and discussion to what is relevant to our approach. Since the seminal work of Wright (1931), population subdivision has often been modeled with the infinite island model. In randomly mating species, usually a single parameter describing the correlation of two random gametes within a population is estimated. This parameter was labeled Fst (Wright 1969) or ⌰ (Weir and Cockerham 1984). It is related to the migration parameter ␥ ⫽ 4Nem (or 3Nem for X-linked variation) via the equation ⌰ ⫽ 1/(1 ⫹ ␥) (see Table 1 for abbreviations of key mathematical symbols). In the original model, it is assumed that all populations share the same migration parameter ␥. But, obviously, populations differ generally in their sizes or migration rates or both. Wakeley (2001) relaxed these assumptions and showed that, even if a rather general migration model is assumed, the scattering phase is short compared to the collecting phase as long as the number of demes is large. In many species with subdivided populations, the central or original populations show higher variability than the peripheral populations. This is the case for the only slightly subdivided populations of humans (Rosenberg et al. 2002), Drosophila melanogaster (David and Capy 1988), and D. simulans (Hamblin and Veuille 1999), which have their population centers in sub-Saharan Africa and spread over all continents. In many crop species, most variation among cultivars can be found where

1386

C. Vogl et al. TABLE 1 Abbreviations of key symbols

Symbol i, 1 ⱕ i ⱕ I l, 1 ⱕ l ⱕ L k, 1 ⱕ k ⱕ K ⌰i ␥ {n1, . . . , nK} {␳1, . . . , ␳K} ␪w

Explanation Index for the population Index for the locus Index for alleles of the focal locus Probability of two random alleles to coalesce within the population before migration; similar to the Weir and Cockerham (1984) measure of population subdivision ⌰ and Wright’s correlation coefficient Fst ⫽ 4Nem (or ␥ ⫽ 3Nem for X-linked loci), where m is the migration rate per generation and Ne the effective population size Allele frequencies for the focal population and locus Inferred allelic proportions in the migrant pool The Watterson (1975) statistic of sequence variation

the species has originally been domesticated, e.g., for maize in Mexico (Piperno and Flannery 2001) and for potato in the Andes (Hawkes 1990). D. ananassae exhibits more population structure than both D. melanogaster and D. simulans. It exists in many semiisolated populations around the equator, particularly in mainland Southeast Asia and on the islands of the Pacific Ocean (Tobari 1993). D. ananassae is thought to have originated in Southeastern Asia where most of its relatives occur (Dobzhansky and Dreyfus 1943; McEvey et al. 1987). Due to its extensive population structure, D. ananassae can be used to analyze the effect of population subdivision on genetic variation. Past molecular analyses of the effect of population subdivision on variation are limited to few loci and populations (Stephan 1989; Stephan and Langley 1989; Stephan and Mitchell 1992; Stephan et al. 1998; Chen et al. 2000). Herein, we use data from nine nuclear DNA fragments in eight populations of Southeast Asia and Australia to improve understanding of population subdivision and molecular genetic variation in this species. Current methods for inferring population differentiation with only a single parameter cannot capture variability among populations. With coalescence-based approaches (Beerli and Felsenstein 1999, 2001; Wakeley 1999), however, this can be accomplished. These approaches generally are quite complex such that the powerful Markov chain Monte Carlo (MCMC) method or the related importance-sampling method are used. Both these methods can be combined and approximate the full likelihood or posterior probability extremely well. With a Bayesian approach, additional complications can also be incorporated relatively easily, e.g., missing data, unreliable scoring of markers, and mutation. As implemented until now, however, the cost of this flexibility is an extraordinary computational load (e.g., Beerli and Felsenstein 1999, 2001; Wakeley 1999), even if data sets are only moderately sized. Hence, one would wish for a method fast enough to allow for analysis of large data sets with only moderate computational requirements, but flexi-

ble enough to allow for inference of different migration parameters for different populations, among other parameters. Then, a species differentiation into central, genetically variable and peripheral, genetically impoverished populations could be detected. Furthermore, the method should be able to deal with molecular sequence data. Herein, we follow Wakeley (1998, 1999, 2001) and assume that the process of migration among and drift within subpopulations is relatively fast compared to the process of mutation and drift in the entire population. This permits the use of a two-step method for the analysis of molecular sequence variation in organisms with population subdivision. In the first step, we use the infinite island model to account for migration among and drift within subpopulations (scattering phase) to estimate the migration/drift parameter ␥ [or conversely the differentiation parameter ⌰ ⫽ 1/(␥ ⫹ 1)] for each population separately and to estimate contributions to the pool of migrants. In the second step, we analyze variation within the pool of migrants (collecting phase), using traditional statistics for molecular sequence variation (Watterson 1975; Tajima 1989). With these statistics, global expansion or shrinkage of the entire population over longer time spans can be detected as in the case of a single panmictic population. Theory for analysis of molecular sequence data (e.g., Watterson 1975; Tajima 1983, 1989) is based on sample allele counts, i.e., integral frequencies. On the other hand, the usual K-allele infinite island model, which we employ for the scattering phase, is parameterized using allelic proportions in the migrants, i.e., “probabilities” of alleles. To overcome this mismatch (counts vs. proportions), we impute the number of alleles that reach the migrant pool by discounting those alleles that coalesce within the population. The allele frequencies in the migrant pool can then be subject to an analysis of molecular variation using the methods cited above or similar ones. In contrast to Wakeley (1999), our method is computationally simple and can handle almost arbitrarily large data sets. The cost of this simplicity is the use of the

Population Subdivision and Sequence Variation

Figure 1.—Map showing the sampling locations of Drosophila ananassae. 1, Kakadu; 2, Darwin; 3, Bogor; 4, Mandalay; 5, Kathmandu; 6, Bhubaneswar; 7, Puri; 8, Chennai.

inexact K-allele model for the scattering phase. We evaluate this approximation with computer simulations. The method is applied to a data set of nine loci from eight D. ananassae populations. MATERIALS AND METHODS Data collection: Population samples: A total of 69 isofemale lines of D. ananassae were sampled from the following eight locations (sample size in parentheses): Kakadu (3) and Darwin (6), Australia; Bogor (10), Java, Indonesia; Mandalay (10), Burma; Kathmandu (10), Nepal; and Bhubaneswar (10), Puri (10), and Chennai (10), India (Figure 1). Details on the collections of these samples will be reported elsewhere. Identification of DNA fragments: The data set consists of DNA sequence data from nine loci. Since DNA sequence information is limited in D. ananassae, we used genomic information from D. melanogaster to identify marker loci on the X chromosome of D. ananassae. (D. melanogaster is the only closely related species with a completely available genome sequence.) The forward and reverse primers for each fragment were designed (and cordially provided by David de Lorenzo) in adjacent exons of a randomly chosen gene to amplify the intron flanked by these exons. More than 200 fragments of the above specifications (sizes varying from 400 to 1000 bp in the D. melanogaster database) were identified and tested for amplification by PCR with genomic DNA from D. ananassae ; ⬍10% of the primer pairs actually amplified DNA fragments. For each of the successfully amplified fragments, the DNA sequence was aligned with the corresponding fragment of D. melanogaster. A fragment was used for the population genetic analysis if sequence homology to the exons of D. melanogaster at both ends of the fragment was observed and if the fragment length was 300–600 bp. Eight such fragments were identified and used in this study. An intron fragment of the Om(1D) (Stephan et al. 1998) gene located in a normal recombination region of the X chromosome of D. ananassae was used as the ninth locus. DNA sequencing: A single male from each isofemale line was used for genomic DNA preparation using the PUREGENE DNA isolation kit (Gentra Systems, Minneapolis) and then purified with EXOSAP-IT (USB Corporation, Cleveland). Both DNA strands were sequenced on an automated DNA

1387

sequencer (Megabace1000; Amersham Biosciences, Buckinghamshire, UK). Sequences were edited with SeqMan and aligned with MegAlign (DNAStar, Madison, WI). Manual alignment was used when required. Insertion-deletion polymorphisms were removed from sequences after alignment and thus not considered for the analyses. Analysis of migration and drift among populations: The main culprit for the slow computation speed of some coalescence-based approaches is the complex migration model used (Beerli and Felsenstein 1999, 2001). By contrast, the infinite island model allows for much easier computation. Furthermore, Wakeley (2001) could show that this model agrees with quite general migration models if the number of subpopulations is large. Hence, we adopt an infinite island model with haploid individuals as the basic migration model, but use all the flexibility it can offer; i.e., we allow for populationspecific migration parameters (Figure 2A). Following Weir and Cockerham (1984), we label this parameter ⌰i. For comparison, we also present simulation results using a splitting population model (Figure 2B). We define ⌰i as the probability that, looking back in time, two random alleles coalesce within the ith population before any of them migrates into the allele pool (Figure 3A). If applied to subdivided populations, the last part of this definition needs to be replaced by “before the split from the ancestral population” (Figure 3B). The Weir and Cockerham (1984) approach is analogous to an ANOVA random-effects model. With a random-effects model, a hyperparameter—usually the variance of effects—is used to combine information from individual effects. Generally, only the variance of effects is interesting, while estimates of the individual effects are ignored. An example is the inference of the additive genetic variance in the standard quantitative genetics model, where the individual breeding values are usually ignored. Although we are interested in the individual migration parameters, we nevertheless follow the randomeffects model by assuming that the ⌰i are drawn from a distribution. The advantage is that even if each population is rather uninformative, quite accurate estimates of the means and variances of the ⌰i over populations can be obtained. We chose an independent ␤-distribution for each ⌰i with parameters ␣ and ␤. The ␤-distribution is flexible and can assume shapes from a wide “U” to a narrow bell, yet is controlled by only two parameters. We then have Pr(⌰i |␣, ␤) ⫽

⌫(␣ ⫹ ␤) ␣⫺1 ⌰i (1 ⫺ ⌰i)␤⫺1 . ⌫(␣)⌫(␤)

(1)

We note that the standard Weir and Cockerham model follows in the limit of ␣ → ∞ and ␤ → ∞ with ␣/(␣ ⫹ ␤) ⫽ const. ⫽ ⌰. For further development of the model, we assume that loci are unlinked and therefore conditionally independent of each other, yet all have the same ⌰i for each population. The latter assumption could be relaxed easily (Balding and Nichols 1995, 1997). Our method assumes that the time for coalescence within the population is so short that no new mutations occur (for justification of this assumption, see Wakeley 2001). We differentiate between the migrant pool and the I populations and assume a K-allele prior for allelic proportions in the migrant gene pool during the scattering phase. This is strictly justified only if the mutations in the collecting phase follow a K-allele model. With other mutation models (e.g., the infinite site model) Wakeley (1999, 1998) offers an exact method, where the distribution of noncoalescing migrants is enumerated. Unfortunately, this method is computationally cumbersome, such that only small data sets can be analyzed. The justification for our simpler method is that, for larger data sets, influence of the prior generally becomes negligible. (We could not detect a compromise in our computer simulations with rather moderately sized data sets; see below.)

1388

C. Vogl et al.

Figure 2.—Two models of population subdivision. (A) The island model, where the infinitely large migrant pool exchanges migrants at a rate of mi per generation with the ith population, which consists of Ne diploid individuals. At equilibrium, ⌰i ⫽ 1/(1 ⫹ 4Neimi). (B) The splitting population model, where an ancestral population subdivides into subpopulations of effective population size Nei that evolve in isolation for t generations, such that ⌰i ⫽ 1 ⫺ e⫺t/2Nei.

The scattering phase is handled as follows. Conditional on the allelic proportions in the migrant pool and the hyperparameters ␣ and ␤ defined above, populations are independent of each other. Conditional on ⌰i, coalescence is independent among loci. Hence, consideration of a single locus in a single population suffices for illustration, as all other loci and populations are conditionally independent. It can be shown that the number of coalescences within allelic class k is independent of other allelic classes and depends only on the migration parameter ␥i, the allelic proportion ␳k, and the current frequency in the sample nk (see appendix): within allelic class k the decision between migration and coalescences can be made recursively. Suppose that there are currently nk individuals in class k ; then the probability of the next event being a migration event is ␥i ␳k/(␥i ␳k ⫹ n k ⫺ 1) (appendix, Equation A6) and the probability of it being a coalescence event is its complement. Then nk is reduced by one and the procedure repeated until only one is left that is a migrant by default. Relevant parameters for calculation of allelic proportions in the migrant pool are the allele frequencies for each locus. These must be estimated from only those alleles within the I populations that originated directly from the migrant pool, whereas alleles that arose through sampling within populations (or, looking backward, through coalescence within a population) must be ignored. Otherwise, estimation of allelic proportions in the migrant pool is straightforward. For integration, we employ a MCMC approach (Gelman et al. 1995), where we alternate cyclically between updating the parameters: conditionally on the current ⌰i and the vector of inferred allelic proportions in the migrant pool {␳1, . . . , ␳K}, we update the coalescence history for each population and locus; conditionally on the coalescence histories for each population, we update the allelic proportions in the migrant pool {␳1, . . . , ␳K} for each locus; conditionally on the allelic proportions in the migrant pool {␳1, . . . , ␳K}, the observed allele frequencies {n1, . . . , nK}, and the two hyperparameters ␣ and ␤, we update ⌰i for each population; and, finally, conditionally on the ⌰i’s

we update the hyperparameters ␣ and ␤. This updating scheme converges relatively quickly such that, with reasonable starting conditions, good approximations to the posterior distribution may already be obtained after a “burn-in” period of ⵑ1000 iterations. For validation of our method and the computer program, we simulated data that were sampled from five populations distributed with varying ⌰i (Table 2). The model parameters (Fst or ⌰i) can be tuned such that probabilities of coalescence within populations are identical to first order for the infinite island model and the splitting population model. Parameters were five populations with ⌰i from 0.1 to 0.9 and 20 loci. With the infinite island model, we observed a bias toward lower values; with the splitting population model the bias reverses for higher ⌰i . Despite the bias, our estimator definitely improved on the Weir and Cockerham estimator. Analysis of molecular variation within the migrant pool: The MCMC procedure described above provides, for each iteration, a sample of alleles that make it into the migrant pool. The collection of such samples over a run approximates the posterior frequencies of alleles in the migrant pool. Hence, we determine statistics describing molecular variation, the Watterson (1975) ␪w (not to be mistaken with the population differentiation parameter ⌰ above), ␲ (Tajima 1983), and D (Tajima 1989), at each 100th cycle. We also determine the average over loci, the posterior variance, and the quantiles. The variance of the estimators in the posterior distribution is only part of the total variance. In addition, there is the familiar variation of the estimates due to coalescence, drift, and sampling within the entire population (Tajima 1983, 1989). In the case of D, this variance is normalized to one under equilibrium assumptions. The two sampling processes, within populations and within the migrant pool, are assumed to be independent. Hence, the total variance is the sum of the two variances. Strictly, the Bayesian approach we adopt does not allow for calculation of confidence intervals or significance. We follow conventional statistics, however, and declare results significant

Population Subdivision and Sequence Variation

1389 TABLE 2

Estimates of ⌰i in a mutation-migration-drift model with five populations True ⌰i W&C

0.1

0.25

0.5

0.75

0.9

0.16 0.15 0.15 0.15 0.15 0.20 0.19 0.20 0.20 0.16

0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.02 0.03 0.01

0.14 0.15 0.17 0.19 0.19 0.37 0.30 0.35 0.28 0.33

0.35 0.30 0.37 0.36 0.36 0.61 0.62 0.58 0.55 0.59

0.63 0.67 0.70 0.59 0.57 0.77 0.80 0.81 0.78 0.74

0.83 0.88 0.77 0.84 0.89 0.89 0.92 0.93 0.90 0.94

First five rows, infinite island model; second five rows, splitting population model; W&C, the Weir and Cockerham estimator of Fst. Our new estimators of ␲, ␪w, and D in the migrant pool are very close to the true values. All these results hold true for both the infinite island and the splitting population model. In fact, we find few differences between these two models. DATA ANALYSIS

Figure 3.—The coalescence in (A) the island model and (B) the splitting population model. In model A, haplotypes 1 and 2 coalesce within the population before migration, whereas haplotype 3 migrates before coalescence. In model B, haplotypes 1 and 2 coalesce within the population before joining the ancestral population at time t ⫽ 0, while haplotype 3 does not coalesce before migration.

if the 0.95 posterior intervals do not overlap the expected value or if two 0.95 posterior intervals do not overlap. For validation of the method, we performed computer simulations assuming both the infinite island and the splitting population model. Parameters were 10 populations, 20 loci with a finite site mutation model with 500 sites each, a mutation rate of 7.5 ⫻ 10⫺6, and a population size of 104 in the migrant pool or the ancestral population (for the infinite island model and the splitting population model, respectively). Note that, because the model is a finite site model rather than the more usual infinite site model, the expectation of ␲ is ⵑ0.0013 (and thus slightly ⬍0.0015) and the expectation for Tajima’s (1989) D is slightly negative (ⵑ ⫺0.1). From this ancestral population/gene pool, 10 subpopulations were sampled with (A) ⌰i ⫽ 0.1, (B) ⌰i ⫽ 0.2, and (C) ⌰i ⫽ 0.5, using both the infinite island model and the splitting population model. In Table 3, we find that the estimated ⌰i is biased to low values in the infinite island model and, generally, toward high values in the splitting population model, similar to the simulations with unequal population sizes above. As expected, averages across populations of the estimators of population variation ␲ and ␪w are lower than the true values in the ancestral population/gene pool and D is positive. This is expected as drift reduces molecular variation, affecting ␪w more than ␲.

We analyzed a sample of eight D. ananassae populations from Asia and Australia (Figure 1), where nine loci were sequenced in up to 10 isofemale lines per population. Pairwise Fst values for the eight populations (Table 4) show values between 0.04 and 0.19. The two Australian populations (Kakadu and Darwin) are most closely related; otherwise there is little evidence for isolation by distance. In particular, the two Indian populations of Bhubaneswar and Puri have an intermediate Fst of 0.09, although they are separated by only 50 km. The Weir and Cockerham (1984) estimate for the whole data set is 0.09. Descriptive statistics of molecular variation (␲, ␪w, and D) are given in Table 5. Populations with the highest molecular variation are Kakadu (Australia) and Bogor (Indonesia) with ␪w on average ⵑ0.013. The Kakadu population has the smallest sample size and, accordingly, the highest variation among loci. Hence, the very high values for loci 1 and 2 may be statistical flukes. The overall high values for Bogor, on the other hand, seem quite trustworthy. The other Australian population (Darwin) and the Indian populations have smaller levels of molecular variation of ␪w ⬇ 0.005. The D-statistic is mostly neutral, but Kathmandu (Nepal), Mandalay (Burma), and Bogor (Indonesia) have increasingly negative D’s. With our two-step analysis, we performed a single MCMC run, where 105 iterations were sampled after a burn-in period of 104 iterations. For each population, the mean ⌰i and quantiles were calculated from the

1390

C. Vogl et al. TABLE 3 Statistics of molecular variation with true ⌰ ⫽ 0.1, true ⌰ ⫽ 0.2, and true ⌰ ⫽ 0.5 Mean ⌰i

Migrant est.



␪w

0.06 0.05 0.06 0.06 0.06 0.12 0.12 0.13 0.13 0.13

13.3 12.0 9.7 11.2 13.1 11.8 10.3 12.7 11.45 10.97

12.5 11.5 9.8 10.8 13.8 10.7 9.4 11.1 10.64 10.05

0.10 0.14 ⫺0.05 0.09 ⫺0.11 0.40 0.28 0.48 0.25 0.36

True ⌰ ⫽ 0.1 14.2 13.8 12.7 13.0 10.4 11.9 11.9 12.9 14.5 13.2 13.3 13.5 11.5 12.7 14.3 13.3 12.7 14.1 12.4 13.0

⌰i 0.12 0.12 0.11 0.12 0.11 0.24 0.24 0.24 0.24 0.25

␲ 11.4 11.5 9.9 11.2 13.1 9.15 7.88 10.26 10.39 9.62

␪ 10.5 10.4 9.3 10.0 11.3 7.77 6.86 8.38 8.39 8.49

D 0.34 0.40 0.22 0.47 0.35 0.65 0.56 0.74 0.80 0.54

⌰i 0.34 0.32 0.32 0.32 0.36 0.56 0.57 0.55 0.59 0.55

␲ 7.6 9.0 10.0 7.6 7.0 6.0 6.1 8.4 5.6 5.0

␪ 6.8 8.0 8.2 6.8 6.2 4.6 4.7 6.4 4.3 3.9

D 0.41 0.39 0.65 0.45 0.42 0.85 0.76 0.89 0.78 0.81

D



␪w

Migrant true ␲

␪w

⫺0.06 ⫺0.10 ⫺0.38 ⫺0.31 0.13 ⫺0.08 ⫺0.39 0.14 ⫺0.34 ⫺0.15

14.0 12.5 10.2 11.8 14.3 13.2 11.4 14.2 12.7 12.3

13.8 13.0 11.9 12.9 13.2 13.5 12.8 13.3 14.1 13.0

⫺0.09 ⫺0.15 ⫺0.42 ⫺0.34 0.09 ⫺0.11 ⫺0.43 0.10 ⫺0.36 ⫺0.18

True ⌰ ⫽ 0.2 ␲ ␪ 13.3 13.9 13.1 12.7 11.3 12.7 12.8 13.0 14.6 13.8 11.3 12.6 9.9 11.9 12.9 13.0 12.7 12.4 12.7 13.4

D ⫺0.11 0.01 ⫺0.37 ⫺0.05 0.00 ⫺0.30 ⫺0.51 ⫺0.07 ⫺0.03 ⫺0.26

␲ 13.1 12.7 11.0 12.4 14.5 11.1 9.9 12.9 12.6 12.6

␪ 13.8 12.7 12.7 12.9 13.8 12.7 12.0 13.2 12.6 13.5

D ⫺0.11 ⫺0.08 ⫺0.44 ⫺0.10 ⫺0.03 ⫺0.38 ⫺0.55 ⫺0.12 ⫺0.09 ⫺0.32

True ⌰ ⫽ 0.5 ␲ ␪ 11.2 12.2 13.7 13.2 14.1 12.9 11.3 12.3 11.3 11.4 11.3 11.7 11.8 12.4 15.7 14.9 11.8 12.0 9.6 10.8

D ⫺0.33 0.04 0.14 ⫺0.27 ⫺0.10 ⫺0.15 ⫺0.20 0.03 ⫺0.12 ⫺0.42

␲ 11.0 13.5 14.0 11.1 10.9 11.2 11.5 15.8 11.6 9.5

␪ 12.1 13.2 12.9 12.4 11.3 11.8 12.4 15.1 12.1 11.1

D ⫺0.37 0.00 0.10 ⫺0.34 ⫺0.19 ⫺0.18 ⫺0.30 0.01 ⫺0.19 ⫺0.50

D

D

First five rows, infinite island model; second five rows, splitting population model. Mean, the mean value of ␲, ␪w, and D among subpopulations; Migrant est., ␲, ␪w, and D in the ancestral/migrant pool population estimated by our method; Migrant true, the values of ␲, ␪w, and D in the ancestral/migrant pool population that gave rise to the 10 subpopulations. ␲ and ␪w are multiplied by 1000 for ease of presentation.

approximate posterior distribution (A5). Estimates of ⌰i vary a lot among populations (see Table 6). The Bogor population is the most variable and least differentiated from the migrant pool, with a ⌰i of ⵑ0.02, whereas, at the opposite end, the population from Chennai in Southern India has a ⌰i of ⵑ0.25. Thus, the Bogor population seems closest to the species center and the Chennai population most peripheral. The average ⌰i is ⵑ0.11, slightly higher than the estimate according to Weir and Cockerham (1984). The molecular variation among migrants (Table 7) is higher than that in any of the populations. This is to be expected, as our method restores the variation that

is reduced due to drift within populations. The large differences between ␲ and ␪w are unexpected, though: on average, values for ␪w are much higher than those for ␲. Correspondingly, Tajima’s (1989) D-statistic is very negative, on average ⫺1.40. While only one individual D-value is statistically significantly negative, there is a clear trend toward negative values and the mean D-value is significantly negative. DISCUSSION

Population subdivision is centrally important to evolution. Unfortunately, it complicates analysis of molecular

Population Subdivision and Sequence Variation

1391

TABLE 4 Pairwise Fst values

Darwin Bogor Mandalay Kathmandu Bhubaneswar Puri Chennai

Kakadu

Darwin

Bogor

Mandalay

Kathmandu

Bhubaneswar

Puri

0.04 0.07 0.17 0.19 0.10 0.19 0.19

0.10 0.16 0.20 0.10 0.15 0.18

0.09 0.08 0.05 0.11 0.12

0.10 0.11 0.05 0.11

0.09 0.10 0.16

0.09 0.12

0.10

variation. Herein, we follow Wakeley (1998, 2001) and assume a two-step model: the first step is modeling of migration among and drift within subpopulations (scattering phase), using the infinite island model; the second is analysis of mutation and drift in the entire population (collecting phase), using traditional statistics for molecular sequence variation (Watterson 1975; Tajima 1989).

Our method of modeling the process of migration among and drift within subpopulations with the infinite island model is similar to those developed by Balding and Nichols (1995, 1997) and Nicholson et al. (2002). But we allow each population to have its own specific differentiation parameter ⌰i and assume that the ⌰i’s are drawn from a common distribution. [This is similar to the random-effects model of Weir and Cockerham (1984).]

TABLE 5 Molecular variation summary statistics of eight populations and nine loci Loci

L

Parm.

Kak.

L1

455

L2

422

L3

363

L4

468

L5

417

L6

395

L7

342

L8

489

L9

396

␪w ␲ D ␪w ␲ D ␪w ␲ D ␪w ␲ D ␪w ␲ D ␪w ␲ D ␪w ␲ D ␪w ␲ D ␪w ␲ D ␪w ␲ D

27.84 27.84 — 30.02 30.02 — 16.53 16.53 — 17.09 17.09 — 0.00 0.00 — 3.38 3.38 — 1.95 1.95 — 2.73 2.73 — 18.52 18.52 — 13.29 13.29 —

E

Darw. 2.89 2.64 ⫺0.45 2.08 2.21 0.31 24.13 27.55 0.89 6.55 8.12 1.39 8.4 8.31 ⫺0.06 1.11 0.84 ⫺0.93 2.56 2.53 ⫺0.05 1.79 1.77 ⫺0.05 11.06 12.96 1.03 6.50 7.10 0.13

Bogo. 13.21 11.87 ⫺0.47 20.94 21.91 0.22 19.48 15.49 ⫺0.96 22.66 18.57 ⫺0.87 13.56 18.57 ⫺0.77 5.37 3.88 ⫺1.15 10.34 10.20 ⫺0.06 4.34 3.09 ⫺1.19 14.28 11.9 ⫺0.77 12.59 11.31 ⫺0.67

Mand. 3.88 2.78 ⫺1.14 21.78 19.27 ⫺0.55 2.92 1.65 ⫺1.56 12.08 8.55 ⫺1.36 4.24 8.55 ⫺1.74 3.58 4.44 0.93 6.20 6.82 0.41 4.34 4.14 ⫺0.19 8.93 9.88 0.47 7.46 6.65 ⫺0.47

Kath. 2.33 2.98 1.00 21.78 22.70 0.20 16.55 12.98 ⫺1.00 7.55 4.94 ⫺1.53 9.32 4.94 ⫺1.59 1.79 2.76 1.74 8.27 8.38 0.06 3.61 2.04 ⫺1.74 8.93 8.53 ⫺0.2 8.12 7.42 ⫺0.34

Bhub. 6.99 4.54 ⫺1.53 13.4 14.27 0.30 3.90 3.06 ⫺0.82 7.55 6.46 ⫺0.64 5.09 6.46 1.06 2.68 3.26 0.78 4.13 5.65 1.41 3.61 4.04 0.48 7.14 8.59 0.87 6.72 6.81 0.11

Puri 3.11 2.78 ⫺0.4 8.38 8.32 ⫺0.03 1.95 2.02 0.12 4.53 3.28 ⫺1.15 4.24 3.28 0.98 0.89 0.51 ⫺1.11 10.34 9.68 ⫺0.28 2.17 2.59 0.70 5.36 6.62 0.98 4.93 4.82 ⫺0.13

Chen. 0.78 0.44 ⫺1.11 10.05 14.48 1.99 2.92 4.53 1.98 6.80 3.85 ⫺1.90 13.56 3.85 ⫺0.65 3.58 3.77 0.20 8.27 9.16 0.47 2.17 2.59 0.70 4.46 4.60 0.12 5.09 5.42 0.17

E 7.63 6.98 ⫺0.51 16.05 16.65 0.31 11.05 10.48 ⫺0.17 10.6 8.86 ⫺0.76 5.65 4.98 ⫺0.66 2.8 2.86 0.06 6.51 6.80 0.25 3.09 2.87 ⫺0.16 9.83 10.2 0.31 8.13 7.85 ⫺0.15

Kak., Kakadu; Darw., Darwin; Bogo., Bogor; Mand., Mandalay; Kath., Kathmandu; Bhub., Bhubaneswar; Chen., Chennai. ␪w and ␲ are multiplied by 1000. L, the length of the sequenced stretch in basepairs; Parm., the parameter; E, the locus and population means, respectively. Sample size in the Kakadu population was too small for calculation of D.

1392

C. Vogl et al. TABLE 6 Parameters from the posterior distribution of ⌰i Posterior mean 2.5% 5.0% 50.0% 95.0% 97.5%

Kakadu Darwin Bogor Mandalay Kathmandu Bhubaneswar Puri Chennai

0.17 0.22 0.02 0.10 0.14 0.08 0.15 0.18

0.04 0.11 0.00 0.05 0.07 0.04 0.08 0.10

0.05 0.13 0.00 0.05 0.08 0.04 0.09 0.11

0.16 0.22 0.02 0.10 0.14 0.08 0.14 0.17

0.31 0.32 0.04 0.16 0.21 0.13 0.22 0.26

0.33 0.34 0.05 0.17 0.23 0.14 0.25 0.27

In contrast to the method developed by Wakeley (1999), our approach is exact for only a K-allele mutation model, but leads to relatively simple formulas for conditional probabilities of key parameters. We did not note any compromise in performance in computer simulations. Furthermore, our approach is fast (inference takes only minutes with quite large data sets on a personal computer) and can handle almost arbitrarily sized data sets. We use a MCMC scheme, mainly Gibbs sampling, for integration. We validate the model with computer simulations. If the infinite island model is used for simulations, the statistical analysis performs very well as expected. But even if a different model with splitting populations instead of the infinite island model is used for simulations, estimates of ⌰i are quite accurate. Our analysis therefore seems quite robust to violations of assumptions as long as the population differentiation parameters ⌰i remain the same. In the second step of our analysis, we analyze the allele spectrum within the migrant population, using traditional estimates of population variation and deviation from mutation-drift equilibrium (Watterson 1975; Tajima 1989). As with the migration/drift parameters, mutation/drift parameters used for simulation are recovered in our analysis. The migration/drift process increases variability in the esti-

TABLE 7 Inferred molecular variation within the migrant pool

L1 L2 L3 L4 L5 L6 L7 L8 L9 Mean



␪w

D

8.3 19.5 12.6 12.0 9.4 3.3 9.1 3.5 12.1 10.0

22.6 30.6 26.0 28.0 19.6 5.1 10.8 6.4 15.7 18.3

⫺2.18 ⫺1.24 ⫺1.77 ⫺1.98 ⫺1.77 ⫺0.99 ⫺0.47 ⫺1.38 ⫺0.77 ⫺1.40

D-values outside the 0.95 interval are in italics.

mates of molecular variation, but does not appear to cause bias. The process of estimating statistics of molecular variation in the migrant pool/ancestral population described above can be thought of as a way of removing the effect of drift within subpopulations. The statistics therefore can be used exactly like those from a sample of an undivided population. A negative D, for example, may indicate population expansion. With this method, we analyzed molecular variation from eight populations of D. ananassae from South Asia, Southeast Asia, and Australia. In tropical and subtropical regions of the world, D. ananassae is one of the most common Drosophila species, especially in and around human habitations. Although populations are separated by major geographical barriers such as mountains and oceans, recurrent transportation by human activity may lead to gene exchange. In spite of this, however, earlier studies with molecular genetic markers detected significant population subdivision with a different method (Stephan 1989; Stephan and Langley 1989; Stephan and Mitchell 1992; Stephan et al. 1998; Chen et al. 2000). This was confirmed in this study: the average population differentiation parameter ⌰ was ⵑ0.1. Migration parameters vary a lot among populations from a very low value of 0.02 in Bogor (Java, Indonesia) to values an order of magnitude higher in some Australian and Indian populations. These estimates suggest that the Bogor population is close to the species center, while the Australian and Indian populations are peripheral. Furthermore, we observe little isolation by distance in this data set: the Indian populations of Puri and Bhubaneswar are separated by only 50 km, yet share as much genetic variation as two random populations. At ranges different from those covered in this data set, isolation by distance may still be observed in D. ananassae, and appropriate adjustments need to be made in the method. As with population differentiation, levels of molecular variation are quite variable among populations: the Bogor population is more than twice as variable as some Indian populations. In light of the previous analyses, the Indian populations seem to show reduction of variation due to drift. This is consistent with values of Tajima’s (1989) D-statistics observed in these populations: the Indian populations show neutral or slightly positive D, while the Bogor population is negative. In a substructured population, drift leads to positive D values. For consistency with the negative D of the Bogor population, a migrant pool with negative D needs to be postulated. This may be caused by rapid population expansion. The process of drift in peripheral populations may then push these negative values back toward neutral or slightly positive values. Thus, in some Indian populations, two processes pushing in opposite directions might cancel out and lead to neutral or slightly positive D. In our two-step analysis, the allele spectrum in the migrant

Population Subdivision and Sequence Variation

pool has an even more negative D than the sample from Bogor. Simulation results indicate, however, that this may be partly an artifact: in expanding populations, estimates of the D-statistic are increasingly negative with increasing sample size (data not shown). From a biological point of view, our observation of negative values of the D-statistic for the Bogor population is particularly interesting, as our results also suggest that this population is close to the species center of D. ananassae. This observation contradicts the conventional notion (assumption) that ancestral populations are in an approximate equilibrium. A similar case was recently reported by Hamblin and Veuille (1999): using a single, putatively neutral marker, these authors found negative (though not significant) D-values for a number of African populations of D. simulans. We thank two reviewers and the editor for their comments on the manuscript, our colleagues at the evolutionary biology group at Ludwig-Maximilians Universita¨t for discussion and critical reading of the manuscript, David de Lorenzo for designing the primers we used, and the Deutsche Froschungsgemeinschaft (grant no. STE 325/4-1) for financial support.

LITERATURE CITED Balding, D., and R. Nichols, 1995 A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96: 3–12. Balding, D., and R. Nichols, 1997 Significant genetic correlations among caucasians at forensic DNA loci. Heredity 78: 583–589. Beerli, P., and J. Felsenstein, 1999 Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescence approach. Genetics 152: 763– 773. Beerli, P., and J. Felsenstein, 2001 Maximum likelihood estimation of a migration matrix and effective population sizes in subpopulations by using a coalescent approach. Proc. Natl. Acad. Sci. USA 98: 4563–4568. Chen, Y. B., J. Marsh and W. Stephan, 2000 Joint effect of natural selection and recombination on gene flow between Drosophila ananassae populations. Genetics 155: 1185–1194. David, J., and P. Capy, 1988 Genetic variation of Drosophila melanogaster natural populations. Trends Genet. 4: 106–111. Dobzhansky, T., and A. Dreyfus, 1943 Chromosomal aberrations in Brazilian Drosophila ananassae. Proc. Natl. Acad. Sci. USA 29: 368–375. Excoffier, L., 2001 Analysis of population subdivision, pp. 271–307 in Handbook of Statistical Genetics, edited by D. Balding, M. Bishop and C. Cannings. Wiley, New York. Gelman, A., J. Carlin, H. Stern and D. Rubin, 1995 Bayesian Data Analysis. Chapman & Hall, London/New York. Hamblin, M., and M. Veuille, 1999 Population structure among African and derived populations of Drosophila simulans : evidence

1393

for ancient subdivision and recent admixture. Genetics 153: 305– 317. Hawkes, J., 1990 The Potato: Evolution, Biodiversity and Genetic Resources. Belhaven Press, London. McEvey, S. F., J. R. David and L. Tsacas, 1987 The Drosophila ananassae complex with description of a new species from French Polynesia (Diptera: Drosophilidae). Ann. Soc. Entomol. 23: 377– 385. Nicholson, G., A. Smith, F. Jonsson, O. Gustafsson, K. Stefansson et al., 2002 Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Stat. Soc. B 64: 1–21. Piperno, D., and K. Flannery, 2001 The earliest archeological maize (Zea mays. L.) from highland Mexico: new accelerator mass spectrometetry dates and their implications. Proc. Natl. Acad. Sci. USA 98: 2101–2103. Rosenberg, N. A., J. K. Pritchard, J. L. Weber, H. M. Cann, K. K. Kidd et al., 2002 Genetic structure of human populations. Science 298: 2381–2385. Rousset, F., 2001 Inference from spatial population genetics, pp. 239–269 in Handbook of Statistical Genetics, edited by D. Balding, M. Bishop and C. Cannings. Wiley, New York. Stephan, W., 1989 Molecular genetic variation in the centromeric region of the X chromosome in three Drosophila ananassae populations. II. The Om(1D) locus. Mol. Biol. Evol. 6: 624–635. Stephan, W., and C. H. Langley, 1989 Molecular genetic variation in the centromeric region of the X chromosome in three Drosophila ananassae populations. I. Contrasts between the vermilion and forked loci. Genetics 121: 89–99. Stephan, W., and S. J. Mitchell, 1992 Reduced levels of DNA polymorphism and fixed between-population differences in the centromeric region of Drosophila ananassae. Genetics 132: 1039– 1045. Stephan, W., L. Xing, D. A. Kirby and J. M. Braverman, 1998 A test of the background selection hypothesis based on nucleotide data from Drosophila ananassae. Proc. Natl. Acad. Sci. USA 95: 5649–5654. Stephens, M., and P. Donnelly, 2000 Inference in molecular poulation genetics. J. R. Stat. Soc. B 62: 605–655. Tajima, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460. Tajima, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–595. Tobari, Y. N., 1993 Drosophila ananassae: Genetical and Biological Aspects. Japan Scientific Societies Press, Tokyo. Wakeley, J., 1998 Segregating sites in Wright’s island model. Theor. Popul. Biol. 53: 166–174. Wakeley, J., 1999 Nonequilibrium migration in human history. Genetics 153: 1863–1871. Wakeley, J., 2001 The coalescent in an island model of population subdivision with variation among demes. Theor. Popul. Biol. 59: 133–144. Watterson, G., 1975 On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256– 276. Weir, B., and C. Cockerham, 1984 Estimating F-statistics for the analysis of population structure. Evolution 38: 1358–1370. Wright, S., 1931 Evolution in Mendelian populations. Genetics 16: 97–159. Wright, S., 1969 Evolution and the Genetics of Populations. II. The Theory of Gene Frequencies. University of Chicago Press, Chicago. Communicating editor: M. Veuille

APPENDIX

Consider a haploid infinite island model and assume that the diffusion approximation or the Moran model holds. Since we assume that loci are unlinked and therefore conditionally independent, consideration of one locus suffices. Assume that the locus has K alleles in the pool of migrants; some of these may be missing in a sample from a particular population. The allelic proportions in the pool of migrants {␳1, . . . , ␳K} are constant over time, but unknown and need to be estimated. The main parameter of interest is the migration parameter for each population, ␥i, or equivalently the probability of two random alleles to coalesce within the population before migration, ⌰i ⫽ 1/

1394

C. Vogl et al.

(␥i ⫹ 1). Given the allelic proportions in the pool of migrants, estimation of ␥i or, equivalently, ⌰i is independent among populations. We thus leave out the indices i and l for population and locus, respectively. Data consist of allele frequencies in the sample, denoted by {n1, . . . , nK}. The likelihood: The distribution of the data {n1, . . . , nK} given the allelic proportions in the migrant pool {␳1, . . . , ␳K} and the migration parameter ␥ is a Dirichlet-multinomial distribution (see Balding and Nichols 1995, 1997, and references therein): Pr(n1, . . . , nK|␥, ␳1, . . . , ␳K) ⫽

⌫(␳1␥ ⫹ n1)···⌫(␳K␥ ⫹ nK) ⌫(␥) n! . n1!···nK! ⌫(␥ ⫹ n) ⌫(␳1␥)···⌫(␳K␥)

(A1)

The coalescence history: For simulating the coalescence history backward in time, we consider an infinitesimally short time interval, ␦. Two types of events may happen in this interval: a migration event and a coalescent event. With a coalescent event, two lineages with the same allele collapse into one. With a migration event, a lineage is substituted by another one by migration. Since migration is independent of the allelic type, no information on which allele was in the lineage before the migration event is available. Hence, the lineage is exchangeable with all other lineages in the population for which no data are available. The lineage is thus dropped from the analyses exactly as one is dropped through coalescence. Subsequently we need the posterior distribution of the type of the next allele sampled given the previously sampled allelic types {n1, . . . , nK} and the allelic proportions in the migrant pool {␳1, . . . , ␳K}. This distribution has been derived previously (Balding and Nichols 1995, 1997; Stephens and Donnelly 2000). For consistent notation, we introduce the characteristic vector {x1, . . . , xk}, where xi ⫽ 1 for the chosen allele and 0 for all others. In our notation, the posterior of the characteristic vector is Pr(x1, . . . , xK|n1, . . . , nK, ␥, ␳1, . . . , ␳K) ⫽



x1

xK

冣 冢



␳1␥ ⫹ n1 ␳ ␥ ⫹ nK ... K , n⫹␥ n⫹␥

(A2)

which corresponds to formula (16) in Stephens and Donnelly (2000). We index the sequence of coalescence or migration events with s. Obviously, the last event must be a migration event. Assume that the lineages remaining in the analysis after the sth coalescence or migration event are {n1(s), . . . , nK(s)}. Introduce another characteristic vector {y1(s), . . . , yK(s)} as {x1(s), . . . , xK(s)}. The probability of a lineage being reduced by one by migration to {n1(s) ⫺ x1(s), . . . , nk(s) ⫺ xk(s)} within the interval ␦ given {n1(s), . . . , nk(s)}, {␳1, . . . , ␳K}, and ␥ is Pr(x1, ..., xK |n1, . . . , nK, ␳1, . . . , ␳K,␥) ⫽ ␦

2 ␥ (n1␳1)x1···(nK␳K)xK n(n ⫹ ␥ ⫺ 1) 2

⫻ ⫽␦

兺y Pr(y1, . . . , yK |n1 ⫺ x1, . . . , nK ⫺ x1, ␥) Pr(x1, . . . , xK |n1 ⫺ x1, . . . , nK ⫺ xK, ␥)

1 ␥ (n1␳1)x1···(nK␳K)xK n(n ⫹ ␥ ⫺ 1) Pr(x1, . . . , xK |n1, . . . , nK, ␥)



x1

xK

冣 冢



␥ n1␳1 nK␳K ··· , ⫽␦ n n1 ⫹ ␳1␥ ⫺ 1 nK ⫹ ␳K␥ ⫺ 1

(A3)

where we took away the index s for notational convenience, and 兺y is the sum over all possible vectors {y1(s), . . . , yk(s)}. Compare this equation to the first part of Equation 15 in Stephens and Donnelly (2000). The probability of a lineage being reduced by one by coalescence to {n1(s) ⫺ x1(s), . . . , nK(s) ⫺ xK(s)} within the interval ␦ given {n1(s), . . . , nK(s)} and ␥ is



x1

冣 冢

xK



1 n1(n1 ⫺ 1) n (n ⫺ 1) ··· K K , Pr(x1, . . . , xK|n1, . . . , nK, ␳1, . . . , ␳K, ␥) ⫽ ␦ n ␳1␥ ⫹ n1 ⫺ 1 ␳K␥ ⫹ nK ⫺ 1

(A4)

where we again left out the index s for notational convenience. Compare this equation to the second part of Equation 15 in Stephens and Donnelly (2000). Summing the relevant terms, one realizes that the posterior probability of class k to be chosen for either a migration or a coalescence event is nk/n. Thus, if we again use the characteristic vector {x1, . . . , xk}, the probability of class k to be chosen is a Bernoulli distribution

Population Subdivision and Sequence Variation

1395

1 Pr(x1, . . . , xK |n1, . . . , nK, ␳1, . . . , ␳K, ␥) ⫽ n1x1···nxKK . n

(A5)

Given that class k is chosen, the conditional probability that the event is a coalescence or a migration is Pr(coalescence|␥, k, nk , ␳k) ⫽ (nk ⫺ 1)/(␥␳k ⫹ nk ⫺ 1) and

Pr(migration|␥, k, nk, ␳ k) ⫽ ␥␳k/(␥␳k ⫹ nk ⫺ 1).

(A6)

As the temporal sequence of coalescences is irrelevant, each allelic class may be treated independently. The number of coalescences relative to the number of migrations within allelic class k is thus independent of the other classes and depends only on the frequency nk, the allelic proportions in the migrant pool ␳k, and the migration parameter ␥. The coalescence history can thus be inferred by simulating backward until no more lineages remain in the population. For each locus, this requires at most ni ⫺ 1 samples from a Bernoulli distribution from the ith population. Records of the allelic types of all migrants need to be kept for updating the estimates of allelic proportions within the migrant pool. The migration parameter: Updating ␥ or ⌰ involves sampling from a nonstandard distribution. [Note that even though we currently do not index ⌰, we refer to the population-specific migration parameter and not to the Weir and Cockerham (1984) summary parameter.] This distribution is proportional to the product of the conditonal distribution of ⌰ given the hyperparameters ␣ and ␤ in formula (1) and the likelihood of formula (A1) multiplied over all loci. If a flat prior for ␥ is used, the prior distribution is not proper; i.e., no constant can be found such that it integrates to one. Reparameterizing ⌰ ⫽ 1/(1 ⫹ ␥) assures a proper posterior distribution with flat priors. Introducing the index l for the loci, we then have L

Pr(⌰|␳l1, . . . , ␳lK, nl1, . . . , nlK, ␣, ␤) ⬀ Pr(⌰|␣, ␤) 兿 Pr(nl1, . . . , nlK|␳l1, . . . , ␳lK, ⌰).

(A7)

l⫽1

With a Metropolis-Hastings step, it is possible to sample from this distribution by using a proposal or jumping distribution j(⌰|x) (Gelman et al. 1995). Suppose the old value is ⌰old and the new value sampled from j(⌰|x) is ⌰new, then ⌰new is accepted in favor of retaining ⌰old with probability one if the ratio a⫽

Pr(⌰new|x)j(⌰old|x) Pr(⌰old|x)j(⌰new|x)

(A8)

is greater than one, and with probability a otherwise. The hyperparameters: Updating the hyperparameters ␣ and ␤ involves sampling from a nonstandard distribution proportional to formula (1). Again we employ a Metropolis-Hastings step. Allelic proportions in the migrant pool: To get a new estimate of the allele frequencies in the migrant pool {␳1, . . . , ␳K}, the sum over all populations of allelic frequencies in each allelic class that enter the migrant pool needs to be determined. Given ␳, the probability of this sum of alleles is multinomial. The new allele proportions {␳1, . . . , ␳K} can thus be sampled from the sum of allelic frequencies using a Dirichlet distribution, where a flat or other Dirichlet prior distribution may be used. The sampling scheme: We alternate between (i) sampling the coalescence history recursively using formula (A6) conditional on the migrant allelic proportions ␳ and the differentiation parameter ⌰i for each population and locus, (ii) sampling the allelic proportions {␳1, . . . , ␳K} in the migrant pool conditional on the coalescence histories in all populations using a Dirichlet distribution for each locus, (iii) sampling the population differentiation parameters ⌰i from the Dirichlet-multinomial distribution conditional on the observed allele frequencies and the hyperparameters using (A7), and (iv) sampling the hyperparameters ␣ and ␤ conditional on the population differentiation parameters ⌰i. This updating scheme converges to the joint posterior distribution of the parameters. From the joint posterior all marginal posterior distributions can be obtained easily. Furthermore, the sample of individuals that reach the migrant pool is used for calculating the sequence variation statistics once per 100 cycles.