Genomic signature of successful colonization of ... - Wiley Online Library

Molecular Ecology (2016) 25, 616–629

doi: 10.1111/mec.13491

Genomic signature of successful colonization of Eurasia by the allopolyploid shepherd’s purse (Capsella bursapastoris) M I N , * K . H O L M , * S . I . W R I G H T † and A. CORNILLE,* A. SALCEDO,† D. KRYVOKHYZHA,* S. GLE M. LASCOUX* *Department of Ecology and Genetics, Evolutionary Biology Centre, Science for life Laboratory, Uppsala University, Uppsala 75236, Sweden, †Department of Ecology and Evolutionary Biology, University of Toronto, 25 Willcocks St., Toronto, ON M6R 1M3, Canada

Abstract Polyploidization is a dominant feature of flowering plant evolution. However, detailed genomic analyses of the interpopulation diversification of polyploids following genome duplication are still in their infancy, mainly because of methodological limits, both in terms of sequencing and computational analyses. The shepherd’s purse (Capsella bursa-pastoris) is one of the most common weed species in the world. It is highly self-fertilizing, and recent genomic data indicate that it is an allopolyploid, resulting from hybridization between the ancestors of the diploid species Capsella grandiflora and Capsella orientalis. Here, we investigated the genomic diversity of C. bursa-pastoris, its population structure and demographic history, following allopolyploidization in Eurasia. To that end, we genotyped 261 C. bursa-pastoris accessions spread across Europe, the Middle East and Asia, using genotyping-by-sequencing, leading to a total of 4274 SNPs after quality control. Bayesian clustering analyses revealed three distinct genetic clusters in Eurasia: one cluster grouping samples from Western Europe and Southeastern Siberia, the second one centred on Eastern Asia and the third one in the Middle East. Approximate Bayesian computation (ABC) supported the hypothesis that C. bursa-pastoris underwent a typical colonization history involving low gene flow among colonizing populations, likely starting from the Middle East towards Europe and followed by successive human-mediated expansions into Eastern Asia. Altogether, these findings bring new insights into the recent multistage colonization history of the allotetraploid C. bursa-pastoris and highlight ABC and genotyping-by-sequencing data as promising but still challenging tools to infer demographic histories of selfing allopolyploids. Keywords: Brassicaceae, time estimation, invasive species, dispersal, bottleneck, disomic inheritance, unphased SNP Received 21 April 2015; revision received 18 November 2015; accepted 19 November 2015

Introduction Understanding how evolutionary processes have shaped the distribution and genetic structure of species across geographically variable environments is a key goal in evolutionary biology because it can provide Correspondence: Amandine Cornille, Fax: +4618 471 6484; E-mail: [email protected]

insights into the processes driving species diversification, distribution and persistence. This is not only an academic exercise; it is also crucial for predicting how organisms will respond to global crises such as climate change, modifications of ecosystems, and the spread of invasive species. The first step is the characterization of the demographic history of the species (divergence time and migration rates among populations with their respective effective population sizes) and the second © 2015 John Wiley & Sons Ltd

D E M O G R A P H I C H I S T O R Y O F T H E S H E P H E R D ’ S P U R S E 617 step is the identification of local adaptation, i.e. adaptation in response to selection that varies across geographically distinct populations (Tiffin & Ross-Ibarra 2014). Polyploids represent good study systems to tackle such questions. Polyploid species commonly occupy larger geographic ranges and usually wider ecological niches than their diploid progenitors (Parisod 2012), suggesting high colonization capacity and local adaptation across wide ranges of biotic and abiotic conditions. Such success has been attributed to the hybrid origin of most polyploids combining diploid nuclear genomes from two or more different ancestral species (i.e. allopolyploids) (Soltis et al. 2014). However, methodological issues have so far limited the understanding of within-species demographic dynamics and local adaptation in allopolyploids. Indeed, it remains challenging to (i) obtain reliable molecular markers and (ii) develop powerful and fast computational coalescent tools to infer demography and divergence histories in polyploids. The recent development of individual next-generation sequencing (NGS) barcoding methods, e.g. genotyping-by-sequencing (Elshire et al. 2011), GBS hereafter, or RAD-seq (Davey et al. 2011), has facilitated population-level and genome-level samplings of nonmodel species (e.g. Hohenlohe et al. 2011; Peterson et al. 2012; Senn et al. 2013; White et al. 2013; Xu et al. 2014), such as polyploids. In parallel, the development of approximate Bayesian computation methods (referred to as ABC hereafter) has provided a robust framework to infer past species’ histories (Beaumont 2010; Bertorelle et al. 2010; Csillery et al. 2012; Roux & Pannell 2015). These coalescent-based ABC approaches allow the formal comparison of alternative demographic models and the estimation of their parameters (e.g. divergence time, migration rates, effective population size). Altogether, the combination of large genomic data sets with powerful coalescent-based ABC approaches offers the possibility to unravel the past demography and divergence history of polyploids, a still largely unexplored topic (Roux & Pannell 2015). However, there are still challenges associated with the complex biological features of allopolyploids (Dufresne et al. 2014). Allopolyploids are typically characterized by two or more distinct subgenomes and are thus characterized by multiple alleles and loci. Homeologous recombination between subgenomes may happen (e.g. Guo et al. 2014; mixed disomic inheritance) or not (e.g. Douglas et al. 2015; fully disomic inheritance, i.e. chromosomes exclusively pair with their homeolog). Allopolyploids can therefore display mixed inheritance patterns (i.e. fully or mixed disomic inheritances, Ramsey & Schemske 2002) that can be associated with specific mating systems (e.g. polyploidy can be associated © 2015 John Wiley & Sons Ltd

with selfing, Neuffer & Hurka 1999). This complex background leads to conceptual questions such as how many alleles and gene copies are expected, depending on the combination of ploidy level and mating system. This can lead to challenges associated with quantifying and interpreting polymorphism. For instance, under complete disomic inheritance, many apparently polymorphic sites may represent fixed differences between homeologous genes, which will lead to large numbers of sites that are heterozygous in all samples (hereafter ‘fixed heterozygotes’). Practically, the complex nature of polyploids is also challenging when it comes to the use of ABC tools that were originally developed to analyse haploids or phased diploid data. For instance, how should one simulate allopolyploid genomes and their often-complex mode of inheritance using coalescentbased approaches initially developed for diploids? Especially in allopolyploids with low sequence divergence between homeologous genomes, when it is often challenging to determine the subgenome from which haplotypes originated (Ilut et al. 2012; Bombarely et al. 2014; Roux & Pannell 2015). Therefore, the use of unphased genetic data is often the only option when studying history and demography of polyploids. A way to bypass these challenges, at least partly, is to focus on allopolyploid model systems with simple inheritance patterns (e.g. fully disomic) and with selfing mating systems. For instance, a fully disomic allotetraploid can be viewed as two diploid subgenomes independently evolving, and, in selfers, few true heterozygotes are expected, such that distinguishing paralogy from heterozygosity becomes much more straightforward than in outcrossers. The shepherd’s purse, Capsella bursa-pastoris, is a fully disomic tetraploid species (2n = 32) with a worldwide distribution, growing under a large range of environmental conditions (Hurka & Neuffer 1997) and is considered to be a major weed (Aksoy et al. 1998). This species displays a remarkable phenotypic variability for several morphological and life history traits (Shull 1929) which are likely of great importance for colonization of new habitats (e.g. flowering time) (Neuffer & Hurka 1986; Neuffer 2011). Some of its life history traits, such as rapid flowering, selfing mating system and extensive seed production, are typical of colonizing species growing in disturbed habitats. Genome data have recently confirmed that C. bursa-pastoris is an allopolyploid that originated through hybridization between the ancestors of the diploid outcrosser Capsella grandiflora, found across a narrow geographic range in the Balkans, and of the diploid selfer Capsella orientalis, widespread across Central Asia (Douglas et al. 2015). This therefore suggests a Eurasian origin for C. bursa-pastoris. However, while Douglas et al. (2015) have documented the

D E M O G R A P H I C H I S T O R Y O F T H E S H E P H E R D ’ S P U R S E 625 We used 1000 pseudo-observed data sets to check the accuracy of the marginal posterior distributions estimated with our approach. As seen from the cross-validation step, the prediction errors were high and independent of the tolerance rate (Table S5, Supporting information). These estimations should therefore be regarded with caution. However, the observed values marginally fell within the simulated data (P = 0.06), suggesting that the assumed model was capable of reproducing the observed summary statistics. Overall, ABC analyses provided support for a colonization of Eurasia by C. bursa-pastoris initiated from the Middle East first towards Europe and then, more recently, to Asia, with the occurrence of moderate gene flow among populations.

Discussion Analyses of present-day variation, population structure and past demographic history of the shepherd’s purse provide general insights into the rapid recent colonization of a polyploid selfing weed across a large geographic area. We inferred that the colonization by C. bursa-pastoris occurred with moderate gene flow among populations and likely started from the Middle Eastern population followed by two consecutive colonization events, first towards Europe and then more recently into Northeast Asia. This complex past demographic history involving a multistep colonization process was likely shaped by human-mediated dispersal as discussed in sections hereafter. We also illustrated that our ABC approach applied to GBS comes with the price of assumptions that may not be met in all polyploids (i.e. fully disomic inheritance with highly homozygous individuals). Altogether, our results demonstrate that combining SNP data from GBS and approximate Bayesian computation methods is a promising, albeit challenging avenue to infer the past demographic histories of polyploid selfing species.

A recent quick West-Eastern colonization of Eurasia, initiated in the Middle East? ABC analyses indicate that C. bursa-pastoris underwent the typical colonization dynamics of species colonizing new range areas and experiencing a series of extinction/colonization events (Excoffier et al. 2009). We indeed detected demographic bottlenecks, followed by population size expansions, a pattern which is common among colonizing species (e.g. Husband & Barrett 1991; Francßois et al. 2008). Our analyses, combined with previous investigations on the time of origin of the shepherd’s purse, also confirmed that C. bursa-pastoris, now found across the world, successfully colonized it rather © 2015 John Wiley & Sons Ltd

recently (i.e. middle to late Pleistocene). Based on our estimation, the colonization of C. bursa-pastoris started ca. 167 250 ya in the Middle East. Previous studies also estimated a recent allopolyploid origin, with estimates around 67 770 ya (Slotte et al. 2008), 80 000 ya (Roux & Pannell 2015) and 128 000 ya (Douglas et al. 2015). Strictly speaking, all these estimates are difficult to compare. They were obtained from different models (different historical split and demographic events, priors, numbers of individuals, of SNPs and mutation rates) and they were all accompanied by relatively large confidence intervals. They also involved either only C. bursapastoris (here, with a large sampling), or C. bursa-pastoris and its diploid ancestors (C. grandiflora and/or C. orientalis) (Slotte et al. 2008; Roux & Pannell 2015) or its diploid relative C. rubella (Slotte et al. 2008; Roux & Pannell 2015). Concerning the origin of colonization of the shepherd’s purse, there is still an ongoing debate. The global pattern of genetic diversity across Eurasia with a higher level of genetic variation in the Middle East supports an origin of C. bursa-pastoris from the Middle East. This scenario is compatible with ABC analyses that lend support for two consecutive colonization events from the Middle Eastern population, first into Europe and then later on towards Asia, and is in line with previous investigations of the past demographic history of C. bursa-pastoris (Ceplitis et al. 2005; Slotte et al. 2008) and the current distribution of C. grandiflora. However, the relatively low Bayes Factors obtained using ABC analyses revealed a limit in power to accurately disentangle the origin of C. bursa-pastoris, and a larger number of loci and a more extensive sampling would be needed to pinpoint with higher certainly the origin of the species. The sampling gap between the Middle East and Asia may be part of the explanation for the limited power to disentangle origins. Biases in sampling design have recently been shown to have a large effect on population genetic analyses (Meirmans 2015). Here, we missed a large sampling area of C. bursa-pastoris populations in Central Asia, which is likely a key region to disentangle the origin of C. bursa-pastoris since C. orientalis, which is known to be one of the progenitor of C. bursa-pastoris, inhabits this region (Douglas et al. 2015). According to previous study (Hurka et al. 2012; Han et al. 2015), Eurasian steppe belt formed at the end of the Pliocene might have been a wide overlapping region between C. grandiflora and C. orientalis. A more comprehensive sampling in Central Asia would therefore help to improve the accuracy of our inferences concerning the spatial origin of C. bursa-pastoris, though it might still be challenging, if possible at all, to delineate the region of origin of the tetraploid. Another open question that

D E M O G R A P H I C H I S T O R Y O F T H E S H E P H E R D ’ S P U R S E 619 could severely bias inferences about polymorphism, we filtered out not only sites showing fixed heterozygosity in the GBS data set, but also possible fixed heterozygotes from the whole genome data (Douglas et al. 2015). Based on our observations on the distribution of heterozygotes across the GBS data set (see details of the method in Appendix S1), we assumed that SNPs with more than 75% of heterozygous individuals in the high quality sequenced draft genomes (Douglas et al. 2015) correspond to homeologues fixed for different alleles (see details of the method in Appendix S1). We then filtered out these fixed heterozygote SNPs in the GBS data set. While it is possible that some of this filtering had led to elimination of polymorphic sites, we expected this filtering to be a good trade-off between eliminating sites that are not actually segregating, minimizing the potential for artificial inflation of polymorphism, and keeping true segregating sites that might still be present in C. bursa-pastoris (see Appendix S1 for the explanation of the choice of this threshold). We further tested the reliability of the 75% cut-off filtering by checking the robustness of population genetic structure inferences and genetic diversity estimates using different thresholds (Appendix S1). Finally, among the 24 quality control duplicates, we only kept one duplicate that had the lowest number of missing data in the final data set.

Spatial genetic diversity and population structure We used a model-based clustering method implemented in ADMIXTURE v.1.23 (Alexander et al. 2009) to determine population structure and admixture in C. bursa-pastoris. ADMIXTURE is based on the same statistical model as the one implemented in STRUCTURE (Pritchard et al. 2000), but it uses a fast numerical optimization algorithm for large SNP data sets to infer the proportion of ancestry of genotypes in K distinct predefined clusters. We used the cross-validation procedure implemented in ADMIXTURE to choose the number of ancestral populations. We tested our data with the number of ancestral populations, K, varying from 1 to 12 and using default settings. Once the most likely K value was chosen, we assigned each individual to its respective population (i.e. clusters inferred by ADMIXTURE, including hybrids with a membership coefficient of up to 0.50 for the focal cluster). We also statistically checked if ADMIXTURE membership coefficient could be an artefact of missing data. We ran a linear model with R v3.1.3 testing the effect of individual cluster assignment inferred using ADMIXTURE (i.e. ME or EUR or ASI) on missing data. After converting the SNP data set into haploid calls, Watterson’s theta estimator (ΘW) (Watterson 1975), Tajima’s D (D) (Tajima 1989), mean pairwise nucleotide © 2015 John Wiley & Sons Ltd

differences (p) (Nei 1978), Weir and Cockerham FST were calculated with arlsumstat (Excoffier & Lischer 2010) for each sampling site (i.e. geographical location) and each population (i.e. clusters inferred by ADMIXTURE, including hybrids with a membership coefficient of up to 0.50 for the focal cluster, see analyses hereafter). Summary statistics were only computed on nonmissing 64-bp reads (i.e. base pair cut-off used to trim reads from the GBS library). For both population and site analyses, the percentage of missing SNPs per individual and the number of SNPs used to compute the summary statistics were reported. We further explored the geographic distribution of the genome-wide diversity across Eurasia using three different spatial analyses of the genomic data. In the first analysis, we mapped spatial patterns of mean pairwise nucleotide differences, p (Nei 1978), for the 62 sites (we excluded the three sites from US for a better visualization) with the KRIGING and SURFACE functions (FIELD R package), and we tested for a Pearson correlation between p and latitude or longitude. Second, we used the SPA (spatial ancestry analysis) software (Yang et al. 2012) which implemented an unsupervised learning algorithm allowing the modelling of the mapping of each individual on a 3-dimensional space on the basis of its genetic information alone (i.e. on the basis of allele frequencies at the 4274 SNPs). Finally, we tested for isolation by distance (IBD) patterns by correlating FST⁄(1 FST) estimates and geographical distances (Rousset 1997). Mantel tests with 9999 random permutations were performed between pairwise FST and the matrix of geographic distances (mantel.rtest function from ADE4 R package).

Demographic model comparison and parameter estimations using approximate Bayesian computation We identified with ADMIXTURE substantial genetic differentiation in the shepherd’s purse between the European/Russian (referred as EUR hereafter), the Middle Eastern (ME) and the Asian (ASI) regions. Using ABC, we aimed at detecting which of these three populations was at the origin of the colonization of Eurasia and the extent of between-population gene flow during this colonization. We therefore investigated whether the observed pattern of genetic diversity resulted from the successive geographic expansion of three independent populations with an origin from a single ancestral source population at TOR (i.e. time of the ancestral bottleneck corresponding to the polyploidization event, followed by colonization across Eurasia) from the population in Asia (ASI, scenarios 1 and 4, Fig. 1), in the Middle East (ME, scenarios 2 and 5, Fig. 1) or in Europe (EUR, scenarios 3

620 A . C O R N I L L E E T A L .

Origin from ASI

Moderate migration rates (m [0 ; 0.05]) Time P = 0.12 Sc1 BF1 = 0.31

Sc2

TOR ASI

TANC ASI-ME

TANC ASI-ME

TREC EUR-ME

TREC EUR-ME

ASI-A ASI-B EUR-A EUR-B ME-A ME-B NASI NEUR NME

Origin from ME

Time

Origin from EUR

EUR-A EUR-B ASI-A ASI-B NEUR NASI

Time TOR ME

TANC ME-EUR

TANC ME-EUR

TREC ASI-EUR

TREC ASI-EUR

ME-A ME-B ASI-A ASI-B EUR-A EUR-B NEUR NASI NME

Time

Sc3

Sc5

TOR ME

ME-A ME-B ASI-A ASI-B EUR-A EUR-B NEUR NASI NME

P = 0.15 BF3 = 0.18

Time

Sc4

TOR ASI

ASI-A ASI-B EUR-A EUR-B ME-A ME-B NASI NEUR NME

P = 0.73 BF2 = 2.76

High migration rates (m [0 ; 0.5])

Time

Sc6

TOR EUR

TOR EUR

TANC EUR-ME

TANC EUR-ME

TREC ASI-ME

TREC ASI-ME

ME-A ME-B NME

EUR-A EUR-B ASI-A ASI-B NEUR NASI

ME-A ME-B NME

Fig. 1 Demographic scenarios compared using approximate Bayesian computation to reconstruct the colonization history of the shepherd’s purse across Eurasia. The six distinct scenarios assumed distinct origins of colonization of Capsella bursa-pastoris in Eurasia from the Asian population (ASI, scenarios 1 and 4), from the Middle Eastern population (ME, scenarios 2 and 5) or from the European population (EUR, scenarios 3 and 6). The two left populations diverged successively and underwent a bottleneck followed by exponential growth Gx. The scenarios assumed moderate (left side of the figure, scenarios 1, 2 and 3) and high (right side of the figure, scenarios 4, 5 and 6) gene flow between each pair of populations (i.e. ASIA/B-EURA/B, ASIA/B-MEA/B, EURA/B-MEA/B), except between A and B genomes from the same population (i.e. EURA-EURB, ASIA-ASIB and MEA-MEB). NXA = NXB = NX: effective population size of population x, mx-y: gene flow per generation from population x to population y and reciproqually (bidirectional arrows in the figure), TOR: time at the origin of the colonization, TANCx-y: first (going forward in time) divergence time between the population x and y, TRECy-z: second, more recent, divergence time between populations y and z, Sc: scenario; x,y,z: referred to populations detected using ADMIXTURE (i.e. EUR or ASI or ME depending on the scenarios presented in this figure). Posterior probabilities (P) and Bayes factor (BF) for the three historical demographic scenarios of colonization of C. bursa-pastoris in Eurasia compared by approximate Bayesian computation obtained in the second set of ABC analyses (i.e. sc1 vs. sc2 vs. sc3, moderate gene flow among populations); BF1: Bayes factor for scenario 1 vs. 2 and 3; BF2: Bayes factor for scenario 2 vs. 1 and 3; and BF3: Bayes factor for scenario 3 vs. 2 and 1.

and 6, Fig. 1), respectively. For each scenario, once the species arose from an ancestral single source population, it spread to the two other populations. This colonization process was modelled by two successive colonization events of the two other populations at, respectively, TANC - the most ancient divergence time between two populations - and TREC - the most recent divergence time between two populations (Fig. 1). At each event (TOR, TANC and TREC), the newly formed populations underwent an instantaneous change in effective popula-

tion size (referred to as RESIZE_X parameter hereafter: NT/NT+1, the relative size of the population X one generation before the bottleneck relative to the size of the population X one generation after), followed by exponential growth with Gx parameter such that G = ln(NT/ N0)/T. The assumptions on the order of splits proposed in these three scenarios were built in accordance to spatial genetic diversities and differentiations observed across Eurasia and previous hypotheses on the origin of C. bursa-pastoris (Ceplitis et al. 2005; Slotte et al. 2008; © 2015 John Wiley & Sons Ltd

D E M O G R A P H I C H I S T O R Y O F T H E S H E P H E R D ’ S P U R S E 621 Hurka et al. 2012; Douglas et al. 2015). In particular, for each model we assumed that the EUR and ME populations diverged from each other because they showed the weakest genetic differentiation and highest admixture (Fig. 2) suggesting a common recent history. For each of these three scenarios we then assumed high or moderate symmetric bidirectional gene flow between populations (i.e. priors on migration rates m between populations [0–0.5] and [0–0.05] respectively, Fig. 1). Thus, in total we simulated six scenarios (Fig. 1). Capsella bursa-pastoris is an allopolyploid with disomic inheritance. Although we are using unphased SNP data, shared polymorphism between homeologues may still exist in our data set and has to be considered. We therefore treated its two subgenomes as two discrete populations, A and B, each sharing the same current effective population size N, the same divergence time T and the same exponential growth rate G (Fig. 1). In total, we simulated six independent populations (EURA, EURB, MEA, MEB, ASIA, ASIB) for each scenario. At T time (TOR, TANC and TREC depending on the tested scenarios, see Fig. 1), the two subgenomes A and B, modelled as two discrete populations (EURA/EURB or MEA/MEB or ASIA/ASIB), coalesced backward in time. For each of the (a)

six scenarios, gene flow was allowed between populations of different origins (i.e. ASIA/B-EURA/B, ASIA/BMEA/B, EURA/B-MEA/B, see details in Fig. 1) but not between populations A and B of the same genome (i.e. EURA-EURB, ASIA-ASIB and MEA-MEA) because of the disomic inheritance of C. bursa-pastoris. In total, we estimated 15 demographic parameters, the effective size of each population (NEUR, NME, NASI), the rates of symmetric migration between population pairs for each generation (mx y: migration rate from populations x to y per generation), the relative size of the population one generation before the bottleneck (i.e. RESIZE_EUR, RESIZE_ME, RESIZE_ASI, the effective size of the population X one generation earlier in time relative to the effective size of the population X one generation later in time such as NT = RESIZE * NT+1) and the divergence times (TOR, TANC and TREC, Fig. 1). The unit of divergence time in FASTSIMCOAL2 is expressed in generations. C. bursa-pastoris is an annual species, we therefore assumed a generation time of 1 year. The boundaries of the uniform (or log-uniform) prior distributions are presented in Table S2 (Supporting information). The “populations” used in these analyses are the main clusters identified with ADMIXTURE (Table 1).

(d)

60

π 0.010

50

0.008 0.007

40

Latitude

0.009

FST = 0.19

0.006

30

0.005 0.004

0

20

40

60

80

100

120

140

Longitude

(b)

1.0

FST = 0.10

FST = 0.16

0.5

EUR ME ASI

0.0

(c)

1.0

−0.5

0.0

0.5

1.0

TWN

CHN

RUS

GRC ITA DZ ISR SYR TUR JOR

−1.0 −1.0

−1.0

−0.5

FR SE ESP CZE BIH

0.0

USA GB

−0.5

0.5

Fig. 2 Geographic distribution of genetic diversity, genetic structure and admixture of Capsella bursa-pastoris in Eurasia and USA. (a) Map of interpolated mean pairwise nucleotide diversity (p), adjusted for number of missing SNPs per site, across Eurasia (for clarity, the three sites from USA were excluded from the map), with each site represented by a white circle on the map; (b) Three-dimensional space mapping of the 261 C. bursa-pastoris accessions inferred with SPA with colors referring to populations inferred with ADMIXTURE and with each axe representing one of the three dimensions predicted by SPA; (c) Population structure for the C. bursapastoris data set (N = 261, 65 sites), inferred with ADMIXTURE, for K = 3, showing three distinct genetic clusters. Each vertical line represents an individual. Individuals were grouped by country. Colours represent the inferred ancestry from K ancestral populations. For clarity, the 65 sites are grouped by country with USA: United States of America, GB: Great Britain, FR: France, SE: Sweden, ESP: Spain, CZE: Czech Republic, BIH: Bosnia Herzegovina, RUS: Russia, GRC: Greece, ITA: Italy, DZ: Algeria, JOR: Jordania, ISR: Israel, SYR: Syria, TUR: Turkey, CHN: China, TWN: Taiwan; (d) Map representing the mean membership proportions inferred by ADMIXTURE for the three detected clusters, for samples of C. bursa-pastoris collected from the same site; FST: pairwise genetic differentiation (Weir & Cockerham 1984), all FST were highly significant (P < 0.001). © 2015 John Wiley & Sons Ltd

622 A . C O R N I L L E E T A L . Table 1 Genetic polymorphism within each detected with ADMIXTURE in Capsella bursa-pastoris

Population

n

Missing data (%)

S

Θ(W)

p

EUR (red) ME (blue) ASI (green)

76 42 143

27 25 33

3122 3189 2848

0.0027 0.0031 0.0025

0.0026 0.0030 0.0020

population

D 0.24 NS 0.27 NS 0.71 NS

n: number of individuals; missing data: mean percentage of missing SNPs per individual in the population; S: number of polymorphic sites; Θ(W) :Watterson’s theta; p, mean standardized pairwise differences; D: Tajima’s D; NS: non-significant; p was computed only on non missing 64-bp reads.

We used ABCTOOLBOX (Wegmann et al. 2010) and fastsimcoal2 (Excoffier et al. 2013) to simulate each of these six scenarios. For each scenario, we simulated a total of 4274 unlinked SNPs (assuming a minimum allele frequency of the derived allele of 0.02 to avoid ascertainment bias in obtaining the site frequency spectrum, Excoffier et al. 2013) and generated 106 genetic data sets from coalescent simulations, using model parameters drawn from prior distributions (Table S2, Supporting information). For each simulation, we computed S, the number of polymorphic sites, Tajima’s D (D) (Tajima 1989), the mean number of pairwise differences for each population (p) and FST (Weir & Cockerham 1984) between pairs of populations (EUR, ASI, ME). We also computed summary statistics from the joint frequency spectrum generated by FASTSIMCOAL2 and assuming three population pairs x y (i.e. EUR-ME, EUR-ASI and MEASI): the number of exclusive polymorphisms in population x (Sxx) and the number of shared polymorphisms between populations x and y (Ssx y) (Wakeley & Hey 1997). All these statistics were computed by merging the two copies of subgenomes A and B of the same population (i.e. ASIA/B, EURA/B or MEA/B) to refer to our unphased SNP data set for which we did not have the possibility to assign the alleles to the two distinct C. bursa-pastoris subgenomes (Roux & Pannell 2015). A total of 18 summary statistics were computed for each simulation. The model choice procedure was hierarchical, conducted using the ABC R package (Csillery et al. 2012), and by excluding statistics that were highly correlated (i.e. p was highly correlated with S, and was therefore deleted from the analyses). First, we compared pairwise models assuming moderate vs. high gene flow (i.e. scenarios 1 vs. 4; 2 vs. 5 and 3 vs. 6, Fig. 1). Based on this first set of ABC analyses, we then compared the best three chosen models. The different models were compared by calculating their Bayes factors and estimating their relative posterior probabilities using the Neural network methodology implemented in the ABC R pack-

age, based on the 0.01% of simulated data sets most closely matching the observed data. To estimate the performance of our method for discriminating between the six models, we computed type I and type II error misclassifications. The error misclassification represented, using 1000 pseudo-observed data sets, the percentage of times that a given scenario did not have the highest posterior probability of the competing scenarios when it was actually the true scenario (type I error), and the percentage of times that a given scenario had the highest posterior probability when it was not the true scenario (type II error). Once the best model had been chosen, we estimated demographic parameters under this scenario, using a rejection method for the 10 000 retained simulations (Blum & Francßois 2010; Csillery et al. 2012). The same set of 15 summary statistics (i.e. 18 summary statistics but excluding p) was also calculated on the observed data and then was used to calculate the Euclidean distance to each simulation. We report the mean and 95% highest posterior density (HPD) interval for each model parameter estimated. Our model parameter estimation procedure was validated in two ways. First, the goodness-of-fit between the selected scenario and the observed data was quantitatively measured by the posterior predictive P-value, corresponding to the probability that data replicated using the estimated posterior distributions of the parameters are not more extreme than the observed data, and was computed using the GFIT function from the ABC R package (Csillery et al. 2012). Second, the ability of our ABC framework to estimate the model parameters was assessed by a cross-validation step implemented in the ABC R package (CV4ABC function). All the simulations were run on Linux clusters at Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX).

Results SNP variability and missing data In total, 1 724 406 unique 64-bp reads were identified across the 261 C. bursa-pastoris accessions, with 69% of these reads aligned to the reference diploid genome of C. rubella (Slotte et al. 2013). A total of 19 139 putative unphased SNPs were identified. After filtering for sequencing errors, we kept 4899 SNPs (i.e. representing 26% of the putative unphased SNP data set). Finally, after filtering for fixed heterozygotes (see methodology in Appendix S1), the final data set consisted in 261 individuals typed for 4274 unphased SNPs (i.e. representing 87% of the data set filtered for sequencing error and 22% of the putative unphased SNP data set), with an © 2015 John Wiley & Sons Ltd

D E M O G R A P H I C H I S T O R Y O F T H E S H E P H E R D ’ S P U R S E 623 average read depth (DP) per SNP of 132 and a rate of missing data per individual of 0.008. After checking, the ascertainment bias and possible residual fixed heterozygosity in this data set were considered very low (see Appendix S1). The final mean error rate across SNPs and individuals was 0.13 after filtering. Using the 24 duplicates included in the three flow cells, we showed a good repeatability of the GBS procedure for this final data set (see Appendix S1).

Spatial genetic diversity and population structure Table S1 gives genetic diversity estimates for each sampling site and population. Note that summary statistics were computed by excluding missing data to limit the ascertainment bias associated to missing data (see Materials and methods). After checking for biases implied by our filtering methodology (Appendix S1) and applying this standardization, the bias was then expected to be almost the same for all localities and populations. The map of interpolated pairwise nucleotide diversity, p, by site (Fig. 2a) showed that genetic diversity was highest in Italy, the Middle East and Central Asia. However, the correlations between longitude or latitude and p were nonsignificant (r = 0.008, P = 0.95 and r = 0.03, P = 0.83, respectively) (Fig. 2a). The results of ADMIXTURE from K = 2 to K = 5 analyses are shown in Figs 2 and S2 (Supporting information). For K = 2, the analyses revealed a clear west/east partitioning across Eurasia with sampling sites from Europe/Middle East/Russia forming a first group and the Asian sampling sites a second one (Fig. S2, Supporting information). The simulations for K = 3 still detected the Asian cluster (green), but divided the Western cluster into two distinct clusters: one extending from Western Europe deep into Southeastern Siberia (red) and another corresponding to Middle East sampling sites (blue) (Fig. 2c and d). For K = 4, the European/Russian cluster was further divided in two new clusters. However, most individuals showed some level of admixture suggesting limits to infer ancestry for this K value (Fig. S2, Supporting information). At K = 5 the genetic structure obtained was similar to the pattern observed at K = 3 (European/Russian, Middle East and Asian clusters), with a fourth cluster identified in the Balkans (Fig. S2, Supporting information). When we increased K up to 12 distinct clusters, further substructure was revealed in the two main regions of the study sampling: Europe/Middle East and Asia. At K = 12, the Europe/ Middle East sub-region showed a cluster spread in a Southwestern crescent in Europe (Spain, the Middle East and Great Britain), a cluster in Italy and in the Middle East, a cluster in Central Europe, and another cluster grouping the sites from Northern Europe with © 2015 John Wiley & Sons Ltd

sites from Russia (Fig. S3, Supporting information). At K = 12, the Asian subregion showed a main division between two clusters in Eastern and Western China, respectively, showing admixture (Fig. S3, Supporting information). Cross-validation (CV) errors implemented in ADMIXTURE decreased monotonically from K = 2 to K = 12. Therefore, increasing the number of clusters continually improved the fit of the model to the data. However, CV errors decreased more slowly after K = 3 (Fig. S4, Supporting information). We therefore concluded that K = 3 was the most relevant number of clusters in the shepherd’s purse in Eurasia when considering to reconstruct the range-wide population history of the species. We noticed that at K = 3 the US sites were admixed between these three main clusters (Figs 2c and d). Interestingly, the Eastern samples of Russia belonged to the European cluster. This Europe/Russian clustering still persisted whatever the K value suggesting that the Russian samples were really genetically close to the European samples, even though they are geographically closer to the Asian populations. We therefore included them with the European populations (i.e. “EUR” population, see hereafter) for population genetic diversity estimates and ABC analyses. We used the ADMIXTURE membership coefficients inferred at K = 3 to define the three populations used in subsequent analyses. Genotypes were assigned to a given population if their membership coefficient for that population strictly exceeded 0.50. The three populations are hereafter referred to as the “European” (EUR, red, N = 76 individuals), the “Middle Eastern” (ME, blue, N = 42) and the “Asian” (ASI, green, N = 143) populations. Summary statistics for each population identified by ADMIXTURE are shown in Table 1. Pairwise FST values and a map of the mean membership ancestry coefficients per sampling site to each of the three populations are given in Fig. 2d. Admixture was detected in the three populations (Fig. 2). First of all, we showed that the ADMIXTURE membership coefficient were not an artefact of missing data (linear model, all P > 0.06). The ME, EUR and ASI populations presented admixed genotypes, with 83% (N = 35), 60% (N = 46) and 28% (N = 41) of genotypes, respectively, having membership coefficients