Appendix S1: Supporting information: materials and ...

15 downloads 0 Views 177KB Size Report
Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X. and Ruden, D. M. (2012) A program for annotating and predicting the ...
Appendix S1: Supporting information: materials and methods M. Parejo, D. Wragg, D. Henriques, A. Vignal, M. Neuditschko DNA extraction, sequencing and SNP calling Whole-genome sequence data of the 56 A. m. mellifera drones used in this study originate from our previously published work (Parejo et al. 2016) with a more detailed description of material and methods. Here, we describe in brief the major steps. DNA was extracted from one honey bee larva per colony with phenol-chloroform-isoamyl alcohol (25:24:1) (Ausubel 1988) and subsequently purified with EZ1® DNA Tissue Kit (QIAGEN Redwood City, www.qiagen.com). Pair-end libraries (2 x 100 bp) were prepared following the manufacturer’s protocol (TruSeq Nano Kit v3) and sequenced on Illumina HiSeq2500 with 24 samples per lane and sequencing depth of 10X per individual. Mapping and variant calling followed best practices. Reads were mapped to the reference genome (Amel4.5) using bwa mem 0.7.10 (Li and Durbin, 2009) and PCR duplicates marked using PICARD 1.80 (http://picard.sourceforge.net). To improve mapping quality reads were realigned around the indels with GATK 3.3.0 (McKenna et al. 2010). SNP calling was performed in two steps as described in Wragg et al. (2016). First, variants were identified for each sample separately applying three different variant calling tools: UnifiedGenotyper (UG) (Van der Auwera et al. 2013), mpileup (Li et al. 2009) and PLATYPUS 0.8.1 (Rimmer et al. 2014). Subsequently, variants were filtered on base quality (BQ) score ≥ 20 and mapping quality (MQ) ≥ 30. Variants from UG were also filtered on genotype quality (GQ) ≥ 30, quality by depth (QD) ≥ 2, and Fisher strand (FS) ≤ 60. The three call sets were then combined using a Bayesian method (BAYSIC: Cantarel et al., 2014). In a second step, the individual variant calling (VCF) files were merged keeping only SNP variants and filtered on DP to generate a set of master sites mapped to chromosomes 1 to 16 with 9 ≤ DP ≤ 3x mean DP using BCFtools (Li 2011). All individuals were re-genotyped with BQ ≥ 20 at these master sites producing a multi-sample VCF containing all samples. For this study, 56 A. m. mellifera samples (core bees from Switzerland N = 39, and core bees from Savoy N = 17) were extracted from the entire dataset. We investigated selection signatures between these populations because, despite their geographic proximity and genetic relatedness, they formed two distinct communities in model- and network-based clustering. As a final quality control step, SNPs were filtered on minor allele frequency (MAF) < 0.01 and genotyping call rate < 0.9 using PLINK 1.9 (Chang et al. 2015; www.coggenomics.org/plink2) leaving 2,924,632 SNPs for subsequent analysis. At last, the SNPs were annotated using SnpEff4.1g (build 2015-05-17) (Cingolani et al. 2012) with interval size of 2 kb to report upstream or downstream effects of the variants.

1

Selection signatures To identify selection signatures, three different test statistics were combined: (1) Fixation indices (FST) (Weir and Cockerham 1984) between the two A. m. mellifera populations were computed in window sizes of 2 kb using VCFtools (Danecek et al. 2011). FST is one of the most commonly used statistics to infer selection (Vitti et. al 2013). The principle is the comparison of the variance of the allele frequencies within and between populations (Holsinger and Weir 2009). When selection is acting on one locus in a population, but not in the other population, the allele frequency of this locus can differ significantly between both populations. (2) Cross-population haplotype homozygosity (XP-EHH) (Sabeti et al. 2007) was computed with selscan (Szpiech and Hernandez 2014) with default settings and averaged over 2 kb windows. This test statistic is based on linkage disequilibrium (LD): A causative allele and its neighbouring linked variants are swept to fixation resulting in extended haplotypes (Vitti et al. 2013). The significant advantage of our data is that it consists of haploid drones and therefore phasing of haplotypes was not required. (3) Cross-population composite likelihood ratio (XP-CLR) was computed using XP-CLR 1.0 (Chen et al. 2010). XP-CLR scores were calculated with default settings at a set of grid points with a spacing of 2 kb using the information from SNPs within a flanking window of 0.5 cM. This statistic identifies genomic regions where changes in allele frequency over many sites appear too quickly based on genetic drift alone (Vitti et al. 2013). The three abovementioned statistics use different approaches to screening the genome for signatures of selection. The three test statistics were combined into a composite selection score (CSS) based on a joint fractional rank (Randhawa et al. 2014). This approach is commonly used to identify selection signatures in order to capture a more robust signal (e.g. Frischknecht et al. 2016). Following this approach, we identified 6 putative sweep regions with a total of 8 genes within and around (±2 kb) these sweep regions. Gene ontology terms for these genes were retrieved from Ensembl (Metazoa release 34 - December 2016; Yates et al. 2015), and Drosophila orthologs from OrthoDB (v9.1 – November 2016; Zdobnov et al. 2016). References 1. Ausubel, F. (1988) Current protocols in molecular biology / edited by Frederick M. Ausubel et al. in Updated quarterly, New York, United States: Greene Pub. Associates; WileyInterscience. 2. Chang, C. C., Chow, C. C., Tellier, L. C. A. M., Vattikuti, S., Purcell, S. M. and Lee, J. J. (2015) Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience, 4(1). 3. Chen, H., Patterson, N. and Reich, D. (2010) Population differentiation as a test for selective sweeps. Genome Research, 20(3), pp. 393-402. 2

4. Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X. and Ruden, D. M. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6(2), pp. 80-92. 5. Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G. and Durbin, R. (2011) The variant call format and VCFtools. Bioinformatics, 27(15), pp. 2156-2158. 6. Frischknecht, M., Flury, C., Leeb, T., Rieder, S. and Neuditschko, M. (2016) Selection signatures in Shetland ponies. Animal Genetics, 47(3), pp. 370-372. 7. Holsinger, K. E. and Weir, B. S. (2009) Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat Rev Genet, 10(9), pp. 639-650. 8. Li, H. (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), pp. 2987-2993. 9. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. and Durbin, R. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), pp. 2078-2079. 10. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. and DePristo, M. A. (2010) The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), pp. 1297-1303. 11. Randhawa, I. A. S., Khatkar, M. S., Thomson, P. C. and Raadsma, H. W. (2014) Composite selection signals can localize the trait specific genomic regions in multi-breed populations of cattle and sheep. BMC Genetics, 15(1), pp. 34. 12. Rimmer, A., Phan, H., Mathieson, I., Iqbal, Z., Twigg, S. R. F., Wilkie, A. O. M., McVean, G. and Lunter, G. (2014) Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nature Genetics, 46(8), pp. 912-918. 13. Sabeti, P. C., Varilly, P., Fry, B., Lohmueller, J., Hostetter, E., Cotsapas, C., Xie, X., Byrne, E. H., McCarroll, S. A., Gaudet, R., Schaffner, S. F. and Lander, E. S. (2007) Genome-wide detection and characterization of positive selection in human populations. 449(7164), pp. 913-918. 14. Szpiech, Z. A. and Hernandez, R. D. (2014) selscan: An Efficient Multithreaded Program to Perform EHH-Based Scans for Positive Selection. Molecular Biology and Evolution, 31(10), pp. 2824-2827. 15. Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., LevyMoonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K. V., Altshuler, D., Gabriel, S. and DePristo, M. A. (2013) From fastQ data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics, (SUPL.43). 16. Vitti, J. J., Grossman, S. R. and Sabeti, P. C. (2013) Detecting natural selection in genomic data. in Annual Review of Genetics: Annual Reviews Inc. pp. 97-120. 17. Weir, B. S. and Cockerham, C. C. (1984) Estimating F-statistics for the analysis of population structure. Evolution, 38(6), pp. 1358-1370. 18. Yates, A., Akanni, W., Amode, M.R., Barrell, D., Billis, K., Carvalho-Silva, D., Cummins, C. et al. (2015) Ensembl 2016. Nucleic Acids Research, 44(D1), pp.D710-D716. 19. Zdobnov, E. M., Tegenfeldt, F., Kuznetsov, D., Waterhouse, R. M., Simão, F. A., Ioannidis, P., Seppey et al. (2016) OrthoDB v9. 1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic Acids Research, pp.gkw1119.

3