Discovering Functional DNA Elements Using

2 downloads 0 Views 367KB Size Report
human mitochondrial genome and downloaded the coordi- nates of all 8,944 SNPs from ..... 12207. А0.277. 0.02112. А0.1786413. 0.1419. MT-TS2. tRNA serine 2 .... the human genome according to the evolution-free gospel of ... Proc IEEE.
GBE Discovering Functional DNA Elements Using Population Genomic Information: A Proof of Concept Using Human mtDNA Daniel R. Schrider* and Andrew D. Kern Department of Genetics, Rutgers, The State University of New Jersey *Corresponding author: E-mail: [email protected]. Accepted: May 29, 2014

Abstract Identifying the complete set of functional elements within the human genome would be a windfall for multiple areas of biological research including medicine, molecular biology, and evolution. Complete knowledge of function would aid in the prioritization of loci when searching for the genetic bases of disease or adaptive phenotypes. Because mutations that disrupt function are disfavored by natural selection, purifying selection leaves a detectable signature within functional elements; accordingly, this signal has been exploited for over a decade through the use of genomic comparisons of distantly related species. While this is so, the functional complement of the genome changes extensively across time and between lineages; therefore, evidence of the current action of purifying selection in humans is essential. Because the removal of deleterious mutations by natural selection also reduces withinspecies genetic diversity within functional loci, dense population genetic data have the potential to reveal genomic elements that are currently functional. Here, we assess the potential of this approach by examining an ultradeep sample of human mitochondrial genomes (n ¼ 16,411). We show that the high density of polymorphism in this data set precisely delineates regions experiencing purifying selection. Furthermore, we show that the number of segregating alleles at a site is strongly correlated with its divergence across species after accounting for known mutational biases in human mitochondrial DNA ( ¼ 0.51; P < 2.2  1016). These two measures track one another at a remarkably fine scale across many loci—a correlation that is purely the result of natural selection. Our results demonstrate that genetic variation has the potential to reveal with surprising precision which regions in the genome are currently performing important functions and likely to have deleterious fitness effects when mutated. As more complete human genomes are sequenced, similar power to reveal purifying selection may be achievable in the human nuclear genome. Key words: population genetics, natural selection, mitochondria.

Introduction Only 1–2% of human genome lies within protein-coding sequence (Lander et al. 2001). Determining the extent to which the remainder of the genome is functional is crucial to our understanding of human biology. A variety of recently developed experimental techniques have aided in the search of noncoding DNA for functional elements (Dunham et al. 2012); however, on their own these techniques can produce a huge number of false positives (Graur et al. 2013). Searches for the evolutionary signature of purifying selection have therefore proved a more fruitful strategy for identifying functional elements; indeed phylogenetic searches comparing sequences of related species have revealed that approximately 5% of the human genome is constrained by natural selection (Chinwalla et al. 2002; Siepel et al. 2005; Lunter et al. 2006;

Birney et al. 2007; Davydov et al. 2010), and similar strategies have been used to predict the phenotypic severity of mutations (Stone and Sidow 2005). Although whole-genome comparisons aimed at identifying the footprints of selection are highly effective, they have been used primarily to detect elements under constraint for hundreds of millions of years of evolutionary history (Siepel et al. 2005; Davydov et al. 2010). However, the set of functional elements in the genome experiences considerable turnover (Demuth et al. 2006). Comparative genomic techniques will fail in these instances, particularly for the supremely interesting cases of humanspecific gain (Knowles and McLysaght 2009) and loss of function (Wang et al. 2006). Surveys of genetic diversity within species, on the other hand, have the potential to identify regions currently

ß The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

1542 Genome Biol. Evol. 6(7):1542–1548. doi:10.1093/gbe/evu116 Advance Access publication June 9, 2014

GBE

Discovering Functional DNA Elements

experiencing purifying selection and that are therefore functional, as purifying selection will remove genetic diversity from such loci. Unfortunately, genetic variation in the human genome is quite sparse, with a comparison of any two homologous chromosomes uncovering less than 1 single-nucleotide polymorphism (SNP) every kilobase (Lander et al. 2001). Sampling more individuals, however, yields additional polymorphisms, and an ultradeep sample of mitochondrial variation from 16,411 genomes is available in the MITOMAP database (Ruiz-Pesini et al. 2007). These data are extremely polymorphic, with more than one SNP on every other base pair on average. This data set thus serves as an ideal proving ground for the approach of identifying functional constraint using massive amounts of polymorphism data, which will soon be available for nuclear genomes. Here, we show that the density of polymorphism in these data closely tracks divergence at a fine scale, implying that these data can indeed be used to reveal the strength of purifying selection in the human mitochondrial genome at a very high resolution. Our results suggest an enormous potential for population genomic data to uncover functional DNA elements, including those not conserved across species.

Results and Discussion We set out to determine the extent to which polymorphism data reveal the strength of purifying selection across the human mitochondrial genome and downloaded the coordinates of all 8,944 SNPs from MITOMAP (http://www. mitomap.org/MITOMAP, last accessed August 4, 2013). We reasoned that if the density of polymorphism was governed by the amount of purifying selection acting on each site, then SNP density would be correlated with divergence across species, in accordance with expectations under the Neutral model (Kimura 1982). This is indeed what we observe in the form of a strong correlation between the number of alleles per site and its average negated phyloP score (Siepel et al. 2006) measuring divergence across vertebrates (Spearman’s  ¼ 0.52; P < 2.2  1016). This correlation is also highly significant when averaging polymorphism and divergence within 10-bp adjacent windows ( ¼ 0.50; P < 2.2  1016; fig. 1). Although this observation is consistent with purifying selection both removing diversity and constraining divergence at functional elements, such a pattern could also be generated by variation in the spontaneous mutation rate. It has been shown that mutation rate in the mitochondria varies according to the duration for which a given site remains single stranded on the H strand (DssH) during DNA replication (Reyes et al. 1998). We also find evidence for this in the form of a significant correlation between divergence at each site and the duration the site is single stranded on the H strand during replication, although this correlation is far weaker than that shared between polymorphism and divergence ( ¼ 0.11; P < 2.2  1016). Moreover, after correcting for DssH, the

correlation between polymorphism and divergence at individual sites is essentially unchanged and still highly significant ( ¼ 0.49; P < 2.2  1016). Rather than being driven by a subset of mitochondrial loci, this correlation is significant (at P < 0.05) in 36/37 genes and is significant in 35/37 genes after correcting for DssH (table 1). Similarly, polymorphism and divergence are more strongly correlated in protein-coding ( ¼ 0.53; P < 2.2  1016) and RNA-coding genes ( ¼ 0.43; P < 2.2  1016) than noncoding DNA within the control region ( ¼ 0.23; P ¼ 9.3  1015) or outside of it ( ¼ 0.30; P ¼ 0.021). This correlation is also far stronger at nonsynonymous than synonymous sites ( ¼ 0.25 for second codon position sites, P < 2.2  1016;  ¼ 0.079 for 4-fold degenerate sites, P ¼ 3.6  104; fig. 2) as expected if purifying selection is a more predominant force at nonsynonymous sites. Finally, the minor allele frequencies of SNPs from the Human Mitochondrial Genome Database (mtDB; Ingman and Gyllensten 2006) are correlated with divergence ( ¼ 0.076; P ¼ 4.0  1016), even though variation in mutation rate is not expect to affect allele frequencies. Thus, purifying selection uniquely drives patterns of polymorphism in the human mitochondrial genome. This finding supports previous reports that purifying selection is a prominent force in the mitochondrial genome (Rand and Kann 1996; Nielsen and Weinreich 1999; Elson et al. 2004; Stewart et al. 2008). The patterns we observed are not the result of positive selection, as the fixation of a beneficial mutation through a selective sweep removes all genetic diversity from a nonrecombining chromosome (Smith and Haigh 1974). We are thus limited to observing mutations occurring since the most recent sweep. Having established that patterns of polymorphism across the human mitochondria are largely determined by purifying selection, we sought to determine the resolution at which these data reveal the strength of selection acting on particular sites in the genome. We examined patterns of SNP diversity and divergence in 5-bp sliding windows across each gene in the mitochondrial genome, observing the extent to which the two measures mirror one another on a small scale. In particular, within each window, we calculated the average SNP density per base pair and the average probability that the site is not conserved across vertebrates according to phastCons (Siepel et al. 2005). We find that for many loci, tRNA genes in particular, these two measures track one another to a surprising extent (e.g., the phenylalanine and tryptophan tRNA genes shown in fig. 3; the remaining tRNA genes are shown in supplementary fig. S1, Supplementary Material online). This result demonstrates that SNP density has the ability to reveal the strength of selection at a surprisingly detailed resolution—on the scale of a few base pairs. As a simple proof of concept, we sought to use SNP density to predict function at a fine scale via a hidden Markov model (HMM; Rabiner 1989). Using a similar strategy as phastCons (Siepel et al. 2005), we learned a two-state HMM (constrained

Genome Biol. Evol. 6(7):1542–1548. doi:10.1093/gbe/evu116 Advance Access publication June 9, 2014

1543

GBE

Schrider and Kern

FIG. 1.—The correlation between polymorphism and divergence in the human mitochondrial genome. The average number of alleles per base pair in 10-bp windows is shown in the x-axis and divergence as measured by the negated phyloP score is shown in the y-axis.

vs. unconstrained) where the observation for each site in the genome is the number of alleles at the site. We then used this HMM to predict constrained regions to which we refer as mitoPopCons elements. There is extremely strong overlap between mitoPopCons elements obtained from polymorphism data and phastCons elements predicted from divergence (P < 0.0001; fig. 4; see Materials and Methods). Although mitoPopCons recovers somewhat fewer genic base pairs than phastCons (35.4% of all genic base pairs are recovered by mitoPopCons versus 49.7% by phastCons), mitoPopCons elements contain fewer intergenic base pairs (0.75% of mitoPopCons base pairs are intergenic versus 3.4% of phastCons bases). Given the dramatically deeper evolutionary time period examined by phastCons data, that it seems to perform only marginally better than mitoPopCons underscores the potential of population genetic approaches. phastCons elements are smaller and more numerous (1,395 elements averaging 6.7 bp in length) than mitoPopCons elements (33 elements averaging 167 bp), perhaps implying that phylogenetic data allow for higher resolution prediction than even our dense polymorphism data. On the other hand, element length distributions may have been influenced by the difference in emission probability training methods used for mitoPopCons (trained via the Baum–Welch algorithm; see Materials and Methods) and phastCons elements. In any case, the success of this simple HMM shows that SNP diversity has the ability to accurately predict function at a fine scale in the human mitochondria. We have shown that ultradense polymorphism data can be used to accurately detect functional nucleotides in the human mitochondrial genome, potentially at the level of the individual base pair, while sidestepping limitations of phylogenetic approaches. This result suggests that as whole-genome

sequencing becomes more ubiquitous, it may become possible to perform such high-resolution prediction in the nuclear genome as well. Applying a polymorphism-based approach to the nuclear genome will present several additional challenges. First, independent assortment and recombination in the nuclear genome cause different loci to have distinct genealogical histories and therefore varying levels of diversity under neutrality, thereby potentially impeding the detection of selection. As another consequence of recombination, both positive (Smith and Haigh 1974) and negative selection (Charlesworth et al. 1993) will have localized effects on flanking variation, rather than genome wide as in the mitochondria. These forces will further increase variance in polymorphism at unselected sites and may thus obscure the signal of negative selection at selected sites. Another difficulty of the nuclear genome is that its nucleotide diversity is far lower than that of the mitochondrial genome (Lander et al. 2001), meaning that an even larger number of sequences than examined here may be required to accurately detect selection. Moreover, the power to detect function increases logarithmically with sample size (supplementary fig. S2, Supplementary Material online). However, given the ever-increasing rate at which new human genome sequences are released, this problem may not be insurmountable. Finally, there is likely more variation in the strength of purifying selection acting in the nuclear genome than in the mitochondria. As a consequence, weakly constrained but still functionally important regions may evade detection, especially by a two-state method, allowing for only one level of constraint like the HMM used here. If these hurdles can be overcome, approaches such as ours will then have an enormous impact on biological research, allowing for the discovery of the complete set of functional

1544 Genome Biol. Evol. 6(7):1542–1548. doi:10.1093/gbe/evu116 Advance Access publication June 9, 2014

GBE

Discovering Functional DNA Elements

Table 1 Gene-Specific Correlations Between SNP Density and Negative phyloP Score Gene Name MT-TF MT-RNR1 MT-TV MT-RNR2 MT-TL1 MT-ND1 MT-TI MT-TQ MT-TM MT-ND2 MT-TW MT-TA MT-TN MT-TC MT-TY MT-CO1 MT-TS1 MT-TD MT-CO2 MT-TK MT-ATP8 MT-ATP6 MT-CO3 MT-TG MT-ND3 MT-TR MT-ND4L MT-ND4 MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP

Gene Description tRNA phenylalanine 12S ribosomal RNA tRNA valine 16S ribosomal RNA tRNA leucine 1 NADH dehydrogenase subunit 1 tRNA isoleucine tRNA glutamine tRNA methionine NADH dehydrogenase subunit 2 tRNA tryptophan tRNA alanine tRNA asparagine tRNA cysteine tRNA tyrosine Cytochrome c oxidase subunit I tRNA serine 1 tRNA aspartic acid Cytochrome c oxidase subunit II tRNA lysine ATP synthase F0 subunit 8 ATP synthase F0 subunit 6 Cytochrome c oxidase subunit III tRNA glycine NADH dehydrogenase subunit 3 tRNA arginine NADH dehydrogenase subunit 4L NADH dehydrogenase subunit 4 tRNA histidine tRNA serine 2 tRNA leucine 2 NADH dehydrogenase subunit 5 NADH dehydrogenase subunit 6 tRNA glutamic acid Cytochrome b tRNA threonine tRNA proline

Gene Start Gene End Spearman’s (hg19) (hg19) q 579 650 1604 1673 3231 3308 4264 4330 4403 4471 5513 5588 5658 5762 5827 5905 7447 7519 7587 8296 8367 8528 9208 9992 10060 10406 10471 10761 12139 12208 12267 12338 14150 14675 14748 15889 15957

649 1603 1672 3230 3305 4263 4332 4401 4470 5512 5580 5656 5730 5827 5892 7446 7515 7586 8270 8365 8573 9208 9991 10059 10405 10470 10767 12138 12207 12266 12337 14149 14674 14743 15888 15954 16024

elements in the human genome and the degree to which new mutations at each site are deleterious. Such efforts will vastly improve predictions of the phenotypic impact of mutations occurring in humans and will prioritize searches for disease-causing mutations. This information will also reveal species-specific changes in selective pressures at the resolution of individual nucleotides, thereby greatly improving our understanding of how the functional components of genomes evolve.

Materials and Methods We converted all coordinates obtained from MITOMAP and mtDB from the Cambridge Reference Sequence (CRS)

0.484 0.429 0.336 0.443 0.301 0.542 0.269 0.463 0.291 0.516 0.509 0.429 0.378 0.596 0.352 0.633 0.627 0.365 0.573 0.331 0.266 0.389 0.538 0.396 0.510 0.470 0.491 0.561 0.277 0.470 0.374 0.532 0.500 0.347 0.551 0.560 0.103

P 1.90  105