The Y-chromosome point mutation rate in humans - Semantic Scholar

10 downloads 0 Views 620KB Size Report
Mar 25, 2015 - Y chromosome (MSY) sequence, on the basis of ,365 meioses. (47, 23 years). The combined mutation rate for 5.2 Mb of. X-degenerate (XDG) ...
letters

focus on Genomes of Icelanders

The Y-chromosome point mutation rate in humans

© 2015 Nature America, Inc. All rights reserved.

Agnar Helgason1,2, Axel W Einarsson1,2, Valdís B Guðmundsdóttir1, Ásgeir Sigurðsson1, Ellen D Gunnarsdóttir1, Anuradha Jagadeesan1,2, S Sunna Ebenesersdóttir1,2, Augustine Kong1 & Kári Stefánsson1,3 Mutations are the fundamental source of biological variation, and their rate is a crucial parameter for evolutionary and medical studies. Here we used whole-genome sequence data from 753 Icelandic males, grouped into 274 patrilines, to estimate the point mutation rate for 21.3 Mb of male-specific Y chromosome (MSY) sequence, on the basis of 1,365 meioses (47,123 years). The combined mutation rate for 15.2 Mb of X-degenerate (XDG), X-transposed (XTR) and ampliconic excluding palindromes (rAMP) sequence was 8.71 × 10−10 mutations per position per year (PPPY). We observed a lower rate (P = 0.04) of 7.37 × 10−10 PPPY for 6.1 Mb of sequence from palindromes (PAL), which was not statistically different from the rate of 7.2 × 10−10 PPPY for paternally transmitted autosomes1. We postulate that the difference between PAL and the other MSY regions may provide an indication of the rate at which nascent autosomal and PAL de novo mutations are repaired as a result of gene conversion. Several recent studies have used whole-genome sequencing data to find de novo mutations through a direct comparison of chromosomes from parents and offspring, yielding mutation rate estimates in the range of 0.89–1.2 × 10−8 mutations per position per generation (PPPG)1–4. Three studies showed that variation in the number of new mutations carried by offspring is primarily determined by the age of the father at the time of conception (reflecting the different numbers of mitoses preceding the meioses that yield eggs and sperm)1,3,5 and that fathers contribute around 3–4 times as many mutations to their offspring as mothers do1,3,5,6. These findings lead to an expectation of a relatively high mutation rate in the MSY, which is only carried and transmitted by fathers. The euchromatic part of the MSY spans 22.4 Mb in NCBI Build 36 and is composed of a mosaic of three different classes of sequence7 (Supplementary Table 1). First, there are eight XDG regions (8.6 Mb) that derive from the ancestral autosomal pair that gave rise to mammalian sex chromosomes. Second, there are two XTR regions (3.4 Mb) that derive from the single transposition of a fragment from Xq21 of the X chromosome about 3–4 million years ago7. Third, there are ampliconic regions (10.2 Mb), characterized by large and highly similar repeat units, including eight palindromes and three inverted repeats (PAL; 6.2 Mb) and other spacer and repeated sequences (rAMP; 4 Mb). Being haploid and not subject to homologous recombination during meiosis, the MSY evolves solely through the accumulation of mutations

and non-homologous gene conversion in palindromic sequences 8,9. Consequently, the MSY mutation rate can be estimated using the genotypes from two or more males in multi-generation patrilineal genealogies10,11, which provide a greater number of chromosome transmission events (meioses) per number of genotyped individuals than father-son pairs. At present, there is only one genealogybased estimate of the MSY single-nucleotide mutation rate in humans, 3.0 × 10−8 (95% confidence interval (CI) = 8.9 × 10−9 to 7.0 × 10−8) mutations PPPG, based on 4 mutations in 10.15 Mb of XDG sequence data from 2 males separated by 13 patrilineal meioses11. More recently, two studies used a phylogenetic approach to estimate the XDG region mutation rate on the basis of the number of mutational differences between chromosomes descended from common patrilineal ancestors whose approximate year of birth is presumed to be known—yielding estimates of 8.2 × 10−10 PPPY for XDG regions alone12 and 5.3 × 10−10 PPPY for XDG and rAMP regions13 (this study also reported a rate of 6.5 × 10−10 PPPY from a subset of data with greater sequence depth). To obtain a more reliable estimate of the MSY mutation rate, we used whole-genome sequencing data from 873 males clustered into 275 patrilines according to the deCODE Genetics genealogical database. Before identifying de novo mutations, we examined the patrilines for inconsistencies between the genetic and genealogical data (Online Methods). The corrected data set consisted of 753 males grouped into 274 patrilines, connected by 739 branches and a total of 2,449 meiosis amounting to 85,128 years (Table 1 and Supplementary Tables 2 and 3). The average sequence depth in this set was 12.4× (s.d. = 5.3×) across XDG regions. The largest patriline, which comprised 86 meiosis traced to a most recent common ancestor (MRCA) born in 1537, is shown in Figure 1. Genotypes were called for all 1,214 Icelandic males with whole-genome sequencing data at deCODE Genetics for each position on the euchromatic MSY using an approach that accounts for its paralogous nature (Online Methods). The identification of de novo mutations in each patriline was based on stringent filters to minimize false positives and negatives. The vast majority of positions from the XDG sequence class yielded unambiguous haploid genotypes that could be used in a relatively straightforward manner to identify de novo mutations in the 274 patrilines (Fig. 1). Despite the paralogous nature of the PAL, rAMP and XTR regions, it was possible to detect de novo mutations in these sequence classes by combining the reads

1deCODE

Genetics/Amgen, Inc., Reykjavik, Iceland. 2Department of Anthropology, University of Iceland, Reykjavik, Iceland. 3Faculty of Medicine, University of Iceland, Reykjavik, Iceland. Correspondence should be addressed to A.H. ([email protected]) or K.S. ([email protected]). Received 4 June 2014; accepted 1 December 2014; published online 25 March 2015; doi:10.1038/ng.3171

Nature Genetics  ADVANCE ONLINE PUBLICATION



letters Table 1 Summary of patrilines and branches Data set Patrilines (n = 274)

Statistic

Mean (s.d.)

Number of whole genome–sequenced males Number of branches Number of meioses Birth year of MRCA

2.7 (2.26) 8.94 (11.36) 1835.8 (123.2)

Length in generations Length in years Mean generation interval in years Birth year of ancestor Birth year of descendant

3.31 115.19 33.99 1823.5 1938.7

Branches with sequence Length in generations depth >10× (n = 482) Length in years Mean generation interval in years Birth year of ancestor Birth year of descendant

2.83 97.8 33.6 1837 1935

All branches (n = 739)

2.75 (1.29)

(2.8) (100.95) (5.8) (122.5) (67.0) (2.63) (94.5) (6.2) (129.6) (81.3)

© 2015 Nature America, Inc. All rights reserved.

NA, not applicable.

from paralogous positions and applying a weighting scheme to account for uncertainty about the location of the mutations (Online Methods and Supplementary Fig. 1). The full list of 1,456 candidate mutations identified at 2,050 positions, which yielded a combined sum of weights of 1,432.5, is shown in Supplementary Table 4. The results for all branches and two subsets of branches, those with >1 descendant and those with a combined sequence depth >10× (which included all multi-descendant branches), are shown in Table 2. False positive variants are twice as likely to be transversions as transitions, whereas the expected transition/transversion (TiTv) ratio in the human genome is around 2.1 (ref. 1). A low TiTv ratio among a set of de novo candidate mutations might therefore indicate an excessive number of false positives. Despite the stringency of the filters used to exclude false positives, the candidate mutations identified for all 739 branches yielded a suspiciously low TiTv ratio of 1.68. In comparison, the TiTv ratios observed for the 170 multi descendant branches and 482 branches with >10× sequencing depth were similar to the genome-wide expectation, 1.95 and 1.94, respectively. This indicates that the majority of Figure 1  A de novo mutation at MSY position 7,270,276 in the largest patriline used in this study. Each square represents a male in the patriline, with vertical position scaled by birth year. Lines between squares represent Y-chromosome transmission events. The filled squares represent males demarcating the branches to which mutations can be assigned in the patriline. Black squares represent males with whole-genome sequencing data and are labeled with the counts of alleles mapped to the forward and reverse strands at position 7,270,276. Inside each black square is the genotype called on the basis of these alleles. The reference allele at this position is G. All males with whole-genome sequencing data who did not belong to this patriline were called with a G genotype. This information, along with the distribution of genotypes in the patriline, can be used to infer the genotype of the MRCA. The most straightforward interpretation is that the MRCA and each of his four sons carried a G at position 7,270,276. A single G > A mutation event is required on the labeled two-generation branch (also indicated by a thicker line) to account for the distribution of genotypes in and outside the patriline.



the false positives were assigned to branches with one descendant and a sequence depth ≤10×. Consequently, all subsequent analyses 2–11 753 were based on the 482 branches with >10× 1–17 739 sequence depth, which comprised 1,365 1–86 2,449 meioses or 47,123 years. Although XDG 1375–1974 NA regions exhibited a lower TiTv ratio than the other MSY regions, 1.74 versus 2.07–2.23 1–14 2,449 16–505 85,128 (Supplementary Table 5), this difference is 16–60 NA consistent with a value of 1.72 reported in a 1375–1980 NA recent phylogenetic analysis14. 1476–2004 NA To directly estimate the false positive 1–14 1,365 rate, we validated 101 candidate mutations 16–505 47,123 through Sanger sequencing (Supplementary 16–60 NA Table 6). We also validated genotype calls for 1375–1980 NA 41 candidate mutations using whole-genome 1476–2004 NA sequencing data obtained from 3 sons of different males included in the main analysis (Supplementary Table 7). The false positive rate for the 482 branches with >10× sequence depth was 0/43 from Sanger sequencing and 0/15 from whole-genome sequencing, yielding a combined rate of 0/58 (95% CI = 0–0.077). We obtained an indirect estimate of the false negative rate through comparing the microarray genotypes at 360 haplogroup-informative SNPs for 25 males with whole-genome sequencing data. We found 2 discre­ pancies in 8,956 instances with valid genotypes from both platforms, amounting to an error rate of 0.00022 (95% CI = 0–0.0009). Neither error would have generated a false mutation call in our study (Supplementary Table 8). Our mutation rate estimate for XDG sequence was 3.07 × 10 −8 (95% CI = 2.76 × 10−8 to 3.4 × 10−8) PPPG and 8.88 × 10−10 (95% CI = 8.0 × 10−10 to 9.86 × 10−10) PPPY. These confidence intervals encompass a previous genealogical estimate of 3.0 × 10−8 PPPG (ref. 11) and a phylogenetic estimate of 8.2 × 10−10 PPPY (ref. 12) but are inconsistent with other phylogenetic estimates of 5.3 × 10−10 and 6.5 × 10−10 PPPY (ref. 13). A comparison of the mutation rates of each sequence class with those of the other three classes combined using a χ2 test of equal proportions showed a lower rate in PAL sequence Min–max

Sum

1537

G>A

A A 4:2

A A 11:6

A A 15:5 A A 12:8

A A 3:3

G G 2:3

G G 3:1

G G 3:4

G G 1:2 G G 5:7

G G 4:4

aDVANCE ONLINE PUBLICATION  Nature Genetics

letters Table 2  Mutation counts and rate by sequence region Sum of weighted mutations

Mutation rate (95% CI)

Generations

Years

Total

Ti

Tv

TiTv ratio

All (n = 739)

rAMP PAL XDG XTR ALL

6,874,385,229 13,393,136,657 19,839,248,707 7,598,266,520 47,705,037,113

238,741,250,131 465,132,879,492 689,006,510,906 263,902,578,153 1,656,783,218,682

199.67 383.33 622 227.5 1,432.5

134.75 239.17 379.75 144 897.67

64.92 144.17 242.25 83.5 534.83

2.08 1.66 1.57 1.72 1.68

2.90 2.86 3.14 2.99 3.00

(2.52−3.34) (2.59−3.17) (2.90−3.39) (2.62−3.42) (2.85−3.16)

8.36 8.24 9.03 8.62 8.65

(7.26−9.63) (7.45−9.12) (8.34−9.77) (7.55−9.84) (8.21−9.11)

>1 descendant (n = 170)

rAMP PAL XDG XTR ALL

2,100,028,363 4,165,585,875 6,080,806,070 2,327,330,075 14,673,750,383

74,200,282,332 147,208,882,493 214,842,865,403 82,236,059,838 518,488,090,066

64.5 110 193.5 64 432

45.67 71.83 125.5 42.5 285.5

18.83 38.17 68 21.5 146.5

2.42 1.88 1.85 1.98 1.95

3.07 2.64 3.18 2.75 2.94

(2.39−3.94) (2.18−3.20) (2.76−3.67) (2.13−3.54) (2.68−3.24)

8.69 7.47 9.01 7.78 8.33

(6.76−11.16) (6.17−9.04) (7.80−10.39) (6.04−10.01) (7.57−9.17)

Sequence depth >10× (n = 482)

rAMP PAL XDG XTR ALL

4,050,198,239 8,022,126,135 11,731,325,928 4,493,981,680 28,297,631,982

139,807,329,028 276,927,105,410 404,947,751,959 155,134,333,009 976,816,519,406

117.08 204.17 359.75 132.5 813.5

80.83 137.67 228.5 89.5 536.5

36.25 66.5 131.25 43 277

2.23 2.07 1.74 2.08 1.94

2.89 2.55 3.07 2.95 2.87

(2.40−3.48) (2.21−2.93) (2.76−3.40) (2.48−3.51) (2.68−3.08)

8.37 7.37 8.88 8.54 8.33

(6.96−10.07) (6.41−8.48) (8.00−9.86) (7.18−10.16) (7.77−8.93)

a

PPPG (×10−8)

PPPY (×10−10)

with other studies1,3,5,16, a joint analysis showed that only the correlation with the number of years was significant (Table 3). We assessed the impact of father’s age at conception through the correlation of the mutation rate (in PPPG) and the mean generation interval per branch (r = 0.212, P = 1.4 × 10−6). This correlation was weaker than in autosomal studies1, partly because the MSY is shorter, such that relatively few de novo mutations are observed per generation (an average of 0.59, where 35% of branches had none). Also, information is lost through the averaging of generation intervals for branches with multiple generations (the 255 single-generation branches yielded r = 0.226, P = 1.8 × 10−4). The patriline branches spanned several centuries, with the oldest and youngest from 1375–1476 and 1980–2002, respectively. A test for temporal variation in mutation rate showed that, if such a difference exists, it was too small to be detected by our study (Fig. 2b,c). One important application of the MSY mutation rate is for estimating the time to the most recent common ancestor (TMRCA). Recent studies have provided conflicting estimates of the mutation rate11–13 and the TMRCA for all Y chromosomes in humans17–19. One source of confusion stems from the tendency to report pedigree estimates in PPPG and evolutionary estimates in PPPY. Given the impact of father’s age at conception on the mutation rate and variation in generation intervals across time and space1,20, it is vital that rates be reported and used in PPPY. Our pedigree-based mutation rate estimate across 16 Mb of XDG, rAMP and XTR sequence of 8.71 × 10−10 PPPY has the advantage of considering a large number of meioses and having a narrow CI. When this rate was applied to estimate

b

c

1900–1950

1850–1900

1800–1850

1700–1800

1400–1700

–9

(×10 )

Mutation rate in PPPY

00

00

19

20

00

00

18

17

00

16

15

Nature Genetics  ADVANCE ONLINE PUBLICATION

00

–9

(×10 )

Mutation rate in PPPY

–7

(×10 )

Figure 2  The number of mutations 1.1 by branch length and calendar year. 5 5 All panels show results for the 1.0 4 4 482 branches whose descendants 3 9.0 3 with whole-generation sequencing 2 2 data exceeded an XDG sequence 8.0 depth of 10×. (a) A scatterplot of the 1 1 7.0 number of mutations per branch per 0 0 position (y axis) against the number 0 100 200 300 400 500 of years per branch (x axis). (b) The Time per branch (years) Mid year of branch mutation rate per branch per position per year (PPPY) (y axis) plotted against Calendar period the mid-year (x axis) for each of the 482 branches, where gray circles represent branches with >1 descendant and black circles represent branches with 1 descendant. (c) The mutation rate in PPPY and its 95% CI grouped into five categories by the mid-year of the branch.

1950–2000

(two-tailed P = 0.04, mostly driven by PAL versus XDG). We therefore report combined mutation rates for the XDG, rAMP and XTR regions of 3.01 × 10−8 (95% CI = 2.77 × 10−8 to 3.26 × 10−8) PPPG and 8.71 × 10−10 (95% CI = 8.03 × 10−10 to 9.43 × 10−10) PPPY and separately for PAL sequence as 2.55 × 10−8 (95% CI = 2.21 × 10−8 to 2.93 × 10−8) PPPG and 7.37 × 10−10 (6.41 × 10−10 to 8.48 × 10−10) PPPY. Despite some evidence for structural variation among haplogroups in IR2, rAMP2 and XTR2 sequence (Supplementary Fig. 2 and Supplementary Table 1), this variation did not have a notable impact on our results. Thus, a χ2 test for equality of proportions showed a significant difference across haplogroups for only 1 (XDG9) of 27 sequence regions—accounted for by an unusually high mutation rate for 5 branches belonging to an E1b1 patriline haplogroup (Supplementary Table 9). We observed 28 instances of multiple mutations on the same branch that were clustered by physical position (90%. For each position with at least one such paralog, genotype calls were performed on the set of alleles obtained by combining all the reads mapped to them. For AMP and XTR positions with no paralogs, genotype calls were based only on the reads mapped to each position and were processed in the same manner as those in the XDG class. A de novo mutation at one of n paralogous positions is expected to yield mutant alleles among combined reads at a frequency e(p) = 1/n. Combined paralogous positions with an observed frequency of the mutant allele P > 0.15 consistently result in pseudo-heterozygote genotypes when processed by the calling algorithm used in this study. Each position with n >1 that yielded both pseudo-heterozygote and pseudo-homozygote genotypes among the members of a patriline was considered to harbor a candidate mutation and was subjected to the set of filters described above. In addition, we performed a binomial test to evaluate the probability of observing a sum of m or fewer copies of the mutated allele in r reads derived from putative carriers in the patriline, given a rate of success of e(p) = 1/n. Candidate mutations for which this probability was