Targeted resequencing of the allohexaploid wheat exome - cimmyt

3 downloads 1691 Views 482KB Size Report
quality scores indicating the reliability of the base calling. In well-annotated ..... expected values for an even three-way split with regard to number of calls (Table ...
Plant Biotechnology Journal (2012), pp. 1–10

doi: 10.1111/j.1467-7652.2012.00713.x

Targeted re-sequencing of the allohexaploid wheat exome Mark O. Winfield1,*, Paul A. Wilkinson1, Alexandra M. Allen1, Gary L. A. Barker1, Jane A. Coghill1, Amanda Burridge1, Anthony Hall2, Rachael C. Brenchley2, Rosalinda D’Amore2, Neil Hall2, Michael W. Bevan3, Todd Richmond4, Daniel J. Gerhardt4, Jeffrey A. Jeddeloh4 and Keith J. Edwards1 1

School of Biological Sciences, University of Bristol, Bristol, UK

2

School of Biological Sciences, University of Liverpool, Liverpool, UK

3

John Innes Centre, Norwich Research Park, Norwich, UK

4

Roche NimbleGen Inc., Madison, WI, USA

Received 2 March 2012; revised 25 April 2012; accepted 30 April 2012. *Correspondence (Tel 44 117 331 6770; fax 44 117 925 7374; email [email protected])

Keywords: next-generation sequencing, single-nucleotide polymorphisms, NimbleGen Array, wheat, bioinformatics, exome capture.

Summary Bread wheat, Triticum aestivum, is an allohexaploid composed of the three distinct ancestral genomes, A, B and D. The polyploid nature of the wheat genome together with its large size has limited our ability to generate the significant amount of sequence data required for whole genome studies. Even with the advent of next-generation sequencing technology, it is still relatively expensive to generate whole genome sequences for more than a few wheat genomes at any one time. To overcome this problem, we have developed a targetedcapture re-sequencing protocol based upon NimbleGen array technology to capture and characterize 56.5 Mb of genomic DNA with sequence similarity to over 100 000 transcripts from eight different UK allohexaploid wheat varieties. Using this procedure in conjunction with a carefully designed bioinformatic procedure, we have identified more than 500 000 putative single-nucleotide polymorphisms (SNPs). While 80% of these were variants between the homoeologous genomes, A, B and D, a significant number (20%) were putative varietal SNPs between the eight varieties studied. A small number of these latter polymorphisms were experimentally validated using KASPar technology and 94% proved to be genuine. The procedures described here to sequence a large proportion of the wheat genome, and the various SNPs identified should be of considerable use to the wider wheat community.

Introduction Globally, wheat is one of the three most important crops for human and livestock feed (Shewry, 2009). However, unlike rice (a diploid) and maize (an ancient tetraploid), wheat is an allohexaploid (AABBDD) derived from the hybridization of the diploid Aegilops tauschii (DD genome) with the tetraploid Triticum turgidum (AABB genome) approximately 8000 years ago (Dubcovsky and Dvorak, 2007). Our ability to characterize the wheat genome has been revolutionized by the development of next-generation sequencing (NGS). However, the size and complexity of the genome means that whole genome sequencing (WGS) costs are still high compared with many model species with smaller genomes (see Biesecker et al., 2011). In maize, a crop with a large ancient tetraploid genome, the issues surrounding WGS have been overcome by utilizing targeted-capture re-sequencing using NimbleGen arrays (Fu et al., 2010). Targeted-capture, which is used to enrich for sequences of interest before carrying out NGS, can be carried out by hybridization of the target sequences to bait probes in solution (Sulonen et al., 2011) or on solid support (Asan et al., 2011). Saintenac et al. (2011) have recently reported the use of SureSelect, an in-solution targetedcapture technology, to examine the genome of tetraploid wheat. Saintenac’s work did not extend the study to hexaploid wheat, and, although they pointed out that, without proper safeguards, many of the sequence variations identified might not be verifiable and may prove to be false, they

performed no validation of their putative single-nucleotide polymorphisms (SNPs). In hexaploid wheat, there is the inherent difficulty of the three homoeologous genomes, and thus, there are likely to be at least three similar copies of each gene. In addition, some loci may have undergone gene duplications producing closely related paralogues. Therefore, intensive computational analyses are required to translate raw wheat NGS data into accurately mapped reads from which a comprehensive list of variants can be derived (Reumers et al., 2012). This is a challenging problem and, without safeguards, many of the detected variants might be subsequently re-classified as genotyping errors. As a result, it is usually necessary to validate the putative variants using Sanger sequencing or other targeted approaches, and this may prove as expensive as the initial NGS approaches. It is, therefore, important to include filters in any analysis to remove falsepositive calls (a variant called when, in fact, one does not exist), while minimizing false negatives (no variant called, when, in fact, a variant is present). Quality filters can be applied to the raw read data to remove sequences based on particular parameters, such as low coverage of the target sequence or quality scores indicating the reliability of the base calling. In well-annotated genomes, such as those available for model organisms, it is also possible to apply filters that remove target genomic regions that contain repetitive DNA sequences (Reumers et al., 2012). This presents more of a challenge for complex unfinished genomes such as wheat. Targeting of exonic regions of the genome should minimize the presence of

ª 2012 The Authors Plant Biotechnology Journal ª 2012 Society for Experimental Biology, Association of Applied Biologists and Blackwell Publishing Ltd

1

2 Mark O. Winfield et al. repeats. To extend targeted-capture re-sequencing to hexaploid wheat and to include within these studies, an examination of homoeologous and varietal SNPs, we have developed a pipeline for sequence capture, NGS and sequence characterization of wheat genomic DNA resulting in SNP identification and validation. Our pipeline incorporates a NimbleGen array designed to capture a significant proportion of the wheat exome, and a bioinformatics pipeline designed to identify both putative homoeologous and varietal SNPs. Furthermore, the work presented here also confirms the validity against 23 wheat varieties of a small subset (96) of the putative SNPs identified from the NimbleGen capture experiments. Here, we discuss the results of targeted sequencing of eight UK bread wheat varieties chosen for this study as representative of valuable British varieties.

Results Coverage of the exome The design of the NimbleGen array used in this study is described in the experimental procedures. The array contained 132 606 repeat-masked expressed sequence tags covering 56.5 Mb. The average and median length of the sequences, used to generate the capture probes, were 426 and 366 bp, respectively. Given that the wheat genome is approximately 17 Gb in size, of which 1%–2% is thought to be protein coding (Paux et al., 2006), the wheat exome represents about 170–340 Mb. If the wheat NimbleGen array contains only unique sequences, as it was designed to do, and given that the total length of the features on the array is 56.5 Mb, it is capable of capturing at least 50% of the genes in a diploid exome.

Screening the NimbleGen array with genomic DNA derived from eight UK wheat varieties Eight UK wheat varieties (Alchemy, Avalon, Cadenza, Hereward, Rialto, Robigus, Savannah and Xi19) were used for the study. Hybridization of Illumina NGS libraries derived from genomic DNA from these eight varieties to NimbleGen arrays followed by Illumina NGS gave a total of 127.2 million reads of which 48.4 million reads (average, 38.1%; range, 22%–44.5%) could be aligned to a reference sequence generated from the sequences used to make the array: the variety Xi19 had the lowest number of mapped reads, and the varieties Alchemy and Savannah had the highest (Table 1). Of the 132 606 features on the array, 119 386 (90%) had a match to at least one

Table 1 The number of raw sequences and mapped reads for the eight wheat varieties following hybridization to the wheat NimbleGen array Raw reads Variety

(·million)

Mapped reads (·million)

Percentage mapped

Alchemy

44.4

16.9

Avalon

27.2

10.9

39.9

Cadenza

20.2

4.9

24.1

Hereward

41.0

15.2

37.0

Rialto

30.4

11.8

38.8

Robigus

31.4

6.3

20.0

Savannah

48.7

16.8

34.6

9.8

2.9

29.9

Xil9

38.0

read. The remaining 13 220 features (10%) had no matches. The mean coverage level of sequence reads matching NimbleGen array capture probes (derived from EST and cDNA sequences; see section ‘Generation of the NimbleGen Array’ in Experimental procedures) from across all eight varieties was 381 sequences per NimbleGen feature (Figure S1). Thus, the average coverage per variety was 47.6 · (381 ⁄ 8).

SNP distribution within and between the eight varieties Using a previously described custom PERL script (Allen et al., 2011), we found 59 762 contigs from the unigene set (i.e. those contigs used to create the array-bound capture probes) where one or more putative SNPs could be characterized once the Illumina data had been aligned to this unigene reference set. These 59 762 array-bound unigenes had a cumulative length of 34.1 Mb. The average and median length of these SNP containing contigs were 571 and 469 bp, respectively. These figures are somewhat higher than the pre-filtered values (426 and 366 bp) reported above. The total number of putative SNPs called across the eight varieties was 511 439 (Table S1). The average number of SNPs per contig was 8.6 (511 439 ⁄ 59 762), which translates into a SNP occurring, on average, every 67 bp (34.1 Mb ⁄ 511 439), or 15 SNPs per kb. This figure is very similar to that of 16.5 SNPs per kb reported by Barker and Edwards (2009). The number of SNPs per contig ranged from 1 (7928 contigs or 13.3% of the total) to 127 (present in only one contig; Figure 1a): 50% of all contigs had