Exome Sequencing

0 downloads 0 Views 235KB Size Report
However, the biochemical sine qua non in sitosterolemia is the detection in ... including a dramatic positive response to fortuitous treatment with ezetimibe in ...
Curr Cardiol Rep (2014) 16:507 DOI 10.1007/s11886-014-0507-2

CARDIOVASCULAR GENOMICS (R MCPHERSON, SECTION EDITOR)

Exome Sequencing: New Insights into Lipoprotein Disorders Sali M. K. Farhan & Robert A. Hegele

# Springer Science+Business Media New York 2014

Abstract Several next generation sequencing platforms allow for a DNA-to-diagnosis protocol to identify the molecular basis of monogenic dyslipidemias. However, recent reports of the application of whole genome or whole exome sequencing in families with severe dyslipidemias have largely identified genetic variants in known lipid genes. To date, highthroughput DNA sequencing in families with previously uncharacterized monogenic dyslipidemias, have failed to reveal new genes for regulation of plasma lipids. This suggests that rather than sequencing whole genomes or exomes, most patients with monogenic dyslipidemias could be diagnosed using a more dedicated approach that focuses primarily on genes already known to act within lipoprotein metabolic pathways. Keywords Next-generation DNA sequencing . Diagnosis, Mutation . Inherited disorders . Dyslipidemia . Hypercholesterolemia . Hypertriglyceridemia . Exome sequencing . Lipoprotein disorders

This article is part of the Topical Collection on Cardiovascular Genomics S. M. K. Farhan : R. A. Hegele Department of Biochemistry, Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada N6A 5B7 S. M. K. Farhan : R. A. Hegele Department of Medicine, Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada N6A 5B7 R. A. Hegele (*) Blackburn Cardiovascular Genetics Laboratory, Robarts Research Institute, Western University, 4288A-1151 Richmond Street North, London, Ontario, Canada N6A 5B7 e-mail: [email protected]

Introduction Sanger sequencing, now considered "first generation" sequencing, has been widely and successfully used for genomic studies. Specifically, Sanger sequencing helped elucidate the code of the first human genome—a large scale, multiyear project and multibillion dollar investment [1]. Today, while Sanger sequencing is still considered the gold-standard for molecular genetic studies, cost and time constraints necessitated the development of fast, high throughput and more affordable genome-wide sequencing approaches. In 2008, a human genome was decoded using massively parallel sequencing for a cost of $1.5 million and five months of effort [2]. Today, we are in the era of highthroughput DNA sequencing, referred to as next-generation sequencing (NGS). NGS generates a voluminous amount of data - billions of nucleotide reads per single run - at extremely low costs (under $5-10,000). NGS allows for whole genome sequencing (WGS) and whole exome sequencing (WES), the latter of which is a scaled down application that captures only protein-coding regions or the “exome”. The integration of NGS technologies into molecular biology has led to a wide range of applications, which have so far included the discovery of hundreds of disease-causing mutations for monogenic disorders covering a wide range of organ systems and medical specialties [3•]. These findings collectively, have increased our understanding of how genetic variation is related to human disease [3•]. WES has been widely used primarily to find causative mutations in Mendelian diseases, since these are monogenic and are often caused by nonsynonymous mutations within protein-coding regions that have been highly conserved throughout evolution. In contrast, the value of NGS technology in complex disorders remains to be determined; such diseases have an inherently complicated genetic basis. The power to link a particular genomic variation to the phenotype is limited by sample size and confounding by the requisite biological

507, Page 2 of 10

architecture whereby numerous variants each contribute in a small way to produce the phenotype, in contrast to the rather pure large effects of high penetrance single genetic variants that typically characterize Mendelian diseases [4]. Complex traits seem more amenable to current microarray-based genome-wide association studies (GWAS) to isolate the small individual risk attributable to common genetic variants, although there is a recent trend to study rare variants with minor allele frequencies (MAFs) 3000

Shortest reads; technology limited to 80 to 100 bases >200

Low cost per base, most favorable Shorter reads

>6000

Abbreviations: N/A, not applicable; PCR, polymerase chain reaction

sequence attached on the flow cell [11]. Extending the vacant 3’ OH with a fluorescently labeled primer that is complementary to the adapter will hybridize and light will be emitted and detected, revealing the identity of the nucleotide base [11]. Genome Alignment and Variant Detection The final step of NGS is to align the newly generated sequences to publicly available human reference genomes to identify a consensus sequence. Several commercially available genome analysis pipelines have built-in algorithms designed to yield accurate and refined data alignment. From this point, any genetic variants identified in the initial biological sample will be "called", including single nucleotide polymorphisms (SNPs) and insertions/deletions (indels). Additional bioinformatic tools such as allele frequency databases or in silico predictions of dysfunction can be added to help sort through the list of genetic variants.

Limitations of NGS Research As the cost of NGS technologies is decreasing, the demand for computer scientists and bioinformaticians to develop

universal quality control metrics, genome alignment and analysis toolkits across all platforms is exponentially increasing. For example, current WES capture kits vary in their genome target and size, which leads to variant output discrepancies among different NGS platforms [14]. Several comparative studies of NGS data have evaluated multiple commercial genome alignment tools and variant calling outputs head-tohead. It appears that when the results of two or more analytical programs are interpreted in combination, there are fewer false discoveries overall. However, this approach may not be practical or economical for a single investigator. Instead, it is likely that NGS analysis will be inducted into the fiscal domain, in which an investigator or clinician will pay a genomics center a single, all-inclusive fee in exchange for data that have been fully analyzed by a range of complementary validated methods. Furthermore, there is a marked increase in the use of in silico prediction programs to validate WGS/WES findings, especially given the extremely large numbers of variants that are generated from a single WGS or even WES experiment. While these tools are readily available and are often useful for predicting the biological effect of putative mutations, interpreting the results can be biased by the initial hypothesis, and different methods can produce contrasting outputs—similar to interpreting an astrologer's chart or a Rorschach

507, Page 4 of 10

diagram. Despite refinements and enhancements to these programs, the burden of proof for causality ultimately falls on the molecular biologist to demonstrate that genetic variants found using WGS/WES are physiologically pathogenic or dysfunctional in an experimental system.

Curr Cardiol Rep (2014) 16:507

completely new or previously unknown genes in lipoprotein metabolism have been uncovered as being causative of the dyslipidemia phenotypes in any of these probands and families. The results of these studies are summarized in Table 2. Compound Heterozygous ABCG5 Mutations in Severe Hypercholesterolemia

Alternative Genomic Approaches to Define the Molecular Basis of Monogenic Dyslipidemias The decreasing cost and accessibility of WES make it attractive as the initial approach to characterize the molecular basis of lipid disorders. However, an initial screening step could be used before subjecting an interesting clinical sample to WES, and its attendant massive datasets of genetic variants. For instance, mutations within genes already known to be causative for a particular dyslipidemia should be ruled out by an initial triage or screening step. This could be accomplished with either limited Sanger sequencing, or NGS-based targeted resequencing of a limited number of known dyslipidemia genes. In this approach, genomic DNA extracted from a patient’s blood is first sequenced using reagents to detect only selected known monogenic dyslipidemia genes. The resulting list of genetic variants is compared with the literature and public databases of known genomic and disease-causing variants. A variant of questionable relevance could be subject to further genetic, in silico and functional tests to validate the potential pathogenic effect of the mutation. A phenotype specific gene panel is a rapid, economical genomic approach and has been successful in diagnosing patients for other monogenic diseases [15–17]. As more mutations are identified by NGS platforms and are catalogued in repositories, a genotype-phenotype correlation for newly discovered mutations can be easily accessible. These public databases can ultimately help clinicians and their patients avoid a diagnostic odyssey.

NGS and Lipid Disorders Once robust NGS technology became available for research applications around 2009, it was applied to solve the molecular basis of several probands and families with atypical or uncharacterized monogenic dyslipidemias. These were patients with quantitatively and qualitatively severe phenotypes that appeared to not involve known genes in lipoprotein metabolism. While WGS was used in one of these studies, WES almost exclusively with the Illumina platform - has been used for the remainder. Cumulatively, these studies over the past four years are rather remarkable in that no

The first reported use of NGS to solve the molecular basis of a dyslipidemia phenotype involved an 11-month old breast-fed girl who developed subcutaneous xanthomas over her Achilles tendons and was found to have plasma total cholesterol level of 26.5 mmol/L [18•]. Sanger sequencing of known candidate genes for autosomal dominant and recessive hypercholesterolemia [19], namely LDLR, LDLRAP1, PSCK9, APOB, and APOE, revealed no mutations. Because this seemed to be a novel syndrome, the proband's DNA was subjected to WGS using the Complete Genomics platform. Out of ~138 billion total nucleotides sequenced, a number that reflects the fact that each nucleotide base was read multiple times, the proband's genome showed a total of ~3.8 million deviations from "normal" reference sequence. To filter this huge volume of variants down to a manageable number, several assumptions were made. First, focusing only on variants predicted to change the amino acid sequence, i.e., non-synonymous single substitutions, reduced the haystack to 9726 variants. Second, removing variants that had been occasionally observed previously in "normal" patients in public genomic databases or in those without severe dyslipidemia left 699 novel variants distributed among 604 genes. Third, ignoring all variants other than those within single copy genes with at least two mutations left 23 variants that could possibly explain recessive inheritance. Of these 23, only two were novel nonsense mutations within a single gene, namely ABCG5 p.Q16X and p.R446X. ABCG5 encodes a sterolin half-transporter and had been mapped a decade earlier as one of the causative genes for sitosterolemia [20, 21], a phenotype with variable clinical expression, but which can present in childhood with severe hypercholesterolemia and subcutaneous xanthomas [22, 23]. However, the biochemical sine qua non in sitosterolemia is the detection in plasma of pathologically high levels of plant sterols, such as campesterol and sitosterol; these molecules are normally actively re-secreted into the gut lumen by sterolin co-transporters ABCG5 and ABCG8 [23]. Given the complete loss of function of the ABCG5 co-transporter that would be predicted by presence of nonsense mutations on both alleles, plasma levels of sitosterol should have been markedly elevated in this patient. However, her plasma showed relatively low levels of plant sterols shortly after the dyslipidemia was first identified, and when her plasma total cholesterol was ~20 mmol/L. However, despite this atypical phenotypic

Curr Cardiol Rep (2014) 16:507

Page 5 of 10, 507

Table 2 Dyslipidemia mutations detected using next generation sequencing Phenotype

Method Gene and mutation

Severe hypercholesterolemia (AR) WGS Familial combined hypolipidemia

WES

Familial hypoalphalipoproteinemia WES Hypercholesterolemia (AD)

WES

Hypobetalipoproteinemia (AD) Hypercholesterolemia (AR)

WES WES

Comments

ABCG5 p.Q16X compound heterozygote; known sitosterolemia gene ABCG5 p.R446X ANGPTL3 p.E129X compound heterozygote; known gene from mouse models; reductions in several lipoproteins ANGPTL3 p.S17X ABCA1 p.S1731C oligogenic interaction; double heterozygote: very LPL p.P207L low HDL-C; simple heterozygotes: low HDL-C APOE p.L167del mutation previously associated with combined hyperlipidemia, hypertriglyceridemia, splenomegaly, histiocytosis APOB p.K2240X additional features include fatty liver and liver cancer LIPA c.894G>A cholesterol ester storage disease presenting as autosomal recessive hypercholesterolemia

Year (reference) 2010 [18•] 2010 [24]

2012 [31] 2013 [32•, 33•] 2013 [37•] 2013 [39•]

Abbreviations: AD, autosomal dominant; AR, autosomal recessive; WGS, whole genome sequencing, WES, whole-exome sequencing

presentation of sitosterolemia, she had other typical features, including a dramatic positive response to fortuitous treatment with ezetimibe in combination with a statin. Certainly having a firm diagnosis of sitosterolemia as a result of NGS has been beneficial for this patient, and allows for more evidence-based/definitive/rational therapy going forward. In retrospect, the experience gained through investigating this patient suggests that: 1) some flexibility may be required when interpreting “classical” phenotypic manifestations; 2) searching for variants within coding regions was sufficient, and indeed was an important filtering step; and 3) widening the range of known genes sequenced at the screening step may have saved the need for applying WGS to this sample. However, the result here is actually seen fairly commonly in WGS or WES studies of rare disease - about one-third of probands and families sequenced are found to have mutations in known genes that were already documented as being the cause of related phenotypes [3•]. Compound Heterozygous ANGPTL3 Mutations in Familial Combined Hypolipidemia The second report of NGS in dyslipidemia [24] involved a 40-member three-generation family of European descent that was first ascertained and studied in the 1990s. The lipid phenotype in affected individuals included markedly depressed mean levels of total and LDL cholesterol, HDL cholesterol and triglycerides [25]. This family was one of six originally reported as having "hypobetalipoproteinemia", i.e., depressed LDL cholesterol but with no evidence for a mutation in APOB, which is the usual causative gene for this phenotype [25]. However, biochemical and genetic screening methods indicated that the APOB was not involved and

therefore, this seemed to be a novel dyslipidemia syndrome. For the NGS study, DNA from two affected individuals was subjected to whole exome sequencing using the Illumina platform [24]. Out of ~10 billion nucleotide bases sequenced, each proband's genome showed a total of ~18 thousand deviations from "normal" reference sequence. After filtering based on novelty and rarity (absence from existing genomic sequence databases), severity (priority on nonsense mutations likely to cause loss-of-function) and consistency with recessive inheritance (mutations on both alleles in affected family members), six variants remained in contention, and only one gene contained two nonsense variants in both affected siblings, namely ANGPTL3, p.S17X and p.E129X. Homozygotes for both mutant alleles had severely reduced TG, LDL-C, and HDL-C levels, while heterozygotes for either allele had intermediate reductions in TG and LDL-C, with normal levels of HDL-C. Thus, the TG and LDL-C phenotypes behaved as co-dominant traits, while the HDL-C phenotype behaved as a recessive trait. ANGPTL3 encoding angiopoietin like 3 protein had been discovered a decade earlier as the causative gene for hypolipidemia in a mouse strain [26]. ANGPTL3 is hepatically secreted and appears to suppress the activities of plasma lipases, such as lipoprotein lipase and endothelial lipase [27, 28]. Furthermore, re-sequencing of ANGPTL3 found enrichment of rare loss-of-function mutations in subjects with low plasma TG [29]. However, despite these earlier supportive findings, it is still not clear why all three lipid variables would be significantly reduced in compound heterozygous family members. The robustness of the initial observation is emphasized by many subsequent reports of families with ANGPTL3 mutations and a combined hypolipidemia phenotype, or familial hypobetalipoproteinemia subtype 2 (FHBL2).

507, Page 6 of 10

Recently, a large series of 115 FHBL2 individuals carrying 13 different ANGPTL3 mutations has been reported [30]. Compared with normal controls, 22 individuals with two mutated alleles had undetectable plasma ANGPTL3 levels, and reductions in all lipoproteins except Lp(a), while 93 individuals with one mutated allele had half normal ANGPTL3 levels and levels of all lipoproteins except Lp(a) intermediate between those in normal controls and in homozygotes, with all traits following an autosomal codominant inheritance pattern [30]. There were no clinical sequelae, particularly no fatty liver, and possibly a lower risk of diabetes and cardiovascular disease in mutation carriers. Double Heterozygous ABCA1 and LPL Mutations in Hypoalphalipoproteinemia A 75-person French Canadian family was reported in 2012, 27 of whom had depressed HDL cholesterol levels [31]. DNA of three family members was subjected to WES using the Illumina platform. Out of ~95 million total nucleotides sequenced per sample, a total of 3419 variants were shared across all three individuals, of which 332 were novel and had low allele frequencies. Of these, two heterozygous variants, namely ABCA1 (encoding ATP-binding cassette, subfamily member A1) p.S1731C; and LPL (encoding lipoprotein lipase) p.P234L were each excellent candidates as they are contributors to low HDL cholesterol based on prior knowledge. Dysfunction of the ABCA1 variant was confirmed with in vitro lipid efflux studies, while the LPL variant was well known to be dysfunctional [31]. Studies of lipid profiles in all members of the extended family indicated that carriers of both variants had the lowest HDL cholesterol, while carriers of one or other variant had HDL cholesterol that was intermediate between normal controls and carriers of two mutations. In this example, WES revealed an oligogenic interaction between two mutations that additively worsened HDL cholesterol levels. Heterozygous APOE Mutation in Autosomal Dominant Hypercholesterolemia Two independently ascertained families reported in 2013 presented with autosomal dominant familial hypercholesterolemia that strongly resembled heterozygous FH, with typical physical findings such as xanthelasma palpebrarum and tendon xanthomas [32•, 33•]. In each case, Sanger sequencing of known genes for autosomal dominant hypercholesterolemia – namely LDLR, APOB, and PCSK9 – showed no mutations; however this method would have missed large deletions, which require a non-sequencing based method for detection. In the case of the first family from France, concurrent linkage analysis studies narrowed the causative locus to

Curr Cardiol Rep (2014) 16:507

chromosome 19q13.31, which is remote to – on the opposite arm of the chromosome from – the LDLR gene. WES identified four possible disease causing mutations in this region, of which a three nucleotide deletion that caused an in-frame deletion of the leucine at apo E residue 167, referred to as APOE p.L167del, was the most likely cause [32•]. The mutation co-segregated with the hyperlipidemia phenotype in a multigenerational extended pedigree. Following a very similar procedure, the same mutation was found in a Quebec family of Italian origin in which hypercholesterolemia showed autosomal dominant segregation [33•]. The same APOE p.L167del mutation (also known as p.L149del due to different numbering conventions) was previously reported as being associated with a wider range of phenotypes quite distinct from ADH, including hypertriglyceridemia with splenomegaly [34], combined hyperlipidemia [35], and splenomegaly with thrombocytopenia and sea-blue histiocytosis [36]. It now seems that the expression of these more complex phenotypes may depend on the presence of so far uncharacterized "second hits" or variants at other loci, and that the two reported ADH families [32•, 33•] reflect the “pure” phenotypic effect of the APOE p.L167del mutation. While APOE has now been added as a bona fide gene that should be considered in the differential diagnosis of ADH, it is another example of how WES led investigators to an already well-known gene.

Heterozygous APOB Mutation in Hypobetalipoproteinemia with Steatosis A large Italian kindred was reported in 2013, in which depressed LDL cholesterol levels, hepatosteatosis, and liver cancer appeared to display an autosomal dominant pattern of inheritance [37•]. Low plasma levels of total cholesterol and fatty liver were observed in the 25 yearold female proband and in ten additional family members, who had liver involvement ranging from no pathology, to cirrhosis to fatal hepatocellular carcinoma. WES was performed in two affected individuals using the Illumina platform. Per sample were read ~8.25 billion nucleotide bases, and ~22,400 single nucleotide variants were identified in each sample. After variant filtering using similar criteria as above, ~300 novel variants were shared between affected individuals. Among these, a heterozygous nonsense variant, p.K2240X in exon 26 of APOB was identified, and confirmed by Sanger sequencing. In studies of 16 members of the extended family, the mutation co-segregated with low LDL cholesterol. Again, this example shows how WES was used to discover a novel nonsense mutation in a gene already known to cause hypobetalipoproteinemia [38].

Curr Cardiol Rep (2014) 16:507

Homozygous LIPA Mutation in Severe Recessive Hypercholesterolemia In 2013, a family ascertained in Holland displayed a phenotype consistent with autosomal recessive hypercholesterolemia, but did not have mutations in LDLRAP1 or indeed any classical FH genes [39•]. WES was performed in the proband, her father and her affected brother using the Illumina platform. More than 4 billion bases were sequenced from each sample, and ~37,000 single nucleotide variants were identified in each sample. Filtering steps reduced this to two candidate single nucleotide substitutions, including a homozygous mutation in exon 8 affecting the splice donor site, referred to as c.894G>A, also known as E8SJM, in the LIPA gene. However, this mutation was previously reported as a recurring variant in multiple independent families with cholesterol ester storage disease (CESD), a highly distinctive phenotype. The authors then noninvasively measured hepatic cholesterol content using magnetic resonance spectroscopy and observed abnormal hepatic accumulation of cholesterol in the homozygote individuals, consistent with a key feature of CESD but not ADH. They then genotyped E8SJM in >27,000 individuals and found no association with plasma lipids or cardiovascular disease risk in heterozygotes. Again, this example shows how WES was used to discover a novel nonsense mutation in a gene already known to cause LDL cholesterol elevation, albeit in the context of different clinical expression of related phenotypes [39•].

Page 7 of 10, 507

is a secondary feature, such as inherited forms of diabetes [42, 43]; and 2) encode proteins identified as important in lipoprotein metabolism but for which no human disease-causing variants have yet been reported. The panel also contains reagents to detect common single nucleotide polymorphisms (SNPs) that have been shown in GWAS studies to be associated with subtle variations in plasma lipids [44], in order to allow construction of polygenic genetic risk scores for plasma lipoprotein traits. An important component of LipidSeq was the development of a bioinformatic pipeline to allow each DNA variant detected in a patient sample to be assessed for likely pathogenicity based on searches for prior publication and of databases of genomic variants to determine frequency in normal populations [40•]. Another key feature is bioinformatic modeling to predict possible functional consequences of the variant [40•]. Quality control experiments indicate that the LipidSeq panel performs well compared to Sanger sequencing, at