Mutations Causing Complex Disease May under Certain ... - Core

0 downloads 17 Views 3MB Size Report
Jul 10, 2015 - ogy of a complex disease can, under certain circumstances, be depleted among patients. Populations with defined disease prevalence were ...

RESEARCH ARTICLE

Mutations Causing Complex Disease May under Certain Circumstances Be Protective in an Epidemiological Sense Sabine Siegert1,2, Andreas Wolf3, David N. Cooper4, Michael Krawczak3‡, Michael Nothnagel1,3‡* 1 Cologne Center for Genomics, University of Cologne, Cologne, Germany, 2 Institute of Epidemiology, Christian-Albrechts University, Kiel, Germany, 3 Institute of Medical Informatics and Statistics, ChristianAlbrechts University, Kiel, Germany, 4 Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, United Kingdom ‡ These authors are joint senior authors on this work. * [email protected]

Abstract OPEN ACCESS Citation: Siegert S, Wolf A, Cooper DN, Krawczak M, Nothnagel M (2015) Mutations Causing Complex Disease May under Certain Circumstances Be Protective in an Epidemiological Sense. PLoS ONE 10(7): e0132150. doi:10.1371/journal.pone.0132150 Academic Editor: Amanda Ewart Toland, Ohio State University Medical Center, UNITED STATES Received: September 2, 2014 Accepted: June 10, 2015 Published: July 10, 2015 Copyright: © 2015 Siegert et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All relevant data are within the paper and its Supporting Information files. Funding: This study was supported by the National Genome Research Network (NGFNplus 01GS0809) and the DFG Excellence Cluster ‘Inflammation at Interfaces’ (EXC 306 and 306/2). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Guided by the practice of classical epidemiology, research into the genetic basis of complex disease has usually taken for granted the dictum that causative mutations are invariably over-represented among clinically affected as compared to unaffected individuals. However, we show that this supposition is not true and that a mutation contributing to the etiology of a complex disease can, under certain circumstances, be depleted among patients. Populations with defined disease prevalence were repeatedly simulated under a WrightFisher model, assuming various types of population history and genotype-phenotype relationship. For each simulation, the resulting mutation-specific population frequencies and odds ratios (ORs) were evaluated. In addition, the relationship between mutation frequency and OR was studied using real data from the NIH GWAS catalogue of reported phenotype associations of single-nucleotide polymorphisms (SNPs). While rare diseases (prevalence 1, up to 20% of mutations causing a pandemic disease (prevalence 10–20%) had ORs165 common human diseases have been identified by GWAS [4–6]. Although most of the genotype-phenotype associations reported from GWAS were modest at best, even if successfully replicated [4,7], and often involved variants located outside functionally relevant gene regions, these studies have nevertheless provided important new insights into the etiology of complex human diseases [4,8]. This notwithstanding, genetic risk factors hitherto identified by GWAS usually account for only a small fraction of the heritability of the disease in question. Indeed, for the vast majority of conditions studied, >90% of the population-level phenotypic variation has remained unaccounted for by reference to any known genetic variation [6,9]. Several explanations for this outcome have been put forward, including a major role for rare as yet unidentified variants, structural features such as copy-number variation (CNVs), power loss due to excessive multiple testing, locus and allelic heterogeneity, parent of origin effects and gene-gene as well as gene-environment interactions [4,5,10,11]. In addition, a number of sometimes counterintuitive phenomena have been described to occur in the context of genetic disease associations. For example, risk alleles often change their role from predisposing to protective and vice versa in different populations, thereby spawning so-called ‘flip-flop associations’ that probably reflect gene-gene interaction and differential linkage disequilibrium [12]. Moreover, a ‘synthetic association’ [13] between a disease and a common non-causative variant can arise as a consequence of the chance accumulation of several rare causative variants in cis to the latter [14–16]. Finally, it has been demonstrated by simulation that, in multi-locus systems of correlated causal and non-causal genetic factors, an ‘indirect association’ between a non-causal variant and a disease can arise even in the absence of any causative genes in the vicinity [17]. Here, we report a population history mechanism that may obfuscate the causal relationship between a genetic variant and a (common) human disease. In classical epidemiology, making inference about disease causation is usually equivalent to ascertaining whether or not a given exposure confers a relative risk greater than unity. However, this process of reasoning (also enshrined in the concept of ‘attributable fractions’ [18]) may be confounded by other unmeasured causal factors. In particular, a causative exposure may pose as a protective factor rather than a risk factor if it is negatively correlated with its causal complement, defined as those other risk factors or combinations of risk factors that are both necessary and sufficient for the exposure in question to cause disease [18]. In genetic epidemiology, genotypes play the role of risk factors and, since many human diseases follow a complex oligogenic or polygenic mode of inheritance [19], causal complementation is probably omnipresent (see Fig 1). Moreover, all humans share a common genealogy which induces varying levels of inter- and intra-individual dependency between genotypes [20]. In view of the above, any uncritical adoption of the paradigm from classical epidemiology that disease-causing genetic variants should invariably be over-represented in the clinically affected portion of the population under study (i.e. that they confer a relative disease risk exceeding unity) [21] may be misleading. In fact, we show by simulation that, depending upon the specifics of the genealogy of the population of interest, a causative mutation may well appear “protective” in an epidemiological sense. Normally, the functional follow-up of such a variant would end up being inconclusive simply because its damaging character would be difficult to reconcile with its protective effect. In many instances, the variant would then be erroneously excluded from further consideration. In order to lend additional support to our

PLOS ONE | DOI:10.1371/journal.pone.0132150 July 10, 2015

2 / 21

Misreading Effect Measures in Genetic Epidemiology

conclusions, we also present empirical evidence for the presence, in real GWAS, of an inverse relationship between epidemiological effect size and disease prevalence as would be predicted by the above mechanism of confounding.

Fig 1. Illustration of a genealogy-actuated negative correlation between causative mutations. The underlying penetrance model assumes that two or more causative mutations are required for an individual to become affected (with certainty). The blue mutation is depleted among affected individuals in this population of four and, therefore, is protective in an epidemiological sense. The other mutations are enriched in cases. doi:10.1371/journal.pone.0132150.g001

PLOS ONE | DOI:10.1371/journal.pone.0132150 July 10, 2015

3 / 21

Misreading Effect Measures in Genetic Epidemiology

Materials and Methods Simulations We investigated various models of population history and penetrance to determine the impact upon epidemiological effect size of a genealogy-related negative correlation between diseasecausing mutations. To this end, we repeatedly simulated the genealogy of a population of constant size (N = 10,000) following a coalescent approach [22]. The coalescent is a well-established means of modeling population histories which, compared to forward-time simulation, is both easier to implement and computationally more efficient [23]. We initially adopted a Wright-Fisher model [24] of a single haploid locus was initially adopted in order to model the evolution of functionally relevant variation within genes or gene regions small enough to escape internal meiotic recombination. Subsequently, we extended our simulations to oligogenic models of inheritance involving 2, 5 or 10 unlinked and hence independently evolving loci in order to assess the possible influence of recombination on our results as well. After the simulation of each locus-specific genealogy, causative mutations were randomly placed on the branches of each coalescence tree with probabilities proportional to the respective branch lengths. Once all locus-specific trees and the distribution of mutations on the trees had been determined in a given simulation, each leaf (i.e. haploid individual) was randomly assigned a dichotomous disease state with probability P(k), where k denotes the number of causative mutations present at all loci combined. We considered two biologically plausible penetrance models that have been widely used for similar analyses before [25–27]. More specifically, we used a oneparameter multiplicative model (Fig 2A), k

PðkÞ ¼ 1  ð1  gÞ with 0 < g < 1;

ð1Þ

Fig 2. Disease risk as a function of the number of mutations present under two different penetrance models. (A) Multiplicative model with parameter γ = 0.3 (blue) or γ = 0.1 (red). (B) Logistic model with parameters α = -5; β = 1 (blue) or α = -5; β = 0.5 (red). doi:10.1371/journal.pone.0132150.g002

PLOS ONE | DOI:10.1371/journal.pone.0132150 July 10, 2015

4 / 21

Misreading Effect Measures in Genetic Epidemiology

which implies a steep increase in disease probability with increasing mutation number, and a two-parameter logistic model (Fig 2B), logit½PðkÞ ¼ a þ b  k with b > 0;

ð2Þ

invoking tolerance against a small number of mutations. Both the multiplicative model and the logistic model assume that mutations exert their additive effects on the respective scale. However, whereas individuals lacking a mutation (i.e. k = 0) would be unaffected under the multiplicative model, their (baseline) risk equals 1/(1+e-α)>0 under the logistic model. For both models, we chose parameter values inducing either a strong or a weak increase in disease probability with mutation number. We also modeled oligogenic disease etiologies that involved 2, 5 or 10 unlinked loci (e.g. genes on different chromosomes), respectively. Since unlinked loci evolve essentially independently in outbred populations, they also have nearly independent genealogies [28]. For this reason, we repeatedly simulated the unrelated history of 2, 5 or 10 haploid loci, respectively, and randomly assigned each individual a dichotomous disease state with probability P({k1,. . .,km}), where ki denotes the number of causative mutations at the i-th locus. Again, we considered both a multiplicative model, Pm k Pðfk1 ; . . . ; km gÞ ¼ 1  ð1  gÞ i¼1 i with 0 < g < 1 ð3Þ and a logistic model, logit½Pðfk1 ; . . . ; km gÞ ¼ a þ b 

Pm

k with b > 0;

i¼1 i

ð4Þ

both of which assume the same contribution to the disease risk by all m loci. Only those simulations with disease prevalence within a pre-specified range were selected for further analysis. Simulations were repeatedly performed until 1,000 such datasets had been obtained per prevalence range. For each dataset, the mutation-specific population frequencies and odds ratios (ORs) were calculated, except for those mutations that were exclusively present in either cases or controls. Simulations were carried out with in-house scripts for Perl and R (S1 File). We used R v3.0.3 [29] for statistical analysis and for creating graphs.

Analysis of GWAS data In order to investigate whether the results of our inevitably simplifying simulations were consistent with real data, we scrutinized the phenotype associations of single-nucleotide polymorphisms (SNPs) compiled in the NIH GWAS catalogue [30] (http://www.genome.gov/ 26525384). Of the 14,869 records logged as of 13 November 2013, some 3,479 contained complete information on the trait-associated SNP allele, the minor allele frequency (MAF), OR and statistical significance (i.e. the p value). We merged reports with synonymous trait names (S1 Table) and, in order to improve data quality, we considered further only those traits for which at least 10 (not necessarily different) significant SNP associations (p18 years) were included in our analysis. We adopted two different significance thresholds for our analysis, namely (i) p

Suggest Documents