Bayesian logistic regression using a perfect phylogeny

0 downloads 0 Views 170KB Size Report
Such data may be: (a) pruned using an algorithm that deletes .... Holmes and Held (2003) consider a latent variable representation of the logistic model ...... for introducing us to the idea of logic regression and for some assistance with C++ programming, Chris. Holmes ... Eskin, E., E. Halperin, and R. M. Karp (2003). Efficient ...
Biostatistics Advance Access published March 23, 2006

Biostatistics (2006), 1, 1, pp. 1–22 Printed in Great Britain

Bayesian logistic regression using a perfect phylogeny B Y TAANE G CLARK Department of Epidemiology and Public Health, Imperial College, London, UK

e-mail: [email protected] MARIA DE IORIO Department of Epidemiology and Public Health, Imperial College, London

ROBERT C GRIFFITHS

S UMMARY Haplotype data capture the genetic variation among individuals in a population and among populations. An understanding of this variation and the ancestral history of haplotypes is important in genetic association studies of complex disease. We introduce a method for detecting associations between disease and haplotypes in a candidate gene region or candidate block with little or no recombination. A perfect phylogeny demonstrates the evolutionary relationship between single-nucleotide polymorphisms (SNPs) in the haplotype blocks. Our approach extends the logic regression technique of Ruczinski et al. (2003) to a Bayesian framework, and constrains the model space to that of a perfect phylogeny. Environmental factors, as well as their interactions with SNPs, may be incorporated into the regression framework. We demonstrate our method on simulated data from a coalescent model, as well as data from a candidate gene study of sarcoidosis. Some key words: Logic regression, gene tree, haplotype–association, Gibbs sampling, SNP data.

1. I NTRODUCTION The identification of genes involved in complex disease has the potential to reduce its prevalence and morbidity through the development of new therapies, as well as early screening. A dense genome map consisting of nearly four million single-nucleotide polymorphisms (SNPs), 0.1% of the genome, will become available to facilitate this task (Cardon and Abecasis, 2003). SNPs are single-base variations in the genetic code that occur approximately once every 1000 bases along the more than four billion bases of the human genome. The patterns in the complete set of SNPs, that is, the structure of the genome, is an area of current investigation and is uncertain. However, recent studies of human populations suggest that the genome consists of chromosome segments that are ancestrally conserved (’haplotype blocks’) and have discrete boundaries defined by recombination hot spots (Daly et al., 2001; Gabriel et al., 2002; Reich et al., 2002). Because of the block structure, haplotype–based methods that exploit the historical development of distinct haplotypes appear to offer a promising approach to disease mapping (Seltman et al., 2003). The evolutionary history of a sample of haplotypes can be represented by a coalescent tree (Kingman, 1982), with disease mutations embedded in the ancestral history of distinct haplotypes. We will assume the blocks are consistent with a perfect phylogeny, where there is no recombination within

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

Department of Statistics, University of Oxford, Oxford

2

Taane G Clark et al.

2. L OGIC TREES A locus is a specified site or short region on a chromosome. Complex traits may be caused by the interaction of many loci, each with varying effect. Patterns of interactions between several loci, for example, disease phenotype caused by locus A and locus B, or A but not B, or A and (B or C), make identification of the loci involved more difficult. Ruczinski et al. (2003) describe a flexible framework, called logic regression, for searching the potentially large space of interactions and constructing Boolean combinations

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

the block and SNPs are in high linkage disequilibrium (LD) (highly correlated or a high degree of non– random association) with each other. The perfect phylogeny, a population genetic constraint, means that the haplotypes can be displayed in a gene tree (Griffiths, 2001) which is unique in its topology. This tree demonstrates the evolutionary relationships between SNPs in the haplotypes, and imposes constraints on the relationship between the SNPs and their potential associations with disease or the outcome of interest (called the phenotype). Complex diseases have both genetic and environmental components, and we consider a Bayesian (logistic) regression framework to model both types of effect. Advantages of a regression framework include the interpretability of coefficients, diagnostics, and its multiple extensions. A Bayesian framework offers advantages in terms of model selection and incorporating prior information from earlier empirical studies. We consider a situation where a number of potential environmental factors and multiple haplotype blocks, and interactions between the two, may be associated with a binary outcome: diseased or not diseased. These types of data may result from case–control studies, a study design being increasingly applied in genetic association studies using population–based data. The (conditional) logistic model is the analysis tool of choice for analysing binary response data, due to the attractive interpretation of the regression coefficients in terms of the change in log odds of one class (diseased) over another (non–diseased) for unit change in the associated covariate. The natural likelihood to use for a case–control study is a ’retrospective’ likelihood, i.e. a likelihood based on the probability of exposure given disease status. Prentice and Pyke (1979) showed that, when a logistic regression form is assumed for the probability of disease given exposure (and co–factors are treated completely non–parametrically), the maximum likelihood estimators and asymptotic covariance matrix of the log odds ratios obtained from the retrospective likelihood are the same as those obtained from the ’prospective’ likelihood, i.e. that based on probability of disease given exposure. Evidence is mounting that multiple (and not single) mutations within a gene, occurring on the same chromosome, can have a large effect on phenotype (Seltman et al., 2003). In our regression framework, combinations of SNPs within haplotype blocks are represented by logic trees (Ruczinski et al., 2003) – structures particularly useful at representing interactions. Our ultimate goal is to find SNPs, environmental factors and interactions between them, that are associated with disease. In section 2 we describe how logic trees model interactions between SNPs. In section 3 we describe the perfect phylogeny (or equivalently the gene tree) constraint on the model space of the logic trees. The description of how trees and environmental effects are incorporated into a Bayesian logistic regression framework is discussed in section 4. The method is demonstrated on simulated data from a coalescent model in section 5, as well as data from a candidate gene study of smoking persistence in section 6. The focus is on haplotype data consisting of SNPs. Association analyses using haplotypes tend to assume: (a) an additive effect of the pair of alleles, (b) Hardy–Weinberg equilibrium for each SNP holding to ensure that haplotypes for individuals are independent (Sasieni, 1997), and (c) that the phase of haplotypes is known (Stephens and Donnelly, 2003). However, we discuss how phasing of haplotypes may be incorporated, as well as how genotypic data may be used to determine the mode of disease inheritance. Other extensions to our methodology, such as modelling a continuous phenotype, are also discussed.

Bayesian logistic regression using a perfect phylogeny

3

of loci. These Boolean combinations or logic trees are predictors within a regression model, hence the term logic regression. Consider a dataset consisting of n individuals with 2n chromosomes or haplotypes consisting of SNPs. Let Si denote SNP i, a binary variable, which takes value one for a mutant type and zero for the wild–type or normal allele. The wild–type variant of Si is also called the conjugate (Sic ). An example of a logic tree T is T = S1c ∧ (S2 ∨ S3 ), where ∨ is an OR and ∧ is an AND operator. T denotes the condition when either S2 and/or S3 mutants are present with the wild–type variant of S1 . T is a column vector with dimension 2n, whose elements, Ti , take value one if the Boolean expression is true for the i–th haplotype, zero otherwise. The interpretation of the coefficient of T within a logistic regression model is a log odds ratio. Logic trees can be presented graphically (see figure 1), where SNPs are leaves on branches with operators (e.g. AND or ∧, OR or ∨), and may be modified using the following steps:

• Growth at an operator e.g. T =S1c ∧ S5 ∧ (S2 ∨ S3 )  • Growth at a SNP e.g. T =S1c ∧ [S2 ∨ (S3 ∧ S6c )

2. Death step

• Deleting a SNP e.g. T =S1c ∧ S2 3. Move step • Changing SNPs e.g. T =S1c ∧ (S4c ∨ S3 ) • Changing operators e.g. T =S1c ∧ (S2 ∧ S3 ). Using the logic tree representation and the modifying operations on trees above, the implementation of Ruczinski et al. (2003) adaptively selects a model of predetermined size using a (stochastic) simulated annealing algorithm. The best scoring model generally over–fits the data, and cross–validation is used to predetermine the number of Boolean expressions and total number of leaves in the model. Our approach differs from Ruczinski et al. (2003) in at least two fundamental ways. Firstly, we implement a type of logic regression within a Bayesian framework, which does not require predetermination of the number of trees and maximum number of SNPs in the model. Bayesian model selection procedures are automatic Ockham’s razors, favouring simpler models over more complex ones when the data provide roughly comparable model fits. Secondly, we incorporate evolutionary or population genetic information into the model, potentially increasing the power to detect associations (Seltman et al., 2003). There is evidence that the human genome can be parsed into haplotype blocks, sizeable regions over which there is little evidence for historical recombination (Gabriel et al., 2002). We assume a coalescent model (Kingman, 1982) of evolution within haplotype blocks. A standard population genetics assumption for mutations at SNP sites is the infinitely–many–sites model where point mutations occur at distinct sites with no back or parallel mutations (Griffiths, 2001). It follows that a natural structure to consider for haplotype blocks is a gene tree constructed from the configuration of mutations at SNP sites in the block. The unique gene tree topology is equivalent to a perfect phylogeny (Griffiths, 2001). Logic trees consistent with the perfect phylogeny are used to represent characteristics of blocks in the logistic regression model.

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

1. Birth step

4

Taane G Clark et al.

AND

1

c

OR

3

Fig. 1. Logic tree for T = S1c ∧ (S2 ∨ S3 ), where ∨ = OR and ∧ = AND

3. P ERFECT

PHYLOGENIES

If mutation and recombination rates are not high, it is reasonable that the haplotypes observed within a block have evolved according to a perfect phylogeny, in which at most one mutation event has occurred within any site (called the infinitely–many–sites model (Griffiths, 2001)), and no recombination occurred at the given region. Consider an incidence matrix for our haplotype data, that is, a matrix with rows that represent unique haplotypes in the data and the columns individual SNPs with two base types (see figure 2). Data are compatible with a rooted perfect phylogeny if, and only if, for any two SNPs (columns) in the incidence matrix, not all three combinations (01, 10, 11) between them exist. Recombination leads to the possible existence of all three combinations. If the perfect phylogeny condition holds between all pairwise combinations of SNPs, it is possible to construct a gene tree (Griffiths, 2001). This unique tree describes the ancestry of a sample of 2n haplotypes, consisting of at most m + 1 distinct haplotypes or lineages / branches, where m is the number of segregating (i.e. mutant) SNPs. Gusfield’s algorithm (Gusfield, 1991) is used to construct the topology of the gene tree. An example of a gene tree based on haplotypes from a perfect phylogeny consisting of three SNPs is shown in figure 2. Knowledge of which SNP type is the most recent mutation determines the root. In general, an out–group might be used to determine the root, or the most frequent haplotype taken as the root (Griffiths, 2001). The model in this paper assumes rooted gene trees; however, a similar approach with unrooted gene trees could be taken. Recombination, parallel or back mutations, gene conversion or genotypic misclassification can cause the perfect phylogeny condition to be violated. Such data may be: (a) pruned using an algorithm that deletes haplotypes or SNPs or a combination of both to give a reduced set of data consistent with a gene tree (Griffiths, 2001), or (b) partitioned into blocks consisting of perfect phylogenies using a block construction approach, such as the imperfect phylogeny method (Halperin and Eskin, 2003). Because sets of nearby SNPs on the same chromosome may be inherited in haplotype blocks within each of which there is little evidence of recombination (Gabriel et al., 2002), it is reasonable to assume that

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

2

Bayesian logistic regression using a perfect phylogeny frequency a b c d

S1 0 1 0 0

S2 0 0 1 1

5

S3 0 0 0 1

2

1

3

c

d

Fig. 2. Incidence matrix for distinct four haplotypes (a, b, c, d), where 0 = wild–type allele, 1 = mutant allele (top); gene tree for the data in (top), where the numbers refer to mutations at S1 , S2 and S3 respectively (bottom)

each haplotype block is consistent with a gene tree. By also assuming an infinitely–many–sites model, logic trees representing SNP configurations in haplotypes must be contained in sub–trees of the gene tree. As in the infinitely–many–sites model, a new mutation occurs at a site never previously mutant and, consequently, mutations on differing branches of gene tree cannot appear in the same individual. For example, in figure 2 we could not observe an individual who has a mutations at both loci 1 and 3 or 1 and 2. This observation implies that mutations occurring on differing branches of a gene tree cannot be separated by an ∧ operator in a logic tree. Therefore the infinitely–many–sites model imposes constraints on the combination of mutant types of SNPs. For the data in figure 2, the logic tree S1 ∧ S3 is not possible because a haplotype containing mutations at both SNPs 1 and 3 cannot exist in the gene tree. A logic tree with the expression S1 ∧ S3c is compatible with the gene tree because a haplotype with a mutant SNP 1 and non–mutant SNP 3 can exist. In general a logic expression Si1 ∧ Si2 ∧ . . . Sik ∧ Scj1 ∧ Scj2 · · · ∧ Scjl is consistent with the gene tree if and only if there is a haplotype in the sample with mutant sites at i1 , i2 , . . . ik and wild–type sites at positions j1 , j2 , . . . jl . Using a gene tree constraint reduces the possible moves in the model space, therefore speeding up the model selection process. The reduction of the number of possible SNP combinations to consider is a major point of the model, as well as a genuine attempt to include evolutionary information from the sample. In the next section, the logic trees and the gene tree constraint for haplotype blocks are included with environmental factors in a Bayesian logistic model. 4. BAYESIAN

LOGISTIC FRAMEWORK

4.1. Basic model We consider a situation where a number of potential environmental factors and multiple haplotype blocks, and interactions between the two, may be associated with a binary outcome: diseased or not

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

b

a

6

Taane G Clark et al.

diseased. Prior knowledge is often available about the effects of particular combinations of SNPs, environmental factors, or interactions between them. Hence, we consider a Bayesian logistic regression model, yi ∼ Bernoulli(g−1 (ηi )) ηi = Xi β + T iα β ∼ π(β) α ∼ π(α) T j ∼ π(T j ),

(1)

zi = Xi β + T iα + εi εi ∼ N(0, λi ) λi = (2ψi )2 ψi ∼ KS β ∼ π(β) α ∼ π(α) T j ∼ π(T j ),

(2)

where KS is the Kolmogorov–Smirnov distribution (Devroye, 1986). Using this framework, conjugate priors are available to the conditional likelihood and full posterior inference can be performed using a block Gibbs sampler (Holmes and Held, 2003). In particular, in the case of priors π(β) = N(mβ , vβ ) and π(α) = N(mα , vα ), ˆ β ), β|z, λ, y, T , α ∼ N(β,V z˜β = z − T α ′ βˆ = Vβ (v−1 mβ + X W z˜β ) β

Vβ =

′ −1 (v−1 β + X WX) ,

ˆ α) α|z, λ, y, X, β ∼ N(α,V z˜α = z − Xβ ′

αˆ = Vα (v−1 α mα + T W z˜α ) ′

−1 Vα = (v−1 α +T WT ) −1 W = diag(λ−1 1 , . . . , λ2n )·

(3)

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

where yi ∈ {0, 1}, i = 1, . . . , 2n is a binary response variable for a collection of 2n haplotypes with an associated intercept and p environmental measurements Xi = (xi0 , xi1 , . . . , xip ). There are q logic trees T j , j = 1, . . . , q, constructed from nSNP possible SNPs; T = (T1 , T2 , . . . , Tq ), and T i = (Ti1 , . . . , Tiq ); g(u) = log(u/(1 − u)) is the logistic link function, ηi is the linear predictor, β px1 and αqx1 represent regression coefficients for environmental factors and logic trees respectively with prior distributions π(β) and π(α) respectively. Unlike normal linear regression, inference on α and β is complicated by the fact that no conjugate priors exist. Holmes and Held (2003) consider a latent variable representation of the logistic model, and (1) may be re–written as ( 1, zi > 0 yi = 0, otherwise

Bayesian logistic regression using a perfect phylogeny

7

Alternatively, we could update α and β jointly. The full conditional distribution for zi is truncated normal, zi |β, α, Xi , T i , yi , λi ∼

(

N(Xi β + T iα, λi )I(zi > 0) yi = 1 N(Xi β + T iα, λi )I(zi ≤ 0) otherwise.

(4)

W is the precision matrix. As the model is fixed, the block Gibbs sampler processes λ updating W , and then updates Z, α and β at each iteration. This algorithm is used to estimate the posterior distributions of α and β in the selected model. The conditional distribution of the variance estimator λi does not have a standard form, but updating is achieved efficiently using rejection sampling with the acceptance probability of a proposed λ∗ being 1 1 exp(0·5λ∗ ) λ∗−1/2KS( λ∗1/2 ), 4 2

(5)

4.2. Prior distribution on trees In this section we focus on the prior for the parameters that define the tree structure. In specifying the prior on a logic tree we follow the Bayesian Classification and Regression Tree (CART) literature (Chipman et al., 1998; Denison et al., 1998). We define a probability distribution over the space of possible logic trees which satisfies the Perfect Phylogeny constraint (PPC). Any logic tree can be defined by the position of the Boolean operators present, the values they assume (∧ or ∨) together with the SNPs included, as well as their respective values (0 or 1). Let K be the number of SNPs in the tree. In the remainder of the section we will denote the SNPs with Si , i = 1, . . . , K and the operator variables with Oi . It is trivial to show that if K SNPs are included in the model than there will be exactly K − 1 operators in the tree. Denison et al. (2002) argue that simple prior distributions are most appropriate when fitting complex non–linear models. We embrace this principle and assume that each tree structure with the same number of SNPs which also satisfies the PPC is equally likely. There is a natural hierarchical structure in this setup which we formalise as p(K, S1 , . . . , SK , O1 , . . . , OK−1 ) = p(K) [p(S1 , . . . , SK , O1 , . . . , OK−1 | K)I(PPC)] , where I(·) is the usual indicator function that is equal to 1 if the PPC is satisfied and 0 otherwise; p(S1 , . . . , SK , O1 , . . . , OK−1 | K) is the probability of a particular tree configuration conditional on there being K SNPs in the tree, and it is given by the uniform distribution defined over the space of all the possible tree configurations with K SNPs. A geometric distribution with parameter θ is used to specify the prior distribution on the number of SNPs: p(K = k) = θ(1 − θ)k−1 , k = 1, 2, . . . Therefore any logic tree must contain at least one SNP. We will assume that θ is a constant, but it is possible to specify a prior for it. Note also that Denison et al. (1998) prefer a truncated Poisson distribution, but we feel that a geometric distribution is more effective in favouring trees with a small number of SNPs. We also have a finite number of possible trees as there cannot be more than nSNP − 1 operator variables, where nSNP is the number of SNPs in the data set.

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

where KS(·) denotes Kolmogorov–Smirnov distribution density (Holmes and Held, 2003). In the next section we discuss the prior distribution defined on the space of logic trees.

8

Taane G Clark et al. 4.3. Covariate selection

The model consists of environmental factors (X) and trees (T ) consisting of SNPs. Rather than propose moves to both X and T and update β and α jointly, we consider each separately. There are three possible moves in the environmental factor space, chosen with equal probability: • Birth step – adding an environmental factor • Death step – deleting an environmental factor • Move step – adding and deleting different environmental factors, but retaining the same number of environmental factors.

′ −1 ˆ ∗ ) |Vβ(γ∗ ) |0·5 |vβ(γ) |0·5 exp(0·5βˆ γ∗ Vβ(γ ∗ ) βγ ) min 1, , |Vβ(γ) |0·5 |vβ(γ∗ ) |0·5 exp(0·5βˆ ′ V −1 βˆ γ )

(

(6)

γ β(γ)

where βˆ γ and Vβ(γ) are defined in (3), and the subscripts indicate that they are conditioned on the environmental factor set defined by γ. The acceptance probability resembles a Bayes factor of a standard Bayesian linear model. This implicit marginalisation of β in the proposal step leads to efficient dimension sampling, as the βs are being updated from their full conditional distributions given the change to the covariate set. After considering a move in environmental factor space, we propose a move in tree space. The space of trees is potentially large and the number of trees, q, is not known in advance, but the Bayesian framework contains a natural penalty against over–complex models. We assume that q has a value between 0 and tmax , where tmax is the maximum number of trees that can be included in the model. We assume a uniform prior for q. We consider two moves in tree space, chosen with equal probability: • Adding a new logic tree consisting of one SNP • Modify an existing logic tree: choosing with equal probability between the birth, death or move steps for the logic trees. The birth and move steps must be consistent with the constraint of the gene tree. In this setting, a death move from a singleton tree (a tree consisting of one SNP) will result in the deletion of the tree. Let φ denote the configuration of trees, and assume that the proposed move results in a new configuration φ∗ . By configuration, φ, we simply mean the set of all logic trees included in the model and their topologies. As above, it is possible to condition on z, X, β and jointly update the αs as well,

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

A convenient approach to choosing a relevant set of environmental covariates is to adopt a prior distribution on the covariate matrix π(X) that places mass on the 2 p possible sub–models made up of different columns of X. In particular, consider the covariate indicator vector γ = {γ1 , . . . , γ p }, γi ∈ {0, 1}, i = 1, . . . , p, such that γi = 1 if the ith environmental covariate is present in the model and γi = 0 if it is not. The intercept is always present in the model i.e γ0 = 1. A prior on the model space can be specified via a prior on the indicator, π(γ), and this is the same as the proposal distribution of γ. We select an environmental factor at random, and propose γ∗i = 1, if the current γi = 0, or γ∗i = 0 otherwise. A key advantage of using representation (2) is that when updating the environmental set defined by γ it is possible to condition on z, T , α and jointly update the βs as well, from their full conditional distribution given the new model structure. Holmes and Held (2003) suggest using a M–H step to update the current set γ, and assuming a uniform prior on the environmental space and the proposal in the same space being the same as the prior, the new move γ∗ is accepted with probability

Bayesian logistic regression using a perfect phylogeny

9

from their full conditional distribution given the new model structure. In this context, the acceptance probability for the proposed move φ∗ is (

min 1,



−1 ˆ φ∗ ) π(φ∗ ) |Vα(φ∗ ) |0·5 |vα(φ) |0·5 exp(0·5αˆ φ∗ Vα(φ ∗) α ′ −1 ˆ |Vα(φ) |0·5 |vα(φ∗ ) |0·5 exp(0·5αˆ φVα(φ) αφ ) π(φ)

)

,

(7)

where π(φ) denotes the joint prior distribution on the number of trees and on the tree structure. To incorporate interactions between SNPs and environmental effects, we extend X to include such interactions, and process them in the same way as the environmental factors. For example, if there are 5 environmental measurements and 10 SNPs, the inclusion of all first order interactions requires 50 extra terms in X. 4.4. Summary of algorithm Using the components discussed previously, our algorithm is:

1. Generate ψi and λi , for i = 1, . . . , 2n. Use equation (5) to accept or reject λ∗i , and calculate W . 2. Generate z from equation (4). 3. Propose a move amongst environmental factors and interactions, γ∗ , and propose a move for β|γ∗ . Calculate the acceptance probability using equation (6) and, if accepted, set γ = γ∗ and β = β∗ . 4. Propose a move amongst trees, φ∗ , and propose a move for α|φ∗ . Calculate the acceptance probability using equation (7) and, if accepted, set φ = φ∗ and α = α∗ . 5. Repeat steps 1 to 4 until convergence. 4.5. Use of genotypes Genotypes comprise the combined information for the two homologous chromosomes present in a diploid individual, and are the genetic information most commonly supplied for an individual following DNA typing. In a sample of n individuals, a segregating SNP site will have at least two of the three possible genotypes 0/0, 0/1 and 1/1, where 1 refers to a mutant allele and 0 the wild–type. Genotypic data may be applied in our algorithm to determine the mode of inheritance for the disease. When compared to a haplotype analysis the number of observations is halved (n instead of 2n) and the number of potential predictors doubled. In particular, two predictors, say Si1 and Si2 , are created for each SNP i (Si ): ( 1, if site i has two mutant alleles, i.e. 1/1 Si1 = 0, if site i has zero or one mutant alleles, i.e. 0/1 or 0/0; ( 1, if site i has one or two mutant alleles, i.e. 0/1 or 1/1 Si2 = 0, if site i has zero mutant alleles, i.e. 0/0· Whether Si1 or Si2 is present in the (final) logic trees depends on whether a recessive or dominance model, respectively, best fits the data. Si1 and Si2 cannot be in the same logic tree, but they may be in separate trees effectively fitting an additive model on the logit scale or multiplicative model on the odds scale.

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

0. Initialise all parameters in the model, including γ and φ.

10

Taane G Clark et al. 4.6. Haplotype reconstruction

where T

H∗

and T

H

denotes the dependence of the covariates T i on the phasing. 4.7. Continuous phenotypes

More often than not, the outcome or phenotype y is continuous. We will assume that y (possibly after transformation) may be distributed with a Gaussian distribution. In this case we can use the standard Bayesian linear model (Lindley and Smith, 1972), where the model (1) can be specified by yi = Xi β + T iα + εi , εi ∼ N(0, σ2 ) β ∼ π(β), α ∼ π(α), π(σ2 ) = IG(α/2, β/2),

(9)

where IG is an inverse–gamma probability distribution. Posterior inference and model selection can be performed by slight modification of the algorithm in section 4.4. 4.8. Model selection The model space consisting of logic trees and environmental factors is potentially large. Chipman et al. (1998) and Denison et al. (1998) argue that simulation from the true posterior distribution cannot be achieved for CART models without a prohibitive number of iterations, because of the huge number of possible trees and the difficulty traversing the posterior surface. This implies that it is difficult to determine exactly which trees have largest posterior probability. The same type of concerns holds in the case

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

Another approach when using genotype data is to reconstruct the haplotype configuration or phase. The determination of phase uses the nhap possible sets of haplotypes (H) forming nhap perfect phylogenies (represented by a gene tree G) constructed from the genotype data. For a perfect phylogeny this is possible using existing algorithms (Bafna et al., 2003; Eskin et al., 2003). We can model the uncertainty of phase determination by adding a new step in the hierarchy in model (1). In particular H becomes a random variable, that can assume one of the nhap possible configurations. Note that in the case of a perfect phylogeny, the number of possible haplotypes, nhap, is finite. A priori we assume each of these configurations is equally likely. Therefore, we can determine the posterior distribution of H given the data, P(H|y), by adding a new M–H step within our block Gibbs sampling framework. G is the gene tree associated with a set of haplotypes H. A new haplotype configuration, H ∗ , results in a new gene tree G∗ , because it implies a different evolutionary model for the loci. Therefore, when we propose H ∗ given the rest of the parameters in the model, we must ensure that the current haplotype configuration is consistent with the logic trees currently in the model. Also, by changing the haplotype we also modify the set of individuals that satisfy the relation defined by the logic tree and therefore T i , i = 1, . . . , 2n. To simulate from the posterior distribution of H, we need to add a new step in the algorithm described in section 4.4. Let H be the current haplotype configuration. Propose a new haplotype configuration H ∗ from the set of haplotypes consistent with the logic trees already in the model. Accept H ∗ with probability ( ) ′ −1 H∗ H∗ ∏2n i=1 exp(−0·5(Xi β + T i α) λi (Xi β + T i α)) min 1, 2n , (8) ′ H ∏i=1 exp(−0·5(Xi β + T i H α) λ−1 i (Xi β + T i α))

Bayesian logistic regression using a perfect phylogeny

11

ˆ + p ln 2n, ˆ β) BIC = −2l(α, ˆ is the maximum log–likelihood for the model, αˆ and βˆ are the maximum likelihood estimates ˆ β) where l(α, (mle) of α and β respectively, and p is the total number of parameters in the model. The mles are estimated using iteratively re–weighted least squares (McCullagh and Nelder, 1989). In the setting of a logistic regression, the marginal likelihood for a model M consisting of logic trees and environmental factors, p(y|M ), is not available in closed form. However, the marginal likelihood can be approximated via the Laplace approximation (DiCiccio et al., 1997) ˆ M ), ˆ 1/2 h(α, ˆ β| p(y| ˆ M ) = (2π) p/2 |Σ| where h(α, β|M ) = p(y|α, β, M )p(α, β|M ), and  2  ˆ M ) −1 ˆ β| α, Σˆ = − ∂ log∂θh( . ˆ ∂θˆ i

j

ˆ the maximum posteriori estimate of α and β respectively, are found via Newton’s method, Both αˆ and β, ˆ ˆ β). and θˆ = (α, 4.9. Model coefficients The logistic model produces (log) odds ratios which are meaningful quantities in relating disease to risk factors in epidemiological studies, and are easily interpreted by clinicians and genetic epidemiologists. The mean, standard deviation and, indeed, the distribution, of the (log) odds ratios for the logic trees, environmental, and interaction effects can be calculated for the selected model using the algorithm described earlier. 4.10. Model averaging One application of a model is to predict the probability of being diseased for any given environmental factor (x) and SNP configuration (t). This prediction is achieved through Bayesian model averaging

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

of logic trees because of the similarity of the search space. Moreover, with the difficulty of sampling from the posterior distribution, it is not clear how to think of the “posterior mean” and of “the posterior mode” found using the MCMC output. Although the entire posterior distribution cannot be efficiently computed in non–trivial problems, the reversible jump algorithm can still be used to effectively explore the posterior distribution. Therefore, we prefer to classify the different models visited by using an optimality criterion such as Bayesian Information Criterion (BIC) (Schwarz, 1978), which is widely applied in the Bayesian model selection literature, or marginal likelihood. Denison et al. (2002) suggest to use the marginal likelihood to compare tree models, while Chipman et al. (1998) propose to use the marginal likelihood along with the misclassification rate to determine good models. Note that the BIC and the marginal likelihood criterion are asymptotically equivalent. The marginal likelihood contains a natural dimension penalty, it is consistent with the Bayesian paradigm, and has been applied to CART models (Denison et al., 1998) and logic trees (Clark et al., 2005). Nevertheless it would be expected that averaging over the models visited during the MCMC run (as described in section 4.10) would lead to better predictions than single models. We refer to Chipman et al. (1998) and Denison et al. (1998) for two different strategies which try to eliminate some of the problems associated with the sampling on the space of trees. In the following sections, we evaluate the performance of the BIC and marginal likelihood on simulated and epidemiological data. The BIC for a model consisting of p components (q logic trees, p − q environmental factors including the intercept) fitted to 2n haplotypes, is

12

Taane G Clark et al.

(Hoeting et al., 1999; Denison et al., 2002). The posterior predictive distribution for the binary outcome, y, for x and t can be written as p(y|x,t, D ) = ∑M j=1 p(y|x,t, M j )p(M j |D ), where D is the data, and M j is the model at iteration j of M (post burn–in). The equation is simply a mixture of individual predictive distributions for y given each model, weighted by the posterior probability of each model. The predicted outcome or expectation of the posterior predictive distribution for y, y, ˜ is y˜ = E(y|x,t, D ) = ∑M j=1 E(y|x,t, M j )p(M j |D ). Other summary statistics of the posterior predictive distribution may be calculated in a similar way. 5. S IMULATION

STUDY

• Model 0: η = 0 • Model 1: η = −2 + log(3)J([S6 ∨ S37 ]) • Model 2: η = −2 + log(2)J([S6 ∨ S37 ]) c ∧ S ]) • Model 3: η = −2 + log(3)J([S34 37 c ∧ S ]), • Model 4: η = −2 + log(2)J([S34 37

• Model 5: η = −2 + log(3)J([(S6 ∧ S41 ) ∨ S37 ]) • Model 6: η = −2 + log(2)J([(S6 ∧ S41 ) ∨ S37 ]) where η refers to the logit(P(Y = 1)), and the probability of being diseased (P(Y = 1)) is 1/(1+exp(η)). Model 0 consists of no causal SNP, i.e. there is no SNP associated with the disease outcome. An additive disease model was defined at the genotype level using J(·), which takes values 0, 1 and 2, according to whether zero, one, or two haplotypes constituting a single genotype are consistent with the logic tree. The intercept relates to the underlying risk of disease in the population and the slope estimate mimics a form of (log) genotypic relative risk for the logic tree for each individual. Note that the genetic signal is relatively weak in all the simulation scenarios to mimic a potentially realistic situation. For our approach, we assumed π(α) = N(0, 100), π(β) = N(0, 100), the maximum number of trees (tmax ) to be 5, and the number of SNPs on a tree to be Geometric(0·3). We ran the algorithm for 1,000,000 iterations, with the first 10,000 being discarded as burn–in. Computational time using a single Pentium IV processor for a single replicate was less than 15 minutes. We compared our method to (i) logistic regression with stepwise selection using Akaike’s Information Criterion (AIC) and BIC, where up to two–way interactions could be included, and (ii) logic regression. The stepwise logistic approaches and logic regression were implemented in the R statistical software package (R Development Core Team, 2005). We

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

Here we compare the performance of our method with logic regression and stepwise logistic regression using data from seven simulation scenarios. Using a coalescent model (Hudson, 2002) we generated 10000 haplotypes (equivalent to 5000 diploid individuals) consisting of two blocks of SNPs, each with 50 SNPs. We restricted ourselves to those SNPs with minor allele frequencies (MAFs) in excess of 0.5%, and this resulted in two blocks consisting of 31 and 32 SNPs, with MAFs ranging from 0.6–48.3% in block 1 and 0.7–45.9% in block 2. The SNPs in blocks 1 and 2 are labelled as S1 –S31 and S32 –S63 respectively. From our population of 10000 haplotypes, we sampled with replacement 100 datasets consisting of 1000 haplotypes (500 individuals). We considered seven simulation models to generate phenotypic data,

Bayesian logistic regression using a perfect phylogeny

13

BF(H1 , H0 ) =

p1 /(1 − p1) · p0 /(1 − p0)

Kass and Raftery (1995) suggest that a BF greater than 10 provides substantial evidence against H0 , but this threshold should always be interpreted with caution. Due to convergence issues discussed in section 4.8, the estimate of p1 is also problematic. Let us consider, for example, one of the simulation replicates from model 1. The posterior probability of inclusion for S6 is equal to one, leading to a BF of infinity. In the case of S37 , p1 = 0·948, and the BF is greater than 100. For most of the remaining SNPs, p1 is less than 0·01, and BF is less than 0·06. The prior probability p0 was evaluated by simulations.

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

have used the R package LogicReg to implement logic regression; to avoid time consuming cross– validation experiments, we prespecified the number of trees and leaves to be those of the true model. In particular, for models 1, 2, 3 and 4, we set the number of trees to be 1 and the number of leaves to be 2. Similarly, for models 5 and 6 we set the number of trees and leaves to be 1 and 3 respectively. Table 1 shows the proportion of replicates where the methods have either: (a) identified the correct set of SNPs used to simulate the phenotype, with no extra SNPs (i.e. the final model from each statistical procedure contains all the correct SNPs and no others), or (b) identified the correct set of SNPs with perhaps extra SNPs. Table 2 shows (c) the proportion of replicates where the methods have not identified any of the correct set of SNPs used to simulate the phenotype in the model (i.e. the final model selected from each statistical procedure does not contain any of the correct SNPs), and (d) under simulation scenario 0, the proportion of false positives (i.e. the number of replicates for which the statistical procedure selected at least one SNP). The stepwise logistic approach using the AIC criterion tends to select models with more SNPs than other approaches (see Table 1); in fact, the method performs best when considering (c) and worst when considering (a) and (d). The stepwise logistic approach with the BIC performs well in simulation scenarios 1 and 2, where the logic tree structure is simpler. The performance of the stepwise approaches is worst in simulations 3, 4, 5 and 6, where some or all the SNPs involved in simulating the phenotype are in the same block, and the search is affected by high LD. Comparatively, logic regression seems to perform better in simulations 3, 4, 5 and 6, and this may be due to the increased flexibility of the logic tree structure, and its ability to be model interactions. Logic regression was constrained to have the correct number of trees and leaves, and therefore models larger than the correct one are never selected. Hence, the mean number of extra SNPs other than the correct ones is zero, and the proportions in (b) are underestimated (see second column of Table 1). For the same reason, the false positive rates for logic regression in Table 2 have not been calculated. Our method uses logic trees jointly with the perfect phylogeny constraint and an optimality criterion, and generally tends to perform better than the other methods; in the worst cases its performance is comparable to the best approach. An advantage of the proposed approach is that it keeps the size of the model relatively small. In our simulations both the BIC and marginal likelihood criteria showed favourable properties with marginal likelihood performing slightly better. However, although BIC performs well, the properties of such a criterion are not fully understood especially in complex settings. Table 3 shows the median and interquartile range of selection proportions from our method for the causal SNPs in simulations 1 to 6. From the MCMC run, we can estimate the posterior probability of inclusion for each SNP. We can then use the Bayes’ factor (BF) (Kass and Raftery, 1995) to compare the hypothesis H1 : ”variable/SNP i should be included in the model (i.e. γi = 1)” against H0 : ”variable/SNP i should not be included in the model (i.e. γi = 0)”. Let p0 = P(γi = 1) be the prior probability of H1 (and therefore 1 − p0 is the prior probability of H0 ), and let p1 = P(γi = 1 | Data) be the posterior probability of H1 . The Bayes’ factor is then defined as the ratio of the posterior odds and the prior odds:

14

Taane G Clark et al.

Table 1. For simulations scenarios 1–6, we report the proportion of replicates that include the correct set of SNPs used to simulate the phenotype with or without the inclusion of other SNPs; the first column shows the proportion of replicates in which the statistical procedure selects a final model that contains all the correct SNPs and no other. The second column reports the proportion of replicates in which the statistical procedure selects a model containing all the correct SNPs with the possible inclusion of others; the mean number of SNPs that were not used to simulate the phenotype but were in the final model (’incorrect SNPs’) is presented in brackets, and a dash in the case of logic regression implies that the calculation could not be performed. AIC refers to Akaike’s Information Criterion; BIC refers to the Bayesian Information Criterion; BLRPP refers to the Bayesian logistic regression with a perfect phylogeny approach; ML refers to the marginal likelihood All correct SNPs Without other SNPs

With or without other SNPs

1

2

3

4

5

6

1

2

3

4

5

6

Stepwise logistic

0

0

0

0

0

0

0.98

0.85

0.18

0.15

0.20

0.15

(4.7)

(5.2)

(4.9)

(4.0)

(5.3)

(5.1)

(AIC) Stepwise logistic

0.70

0.62

0.08

0.05

0

0

(BIC) Logic Regression BLRPP (BIC) BLRPP (ML)

0.66 0.74 0.76

0.34 0.65 0.66

0.49 0.54 0.57

0.22 0.31 0.35

0.20 0.23 0.24

0.06 0.12 0.14

6. S ARCOIDOSIS

1

0.82

0.13

0.07

0.06

0

(0.4)

(0.5)

(0.5)

(0.3)

(0.6)

(0.4)

0.66

0.34

0.49

0.22

0.20

0.06

(-)

(-)

(-)

(-)

(-)

(-)

1

0.86

0.61

0.40

0.32

0.19

(0.5)

(0.6)

(0.1)

(0.3)

(0.6)

(0.5)

1

0.88

0.63

0.43

0.33

0.20

(0.4)

(0.4)

(0.1)

(0.2)

(0.4)

(0.4)

STUDY

Sarcoidosis is a rare disease due to inflammation, particularly in the lung and lymph nodes, and is characterised by the presence of granulomas, small areas of inflamed cells. Sarcoidosis is thought to result from the interaction between an unknown environmental antigenic trigger and the host’s genetic susceptibility. Clinical onset and progression vary widely in sarcoidosis, ranging from benign disease to progressive pulmonary fibrosis leading to respiratory failure. A subset of patients has L¨ofgren’s syndrome, which is characterised by the symptoms, such as acute presentation of fever. It has been found that some relevant genes on the human leukocyte antigen (HLA) region, such as DQB1*0201, may be associated with L¨ofgren’s syndrome and not other types of sarcoidosis (Spagnolo et al., 2003). It has also been hypothesised that C–C chemokines, such as CCR2 (chromosome 3) and CCR5 (chromosome 3), and the tumour necrosis factor (TNF) gene (chromosome 6) are similarly associated with L¨ofgren’s syndrome, but no other form of sarcoidosis (Spagnolo et al., 2003). Here we present data from 141 Dutch patients with sarcoidosis, 47 (33%) with L¨ofgren’s syndrome. The candidate gene regions of interest are TNF (4 SNPs: T-1032C, G-308A, G-238A, G488A), CCR5 (6 SNPs: A-5652G, C-3890A, T-3452G, T-2129C, T-1829C, G32A) and CCR2 (5 SNPs: G-6928T, A6752G, G190A, A3610G, C3671G). These represent regions of approximately 1.5, 5.5, and 10.5 kb respectively, and will be referred to collectively as SNPs S1 –S15 . The gene trees for the Sarcoidosis study haplotypes using all 282 haplotypes are shown in Figure 3, and represent five blocks constructed using

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

Method

Bayesian logistic regression using a perfect phylogeny

15

Table 2. For simulation scenarios 1–6, we report the proportion of replicates in which each statistical procedure excludes from the final model all the SNPs used to simulate the phenotype (denoted by the ’no correct SNPs’ column); from those replicates, the mean number of SNPs not used to simulate the phenotype but included in the model (’incorrect SNPs’) is presented in round brackets. The ’False positives’ column reports the proportion of replicates under simulation scenario 0 in which the statistical procedure selected a model containing at least one SNP; the mean number of SNPs in those models is presented in square brackets. A dash in the table indicates that the calculation could not be performed. AIC refers to Akaike’s Information Criterion; BIC refers to the Bayesian Information Criterion; BLRPP refers to the Bayesian logistic regression with a perfect phylogeny approach; ML refers to the marginal likelihood False positives

Method

1

2

3

4

5

6

0

Stepwise logistic

0

0

0.76

0.71

0

0.08

0.97

(AIC)

(-)

(-)

(7.1)

(6.7)

(-)

(5.5)

[5.1]

Stepwise logistic

0

0

0.84

0.81

0.03

0.38

0.21

(BIC)

(-)

(-)

(1.9)

(1.5)

(1.3)

(0.6)

[1.2]

Logic Regression

0

0.05

0.51

0.74

0.08

0.22

-

(-)

(2)

(2)

(2)

(3)

(3)

-

BLRPP

0

0.01

0.30

0.49

0.03

0.12

0.17

(BIC)

(-)

(1)

(1.2)

(1.1)

(1.3)

(1.1)

[1.1]

BLRPP

0

0

0.29

0.47

0.02

0.09

0.16

(ML)

(-)

(-)

(1.2)

(1.1)

(1.5)

(1.2)

[1.1]

the imperfect phylogeny method (Halperin and Eskin, 2003). The plots indicate that mutations at S2 , S12 and S14 may be associated with L¨ofgren’s syndrome. Table 4 shows other covariates of interest: gender, age at diagnosis, history of smoking and various HLA regions. There appears to be a large difference in the distribution of gender and DRB1*201 between the phenotype groups. Although the HLA regions are genetic in nature, they involve potentially complex combinations of genotypes, and in practice they tend not to be converted to SNPs. Here we consider them as binary covariates indicating whether the particular combination of genotypes of interest is present or absent for an individual. The HLA covariates could be viewed as existing risk factors that the new candidate SNPs may potentially displace from an association model. An analysis of the data by Spagnolo et al. (2003) with haplotypes consisting of only CCR2 (S11 –S15), suggested that female gender, DQB1*201, and a haplotype S11 S12 S13 S14 S15 is associated with L¨ofgren’s syndrome in those diagnosed with a form of sarcoidosis. We applied our approach assuming π(α), π(β) = N(0, 100), tmax = 5, and the number of SNPs on a tree to be distributed as a Geometric(0·3). The block Gibbs’ sampler was run for 1,000,000 iterations with a 10,000 burn–in. The selection frequencies in table 5 indicate that gender and DQB1*201 were present in most post burn–in models, whilst S12 was the most selected SNP. The tree selected most frequently were:

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

No correct SNPs

16

Taane G Clark et al.

Table 3. Median (inter-quartile range) of SNP selection proportions from the Bayesian logistic regression with a perfect phylogeny approach SNPs Model 1 2 3

5 6

S34

S37

S41

Other

0.997

-

0.976

-

0.003

(0.991 – 1)

-

(0.947 – 0.991)

-

(0.001 – 0.006)

0.994

-

0.639

-

0.004

(0.820 –0.991)

-

(0.152 – 0.894)

-

(0.002 – 0.009)

-

0.427

0.775

-

0.005

-

(0.177 – 0.639)

(0.429 – 0.996)

-

(0.002 – 0.015)

-

0.278

0.521

-

0.006

-

(0.071 – 0.411)

(0.132 – 0.876)

-

(0.003 – 0.018)

0.351

-

0.743

0.277

0.005

(0.057 – 0.748)

-

(0.578 – 0.971)

(0.049 – 0.409)

(0.003 – 0.010)

0.147

-

0.252

0.110

0.008

(0.039 – 0.496)

-

(0.100 – 0.669)

(0.030 – 0.351)

(0.004 – 0.020)

Table 4. Covariate factors in the Sarcoidosis study Covariates

L¨ofgren’s syndrome

Other sarcoidosis

(n=47)

(n=94)

Male

18 (38.3%)

64 (68.1%)

Age at diagnosis

34 (20–64)

35 (17–71)

History of smoking

21 (44.7%)

34 (36.2%)

DRB1*01

4 (8.5%)

12 (12.8%)

DRB1*04

11 (23.4%)

19 (20.2%)

DRB1*04-DQB1*0301

3 (6.4%)

3 (3.2%)

DRB1*07

4 (8.5%)

15 (16.0%)

DRB1*10

2 (4.3%)

5 (5.3%)

DQB1*201

38 (80.9%)

18 (19.1%)

DQB1*602

5 (10.6%)

25 (26.6%)

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

4

S6

Bayesian logistic regression using a perfect phylogeny

2

17

1

4

3

65 15.4 38.3

26 11.2 5.3

28 12.2 5.3

6 1.6 0.0

Other (%) Lofgrens (%)

7 6 8

10

9

5

116 36.2 51.1

31 11.2 10.6

105 41.0 29.8

21 7.4 7.4

4 2.1 1.1

Other (%) Lofgrens (%)

230 80.9 83.0

14

12

177 67.6 53.2

80 21.8 41.5

34 13.3 9.6

18 5.9 Other (%) 7.4 Lofgrens (%)

15

13

11

18 8.0 Other (%) 3.2 Lofgrens (%)

125 50.5 31.9

107 31.9 50.0

33 11.2 12.8

13 4.8 4.3

Other (%) Lofgrens (%)

Fig. 3. Gene trees for the Sarcoidosis study: TNF block (top), CCR5 blocks (middle), CCR2 blocks (bottom). The percentages relate to the haplotype frequency within the L¨ofgren’s syndrome group and the other sarcoidosis group

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

152 56.4 48.9

18

Taane G Clark et al. Table 5. Proportion of post burn–in iterations in the which the specified covariate factors in the Sarcoidosis study were selected in the model Proportion

variance

S1

0.046

0.044

S2

0.276

0.200

S3

0.104

0.092

S4

0.388

0.237

S5

0.099

0.090

S6

0.062

0.058

S7

0.238

0.184

S8

0.032

0.031

S9

0.114

0.101

S10

0.056

0.053

S11

0.057

0.053

S12

0.215

0.168

S13

0.099

0.089

S14

0.065

0.066

S15

0.055

0.052

Gender

0.999

0.001

Age

0.003

0.003

History of smoking

0.181

0.148

DRB1*01

0.058

0.055

DRB1*04

0.049

0.046

DRB1*04-DQB1*0301

0.163

0.136

DRB1*07

0.029

0.029

DRB1*10

0.100

0.090

DQB1*201

1.000

< 0·001

DQB1*602

0.026

0.026

(i) S12 (17%)

(ii) S14 (11%)

(iii) S1 (10%).

The results were robust to changes in the random number seed, tmax , and the parameter in the geometric distribution for tree size. The best model under both the BIC criterion and marginal likelihood contained gender, a history of smoking, DQB1*201 and S12 ∨ S14 . The results of fitting this model in a Bayesian logistic regression are presented in Table 6, and are consistent with the findings of Spagnolo et al. (2003). Using model averaging, the predicted probability of L¨ofgren’s syndrome for a male with a history of smoking, the DQB1*201 allele, and with mutations at SNPs 12 or 14 was 0.702 (variance 0.209). Application of logic regression to the haplotype data suggested the trees (log OR) S2c (-1.22), S12 ∨ S14 (1.52) and S15 (-0.863) were associated with the response. However, when including the three non–SNP

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

Covariates

Bayesian logistic regression using a perfect phylogeny

19

Table 6. Log odds ratios (OR) from a Bayesian logistic regression on the best ML model for the Sarcoidosis study Covariates

Gender

Smoking

DQB*201

S12 ∨ S14

log(OR)

-1.84

0.94

2.99

1.51

sd(log OR)

0.39

0.39

0.37

0.37

covariates in Table 6 as fixed covariates, the tree component (log OR) of the model was S12 ∧ S1c (1.27). A Bayesian logistic regression with gender, a history of smoking, DQB*201 and S12 ∧ S1c resulted in log ORs (se) of -1.78 (0.37), 0.79 (0.36), 3.09 (0.37) and 1.71 (0.73) respectively. 7. D ISCUSSION Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

We have presented a Bayesian logistic regression method that performs an evolution–based association analysis using haplotype data, where these data are consistent with a perfect phylogeny. Haplotype data capture the genetic variation among individuals in a population and among populations, and an understanding of this variation is essential in genetic association studies of complex disease. A gene tree is equivalent to the SNP configuration of mutations when there is no recombination and recurrent or parallel mutations do not occur. A natural choice for modelling genealogies is the coalescent model, which induces a stochastic distribution for a gene tree. Although, we only use the basic gene tree topology, full likelihood inference to estimate mutation rates, the age of mutations, or the time to the most recent common ancestor are also possible. Cladistic approaches (Templeton et al., 1992) are an alternative to the gene tree, and have been employed within a generalised linear framework on family and population–based data, where models account for haplotype phase ambiguity (Seltman et al., 2003). Cladistic approaches may be justified if interest lies not in parameters of the evolutionary model, but rather in the particular history of a specific locus (Rosenberg and Nordborg, 2002). Parsimonious methods (e.g. Estabrook et al. (1975)) may also be used to construct a phylogeny, but have been criticised on the grounds that they are not based on statistical principles (e.g. Felsenstein (1983)). Alternative association methods based on the coalescent exist (e.g. Larribe et al. (2002); Morris et al. (2002)), but single (and not multiple) sites are assumed to be disease predisposing and covariates or environmental factors cannot be included. The incorporation of covariates is important for confounding reasons and assessing the balance of genetic and environmental effects and interactions between them. We applied our method to both simulated and epidemiological data. The simulated data had varying effect sizes, and our approach was in general able to detect the correct logic tree with or without other SNPs more often than the alternative approaches, whilst keeping the total number of leaves small. In the Sarcoidosis study, we identified genetic and non–genetic factors associated with L¨ofgren’s syndrome. Although, the dataset is small, our results replicate the findings in previous work. In general, large sample sizes are required to detect modest genetic effects and outcomes must be robust (Cardon and Bell, 2001). Our method performs well because of the link between the data matrix, logic trees and gene trees, when the perfect phylogeny assumption holds. The gene tree enhances speed in the model search space. Our methodology is an extension of logic regression, an approach that has been shown to be effective at identifying disease–predisposing SNPs when compared to other approaches (Witte and Fijal, 2001). Logic regression is also related directly to other graphical methods (Ruczinski et al., 2003), such as CART. In practice, the presence of recombination will mean the perfect phylogeny assumption is unlikely to hold, but like other LD mapping methods mentioned earlier (except Larribe et al. (2002)) assuming (little or) no recombination is a starting point, and may be potentially useful when considering haplotype blocks of

20

Taane G Clark et al.

Software to implement the method is available from the corresponding author. ACKNOWLEDGEMENTS We wish to thank Ken Welsh for providing the data from the Sarcoidosis study, Jonathan Marchini for introducing us to the idea of logic regression and for some assistance with C++ programming, Chris Holmes for providing a copy of his technical report, and Yvonne Griffiths for comments. TC was funded by a National Health Service (UK) Training Fellowship, and is now funded by the Medical Research Council (UK). R EFERENCES Bafna, V., D. Gusfield, G. Lancia, and S. Yooseph (2003). Haplotyping as a perfect phylogeny: a direct approach. J Comp Biol 3, 323–340. Cardon, L. R. and G. R. Abecasis (2003). Using haplotype blocks to map human complex trait loci. Trends Genet 19, 135–40. Cardon, L. R. and J. I. Bell (2001). Association study designs for complex diseases. Nat Rev Genet 2, 91–98. Chipman, H. A., E. I. George, and R. E. McCulloch (1998). Bayesian CART model search. J Am Statist Assoc 93, 935–960.

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

high LD. It is possible to apply our current algorithm where there is a small amount of recombination, using either the prune algorithm to obtain a reduced dataset that has a unique gene tree, or the imperfect phylogeny methodology (Halperin and Eskin, 2003) to create blocks. It is also possible to use knowledge of a genome map of haplotype blocks (e.g. the HAPMAP project described in Cardon and Abecasis (2003)) in our methodology. Another potential limitation is that genetic data often consist of diploid genotypes and not haplotypes. Haplotypes need to be reconstructed and any uncertainty associated with this process must be incorporated into the analysis. Haplotype reconstruction programs, such as PHASE (Stephens and Donnelly, 2003), output the posterior probabilities or population frequencies for haplotypes, and these may be used as a probability distribution of the possible haplotype configurations in a more general hierarchical modelling framework. We have presented an extension to the block Gibbs sampler which will estimate the posterior distribution of the haplotype configuration. Moreover, haplotypes do not usually inform us about a mechanism of disease. The use of haplotypes will detect a signal for an association (our objective), and our methodology may be applied to genotypes to detect the mechanism by using the approach described earlier. The Bayesian nature of the method implies that model selection may be less problematic than in a frequentist setting, where models may be over–fitted and multiple testing can lead to false positives. There are also advantages in terms of incorporating prior beliefs, and interpretability of posterior probabilities and distributions. Extensions of the Bayesian linear model (Lindley and Smith, 1972) are well developed, and it is possible to model a continuous or a count phenotype as well as incorporate cluster effects for family data. Our framework incorporates evolutionary relationships among haplotypes, and interpretability of results in association studies are enhanced substantially. The methodology described is useful for haplotype data with little recombination, and should prove to be useful for genetic epidemiologists who are searching for the genetic and environmental basis of complex disease in this setting.

Bayesian logistic regression using a perfect phylogeny

21

Clark, T. G., M. De Iorio, R. C. Griffiths, and M. Farrall (2005). Finding associations in dense genetic maps: a genetic algorithm approach. Hum Heredity 50, 97–108. Daly, M. J., J. D. Rioux, S. F. Schaffner, T. J. Hudson, and L. E. S. (2001). High-resolution haplotype structure in the human genome. Nat Genet 29, 229–32. Denison, D. G. T., C. C. Holmes, B. K. Mallick, and A. F. M. Smith (2002). Bayesian methods for nonlinear classification and regression. England: John Wiley and Sons Ltd. Denison, D. G. T., B. K. Mallick, and A. F. M. Smith (1998). A Bayesian CART algorithm. Biometrika 85, 363–377. Devroye, L. (1986). Non–uniform random variate generation, pp. 361–8. New York: Springer. DiCiccio, T. J., R. E. Kass, A. Raftery, and L. Wasserman (1997). Computing Bayes factors by combining simulation and asymptotic approximations. J Am Statist Soc 92, 903–915.

Estabrook, G. F., C. S. Johnson, and F. R. McMorris (1975). An idealized concept of the true cladistic character. Math Biosci 23, 263–72. Felsenstein, J. (1983). Method for inferring phylogenies: A statistical view. In J. Felsenstein (Ed.), Numerical Taxonomy, Berlin, pp. 315–334. Springer-Verlag. Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Higgins, M. DeFelice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi, A. Adeyemo, R. Cooper, R. Ward, E. S. Lander, M. J. Daly, and D. Altshuler (2002). The structure of haplotype blocks in the human genome. Science 296, 2225–9. Griffiths, R. C. (2001). Ancestral inference from gene trees. In P. Donnelly and R. Foley (Eds.), Genes, Fossils, and Behaviour: an Integrated Approach to Human Evolution, Netherlands, pp. 137–172. IOS Press. Gusfield, D. (1991). Efficient algorithms for inferring evolutionary trees. Networks 21, 19–28. Halperin, E. and E. Eskin (2003). Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics 1, 1–8. Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky (1999). Bayesian model averaging: a tutorial (with discussion). Statist Sci 14, 382–417. Holmes, C. C. and L. Held (2003). On the simulation of Bayesian binary and polychotomous regression models using auxiliary variables. Technical report, Department of Mathematics, Imperial College, London. Hudson, R. R. (2002). Generating samples under a Wright-Fisher neutral model. Bioinformatics 18, 337–8. Kass, R. E. and A. Raftery (1995). Bayes factors. J Am Statist Assoc 90, 773–95. Kingman, J. F. C. (1982). The coalescent. Stochastic Processes and Their Applications 13, 235–48.

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

Eskin, E., E. Halperin, and R. M. Karp (2003). Efficient reconstruction of haplotype structure via perfect phylogeny. J Bioinf Comp Biol 1, 1–20.

22

Taane G Clark et al.

Larribe, F., S. Lessard, and N. J. Schork (2002). Gene mapping via the ancestral recombination graph. Theor Popul Biol 62, 215–29. Lindley, D. V. and A. F. M. Smith (1972). Bayes estimates for the linear model (with discussion). J Roy Statist Soc B 34, 1–41. McCullagh, P. and J. A. Nelder (1989). Generalized linear models. London: Chapman and Hall. Morris, A. P., J. Whittaker, and D. J. Balding (2002). Fine–scale mapping of disease loci via shattered coalescent modelling of genealogies. Am J Hum Genet 70, 686–707. Prentice, R. L. and R. Pyke (1979). Logistic disease incidence and case-control studies. Biometrika 66, 403–11. R Development Core Team (2005). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.

Rosenberg, N. A. and M. Nordborg (2002). Genealogical trees, coalescent theory and analysis of genetic polymorphisms. Nat Genet 3, 380–390. Ruczinski, I., C. Kooperberg, and M. LeBlanc (2003). Logic regression. J Comput Graph Stat 12, 475– 511. Sasieni, P. D. (1997). From genotypes to genes: doubling the sample size. Biometrics 53, 1253–61. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461–4. Seltman, H., K. Roeder, and B. Devlin (2003). Evolutionary-based association analysis using haplotype data. Genetic Epidemiol 25, 48–58. Spagnolo, P., E. A. Renzoni, A. U. Wells, H. Sato, J. C. Grutters, P. Sestini, A. Abdallah, E. Gramiccioni, H. J. T. Ruven, R. M. du Bois, and K. I. Welsh (2003). C–C chemokine receptor 2 and sarcoidosis: association with Lofgren’s ¨ syndrome. Am J Respir Crit Care Med 168, 1162–6. Stephens, M. and P. Donnelly (2003). A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73, 1162–9. Templeton, A. R., K. A. Crandall, and C. F. Sing (1992). A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. iii. Cladogram estimation. Genetics 132, 619–633. Witte, J. S. and B. A. Fijal (2001). Introduction: Analysis of sequence data and population structure. Genet Epidemiol S1, 626–631. [Received 25 November 2004. Revised 16 September 2005]

Downloaded from http://biostatistics.oxfordjournals.org/ by guest on May 30, 2013

Reich, D. E., S. F. Schaffner, M. J. Daly, G. McVean, J. C. Mullikin, J. M. Higgins, D. J. Richter, E. S. Lander, and D. Altshuler (2002). Human genome sequence variation and the influence of gene history, mutation and recombination. Nat Genet 32, 135–142.