Selection of Important Metabolites Using a Bayesian Hierarchical Model

6 downloads 0 Views 568KB Size Report
Dec 10, 2003 - In this project I use a Bayesian hierarchical model with latent variables in order ... in the model remain interpretable, I scaled each covariate.
Selection of Important Metabolites Using a Bayesian Hierarchical Model Eric Vance December 10, 2003 Abstract Identifying key metabolites associated with whether or not a person has a particular disease could be useful in identifying metabolic pathways in diseased cells. In this project I use a Bayesian hierarchical model with latent variables in order to select metabolites useful in distinguishing between diseased and non-diseased persons.

1. Background 1.1 Metabolomics Metabolomics is the next step in the progression from genomics, transriptomics, and proteomics. ”The measurement of well-characterized, relevant biochemicals allows a specific and easily interpretable view of cellular activity... Control of biochemical concentrations may be viewed as the sole reason for all the machinations that occur at more complex levels: DNA, RNA, and proteins. Thus metabolomics may be the best and most direct measure of cellular physiology.” Beecher (2002). Understanding which metabolites are correlated with a disease could provide insight into the disease mechanism. Biologists currently working in this area use dendrogramatic clustering and principal component analysis comparing the similarity of metabolite profiles to those of known factor mutations in order to deduce where in metabolism mutations act. No Bayesian methods have yet been applied to metabolomic datasets. 1.2 Dataset Measurements of 317 metabolites for 63 patients were made using mass spectrometry. Of the 63 patients, 30 are afflicted with a certain rare disease while 33 do not have the disease. Many of the metabolites show large variation with many outlying values. To insure that the size of the coefficients for the variables in the model remain interpretable, I scaled each covariate. This scaling is key to having the common prior distribution on the coefficients make sense. Many of the 19,971 metabolite values are missing. The biologists who own this dataset claim that these missing values are true zeros, i.e, the metabolite is not present at all for that person. A quick examination of the data shows that in a few cases this does not seem likely, yet based on this advice from the biologists, I defer to their expertise and replace all missing values with zeros. 1

2. Hierarchical Model 2.1 Probit link with latent variable A probit link is used so that P r(yi = 1) = Φ(Xi0 Ψ), where Y = (y1 , . . . , y63 )0 , X = (1, X2 , . . . , X318 ), and Ψ = (ψ1 , ψ2 , . . . , ψ318 )0 , with ψ1 being the intercept which is included in every model. In the method of Albert and Chib (1993), a latent variable zi ∼ N (Xi0 Ψ, 1) is introduced such that  1 if zi > 0 yi = 0 if zi ≤ 0 Thus the full data likelihood is given by P (Y | Z) P (Z| Ψ) =

63 Y

{1(yi = 0) 1(zi ≤ 0) + 1(yi = 1) 1(zi > 0)} N (zi ; Xi0 Ψ, 1)

i=1

2.2 Priors In order to select covariates from the design matrix X to include in the model I use the stochastic search variable selection (SSVS) method of George and McCulloch (1993). Ψ will be a “spike and slab” mixture of normals. With probability p0 , ψj will be from the spiked distribution very nearly equal to 0:

0

2

4

Normal density

6

8

10

318 1 Y 1 1 π(Ψ| p0 , τ ) = N (ψ1 ; 0, ) · p0 N (ψj ; 0, ) + (1 − p0 ) N (ψj ; 0, ) τ j=2 cτ τ

−0.4

−0.2

0.0

0.2

0.4

c=10000, tau=.1

Figure 1: Spike and slab Normal prior for ψj with c=10,000 and τ = .1 By introducing a vector of latent indicator variables ∆ = (1, δ2 , . . . , δ318 ), with P r(δj = 0) = p0 for j = 2, . . . , 318, the prior for ψj can be written as the mixture 1 1 π(ψj |δj , τ ) = 1(δj = 0)N (ψj ; 0, ) + 1(δj = 1)N (ψj ; 0, ). cτ τ 2

The prior for ∆ is π(∆| p0 ) =

318 Y

1−δj

(1 − p0 )δj p0

,

j=2

and the prior for p0 is

π(p0 ) ∼ Beta(a0 , b0 ). 2.3 Posterior computation and full conditionals To compute the posterior distribution, P (Z, Ψ, ∆, p0 | Y, X) ∝ P (Y | Z) P (Z| Ψ) π(Ψ| ∆)π(∆| p0 )π(p0 ), I use the Gibbs sampler, updating each parameter (Z, Ψ, ∆, p0 ) from its full conditional distribution at the k-th iteration. (k)

(k)

(k−1)

• zi | z(i) , Ψ(k−1) , ∆(k−1) , p0

(k)

, Y, X = P (zi | Ψ(k−1) , yi , Xi )

= {1(yi = 0) 1(zi ≤ 0) + 1(yi = 1) 1(zi > 0)} N (zi ; Xi0 Ψ(k−1) , 1)

(k)

(k)

(k−1)

• δj | δ(j) , Z (k) , Ψ(k−1) , p0 (k−1)

∝ p0

(k)

1(δj

(k−1)

= 0) N (ψj

(k−1)

• Ψ(k) | Z (k) , ∆(k) , p0 where

P

(k)



  = 

(k)

(k−1)

, p0

(k−1)

1 )+(1−p0 ; 0, cτ

) (k)

) 1(δj

(k−1)

= 1) N (ψj

; 0, τ1 )

 , X ∼ N318 (X0 X + P(k) )−1 X0 Z (k) , (X0 X + P(k) )−1 , 0

τ γ2 .. 0

(k−1)

= P (δj | ψj

. γ318



  , 

(

(k)

(k)

δj = 0 =⇒ γj = c τ (k) (k) δj = 1 =⇒ γj = τ

)

.

 P P (k)  (k) (k) • p0 | Z (k) , ∆(k) , Ψ(k) ∼ Beta a0 + (1 − δj ), b0 − 1 + δj

The zi ’s are sampled one at a time, as are the δj ’s. Ψ is sampled in one block from its full conditional multivariate distribution. 2.4 Choice of constants I chose a0 = b0 = 5 to help increase the mixing. The constants c and τ should be chosen carefully so that the prior distributions on Ψ make sense. Choosing c = 10, 000 and τ = 0.1 implies that when ψj belongs to the ”spike” distribution it will have an sd ≈.03. This means that nearly all |ψj | ≤ .1 for the ψj which are intended to be 0. Larger values of c will lead to slower mixing between models. 3

3. Results 3.1 MCMC convergence I ran my Gibbs sampler MCMC 110,000 times and thinned by 1/11 due to the high autocorrelation in the chain and to memory storage constraints. I had no burn-in. Figure 2 shows that p0 and the Model size (the number of non-zero variables from the “slab” dispersed Normal distribution) converge rapidly to a stationary distribution, but the value of the coefficients do not.

0.95 0.90 0.85

p0.1taz[, ]

1.00

Plot of p0 110,000 runs, thinned by 11

0

2000

4000

6000

8000

10000

6000

8000

10000

6000

8000

10000

Index

40 30 20 0

10

modelsize.1taz

50

Model size

0

2000

4000 Index

8 6 4 2 0

PSI.1taz[117, ]

10

PSI[117]

0

2000

4000 Index

Figure 2: Convergence (or lack thereof) for three variables p0 , modelsize, ψ117 An examination of the plots for more of the ψj coefficients indicate that the MCMC should be run many, many more times. Variables tend to be included in the model for lots of iterations, then become excluded. Variables can jump back in to the models, but often many iterations are required for this. The stochastic nature of this procedure leads to high variability in the number of times a variable is included or excluded from the model. 4

3.2 Marginal probabilities The marginal frequency of a given coefficient being from the “slab” distribution (δj = 1) is an indication of how “important” that covariate is in the model.

Top 10 Variables

174

37 0.3

77

0.2

Probability of inclusion

0.4

0.5

0.6

117

103 267 244 266

139

0.0

0.1

300

2 17 34 51 68 85 104 126 148 170 192 214 236 258 280 302 Covariate number

Figure 3: Top 10 “important” variables based on marginal sample proportion

The covariate X117 is included in the model most often, about 59% of the time. This is a good indication that this metabolite is related to whether or not the person has the disease. Directing research towards discovering what the 5

theoretical relationship is between the disease and metabolite X117 or the other “important” covariates might yield new insight as to the cause of the disease or how it could be best treated. The next figure shows the plots of the 5 metabolites occurring most frequently in the models.

8 4 0

PSI.1taz[117, ]

PSI[117]

0

2000

4000

6000

8000

10000

6000

8000

10000

6000

8000

10000

6000

8000

10000

6000

8000

10000

Index

−2 −6 −10

PSI.1taz[174, ]

PSI[174]

0

2000

4000 Index

−2 −6 −10

PSI.1taz[37, ]

PSI[37]

0

2000

4000 Index

6 4 2 0

PSI.1taz[77, ]

8

PSI[77]

0

2000

4000 Index

−2 −6

PSI.1taz[103, ]

0

PSI[103]

0

2000

4000 Index

Figure 4: Top 5 “important” coefficients None of these metabolites will converge to a stationary distribution. Another thing to notice is that these variables are not obviously correlated. There is no 6

indication from looking simply looking at the plots that when one metabolite is included in the model, another metabolite will be excluded. However, these relationships do exist within this dataset. For example, X117 and X244 have a pairwise correlation of r = .99. By themselves, each of these covariates is a good predictor of whether or not a person has the disease. But rarely will the model support both variables. Figure 5 shows that when one of the metabolites is included in the model, the other is nearly always excluded.

8 6 4 2 0

PSI.1taz[117, ]

10

PSI[117]

0

2000

4000

6000

8000

10000

6000

8000

10000

Index

8 6 4 2 −2

0

PSI.1taz[244, ]

10

PSI[244]

0

2000

4000 Index

Figure 5: ψ117 and ψ244 correlated coefficients Because of the correlation between the covariates, simply looking at the marginal proportions of times included in the model could be misleading. However, it is one way to identify interesting variables. 3.3 Comparing the sizes of the coefficients As predicted, when the variables are excluded from the model, i.e. come from the “spike” Normal distribution, nearly all of the values of the coefficients are between -0.1 and 0.1. These values would change if c or τ were adjusted. When the variables are included in the model, their values range between -10 7

and 10. Below, in Figure 6, is a graphical comparison of the coefficients for the two covariates that were included in the model most often (X177 ) and excluded most often (X309 ).

1000 0

500

Frequency

1500

2000

PSI[117] included most times (5865)

−2

0

2

4

6

8

10

PSI.1taz[117, ]

100 200 300 400 500 600 0

Frequency

PSI[309] included fewest times (4)

−0.10

−0.05

0.00

0.05

0.10

0.15

PSI.1taz[309, ]

Figure 6: Histograms of contrasting coeffients ψ117 and ψ309

4. Conclusion The slow mixing between models makes it hard to draw reliable inferences about which covariates are most important. Due to the high correlation between the 317 variables, the variables which appear often in the first sequence of runs might not be the same as those that appear most often in subsequent runs. However, even in a limited number of runs, it appears that there are a few variables that are definitely important, and this method has a good chance of identifying them.

References Albert, J. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669–679. Beecher, C. (2002). Metabolomics: A new ”omics” technology. American Genomics - Proteomics Technology . George, E. and McCulloch, R. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association 88, 881–889. 8