BAYESIAN HIERARCHICAL MODELS FOR MULTI-LEVEL ...

5 downloads 0 Views 172KB Size Report
the Bayesian hierarchical models have proved to be a powerful tool for analysis with computation being performed by Markov Chain Monte Carlo. (MCMC) ...
MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

JOURNAL OF BIOPHARMACEUTICAL STATISTICS Vol. 12, No. 2, pp. 121–135, 2002

BAYESIAN HIERARCHICAL MODELS FOR MULTI-LEVEL REPEATED ORDINAL DATA USING WinBUGS Zhenguo Qiu,1 Peter X.-K. Song,1 and Ming Tan2,* 1

Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada M3J 1P3 2 Department of Biostatistics, St. Jude Children’s Research Hospital, 332 N Lauderdale St, Memphis, TN 38139

ABSTRACT Multi-level repeated ordinal data arise if ordinal outcomes are measured repeatedly in subclusters of a cluster or on subunits of an experimental unit. If both the regression coefficients and the correlation parameters are of interest, the Bayesian hierarchical models have proved to be a powerful tool for analysis with computation being performed by Markov Chain Monte Carlo (MCMC) methods. The hierarchical models extend the random effects models by including a (usually flat) prior on the regression coefficients and parameters in the distribution of the random effects. Because the MCMC can be implemented by the widely available BUGS or WinBUGS software packages, the computation burden of MCMC has been alleviated. However, thoughtfulness is essential in order to use this software effectively to analyze such data with complex structures. For example, we may have to reparameterize the model and standardize the covariates to accelerate the convergence of the MCMC, and then carefully monitor the convergence of the Markov chain. This article aims at resolving these issues in the application of the WinBUGS through the analysis of a real multi-level ordinal data. In addition, we extend the

*Corresponding author. Current address: Division of Biostatistics, University of Maryland, Greenebaum Cancer Center, 22 S. Greene St., Baltimore, MD 21201. E-mail: [email protected] 121 DOI: 10.1081/BIP-120014415 Copyright q 2002 by Marcel Dekker, Inc.

1054-3406 (Print); 1520-5711 (Online) www.dekker.com

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

122

QIU, SONG, AND TAN

hierarchical model to include a wider class of distributions for the random effects. We propose to use the deviance information criterion (DIC) for model selection. We show that the WinBUGS software can readily implement such extensions and the DIC criterion. Key Words: Bayesian methods; Multi-level ordinal data; Markov chain Monte Carlo; Hierarchical analysis; Random effects; WinBUGS

INTRODUCTION Ordinal data arise frequently in studies when there is a meaningful ordering among the outcome categories of interest, for instance, the stages of cancers, the scale (say, of 4) of the patient’s improvement after treatment, and a tumor marker which is of continuous nature but can only be measured accurately up to an ordinal scale. When the ordinal outcome is measured repeatedly in subclusters of a cluster or on subunits of an experimental unit, multi-level structure arises and the analysis has to account for the correlations at several levels, such as the within cluster correlation due to clustering of subjects and the within subcluster correlation due to repeated measures for one subject, as well as the ordering. To facilitate our presentation, we first describe a dataset from the study of the oral practice examinations (OPEs) used in many anesthesiology programs to familiarize anesthesiology residents with the format of the oral examination administered by the American Board of Anesthesiology. In this health economics study, the OPE outcome of the final grade is categorized as “Definite Not Pass,” “Probable Not Pass,” “Probable Pass,” and “Definite Pass.” This study involved 163 residents who took a number of OPEs ranging from 1 to 6 with an average of 2, each of which was evaluated by two board certified anesthesiologists (or raters) randomly selected from a pool of 12. The central question is to identify factors, such as the length of training, didactic experience, and other characteristics, that most influence the OPE outcome. In addition, it is also of interest to estimate reliability coefficient (a function of variance components). Clearly, the response yijk is a three-level ordinal variable, indexed by resident i, OPE j, and rater k. The obvious complication in statistical analysis is the presence of the correlation among repeated examinations taken by a resident (correlation at level 2) and the correlation between raters (correlation at level 3). In summary, the OPE dataset represents a class of unbalanced multi-level ordinal data where both the regression coefficients and correlation parameters are of interest and the number of repeated measures and times at which they appear are different for subjects within a cluster. Consequently, a subject-specific model for this kind of data is preferable to a marginal model.[1] Recently, both frequentist models (e.g., random effects models[2]) and Bayesian models[3,4] have been developed to analyze this kind of ordinal data. A good review on ordinal data analysis can be found in Agresti.[5] A more focused review on multi-level ordinal data can be found in Qu and Tan[3] and

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

BAYESIAN HIERARCHICAL MODELS

123

Tan et al.[4] The Bayesian hierarchical models are extensions of the random effects model by specifying a (flat or diffuse) prior (see “Bayesian Hierarchical Models”) on regression coefficients and parameters in the distribution of random effects. See Ref. [6, page 17] for the discussion of the diffuse prior. These models, if flat or diffuse priors are used, essentially give results similar to those from a frequentist model. This is because, in such a case, the resultant posterior is equal or approximately equal to the normalized likelihood function. Statistical inference in random effects models is well known to be computationally burdensome. This is simply due to the fact that inferential approaches based on the marginal likelihood[1,7,8] involve the computation of an integral for the conditional likelihood over the distribution of the random effects.[2,9,10] Except for some special cases, such as normal data (see Ref. [11, page 85]) with the identity link and a complementary log–log link with gamma random effects,[12,13] such an integral cannot be acquired in a closed form and requires numerical evaluation, usually over many dimensions. The dimensionality can be so high that a direct numerical evaluation becomes prohibitive in the frequentist modeling context.[14,15] Such computational difficulty has unfortunately hindered the promotion of random effects models. Although a variety of methods for multi-level modeling have been developed (see, e.g., a summary is provided at http://www. multilevel.ioe.ac.uk/ and http://www.lrz-lrz-muenchen.de/~wlm/wlmmule.htm), most of these methods are for normal, binary, or multinomial responses. Much less is available for multi-level ordinal random effects models, partly due to the complexity of order restriction on the intercept of the proportional odds model (see “Bayesian Hierarchical Models”) in the modeling. Hedeker and Gibbons[2] proposed and implemented maximum likelihood estimates based on the Gaussian quadrature method for two-level random effects models using their software MIXOR (tigger.uic.edu/~hedeker). However, the MIXOR software currently does not fit three-level ordinal data such as the OPE dataset considered in this article. In contrast, the Bayesian hierarchical models[3,4] incorporate the multi-level structure easily. With these models, statistical inference on variance components or a function of them (such as the ratio of variance components) can be made easily by using samples generated by the Markov Chain Monte Carlo (MCMC) method. The MCMC method overcomes the difficulty of evaluating highdimensional integration and can be implemented using the WinBUGS software. However, thorough understanding of the model and convergence issues in MCMC is essential for effective applications. For instance, it requires a great deal of care in assessing the convergence of the algorithm to avoid misleading conclusions. The WinBUGS is a window version software for Bayesian analysis Using Gibbs Sampler, developed by the MRC, Biostatistics Group, Institute of Public Health (www.mrc-bsu.cam.ac.uk/bugs).[16] Primarily, the software generates random samples from a series of conditional posterior distributions according to the MCMC algorithm. The samples generated after the convergence of the algorithm can then be used to form estimates of parameters, to perform model diagnostics, and to examine the goodness-of-fit. The Bayesian outputs analysis

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

124

QIU, SONG, AND TAN

(BOA) software[17] (http://www.public-health.uiowa.edu/boa) is developed and programmed in SPlus with menu-driven functions not only to assess the convergence but also to carry out statistical inferences such as estimation for marginal posterior densities of parameters. The BUGS website also provides a program called CODA that carries out the same computations. The purpose of this article is to show how to effectively use the WinBUGS software to analyze multi-level ordinal data through the analysis of a real data dataset. In particular, we address the issue of convergence for the MCMC algorithm, with emphasis on strategies to improve the rate of convergence. In addition, we extend the hierarchical model[4] to include a wider class of distributions for the random effects. We propose to use the deviance information criterion (DIC) for model selection in multi-level ordinal models. We show that the WinBUGS software can readily implement such extensions and the DIC criterion. Since there is a lack of general programs for analyzing multi-level ordinal data, the proposed extension and implementation in WinBUGS software can provide a useful yet easily accessible tool for analyzing data of this kind. In “Bayesian Hierarchical Models,” we present the model as a generalized proportional odds model for multi-level repeated ordinal data in the context of the OPE study. The DIC for model selection is detailed in “Model Selection.” “Improving Markov Chain Monte Carlo” discusses issues in implementing hierarchical models in WinBUGS such as the role of reparameterization to improve MCMC convergence. “Oral Practice Examination Data Analysis” briefly summarizes analysis results of the OPE data. We conclude with a Discussion. BAYESIAN HIERARCHICAL MODELS For simplicity and ease of exposition, we present the Bayesian hierarchical model in the context of the OPE data. To begin, let yijk be the final grade for resident i given by the kth rater at the jth OPE, where j [ Ai with Ai being a subset of set A ¼ {1; 2; 3; 4; 5; 6}; each integer in set A indexing a given OPE. Here i ¼ 1; 2; . . . ; N with N ¼ 163; and k ¼ 1; 2: yijk takes the integer values between 1 and 4, with 1 indicating the worst. By the threshold approach, the response variable may be expressed in terms of a continuous latent variable zijk taking values on (2 1, 1) as follows: yijk ¼ m if and only if zijk [ ðam21 ; am ;

m ¼ 1; 2; 3; 4;

where 21 ¼ a0 , a1 , a2 , a3 , a4 ¼ 1: Clearly, the cumulative probability lijk;m ¼ Pðyijk # mÞ represents the probability that the final grade for resident i given by the kth rater at the jth OPE is not better than category m. Similar to the cumulative logit model, this cumulative probability is modeled as hðlijk;m Þ ¼ am þ b1 x1ij þ b2 x2ij þ b3 x3ij þ ci þ d ij ;

ð2:1Þ

where ci is the resident-specific random effect and dij is the examination-specific random effect. In effect, the model (2.1) results from the assumption that the latent

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

BAYESIAN HIERARCHICAL MODELS

125

variable zijk follows a standardized logistic distribution. Understanding this relationship is important as it pertains to the sampling scheme that the WinBUGS implements and codes. In the context of hierarchical models, the vector of random effects gi ¼ ðci ; d i1 ; . . . ; dini Þ0 is usually assumed to be normally distributed with zero mean and variance – covariance matrix D, i.e., gijD , N(0,D ). The three covariates (or explanatory variables) in the data are x1ij, the length of training for the ith resident at the time of his/her jth examination, and x2ij and x3ij, the self-assessed anxiety and preparedness levels of the resident at that time, respectively. Other types of the link function besides the logistic can be used in model (2.1), e.g., the probit link that assumes the latent variable follows a standard normal distribution. The proposed model (2.1) can be considered as an extension to the classical cumulative logit models presented in McCullagh and Nelder.[18] It also appears to be an extension of the multi-level random effects model in Gibbons and Hedeker[19] for binary responses, and it can be regarded analogous to the Laird and Ware’s random effects model[20] for Gaussian responses. Observing the fact that the examinations are nested within residents and the raters are nested within examinations, we postulate that the random effects at the resident level are not correlated with those at the examination level for simplicity. This gives rise to the specification of a variance components structure for the variance–covariance matrix D. This variance components model for D accounts for the sources of variation at different levels of clustering. That is, the variance of random effect gi is D ¼ diagðs12 ; s22 ; . . . ; s22 Þ; a matrix of dimension ni þ 1, with ni being the number of examinations that resident i took. Here, we assume different variance components s12 and s22 for random effects ci and dij, respectively. The Bayesian hierarchical model requires prior distributions for each parameter in the model. As usual, we assume vague (diffuse) normal priors N(0,10002) for regression coefficients bl, l ¼ 1; 2; 3: Specifying the prior distributions for intercept terms, ams, needs extra caution, simply because of the order restriction on them, namely, a1 , a2 , a3. To resolve this problem, we introduce parameters u1 . 0, u2 . 0 such that a2 ¼ a1 þ u1 and a3 ¼ a2 þ u2 : Moreover, we assume a vague (diffuse) normal prior N(0,10002) for a2 and truncated normal N(0,10002) priors on the interval (0,1) for u1 and u2. The main reason for choosing such priors is to avoid improper posteriors which may lead to divergence of the MCMC.[21] Thus, such approximate noninformative priors are commonly used in WinBUGS.[16,21] For both variance components s 21 and s 22 ; the inverse gamma prior distribution IG(a0,b0) with parameter values a0 ¼ b0 ¼ 0.0001 is used. MODEL SELECTION The common Bayes (or Schwarz) information criterion (BIC/SIC)[22] for model selection is not appropriate for the hierarchical models in the presence of random effects, which complicates the determination of the true number of free

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

126

QIU, SONG, AND TAN

parameters in the assessment of model complexity. The DIC, proposed recently by Spiegelhalter et al.,[23] attempts to resolve this problem in the framework of approximate Akaike information criterion (AIC), and is adopted for model selection in the context of the hierarchical models. According to Ref. [23], the Bayesian deviance is given by DðujM k Þ ¼ 22 log pðyju; M k Þ þ 2 log f ðyÞ for model Mk with parameter u, where f(y ) is some known standardizing term that is fully given by the data. The value of the true number of free parameters is defined as pD ¼ Eujy ½D 2 DðEujy ½uÞ ¼ D 2 DðuÞ: Moreover, the DIC takes the form  þ pD ¼ DðuÞ þ 2pD ; DIC ¼ D  explains the model fit and pD indicates the model where as usual, the term D complexity. Spiegelhalter et al.[23] showed asymptotically that the DIC is a generalization of the AIC. Computing the DIC is straightforward in an MCMC implementation. By tracking both u and DðuÞ in MCMC iterations, at the exit of sampling, one can estimate the D by the sample mean of the simulated values of D and DðuÞ by plugging in the sample mean of the simulated values of u. A lower value of DIC indicates a better-fit model. We adopt the so-called null standardization criterion D0 ðujM k Þ ¼ 22 log LðbjM k Þ in this article. Because the standardizing term 2 log f(y ) is independent of model Mk, for model comparison purpose, ignoring this term does not affect the conclusion. This model selection criterion allows us to assess the necessity of certain random effects in the hierarchical models, and it is particularly useful to determine a proper error structure for the random effects. The implementation of this model selection criterion is demonstrated in “Oral Practice Examination Data Analysis” numerically. IMPROVING MARKOV CHAIN MONTE CARLO The MCMC approach is to obtain a set of “independent” draws from the joint posterior distribution of the quantities of interest after the Markov chain has reached its stationary distribution. This stationarity has been theoretically proved achievable when the Markov chain is geometrically ergodic.[24] The practical issue is to decide a time after which the drawn samples can be regarded as independent samples. This is usually determined based on various convergence diagnostics that are implemented by the BOA package. In practice, however, judging the convergence is more involved since, in some models, samples drawn can remain highly correlated, so that the algorithm needs to run a large number of

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

BAYESIAN HIERARCHICAL MODELS

127

loops in order to obtain independent samples. This extremely slow convergence occurred in fitting the hierarchical model to the OPE data in this article. For hierarchical models, one way to alleviate this is to thin the successive draws or adopt the “ordered over-relaxation” method[25] to obtain the low correlations in the recorded simulated values. The ordered over-relaxation generates multiple samples at each iteration and then picks one that is negatively correlated with the current value. This costs more time compared with “thinning” if the “thinning interval” is less than the number of the multiple samples. Thus, it is less frequently used than thinning. In the example we have considered in “Oral Practice Examination Data Analysis,” the ordered over-relaxation is not effective as it produces a nondiminishing autocorrelation pattern of the sine shape. Another way is to modify the structure of the model via reparameterization strategies or “sweeping” method.[26] In implementing the method, first it is useful to explore the between- and within-variable correlations through some pilot study based on a small number of draws, as a basis for choosing a strategy for improvement. In the OPE study, we first generated 4000 preliminary samples from WinBUGS for such an investigation, and found that both between- and within-variable correlations were very high. In the following, we illustrate two reparameterization strategies adopted in the analysis of the OPE data which considerably improved the rate of convergence and hence significantly reduced the computational time.

Reparameterization: Centering Around Means The first reparameterization strategy is to standardize the covariates around their means, so that the fixed effect parameters become more orthogonal. That is, part of the hierarchical model is reparameterized as logitðlijk;m Þ ¼ a0m þ b1 ðx1ij 2 x1 Þ þ b2 ðx2ij 2 x 2 Þ þ b3 ðx3ij 2 x 3 Þ þ ci þ dij ; where x¯l, l ¼ 1; 2; 3 are the means of the covariates xlij and a0m ¼ am þ b1 x 1 þ b2 x 2 þ b3 x 3 ; m ¼ 1; 2; 3: This reparameterization approach is in fact widely used in the classical setting of regression analysis to deal with the multicollinearity problem. The reparameterization is appealing here since the hierarchical models can achieve better correlation properties. As shown by the example of the OPE analysis in “Oral Practice Examination Data Analysis,” the reparameterized models are equivalent to the original hierarchical models but gain convergence faster than the original one.

Reparameterization: Hierarchical Centering Another reparameterization strategy is the so-called hierarchical centering, proposed by Gelfand et al.,[27,28] with the same aim of acquiring better correlation properties in the reformulated hierarchical models, so that computational time per

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

128

QIU, SONG, AND TAN

update is reduced. In particular, we consider the “partial” hierarchical centering, in which the random effects are centered around the nested mean instead of zero. That is, let eij ¼ b1 ðx1ij 2 x 1 Þ þ b2 ðx2ij 2 x 2 Þ þ b3 ðx3ij 2 x 3 Þ; and e ¼ b1 x 1 þ b2 x 2 þ b3 x 3 : We take logitðlijk;1 Þ ¼ 2u1 þ d 0ij þ eij ; logitðlijk;2 Þ ¼ d0ij þ eij ; logitðlijk;3 Þ ¼ u2 þ d 0ij þ eij ; where d 0ij , Normalðc0i ; s22 Þ and c0i , Normalða02 ; s12 Þ; i.e., d 0ij ¼ c0i þ dij and c0i ¼ a02 þ ci ; and u1 . 0, u2 . 0. Then, the cut-points am ; m ¼ 1; 2; 3 are re-calculated by a1 ¼ a02 2 u1 2 e;  a2 ¼ a02 2 e ; a3 ¼ a02 þ u2 2 e:  According to Gelfand et al.,[27,28] the main reason that the hierarchical centering method shortens computational time is that the updated draws of a20 , ci0, and dij0 become less correlated and hence shorter Markov chain is needed for Monte Carlo algorithm; while, by focusing on the main part of the model, we only need to simulate dij0 instead of simulating a2, ci, and dij in model (2.1), namely, the updated draws in hierarchical centering model are “conjugate,” so that computational time per loop is reduced. ORAL PRACTICE EXAMINATION DATA ANALYSIS We first generated and analyzed 4000 preliminary samples using Raftery and Lewis convergence diagnostic (available in BOA) to get some preliminary information about the burn-in, thinning interval, and total number of iterations required for the data analysis. The most important issue in the implementation of the MCMC method is to determine how many iterations a program needs to run and after which iteration the generated samples can be used for inference. In the OPE data analysis, three chains from different starting values of precisions for random effects with five-iteration thinning interval are recorded from iteration 2001 to iteration 4000. The Gelman-Rubin method is used for convergence diagnostics for the multiple chains. Then, two diagnostics for checking stationarity and convergence: Heidelberger and Welch’s stationarity test and Geweke’s Z-score test, both available in the BOA package, are used for the combined 6000 iterations from the three chains.

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

BAYESIAN HIERARCHICAL MODELS

129

Following the discussion of “Model Selection,” we compare four models: (a) the naive model with no random effects, (b) the level-1 or partial hierarchical model that just incorporates resident-specific random effects ci but not examination-specific random effects dij, (c) the level-2 or full hierarchical model that includes both random effects, ci and dij assuming that the variance matrix of random effects D is normally distributed, namely, model (2.1), and (d) the full hierarchical model but assuming the random effects is independently t-distributed, t4 ð0; s 2k Þ; k ¼ 1; 2; incorporating possible heavytailed distributions. Table 1 reports the values of the DIC and related quantities for the four models. The values of pD represents the number of free parameters in a given model, which may be different from the number of parameters to be estimated. For the four models considered, the pD of model (a) is close to 6, the number of parameters in this naive model. The values of pD for models (b), (c), and (d) are much smaller than the number of parameters, 116.896 vs. 169 for model (b), 190.186 vs. 502 for model (c), and 202.613 vs. 502 for model (d). The increased model complexity of model (c) does not wash away the gain of model fit, so DIC favors (c) over (a) and (b). Model (c) is less complex than model (d), although the model fit D of model (c) is slightly larger than that of model (d). Therefore, the DIC selects model (c) over models (a), (b), and (d). This result confirms the hierarchical model with variance components that we intend to use in “Bayesian Hierarchical Models.” We thus use model (c) to analyze the OPE data. The initial values of fixed effects were specified essentially as the means of prior distributions (see Appendix A for more details). We used the dynamic trace in WinBUGS and BOA to detect possible nonstationary chains. The autocorrelation in each parameter of the model is checked using BOA. The autocorrelations are between 20.026 and 0.027 at Lag 1, and are between 20.0001 and 20.015 at Lag 50, implying minimal autocorrelation. The results from BOA are summarized in Table 2, in which “Sd” stands for the sample standard deviation, “MC error” for the Monte Carlo error, and “2.5%,” “Median,” “97.5%” for the estimated 2.5, 50, and 97.5% quantiles of posterior distribution for each parameter, respectively. From the table we obtained not only point estimates such as means and medians but also interval estimates such as 95% credible intervals for the parameters. The chosen number of iterations was deemed sufficient to

Table 1. Model (a) Naive (b) Level-1 (c) Level-2 VC (d) Level-2 t4

Model Comparisons for OPE Study ¯0 D

D0(u¯ )

pD

DIC

1489 1199 854.9 850.6

1483.349 1082.104 664.714 647.987

5.651 116.896 190.186 202.613

1494.651 1315.896 1045.086 1053.213

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

130

QIU, SONG, AND TAN

achieve the convergence and stationarity by the Heidelberger and Welch convergence criterion. All of the Geweke’s Z-scores were bounded in (2 1.96, 1.96), and the scale reduction factors of Brooks– Gelman-Rubin method (the last column of Table 2) are all 1.00 (up to two decimal places). Thus, better final grade is associated with longer length of training, lesser self-reported anxiety level, and more self-reported preparedness for the OPE. From the 95% credible intervals, it is evident that all three covariates are important to the success in the examination. In particular, a precise interpretation of bˆ1 ¼ 20.0528 is as follows. After 12 months of training, for a given resident and a given rater, the odds that the resident’s OPE is at least grade m vs. lower than m ðm ¼ 1; 2; 3; 4Þ will increase by e 0.0528£12 < 1.88 times. To obtain estimates of the correlation parameters, we note that the variance of the standard logistic distribution is p2/3. Then, the correlation coefficient (r1) between the examinations for the same resident is s12 ðs12 þ s22 þ p 2 =3Þ21 ; and the correlation coefficient (r 2) between the two raters for the same resident and the same examination is given by ðs12 þ s22 Þðs12 þ s22 þ p 2 =3Þ21 ; and r2 can be considered as the inter-rater reliability coefficient.[29] Noting that both correlation parameters are the functions of s 21 and s 22 , which can be logically expressed in WinBUGS program (see Appendix A for details), with the use of WinBUGS, the MCMC method easily produces statistics of the posterior distributions for these two correlation parameters. It is clear from the 95% credible intervals that both between-examination correlation and between-rater correlation are present. It is also noted that the posterior means for the two variance parameters s 21 (between-resident variation) and s 22 (between-examination variation) are different, and the estimate of s 21 is smaller than the estimate of s 22 : This ordering relation seems to preserve even for cases where we used rather flat priors for both parameters. This is due partly to the fact that residents who have never taken

Table 2. Variable a1 a2 a3 b1 b2 b3 s12 s 22 r1 r2 a

Parameter Estimates of OPE Study Using WinBUGS

Mean

Sd

MC Error

2.5%

Median

97.5%

Z-Score

0.7533 1.7730 3.0620 20.0528 0.3857 20.6449 5.3380 9.1110 0.2974 0.8087

1.0900 1.0970 1.1130 0.0247 0.1506 0.1485 1.9620 2.3410 0.07974 0.03343

0.02460 0.02518 0.02574 5.55E24 0.00370 0.00357 0.05954 0.06398 0.00235 9.88E24

21.3190 20.2946 0.9690 20.1028 0.1022 20.9484 2.2100 5.2480 0.1436 0.7375

0.7451 1.7690 3.0400 20.0522 0.3816 20.6426 5.0820 8.8790 0.2977 0.8110

2.8960 3.9370 5.2520 2 0.0056 0.6942 2 0.3630 9.8180 14.520 0.4512 0.8680

0.0142a 0.0649a 0.5170a 0.0488a 1.5360a 1.8485a 1.9540a 1.8675a 1.7647a 1.8355a

Met Heidelberger and Welch convergence diagnostic test criterion.

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

BAYESIAN HIERARCHICAL MODELS

131

the OPE before tend to do worse in the exam than those who have taken it at least once. In summary, WinBUGS enables us to reduce the computational burden considerably, which in turn makes model selection feasible. In addition, the Bayesian hierarchical model, as an extension to the mixed-effects model, easily incorporates resident-specific and examination-specific variance component estimates that are of practical interest.

DISCUSSION We showed that multi-level ordinal data can be effectively analyzed using the Bayesian hierarchical models with WinBUGS. In general, such models provide a promising alternative to analyzing ordinal data with complex multi-level structure where the correlation parameters are also of interest. The Gibbs sampler avoids numerical integration, a hurdle for inference in random effects models. Through the hierarchical Bayesian models, we can compute the posterior distributions of various parameters of interest, which provides more information than does maximum likelihood estimation even if its computation is feasible. The method we used to analyze the OPE data also provides a general method to calculate reliability coefficients for ordinal outcomes that is commonly used in psychometric studies. In addition, the inference based on this model is robust with respect to the specifications of the hyperparameters. The WinBUGS software, together with the BOA package, greatly eases monitoring convergence, accelerating and optimizing computation, checking and selecting models.[30] The WinBUGS software substantially reduces the computational burden and provides an effective inferential tool for the proposed model, although thoughtful analysis such as proper reparameterization is essential in applying this program. Finally, it is worth keeping in mind the limitations of the current convergence diagnostics as pointed out in Gelfand[31] that “in principle, convergence can never be assessed using such output, as comparison can be made only between different iterations of one chain or between different observed chains, but never with the true stationary distribution.”

APPENDIX A This appendix includes a brief introduction to the BUGS programming and the BUGS programs used for the data analyses of this article. One programming trick we think useful to deal with more than two indices appearing in multi-level data is detailed later. Readers who are already familiar with the BUGS software can skip this part. The syntax of the WinBUGS language is a direct description for the relationships among each variable and parameter in a given model by two

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

132

QIU, SONG, AND TAN

forms of relation: stochastical “ , ” and logical “ ˆ ”. Additionally, the corresponding priors and initial values of the “nodes (variables and parameters) without parents” must be specified in the programs. However, the current version of WinBUGS software restricts the mixture nodes with at most two indices, and thus, nested index is absolutely necessary to reduce one or two indices for our models. We simply adopted the index variables for the i and j levels. Denote N ¼ 666 as the number of observations, NS ¼ 163 as the number of subjects, NJ ¼ 666=2 ¼ 333 as the number of OPE papers which one rater marked, Ncut ¼ 3; as the number of cut-points, K ¼ 2 as the number of raters, and h as the index of overall observations, so i ¼ UðhÞ; j ¼ VðhÞ; where U and V are the index variables for level i and j. “x1.bar,” “x2.bar,” and “x3.bar” are the mean values (given in data set) of “x1,” “x2,” and “x3,” respectively. model { for (h in 1:N) { covc[h] ˆ b[1] * (x1[h] 2 x1.bar) þ b[2] * (x2[h] 2 x2.bar) þ b[3] * (x3[h] 2 x3.bar) logit(Q[h,1]) ˆ 2a[1] þ d[U[h], V[h]] þ covc[h] logit(Q[h,2]) ˆ d[U[h], V[h]] þ covc[h] logit(Q[h,3]) ˆ a[3] þ d[U[h], V[h]] þ covc[h] # probability of response ¼ m mu[h,1] ˆ Q[h,1] for (m in 2:Ncut) { mu[h,m] ˆ Q[h,m] 2 Q[h,m 2 1] } mu[h,(Ncut þ 1)] ˆ 1 2 Q[h,Ncut] y[h] ˜ dcat(mu[h, 1: (Ncut þ 1)]) # loglikelihood: lik[h] ˆ log(mu[h, y[h]]) } # Resident-specific random effects for (i in 1:NS) { c[i] ˜ dnorm(a[2], tau[1]) # t(4): c[i] ˜ dt(a[2], tau[1], 4) } # Resident*exam-specific random effects for (j in 1: NJ) { d[U[K*j], V[K*j]] ˜ dnorm(c[U[K*j]], tau[2]) # t(4): d[U[K*j], V[K*j]] ˜ dt(c[U[K*j]], tau[2], 4) } # Priors for(k in 1:3){ b[k] ˜ dnorm(0, 1.0E 2 06) } a[2] ˜ dnorm(0, 1.0E 2 06) a[1] ˜ dnorm(0, 1.0E 2 06)I(0, ); a[3] ˜ dnorm (0, 1.0E 2 06)I(0, ) for(k in 1:2){ tau[k] ˜ dgamma(0.0001, 0.0001) sigma[k] ˆ 1 / tau[k] }

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

BAYESIAN HIERARCHICAL MODELS

133

# Correlation coefficients and intercepts on original scale: deno ˆ sigma[1] þ sigma[2] þ ((pow(pi, 2)) / 3) rho[1] ˆ sigma[1] / deno rho[2] ˆ (sigma[1] þ sigma[2]) / deno x.bar ˆ b[1] * x1.bar þ b[2] * x2.bar þ b[3] * x3.bar cut[1] ˆ a[2] 2 a[1] 2 x.bar cut[2] ˆ a[2] 2 x.bar cut[3] ˆ a[2] þ a[3] 2 x.bar # Deviance D0 ˆ 22 * sum(lik[]) } Inits list(b ¼ c(0, 0,0), a ¼ c(1,0,1), tau ¼ c(10,10) ) list(b ¼ c(0, 0,0), a ¼ c(1,0,1), tau ¼ c(1,1) ) list(b ¼ c(0, 0,0), a ¼ c(1,0,1), tau ¼ c(0.1, 0.1) )

ACKNOWLEDGMENTS The authors are grateful to the referees for their constructive comments and valuable suggestions that led to a significant improvement of this paper. This work is partially supported by the American, Lebanese, Syrian Associated Charities and by NIH award CA 21765 for the first and the last authors, as well as by the Natural Sciences and Engineering Research Council of Canada for the second author.

REFERENCES 1. 2. 3.

4.

5. 6. 7. 8.

9.

Heagerty, P.J.; Zeger, S.L. Marginal Regression Models for Clustered Ordinal Measurements. J. Am. Stat. Soc. 1996, 91, 1024– 1036. Hedeker, D.; Gibbons, R.D. A Random-Effects Ordinal Regression Model for Multilevel Analysis. Biometrics 1994, 50, 933– 944. Qu, Y.; Tan, M. Analysis of Clustered Ordinal Data with Subclusters via a Bayesian Hierarchical Model. Commun. Stat. Ser. A: Theory Method 1998, 27, 1461 – 1476. Tan, M.; Qu, Y.; Mascha, E.; Schubert, A. A Bayesian Hierarchical Model for Multilevel Repeated Ordinal Data: Analysis of Oral Practice Examinations in a Large Anesthesiology Training Programme. Stat. Med. 1999, 18, 1983– 1992. Agresti, A. Modelling Ordered Categorical Data: Recent Advance and Future Challenges. Stat. Med. 1999, 18, 2191 –2207. Tanner, M.A. Tools for Statistical Inference, 3rd Ed.; Springer: New York, 1996. Molenberghs, G.; Lesaffre, E. Marginal Modeling of Correlated Ordinal Data Using a Multivariate Plackett Distribution. J. Am. Stat. Assoc. 1994, 89, 633 –644. Miller, M.E.; Davis, C.S.; Landis, J.R. The Analysis of Longitudinal Polytomous Data: Generalized Estimating Equations and Connections with Weighted Least Squares. Biometrics 1993, 49, 1033– 1044. Zeger, S.L.; Liang, K.Y.; Albert, P. Models for Longitudinal Data: A Generalized Estimating Equation Approach. Biometrics 1988, 44, 1049– 1060.

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

134

QIU, SONG, AND TAN

10.

Harville, D.A.; Mee, R.W. A Mixed Model Procedure for Analyzing Ordered Categorical Data. Biometrics 1984, 40, 393 –408. Lindsey, J.K. Models for Repeated Measurements; Oxford University Press Inc.: New York, 1993. Conaway, M.R. A Random Effects Model for Binary Data. Biometrics 1990, 46, 317– 328. Crouchley, R. A Random-Effects Model for Ordered Categorical Data. J. Am. Stat. Assoc. 1995, 90, 489 –498. Tutz, G.; Hennevogl, W. Random Effects in Ordinal Regression. Comput. Stat. Data Anal. 1996, 22, 537 –557. Jiang, J. Consistent Estimators in Generalized Linear Mixed Models. J. Am. Stat. Assoc. 1998, 93, 720 –729. WinBUGS 1.3: MRC Biostatistics Unit; Institute of Public Health, Cambridge University: Cambridge, 1999. Smith, B.J. Bayesian Output Analysis Program (BOA) Version 0.5.0 User Manual; Department of Biostatistics, The University of Iowa College of Public Health: Iowa, 2000. McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd Ed.; Chapman and Hall: New York, 1989. Gibbons, R.D.; Hedeker, D. Random-Effects Probit and Logistic Regression Models for Three-Level Data. Biometrics 1997, 53, 1527 –1537. Laird, N.M.; Ware, J.H. Random-Effects Models for Longitudinal Data. Biometrics 1982, 38, 963 – 974. DuMouchel, W.; Waternaux, C. Discussion on Hierarchical Models for Combining Information and for Meta-analyses by C. N. Morris and S. L. Normand. In Bayesian Statistics, 4; Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M., Eds.; Oxford University Press: Oxford, 1992; 338– 341. Kass, R.; Raftery, A. Bayes Factors and Model Uncertainty. J. Am. Stat. Soc. 1995, 90, 773 – 795. Spiegelhalter, D.J.; Best, N.G.; Carlin, B.P. Bayesian Deviance, the Effective Number of Parameters, and the Comparison of Arbitrarily Complex Models. Technical Report; Medical Research Council Biostatistics Unit: Cambridge, 1998. Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer-Verlag: New York, 1999. Neal, R. Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Over-Relaxation. In Learning in Graphical Models; Jordan, M.I., Ed.; Kluwer Academic Publishers: Dordrecht, 1998; 205– 230. Gilks, W.R.; Richardson, S.; Spiegelhalter, D.J., (Eds.) Markov Chain Monte Carlo in Practice; Chapman and Hall: London, 1996. Gelfand, A.E.; Sahu, S.K.; Carlin, B.P. Efficient Parameterizations for Normal Linear Mixed Models. Biometrika 1995, 82, 479 –488. Gelfand, A.E.; Sahu, S.K.; Carlin, B.P. Efficient Parametrizations for Generalized Linear Mixed Models. Bayesian Stat. 1996, 5, 165 –180. Schubert, A.; Tetzlaff, J.; Tan, M.; Ryckman, V.; Mascha, E. Consistence, Inter-rater Reliability, and Validity of 441 Consecutive Mock Oral Examinations in Anesthesiology: Implications for Use as a Tool for Assessment of Residents. Anesthesiology 1999, 91, 288– 298.

11. 12. 13. 14. 15. 16. 17.

18. 19. 20. 21.

22. 23.

24. 25.

26. 27. 28. 29.

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2002 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

BAYESIAN HIERARCHICAL MODELS

30.

31.

135

Newton, M.A.; Raftery, A.E. Approximate Bayesian Inference by Simulation by the Weighted Likelihood Bootstrap (with Discussion). J. R. Stat. Soc., Ser. B 1994, 56, 449 – 455. Gelfand, A.E. Gibbs Sampling. J. Am. Stat. Assoc. 2000, 95, 1300 – 1304.

Received April 2000 Revised September 2000, March 2001, May 2002 Accepted May 2002