A Bayesian Approach to Robust Binary Nonparametric Regression y

A Bayesian Approach to Robust Binary Nonparametric Regression y Sally Wood and Robert Kohn Australian Graduate School of Management, University of New South Wales, Sydney 2052, Australia February 25, 1997

Summary This paper presents a Bayesian approach to binary nonparametric regression which assumes that the argument of the link is an additive function of the explanatory variables and their multiplicative interactions. The paper makes the following contributions. First, a comprehensive approach is presented in which the function estimates are smoothing splines with the smoothing parameters integrated out, and the estimates made robust to outliers. Second, the approach can handle a wide rage of link functions. Third, ecient state space based algorithms are used to carry out the computations. Fourth, an extensive set of simulations is carried out which show that the Bayesian estimator works well and compares favorably to two estimators which are widely used in practice. KEY WORDS: Data augmentation; Gibbs sampler; State space model; Mixture of normals. Sally Wood is a PhD student and Robert Kohn is a professor, both at the Australian Graduate School of Management. This is part of Sally Wood's PhD Thesis. y We would like to thank Geo Eagleson, Simon Sheather, Mike Smith, and Matt Wand for constructive comments. We would also like to thank Grace Wahba, Yuedong Wang and Clive Loader for making their software available to us.

0

1 Introduction Consider a study in which subjects are classi ed as either having had a heart attack (HA = 1) or not (HA = 0), and suppose that for each subject the risk factors age (Age), blood pressure (BP ) and cholesterol ratio (CR) are also measured. A popular way of modeling the probability of having a heart attack given these risk factors is to write

p(HA = 1jAge; BP; CR) = H fg1(Age) + g2(BP ) + g3(CR))g :

(1:1)

The function H is a cumulative distribution function (cdf) and is called a link function. If H is the standard normal cdf then (1.1) is called a probit regression and if H is the logistic function es =(1 + es ) then (1.1) is called a logistic regression. If the functions g1; g2, and g3 are speci ed except for a few parameters, for example g1 ; g2, and g3 are linear functions, then (1.1) is a binary parametric regression. There is an extensive literature on binary parametric regression which is summarized by McCullagh and Nelder (1989). However, the assumption that the gi are known is usually unwarranted and it is preferable to just assume that they are smooth functions of their arguments. Such an approach is called binary nonparametric regression and is the topic of this paper. The above example is discussed further in sections 5.4 and 6.4. This paper presents a Bayesian approach to binary nonparametric regression in which the dependent variable takes the values 0 and 1. We assume that the argument of the link function is an additive function of the explanatory variables and their multiplicative interactions. An integrated Wiener process prior is assumed for each of the component functions. The unknown functions are estimated by their posterior means which are cubic smoothing splines. We show by simulation that the estimators in this paper outperform some other estimators for binary nonparametric regression which are prominent in the literature. The nonparametric estimation is also made robust to miscoded values of the dependent variable and outliers. We believe that at present there are no competing robust nonparametric estimators in the binary literature. The approach in the paper can be applied to any link function that is a cumulative distribution function (cdf) by approximating the cdf by a mixture of normals. In particular, the probit and logit link functions are discussed explicitly. 1

The computation is carried out by using data augmentation and the Gibbs sampler. By expressing the prior for each function in state space form, an ecient sampling scheme is obtained which requires O(nM ) operations and O(n) storage locations, where n is the sample size and M is the number of Gibbs iterations. There is an extensive non-Bayesian literature on nonparametric regression in exponential family models, with binary regression treated as a special case. The main approaches used are: (a) Penalized likelihood with unknown functions estimated by smoothing splines, and (b) Local likelihood kernel based methods. Using a smoothing spline approach, O'Sullivan, Yandell and Raynor (1986) estimate a single function nonparametrically in an exponential family model and their work is extended to additive models by Hastie and Tibshirani (1990). Gu (1990a,1990b, 1992), Wang (1994, 1996) and Wahba, Wand, Gu, Klein and Klein (1995) use tensor product smoothing splines to allow for interactions between variables. The literature on penalized likelihood methods uses a number of approaches to estimate the smoothing parameters, with approximate cross-validation and unbiased risk estimation among the better performers. Simulations by Wang (1994) suggest that unbiased risk estimation generally outperforms approximate crossvalidation. We note that at present, exact crossvalidation seems computationally too intensive to be practical. Fan, Heckman and Wand (1995) and Carroll, Fan, Gijbels and Wand (1995) proposed a local likelihood approach to nonparametric regression estimation in exponential family models, but their work does not extend to additive nonparametric regression, and they do not have data-driven estimators for their bandwidth parameters. Loader (1995) also proposes a local likelihood approach for both univariate and bivariate nonparametric estimation in exponential family models and provides data-driven bandwidth estimators. In the paper the performance of the Bayesian approach is compared with Loader's local likelihood approach and the smoothing spline estimator in Wang (1994, 1996) which estimates the smoothing parameters using the unbiased risk estimator. Previous Bayesian approaches for estimating a binary regression in a nonparametric manner include Gelfand and Kuo (1991), Mallick and Gelfand (1994), Basu and Mukhopadhyay (1994), and Newton, Czado and Chappell (1996). These authors estimate the link 2

function exibly, but assume that the argument of the link function is linear in the explanatory variables. In example (1.1) this approach would estimate H nonparametrically, but assume that the functions gi are linear. Such an approach makes p(HA = 1jAge; BP; CR) monotonic in the explanatory variables and is less general than the approach proposed in the current paper. In addition, none of the above Bayesian estimators is robust to miscoded values of the dependent variable. The paper is organized as follows. Sections 2 and 3 outline the methodology for the probit link function and section 4 provides simulation results . Section 5 makes the estimation robust to outliers. Section 6 extends the analysis to an additive model and discusses a real example. Section 7 extends the approach to general link functions, and in particular the logistic link function.

2 Probit Regression 2.1 Introduction Suppose that the dependent variable w takes the values 0 and 1 only and that w depends on the explanatory variable s. In this section we model the dependence of w on s using the probit regression model p(w = 1js) = E (wjs) = fg(s)g ; where is the standard normal cdf and g (s) is a smooth function of s. Suppose w = (w1; : : :; wn) are n independent binary observations and s = (s1 ; : : :sn ) are the corresponding values for s. Section 2.2 shows how data augmentation can be used to estimate g . Generalizations of this approach to additive models and other link functions H are given later in the paper. In all our work we estimate g by its posterior mean. We would like to select the prior for g so that its posterior mean has the following two properties: (a) The estimate of g is a smooth function. (b) The prior is exible enough to give good estimates for a large range of regression functions. Choosing a parametric form for g , for example taking g as a linear 3

function, satis es the rst requirement, but not the second. The following prior for g was discussed by Wahba (1983) and satis es both requirements. Let

g(s) = 0 + 1 s + ( 2) 12

Zs 0

W (v)dv ;

(2:1)

where W is a Wiener process with W (0) = 0 and var(W (s)) = s. We take = (0; 1)0 as diuse, that is N (0; cI2), with c ! 1. In (2.1), the parameter 2 controls the curvature of g and is called a smoothing parameter. The prior (2.1) has the following properties. First, there is no prior information about g and its rst derivative at s = 0 because 0 = g (0); 1 = g(1)(0) and is diuse. The notation g (1) means the rst derivative of g. Second, the rst derivative of g is continuous because a Wiener process is continuous. Third, there is no information about the second derivative of g because d2g (s)=ds2 = dW (s)=ds and the rst derivative of a Wiener process is in nite. Fourth, the posterior mean of g is a cubic smoothing spline. That is, it has continuous second derivative throughout its range, is a cubic in each of the subintervals (si?1 ; si ); i = 2; : : :; n, and is linear for s s1 and s > sn . In addition, the posterior mean is linear if 2 = 0 and it interpolates the data if 2 is very large. To complete the prior speci cation for g we place a at prior on 2 . For computational purposes we now express g in state space form. Let

f (s) = ( 2 ) 12

Zs 0

W (v)dv

be the nonlinear part of g and de ne the state vector x(s) as ff (s); f (1)(s)g0. Then, as in Wecker and Ansley (1983), x(si ) = Fix(si?1 ) + ui is the state space representation of f . The ui are independent N (0; 2Ui ),

1 0 1 0 =3 i =2 C 1 Fi = B A; @ i CA and Ui = B@ i 3

i2=2

0 1

with i = si ? si?1 and s0 = 0. 4

2

i

2.2 Estimation We estimate g , and hence p(w = 1js), using the data augmentation approach of Albert and Chib (1993) who carry out a Bayesian analysis of linear probit models. The extension of the data augmentation approach to other link functions is discussed in section 7 Let y = (y1; : : :; yn) be the vector of latent variables such that

yi = g(si) + ei ;

(2:2)

the errors ei are independent N (0; 1). Let wi = 1 if yi > 0 and let wi = 0 otherwise. Then,

pfwi = 1jg(si)g = pfyi > 0jg(si)g = pfei > ?g(si )g = fg(si)g;

(2:3)

so that wi has the correct distribution. The posterior mean of g (s) is equal to

E fg(s)jwg =

Z1

?1

E fg(s)jw; y; 2gp(y; 2jw)dy d 2 :

(2:4)

We use Monte Carlo simulation to estimate this posterior mean. Suppose the sequence ( 2)[j ] ; y[j ] ; j = 1; : : :; M; is generated from the posterior distribution of p( 2; yjw). Then

g^(s) = M1

M n X j =1

E g(s)jy[j]; 2[j]

o

(2:5)

is an estimator of the right side of (2.4) because E fg (s)jw; y; 2g simpli es to E fg (s)jy; 2g. Each of the expectations in (2.5) is evaluated using state space ltering and smoothing algorithms. Once g^(s) is obtained we estimate p(w = 1js), which is equal to fg (s)g, by fg^(s)g. The next section shows how to generate the sequence fy [j ] ; ( 2)[j ] g using the Gibbs sampler.

Remark 2.1 The mode of the posterior density p(gjw; ) is equal to the penalized likelihood estimate of g . This is explained by Hastie and Tibshirani (1990, Chapter 6) and Green and 2

Silverman (1993, p.98). These authors use an approximation to the cross-validation function to estimate 2. The Bayesian approach estimates g by its posterior mean with 2 integrated out. 5

Remark 2.2 The estimate g^(s) is a cubic smoothing spline because each of the terms in (2.5) is a cubic smoothing spline by Wecker and Ansley (1983).

3 The Gibbs sampler Let f = ff (s1); : : :; f (sn)g. We use the Gibbs sampler to draw from the posterior distribution of ; f , y and 2 . The Gibbs sampler is discussed and illustrated by Gelfand and Smith (1990). We run the sampling scheme for a warmup period, at the end of which we assume the Gibbs sampler is generating iterates from the posterior distribution p(; f ; 2; yjw), followed by a sampling period used for estimation. The estimate g^(s) in (2.5) uses the iterates in the sampling period.

Sampling Scheme 1 0. Initialize f and as f [0] = 0 and [0] = 0. 1. Draw y from p(yjf ; ; w) as follows. For i = 1; : : :; n, generate yi from a normal distribution with mean f (si ) + zi and variance 1. If wi = 1 then constrain yi to be positive and if wi = 0 then constrain yi to be negative. 2. Generate and f as a block from p(f ; jy; 2) by generating from p(jy; 2) and then generating f from p(fjy; 2; ) conditional on the generated value of . (a) Let Z = (z10 ; : : :; zn0 )0. To generate we note that p(jy; 2) / p(yj 2 ; )p(). As in Wecker and Ansley (1983) we can use this expression for the conditional density of to show that p(jy; 2) is normal with mean (Z 0 ?1 Z )?1 Z 0 ?1 y and variance (Z 0 ?1 Z 0 )?1 , where = var(f + e). (b) The vector f is generated from p(fjy; 2; ) by generating the errors e and then calculating f = y ? Z ? e. We use the de Jong and Shephard (1995) algorithm to generate e in O(n) operations by expressing the conditional density of e as

p(ejy; ; ) = p(en jy; ; ) 2

2

6

Y

n?1 i=1

P (eijy; ei+1 ; : : :; en; 2; )

The error en is generated rst from p(en jy; 2; ) and then for i = n ? 1; : : :; 1, the errors ei are successively generated conditional on y; ei+1 ; : : :; en; 2 and . 3. Generate 2 from p( 2jy; ; f ), which simpli es to p( 2jf ). The conditional density p( 2jf ) / p(fj 2)p( 2) and is inverse gamma. The parameter 2 is generated as in Carter and Kohn (1994, p. 545). Step 0 initialises the Gibbs sampler and steps 1 to 3 are repeated for both the warmup and sampling periods. In binary regression many of the si are usually the same so there are many more observations than there are distinct values of g (si ). The repeated observations can be used to speed up sampling scheme 1 as follows. Suppose pr(wij = 1jsi ) = fg (si)g, for j = 1; : : :; ni ; i = 1; : : :; m, with n = Pi ni . Sampling scheme 1 is modi ed as follows. Generate yij given wij and g(si) as in step 1. Form the means yi = Pi yij =ni for i = 1: : : :; m and generate and f from the yi as in step 2. The justi cation for using the means yi instead of all the latent variables yij is given by Ansley and Kohn (1986).

4 Simulations This section uses simulation to evaluate the quality of the Bayesian estimator of the probit regression and compares it to two other estimators that are considered to perform well. The rst estimator is LOCFIT which was developed by Loader (1995). LOCFIT uses the smoothing parameter that maximizes the local log likelihood. The code for LOCFIT is available at the WWW site, http://netlib.att.com/netlib/att/stat/prog/loc t.shar.Z and a description of LOCFIT is given in Loader (1995). The second estimator is GRKPACK with the smoothing parameter estimated by the unbiased risk (UBR) method. This estimator was developed by Wang (1994, 1996). In nonparametric regression with Gaussian errors the average integrated squared error of the function estimate is often used as a measure of performance. This measures the average 7

squared deviation between g and g^. In binary regression it may be more appropriate to measure the distance between (g ) and (^g). As suggested by Gu (1992) and Wang (1994) we do so using the symmetric Kullback-Leibler distance between g and g^. This distance is de ned as follows. Let pb(wi = 1jsi ) = fgb(si )g. The Kullback-Leibler distance between fg (sig and fg^(si )g is de ned as (Rao, 1973, pp. 58-59) pb(wi = 1jsi) pb(wi = 0jsi) b b I fg(si); g^(si )g = p(wi = 0jsi ) log p(w = 0js ) + p(wi = 1jsi ) log p(w = 1js ) : i i i i By Rao (1973, pp. 58-59), I fg (si); g^(si )g 0 and I fg (si); g^(si )g = 0 if and only if g (si ) = g^(si ). We de ne the integrated symmetric Kullback-Leibler distance between g and g^ as

I (g; g^) =

n X i=1

[I fg (si); ^g(si )g + I fg^(si ); g (si)g] :

It follows from the properties of the Kullback Leibler distance that I (g; g^) 0 and I (g; g^) = 0 if and only if g^ = g. Thus, the closer I (g; g^ ) is to zero the better. We generated 100 realizations, each consisting of 400 observations, from the three smooth functions: (i) g1(s) = 4s ? 2, (ii) g2 (s) = 9=(2 )1=2 exp(?18(s ? 0:5)2) ? 2, and (iii) g3(s) = sin(4s). These functions are plotted in Figure 1. In each case the si were equally spaced. The function g1 was chosen because a linear function is often used in probit regression, g2 was chosen because p(w = 1js) is not monotonic in s, and g3 was chosen because p(w = 1js) is multimodal in s. For each realization, the integrated symmetric Kullback-Leibler distance of each estimate was computed. Figure 1 shows boxplots of I (g; g^) for the Bayesian, LOCFIT and GRKPACK estimators. The warmup and sampling periods were 200 and 800 iterations, respectively. Convergence of the sampler was checked for selected realizations by starting the sampler at dierent positions and checking if the same conditional probability estimates were obtained. In all cases the sampler converged rapidly. Figure 1 shows that the Bayesian estimator outperforms the LOCFIT and GRKPACK estimators. To see how this dierence in I (g; g^ ) translates into dierent probit regression curves, the estimates of f9=(2 )1=2 exp(?18(s ? 0:5)2) ? 2g corresponding to the 10th worst, 50th worst, and 10th best values of I (g; g^ ) are plotted in Figure 2. The plot shows that the Bayesian estimator outperforms the other two estimators across the whole domain of s. 8

Figures 1 and 2 about here

4.1 Posterior Con dence Intervals The posterior pointwise con dence interval for p(w = 1js) shows the uncertainty in the estimate of this probability and is obtained from the corresponding interval for g (s). Let î be an estimate of the posterior standard deviation of g (si ). We estimate the 95% posterior con dence interval of g (si) by g^(si ) 1:96î. To obtain î we write varfg (si)jwg as E fg (si)2jwg ? E fg (si)jwg2 . The posterior mean of g (si) is estimated by g^(si ) de ned by (2.5). Similarly, we estimate E fg (si)2jwg by M M h i 1 X 1 X 2 [j ] 2 [j ] [j ] 2 [j ] [j ] 2 [j ] 2 (4:1) E f g ( s ) jy ; ( ) g = var f g ( s ) jy ; ( ) g + E f g ( s ) jy ; ( ) g i i i M M j =1

j =1

using the iterates in the sampling period. The term varfg (si)jy[j ] ; ( 2)[j ] g is equal to varfei jy[j ] ; ( 2)[j ] g and the term E fg (si)jy[j ] ; ( 2)[j ] g is equal to yi[j ] ? E (eijy[j ] ; ( 2)[j ]). Both the conditional mean and the conditional variance of ei are computed in step 2(b) of sampling scheme 1. Figure 3 illustrates Bayesian posterior con dence intervals for the cubic spline estimate for observations generated from the probit regression p(w = 1js) = (4s ? 2) using a typical realization from the simulations described in section 2.3. Figure 3 about here Bayesian con dence intervals for nonparametric spline regression with Gaussian errors were proposed by Wecker and Ansley (1983) and Wahba (1983). Gu (1992) uses a Laplace approximation to the likelihood to obtain approximate Bayesian con dence intervals for a nonparametric regression function in generalized linear models. We note that by integrating out 2 , the Bayesian con dence interval takes into account the extra variability due to estimating the smoothing parameter 2 , whereas the con dence intervals obtained by Wecker and Ansley (1983), Wahba (1983) and Gu (1992) do not.

9

5 Robust estimation 5.1 Introduction Suppose that some of the observations are miscoded with a 1 recorded as a zero or vice versa, or that some observations are outliers, for example the patient has a heart attack at a very young age. This section shows how to robustly estimate p(w = 1js) and identify such miscodings and outliers by using an approach outlined by Verdinelli and Wasserman (1991) for linear binary regression. The results in this section are for probit regression, but can be generalised to other link functions following the discussion in section 7.

5.2 Implementing the Gibbs Sampler To run the Gibbs sampler we introduce a vector of indicator variables, = ( 1; : : :; n), where i = 1 if wi is an outlier or miscoded and i = 0 otherwise. We assume a priori that the i are independent with p( i = 1) = e . In our applications we take e = 0:05 and nd that this choice works well in practice. Using the same notation as in section 2, we implement the following sampling scheme to estimate the posterior mean of g .

Sampling Scheme 2 0. Initialize f and as f = 0 and = 0. [0]

[0]

1. Draw y and jointly from p(y; jf ; ; w) by drawing from p( jf ; ; w) and then drawing y from p(yjf ; ; w; ) conditional on the generated value of . To draw we note that conditional on f ; w and , the i are independent with distribution p( ijfi ; wi; ). For j = 0 and j = 1 p( i = j jfi; wi; ) = (1 ? )p(wp(jw ij= i 0=; fj;; fi;) +)p( pi(w= jj ) = 1; f ; ) e i i i e i i i To draw y from p(yjf ; ; w; ), we note that the yi are independent conditional on f ; ; w and with pfyijwi; f (si); ; i = 1g = pfyij1 ? wi; f (si); ; i = 0g. The yi are now generated as in sampling scheme 1. 10

2. Generate , f and 2 as in sampling scheme 1

5.3 Simulations To illustrate the impact of miscoded observations and outliers on the estimation of p(w = 1js) we generated 100 realizations from the probit regression function (4s ? 2). Each realization consisted of 400 observations, with the si equally spaced on the interval [0; 1]. Miscodings and outliers were introduced into the observations wi at random, with the probability of generating a miscoded observation or an outlier set at g = 0:10. The simulation experiment was repeated using a value of g = 0:20. Thus the probability used to generate the miscoded observations and outliers (g = 0:10 and g = 0:20) is dierent than that used to model them (e = 0:05). The integrated symmetric Kullback-Leibler distance I (g; g^ ) was calculated both when outlier detection was turned and when it was not. Figure 4 presents the boxplots of I (g; g^) for both experiments. In both experiments the robust Bayesian estimator performed substantially better than the non-robust estimator. Figure 4 about here

5.4 Real Example The robust estimation technique is now applied to a data set consisting of 463 observations on the dependent variable w which equals 1 if the subject had a heart attack and 0 if he or she did not, and a number of risk factors. This data is described and analyzed by Hastie and Tibshirani (1987). We model the probability of a heart attack (HA) given a patient's cholesterol ratio (CR). The model is

p(HA = 1jCR) = fg(CR)g

with

g(CR) = 0 + 1CR + f (CR)

We chose cholesterol as the explanatory variable because it appears that at least one observation (CR = 15:33; HA = 0) is either an outlier or miscoded. Figure 5 shows that when p(HA = 1jCR) is estimated without outlier detection a sharp drop occurs for high levels of cholesterol ratio. This is due to the presence of the observation (CR = 15:33; HA = 0) together with the scarcity of data in this region. When outlier detection is turned on (dotted 11

line) this apparent anomaly disappears and the posterior probability that the observation (CR = 15:33; HA = 0) is an outlier is approximately 0.50. Similarly for low levels of cholesterol ratio the posterior probability that the observation (CR = 1:74; HA = 1) is an outlier is approximately 0.75. However the observation (CR = 1:74; HA = 1) does not aect the estimation of p(HA = 1jCR) to the same degree as does the observation (CR = 15:33; HA = 0), even though the posterior probability of it being an outlier is higher. This is because there are many observations for low levels of cholesterol ratio and hence the contribution of the observation (CR = 1:74; HA = 1) to the estimate of p(HA = 1jCR) is reduced. Figure 5 about here

6 Additive binary regression 6.1 Introduction The Bayesian approach for estimating a binary regression is now extended to more than one independent variable. We assume that the variables enter additively into the link function, but this formulation allows for interaction between variables by de ning the product of two explanatory variables as a new variable. This is explained in section 6.3. For simplicity, the discussion below is con ned to the probit link function with two independent variables, but the extension to a more general link function and to more than two independent variables is straightforward. Using the probit link function we model the probability p(w = 1js; t) as f0 + g1 (s) + g2(t)g and impose the constraints g1(0) = g2 (0) = 0 because we cannot uniquely identify separate intercepts for g1 and g2. We write g1 (s) = 1 s + f1 (s) and g2(t) = 2 t + f2 (t) with the priors for f1 and f2 given by

f1 (s) = ( )

2 1 2 1

Zs 0

W1 (v)dv

and

f2(t) = ( )

1 2 2 2

Zt 0

W2 (v)dv :

W1 and W2 are two independent Wiener processes with W1 (0) = W2 (0) = 0 and 12 and 22 are the smoothing parameters for g1 and g2 . Let = (0; 1; 2)0. The prior for is diuse and is independent of f1 and f2 . We also impose at priors on 12 and 22 on the interval 12

[0; 1) and assume they are independent.

6.2 Gibbs sampler Let f 1 = ff1 (s1); : : :; f1(sn )g, f 2 = ff2(t1 ); : : :; f2(tn )g, and Z = (z10 ; : : :; zn0 )0 , where zi = (1; si; ti ). Let ^ ; f^1 ; and f^2 be estimates of the posterior means of ; f 1 , and f 2 . These estimates are obtained using the Gibbs sampler outlined below. The regression function fzi + f1 (si ) + f2 (ti )g is estimated by fzi ^ + f^1 (si ) + f^2 (ti )g.

Sampling Scheme 3 0. Initialize f , f and as f 1

2

1

[0]

= 0, f 2[0] = 0 and [0] = 0.

1. Generate y from p(yjf 1 ; f 2 ; ; w) as in sampling scheme 1. 2. Generate and f 1 simultaneously from p(; f 1 jy; f 2 ; 12). The vectors and f 1 are generated exactly as in sampling scheme 1 by taking y?f 2 as the observations because y ? f 2 = Z + f 1 + e. 3. Generate 12 and 22 as in sampling scheme 1. The vectors and f 2 are generated similarly to and f 1 . The idea of successively generating f 1 and f 2 was proposed by Wong and Kohn (1996, section 3) for semiparametric regression with Gaussian errors, but they use the Carter and Kohn (1994) sampler which is less ecient than the De Jong and Shephard (1995) method. Erkanli and Gopalan (1993) discuss a similar sampler for semiparametric regression, both for Gaussian errors and also for probit regression. Their approach is based on eigenvalue decompositions rather than a state space approach and therefore requires O(n3 ) operations and O(n2 ) storage location, which will be prohibitively expensive for larger sample sizes. Erkanli and Gopalan (1993) only discuss the probit link function and it seems dicult to extend their approach to other link functions. Furthermore, they do not implement their proposal for probit regression, nor do they consider robust estimation in either the Gaussian case or for binary regression. 13

6.3 Interaction This section shows how our approach allows for interactions between the variables. For simplicity we consider the probit regression model with two explanatory variables s and t and suppose that s; t, and their interaction st enter linearly into the link function, that is, p(w = 1js; t) = (0 + 1s + 2 t + 3st). This parametric regression function can be immediately generalized to

p(w = 1js; t) = f0 + 1s + 2 t + 3 st + f1 (s) + f2 (t) + f3 (st)g with f1 ; f2 and f3 having priors that are similar to the prior for the function f in section 2. Sampling scheme 3 can be extended to handle more than two independent variables and higher order interactions of similar form to that for s and t.

6.4 Real Example The Bayesian approach for estimating a binary regression is now applied to the data set described in section 5.4. We model the probability of a heart attack, (HA = 1), by an additive probit model using the following three risk factors: (i) systolic blood pressure (BP ), (ii) cholesterol ratio (CR), and (iii) age of patient (Age). The model is

p(HA = 1jBP; CR; Age) = fg(BP; CR; Age)g where

g(BP; CR; Age) = 0 + 1BP + 2 CR + 3 Age + f1 (BP ) + f2 (CR) + f3 (Age): Sampling scheme 3 in section 6.2 was used to estimate the functions f1 (BP ); f2 (CR) and f3(Age), and = (0; : : :; 3). The results are plotted in Figure 6. For all three independent variables there were many co-incident points. In order to see the distribution of the data the independent variables were perturbed by adding a small amount of gaussian noise to their values. The dotted line is the estimate with outlier detection turned on and the solid line is the estimate when it is not. 14

Figure 6 shows that the eect of age increases monotonically. When outlier detection is turned on the eect of cholesterol ratio also increases monotonically. Interestingly, the posterior probability that the observation (CR = 15:33; HA = 0) is an outlier has decreased from 0.50 in section 5.4 to 0.30. This is because this patient had a systolic blood pressure of only 120 and was 49 years old. Thus after controlling for systolic blood pressure and age the likelihood that this observation is an outlier decreased. Conversley the posterior probability that the observation (CR = 1:74; HA = 1) is an outlier increased from 0.75 to 0.86. This is because this patient was only 20 years old and had a systolic blood pressure of 106. The plot of blood pressure is interesting because it shows that even with outlier detection turned on f (BP ) increases with very low systolic blood pressure. This is an artifact of the data. The graph shows that there are a considerable number of patients who had heart attacks with a systolic blood pressure of less than 120. Figure 6 about here

6.5 Simulations We use simulation to evaluate the performance of the Bayesian estimator for the additive model and compare it to the GRKPACK estimator. Fifty realizations were generated from ), the regression model p(wi = 1js; t) = fg1(si ) + g2 (ti ) + g3 (ui)g, where g1(s) = sin( s 10 g2(t) = t=4 ? 2 and g3 (u) = exp(u=60)3 ? 2. The independent variables s, t and u are those explanatory variables used in the real example, i.e., systolic blood pressure, cholesterol ratio and age. Figure 7 contains boxplots of the integrated symmetric Kullback-Leibler distance for the Bayesian and GRKPACK estimates of p(w = 1js; t; u), and shows that the Bayesian estimator outperforms the GRKPACK estimator. Figure 7 about here

15

7 General link function 7.1 Introduction For the univarite model p(w = 1js) = H fg (s)g, the choice of link function is unimportant, provided that the form of g is speci ed exibly. However, for the additive model p(w = 1js; t) = H f0 + g1 (s) + g2(t)g the choice of link function is more important because a model that is additive using one link function will usually not be additive for another link function. The data augmentation method used in section 2 for probit regression can be generalized to a wide range of link functions by approximating the link function by a mixture of normals. We illustrate this approach by applying it to the logistic link function (s) = es =(1 + es ), but most link functions can be handled similarly. A mixture of normals approximation to the link function is used so that the state space algorithms can be used as in sampling scheme 1. McCullagh and Nelder (1989, p. 108) list some commonly used link functions. Similarly to the probit case we introduce the latent variables y = (y1 ; : : :; yn ) such that (2.2) holds with ei having cdf . Let wi = 1 if yi > 0 and let wi = 0 otherwise. As in (2.3) pfwi = 1jg(si)g = fg(si)g so that wi has the correct distribution. To implement our method we approximate the logistic density by a mixture of ve normal densities. The means, variances, and weights of the ve components are given in Table 1. The approximation was obtained using a genetic algorithm; see Goldberg (1989) for a description of genetic algorithms. Let the j th component have mean j , variance j2 , and weight j . Figure 8(a) plots the logistic density (1)(s) = es=(1 + es)2 and the mixture of normals approximating that density. Albert and Chib (1993) proposed a t8 distribution as an approximation to and this is also plotted in Figure 8(a). Figure 8(b) plots the logarithms of the densities in Figure 8(a). Figure 8 shows that the mixture of normals provides an excellent approximation to (s) and that t8 is a much poorer approximation. Table 1 and Figure 8 about here As in section 2, g is estimated by its posterior mean using the Gibbs sampler. Because the algorithms require a Gaussian state space model, it is necessary to introduce the sequence of 16

discrete variables J = (J1 ; : : :; Jn) such that Ji determines the component of the mixture of 5 normals that ei belongs to. The Ji are independent a priori because the ei are independent. From Table 1, p(Ji = k) = k for k = 1; : : :; 5.

7.2 The Gibbs sampler Let y , f and w be de ned as in section 2.

Sampling Scheme 4 0. Initialize f , and J as f [0] = 0, = [0] = 0, and J = J [0] = (1; : : :; 1). 1. Draw y from p(yjf ; ; J ; w) as follows. For i = 1; : : :; n and Ji = j , generate yi from a normal distribution with mean f (si ) + zi + j and variance j2 . If wi = 1 then constrain yi to be positive and if wi = 0 then constrain yi to be negative. 2. Generate ; f and 2 as in sampling scheme 1. 3. Generate J from p(Jjy; f ; ). Given f and , the Ji are independent so that p(Jjy; f ; ) = Qn pfJ jy ; f (s ); )g. The ith term in the product is evaluated as i i i i=1

pfJi = j jyi; f (si); g = P5pfyijJi = j; f (si); gj j =1 p(yjJi = j; f (si ); )j

7.3 Simulations To demonstrate that the choice of link function is important when the model is additive, 50 realisations were generated from the model p(w = 1js; t) = fg1 (s) + g2(t)g, where g1(s) = sin(4s) and g2(t) = 9=(2)1=2 exp(?18(t ? 0:5)2) ? 2. Both the probit and logistic link functions were used to estimate p(w = 1js; t) and the integrated symmetric KullbackLeibler distance was calculated for the estimates of each realisation. The boxplots of the integrated symmetric Kullback-Leibler distance for each link function appear in Figure 9. Figure 9 shows that when the data are generated from the logistic link function, using the logistic link function to estimate p(w = 1js; t) produces more accurate results than using the probit link function. 17

Figure 9 about here

7.4 Timing comparisons Table 2 reports the time taken by the Bayesian, GRKPACK and LOCFIT estimators for the univariate model for sample sizes n = 100; 400, and 1600. For the Bayesian estimator we used a warmup period of 200 iterations and a sampling period of 800 iterations. All runs were on an IBM RS/6000 model 250 PowerPC 601 running at 60 MHz. The table con rms that the time taken by the Bayesian estimator is O(Mn) where n is the sample size and M is the total number of Gibbs iterations. The table shows that for large sample sizes the Bayesian estimator is faster than GRKPACK, but not as fast as LOCFIT. However, the work in previous sections showed that the Bayesian estimator is more accurate than both GRKPACK and LOCFIT. Table 2 about here

References Albert, J. and Chib, S. (1993), \Bayesian analysis of binary and polychotomous response data," Journal of the American Statistical Association, 88, 669-679. Ansley, C.F. and Kohn, R. (1986), \Spline smoothing with repeated values," Journal of Statistical Computation and Simulation, 25, 251-258. Basu, S. and Mukhopadhyay, S. (1994), \Bayesian analysis of a random link function in binary response regression," Duke Technical Report Carroll, R.J., Fan, J., Gijbels, I., and Wand, M. (1997), \Generalized partially linear singleindex models," to appear Journal of the American Statistical Association Management. Carter, C.K. and Kohn, R. (1994), \On Gibbs sampling for state space models," Biometrika 81, 541-553. De Jong, P. and Shephard, N. (1995), \Ecient sampling from the smoothing density in time 18

series models," Biometrika, 82, 339{350. Erkanli, A. and Gopalan, R. (1993), \Bayesian nonparametric regression: Smoothing using Gibbs sampling," Working paper. Fan, J., Heckman, N. E., and Wand, M.P. (1995), \Local polynomial kernel regression for generalized linear models and quasi-likelihood functions," Journal of the American Statistical Association, 90, 141-150. Gelfand, A. E., and Kuo. (1991), \ Nonparametric Bayesian bioassay including ordered polytomous response," Biometrika, 78, 657{666. Gelfand, A. E., and Smith, A. F. M. (1990), \Sampling-based approaches to calculating marginal densities," Journal of the American Statistical Association, 85, 398-409. Goldberg, D.E. (1989), \Genetic Algorithms in Search, Optimization and Machine Learning", Reading, MA: Addison Wesley Green, P.J. and Silverman, B.W. (1994), Nonparametric regression and generalized linear models. A roughness penalty approach, New York: Chapman and Hall. Gu, C. (1990a), \Adaptive spline smoothing in non-Gaussian regression models," Journal of the American Statistical Association, 85, 801{807. Gu, C. (1990b), \Penalized likelihood regression: a Bayesian analysis," Statistica Sinica, 2, 255-264. Gu, C. (1992), \Cross-validating non-Gaussian data", Journal of Computational and Graphical Statistics, 1, 169{179. Hastie, T.J. and R.J. Tibshirani (1987), \Generalized additive models: some applications," Journal of the American Statistical Association, 82, 371-386. Hastie, T.J. and R.J. Tibshirani (1990), Generalized additive models, New York: Chapman Hall. 19

Loader, C. (1995) \LOCFIT: A program for local tting". Available at http://netlib.att.com/netlib/att/stat/prog/lfhome/home.html Mallick, B.K. and Gelfand, A.E. (1994), \Generalized linear models with unknown link functions,"Biometrika, 81, 237-245. McCullagh, P. and Nelder, J.A. (1989), Generalized linear models (2nd edition), New York: Chapman Hall. Newton, M., Czado, C. and Chappell, R. (1996), \ Bayesian inference for semiparametric binary regression," Journal of the American Statistical Association, 91, 142{153. O'Sullivan, F., Yandell, B. and Raynor, W. (1986), \Automatic smoothing of regression functions in generalized linear model," Journal of the American Statistical Association, 81, 96-103. Rao, C.R. (1973), Linear statistical inference and its applications (2nd edition), New York: John Wiley. Verdinelli, I. and Wasserman, L. (1991), \Bayesian analysis of outlier problems using the Gibbs sampler," Statistics and Computing, 1, 105-117. Wahba, G. (1983), \Bayesian `con dence intervals' for the cross-validated smoothing spline", Journal of the Royal Statistical Society B, 45, 133-150. Wahba, G., Wang, Y., Gu, C., Klein, R., and Klein, B (1995), \Smoothing spline ANOVA for exponential families, with application to the Wisconsin epidemiological study of diabetic retinopathy", Annals of Statistics, 23, 1865-1895. Wang, Y. (1994), Unpublished doctoral thesis, University of Wisconsin-Madison. Wang, Y. (1996), \ GRKPACK: Fitting smoothing spline anova models for exponential families," to appear in Communications in Statistics: Simulation and Computation Wecker, W. and Ansley, C.F. (1983), \The signal extraction approach to nonparametric 20

regression and spline smoothing," Journal of the American Statistical Association, 78, 81-89. Wong, C. and Kohn, R. (1996), \A Bayesian approach to additive semiparametric regression," Journal of Econometrics, 74, 209-235

21

weight (j ) 0.3173881 0.1763082 0.1881111 0.1753347 0.1428579

mean (j ) -0.0900430 0.63457071 -0.02886299 0.12175356 -0.69453339

stdev (j ) 1.479860 1.611898 1.079041 2.682629 1.890645

Table 1: Table of ve normal densities used to approximate the logistic density

n Bayesian GRKPACK LOCFIT 100 400 1600

9 36 145

1 56 2682

1 3 13

Table 2: Time taken (seconds) to obtain estimates of pr(w = 1js) for the univariate case

(b) 1.0 g(s)

0.0

0.5

-10 -20 -30

-0.5

-40 -50

-1.0

-60

Kullback-Leibler Distance

0

(a)

Bayesian

GRKPACK

LOCFIT

0.0

0.2

0.4

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

s

(d)

1

-20

-1

0

g(s)

-40 -60 -80

-2

-100


0

(c)

Bayesian

GRKPACK

LOCFIT

0.0

0.2

0.4 s

(f) 2 1

-5

-1

0

g(s)

-10 -15 -20

-2

-25


0

(e)

Bayesian

GRKPACK

LOCFIT

0.0

0.2

0.4 s

Figure 1: Probit link function. Panel (a) Boxplots of the integrated Kullback-Leibler distance for the Bayesian estimator (left), the GRKPACK estimator (middle), and the LOCFIT estimator (right); g (s) = sin(4s). Panel (b) is a plot of g (s) = sin(4s). Panels (c) and (d) are similar plots for g (s) = 9=(2 )1=2 exp(?18(s ? 0:5)2) ? 2, and (e) and (f) are similar plots for g (s) = 4s ? 2.

0.6

0.8

1.0 0.8 0.4

lf10

1.0

0.0

0.2

0.4 0.2 0.0 0.4

0.0

0.2

0.4

0.6

0.8

1.0

0.0

(e)

(f)

0.8

0.6

0.8

1.0

0.0

0.2

0.4

0.8 lf90 0.4 0.0

0.2

0.4 0.2 0.0 1.0

1.0

0.6

0.8 w90

0.6

0.8 0.6 0.4

0.8

0.8

1.0

(i)

1.0

(h)

0.6

0.6

0.8 0.4

(g)

s

1.0

0.4 0.2

s

0.4

0.8

0.0 0.0

s

0.2

0.2

0.6

lf50

1.0

s

0.0 0.0

1.0

0.2

0.4 0.2 0.0 0.6

0.8

0.6

0.8 w50

0.6

0.8 0.6 0.4

0.4

0.6

1.0

(d) 1.0

s

0.0

0.2

0.4

s

1.0

0.0

0.2

s

0.2

50%

0.6

0.8 w10

0.6

0.8 0.6

10%

0.4 0.2 0.0

0.2

1.0

0.0

90%

(c)

1.0

(b)

1.0

(a)

0.0

0.2

0.4

0.6 s

0.8

1.0

0.0

0.2

0.4 s

Figure 2: Probit link function with g (s) = 9=(2 )1=2 exp(?18(s ? 0:5)2) ? 2). Panels (a){(c) plot fg (s))g (dotted line) and fg^(s)g (solid line) for the 10th worst estimate (in terms of integrated Kullback-Leibler distance) for (a) the Bayesian estimator, (b) the GRKPACK estimator, and (c) the LOCFIT estimator. Panels (d){(f) are similar plots for the 50th worst estimate; panels (g){(i) are similar plots for the 10th best estimate.

1.0 0.8 0.6 0.0

0.2

0.4

p(w=1|s)

0.0

0.2

0.4

0.6

0.8

1.0

s

Figure 3: (a) Plot of the regression function (4s ? 2) (dotted line), the estimate of the posterior mean of (4s ? 2) (solid line), and the 95 percent pointwise posterior con dence interval (dashed lines) for the Bayesian estimate.

0 -5 -15 -25


(a)

Outlier

No-Outlier

-20 -40 -60


(b)

Outlier

No-Outlier

Figure 4: Panel (a); Boxplot of the integrated Kullback-Leibler distance for the Bayesian estimator when outlier detection is turned on (left) and when it is not (right); g (s) = 4s ? 2 and g = 0:10. Panel (b) has the same interpretation as (a), but g = 0:20.

1.0

| || | |||| |||| || | |||||||||||| |||| | |||||||||| ||||| |||||| | |||||||| ||||| || |||||||||| | ||| | ||

|| ||||| |

|||

|

| | | || |

|

|

0.6 0.4 0.0

0.2

P(Heart Attack|CR)

0.8

|

| | | |||| ||| || ||||||||||||| ||||||||||||||||| ||||| |||||||||||||| ||| ||| ||||||||| |||| |||||||||||||| ||||||| |||||||||| ||||||||||||||||| ||||||||||||||||||||||||| | |||||| | | || || ||||| | 2

4

6

||

8

|

| 10

|

|

| 12

14

Cholesterol Ratio

Figure 5: Plot of the estimate of P (HA = 1jCR) when outlier detection is turned on (dotted line) and when it is not (solid line).

-3.0

| | ||

|| || |

| ||||

| | |||||| | ||| | | | | | || || || || |||| || ||| || | | |||||| | || || |||||||

|| || ||||| | || ||||| |

||

|||||

| |||| | | | | | ||| | | || | |

| | | || ||

|| |

|

|

|

| |||

|

|

-3.4 -3.6 -3.8 -4.0

f(Systolic B.P.)

-3.2

|

|

|

|| || || | |

||| ||| | ||||||| || | |||||||||||||| ||| |||| ||||||||||| ||||| ||||||| ||| ||||||||| | ||||||||||| || ||||| |||||| |||||||||||||||||||||||||| ||||||||| |||||||| ||||| ||| | | | | |||||| ||||| | ||| |||| | | | | || || | | | || || | | |||| | || |

100

120

140

||

| ||

|

|

160

|

|

||

180

|

|

200

220

Systolic Blood Pressure

(b) || ||

| | ||| |||||| || ||||||||||||||| |||| | || || ||||| ||||||||||| ||| | | || ||| | | ||||| || ||| | || | | | || | | | | || | ||| |

| | | || | |

| | || || | |

| ||

|

| |

|

||

|

|

|

0.5

1.0

|

0.0

f(Cholesterol Ratio)

1.5

|

| |

|

| | ||||||| || ||| | || ||||||||||||||| |||||||||||||||| |||||||||||||||||| |||||||||||||||||||||||||||| | |||||||||| |||||||||||||||||||| ||||| |||||||||| || |||||||| | ||||||||| | |||||| ||||| || | ||| | | || | 2

4

|| | ||| || | | || ||| ||| ||| |

6

| |

8

|

|

|

|

10

| 12

14

Cholesterol Ratio

|

|

|| |

||

|

| | | || || | |

|

|

| ||

|| | || | || || || | | | || ||

|| || | || || | |

| | | | | || || | | | | ||| | |||| |||

||| || | |||| | | | || || | |

|||| ||| || || | ||||||| ||| | ||| ||||||| ||

| ||| | ||| ||| | | | | || ||| | || ||||||

| |||| |||||| || | | ||| || ||| | | | | |||

2.5 2.0

f(Age)

3.0

3.5

(c)

1.5

Figure 6: Heart attack data. Panels (a)-(c) are the Bayesian estimates of 0 +1 BP +f1 (BP ), which is the eect of blood pressure, 2 CR + f2 (CR), which is the eect of the cholesterol ratio, and 3 Age + f3 (Age), which is the eect of age, when outlier detection is turned on (dotted line) and when it is not (solid line).

(a)

||| |||| |||||||| |||||| ||||||||| ||| | || || | | 20

|| |

|

|| ||| | | | | || | | || | || ||| | | ||| | |||| |||| 30

||||| || || | ||| | |

||

| ||| || || | |||||| | | | || ||| || | || ||||| | ||| || | |||| | | ||||| ||| || || ||||| |||||| | | | | || | | | || | | | || 40 Age

50

60

-20 -40 -60 -80


-100

Bayesian

GRKPACK

Figure 7: Additive Model. Boxplots of the integrated Kullback-Leibler distance of the Bayesian estimator (left), and the GRKPACK estimator (right); the probit regression function is fsin( s ) + t=4 + exp(u=60)3 ? 4g. 10

0.0 0.1 0.2 0.3 0.4

Density

(a)

-10

-5

0

5

10

5

10

x

-8 -6 -4 -2 -12

Log Density

(b)

-10

-5

0 x

Figure 8: (a) Plot of the logistic density es =(1 + es )2 (solid line), the mixture of normals approximation (dotted line), and the t8 approximation (dashed line); (b) Plots of the logarithms of the densities to the base e.

-10 -20 -30 -40

Kullback Leibler Distance

Probit

Logit

Figure 9: Additive model. Boxplots of the integrated Kullback-Leibler distance using the probit link function (left) and the logistic link function (right). The data is generated from the model p(w = 1js; t) = fg1(s)+ g2 (t)g, where is the logistic link function, g1(s) = sin(4s) and g2 (t) = 9=(2 )1=2 exp(?18(t ? 0:5)2) ? 2.