OPTIMAL BAYESIAN DESIGN APPLIED TO LOGISTIC REGRESSION ...

Journal of Statistical North-Holland

Planning

OPTIMAL LOGISTIC Kathryn

Received

of Applied 17 August

Recommended

Abstract:

guess however

received

way to design a binary

account

for a situation theory

distributions

St. Paul, MN55108,

24 January

response

for parameter

U.S.A

1988

where approximate

for concave Designs

experiment

values close to that best guess.

design found

critria

attached models

optimization

The theoretical

values.

for a best

We propose A design for

to it is very different

of the parameters

for non-linear

by numerical

and a range of criteria.

in the parameter

uncertainty values

is to design the experiment

values. A design which is optimal

for the prior uncertainty

where the best guess has substantial

regression.

are known.

and then apply

are examined

from

We derive a the theory

for a range

to

of prior

results are used to verify that the designs are

optimal. Subject

Key words:

Classification: Algorithm;

Mead algorithm; design;

of Minnesota,

for a best guess of the parameter

which formally

logistic

AMS

University

may not be efficient

general

indeed

Statistics,

LARNTZ

1987; revised manuscript

A traditional

a design

and Kinley

TO

by A. Cohen

to be most efficient

a situation

191

21 (1989) 191-208

BAYESIAN DESIGN APPLIED REGRESSION EXPERIMENTS

CHALONER

Department

designs

and Inference

logistic

Primary

A-optimality;

optimization; regression;

62KO5; Secondary binary

reliability

non-linear

data;

D-optimality;

experiments;

models;

62F15.

optimal

simplex

equivalence method;

theorem;

stress testing;

NelderBayesian

design.

1. Introduction Designing an experiment for a binary response is important in the biological sciences for bioassay experiments and in the engineering sciences for reliability testing. We will focus specifically on the logistic regression experiment, although the theory and methods apply to any generalized linear model. A logistic regression experiment is just one example of a non-linear design problem. The non-linear problem is intrinsically much harder than the linear problem as the efficiency of a design depends on the unknown parameters. A common approach is to design an experiment to be efficient for a best guess of the parameter values. This approach leads to what are termed ‘locally optimal’ designs, the term being introduced by Chernoff (1953). A natural generalization is to use a prior distribution on the unknown parameters rather than a single guess. We use this approach here and justify it using Bayesian arguments. 0378.3758/89/$3.50

0

1989, Elsevier

Science Publishers

B.V. (North-Holland)

192

K. Chaloner,

K. Larntz / Optimal Bayesian design

Non-Bayesian design for logistic regression is discussed at length in Finney (1978), Abdelbasit and Plackett (1983) and Meeker and Hahn (1977). Various Bayesian approaches are given in Tsutakawa (1972,1980), Zacks (1977), Freeman (1970), Leonard (1982) and Owen (1975). There is a large literature on sequential design for binary response models, see for example Wu (1985), but we will not consider the sequential approach in this paper. In many experimental situations sequential design is impractical or too expensive (see Anbar, 1984). We first present a unifying theory of optimal Bayesian design for non-linear problems. This theory parallels the theory for linear problems as presented in Silvey (1980) and Fedorov (1972,198l). The theory is valid, under regularity conditions, for general concave optimality criteria. The concavity arises from the formulation of the criteria. The general theory developed in Section 2 discusses an equivalence theorem of Whittle (1973) and Dubov (1977) and shows how the theorem applies to Bayesian design problems. The theorem also provides a method of verifying whether or not a particular design is optimal, or close to optimal. In Section 3 we discuss the logistic regression model and give expressions for the directional derivatives needed to use the theorem. Two concave criteria will be used. The first criterion corresponds to maximizing the average over the prior distribution of the log of the determinant of the information matrix. This maximizes, approximately, the expected increase in Shannon information provided by the experiment as suggested by Lindley (1956) and given a decision theoretic basis by Bernard0 (1979). It was used by Lauter (1974,1976) for mixtures of linear models. This criterion is equivalent to D-optimality in linear design. The second criterion corresponds to minimizing the approximate expected posterior variance of the quantities of interest. This criterion requires the experimenter to specify what is to be estimated or what is to be predicted and the relative importance of these predictions. A weighted trace of the inverse of information matrix is then averaged over the prior distribution and minimized. In linear problems this criterion is called A-optimality. Both these criteria correspond to having a fixed number of observations to allocate. In Section 4 an experiment to investigate these designs for a range of prior distributions is described and the results discussed. One result is that as the uncertainty in the prior distribution increases so does the number of support points required. In Section 5 the designs from Section 4 are compared and contrasted to the corresponding locally optimal designs. We also examine whether designs that maximize the average log determinant of the information matrix are generally efficient in terms of minimizing average variances. In Section 6 we examine a few designs in more detail. Previous work in this area (Tsutakawa, 1972, 1980, and Abdelbasit and Plackett, 1983) is restricted to designs taking observations at equally spaced points with equal numbers of observations at each point. No such restrictions are made here. Extensions of the results of Abdelbasit and Plackett (1983) are given in Minkin (1987) and Khan and Yazdi (1988).

K. Chaloner,

K. Larntz

2. The general theory of optimal

/ Optimal Bayesian

Bayesian

design

193

design

2. I. The criteria To introduce the notation, suppose that an experiment is to be designed by choosing n values of the design variable x from an experimental region X. At each of the n values of xi an independent observation yi will be available. Let the data be denoted yT = (y i, . . . , y,) and the unknown parameters Or= (O,, . . . ,IVJ. The contribution to the likelihood for a single observation yi made at x is p(_,Vi) 0,x). For most models using exact posterior distributions is intractable and asymptotic arguments must be used. Under easily satisfied assumptions the posterior distribution of 0 is approximately a multivariate normal distribution with mean the maximum likelihood estimate, 0, and variance covariance matrix the inverse of the observed Fisher information matrix, i.e. the matrix of second derivatives of the log likelihood function evaluated at the maximum likelihood estimate of 9. This approximation is given, for example, in Berger (1985) page 224. The expected Fisher information matrix could also be used, but for the logistic regression model they are identical. By expanding the definition of a design to include any probability measure q on X we define the normalized matrix I((?, q) by

Assuming that the information matrix is nonsingular, the posterior distribution of 8 using a design measure q is approximately multivariate normal N,(g, (nl(e v)))‘). Preposterior expected losses can therefore be approximated using the prior distribution of 8 as the predictive distribution of 0. The first criterion we consider is to choose the design measure rl maximizing @i(q) where

For a design measure q for which Z(e, q) is singular for 0 values of non-zero prior probability we define @i(v) to be -w. The second criterion is &();1), which requires that the parameters to be estimated are specified and possibly weighted. The criterion is Q2(1;1)= -E0(tr

B(e)1(e,&),

where B(B) is a symmetric p by p matrix. If linear combinations of the 0, are of interest then B(B) does not depend on B and is a matrix of known weights. If nonlinear combinations of the 0,‘s are of interest then B(8) has entries which are functions of 8. 2.2.

General equivalence theorem

Define H to be the set of all probability

measures

on X and view the design prob-

K. Chaloner,

194


lem as finding a measure in H that maximizes Q(q). We will only consider concave functions $J on Hand if we assume X is compact then H is compact in weak convergence topology. Both @i and Q2 are concave on H for logistic regression experiments and many other design problems. In linear design problems the information matrix does not depend on 0 and there is a compact set of possible information matrices to optimize over. The set is a finite dimensional subset of Euclidean space. The general equivalence theorems of optimal design are usually stated in terms of this set and directional derivatives defined on this set. This is the approach in Silvey (1980). For the non-linear problem we need to optimize over the set of measures H directly and define directional derivatives on H. Whittle (1973) uses this approach and we use his version of the general equivalence theorem. For two measures y~i and rZ in H the derivative at vi in the direction of y/1 is denoted by F(q,,qz) and is defined, when the limit exists, by F(vl1, viz) = ;,:: E&H(l

-&)I?1 + UZ) - @(rl,)).

The extreme points of H are the measures which put point mass at a single point x in X. We denote such a measure qX. If F(q,,q2) is linear in y/2 then r$ is said to be differentiable and

We further define d(l;l, x) to represent F(q, rX). Theorem 1 of Whittle (1973) is usually used in the context of linear design but we use it here for non-linear design. This theorem forms the basis for checking whether particular designs are optimal and is reproduced below in our notation. (Whittle, 1973). (a) If @ is concave, then a @-optimal design q0 can be equivalently characterized by any of the three conditions (i) q. maximizes @(v]), (ii) q. minimizes SUP,,~ d(u,x), (iii) SUP,,~ d(q?,,x)=O. (b) The point (qo, qo) is a saddle point of F in that

Theorem

F(vo,

‘II 110

= F(vo,

~0) 5 F(YI~, ~0)

for all VI, 112E H.

(c) If 4~is differentiable, then the support of q. is contained in the set of x for which d(tlo,x) = 0 almost everywhere in q. measure. The proof is given by Whittle. Whittle’s proof is in the matrix for the linear design problem but his proof applies under some additional assumptions. Specifically, assume derivatives exist and are continuous in x, that there is at

context of the information in more general problems that X is compact, that the least one measure in H for

K. Chaloner,

which

@ is finite,

and

that

195


@ is such that

if vi * q in weak

convergence

then

@(vi) + Q(q). Under these assumptions Whittle’s proof applies to general nonlinear problems. Dubov (1977) shows how this theorem applies to non-linear regression problems. The equivalence theorem proved by Lauter (1974) is a special case of Whittle’s theorem for mixtures of linear models and is also used, for example, by Cook and Nachtsheim (1982) for mixtures of polynomial regression models. Mixtures of linear models are a special case since there is a bound on the number of support points in an optimal design. For models other than linear models, and specifically for logistic regression, there is no such bound. Some results and algorithms of linear design, as described in Wynn (1970) and Fedorov (1972), apply. Specifically, results apply if they do not require the existence of a finite dimensional set of information matrices not depending on 19. For local optimality all results of linear design apply (White, 1973) since for a fixed value of 0 the set of information matrices is a finite dimension set. Even though Whittle’s theorem can be used to decide whether a particular design v0 is optimal, it does not appear to have been used elsewhere. For example Pronzato and Walter (1985) use @,-optimality for some bioengineering design problems but do not use this theorem. These results could also be applied to the dynamic systems described in Mehra (1974) and Titterington (1980). Mehra does not use @I,- or @z-optimality but uses a criterion which first computes the prior expectation of I(& ~7)and then optimizes the determinant or trace of that matrix. 2.3.

The criteria Q1 and &

For the two criteria

@, and Qz the necessary

directional

derivatives

are

(i) for 6,

We now proceed experiments.

3. The logistic

to use these general

regression

results

for designing

logistic

regression

design problem

The logistic regression model is such that, for an observation at a value x of the explanatory variable, a Bernoulli response yj is observed. There are two parameters 19~= (p, p) and the probability of success is p(x, 0) = l/(1 +exp(-p(x-p)).

196

K. Chaloner,


The model is a generalized linear model with the logit transformation of the expected response being linear in x. The parameter ,U is the value of x at which the probability of success is 0.5 and p is the slope in the logit scale. The parameters ,U and p can be thought of as representing the location and scale of an underlying tolerance distribution. We will assume that X is closed and bounded and therefore H will be compact. We also suppose that the support of the prior measure is bounded. This requirement insures non-singularity of Z(& YZ)for all measures YZexcept for those corresponding to point mass at a particular x. These conditions are sufficient for the results of Section 2 to be applied to logistic regression. For a design measure, r, on X putting nj weight at k distinct design points xl, i=l ,..., k, Cnj=l, define

t=

w; = P(X;, e)(l -P(X,, @IL

k

c njw,, I

x = tr’ i: n, wixi,

s=

n;w;(x;-K)2.

i

1

1

Note that w;, t, R and s all depend on 8, but this dependence has been dropped simplify the notation. The normalized information matrix can be written as Z(& rl) =

The inverse

P2Z -pt(K-p)

of the information

z(e, q)p

= pp2

-PG-P) s+ t(x-,uy matrix

to

(2)

>.

is therefore

1/3+(X-&/s

p(x-p)/.s

p(G,LL)/s

p%

>.

It can be shown that the expected Fisher information matrix for any generalized linear model, with a single explanatory variable, takes a form similar to (2). This result follows from the form of the likelihood function for any member of the exponential family. Details can be found in Zacks (1977) for models with no unknown scale parameter and Chaloner (1987) for models with a scale parameter. The form of the information matrix (2) makes the concavity of @t and @2 straightforward to prove. For example, to show that @, is concave, take any two measures in H, ty, and v12, and any (Y where 0< a< 1. It is easy to show that for any e log det Z(e, cxq, + (1- (x)qJ I a log det I(& q,) + (1 -a) log det Z(0, Y/~). Taking expectations over both sides of the cavity of @t over H. A similar argument The Ql-criterion and several versions of expressions will be found for the criteria

above inequality gives the required conyields the concavity of G2. @,-optimality will now be examined and and their derivatives.

197

K. Chaloner, K. Larntz / Optimal Bayesian design

3. I. $,-optimality For @,-optimality, plify to give

maximizing

the average log determinant,

the expressions

sim-

@l(V) = ‘% llog(PZtJJ)l and if we define w(x, 0) = P(X, 0x1 -PC& 0)) we have

d(q,x)

=E,{w(x,B)[t-‘+(R-x)2s~‘]}

-2.

rate, and For any design q, we see that if 1x1--t COthen w(x, 0) --f 0 at an exponential therefore d(a,x) + -2. From Whittle’s result, we know that at points of support of the optimal design this derivative must be exactly zero. For sufficiently wide X, the design points of the optimal design will therefore not be at the boundary of X. Rather if q0 is the optimal design the support points will be at the roots of the function d(qa,x). This is different from design for linear models where optimal designs require that 1x1 is as large as possible. 3.2.

@,-optimality

The weighted trace criterion of @,-optimality is perhaps more appealing than @,-optimality as it corresponds to the familiar squared error loss function. It requires that the experimenter carefully specify exactly what is to be estimated or predicted. The specification of what is to be estimated provides the matrix B(0). If, for example, the only parameter of interest is ,u, then B(B) can be written as ccT with c = (1, O)T. If both ,L and /3 are of equal interest then B(B) is the identity matrix. It is often of interest, especially in reliability and bioassay experiments, to estimate the value of x at which the probability of success, p(x, O), is a particular value. Suppose that we wish to estimate x0 where logit(p(xo, r3)) = y, then x0 =p + yp-’ which is a non-linear function of 0. Standard asymptotic arguments give B(B)= c(0)c(Q)T where c(6) is the vector of derivatives of the non-linear function. Specifically, c(0)T = (1, -ype2). If several percentile response points are of interest then a distribution could be put on y and B(B) averaged over this distribution. This is a standard way of using the weighted trace criterion in a Bayesian framework (see, for example, Chaloner, 1984, for this approach in linear models). We implement the design algorithm for three choices of B(8). The first choice is where ,/Ais the only parameter of interest. The second is where the only parameter of interest is the value x such that p(x, 0) = 0.95. The third is when we are interested in estimating the value of x at which logit(p(x, 0)) = y and y has a uniform distribution over [-1, 11. This range on y gives probabilities between 0.27 and 0.73. These

198

K. Chaloner,

three

choices

are by no means

K. Larntz

/ Optimal

exhaustive

Bayesian

of the many

design

which might

occur

in real

experimental situations but they represent a range of criteria which might be appropriate in particular experiments. These three examples of @,-optimality will be examined and optimal designs found for a range of uniform and independent prior distributions for p and fi. The criterion functions and derivatives are now derived for these examples of @,-optimality. Recall that the derivative involves the weighted trace of the matrix I(& v))‘I(e, q,)1(0, v)-‘. Expressions for the entries of this matrix are given in the Appendix.

qb,-optimality for estimation of p

3.3.

When interest is in estimation of p alone then the (1,l) entry of B(8) is 1 and all other entries are zero. For any design q, with more than one support point, Q*(q) = -E,{p[t-’

+(x-,&s+]}

and 0/,x)

=

EB{W(X,e)(pst)-2[f(~--X)(K-~u)+S12} +@2m.

Estimating p alone is of interest in many biological applications, the parameter p is often referred to as the LD50 as it is the 50th percentile of the underlying logistic tolerance distribution. 3.4.

@,-optimality for any or several percentile response points

Following

the earlier

discussion

@2(yI) = -4W2[tp1

in 3.2, to estimate

p + ypP1 we have

+(Y-P(~--iu))2P~2s~1)1}

and d(ul,x) =EB{W(X,~)(p2St)-2[t(R-X)(P(R--)--Y)+PSl2}+~2(YI). Substituting y=O gives the expressions in 3.3 for the special case of estimating ,u alone. Substituting y = 2.944 gives the situation where we want to estimate the value of x at which p(x, 0) = 0.95. This value of x is referred to as the LD95 and is often of interest in reliability experiments (see Meeker and Hahn, 1977). If a distribution is put over y to represent interest in several percentile response points then define B(8) =

1 -E(Y)/P2

The expectation is over the distribution on y. An example where y has a uniform distribution on t-1, l] will later be used. In that case E(y) = 0 and E(y2) = 3 and the criterion and derivative become


@2(q) = d(q,x)

199

-E@{p-2[tr1 +(K-~)2s~'+(3p2s)-1]}

=E,{w(x,e)(p2st)-2[~2(t(~-X)(R-~)+S)2+t2(~-X)23~1]} +

@2(v).

A uniform [- 1, l] distribution on y represents of the logistic response curve.

interest

in calibrating

the central

part

4. Implementation Finding optimal designs is a problem in numerical optimization. We have found the Nelder and Mead (1965) version of the simplex algorithm to be effective for these problems. This method is an unconstrained optimization and we transform the problem to satisfy this. This method also requires that the number of design points is specified and so we repeat the optimization, starting with 2 design points and increasing the number of support points steadily up to 11, or, if necessary 20. We then choose the design which optimizes the criterion on the smallest number of design points. To verify that a design chosen this way does indeed correspond to a global optimum of the criterion the derivative d(~,x) is evaluated over possible values of X. A double precision FORTRAN implementation of this procedure is available from the authors. This implementation uses an adaptation of O’Neill’s (1971) subroutine and standard numerical integration methods. Algorithms of the type proposed by Wynn (1972) and Fedorov (1972) could also be used and versions of these were implemented. These were found to be very slow, especially for vague prior distributions. Our implementations of these algorithms could undoubtedly be improved but we found the Nelder and Mead algorithm to be reliable and adequate, especially given the ability to verify that a global optimum had been found by examining the derivative function. 4. I. A numerical

investigation

A numerical investigation using our implementation of the Nelder-Mead algorithm will now be described. Designs were found for a range of uniform and independent prior distributions for p and /3 and a range of criteria. It was assumed that p and p were independent in the prior distribution and each had a uniform distribution on an interval. Three different intervals were chosen for each parameter and all nine combinations were used to generate nine different prior distributions in a 3 x 3 factorial structure. The intervals for p were all centered at zero and were [-O.l,O.l], [-0.3,0.3] and [-1.0, 1.01. The intervals for /I were [6.9,7.1], [6.0,8.0] and [4.0,10.0] and so they were all centered at 7. These intervals were motivated by the example in Brown (1966) and represent a range of prior distributions from quite

200

K. Chaloner,

K. Larntz

/ Optimal

Bayesian

design

precise to quite vague but informative. Even the vaguest prior distribution is proper and recognizes that the sign of p is known and a range is known for p. This knowledge would be reasonable to assume in many real experimental situations. Independent uniform prior distributions on an interval may not always be appropriate but they serve to illustrate our technique. The computer implementation may easily be modified to use any prior distribution. For each of the nine uniform prior distributions optimal designs were found, using the Nelder-Mead algorithm. Four criteria were used, @,-optimality and the three versions of @,-optimality introduced in Sections 3.3 and 3.4. 4.2.

@,-optimality

Figure 1 highlights some of the results for maximizing the average log determinant. The figure displays the designs as probability mass functions and gives the maximum value of the determinant. Examination of Figure 1 reveals that as our uncertainty increases so does the minimum number of support points required to attain the optimal value. For p uniform on [-O.l,O. l] optimal designs for all three prior distributions on /3 are on only k = 2 points. For ,D uniform on [-0.3,0.3] the optimal design is on k = 3 points and for ,Duniform on [- 1 .O, 1 .O] more than 7 points are needed. The designs appear to be much more sensitive to the prior distribution on p than on j3.

,

l

, , , (-3.372g)

(-3.o4o7)

,

,

, , , (-4.5604) , L

f6.9.7.11

1

-1

0

(-4.5763)

(-3.3786)

I 0

-1

[4,,ol

-1

, (-3.1461)

,

-1

1

0

1

[-0.1.0.11

,

-1

1

0

1

-1

,

r.4297)

I

0

1

-1

Illll

1

0

1

, , , , (-4.5613) , , ,

0

1

[-1.0.1.01

L-0.3.0.31

P

Fig. 1. @,-optimal are the maximum

designs plotted values attained

as probability mass functions by the criterion.

on [-I,

11. The numbers

in parentheses


201

@,-optimality for the estimation of p

4.3.

Figure 2 gives the corresponding results for minimizing the expected variance of p. Again, as for @,-optimality, the more uncertainty there is in the prior distribution the more support

points

are needed.

@,-optimality for the estimation of LD95

4.4.

In engineering problems it may well be an extreme percentile, such as the point where the probability of response is 0.95, that is of interest. Unlike the two previous examples the optimal design is not necessarily symmetric about the prior mean for ,D. Figure 3 illustrates these designs. They are clearly not symmetric and can be quite different depending on the prior distribution. We noted that the criterion function to be optimized in this case does not appear to be as well behaved as for the other criteria studied.

@,-optimality for the average percentile response point

4.5.

As described earlier we use a criterion designed to minimize the average variance of ,D+ yp-’ where y has a uniform distribution over [-I, 11. The results are sum-

(0.1101)

(0.1565)

CO.35231

L6.9.7.11 AL -1

_L_L_ 0

1

-1

L_Ll-Lu 0

(0.1121)

B

-1

0

1, 1 1

-1

, ,(0.1321)

[4,101

Fig. 2. &-optimal The numbers

-1

0

CO.16111

11

16.81

1

1

(0.3561)

, ,,,, ,

0

1

(0.1664)

-1

0

1

(0.3915)

1 1

designs

in parentheses

for estimating

the LD50, p, plotted

are the minimum

average

as probability

variance.

mass functions

on L-1, 11.

K. Chaloner,

[6.9,7.11

(0.44;

[6.61

(0.6;;

1.2

(0.46;

,0.6;9)

1.2

,

-1.2

,

-1.2

r-0.3.0.31

,

1.2

(1.4301) ,, , , , ,

1

1.2

,

1.2

, (1.2606) , , , ,

1.2

-1.2

[-0.1.0.11

,

-1.2

/

(0.79,44),

-1.2

,(1.2652) , , , ,

1.2

-1.2

1

design

1

(0.625;

1.2

Bayesian

-1.2

(

-1.2

[4,101

/ Optimal

1

-1.2

8

K. Larntz

,

1.2 [-1.0.1.01

)1

Fig. 3. q$optimal The numbers

8

designs

for estimating

in parentheses

the LD95 plotted

are the minimum

average

as probability

mass functions

[6,q,7.11-1

/ 0 / (0.1452) 1

-1

, 0, rl673) 1

-1 ,

,

[6,al

( 0 ( (0.1467) 1

-1

, 0, /(0.1906! i

-1 ,

,

-1

[4,101

,

-1

( , ( (o.2244)

, (0.1653)

0

[-O.l,O.ll

1

-1

on [-

1.2,1.2].

variance.

0

[-0.3.0.31

i

,o (0.3959) , , ~

, o

(0.4002) , , ~

] , , , (0.4413) , , ,

-1

0

1

[-l.O,l.Ol

P

Fig. 4. @2-optimal designs for estimating the average percentile response plotted as probability functions on [-1, I], The numbers in parentheses are the minimum average variance.

mass

K. Chaloner,

marized interval

/ Optimal

Bayesian

design

203

in Figure 4. The designs are symmetric around zero and all are within the [- 1, 11. Again we see that as prior uncertainty increases so does the number

of support 4.6.

K. Larntz

points.

The locally optimal design

In these examples the best guess for (,/J,p) would be (0,7). The locally C$J~ -optimal design is to put weight 0.5 at each of -to.2205 and -0.2205. These are the points at which the probability p(x, 19), for these values of the parameters 8, are 0.176 and 0.824.

4.7. General conclusions In general as the uncertainty in the prior distribution increases so does the number of support points. This result is in accord with the results of Tsutakawa (1972,198O) and Abdelbasit and Plackett (1983) who, however, restricted their designs to be equally spaced and have equal weights. The designs we found make no such restriction and indeed the designs, in general, are neither equally spaced nor have equal weight on all design points, especially for estimating the LD95. A feature common to all of the four design criteria was that as the uncertainty in the prior distribution increased so did the length of computing time. Looking at the derivative functions we found that as the uncertainty in the prior distribution increased the derivative function d(q,x) of a design close to optimal is very flat compared to the derivative functions for a sharper prior distribution. Optimization for these cases is thus more difficult.

5. Efficiencies Efficiency calculations were made for the nine examples used in Section 4. We calculated the improvement in using either the @,-optimal or &-optimal design instead of the locally D-optimal design. The locally D-optimal design is most inefficient for the prior distribution where ,/J is uniform over [-1, l] and, independently, p is uniform over [4, lo]. For this prior distribution, the locally D-optimal design requires 5.5 times as many observations to be as efficient as the @i-optimal design in maximizing the expected log of the determinant of the unnormalized information matrix. For minimizing the expected asymptotic variance of y, the locally D-optimal design requires 480 times as many observations as the @,-optimal design. Details of these efficiency calculations and tables for all the examples of Section 4 are in a technical report available from the authors.

K. Chaloner,

204


6. Two examples

In order to further understand these designs we take two particular prior distributions and study the optimal designs in detail. We take the prior distribution where /3 is between 6 and 8 and two choices of the prior distribution on ,LL,namely ,U is between -0.3 and 0.3 and between - 1 and 1. These two prior distributions represent a moderate amount of prior information and more vague prior information. These two examples illustrate the difference caused by changing the prior distribution on p. First consider the prior distribution representing a moderate amount of information, ,D uniform on [-0.3,0.3] and, independently, /? uniform on [6,8]. That the @,-optimal design in Figure 1 is indeed optimal can be verified by plotting the derivative function, d(q,x). The plot is given in Figure 5(a) and it is clear that the function is negative almost everywhere and has exactly three roots at the design points. Plots are also given of the corresponding derivatives for the criteria for the other three optimal designs in 5(b), 5(c) and 5(d). Now consider vaguer prior information, where ,/Ais uniform on [-1, l] and p is again uniform on [6,8]. Figure 6 gives the corresponding plot of the derivative function d(q,x), again for values of x between -1.2 and 1.2. In comparison to Figure 5 we see that the derivative is much flatter in all four cases.

a

b----L

0

-0.4 -1

0

-1

0

1

[d

0

t

t -0.4

Fig. 5. Derivative

functions

d(q,x)

plotted

against

prior distribution with ,u uniform on [-0.3,0.3] mality for LD50; (c) q?,-optimality for LD95;

x for each of the four optimal

and /? uniform (d) &-optimality

1 designs

on [6,8]. (a) @,-optimality; for average percentile.

found

for the

(b) @2-opti-

K. Chaloner,

K. Larntz

/ Optimal

a 0

0

Bayesian

205

design

br---7

/-----I -0.4

-0.4

I

0

-1

-1

1

C 0

0

:“

-0.4

0

1

d /

\

-0.4 I

L

0

-1

Fig, 6. Derivative

functions

1

d(q,x)

plotted

-1

against

0

x for each of the four optimal

1

designs

prior distribution with p uniform on [-1, l] and /3 uniform on [6,8]. (a) @I-optimality; for LD50; (c) Q2-optimality for LD95; (d) &optimality for average percentile.

7. Asymptotic A referee

found

for the

(b) @2-optimality

approximations

has pointed

out that,

as the justification

of these Bayesian

designs

de-

pends on asymptotic arguments, the effectiveness of the designs will depend heavily on the accuracy of the asymptotic approximations. This is clearly very important. There are really two separate issues here. First, whether the value of the criterion really does well approximate the expected posterior utility, and second, whether a design obtained by maximizing the approximate posterior utility gives an actual posterior utility close to the exact optimal value. We have done some preliminary numerical work to answer these questions for the prior distributions used in Section 4. We simulated 1500 replications of the experiment with a sample size of n = 100 for the &-optimal design for estimating ,u. We also simulated 1500 replications for eight other designs obtained by perturbing the @,-optimal design slightly. The actual posterior variance of ,U was calculated for each replication of each experiment. The median of the actual posterior variance of ,D was close to nP1&(r) for all the designs. The @,-optimal design, therefore, has reasonable empirical properties for n = 100. The details of this simulation are omitted to save space and are in a technical report available from the authors. This very preliminary investigation is clearly

K. Chaloner,

206

incomplete and sample

and we intend


to explore

sizes in future

this issue for other criteria,

prior distributions

research.

8. Conclusion Numerical optimization is efficient in finding good designs which are close to optimal using these Bayesian criteria. The results of Section 2 are useful in understanding the problem and in verifying that the designs obtained from the optimization algorithm are indeed close to optimal. The prior distributions used in this paper, uniform and independent prior distributions for ,/Aand /3, clearly do not always apply. The methodology we have presented, however, applies to any situation where bounds are available on the parameter space. We believe that this methodology is preferable to the procedure of designing for a single best guess of the parameter values. We also believe that this methodology provides a framework for the careful examination of any experimental procedure in which the data analysis will include fitting a generalized linear model. Of course, in practice designs are required which have an integer number of observations. The approximate optimal designs derived here can be rounded to exact designs and the experimenter can calculate exactly how much is lost by rounding. These approximate designs are easy to derive. Even if an experimenter can not use one of these designs exactly they provide a useful benchmark with which to calibrate other designs. Interestingly, for these designs, the number of support points is generally greater than the number of parameters and, therefore, checks can usually be made on the appropriateness of the model.

Appendix Denote

the matrix

A by

A = I(& r)PZ(& then the derivative

rl,)Z(4 vl)Y’;

for the @,-criterion

is

d(vl,x) = 41 tr B(B)A + $0). The entries

of the matrix

A can be written

as

A,, = w(x,~)(pst)~*[f(R-X)(X-~)+S]2, A,, = W(X,e)(prs*)~‘(X-x)[t(K-X)(K--)++I, A21

These expressions choices of B(B).

=

,412,

simplify

A,,

=

w(x,o)(K-x)2s-2.

the algebra

required

to find the derivatives

for different

K. Chaloner,

K. Larntz

/ Optimal

Bayesian

design

207

Acknowledgements We are grateful to the referees and our colleagues at the University of Minnesota for helpful suggestions. We are also grateful for the comments of the participants at the workshop, in March 1987, at Castle Wartburg, Eisenach, DDR. This research was supported in part by grant number DMS-8706754 from the National Science Foundation.

References Abdelbasit, K.M. and R.L. Assoc. 78, 90-98.

Plackett

Anbar,

approximation

D. (1984). Stochastic

Commun. Berger,

Statist.

~ Theory

(1983).

methods

Methods

J.O. (1985). StatisticalDecision

Experimental

design

for binary

and their use in bioassay

data.

J. Amer.

Statist.

and phase I clinical trials.

13 (19), 2451-2467. Theory and Bayesian Analysis

(second edition).

Springer-Verlag,

New York. Bernardo,

J.M.

(1979). Expected

information

as expected

utility.

Ann.

Statist.

7, 686-690.

Brooks, R.J. (1987). Optimal allocation for Bayesian inference about an odds ratio. Biometrika 196-199. Brown, B.W., Jr. (1966). Planning a quanta1 assay of potency. Biometrics 22, 322-329. Chaloner,

K. (1984). Optimal

Chaloner,

K. (1987). An approach

shop

on Model-oriented

Mathematical Chernoff,

H.

Locally

E.L. (1977). D-optimal

sion Experiments, Fedorov, Fedorov,

Wartburg,

Moscow

optimal

Academic V.V. (1981).

Press,

March

(1982). Model robust, Press,

1987, Lecture parameters.

linear-optimal

Moscow,

in Economics

Ann.

Math.

and

designs.

103-111.

Experiments.

Statist.

24,

Technometrics24,49-54. approach.

In: Regres-

(In Russian).

Transl.

and ed. by W.J. Studden

and E.M.

New York.

Active

P.R. (1970). Optimal

57, 79-89. Khan, M.K. and A.A.

12, 283-300. of the Work-

Notes

models under the Bayesian

regression

experiments.

In: Mathematical

Design. Nauka Press, Siberian Section, 19-73. (In Russian). Finney, D.J. (1978). Statistical Methods in Biological Assay (third Freeman,

Ann. Statist.

In: Proceedings

3-12.

for estimating

designs for non-linear University

linear models.

Berlin,

designs

V.V. (1972). The Theory of Optimum

Klimko.

design for linear models.

No. 297. Springer-Verlag.

586-602. Cook, R.D. and C.J. Nachtsheim Dubov,

experimental

to design for generalized

Data Analysis,

Systems (1953).

Bayesian

74,

Bayesian

sequential

Yardi (1988). On D-optimal

estimation designs

Methods

edition).

Macmillan,

of the median

for binary

data.

of Experimental

effective

New York.

dose. Biomefrika

J. Statist.

Plann.

Inference

18, 83-91. Lauter,

E. (1974). Experimental

LCuter, E. (1976). Optimal

design in a class of models. Math.

multipurpose

designs for regression

Operationsforsch.

models. Math.

Sfatist.

5, 379-396.

Operationsforsch.

Statist.

7, 51-68. Leonard, T. (1982). An inferential approach 2416, University of Wisconsin-Madison, Lindley,

D.V. (1956). On a measure

to the bioassay design problem. Mathematics Research Center.

of information

provided

Technical

by an experiment.

Ann.

summary Math.

report

Statist.

27,

986-1005. Meeker, W.Q. and G.J. Hahn (1977). Asymptotically optimum over stress tests to estimate the survival probability at a condition with low expected failure probability. Technometrics 19, 381-399. Mehra, R.K. (1974). Optimal input signals for parameter estimation new results. IEEE Trans. Automat. Control 19, 753-768.

in dynamic

systems

- survey and

208

K. Chaloner,

Minkin,

S. (1987). On optimal

Nelder,

K. Larntz

design

/ Optimal

for binary

data.

J.A. and R. Mead (1965). A simplex method

O’Neill,

R. (1971). Function

minimization

Bayesian

J. Amer.

Statist.

for function

using a simplex

design

Assoc.

procedure,

82, 1098-1103.

ComputerJ. 7, 308-313.

minimization.

Algorithm

AS47. Appl.

Statist.

20, 338-345. Owen,

R.J. (1975). A Bayesian

tal testing. Pronzato,

S.D. (1980). D.M.

Tsutakawa,

procedure

for quanta1

response

in the context

of adaptive

men-

70, 351-356.

(1985).

Optima/

Design.

(1980). Aspects

R.K. (1972). Design

Tsutakawa,

Robust

experiment

curve.

Appl.

Statist.

design

and Hall,

via stochastic

London-New

design in dynamic

of an experiment

approximation.

Math.

for bioassay.

of dose levels for estimating

York.

systems. J. Amer.

Technometrics

22, 287-299.

Statist.

67, 584-590.

a percentage

Assoc.

point of a logistic

quanta1

29, 25-33.

L.V. (1973). An extension

60, 345-348. Whittle, P. (1973).

Chapman of optimal

R.K. (1980). Selection

response

Sot. Wynn,

Assoc.

75, 103-120.

Titterington,

White,

sequential

Statist.

L. and E. Walter

Biosci. Silvey,

J. Amer.

Some general

of the general points

Ser. B 35, 123-130. H.P. (1970). The sequential

equivalence

in the theory

generation

theorem

of optimal

of D-optimal

to nonlinear

experimental

experimental

models.

design.

designs.

Biometrika

J. Roy.

Statist.

Math.

Statist.

designs.

J. Roy.

Ann.

41, 1655-1664. Wynn,

H.P.

Statist. Wu, C.F.J. Zacks, linear

(1972). Results in the theory

Sot.

(1985). Efficient

S. (1977). Problems problems.

209-223.

and construction

of D-optimum

experimental

Ser. B 34, 133-147.

In: P.R.

sequential

designs

and approaches Krishnaiah,

with binary

data.

J. Amer.

in design of experiments Ed.,

Multivariate

Analysis

Statist.

for estimation

Assoc.

80, 974-984.

and testing

IV. North-Holland,

in non-

Amsterdam,