On the rate of convergence of the ECME algorithm ... - Semantic Scholar

1 downloads 0 Views 661KB Size Report
On the rate of convergence of the ECME algorithm for multiple regression models with ^-distributed errors. BY JEANNE KOWALSKI, XIN M. TU. Department of ...
Biometrika (1997), 84,2, pp. 269-281 Printed in Great Britain

On the rate of convergence of the ECME algorithm for multiple regression models with ^-distributed errors BY JEANNE KOWALSKI, XIN M. TU Department of Mathematics and Statistics, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, U.S.A. e-mail: JJK + @pitt.edu xintu + @pitt.edu ROGER S. DAY Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, U.S.A. e-mail: [email protected] AND JOSE R. MENDOZA-BLANCO Universidad Nacional Autdnoma de Mexico, Facultad de Ciencias, Departamento de Matematicas, Circuito Exterior, Cd. Universitaria. c.p. 04510, Mixico, D.F., Mexico e-mail: [email protected] SUMMARY

Although much work has been done on comparing and contrasting the EM and ECME algorithms, in terms of their rates of convergence, it is not clear what mechanism underhes each and, furthermore, what factors may determine and influence their rates of convergence. In this paper, we examine the convergence rates and properties of these two popular optimisation algorithms as used in computing the maximum likelihood estimates from regression models with t-distributed errors. By approaching this computing problem through the use of two data augmentation schemes, as well as variations of these wellknown algorithms, we offer a more composite view on the performance of each. Some key words: ECME; EM; Maximum likelihood; Step-length Newton's method; r-distribution.

1. INTRODUCTION

For maximum likelihood estimation of parameters for regression models with t-distributed errors, many approaches have employed a missing data approach, e.g. Lange, Little & Taylor (1989), Liu & Rubin (1995). By representing the t-distribution in the form of a hierarchical model involving the normal distribution, this approach exploits the simple form of the maximum likelihood estimates for regressions with normal errors. Although the missing data approach leads to simple algebra, the expectation/maximisation, EM, algorithm based on this approach is much too slow to be of any practical use. Liu & Rubin (1995) recently proposed an 'expectation/conditional maximisation either', ECME, algorithm to accelerate the convergence of the EM. Although their results characterise the relation of ECME to EM, ECM and CM, conditional maximisation, algorithms through the amount of observed information associated with each, the precise mechanism underlying the acceleration of the ECME for the present problem is not clear.

270

J. KOWALSKI, X. M. Tu,

R. S. DAY AND J. R. MENDOZA-BLANCO

In this paper, we study the convergence of ECME from the perspective of the standard optimisation theory. We show that the particular implementation of ECME for multiple regression with f-distributed error as proposed by Liu & Rubin (1995) may be viewed as a CM-type algorithm. This characterisation offers insight into the acceleration mechanism of the ECME algorithm and relates this algorithm to other implementations of ECME as well as to other commonly applied optimisation tools such as Fisher-scoring and Newton's methods for the present computing problem. Further, this approach makes it feasible to establish asymptotic convergence rates for ECME in this case.

2. MULTIPLE REGRESSION WITH ^-DISTRIBUTED ERRORS

Consider the following multiple linear regression model, with independent errors from a ^-distribution: yi = xjp + elt

B(~t(O, where C(£) is anm x m matrix with continuous components and M ( ^ ) : Q -> Q is differentiable. Then,

where H(i) is the Hessian matrix of / ( £ ) and Im is the m xm identity matrix. The maximiser £max is a fixed point of M($), that is, M K ) ^

Proof. It is readily checked that

H (o=jL /(o=eg) ij MM -i\+o(\\ MM -ay Thus, •O(\\M(0-, Now, denote in general a mapping in (11) by M(£), where the corresponding argument ^ is either /? or a2. Then, a local maximiser /?„„„ or o ^ of lo(6; Yobl) conditional on the remaining arguments is a fixed point of an appropriate mapping M(£). • By Theorem 1, if dM(£)/d£, = clm, where 0 < c < 1 is a scalar, the iteration defined by the mapping yields the Newton direction H~l(£)g{£), but with a modified step-length given by 1 — c. The next theorem shows that each of the recursive equations defined in (11) may be viewed asymptotically as a step-length Newton's or Fisher-scoring method in the

275

Rate of convergence of the ECME algorithm

sense that the ascent direction given by each equation at the true value of 6 converges to the Newton direction as n-> oo. In addition, the theorem provides explicit expressions for these asymptotic step-lengths, which are an essential factor in comparing the performance of ECMEX and ECME 2 . Note that the Hessian matrix in this case converges almost surely to the negative of the information matrix, e.g. Serfling (1980, pp. 145-8). 2. Define by 90 = (/Jo,CT^,V0) the true value of 9. Then, as n-> oo, (a) dn(P)/dp\e=$0 converges in probability to 2(vo + 3)" 1 /,,, provided that n~1Y,xfj = o(n) for each j ; (b) 5i^n(«T2)/5«T2|£l=eo converges in probability to 3(v0 + 3)" 1 ; (c) ^yn((r2)/3 = 1 ,

Thus, in particular, all the three quantities converge in probability to their respective limits. Proof, (a) Using Lemma 1, it is readily verified that for all j , k n

n

It

£ var{w,(0o)xyxtt} = £ xjjxi var{w,(0o)} = £ s^Xa 1=1

(-1

1-1

9 V

,

..= o(n2).

O(.VO+JJ

The assertion then follows from a weak law of large numbers, e.g. Serfling (1980, Theorem C, p. 27). (b) It follows from Lemma 1 that for all j E

upWi{6

The rest of the proof is similar to (a) above.

280

J.

KOWALSKI,

X.

M.

Tu,

R.

S.

DAY AND J.

R.

MENDOZA-BLANCO

(c) The proof is the same as in (b) after using Lemma 1 to establish that for all j

(d) Clearly, ef dw^Q^/da2 are independently and identically distributed with E{e\dwi{B0)lda2} = 3/(v0 + 3). The first assertion then follows from a strong law of large numbers, e.g. Serfling (1980, Theorem B, p. 27). The second and third claims are similarly proved.



Proof of Theorem 2. (a) First, rewrite tpH(9) as



(e)xlxj) Wt(e)x Wt lxj) n(0)= )

^Po £ w,(e)xlXJ+ £ w,(0)x£e,. (=1

(=1

(=1

Taking the derivative with respect to fij and evaluating at 90 we obtain

-

T

d

The assertion then follows from Lemma 2. (b) The assertion is a consequence of the first equation in Lemma 2(d). (c) Taking the derivative of yH{6) with respect to a2 and evaluating at 90 we obtain

The assertion follows from Lemma 2(d).

D APPENDIX

2

Proof of Theorem 3 (a) Let 7(0) be the information matrix. Then it is readily shown that I(00) is block diagonal with /? in one block and a2 and v in another block; see e.g. Lange et al. (1989). The assertion thus follows from the identity V(60) = I~1(90) and the relationship between V(0O) and p{90). (b) The block of the inverse of I(60) corresponding toCT2and v is given by *(v0) °0 3)(v o +l)fc(v o )

2(vo + 3)*(vo)

where

The first assertion follows by evaluating the quantity p^y = For vo->O, it follows from the recursive equation TG(V) = V~ 2 + T G ( V + 1), for example Abramowitz & Stegun (1965, p. 260), that / „ = VQ 2 + O(VQ l). Thus,

and the second claim follows.

Rate of convergence of the ECME algorithm

281

For V-KX), approximating TG(V) to the third order, e.g. Abramowitz & Stegun (1965, p. 260), yields

Thus, (v0 + 3)(v0 + lfvolyv = (3 + i) + (f + 3)vo x + O(VQ l) and the third assertion follows.



REFERENCES ABRAMOWITZ, M. & STEGUN, I. A. (1965). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover. DEMPSTER, A. P., LAIRD, N. M. & RUBIN, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with Discussion). J. R. Statist. Soc. B 39, 1-38. DENNIS, J. E. & M H , H. H. W. (1979). Two new unconstrained optimization algorithms which use function and gradient values. J. Optimiz. Theory Applic. 28, 453-83. FANG, K. T., KOTS, S. & N G , K. W. (1990). Symmetric Multivariate and Related Distributions. London: Chapman and HalL JAMSHIDIAN, M. & JENNRICH, R. I. (1993). Conjugate gradient acceleration of the EM algorithm. J. Am. Statist. Assoc. 88, 221-8. LANGE, K. L., LITTLE, R. J. A. & TAYLOR, J. M. G. (1989). Robust statistical modeling using the t distribution. J. Am. Statist. Assoc. 84, 881-96. LITTLE, R. J. A. (1988). Robust estimation of the mean and covariance matrix from data with missing values. Appl. Statist. 37, 23-38. Liu, C. H. & RUBIN, D. B. (1994). A simple extension of EM and ECM with faster monotone convergence. Biometrika 81, 633-48. Liu, C. H. & RUBIN, D. B. (1995). ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statist. Sinica. 5, 55-76. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. R. Statist. Soc. B 44, 226-33. MEUJJSON, I. (1989). A fast improvement to the EM algorithm on its own terms. J. R. Statist. Soc. B 51,127-38. MENG, X. L. & RUBIN, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80, 267-78. PRESS, W. H., TEUKOLSKY, S. A., VETTERIJNG, W. T. et al. (1991). Numerical Recipes, 2nd ed. New York:

Cambridge. SEBER, G. A. F. & WILD, C. J. (1989). Nonlinear Regression. New York: Wiley. SERFLING, R. J. (1980). Approximation Theorems of Mathematical Statistics. New York: Wiley. ZANGWILL, W. (1969). Nonlinear Programming—A Unified Approach. Englewood Cliffs, NJ: Prentice-Hall. ZELLNER, A. (1976). Bayesian and non-Bayesian analysis of the regression model with multivariate student-t error terms. J. Am. Statist. Assoc. 71, 400-5.

[Received August 1995. Revised October 1996]