Degeneracy in the maximum likelihood estimation of

0 downloads 0 Views 268KB Size Report
f(xi|Â):. (3). Nevertheless, it is well known that the likelihood function of normal mixture ... Secondly, convergence can be quite slow when far ... We define also ... the development below if we make the two following restrictions. .... Since the likelihood tends to infinity as fast as the inverse of the standard deviation, divergence.
Statistics & Probability Letters 61 (2003) 373 – 382

Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with EM Christophe Biernacki∗ , St-ephane Chr-etien Universit e de Franche-Comt e, UMR CNRS 6623, 25030 Besancon Cedex, France Received August 2001; received in revised form October 2002

Abstract As is well known, the likelihood in the Gaussian mixture is unbounded for any parameters such that a Dirac is placed at any observed sample point. The behavior of the EM algorithm near a degenerated solution is studied. It is established that there exists a domain of attraction around degeneracy and that convergence to these particular solutions is extremely fast. It con5rms what many practitioners already noted in their experiments. Some available proposals to avoid degenerating are discussed but the presented convergence results make it possible to defend the pragmatic approach to the degeneracy problem in EM which consists in random restarts. c 2002 Elsevier Science B.V. All rights reserved.  Keywords: Degeneracy; Maximum likelihood; EM algorithm; Gaussian mixtures; Speed of convergence

1. Introduction Because 5nite mixtures of distributions are an extremely 7exible method of modeling, they received increasing attention over the years, from both practical and theoretical point of view. Fields of applications are numerous (biology, social sciences, etc.) since many techniques of data analysis, as clustering (McLachlan and Basford, 1988), discriminant analysis (McLachlan, 1992) or image analysis (Besag, 1986) for instance, make an extensive use of these models. Various approaches to estimate mixture distributions are available, including method of moments or nowadays the maximum likelihood (ML) approach (see McLachlan and Peel, 2000 for a survey). The ML estimator (MLE) has to be computed iteratively and the EM algorithm (Dempster et al., 1977; Redner and Walker, 1984) is generally chosen to accomplish maximization. ∗

Corresponding author. Tel.: +33-3-81-66-64-65; fax: +33-3-81-66-66-23. E-mail address: [email protected] (C. Biernacki).

c 2002 Elsevier Science B.V. All rights reserved. 0167-7152/03/$ - see front matter  PII: S 0 1 6 7 - 7 1 5 2 ( 0 2 ) 0 0 3 9 6 - 6

374

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382

In this study, we restrict our attention to the important case of univariate normal mixtures. The mixture density f of K normal densities   1 1 (x − k )2 2 (x|k ; k ) =  (1) exp − 2 k2 2 k2 of center k and variance k2 (k = 1; : : : ; K) is given by f(x| ) =

K 

pk (x|k ; k2 );

(2)

k=1

 with pk the proportion of the kth component (0 ¡ pk ¡ 1 and k pk = 1) and with = (p1 ; : : : ; pK ; 1 ; : : : ; K ; 12 ; : : : ; K2 ) the parameter vector of the mixture. Considering a sample x1 ; : : : ; x n as the realized values of n i.i.d. random vectors with common density f, the MLE is obtained by maximizing over the likelihood function n  ‘( ) = f(xi | ): (3) i=1

Nevertheless, it is well known that the likelihood function of normal mixture models is not bounded above (Kiefer and Wolfowitz, 1956; Day, 1969). Indeed, if one mixture center coincides with a sample observation and if the corresponding variance tends to zero, then the likelihood function increases without bound (note that the likelihood converges to in5nity as fast as inverse of the corresponding standard deviation). As a consequence, the EM algorithm may converge towards such degenerate solutions. In this paper, the question of degeneracy of the ML approach in normal mixtures is concerned. We focus our attention on the study of the behavior of the EM algorithm near a degenerated solution. Our intention is to clarify some facts about the degeneracy problem that are well known by the practitioners and the EM users. In particular, it is generally acknowledged that, if mixture parameters are close to a degenerated solution, then EM converges towards it. Moreover, this convergence is extremely fast. These issues are addressed in the next section. In the last section, we discuss some consequences of these results for the practitioner in the way of managing degeneracy in EM and suggest possible extensions of the present work. 2. EM algorithm near degeneracy In this section, we prove existence of a domain of attraction leading EM to degeneracy. Then, we show that the speed of convergence to an in5nite likelihood is at least exponential. This con5rms experimental evidence that degeneracy of EM is extremely fast. We start with a numerical illustration. 2.1. Numerical example To illustrate degeneracy, let us consider a sample of 11 points on the real line. The EM algorithm is started at great values of the variances 12 = 10000, 22 = 1000, 1 = 8, 2 = 6, p1 = p2 = 0:5. The

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382 0.2 0.15 Density

Density

Iteration 1 0.005

0.2

Iteration 2

0.15 Density

0.01

0.1 0.05

0 –10

0

10

0 –10

20

0

10

0 –10

0.1 0.05

0

10

0 –10

20

0.05

Iteration 81

0.3

0.1 0.05

10

20

x

0 –10

10

20

0.4

Density

Density

0.15

0.1

0

0 x

0.2

Iteration 80

20

Iteration 79

x

0.2

Density

0.15

0.1

20

10

0.2

0.05 0

0 x

Iteration 78

x

0 –10

0 –10

20

Density

0.15

0.05

0.15

10

0.2

Iteration 77

0.1

0 –10

0.1

x

Density

Density

0.15

Iteration 50

0.05

x 0.2

375

Iteration 82

0.2 0.1

0

10

20

0 –10

0

x

10

20

x

Fig. 1. Evolution of the mixture estimates vs. EM iterations.

number of iterations before reaching the numerical accuracy tolerance was 82. The mixture density estimates are displayed on Fig. 1 for iterations 1, 2, 50 and from 77 to 82. Centers 1 and 2 are represented by triangles in this 5gure. We can draw several conclusions from this experiment. Firstly, degeneracy can be reached even when starting with large initial variances. Secondly, convergence can be quite slow when far from the degenerate limit and one can even confuse this slow behavior with convergence to the desired solution. Thirdly, convergence is extremely fast when in a small neighborhood of the degenerate limit. The next section provides a proof of this latter fact which is well known by practitioners. 2.2. Existence of a domain of attraction 2.2.1. Notations We start with some notations that will appear useful to correctly describe the behavior of EM. Let pk , k and k2 , respectively, denote the proportion, the center and the variance of the kth component at the current iteration of the EM algorithm, and pk+ , k+ and k2+ these quantities at the next iteration. Let fik = pk (xi |k ; k2 ) and fik+ = pk+ (xi |k+ ; k2+ ). We assume that, at the current iteration of EM, the component k0 (1 6 k0 6 K) is close to degeneracy at the individual xi0 (1 6 i0 6 n). Such a situation is equivalent to a high density fi0 k0 of component k0 at xi0 , and to small densities fik0 at

376

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382

other individuals xi (i = i0 ) (this equivalence is veri5ed with probability one since all individual are diHerent with probability one). In other words, noting the vector u0 = [1=fi0 k0 ; {fik0 }i=i0 ], its norm u0  is small. The updated vector will be denoted by u0+ = [1=fi+0 k0 ; {fik+0 }i=i0 ]. We de5ne also  the sum of densities f˜ ik0 = k =k0 fik and the conditional probability that xi arises from the kth component tik = fik =(fik0 + f˜ ik0 ). Note that other components than k0 may also be close to degeneracy but it has no in7uence on the development below if we make the two following restrictions. First, these components are not close to a degenerated solution at the same individual xi0 . Second, at least one of the K Gaussians is not close to degeneracy. 2.2.2. Taylor expansion of parameters Assuming that norm u0  is small at the current iteration of EM, we express the 5rst-order Taylor expansion for conditional probabilities tik0 (i = 1; : : : ; n). We have t i0 k0 =

f i0 k 0 f˜ i0 k0 =1− + ou0  fi0 k0 fi0 k0 + f˜ i0 k0

(4)

and, with i = i0 , tik0 =

fik0 fik0 = + ou0 : ˜ fik0 + f ik0 f˜ ik0

(5)

We now compute the 5rst-order expansion for parameters pk+0 , k+0 and k2+ at the next iteration of 0 EM. We have   n   ˜ f f 1 1 i0 k0 ik0  pk+0 = + ou0 : tik = 1 − + (6) n i=1 0 n fi0 k 0 f˜ ik0 i=i0

Note also that (pk+0 − 1=n)2 = ou0 . Then, Taylor expansion of center k+0 is given by n tik0 xi + k0 = i=1 n i=1 tik0

(7)

 (1 − ( f˜ i0 k0 =fi0 k0 ))xi0 + i=i0 (fik0 = f˜ ik0 )xi + ou0  =  1 − ( f˜ i0 k0 =fi0 k0 ) + (fik0 = f˜ ik0 ) + ou0 

(8)

    fik  ˜ ˜ i0 k 0 f f f i0 k 0 ik0  0 xi 0 + + ou0  x  1 + − = 1− ˜ ik0 i ˜ ik0 fi0 k0 fi0 k0 f f i=i i=i

(9)

i=i0

0

= xi0 +

 fik 0 (xi − xi0 ) + ou0  ˜ f ik 0 i=i 0

0

(10)

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382

and so (k+0 − xi0 )2 = ou0 . Finally, Taylor expansion of variance k2+ is expressed by 0 n tik0 (xi − k+0 )2 2+ k0 = i=1n i=1 tik0  (1 − ( f˜ i0 k0 =fi0 k0 ))( i=i0 (fik0 = f˜ ik0 )(xi − xi0 ))2 + ou0  =  1 − ( f˜ i0 k0 =fi0 k0 ) + i=i0 (fik0 = f˜ ik0 ) + ou0   +

i=i0

 (fik0 = f˜ ik0 )(xi − xi0 − j=i0 (fjk0 = f˜ jk0 )(xj − xi0 ))2 + ou0  :  1 − ( f˜ i0 k0 =fi0 k0 ) + (fik0 = f˜ ik0 ) + ou0 

377

(11)

(12)

(13)

i=i0



(fik0 = f˜ ik0 )(xi − xi0 ))2 = ou0 ,     fik  ˜ f f i0 k 0 ik0  0 = + ou0  (x − xi0 )2  1 + − ˜ ik0 i ˜ ik0 fi 0 k 0 f f i=i i=i

Since (

i=i0

0

=

(14)

0

 fik 0 (xi − xi0 )2 + ou0 : f˜ ik0

(15)

i=i0

2.2.3. Domain of attraction In the following theorem, we prove that, if u0  is small enough, then the EM mapping is contracting and, therefore, EM is convergent and its 5xed point is degenerated. Theorem 1. There exists  ¿ 0 such that if u0  6  then u0+  = ou0  with probability one. Proof. First, we will show that, there exists  ¿ 0 such that if u0  6  then 1=fi+0 k0 = ou0 . by their respective Taylor expansion, we obtain Replacing pk+0 , k+0 and k2+ 0

√  2 ˜ n 2 i=i0 (fik0 = f ik0 )(xi − xi0 ) 1 = (16) + ou0 :  fi+0 k0 1 − ( f˜ i0 k0 =fi0 k0 ) + i=i0 (fik0 = f˜ ik0 ) But, using Lemma 2, for all i = i0 , there exists  ¿ 0 such that if u0  6  then fik0 6 1=(fi0 k0 )4 . Thus,      fik0 (xi − xi0 )2 1   (xi − xi0 )2 6  (17) ˜ ik0 ˜ ik0 fi20 k0 f f i=i i=i 0

and, consequently, 1=fi+0 k0 = ou0 .

0

378

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382

Second, we will show that, for all i = i0 , there exists  ¿ 0 such that if u0  6  then fik+0 = ou0  with probability one. We have (with i = i0 ) + 2 −  ) (x 1 i k0 : (18) exp − fik+0 6 2+ 2+ 2 k0  k0 Using Taylor expansion of k+0 , we obtain  2  fjk 0 (xj − xi0 ) + ou0  (xi − k+0 )2 = xi − xi0 − ˜ f jk 0 j =i 0

  fjk 0 = (xi − xi0 )2 − 2(xi − xi0 )  (xj − xi0 ) + ou0  f˜ jk0

(19)





(20)

j =i0

  fjk 0 ¿ (xi − xi0 )2 − 2|xi − xi0 |  |xj − xi0 | + ou0 : f˜ jk0

(21)

j =i0

So, if  is small enough, (xi − k+0 )2 ¿ (xi − xi0 )2 =2 and, consequently, 2 1 (x − x ) i i 0 + ou0 : exp − fik+0 6 2+ 2+ 4 k0 

(22)

k0

by its Taylor expansion, we have Replacing now k2+ 0 2 − x ) 1 (x i i 0 + ou0 : exp −  fik+0 6  2 ˜ 4 (f = f 2 ˜ ik0 ik0 )(xi − xi0 ) i =  i (f = f )(x − x ) 0 ik ik i i 0 0 0 i=i0

(23)

with probability one, we conclude that there exists  ¿ 0 such Since all individuals xi are diHerent  that if u0  6  then fik+0 6 ( i=i0 (fik0 = f˜ ik0 )(xi − xi0 )2 )2 + ou0 , with probability one, because (1=x ) exp − (=x) → 0 when x → 0 for  ¿ 0 and all . 2.3. Speed towards degeneracy Here, we establish that the variance of a degenerated component tends to zero with an exponential speed. Since the likelihood tends to in5nity as fast as the inverse of the standard deviation, divergence of the likelihood is exponential too. Theorem 2. There exists  ¿ 0,  ¿ 0 and  ¿ 0 such that if u0  6  then, with probability one, k2+ 6 0

exp(−=k20 ) : k20

(24)

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382

Proof. By the Taylor expansion of k2+ , we have 0  fik 0 = (x − xi0 )2 + ou0  k2+ 0 ˜ ik0 i f i=i0 2  pk (xi − xi )2 −  ) (x i k 0 0 0 + ou0  exp − = √ 2 2 2 ˜ k 0 2 k0 f ik0 i=i0 2  (xi − xi )2 −  ) (x i k0 0

+ ou0 ; 6 exp − 2 2 2 ˜ k0  f i=i0



k0

379

(25) (26) (27)

ik0

since pk0 = 2 6 1. Using Lemma 1, if u0  is small enough, we have (xi0 − k0 )2 6 − 2 ln 2 , and, so, we can choose  such that, for all u0  6  and i = i0 , we have (xi − k0 )2 ¿ mini=i0 (xi − xi0 )2 =2. Thus, 2  (xi − xi )2 min (x − x ) i i i =  i 0 0 0

+ ou0 : 6 exp − (28) k2+ 0 4k20 2 f˜ i=i0

k0

ik0

Since we assumed that not all components go to degeneracy, there exists a value c ¿ 0, independent of the EM iterations, such that f˜ ik0 ¿ c. Noting that the two quantities  (xi − xi )2 mini=i0 (xi − xi0 )2 0 and  = (29) = c 4 i=i0

are strictly positive values with probability one because, for all i = i0 xi = xi0 with probability one, the theorem is proved. 3. Discussion In this last section, we discuss some consequences of Theorems 1 and 2 on how to avoid degeneracy and, then, we describe some prospects of this work. Avoiding degeneracy is handled in several ways, mainly by transforming or constraining the likelihood objective function. A natural approach consists of constraining the variances to be greater than an a priori value  ¿ 0: ∀k = 1; : : : ; K; k2 ¿ . Thus, the maximum likelihood function is bounded. Another way to do the job is to impose relative constraints between variances of the type mink k2 ¿  maxk k2 , with 0 ¡  6 1 a given constant. This scheme was 5rst proposed by Hathaway (1985). In the Bayesian framework, the likelihood maximization problem is replaced by a penalized likelihood function incorporating a prior density ( ) on the mixture parameter. In all cases, small values of variances are heavily penalized, i.e. one chooses the prior  such that the limit of ln ‘( ) + ln ( ) is −∞ when any variance tends towards zero (see Ciuperca et al., 2000). All these proposals need some prior information which changes the initial likelihood problem and, so, may aHect the estimation of . In the Bayesian case, Ciuperca et al. (2000) proved consistency of the estimator. Nevertheless, for moderate sample sizes, prior density may strongly aHect the obtained estimation. In the approach where a minimum variance is de5ned, resulting estimation obviously

380

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382

depends on the choosen bound. Indeed, instead of degenerating, EM converges to the a priori variance bound which has no more concrete meaning than the degenerate point itself (Biernacki and Chr-etien, 2001). In the relative constraints between variances situation, imposing equal variances between all components (12 = · · · = K2 ) is an interesting particular case obtained when  = 1. Indeed, homoscedasticity may be a natural prior knowledge in some cases. For instance, in biology, homoscedasticity is often veri5ed when biometrical features about both males and females are concerned. Nevertheless, when no a priori information is available, this last option remains a subjective way to avoid degeneracy. In fact, the EM practitioner is generally satis5ed to start again the algorithm when numerical tolerance of the computer is reached. Results of the previous section tend to justify this pragmatic behavior. Indeed, we con5rmed what the user noted already in experiments: when EM is close to degeneracy, the EM mapping is contracting and iterates reach numerical tolerance extremely quickly since convergence is exponential. Usually one or two iterations are suOcient if the value of constant c in (29) is not too small. Moreover, one can hardly test degeneracy before reaching the computer limit for such fast converging processes. Computer experiments seem to con5rm that our conclusions equally hold in the multidimensional case and in the situation where two or more components degenerate at the same individual. Moreover, our main results seem to apply to a large class of methods such gradient or Newton search for instance. These problems do not have a mathematical answer yet and should be addressed in future works.

Appendix In this appendix, two lemmas useful for theorems above are proved. Lemma 1. There exists  ¿ 0 such that if u0  6  then k20 6 2 and (xi0 − k0 )2 6 − 2 ln 2 : Proof. First, since on variance k20 :



(A.1)

2 =pk0 ¿ 1 and exp((xi0 − k0 )2 =2k20 ) ¿ 1, we obtain the following inequality

1 u0  6  ⇒ 6 ⇒ fi0 k0



2

pk 0

k20



(xi0 − k0 )2 exp 2k20

6  ⇒ k20 6 2 :

Next, if  ¡ 1 and k20 ¡ 1, we deduce the next inequality on (xi0 − k0 )2 :

1 (xi0 − k0 )2 2 u0  6  ⇒ 6 6  ⇒ k0 exp fi0 k0 2k20 ⇒ (xi0 − k0 )2 6 k20 ln  − k20 ln k20 ⇒ (xi0 − k0 )2 6 − k20 ln k20 :

(A.2)

(A.3) (A.4)

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382

381

√ Because −x ln x is an increasing function of x for x 6 1=e, we have, for  6 1= e, −k20 ln k20 6 − 2 ln 2 and, so, (xi0 − k0 )2 6 − 2 ln 2 . Lemma 2. For all m ∈ N and for all i = i0 , there exists  ¿ 0 such that if u0  6  then fik0 6

1 (fi0 k0 )m

with probability one:

(A.5)

Proof. We will prove that, equivalently, fik0 (fi0 k0 )m 6 1. Since pk20 =2 6 1 and, for all m ∈ N, exp(−m((xi0 − k0 )2 =2k20 )) 6 1, we have pk20 (xi0 − k0 )2 (xi − k0 )2 m fik0 (fi0 k0 ) = exp −m (A.6) exp − 2 2 2(m+1) 2 2 k k 0 0 2  k0

1

6 k2(m+1) 0



(xi − k0 )2 exp − 2k20

:

(A.7)

Decomposing (xi − k0 )2 gives (xi − k0 )2 = (xi − xi0 )2 + (xi0 − k0 )2 − 2(xi − xi0 )(xi0 − k0 )  (xi − xi0 )2 (xi − xi0 )2 + − 2|xi − xi0 | |xi0 − k0 |: ¿ 2 2

(A.8)



(A.9)

√ But we proved in Lemma 1 that there exists  ¡ 0 such that, if u0  6 , then |xi0 −k0 | 6 −2 ln 2 . Therefore, we have (xi − xi0 )2 =2 − 2|xi − xi0 | |xi0 − k0 | ¿ 0 and, thus, (xi − k0 )2 ¿ (xi − xi0 )2 =2. Consequently, we obtain 1 (xi − xi0 )2 m fik0 (fi0 k0 ) 6 : (A.10) exp − 4k20 2(m+1) k0

√ The function 1= xm+1 exp(−c=x) with c ¿ 0 (remind that xi = xi0  with probability one) and m ∈ N is an increasing function of x when 0 ¡ x ¡ 2c=m + 1. Take  ¡ 2c=(m + 1) such that k20 6 2 . Such an  exists due to Lemma 1. Thus,   2 1 1 (x (xi − xi0 )2 − x ) i i0 2 2  k0 6  ⇒ 6√ : (A.11) exp − exp − 42 4k20 2(m+1) 2(m+1) k0

√ Noting that limx→0 1= xm+1 exp(−c=x) = 0, for  suOciently small, fik0 (fi0 k0 )m 6 1.

382

C. Biernacki, S. Chr etien / Statistics & Probability Letters 61 (2003) 373 – 382

References Besag, J., 1986. On the statistical analysis of dirty pictures (with discussion). J. Roy. Statist. Soc. B 48, 259–302. Biernacki, C., Chr-etien, S., 2001. Degeneracy in the likelihood approach to univariate Gaussian mixtures estimation with EM. 10th International Symposium on Applied Stochastic Models and Data Analysis, June 12–15, CompiQegne, France, pp. 206 –212. Ciuperca, G., Ridol5, A., Idier, J., 2000. Penalized maximum likelihood estimator for normal mixtures, Research Report 2000-70, Mathematics Department, Universit-e Paris Sud-Orsay. Day, N.E., 1969. Estimating the components of a mixture of normal distributions. Biometrika 56, 463–474. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. B 39, 1–38. Hathaway, R.J., 1985. A constrained formulation of maximum-likelihood estimation for normal distributions. Ann. Statist. 13, 795–800. Kiefer, J., Wolfowitz, J., 1956. Consistency of the maximum likelihood estimates in the presence of in5nitely many incidental parameters. Ann. Math. Statist. 27, 887–906. McLachlan, G.J., 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York. McLachlan, G.J., Basford, K.E., 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York. McLachlan, G.J., Peel, D., 2000. Finite Mixture Models. Wiley, New York. Redner, R.A., Walker, H.F., 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26 (2), 195–239.