A Convergence Theorem for the Fuzzy ISODATA ... - IEEE Xplore

50 downloads 0 Views 2MB Size Report
A Convergence Theorem for the Fuzzy ISODATA. Clustering Algorithms. JAMES C. BEZDEK. Abstract-In this paper the convergence of a class of clustering pro-.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 1, JANUARY 1980

I

A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms JAMES C. BEZDEK

Abstract-In this paper the convergence of a class of clustering procedures, popularly known as the fuzzy ISODATA algorithms, is established. The theory of Zangwill is used to prove that arbitrary sequences generated by these (Picard iteration) procedures always terminates at a local minimum, or at worst, always contains a subsequence which converges to a local minimum of the generalized least squares objective functional which defines the problem.

Index Terms-Cluster analysis, convergence of fuzzy ISODATA, fuzzy sets, generalized least squares, iterative optimization.

I. INTRODUCTION

IN 1973 Dunn [1] defined the first fuzzy generalization of the conventional minimum-variance partitioning problem, and derived necessary conditions for minimizing the functional J2(U, v) defined below. He used these conditions to develop a Picard iteration scheme for iterative optimization of J2 and called it fuzzy ISODATA, in deference to its relation to the hard ISODATA process of Ball and Hall [2] (hard "c-means"l or "basic ISODATA" more accurately describes its historical predecessor, as it contains none of the decisionoriented embellishments of Ball and Hall). Numerical examples given in [11 suggested empirically that the algorithm was at least locally convergent, but no proof of convergence was formulated therein. A generalization of J2(U, v) to an infinite family of objective functions-{.Jm (U, v): 1 < m < oo} defined below-also appeared in 1973 [3]; and for m > 1, a similar algorithm for iterative optimization of Jm was formulated. Numerical experiments with real data in various applications have subsequently established the usefulness of these fuzzy ISODATA partitioning algorithms [4] - [81, and no difficulty has ever been reported concerning the attainment (computationally) of convergence. Nevertheless, their theoretical convergence properties have remained an open question. In this paper we formulate a convergence theorem using the method of Zangwill, which applies to the fuzzy ISODATA algorithms for every m > 1. In Section II we fix notation and establish the problem. Section III briefly reviews the form of Zangwill's theorem used in the sequel. Section IV contains the main result: fuzzy ISODATA terminates at a local minimum of Jm; or at worst, al-

ways generates a sequence containing a subsequence convergent to a local minimum. In Section V we observe that the proof cannot be extended directly to the conventional cmeans algorithm for the iterative minimization of J1; and Section VI concludes with a short discussion of how the present theory might extend to more general classes of fuzzy clustering algorithms. II. THE FUZZY ISODATA ALGORITHMS Let X= {xl,x2, * * * ,xn} C R' be a finite data set in feature space IRs; let c be an integer 2 6 c < n; and let Vcn denote the vector space of all real (c X n) matrices over R, equipped with the usual scalar multiplication and vector addition. A conventional ("hard," or nonfuzzy) c-partition of X is conveniently represented by a matrix U = [Uik] E Vcn, the entries of which satisfy

uik E{O, 1}; c

i=l

Uik

=1;

n

>=

E Uik

k =l

;

1 6i6c;

1 .k.n,

(la)

1 Sk6n,

(1 b)

2

k=i i=l

(Uik) IlIxk - VillE

(4)

J1 is the classical within-group sum of squared (WGSS) errors objective function. Its value is a measure of the total squared error (in the Eucidean sense) incurred by the representation of the c clusters defined by U eMc by the c prototypes {vi}. The use of 11 IIE renders J1 a measure of the total withingroup scatter, or variance of the {Xkj} from the {vi1} in the statistical sense of Wilks [9]. For these reasons, JA is a popular criterion for identifying optimal (defined as minimums of J1) pairs (U*, v*), where U* is assumed to be a clustering of X which exhibits inherent data substructure. J1 performs well for certain kinds of data, but its failure at detection of, e.g., linear substructure, is well documented [1], [3]-[5], [9]. Nonetheless, it is an often-used clustering criterion; one of the most popular methods for approximating local minima of J1 is iterative optimization, using the necessary conditions { 1:

d,*k

=

mini {d(*A

{0: otherwise

1 0 Vk. If this assumption fails, U* EM, is nonunique [1]. The where was again any inner product induced norm on JRS. necessity of (5b) follows easily by differentiation, that of (5a) It is shown in [3] ,under the assumption thatd- llXk - Vu*n2 > by one of several arguments (cf. [1]). The hard c-means algo- 0 Vi, k, that (U*, v*) might be a local minimum of Jm only if, rithm is, loosely speaking, iteration through (5) by an op- for any m > l, there holds erator T1: (Mc) -+ (Mc) defined by 1 ulk = 1 1 1, that the cn dis;tances dik = IIXk - vill' are always positive, and that the norm iIn (11) is any inner product induced norm. First, we establis,h that Jm serves as its own descent functional. It was shown in Q31 ) that Jm descends weakly on the iterates generated by ffuzzy ISODATA, i.e., that

c

ap = 0,

Summing c) over I and applying a) yields +X* E.S.1 (20e) d) (Cpa)1I(m-l) = (c 1/(m1

Theorem C and its generalizations can be used to secure convergence proofs for almost all of the classical it(erative optimization algorithms, e.g., steepest descent, Neiwton's method, etc., by using this approach as an alternat ive to more conventional arguments. Note especially that so lution points x* ES are strict local minima of f, while the thLeorem itself asserts global convergence for the iterates of A, t;hat is, convergence to an x* CS starting from an arbitrary xo Df In Section IV we show that Theorem 1 applies to the fuzzy ISODATA algorithms described above.

n

I < j.n,

From b) we find that

(20d)

or

{Xkj}

I =0;

-

1

(20c)

then for each iterative sequence {Xk} generated by A, w either

C

)= £ (W:)2 1

b) a(W*a*) = 2m (w* )2m -1dp + 2(w* )

;xOE Df}CK

are contained in a compact set

3 a subsequence

ai

/() (dip)/( (d,

-M)) -)

BEZDEK: FUZZY ISODATA CLUSTERING ALGORITHMS fe

ap =

-) 1-rn

(dip))/I(1-m)

;

1.p

5

and

n,

dht dt

e) Xp=4m(m- l)ap; lp 1 and dip > 0 vj and p, that Xp > 0 Vp, so H, (U*) is positive definite at U*, not only on the tangent subspace defined by equality constraints (7b), but over all of Vcn. Thus, (12a) is sufficient, and U* is a strict local minimum of p. Finally, note that O < u * < 1 V i, k in (1 2a), so the constraints (7c) are satisfied by solutions of the relaxed problem, and hence of the original problem as well. Q.E.D. Next, we fix U E Mfc and consider minimization of Jm (U, v) in the variables {vi}. Proposition 2: Let 4: RcS -* R, (v) .Jm (U, v), where UeMfc is fixed. Then v* is a strict local minimum of if and only if vi*, 1 < i < c, is calculated via (12b), v G (U). Proof: Since minimization of over Rcs is an unconstrained problem, the necessity of (1 2b) follows by requiring v iV(v*) to vanish for every i. Equivalently, the directional derivatives 4'(vO,y) of with respect to vi vanish at v* in arbitrary directions y G RlS, y 0. Let t G IR, and define Vi

n

ty) =,(i) lk- (V*I + ty)112 k=1

(m~V ty,Xk E~ ~(Uik) f

hi(t) h1(t)

(Uik)rn

=

(Xk

Vyr

tY, Xk

V*ty

VP

k=l

where (z, z) -

tyI112.

c

=

k=l i=l

-2(Uik)m (Yi,Xk

IIz112 is the inner product on lR'.

tY)

Vi

tYi)

and

we see

+

-

Then

0

c

hi(t) = u>(v

Vy&RS y*O.

O

Since a) can be zero for arbitrary y =A0 if and only if its second argument is the zero vector, condition (1 2b) follows and necessity is established. While sufficiency of (1 2b) can be established by calculating the eigenvalues of H,~(v*), the (cs X cs) Hessian of 4 at v *, it is instructive to give a different proof. Let Y = (Y1,Y2,* ,yc),yiE RSVji and tEIR. Then define

A2

I~~~~~~~~~~~~~~~~~~

=

I

c

. . .

\

n

a) Sy E (Uik)m (Xk - VP )

°

A2

O

K

=0

c

n

hI"(t) =yT [H,> (v* + ty)] y = , E 2 (Uik) (Yi,Yid k=1 i=1

E 0Rcf,

Thus, at t = O we find that Vy b) h"(O) =yT[Hp(v*)]

y =

2

(

, Yii2 (kE (Uik)

))

For y 0, we have from b) and constraints (7c), that h"(0) > i.e., H,p (v*) is positive definite. Thus, (1 2b) is sufficient, and v* is a strict local minimum of Q.E.D. Next, consider minimization of Jm over (Mfc X IRCS) jointly in the variables (U, v). The U variables are of the KuhnTucker variety due to (7), while v is unconstrained. Using the substitution Uik = Wik and relaxation as in Proposition 1, it is easy to verify that (U*, v*) may minimize Jm locally only if (U*, v*) = (F(v*), G (u *)) = ((F o G) (U*), G (U*)) = i.Mr(U*, v*), i.e., only if (U*, v*) is a fixed point of 9m. In other words, (12) are jointly necessary: this follows by setting the gradient of the Lagrangian of Jm equal to zero. Since the multipliers {a*} for constraints (7b) are not coupled to the {vi}, exactly the same conditions will follow. The joint sufficiency of (12a) and (12b) is not so easy to establish. If Hp(U*) and H,p (v*) denote the same Hessian matrices as in Propositions 1 and 2, respectively, the Hessian of the Lagrangian L of Jm at (U*, v*) is the (cn X cs) X 0,

4.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 1, JANUARY 1980

6

(cn X cs) matrix with partitioned form

rH(U *) HL(U*, V*) =

FH.

(u*)

cn X cn

I| M

mT

cn X cs

H, (v*) cs X cn

Cs X cs

A1 is continuous if G is. To see that G is continuous in the (cn) variables {Uik}, note that G is a vector field, with the resolution by (cs) scalar field, say G = (G 1, GC2, - * i, GCs)einen (Rcs where Gii: llR' * 1R is defined via (1 2b) as In

n

Gi.(U) = E (Uik)m Xk,IZ (Uik)m We know from Proposition 1 that HD (U*) is a diagonal matrix with n distinct eigenvalues each of multiplicity c, viz., Xk = 4m (m

-

1 ) *(E(dik)

)I

< k < n.

Proposition 2 shows that all the eigenvalues of H,p (v*) are positive: in fact, this matrix has c distinct eigenvalues each of multiplicity s, namely yi= 2(n=, (ui*)m); 1 .i.c; and is also diagonal. Unfortunately, the matrix M in H,p(v*) HL(U*, v*) is not the zero matrix; it contains n (c X s) blocks involving entries of the form (Wik)2m1 (Xk - vi), and so the structure of HL(u*, v*) is not immediately obvious. To establish that conditions (12) are jointly sufficient, it would be enough to show that HL(U*, v*) was positive definite on at least the subspace of MfC X RCs tangent to (U*, v*), i.e., that HL(U*, v*) restricted to the solution space of the linear system defined by applying the Jacobian of the constraints to (U, v) and equating it to zero was positive definite. Since this is an ambitious undertaking, we are content to conjecture here that (1 2a) and (1 2b) are jointly sufficient, and defer this question to a future investigation. We are, however, now in a position to establish that 5m descends strictly on the iterates of Jm. This is the content of the following. Theorem 1: Let S

=

{(U*,v*): Jm(U*, V*) O Vj and k, Fij is continuous for all i, j. Therefore, F, and in turn A2, are continuous on their entire domains. Thus, Jfm = Q.E.D. A 2 0 A1 is continuous on Mfc X IRCs. The final condition needed for Theorem C is compactness of a subset of (Mfc X Rcs) which contains all of the possible iterate sequences generated by 'm . Note first that an initial guess for such a sequence consists of either U(), forthen v(°)=G(U()); or v(°), in which case U() =F(v(o)). If U(°) is the initial guess, then the entire Picard sequence {(T)(k)(U(o), v(0))} always lies in a compact set; while if v (0) is the beginning point, all terms except (U(°), v(°)) do. The result of Theorem 3 is valid in either instance.

Finally, if (U, v) E S, equality prevails throughout in the Q.E.D. above argument. The second requirement of Theorem C is that algorithm fm be continuous on the domain of Jm with S deleted. Tm is in fact continuous on all of Mfc X IRc', as we show in the following. Theorem 2: _m is continuous on (MfC X IRCs). Theorem 3: Let [conv (X)]c be the c-fold Cartesian product Proof: Since 5fm =A2 OA1, and the composition of of the convex hull of X, and let (U(0), G(U(°))) be the startcontinuous functions is again continuous, it suffices to show ing point of iteration with 9m, with U(°) e Mfc and v(0)= that A 1 and A2 are each continuous. Since A 1 (U, v) = G(U), G(U(O)). Then

BEZDEK: FUZZY ISODATA CLUSTERING ALGORITHMS

(m )(k)(U(O), V (0)) E MfC X [conv (X)] c,

k = 1, 2

7 * * *

(21a)

{({Jm)(I)(U(O), v(0))} terminates at a local minimum (U*, v*) of Jm; or

(2)1 b) i c m )(i )(aU ), V ())} c Mfc X [conv (X)] c is compact in Mfc X Rcs. Proof: Let U(°) E MfC be chosen. Then v(°) = G(Ur(°)) is contains a subsequence such that calculated using (1 2a) so that {(9M) Uk(U(°), V (0))} , (U*, V *) a local minimum of Jm as ik -+ °° V(4) E (Ui(k))m Xk/ (Uk))m; 1