MDL Principle for Robust Vector Quantization - CiteSeerX

MDL Principle for Robust Vector Quantization Horst Bischof

Ales Leonardis

Pattern Recog. and Image Proc. Group

Faculty of Computer and Info. Science

Vienna University of Technology

University of Ljubljana

A-1040 Vienna, Austria

SI-1001 Ljubljana, Slovenia

[email protected]

[email protected]

Alexander Selb Pattern Recog. and Image Proc. Group Vienna University of Technology A-1040 Vienna, Austria [email protected]

This work was supported by a grant from the Austrian National Fonds zur Forderung der wissenschaftlichen Forschung (No. S7002MAT). A. L. acknowledges the support from the Ministry of Science and Technology of Republic of Slovenia (Projects J2-0414 and J2-8829).

1

MDL Principle for Robust Vector Quantization Keywords: Vector Quantization, Clustering, Minimum Description Length, Robustness, Image coding, Color-image segmentation

Abstract We address the problem of nding the optimal number of reference vectors for vector quantization from the point of view of the Minimum Description Length (MDL) principle. We formulate vector quantization in terms of the MDL principle, and then derive dierent instantiations of the algorithm, depending on the coding procedure. Moreover, we develop an ecient algorithm (similar to EM-type algorithms) for optimizing the MDL criterion. In addition, we use the MDL principle to increase the robustness of the training algorithm, namely, the MDL principle provides a criterion to decide which data points are outliers. We illustrate our approach on 2D clustering problems (in order to visualize the behavior of the algorithm) and present applications on image coding. Finally we outline various ways to extend the algorithm.

1 Introduction Unsupervised learning (clustering techniques) are widely used methods in pattern recognition and neural networks for exploratory data analysis. These methods are often used to understand the spatial structure of the data samples and/or to reduce the computational costs of designing a classi er. There exists a vast amount of dierent methods for unsupervised learning and clustering (see [1] for a recent review on this topic). A common goal of unsupervised learning algorithms is to distribute a certain number of reference (weight) vectors in a possibly high dimensional space according to some quality criteria. This is also called Vector Quantization (VQ). In this paper we consider the following problem: Given a nite data set S = fx ; : : : ; xng; xi 2 IRd , where xi are independently and identically distributed (iid) according to some probability distribution p(x), nd a 1

2

set of reference vectors A = fc ; : : : ; cmg; ci 2 IRd such that a given distortion measure E (p(x); A) is minimized. A typical application is, for example, to compress a set of vectors for transmission purpose. This can be achieved by vector quantization which minimizes the expected quantization error 1

E (p(x); A) =

m Z X i=1

Si

jjx ? cijj p(x) dx; 2

(1)

by positioning the ci; where Si = fx 2 IRd ji = arg minj2f ;:::;mg jjx ? cijjg is the Voronoi region of a vector ci (see Fig. 1). In general, the only knowledge that we have about p(x) is the data set S . Therefore, we can only minimize 1

E (S; A) =

m X X

i=1 x2Si

jjx ? cijj ; 2

(2)

where now Si = fx 2 S ji = arg minj2f ;:::;mg jjx ? cijjg (see Fig. 1). Many dierent algorithms have been proposed to nd the reference vectors ci for the case when their number m is given, e.g., the well-known K-means and Linde-BuzoGray (LBG) algorithms [2, 3]. Also, various neural network models have been proposed which can be used for vector quantization, e.g., hard-, soft-competitive learning, neural gas network, and networks with topological connectivity like Kohonen's self organizing feature maps (see [1] for a recent review on this topic). All these methods assume that the number of reference vectors is a priori given. However, nding the \right" number of clusters m remains an important open question. Depending on the data distribution, a very dierent number of clusters may be appropriate. The methods mentioned above require a decision on the number of clusters in advance, and if the result is not satisfying, new simulations have to be performed from scratch. Fig. 2 illustrates what happens if we take a simple cluster structure (5 clusters (30 points) and 2 isolated points) and initialize K-means with the wrong number of centers. The major problem with these approaches is that the nal result depends on the initialization, i.e. for dierent initializations we get dierent results. Various heuristics have been proposed to deal with the problem of nding the right number of centers; e.g., ISODATA [2] and scale space clustering [4] are two examples from pattern recognition literature. Some neural network algorithms have also been extended to include a mechanism for adding reference vectors during training, e.g., growing cells structures, growing neural gas, etc. (see [1] for a review). These methods include the stopping criterion (i.e., when to stop adding units) which is crucial for nding the right 1

3

number of reference vectors. Usually, the growing of the network is stopped when a prespeci ed performance bound is reached, which is dicult to set a priori, especially if the clusters have unequal distributions. In order to nd the right number of reference vectors, an interesting approach was recently proposed by Xu [5]. The approach is designed as a particular instantiation of the Ying-Yang machine which can be used for vector quantization. For this network, Xu developed a pruning method, i.e., units are removed from an initially overly complex network. As a pruning criterion he trades-o the error versus the number of reference vectors. Buhmann and Kuhnel [6] propose an algorithm to jointly optimize the distortion errors and the codebook complexity, using the maximum entropy estimation technique. Their method involves the optimization of a complex cost function with a regularization parameter. The cost function is minimized using simulated annealing, and the number of reference vectors is estimated using a growing method. Another problem closely related to nding the optimal number of reference vectors is the problem of outliers. Suppose that in the training set S there are additional data points not belonging to the distribution p(x) (the so-called outliers). Ideally one would treat the outliers in such a way that they would not degrade the result of the vector quantization. However, all the algorithms mentioned above can not cope with this problem. Especially the incremental algorithms tend to add additional reference vectors for the outliers. Since most clustering methods minimize a squared error measure, it is easy to show that a single outlier may arbitrarily change the position a reference vector. These clustering methods are non-robust with a breakdown point of 0% . The problem of robust clustering has only recently received some attention (e.g. [9, 10]). In this paper we address the problem of nding the optimal number of reference vectors for vector quantization (also in the presence of outliers) from the point of view of the Minimum Description Length (MDL) principle. MDL, which is closely related to algorithmic complexity [11], was proposed by Rissanen [12, 13] as a criterion for model selection. MDL has also been applied in the area of neural networks, see e.g. [14, 15, 16, 17, 18]. In most cases, MDL has been used for supervised networks as a penalty term on the error function or as a criterion for network selection. One exception is the work of Zemel and Hinton [19, 14] who applied the MDL principle to auto-associative networks using stochastic binary units. However, the main goal of their work was not to nd the 1

Where the breakdown point of an estimator is determined by the smallest portion of outliers in the data set at which the estimation procedure can produce an arbitrarily wrong estimate [7, 8]. 1

4

size of the network, but to determine a suitable encoding of the data. Tenmoto et al. [20] used MDL for the selection of the number of components for Gaussian Mixture Models, however, they used the approach for supervised learning and they did not consider outliers. Our approach diers from these methods; namely we use the MDL principle as a pruning criterion in an integral scheme and to identify outliers: We start with an overly complex network, and while training the network, we gradually reduce the number of redundant reference vectors, arriving at a network which balances the error versus the number of reference vectors. By combining the reduction of complexity and the training phase we achieve a computationally ecient procedure. We organize the paper as follows: In the next section we derive the MDL formulation for vector quantization. From this formulation we derive the pruning criterion which is embedded in the training phase, yielding a complete algorithm. Section 3 shows the experimental results; we illustrate our approach on 2D clustering problems and present applications on image coding and color image segmentation. Section 4 presents conclusions and outlines venues of further research.

2 Robust Vector Quantization in an MDL Framework We approach the problem of vector quantization as the minimization of the length of the description of the training data S (S is, in fact, the only information available to us). A convenient way to derive the MDL formulation is via a communication game. Assume that our goal is to transfer the data S without error to a receiver, and that we have agreed beforehand on a communication protocol. Let us further assume that we have already identi ed which data vectors are outliers, and are therefore not coded by the reference vectors, i.e., I = S ? O, where O are the outliers and I are the inliers. Using the reference vectors A, the length of encoding S is then given by: 1. The length of encoding the reference vectors A, denoted by L(A). 2. The length of encoding I using A, which can be subdivided into the following two costs: (a) The length of encoding the index of A to which the vectors in I have been assigned, denoted by L(I (A)), and 5

(b) the length of encoding the residual errors, denoted by L((IA)). 3. The length of encoding the outliers, denoted by L(O). Therefore, the cost of encoding S using A is given by

L(S (A)) = L(A) + L(I (A)) + L((IA)) + L(O):

(3)

Our goal is to minimize L(S (A)), i.e., determine O, m, and ci; 1 i m, such that L(S (A)) is minimal. In principle we could enumerate all partitions and evaluate each partition by the MDL-criterion and choose the one with the minimal description length. Since the number of partitions grows exponentially with the number of data points this is not feasible even for datasets of moderate sizes. Therefore, we have to nd a more ecient approach. In order to simplify the subsequent derivation and notation we make the following assumptions: 1. All quantities are speci ed with a nite precision , in particular we assume that the vectors in the training set and the reference vectors are represented by K bits. 2. The samples in S are independently and identically (iid) distributed, which means that p(x; y) = p(x)p(y); x; y 2 S . Using these assumptions we can rewrite (3) in the following form:

L(S (A)) = mK + L(I (A)) +

m X X i=1 x2Si

L(x ? ci) + jOjK:

(4)

Outliers Let us rst consider the outliers O: Given a reference vector set A, those

vectors from S are considered as outliers, according to the MDL principle, which can be encoded with less bits directly, i.e., without having a vector quantizing them. More precisely, a vector is an outlier if the encoding of this vector by an index of a reference vector and the encoding of the error yields a higher number of bits, than coding the vector as it is. This can be calculated using Eq. (4), by checking the change in the coding length when a vector y 2 Si is moved from the inliers to the outliers. If we start from Eq. (4) and change a point from an inlier to an outlier we have the following change in the coding length (in fact, we have two cases depending on whether the point that is changed from an inlier to an outlier is the only point in a cluster): 6

1. if the point that is changed from an inlier to an outlier is not the only point in a cluster, then

the number of reference vectors does not change; we save an index; we save the error; we have to encode the outlier with K bits;

2. if the point that is changed from an inlier to an outlier is the only point in a cluster, then

we save one reference vector (K bits); we save an index; depending on the encoding procedures for the indices we may save some bits

for specifying all other indices since we have one reference vector less to encode; we do not save the error since it was 0; we have to encode the outlier with K bits.

Written in a compact form this results in the following condition for a vector being an outlier: K < L((I ? fyg)(A)) + L(y ? ci) + IjS j K ; (5) 8 < 1 : jSij = 1 where IjS j = : is the indicator function, indicating that y is the only 0 : otherwise data point assigned to cluster i. i =1

i =1

Minimizing L(S (A)) Our next step is to calculate the change in L(S (A)) when we remove a reference vector cj from A, i.e., Lc = L(S (A ? fcj g)) ? L(S (A)) : Our goal is to remove those reference vectors which decrease the description length, i.e., Lc < 0. We can estimate Lc , according to j

j

j

L^ c = ?K + L(I (A ? fcj g)) ? L(I (A)) + X (L( ?j (x)) ? L(x ? cj )) ; j

x2Sj

(

)

7

(6)

where ?j (x) is the error caused by the vector x when the vector cj is removed from A, i.e. ?j (x) = (x ? ck ); k = arg mini f ;:::j? ;j ;:::;mg jjx ? cijj. L^ c is only an estimate of Lc because Eq. (6) does not take into account the possibility that by removing a reference vector some points may become outliers. However, it can easily be shown (by the de nition of an outlier Eq. (5)) that L^ c Lc , therefore, we can guarantee that the reference vectors that are removed at this stage de nitely decrease the description length. The condition that all reference vectors with Lc < 0 are removed can be guaranteed by the iterative nature of the complete algorithm (see section 2.2). The above derivation holds for the removal of a single reference vector. However, we can prove that all non-neighboring reference vectors with L^ c < 0 can be removed in parallel . The argument goes as follows: Let us consider two non-neighboring reference vectors ci, cj for which holds that L^ c < 0, L^ c < 0. Without loss of generality we have to show that by removal of ci that L^ c ;c < 0 (L^ c ;c denotes the change in coding length when cj is removed and ci has been removed before). This guarantees that the reference vectors that are removed at this stage de nitely decrease the description length. Since the reference vectors ci, cj are non-neighbors, the error term in Eq. (6) does not change, nor does the rst term of the same equation. Therefore, it is sucient to show that L(I (A ? fcj g)) > L(I (A ? fci; cj g)). This condition can easily be veri ed for the encodings of the index terms as speci ed in the next section. In fact we can not think of any reasonable encoding for which this condition would not be satis ed. Therefore, we can conjecture that all non-neighboring reference vectors with L^ c < 0 can be removed in parallel. In particular, we use a greedy strategy to select the reference vectors for removal. More speci cally, under the condition that the neighborhood constraint holds, we always select the reference vectors with the largest decrease in description length rst. This strategy does not necessarily lead to the optimal result (i.e., we can not guarantee that by our selection we achieve a maximal reduction in the description length), however, this problem is alleviated by an embedding of the selection in an iterative algorithm (see section 2.2). (

(

)

)

= 1

1 +1

j

j

j

j

j

j

2

i

j

i

j

i

j

j

Two reference vectors are de ned as neighbors when at least for one sample x, one of the reference vectors is closest and the other one is second closest. 2

8

2.1 Instantiations of the MDL-Algorithm Up till now we presented the problem of vector quantization in the MDL framework on a general level without specifying the actual coding procedure for the indices and the errors. Depending on the particular type of encoding, we can calculate various instantiations of the algorithm. We consider two dierent types of encodings for the index term L(I (A)) and for the error term L(x ? ci) (other encodings can be derived in a similar manner):

Index term L(I (A)): 1. A simple type of encoding is to use a xed length code for each reference vector index. The amount of bits needed per data point is log (m); therefore, L(I (A)) = jI j log (m). If we remove a reference vector this term changes to jI j log (m ? 1). The decrease in the length of encoding for the index term is then jI j(log (m ? 1) ? log (m)). 2. We code the indices with the optimal variable length code according to their probability of occurrence, i.e., pi = ni=jI j, where ni = jSij. The amount of bits for data point x 2 Si is then ? log (pi). Therefore, L(I (A)) = ? Pmi ni log pi. The change in the length of encoding when cj is removed is Pmk ;k6 j (nk + njk ) log ( n jInj )?Pmk nk log pk , which is ?nj log pj +Pmk ;k6 j (njk log ( n jInj )+ nk (log ( n n n ))). When nk njk this can be approximated as ?nj log pj + Pm n n k ;k6 j njk log ( jI j ), where njk is the number of vectors which have the reference vector j as their nearest and k as their second nearest reference vector. 2

2

2

2

2

2

=1

2

2

=1

k + jk

=1

2

2

=1

=

k + jk

2

=

k + jk

2

k

=

2

=1

2

k + jk

Error Term L(x ? ci) : 1. Let us assume that the encoding length of the error is proportional to its magnitude, and that the required accuracy (i.e., quantization) is in each dimension. If the error (x) = (x ? ci) is independent along the dimensions, it is encoded for each component separately, i.e. L((x)) = Pdj max(log ( jx ?c j ); 1). Assuming that on the average the error is the same in each dimension, we can simplify the above expression and calculate the change in the error for the removal of the reference vector cj as d Px2S log ( jjjj? x xjj jj ). j

2

=1

j

9

2

( j)( )

( )

ij

2. If we can assume a particular distribution p((x)) for the error, we can encode the error using this distribution, i.e., L((x)) = ? log (p((x))). For example, if we assume an independent and normally distributed error with zero mean and a xed variance of in each dimension, i.e. p((x)) = Qdi e? , the length of encoding the error term is then given by L((x)) = log (p((x))) = Pd x + d log (p2) ? d log (): The change in the error term is then i P given by x2S Pdi x ?c ? x ?c ; where here again j is the nearest and k the second nearest reference vector. Here we have assumed that the variance is constant and known to the sender and receiver (i.e., it is a parameter which needs to be set, see also section 3.1). However, we could also estimate the variance for each reference vector and include it in the transmission as part of the description of the reference vector. In this case, the cost of coding a reference vector will be increased ( rst term in Eq. (6)). 2

2 (xi ) 2 2

1 =1 (2 )

2

2

2

( i) =1 2 ln(2) 2

2

j

=1

( i

2

2 ( i ik ) 2 ln(2) 2

ij )

2

2

We can now specify dierent instantiations of the outlier condition Eq. (5) and the conditions for the removal of a reference vector according to Eq. (6), by replacing the appropriate terms with the above derived equations. For example, using xed encoding of the index term (Case 1) and magnitude based coding of the error term (Case 1) we get: L^ c = ?K + jI j(log (m ? 1) ? log (m)) + d 2

j

2

X

jj (x)jj log ( jj?(jx)jj ) ; x2S (

)

2

(7)

j

as a change in the coding length. Similarly, having variable encoding for the index term (Case 2) and Gaussian encoding for error term (Case 2) we get: L^ c = ?K ? nj log pj + 2

j

m X k=1;k6=j

njk ) + njk log ( nk + jI j 2

d (x ? c ) ? (x ? c ) XX i ik i ij ; 2 ln(2) x2S i 2

2

2

j =1

(8)

as a change in the coding length.

2.2 Complete Algorithm Having derived the conditions for the removal of super uous reference vectors, we now formulate the complete algorithm, which is schematically shown in Fig. 3. 10

1. Initialization: Initialize the vector quantization network with a large number of reference vectors, e.g., by randomly drawing samples from S , and using them as reference vectors. Initially I = S . 2. Adaptation: Use an unsupervised algorithm to adapt the reference vectors using the inliers I . In principle, any clustering algorithm can be used [1]. It is also important to note that we do not need to train the network to convergence because we compare the reference vectors on a relative basis (see discussion below). 3. Selection: Remove the super uous reference vectors (in the MDL sense) according to the procedure described in the previous subsection. 4. Outliers: Detect outliers according to the MDL outlier condition (Eq. (5)). It is important to note that all vectors S have to be taken into account because it might happen that a vector classi ed as an outlier in the previous iteration may become again an inlier. 5. Convergence: If

no additional outliers were detected, the selection step has not removed any reference vectors, and the changes in the adaptation step are small, then stop, otherwise goto step 2 (adaptation). This iterative approach is a very controlled and ecient way of removing reference vectors, and as such shares similarities to EM-type algorithms [21]. In step 2, having m xed, we estimate the positions of the cj , which decreases the error term (L(x ? ci)) of the description length. In step 3, we re-calculate the number of reference vectors, keeping the cj xed. In step 4, vectors are assigned as outliers if this decreases the description length. Since we reduce the description length of the network in each of the three steps, and we iterate these steps until no further improvement is possible, we are guaranteed to nd at least a local minimum of the description length. Of course, convergence to a global minimum can not be assured. However, starting from an initially high number of reference vectors, we reduce the likelihood of being trapped in a poor local minimum due to initialization problems. 11

It is also important to note that in order to achieve a proper selection, it is not necessary to train the network to convergence at each step. This is because the selection removes only those reference vectors that cause a decrease in the description length (i.e., other reference vectors can compensate for their omission). Since this is independent on the stage of the training, it is not critical when we evoke the selection. One should also note that the quantities needed for the selection can be computed at almost no additional computational cost because they are needed already at the learning stage.

3 Experimental Results To test the proposed method, we applied it to various clustering problems of 2-D data (in order to visualize the behavior of the algorithm), and to vector quantization of images. In the experiments used in this paper we have used the following datasets: For 2D clustering we have used 10 dierent datasets where the clusters where distributed according to Gaussian distribution: the number of clusters varied from 3 to 10, the standard deviation within the clusters varied from 0:03 to 0:08 and the number of data points per cluster varied from 20 to 100. For the image quantization experiment we have used 8 8 = 64 dimensional vectors, with 3000 data points in the training set. For the color segmentation experiments we have used two dierent data sets consisting of 8000, 3 dimensional (RGB values) data points. The exact details are given in the experiments below. As for the training procedures, we have experimented with two algorithms: K-means clustering algorithm [2] and the unsupervised learning vector quantization (LVQ) algorithm [22]. In terms of the number of clusters found, the results are independent of the training algorithm, therefore, we show only the results obtained by K-means. In these experiments we have used the instantiation in Eq. (8), i.e., optimal index coding and Gaussian error coding as MDL instantiations which is particularly well suited for the K-means algorithm.

3.1 2-D Clustering The clusters were generated according to a Gaussian distribution (dierent variance) with random placement of the centers in [0 : : : 2] [0 : : : 2]. Fig. 4 shows a data set (where a few additional outlier points have been added) and the reference vectors at dierent stages of the algorithm. One can see that the algorithm gradually reduces the number of reference 12

vectors and nally ends up with the same number of clusters as were originally generated, however, the outliers in uence the nal positions of the reference vectors. Fig. 5 shows the same run, but this time with treating the outliers separately. One can see that those points which are far from the other clusters have been correctly detected as outliers. Fig. 6 shows four results obtained on datasets with increasing variance and dierent number of clusters. We have initialized the network with 50 reference vectors, randomly selected from the data set. Usually after 3 1 selection steps the network converged (5 2 in the case of LVQ). One can see that in the cases (a)(b)(d), the algorithm has found the originally generated number of clusters. In the case of (c), one cluster is highly overlapped with the three others (lower left corner); therefore, the algorithm did not generate a separate center for it, since according to the MDL principle it is more economically to let the other centers encode also these data points. In the next experiment we tested the noise sensitivity of the method. We generated 6 Gaussian clusters with G = 0:05. Then we added zero mean uncorrelated Gaussian noise with increasing variance (N = [0:005; 0:085], i.e., 10%-170% of the variance of the clusters) to each data point. All parameters were kept the same as in the previous experiment. Fig. 7(a) shows the positions of cluster centers and Fig. 7(b) shows a plot of the mean squared error of the distance between the true centers and the centers resulting from the application of our algorithm. It is important to note that the algorithm consistently found 6 clusters in all cases, which indicates that the selection mechanism is very noise tolerant. Next we show how the parameter (in Eq. 8) can be used for hierarchical clustering. We generated 6 Gaussian clusters with 50 data points each, and G = 0:07. Fig. 8 shows the nal clustering result depending on . The number of clusters decreases with the increasing value of , however, it is important to note that the number of clusters is very stable around the right value of . Fig. 9 shows a plot of the nal number of cluster centers obtained by our method with respect to , i.e., we used the same data set an in Fig. 8, and set the parameter of our algorithm to dierent values, and plotted it against the nal number cluster centers. From this plot one can see that for nding the right number of centers it is not critical to ne tune the parameter . In fact, this parameter can also be used for hierarchical clustering. Our next test compares two training strategies:In the rst case, K-means after each we 13

apply K-means after each selection step to convergence; in the second case we perform only one K-means iteration after each selection step. We generated randomly dierent datasets similarly to Fig. 6, and initialized both training versions with the same 30 reference vectors. Both training strategies converged to the same nal number of reference vectors. Table 1 shows the minimum, mean, and maximum number of K-means iterations in these cases. One can see that it is much cheaper to run only one K-means iteration after each selection step. Table 1: Number of K-means iterations. min mean max Full K-means 9 14.29 22 1 K-means 5 6.43 10 Our last 2D experiment compares our MDL method with the Growing neural gas method of Fritzke [23], and shows how these two methods can be combined eciently. The number of reference vectors generated by the growing neural gas method depends on the number of iterations and the stopping criterion (i.e, allowable error). Figs. 10(a)(b) show two typical results obtained with the growing neural gas method. One can see that the number of clusters is always overestimated by the growing neural gas network. Figs. 10(c)(d) show two results of our method initialized with 75 and 150 reference vectors, respectively. One can see that our method always nds the right number of clusters. Figs. 10(e)(f) show the results of our method when we take as initial cluster centers those obtained by the growing neural gas method. The results are the same as with random initialization, however, in this case the algorithm converges much faster (i.e. needs much less selection steps). Therefore, it seems bene cial to use the growing neural gas method as an initialization step for our method. Preliminary results have shown that the results are consistent over dierent datasets.

3.2 Image Quantization We performed also several experiments on image vector quantization; Fig. 11(a) shows a typical example (size 768 512). We divided the image into non-overlapping 8 8 blocks and use them to form 64 dimensional data vectors. Then, we randomly selected half (3000) 14

of them for training the network. The network was initialized with 256 reference vectors, the parameter (Eq. 8) was set to 9. After 7 selection steps the network converged at 34 vectors. The obtained reference vectors were used to quantize the image, i.e., each nonoverlapping 8 8 block was assigned the index of the nearest reference vector. From this representation the image can be reconstructed by replacing the index with corresponding reference vector. Fig. 11(b) shows the reconstruction of the image from the reference vectors . Fig. 11(c) shows the error between the original and the reconstructed image (gray represents zero-error). The compression obtained (after run-length encoding of the vector quantized image) is by a factor of 50, i.e., 0:16 bits/pixel as compared to 8 bits/pixel of the original image (run-length encoding of the original image yields a compression factor of 1:26 or 6:35 bits/pixel). The visible errors are mainly due to the blocking eects of taking 8 8 windows, and could easily be removed by a proper smoothing. 3

3.3 Color Segmentation The last example demonstrates the usage of our MDL-based algorithm for segmentation of color images, i.e., nding regions of similar colors. We have used 24-bit RGB-images as shown in Fig. 12(a),(b) (only shown in black and white). Fig. 12(a) is a simple test image consisting of three dominant colors. Fig. 12(b) is the well-known Mandrill image. From each of the images we randomly sampled 10% (8000 data points) of the pixels as training sets, each consisting of the 3-dimensional RGB vectors. The network was initialized with 50 reference vectors, was set to 4. After 5 selection steps the network converged to 3 and 4 reference vectors, respectively. These reference vectors were then used to segment the image, i.e., assigning each of the pixels the index of the nearest reference vector. Fig. 12(c) shows that the image has been correctly segmented into its three dominant colors, corresponding to the background, and two regions inside the leaf. The segmentation of the Mandrill in Fig. 12(d) also highlights the dominant regions of the image. One may notice that the printout of the reconstruction is visually more pleasing than the original image due to a reduction in the number of dierent gray values causing a reduction of the half-toning eect of the printer. 3

15

4 Conclusions In this paper we presented an MDL framework for vector quantization networks. This framework was used to derive a computationally ecient algorithm for nding the optimal number of reference vectors as well as their positions. This was possible by a systematic approach of removing super uous reference vectors and identifying outliers. We have demonstrated the algorithm on 2-D examples, demonstrating the features of our algorithm, for quantizing images for compression purposes and color segmentation. In all of these examples the algorithm performed successfully. The method of handling outliers (i.e., to measure the costs of coding them is quite general and might be useful for other methods as well). There are various ways we are currently working on to extend our method:

Dierent Learning algorithms The method can in principle be used with any unsu-

pervised learning algorithm. However, for networks which also have topological connectivity, like the Kohonen network, one must add special mechanisms to preserve this connectivity when a unit is removed. Another natural extension is to modify the method to nd the number of components in a Gaussian mixture model, where the EM-algorithm [21] is used as a training algorithm. In this case the method will be similar to our recently proposed scheme for optimizing RBF networks [17]. Another extension we are currently working on is the usage of our method for fuzzy clustering algorithms.

Combination with growing networks The computationally most expensive steps of

our method are the rst one or two iterations, when we have to initialize the clustering algorithm with many randomly placed reference vectors, where it is very likely that most of them will later be pruned. Therefore, a more elegant way of initializing our method would be to use a growing method like the growing gas algorithm to nd the initial cluster centers, and then use our method to nd the right size of the network. Preliminary results have already been demonstrated in Fig. 10 and seem quite promising, however, this should be analyzed further.

Online algorithm Currently the algorithm is formulated as an o-line method (i.e., all the training data has to be given beforehand). Since one of the strengths of neural networks is that they can be used also on-line (i.e., the networks are trained as the new data arrives), it would be interesting to develop also an on-line version of our 16

algorithm. In this case we would have to replace the MDL measures by suitable running averages.

Combination with supervised RBF method Another extension is to use the unsu-

pervised MDL method as an initialization method for our supervised RBF-network construction method [17].

Since the method proposed in this paper is very general, it is easy to incorporate all these extensions without changing the general paradigm.

References [1] B. Fritzke. Some competitive learning methods (draft). Technical report, Inst. for Neural Computation, Ruhr-University Bochum, 1997. [2] R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. New York: Wiley, 1973. [3] Allen Gersho and Robert M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1992. [4] S. Roberts. Parametric and non-parametric unsupervised cluster analysis. Pattern Recognition, 30(2):261{272, 1997. [5] L. Xu. How many clusters?: A YING-YANG Machine based theory for a classical open problem in pattern recognition. In Proceedings of 1996 IEEE International Conference on Neural Networks, volume 3, pages 1546{1551, Washington, DC, June 2-6, 1996. IEEE Computer Society. [6] J. Buhmann and H. Kuhnel. Vector quantization with complexity costs. IEEE Trans. on Information Theory, 39:1133{1145, 1993. [7] P. J. Huber. Robust Statistics. Wiley, New York, 1981. [8] P. J. Rousseuw and A. M. Leroy. Robust Regression and Outlier Detection. Wiley, New York, 1987.

17

[9] G.J. McLachlan and D. Peel. Robust cluster analysis via mixtures of multivariate t-distributions. In A. Amin, D. Dori, P. Pudil, and H. Freeman, editors, Advances in Pattern Recognition, number 1451 in Lecture Notes in Computer Science, pages 658{665. Springer, 1998. [10] D. Comaniciu and P. Meer. Distribution free decomposition of multivariate data. In A. Amin, D. Dori, P. Pudil, and H. Freeman, editors, Advances in Pattern Recognition, number 1451 in Lecture Notes in Computer Science, pages 602{610. Springer, 1998. [11] M. Li and P. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 2nd edition, 1997. [12] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30:629{636, July 1984. [13] J. Rissanen. Stochastic Complexity in Statistical Inquiry, volume 15 of Series in Computer Science. World Scienti c, 1989. [14] R. S. Zemel and G.E Hinton. Learning population codes by minimum discription length. Neural Computation, 7(3):549{564, 1995. [15] G. E. Hinton and R. S. Zemel. Autoencoders, minimum description length and Helmholtz free energy. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 3{10. Morgan Kaufmann Publishers, Inc., 1994. [16] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1{43, 1997. [17] Ales Leonardis and Horst Bischof. An ecient MDL-Based construction of RBF networks. Neural Networks, 11(5):963{973, July 1998. [18] J. Rissanen. Information theory and neural nets. In P. Smolensky, M.C. Mozer, and D.E. Rumelhart, editors, Mathematical Perspective on Neural Networks, pages 567{602. Lawrence Erlbaum, 1996. [19] R. S. Zemel. A Minimum Description Length Framework for Unsupervised Learning. PhD thesis, University of Toronto, 1994. 18

[20] H. Tenmoto, M. Kudo, and M. Shimbo. MDL-Based selection of the number of components in mixture models for pattern classi cation. In A. Amin, D. Dori, P. Pudil, and H. Freeman, editors, Advances in Pattern Recognition, number 1451 in Lecture Notes in Computer Science, pages 831{836. Springer, 1998. [21] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J.R. Statist. Soc. B, 39:1{38, 1977. [22] T. Kohonen. Self-Organization and Associative Memory. Berlin: Springer{Verlag, 1989. [23] B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural Information Processing Systems 7. MIT-Press Cambridge MA, 1995.

19

Si

Figure 1: Illustration of Voronoi region Si; shaded part in the continuous case, and dots within shaded part in the discrete case.

20

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2 0.6

0.8

1

1.2

1.4

1.6

1.8

2

(a) Too many centers

−0.2 0.6

0.8

1

1.2

1.4

1.6

1.8

2

(b) Too few centers

Figure 2: Illustration of the performance of the K-means algorithm, when initialized with 8 reference vectors (a) and 4 reference vectors (b).

21

TS + Initial Network Size

No

INITIALIZATION

Step 1

ADAPTATION

Step 2

SELECTION

Step 3

OUTLIERS

Step 4

Converge

Yes Trained Network

Figure 3: Complete Algorithm.

22

Step 5

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

0.5

1

1.5

2

2.5

0

0

0.5

(a) Initialization

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0.5

1

1.5

1.5

2

2.5

2

2.5

(b) After 1st selection

2.5

0

1

2

2.5

(c) After 2nd selection

0

0

0.5

1

1.5

(d) Final result

Figure 4: Dierent stages of Vector Quantization of 2-D data (without identifying outliers); dots denote data points, octagons are the removed reference vectors, squares are the remaining reference vectors. 23

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

0.5

1

1.5

2

2.5

0

0

0.5

(a) Initialization

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0.5

1

1.5

1.5

2

2.5

2

2.5

(b) After 1st selection

2.5

0

1

2

2.5

(c) After 2nd selection

0

0

0.5

1

1.5

(d) Final result

Figure 5: Dierent stages of Vector Quantization of 2-D data (with identifying outliers); dots denote data points, octagons are the removed reference vectors, squares are the remaining reference vectors, and crosses denote outliers detected by the algorithm. 24

0.8

1.4

1.2

0.7 1

0.6 0.8

0.5

0.6

0.4

0.4 0.2

0.3 0

0.2 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−0.2

0

0.1

0.2

(a) 5 clusters G = 0:03

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b) 8 clusters G = 0:04

0.8

1.4

0.7

1.2

0.6 1

0.5 0.8

0.4 0.6

0.3 0.4

0.2 0.2

0.1

0

0

−0.1 −0.2

0

0.2

0.4

0.6

0.8

1

(c) 8 clusters G = 0:06

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

(d) 7 clusters G = 0:08

Figure 6: Vector Quantization of 2-D data; dots denote data points and squares denote cluster centers.

25

−3

Spread of Cluster Centers 0−170% noise 1

1

x 10

0.9

0.8

0.8

0.7

0.6

MSE

0.6

0.4

0.5

0.4

0.2

0.3

0.2

0 0.1

0

−0.2 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

20

40

60

1

(a) Cluster centers

80

100 % Noise

120

(b) MSE of cluster centers

Figure 7: Error sensitivity of the method.

26

140

160

180

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

−0.2

0

0.2

0.4

(a) = 0:05 1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

0

0.2

0.4

0.6

0.8

0.8

1

1.2

1.4

1

1.2

1.4

(b) = 0:07

1.4

−0.2

0.6

1

1.2

1.4

−0.2

(c) = 0:09

0

0.2

0.4

0.6

0.8

(d) = 0:15

Figure 8: Clustering results obtained with dierent (see Eq. (8)).

27

25

Number of reference vectors

20

15

10

5

0

0

0.02

0.04

0.06

0.08

0.1 Sigma

0.12

0.14

0.16

Figure 9: Number of clusters with respect to .

28

0.18

0.2

2.2

2.2

2

2

1.8

1.8

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2 −0.5

0

0.5

1

1.5

2

2.5

0.2 −0.5

(a) Growing Neural Gas-1

1

1.5

2

2.5

2.2

2

2

1.8

1.8

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0

0.5

1

1.5

2

2.5

0.2 −0.5

(c) MDL (initialized with 75 reference vectors) 2.2

2

2

1.8

1.8

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0

0.5

1

1.5

2

0

0.5

1

1.5

2

2.5

(d) MDL (initialized with 150 reference vectors)

2.2

0.2 −0.5

0.5

(b) Growing Neural Gas-2

2.2

0.2 −0.5

0

2.5

0.2 −0.5

(e) MDL initialized with (a)

0

0.5

1

1.5

2

2.5

(f) MDL initialized with (b)

Figure 10: Growing neural gas versus MDL algorithm, and a combination of the two methods. 29

(a) Original Image

(b) Reconstructed Image

(c) Error Image

Figure 11: Vector Quantization of logs-image.

30

(a) Leave Image

(b) Mandrill image

(c) Segmented Leave (3 clusters)

(d) Segmented Mandrill (4 clusters)

Figure 12: Color Segmentation of dierent images.

31