Clustering Large Data sets based on Data ...

43 downloads 1504 Views 444KB Size Report
Te objective function ( )m. J minimized by FCM (Fuzzy C-. Means) [2] is defined as follows: Clustering Large Data sets based on Data Compression Technique.
FUZZ-IEEE 2009, Korea, August 20-24, 2009

Clustering Large Data sets based on Data Compression Technique and Weighted Quality Measures M. Sassi, A. Grissa

compression technique. Once these two requirements are met, the strategy of getting the optimal number of clusters is easy: produce the optimal solution for each potential number of clusters, and use the validity function to choose the best one, thus automatically deciding on the number of clusters. To satisfy the first condition, several measures for clustering quality evaluation have been proposed [15]. They include the separation between clusters and compactness within a cluster. Therefore, they address the hard and fuzzy case of clustering, which cannot be applicable in the case of large databases. To solve this problem, we propose new quality measures which can be applicable in the context of large databases. These measures are based on weighted compactness and separation. For the second condition, we are particularly interested in merge-based and splitting-based processes for clustering proposed in [16] and [17] respectively. We propose an extension of these processes, called weighted merge-based and splitting-based processes, which will produce a final clustering with limited memory allocated. We neither keep any complicated data structures nor use any complicated compression techniques. The rest of the paper is organized as follows. In section 2, we discuss related works. In section 3, we present the proposed compression technique. Section 4 gives the experimentation and finally, section 5 concludes the paper and gives some futures works.

Abstract— Various algorithms have been proposed for clustering large data sets for the hard and fuzzy case, not as much work has been done for automatic clustering approaches in which the number of clusters is unknown for the user. These approaches need some measures, called validity function to evaluate the clustering result and to give to the user the optimal number of clusters. In order to obtain this number, three conditions are necessary: (1) a good compression technique for data reduction with limited memory allocated, (b) good measures for the evaluation of the goodness of clusters for varying number of clusters, and (c) a good cluster algorithm that can automatically produce the number of clusters and takes into account the used compression technique. In this paper, we propose new clustering approaches which deals with new compression technique based on quality measures.

I. INTRODUCTION

C

lustering methods group homogeneous data in clusters by maximizing the similarity of the data in the same cluster while minimizing it for the data in different clusters. In consequent, compactness and separation are two reasonable measures making it possible to evaluate the quality of obtained clusters. Various algorithms have been proposed for clustering large databases for the hard and fuzzy case [19,3,5,7,14,4,13,9,1,10], not as much work has been done for automatic clustering approaches in which the number of clusters is unknown for the user. These approaches need some measures, called validity function [11,15], to evaluate the clustering result and to give to the user the optimal number of clusters. In order to obtain this number, three conditions are necessary: A good compression technique for data reduction with limited memory allocated. Good measures for the evaluation of the goodness of clusters for varying number of clusters. A good cluster algorithm that can automatically produce the number of clusters and takes into account the used

II. RELATED WORKS To make it easier for the readers understand the ideas behind fuzzy clustering techniques, we tried to unify the notation used in this last. To achieve that, the following definitions are assumed: X ∈ R N × M denotes a set of data items representing a set of N data xi in R M , v j denoted the j th cluster

and c denoted the optimal number of clusters found. Te objective function (J m ) minimized by FCM (Fuzzy CMeans) [2] is defined as follows:

Sassi M. is with the National School of Engineering of Tunis, TIC Department, BP. 37 Le Belvédère 1002, Tunis. Tunisia.(phone: +21622537272; email: [email protected]. Grissa A. is with the National School of Engineering of Tunis, TIC Department, BP. 37 Le Belvédère 1002, Tunis. Tunisia.(phone: +216255444074; email: [email protected]. 978-1-4244-3597-5/09/$25.00 ©2009 IEEE

J m (U , V ) =

∑ j =1 ∑ i =1U mji c

396

Authorized licensed use limited to: CNUDST. Downloaded on February 3, 2010 at 05:39 from IEEE Xplore. Restrictions apply.

N

xi − v j

2

FUZZ-IEEE 2009, Korea, August 20-24, 2009

U

and V can be calculated as : U ji =

xi − v j



c j =1

(

1 1− m

)2 1

xi − v j

1− m

,v = j

increased. On the other hand, a larger value tends to mean that cluster has a smaller number of elements and exerts a strong “attraction” on them. Based on this principle and this function, an algorithm, called, FCM-Based Model Selection Algorithms (FBSA) has been introduced. [17] In the rest, we present principles of the new algorithms using a new compression technique and the different optimizations that we made as well for the global and local clustering quality evaluation.

∑ (μ ) x ∑ (μ ) N

i =1 N

i =1

m

ji

i

m

ji

Where, μ ji is the membership value of the ith example, , in the j th cluster, v j is the j th cluster center, N is the

xi

number of patterns, '

C =

{ C pk C

et

C pk

C = {C1 , C2 ,..., Cc }

a cluster scheme

is not a singleton,

k = 1,2,..., m

where m = C ' .

III. A NEW COMPRESSION TECHNIQUE

A simpler way to prevent bad clustering due to inadequate seeding is to modify the basic FCM algorithm. We start with a large number of uniformly distributed seeds in the bounded M -dimensional feature space, but we decrease or increase them considerably by merging or splitting worst clusters until the quality measure stops increasing. In merge process, the algorithm is based on global and local measures representing the compactness and separation for the determination of the “worst” cluster to be merged and for the evaluation of the obtained clustering quality. The merge process generally used by earlier studies involves some similarity or compatibility measure to choose the most similar or compatible pair of clusters to merge into one. In our merge process, we choose the “worst” cluster and delete it. Each element included in this cluster will then be placed into its own nearest cluster. Then, centers of all clusters will be adjusted. That means, our merge process may affect multiple clusters, which we consider to be more practical. How to choose the “worst” cluster? We still use the measures of separation and compactness to evaluate individual clusters (except singleton). A small value of this measure indicates the “worst” cluster to be merged. In [16], a clustering algorithm based on these measures, called EFCM (Enhanced Fuzzy C-Means), has been proposed. This clustering model makes it possible to determine the optimal number of clusters; however, in the context of the large data sets, this last presents its limits in the determination of the number of clusters. The general idea in the splitting-based algorithm is to identify the “worst” cluster and split it, thus increasing the number of clusters by one. For identifying the “worst” cluster, a “score” function associated with each cluster has been introduced [17]. In general, when this function is small, cluster is tends to contain a large number of data vectors with low membership values. The lower the membership value, the farther the data is from its cluster center. Therefore, a small value means that cluster is large in volume and sparse in distribution. This is the reason that the cluster corresponding to the minimum of this function as the candidate to split when the value of c is

In this section, we present the compression technique that we adopted in our clustering large data sets model. We start with the general principle then we present the basic concepts of this model. A. General Principle Suppose we intend to cluster a large data sets present on a disk. We assume for large data sets the data set size exceeds the memory size. As in [10], we assumed the data set is randomly scrambled on the disk. We can only load a certain percentage of the data based on the available memory allocation. If we load 1% of the data into memory at a time then we have to do it 100 times to scan through the entire data set. We call each such data access a Partial Data Access (PDA). The number of partial data accesses will depend on how much data we load each time. In our approach, after the first PDA, data is clustered into c partitions using EFCM (resp. FBSA). We will discuss the EFCM (resp. FBSA) algorithm in detail later. Then the data in memory is condensed into c weighted points and clustered with new points loaded in the next PDA. We call them “weighted” points because they are associated with weights, which are calculated by summing the membership of examples in a cluster. This is the key difference from the hard clustering case [5], where a fuzzy membership matrix is not present. In each PDA new singleton points are loaded into memory and clustered along with the past c weighted points obtained from the previous clustering. We will call this Partial Data Clustering (PDC). After clustering these new singleton points along with the past c weighted points, they are condensed again into c new higher weighted points and clustered with examples loaded in the next PDA. This continues until all the data has been scanned once. The objective function of WEFCM (resp. WFBSA) was modified in a fashion similar to that in [5] and [10] to accommodate the effect of weights. We will discuss the calculation of weighted points in detail later. As an example, consider a large data set of N examples. If n1 examples are fetched in the first PDA and clustered into c partitions then all these n1 examples in memory are condensed into c weighted points, whose weights will sum up to n1 . Condensation of n1 examples into c weighted 397

Authorized licensed use limited to: CNUDST. Downloaded on February 3, 2010 at 05:39 from IEEE Xplore. Restrictions apply.

FUZZ-IEEE 2009, Korea, August 20-24, 2009

points frees the buffer. Next n2 examples are then loaded into memory in the next PDA. These new n2 examples are then clustered along with the c weighted points. So, after the second PDA there will be n2 + c examples in memory for clustering, out of which c are weighted points and c examples have weight one (singletons). We will call the new WEFCM (resp. WFBSA) which takes into account the weights of c weighted points. After clustering these n2 + c examples in memory using WEFCM (resp. WFBSA) they are condensed again into c new weighted points. This time the weight of the c points sum up to n1 + n2 and thus they have more weight than before. This is because there were already c weighted points, summed up to n1 , present when n2 new singleton examples were loaded in the second PDA. Similarly, after completion of clustering in the third PDA, the weight of the new condensed c points will sum up to n1 + n2 + n3 . This means after the mth PDA there will be nm singleton points loaded in the memory along with c weighted points from a previous PDC, whose weights sum up to n1 + n2 + n3 + ... + nm −1 . So, if the last PDA loads nl examples, it essentially clusters the whole data set, where n − nl examples remain as c condensed weighted points and

clustering results, in memory as follows: w j = w'j , 1 ≤ j ≤ c

as singleton points. Thus, our simple single pass fuzzy c means will partition the whole data in a single pass through the whole data set. To speed up clustering, we initialize each PDC with the final centers obtained from the previous PDC. This knowledge propagation allows for faster convergence.

C. Weighted Quality Measures As we mentioned in the preceding section, the evaluation of the clustering must be done on two levels: globally and locally. Definition 1. Global Weighted Compactness : Given a cluster scheme C = {C1, C2 ,..., Cc } for a data set X = {x1, x2 ,..., x N } ,

It should be noted that when nd new singleton points (weight one) are loaded in all subsequent PDA (d 1) , their will begin at (c + 1) and end at indices associated with ⎯⎯→ W i.e. wi = 1, ∀c ≺ i ≤ nd + c 2) Case 2: d>1: in this case, clustering will be applied on

nd + c

th

singleton points freshly loaded in the d PDA along with c weighted points obtained after condensation from the (d − 1)th PDA. So, there will be nd + c points in the memory for clustering using WEFCM (resp. WFBSA). The new nd singleton points have weight one. After clustering, the data in memory (both singletons and weighted points) is condensed into c new weighted points. The new weighted points are represented by the c cluster centers v j , where 1≤ j ≤ c

w 'j =

B. Weighted points calculation We will assume we have a weighted WEFCM (resp. WFBSA) algorithm that takes into account the weights of examples and clusters data into c partitions. The details of WEFCM (resp. WFBSA) will be discussed in the next section. Consider nd examples which are loaded in memory in the

let C ' = { C pj k = C'

1 ≤ j ≤ c and 1 ≤ i ≤ nd

C pj

(μ ji )wi , 1 ≤ j ≤ c ,

The weights of the

c

C pj

is not singleton,

vj

where

j = 1,2,..., k WCP ,

of the

is given by:

C

k ⎛ ⎜ 2 ∑ ⎜⎜ ∑ wi μ j (xi ) xi − v j j =1⎝ x i ∈C pj , x i ≠ v j k

2



∑ wi μ j (xi )2 ⎟⎟⎟

/

x i ∈C pj , x i ≠ v j

is the center of , xi − v j

2

C pj

,

c

xi



belonging to

is the number of clusters and

is the distance between

xi

and v j .

Definition 2. Global Weighted Separation : the global weighted separation, WSP , of a cluster scheme C = {C1 , C2 ,..., Cc } for a data set X = {x1 , x2 ,..., x N } is given by:

where

{

⎛ c WSP = ⎜⎜ ∑ min v j − v k 1≤ k ≤ c ⎝ j =1

Where cluster,

c vk

the number of clusters, is the center of

distance between

where 1 ≤ j ≤ c and their weights are computed as follows: =∑

,

2≤c≤ N

. Let ⎯⎯→ be the weights of the points W

nd i =1

and

Where μ j (xi ) is the membership value of

in memory. In this case all nd points have weight 1 because no weighted points from previous PDC exist. Now, memory is freed by condensing the clustering result into c weighted points, which are represented by the c cluster center v j , w'j

d

}, the global weighted compactness, WCP =

PDA. 1) Case 1: d=1: If d is equal to one i.e. the first PDA, there will be no previous weighted points. In this case WEFCM (resp. WFBSA) will be the same as EFCM (resp. FBSA). After applying clustering, let v j be the cluster centroids values,

C

cluster scheme

d th

membership

n +c ∑ i =1 (μ ji )wi , 1 ≤ j ≤ c

Then memory is freed up and the weight of the condensed clustering results in memory is updated as follows: w j = w'j , 1 ≤ j ≤ c

nl

obtained, where 1 ≤ j ≤ c . μ ji be the Let

and their weights are computed as follows:

vj

2

}/ c ⎞⎟⎟⎠

vj

2

is the center of

k th cluster,

v

j

− vk

2

j th

is the

and vk .

Definition 3. Global Weighted Separation-Compactness: given a cluster scheme C = {C1, C2 ,..., Cc } for a data set

wi = 1 , ∀1 ≤ i ≤ nd

X = {x1 , x2 ,..., x N } ,

points, after condensing the

let

C ' = { C pj C

398

Authorized licensed use limited to: CNUDST. Downloaded on February 3, 2010 at 05:39 from IEEE Xplore. Restrictions apply.

and

C pj

is not singleton,

FUZZ-IEEE 2009, Korea, August 20-24, 2009

cluster scheme C = {C1, C2 ,..., Cc } .

where k = C ' }, the global weighted separation-

j = 1,2,..., k

compactness,

WSC

, of a cluster scheme

C

Procedure 2 Recalculate the cluster centers

is given by:

Step 1:

k WSP × WCP c

WSC =

and group xi into

Then, the objective of WEFCM algorithm is to find the cluster scheme which solves:

Step 2:

∑ wiu j ( xi ) 2

xi ∈C j , xi ≠ v j

Where j

th

u j (xi )

cluster.

vj

is the center of

j

Step 1: Step 2:

vj

is the

j th center

of cluster C j ,

cluster

xi

belonging to

clustering using EFCM by taking in account of weighted local and global measures. Until r=d.

c

is the number

In order to split the cluster, we adopt the “Greedy” technique [6]. The “Greedy” technique aims to initialize the cluster centers as far apart from each other as possible. In an iterative manner, the “Greedy” technique selects as a new cluster center the data vector which has the largest total distance from the existing cluster centers. Adaptation of the technique for cluster splitting yields the following procedure: Procedure 3 Weighted Splitting-based Process

2

vk

The first PDA. Applying the EFCM clustering algorithm. Repeat Clustering will be applied on singleton points freshly

2

is the

of cluster Ck . c is the number of clusters and 2 ≤ c ≤

Step 1:

Identify the cluster to be split. Supposing that the cluster number is j0 , its center and the set of all the data in the

Step 2:

Search in E for the data vector not labeled “tested” which has the maximal total distance from all of the

k th center N

cluster are denoted by v j 0 and E .

.

Definition 6. Local Weighted Separation-Compactness : for a given a cluster scheme C = {C1 , C2 ,..., Cc } dataset X = {x1, x2 ,..., x N } , for each C j ∈ C , if

Cj

remaining c − 1 cluster centers. This data vector is denoted by v j1 .

is not

Step 3:

singleton, the local weighted separation-compactness of C j , denoted as wsc j , is given by:

Partition E into E0 and E1 based on the distance of each data vector from v j 0 and v j1 . If

wsc j = wsp j × wcp j

Procedure 1 Weighted Merge-based Process

{

Step 4:

}

Build the array C * = C *1, C *2 ,..., C *c , C *c +1 , such that each v*j is the center of cluster C *j ∈ C * .

Step 2:

Calculate wsc value for each C *j in C * ,

Step 3:

Delete the center of cluster with the least wsc value

E1 E

10% , then v j1 is

taken as the C th cluster center; else label v j1 ”tested” and

Thus, the “worst” cluster is the one with the least wsc j value. Step 1:

the

loaded in the d th PDA along with copt weighted points

, is given by:

wsp j = min1≤ j ≤ c , k ≠ j v j − vk

Where

output

obtained after condensation from the (d − 1)th PDA. So, there will be nd + c points in the memory for

xi − v j

cluster C j .

wsp j

as v j ,

number of PDA’s required by WEFCM and 1 ≤ r ≤ d . WEFCM follows the following steps:

of clusters and 2 ≤ c ≤ N . Definition 5. Local Weighted Separation: given a cluster scheme C = {C1, C2 ,..., Cc } for a data set X = {x1, x2 ,..., x N } , for each C j ∈ C , if C j is not a singleton, the local weighted compactness of C j , denoted as

it

If (C * ≠ C ) and maximum step is not reached, go to step1. Output C . Let p the fraction of data loaded in each PDA, d the

(xi )2

is the membership value of th

denote

Step 3: Step 4:

compactness of C j , denoted as cp j , is given by:

wcp j =

it,

scheme C = {C1, C2 ,..., Cc } .

Where ΩC denotes all of the candidate cluster schemes for a certain number of clusters c . We still use the measures of separation and compactness to evaluate individual clusters (except singleton). Definition 4. Local Weighted Compactness : given a cluster scheme C = {C1, C2 ,..., Cc } for a data set X = {x1 , x2 ,..., x N } , for each C j ∈ C , if C j is not a singleton, the local weighted



cluster C *j whose center is v*j .

Calculate the data median for each C *j as the new center for

⎧⎪ ⎫⎪ max ⎨max{WSC }⎬ ⎪⎭ 2≤c ≤ N ⎪ Ω C ⎩

wi u j xi ∈C j , xi ≠ v j

Choose the nearest center v*j for each element xi ∈ X ,

Step 5:

go to Step 2. Search E for the data vector not labeled “tested” which has the maximal total distance from all of the c cluster centers. This data vector is denoted by v j 2 . Partition E into E1 and E2 based on the distance of the data vector from v j1 and v j 2 . If

E2 E

10% , then v j 2 is

taken as the (c + 1)th cluster center, else label v j 2 ”tested”

from C * , recalculate the cluster centers. Store the new

and go to Step 4. 399

Authorized licensed use limited to: CNUDST. Downloaded on February 3, 2010 at 05:39 from IEEE Xplore. Restrictions apply.

FUZZ-IEEE 2009, Korea, August 20-24, 2009

the mean Rm value of EFCM and WEFCM, and the mean Rm value of FBSA and WFBSA, and then expressing it as a percentage. For example, if we denote the mean Rm value of

This procedure ensures that the two new centers v j1 and v j2

are as far apart as possible from each other and from the

c − 1 centers (of the unsplit clusters). In addition, a significant number of data vectors (10% of E ) are required to be in the

experiments of EFCM as m1 and mean Rm value of experiments of WEFCM as m2 then Difference in Quality (DQ ) is:

neighborhood of each center so as to minimize the possibility of picking up an outlier. For splitting clusters, we have introduced a weighted score function. Definition 7. Weighted Score Function : given a cluster scheme C = {C1, C2 ,..., Cc } for a data set X = {x1, x2 ,..., xN } , for each C j ∈ C , the weighted score function of C j , denoted

⎛ m − m1 ⎞ ⎟ × 100 DQ = ⎜⎜ 2 ⎟ ⎝ m1 ⎠

So, a positive value means average value of Rm of EFCM (resp. FBSA) is better (lower) while negative value means Rm of WEFCM (resp. WFBSA) is better. We also compare the speed-up obtained by WEFCM and WFBSA compared with EFCM and FBSA respectively. For EFCM and WFBSA, as stated earlier, we load the entire data set into memory before clustering. Thus the speed-up reported in this paper is the minimum speed up WEFCM and WFBSA can achieve compared to EFCM and FBSA respectively. This is because for very large data sets the time required by EFCM and FBSA will become more and more due to disk accesses per iteration. Thus we will show that even if we have enough memory to load all the data WEFCM and WFBSA will be significantly faster than EFCM and FBSA respectively while providing almost the same quality partition.

as WS ( j ) , is given by: WS ( j ) =

∑i =1 wi μij N

number _ of _ data _ vectors _ in _ cluster _ j

Let p the fraction of data loaded in each PDA, d the number of PDA’s required by WFBSA and 1 ≤ r ≤ d . WFBSA follows the following steps: Step 1: Step 2:

The first PDA. Applying the FBSA clustering algorithm. Repeat Clustering will be applied on singleton points freshly loaded in the d th PDA along with copt weighted points obtained after condensation from the (d − 1)th PDA. So, there will be nd + c points in the memory for clustering

A. Data sets Four real data sets are used for experimentation. They are ISOLET6, Pen Digits, MRI-1 (Magnetic Resonance Image) and MR-2. The ISOLET6 data set is a subset of the ISOLET spoken letter recognition training set and has been prepared in the same way as done in [8]. Six classes out of 26 were randomly chosen and it consists of 1440 patterns with 617 features [8]. The optimal number of clusters in this data set is 6. The Pen-Based Recognition of Handwritten Digits data set (Pen Digits) consists of 3498 examples, 16 features, and 10 clusters [8]. The clusters are pen-based handwritten digits with ten digits 0 to 9. We clustered this data set into 10 clusters. The MRI-1 data set was created by concatenating 45 slices of MR images of the human brain of size 256X256 from modalities T1, PD, and T2. The magnetic field strength was 3 Tesla. After air was removed, there are slightly above 1 million examples (1,132,545 examples, 3 features). We clustered this data set into 9 clusters. The MRI-2 data set was created by concatenating 2 volumes of human brain MRI data. Each Volume consists of 144 slices of MR images of size 512X512 from modalities T1, PD, and T2. The magnetic strength is 1.5 Tesla. For this data set air was not removed and the total size of data set is slightly above 75 million (75,497,472 examples, 3 features). We clustered this data set into 10 clusters. The values of m used for fuzzy clustering were m = 2 .

using FBSA by taking in account of weighted local and global measures. Until r=d.

IV. EXPERIMENTATION In this section, we present the data of the experimentation as well as the obtained result. As in [10], we used a reformulated optimization criterion Rm , which is mathematically equivalent to J m was given by: R m (V ) =

∑ i =1 ( ∑ j =1 N

c

xi − v j

1 1− m

) (1− m )

The new formulation has the advantage that it does not require the V matrix and can be directly computed from the final cluster centroids. For large data sets, where the whole data cannot be loaded into memory, Rm can be computed by incrementally loading examples from the disk. Each experiment was conducted with 50 random initializations. Each data set was clustered using EFCM and FBSA and loading all the data into memory. WEFCM and WFBSA while loading a chosen percentage of data into memory. For measuring quality, as in [10], we compute the mean Rm value of EFCM and WEFCM, FBSA and WFBSA. We compare quality by computing the difference between 400

Authorized licensed use limited to: CNUDST. Downloaded on February 3, 2010 at 05:39 from IEEE Xplore. Restrictions apply.

FUZZ-IEEE 2009, Korea, August 20-24, 2009

B. Experimental results In this section, we will present comparative studies of the proposed clustering algorithms (by merge and splitting) via the previously presented data sets. If we load n % of the data in each PDA, it is denoted by WEFCMn and WFBSAn. Experimental results on the four data sets are shown in Table I. For the two data sets, with WEFCM10 (10% data loaded), Isolet6 and Pen digits, we got excellent quality partitions over all 50 experiments with different random initializations i.e. on average the difference in quality from EFCM is 0.21%, 0.14% for Isolet6 and Pen Digits respectively. With WFBSA10 (10% data loaded), Isolet6 and Pen digits, we got excellent quality partitions over all 50 experiments with different random initializations i.e. on average the difference in quality from FBSA is 0.45%, 0.19% for Isolet6 and Pen Digits respectively. We also performed experiments on these small data sets under a more stringent condition i.e. loading as little as 1% for Isolet6 (loading only 15 of 1440 examples each time) And Pen digits (loading only 35 of 3498 examples each time) . As seen in Table I, with WEFCM10, for Isolet6 and Pen Digits we observed on average acceptable differences in quality of 3.16% and 2.91% respectively from EFCM over 50 experiments. With WFBSA10, we observed on average acceptable differences in quality of 3.87% and 3.05% for Isolet6 and Pen Digits respectively from FBSA over 50 experiments. As our weighted merge-based and splitting-based processes are mainly meant for large or very large data sets, results on the Magnetic Resonance Image data set MRI-1, which contains 1,132,545 examples will help us better assess our algorithm. The results (Table I) show that WEFCM achieved excellent partitions whose average quality difference with EFCM on all the 50 experiments with different random initializations was only 0.09% and 0 .01% respectively for 1% (WEFCM1) and 10% (WEFCM10) data loaded. WFBSA achieved good partitions whose average quality difference with FBSA on all the 50 experiments with different random initializations was only 0.17% and 0.03% respectively for 1% (WFBSA1) and 10% (WFBSA10) data loaded. So, for medium/large data sets 1% data loaded may be enough to get an excellent partition, that is, with average quality almost the same as EFCM and FBSA. The MRI-2 experiment consists of 75,497,472 examples and it takes about 6 to 7 days for EFCM and FBSA to complete each experiment. We clustered the whole data 10 times using EFCM and FBSA. So, we compare it with the first 10 experiments of WEFCM and WFBSA respectively and thus the average results are computed on 10 experiments only. On this data set we loaded only 1% of the data for WEFCM and WFBSA.

The average quality difference of EFCM and FBSA with WEFCM1 and WFBSA1 are 0.0013% and 0.0029%. EFCM on average took 132.65 hours, above 6 days, to partition the whole data set, while WEFCM took only 2.12 hours. Thus, the speed up obtained is 62.57 times. The quality difference and speed up obtained on this 75 million example data set were also excellent. In summary, EFCM is widely used for other data partitioning purposes. Looking at the excellent quality achieved by WEFCM, besides using it for clustering large generic data sets, it can also be used to accurately segment, compared to EFCM, large volumes of image data quickly. For example MRI-1, which is an entire volume of MRI data of the human brain, is segmented into 9 clusters accurately in less than a minute on average with our not so fast processor. So, our WEFCM algorithm could help segment the huge amounts of data involved in 3D modeling. For example, to our knowledge segmenting the whole volume of MRI at once using multiple features (generally it is done slice by slice) is done rarely, if ever, but with our WEFCM this can be explored/studied.

TABLE I DIFFERENCE IN QUALITY OF MERGE-BASED (WEFCM) AND SPLITTINGBASED (WFBSA) PROCESSES. Process Data loaded Data sets Quality Difference Isolet6 3,16% Pen Digits 2,91% WEFCM1 MRI-1 0.09% MRI-2 0.0013% by merge Isolet6 0.21% Pen Digits 0.14% WEFCM10 MRI-1 0.01% MRI-2 0.0001% Isolet6 3,87% Pen Digits 3.05% WEFCM1 MRI-1 0.17% MRI-2 0.0029% by split Isolet6 0,45% Pen Digits 0.19% WEFCM10 MRI-1 0.03% MRI-2 0.0004%

V. CONCLUSION In this paper, we have proposed a new compression technique based on quality measures for clustering large data sets. The proposed technique takes into account two clustering processes: splitting-based and merge-based. For each process, we have proposed to evaluate globally and locally the clustering quality. The first evaluation is based on weighted global compactness and separation measures which permit the validation of clustering result. The second evaluation is based on, for the merge-based process, weighted local compactness and separation measures for determining the “worst” cluster to be merged. 401

Authorized licensed use limited to: CNUDST. Downloaded on February 3, 2010 at 05:39 from IEEE Xplore. Restrictions apply.

FUZZ-IEEE 2009, Korea, August 20-24, 2009

[18] H.G.C. Traven, “A neural network approach to statistical pattern classification by semi parametric estimation of probability density functions”, IEEE Transactions on Neural Networks, vol. 2, 366–377, 1991. [19] T. Zhang, R. Ramakrishnan and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Database”, Tian ACM SIGMOD International Conference on Management of Data, 103– 114, 1996.

For the splitting-based process, this evaluation is based on weighted score function. For the two processes, we neither keep any complicated data structures nor use any complicated compression techniques, yet achieved excellent quality partitions, with an average quality almost the same as EFCM and FBSA, by loading as little as 1% of the data for large data sets. The data compression technique used in this algorithm can be characterized as horizontal since it acts on the data. As future work, we propose to better reduce the data memory allocation by vertical compression acting on the data set attributes. We propose to use Principal Component Analysis for the reduction of the clustering parameters. Methods should also be able to include different data types. REFERENCES [1] [2] [3]

[4]

[5] [6] [7] [8] [9] [10] [11] [12] [13]

[14] [15]

[16]

[17]

S. Asharaf, M. Narasimha Murty, An adaptive rough fuzzy single pass algorithm for clustering large data sets, Pattern Recognition, vol.36, 2003. J. C. Bezdek, “Convergence Theory for Fuzzy C-Means: Counter examples and Repairs”, IEEE Trans. Syst., 873-877, 1987. P.S. Bradley, U. Fayyad and C. Reina, “Scaling Clustering Algorithms to Large Databases”, In Proceedings of the Fourth International, Conference on Knowledge Discovery and Data Mining, KDD-1998, 9–15, 1998. C. Gupta, R. Grossman, “GenIc: A Single Pass Generalized Incremental Algorithm for Clustering”, Proceedings of the Fourth SIAM International Conference on Data Mining (SDM 04), 22–24, 2004. F. Farnstrom, J. Lewis and C. Elkan, Scalability for Clustering Algorithms Revisited, ACM SIGKDD Explorations, V. 2, 51–57, 2000. T. Gonzalez, “Clustering to minimize and maximum intercluster distance”, Theor. Comput. Sci. 38, 293–306, 1985. S. Guha, R. Rastogi and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases”, In Proceedings of ACM SIGMOD International Conference on Management of Data, 73–84, 1998. S. Hettich and S. Bay, The UCI KDD archive [http://kdd.ics.uci.edu], 1999. P. Hore, L. Hall and D. Goldgof, “A Cluster Ensemble Framework for Large Data sets”, IEEE International Conference on Systems, Man and Cybernetics, 2006. P. Hore, L. Hall and D. Goldgof, “Single Pass Fuzzy C Means”, IEEE International conference on Fuzzy Systems, 847-852, 2007. D.W. Kim, K.H. Lee, D. Lee, “On cluster validity index for estimation of the optimal number of fuzzy clusters”, Pattern Recognition 37(10), 2009-2025, 2004. R. Nikhil, J.C. Bezdek, “Complexity Reduction for Large Image Processing”, IEEE Transactions on Systems, Man, and Cybernetics, Part B 32(5), 598–611, 2002. L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha and R. Motwani, “Streaming-Data Algorithms for High-Quality Clustering”, Proceedings of IEEE International Conference on Data Engineering, March 2002. J. Richard and J.C. Bezdek, “Extending Fuzzy and Probabilistic Clustering to Very Large Data Sets”, Journal of Computational Statistics and Data Analysis, 2006. M. Sassi, A. Grissa Touzi and H. Ounelli, “Two Levels of Extensions of Validity Function Based Fuzzy Clustering”, 4th International Multiconference on Computer Science & Information Technology, Amman-Jordan, 223-230, 2006. M. Sassi, A. Grissa Touzi and H. Ounelli, “Using Gaussians Functions to Determine Representative Clustering Prototypes”, 17th IEEE International Conference on Database and Expert Systems Applications, Poland, 435-439, 2006. H. Sun, H. Wang and Q. Jiang, “FCM-Based Model Selection Algorithms for Determining the Number of Clusters”, Pattern recognition, Vol. 37, Issue 10, 2027-2037, 2004. 402

Authorized licensed use limited to: CNUDST. Downloaded on February 3, 2010 at 05:39 from IEEE Xplore. Restrictions apply.