.'' ... r -

4 downloads 0 Views 902KB Size Report
Faonos & Harilaou Trikoupi Str., 81 100, Mytilene, Greece e.sa~panikou@ct,aegeangr. Abstract - This paper introduces a novel hierarchical fuzzy algorithm forĀ ...
,,

.''

.i ..E--

... r -

PotentiaLBased Fuzzy Clustering and Cluster Validity for Categorical Data and its Application in Modeling Cultural Data George E. Tsekouras

Abraam Kawa

University of the Aegean, Department of Cultural Technology and Communication Faonos & Harilaou Trikoupi Str., 81 100, Myhlene, Greece Tel: +301-2251-0-36631, Fax: +301-2251-0-36609

University of the Aegean, Department of Cultural Technology and Communication Faonos & HariIaou Trikoupi Str., 81 100, Myhlene, Greece [email protected]

gtsek@ct. aegeangr

Evi Sampanikou University of the Aegean, Department of Cultural Technology and Communication Faonos & Harilaou Trikoupi Str., 81 100, Mytilene, Greece e.sa~panikou@ct,aegeangr

Abstract - This paper introduces a novel hierarchical fuzzy

Despite the fact that fuzzy e-modes is a very fast and efficient method, it suffers from two major problems:

algorithm for clustering categorical attributes, which consists of three basic design steps. It incorporates a potential-based clustering scheme with a cluster validity index into a framework that is based on the use of the weighted fuzzy cmodes. The novelty of the contribution lies in the foollowing properties: (a) the potential-based clustering scheme reduces the dependence of the algorithm on initialization, @) the weighted fkzy c-modes provides flexibility in detecting the real data structure, and (c) the cluster validity index determines the appropriate number of clusters, The algorithm is applied to model (classify) cultural data related to a number of painters o f the seventeenth century, where its performance is compared to the respective performance of an sgglomerative hierarchical clustering algorithm.

I

(a) Fuzzy c-modes is very sensitive to initialization. (b) Fuzzy c-modes requires an a priori knowledge of the number of clusters.

In order to cope with these two problems we propose a hierarchical fuzzy logic-based clustering algorithm, which consists of three steps. The first step, which attempts to solve the first problem, introduces a potential based-clustering scheme, This scheme provides an initial partition of the original data set, without using any random guesses but rather based on the real data structure. Thus, in this case, it is better to have something than nothing. In the second step, we develop an extended version of the fuzzy c-modes algorithm called weighted fuzzy c-modes. This step uses the above partition as initial condition to generate a fuzzy partition of the data set. Finally, the second problem is solved in the third step of the algorithm, which utilizes the cluster validity index developed by Tsekouras in [7]. The algorithm is used to analyze cultural data related to the aesthetic judgment of painters, where it is compared to an agglomerative hierarchical clustering method.

INTRODUCTION

Categorical data clustering (CDC) is an important operation

in data mining. A common approach among the various CDC procedures is to use hierarchical clustering schemes, which are based on agglomerative clustering [l] or on the use of similarity [2] and disimilarity measures 131. Ralambondrainy [4j, converted multiple categorical attributes into binary attributes by using 0 or 1 for absence or presence of a category, respectively. Then he treated these binary values as real differences and used them in the well-known c-means algorithm. However, a major drawback related to this approach is that the produced number of binary values becomes very large when each attribute is described by many categories. To reduce the computational complexity of a CDC algorithm, Huang [ 5 ] used a simple matching dissimilarity measure and introduced the c-modes algorithm, which is an extension of the classical c-means algorithm. However, as with most clustering algorithms, the c-modes is very sensitive to hitialization. In his later work [6], Huang generalized the e-modes approach by introducing the fuzzy e-modes. The existence of fuzziness in a clustering process exhibits two appealing features. Firstly, it provides a flexible representation of the data structure because each object belongs to more than one clusters with different degrees of participation. Secondly, it is able to model the uncertainy typically involved in a data set. 0-7803-9122-5/05/$20.002005 IEEE

JI AGGLOMERATIVE HIERARCHICAL CLUSTERING

In this section we briefly describe the agglomerative hierarchical CDC approach developed in [3], which will be compared with the proposed algorithm. Let X = {x,,x2,...,x N ) be a set of categorical objects. Each object is described by a set of attributes A,, A, ,...,Ap . Thej-th attribute Ai (1 5 j I p) is defined on a domain of categories denoted as,

81

where q, is the number of categories assigned to Ai. Thus, the k-th categorical object xt (1 < k 5 N ) is described as: x k = [ ~ ~ ~ , x ~ ~ , . . . , x ~e ,D] OwMi (t Ah j~) (~I s j l p ) . Let

weighted fuzzy e-modes, The main advantage of using the weighted fuuy c-modes is that it provides a flexible representation of the data, since the contribution to the final fuzzy partition of each cluster center, coming from the first step, is determined by the respective weight value. The third step utilizes a cluster validity index, which decides the final (optimal) number of clusters.

x k = [ x k l ,x k 2,..., xc 1 and x i =[xil , x 1 2 ,..-,xIpJ be two categorical objects. Then, the matching dissimilarity between them is defined as [SI, P

D ( x k , x J = ~ G ( x @ , x (gl)S k l N , 1 5 1 l N , k # t )

(2)

j-1

where 6(x, y ) = 0 if x=y and 6(x, y ) = 1 if x f y . Agglomerative hierarchical clustering (AHC) usually consists of iteratively applying the next three basic steps,

Potential-based fuzzy clustering

1

Step 1). Assignment of pattern vectors to clusters. Step 2). Inter-cluster distance computation. Step 3). Merging the closest clusters.

I

Merging categorical data typically results in a composite representative object of their centroid. The algorithm described here consists of the following steps [3]:

Weightedfimyc-modes

I

Cluster validity index

I

Agglomerative Hierarchical Chstering Algorithm Let the initial number of clusters equals the number of objects: FN Thus, initially, the cardinality o f each cluster is curdi = I (1 2 i 5 c ) . Step 1) Compute the weighted dissimilarities between all pairs of clusters as follows,

cardi card A rJ- = D ( V , , V ~cardi ) +cardj

Figure 1

T h e flow sheet of the proposed CDC algorithm,

A Potentia[-BasedClustering

In this subsection we use the concept of the potential of a data point, which was introduced by Chiu in [XI. More specifically, we extend this concept in the categorical data case and we develop an algorithm that is based on the properties of this concept. The potential of the k-th categorical object is defined as follows,

(3)

where I < i S : c , I I j S c , i + j , a n d v i , vj arethe respective cluster centers. Step2) Determine the mutual pair of clusters having the lowest weighted dissimilarity, and merge them to produce a new cluster: cardiwi + cardjv j v, = (4) cardi +card,

(5)

where (1