Clustering Large Symbolic Datasets - CiteSeerX

Clustering Large Symbolic Datasets M.Narasimha Murty 1 , T.Ravindra Babu 1 , V.K.Agrawal

1 Department

2

of Computer Science and Automation, Indian Institute of Science, Bangalore-560012 2 ISRO Satellite Centre, Bangalore-560017 Submitted: June, 2004; Accepted: February , 2005

Abstract Clustering is the process of partitioning a set of labeled/unlabeled patterns into meaningful groups so that patterns in each group/cluster are similar to each other in some sense and patterns in different clusters are dissimilar in a corresponding sense. A major outcome of clustering process is an abstraction in the form of description of the clusters; this abstraction can be useful in several decision-making situations including classification. Clustering is extensively used in data mining where a large set of patterns is routinely processed. Here, the size of the dataset is such that the entire dataset does not fit in main memory of the machine; typically, data is stored on a disk and is read into the memory, for processing, based on need. It is wellknown that disk access is costlier than memory access. Conventionally, each pattern is viewed as a vector of numbers; this representation is ideally suited for processing of patterns using a neural network. However, there are several applications where this simple view is inadequate. For example, features may assume interval values or more complex structures; such patterns are called symbolic patterns. In this paper, we deal with clustering symbolic datasets, which forms an important component of symbolic data mining.

1

Introduction

Clustering is the process of organization of a collection of patterns into clusters based on similarity, such that the patterns within clusters are as similar as possible and patterns across the clusters are as dissimilar as possible [6]. Large data clustering refers to clustering of datasets which are so large that they cannot be completely accommodated in the main memory. A pattern is characterized, usually, by a number of features. Each feature can assume binary or floating point numbers. However, in many practical situations, the features assume interval or more complex structures such as summarization of huge databases by their concepts [3], etc. Such patterns are referred to as Symbolic Patterns. Clustering is useful in many ways. Clustering helps in achieving data compression or abstraction. Through clustering, one would obtain cluster centers or cluster representatives. With appropriate number of clusters, a small set of representatives provide summary of a The Electronic Journal of Symbolic Data Analysis - Vol.3, N.1 (2005), ISSN 1723-5081

1

large data. Clustering features helps in arriving at dimensionality reduction. Clustering provides detection outliers or noisy patterns, data re-organization and indexing. Clustering also helps in partitioning in dataset that would help the computational effort for further use. Clustering of labeled patterns is also a very useful activity, as it would help in classification of new patterns. Large data clustering is a challenging area. The largeness can be either in terms of memory or processing time and its definition changes depending on the technology improvement. For example, few thousands of patterns were referred to as large data in 1960s [10] and currently there are requirements of clustering several millions of patterns as a single task [6]. Given a large symbolic or conventional data, various stages of clustering includes considering grouping algorithms that are simple and efficient in terms of space and time, classification of symbolic patterns using cluster descriptions that are symbolic. The current paper is organized as follows. Section 2 contains strategies in large data clustering, Sections 3-6 contain description of each of the strategies along with the experimental results. Section-7 contains summary and future work.

2

Issues in Large Symbolic Data Clustering

Following are major issues of large symbolic data clustering that have applications in areas like data mining [8]. • Scalability: Many clustering algorithms work well on small datasets. In case of large datasets, clustering on a sample of given large dataset may lead to biased results. Thus the clustering algorithm should be scalable. • Ability to deal with symbolic data: The algorithms are required to cluster data with different types of attributes, such as, interval-based, binary, categorical(nominal), ordinal and mixture of all or any of these data types. • Arbitrary cluster shapes: Clustering algorithms based on Manhattan or Euclidean distance measure tend to find spherical clusters with similar size and density. Algorithms should be developed to find clusters of arbitrary shape. • Minimal requirements of domain knowledge:The clustering algorithm should be insensitive to domain knowledge based input parameters, as they make quality of clustering difficult to control. • Insensitivity to order of input records: Many clustering results depending the order in which input data is presented. The algorithm should be insensitive to order of inputs. • Noisy data handling: The clustering algorithm should be insensitive to outliers and noisy patterns. • Ability to handle High Dimensionality: Large data clustering is also, often, characterized by high dimensionality. Whereas many clustering algorithms are good at low dimensions of 2 or 3, algorithms to handle large data should be able to operate The Electronic Journal of Symbolic Data Analysis - Vol.3, N.1 (2005), ISSN 1723-5081

2

Table 1: Results of Leader Clustering Algorithm Dissimilarity No. of representative Classification Accuracy Threshold patterns out of 6670 using k-NNC with k=5 2.4 4890 92.16% 3.0 4750 92.14% 4.0 3938 91.69% 4.5 3362 89.29% in high dimensionality space. • Constraint-based clustering: Clustering algorithms should be able to operate within specified constraints. • Interpretability of clustering results: The results of a clustering algorithms should be usable and interpretable. • Number of database scans: With large data with high dimensionality, the entire data is may not fit the main memory. The algorithms should not seek database access less than or equal to two times. Otherwise, this results in significant additional cost. For example, even linear time and space algorithms such as k-means may scan the dataset many times. Following sections provide solutions to the raised issues.

3

Incremental Clustering Algorithms

In “Incremental clustering” [6] or “On-line clustering” algorithm, each pattern is considered one by one and assigned to current clusters depending some criterion such as dissimilarity between new pattern current cluster centroids. The activity is repeated till all input patterns are covered. The major advantage of the algorithm is that it satisfies all the points discussed in Section-2. Some of the useful incremental algorithms are leader clustering algorithm [7], shortest spanning path(SSP) algorithm [11] and cobweb system [4]. The leader algorithm is used in ART network [2]. The algorithm has time complexity of O(nk) and O(k) space. In order to demonstrate the leader algorithm, Handwritten digit data is considered. It consists of 10003, labeled, 192-feature binary patterns. The data consists of 10 categories, viz., 0 to 9. Of this entire data, 6670 patterns, equally divided into 10 categories are considered as training patterns and 3333 as test patterns, with approximately 333 patterns per class. Table-1 contains results using leader algorithm. In the current exercises the domain knowledge of handwritten data is integrated in choosing thresholds and combining the different classes for the training data. It can be seen from Table-1 that with increasing dissimilarity threshold, the number of clusters and hence the cluster representatives reduce. Also, the classification accuracy using k-NNC is a function of number of representatives. The Electronic Journal of Symbolic Data Analysis - Vol.3, N.1 (2005), ISSN 1723-5081

3

Table 2: Classification of HW Data Support No. of nodes Classification Accuracy Value in FP-Tree with kNNC with k=5 0.01 16580 92.44% 0.03 16496 92.50% 0.05 16359 92.56% 0.06 16275 92.71%

4

Divide and Conquer Strategy

In the current strategy [6], the entire pattern matrix, containing n patterns with d features, is stored in secondary storage. The data is divided into a predefined number of blocks, p(< n). Each of blocks is assumed to contain patterns proportional to total number of patterns to the number of blocks( np ). Each of the blocks is transferred to main memory and clustering is carried out into k clusters using a standard algorithm. By choosing one representative pattern per cluster, we have pk representatives. We further cluster these representatives into k clusters and the cluster labels of these representative patterns are used to label the original pattern matrix. A two level strategy for clustering a data set containing 2000 patterns was described in [12].

5

Use of Intermediate Representation

In this method, start with the initial representation of entire data. Use an intermediate representation, such that the following properties are satisfied by the clustering algorithm. • Use single scan of the dataset • Structure can be used for pattern synthesis • Structure occupies less space than the original dataset; thus may fit main memory. • An example of such structure is Frequent Pattern tree [9, 5] • It is used for data mining tasks. The patterns and features are considered as transactions and items in transactions. The frequency of items is computed and ordered in descending order. The frequent items are descending ordered in each transaction depending the frequency of the items. A method known as Frequent Pattern-growth method is devised further, which adopts a divide and conquer approach. In computing the frequent items, the concept of support is used [1]. Table-2 provides the classification results using kNNC(k=5), with FPTree. Here, handwritten(HW) data described in Section-3 is used. Observe the number of nodes as a function of support value.

The Electronic Journal of Symbolic Data Analysis - Vol.3, N.1 (2005), ISSN 1723-5081

4

6

Compress and Cluster/Classify

Here, we compress the given data and cluster or classify the compressed data. The compressed data forms the symbolic data. The symbolic data is directly used to cluster or classify. Consider binary HW data described in Section-3. Following are various steps in classifying the data. 1. Compress HW data by means of runs 2. The resulting symbolic data is used to cluster and/or classify. The distance computation procedure using the symbolic data is provided in Algorithm-1. 3. The compression and decompression procedures are totally non-lossy. The results using kNNC are provided in Table-3. CPU Time in the table is obtained on Dec-AlphaServer DS10.

7

Conclusions and Future Work

Basic concepts of large data, data clustering, symbolic data are provided. The issues in large symbolic data clustering are highlighted. Different solutions satisfying the requirements are enlisted along with results wherever felt necessary.

References [1] R.Agrawal and R.Srikant. “Fast Algorithms for mining association rules”. In Proc.1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487-499, Santiago, Chile, Sept.1994. [2] G.Carpenter and S.Grossberg, “ART3: Hierarchical Search Using Chemical Transmitters in Self-Organizing Pattern Recognition Architectures”, Neural Networks, 3:129152, 1990. [3] Edwin Diday, “An Introduction to Symbolic Data Analysis and the Sodas Software” [4] D.Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering”,Machine Learning,2:139-172,1987. [5] Benajmin Chin Ming Fung, “Hierarchical Document Clustering Using Frequent Itemsets”, Thesis submitted for the degree of M.Sc., Simon Fraser University, September 2002. [6] A.K.Jain, M.N.Murty and P.J.Flynn, “Data Clustering: A Review”, ACM Computing Review, September 1999.


5

Step-1: Read Pattern-1 in array a[1..n] and Pattern-2 in array b[1..m] Step-2: Initialize i=1, j=1, runa=a[i], runb=b[j], distance=0 Step-3: WHILE-loop(from Step-4 to Step-8) Step-4:

If runa=0 (a)increment i, (b)if i > n , go to Step-9, (c)load a[i] in runa

Step-5:

If runb=0 (a)increment j, (b)if j > m, go to Step-9, (c)load b[j] in runb

Step-6:

If | i − j | is odd, Increment distance by min(runa,runb)

Step-7:

If runa ≥ runb (a) Subtract runb from runa, (b)Set runb=0 Else (a) Subtract runa from runb, (b)Set runa=0

Step-8: Go to Step-3 Step-9: Return distance

Algorithm 1. Computation of Distance with Compressed Data

Table 3: Data Size and Processing Times Description Data CPU Time(sec) of data Training Data Test Data of k-NNC Original data as features 2574620 1286538 527.37 Compressed data in terms of runs 865791 432453 106.83


6

[7] A.K.Jain, R.C.Dubes, and C.C.Chen, “Bootstrap Techniques for error estimation”,IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.9, No.5, pp.628633,1987. [8] J.Han and M.Kimbler, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, August 2000. [9] J.Han, J.Pei and Y.Yin, “Mining frequent patterns without candidate generation”. In, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00), Dallas, Texas, USA, May 2000. [10] G.J.S.Ross, “Classification Techniques for Large Sets of Data”, In A.J.Cole, Editor, Numerical Taxonomy, Academic Press, New York, 1968. [11] J.R.Slagle, C.L.Chang, and S.R.Heller, “A Clustering and Data-Reorganizing Algorithm”,IEEE Trans. Systems, Man and Cybernetics,5:125-128,1975. [12] H.Stahl, “Cluster Analysis of Large Data Sets”,In W.Gaul and M.Schader, editors Classification as a Tool of Research, pp.423-430,Elsevier, Amsterdam,1986..


7