DESCRY: a Density Based Clustering Algorithm for Very Large Data ...

2 downloads 22713 Views 2MB Size Report
{angiulli,pizzuti,ruffolo}@icar.cnr.it. Abstract. A novel algorithm, named DESCRY , for clustering very large multidimensional data sets with numerical attributes is ...
DESCRY: a Density Based Clustering Algorithm for Very Large Data Sets Fabrizio Angiulli, Clara Pizzuti, Massimo Ruffolo ICAR-CNR c/o DEIS, Universit` a della Calabria 87036 Rende (CS), Italy {angiulli,pizzuti,ruffolo}@icar.cnr.it

Abstract. A novel algorithm, named DESCRY , for clustering very large multidimensional data sets with numerical attributes is presented. DESCRY discovers clusters having different shape, size, and density and when data contains noise by first finding and clustering a small set of points, called meta-points, that well depict the shape of clusters present in the data set. Final clusters are obtained by assignign each point to one of the partial clusters. The computational complexity of DESCRY is linear both in the data set size and in the data set dimensionality. Experiments show the very good qualitative results obtained comparable with those obtained by state of the art clustering algorithms.

1

Introduction

Clustering is a data analysis unsupervised technique that consists in partitioning large sets of data objects into homogenous groups [5, 2, 13]. All objects contained in the same group have similar characteristics, where similarity is computed using suitable coefficients. Clustering is an unsupervised knowledge discovery technique that is used to discover meaningful structures in data when no predefined knowledge is available. The importance of this technique has been recognized in different fields such as sociology, biology, statistics, artificial intelligence and information retrieval. In the last few years clustering has been identified as one of the main data mining tasks [6]. Generally, an object (tuple) is described by a set of d features (attributes) and it can be represented by a d-dimensional vector. Thus, if the data set contains N objects, it can be viewed as an N × d matrix. Rows of the matrix correspond to the objects and columns to the features. Each object can then be considered as a point in an d-dimensional space, where each of the d attributes is one of the axis of the space. In this paper we present a novel algorithm, named DESCRY , that finds clusters having different shape, size, and density in very large high-dimensional data sets with numerical attributes and when data contains noise. The method combines hierarchical, density-based and grid-based approaches to clustering by exploiting the advantages of each of them. DESCRY considers clusters as dense regions of objects in the data space arbitrarily shaped and separated by regions

of low density that represent noise. Given a data set D of N objects in the ddimensional space, the number K of clusters to find, and the minimum number F of points that a region must contain to be considered dense, DESCRY finds the K most homogeneous groups of objects in D by first finding a small set of points, called meta-points, that well depict the shape of clusters present in the data set. To obtain the meta-points the search space is recursively divided into a finite number of non-overlapping rectangular regions. The partitioning process of each region continues until the number of points it contains is above the threshold value F . DESCRY consists of four phases: sampling, partitioning, clustering and labelling. Sampling allows to efficiently manage very large data sets. Partitioning obtains the meta-points, well spread in the data space and representative of the clusters really present in the data set. Clustering groups the meta-points by using a hierarchical agglomerative clustering method and builds the partial clusters. Finally, labelling assigns each point to one of the partial clusters and obtains the final clusters. The computational complexity of DESCRY is linear both in the data set size and in the data set dimensionality, and qualitative results obtained are very good and comparable with those of state of the art clustering algorithms, as confirmed by the reported experimental results. The rest of the paper is organized as follows. Section 2 gives a detailed description of the DESCRY algorithm. Section 3 briefly surveys existing clustering methods related to DESCRY and points out the main differences with our approach. Finally, Section 4 shows the effectiveness of the proposed algorithm on some data sets.

2

The DESCRY Algorithm

In this section the DESCRY algorithm is presented. Let D be a data set of N data points (or objects) in Rn , the d-dimensional Euclidean space. We denote by ID the minimum bounding rectangle (MRB) of D, that is ID = [[lb1 , up1 ], ×, [lbd, upd ]], where lbi (ubi ) denotes the minimum (maximum) value of the i-th coordinate among the N points of D. Given the number K of clusters to find and the number F of points that a region of the data set must contain to be considered dense, we formulate the problem of finding K clusters in D as the problem of finding the K most homogeneous regions of ID , according to a suitable similarity metrics. DESCRY consists of four phases: sampling, partitioning, clustering and labelling. Next, we give a detailed description of the four steps composing the algorithm. 2.1

Sampling

A random sample S of n points is drawn from D. The Vitter’s Algorithm Z [19] was used to extract S. The Algorithm Z performs the sampling in one pass over the data set, using constant space and time O(n(1 + log( N n ))), where n denotes

the size of the sample. As already discussed in [7], some techniques can be used to determine the size n of the sample for which the probability of missing clusters is low. This phase allows to efficiently manage very large data sets and performs a preliminary filtering of noise present in the data set. 2.2

Partitioning

The random sample S of size n, calculated in the sampling step, constitutes the input to the partitioning step. Partitioning consists in dividing the search space in a finite number of non-overlapping rectangular regions by selecting a dimension a and a value c in this dimension and splitting IS (the MRB of S) using (d−1)-dimensional hyperplanes such that all the data points having a value in the split dimension smaller than the split value c are assigned to the first partition whereas the other data points form the second partition. Hyperplanes are parallel to the axes, and placed so that one finds about the same number of elements on both the sides. In order to efficiently partition the space IS , an adaptive k-d-tree [17, 16] like data structure was used. The adaptive k-d-tree is a binary tree that represents a recursive subdivision of the space IS in subspaces. Each node p of the k-d-tree has associated a set of points Dp . Initially, the tree consists of a single root node p, with Dp coinciding with the sample set S. The partitioning step consists in splitting points Dp , associated with a leaf node p of the k-d-tree, into two disjoint subsets Dq and Dr , associated with two new child nodes q and r of p, respectively. The splitting process continues until there exists a set of points Dp , associated with a leaf node p of the tree, such that |Dp | is above the user-provided threshold F . During each splitting there are two choices to perform: the splitting axe a, and the splitting coordinate c along the axe a. We used two different functions to choose the splitting axe. The first, very simple, consists in alternating one axe per iteration, having cost O(1). The second, consists in finding the axe a that maximizes the variance of the projections of the points on a, having const O(ρd), with ρ the number of points in the region. The splitting coordinate c along the axe a is then chosen by taking the mean or the median of the projections of the points on a, with cost O(ρ). At the end of the partitioning process the leaf nodes of the tree are associated with regions of the space containing almost the same number F of points. Their number m is approximatively equal to Fn . Thus, on the average case, the time cost of the partitioning step is O(n log Fn ), or O(nd log Fn ), depending on the adopted strategy to select the splitting axe. The center of gravity of the points contained in each leaf node, called metapoints, are then computed and stored in an appropriate data structure. Metapoints constitute a small set of points well scattered in the denser regions of the search space representative of the true clusters present in the data set. This phase mitigates the undesirable presence of bridges of outliers. Notice that the leaf nodes of the tree represent regions of IS containing approximatively the same number F of points, though the distances among points contained in the same node can be very different with respect to those of other leaf nodes. Thus, the density of the data set in these regions can be

8

7

6

5

4

3

2

1

1

2

3

4

5

6

7

8

9

10

Fig. 1. Example of KD-Tree in 2-dimension space.

considered inversely proportional to the volume of each region. We distinguish between three basic cases: 1. The points are uniformly distributed in the region: the associated metapoint falls in the center of the region since the center of gravity of the points coincides with the geometric center of the region; 2. There exist at least two significantly populated and well separated subregions in the region: the associated meta-point falls in the center of gravity of the two subregions, that could coincide with the geometric center; 3. There exists an highly populated subregion and few outlying points in the region: the associated meta-point falls into the highly populated subregion. We discuss in the following section, how the position of the meta-point affects the quality of the partial clusters obtained in the clustering step. In figure 1 is shown the structure of the KD-Tree, obtained using the BuildKDT ree procedure, for a 2-dimensional data set containing 24 points. The threshold value is set to 2. In this figure points are represented using a small circle, while metapoints are represented using an asterisk. Using the KDT ree based partitioning method, DESCRY consider more than one representative meta-point per cluster this enable the algorithm to recognize clusters with non-spherical shape and to reduce the influence of different density of the clusters. 2.3

Clustering

The m meta-points M1 , . . . , Mm , calculated in the partitioning step, constitute the input of the clustering step. The goal of the clustering step is to arrange the m meta-points into K homogeneous groups, or partial clusters, C1 , . . . , CK . The final clusters can be obtained from these partial clusters with a final step of labelling that assigns each point of the data set to the closest cluster, at a low cost. To perform clustering we used a hierarchical agglomerative clustering algorithm [11]. An agglomerative algorithm receives in input the number K of desired

clusters and a set M1 , . . . , Mm of m points to group. Between the elements of the set M1 , . . . , Mm it must be defined a similarity metrics σ, i.e. a function σ(Mi , Mj ) returning a real number (1 ≤ i ≤ j ≤ m). Each hierarchical agglomerative clustering algorithm is characterized by a different similarity metrics Σ on subsets of M1 , . . . , Mn . Σ is formulated as a function of σ. Hierarchical agglomerative algorithms have all the same basic structure: first, assign each element M1 , . . . , Mm to its own cluster C1 , . . . , Cm , respectively (call this set of clusters P ); next, until the number of clusters is greater than K, choose the pair of clusters Ci , Cj of P scoring the maximum value of similarity Σ(Ci , Cj ), delete the clusters Ci and Cj from P , and add the new cluster Ci ∪ Cj to P . In general, hierarchical agglomerative clustering algorithms can be executed in time O(dm2 ) ???, for suitable choices of the similarity metrics. In the implementation of DESCRY we used the single linkage clustering algorithm [11], and the Euclidean distance between points as similarity metrics. In the single linkage algorithm the similarity Σ(X, Y ) between two sets of objects X and Y (X, Y ⊆ M1 , . . . , Mm ) is defined as follows: Σ(X, Y ) = min{σ(x, y) | x ∈ X, y ∈ Y }. The main advantage of this method is its fast computability (if the number of points is not large), because of its close relationship to the minimum spanning tree (MST) of the objects []. Given a weighted graph G, the MST of G is a minimum weighted tree, obtained from the edges of G, and containing all the vertices of G. The MST of a set of objects M1 , . . . , Mm , with a similarity metrics σ is, thus, the MST of the graph with vertices M1 , . . . , Mm , and edges (Mi , Mj ) of weight σ(Mi , Mj ) (1 ≤ i < j ≤ m). Sorting the edges of a MST by length in non-decreasing order, reflects the sequence of merging steps of the single linkage algorithm. Thus, once given the MST of m objects, it is possible to perform the single linkage clustering in O(m log m) time, by sorting the m − 1 edges of the MST and then merging points following the order induced by the sorted edges. It remains to state the complexity of the computation of the MST. When M1 , . . . , Mm are points in Rd and σ is the Euclidean distance, we have an Euclidean MST. If the dimension d of the data set is low or fixed, then the Euclidean MST of a general data set of m points can be computed in O(m log m) time ??? [] (note that d does not appears in this expression as its contribution can be considered negligible), thus, in this case the complexity of the pre-clustering phase is O(m log m). For high-dimensional data sets, the Euclidean MST can be computed in O(dm2 ) worst case time, thus leading to a worst case time complexity for the pre-clustering phase O(dm2 ). We note that the number m is very small w.r.t. the size N of the entire data set. In fact, as stated in the previous section, the value of m is approximatively equal to Fn , and n, the size of the random sample, is such that n  N . In addition, we’ll see in Section 4 that, for most data sets the ratio m can be considered a small fixed constant. We point out that, although in the implementation of DESCRY we used the usual Euclidean distance as σ and the single linkage method to cluster points, the DESCRY algorithm is parametric w.r.t. the similarity metrics σ, and also w.r.t. the hierarchical agglomerative clustering method. In particular, these choices de-

termine the notion of “what can be considered a cluster”. Using the Euclidean distance as σ, means that a cluster is a dense populated region, and using the single linkage means that these regions are arbitrarily shaped. Indeed, the single link algorithm is the most versatile hierarchical agglomerative clustering algorithm, as it is able to extract, for example, concentric clusters. Nevertheless, usually the single link algorithm feels the effect of the presence of “bridges” of outliers connecting two clusters. Now we show, as experimental results confirm, that the partitioning step of the DESCRY algorithm greatly mitigates this undesirable effect. Consider the three different basic types of regions R associated with the leaves of the tree computed in the partitioning step, described at the end of the previous section. If R is a region of type (1), then either R belongs to a cluster or R is a noisy region. In the former case, the volume of R is small and comparable to the volume of its neighboring regions belonging to the cluster, thus the meta-point of R is most likely to be clustered with the meta-points of these regions in the initial iterations of the agglomerative algorithm. In the latter case, the volume of R is great, and thus its meta-point is most likely to be reached in the final iterations of the clustering algorithm, and in general it is not a good candidate to constitute the nearest point for a point belonging to a cluster. Regions of type (2) are more critical, but in general we can state that, if the volume of R is small and comparable with those of the neighboring regions, then the two highly populated and well separated subregions of R do not represent really two separated clusters, but rather fluctuation on the density of the overall cluster containing them, otherwise the volume of R is high and the associated meta-point behaves as an outlier. Finally, as for regions of type (3), the position of the meta-point practically erases the presence of outliers. Note that, provided that they are well separated or separated by noise, by the above discussion, the present implementation of DESCRY is able to isolate clusters having sensible variance in density. ???????? 2.4

Labelling

During the labelling procedure each point of the original data set D is assigned to one of the partial K clusters. In particular, let p be a point of D, and let Mi (1 ≤ i ≤ m) be the meta-point closest to p. Then to p is assigned the label l, where Cl is the cluster to which Mi is assigned. This step requires a single scan of the data set, and can be performed in time O(N log m), for low dimensional data sets, by storing the m meta-points in an appropriate data structure T , as k-d-tree or an R-tree [], and then performing a nearest-neighbor query on T for each point of the data set, or in time O(N md), for high-dimensional data sets, by comparing each point of the data set with each meta-point. 2.5

Time and Space Complexity

Now we state the space and time complexity of the algorithm DESCRY. As for the space complexity, the algorithm needs O(nd) space to build the k-d-tree representing the partitioning of the sample S of n points. Meta-points

require space O(md). Finally, the hierarchical agglomerative clustering can be performed using O(m) space???. As m = Fn , then the space required by the algorithm is O(nd), i.e. it is linear in the size of the sample. We recall that the size of the input data set is O(N d) and that n  N . The time required by DESCRY is the sum of the times required by the sampling, partitioning, pre-clustering, and labelling steps. The actual implementation of the sampling step requires time O(n(1 + log( N n ))). The partitioning step requires time O(dn log m). The pre-clustering step has time complexity O(dm2 ), as we used a hierarchical agglomerative clustering algorithm. Finally, the labelling process has a complexity O(N md). Thus, simplifying we obtain an overall time complexity O(N md). Furthermore, if we consider low dimensional data sets, or data sets with fixed dimensionality, the complexity reduces  n to O(N log m), provided that N ≥ n log N / log . We point out that, when n F the size n of the random sample increases, for most data sets, the value of the population threshold F can be simultaneously increased without losing in clustering quality. Thus, the ratio Fn , i.e. the number of meta-points m to consider, can be considered a fixed constant. This leads to a final time complexity O(N d), for high-dimensional data sets, or O(N ), for low dimensional data sets, linearly related, by the small constant m or log m, to the data set size N and to the data set dimensionality d. We can conclude that DESCRY is very fast, as it scales linearly both w.r.t. the size N and the dimensionality d of the data set, and that it outperforms existing clustering algorithms. We’ll see in Section 4 that, despite its low time complexity, DESCRY guarantees a very good clustering quality.

3

Related works

In the last few years a lot of efforts have been made to realize fast clustering algorithms for large data sets. Many surveys and books have been written on clustering [5, 13, 2, 11]. Clustering methods can be classified into four categories: partitioning method, hierarchical method, density-based method and grid-based method [8]. The most know algorithms of each category are k-means [14], k-medoid [13], CLARA [13], and CLARAN S[15] for partitioning, CU RE [7] , CHAM ALEON [12] and BIRCH [21] for hierarchical, DBSCAN [4], DEN CLU E [9], OP T ICS [3] for density-based, ST IN G [20] , W aveCluster [18], CLIQU E [1], OptiGrid [10] for grid-based. DESCRY combines hierarchical, density-based, and grid-based approaches by exploiting the advantages of each of them. Being a density-based method, it can discover clusters of arbitrary shape. DESCRY quantizes the space into a finite number of regions, like the grid-based approach, but the grid structure is not fixed in advance, it is dynamically built on the base of the random sample extracted from the data set. Because of the partitioning technique adopted, DESCRY can find clusters when the number of dimensions is high. Finally the utilization of a hierarchical agglomerative method on a small number of points, the meta-points, instead of the overall data set provides a very efficient

and fast method for very large data sets. Among the widely known clustering algorithms, the most related to DESCRY are BIRCH [21] and CURE [7]. Next, we briefly survey these algorithms and put in evidence the main differences with our approach. BIRCH incrementally and dynamically groups incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources and with a single scan of the data. The algorithm makes use of a new data structure, the CF tree, whose elements are Clustering Features, i.e. short summaries representing a cluster. The radius of the clusters associated with the leaf nodes, representing a partitioning of the data set, has to be less than a threshold T . After the building of the CF tree, the clusters associated with the leaf nodes of the tree are clustered by an agglomerative hierarchical clustering algorithm using a distance metrics. The following main differences between DESCRY and BIRCH can be depicted: BIRCH works on the entire data set, while DESCRY performs sampling; the partitioning of BIRCH is introduced in order to obtain small clusters, i.e. regions of the space of bounded diameter, while DESCRY partitions the space in equally populated regions; BIRCH labels data sets points using the centroid of the clusters obtained, and if the clusters are not spherical in shape then BIRCH does not perform well because it uses the notion of radius to control the boundary of a cluster, while DESCRY uses all the meta-points belonging to the pre-clusters obtained, thus its clusters are arbitrarily shaped. CURE identifies clusters having non-spherical shapes and wide variance in size. It achieves this by representing each cluster with a certain fixed number of points that are generated by selecting well scattered points from the cluster, and then shrinking them toward the center of the cluster by a user-specified fraction. The algorithm (a) obtains a random sample of the data set, (b) partitions the sample into a set of partitions or disjoint samples, (c) performs a preclustering of the samples using the BIRCH algorithm, (d) eliminate outliers, (e) clusters the preclusters, and (e) assigns each data set point to one of the final clusters obtained. The clustering is performed using an agglomerative procedure that works on representative points, i.e. each cluster is represented by c points uniformly distributed over the cluster. When two clusters are merged, the representative points of the new cluster are recalculated and then shrunk toward the mean, by a fraction α, to dampen the effects of outliers. CURE partitions the sample in a set of partitions, clusters each partition by using BIRCH and obtains the representative points of each partial cluster, DESCRY partitions the sample in a finite number of non-overlapping rectangular regions in order to obtains meta-points, but it does not cluster the sample. CU RE considers outliers the partial clusters that grows too slowly, removes them and performs again clustering on the partial clusters, DESCRY directly clusters meta-points to obtain partial clusters and outliers are automatically removed during the partitioning step. The computational complexity of CURE is quadratic w.r.t. the data set size, while DESCRY depends linearly on the data set size.

80 30 70 25 60

20

50

40 15 30 10 20

5

0

10

0 0

5

10

15

20

25

30

0

20

40

60

80

100

120

Fig. 2. Data sets DS1 and DS2, and clusters found by DESCRY 30

80

70 25 60 20 50

15

40

30 10 20 5 10

0

0

5

10

15

20

25

30

0

0

20

40

60

80

100

120

Fig. 3. Meta-Points of DS1 and DS2.

4

Experimental Results

In this section we report the results of some experiments performed with DESCRY. In these experiments, we considered the two synthetic data sets shown in Figure 2, that we call DS1 and DS2. DS1 is a data set described in [7], consisting of 100, 000 two dimensional points, grouped in 5 clusters of different size, density and shape. DS2 is a data set described in [12], consisting of 100, 000 two dimensional points, grouped in 2 non-spherical clusters. Both DS1 and DS2 are characterized by the presence of outliers and of bridges linking two different clusters. Figure 2 shows also the clusters reported by DESCRY when the sample size n has been set to 2500, and the population threshold F has been set to F1 = 35 for DS1 and F2 = 20 for DS2. As the figure show, the clustering quality of DESCRY is very good. In Figure 3 are depicted the meta-points resulting from the partitioning step on these two data sets. We can observe that the meta-points depicts very well the shape of the clusters. We also studied how the clustering quality is affected by the choice of the population threshold F , varying this value in a suitable neighborhood of F1 and F2 , respectively. These experiments show that DESCRY is little sensitive

4

2.5

x 10

4

2.5

KD−Tree construction Pre−clustering Labelling

KD−Tree construction Pre−clustering Labelling

2

2

1.5

1.5 Milliseconds

Milliseconds

x 10

1

0.5

1

0.5

0

0

1.5

2

2.5

3

3.5 Sample Size %n

4

4.5

5

5.5

1.5

2

2.5

3

3.5 Sample Size %n

4

Fig. 4. Execution time of DESCRY on DS1 as a function of the sample size n when F is fixed (on the left) or Fn is fixed (on the right)

to suitable variations of the parameter F . We do not report these experiments, due to space limitations. To show how the algorithm scales w.r.t. the sample size n, we report in Figure 5 the execution times (in milliseconds1 ) of DESCRY on DS1, obtained varying n from 1500 to 5500, and maintaining F fixed to F1 = 35 (on the left), while in the figure on the right the ratio Fn is fixed to 75. The figure on the left shows that, when F is fixed, the execution time is practically constant for the partitioning step, and that it grows linearly for the pre-clustering and labelling steps. The figure on the right points out that, if the ratio Fn is fixed, increasing the sample does not increase the execution times of the algorithms. Finally we studied the behavior of the algorithm when the number of dimensions increases. Figure 4 shows that also in this case the partitioning and pre-clustering steps the execution time is constant, and for the labelling step it grows linearly.

5

Conclusions

This paper described a new method, named DESCRY, to identify clusters in large high dimensional data set having different size and shape. The algorithm is parametric w.r.t. the agglomerative method used in the pre-clustering step and the similarity metrics σ of interest. DESCRY has a very low computational complexity, indeed it requires O(N md) time, for high-dimensional data sets, and O(N log m) time, for low dimensional data sets, where m can be considered 1

We implemented the DESCRY algorithm using the Java programming language and we ran the experiments on a machine with an AMD Athlon 4 processor at 1.2MHz and 256MB of main memory

4.5

5

5.5

4

2.5

x 10

KD−Tree construction Pre−clustering Labelling 2

Milliseconds

1.5

1

0.5

0 2

3

4

5

6 7 Number of dimension d

8

9

10

Fig. 5. Execution time of DESCRY on DS1 for increasing dimension.

a constant characteristic of the data set. Thus DESCRY scales linearly both w.r.t. the size and the dimensionality of the data set. Despite its low complexity, qualitative results are very good and comparable with those obtained by state of the art clustering algorithms. Future work includes, among other topics, the investigation of similarity metrics particularly meaningful in high-dimensional spaces, exploiting summaries extracted from the regions associated to metapoints.

References 1. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering for high dimensional data for data mining applications. In Proceedings of the Int. Conf. on Management of Data (SIGMOD’98), pages 94–105, 1998. 2. R.C. Dubes A.K. Jain. Algorithms for Clustering Data. Prentice Hall, 1988. 3. M. Ankerst, M. Breunig, H.P. Kriegel, and J. Sandler. OPTICS: Ordering Points to to Identify the Clustering Structure. In Proceedings of the Int. Conf. on Management of Data (SIGMOD’99), pages 49–60, 1999. 4. M. Ester, H. Kriegel, J. Sander, and X. Xu. A database interface for clustering large spatial databases. In Proceedings of the 1st Int. Conf. on Knowledge Discovery and Data Mining, 1995. 5. B. Everitt. Cluster Analysis. Heinemann Educational Books Ltd, London, 1977. 6. U.M. Fayyad, G. Piatesky-Shapiro, and P. Smith. From Data Mining to Knowledge Discovery: an overview, chapter Advances in Knowledge Discovery and Data Mining, pages 1–34. In U.M. Fayyad & al. (Eds), AAAI/MIT Press, 1996. 7. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Int. Conf. on Managment of Data, pages 73–84, New York, May 1998. 8. J. Han and M. Kamber. Data Mining- Concepts and Techniques. Morgan Kaufman, 2001. 9. A. Hinnenburh and D. A. Keim. An efficient approach for clustering in large multimedia databases with noise. In Proceedings of the 1st Int. Conf. on Knowledge Discovery and Data Mining, pages 58–65, 1998.

10. A. Hinnenburh and D. A. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high dimensional clustering. In Proceedings of the 25th Int. Conf. on Very Large Databases, 1999. 11. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. 12. G. Karypis, S. Han, and V. Kumar. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer, pages 65–75, 1999. 13. L. Kaufman and P.J. Rousseew. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. 14. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkley Symposium on Mat. Statist., Prob., pages 281–297, 1967. 15. R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th Int. Conf. on Very Large Data Bases, pages 144–155, 1994. 16. H. Samet. Hierarchical representation of collections of small rectangles. ACM Computing Surveys, 20(4):271–309, 1984. 17. H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys, 16(2):187–260, 1984. 18. G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multiresolution clustering approach forvery large spatial databases. In Proceedings of the Int. Conf. on Very Large Data Bases, pages 428–439, 1998. 19. J. Vitter. Random sampling with a reservoir. CM Transaction on Mathematical Software, 11(1):37–57, 1985. 20. W. Wang, J. Yang, and R. Muntz. Sting: A statistical information grid approach to spatial data mining. In Proceedings of the Int. Conf. on Very Large Data Bases, pages 186–195, 1997. 21. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Int. Conf. on Managment of Data, pages 103–114, june 1996.