Finding Clusters in subspaces of very large, multi-dimensional datasets

Finding Clusters in Subspaces of Very Large, Multi-dimensional Datasets Robson L. F. Cordeiro #1 , Agma J. M. Traina #2 , Christos Faloutsos ∗3 , Caetano Traina Jr. #4 #

Computer Science Department - ICMC, University of São Paulo 400 Trabalhador Saocarlense Ave, São Carlos SP 668, Brazil 1

[email protected] 2 [email protected] 4 [email protected] ∗

School of Computer Science, Carnegie Mellon University 5000 Forbes Ave, Pittsburgh PA 15213, USA 3

[email protected]

Abstract— We propose the Multi-resolution Correlation Cluster detection (MrCC), a novel, scalable method to detect correlation clusters able to analyze dimensional data in the range of around 5 to 30 axes. Existing methods typically exhibit superlinear behavior in terms of space or execution time. MrCC employs a novel data structure based on multi-resolution and gains over previous approaches in: (a) it finds clusters that stand out in the data in a statistical sense; (b) it is linear on running time and memory usage regarding number of data points and dimensionality of subspaces where clusters exist; (c) it is linear in memory usage and quasi-linear in running time regarding space dimensionality; and (d) it is accurate, deterministic, robust to noise, does not require stating the number of clusters as input parameter, does not perform distance calculation and is able to detect clusters in subspaces generated by original axes or linear combinations of original axes, including space rotation. We performed experiments on synthetic data ranging from 5 to 30 axes and from 12k to 250k points, and MrCC outperformed in time five of the recent and related work, being in average 10 times faster than the competitors that also presented high accuracy results for every tested dataset. Regarding real data, MrCC found clusters at least 9 times faster than the competitors, increasing their accuracy in up to 34 percent.

I. I NTRODUCTION Traditional clustering methods suffer from the curse of dimensionality and often fail to produce acceptable results when data dimensionality raises above ten so [1]. The dimensionality curse refers to the fact that, when dimensionality increases, the spaces tend to be very sparse and the distances between any point pair tend to become similar, regardless of the distance functions employed [2]. That is the reason why traditional clustering methods are likely to fail when facing this kind of data. However, it has been shown that multi-dimensional data tend to cluster in subspaces of the original space (i.e. sets of orthogonal vectors formed from the original axes or subset combinations of them) [3], [4], [5], [1], [6], [7]. This material is based upon work supported by FAPESP (São Paulo State Research Foundation), CAPES (Brazilian Coordination for Improvement of Higher Level Personnel), CNPq (Brazilian National Council for Supporting Research) and the National Science Foundation under Grants No. DBI0640543. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties.

978-1-4244-5446-4/10/$26.00 © 2010 IEEE

Dimensionality reduction aims at eliminating global correlations. To be detected, they must exist in every dataset region with respect to a set of orthogonal vectors formed by either the original axes (feature selection) or linear combinations of them (feature extraction). However, besides global correlations, local correlations are likely to exist in subsets of the original axes. In fact, a cluster may present correlations that differ from those presented in another cluster. Traditional dimensionality reduction does not solve the problem of identifying local clusters and correlations in sub-sets of the dimensions. In this paper we use the following notation. Definition 1: A multi-dimensional dataset d S is a set of η points in a d-dimensional space d S over the set of axes d E = {e1 , e2 , . . . ed }, where d is the dimensionality of the dataset and ej , 1 ≤ j ≤ d is an axis (also called a dimension or an attribute) of the space where the dataset is embedded. A point si ∈ d S is a set of values si = {si1 , si2 , . . . sid }. Without loss of generality, we assume that each value sij , 1 ≤ i ≤ η, 1 ≤ j ≤ d is a real value in [0, 1), so the whole dataset is embedded in the d-dimensional hyper-cube [0, 1)d . Figure 1 shows examples of local correlations in two 3dimensional datasets over the axes 3 E = {x, y, z} [5]. Figure 1a shows a 3-dimensional dataset 3 S projected onto axes x and y, while Figure 1b shows the same dataset projected onto axes x and z. We can see that there exist two clusters in this dataset, C1 and C2 . Cluster C1 exists in the subspace formed by the two axes x and z, so we note it as 2γ C1 , while cluster C2 exists in the subspace formed by the two axes x and y, so we note it also as 2γ C2 . The symbol γ is used to denote them as correlation clusters. It is not worth to apply a global dimensionality reduction technique over this dataset, as all axes are important to characterize at least one cluster. Also, traditional clustering methods are likely to fail, since both clusters are spread over an axis. Figures 1a and 1b correspond to a simple situation, as both clusters are aligned along one original axis. Thus, the clusters exist in subspaces of the original axes. Nevertheless, real data may also have clusters aligned along arbitrarily oriented axes. As an example, Figures 1c and 1d respectively represent x-

625

ICDE Conference 2010

Fig. 1. Examples of correlation clusters in 3-dimensional datasets over the axes 3 E = {x, y, z}.

y and x-z projections of another 3-dimensional dataset 3 S 0 , whose clusters 2γ C10 and 2γ C20 are not aligned along the original axes. In this case, each cluster exists in a plane generated by linear combinations of the original axes. Correlation clustering methods unify the tasks of traditional clustering and dimensionality reduction, finding clusters in spaces with several dimensions together with the subspaces where the clusters exist. According to the recent survey [8], these methods partition the data into disjoint sets of points that exhibit a dense, arbitrary linear correlation. Points not belonging to any of the correlation sets are labeled as noise. Based on this idea, the clusters that we are looking for are defined as follows. Definition 2: Given a multi-dimensional dataset d S on d the cluster in d S, δγ Ck =

δ setδ of axes E, a correlation d δ γ Ek , γ Sk , is defined as a set γ Ek ⊆ E of δ axes together δ d with a set of points γ Sk ⊆ S that exhibit a linear correlation in axes δγ Ek . The axes in δγ Ek are said to be relevant to the cluster, and the axes in d E −δγ Ek are said to be irrelevant to the cluster. The cardinality δ = δγ Ek is the dimensionality of the cluster. For any pair of distinct D correlation clusters E

δ 0 0 d δ δ δ0 in S, γ Ck = γ Ek , γ Sk and γ Ck0 = δγ Ek0 , δγ Sk0 , the 0

expression δγ Sk ∩ δγ Sk0 = ∅ is always true. This paper proposes the new Multi-resolution Correlation Clustering (MrCC) method. It looks for clusters in a topdown way, first analyzing the distribution of points in the “full dimensional” space, and them performing a multi-resolution, recursive partition of the space, which helps distinguishing clusters covering regions with varying sizes, density, correlated axes and number of points. It uses spatial convolution masks over multi-resolution partitions of the data space. The masks, also known as convolution filters, are extensively used in digital image processing [9] to uncover patterns in 2- and 3-dimensional grids. They are applied in a novel way in our work over a multi-scale grid structure defined over the input dataset in order to efficiently detect density variations in multi-dimensional data. The masks we use are integer approximations of the Laplacian filter, a second-derivative operator that reacts to transitions in density. Figures 2a and 2b show examples of 2-dimensional Laplacian masks, and Figure 2c shows a 3-dimensional one. MrCC also employs the Minimum Description Length (MDL) method [10] to automatically tune a density threshold based on the data distribution. The MDL main idea is to encode the input data, selecting a minimal code length. In

Fig. 2.

2- and 3-dimensional integer approximations of the Laplacian filter.

our case, MDL partitions computed arrays of axes relevances tuning a threshold that defines the axes relevant to each cluster. MrCC is well-suited to analyse datasets in the range of 5 to 30 dimensions. In our experience, the intrinsic dimensionalities of datasets are frequently smaller than 30 [11]. Therefore, if a dataset has more than 30 or so dimensions, it is possible to apply some distance preserving dimensionality reduction or feature selection algorithm, such as PCA or FDR, and then apply MrCC. MrCC gains over previous approaches in: (a) it finds clusters that stand out in the data in a statistical sense; (b) it is linear in running time and in memory usage with respect to the number of data points and clusters’ dimensionality; (c) it is linear in memory usage and quasi-linear in running time with respect to space dimensionality; and (d) it is accurate, deterministic, robust to noise, does not require the number of clusters as a parameter and does not perform distance calculation. We performed experiments on synthetic data comparing MrCC with five of the recent and related work. It consistently outperformed everyone in execution time, always presenting high accuracy results. Regarding experiments in a real dataset, MrCC found clusters in the KDD Cup 2008 data at least 9 times faster than the competitors, increasing their accuracy in up to 34 percent. The remainder of this paper is structured as follows. Related works are briefly discussed in Section II. The third and main section presents the proposed MrCC method. Experiments using synthetic and real data are shown in Section IV. Finally, conclusions are presented in Section V. The symbols used in the paper are listed in Table I. II. R ELATED W ORK Finding clusters in subspaces of multi-dimensional data is the purpose of several work. A recent survey on this area is found in [8]. Bottom-up methods find dense regions in lowdimensional subspaces and merge the subspaces with these regions to uncover clusters and their corresponding subspaces. One of the first methods to use this approach is CLIQUE [3]. It divides one-dimensional projections into a user-defined number of partitions, merging adjacent ones to identify dense partitions in interesting subspaces of higher dimensionality. MDL is employed to define what subspaces are interesting. CLIQUE’s main drawbacks are: its merging process scales exponentially with increasing dimensionality of the subspaces where clusters are found, it relies on a fixed density threshold that assumes the high-dimensional clusters to be as dense as the low-dimensional ones, and it is able to find clusters in

626

TABLE I TABLE OF SYMBOLS . Symbols dS S

E

d η si sij δC γ k δ γk T H h ξh ah ah .loc ah .n ah .P [j] ah .ptr lj , u j

Definitions A d-dimensional space. A set of d-dimensional points. S ⊂ dS d S - A full d-dimensional dataset. δ S - Points of a correlation cluster δ C . γ k γ k δ S ⊆ dS γ k A set of axes. d E - Full set of axes for d S. d E = {e , e . . . e }, d E = d 1 2 d δ E - Axes relevant to a correlation γ k cluster δγ Ck . δγ Ek ⊆ d E, δγ Ek = δ Dimensionality of dataset d S. Number of points in dataset d S. η = d S A point of dataset d S. si ∈ d S Value in axis ej of point si . s ij ∈ [0, 1) A correlation cluster. δγ Ck = δγ Ek , δγ Sk Dimensionality of δγ Ck . Number of correlation clusters in d S. A Counting-tree. Number of resolutions in T Each level of T . Side size of cells at level h in T . A cell at level h in T . Relative position of ah in its parent cell. Count of points in ah . Half-space count of points in ah as to ej . Pointer to the node that represents ah refined in the level h + 1 of T . Lower and upper bounds of axis ej for ah .

subspaces generated by linear combinations of original axes only if the specified number of partitions is large. Some posterior work aimed at reducing these drawbacks. EPCH [1] initiates the process looking for dense partitions in subspaces with a user-defined dimensionality and automatically computes the density threshold based on the data distribution. However, the maximum number of clusters is a required parameter. P3C [6] applies an statistical approach to avoid using a fixed density threshold. Both EPCH and P3C are able to find clusters in subspaces generated by the original axes or their linear combinations. Top-down methods analyze the “full dimensional” space to find patterns spotting clusters. The subspaces where clusters exist are identified based on the data distribution surrounding the patterns. Similar clusters are merged to produce the final result. One of the first methods following this approach is PROCLUS [4]. It is a k-medoid clustering method that assigns to each medoid a subspace generated by the original axes. Each point is assigned to the closest medoid in its subspace. An iterative process analyzes the points distribution of each cluster in each axis and the axes in which the cluster is denser form the cluster subspace. Its main disadvantages are: it assumes that analyzing the “full dimensional” space is sufficient to identify clusters that exist only in subspaces, the process has super-linear computational complexity on the number of points and axes, it only finds clusters in subspaces generated by the original axes, it heavily depends on the user-defined number and average dimensionality of clusters, and its iterative process

may not converge in reasonable time. Other works used PROCLUS ideas aiming at reducing its disadvantages. ORCLUS [5] finds clusters in subspaces generated by original axes or linear combinations of them. The analysis of each cluster’s orientation, the eigenvector with the biggest eigenvalue, in an iterative process that merges close clusters with similar orientations makes it possible. CURLER [12] and PkM [13] use similar ideas to find nonlinear correlation clusters. However, both methods have cubic complexity over the data dimensionality. LAC [7] and LWC/CLWC [14] define that the points in a cluster must be close to each other according to the L2 norm distance weighted in each of the original axes. Similarly to PROCLUS, the points distribution of each cluster is analyzed, but instead of defining subspaces, weighting vectors are defined for each cluster. Iterative processes allow guessing the average dimensionality of the clusters without user’s help. DOC/FASTDOC [15] provides a clustering model and proposes a probabilistic algorithm to look for approximations of clusters that maximize the quality of a clustering result based on the model. However, it relies on a fixed density threshold and requires multiple runs to find each cluster. FPC/CFPC [16] improves upon DOC/FASTDOC replacing its inner randomized part by a systematic search, mining frequent itemsets. Also, CFPC finds multiple clusters in a single run. HARP [17] looks for clusters based on the traditional agglomerative, hierarchical approach. It exploits the data distribution to automatically adjust internal thresholds. However, the number of clusters and the maximum percentile of noise are user-defined. Also, the algorithm inherits the drawbacks of hierarchical clustering algorithms, in particular the quadratic run time complexity. In [18] a problem formulation that aims at extracting a reduced, non-redundant set of regions that stand out in the dataset in a statistical sense is proposed and an approximation algorithm, STATPC, is presented. However, STATPC is non deterministic, finding clusters only in subspaces generated by original axes and, according to the original experiments, it has a larger run time than other methods. Besides CLIQUE, other methods employ MDL. RIC [19] improves clustering cleansing noise and adjusting the clusterings such that it determines the most natural subspace for the clusters. OCI [20] is a parameter-free method that applies the exponential power distribution (EPD) model and Independent Component Analysis (ICA) to find both the main directions inside a cluster and the split planes in a top-down approach. It also defines a filter for outliers, based on EPD and ICA. STING [21] is a basis to our work. It also performs multi-resolution space division in a statistical approach for clustering. However, it was designed for GIS datasets, which are mostly two or very low dimensional. In spite of the qualities of the described approaches, it is important to notice that, for the best of the authors knowledge, there is no method in literature, designed to look for clusters in subspaces, that exhibit linear behavior in terms of space and execution time regarding increasing number of points, axes and clusters’ dimensionalities.

627

III. T HE C LUSTERING M ETHOD This section presents the Multi-resolution Correlation Clustering method (MrCC). Its main idea is to identify correlation clusters based on the variation of the data density over the space in a multi-resolution approach, dynamically changing the partitioning size of the analyzed regions. Multi-resolution is explored applying d-dimensional hypergrids with cells of several side sizes over the dataset space and counting the number of points in each grid cell. The number of cells increase exponentially to the dataset dimensionality as the cell size shrinks, so the grid sizes dividing each region are carefully chosen. The grid densities are stored in a quadtree-like structure called the Counting-tree, where each level represents the dataset as a hyper-grid in a specific resolution. Spatial convolution masks are applied over each level of the Counting-tree, to identify patterns in the data distribution regarding each resolution. Applying the masks to the needed tree levels allows spotting clusters with different sizes. Given a tree level, MrCC applies a mask to find the regions in the “full dimensional” space with the biggest densities of points. The regions found may indicate clusters that only exist in subspaces of the analyzed space. The neighborhoods of these regions are analyzed to define if they stand out in the data in a statistical sense, thus indicating clusters. The axes in which the points in an analyzed neighborhood are close to each other are said relevant to the respective cluster, while the axes in which these points are sparsely distributed are said irrelevant to the cluster. The Minimum Description Length (MDL) principle is used in this process to automatically tune a threshold able to define relevant and irrelevant axes, based on the data distribution. The following subsections detail the phases of our method. A. Building the Counting-tree The first phase of MrCC constructs the Counting-tree, representing a dataset d S with d axes and η points as a set of hyper-grids of d-dimensional cells in several resolutions. The tree root (level zero) corresponds to a hyper-cube embodying the full dataset. The next level divides the space in a set of 2d hyper-cubes, each fulfilling a “hyper-quadrant” whose side size is equal to half the size of the previous level. The resulting hyper-cubes are divided again, generating the tree structure. Therefore, each level h of the Counting-tree represents data as a hyper-grid of d-dimensional cells of side size ξh = 1/2h , h = 0, 1, ..., H − 1, where H is the number of resolutions. Each cell can either be refined in the next level or not, according to the presence or absence of points in the cell space, so the Counting-tree can be unbalanced. Each cell has the structure < loc, n, P [d], usedCell, ptr >, where loc is the cell spatial position inside its parent cell, n is the number of points in the cell, P [ ] is a d-dimensional array of half-space counts, usedCell is a boolean flag and ptr is a pointer to the next tree level. The cell position loc locates the cell inside its parent cell. It is a binary number with d bits of the form [bb . . . b], where the j-bit sets the cell in the lower (0) or upper (1) half of axis ej relative to its immediate parent cell.

Fig. 3.

Two-dimensional hyper-grid cells and corresponding Counting-tree.

Each half-space count P [j] counts the number of points in the lower half of the cell regarding axis ej . The flag usedCell is used in the second phase of MrCC to determine whether the cell has already been used by the clustering process. Initially, every usedCell flag is set to f alse. Figure 3a illustrates 2-dimensional grids in up to five distinct resolutions, while Figures 3b and 3c illustrate respectively a grid over a 2-dimensional dataset in four distinct resolutions and the corresponding Counting-tree. The usedCell flags are not shown in the illustration. In a Counting-tree, a given cell a at level h is referred to as ah . The immediate parent of ah is ah−1 and so on. The cell position loc of ah corresponds to the relative position regarding its immediate parent cell, and it is referred to as ah .loc. The parent cell is at relative position ah−1 .loc and so on. Hence, the absolute position of cell ah is obtained following the sequence of positions [a1 .loc ↓ a2 .loc ↓ . . . ↓ ah−1 .loc ↓ ah .loc]. For example, the cell marked as A2 in level 2 of Figure 3c has relative position A2 .loc = [00] and absolute position up to level 2 as [11 ↓ 00]. A similar notation is used to refer to the other cells’ attributes, n, P [ ], usedCell and ptr. As an example, given cell ah , its number of points is ah .n, the number of points in its parent cell is ah−1 .n and so on. The Counting-tree is created in main memory, and the number of levels can be set automatically according to the available memory. Each tree node is implemented as a linked list of cells. Thus, although the number of regions to divide the space explodes at O(2dH ), we only store the regions where there is at least one point and each tree level has in fact at most η cells. However, without loss of generality and intending to make it easier to understand, in this paper we consider each node as an array of cells. Algorithm 1 builds a Counting-tree. It receives the dataset normalized to a d-dimensional hyper-cube [0, 1)d and the

628

desired number of resolutions H, although the tree can be truncated if available memory is not enough. However, H must be greater than or equal to 3. MrCC performs a single data scan, counting each data point in every corresponding cell at each tree level as it is read. Each point is also counted in each half-space count P [j] for every axis ej when it is at the lower half of a cell in ej . Algorithm 1 : Building the Counting-tree. Input: normalized dataset d S, number of distinct resolutions H Output: Counting-tree T 1: for each point si ∈ d S do 2: start at the root node; 3: for h = 1, 2, ..., H − 1 do 4: decide which hyper-grid cell in the current tree node covers si (let it be the cell ah ); 5: ah .n = ah .n + 1; 6: ah .usedCell = f alse; 7: if h > 1, update the half-space counts in ah−1 ; 8: access the tree node pointed by ah .ptr; 9: end for 10: update the half-space counts in ah ; 11: end for Time and Space Complexity: Algorithm 1 reads each of the η dataset points once. When a point is read, it is counted in all the H − 1 tree levels, based on its position in all the d axes. Thus, the time complexity of Algorithm 1 is O(η H d). The tree built in this phase has H − 1 levels. Each level has at most η cells, which contain an array with d positions each. Thus, the space complexity of Algorithm 1 is O(H η d). B. Finding β-clusters The second phase of MrCC employs the counts in the Counting-tree to look for patterns that indicate possible correlation clusters. As the clusters are not yet confirmed in δ this D phase,Ethey are called β-clusters. A β-cluster β Ck = δ δ β Ek , β Sk

also follows the definition of a correlation cluster, but using the symbol β instead of γ. β-clusters are spotted looking for dense, hyper-rectangular regions formed in subspaces of the d-dimensional data space. In this phase, MrCC employs three matrices L, U and V to describe the β-clusters. Let β k be the number of β-clusters found so far. Each matrix has β k lines and d columns. Thus, the description of a βcluster δβ Ck is in the arrays L[k], U [k] and V [k]. L[k] and U [k] store respectively the lower and upper bounds of β-cluster δβ Ck in each axis, while V [k] has the value true in V [k][j] if axis ej is relevant to β-cluster δβ Ck , and the value f alse otherwise. MrCC looks for β-clusters by applying convolution masks over each level of the Counting-tree. The masks are integer approximations of the Laplacian filter. They are suitable for this task since they spot transitions in intensity. For performance purposes, only masks of order 3 are used, that is, matrices of sizes 3d . In such masks, regardless of their dimensionality,

there is always one center element (the convolution pivot), 2d center-faces elements (or just face elements, for short) and 3d − 2d − 1 corner elements. Applying a mask over all cells at a Counting-tree level becomes prohibitively expensive in datasets with several dimensions – for example, a 10dimensional cell has 59,028 corner elements. However, MrCC uses Laplacian masks having non-zero values only at the center and the facing elements, that is 2d for the center and −1 for the face elements, as shown in Figures 2b and 2c. A 10dimensional cell has only 20 face elements. Therefore, it is possible to convolute each level of a Counting-tree with linear complexity regarding the dataset dimensionality d. Notice that, the mask order must be an odd integer. We chose 3, since it is the smallest available. The algorithm’s ability to find clusters improves a little when we use masks of order φ, φ > 3, having non-zero values at all elements (center, face and corner elements), but the time required increases too much – in the order of O(φd ) as compared to O(d) when using masks of order 3 having non-zero values only at the center and the facing elements. As it can be seen in Algorithm 2, phase two starts applying the mask to level two of the Counting-tree, starting at a coarse resolution and refining as needed. It allows MrCC to find β-clusters of different sizes. When analyzing a resolution level h, the mask is convoluted over every cell at this level, excluding those already used for a β-cluster previously found. The cell with the largest convoluted value indicates the densest region at this resolution, except by the space of the β-clusters previously found. Applying a convolution mask to level h requires a partial walk over the Counting-tree, but no node deeper than h is visited. The walk starts going down from the root until reaching a cell bh at level h that may need to be convoluted. The neighbors of this cell are named after the convolution matrix: face and corner neighbors. The center element corresponds to the cell itself. If the bh .usedCell flag is true or the cell shares the data space with a previously found β-cluster, the cell is skipped. Otherwise, MrCC finds the face neighbors of cell bh and uses the point counts in bh and in its face neighbors to apply the mask. After visiting all cells in level h, the cell with the largest convoluted value has the usedCell flag set to true. To verify if a cell bh shares the data space with a previously found β-cluster δβ Ck , MrCC uses the relative positions of the predecessors of bh to compute its absolute position in the data space up to level h. Let uj and lj represent respectively the upper and lower bounds of cell bh at axis ej . It is possible to say that a β-cluster δβ Ck shares its space in the database with cell bh if uj ≥ L[k][j] ∧ lj ≤ U [k][j] holds for every axis ej ∈ d E. Each cell bh at resolution level h is itself a d-dimensional hyper-cube and it can be divided into other 2d cells in the next level, splitting each axis in half. Therefore, from the two face neighbors at axis ej , one is always stored in the same node of cell bh and the other, if it exists, is stored in a sibling node of cell bh . We call the face neighbor stored at the same node as the internal neighbor of cell bh , and the other as

629

Algorithm 2 : Finding the β-clusters. Input: Counting-tree T , significance level α Output: matrices of β-clusters L, U and V , number of β-clusters β k 1: β k = 0; 2: repeat 3: h = 1; 4: repeat 5: h = h + 1; 6: for each cell bh in level h of T do 7: if bh .usedCell = f alse ∧ bh does not share space with a previously found β-cluster then 8: find the face neighbors of bh in all axes; 9: apply the mask with the convolution pivot bh , using the counts of points in bh and in the found neighbors; 10: ah = bh , if the resulted convoluted value is the largest one found in this iteration; 11: end if 12: end for 13: ah .usedCell = true; 14: centered on ah and based on α, compute cPj , nPj and θjα for every axis ej ; 15: if cPj > θjα for at least one axis ej then 16: β k =β k + 1; {a new β-cluster was found} 17: end if 18: until a new β-cluster is found ∨ h = H − 1 19: if a new β-cluster was found then 20: compute r[ ] and use MDL to get cT hreshold; 21: for j = 1, 2, ..., d do 22: V [β k][j] = r [j] ≥ cT hreshold; 23: if V [β k][j] = true then 24: compute L[β k][j] and U [β k][j] based on the lower and upper bounds of ah and of its face neighbors regarding axis ej ; 25: else 26: L[β k][j] = 0; 27: U [β k][j] = 1; 28: end if 29: end for 30: end if 31: until no new β-cluster was found

its external neighbor regarding axis ej . The internal neighbor always exists, but there are two cases where the external one may not exists: if cell bh is at the space border in axis ej ; or if the space corresponding to the external neighbor was not refined up to level h. For example, cell A2 in resolution level 2 of the Counting-tree presented in Figure 3c has the internal neighbor B2 and the external neighbor D2 regarding axis e1 . The internal neighbor of a cell bh in axis ej at resolution level h and its point count are referred as N I(bh , ej ) and N I(bh , ej ).n respectively. Similarly, considering the same axis and resolution, the external neighbor of bh and its points

count are N E(bh , ej ) and N E(bh , ej ).n respectively. MrCC analyzes the absolute cell positions in a Counting-tree to look for external and internal neighbors. To state if a new β-cluster exists in level h, MrCC searches for the cell ah with the largest convoluted value and analyzes its neighbors. For axis ej , the neighbor cells are the predecessor ah−1 of ah , its internal neighbor N I(ah−1 , ej ) and its external neighbor N E(ah−1 , ej ). Together, they have nPj = ah−1 .n + N I(ah−1 , ej ).n + N E(ah−1 , ej ).n points. The cells’ half-space counts show how these points are distributed in six consecutive, equal-size regions in axis ej . The point count in the center region, the one containing ah , is given by: cPj = ah−1 .P [j], if the j-bit in ah .loc is 0, or by cPj = ah−1 .n − ah−1 .P [j] otherwise. As an example, for cell A3 in Figures 3b and 3c, regarding axis e1 , the six analyzed regions are presented in distinct texture, cP1 = 1 and nP1 = 6. If at least one axis ej of cell ah has cPj significantly nP greater than the expected average number of points 6 j , MrCC assumes that a new β-cluster was found. Thus, for each axis ej , the null hypothesis test is applied to compute the probability that the central region contains cPj points if nPj points are uniformly distributed in the six analyzed regions. The critical value for the test is a threshold to which cPj must be compared to determine whether or not it is statistically significant to reject the null hypothesis. The statistic significance is a userdefined probability α of wrongly rejecting the null hypothesis. For a one-sided test, the critical value θjα is computed as α = P robability(cPj ≥ θjα ). The probability is computed assuming the Binomial distribution with the parameters nPj and 61 , since cPj ∼ Binomial(nPj , 16 ), under the null hypothesis and 16 is the probability that one point falls in the central region, when it is randomly assigned to one of the six analyzed regions. If cPj > θjα for at least one axis ej , MrCC assumes ah to be the center cell of a new β-cluster and increments β k. Otherwise, the next tree level is processed. Once a new β-cluster was found, MrCC generates the relevances array r = [r1 , r2 , . . . rd ], where r [j] is a real value in (0, 100] representing the relevance of axis ej regarding the β-cluster centered in ah . The relevance r [j] is given by (100 ∗ cPj )/nPj . Then, MrCC automatically tunes a threshold to mark each axis as relevant or irrelevant to the β-cluster. The relevances in r[ ] are sorted in ascending order into array o = [o1 , o2 , . . . od ], which is submitted to MDL to find the best cut position p, 1 ≤ p ≤ d that maximizes the homogeneity of values in the partitions of o[ ], [o1 , . . . op−1 ] and [op , . . . od ]. The value cT hreshold = o[p] is used to define axis ej as relevant or irrelevant to the new β-cluster, by setting V [β k][j] = true if r [j] ≥ cT hreshold, and false otherwise. The last step required to identify the new β-cluster finds its lower and upper bounds in each axis. These bounds are respectively set to L[β k][j] = 0 and U [β k][j] = 1 for every axis ej , having V [β k][j] = f alse. For the other axes, the relevant ones, these bounds are first set equal to the lower and upper bounds of ah in these axes, L[β k][j] = lj and U [β k][j] = uj . Then, they are refined analyzing the neighbors of ah . Considering a relevant axis ej , if there exists a face

630

neighbor of ah , containing at least one point, whose lower bound is smaller than the lower bound of ah , then L[β k][j] is decreased by 1/2h . In the same way, if there exists a face neighbor of ah , containing at least one point, whose upper bound is bigger than the upper bound of ah , then U [β k][j] is increased by 1/2h . In this way, MrCC completes the description of the new β-cluster and restarts applying the convolution mask from level two in order to look for another β-cluster. The process stops when the mask is applied to every tree level without finding in a new β-cluster. Time and Space Complexity: Algorithm 2 identifies β k βclusters. When looking for each β-cluster, at most H − 2 tree levels are analyzed, which have at most η cells each. For each tree level, the cells that do not belong to a previously found β-cluster are the convolution pivots to apply the mask. Finally, the neighborhood of the cell with the largest convoluted value is analyzed to find if it is the center of a new β-cluster. Thus, the time complexity of this part of Algorithm 2 (lines 3-18) is O(β k 2 H 2 η d). After finding each new β-cluster, the relevance levels array with d real values in (0, 100] is built in O(d) time and sorted in O(d log d) time, the method MDL is used in O(d) time and the new β-cluster is described in O(d H) time. Thus, the time complexity of this part of Algorithm 2 (lines 19-30) is O(d β k(log d + H)). However, each iteration step of the first part of Algorithm 2 consumes a time t1 that is much larger than the time t2 consumed by each iteration step of the second part. Thus, the total time of Algorithm 2 is O(β k 2 H 2 η d) t1 + O(d β k(log d + H)) t2 . Given that t1 and t2 are constant values and t1 t2 , we argue that MrCC is quasi-linear in d and our experiments corroborate this claim. The space complexity of Algorithm 2 is O(d β k + d + H), since it builds the matrices L, U and V and it uses arrays with either d or H positions each. C. Building the Correlation Clusters The final phase of MrCC builds γ k correlation clusters based on the β-clusters found so far. A correlation cluster δ γ Ck is represented by a non-empty set of β-clusters. Unless the correlation cluster is a singleton set, any of its β-clusters must share space in the d-dimensional data space with at least one other β-cluster in the same set. Based on the matrices L 0 00 and U , two β-clusters δβ Ck0 and δβ Ck00 are said to share the same space if U [k 0 ][j] ≥ L[k 00 ][j] ∧ L[k 0 ][j] ≤ U [k 00 ][j] for every axis ej , 1 ≤ j ≤ d. The space of a correlation cluster is the union of the spaces of its β-clusters, and its relevant axes are those relevant to at least one of its β-clusters. As shown in Algorithm 3, MrCC analyzes pairs of βclusters and those sharing space in the d-dimensional space are assigned to the same cluster. Finally, after assigning each β-cluster to a correlation cluster, the relevant axes are defined. Since the correlation clusters obtained do not share space in the d-dimensional space, any point can belong to at most one correlation cluster. Thus, in order to create a dataset partition, MrCC labels points after the regions covered by the correlation clusters. All other points are labeled as noise.

Algorithm 3 : Building the correlation clusters. Input: matrices of β-clusters L, U and V , number of β-clusters β k Output: set of correlation clusters C, number of correlation clusters γ k 1: for k 0 = 1, 2, ...,β k do 2: for k 00 = k 0 + 1, k 0 + 2, ...,β k do 0 00 3: assign β-clusters δβ Ck0 and δβ Ck00 to the same correlation cluster, if they share database space; 4: end for 5: end for 6: for k = 1, 2, ...,γ k do 7: set an axis as relevant to correlation cluster δγ Ck , if it is relevant to any of the β-clusters assigned to δγ Ck ; 8: end for

Time and Space Complexity: Algorithm 3 analyzes all the β-clusters found before building the correlation clusters. Given a pair of β-clusters, their positions in all the d axes are compared. After the γ k correlation clusters are built, their relevant axes are defined based on the relevant axes of the β-clusters assigned to them. Thus, the time complexity of Algorithm 3 is O(d (β k 2 +γ k β k)). During the process, an array with β k positions links β-clusters to correlation clusters, while a matrix with γ k lines and d columns indicates the relevant axes to the correlation clusters. Thus, the space complexity of Algorithm 3 is O(β k + γ k d). βk

IV. E XPERIMENTS This section presents the experiments performed comparing MrCC with five of the recent and related work. The competitors are: CFPC, HARP, LAC, EPCH and P3C1 . The source codes for these methods were gracefully provided by the original authors. All experiments were made in a machine with 8.00GB of RAM with an Intel Xeon 2.33GHz processor and the Microsoft Windows XP Professional x64. Except LAC, all competitors were tuned to partition the dataset into disjoint clusters and a noise set. Every method indicates which are the relevant axes to each cluster. LAC finds disjoint groups, but not noise, and it sorts the axes by their importance for each cluster — as LAC is a state of the art competitor, it is worth comparing to MrCC, even it not defining the relevant axes. Several synthetic datasets were created to evaluate the quality of each method, with varying numbers of points, axes, correlation clusters and percentile of noise. Scalability was also analyzed regarding the needs for computational resources, memory consumption and processing time. The datasets were created to evaluate the behavior of each algorithm regarding a wide range of values for several parameters, changing one parameter at a time. Finally, the algorithms were evaluated over a real large, multi-dimensional dataset. 1 We tried to compare also to STATPC (gracefully provided by Gabriela Moise) but, tuned as suggested, the time it takes to run was impractical over our datasets: it did not finished within a week even for our smallest one.

631

For better visualization, the vertical axes of all graphs on memory consumption and run time are shown in log scale. A. Evaluating the Results The quality of each clustering result provided by each technique was measured based on the well-known precision and recall measurements. We distinguish between the correlation clusters known to exist in a dataset, which we call real clusters, and those that a techniqueD found, which E we call found clusters. δ δ δ A found cluster f Ck = f Ek , f Sk as well as a real cluster D0 E δ0 δ δ0 follow the definition of a correlation r Ck0 = r Ek0 , r Sk0 cluster, Definition 2. The symbols f and r are used instead of γ. Thus, f k and r k denote the numbers of found and real clusters respectively. For each found cluster δf Ck , its most dominant real cluster δ0 r Ck0 is δ0 r Ck 0

| |δf Sk ∩

δ0 r Sk0 |

= max(|δf Sk ∩

Similarly, for each real cluster found cluster δf Ck is δ f Ck

0

0

δ 00 r Sk00 |),

δ0 r Ck 0 ,

| |δr Sk0 ∩ δf Sk | = max(|δr Sk0 ∩

1 ≤ k 00 ≤

r k.

its most dominant

δ 00 f Sk00 |),

1 ≤ k 00 ≤

f k.

The precision and the recall of found cluster δf Ck and real 0 cluster δr Ck0 are computed as follows. δ δ0 0 S S ∩ k k r f δ δ0 0 precision(f Ck , r Ck ) = δ f Sk

recall(δf Ck ,

δ0 r Ck0 )

=

δ f Sk ∩

δ0 r Sk0

|δr0 Sk0 |

(1)

(2)

To evaluate the quality of a clustering result, we averaged the precision in terms of all found clusters and their respective most dominant real clusters. Also, we averaged the recall in terms of all real clusters and their respective most dominant found clusters. These two averaged values are closely related to well-known measurements. The first one is directly proportional to the dominant ratio [5], [1], while the second one is directly proportional to the coverage ratio [1]. The harmonic mean of these two averaged values is the Quality. The evaluation of a clustering result regarding the quality of the uncovered relevant axes were made in a similar way. We also computed the harmonic mean of the averaged precisions for all found clusters and the averaged recalls for all real clusters, but we exchanged the sets of points (S sets) in Equations 1 and 2 by sets of axes (E sets). This harmonic mean is the Subspaces Quality. In the cases where a technique does not find clusters in a dataset, the value zero is assumed for both qualities.

B. Synthetic Data Generation Several synthetic datasets were created in order to analyze each algorithm regarding distinct data distribution. The synthetic data were created using strategies similar to those used by the competitors. All the datasets are embedded in the hypercube [0, 1)d . However, since the original implementation of EPCH did not provide meaningful results for data in a unitary hyper-cube, the same synthetic datasets were normalized to the hyper-cube [−100, 100)d to be applied to EPCH. The first group of synthetic datasets was created to analyze the behavior of each method with increasing numbers of points, axes and correlation clusters. The group contains 7 datasets with numbers of axes, points and clusters growing together from 6 to 18, 12, 000 to 120, 000 and 2 to 17 respectively. Clusters with random sizes were created in subspaces with randomly chosen original axes whose dimensionality vary from 5 to 17. Each cluster follows Gaussians distributions with random means and standard deviations. Noise percentile was fixed as 15%. For identification purposes, the datasets are named 6d, 8d, 10d, 12d, 14d, 16d and 18d according to their dimensionality. Among these datasets, one with close quality results for the compared methods was chosen. This dataset, the 14d, has 14 axes, 90, 000 data points, 17 correlation clusters and 15 percent of noise. It served as the base in the creation of the other datasets, with the purpose of evaluating scalability. Based on the 14d dataset, 4 groups with 5 datasets each were created varying one of these characteristics: number of points, number of axes, number of clusters or percent of noise. In other words, each group has 5 datasets in which a single characteristic changes, while all others remain the same as the ones in the 14d dataset. The number of points grows from 50, 000 to 250, 000, the number of axes grows from 5 to 30 and the number of clusters and percentile of noise grow from 5 to 25. The clusters were created following the same strategies used for the first datasets’ group. Identification names were also assigned to the datasets. The names Xk, Xc, Xd s and Xo refer respectively to datasets in the groups changing number of points, number of clusters, number of axes, or percent of noise. For example, dataset 30d s differs from 14d only because it has 30 axes instead of 14. A last group of datasets was created with the purpose of analyzing the behavior of each method regarding the presence of clusters in subspaces generated by linear combinations of the original axes. This group contains the data in the first datasets’ group rotated 4 times in random planes and degrees. These datasets were named 6d r, 8d r, 10d r, 12d r, 14d r, 16d r and 18d r according to their dimensionality. C. Real Data The real dataset used in the experiments consists of the training data provided for the Siemens KDD Cup 20082 . It was created for automatic breast cancer diagnosis, consisting of 25 of the most significant features extracted automatically from

632

2 http://www.kddcup2008.com

102,294 Regions of Interest (ROIs) present in X-ray breast images of 118 malignant cases and 1,594 normal cases. The ground truth is also provided, since each ROI was manually classified based on either a radiologist’s interpretation, a biopsy or both. A common breast cancer screen consists of four X-ray images. Two images of each breast taken from directions known as the Cranial-Caudal view (CC) and the MedioLateral-Oblique view (MLO). Before applying the algorithms over the real data, some preprocessing was necessary. The data was partitioned in order to separate features extracted from heterogeneous images. Thus, four datasets were created, each with features extracted from approximately 25,000 ROIs related to images taken from one breast, left or right, in each direction CC or MLO. The results obtained for these datasets were evaluated based on the ground truth class label of each ROI. D. Sensibility Analysis MrCC’s behavior varies depending on two input parameters: α and H. This section analyzes how they affect our method. We first varied these parameters to maximize the Quality obtained by MrCC, defining the best configuration for each dataset. Then, we modified the best configuration of each dataset, changing one parameter at a time, and analyzed MrCC’s behavior. For example, when varying H for a dataset, the value of α was fixed at the value in the dataset’s best configuration. MrCC’s behavior was analyzed varying α and H from 1.0E−3 to 1.0E−160 and 4 to 80 respectively. Figure 4 shows the results for the first group of synthetic datasets. The values of α that leaded to the best results varies from 1.0E − 5 to 1.0E − 20, as it can be seen in Figures 4a, 4b and 4c. The run time and memory consumption where barely affected by changes in α. Concerning H, Figures 4d, 4e and 4f show that the Quality does not increase significantly for H higher than 4. However, as expected, the run time and the memory consumption respectively increased super-linearly and linearly with increasing values of H. Therefore, small values for H, such as 4, are enough for most datasets. Similar results were obtained for our other synthetic and real datasets. They are not shown due to the space limitation. E. System Configuration MrCC was tuned as suggested in Section IV-D. The values α = 1.0E − 10 and H = 4 were fixed for all experiments. The other algorithms were tuned as follows. The correct number of clusters in each dataset was informed for LAC, EPCH, CFPC and HARP. Also, the known percent of noise for each dataset was informed to HARP. The other parameters of each algorithm were tuned as in their original authors’ instructions. LAC was tested with integer values from 1 to 11, for the parameter 1/h. However, the time used by LAC to look for clusters differed considerably with distinct values of 1/h. Thus, a time out of three hours was specified for LAC executions. All configurations that exceeded this time limit were interrupted. EPCH was tuned with integer values from 1 to 5 for the dimensionalities of its histograms and

several real values varying from 0 to 1 were tried for the outliers threshold. For the tests in P3C, the values 1.0E − 1, 1.0E − 2, 1.0E − 3, 1.0E − 4, 1.0E − 5, 1.0E − 7, 1.0E − 10 and 1.0E − 15 were tried for the P oisson threshold. HARP implements three kinds of cache structures that affect its time and space complexities, but they all lead to identical clustering results. Due to our memory limitation, the Conga line structure was used in all experiments. It does not have the best time complexity, but it leads to a linear space complexity. CFPC was tuned with the values 5, 10, 15, 20, 25, 30 and 35 for w, values 0.05, 0.10, 0.15, 0.20 and 0.25 for α, values 0.15, 0.20, 0.25, 0.30 and 0.35 for β and the value 50 for maxout. Finally, as CFPC is non-deterministic, we ran it 5 times in each tested configuration and averaged the results. The averaged values were taken as the final results for each configuration. The results reported in Sections IV-F and IV-G refer to the configurations that leaded to the best Quality achieved by each competitor, over all possible parameters tuning. F. Results on Synthetic Data Figure 5 shows the behavior of the methods applied over the synthetic datasets. Results on the quality of the clusters found and on the amount of computational resources used are shown through measured Quality, Subspaces Quality, memory consumption and run time respectively. We linked the values obtained for each method to easy reading the graphs. In the cases where a competitor found no cluster, distinct memory consumption and run time were measured for distinct parameter configurations. Thus, these results were ignored. Also, when a method used secondary memory cache, its run time was ignored too. In these cases, linking lines were not drawn between the values obtained for the respective methods in the graphs. Figure 5a shows that MrCC, EPCH, HARP and LAC presented similar high Quality for all datasets in the first group. CFPC presented good Quality for the datasets 6d, 8d, 10d and 12d, but a decrease in this quality was clearly presented for the datasets with higher dimensionalities. P3C provided the worst quality results. Regarding memory requirements, it can be seen in Figure 5b a huge discrepancy between the memory needs of HARP and EPCH face to the needs of the other methods. As an example, for the biggest dataset 18d, HARP used approximately 34.4GB of memory, while EPCH and MrCC used respectively 7.7 and 0.3 percent of this amount. Finally, Figure 5c shows that MrCC was the fastest algorithm over all others in all datasets in the first group. As an example, for the biggest dataset 18d, MrCC was respectively 2.8, 8.6, 23, and 81.3 times faster than CFPC, EPCH, LAC and P3C. The groups of datasets with changes in one of the characteristics, number of points, number of axes, number of clusters and percent of noise were used for the scalability tests. The main goal of these tests was to analyze the behavior of each algorithm regarding changes in a single characteristic at a time. The Quality presented in Figures 5d, 5g, 5j and 5m shows that MrCC, LAC and EPCH performed well, exchanging positions but being in general with 10 percent from each other. The

633

10000

6d

8d

1000

8d

100

10d

10d

10

12d

16d

16d

18d

18d

(a)

16d 18d α

(c)

20

40

80

8d

10000

10d

1000

12d

100

12d

14d

10

14d

16d

1

16d

18d

4

5

10

H

Fig. 4.

20 H

40

80

18d

Seconds

0.9 0.88 0.86

6d

6d

100000

8d 10d

100

1000000

6d

0.96 0.94 0.92

(d)

14d

(b)

KB

Quality

12d

10000000

10

10d

α

1 0.98

5

8d

1 0.1

14d

α

4

6d

10

12d

1

14d

Seconds

100

6d KB

Quality

100000

1.2 1 0.8 0.6 0.4 0.2 0

10

8d 10d

1

12d 14d

0.1

16d 4

5

10

20

40

80

18d

H

(f)

(e)

Sensibility analysis: left column, Quality; middle column, memory requirement; right column, wall clock time.

graphs also show that CFPC and HARP did not provide results as good as these ones, since both methods were very sensitive to changes in some of the varying characteristics. The worst quality results were again provided by P3C. The memory requirement graphs for these datasets are presented in Figures 5e, 5h, 5k and 5n, while run time is shown in Figures 5f, 5i, 5l and 5o. As it can be seen, HARP and EPCH consumed a huge amount of memory for these datasets, which was several times bigger than the amount of memory used by the other methods. The memory amount used by MrCC for these data varied between 0.06 and 1.5 percent of the amount needed by the most expensive method for the same datasets. Regarding run time, MrCC was the fastest method in all datasets of these groups, except for the 20d s dataset in which MrCC and LAC tied. As an example, for the 20c dataset, MrCC ran respectively 4.8, 13.6, 18.8, 183 and 1, 785 times faster than CFPC, LAC, EPCH, P3C and HARP. Notice that these results corroborate the theoretical study on MrCC’s time and space complexity, presented in Section III. We say that the quadratic behavior regarding the number of β-clusters found is not a crucial concern for MrCC. Experimental evaluation showed that this number closely follows the number of clusters found in the tested datasets. In our experiments, the biggest number of β-clusters found over all synthetic and real datasets was 33. Notice that, the biggest number of clusters present in these data is 25. Furthermore, the analysis of data with many clusters is usually meaningless, since it is very hard to the user to obtain semantic interpretations from a large amount of clusters. We also claim that the quadratic behavior on H is not a relevant concern for MrCC, since very small values for H are sufficient to obtain good clustering results. Remember that a Counting-tree describes a d-dimensional dataset in H distinct resolutions. Each cell in a resolution level h is divided in 2d cells in level h+1, which are divided again in 2d cells in level h+2 each, and so on. Thus, for data in ten or more dimensions, the maximum count of points in a cell of tree level h converges

to 1 really fast, as the value of h increases. After this point, even for extremely skewed data, more levels are useless, since they would not help to better describe the data. The sensibility analysis presented in Section IV-D corroborate this claim. The last test refers to the rotated datasets’ group. This test intends to analyze how each method behaves due to the presence of clusters in subspaces generated by linear combinations of the original axes. The results in terms of Quality, memory consumption and run time are presented respectively in Figures 5p, 5q and 5r. MrCC and LAC were only marginally affected by rotation, presenting a variation of at most 5 percent in their respective Quality, compared to the results of the datasets without rotation. All other methods presented considerable decreases in Quality for at least one of the rotated datasets. Concerning the needs for computational resources, the results for these datasets were similar to those obtained for the datasets in the first group. The quality of the relevant axes was also evaluated. LAC was not considered in this experiment, since it only weights the axes regarding the clusters found, instead of indicating the relevant ones. The results for the first group of datasets are shown by Figure 5s. As it can be seen, the Subspaces Quality is close to each other for MrCC and EPCH, while P3C, CFPC and HARP presented worse results. The results for the other datasets were similar. However, due to the space limitation, they are not shown in the paper. Concluding, P3C was not able to find clusters in several experiments, being in average the worst method regarding Quality. HARP and CFPC provided good quality results for some datasets, but they also provided bad results for several datasets. MrCC, LAC and EPCH provided good results for all synthetic datasets, in general tying regarding Quality. However, contrasting to MrCC, LAC and EPCH demanded guessing the number of clusters and required distinct threshold configurations to each dataset to obtain their best results reported. Regarding memory requirements, CFPC in general required the least amount, followed by LAC, MrCC, P3C,

634

10000

P3C

10000000

P3C

0.8

LAC

1000000

LAC

0.6

EPCH

0.4

CFPC

EPCH

100000

HARP

10000

HARP

0

MrCC

1000

MrCC

8d

10d

12d

14d

16d

18d

6d

LAC

0.6

EPCH

0.4

CFPC

0.2

HARP

0

MrCC 20o

LAC 100000

EPCH CFPC

10000

5o

25o

10o

LAC

0.6

EPCH

EPCH

100000

0.4

CFPC

0.2

HARP

10000

HARP

0

MrCC

1000

MrCC

200k

CFPC

50k

250k

LAC 100

EPCH CFPC HARP MrCC

1

100k 150k 200k 250k

50k

10000000

P3C

LAC

1000000

LAC

0.6

EPCH

0.4

CFPC

0.2

HARP

10000

HARP

0

MrCC

1000

MrCC

EPCH

100000

CFPC

5c

10c

15c

20c

P3C LAC

100

EPCH CFPC HARP MrCC

1

25c

5c

10c

(k)

EPCH

10000000

P3C

1000000

LAC EPCH

100000

P3C

1000

LAC

0.2

HARP

10000

HARP

10

0

MrCC

1000

MrCC

1

CFPC

5d_s 10d_s15d_s20d_s25d_s30d_s

(m)

EPCH

100

CFPC

0.1

CFPC HARP MrCC 5d_s 10d_s 15d_s 20d_s 25d_s 30d_s

(n)

(o)

100000000

1

LAC

0.6

EPCH

0.4

CFPC

0.2

HARP

KB

P3C

0.8

10000

10000000

P3C

1000000

LAC

100000

EPCH

10000

CFPC

1000

P3C

100

LAC EPCH

10

CFPC 1

HARP

1000

MrCC

0

Seconds

1.2

25c

10000

0.4

5d_s 10d_s 15d_s 20d_s 25d_s 30d_s

20c

(l)

Seconds

0.6

15c

100000

KB

LAC

250k

10

100000000 P3C

200k

1000 Seconds

P3C

1

150k

(i)

1

6d_r

100k

10000

0.8

0.8

25o

P3C

(h)

25c

20o

10

100000000

20c

15o

1000 Seconds

P3C

1000000

KB

Quality

MrCC 10o

(f)

10000000

(j)

Quality

HARP

1 5o

LAC

1.2

Quality

CFPC

10

25o

KB

Quality

Number of points

EPCH

10000

(g)

Number of clusters

20o

P3C

15c

LAC

(e)

1.2

Dimensionality

15o

P3C

100

MrCC

1000

1

10c

10d 12d 14d 16d 18d

1000

HARP

0.8

5c

8d

(c) ( )

100000000

150k

MrCC 6d

P3C

(d)

100k

HARP

0.1

10000

1000000

1.2

Rotated datasets

CFPC

Seconds

P3C KB

Quality

Percent of noise

1

50k

EPCH

(b)

0.8

15o

LAC

10 1

10000000

10o

P3C

100

8d 10d 12d 14d 16d 18d

( ) (a) 1.2

5o

1000

CFPC

0.2 6d

Seconds

100000000

1

KB

Quality

First group of datasets

1.2

HARP

0.1

MrCC

8d_r 10d_r 12d_r 14d_r 16d_r 18d_r

MrCC 6d_r 8d_r 10d_r12d_r14d_r16d_r18d_r

(p)

(q)

(r)

Subspaces Quality

1.2 1 0.8

P3C

0.6

EPCH

0.4

CFPC

0.2

Results for features automatically extracted from X-ray breast images

HARP

0 6d

8d

10d

12d

14d

16d

18d

Quality KB Seconds

MrCC

First group of datasets

(s)

EPCH 0.7050 511,031 20.95

CFPC 0.8322 7,110 8.12

HARP 0.8669 1,586,984

1,001.43

MrCC 0.9466 23,908 0.87

(t)

Fig. 5. MrCC is shown in circles: results on Quality (left column), memory requirement (middle column) and wall clock time (right column) over synthetic data; (s): Subspace Quality for synthetic data; (t): results on real data. Time and space are in log scale.

635

EPCH and HARP which respectively required 1.2, 2.8, 6.5, 112 and 600 times more memory than CFPC in average. MrCC was the fastest among all methods tested, being in average 4.1, 9.8, 10.3, 219 and 1, 422 times faster than CFPC, EPCH, LAC, P3C and HARP respectively. Finally, MrCC was in general 10 times faster than LAC and EPCH, presenting for all tested datasets high quality results, similar to those presented by these methods. G. Results on Real Data The tested methods were applied over the real datasets described in Section IV-C. All methods were analyzed with this data. However, LAC grouped all points in a single cluster for all real datasets. Also, a time up to a week was specified and P3C exceeded this limit for all datasets. Thus, these methods were not reported. The results obtained for left breast images in MedioLateral-Oblique view are presented in Figure 5t. As it can be seen, MrCC was at least 9 times faster than the competitors, increasing their accuracy in up to 34 percent. The results regarding memory consumption follow the same pattern as for synthetic data. Similar results were obtained for the other three real datasets, not shown due to space limitation. V. C ONCLUSIONS The main contribution of this paper is the MrCC method for correlation clustering. It is well-suited to data with 5 up to 30 axes and gains over previous approaches because: • it finds clusters that stand out in the data in a statistical sense; • it is linear in running time and in memory usage with respect to increasing number of data points and clusters’ dimensionality; • it is linear in memory usage and quasi-linear in time with respect to increasing data dimensionality; • it is accurate, deterministic, robust to noise, does not require the number of clusters as an input parameter, does not perform distance calculation and it is able to detect clusters in subspaces generated by original axes or linear combinations of them. Theoretical study on MrCC’s time and space complexity, presented in Section III, as well as the experiments made with synthetic and real data corroborate these properties. On synthetic data, MrCC outperformed in time five of the recent and related work, being in average 10 times faster than the competitors that also presented high accuracy results for all tested datasets. Data with numbers of axes, points, clusters and percentile of noise varying respectively from 5 to 30, 12k to 250k, 5 to 25 and 5 to 25 were analyzed in these experiments without causing significant changes in MrCC’s results. MrCC was also robust to data rotations. On real data (KDD Cup 2008), MrCC found clusters at least 9 times faster than the competitors, increasing their accuracy in up to 34 percent, while some methods did not even finished within a week. Our method looks for dense space regions, and thus works for any data distribution. We illustrated it on rotated Gaussians, as well as on real data of unknown distribution.

Notice that, MrCC is limited to the size of the clusters that it finds. Our method analyses the points distribution in specific regions of the “full dimensional” data space using an statistical hypothesis test to identify β-clusters, which lead to the clusters. However, these regions must have a minimum amount of points to reject the null hypothesis. In this way, MrCC may miss clusters with small amount of points present in low-dimensional subspaces, since they tend to be extremely sparse in spaces with several dimensions. On the other hand, the clustering results tend to be better as the number of points in the clusters increase. Thus, MrCC is suitable for very large, multi-dimensional datasets and as it scales linearly with the dataset size, it becomes better as the dataset increases. In this way, MrCC tend to be able to partition even datasets with more than 30 axes, when they are huge. R EFERENCES [1] E. K. K. Ng, A. W. chee Fu, and R. C.-W. Wong, “Projective clustering by histograms,” IEEE TKDE, vol. 17, no. 3, pp. 369–383, 2005. [2] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ”nearest neighbor” meaningful?” in ICDT, UK, 1999, pp. 217–235. [3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” SIGMOD Rec., vol. 27, no. 2, pp. 94–105, 1998. [4] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithms for projected clustering,” SIGMOD Rec., vol. 28, no. 2, pp. 61–72, 1999. [5] C. Aggarwal and P. Yu, “Redefining clustering for high-dimensional applications,” IEEE TKDE, vol. 14, no. 2, pp. 210–225, 2002. [6] G. Moise, J. Sander, and M. Ester, “Robust projected clustering,” Knowl. Inf. Syst., vol. 14, no. 3, pp. 273–298, 2008. [7] C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan, and D. Papadopoulos, “Locally adaptive metrics for clustering high dimensional data,” Data Min. Knowl. Discov., vol. 14, no. 1, pp. 63–97, 2007. [8] H.-P. Kriegel, P. Kröger, and A. Zimek, “Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering,” ACM TKDD, vol. 3, no. 1, pp. 1–58, 2009. [9] R. C. Gonzalez and R. E. Woods, Digital Image Processing (3rd Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2006. [10] P. D. Grunwald, I. J. Myung, and M. A. Pitt, Advances in Minimum Description Length: Theory and Applications (Neural Information Processing). The MIT Press, 2005. [11] C. Traina Jr., A. J. M. Traina, C. Faloutsos, and B. Seeger, “Fast indexing and visualization of metric data sets using slim-trees,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 2, pp. 244–260, 2002. [12] A. K. H. Tung, X. Xu, and B. C. Ooi, “Curler: finding and visualizing nonlinear correlation clusters,” in SIGMOD. NY, USA: ACM, 2005, pp. 467–478. [13] P. K. Agarwal and N. H. Mustafa, “k-means projective clustering,” in PODS. Paris, France: ACM, 2004, pp. 155–165. [14] H. Cheng, K. A. Hua, and K. Vu, “Constrained locally weighted clustering,” PVLDB, vol. 1, no. 1, pp. 90–101, 2008. [15] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali, “A monte carlo algorithm for fast projective clustering,” in SIGMOD, USA, 2002, pp. 418–427. [16] M. L. Yiu and N. Mamoulis, “Iterative projected clustering by subspace mining,” IEEE TKDE, vol. 17, no. 2, pp. 176–189, Feb. 2005. [17] K. Yip, D. Cheung, and M. Ng, “Harp: a practical projected clustering algorithm,” IEEE TKDE, vol. 16, no. 11, pp. 1387–1397, Nov. 2004. [18] G. Moise and J. Sander, “Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering,” in KDD, 2008, pp. 533–541. [19] C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant, “Robust informationtheoretic clustering,” in KDD, NY, USA, 2006, pp. 65–75. [20] C. Böhm, C. Faloutsos, and C. Plant, “Outlier-robust clustering using independent components,” in SIGMOD, NY, USA, 2008, pp. 185–198. [21] W. Wang, J. Yang, and R. Muntz, “Sting: A statistical information grid approach to spatial data mining,” in VLDB, USA, 1997, pp. 186–195.

636