Advancing Data Clustering via Projective Clustering Ensembles

Advancing Data Clustering via Projective Clustering Ensembles Francesco Gullo

Carlotta Domeniconi

Andrea Tagarelli

DEIS Dept. University of Calabria 87036 Rende (CS), Italy

Dept. of Computer Science George Mason University 22030 Fairfax – VA, USA

DEIS Dept. University of Calabria 87036 Rende (CS), Italy

[email protected]

[email protected]

[email protected]

ABSTRACT

General Terms

Projective Clustering Ensembles (PCE) are a very recent advance in data clustering research which combines the two powerful tools of clustering ensembles and projective clustering. Specifically, PCE enables clustering ensemble methods to handle ensembles composed by projective clustering solutions. PCE has been formalized as an optimization problem with either a two-objective or a single-objective function. Two-objective PCE has shown to generally produce more accurate clustering results than its single-objective counterpart, although it can handle the object-based and feature-based cluster representations only independently of one other. Moreover, both the early formulations of PCE do not follow any of the standard approaches of clustering ensembles, namely instance-based, cluster-based, and hybrid. In this paper, we propose an alternative formulation to the PCE problem which overcomes the above issues. We investigate the drawbacks of the early formulations of PCE and define a new single-objective formulation of the problem. This formulation is capable of treating the object- and feature-based cluster representations as a whole, essentially tying them in a distance computation between a projective clustering solution and a given ensemble. We propose two cluster-based algorithms for computing approximations to the proposed PCE formulation, which have the common merit of conforming to one of the standard approaches of clustering ensembles. Experiments on benchmark datasets have shown the significance of our PCE formulation, as both the proposed heuristics outperform existing PCE methods.

Algorithms, Theory, Experimentation

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—clustering; I.2.6 [Artificial Intelligence]: Learning; I.5.3 [Pattern Recognition]: Clustering

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’11, June 12–16, 2011, Athens, Greece. Copyright 2011 ACM 978-1-4503-0661-4/11/06 ...$10.00.

Keywords Data Mining, Clustering, Clustering Ensembles, Projective Clustering, Subspace Clustering, Dimensionality reduction, Optimization

1.

INTRODUCTION

Given a set of data objects as points in a multidimensional space, clustering aims to detect a number of homogeneous, well-separated subsets (clusters) of data, in an unsupervised way [18]. After more than four decades, a considerable corpus of methods and algorithms has been developed for data clustering, focusing on different aspects such as data types, algorithmic features, and application targets [14]. In the last few years, there has been an increased interest in developing advanced tools for data clustering. In this respect, clustering ensembles and projective clustering represent two of the most important directions of research. Clustering ensemble methods [28, 13, 36, 29, 17] aim to extract a “consensus” clustering from a set (ensemble) of clustering solutions. The input ensemble is typically generated by varying one or more aspects of the clustering process, such as the clustering algorithm, the parameter setting, and the number of features, objects or clusters. The output consensus clustering is usually obtained using instance-based, cluster-based, or hybrid methods. Instance-based methods require a notion of distance measure to directly compare the data objects in the ensemble solutions; cluster-based methods exploit a meta-clustering approach; and hybrid methods attempt to combine the first two approaches based on hybrid bipartite graph clustering. Projective clustering [32, 35, 30, 34] aims to discover clusters that correspond to subsets of the input data and have different (possibly overlapping) dimensional subspaces associated with them. Projected clusters tend to be less noisy—because each group of data is represented in a subspace that does not contain irrelevant dimensions—and more understandable—because the exploration of a cluster is easier when few dimensions are involved. Clustering ensembles and projective clustering hence address two major issues in data clustering distinctly: projective clustering deals with the high-dimensionality of data, whereas clustering ensembles handle the lack of a-priori knowledge on clustering targets. The first issue arises due to the sparsity that naturally occurs in data representation.

As such, it is unlikely that all features are equally relevant to form meaningful clusters. The second issue is related to the fact that there are usually many aspects that characterize the targets of a clustering task; however, due to the algorithmic peculiarities of any particular clustering method, a single clustering solution may not be able to capture all facets of a given clustering problem. In [16], projective clustering and clustering ensembles are treated for the first time in a unified framework. The underlying motivation of that study is that the highdimensionality and the lack of a-priori knowledge problems usually co-exist in real-world applications. To address both issues simultaneously, [16] hence formalizes the problem of projective clustering ensembles (PCE): the objective is to define methods that, by exploiting the information provided by an ensemble of projective clustering solutions, are able to compute a robust projective consensus clustering. PCE is formulated as an optimization problem, hence the sought projective consensus clustering is computed as a solution to that problem. Specifically, two formulations of PCE have been proposed in [16], namely two-objective PCE and single-objective PCE. The two-objective PCE formulation consists in the simultaneous optimization of two objective functions, which separately consider the data object clustering and the feature-to-cluster assignment. A wellfounded heuristic developed for this formulation of PCE (called MOEA-PCE ) has been found to be particularly accurate, although it has drawbacks concerning efficiency, parameter setting, and interpretability of results. By contrast, the single-objective PCE formulation embeds in one objective function the object-based and feature-based representations of candidate clusters. Apart from being a weaker formulation than two-objective PCE, the developed heuristic for single-objective PCE (called EM-PCE ) is outperformed by two-objective PCE in terms of effectiveness, while showing more efficiency. Both the early formulations of PCE have their own drawbacks and advantages, however none of them refers to any of the common approaches of clustering ensembles, i.e., the aforementioned instance-based, cluster-based, and hybrid approaches. This may limit the versatility of such early formulations of PCE and, eventually, their comparability with existing ways of solving clustering ensemble problems at least in terms of experience gained in some real-world scenarios. Besides this common shortcoming, an even more serious weakness concerns the inability of the two-objective PCE of treating the object-based and feature-based cluster representations as interrelated. This fact in principle may lead to projective consensus clustering solutions that contain conceptual flaws in their cluster composition. In this work, we face all the above issues revisiting the PCE problem. For this purpose, we pursue a different approach to the study of PCE, focusing on the development of methods that are closer to the standard clustering ensemble methods. By providing an insight into the theoretical foundations of the early two-objective PCE formulation, we show its weaknesses and propose a new single-objective formulation of PCE. The key idea underlying our proposal is to define a function that measures the distance of any projective clustering solution from a given ensemble, where the object-based and feature-based cluster representations are considered as a whole. The new formulation enables the development of heuristic algorithms that are easy to define

and, at the same time, are well-founded as they can exploit a corpus of research results obtained by the majority of existing clustering ensemble methods. Particularly, we investigate the opportunity of adapting each of the various approaches of clustering ensembles to the new PCE problem. We define two heuristics that follow a cluster-based approach, namely Cluster-Based Projective Clustering Ensembles (CB-PCE) and a step-forward version called Fast Cluster-Based Projective Clustering Ensembles (FCB-PCE). We show not only the suitability of the proposed heuristics to the PCE context but also their advantages in terms of computational complexity w.r.t. the early formulations of PCE. Moreover, based on an extensive experimental evaluation, we assessed effectiveness and efficiency of the proposed algorithms, and found that both outperform the early PCE methods in terms of accuracy of projective consensus clustering. In addition, FCB-PCE reveals to be faster than the early two-objective PCE and comparable or even faster than the early single-objective PCE in the online phase. The rest of the paper is organized as follows. Section 2 provides background on clustering ensembles, projective clustering, and the PCE problem. Section 3 describes our new formulation of PCE and presents the two developed heuristics along with an analysis of their computational complexities. Section 4 contains experimental evaluation and results. Finally, Section 5 concludes the paper.

2.

BACKGROUND

2.1

Clustering Ensembles (CE)

Given a set D of data objects, a clustering solution defined over D is a partition of D into a number of groups, i.e., clusters. A set of clustering solutions defined over the same set D of data objects is called ensemble. Given an ensemble defined over D, the goal of CE is to derive a consensus clustering, which is a (new) partition of D derived by suitably exploiting the information available from the ensemble. The earliest CE methods aim to explicitly solve the label correspondence problem to find a correspondence between the cluster labels across the clusterings of the ensemble [10, 11, 12]. These approaches typically suffer from efficiency issues. More refined methods fall into instance-based, clusterbased, and hybrid categories.

2.1.1

Instance-based CE

Instance-based CE methods perform a direct comparison between data objects. Typically, instance-based methods operate on the co-occurrence or co-association matrix W, which resembles the pairwise object similarities according to the information available from the ensemble. For each pair of objects (~o 0 , ~o 00 ), the matrix W stores the number of solutions of the ensemble in which ~o 0 and ~o 00 are assigned to the same cluster divided by the size of the ensemble. Instance-based methods derive the final consensus clustering by applying one of the following strategies: (i) performing an additional clustering step based on W, using this matrix either as a new data matrix [20], or as a pairwise similarity matrix involved in a specific clustering algorithm [13, 22, 15]; (ii) constructing a weighted graph based on W and partitioning the graph according to well-established graph-partitioning algorithms [28, 3, 29].

2.1.2

Cluster-based CE

Cluster-based CE lies on the principle “to cluster clusters” [7, 28, 6]. The key idea is to apply a clustering algorithm to the set of clusters that belong to the clustering solutions in the ensemble, in order to compute a set of metaclusters (i.e., sets of clusters). The consensus clustering is finally computed by assigning each data object to the metacluster that maximizes a specific criterion, such as the commonly used majority voting, which assigns each data object ~o to the metacluster that contains the maximum number of clusters which ~o belongs to.

2.1.3

Hybrid CE

Hybrid CE methods combine ideas from instance-based and cluster-based approaches. The objective is to build a hybrid bipartite graph whose vertices belong to the set of data objects and the set of clusters. For each object ~o and cluster C, the edge (~o, C) of the bipartite graph usually assumes a unit weight, if the object ~o belongs to the cluster C according to the clustering solution that includes C, and zero otherwise [36]. Some methods use weights in the range [0, 1], which express the probability that object ~o belongs to cluster C [29]. The consensus clustering of hybrid CE methods is obtained by partitioning the bipartite graph according to well-established methods (e.g., METIS [19]). The nodes representing clusters are filtered out from the graph partition.

2.2

Projective Clustering (PC)

Let D be a set of data objects, where each ~o ∈ D is defined on a feature space F = {1, . . . , |F|}. A projective cluster C defined over D is a pair hΓC , ∆C i such that • ΓC denotes the object-based representation of C. It is a |D|-dimensional real-valued vector whose component ΓC,~o ∈ [0, 1], ∀~o ∈ D, represents the object-to-cluster assignment of ~o to C, i.e., the probability Pr(C|~o) that object ~o belongs to C; • ∆C denotes the feature-based representation of C. It is a |F |-dimensional real-valued vector whose component ∆C,f ∈ [0, 1], ∀f ∈ F , represents the feature-to-cluster assignment of the feature f to C, i.e., the probability Pr(f |C) that feature f belongs to the subspace of features associated with C. Note that the above definition addresses all possible types of projective clusters handled by existing PC algorithms. In fact, both soft and hard object-to-cluster assignments are taken into account—the assignment is hard when ΓC,~o ∈ {0, 1} rather than [0, 1], ∀~o ∈ D. Similarly, feature-to-cluster assignments may be equally-weighted, i.e., ∆C,f = 1/R (where R is the number of relevant features for C), if f is recognized as relevant, ∆C,f = 0 otherwise. This representation is suited for dealing with the output of all those PC algorithms which only select the relevant features for each cluster, without specifying any feature-to-cluster assignment probability distribution. Such algorithms fall into bottom-up [34, 25], top-down [32, 31, 2, 37, 5], and hybrid approaches [24, 35, 1]. On the other hand, the methods defined in [34, 8, 30] handle projective clusters having soft objectto-cluster assignment and/or feature-to-cluster assignment unequally weighted. The object-based (ΓC ) and the feature-based (∆C ) representations of any projective cluster C are exploited to define

the projective cluster representation matrix (for brevity, projective matrix ) XC of C. XC is a |D|×|F| matrix that stores, ∀~o ∈ D, f ∈ F, the probability of the intersection of the events “object ~o belongs to C” and “feature f belongs to the subspace associated with C”. Under the assumption of independence between the two events, such a probability is equal to the product of Pr(C|~o) = ΓC,~o with Pr(f |C) = ∆C,f . Hence, given D = {~o1 , . . . , ~o|D| } and F = {1, . . . , |F |}, matrix XC can be formally defined as:   ΓC,~o1 ×∆C,1 . . . ΓC,~o1 ×∆C,|F |   .. .. (1) XC =   . . ΓC,~o|D| ×∆C,1 . . . ΓC,~o|D| ×∆C,|F | The goal of a PC method is to derive from an input set D of data objects a projective clustering solution denoted by C, which is defined as a set of projective clusters that satisfy the following conditions: X X ΓC,~o = 1, ∀~o ∈ D and ∆C,f = 1, ∀C ∈ C f ∈F

C∈C

The semantics of any projective clustering C is that for each projective cluster C ∈ C, the objects belonging to C are actually close to each other if (and only if) they are projected onto the subspace associated with C.

2.3

Projective Clustering Ensembles (PCE)

A projective ensemble E is defined as a set of projective clustering solutions. No information about the ensemble generation strategy (algorithms and/or setups), nor original feature values of the objects within D are provided along with E. Moreover, each projective clustering solution in E may contain in general a different number of clusters. The goal of PCE is to derive a projective consensus clustering by exploiting information on the projective solutions within the input projective ensemble.

2.3.1

Two-objective PCE

In [16], PCE is formulated as a two-objective optimization problem, whose objectives take into account the objectbased (function Ψo ) and the feature-based (function Ψf ) cluster representations of a given projective ensemble E: C ∗ = arg min {Ψo (C, E), Ψf (C, E)} C∈E

(2)

where Ψo (C, E) =

X

ˆ ψ o (C, C),

Ψf (C, E) =

X

ˆ (3) ψ f (C, C)

ˆ C∈E

ˆ C∈E

Functions ψ o and ψ f are defined as ψ o (C 0 , C 00 ) (ψo (C 0 , C 00 )+ ψo (C 00 , C 0 ))/2 and ψ f (C 0 , C 00 ) 0 00 00 0 (ψf (C , C ) + ψf (C , C )) /2, respectively, where 1 X 0 , ΓC 00 ψo (C 0 , C 00 ) = 0 1 − max J Γ C C 00 ∈C 00 |C | 0 0

= =

C ∈C

ψf (C 0 , C 00 ) =

1 X 1 − max J ∆C 0 , ∆C 00 0 00 00 C ∈C |C | 0 0 C ∈C

J ~ u, ~v = ~ u · ~v / k~ uk22 + k~v k22 − ~ u · ~v ∈ [0, 1] denotes the extended Jaccard similarity coefficient (also known as Tanimoto coefficient) between any two real-valued vectors ~ u and ~v [26].

The problem defined in (2) is solved by a well-founded heuristic, in which a Pareto-based Multi-Objective Evolutionary Algorithm, called MOEA-PCE, is used to avoid combining the two objective functions into a single one.

2.3.2

Single-objective PCE

To overcome some issues of the two-objective PCE formulation (such as those concerning efficiency, parameter setting, and interpretation of the results), [16] proposes an alternative PCE formulation based on a single-objective function, which aims to consider the object-based and the feature-based cluster representations in E as a whole: 2 XX α XX X C ∗ = arg min ΓC,~o ΓC,~ ∆C,f − ∆C,f ˆ o ˆ C∈E

C∈C ~ o∈D

f ∈F

ˆ ˆ Cˆ C∈E C∈

where α > 1 is a positive integer that ensures non-linearity of the objective function w.r.t. ΓC,~o . To solve the above problem, the EM-based Projective Clustering Ensembles (EM-PCE) heuristic is defined. EM-PCE iteratively looks for the optimal values of ΓC,~o (resp. ∆C,f ) while keeping ∆C,f (resp. ΓC,~o ) fixed, until convergence.

3. 3.1

C. Indeed, in C2 , C10 = hΓC 0 , ∆C 00 i and C100 = hΓC 00 , ∆C 0 i, whereas, the solution C ∈ E is such that C 0 = hΓC 0 , ∆C 0 i and C 00 = hΓC 00 , ∆C 00 i. The issue described in the above Example arises because the two-objective PCE formulation ignores that the objectbased and feature-based representations of any projective cluster are strictly coupled to each other, and hence, need to be considered as a whole. In other words, in order to effectively evaluate the quality of a candidate projective consensus clustering, the objective functions Ψo and Ψf cannot be kept separated from each other. We attempt to solve the above drawback by proposing the following alternative formulation of PCE, which is based on a single objective function:

CLUSTER-BASED PCE

C ∗ = arg min Ψof (C, E) C∈E

where Ψof is a function designed to measure the “distance” of any well-defined projective clustering solution C from E in terms of both data clustering and feature-to-cluster assignment. To carefully take into account efficiency, we define Ψof based on an asymmetric function, which has been derived by adapting the measure defined in [16] to our setting: X ˆ ψ of (C, C) (5) Ψof (C, E) =

Problem Statement

ˆ C∈E

Experimental results have shown that the two-objective PCE formulation is much more accurate than the singleobjective counterpart [16]. Nevertheless, two-objective PCE suffers from an important conceptual issue that has not been discussed in [16], proving that the accuracy of two-objective PCE can be further improved. We unveil this issue in the following example. Example: Let E be a projective ensemble defined over a set D of data objects and a set F of features. Suppose that E contains only one projective clustering solution C and that C in turn contains two projective clusters C 0 and C 00 , whose object- and feature-based representations are different from one another, i.e., ∃ ~o ∈ D s.t. ΓC 0,~o 6= ΓC 00,~o , and ∃ f ∈ F s.t. ∆C 0,f 6= ∆C 00,f . Let us consider two candidate projective consensus clusterings C1 = {C10 , C100 } and C2 = {C20 , C200 }. We assume that C1 = C, whereas C2 is defined as follows. Cluster C20 has object- and feature-based representations given by ΓC 0 (i.e., the object-based representation of the first cluster C 0 within C) and ∆C 00 (i.e., the feature-based representation of the second cluster C 00 within C), respectively; cluster C200 has object- and feature-based representations given by ΓC 00 (i.e., the object-based representation of the second cluster C 00 within C) and ∆C 0 (i.e., the feature-based representation of the first cluster C 0 within C), respectively. According to (3), it is easy to see that: Ψo (C1 , E) = Ψo (C2 , E) = 0

and

(4)

Ψf (C1 , E) = Ψf (C2 , E) = 0

Both the candidates C1 and C2 minimize the objectives of the early two-objective PCE formulation reported in (2), and hence, they are both recognized as optimal solutions. This conclusion is conceptually wrong, because only C1 should be recognized as an optimal solution, since only C1 is exactly equal to the unique solution of the ensemble. Conversely, C2 is not well-representative of the ensemble E, as the objectand feature-based representations of its clusters are inversely associated to each other w.r.t. the associations present in

where 1 0 00 00 0 ψ of (C , C ) = ψof (C , C ) + ψof (C , C ) 2 0

00

(6)

and ψof (C 0 , C 00 ) =

1 X ˆ XC 0 , XC 00 1− max J 0 C 00 ∈C 00 |C | 0 0

(7)

C ∈C

In (7), the similarity between any pair C 0 , C 00 of projective clusters is computed in terms of their corresponding projective matrices XC 0 and XC 00 (cf. (1), Sect. 2.2). For this purpose, the Tanimoto similarity coefficient can easily be generalized to operate on real-valued matrices (rather than vectors): P|rows(X)| î Xi · X i=1 ˆ ˆ = J(X, X) (8) P |rows(X)| 2 2 ˆ î kXk + kXk − Xi · X 2

2

i=1

ˆ i denotes the scalar product between the i-th where Xi · X ˆ >From a dissimilarity viewpoint, rows of matrices X and X. ˆ ˆ We as J ∈ [0, 1], we adopt in this work the measure 1 − J. hereinafter refer to 1 − Jˆ as Tanimoto distance. It can be noted that the proposed formulation based on the function Ψof fulfils the requirement of measuring the quality of a candidate consensus clustering in terms of both data clustering and feature-to-cluster assignments as a whole. In particular, we remark that the issue described in the previous Example does not arise in the proposed formulation. Indeed, considering again the two candidate projective consensus clusterings C1 and C2 of the Example, it is easy to see that: Ψof (C1 , E) = 0

and

Ψof (C2 , E) > 0

Thus, C1 is correctly recognized as an optimal solution, whereas C2 is not.

3.2

Heuristics

Apart from solving the critical issue of two-objective PCE previously explained, a major advantage of the proposed PCE formulation w.r.t. the early ones defined in [16] is its close relationship to the classic formulations typically employed by CE algorithms. Like standard CE, the problem defined in (4) can be straightforwardly proved to be a special version of the median partition problem [4], which is defined as follows: given a number of partitions (clusterings) defined over the same set of objects and a distance measure between partitions, find a (new) clustering that minimizes the distance from all the input clusterings. The only difference between (4) and any standard CE formulation is that the former deals with projective clustering solutions (and hence, it needs a new measure for comparing projective clusterings), whereas the latter involves standard clustering solutions. The closeness to CE is a key point of our work, as it enables the development of heuristic algorithms for PCE following standard approaches to CE. The advantage in this respect is twofold: heuristics for PCE can be defined by exploiting the extensive and well-established work so far given for standard CE, which enables the development of solutions that are simple and easy-to-understand, and effective at the same time. Within this view, a reasonable choice for defining proper heuristics for PCE is to adapt the standard CE approaches, i.e., instance-based, cluster-based, and hybrid (cf. Sect. 2.1), to the PCE context. However, it is arguable if all such CE approaches are well-suited for PCE. In fact, defining an instance-based PCE method is intrinsically tricky, and this also holds for the hybrid approach, which is essentially a combination of the instance-based and cluster-based ones. We explain the issues on defining instance-based PCE in the following. First, as the focus of any hypothetical instance-based PCE is primarily on data objects, performing the two PCE steps of data clustering and feature-to-cluster assignment altogether would be hard. Indeed, focusing on data objects may produce information about data clustering only (for instance, by exploiting a co-occurrence matrix properly redefined for the PCE context). This would force the assignment of the features to the various clusters to be performed in a separate step, and only once the objects have been grouped in clusters. Unfortunately, performing the two PCE steps of data clustering and feature-to-cluster assignment distinctly may negatively affect accuracy of the output consensus clustering. According to the definition of projective clustering, the information about the various objects belonging to any projective cluster should not be interpreted as absolute, but always in relation to the subspace associated to that cluster and vice versa. Thus, data clustering and feature-to-cluster assignment should be interrelated, at each step of the heuristic algorithm to be defined. A more crucial issue arises even accepting to perform data clustering and feature-to-cluster assignment separately. Given a set of data objects to be included in any projective cluster, the feature-to-cluster assignment process should take into account that the notion of subspace of any given projective cluster makes sense only if it refers to the whole set of objects belonging to that cluster. In other words, saying that any set of data objects forms a cluster C having a subset S of features associated with it does not mean that each object within C is represented by S, but rather that

Algorithm 1 CB-PCE Input: a projective ensemble E; the number K of clusters in the output projective consensus clustering; Output: the projective consensus clustering C ∗ S 1: ΦE ← C∈E Cˆ ˆ 2: P ← pairwiseClusterDistances(ΦE ) {(8)} 3: M ← metaclusters(ΦE , P, K) ∗ 4: C ← ∅ 5: for all M ∈ M do 6: Γ∗ {(12)} M ← object-basedRepresentation(ΦE , M) 7: ∆∗ {(22)} M ← f eature-basedRepresentation(ΦE , M) ∗ ∗ 8: C ← C ∪ {hΓM , ∆M i} 9: end for

the entire set C is represented by S. Unfortunately, performing feature-to-cluster assignment apart from data clustering contrasts with the semantics of a subspace associated to a set of objects in a projective cluster. Indeed, the various features could be assigned to any given cluster C only by considering the objects within C independently of one another. Let us consider, for example, the case where the assignment is performed by averaging over the objects within C and over the feature-based representations of all the clusters within the ensemble E, i.e., ∆C,f = avg~o∈C,C∈ {ΓC,~ ˆ C, ˆ C∈E ˆ ˆ o × ∆C,f ˆ }, ∀f ∈ F. This case clearly shows that each feature f is assigned to C by considering each object within C independently from the other ones belonging to C. Within this view, we discard instance-based and hybrid approaches to embrace a cluster-based approach. In the following, we describe our cluster-based proposal in detail and also show how this is particularly appropriate to the PCE context.

3.2.1

The CB-PCE algorithm

The Cluster-Based Projective Clustering Ensembles (CBPCE) algorithm is proposed as a heuristic approach to the PCE formulation given in (4). In addition to the notation provided in Sect. 2, CB-PCE employs the following symbols: M denotes a set of metaclusters (i.e., a set of sets of clusters), M ∈ M denotes a metacluster (i.e., a set of clusters), and M ∈ M denotes a cluster (i.e., a set of data objects). The outline of CB-PCE is reported in Alg. 1. Similarly to standard cluster-based CE, the first step of CB-PCE aims to group the set ΦE of clusters from each solution within the input ensemble E into metaclusters (Lines 1-2). A clustering step over the set ΦE is performed by the function metaclusters. This step exploits the matrix P of pairwise distances between the clusters within ΦE (Line 1). The distance between any pair of clusters is computed by resorting to the Tanimoto similarity coefficient reported in (8). The set M of metaclusters is finally exploited to derive the object- and feature-based representations of each projective cluster to be included into the output consensus clustering C ∗ (Lines 3-8). Such representations are denoted by Γ∗M and ∆∗M , ∀M ∈ M, respectively; more precisely, Γ∗M (resp. ∆∗M ) denotes the object-based (resp. feature-based) representation of the projective cluster within C ∗ corresponding to the metacluster M. Γ∗M and ∆∗M are derived by focusing on the optimization of a criterion easy to solve, which enables the finding of reasonable and effective approximations at the same time. In particular, we adapt the widely used majority voting to the context at hand. Let us first consider Γ∗M values. If

the projective clustering solutions within the ensemble are all hard at a clustering level, the majority voting criterion leads to the definition of the following optimization problem: X X ΓM,~o X 1 − ΓM,~o {Γ∗M | M ∈ M} = argmin {ΓM |M∈M} |M| M ∈M M∈M ~ o∈D

s.t. X

ΓM,~o = 1,

∀~o ∈ D

∂ Q0 ∂ ΓM,~o

= α AM,~o (ΓM,~o )α−1 + λ~o = 0

∂ Q0 ∂ λ~o

X

=

X M0 ∈M

argmin

{ΓM |M∈M}

Q

X

ΓM,~o = 1,

∀~o ∈ D

∀M ∈ M, ∀~o ∈ D (11)

AM,~o =

M∈M ~ o∈D

1 X 1 − ΓM,~o |M|M ∈M

and α > 1 is an integer that guarantees the non-linearity of the objective function Q w.r.t. ΓM,~o , needed to ensure Γ∗M,~o ∈ [0, 1] (rather than {0, 1}).1 The solution for such a problem however is not as straightforward as that of the traditional case (i.e., hard data clustering). We derive the solution in the following. Theorem 1. The optimal solution of problem P defined in (9)-(11) is given by (∀M, ∀~o): " 1 #−1 X AM,~o α−1 ∗ ΓM,~o = (12) AM0 ,~o 0 M ∈M

Proof. The optimal Γ∗M,~o can be found by means of the conventional Lagrange multipliers method. To this end, we first consider the relaxed problem P 0 of P obtained by temporarily discarding the inequality constraints from the constraint set of P (i.e., the constraints defined in (11)). We define the new (unconstrained) objective function Q0 for P 0 as follows: ! X X 0 Q =Q+ λ~o ΓM0 ,~o − 1 (13) ~ o∈D

1 α−1

=1

(16)

1 α AM0 ,~o

1 α−1

#−(α−1) = 0 (17)

Finally, solving (17) w.r.t. ΓM,~o , we obtain a stationary point whose expression is exactly equal to that in (12): " =

1 #−1 X AM,~o α−1 AM0 ,~o 0

(18)

(10)

where Γα o, M,~ o AM,~

M ∈M

ΓM,~o ≥ 0,

X X

X M∈M

Γ∗M,~o

M∈M

Q=

α AM,~o (ΓM,~o )α−1−

(9)

s.t.

−λ~o α AM0 ,~o

Solving (16) w.r.t. λ~o and substituting such a solution in (14), we obtain: "

that is, each object ~o is assigned to the metacluster M containing the maximum number of clusters to which ~o belongs (i.e., such that ΓM,~o = 1). If the ensemble contains projective clusterings that are soft at clustering level, the following problem can be defined: =

(15)

M0 ∈M

∀M ∈ M, ∀~o ∈ D

whose solution can be easily proved to be as follows (∀M, ∀~o):  X 1  1 if M = arg min 1 − ΓM,~o ∗ 0 ∈M |M0 | M ΓM,~o = M ∈M0  0 otherwise

{Γ∗M |M ∈ M}

ΓM0 ,~o − 1 = 0

(14)

Solving (14) w.r.t. ΓM,~o and substituting such a solution in (15), we obtain:

M∈M

ΓM,~o ∈ {0, 1},

Thus, we solve the following system of equations:

M0 ∈M

The optimal Γ∗M,~o are computed by first retrieving the stationary points of Q0 , i.e., the points for which ∂ Q0 ∂ Q0 , =0 ∇Q0 = ∂ ΓM,~o ∂ λ~o 1 Alternatively, to obtain Γ∗M,~o ∈ [0, 1], properly defined regularization terms can be introduced (see, e.g., [21]).

As it holds that (i) the stationary points of the Lagrangian function Q0 are also stationary points of the original objective function Q, (ii) the feasible region of P and hence, the feasible region of P 0 is a convex set, and (iii) Q is convex w.r.t. ΓM,~o , it follows that such a stationary point represents a global minimum of Q, and, accordingly, the optimal solution of P 0 . Moreover, as AM,~o ≥ 0, ∀M, ∀~o, it is trivial to observe that Γ∗M,~o ≥ 0, ∀M, ∀~o. Therefore, the solution in (18) satisfies the inequality constraints that were temporarily discarded in order to define the relaxed problem P 0 (cf. (11)); thus, it represents the optimal solution of the original problem P , which proves the theorem. An analogous reasoning can be carried out for ∆∗M,f . In this case, the problem to be solved is the following: X X β (19) ∆M,f BM,f {∆∗M |M ∈ M} = arg min {∆M |M∈M} M∈M f ∈F

s.t. X

∆M,f = 1,

∀M ∈ M

(20)

f ∈F

∆M,f ≥ 0, ∀M ∈ M, ∀f ∈ F (21) P where BM,f = |M|−1 M ∈M 1−∆M,f and β plays the same role as α in function Q. The solution of such a problem is similar to that derived for Γ∗M,~o : Theorem 2. The optimal solution of the problem defined in (19)-(21) is given by the following (∀M, ∀f ):  −1 1 X BM,f β−1 ∗  (22) ∆M,f =  BM,f 0 0 f ∈F

Proof. Analogous to Theorem 1.

Rationale of CB-PCE. Let us now informally show that CB-PCE is well-suited for PCE, thus supporting one of the claim of this work, i.e., cluster-based approaches are particularly appropriate to the PCE context (unlike instance-based and hybrid ones). Looking at the PCE formulation reported in (4), it is easy to see that function Ψof retrieves the consensus clustering C ∗ so that each cluster within C ∗ is ideally “assigned” to exactly one cluster of each projective clustering solution in the input ensemble E, where the “assignments” are performed by minimizing the Tanimoto distance 1 − Jˆ (cf. (8)). Thus, considering all the solutions in the ensemble, any cluster C ∈ C ∗ is assigned to a set of clusters (metacluster) M that contains exactly one cluster of each solution in the ensemble, that is |M| = |E|, and M 0 ∈ C ∧ M 00 ∈ C ⇔ M 0 = M 00 , ∀M 0 , M 00∈ M, ∀C ∈ E. Clearly, if one would know in advance the optimal set of metaclusters to be assigned to the clusters within C ∗ , the problem in (4) would be optimally solved by computing, for each metacluster M, the cluster C ∗ that minimizes the Tanimoto distance from all the clusters within M, that is: X ˆ C , XM ) C ∗ = arg min 1 − J(X (23) C

M ∈M

However, it holds that: (i) the metaclusters are not known in advance, as their computation is part of the optimization process; (ii) the problem in (23) is hard to solve: it falls into the class of median problems in which the distance to be minimized is the Tanimoto distance; this kind of problems has been recently proved to be NP-hard [9]. The validity of CB-PCE as a heuristic approach to the PCE formulation proposed in (4) lies in that it exactly follows the scheme reported above (i.e., it first recognizes metaclusters and then assigns objects and features to metaclusters), following some approximations. These approximations are needed for solving two critical points: 1. a sub-optimal set of metaclusters is computed by clustering the overall set of projective clusters within the ensemble, where the distance measure used for comparing clusters is the Tanimoto distance, which is the measure employed by the proposed formulation in (4); 2. Γ∗M and ∆∗M values (for each metacluster M) are computed by optimizing an easy-to-solve criterion that effectively approximates the problem in (23).

3.2.2

Speeding-up CB-PCE: FCB-PCE

Given a set D of data objects and a set F of features, the computational complexity of the measure Jˆ reported in (8) (used for computing the similarity between two projective clusters) is O(|D| |F|), as it involves a comparison between two |D|×|F | matrices. For efficiency purposes, we can lower the complexity by defining an alternative measure working in O(|D| + |F |). Given any two projective clusters C 0 and C 00 , such a measure, called Jˆf ast , exploits the object-based (ΓC 0 and ΓC 00 ) and the feature-based (∆C 0 and ∆C 00 ) representation vectors of C 0 and C 00 , respectively, rather than their corresponding projective matrices. Formally: 1 ˆ ˆ C 0 , ∆C 00 ) J(ΓC 0 , ΓC 00 ) + J(∆ (24) Jˆf ast (C 0 , C 00 ) = 2 ˆ ·) denotes again the Tanimoto similarity coeffiwhere J(·, cient defined in (8), which is in this case applied to real-

valued vectors rather than matrices. It is easy to observe ˆ Jˆf ast ∈ [0, 1]. that, like J, Taking into account Jˆf ast , we define a version of the CBPCE algorithm which is similar to that defined in Sect. 3.2.1, except for the measure involved for comparing the projective clusters, which is, in this case, based on Jˆf ast . We hereinafter refer to this alternative version of the algorithm as Fast Cluster-Based Projective Clustering Ensembles (FCBPCE) algorithm. Although clearly advantageous in terms of efficiency, a major drawback of FCB-PCE concerns accuracy. In fact, a major weakness of the measure Jˆf ast exploited by FCBPCE is that it is less accurate than its slow counterpart Jˆ exploited by CB-PCE. This essentially depends on the fact that comparing any two projective clusters C 0 and C 00 by involving their projective matrices XC 0 and XC 00 , respectively, is generally more effective than involving their object- and feature-based representation vectors ΓC 0 , ΓC 00 , ∆C 0 , and ∆C 00 [23].2 Indeed, although it can be trivially proved that XC 0 = XC 00 ⇔ ΓC 0 = ΓC 00 ∧ ∆C 0 = ∆C 00 , the vectors ΓC 0 , ∆C 0 , and ΓC 00 , ∆C 00 are in general a factorization of the matrices XC 0 and XC 00 , respectively (i.e., XC 0 = ΓTC 0 ∆C 0 and XC 00 = ΓTC 00 ∆C 00 ). Thus, only matrices XC 0 and XC 00 provide the whole information about the representation of the corresponding projective clusters. ˆ it still allows the Although Jˆf ast is less accurate than J, comparison of projective clusters by taking into account their object- and feature-based representations altogether. Hence, the proposed FCB-PCE heuristic based on Jˆf ast still represents a valuable heuristic to the PCE formulation proposed in this work, as it overcomes the main issue of twoobjective PCE explained in Sect. 3.1.

3.2.3

Computational Analysis

Here we discuss the computational complexity of the proposed CB-PCE and FCB-PCE algorithms. We are given: a set D of data objects, each one defined over a feature space F, a projective ensemble E defined over D and F, and a positive integer K representing the number of clusters in the output projective consensus clustering. We also assume that the size |C| of each solution C in E is O(K). For both the algorithms, we may distinguish three steps: 1. pre-processing: it concerns the computation of the pairwise distances between clusters, by involving measures Jˆ (cf. (8)) for CB-PCE and Jˆf ast (cf. (24)) for FCB-PCE; this step takes O(K 2 |E|2 |D| |F |) and O(K 2 |E|2 (|D| + |F|)) for CB-PCE and FCB-PCE, respectively, because computing Jˆ (resp. Jˆf ast ) is O(|D| |F|) (resp. O(|D| + |F|)) (cf. Sect. 3.2.2), and the clusters to be compared to each other are O(K |E|); 2. meta-clustering: it concerns the clustering of the O(K |E|) clusters of all the solutions in the ensemble; assuming to employ a clustering algorithm which is at most quadratic w.r.t. the size of the dataset to be partitioned, this step takes O(K 2 |E|2 ) for both CB-PCE and FCB-PCE; 3. post-processing: it concerns the assignment of objects and features to the metaclusters, and is exactly the 2 [23] deals with hard projective clusters; however, the reasoning therein involved can be easily extended to a soft case.

Table 1: Computational complexities total

online

offline

MOEA-PCE EM-PCE

O(ItK 2 |E|(|D| + |F |)) O(K|E||D||F |)

O(ItK 2 |E|(|D| + |F |)) O(IK|D||F |)

— O(K|E||D||F |)

CB-PCE FCB-PCE

O(K 2 |E|2 |D||F |) O(K 2 |E|2 (|D| + |F |))

O(K|E|(K|E| + |D| + |F |)) O(K|E|(K|E| + |D| + |F |))

O(K 2 |E|2 |D||F |) O(K 2 |E|2 (|D| + |F |))

same for both CB-PCE and FCB-PCE. According to (12) and (22), both the object and the feature assignments need to look up all the clusters in each metacluster only once; thus, for each object and for each feature, the needed step costs O(K|E|). Accordingly, performing this step for all objects and features leads to a total cost of O(K|E| (|D| + |F|)) for the entire post-processing step.

(i.e., 200, 30 and 200, respectively), CB-PCE is faster than MOEA-PCE if |D| + |F | < 120, i.e., when the input dataset is very small and/or low-dimensional. For this purpose, CB-PCE can be recognized as in practice always slower than MOEA-PCE. • FCB-PCE:

It can be noted that the first step is an offline phase, i.e., a phase to be performed only once in case of a multi-run execution, whereas the second and third are online steps. Thus, as summarized in Table 1 (where we also report the complexities of the earlier MOEA-PCE and EM-PCE methods defined in [16]3 ), we can finally state that:

– it holds that the ratio r(FCB, EM) = (K |E| (|D| + |F|))/(|D| |F|) is greater than 1 if (2 |D| |F|)/(|D| + |F|) < 2 K |E|, which essentially means that FCB-PCE is slower than EM-PCE if |D| + |F| < 4 K |E|, as (|D| + |F|)/2 ≥ (2 |D| |F|)/(|D| + |F |). Thus, for large and/or high-dimensional datasets (i.e., for datasets having |D| and |F | such that |D| + |F| > 4 K |E|) FCB-PCE may be faster than EM-PCE, whereas for small and/or low-dimensional datasets may not; – r(FCB, MOEA) = |E|/(I t); assuming to set t equal to 15% of the ensemble size |E| as suggested in [16], it holds that r(FCB, MOEA) = 20/(3 I). Thus, as it typically holds that I 7 (e.g., in [16] I = 200), r(FCB, MOEA) is always smaller than 1 and, therefore, FCB-PCE is always faster than MOEA-PCE.

• the offline, online, and total (i.e., offline + online) complexities of CB-PCE are O(K 2 |E|2 |D| |F|), O(K|E|(K|E| + |D| + |F|)), and O(K 2 |E|2 |D| |F|), respectively; • the offline, online, and total (i.e., offline + online) complexities of FCB-PCE are O(K 2 |E|2 (|D| + |F|)), O(K|E|(K|E|+|D|+|F|)), and O(K 2 |E|2 (|D|+|F|)), respectively.

Interpretation of the complexity results. Let us now provide an insight for the comparison between the (total) complexities derived above. For the sake of readability, we hereinafter omit the suffix “-PCE” from the names of the various PCE algorithms. We denote with r(a1 , a2 ) the ratio between the complexities of the PCE algorithms a1 and a2 . Clearly, a ratio smaller (resp. greater) than 1 means that the complexity of a1 is smaller (resp. greater) than that of a2 . Our main observations are summarized in the following. • As expected, FCB-PCE is always faster than CB-PCE, as it holds that r(FCB, CB) = (|D|+|F|)/(|D||F|) ≤ 1, ∀ |D|, |F | > 1. • CB-PCE: – it holds that r(CB, EM) = K |E| > 1; thus, CBPCE is always slower than EM-PCE; – the ratio r(CB, MOEA) is equal to (|E| |D| |F|)/(I t (|D| + |F|)). This implies that r(CB, MOEA) < 1 if (2 |D| |F|)/(|D| + |F|) < 2 I t/|E|, i.e., as (|D| + |F|)/2 ≥ (2 |D| |F|)/(|D| + |F|), that r(CB, MOEA) < 1 if |D| + |F| < 4 I t/|E|. The latter condition is true only in a small number of real cases; as an example, considering the numerical values for I, t and |E| suggested in [16] 3

In Table 1, I denotes the number of iterations to convergence (for MOEA-PCE and EM-PCE), whereas t is the population size (for MOEA-PCE only) [16].

To summarize, we can state that CB-PCE is the slowest method. FCB-PCE is faster than MOEA-PCE, whereas, compared to EM-PCE, it is faster (resp. slower) for large (resp. small) and/or high-dimensional (resp. lowdimensional) datasets.

4.

EXPERIMENTAL EVALUATION

We conducted an experimental evaluation to assess the accuracy and efficiency of the consensus clusterings obtained by the proposed CB-PCE and FCB-PCE. The comparison also involved the previous existing PCE algorithms (i.e., MOEA-PCE and EM-PCE) [16] as baseline methods.4

4.1

Evaluation methodology

Following [16], we used eight benchmark datasets from the UCI Machine Learning Repository [27], namely Iris, Wine, Glass, Ecoli, Yeast, Segmentation, Abalone and Letter, and two time-series datasets from the UCR Time Series Classification/Clustering Page [33], namely Tracedata and ControlChart. Table 2 reports the main characteristics of the datasets; the interested reader is referred to [27, 33] for a description of the datasets. 4

Experiments were conducted on a quad-core platform Intel Pentium IV 3GHz with 4GB memory and running Microsoft WinXP Pro.

Table 2: Datasets used in the experiments dataset Iris Wine Glass Ecoli Yeast Segmentation Abalone Letter Tracedata ControlChart

objects 150 178 214 327 1,484 2,310 4,124 7,648 200 600

attributes 4 13 10 7 8 19 7 16 275 60

classes 3 3 6 5 10 7 17 10 4 6

computed as suggested in [30]: e f )/h exp −U (C, ∆C,f = P e e 0 f 0 ∈F exp −U (C, f )/h where the LAC’s parameter h was set equal to 0.2 and: X −1 X 2 ˆ ˆ ˆ fˆ) = ΓC,~ U (C, ΓC,~ ˆ o c(C, f ) − ofˆ ˆ o

ˆ fˆ) = c(C,

4.1.1

Ensemble generation

We generated ensembles as suggested in [16]. In particular, for each set of experiments and dataset we considered 20 different ensembles; all results we present in the following refer to averages over these ensembles. Ensemble generation was carried out by running the LAC projective clustering algorithm [30], in which the diversity of the solutions was ensured by randomly choosing the initial centroids and varying the parameter h; here we recall that this parameter controls the incentive for clustering on more features depending on the strength of the local correlation of data. To test the ability of the proposed algorithms to deal with soft clustering solutions and with solutions having equally weighted feature-to-cluster assignments, we generated each ensemble E as a composition of four equal-sized subsets, denoted as E1 (hard data clustering, feature-to-cluster assignments unequally weighted), E2 (hard data clustering, feature-tocluster assignments equally weighted), E3 (soft data clustering, feature-to-cluster assignments unequally weighted), and E4 (soft data clustering, feature-to-cluster assignments equally weighted).

4.1.2

Setting of the PCE algorithms

We set the parameters of MOEA-PCE and EM-PCE as reported in [16]. In particular, as far as MOEA-PCE, the population size (t) was set equal to 15% of the ensemble size and the number I of maximum iterations equal to 200. The random noise needed for the mutation step was obtained via Monte Carlo sampling on a standard Gaussian distribution. Regarding EM-PCE, the parameter α was set equal to 2; this value also represented the optimal value for the parameters α and β of our CB-PCE and FCB-PCE.

4.1.3

Assessment criteria

We assessed the quality of a consensus clustering C using both an external and an internal validity approach; specifically, we carried out two evaluation stages, the first based on the similarity of C w.r.t. a reference classification and the second based on the average similarity w.r.t. the solutions in the input ensemble E.

Similarity w.r.t. the reference classification. We denote with Ce a reference classification, where the object-based representations ΓCe of each projective cluster e within Ce are provided along with D (the selected datasets C are all available with a reference classification), whereas the e e feature-based representations ∆C,f e , ∀C ∈ C, ∀f ∈ F , are

~ o∈D

~ o∈D

X

ΓC,~ ˆ o

−1 X

ΓC,~ ˆ o × ofˆ

~ o∈D

~ o∈D

with ofˆ denoting the fˆ-th feature value of object ~o. Similarity between C and Ce was computed in terms of the Normalized Mutual Information, by taking into account their object-based (NMIo ) representations, feature-based representations (NMIf ), or both (NMIof ), and by adapting the original definition given in [28] to handle soft solutions. Here we report the formal definition of NMIof , NMIo and NMIf can be derived in a similar way: X X a(C,C) e e |D|2 ×a(C,C) e × log T (C,C)×b(C)×b( e e T (C,C) C) e = NMIof (C, C)

C∈C C∈ e Ce

q e H(C) × H(C)

where a(C 0 , C 00 ) =

XX

ΓC 0,~o ∆C 0,f ΓC 00,~o ∆C 00,f

~ o∈D f ∈F

ˆ = b(C)

XX

ΓC,~ ˆ o ∆C,f ˆ

ˆ =− H(C)

~ o∈D f ∈F

T (C 0 , C 00 ) =

X

ˆ b(C) |D|

log

ˆ b(C) |D|

ˆ Cˆ C∈

XX X ~ o∈D f ∈F

C 0 ∈C 0

ΓC 0,~o ∆C 0,f

X

ΓC 00,~o ∆C 00,f

C 00 ∈C 00

We now explain the rationale of this evaluation stage. Let us consider NMIof , where analogous considerations hold for NMIo and NMIf . Since no additional information is provided along with any given input projective ensemble E—the reference classifications associated to the benchmark datasets are indeed exploited only for testing purposes— randomly extracting a projective solution from E is the only fair way to proceed in case no PCE method is used. Within this view, in order to establish the validity of a projective consensus C computed by any PCE algorithm, we compare the results achieved by C w.r.t. those obtained by any projective clustering randomly chosen from E. Such a comparison can be performed according to the following expression, which aims to compute the “expected difference” between the results by C and those by E: X e = e − NMIof (C, ˆ C) e Pr(C) ˆ Θof (C, E, C) NMIof (C, C) ˆ C∈E

ˆ is the probability of randomly choosing Cˆ from where Pr(C) E. Since no prior knowledge is provided along with E, we can ˆ assume a uniform distribution for the probabilities Pr(C), −1 ˆ ˆ i.e., Pr(C) = |E| , ∀C ∈ E. Computing Θof hence becomes equal to computing the similarity between C and Ce minus

Table 3: Evaluation w.r.t. the reference classification data Iris Wine Glass Ecoli Yeast Segmentation Abalone Letter Trace ControlChart min max avg

MOEAPCE +.146 +.136 +.105 +.164 +.049 +.137 +.116 +.111 +.097 +.091 +.049 +.164 +.115

Θof EMPCE +.168 +.083 +.162 +.086 +.021 +.144 +.111 +.107 +.019 +.204 +.019 +.204 +.110

CBPCE +.218 +.275 +.158 +.211 +.092 +.148 +.134 +.141 +.125 +.345 +.092 +.345 +.185

FCBPCE +.185 +.224 +.157 +.232 +.095 +.141 +.130 +.134 +.140 +.276 +.095 +.276 +.171

MOEAPCE +.319 +.201 +.092 +.245 +.090 +.102 +.141 +.146 +.032 +.050 +.032 +.319 +.142

the average similarity between Ce and the solutions within E, as proved by the following: X e = e − NMIof (C, ˆ C) e Pr(C) ˆ = Θof (C, E, C) NMIof (C, C) ˆ C∈E

e − = NMIof (C, C)

X

ˆ C) e × |E|−1 = NMIof (C,

ˆ C∈E

e − avg ˆ NMIof (C, ˆ C) e = NMIof (C, C) C∈E

(25)

Θo and Θf can be defined analogously. The larger Θof , Θo and Θf , the better the quality of C.

Similarity w.r.t. the ensemble solutions. The goal of this evaluation stage was to assess how well a consensus clustering complies with the solutions in the input ensemble. For this purpose, we evaluated the average similarity NMI of (C, E) = avgC 0 ∈E NMIof (C, C 0 ) between the consensus clustering C and the solutions in the ensemble E (NMI o and NMI f are defined analogously). To improve the readability of the results, we normalize NMI of , NMI o and NMI f by dividing them by the average pairwise similarity of the solutions in the ensemble. Formally, we define the ratios (coefficients of variation) Υof , Υo , and Υf : Υof (C, E) = NMI of (C, E)/avgC 0 ,C 00 ∈E NMIof (C 0 , C 00)

(26)

Υo and Υf are defined similarly. The larger these quantities are, the better the quality of C is.

4.2 4.2.1

Results Accuracy

For each algorithm, dataset and ensemble, we performed 50 different runs. We reported average clustering results obtained by CB-PCE and FCB-PCE, as well as by the early MOEA-PCE and EM-PCE in Tables 3 and 4.

Evaluation w.r.t. the reference classification. Both CB-PCE and FCB-PCE achieved higher Θof results (first 4-column groups in Table 3) than MOEA-PCE on all datasets. In particular, CB-PCE obtained an average improvement of 0.070, with a maximum gain of 0.254 (ControlChart), whereas FCB-PCE obtained an average improvement of 0.056, with a maximum of 0.185 (ControlChart again). EM-PCE was on average less accurate than MOEAPCE; thus, the average gains of CB-PCE and FCB-PCE w.r.t. EM-PCE were higher than those achieved w.r.t.

Θo EMPCE +.228 +.130 +.134 +.125 +.066 +.206 +.116 +.122 +.026 +.011 +.011 +.228 +.116

CBPCE +.309 +.272 +.180 +.223 +.113 +.194 +.185 +.188 +.154 +.027 +.027 +.309 +.185

FCBPCE +.297 +.253 +.167 +.213 +.110 +.185 +.182 +.185 +.132 +.051 +.051 +.297 +.178

MOEAPCE +.198 +.152 +.048 +.042 +.006 +.075 +.093 +.092 -.007 +.233 -.007 +.233 +.093

Θf EMPCE -.095 +.030 +.060 +.042 +.090 +.079 +.092 +.097 +.114 +.416 -.095 +.416 +.093

CBPCE +.139 +.211 +.001 +.023 +.102 +.098 +.123 +.131 +.112 +.287 +.001 +.287 +.123

FCBPCE +.117 +.206 +.009 +.017 +.010 +.150 +.120 +.124 +.115 +.283 +.009 +.283 +.122

MOEA-PCE (0.075 and 0.061, respectively). Comparing the two proposed CB-PCE and FCB-PCE, the former achieved higher quality on nearly all datasets (all but Ecoli, Yeast and Trace), with an average gain of about 0.014 and peaks on ControlChart (0.069) and Wine (0.051). The higher performance of CB-PCE vs. FCB-PCE confirms one of the major claims of this work (cf. Sect. 3.2.2). The superior performance of CB-PCE and FCB-PCE w.r.t. the early MOEA-PCE and EM-PCE was also confirmed in terms of object-based (Θo ) and feature-based (Θf ) representations. In particular, CB-PCE achieved average Θo equal to 0.185 and average improvements w.r.t. MOEAPCE and EM-PCE of 0.043 and 0.069, respectively. Also, CB-PCE outperformed MOEA-PCE (resp. EM-PCE) on seven (resp. eight) out of ten datasets. As far as FCBPCE, the average Θo was 0.178, with average gains w.r.t. MOEA-PCE and EM-PCE equal to 0.036 and 0.062, respectively. FCB-PCE performed better than MOEA-PCE and EM-PCE on eight and nine out of ten datasets, respectively. In terms of Θf , both CB-PCE and FCB-PCE were on average comparable to each other; in fact, they achieved average Θf equal to 0.123 and 0.122, respectively. The average improvements obtained by CB-PCE (resp. FCB-PCE) w.r.t. both MOEA-PCE and EM-PCE were equal to 0.030 (resp. 0.029). Like Θof and Θo , both the proposed CB-PCE and FCB-PCE performed better than MOEA-PCE and EMPCE on the majority of the datasets also in terms of Θf .

Evaluation w.r.t. the ensemble solutions. Concerning the coefficients of variation due to the consensus clustering w.r.t. the average pairwise similarity of the input ensemble (Table 4), CB-PCE and FCB-PCE led to average values respectively equal to 1.110 and 1.108 (Υof ), 1.318 and 1.316 (Υo ), 1.049 and 1.030 (Υf ). Particularly, in the case Υof , CB-PCE improved MOEA-PCE and EM-PCE by 0.062 and 0.114 on average, respectively, whereas the average improvements obtained by FCB-PCE w.r.t. MOEAPCE and EM-PCE were equal to 0.060 and 0.112, respectively. Also, CB-PCE was able to obtain peaks of improvement up to 0.297 (w.r.t. MOEA-PCE) and 0.454 (w.r.t. EM-PCE). The maximum gains of FCB-PCE were instead equal to 0.3 and 0.457 w.r.t. MOEA-PCE and EM-PCE, respectively. Both CB-PCE and FCB-PCE outperformed MOEA-PCE and EM-PCE on nearly all datasets. CB-PCE results were better than those of MOEA-PCE and EM-PCE on seven and nine out of ten datasets, respectively. As far

Table 4: Evaluation w.r.t. the ensemble solutions data Iris Wine Glass Ecoli Yeast Segmentation Abalone Letter Trace ControlChart min max avg

MOEAPCE 1.019 .993 1.023 1.074 1.074 1.008 1.044 1.040 1.170 1.034 .993 1.170 1.048

Υof EMPCE .914 .960 .918 1.052 1.050 .851 1.001 1.001 1.207 1.006 .851 1.207 .996

CBPCE .984 1.074 1 1.058 1.217 1.305 1.068 1.045 1.196 1.152 .98 1.305 1.110

FCBPCE .989 1.072 1.003 1.015 1.189 1.308 1.071 1.088 1.196 1.152 .989 1.308 1.108

MOEAPCE 1.025 1.060 1.114 1.034 1.189 1.367 1.121 1.118 1.325 1.162 1.025 1.367 1.152

Υo EMPCE 1.004 .991 .971 1.023 1.182 1.304 1.102 1.099 1.501 1.237 .971 1.501 1.141

CBPCE 1.044 1.057 1.064 1.027 1.310 1.788 1.208 1.277 1.503 1.903 1.027 1.903 1.318

FCBPCE 1.039 1.056 1.066 1.028 1.297 1.786 1.208 1.274 1.503 1.903 1.028 1.903 1.316

MOEAPCE .953 1.018 .979 .975 .960 .971 .982 .981 .949 1.085 .949 1.085 .985

Υf EMPCE .906 .952 .915 .924 1.021 .969 .902 .891 .927 .577 .577 1.021 .898

CBPCE .986 1.001 1.004 .986 1.036 1.032 .980 1.169 1.062 1.234 .980 1.234 1.049

FCBPCE .977 1.001 1.004 .992 1.037 1.013 .986 .998 1.062 1.234 .977 1.234 1.030

Table 5: Execution times (milliseconds)

data Iris Wine Glass Ecoli Yeast Segmentation Abalone Letter Trace ControlChart

TOTAL MOEAEMCBFCBPCE PCE PCE PCE 17,223 55 13,235 906 21,098 184 50,672 993 61,700 281 110,583 3,847 94,762 488 137,270 4,911 1,310,263 1,477 2,218,128 56,704 1,250,732 11,465 6,692,111 47,095 13,245,313 34,000 19,870,218 527,406 7,765,750 54,641 26,934,327 271,064 86,179 4,880 2,589,899 3,731 291,856 2,313 3,383,936 12,439

ONLINE MOEAEMCBPCE PCE PCE 17,223 53 343 21,098 153 306 61,700 239 1,713 94,762 427 1,643 1,310,263 477 12,159 1,250,732 8,496 6,095 13,245,313 12,922 107,547 7,765,750 28,766 15,593 86,179 3,224 836 291,856 735 2,717

as FCB-PCE, it was superior to MOEA-PCE and EM-PCE on seven and eight out of ten datasets, respectively. Υo and Υf results followed similar trends as Υof . CB-PCE was still predominant on FCB-PCE, even if the difference between the two methods is less evident than the evaluation w.r.t. the reference classification. The average gains of CB-PCE w.r.t. FCB-PCE were 0.002 (Υof ), 0.002 (Υo ), and 0.019 (Υf ).

Efficiency

Table 5 reports on the runtimes of the proposed algorithms CB-PCE and FCB-PCE, along with those of the early MOEA-PCE and EM-PCE. The reported times (expressed in milliseconds) are organized to distinguish between the online and offline phases. The total runtimes confirm the theoretical considerations made in Sect. 3.2.3. Indeed, FCB-PCE is always faster than CB-PCE (from 2 to 3 orders of magnitude) and MOEAPCE (1-2 orders), as well as CB-PCE is always slower than EM-PCE (2-3 orders) and slower than MOEA-PCE (up to 2 orders) on all datasets but Iris. The latter observation fully complies with the analysis of the relative performance between CB-PCE and MOEA-PCE: CB-PCE is generally outperformed by MOEA-PCE, except for the datasets having small size and/or low dimensionality, like Iris. FCB-PCE would appear generally slower than EM-PCE. However, as stated in Sect. 3.2.3, the relative performance of the two methods mostly depends on the size |D| of the dataset and the dimensionality |F| of the data objects within D and the number K of clusters; in particular, the larger |D| + |F | and/or the smaller K, the better relative performance of FCB-PCE w.r.t. EM-PCE.

MOEAPCE – – – – – – – – – –

OFFLINE EMCBFCBPCE PCE PCE 2 12,892 534 31 50,366 670 42 108,870 2,134 61 135,627 3,222 1,000 2,205,969 44,547 2,969 6,686,016 41,969 21,078 19,762,671 437,328 25,875 26,918,734 255,454 1,656 2,589,063 2,891 1,578 3,381,219 9,656

As a final remark, we note that the runtimes of the proposed CB-PCE and FCB-PCE were roughly similar to each other in the online phase. As expected, the difference between the two methods depends only on their offline phases, which are influenced by the adoption of the measures Jˆ and Jˆf ast (cf. (8) and (24)).

5. 4.2.2

FCBPCE 372 323 1,713 1,689 12,157 5,126 90,078 15,610 840 2,783

CONCLUSION

Recent advance in data clustering resulted in the introduction of a new problem, called projective clustering ensembles (PCE), whose goal is to derive a robust projective consensus clustering from an ensemble of projective clustering solutions. PCE has been originally formulated as a twoobjective or a single-objective optimization problem, and related heuristics have been developed focusing either on effectiveness or efficiency aspects. In this paper we addressed the main issues in existing PCE methods: none of them exploits approaches commonly adopted for solving the clustering ensemble problem, thus missing a wealth of experience gained by the majority of clustering ensemble methods. More importantly, the two-objective PCE is not capable of treating the object-to-cluster and the feature-to-cluster assignments as interrelated. We defined an alternative formulation of PCE as a new single-objective problem in which the objective function is able to take into account the object- and feature-based cluster representations as a whole in a notion of distance for projective clustering solutions. We developed two heuristics of such a new formulation, namely CB-PCE and FCB-PCE, which follow the cluster-based approach to the clustering ensembles problem. Experiments on benchmark datasets have shown that the proposed algorithms out-

perform in accuracy the early PCE methods, and FCB-PCE is faster than the early two-objective PCE. [20]

6.

REFERENCES

[1] E. Achtert, C. B¨ ohm, H. Kriegel, P. Kr¨ oger, I. M¨ u ller-Gorman, and A. Zimek. Detection and Visualization of Subspace Cluster Hierarchies. In Proc. DASFAA Conf., pages 152–163, 2007. [2] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast Algorithms for Projected Clustering. In Proc. SIGMOD Conf., pages 61–72, 1999. [3] H. Ayad and M. S. Kamel. Finding Natural Clusters Using Multi-Clusterer Combiner Based on Shared Nearest Neighbors. In Proc. Int. Workshop on Multiple Classifier Systems (MCS), pages 166–175, 2003. [4] J. P. Barthélemy and B. Leclerc. The Median Procedure for Partitions. Partitioning Data Sets, 19:3–33, 1995. [5] C. B¨ ohm, K. Kailing, H. P. Kriegel, and P. Kr¨ oger. Density Connected Clustering with Local Subspace Preferences. In Proc. ICDM Conf., pages 27–34, 2004. [6] C. Boulis and M. Ostendorf. Combining Multiple Clustering Systems. In Proc. PKDD Conf., pages 63–74, 2004. [7] P. S. Bradley and U. M. Fayyad. Refining Initial Points for K-Means Clustering. In Proc. ICML Conf., pages 91–99, 1998. [8] L. Chen, Q. Jiang, and S. Wang. A Probability Model for Projective Clustering on High Dimensional Data. In Proc. ICDM Conf., pages 755–760, 2008. [9] F. Chierichetti, R. Kumar, S. Pandey, and S. Vassilvitskii. Finding the Jaccard Median. In Proc. SODA Conf., pages 293–311, 2010. [10] E. Dimitriadou, A. Weingesse, and K. Hornik. Voting-Merging: An Ensemble Method for Clustering. In Proc. ICANN Conf., pages 217–224, 2001. [11] S. Dudoit and J. Fridlyand. Bagging to Improve the Accuracy of a Clustering Procedure. Bioinformatics, 19(9):1090–1099, 2003. [12] B. Fischer and J. M. Buhmann. Bagging for Path-Based Clustering. TPAMI, 25(11):1411–1415, 2003. [13] A. L. N. Fred. Finding Consistent Clusters in Data Partitions. In Proc. Int. Workshop on Multiple Classifier Systems (MCS), pages 309–318, 2001. [14] G. Gan, C. Ma, and J. Wu. Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability, 2007. [15] A. Gionis, H. Mannila, and P. Tsaparas. Clustering Aggregation. TKDD, 1(1), 2007. [16] F. Gullo, C. Domeniconi, and A. Tagarelli. Projective Clustering Ensembles. In Proc. ICDM Conf., pages 794–799, 2009. [17] F. Gullo, A. Tagarelli, and S. Greco. Diversity-Based Weighting Schemes for Clustering Ensembles. In Proc. SDM Conf., pages 437–448, 2009. [18] A. K. Jain and R. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988. [19] G. Karypis and V. Kumar. A fast and high quality

[21]

[22] [23] [24]

[25]

[26]

[27] [28]

[29] [30]

[31]

[32]

[33]

[34] [35]

[36]

[37]

multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comp., 20(1):359–392, 1998. L. I. Kuncheva, S. T. Hadjitodorov, and L. P. Todorova. Experimental Comparison of Cluster Ensemble Methods. In Proc. Int. Conf. on Information Fusion, pages 1–7, 2006. R. P. Li and M. Mukaidono. Gaussian clustering method based on maximum-fuzzy-entropy interpretation. Fuzzy Sets and Systems, 102(2):253–258, 1999. N. Nguyen and R. Caruana. Consensus Clustering. In Proc. ICDM Conf., pages 607–612, 2007. A. Patrikainen and M. Meila. Comparing subspace clusterings. TKDE, 18(7):902–916, 2006. C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A Monte Carlo algorithm for fast projective clustering. In Proc. SIGMOD Conf., pages 418–427, 2002. K. Sequeira and M. Zaki. SCHISM: A New Approach for Interesting Subspace Mining. In Proc. ICDM Conf., pages 186–193, 2004. A. Strehl, J. Ghosh, and R. Mooney. Impact of Similarity Measures on Web-Page Clustering. In Proc. of AAAI Workshop on AI for Web Search, pages 58–64, 2000. A. Asuncion and D. Newman. UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/. A. Strehl and J. Ghosh. Cluster Ensembles — A Knowledge Reuse Framework for Combining Multiple Partitions. J. Mach. Learn. Res., 3:583–617, 2002. C. Domeniconi and M. Al-Razgan. Weighted Cluster Ensembles: Methods and Analysis. TKDD, 2(4), 2009. C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan, and D. Papadopoulos. Locally Adaptive Metrics for Clustering High Dimensional Data. Data Mining and Knowledge Discovery, 14(1):63–97, 2007. E. Achtert, C. B¨ ohm, H. Kriegel, P. Kr¨ oger, I. M¨ u ller-Gorman, and A. Zimek. Finding Hierarchies of Subspace Clusters. In Proc. PKDD Conf., pages 446–453, 2006. E. Ka Ka Ng, A. W.-C. Fu, and R. C.-W. Wong. Projective Clustering by Histograms. TKDE, 17(3):369–383, 2005. E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana. The UCR Time Series Classification/Clustering Page, http://www.cs.ucr.edu/∼eamonn/time series data/. G. Moise, J. Sander, and M. Ester. Robust projected clustering. KAIS, 14(3):273–298, 2008. M. L. Yiu and N. Mamoulis. Iterative Projected Clustering by Subspace Mining. TKDE, 17(2):176–189, 2005. X. Z. Fern and C. Brodley. Solving Cluster Ensemble Problems by Bipartite Graph Partitioning. In Proc. ICML Conf., pages 281–288, 2004. K. Y. Yip, D. W. Cheung, and M. K. Ng. On Discovery of Extremely Low-Dimensional Clusters using Semi-Supervised Projected Clustering. In Proc. ICDE Conf., pages 329–340, 2005.