Clustering Heterogeneous Data Sets

3 downloads 0 Views 2MB Size Report
for unsupervised learning. The basic ... on which the categorical data clustering algorithm is used to ... Apart from clustering algorithms, which are unsupervised.
2012 Eighth Latin American Web Congress

Clustering Heterogeneous Data Sets Artur Abdullin and Olfa Nasraoui Knowledge Discovery and Web Mining Lab, Department of Computer Engineering and Computer Science University of Louisville Louisville, KY, USA [email protected], [email protected] be user feedback in the form of user ratings and textual reviews/comments. In this paper, we discuss several approaches that can be used for the purpose of clustering mixed data, although most of them have been proposed for different purposes. We first review the related areas of multiview clustering, ensemble clustering, and collaborative clustering, and explain how each one can be modified for the specific purpose of clustering heterogeneous data. Then we present a new methodology to handle heterogeneous data that we recently presented in [1], [2] that makes an innovative use of Semi-Supervised Learning (SSL), which is used in a completely novel way and for a new purpose that has never been the objective of previous SSL research and applications. Unlike our previous work in [1], [2], in this paper we explore the compatibility between different domains in heterogeneous data before applying our SSL-based clustering. Our findings indicate that a preliminary domain compatibility analysis step sets the stage for a more effective clustering of heterogeneous data, that exploits the synergy between the different domains. The rest of this paper is organized as follows. Section II, III and IV give an overview of multiview clustering, ensemble clustering, and collaborative clustering, respectively, and explain how each one can be modified for the specific purpose of clustering heterogeneous data. Section V presents our new SSL framework to cluster heterogeneous data with compatibility analysis. Section VI presents some experimental results that illustrate our proposed approach, and finally, Section VII presents our conclusions.

Abstract—Recent years have seen an increasing interest in clustering data comprising multiple domains or modalities, such as categorical, numerical and transactional, etc. This kind of data is sometimes found within the context of clustering multiview, heterogeneous, or multimodal data. Traditionally, different types of attributes or domains have been handled by first combining them into one format (possibly using some type of conversion) and then following with a traditional clustering algorithm, or computing a combined distance matrix that takes into account the distance values for each domain, then following with a relational or graph clustering approach. In other cases where data consists of multiple views, multiview clustering has been used to cluster the data. In this paper, we review the existing approaches such as multiview clustering and discuss several additional approaches that can be harnessed for the purpose of clustering heterogeneous data once they are adapted for this purpose. The additional approaches include ensemble clustering, collaborative clustering and semi-supervised clustering. Keywords-clustering, heterogeneous data set

I. I NTRODUCTION Many recent applications deal with heterogeneous data that consists of several parts, each part being of a different type of domain or modality, for example many Web data sets, network activity data, scientific data sets, and census data sets typically comprise several parts that are of different types: numerical, categorical, transactional, free text, ratings, social relationships, etc. Traditionally each of these different types of data has been best clustered with a different specialized clustering algorithm or with a specialized dissimilarity measure. A very common approach to cluster data with mixed types has been to either convert all data types to the same type (e.g: from categorical to numerical or vice-versa) and then cluster the data with a standard clustering algorithm that is suitable for that target domain; or to use a different dissimilarity measure for each domain, then combine them into one dissimilarity measure and cluster this dissimilarity matrix. However, there are many different contexts in which the plurality of data exist. For example, in multiview data, the features of the data can naturally be divided into subsets (views), such that each of which is sufficient to learn a target concept. In multimodal data, there exist more than one modality in the data. One example is online images (on Flickr or Facebook) with visual content features and tags. Another example is data consisting of ratings, clickstreams, and transactions by users relating to items purchased or viewed online. A third example would 978-0-7695-4839-5/12 $26.00 © 2012 IEEE DOI 10.1109/LA-WEB.2012.27

II. M ULTIVIEW C LUSTERING (MC) The multiview setting typically applies to supervised learning problems that have a natural way to divide their features into subsets (views) each of which are sufficient to learn the target concept. Multiview algorithms train two independent hypotheses with bootstrapping by providing each other with labels for the unlabeled data [3]. The training algorithms tend to maximize the agreement between the two independent hypotheses and optimally combine the multiple views. In the rest of this section, we will use the terms graph and view are interchangeably. Besides the direct combination of graphs, [4] and [5] proposed to maximize the agreement between different views. Relying on the central idea that the clustering from one 1

view should agree with the clustering from another view, they extended spectral clustering to multiple views based on the co-training idea [3]. Their approach is based on the assumption that the true underlying clustering would assign corresponding points in each view to the same cluster. First, they perform spectral clustering on individual graphs to get the discriminative eigenvectors in each view. Then they iteratively find a projection of the similarity matrix of the first view along the eigenvectors of the second view and vice-versa. Then, using the projections of the first and second views as the new graph of similarities, they compute the Laplacian and find updated values for the discriminative eigenvectors in both views. After the final values of the eigenvectors of both views are obtained, they select the most informative view and cluster the eigenvectors of the selected view with the k-means algorithm. A completely different approach was proposed in [6], that first “combines” the two views/graphs and then proceeds with spectral clustering. They use a Markov random walk model to combine multiple graphs. Assuming a random walk with the current position being at a vertex in one graph, in the next step, the walker may continue her random walk in the same graph with a certain probability, or jump to the other graph with the remaining probability and continue her random walk there. A subset of vertices is regarded as a cluster if during the random walk, the probability of leaving this subset is small while the stationary probability mass of the same subset is large. Another approach [7] developed an algorithm for spectral clustering in the multiview setting where there are two independent subsets of dimensions, each of which could be used for clustering. The algorithm clusters the data in each view so as to minimize the disagreement between the clusterings. The main idea is that two (or more) networks receiving data from different views, but with no explicit supervisory label, should cluster the data in each view so as to minimize the disagreement between clusterings. Both views are combined into a bipartite graph, where the strength of the weight (Gaussian weighted normalized distance) between two nodes (patterns) in different views depends on the number of cooccurring pairs of patterns that are sufficiently close in both views. Using those weights, they define an affinity matrix which is then clustered by spectral graph clustering [8]. Finally, [9] presented a Linked Matrix Factorization (LMF) algorithm to find a shared partition of different views in both unsupervised and semi-supervised settings. In LMF, each graph is approximated by matrix factorization with a graphspecific factor and a factor common to all graphs, where the common factor provides features for all vertices. Then, vertices are clustered in the new feature space common for all views with a spectral clustering algorithm.

Figure 1: Multiview clustering.

as a graph. However, one limitation of existing MC methods is the insistence on enforcing “agreement” between the different aspects of the data. Such an assumption, when violated, may yield forced and incorrect results. In this paper, we analyze the inter-domain compatibility to improve clustering of heterogeneous data. III. E NSEMBLE - BASED C LUSTERING (EC) The success of ensemble-based methods for supervised learning has motivated the development of ensemble methods for unsupervised learning. The basic idea of clustering ensembles is to combine multiple partitions into a single clustering solution. Clustering ensembles can go beyond what is typically achieved by a single clustering algorithm in several respects: (i) robustness: better average performance across the data sets, (ii) novelty: finding a combined solution unattainable by any single clustering algorithm, (iii) stability and confidence estimation: clustering solutions with lower sensitivity to noise, outliers, or sampling variations. This is because clustering uncertainty can be assessed from ensemble distributions. (iv) Parallelization and scalability: the ability to integrate solutions from multiple distributed sources of data or features [10], [11]. Ensemble clustering must tackle three major problems which are specific to combination design: • Consensus function: Unlike supervised classification, the patterns are unlabeled and therefore, there is no explicit correspondence between the labels delivered by different clusterings. An extra complexity arises when different partitions contain different numbers of clusters, often resulting in an intractable label correspondence problem. The optimal correspondence can be obtained using the Hungarian method for minimal weight bipartite matching problem with O(k 3 ) complexity for k clusters [12], [13]. • Diversity of clustering: There are many different ways of generating a clustering ensemble and then combining the partitions. Multiple data partitions could be generated by: (i) applying different clustering algorithms, (ii) applying the same clustering algorithm with different values of parameters (different number of clusters, different number of neighbors, etc.) or initializations, and (iii) combining different data representations (different sets of features

A. Using MC to Cluster Heterogeneous Data and Relationship to Our SSL Framework It is clear that the above methods could be used for clustering heterogeneous data as long as the data is expressed 2

with patterns defined in the same of different feature spaces, develop a scheme of collective development and reconciliation of a fundamental cluster structure across the sites that it is based upon exchange and communication of a local findings where the communication needs to be realized at some level of information granularity” [19]. One important feature is that sharing the raw data together is not allowed given restrictions of privacy or other technical reasons. However, some findings at the higher conceptual level of information granules could be shared between the collaborating data sites. Usually, the information granules are cluster membership partition matrices, constructed through fuzzy clustering [20]. The main goal of collaboration is to give an ability for each node to benefit other nodes based on their needs. It is important to note that the collaborative approach aims only at enriching the local clustering solution of each individual node based on recommendations from other nodes. Thus, no “combined” solution is desired. This means that the goal of collaborative clustering is distinct from the goal of providing a clustering solution for the entire heterogeneous data set. In other words, collaborative clustering is centered on data being distributed over multiple sites. Pedrycz [21] proposed an algorithm where two underlying processes are run consecutively. It starts with fuzzy clustering procedures (FCM) that are run independently at each data site for a certain number of iterations until convergence. Next, the data sites exchange the findings by transferring partition matrices, and afterward, an iterative process which optimizes the objective function takes place. After convergence, the partition matrices are exchanged between the data sites and the iterative computing of the partition matrices and the prototypes resume. Another interesting CC approach was presented in [22] that proposed a distributed collaborative approach for document clustering. The main objective of this paper was to allow peers in a network to form independent opinions of local document grouping, followed by exchange of cluster summaries in the form of key-phrase vectors. The nodes then expand and enrich their local solution by receiving recommended documents from their peers based on the peer judgement of the similarity of the local documents to the exchanged cluster summaries.

Figure 2: Ensemble clustering.

or different subsets of the original data) and clustering algorithms [14], [15], [16]. • Cluster ensemble selection: Given a large library of clustering solutions, the goal of cluster ensemble selection is to choose a subset from the library to form a smaller cluster ensemble that performs as well as, or better than, using all available clustering solutions [17]. Clustering ensembles can also be used in multiobjective clustering as a compromise between individual clusterings with conflicting objective functions and plays an important role in distributed data mining [14]. In [18], the authors propose a divide and conquer technique to cluster data with mixed types of attributes. First, the original mixed data set is divided into two subsets: the pure categorical data set and the pure numerical data set. Next, an existing clustering algorithm designed to cluster a specific type of data is employed to cluster each subset separately and produce the corresponding clusterings. Last, the clustering results of the categorical and numerical data set are combined as a categorical data set, on which the categorical data clustering algorithm is used to produce a final clustering. A. Using EC to Cluster Heterogeneous Data and Relationship to our SSL Framework EC can handle heterogeneous data by dedicating a different clustering process to each domain and aggregating the results within an ensemble framework, which by intuition, emphasizes a consensus or agreement between the different domains. Our proposed SSL-based approach is reminiscent of ensemble-based clustering. However, one main distinction is that our approach enables the different algorithms running in each domain to reinforce or supervise each other until the final clustering is obtained. In other words, our approach is more collaborative. Ensemble-based methods were not intended to provide collaborative exchange of knowledge between different data “domains” while algorithms are still running, but rather to combine the end results of several runs or algorithms.

A. Using CC for Heterogeneous Data and Relationship to Our SSL Framework Although this was not the purpose of CC, one way to harness CC to cluster heterogeneous data is to consider each site as dedicated to only one pure domain of the data. The main differences between collaborative clustering and our proposed SSL approach are: •



IV. C OLLABORATIVE C LUSTERING (CC)



The problem of collaborative clustering can be defined as follows: “Given a finite number of disjoint data sites 3

in collaborative clustering the data is distributed across different nodes or sites, and in fact, this is the main assumption, all data sets have the same type of features or domains, collaborative clustering seeks to improve the local clustering solution at each node or site and no final combined solution is desired.

subset of pairwise must-link constraints (MLC) and cannotlink constraints (CLC) that it has learned on its own from the data in one domain, and that try to favor placing some data records (in the MLC) in the same cluster while trying to forbid others (in the CLC) from being placed in the same cluster. Hence the SSLs from the different domains try to mutually guide each other with each separate SSL transmitting semi-supervision constraints to the other SSL in the other domain, according to what it has discovered in its own domain. For the SSL in [2], we chose Basu et al.’s Hidden Markov Random Fields (HMRF) K-means algorithm that combines the constraint-based and distance-based approaches in a unified model. The K-Means with Hidden Markov Random Fields (HMRF) (HMRF-KMeans) algorithm [27] provides a principled probabilistic framework for incorporating supervision into prototype based clustering by using an objective function that is derived from the posterior energy of the Hidden Markov Random Fields framework for the constrained cluster label assignments. The HMRF consists of the hidden field of random variables with unobservable values corresponding to the cluster assignments/labels of the data, and an observable set of random variables which are the input data. The neighborhood structure over the hidden labels is defined based on the constraints between data point assignments (the neighbors of a data point are the points that are related to it via must-link or cannot-link constraints). The HMRF-KMeans algorithm is an Expectation Maximization (EM) based partitional clustering algorithm for semi-supervised clustering that combines the constraint-based and distance-based approaches in a unified model. The objective function [27] is given by

Figure 3: Collaborative clustering.

V. S EMI -S UPERVISED L EARNING (SSL) A PPROACH TO C LUSTER H ETEROGENEOUS DATA A. Semi-supervised Clustering Apart from clustering algorithms, which are unsupervised learners in the sense that they use unlabeled data, recent years have seen increasing interest in another direction, known as semi-supervised learning which takes advantage from both labeled and unlabeled data. Many semisupervised algorithms have been proposed including cotraining, transductive support vector machines, entropy minimization, semi-supervised Expectation Maximization, graphbased approaches, and clustering-based approaches. In semisupervised clustering, labeled data can be used in the form of (1) initial seeds [23], (2) constraints [24], or (3) feedback [25]. All these existing approaches are based on modelbased clustering [26] where each cluster is represented by its centroid. Seed-based approaches use labeled data only to help initialize cluster centroids, while constrained approaches keep the grouping of labeled data unchanged throughout the clustering process, and feedback-based approaches start by running a regular clustering process and finally adjusting the resulting clusters based on labeled data.

Jobj

=

X

D(xi , µli )

(1)

xi ∈X

+

X

wij φD (xi , xj )I[li 6= lj ]

(xi ,xj )∈M

+

B. Semi-supervised Clustering to Cluster Heterogeneous Data with Hidden Markov Random Fields (HMRF)

X

w¯ij (φDmax − φD (xi , xj ))I[li = lj ]

(xi ,xj )∈C

+

The SSL approach to cluster heterogeneous data can be done in various ways. While the method in [1] relied on the exchange of seeds as the semi-supervising link between the alternating clustering processes of the different data types or domains, [2] uses cluster-membership constraints as the semisupervising link between the processes. Whereas traditional semi-supervised learning or transductive learning has been used mainly to exploit additional information in unlabeled data to enhance the performance of a classification model trained with labeled data [27], or to exploit external supervision in the form of some labeled data to enhance the results of clustering unlabeled data; the methodology presented in this paper uses SSL “without” any external labels. In fact, the guiding or semi-supervising labels will be “inferred” from multiple Semi-supervised Learners (SSL), such that each SSL transmits to the other SSL, a

log Z,

where D(xi , µli ) is the distortion between xi and µli , wij is the cost of violating the must-link constraint (i, j), φD (xi , xj ) is the penalty scaling function, chosen to be a monotonically increasing function of the distance between xi and xj according to the current distortion measure D. I is the indicator function (I(true) = 1, I(f alse) = 0), so that the must-link term is active only when cluster labels of xi and xj are different. In the next term, w¯ij is the cost of violating the cannot-link constraint (i, j), φDmax is the maximum value of the scaling function φD for the data set, and Z is a normalization constant. Thus, the task is to minimize Jobj over cluster representatives {µh }K h=1 , cluster label configuration L = {li }N (every l takes values from the set {1, ..., K}), i i=1 and D (if the distortion measure is parameterized). Many distortion measures can be parameterized [28] and integrated 4

into the HMRF-KMeans algorithm. In this work, we do not parametrize any distortion measure, and instead keep it as a function only of the data objects D = D(xi , xj ). For data records that consist of different domains, we invoke several HMRF-KMeans processes one per domain, with each one receiving supervising constraints that were discovered in the other domains. For the sake of simplicity, we shall limit the data to consist of two parts in the rest of this paper. We start by dividing the set of attributes into two subsets: one subset, called domain T1 , with only attributes of one type, say visual bag of features, and a second subset, called T2 , with attributes of the other type such as text features. The first subset consists of dT1 attributes from domain T1 and the second subset consists of dT2 attributes from domain T2 , such that that dT1 + dT2 = d, the total number of attributes. As a distortion measure D for the visual and text domains, we used the cosine distance defined as follows: P (xim xjm ) DT,cos (xi , xj ) = 1 − m∈T (2) ||xi ||T ||xj ||T

Figure 4: Outline of the mutual semi-supervision based heterogeneous data clustering using HMRF-KMeans.

these terms from Equation 3. The constraints are generated based on the cluster membership assignment of the nTi points that are closest to each cluster in domain Ti . We alternate the HMRF-KMeans clustering between different domains until both algorithms converge or the number of exchange iterations exceeds a maximum number. The general flow of our approach is presented in Figure 4.

where ||x||T is the L2 norm calculated in domain T and xim denotes the mth attribute of data record xi . We also define the penalty scaling function φD (xi , xj ) to be equal to the corresponding distance function, and set the pairwise ¯ to unit costs, so that constraint violation costs W and W wij = w¯ij = 1 for any pair (i, j). Also the maximum value of the scaling function φDT ,max is set to 1 for both domains. Putting all this into (1) gives the following objective functions for the visual domain T1 or text domain T2 , where we use Ti(j) to imply either domain and we use DT1 , MT2 , and CT2 to formulate JT1 , while using DT2 , MT1 , and CT1 to formulate JT2 , JT1(2)

=

X

DT1(2) ,cos (xi , xj )

C. Discovering Domain Compatibility in Heterogeneous Data One important issue in clustering heterogeneous data is that the different domains may exhibit some compatibility (or agreement) for part of the data, while exhibiting incompatibility for the rest of the data. We call the set consisting of the first type of data, the compatible set, and call the set containing the rest of the data, the incompatible set. Ideally, one would be motivated to build different descriptive or summarization models and different predictive models for the data depending on whether or not the data is deemed to be in the compatible set. That way, when data is available in different domains, these domains can be utilized to a full advantage in a judicious manner (separately or in combination) without forfeiting the abundance of data in the multiple domains therefore, as opposed to out previous work in [1], [2], we explore clustering the heterogeneous data separately depending on its compatibility status. In order to do this, we need to identify the compatible and incompatible sets, which we perform using the following steps: (i) first, we cluster each domain using Bisecting k-means (Section V-C1); (ii) then, we use the resulting partition to identify domain compatibility (Section V-C2). Finally, we apply SSL HMRF-KMeans on each one of the two extracted data sets (Section V-B). 1) Bisecting K-Means: A simple variant of k-means, “bisecting” k-means, can produce clusters of documents that are generally better than those produced by k-means or by agglomerative hierarchical clustering techniques. The bisecting k-means algorithm starts with a single cluster of all the data

(3)

xi ∈X

+

X

DT1(2) ,cos (xi , xj )I[li 6= lj ]

(xi ,xj )∈MT2(1)

+

X

(1 − DT1(2) ,cos (xi , xj ))I[li = lj ]

(xi ,xj )∈CT2 (1)

+

log ZT1(2) .

The main idea of the HMRF-KMeans is as follows: in the Estep, given the current cluster representatives, every data point is re-assigned to the cluster that minimizes its contribution to Jojb . In the M-step, the cluster representatives {µh }N h=1 are reestimated from the previous cluster assignments to minimize Jobj for the current assignment. The E-step and M-step are repeatedly alternated until a specified convergence criterion is reached. MTi is a set of must-link constraints inferred based on the clustering of domain Ti, , and CTi is a set of cannotlink constraints inferred based on the clustering of domain Ti, . We further set the normalization constants ZT1 and ZT2 to be constant throughout the clustering iterations, and hence drop

5

points and repeatedly bisects the largest remaining cluster into two subclusters at each step, until the desired number of clusters is reached [29]. The steps are as follows: 1) Pick a cluster to split. 2) Find 2 sub-clusters using the basic K-means algorithm. 3) Repeat step 2, the bisecting step, for a fixed number of times and take the split that produces the clustering with the highest overall similarity. 4) Repeat steps 1, 2 and 3 until the desired number of clusters is reached. We use the cosine distance to handle the visual and text domains of our data, which are in bag of terms format. 2) Using Preliminary Partitions to Pinpoint Domain Incompatibility: After partitioning the data in both domains, we identify the corresponding clusters between the domains by solving a matching problem using the Hungarian method [12], which uses as input an inter-cluster matching weight inversely proportional to the Jaccard coefficient between the data membership assignment in each pair of clusters. Finally, we compare the membership matrices and find the data records that were assigned to the same (corresponding) or different clusters. If a data record was assigned to corresponding clusters in both domains, it would indicate that the different domains agreed on this data, and if a data record was assigned to different clusters (in both domains), this would indicate that the clusterings from different domains disagreed on this data, in which case it is considered part of the incompatible set.

Figure 5: Representation of a data record of the MIRFlickr data set.

of terms. Because calculating the DB index requires a distance measure, we used the cosine distance for all data domains, including the annotation domain. C. Results with the MIRFlickr Data Set. Using the methodology described in Section V-C, we extracted two subsets of the MIRFlickr data set depending on whether the domains were compatible or incompatible. The compatible subset consists of 2, 266 data records, while the size of the incompatible subset was 22, 734 data records. Figures 6 and 7 illustrate four randomly selected pictures along with their text data from the compatible and incompatible subsets, respectively. To show the importance of the domain compatibility in clustering heterogeneous data we performed two similar experiments. In the first experiment, we used only data from the compatible subset, while in the second experiment, we used data from the incompatible subset. Since the incompatible subset had more than ten times the data records, we used only 2, 266 randomly selected data records to perform our experiments, and get comparable metrics that are not biased by the size of the data sets. We used the following parameters in the proposed semisupervised framework1 : we used the same number of clusters KT1 = KT2 = 16 in both domains, we extracted the same number nT1 = nT2 = 5 of the closest points to the clusters, and we let the algorithm run for tT1 = tT2 = 2 iterations in each turn. We compared the proposed semi-supervised clustering approach with the following two classical baseline approaches for clustering mixed type data. The first baseline approach is to convert all data to the same attribute type and cluster it using the spherical k-means [33]. Spherical kmeans is the closest equivalent to HMRF-KMeans using cosine distance and without SSL. We call this method the conversion algorithm. Since both domains have the same attribute type there is no need for converting one domain to another, instead

VI. E XPERIMENTAL RESULTS A. The MIRFlickr Data Set The MIRFlickr-25000 image data set is composed of 25, 000 pictures downloaded from the popular online photosharing service Flickr [30]. The data set comes with the Flickr tags given by users, which can be considered as low level, noisy text. By processing this content, a 2105-word dictionary is defined based on the most frequent terms. The bag-offeatures approach is used to represent visual content using a dictionary of 1000 visual patterns which were extracted based on the image content. This image collection has also been manually annotated using a set of 38 semantic terms or tags provided as groundtruth for validating information retrieval tasks. The annotation vector has binary elements indicating whether the photo can be described by the term or not [31]. Figure 5 shows a sample from the data. B. Clustering Evaluation The proposed semi-supervised framework was evaluated using several internal and external clustering validity metrics. As an internal evaluation measure we used the Davies-Bouldin (DB). The DB index is a function of the ratio of the sum of within-cluster scatter to between-cluster separation [32]. Hence the ratio is small if the clusters are compact and far from each other. That is, the DB index will have a small value for a good clustering. As an external evaluation measure we also computed the DB based only on the groundtruth domain, where each data record is represented as an annotation vector

1 The

6

parameters were selected after several trials

Table I: Clustering results for the compatible set. Algorithm Domain DB Index Tags DB Index Algorithm Domain DB Index Tags DB Index Algorithm Domain DB Index Tags DB Index

Semi-supervised Text Visual 2.06 ± 0.31[1.36, 2.17, 2.27] 2.34 ± 0.20[2.03, 2.34, 2.63] 18.9 ± 8.5[3.9, 20.7, 32.5] 27.4 ± 6.0[17.1, 28.8, 36.0] Conversion Text Visual 2.41 ± 0.07[2.30, 2.41, 2.53] 2.39 ± 0.05[2.30, 2.40, 2.45] 33.8 ± 3.8[27.2, 35.5, 37.5] 32.6 ± 4.9[25.0, 35.3, 37.4] Splitting Text Visual 1.91 ± 0.01[1.89, 1.90, 1.92] 1.90 ± 0.05[1.81, 1.90, 1.95] 30.1 ± 7.6[19.9, 28.9, 45.6] 33.2 ± 2.5[30.4, 32.2, 38.5]

we normalized each domain to an L2 -norm of 1, merged the data records together and normalized them again to an L2 norm of 1. The second classical baseline approach is to run the spherical k-means independently on both domains. We call this method the splitting algorithm. We repeated each experiment 10 times, and report the mean, standard deviation, minimum, median, and maximum values for each validation metric (in the format of mean±std [min, median, max]). Tables I and II show the results for clustering the compatible and incompatible subsets using the proposed semi-supervised framework, the conversion algorithm, and the splitting algorithm. The best results are marked in a bold font, based on their significant p-values. The results are described below for each subset. • Compatible subset: As Table I illustrates, the conventional splitting algorithms yielded better clustering results for the text and visual domains, separately, based on the DB index, however the SSL proposed approach outperformed it in terms of the DB index calculated in the annotation tag (groundtruth) domain. Note the low minimum value of DB index in the text and visual domains, showing that over all runs, the proposed SSL approach could achieve a better clustering than classical baseline approaches. The conversion algorithm performed worse than the splitting and SSL approach. • Incompatible subset: Table II shows very similar results to the clustering of the compatible subset. Most interestingly, the splitting and conversion algorithms were not affected by the compatibility issue since there is no mutual supervision between the domains, unlike the proposed SSL approach, where the results for the compatible subset were significantly better than those obtained for the incompatible subset. • Also in both the compatible and incompatible sets, the inter-domain supervision had more impact on improving clusters in the text domain than in the visual domain. Thus for both subsets, SSL is able to leverage the mutual supervision between the different noisy domains to extract clusters with better consistency as measured based on true tags.

(a) Text: graffiti.

(b) Text: flower, green, orange, petal, spider, yellow.

(c) Text: grey, horse, friend.

(d) Text: animal, close up, detail, flower, insect, red, water.

Figure 6: Sample data from the compatible clusters.

(a) Text: hawaii, light house, vacation.

(b) Text: gold, record, sing, vintage, vinyl.

(c) Text: canon, flower, rainy.

(d) Text: Indonesia, sun, temple.

japan,

Figure 7: Sample data fron the incompatible clusters.

Table II: Clustering results for the incompatible subset. Algorithm Domain DB Index Tags DB Index Algorithm Domain DB Index Tags DB Index Algorithm Domain DB Index Tags DB Index

VII. C ONCLUSIONS Our preliminary results show that the proposed mutual SSL based heterogeneous data clustering framework using 7

Semi-supervised Text Visual 2.14 ± 0.22[1.42, 2.20, 2.28] 2.55 ± 0.37[2.30, 2.42, 3.63] 26.7 ± 7.5[7.9, 27.9, 38.0] 31.7 ± 4.4[24.7, 31.9, 40.7] Conversion Text Visual 2.37 ± 0.08[2.29, 2.37, 2.58] 2.39 ± 0.07[2.32, 2.36, 2.51] 33.1 ± 3.4[28.0, 32.6, 37.4] 33.4 ± 3.2[28.4, 33.7, 39.6] Splitting Text Visual 1.91 ± 0.01[1.89, 1.90, 1.92] 1.88 ± 0.05[1.81, 1.90, 1.95] 31.1 ± 7.6[19.9, 28.9, 45.6] 33.5 ± 2.5[30.4, 32.2, 38.5]

HMRF-KMeans tends to yield better clustering results, as assessed based on groundtruth annotation, in both the text and visual domains. Thus, the constraints obtained from each domain tend to provide additional helpful knowledge to guide clustering in the other domain. This information may in turn be used to avoid local minima and obtain a better clustering in both domains. Moreover, by first distinguishing between the data depending on whether the different domains described the data in a compatible manner, the SSL approach was able to compute an even better clustering compared to both conventional methods. We are currently completing our study by extending our experiments and methodology to mixed data involving web and transactional information (particularly text, ratings, and clickstreams) as the data domains. In the future, we plan to investigate the effect of parameterized distortion measures that are incorporated with the proposed heterogeneous data clustering framework. We also plan to devise a better method to estimate the confidence levels of the points contributing to the created constraints, and then use them to obtain better informed constraint violation cost ¯ . We are also extending our experiments weights W and W to study the sensitivity of the proposed framework to its parameters, such as nT , the number of closest points to the cluster representatives, and the number of iterations tT when running the HMRF-KMeans algorithm in each stage.

[11] A. Topchy, A. Jain, and W. Punch, “Clustering ensembles: models of consensus and weak partitions,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 12, pp. 1866 –1881, dec. 2005. [12] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistic Quarterly, vol. 2, pp. 83–97, 1955. [13] A. Frank, “On kuhn’s hungarian method - a tribute from hungary,” Naval Research Logistics (NRL), vol. 52, no. 1, pp. 2–5, 2005. [14] A. Strehl and J. Ghosh, “Cluster ensembles — a knowledge reuse framework for combining multiple partitions,” J. Mach. Learn. Res., vol. 3, pp. 583–617, Mar. 2003. [15] A. Topchy, B. Minaei-Bidgoli, A. Jain, and W. Punch, “Adaptive clustering ensembles,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 1, aug. 2004, pp. 272 – 275 Vol.1. [16] P. Hore, L. O. Hall, and D. B. Goldgof, “A scalable framework for cluster ensembles,” Pattern Recognition, vol. 42, no. 5, pp. 676 – 688, 2009. [17] X. Z. Fern and W. Lin, “Cluster ensemble selection,” Statistical Analysis and Data Mining, vol. 1, no. 3, pp. 128–141, 2008. [18] Z. He, X. Xu, and S. Deng, “Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach,” eprint arXiv:cs/0509011, Sep. 2005. [19] W. Pedrycz and P. Rai, “Collaborative clustering with the use of fuzzy c-means and its quantification,” Fuzzy Sets and Systems, vol. 159, no. 18, pp. 2399 – 2427, 2008. [20] J. C. Dunn, “A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters,” Journal of Cybernetics, vol. 3, no. 3, pp. 32–57, 1973. [21] W. Pedrycz and P. Rai, “A multifaceted perspective at data analysis: a study in collaborative intelligent agents,” Trans. Sys. Man Cyber. Part B, vol. 39, no. 4, pp. 834–844, Aug. 2009. [22] K. Hammuoda and M. Kamel, “Collaborative document clustering,” in In Proceedings of the Sixth SIAM International Conference on Data Mining (SDM06), Bethesda, MD, apr. 2006, pp. 453–463. [23] S. Basu, A. Banerjee, and R. Mooney, “Semi-supervised clustering by seeding,” in In Proceedings of 19th International Conference on Machine Learning (ICML-2002, 2002. [24] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, “Constrained kmeans clustering with background knowledge,” in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. Morgan Kaufmann Publishers Inc., 2001, pp. 577–584. [25] D. Cohn, R. Caruana, and A. Mccallum, “Semi-supervised clustering with user feedback,” Tech. Rep., 2003. [26] S. Zhong and J. Ghosh, “A unified framework for model-based clustering,” Journal of Machine Learning Research, vol. 4, pp. 1001–1037, 2003. [27] S. Basu, M. Bilenko, and R. J. Mooney, “A probabilistic framework for semi-supervised clustering,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’04, 2004, pp. 59–68. [28] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Advances in Neural Information Processing Systems 15. MIT Press, 2002, pp. 505–512. [29] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” in In KDD Workshop on Text Mining, 2000. [30] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. New York, NY, USA: ACM, 2008. [31] J. C. Caicedo, J. BenAbdallah, F. A. Gonzalez, and O. Nasraoui, “Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization,” Neurocomputing, vol. 76, no. 1, pp. 50 – 60, 2012. [32] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. PAMI-1, no. 2, pp. 224 –227, april 1979. [33] I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, vol. 42, pp. 143–175, 2001.

ACKNOWLEDGMENT This work was supported by US National Science Foundation Data Intensive Computation Grant IIS-0916489. R EFERENCES [1] A. Abdullin and O. Nasraoui, “A semi-supervised learning framework to cluster mixed data types,” in Proceedings of KDIR 2012 - International Conference on Knowledge Discovery and Information Retrieval, October 2012. [2] ——, “Clustering heterogeneous data with mutual semi-supervision,” in Proceedings of SPIRE 2012 - International Symposium on String Processing and Information Retrieval, October 2012. [3] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proceedings of the eleventh annual conference on Computational learning theory, ser. COLT’ 98. New York, NY, USA: ACM, 1998, pp. 92–100. [4] K. Abhishek and D. I. Hal, “A co-training approach for multi-view spectral clustering,” in Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 393–400. [5] K. Abhishek, R. Piyush, and D. I. Hal, “Co-regularized multi-view spectral clustering,” in Proceedings of the 25th Neural Information Processing Systems, 2011, pp. 1412–1421. [6] D. Zhou and C. J. C. Burges, “Spectral clustering and transductive learning with multiple views,” in Proceedings of the 24th international conference on Machine learning, ser. ICML ’07. ACM, 2007, pp. 1159–1166. [7] V. de Sa, “Spectral clustering with two views,” in Proceedings of the ICML 2005 Workshop on Learning with Multiple Views, ser. ICML ’05, Bonn, Germany, 2005, pp. 20–27. [8] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in Neural Information Processing Systems. MIT Press, 2001, pp. 849–856. [9] W. Tang, Z. Lu, and I. S. Dhillon, “Clustering with multiple graphs,” in Proceedings of the 2009 Ninth IEEE International Conference on Data Mining. Washington, DC, USA: IEEE Computer Society, 2009, pp. 1016–1021. [10] A. Topchy, A. K. Jain, and W. Punch, “A Mixture Model for Clustering Ensembles,” in Proceedings of the SIAM International Conference on Data Mining, 2004.

8