Hierarchical Agglomerative Clustering with ... - Semantic Scholar

3 downloads 3249 Views 151KB Size Report
call HAC embedded with OC as Hierarchical Agglomerative. Clustering with ... instances to guide xi's merging operation when clustering. Each instance is ...
Hierarchical Agglomerative Clustering with Ordering Constraints Haifeng Zhao Department of Computer Science University of California, Davis Davis, CA, USA 95616 Email: [email protected]

Abstract—Many previous researchers have converted background knowledge as constraints to obtain accurate clustering. These clustering methods are usually called constrained clustering. Previous ordering constraints are instance level non-hierarchical constraints, like must-link and cannot-link constraints, which do not provide hierarchical information. In order to incorporate the hierarchical background knowledge into agglomerative clustering, we extend instance-level constraint to hierarchical constraint in this paper. We name it as ordering constraint. Ordering constraints can be used to capture hierarchical side information and they allow the user to encode hierarchical knowledge such as ontologies into agglomerative algorithms. We experimented with ordering constraints on labeled newsgroup data. Experiments showed that the dendrogram generated by ordering constraints is more similar to the pre-known hierarchy than the dendrogram generated by previous agglomerative clustering algorithms. We believe this work will have a significant impact on the agglomerative clustering field. Keywords-hierarchical agglomerative clustering; constrained clustering; ordering constraint

I. I NTRODUCTION The basic Hierarchical Agglomerative Clustering(HAC) method begins with each instance as a separate group. These groups are combined until there is only one group remaining. Constrained clustering methods incorporate sideinformation to improve clustering results. Typical constraints used in previous research are Must-Link(ML) and CannotLink(CL)[1], [2]. ML and CL were first applied in K-means clustering. Davidson and Ravi [3], [4], [5] investigated how to use ML and CL in hierarchical agglomerative clustering and proved the problem is tractable. However, using ML and CL in HAC provides no improvement over classic HAC when the cluster number is small compared to instance number. Bade and Nurnberger [6] introduced the concept of must-link-before(MLB) constraints which is a type of hierarchical clustering. To solve conflicts produced by MLB, they enlarge the distance of two most similar clusters to prevent their combination. MLB incorporates merging preferences in HAC, but the defect of MLB is that it modifies two clusters’ underlying distance. Too many modifications may lead to inaccurate result. In this paper, we present a new type of constraint for hierarchical agglomerative clustering, ordering-constraint(OC),

ZiJie Qi Department of Computer Science University of California, Davis Davis, CA, USA 95616 Email: [email protected]

which is much more helpful than ML, CL and MLB. An OC of an instance is a merging preference for that instance. For example, from side-information, we know instance A is more similar to instance C than instance B. Then, a related OC defines that A must merge with C before merging with B during clustering. Specifically, let’s take a look at newsgroup instances. Assume before clustering we have some labeled instances. x1 belongs to rec.sports.basketball, x2 belongs to comp.sys.mac.hardware, x3 belongs to rec.sports.hockey, x4 belongs to rec.sports.basketball. For x1 , it prefers to merge in the sequence with x4 , x3 , x2 . So, the OC of x1 is {x1 → x4 → x3 → x2 }. Similarly, we can define OC for other labeled instances. After generating OC for all labeled instances, we can begin HAC to generate the dendrogram. During clustering, each OC should be obeyed, which means if x1 and x4 have not been merged or x4 and x3 have not been merged, then x1 and x3 can not been merged. The objective of the ordering constraint is to construct a dendrogram with all merging preferences satisfied. Traditional constraints, like ML and CL can not provide such merging preference information as OC. Different from MLB, OC does not change similarities of clusters, but just delay two clusters’ combination until conflicts are resolved. OC can provide more accuracy than ML, CL and MLB. We call HAC embedded with OC as Hierarchical Agglomerative Clustering with Ordering Constraints(HACOC). In the second section, we define OC in detail and discuss related issues with OC. In the third section, we implement the hierarchical agglomerative clustering algorithm with ordering constraint. In the fourth section, our experiment results show HACOC can generate an obviously better dendrogram than classic HAC algorithms. II. D EFINITIONS A. Ordering Constraint Ordering Constraint (OC) can be classified as two types: Instance-level Ordering Constraint (IOC) and Cluster-level Ordering Constraint (COC). An IOC defines a preference for an instance xi . It contains a number of sequenced instances to guide xi ’s merging operation when clustering. Each instance is initialized with one IOC (it could be empty if not labeled). Since instances merge into clusters

during clustering, their IOC should also combine. We define a COC as a cluster-level ordering constraint. An COC for a cluster is a set which consists of all its instances’ IOC. For example, if a cluster c has m instances who possess IOC, the COC of cluster c keeps m IOC. To be noticed, since COC is a set, all the IOC of a COC have no order. To explain our method, in this section, we formally introduce the notations used through the paper and the definitions relevant to ordering constraint. (1) IOC[xi ] : denotes the instance-level ordering constraint initialized as xi ’s ordering constraint. When initialized, IOC[xi ] = {xi1 → ... → xim }, and {xi1 , ..., xim }are m instances. During clustering, each xik ∈ IOC[xi ] will be replaced by cj if xik merges to cj . (2) IOCi,j : denotes the j th instance(or cluster) in IOC[xi ] . For example, if x1 ’s ordering constraint IOC[x1 ] =< x1 → x3 → x5 >, we can also write IOC[x1 ] as IOC[x1 ] =< IOC1,1 → IOC1,2 → IOC1,3 > . Here IOC1,1 stands for x1 , IOC1,2 stands for x3 and IOC1,3 stands for x5 . (3) COCi,j denotes the j th IOC in COCci .. For example, if cluster c2 has three instances x1 , x4 and x5 , then c2 ’s ordering constraint is COCc2 = {IOC[x1 ] , IOC[x4 ] , IOC[x5 ] }. We can also write COCc2 as COCc2 = {COC2,1 , COC2,2 , COC2,3 }. Here, COC2,1 stands for IOC[x1 ] , COC2,2 stands for IOC[x4 ] , and COC2,3 stands for IOC[x5 ] . (4) COCi,j,k denotes the k th instance (or cluster) in COCi,j . For example, if x3 and x4 combines as c3 . c3 ’s COC has both IOC[x3 ] and IOC[x4 ] . So COCc3 = {IOC[x3 ] = {c3 , x5 , x6 }, IOC[x4 ] = {c3 , x7 , x8 }}. We can also write COCc3 as COCc3 = {COC2,1 = {COC2,1,1 , COC2,1,2 , = COC2,1,3 }, COC2,2 {COC2,2,1 , COC2,2,2 , COC2,2,3 }}. In IOC[x3 ] , COC2,1,1 stands for c3 , COC2,1,2 stands for x5 and COC2,1,3 stands for x6 . (5) ”