Clustering using Unsupervised Binary Trees

0 downloads 0 Views 686KB Size Report
Nov 11, 2010 - Clustering is a method of unsupervised classification and a common technique ... Specifically for categorical data Andreopoulus et al. (2007) ...
arXiv:1011.2624v1 [stat.ME] 11 Nov 2010

Clustering using Unsupervised Binary Trees: CUBT Ricardo Fraiman Universidad de San Andr´es Badih Ghattas Universit´e de la M´editerrann´ee Marcela Svarc Universidad de San Andr´es November 12, 2010 Abstract We introduce a new clustering method based on unsupervised binary trees. It is a three stages procedure, which performs on a first stage recursive binary splits reducing the heterogeneity of the data within the new subsamples. On the second stage (pruning) adjacent nodes are considered to be aggregated. Finally, on the third stage (joining) similar clusters are joined even if they do not descend from the same node. Consistency results are obtained and the procedure is tested on simulated and real data sets.

1

Introduction

Clustering is a method of unsupervised classification and a common technique for statistical data analysis with applications in many fields, such as medicine, marketing, economics among others. The term cluster analysis (first used by Tryon, 1939) includes a number of different algorithms and methods for grouping similar data into respective categories. The grouping is build up in a way that “the degree of association between data is maximal if they belong to the same group and minimal otherwise”. Cluster analysis or clustering is the assignment of a set of observations from Rp into subsets (called clusters) such that observations in the same clus1

ter are similar ”in some sense”. These definitions are quite vague, since there is not a clear population objective function to measure the performance of a clustering procedure. Each clustering algorithm has implicitly an objective function which varies from one method to another. It is important to notice that even though most of the clustering procedures require the number of clusters beforehand in practice this information is usually unknown. On the contrary, in supervised classification the number of groups is known and we have in addition a learning sample and a universal objective function: to minimize the number of miss-classifications, or in population terms to minimize Bayes error. Despite these facts there are many similarities between supervised and unsupervised classification. Specifically, there are many algorithms that share the same spirit for both problems. Supervised and unsupervised classification (Hastie, 2003) algorithms have two main branches: algorithms can be partitional or hierarchical. Partitional algorithms determine all groups at once. The most popular and studied partitioning procedure for cluster analysis is k-means. Hierarchical algorithms find successively groups splitting or joining previously established groups. These algorithms can be either agglomerative (”bottom-up”) or divisive (”top-down”). Agglomerative algorithms begin with each element as a separate group and merge them into successively larger groups. Divisive algorithms begin with the whole set and proceed to split it into successively smaller groups. Hierarchical algorithms create a hierarchy of partitions which may be represented in a tree structure. The best known hierarchical algorithm for supervised classification is CART (Breiman et al., 1984). CART has a relevant additional property. The partition tree is built up based on few binary conditions on the original coordinates variables of the data. In most cases, the interpretation of the results is summarized in a tree that has a very simple structure. The usefulness of such classification scheme is valuable not only for rapid classification of new observations, but also, it can often yield a much simpler ”model” for explaining why the observations are classified in a particular group, and this fact is remarkably important in many applications. Moreover, it is important to highlight that the algorithm does not assume any kind of parametric model on the underlying distribution. Several different methods to obtain clusters based on decision trees have been already introduced. Liu el at. (2000) use decision trees to partition the data space into clusters and empty (sparse) regions at different levels of details. The method is based on the idea of adding an artificial sample of size N uniformly distributed on the space. With these N points added to the original data set, the problem is to obtain a partition of the space into dense and sparse regions. They treat it as a classification problem using a 2

new ”purity” function adapted to the problem based on the relative density among regions. Chavent et al. (1999) obtain a binary clustering tree based on a particular variable and its binary transformation. They present two different procedures. In the first one the splitting variables are recursively selected using correspondence analysis and the factorial scores lead to the binary transformation. On the second one the candidate variables and their variables transformations are simultaneously selected by an optimization criterion which evaluates the resulting partitions. Basak et al. (2005) propose four different measures for selecting the most appropriate features for splitting the data at every node and two algorithms for partitioning the data at every decision node. Specifically for categorical data Andreopoulus et al. (2007) introduce HIERDENC which is an algorithm that searches dense subspaces on the ”cube” distribution of the attributes values of the data set. Our objective is to propose a simple clustering procedure sharing the appealing properties of CART. We introduce a hierarchical top-down method called CUBT (Clustering using unsupervised binary trees), where the clustering tree is based on binary rules on the original variables, which will help to understand the clustering structure. One main difference with the most popular hierarchical algorithms, such as single linkage or complete linkage, is that our clustering rule has a predicting property since allows to classify a new observation. The procedure is done in three stages. The first one grows a maximal tree by applying a recursive partitioning algorithm. The second one prunes the tree using a minimal dissimilarity criterion. Finally, the third one aggregates leaves of the tree which do not necessarily share the same direct ascendant. The paper is organized as follows. In Section 2 we introduce some notation and we describe the empirical and population versions of our method. The latter describes the method in terms of the population, regarded as a random vector X ∈ Rp with unknown distribution P . The consistency of our method is shown in Section 3. In Section 4 we present the results of a simulation study where we challenge our method considering several different models and compare it with k–means. We also compare on a synthetic data set the tree structure provided by CART (using the training sample) and CUBT considering the same sample without the labels. A real data example is analyzed in Section 5. Concluding remarks are given in Section 6. All proofs are given in the Appendix.

3

2

Clustering a la CART

We start by fixing some notations, let X ∈ Rp be a random p-dimensional real vector with coordinates X(j), j = 1, . . . , p, such that E(kXk2 ) < ∞. The data are constituted by n random independent and identically distributed realizations of X, Ξ = {X1 , . . . , Xn }. For the population version the space is Rp , while for the empirical version the space is Ξ. We denote by t to the nodes of the tree. Each node t of the tree determines a subset of Rp which will be also denoted as t ⊂ Rp . To the root node we assign the whole space. Even though our procedure shares in many aspects the spirit of CART two main differences should be pointed out. First, as we are in the case of unsupervised classification, only the information of the observations without labels is available. Thus the splitting criterion cannot be based on the labels as in CART. The other essential difference is that instead of having one final pruning stage our algorithm has two phases, first we prune the tree and then there is a final joining stage. The former procedure evaluates the merging of adjacent nodes and the latter one aims to aggregate similar clusters that do not share the same direct ascendant in the tree.

2.1

Forward step: maximal tree construction

As we are defining a top-down procedure we begin by assigning the whole space to the root node. Let t be a node and tˆ = Ξ ∩ t the set of observations coming from the sample at hand. At each stage a terminal node is considered to be split in two subnodes, the left and right child, tl , tr , if it fulfills a condition. At the begging there is only one node, the root, which contains the whole space. The splitting rule has the form x(j) < a, where x(j) is a variable and a is a threshold level. Thus, tl = {x ∈ Rp : x(j) ≤ a} and tr = {x ∈ Rp : x(j) > a}. Let Xt be the restriction of X to the node t, i.e. Xt = X|{X ∈ t}, and αt the probability of being in t, αt = P (X ∈ t). Then R(t) is an heterogeneity measure of t defined by, R(t) = αt trace(Cov(Xt )),

(1)

where, cov(Xt ) is the covariance matrix of Xt . Thus, R(t) roughly measures the “mass concentration” of the random vector X at the set t weighted by the 4

mass of the set t. In the empirical algorithm αt and Cov(Xt ) are replaced by their empirical versions (estimates), and R(t) isPcalled the deviance. Then, n we denote nt the cardinal of the set t, nt = i=1 I {Xi ∈ t}, (where IA stands for the indicator function of set A) hence the estimated probability is  α bt = nnt and the estimate of E kXt − µt k2 is P

{Xi ∈t}



Xi − X t 2 nt

,

(2)

where X t is the empirical mean of the observations on t. The best split for t is defined as the couple (j, a) ∈ {1, . . . , p} × R, (the first element indicates the variable where the partition is defined and the second element is the threshold level) that maximizes, ∆(t, j, a) = R(t) − R(tl ) − R(tr ). It is easy to verify that ∆(t, j, a) ≥ 0 for every t, j, a, this property is also verified by all the splitting criteria proposed in CART. We start with the whole space assigned to the root node, then each node is split recursively until one of the following stopping rules is satisfied: • All the observations within a node are the same. • There are less than minsize observations in a node. • The deviance’s reduction is less than mindev × R(S), where S is the whole space. minsize and mindev are tuning parameters that must be supplied by the user, we consider minsize = 5 and mindev = 0.01. Once the algorithm stops a label is assigned to each leaf (terminal node); we call this tree the maximal tree. We consider the standard criteria for enumerating the nodes on a complete binary tree, let t be the node n, the left child is numbered 2n and the right one 2n + 1. At this point, we have obtained a partition of the space and consequently a partition of the data set, where each leaf is associated to a cluster. Ideally, this tree has at least the same number of clusters as the population, in practice this tree may have too many clusters, then an agglomerative stage must be applied as in CART. It is important to remark that even though the number of clusters will be known beforehand, it is not needed on this stage and that small values of mindev ensure a tree with many leaves. Moreover, if the tree has the same 5

Model M1: minsize=10, minsplit=25, mindev=0.01

x1 < 2.72 |

x1 < −1.836

x2 < −0.09189

x2 < 4.982

x2 < 0.0392

x1 < 4.534

x2 < 5.858

x1 < −3.821 x1 < −0.1879 x2 < 0.7682 x1 < 4.979x1 < 5.114 42 11 12

x1 < 0.3581 25 33 25 28 17

22 17 13 25

20 10

Figure 1: Illustration of maximal tree for Model M1 simulated data. number of leaves as the number of clusters then is not necessary to run the subsequent stages of the algorithm. An example of a maximal tree is given on Figure 1 for a simulated data set defined in section 4, model M1. The maximal tree has fourteen leaves while model M1 has a three groups structure, the number of observations that belong to each terminal node is indicated.

2.2

Backward step: pruning and joining

In this step we use successively two algorithms to give the final grouping conformation: the first one prunes the tree and the second one merges non adjacent leaves, we call it ”joining”. We introduce a pruning criterion named ”minimum dissimilarity pruning”. 2.2.1

Minimum dissimilarity pruning

On this stage we define a measure of dissimilarities between sibling nodes and collapse them if this measure is lower than a threshold. First, we consider

6

the maximal tree T0 obtained on the previous stage. Let tl and tr be a pair of terminal nodes sharing the same direct ascendant. Next, define (in populations term) the random variables Wlr = D(Xtl , Xtr ), as the Euclidian distances between the random elements of Xtl and of Xtr . Finally define the dissimilarity measure between the sets tl and tr as, Z δ ∆lr = qα (Wlr )dα, 0

where qα (.) stands for the quantile function and δ is a proportion, δ ∈ (0, 1). If ∆lr < , we prune the tree, i.e. we replace tl and tr by tl ∪ tr in the partition. It is worth to acknowledge that ∆lr is just ”a more resistant version” of the distance among the supports of the random vectors Xtl and Xtr . The dissimilarity measure ∆ can be estimated as follows, let nl (resp. nr ) be the size of tˆl (resp. tˆr ). Consider for every xi ∈ tˆl and yj ∈ tˆr , the sequences, d˜l = miny∈tˆl d(xi , y), d˜r = minx∈tˆr d(x, yj ) and their ordered versions denoted dl and dr . For δ ∈ [0, 1], let δnj δn 1 X 1 Xi δ δ ¯ ¯ dr . dl , dr = dl = δni 1 δnj 1

We compute the dissimilarity between tl and tr as, dδ (l, r) = dδ (tl , tr ) = max(d¯δl , d¯δr ), and on each step of the algorithm the leaves, tl and tr , are merged into the ascendant node t if dδ (l, r) ≤  where  > 0. The dissimilarity pruning depends on two parameters: δ and , which from now on we call ”mindist”. 2.2.2

Joining

The idea of the joining step is to aggregate nodes which do not necessarily share the same direct ascendant. It is based on the relative decrease of the deviance when two nodes are aggregated. For any tree T we define its deviance as, 1X R(t), R(T ) = n ˜ t∈T

7

where T˜ is the set of the leaves of T . Our goal is to find an optimal subtree of T . Here all pairs of non terminal nodes ti and tj are compared computing, dij =

R(ti ∪ tj ) − R(ti ) − R(tj ) . R(ti ∪ tj )

For the empirical version we consider a plug in estimate for dij following the definitions provided on Section 2.1. As in standard hierarchical clustering procedures pairs of terminal nodes are successively aggregated in decreasing order of dij (dˆij respectively), then on each step there is one cluster less. We may consider two different stopping rules for the joining procedure which correspond either to the case where the number of clusters k is known or to the case where k is unknown. If k is known repeat the following step until m ≤ k: • For each pair of values (i, j) , 1 ≤ i < j ≤ m, let (˜i, ˜j) = arg mini,j {dij }. Replace t˜i and t˜j by its union t˜i ∪t˜j , put m = m−1 and proceed ongoing. If k is unknown, • if d˜i˜j < η replace t˜i and t˜j by its union t˜i ∪ t˜j , where η > 0 is a given constant, continue until this condition it is not fulfilled. In the first case the stopping criterion is the number of clusters while in the second case a threshold η for dij must be settled.

2.3

CUBT and k-means

In this section we discuss someway informally when our procedure and the very well known k—means algorithm should produce a reasonably good output. We shall consider those cases where there are “nice groups” strictly separated. More precisely, let A1 , . . . , Ak be disjoint connected compact sets on Rp such that Ai = A0i for i = 1, . . . , k, and {Pi : i = 1, . . . , k} their probability measures on Rp with supports {Ai : i = 1, . . . , k} . A typical case is obtained defining a random vector X∗ with density f and then considering the random vector X = X∗ |{f > δ} for a positive level set δ, as in several hierarchical clustering procedures. On the one hand, an admissible family for CUBT will be a family of sets A1 , . . . , Ak such that there exist another family of disjoint sets B1 , . . . , Bk built up as the intersection of a finite number of half–spaces delimited by hyperplanes which are orthogonal to the coordinate axis satisfying Ai ⊂ Bi . 8

On the other hand, k-means is defined through the vector of centers (c1 . . . , ck ) minimizing   E min kX − cj k , j=1,...,k

associated with each center cj is the convex polyhedron Sj of all points in Rp closer to cj than to any other center, called the Voronoi cell of cj . The sets in the partition S1 , . . . , Sk are the population clusters for k–means. Therefore, the population clusters for k–means are defined by exactly k hyperplanes in an arbitrary position. Then, an admissible family for k-means will be a family of sets A1 , . . . , Ak that can be separated by exactly k hyperplanes. Even though the hyperplanes for k-means can be in general position, one cannot use more than k of them. It is clear that in this sense CUBT is much more flexible than k–means, since the family of admissible sets is more general. For instance, k-means will necessarily fail to identify nested groups, while this will not be the case of CUBT. Another important difference between k–means and CUBT is that our proposal is less sensitive to small changes in the parameters that define the partition. Effectively, small changes on them will produce small changes on the partition. However, small changes on the centers (c1 . . . , ck ) defining the k-means partition can produce significant changes in the associated partition given by the Voronoi cells.

3

Consistency of CUBT

In this section we give some theoretical results about the consistency of our algorithm. First we prove an important property, the monotonicity of the deviance when the tree size increases. A simple equivalent characterization of the function R(t) is given in the following Lemma. Lemma 3.1. Let tl and tr be disjoint compact sets on Rp and denote by µs = E(Xts ), s = l, r respectively. If t = tl ∪ tr we have that, R(t) = R(tl ) + R(tr ) +

The proof is given in the Appendix.

9

αtl αtr kµl − µr k2 . αt

(3)

Remark 3.1. Monotonicity of the function R(.) and geometric interpretation. Observe that Lemma (3.1) entails that for all disjoint compact sets tl , tr and t = tl ∪ tr , the function R(.) is monotonic in the sense that, R(t) ≥ R(tl ) + R(tr ).

(4)

Moreover, R(t) will be close to R(tl ) + R(tr ) when the last term on the right hand side of (3) is small. This will happen either if one of the sets tl , tr has a very small fraction of the mass of t and/or if the centers of the two subsets tl , tr are very close. In either cases we will not want to split the set t. The following results show the consistency of the empirical algorithm to its population version. We begin with the splitting algorithm and follow then with the pruning and joining. Theorem 3.1. Assume that the random vector X has distribution P and a density f fulfilling that kxk2 f (x) is bounded. Let X1 , . . . , Xn be iid random vectors with the same distribution as X and denote by Pn the empirical distribution of the sample X1 , . . . , Xn . Let {t1n , . . . , tmn n } be the empirical binary partition obtained by the forward empirical algorithm, and {t1 , . . . , tm } the population version. Then, we have that mn = m ultimately and each pair (ijn , ajn ) ∈ {1, . . . , p} × R determining the empirical partition converges a.s. to the corresponding one (ij , aj ) ∈ {1, . . . , p} × R for the population values. In particular, it implies that, lim

n→∞

m X

P (tin ∆ti ) = 0,

i=1

where ∆ stands for the symmetric difference. The proof is given in the Appendix. Theorem 3.2. Let {t∗1n , . . . , t∗kn n } be the final empirical binary partition obtained after the forward and backward empirical algorithm and {t∗1 , . . . , t∗k } the population version. Under the assumptions of Theorem 3.1 we have that kn = k ultimately (kn = k ∀n if k is known), and lim

n→∞

k X

P (t∗in ∆t∗i ) = 0.

i=1

The proof is given in the Appendix. 10

4

Some experiments.

In this Section, we present the results of a simulation study where we challenge our method for four different models. We compare these results with those of k-means. As it is well known the performance of k-means strongly depends on position of the initial centroids used to start the algorithm. Several proposals have been made to handle this effect (see Steinly, 2007). We follow the recommendations in this last reference, considering ten random initializations and keeping the one with minimum within-cluster sum of squares given by, ! n k X X 2 kXi − cj k I {Xi ∈ Gj } , i=1

j=1

where Gj is the j–th group and cj is the corresponding center. We denote this version k-means(10).

4.1

Simulated data sets

We consider four different models, with k = 3, 4, and 10 groups. The sample size of each group is ni = 100 for i = 1, . . . , k in all cases, then the total sample size is N = 100k data. Except in the last model where ni = 30 for i = 1, . . . , 10, then the sample size is N = 300. M1. Three groups in dimension 2. The data are generated according to the following distributions: N (µ1 , Σ), N ((µ2 , Σ), N ((µ3 , Σ),

(5)

with µ1 = (−4, 0), µ2 = (0, 0), µ3 = (5, 5) and  Σ=

0.52 0.6 0.6 0.72

 .

The left panel of Figure 4.1 exhibits an example of data generated from model M1. M2. Four groups in dimension 2. The data are generated from : N (µi , Σ), i = 1, . . . , 4 distributions with centers (−1, 0), (1, 0), (0, −1), (0, 1) and covariance matrix Σ = σ 2 Id with σ = 0.11, 0.13, . . . , 0.19. The right panel of Figure 4.1 gives an example of data generated from model M2 with σ = 0.15. 11

Model M2, sigma=0.15

−0.5

0.0

X2

2 −2

−1.0

0

X2

4

0.5

1.0

6

Model M1

−4

0

4

−1.0

X1

0.5 X1

Figure 2: Scatter plot corresponding to M1 (left) and M2 (right) M3. Three groups in dimension 10. The data are generated according to the distributions given in (5) with Σ = σ 2 Id, σ = 0.7, 0.75, . . . , 0.9, µ1 = (−2, . . . , −2), µ2 = (0, . . . , 0), µ3 = (3, . . . , 3). The left panel of Figure 3 shows an example of data generated from model M3 with σ = 0.8 projected over the two first dimensions. M4. Ten groups in dimension 5. The data are generated from N (µi , Σ), i = 1, . . . , 10. The means of the first five groups µi , are the vectors of the canonical basis e1 , . . . , e5 respectively, while the five remaining groups’ centers are µ5+i = −ei , i = 1, . . . 5. In every case, the covariance matrix is Σ = σ 2 Id, σ = 0.11, 0.13, . . . , 0.17. The right panel of Figure 3 gives an example of data generated from model M4 with σ = 0.19 projected on the two first dimensions.

4.2

Tuning the method

We perform M = 100 replicates for each model and compare these results with those of k-means and k-means(10). Throughout the simulations k is assumed to be known. In order to perform CUBT we must fix the values for the parameters involved at each stage of the algorithm: • For the maximal tree we use: minsize = 5 and mindev = 0.01. • For the pruning stage: mindist = 0.3, 0.5 and δ = 0.4.

12

Model M4: sigma=0.19

−4

−1.5

−1.0

−2

−0.5

0.0

X2 0

X2

0.5

2

1.0

4

1.5

Model M3: sigma=0.8

−4

−2

0

2

4

−1.0

X1

0.0

1.0

X1

Figure 3: Two dimensional projection scatter plots corresponding to M3 (left) and M4 (right) • For the joining stage, the value of k for each model has been stated previously. Since we are working with synthetical data sets, we know the actual label of each observation, then it is reasonable to measure the goodness of a partition by computing ”the number of misclassified observations”, which is the analogous to the misclassification error for supervised classification procedures. Denote the original clusters r = 1, . . . , R and the predicted ones s = 1, . . . , S. Let y1 , ...yn be the group label of each observation, and yˆ1 , ...ˆ yn the class label assigned by the clustering algorithm. Let Σ be the set of permutations over {1, ..., S} and A the set of arrangements of S elements taken from {1, ..., R} , then the misclassification error we use may be expressed as: P M CE = M inσ∈Σ n1 ni=1 1{yi 6=σ(ˆyi )} if S ≤ R (6) Pn 1 = M inσ∈A n i=1 1{σ(yi )6=yˆi } else If the number of clusters is large the assignment problem may be computed in polynomial time using Bipartite Matching and the Hungarian Method (Papadimitriou et al, 1982). It is important to remark that Equation (6) is given in a more general framework than we need because in our case as the

13

Sigma (σ)

CUBT (0.3) 0.0073

0.11 0.13 0.15 0.17 0.19

0.001 0.0064 0.0182 0.0749 0.136

0.7 0.75 0.8 0.85 0.9

0.0166 0.0262 0.576 0.0633 0.0986

0.11 0.13 0.15 0.17

0.0015 0.0149 0.0456 0.202

CUBT (0.5) Model 1 4e − 04 Model 2 0.001 0.00675 0.0171 0.0411 0.0913 Model 3 0.0296 0.0275 0.0427 0.0515 0.0636 Model 4 0.0015 0.0117 0.0337 0.121

k-Means

k-Means(10)

0.0967

0

0.114 0.0979 0.127 0.0649 0.0608

0 0 0 2.5e − 05 0.00015

0.139 0.134 0.127 0.0719 0.124

0 0 1e − 04 1e − 04 1e − 04

0.134 0.1 0.107 0.0965

0.0015 0.00713 0.00297 0.000267

Table 1: Simulation results for models M1 to M4.

number of cluster is known, then S is equal to R, nonetheless Equation (6) admits S different from R.

4.3

Results

Table 4.3 shows the results for the four models. Except for Model 1, we varied the values of σ reflecting the level of overlapping between classes. We report the misclassification error obtained for each clustering procedure, CUBT considering mindist = 0.3 (CUBT (0.3)) and mindist = 0.5 (CUBT (0.5)), k-Means and k-Means(10). The results for the different values of mindist are practically the same and in almost every case the results for CUBT are between those of k-Means and k-Means(10). Moreover, it is important to notice that the average number of leaves for the maximal tree is much bigger than the actual number of clusters. For model M1 it equals 33.8, for M2 it ranges between 35.2 and 40.4, for M3 between 18.4 and 20.8 and for M4 between 45.2 and 47.8.

14

4.4

A comparison between CART and CUBT

We compare in a simple example the tree structure obtained by CART using the complete training sample (observations plus the group labels) versus the tree structure obtained by CUBT considering only the training sample without the labels. We generate three sub–populations in a two-dimensional variable space. The underlying distribution of the vector X =(X1 , X2 ) is a bivariate normal distribution where the variables X1 and X2 are independent and their distributions for teh three groups are given by X1 ∼ N (0, 0.03) , N (2, 0.03) , N (1, 0.25) , X2 ∼ N (0, 0.25) , N (1, 0.25) , N (2.5, 0.03) , The data are then rotated π/4. A difficulty of this problem is that the optimal partitions are not parallel to the axis. Figure 4 shows for a single sample of size 300 the partitions obtained by CART and by CUBT respectively. We performed 100 replicates and for each a training sample of size 300 is generated, where every group has the same size. We compute then the ”mean misclassification rate” with respect to the true labels. For CUBT its value was 0.032, while for CART there are not classification errors since we use the sample for both purposes, growing the tree and classifying the observations. Moreover, in order to compare our procedure with the traditional CART classification method, we found the respective binary trees for CART and CUBT. The structure is very similar, both trees are presented in Figure 5. Notice that the structure is exactly the same for both of them, and that the different cutoff values in the different branches may be understood with the aid of Figure 4 which corresponds to the same data set.

5

A real data example. European Jobs

The data set contains the percentage employed in different economical activities for several European countries during 1979. The categories are, agriculture (A), mining (M), manufacturing (MA), power supplies industries (P), construction (C), service industries (SI), finance (F), social and personal services (S), and transportation and communication(T). It is important to notice that these data where collected during the cold war. The aim is to allocate the observations in clusters, but the number of clusters is unknown, then we are going to study the data structure for several numbers of clusters.

15

Figure 4: Plot of the partitions of the space for a generated data set. The solid lines exhibit the partition for CUBT and the dashed lines the partition for CART We consider first a four groups structure. In this case only one variable the percentage employed in agriculture - determines the tree structure. The four groups are given on Table 2 and the corresponding tree is plotted on the top panel of Figure 6. As it can be seen in the tree, the highest value of A corresponds to Turkey, which is an outlier an conforms a single observation cluster, it can be explained due to its social and territorial proximity to Africa. The countries conforming groups 2 and 3 are countries who were either under a communist government or going through difficult political situations, for instance Spain was leaving behind Franco’s regime. The countries that belong to Group 2 were poorer than those on Group 3. Finally Group 4 has the lowest percentage of employment in agriculture and the countries conforming that group were the most developed and not controlled by a communist regime, with the exception of East Germany. If we consider a four group structure by k-Means, Turkey and Ireland, each of them are isolated in one group, Greece, Portugal, Poland, Romania and Yugoslavia conform another group and the rest of the countries conform another cluster. If instead of four clusters we consider a five cluster structure we can see 16

Figure 5: Left: Tree corresponding to CUBT . Right: Tree considering CART. In both cases the left branches indicate the smaller value of the partitioning variable. Group 1 Turkey

Group 2 Group 3 Greece Ireland Poland Portugal Romania Spain Yugoslavia Bulgaria Czechoslovakia Hungary USSR

Group 4 Belgium Denmark France W. Germany E. Germany Italy Luxembourg Netherlands United Kingdom Austria Norway Sweden Switzerland

Table 2: CUBT clustering structure for four groups. that the main difference is that Group 3 of the previous partition is divided into three groups (Groups 3, 4 and 5) and that the variables that explained those partitions are the percentage employed in mining and the percentage employed in agriculture. The third group of the previous partition remain 17

Figure 6: Tree structure considering four groups top and five groups bottom, in every case on the left branch we present the smaller values of the variable that is making the partition. stable and groups 1 and 2 of the previous partition collapsed on one group that we denoted Group 1. The percentage employed in agriculture was the variable that determined the partition between groups 1, 2 and the rest of them. If a five cluster structure is considered via k-Means then Turkey and Ireland, each of them are isolated in one group, Greece, Portugal, Poland, Romania and Yugoslavia conform another group this is the same as in the four cluster structure. Switzerland and east and west Germany conform a new cluster. Group 1 Group 2 Group 3 Group 4 Group 5 Turkey Ireland Belgium W. Germany France Greece Portugal Denmark E. Germany Italy Poland Spain Netherlands Switzerland Luxembourg Romania Bulgaria United Kingdom Austria Yugoslavia Czechoslovakia Norway Hungary Sweden USSR Table 3: CUBT clustering structure for five groups.

18

6

Concluding Remarks

A new clustering method, CUBT ”in the spirit” of CART is presented, defining the clusters in terms of binary rules on the original variables. This approach shares with classification trees this nice property which is very important in many practical applications. As the tree structure is based on the original variables it helps to determine which variables are important to in the cluster conformation. Moreover, the tree allows to classify new observations. A binary tree is obtained in three stages. In the first stage the sample is split into two sub–samples reducing the heterogeneity of the data within the new sub–samples according to the objective function R(·). Then the procedure is applied recursively to each sub–sample. In the second and third stages the maximal tree obtained at the first stage is pruned using two different criteria, one for the adjacent nodes and the other one for all the terminal nodes. The algorithm is simple and runs in a reasonable computing time. There are no restrictions on the dimension of the data. Our method is consistent under quite general assumptions and behaves quite well on the simulated examples that we have considered, as well as in a real data example.

A robust version could be developed changing in the objective function given in (1), cov(XT , XT ) by a robust covariance functional robcov(XT , XT ) (see for instance Maronna et al. Chapter 6, for a review) and then proceed in the same way. However a detailed study of this alternative goes beyond the scope of this work.

7 7.1

Appendix Proof of Lemma 3.1

Observe first that since tl and tr are disjoint, E(Xtl ∪tr ) = γµl + (1 − γ)µr , (j)

where γ = P (X ∈ tl |X ∈ tl ∪ tr ). Given j = 1, . . . , p, denote by M2i = R x(j)2 dF (x), i = l, r, where F stands for the distribution function of the ti vector X. It follows easily that (j)

(j)

E(Xtl ∪tr (j)2 ) = γM2l + (1 − γ)M2r , 19

and therefore var(Xtl ∪tr (j)) = γvar(Xtr (j)) + (1 − γ)var(Xtr (j)) + γ(1 − γ)(µl (j) − µr (j))2 . Finally summing up on j we get the desired result.

7.2

Proof of Theorem 3.1

Let T be the family of polygons in Rd with faces orthogonal to the axes, and fix i ∈ {1, . . . , p} and t ∈ T. For a ∈ R denote by tl = {x ∈ t : x(i) ≤ a} and tr = t \ tl . Define r(t, i, a) = R(t) − R(tl ) − R(tr ), (7) and rn (t, i, a) = Rn (t) − Rn (tl ) − Rn (tr ),

(8)

the corresponding empirical version. We start showing the uniform convergence sup sup |rn (t, i, a) − r(t, i, a)| → 0 a.s.

(9)

a∈R t∈T

By Lemma 3.1, αt (t, i, a) = αtl αtr kµl (a) − µr (a)k2 ,

(10)

where αA = P (X ∈ A) and µj (a) = E(Xtj ), j = l, r. Then, the pairs (ijn , ajn ) and (ij , aj ) are the arguments that maximize the right hand of (10) with respect to the measures Pn and P respectively. Observe that the right hand size of (10) equals Z

2

Z

kxk dP (x) + αtl

αtr tl

Z Z kxk dP (x) − 2h xdP (x), xdP (x)i. 2

tr

tl

(11)

tr

It order to prove (9) it suffices to show that: 1. supa∈R supt∈T |Pn (tj ) − P (tj )| → 0 a.s. j = l, r R R 2. supa∈R supt∈T | tj kxk2 dPn (x) − tj kxk2 dP (x)| → 0 a.s. j = l, r R R 3. supa∈R supt∈T | tj x(i)dPn (x) − tj x(i)dP (x)| → 0 a.s. j = l, r, i = 1, . . . , p.

20

Since T is a Vapnik–Chervonenkis class, we have that (i) holds. Now observe that the conditions for uniform convergence over families of sets still hold if we are dealing with signed finite measures. Therefore if we consider the finite measure kxk2 dP (x) and the finite signed measure given by x(i)dP (x) we also have that (ii) and (iii) holds. Since lim αtl αtr kµl (a) − µr (a)k2 = lim αtl αtr kµl (a) − µr (a)k2 = 0,

a→∞

a→−∞

we have that argmaxa∈R rn (t, i, a) → argmaxa∈R r(t, i, a) a.s. At the first step of the algorithm, t = Rp and we get that in1 = i1 for n large enough and an1 → a1 a.s. For the next step we have that the empirical procedure will start working with tnl and tnr while the population algorithm will do it with tl and tr . However, we have that sup |rn (tnj , i, a) − r(tj , i, a)| ≤ a∈R

sup sup |rn (tnj , i, a) − r(tnj , i, a)| + sup |r(tnj , i, a) − r(tj , i, a)|, a∈R t∈T

(12)

a∈R

for j = l, r. We already know that the first term on the right hand side of (12) converges to zero almost surely. To show that the second term also converges to zero, it suffices to show that 1. supa∈R |P (tnj ) − P (tj )| → 0 a.s. j = l, r R R 2. supa∈R | tj kxk2 dP (x) − tnj kxk2 dP (x)| → 0 a.s. j = l, r R R 3. supa∈R | tj x(i)dP (x) − tnj x(i)dP (x)| → 0 a.s. j = l, r, i = 1, . . . , p, which follows from the assumption that kxk2 f (x) is bounded. This concludes the proof.

7.3

Proof of Theorem 3.2

We need to show that we have consistency in both steps of the backward algorithm. (i) Convergence of the pruning step. Let {t∗1n , . . . , t∗mn } the output of the forward algorithm. The pruning step partition of the algorithm converges to the corresponding population version from • the conclusions of Theorem 3.1. 21

• the fact that the random variables Wlr d˜l , d˜r are positive. • the uniform convergence of the empirical quantile function to it’s population version. • and the Lebesgue dominated convergence Theorem. (ii) Convergence of the joining step. Let {t1n , . . . , tmn } the output of the pruning algorithm. Since (i) and the conclusions of Theorem 3.1 hold, the proof will be complete if we show that for any pair tln , trn ∈ {t1n , . . . , tmn }, dln,rn =

Rn (tln ∪ trn ) − Rn (tln ) − Rn (trn ) Rn (tln ∪ trn )

converges almost surely to dl,r =

R(tl ∪ tr ) − R(tl ) − R(tr ) R(tl ∪ tr )

as n → ∞. Now observe that the following inequalities hold |Rn (tln ) − R(tl )| ≤ |Rn (tln ) − R(tln )| + |R(tln ) − R(tl )| ≤ sup |Rn (t) − R(t)| + |R(tln ) − R(tl )|.

(13)

t∈T

Finally from a similar argument to that in Theorem 3.1 we obtain that lim Rn (tsn ) = R(ts ), a.s.,

n→∞

for s = i, j and lim Rn (tln ∪ trn ) = R(tl ∪ tr ), a.s.,

n→∞

from which we derive lim dln,rn = dl,r , a.s.,

n→∞

which completes the proof.

REFERENCES Andreopoulos, B. and An, A. and Wang, X., 2007, Hierarchical DensityBased Clustering of Categorical Data and a Simplification, Advances in Knowledge Discovery and Data Mining, 11–22.

22

Basak, J. and Krishnapuram, R., 2005. Interpretable Hierarchical Clustering by Construction an Unsupervised Decision Tree, IEEE Transactions on Knowledge and Data Engineering, 17 121-132. Blockeel, H. and De Raedt, L. and Ramon, J., 1998. Top-down induction of clustering trees, Proceedings of the 15th International Conference on Machine Learning, ed. Morgan Kaufmann, 55–63. Breiman, L. and Friedman, J. and Stone, C. J. and Olshen, R. A., 1984. Classification and Regression Trees, ed. Chapman & Hall/CRC. Chavent, M. and Guinot, C. and Lechevallier, Y. and Tenenhaus, M., 1999. M´ethodes Divisives de Classification et Segmentation Non Supervis´ee: Recherche d’une Typologie de la Peau Humanine Saine, Rev. Statistique Appliqu´ee, XLVII 87–99. Hastie, T. and Tibshirani, R. and Friedman, J. H., 2003. The Elements of Statistical Learning, Springer. Liu, B. and Xia, Y. and Yu, P. S., 2000. Clustering through decision tree construction, CIKM ’00: Proceedings of the ninth international conference on Information and knowledge management, ed. ACM,New York, NY, USA, 20–29. Maronna, R. A and Martin, D. R. and Yohai, V. J., 2006. Robust Statistics: Theory and Methods (Wiley Series in Probability and Statistics), Wiley. Papadimitriou, C. and Steiglitz, K. , 1982. Combinatorial Optimization: Algorithms and Complexity, Englewood Cliffs: Prentice Hall. F. Questier and R. Put and D. Coomans and B. Walczak and Y. Vander Heyden, 2005. The use of CART and multivariate regression trees for supervised and unsupervised feature selection, Chemometrics and Intelligent Laboratory Systems, 76 1 45–54. Smyth C. and Coomans D. and Everingham Y. and Hancock T., 2006. Autoassociative Multivariate Regression Trees for Cluster Analysis, Chemometrics and Intelligent Laboratory Systems, 80 1 120–129. Smyth C. and Coomans D. and Everingham Y., 2006. Clustering noisy data in a reduced dimension space via multivariate regression trees, Pattern Recognition, 39 3 424–431. Steinly D. and Brusco M., 2007. Initializing K-means Batch Clustering: A Critical evaluation of Several Techniques, Journal of Classification, 24 99– 121. Tryon, R.C., 1939. Cluster Analysis, McGraw-Hill.

23