Clustering including dimensionality reduction

1 downloads 0 Views 191KB Size Report
two-mode clustering methods in Van Mechelen, Bock and De Boeck (2004). In this paper we show the performance of some new methodologies for two mode ...
Clustering including dimensionality reduction: least-squares and maximum-likelihood approaches Maurizio Vichi Department of Statistics, Probability and Applied Statistics University “La Sapienza” of Rome P.le Aldo Moro, 5 I-00185 Rome, Italy

ABSTRACT. In this paper new methodologies for clustering and dimensionality reduction of large data sets are illustrated using both a least-squares and maximum likelihood approach. The methodologies are described by both real applications and Monte Carlo simulations. KEYWORDS: Clustering of objects and variables, dimensionality reduction, least-square partitioning, maximum likelihood clustering, mixture models.

1

Introduction

The analysis of proximity relationships within a set of objects can be obtained by identifying disjoint classes of objects which are perceived as similar to one another within each class. Such partitions can be obtained from the applications of cluster analysis methodologies. Nevertheless, cluster analysis is frequently used to partition variables instead of objects or both objects and variables. For example, marketers are interested to know how the market can be segmented into homogeneous classes of consumers according to their preference on products; at the same time, marketers may wish to know how products are clustered according to preferences of customers. This case will be referred to as two-mode partitioning. The basic idea is to identify blocks, i.e., sub-matrices of the observed data matrix, where objects and variables forming each block specify an object cluster and a variable cluster. Of course, in applying two-mode partitioning, variables expressed on the same scale are required, so that entries are comparable among both rows and columns. If this is not the case, data need to be rescaled by an appropriate standardization method. The interested reader can find a very complete structured overview of two-mode clustering methods in Van Mechelen, Bock and De Boeck (2004). In this paper we show the performance of some new methodologies for two mode partitioning of two way data recently proposed. In particular, the “double k-means” (Vichi, 2000; Rocci, Vichi 2004) for two-way data is discussed and compared with procedures that can be obtained by applying ordinary clustering techniques in repeated steps. The performance of double k-means has been tested by both a simulation study and an application to gene microarray data. Recently, Vichi and Martella, (2005) have studied the maximum likelihood clustering estimation of the double k-means parameters. Clustering of objects and variables according to double k-means is particularly valuable when variables are not so discernible from objects as in the case above described when customers and products are considered. In this situation centroids for both objects and variables (e.g., mean profiles of customers and mean profiles of products) can be used to synthesize the observed data matrix. However, for a usual multivariate data matrix a reduction of the objects is generally given by means of centroids from partitioning methodology, while a reduction of the variables is achieved by a factorial methodology as PCA, hence by means of linear combinations that give different weights to the original variables. However, PCA, but also other factorial techniques, have often the drawback that different factors are characterized by the same original variables, so that the interpretation of these factors becomes a relevant and complex problem. In this situation it would be useful to partition objects into clusters summarized by centroids, but also to partition

variables into clusters of correlated variables, summarized by linear combinations of maximum variance as it is obtained in clustering and disjoint principal component analysis (CDPCA) (Vichi and Saporta, 2004). This methodology can be seen as a generalization of the double k-means. 2 The clustering and dimensionality reduction model The double k-means model (Vichi, 2000) is formally specified as follows X = U Y V' + E,

(1)

where X is a (I × J) data matrix, while matrix E is the error component matrix. Matrix U=[uij] is a (I × P) membership matrix, assuming values {0, 1}, specifying for each object i its membership to a class of the partition of objects in P classes. Matrix V=[vjq] is a (J × Q) membership matrix, assuming values {0, 1}, specifying for each variable j its membership to a class of the partition of variables in Q classes. Matrix Y =[ y pq ] is the (P × Q) centroid matrix where y pq denotes the mean of values corresponding to object and variable clusters p and q, respectively. The first term in model (1) pertains to the portion of information of X that can be explained by the simultaneous classification of objects and variables. Of course, matrix X is supposed to be column standardized if the variables are not commensurate. In the papers by Vichi, 2000 and Rocci and Vichi, 2004 fast alternating least-square algorithms are proposed in the case the model is estimated with the least-squares approach, while recently Vichi and Martella (2005) estimate parameters of the model according to a model-based likelihood approach. The double k-means model can be modified to assess a partition of the objects along a set of centroids, as above, but also a partition of the variables along a set of linear combinations of maximum variance. Thus the model (1) is written (Vichi and Saporta, 2004)

X = U Y V'B + E,

(2)

where matrix B is a diagonal matrix defined so that V'BBV = IQ and tr(BB) = Q. An efficient alternating least-square algorithm is given.

3 Application The short-term scenario of September 1999 on macroeconomic performance of national economies of twenty countries, members of the Organization for Economic Co-operation and Development (OECD) has been considered in the paper by Vichi and Kiers (2001) to test the ability of the factorial k-means analysis (which allows a simultaneous classification of objects and a component reduction for variables) in identifying classes of similar economies and help to understand the relationships within the set of observed economic indicators. The performance of the economies reflects the interaction of six main economic indicators: Gross Domestic Product (GDP), Leading Indicator (LI), Unemployment Rate (UR), Interest Rate (IR), Trade Balance (TB), Net National Savings (NNS). Variables have been standardized. The classification, obtained by the tandem analysis, i.e. k-means applied on the first two principal components scores, when the number of clusters for the objects is equal to three and the number of components for the variables is equal to two, is displayed in figure 1. The first PCA component is characterized mainly by net national savings, gross domestic product, whereas the second PCA component, by interest rate and trade balance. The unemployment rate characterizes both dimensions, as it can be observed from Table 1. The first component explains 28% of the total variance, while the second PCA dimension explains the 23%. The classification of the countries is given below:

Figure 1. Tandem Analysis, i.e, K-means clustering computed on the first two principal components.

First class: Second class: Third class:

Australia, Canada, Finland, France, Spain, Sweden, United Kingdom, United States; Greece, Mexico, Austria, Belgium, Denmark, Germany, Italy, Japan, Portugal, Netherlands, Norway, Switzerland.

Table 1: Component loadings defined by PCA GDP IR LI Comp 2 Comp 1

-0.065 -0.567

-0.696 -0.175

-0.229 -0.192

UR

0.367 -0.489

NNS -0.092 0.607

TB

0.563 0.059

Clustering and disjoint PCA (CDPCA) has been applied on the same data set by fixing the number of clusters for the objects and variables equal to three and two respectively. The component loadings matrix is shown in table 2, while the classification of the countries is given below: First class: Second class: Third class:

Australia, Canada, Denmark, Finland, France, Germany, Italy, Spain, Sweden, United Kingdom, United States; Greece, Mexico, Portugal; Austria, Belgium, Japan, Netherlands, Norway, Switzerland.

The first dimension of CDPCA is still characterized mainly by net national savings, and less strongly by gross domestic product, whereas the second CDPCA dimension by interest rate and trade balance. However, this time unemployment rate characterizes the first dimension only, as it can be observed from Table 2. The first CDPCA dimension explains 26% of the total variance, while the second CDPCA dimension accounts for 21%.Thus, the loss of variance with respect to the PCA is irrelevant.

Table 2: Component loadings defined by Disjoint PCA GDP IR LI UR Dim 2 Dim 1

0 -0.383

-0.697 0

0.229 0

0 -0.498

NNS 0 0.778

TB 0.679 0

Comparing the two graphical representations in Figure 1 and 2, it can be observed that the CDPCA, more clearly shows three homogeneous classes, mainly representing the same countries of the tandem analysis with some relevant differences. These are mainly due to the role on the unemployment rate in the two analyses and less strongly by the leading indicator. In CDPCA UR characterizes the first dimension only, while it influences both dimensions in Figure 1. In Figure 2, Italy and Germany are positioned higher in the plot with respect to Figure 1 to better represent the higher unemployment rate they have. In Figure 2 Mexico and Portugal also are located much closer because they have very similar values of GDP, LI and TB, which describe the first dimension of CDPCA.

Figure 2. Clustering and Disjoint PCA.

4

Bibliography

ROCCI R, VICHI M., Multimode partitioning, 2004, submitted. VAN MECHELEN, I., BOCK H. H. & DE BOECK, P., Two-mode clustering a structured overview. Statistical Methods in Medical Research, 2004, to appear. VICHI, M., Double k-means Clustering for simultaneous classification of Objects and Variables. In Borra et al. (eds): Advances in Classification and Data Analysis, 43-52, 2000, Springer. VICHI, M., KIERS, H.A.L, Factorial k-means analysis for two way data, Computational Statistics and Data Analysis, 37, 49-64, 2001. VICHI, M, MARTELLA, F. Model-based clustering for block-data, 2005, submitted. VICHI, M., SAPORTA G., Clustering and Disjoint Principal Component Analysis, 2004 submitted.