sparse dimensionality reduction methods

2 downloads 0 Views 1MB Size Report
ZHANG XIAOWEI. (B.Sc., ECNU, China). A THESIS ...... [86] Z. Jin, J. Y. Yang, Z. M. Tang, and Z. S. Hu, A theorem on the uncorre- lated optimal discriminant ...
SPARSE DIMENSIONALITY REDUCTION METHODS: ALGORITHMS AND APPLICATIONS

ZHANG XIAOWEI (B.Sc., ECNU, China)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF MATHEMATICS NATIONAL UNIVERSITY OF SINGAPORE JULY 2013

Acknowledgements

First and foremost I would like to express my deepest gratitude to my supervisor, Associate Professor Chu Delin, for all his guidance, support, kindness and enthusiasm over the past five years of my graduate study at National University of Singapore. It is an invaluable privilege to have had the opportunity to work with him and learn many wonderful mathematical insights from him. Back in 2008 when I arrived at National University of Singapore, I knew little about the area of data mining and machine learning. It is Dr. Chu who guided me into these research areas and encouraged me to explore various ideas, and patiently helped me improve how I do research. It would not have been possible to complete this doctoral thesis without his support. Beyond being an energetic and insightful researcher, he also helped me a lot on how to communicate with other people. I feel very fortunate to be advised by Dr. Chu. I would like to thank Professor Li-Zhi Liao and Professor Michael K. Ng, both from Hong Kong Baptist University, for their assistance and support in my research. Interactions with them were very constructive and helped me a lot in writing this thesis. I am greatly indebted to National University of Singapore for providing me a full scholarship and an exceptional study environment. I would also like to thank

v

vi

Acknowledgements Department of Mathematics for providing financial support for my attendance of IMECS 2013 in Hong Kong and ICML 2013 in Atlanta. The Centre for Computational Science and Engineering provides large-scale computing facilities which enable me to conduct numerical experiments in my thesis. I am also grateful to all friends and collaborators. Special thanks go to Wang Xiaoyan and Goh Siong Thye, with whom I closely worked and collaborated. With Xiaoyan, I shared all experience as a graduate student and it was enjoyable to discuss research problems or just chat about everyday life. Siong Thye is an optimistic man and taught me a lot about machine learning, and I am more than happy to see that he continues his research at MIT and is working to become the next expert in his field. Last but not least, I want to warmly thank my family, my parents, brother and sister, who encouraged me to pursue my passion and supported my study in every possible way over the past five years.

Contents

Acknowledgements Summary

v xi

1 Introduction

1

1.1

Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Sparsity and Motivations . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2 Sparse Linear Discriminant Analysis

11

2.1

Overview of LDA and ULDA . . . . . . . . . . . . . . . . . . . . . . 12

2.2

Characterization of All Solutions of Generalized ULDA . . . . . . . . 16

2.3

Sparse Uncorrelated Linear Discriminant Analysis . . . . . . . . . . . 21 2.3.1

Proposed Formulation . . . . . . . . . . . . . . . . . . . . . . 22

2.3.2

Accelerated Linearized Bregman Method . . . . . . . . . . . . 24

2.3.3

Algorithm for Sparse ULDA . . . . . . . . . . . . . . . . . . . 29

vii

viii

Contents 2.4

2.5

Numerical Experiments and Comparison with Existing Algorithms . . 31 2.4.1

Existing Algorithms . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.2

Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 35

2.4.3

Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4.4

Real-World Data . . . . . . . . . . . . . . . . . . . . . . . . . 38

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Canonical Correlation Analysis 3.1

3.2

45

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.1

Various Formulae for CCA . . . . . . . . . . . . . . . . . . . . 47

3.1.2

Existing Methods for CCA . . . . . . . . . . . . . . . . . . . . 49

General Solutions of CCA . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.1

Some Supporting Lemmas . . . . . . . . . . . . . . . . . . . . 53

3.2.2

Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3

Equivalent relationship between CCA and LDA . . . . . . . . . . . . 66

3.4

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4 Sparse Canonical Correlation Analysis

71

4.1

A New Sparse CCA Algorithm . . . . . . . . . . . . . . . . . . . . . . 72

4.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3

4.2.1

Sparse CCA Based on Penalized Matrix Decomposition . . . . 76

4.2.2

CCA with Elastic Net Regularization . . . . . . . . . . . . . . 78

4.2.3

Sparse CCA for Primal-Dual Data Representation . . . . . . . 78

4.2.4

Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.1

Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.3.2

Gene Expression Data . . . . . . . . . . . . . . . . . . . . . . 87

Contents 4.3.3 4.4

ix Cross-Language Document Retrieval . . . . . . . . . . . . . . 93

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5 Sparse Kernel Canonical Correlation Analysis

101

5.1

An Introduction to Kernel Methods . . . . . . . . . . . . . . . . . . . 102

5.2

Kernel CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3

Kernel CCA Versus Least Squares Problem . . . . . . . . . . . . . . . 108

5.4

Sparse Kernel Canonical Correlation Analysis . . . . . . . . . . . . . 114

5.5

Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.6

5.5.1

Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 120

5.5.2

Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.5.3

Cross-Language Document Retrieval . . . . . . . . . . . . . . 123

5.5.4

Content-Based Image Retrieval . . . . . . . . . . . . . . . . . 127

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 Conclusions

133

6.1

Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Bibliography

137

A Data Sets

157

Summary

This thesis focuses on sparse dimensionality reduction methods, which aim to find optimal mappings to project high-dimensional data into low-dimensional spaces and at the same time incorporate sparsity into the mappings. These methods have many applications, including bioinformatics, text processing and computer vision. One challenge posed by high dimensionality is that, with increasing dimensionality, many existing data mining algorithms usually become computationally intractable. Moreover, a lot of samples are required when performing data mining techniques on high-dimensional data so that information extracted from the data is accurate, which is well known as the curse of dimensionality. To deal with this problem, many significant dimensionality reduction methods have been proposed. However, one major limitation of these dimensionality reduction techniques is that mappings learned from the training data lack sparsity, which usually makes interpretation of the results challenging or computation of the projections of new data time-consuming. In this thesis, we address the problem of deriving sparse version of some widely used dimensionality reduction methods, specifically, Linear Discriminant Analysis (LDA), Canonical Correlation Analysis (CCA) and its kernel extension Kernel Canonical Correlation Analysis (kernel CCA).

xi

xii

Summary First, we study uncorrelated LDA (ULDA) and obtain an explicit characterization of all solutions of ULDA. Based on the characterization, we propose a novel sparse LDA algorithm. The main idea of our algorithm is to select the sparsest solution from the solution set, which is accomplished by minimizing `1 -norm subject to a linear constraint. The resulted `1 -norm minimization problem is solved by (accelerated) linearized Bregman iterative method. With similar idea, we investigate sparse CCA and propose a new sparse CCA algorithm. Besides that, we also obtain a theoretical result showing that ULDA is a special case of CCA. Numerical results with synthetic and real-world data sets validate the efficiency of the proposed methods, and comparison with existing state-of-the-art algorithms shows that our algorithms are competitive. Beyond linear dimensionality reduction methods, we also investigate sparse kernel CCA, a nonlinear variant of CCA. By using the explicit characterization of all solutions of CCA, we establish a relationship between (kernel) CCA and least squares problems. This relationship is further utilized to design a sparse kernel CCA algorithm, where we penalize the least squares term by `1 -norm of the dual transformations. The resulted `1 -norm regularized least squares problems are solved by fixed-point continuation method. The efficiency of the proposed algorithm for sparse kernel CCA is evaluated on cross-language document retrieval and contentbased image retrieval.

List of Tables

1.1

Sample size required to ensure that the relative mean squared error at zero is less than 0.1 for the estimate of a normal distribution . . .

2.1

3

Simulation results. The reported values are means (and standard deviations), computed over 100 replications, of classification accuracy, sparsity, orthogonality and total number of selected features. . . . . . 37

2.2

Data stuctures: data dimension (d), training size (n), the number of classes (K) and the number of testing data (# Testing). . . . . . . . 39

2.3

Numerical results for gene data over 10 training-testing splits: mean (and standard deviation) of classification accuracy, sparsity, orthogonality and the number of selected variables. . . . . . . . . . . . . . . 40

2.4

Numerical results for image data over 10 training-testing splits: mean (and standard deviation) of classification accuracy, sparsity, orthogonality and the number of selected variables. . . . . . . . . . . . . . . 41

4.1

Comparison of results obtained by SCCA `1 with µx = µy = µ and 1 = 2 = 10−5 , PMD, CCA EN, and SCCA PD. . . . . . . . . . . . . 87

xiii

xiv

List of Tables 4.2

Data stuctures: data dimension (d1 ), training size (n), the number of classes (K) and the number of testing data (# Testing), m is the rank of matrix XY T , l is the number of columns in Wx and Wy and we choose l = m in our experiments. . . . . . . . . . . . . . . . . . . 88

4.3

Comparison of classification accuracy (%) between ULDA and WxN S of CCA using 1NN as classifier . . . . . . . . . . . . . . . . . . . . . 89

4.4

Comparison of results obtained by SCCA `1 , PMD, CCA EN, and SCCA PD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.5

Comparison of results obtained by SCCA `1 , PMD, CCA EN, and SCCA PD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.6

Average AROC of standard CCA and sparse CCA algorithms using Data Set I (French to English). . . . . . . . . . . . . . . . . . . . . . 97

4.7

Average AROC of standard CCA, SCCA `1 and SCCA PD using Data Set II (French to English). . . . . . . . . . . . . . . . . . . . . . 98

5.1

Computational complexity of Algorithm 7 . . . . . . . . . . . . . . . 120

5.2

Correlation between the first pair of canonical variables obtained by ordinary CCA, RKCCA and SKCCA. . . . . . . . . . . . . . . . . . . 121

5.3

Cross-language document retrieval using CCA, KCCA, RKCCA and SKCCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.4

Content-based image retrieval using CCA, KCCA, RKCCA and SKCCA.129

List of Figures

2.1

2D visualization of the SRBCT data: all samples are projected onto the first two sparse discriminant vectors obtained by PLDA (upper left), SDA (upper right), GLOSS (lower left) and SULDA (lower right), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1

True value of vectors v1 and v2 . . . . . . . . . . . . . . . . . . . . . . 85

4.2

Wx and Wy computed by different sparse CCA algorithms: (a) SCCA `1 (our approach), (b) Algorithm PMD, (c) Algorithm CCA EN, (d) Algorithm SCCA PD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.3

Average AROC achieved by CCA and sparse CCA as a function of the number of columns of (Wx , Wy ) used: (a) Data Set I, (b) Data Set II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.1

Plots of the first pair of canonical variates:(a) sample data, (b) ordinary CCA, (c) RKCCA, (d) SKCCA. . . . . . . . . . . . . . . . . . . 122

xv

xvi

List of Figures 5.2

Cross-language document retrieval using CCA, KCCA, RKCCA and SKCCA: (a) Europarl data with 50 training data, (b) Europarl data with 100 training data,(c) Hansard data with 200 training data, (d) Hansard data with 400 training data. . . . . . . . . . . . . . . . . . . 124

5.3

Gabor filters used to extract texture features. Four frequencies f = 1/λ = [0.15, 0.2, 0.25, 0.3] and four directions θ = [0, π/4, π/2, 3π/4] are used. The width of the filters are σ = 4. . . . . . . . . . . . . . . 128

5.4

Content-based image retrieval using CCA, KCCA, RKCCA and SKCCA on UW ground truth data with 217 training data. . . . . . . . . . . . 130

Chapter

1

Introduction Over the past few decades, data collection and storage capabilities as well as data management techniques have achieved great advances. Such advances have led to a leap of information in most scientific and engineering fields. One of the most significant reflections is the prevalence of high-dimensional data, including microarray gene expression data [7, 51], text documents [12, 90], functional magnetic resonance imaging (fMRI) data [59, 154], image/video data and high-frequency financial data, where the number of features can reach tens of thousands. While the proliferation of high-dimensional data lays the foundation for knowledge discovery and pattern analysis, it also imposes challenges on researchers and practitioners of effectively utilizing these data and mining useful information from them, due to the high dimensionality character of these data [47]. One common challenge posed by high dimensionality is that, with increasing dimensionality, many existing data mining algorithms usually become computationally intractable and therefore inapplicable in many real-world applications. Moreover, a lot of samples are required when performing data mining techniques on high-dimensional data so that information extracted from the data is accurate, which is well known as the curse of dimensionality.

1

2

Chapter 1. Introduction

1.1

Curse of Dimensionality

The phrase ‘curse of dimensionality’, apparently coined by Richard Bellman in [117], is used by the statistical community to describe the problem that the number of samples required to estimate a function with a specific level of accuracy grows exponentially with the dimension it comprises. Intuitively, as we increase the dimension, most likely we will include more noise or outliers as well. In addition, if the samples we collect are inadequate, we might be misguided by the wrong representation of the data. For example, we might keep sampling from the tail of a distribution, as illustrated by the following example. Example 1.1. Consider a sphere of radius r in d dimensions together with the concentric hypercube of side 2r, so that the sphere touches the hypercube at the centres of each of its sides. The volume of the hypercube is (2r)d and the volume of the sphere is

2rd π d/2 , dΓ(d/2)

where Γ(·) is the gamma function defined by Z Γ(x) =



ux−1 e−u du.

0

Thus, the ratio of the volume of the sphere to the volume of the cube is given by π d/2 , d2d−1 Γ(d/2)

which converges to zero as d → ∞. We can see from this result that, in

high dimensional spaces, most of the volume of a hypercube is concentrated in the large number of corners. Therefore, in the case of a uniform distribution in high-dimensional space, most of the probability mass is concentrated in tails. Similar behaviour can be observed for the Gaussian distribution in high-dimensional spaces, where most of the probability mass of a Gaussian distribution is located within a thin shell at a large radius [15]. Another example illustrating the difficulty imposed by high dimensionality is kernel density estimation. Example 1.2. Kernel density estimation (KDE) [20] is a popular method for estimating probability density function (PDF) for a data set. For a given set of samples

1.1 Curse of Dimensionality

3

{x1 , · · · , xn } in Rd , the simplest KDE aims to estimate the PDF f (x) at a point x ∈ Rd with the estimation in the following form:   n 1X 1 kx − xi k ˆ fn (x) = k , n i=1 hdn hn where hn =

1 nd+4

is the bandwidth, k : [0, ∞) → [0, ∞) is a kernel function satisfying

certain conditions. Then the mean squared error in the estimate fˆn (x) is given by   1 2 ˆ ˆ M SE[fn (x)] = E[(fn (x) − f (x)) ] = O , as n → ∞. n4/(d+4) Thus, the convergence rate slows as the dimensionality increases. To achieve the same convergence rate with the case where d = 10 and n = 10, 000, approximately 7 million (i.e., n ≈ 7 × 106 ) samples are required if the dimensionality is increased to d = 20. To get a rough idea of the impact of sample size on the estimation error, we can look at the following table, taken from Silverman [129], which illustrates how the sample size required for a given relative mean squared error for the estimate of a normal distribution increases with the dimensionality. Table 1.1: Sample size required to ensure that the relative mean squared error at zero is less than 0.1 for the estimate of a normal distribution Dimensionality

Required Sample Size

1

4

2

19

3

67

6

2790

10

842000

Although the curse of dimensionality draws a gloomy picture for high-dimensional data analysis, we still have hope in the fact that, for many high-dimensional data in practice, the intrinsic dimensionality [61] of these data may be low in the sense that the minimum number of parameters required to account for the observed properties of these data is much smaller. A typical example of this kind arises in document classification [12, 96].

4

Chapter 1. Introduction Example 1.3 (Text document data). The simplest possible way of representing a document is as a bag-of-words, where a document is represented by the words it contains, with the ordering of these words being ignored. For a given collection of documents, we can get a full set of words appearing in the documents being processed. The full set of words is referred as the dictionary whose dimensionality is typically in tens of thousands. Each document is represented as a vector in which each coordinate describes the weight of one word from the dictionary. Although the dictionary has high dimensionality, the vector associated with a given document may contain only a few hundred non-zero entries, since the document typically contains only very few of the vast number of words in the dictionary. In this sense, the intrinsic dimensionality of this data is the number of non-zero entries in the vector, which is far smaller than the dimensionality of the dictionary. To avoid the curse of dimensionality, we can design methods which depend only on the intrinsic dimensionality of the data; or alternatively work on the lowdimensional data obtained by applying dimensionality reduction techniques to the high-dimensional data.

1.2

Dimensionality Reduction

Dimensionality reduction, aiming at reducing the dimensionality of original data, transforms the high-dimensional data into a much lower dimensional space and at the same time preserves essential information contained in the original data as much as possible. It has been widely applied in many areas, including text mining, image retrieval, face recognition, handwritten digit recognition and microarray data analysis. Besides avoiding the curse of dimensionality, there are many other motivations for us to consider dimensionality reduction. For example, dimensionality reduction can remove redundant and noisy data and avoid data over-fitting, which improves

1.2 Dimensionality Reduction

5

the quality of data and facilitates further processing tasks such as classification and retrieval. The need for dimensionality reduction also arises for data compression in the sense that, by applying dimensionality reduction, the size of the data can be reduced significantly, which saves a lot storage space and reduces computational cost in further processing. Another motivation of dimensionality reduction is data visualization. Since visualization of high-dimensional data is almost beyond the capacity of human beings, through dimensionality reduction, we can construct 2-dimensional or 3-dimensional representation of high-dimensional data such that essential information in the original data is preserved. In mathematical terms, dimensionality reduction can be defined as follows. Assume we are given a set of training data h i A = a1 · · · an ∈ Rd×n consisting of n samples from d-dimensional space. The goal is to learn a mapping f (·) from the training data by optimizing certain criterion such that, for each given data x ∈ Rd , f (x) is a low-dimensional representation of x. The subject of dimensionality reduction is vast, and can be grouped into different categories based on different criteria. For example, linear and non-linear dimensionality reduction techniques; unsupervised, supervised and semi-supervised dimensionality reduction techniques. In linear dimensionality, the function f is linear, that is, xL = f (x) = W T x,

(1.1)

where W ∈ Rd×l (l  d) is the projection matrix learned from training data, e.g., Principal Component Analysis (PCA) [87], Linear Discriminant Analysis (LDA) [50, 56, 61] and Canonical Correlation Analysis (CCA) [2, 79]. In nonlinear dimensionality reduction [100], the function f is non-linear, e.g., Isometric feature map (Isomap) [137], Locally Linear Embedding (LLE) [119, 121], Laplacian Eigenmaps [11] and various kernel learning techniques [123, 127]. In unsupervised learning, the training data are unlabelled and we are expected to find hidden structure of

6

Chapter 1. Introduction these unlabelled data. Typical examples of this type include Principal Component Analysis (PCA) [87] and K-mean Clustering [61]. In contrast to unsupervised learning, in supervised learning, we know the labels of training data, and try to find the discriminant function which best fits the relation between the training data and the labels. Typical examples of supervised learning techniques include Linear Discriminant Analysis (LDA) [50, 56, 61], Canonical Correlations Analysis (CCA) [2, 79] and Partial Least Squares (PLS) [148]. Semi-supervised learning falls between unsupervised and supervised learning, and makes use of both labelled and unlabelled training data (usually a small portion of labelled data with a large portion of unlabelled data). As a relatively new area, semi-supervised learning makes use of the strength of both unsupervised and supervised learning and has attracted more and more attention during last decade. More details of semi-supervised learning can be found in [26]. In this thesis, since we are interested in accounting label information for learning, we restrict our attention to supervised learning. In particular, we mainly focus on Linear Discriminant Analysis (LDA), Canonical Correlation Analysis (CCA) and its kernel extension Kernel Canonical Correlation Analysis (kernel CCA). As one of the most powerful techniques for dimensionality reduction, LDA seeks an optimal linear transformation that transforms the high-dimensional data into a much lower dimensional space and at the same time maximizes class separability. To achieve maximal separability in the reduced dimensional space, the optimal linear transformation should minimize the within-class distance and maximize the between-class distance simultaneously. Therefore, optimization criteria for classical LDA are generally formulated as the maximization of some objective functions measuring the ratio of between-class distance and within-class distance. An optimal solution of LDA can be computed by solving a generalized eigenvalue problem [61]. LDA has been applied successfully in many applications, including microarray gene expression data analysis [51, 68, 165], face recognition [10, 27, 169, 85], image retrieval [135] and document classification [80]. CCA was originally proposed in [79] and has

1.3 Sparsity and Motivations become a powerful tool in multivariate analysis for finding the correlations between two sets of high-dimensional variables. It seeks a pair of linear transformations such that the projected variables in the lower-dimensional space are maximally correlated. To extend CCA to non-linear data, many researchers [1, 4, 72, 102] applied kernel trick to CCA, which results in kernel CCA. Empirical results show that kernel CCA is efficient in handling non-linear data and can successfully find non-linear relationship between two sets of variables. It has also been shown that solutions of both CCA and kernel CCA can be obtained by solving generalized eigenvalue problems [14]. Applications of CCA and kernel CCA can be found in [1, 4, 42, 59, 63, 72, 92, 93, 102, 134, 143, 144, 158].

1.3

Sparsity and Motivations

One major limitation of dimensionality reduction techniques considered in previous section is that mappings f (·) learned from training data lack sparsity, which usually makes interpretation of the obtained results challenging or computation of the projections of new data time-consuming. For instance, in linear dimensionality reduction (1.1), low-dimensional projection xL = W T x of new data point x is a linear combination of all features in original data x, which means all features in x contribute to the extracted features in xL , thus makes it difficult to interpret xL ; in kernel learning techniques, we need to evaluate the kernel function at all training samples in order to compute projections of new data points due to the lack of sparsity in the dual transformation (see Chapter 5 for detailed explanation), which is computationally expensive. Sparsity is a highly desirable property both theoretically and computationally as it can facilitate interpretation and visualization of the extracted feature, and a sparse solution is typically less complicated and hence has better generalization ability. In many applications such as gene expression analysis and medical diagnostics, one can even tolerate a small degradation in performance to achieve high sparsity [125].

7

8

Chapter 1. Introduction The study of sparsity has a rich history and can be dated back to the principle of parsimony which states that the simplest explanation for unknown phenomena should be preferred over the complicated ones in terms of what is already known. Benefiting from recent development of compressed sensing [24, 25, 48, 49] and optimization with sparsity-inducing penalties [3, 142], extensive literature on the topic of sparse learning has emerged: Lasso and its generalizations [53, 138, 139, 170, 173], sparse PCA [39, 40, 88, 128, 174], matrix completion [23, 116], sparse kernel learning [46, 132, 140, 156], to name but a few. A typical way of obtaining sparsity is minimizing the `1 -norm of the transformation matrices.1 The use of `1 -norm for sparsity has a long history [138], and extensive study has been done to investigate the relationship between a minimal `1 norm solution and a sparse solution [24, 25, 28, 48, 49]. In the thesis, we address the problem of incorporating sparsity into the transformation matrices of LDA, CCA and kernel CCA via `1 -norm minimization or regularization. Although many sparse LDA algorithms [34, 38, 101, 103, 105, 111, 126, 152, 157] and sparse CCA algorithms [71, 114, 145, 150, 151, 153] have been proposed, they are all sequential algorithms, that is, the sparse transformation matrix in (1.1) is computed one column by one column. These sequential algorithms are usually computationally expensive, especially when there are many columns to compute. Moreover, there does not exist effective way to determine the number of columns l in sequential algorithms. To deal with these problems, we develop new algorithms for sparse LDA and sparse CCA in Chapter 2 and Chapter 4, respectively. Our methods compute all columns of sparse solutions at one time, and the computed sparse solutions are exact to the accuracy of specified tolerance. Recently, more and more attention has been drawn to the subject of sparse kernel approaches [15, 156], such as support vector machines [123], relevance vector machine [140], sparse kernel partial least squares [46, 107], sparse multiple kernel learning [132], and many others. 1

In this thesis, unless otherwise specified, the `1 -norm is defined to be summation of the absolute

value of all entries, for both a vector and a matrix.

1.4 Structure of Thesis However, seldom can be found in the area of sparse kernel CCA except [6, 136]. To fill in this gap, a novel algorithm for sparse kernel CCA is presented in Chapter 5.

1.4

Structure of Thesis

The rest of this thesis is organized as follows. • Chapter 2 studies sparse Uncorrelated Linear Discriminant Analysis (ULDA) that is an important generalization of classical LDA. We first parameterize all solutions of the generalized ULDA via solving the optimization problem proposed in [160], and then propose a novel model for computing sparse ULDA transformation matrix. • In Chapter 3, we make a new and systematic study of CCA. We first reveal the equivalent relationship between the recursive formulation and the trace formulation of the multiple-projection CCA problem. Based on this equivalence relationship, we adopt the trace formulation as the criterion of CCA and obtain an explicit characterization of all solutions of the multiple CCA problem even when the sample covariance matrices are singular. Then, we establish equivalent relationship between ULDA and CCA. • In Chapter 4, we develop a novel sparse CCA algorithm, which is based on the explicit characterization of general solutions of CCA in Chapter 3. Extensive experiments and comparisons with existing state-of-the-art sparse CCA algorithms have been done to demonstrate the efficiency of our sparse CCA algorithm. • Chapter 5 focuses on designing an efficient algorithm for sparse kernel CCA. We study sparse kernel CCA via utilizing established results on CCA in Chapter 3, aiming at computing sparse dual transformations and alleviating overfitting problem of kernel CCA, simultaneously. We first establish a relationship between CCA and least squares problems, and extend this relationship

9

10

Chapter 1. Introduction to kernel CCA. Then, based on this relationship, we succeed in incorporating sparsity into kernel CCA by penalizing the least squares term with `1 -norm of dual transformations and propose a novel sparse kernel CCA algorithm, named SKCCA. • A summary of all works in previous chapters is presented in Chapter 6, where we also point out some interesting directions for future research.

Chapter

2

Sparse Linear Discriminant Analysis Despite simplicity and popularity of Linear discriminant analysis (LDA), there are two major deficiencies that restrict its applicability in high-dimensional data analysis, where the dimension of the data space is usually thousands or more. One deficiency is that classical LDA cannot be applied directly to undersampled problems, that is, the dimension of the data space is larger than the number of data samples, due to singularity of the scatter matrices; the other is the lack of sparsity in LDA transformation matrix. To overcome the first problem, generalizations of classical LDA to undersampled problems are required. To overcome the second problem, we need to incorporate sparsity into LDA transformation matrix. So, in this chapter we study sparse LDA, specifically sparse uncorrelated linear discriminant analysis (ULDA), where ULDA is one of the most popular generalizations of classical LDA, aiming at extracting mutually uncorrelated features and computing sparse LDA transformation, simultaneously. We first characterize all solutions of the generalized ULDA via solving the optimization problem proposed in [160], then propose a novel model for computing the sparse solution of ULDA based on the characterization. This model seeks the minimum `1 -norm solution from all the solutions with minimum dimension. Finding minimum `1 -norm solution can be formulated as a `1 -minimization problem which is solved by the accelerated linearized Bregman method [21, 83, 167, 168], resulting in a

11

12

Chapter 2. Sparse Linear Discriminant Analysis new algorithm named SULDA. Different from existing sparse LDA algorithms, our approach seeks a sparse solution of ULDA directly from the solution set of ULDA, so the computed sparse transformation is an exact solution of ULDA, which further implies that extracted features by SULDA are mutually uncorrelated. Besides interpretability, sparse LDA may also be motivated by robustness to the noise, or computational efficiency in prediction. A part of the work presented in this chapter has been published in [37]. This chapter is organized as follows. We briefly review LDA and ULDA in Section 2.1, and derive a characterization of all solutions of generalized ULDA in Section 2.2. Based on this characterization we develop a novel sparse ULDA algorithm SULDA in Section 2.3, then test SULDA on both simulations and real-world data and compare it with existing state-of-the-art sparse LDA algorithms in Section 2.4. Finally, we conclude this chapter in Section 2.5.

2.1

Overview of LDA and ULDA

LDA is a popular tool for both classification and dimensionality reduction that seeks an optimal linear transformation of high-dimensional data into a low-dimensional space, where the transformed data achieve maximum class separability [50, 61, 77]. The optimal linear transformation achieves maximum class separability by minimizing the within-class distance while at the same time maximizing the between-class distance. LDA has been widely employed in numerous applications in science and engineering, including microarray data analysis, information retrieval and face recognition. Given a data matrix A ∈ Rd×n consisting of n samples from Rd . We assume A = [a1 a2 · · · an ] =

h

A1 A2 · · · AK

i

,

where aj ∈ Rd (1 ≤ j ≤ n), n is the sample size, K is the number of classes and Ai ∈ P Rd×ni with ni denoting the number of data in the ith class. So we have n = K i=1 ni .

2.1 Overview of LDA and ULDA

13

Further, we use Ni to denote the set of column indices that belong to the ith class. Classical LDA aims to compute an optimal linear transformation GT ∈ Rl×d that maps aj in the d-dimensional space to a vector aLj in the l-dimensional space GT : aj ∈ Rd → aLj := GT ai ∈ Rl , where l  d, so that class structure in the original data is preserved in the ldimensional space. In order to describe class quality, we need a measure of within-class distance and between-class distance. For this purpose, we now define scatter matrices. In discriminant analysis [61], between-class scatter matrix Sb , within-class scatter matrix Sw and total scatter matrix St are defined as: K

1X Sb = ni (c(i) − c)(c(i) − c)T , n i=1 K

Sw =

1XX (aj − c(i) )(aj − c(i) )T , n i=1 j∈N i

St =

1 n

n X

(aj − c)(aj − c)T ,

j=1

where c(i) =

1 Ai ei with ei = [1 · · · 1]T ∈ Rni , ni

i = 1, · · · , K

denotes the centroid of class i and c=

1 Ae with e = [1 · · · 1]T ∈ Rn n

denotes the global centroid. It follows from the definition that St is the sample covariance matrix and St = Sb + Sw . Moreover, let √ 1 √ Hb = √ [ n1 (c(1) − c) · · · nk (c(k) − c)] ∈ Rd×K , n 1 Hw = √ [A1 − c(1) eT1 · · · Ak − c(k) eTk ] ∈ Rd×n , n 1 Ht = √ [a1 − c · · · an − c] = A − ceT ∈ Rd×n , n

14

Chapter 2. Sparse Linear Discriminant Analysis then the scatter matrices can be expressed as Sb = Hb HbT ,

Sw = Hw HwT ,

St = Ht HtT .

(2.1)

Since K

Trace(Sb ) =

1X ni kc(i) − ck22 , n i=1

Trace(Sw ) =

1XX kaj − c(i) k22 , n i=1 j∈N

K

i

it is easy to see that Trace(Sb ) measures the distance between classes and Trace(Sw ) measures the closeness of data within the same class over all k classes. In the low-dimensional space mapped by the linear transformation GT ∈ Rl×d , between-class, within-class and total scatter matrices are of the forms SbL = GT Sb G,

SwL = GT Sw G,

StL = GT St G.

Thus, an optimal transformation GT should maximize Trace(SbL ) and minimize Trace(SwL ), simultaneously, which results in a commonly used criterion for classical LDA:  G∗ = arg max Trace((StL )−1 SbL ) . G

(2.2)

In classical LDA [61], the optimization problem above is solved by computing all the eigenpairs Sb x = λSt x,

λ 6= 0.

Thus, the optimal G∗ consists of eigenvectors of St−1 Sb corresponding to all non-zero eigenvalues, provided that St is nonsingular. Since rank(Sb ) ≤ K − 1, the reduced dimension by classical LDA is at most K − 1. Classical LDA cannot be applied directly when St is singular, which is the case for undersampled problems. To deal with the singularity of St , many generalizations of classical LDA have been proposed. These generalizations include pseudo-inverse LDA (e.g., uncorrelated LDA (ULDA) [33, 85, 160], orthogonal LDA [32, 160], null space LDA

2.1 Overview of LDA and ULDA

15

[27, 31]), two-stage LDA [81, 164], regularized LDA [58, 68, 162], GSVD-based LDA (LDA/GSVD)[80, 82], and least squares LDA [161]. For details and comparison of these generalizations, see [113] and references therein. In this chapter we are interested in the generalized ULDA [160], which considers the following optimization problem G∗ = arg max Trace((StL )(+) SbL ) = arg max Trace(SbL ), GT St G=I

GT St G=I

(2.3)

where (StL )(+) denotes the pseudo-inverse [65] of StL . The generalized ULDA has several nice properties: 1. Due to the constraint GT St G = I, feature vectors extracted by ULDA are mutually uncorrelated, thus ensuring minimum redundancy in the transformed space; 2. The generalized ULDA can handle undersampled problems; 3. The generalized ULDA and classical LDA have common optimal transformation matrix when the total scatter matrix is nonsingular. Numerical experiments on real-world data show that the generalized ULDA [33, 160] is competitive with other dimensionality reduction methods in terms of classification accuracy. An algorithm, based on the simultaneous diagonalization of the scatter matrices, was proposed in [160] for computing the optimal solution of optimization problem (2.3). Recently, an eigendecomposition-free and SVD-free ULDA algorithm was developed in [33] to improve the efficiency of the generalized ULDA. Some applications of ULDA can be found in [85, 163, 165].

16

Chapter 2. Sparse Linear Discriminant Analysis

2.2

Characterization of All Solutions of Generalized ULDA

In this section, we consider the characterization of all solutions to the optimization problem (2.3). Before that, we need the following lemma. m×m Lemma be a symmetric positive semi-definite matrix and M =   2.1. Let A ∈ R M  1  ∈ Rr×s with M1 ∈ Rm×s and m ≤ r. Then M2

Trace((M1T M1 + M2T M2 )(+) M1T AM1 ) = Trace((M T M )(+) M1T AM1 ) ≤ Trace(A). (2.4) Proof. Let the singular value decomposition (SVD) of     Σ 0 U Σ  V T =  1  M =U 0 0 U2 0

M be  0 V T, 0

where U ∈ Rr×r and V ∈ Rs×s are orthogonal, U1 ∈ Rm×r is row orthogonal and Σ is a nonsingular diagonal matrix. Then we have 

 I 0  U1T , M1 (M T M )(+) M1T = U1  0 0

which implies that Trace((M T M )(+) M1T AM1 ) = Trace(M1 (M T M )(+) M1T A)   I 0  U1T A) = Trace(U1  0 0   I 0  U1T AU1 ) = Trace( 0 0 ≤ Trace(U1T AU1 ) = Trace(U1 U1T A) = Trace(A), where the inequality is obtained from the fact that U1T AU1 is positive semi-definite and all diagonal entries are nonnegative.

2.2 Characterization of All Solutions of Generalized ULDA

17

We characterize all solutions of the optimization problem (2.3) explicitly in Theorem 2.2, which is based on singular value decomposition (SVD) [65] and simultaneous diagonalization of scatter matrices. Theorem 2.2. Let the reduced SVD of Ht be Ht = U1 Σt V1T ,

(2.5)

where U1 ∈ Rd×γ and V1 ∈ Rn×γ are column orthogonal, and Σt ∈ Rγ×γ is diagonal and nonsingular with γ = rank(Ht ) = rank(St ). Next, let the reduced SVD of T Σ−1 t U1 Hb be

Σt−1 U1T Hb = P1 Σb QT1 ,

(2.6)

where P1 ∈ Rγ×q , Q1 ∈ RK×q are column orthogonal, Σb ∈ Rq×q is diagonal and nonsingular. Then q = rank(Hb ) = rank(Sb ), and G is a solution of the optimization problem (2.3) if and only if q ≤ l ≤ γ and  h i  −1 G = U1 Σt P1 M1 + M2 Z,

(2.7)

where M1 ∈ Rγ×(l−q) is column orthogonal satisfying MT1 P1 = 0, M2 ∈ Rd×l is an arbitrary matrix satisfying MT2 U1 = 0, and Z ∈ Rl×l is orthogonal. Proof. Let U2 ∈ Rd×(d−γ) , V2 ∈ Rn×(n−γ) , P2 ∈ Rγ×(γ−q) and Q2 ∈ RK×(K−q) be h i h i h i column orthogonal matrices such that U = U1 U2 , V = V1 V2 , P = P1 P2 h i and Q = Q1 Q2 are orthogonal, respectively. Then, it is obvious that   h i Σt 0 h iT  V1 V2 , Ht = U1 U2  0 0 and h

St = Ht HtT = U1 Σt

  i Iγ 0 h iT   U2 U1 Σt U2 . 0 0

Note that St = Sb + Sw , St , Sb and Sw are all symmetric and positive semi-definite, and Sb = Hb HbT , so Sb U2 = 0 and it holds that   h i Sb 0 h iT   Sb = U1 Σt U2 U1 Σt U2 , 0 0

18

Chapter 2. Sparse Linear Discriminant Analysis where T T T T −1 T T Sb = (Σ−1 t U1 Hb )(Σt U1 Hb ) = (P1 Σb Q1 )(P1 Σb Q1 )   iT h i Σ2 0 h b   = P1 P2 P1 P 2 . 0 0 h i So, let Q = U1 Σt P1 U1 Σt P2 U2 , we have     Iq 0 0 Σ2b 0 0         St = Q  0 Iγ−q 0 QT , Sb = Q  0 0 0 QT ,     0 0 0 0 0 0   0 0 I − Σ2b  q    T Sw = St − Sb = Q  0 Iγ−q 0 Q ,   0 0 0

which yield q = rank(Hb ) = rank(Sb ), and   Σ2b 0 0     (+) St Sb = Q−T  0 0 0 QT .   0 0 0 iT h where G1 ∈ Rq×l , G2 ∈ R(γ−q)×l For any G ∈ Rd×l , let QT G = GT1 GT2 GT3 and G3 ∈ R(d−γ)×l . We have  T G1 StL = GT St G =   G2

  G  1 , G2

SbL = GT Sb G = GT1 Σ2b G1 .

This, together with lemma 2.1, yields that Trace((StL )(+) SbL ) ≤ Trace(Σ2b ),     G1 Il where equality holds if   =  . G2 0 Thus, (+)

Trace(Σ2b ) = Trace(St Sb ) = max Trace((StL )(+) SbL ) G

=

max Trace((StL )(+) SbL ),

GT St G=Il

(2.8)

2.2 Characterization of All Solutions of Generalized ULDA and we get that G ∈ Rd×l is a solution of optimization problem (2.3) if and only if GT1 G1 + GT2 G2 = Il ,

Trace(GT1 Σ2b G1 ) = Trace(Σ2b ).

q×(γ−l) that and G2 ∈ GT1 G1 + GT2 G2 = Il implies  l ≤ γ, and there exist G1 ∈ R  G1 G1  is orthogonal, which gives that G1 GT1 = Iq − G1 G1T . R(γ−q)×(γ−l) such that  G2 G2 Thus, we have the following conclusions

Trace(GT1 Σ2b G1 ) = Trace(Σ2b ) ⇔ Trace(Σ2b G1 GT1 ) = Trace(Σ2b ) ⇔ Trace(Σ2b (I − G1 G1T )) = Trace(Σ2b ) ⇔ 0 = Trace(Σ2b G1 G1T ) = Trace(G1T Σ2b G1 ) ⇔ G1 = 0 ⇔ G1 GT1 = Iq , which, in return, implies q ≤ l ≤ γ, and G ∈ Rd×l is a solution of optimization problem (2.3)    T   G  1 G1 G1   ⇔ G = Q−T G2  , G1 GT1 = Iq ,     = Il   G2 G2 G3          G1       G1  Iq 0   −T   G = Q = Z,  ,  G 2     G2 0 G3 ⇔ G3      q ≤ l ≤ γ, G3 ∈ R(γ−q)×(l−q) is column orthogonal and      Z ∈ Rl×l is orthogonal  h i  T ⇔ G = U1 Σ−1 + U G Z Z, P1 P2 G3 2 3 t where in the last equality we used h i −1 Q−T = U1 Σ−1 . t P1 U1 Σt P2 U2 h i h i Since P1 P2 and U1 U2 are orthogonal, it follows that for any M1 ∈

19

20

Chapter 2. Sparse Linear Discriminant Analysis Rγ×(l−q) and M2 ∈ Rd×l M1 is column orthogonal, and MT1 P1 = 0 ⇔ M1 = P2 G3 , for some column orthogonal G3 , and MT2 U1 = 0 ⇔ M2 = U2 G3 , for some G3 ∈ R(d−γ)×l . Therefore, we conclude that G ∈ Rd×l is a solution of optimization problem (2.3) if and only if q ≤ l ≤ γ and  h i  G = U1 Σ−1 + M P 1 M1 2 Z, t where M1 ∈ Rγ×(l−q) is column orthogonal satisfying MT1 P1 = 0, M2 ∈ Rd×l is an arbitrary matrix satisfying MT2 U1 = 0, and Z ∈ Rl×l is orthogonal. A similar result as Theorem 2.2 has been established in [33], where the optimal solution to the optimization problem (2.3) is computed by means of economic QR decomposition with/without column pivoting. When we compute the optimal linear transformation G∗ of LDA for data dimensionality reduction, we prefer the dimension of the transformed space to be as small as possible. Hence, we parameterize all minimum dimension solutions of optimization problem (2.3) in Corollary 2.3 which is a special case of Theorem 2.2 with l = q. Corollary 2.3. G ∈ Rd×l is a minimum dimension solution of the optimization problem (2.3) if and only if l = q and G = (U1 Σ−1 t P1 + M2 )Z,

(2.9)

where M2 ∈ Rd×q is any matrix satisfying MT2 U1 = 0 and Z ∈ Rq×q is orthogonal. Another motivation for considering minimum dimension solutions of optimization problem (2.3) is that theoretical results in [33] show that among all solutions of the optimization problem (2.3), minimum dimension solutions maximize the ratio Trace(SbL ) L) , Trace(Sw

which is also an important measure of class discrimination in LDA [61].

2.3 Sparse Uncorrelated Linear Discriminant Analysis Corollary 2.4. Let SG be the set of optimal solutions to the optimization problem (2.3), that is, h i  o n  d×l −1 SG = G ∈ R : G = U1 Σt P1 M1 + M2 Z, q ≤ l ≤ γ . Then  G = arg max

 Trace(SbL ) : G ∈ SG , Trace(SwL )

if and only if l = q, that is, G = (U1 Σ−1 t P1 + M2 )Z. Proof. From the proof of Theorem 2.2, we note that for any G ∈ SG , StL = GT St G = Il ,

Trace(SbL ) = Trace(Σ2b ).

Thus Trace(Σ2b ) Trace(Σ2b ) Trace(SbL ) = ≤ , Trace(SwL ) l − Trace(Σ2b ) q − Trace(Σ2b ) where equality holds if and only if l = q. Moreover, G is in the form of (2.9) when l = q. From both equations (2.7) and (2.9), we see that the optimal solution G∗ of h i generalized ULDA equals to the summation of two factors, U1 Σ−1 P M t 1 1 Z in the range space of St and M2 Z in the null space of St . Since the factor M2 Z belongs to null(Sb ) ∩ null(Sw ), it does not contain discriminative information. However, with the help of factor M2 Z we can construct a sparse solution of ULDA in the next section.

2.3

Sparse Uncorrelated Linear Discriminant Analysis

In this section, we introduce a novel model for sparse uncorrelated linear discriminant analysis (sparse ULDA) which is formulated as a `1 -minimization problem, and describe how to solve the proposed optimization problem.

21

22

Chapter 2. Sparse Linear Discriminant Analysis

2.3.1

Proposed Formulation

Note from Corollary 2.3 that G is a minimum dimension solution of the optimization problem (2.3) if and only if equality (2.9) holds, which is equivalent to U1T G = Σ−1 t P1 Z,

Z T Z = I.

(2.10)

The main idea of our sparse ULDA algorithm is to find the sparsest solution of ULDA from the set of all G satisfying (2.10). For this purpose, a natural way is to find a matrix G that minimizes the `0 -norm (cardinality),1 that is,  T G∗0 = min kGk0 : G ∈ Rd×q , U1T G = Σ−1 t P1 Z, Z Z = I .

(2.11)

However, `0 -norm is non-convex and NP-hard [109], which makes the above optimization problem intractable. A typical convex relaxation of the problem is to replace the `0 -norm with `1 -norm. This convex relaxation is supported by recent research in the field of sparse representation and compressed sensing [25, 24, 49] which shows that for most large underdetermined systems of linear equations, if the solution x∗ of the `0 -minimization problem x∗ = arg min{kxk0 : Ax = b, A ∈ Rm×n , x ∈ Rn }

(2.12)

is sufficiently sparse, then x∗ can be obtained by solving the following `1 -minimization problem: min{kxk1 : Ax = b, A ∈ Rm×n , x ∈ Rn },

(2.13)

which is known as basis pursuit (BP) problem and can be reformulated as a linear programming problem [28]. Therefore, we replace the `0 -norm with its convex relaxation `1 -norm in our sparse ULDA setting, which results in the following optimization problem  T G∗ = arg min kGk1 : G ∈ Rd×q , U1T G = Σ−1 t P1 Z, Z Z = I , 1

(2.14)

The `0 -norm (cardinality) of a vector (matrix) is defined as the number of non-zero entries in

the vector (matrix).

2.3 Sparse Uncorrelated Linear Discriminant Analysis where kGk1 :=

Pd Pq i=1

j=1

|Gij |.

Note that Z ∈ Rq×q in (2.14) is orthogonal. However, on one hand, there still lack numerically efficient methods for solving non-smooth optimization problems over the set of orthogonal matrices. On the other hand, it can introduce at most q 2 additional zeros in G by optimizing Z over all q × q orthogonal matrices assuming that the zero structure of the previous G is not destroyed; but usually, q < K  d, so the number of the additional zeros in G introduced by optimizing Z is very small compared with dq. So it is acceptable that G∗ in (2.14) is computed with a fixed Z (Z = Iq in our experiments). When q = 1, the `1 -minimization problem (2.14) reduces to the BP problem (2.13). Although the BP problem can be solved in polynomial time by standard linear programming methods [28], there exist even more efficient algorithms which exploit special properties of `1 -norm and the structure of A. For example, many algorithms [21, 22, 83, 112, 167, 168] solved the BP problem by applying Bregman iterative method, while a lot of algorithms [8, 55, 69, 149, 155, 171] considered the unconstrained basis pursuit denoising problem   1 2 n min kAx − bk2 + λkxk1 : x ∈ R 2 or constrained basis pursuit denoising problem min {kxk1 : kAx − bk2 ≤ ,

x ∈ Rn } .

When q > 1, the `1 -minimization problem (2.14) can be decomposed into q independent BP problems, which means that all numerical methods for solving (2.13) can be automatically extended to solve (2.14). Since the linearized Bregman method [21, 22, 83, 112, 167, 168] is considered as one of the most powerful methods for solving problem (2.13), and has been accelerated in a recent study [83], we apply it to solve (2.14). Before that, we briefly describe the derivation of accelerated linearized Bregman method in the next subsection.

23

24

Chapter 2. Sparse Linear Discriminant Analysis

2.3.2

Accelerated Linearized Bregman Method

We derive the (accelerated) linearized Bregman method for the basis pursuit problem min{kxk1 : Ax = b, A ∈ Rm×n , x ∈ Rn }. Most of the results derived in this subsection can be found in [21, 22, 83, 98, 112, 167, 168], and can be generalized to general convex function J (x) other than kxk1 . In order to make (2.13) simpler to solve, we usually prefer to solve the unconstrained problem  1 2 n (2.15) kAx − bk2 + µkxk1 : x ∈ R , min 2 where µ > 0 is a penalty parameter. In order to enforce that Ax = b, we must 

choose µ to be extremely close to 0. Unfortunately, a small µ makes (2.15) difficult to solve numerically for many problems. In the remaining part of this section, we introduce the linearized Bregman method which can obtain a very accurate solution to the original basis pursuit problem (2.13) using a constant µ and hence avoid the problem of numerical instabilities caused by forcing µ → 0. For any convex function J (x), the Bregman distance [18] based on J (x) between points u and v is defined as DJp (u, v) = J (u) − J (v) − hp, u − vi,

(2.16)

where p ∈ ∂J (v) is some subgradient of J at the point v. We can prove that DJp (u, v) ≥ 0 and DJp (u, v) ≥ DJp (w, v) for all points w on the line segment connecting u and v. Instead of solving (2.15), Bregman iterative method replace µkxk1 with the associated Bregman distance and solves a sequence of convex problems   1 k+1 pk k 2 x = arg min D (x, x ) + kAx − bk2 , x 2 pk+1 = pk − AT (Axk+1 − b), for k = 0, 1, · · · , starting from x0 = 0 and p0 = 0, where k

Dp (x, xk ) := µkxk1 − µkxk k1 − hpk , x − xk i

(2.17a) (2.17b)

2.3 Sparse Uncorrelated Linear Discriminant Analysis

25

denotes the Bregman distance. Here pk+1 in (2.17b) is chosen based on the optimality condition 0 ∈ µ∂kxk+1 k1 −pk +AT (Axk+1 −b) so that pk+1 ∈ µ∂kxk+1 k1 . It has been proven in [168] that xk is a solution of the basis pursuit problem (2.13) after a finite number of iterations. In each step of the Bregman iterative method, we need to solve subproblem (2.17a) which usually does not have a closed form solution and must be solved iteratively. Although many algorithms have been proposed to solve this subproblem, see [8, 69, 91, 171], it usually takes many iterations to find a satisfactory approximate solution. To overcome this difficulty, the linearized Bregman method was proposed in [168]. One of the simplest ways to derive the linearized Bregman method is the following: first, we approximate the quadratic term 12 kAx − bk22 by its linearization at xk 1 kAxk − bk22 + hAT (Axk − b), x − xk i. 2 Since the approximation is only accurate when x is close to xk , we then add a penalty term

1 kx 2δ

− xk k22 to the objective. The resulting approximation of subproblem

(2.17a) becomes   1 1 k 2 T k k k 2 k+1 pk k x = arg min D (x, x ) + kAx − bk2 + hA (Ax − b), x − x i + kx − x k2 x 2 2δ   1 k k T k k k 2 = arg min µkxk1 − hp , x − x i + hA (Ax − b), x − x i + kx − x k2 x 2δ ) (

  2 k 1

x − δ pk − AT (Axk − b) + x , (2.18) = arg min µkxk1 + x 2δ δ 2 where δ > 0 is a penalty parameter. Similar to (2.17b) in Bregman iterative method, pk+1 for the linearized Bregman method is chosen according to the optimality condition of problem (2.18): 0=p

k+1

1 + δ

   xk k+1 k T k x − δ p − A (Ax − b) + , δ

(2.19)

which leads to k

p

k+1

X xk+1 − xk xk+1 T i = p − A (Ax − b) − = ··· = A (b − Ax ) − . δ δ i=0 k

T

k

(2.20)

26

Chapter 2. Sparse Linear Discriminant Analysis Then, letting v k+1 =

k X

AT (b − Axi ),

k ≥ 0,

i=0

subproblem (2.18) becomes  

2 1 k+1

x − δv , min µkxk1 + 2 x 2δ which has a closed form unique solution xk+1 = δS(v k+1 , µ), where S(·, ·) : Rn × R+ → Rn is the componentwise soft-thresholding shrinkage operator defined as S(x, µ) = sign(x) max{|x| − µ, 0}

(2.21)

with denoting the componentwise product. It is easy to see that S(x, µ) reduces any x with magnitude not larger than µ to zero, thus reducing the `1 -norm and introducing sparsity. Therefore, we obtain the simplified iterative scheme [168] for the linearized Bregman method   v k+1 = v k + AT (b − Axk+1 ),  xk+1 = δS(v k+1 , µ),

(2.22)

for k = 0, 1, · · · , starting from x0 = 0 and v 0 = 0. The convergence of the linearized Bregman method (2.22) has been studied in [21, 22, 167], and is summarized in the following theorem. Theorem 2.5 ([21, 167]). Assume that 0 < δ
0.

28

Chapter 2. Sparse Linear Discriminant Analysis Gradient descent method applied to problem (2.24) is of the form  y k+1 = y k − τ ∇Fµ (y k ) = y k + τ b − δA · S(AT y k , µ) ,

(2.26)

where τ > 0 denotes the step size. If letting wk = δS(AT y k , µ) and v k = AT y k , then we can show that wk obtained from gradient method (2.26) with τ = 1 and xk generated by the linearized Bregman method (2.22) are the same. The following theorem shows that the unique solution x∗µ of (2.23) can be recovered from any solution of its dual problem (2.24). Theorem 2.6 ([167]). Let δ > 0. For any solution y ∗ of the dual problem (2.24), x∗µ = δS(AT y ∗ , µ) is the unique solution of (2.23). Motivated by the above theorem and equivalence between linearized Bregman method and gradient descent method applied to the dual problem (2.24), an accelerated linearized Bregman method has been proposed in [83] via solving the dual problem (2.24) with an accelerated gradient descent method [110]:    y k+1 = y˜k − τ ∇F (˜ y k ) = y˜k + τ b − δA · S(AT y˜k , µ) , µ

 y˜k+1 = α y k+1 + (1 − α )y k , k k k starting from y 0 = y˜0 = 0, where αk := 1 + k+3 =

2k+3 k+3

(2.27)

are weighting parameters. In

[110], it has been proven that the accelerated gradient descent method obtains an √ -optimal solution in O(1/ ) iterations, while it takes O(1/) iterations to obtain an -optimal solution for classical gradient descent method. Let xk = δS(AT y˜k , µ), v˜k = AT y˜k and v k = AT y k , we get the accelerated linearized Bregman method:    v k+1 = v˜k + τ AT (b − Axk ),   v˜k+1 = αk v k+1 + (1 − αk )v k ,     xk+1 = δS(˜ v k+1 , µ),

(2.28)

starting from x0 = 0 and v 0 = v˜0 = 0. Due to the equivalence with accelerated gradient descent method, we know that the accelerated linearized Bregman method

2.3 Sparse Uncorrelated Linear Discriminant Analysis

29

√ require O(1/ ) iterations to obtain an -optimal solution to (2.23), which can be more formally described in the following theorem. Theorem 2.7 ([83]). Let the sequences {xk } and {y k } be generated by accelerated linearized Bregman method (2.28) and accelerated gradient descent method (2.27) with τ
0 is a tolerance parameter. Let ∆k = U1T Gk − Σ−1 t P1 Z, then U1T Gk = ∆k + Σ−1 t P1 Z, k∆k kF ≤ , and (Gk )T St Gk = Iq + ∆Tk Σt P1 Z + Z T P1T Σt ∆k + ∆Tk Σ2t ∆k . The deviation of GT St G from Iq at the kth iteration can be measured by k(Gk )T St Gk − Iq kF k∆Tk Σ2t ∆k kF + 2kZ T P1T Σt ∆k kF ≤ √ √ q q 2 2 kΣt k2 k∆k kF + 2kΣt k2 k∆k kF ≤ √ q kΣt k2 k∆k kF (2 + kΣt k2 k∆k kF ) = √ q kHt k2 (2 + kHt k2 ) = O(). ≤ √ q

(2.33)

Thus, it is expected that the extracted features are well uncorrelated if the tolerance  is small enough.

2.4

Numerical Experiments and Comparison with Existing Algorithms

This section presents some experimental results comparing SULDA with its nonsparse counterpart ULDA and other three sparse LDA algorithms: penalized Fisher’s linear discriminant analysis (PLDA) [152], sparse discriminant analysis (SDA) [38] and group-lasso optimal scoring solver (GLOSS) [103]. Before that, we give an overview of these three sparse LDA algorithms.

31

32

Chapter 2. Sparse Linear Discriminant Analysis

2.4.1

Existing Algorithms

PLDA was proposed in [152] for penalizing discriminant vectors in Fisher’s discriminant problem that seeks discriminant vectors β1 , · · · , βl by successively solving the following problem βr := arg max β T Sb β β

s.t. β T Sw β ≤ 1,

(2.34) β T Sw βi = 0,

∀i < r.

In high dimensional settings where d > n, Sw is singular. To address the singularity problem, a modified Fisher’s discriminant problem which replaces Sw with its diagonal Sˆw := diag(Sw ) has been proposed, for instance in [13, 51, 58]. It has also been shown that the solution βr to problem (2.34) with Sw being replaced by Sˆw also solves the problem max{β T Sbr β : β T Sˆw β ≤ 1}, β

(2.35)

where Sbr := Hb Pr⊥ HbT . Pr⊥ is defined in the following way: P1⊥ = I and for r > 1, Pr⊥ is an orthogonal projection matrix into the space that is orthogonal to {HbT βi }ri=1 . Based on reformulation (2.35) of the modified Fisher’s linear discriminant problem, a penalized linear discriminant analysis was proposed in [152], where the rth penalized discriminant vector βˆr is defined to be the solution to problem ) ( d X max β T Sbr β − λr |Sw (j, j)βj | : β T Sˆw β ≤ 1 , β

(2.36)

j=1

where {Sw (j, j)}dj=1 denote the diagonal entries of Sw and βj denotes the jth component of β. Optimization problem (2.36) is a non-convex problem and was solved in [152] by a minorization-maximization algorithm and the matrix h i GP LDA := βˆ1 · · · βˆl is defined as the sparse LDA transformation matrix in PLDA. The R package ‘penalizedLDA’ implementing PLDA is available on the CRAN: http://cran. r-project.org/web/packages/penalizedLDA.

2.4 Numerical Experiments and Comparison with Existing Algorithms SDA was derived in [38] by exploiting the connection between Fisher’s linear discriminant analysis problem (2.34) and optimal scoring [74, 76, 77] problem (θr , βr ) := arg min kY θ − HtT βk22 θ,β

1 T T θ Y Y n

s.t.

θ = 1,

(2.37) T

T

θ Y Y θi = 0,

∀i < r,

where Y ∈ Rn×k is a matrix of dummy variables with Yi,j = 1i∈Nj being an indicator variable for whether the ith observation belongs to the jth class. It has been shown in [74] that the solution βr is proportional to the solution to Fisher’s linear discriminant problem (2.34). SDA computes the rth sparse discriminant vector βˆr via solving the following problem (θˆr , βˆr ) := arg min kY θ − HtT βk22 + γβ T Ωβ + λkβk1 θ,β

1 T T θ Y Y n

s.t.

θ = 1,

θ Y Y θˆi = 0, T

T

(2.38) ∀i < r,

where Ω is a positive definite matrix, γ and λ are nonnegative parameters. Optimization problem (2.38) is a non-convex problem which was solved in [38] by alternating minimization: fixing β and optimizing with respect to θ, which has a closed form solution; then fixing θ and optimizing with respect to β, which reduces to a generalized elastic net [173] problem βˆr = arg min{kY θˆr − HtT βk22 + γβ T Ωβ + λkβk1 }. β

(2.39)

The matrix h i GSDA := βˆ1 · · · βˆl is then defined as the sparse LDA transformation matrix in SDA. A MATLAB implementation of SDA can be found at http://www2.imm.dtu.dk/pubdb/views/ publication_details.php?id=5671. GLOSS, similar to SDA, addresses sparse LDA problem via exploiting the connection between LDA and optimal scoring. However, instead of obtaining sparsity through elastic net penalization and computing sparse discriminant vectors recursively, GLOSS computes all sparse discriminant vectors simultaneously and obtains

33

34

Chapter 2. Sparse Linear Discriminant Analysis sparsity through group-lasso [170] penalty. Specifically, GLOSS considers the following Group-Lasso Optimal Scoring problem B ∗ = arg min

min

B∈Rd×l Θ∈Rk×l

1 kY 2 T

Θ − HtT Bk2F + λ

d P

kβj k2

j=1

(2.40)

T

s.t. Θ Y Y Θ = Il , h

where B = β1 · · · βd

iT

and Y is defined as in (2.37). Due to the group-lasso

penalty, GLOSS selects the same features in all discriminant directions. It has been proved in [103] that the group-lasso optimal scoring problem (2.40) is equivalent to the following regularized LDA problem GRLDA = arg max {Trace(GT Sb G) : GT (Sw + G∈Rd×l

λ Ω)G = Il }, n

∗ −1 where Ω = diag(kβ1∗ k−1 2 , · · · , kβd k2 ), assuming that kβj k2 = 0 implies that the jth

row of GRLDA is also zero. The group-lasso optimal scoring problem (2.40) is a non-convex problem, and was solved by an active set iterative method in [103]. The matrix GGLOSS := B ∗ is defined as the sparse LDA transformation matrix in GLOSS. A MATLAB implementation of GLOSS can be found at https://www.hds.utc.fr/~grandval/ dokuwiki/doku.php?id=en:code. Compared with the above three methods, our newly proposed sparse ULDA algorithm SULDA seeks a sparse solution directly from the solution set of ULDA. Thus, the computed sparse transformation is a solution of ULDA, instead of an approximation. This also implies that features extracted by SULDA are mutually uncorrelated, which ensures minimum redundancy in the low-dimensional space. Moreover, the optimization problems in forementioned methods are non-convex whose global optimum are very difficult to obtain, and a local optimum is the best that we can expect. However, the optimization problem of sparse ULDA is convex and we have proposed an efficient algorithm involving only matrix factorizations (SVD) and multiplications to find the global optimum.

2.4 Numerical Experiments and Comparison with Existing Algorithms

2.4.2

Experimental Settings

We now consider the problem of selecting tuning parameters for all four sparse LDA methods. The tuning parameters {λr }lr=1 of PLDA (2.36) were selected in [152] in the following way: −1

−1

λr := λkSˆw 2 Sbr Sˆw 2 k2 , where λ is a non-negative constant, and can be chosen by cross-validations [77]. In our experiments, we chose λ by 10-fold cross-validations with the candidate set

1

{10−2 , 10−1.9 , · · · , 10−1 }. In SDA problem (2.38), the positive definite matrix Ω is selected as identity matrix, which is the default value in the MATLAB implementation provided by the authors. The generalized elastic net problem (2.39) in SDA was solved by least angle regression-elastic net (LARS-EN ) [53, 173] which can find the solution path. In our experiments, we chose γ and λ such that SDA and SULDA achieve comparable sparsity. In GLOSS, the authors suggested to compute a series of sparse solutions along the regularization path defined by penalties λ1 = λmax > · · · > λT = λmin ≥ 0, where GGLOSS (λmax ) = 0 and λj+1 =

λj . 2

Like what we did to SDA, we chose λ in the

regularization path such that GLOSS and SULDA achieve comparable sparsity. In the implementation of SULDA, we selected tuning parameters δ = 0.9, τ = 1 (< 1δ ) and  = 10−5 . For the selection of µ, we used µ = 100 for simulations in Section 2.4.3 and µ = 10 for real-world data in Section 2.4.4. All the experiments were performed under CentOS 5.2 and MATLAB v7.4 (R2007a) running on a SUN v40z server with four AMD 2.4GHz Opteron 850 CPUs and 32GB of Random-access memory (RAM) from the Centre for Computational Science and Engineering, National University of Singapore. 1

One can always choose a finer candidate set with larger range, but it will take more time for

cross-validation. In our experiments we found that the given candidate set works well for all the datasets and the computational time for cross-validation is reasonable.

35

36

Chapter 2. Sparse Linear Discriminant Analysis

2.4.3

Simulation Study

We compare newly proposed sparse ULDA algorithm SULDA with PLDA, SDA and GLOSS in a simulation study, where three simulation models were considered. In each simulation, there are 300 samples with dimension d = 1000, including n = 100 samples for training data, 100 samples for validation data and 100 for testing data. Details of these simulation models are summarized as follows: (a) Simulation 1: there are two classes, i.e. k = 2. For j ∈ Ni , aj ∼ N (µi , I), where µ1 = [0.6 · · 0.6} 0 · · · 0], | ·{z 50

µ2 = [0| ·{z · · 0} 0.6 · · 0.6} 0 · · · 0]. | ·{z 50

50

(b) Simulation 2: there are two classes, i.e. k = 2. For j ∈ Ni , aj ∼ N (µi , Σ), where µ1 = 0,

µ2 = [0.6 · · 0.6} 0 · · · 0], | ·{z

Σ = 0.5I + 0.5eeT .

100

(c) Simulation 3: there are two classes, i.e. k = 2. For j ∈ Ni , aj ∼ N (µi , Σ), where µ1 = 0,

µ2 = [0.5 · · 0.5} 1/d · · · 1/d], | ·{z

Σ = 0.5I + 0.5eeT .

100

Simulations 1 and 2 are sparse discriminant models with different covariance and mean, where simulation 1 also assumes independence among features. Simulation 3 is essentially sparse in the sense that the discriminant vectors depend on all features in theory but can be well approximated by sparse discriminant vectors. Simulation results were summarized in Table 2.1, where the mean values (and standard deviations), computed over 100 replications, of classification accuracy, sparsity and total number of selected features

2

were recorded. In Table 2.1, ‘Ac-

curacy’ denotes the classification accuracy computed by 1NN (nearest neighbour ) 2

A feature is called selected feature if the corresponding row of sparse G has at least one non

zero entry.

2.4 Numerical Experiments and Comparison with Existing Algorithms Table 2.1: Simulation results. The reported values are means (and standard deviations), computed over 100 replications, of classification accuracy, sparsity, orthogonality and total number of selected features. Accuracy (%)

Sparsity (%)

Orthogonality

# Variable

Simulation 1: PLDA

98.60 (1.15)

36.80 (22.29)

8.29e-1 (2.33e-2)

632.0 (222.9)

SDA

93.21 (2.89)

90.90 (1.02e-1)

7.28e-2 (2.08e-2)

91.0 (1.0)

GLOSS

92.94 (2.83)

90.97 (2.36e-1)

1.30e-3 (7.40e-4)

90.3 (2.4)

SULDA

92.88 (2.85)

90.10 (4.81e-2)

4.14e-7 (3.49e-7)

99.0 (0.5)

Simulation 2: PLDA

75.92 (15.17)

57.02 (29.24)

6.70e-1 (2.80e-1)

429.8 (292.4)

SDA

91.94 (2.93)

91.92 (9.78e-2)

1.72e-1 (3.48e-2)

80.9 (1.0)

GLOSS

93.53 (2.44)

91.18 (2.67e-1)

1.56e-3 (9.53e-4)

88.2 (2.7)

SULDA

94.03 (2.50)

90.09 (5.14e-2)

1.28e-6 (1.37e-6)

99.1 (0.5)

Simulation 3: PLDA

70.13 (14.43)

57.96 (28.81)

7.34e-1 (3.18e-1)

420.4 (288.1)

SDA

88.69 (3.24)

90.86 (1.26e-1)

9.34e-2 (2.72e-2)

91.4 (1.3)

GLOSS

89.18 (3.01)

91.12 (2.43e-1)

1.68e-3 (1.04e-4)

88.8 (2.4)

SULDA

89.65 (2.99)

90.09 (5.39e-2)

1.25e-6 (1.10e-6)

99.2 (0.5)

classifier [61, 77], ‘Sparsity’ denotes the percentage of zero entries in the solution G, that is, Sparsity =

the number of zeros in G × 100%, d∗q

(2.41)

‘Orthogonality’ measures the difference between GT St G and Iq which is defined as Orthogonality =

kGT St G − Iq kF , √ q

(2.42)

‘# Variable’ denotes the total number of selected features. As can be observed from Table 2.1, our method, SULDA, has the best overall performance in all settings. SULDA has comparable performance with SDA and GLOSS in terms of classification accuracy in all scenarios, and it achieves slightly

37

38

Chapter 2. Sparse Linear Discriminant Analysis higher accuracy than SDA and GLOSS except for simulation 1. In simulation 1, PLDA obtains much higher accuracy than the other three competitors. One reason of such a phenomenon is that PLDA assumes independence among features and approximates the within-class covariance matrix with a diagonal matrix, which coincides with the setting in simulation 1 where the within-class covariance matrix is identity. However, independence among features is unrealistic in most applications, and ignoring of correlations among features could result in misleading feature selection and poor performance. In simulations 2 and 3, where features are correlated, PLDA performs much worse than the other three competitors. Moreover, PLDA tends to compute discriminant vectors with much lower sparsity and the standard deviation of sparsity is very large, which implies that PLDA is sensitive to the tuning parameter λ whose selection can be challenging. Regarding orthogonality (or correlation among extracted features in lower dimensional space), the obvious winner is SULDA, which computes sparse discriminant vectors with orthogonality of order O(). We can also see from the last column of Table 2.1 that SDA, GLOSS and SULDA use nearly the same number of variables to represent the extracted features, while PLDA uses four to six times as many variables.

2.4.4

Real-World Data

We now compare SULDA with its non-sparse counterpart ULDA and other three sparse LDA methods, PLDA, SDA and GLOSS, on some real world data, including four gene expression data sets: Colon, Prostate, SRBCT and Brain; and four image data sets: Yale64×64 , UMIST, ORL64×64 and Palmprint. A more detailed description of these data is presented in Appendix A. In our experiments, each data set was randomly split into training and testing data using the following algorithm: within each class with ni data, we randomly select d0.5ni e of them as training data and the rest as testing data. The splitting was repeated 10 times. Statistics of all data sets used are listed in Table 2.2.

2.4 Numerical Experiments and Comparison with Existing Algorithms Table 2.2: Data stuctures: data dimension (d), training size (n), the number of classes (K) and the number of testing data (# Testing).

Data set

d

n

K

# Testing

Colon

2000

31

2

31

Prostate

6033

51

2

51

SRBCT

2308

32

4

31

Brain

5597

21

5

21

Yale64×64

4096

90

15

75

UMIST

10304

290

20

285

ORL64×64

4096

200

40

200

Palmprint

4096

300

100

300

Results are summarized in Table 2.3 and 2.4, where definitions (2.41) and (2.42) are adopted to compute sparsity and orthogonality, respectively. We can observe from both tables that the overall performance of SULDA is better than the other three sparse LDA algorithms. In particular, SULDA achieves the highest classification accuracy over all data sets, followed by SDA and GLOSS which achieve higher accuracy than PLDA. Regarding sparsity and the number of selected variables, although we choose tuning parameters such that SDA, GLOSS and SULDA achieve similar sparsity, GLOSS tends to compute sparse discriminant vectors that use less variables, followed by SULDA. This result is attributed to the usage of group lasso penalty in GLOSS, see (2.40), which selects the same features in all discriminant vectors. All these three methods compute solutions with much higher sparsity than PLDA and select less variables. Another important advantage of SULDA over the other three algorithms is that the extracted features in the low-dimensional space are mutually uncorrelated. We see from the ‘Orthogonality’ column of Table 2.3 and 2.4 that

kGT St G−Iq kF √ q

= O() for SULDA, which is consistent with error bound

(2.33). However, for the other three algorithms

kGT St G−Iq kF √ q

is relatively large, which

implies that the extracted features are far away from being uncorrelated.

39

40

Chapter 2. Sparse Linear Discriminant Analysis

Brain

SRBCT

Prostate

Colon

Table 2.3: Numerical results for gene data over 10 training-testing splits: mean (and standard deviation) of classification accuracy, sparsity, orthogonality and the number of selected variables.

Methods

Accuracy (%)

Sparsity (%)

Orthogonality

# Variable

PLDA

79.68 (4.82)

71.06 (8.41)

37.60 (8.86)

578.9 (168.2)

SDA

80.97 (5.15)

97.94 (6.00e-2)

9.47 (1.64)

41.1 (1.2)

GLOSS

79.03 (7.79)

98.58 (7.17e-2)

9.13 (2.59)

28.5 (1.4)

SULDA

83.87 (4.81)

98.49 (2.42e-2)

3.38e-6 (2.51e-6)

30.3 (0.5)

ULDA

81.94 (5.52)

0 (0)

1.11e-15 (1.13e-15)

2000 (0)

PLDA

60.78 (14.96)

86.99 (19.68)

96.97 (2.00e+2)

785.0 (1.19e+3)

SDA

89.80 (4.51)

98.96 (2.38e-2)

17.22 (2.97)

62.5 (1.4)

GLOSS

90.00 (4.57)

99.23 (3.48e-2)

17.90 (2.99)

46.2 (2.1)

SULDA

90.98 (3.72)

99.17 (1.17e-14)

5.72e-6 (5.93e-6)

50 (0)

ULDA

93.33 (2.11)

0 (0)

1.58e-15 (1.35e-15)

6033 (0)

PLDA

95.48 (4.35)

82.71 (6.14)

30.65 (9.47)

962.8 (324.8)

SDA

97.42 (3.33)

98.94 (1.23e-2)

42.85 (5.49)

72.5 (1.1)

GLOSS

97.74 (2.18)

98.47 (7.13e-2)

42.0 (8.89)

35.4 (1.6)

SULDA

99.36 (1.36)

98.65 (7.61e-3)

3.91e-6 (1.59e-6)

79.6 (3.7)

ULDA

97.74 (2.18)

0 (0)

2.90e-15 (1.74e-15)

2308 (0)

PLDA

30.95 (25.72)

94.00 (12.65)

18.40 (36.73)

836 (1762.5)

SDA

75.71 (5.70)

98.99 (7.00e-3)

6.57 (0.71)

223.9 (1.2)

GLOSS

76.19 (7.78)

99.57 (3.42e-2)

2.34 (1.32)

23.9 (1.9)

SULDA

77.62 (5.04)

99.64 (4.92e-3)

6.41e-6 (2.19e-6)

78.2 (1.9)

ULDA

81.43 (4.17)

0 (0)

2.83e-15 (1.13e-15)

5597 (0)

Comparing classification accuracy achieved by ULDA and SULDA, we notice that the performance of ULDA usually degenerates slightly when sparsity is taken into account. However, for some data sets, especially for gene expression data (Colon and SRBCT in Table 2.3), the incorporation of sparsity will improve the performance of LDA. In addition, concerning with sparsity and the number of selected variables,

2.4 Numerical Experiments and Comparison with Existing Algorithms

Palmprint

ORL64×64

UMIST

Yale64×64

Table 2.4: Numerical results for image data over 10 training-testing splits: mean (and standard deviation) of classification accuracy, sparsity, orthogonality and the number of selected variables.

Methods

Accuracy (%)

Sparsity (%)

Orthogonality

# Variable

PLDA

70.13 (4.67)

76.98 (13.54)

4.10e+5 (1.15e+5)

3648.2 (699.6)

SDA

85.87 (3.03)

99.17 (8.64e-3)

1.46e+5 (1.19e+4)

432 (9.8)

GLOSS

82.00 (2.11)

97.50 (3.68e-1)

6.01e+4 (1.26e+4)

102.5 (15.1)

SULDA

86.67 (2.67)

97.84 (1.71e-2)

1.86e-5 (3.05e-6)

888.8 (22.2)

ULDA

90.00 (3.40)

0 (0)

4.35e-15 (1.80e-16)

4096 (0)

PLDA

90.04 (3.04)

78.77 (7.19e-1)

9.83 (0.79)

1.01e+4 (131.4)

SDA

94.98 (1.22)

95.78 (1.05e-2)

8.03 (0.18)

5111.6 (32.0)

GLOSS

95.19 (1.24)

99.64 (4.74e-2)

9.09e-1 (4.47e-3)

37.4 (4.9)

SULDA

95.72 (0.93)

97.17 (1.67e-2)

2.10e-5 (1.70e-6)

3720.6 (25.8)

ULDA

96.25 (0.72)

0 (0)

5.27e-15 (3.36e-16)

10304 (0)

PLDA

81.05 (7.26)

68.53 (12.64)

1.80e+5 (4.16e+4)

3851.9 (771.9)

SDA

91.60 (2.04)

94.66 (8.77e-3)

2.29e+5 (1.82e+3)

3169.9 (14.4)

GLOSS

91.50 (2.84)

95.84 (2.50e-1)

2.85e+4 (8.29e+2)

170.5 (10.2)

SULDA

91.85 (2.12)

95.14 (1.21e-3)

2.62e-5 (3.10e-6)

2937 (32.9)

ULDA

93.90 (2.13)

0 (0)

6.16e-15 (3.67e-16)

4096 (0)

PLDA

95.50 (1.47)

62.06 (2.45e-1)

4.47 (1.05e-1)

4096 (0)

SDA

98.27 (0.75)

92.74 (1.11e-2)

17.73 (1.27e-1)

4055.8 (7.6)

GLOSS

96.80 (1.48)

87.48 (3.83e-1)

1.18 (1.62e-2)

512.7 (15.7)

SULDA

98.60 (0.58)

92.67 (4.00e-3)

1.78e-5 (3.92e-7)

4012.1 (7.5)

ULDA

99.40 (0.26)

0 (0)

7.93e-15 (3.27e-16)

4096 (0)

SULDA can obtain a highly sparse (with at least 95% sparsity except for Parlmprint) solution and uses much fewer variables for interpretation with only a small sacrifice of classification accuracy (decrease of accuracy is less than 4%). From this point of view, we can conclude that SULDA strikes a good balance between computing a sparse solution and maintaining good performance in terms of classification accuracy.

41

42

Chapter 2. Sparse Linear Discriminant Analysis PLDA

SDA 15

10

10

5

5

2nd discriminant vector

2nd discriminant vector

15

0

−5 class 1 class 2 class 3 class 4

−10

−15 −14

−12

−10

class 1 class 2 class 3 class 4

0

−5

−10

−8

−6 −4 −2 1st discriminant vector

0

2

4

−15 −5

6

0

5 10 1st discriminant vector

GLOSS

15

20

SULDA

10

0.4

class 1 class 2 class 3 class 4

5

class 1 class 2 class 3 class 4

0.3

2nd discriminant vector

2nd discriminant vector

0

−5

−10

−15

0.2

0.1

0

−0.1 −20

−0.2

−25

−30 −10

−5

0

5 10 1st discriminant vector

15

20

25

−0.4

−0.3

−0.2

−0.1 0 0.1 1st discriminant vector

0.2

0.3

0.4

Figure 2.1: 2D visualization of the SRBCT data: all samples are projected onto the first two sparse discriminant vectors obtained by PLDA (upper left), SDA (upper right), GLOSS (lower left) and SULDA (lower right), respectively.

Besides classification, we also found that the sparse discriminant vectors computed by SULDA are amenable to low-dimensional representations of data. A 2dimensional visualization of the SRBCT data is shown in Figure 2.1, where sample data were projected onto the first two sparse discriminant vectors (i.e., l = 2) computed by PLDA, SDA, GLOSS and SULDA. We can see from Figure 2.1 that SULDA has the best class discrimination quality in the 2-dimensional space. For PLDA, classes 1, 3 and 4 intersect, while for SDA, classes 3 and 4 intersect. Although classes do not intersect with each other in GLOSS case, data points in each class are far away from each other, especially data in class 4. Thus, class quality obtained by GLOSS is not satisfactory. On the other hand, for SULDA, classes are

2.5 Conclusions

43

well separated from each other and data points in the same class are close (training data from the same class were projected to the same point). In fact, when the training data consists of n linearly independent data, which is the case in our experiments, we can prove that γ = rank(St ) = n − 1, q = rank(Sb ) = k − 1 and rank(St ) = rank(Sb ) + rank(Sw ). In this case, the minimum dimension solutions G of ULDA belongs to range(Sb ) ∩ null(Sw ), which implies that GT Hw = 0, and GT aj = GT c(i)

∀j ∈ Ni (1 ≤ i ≤ k),

that is, all training data form the same class (class i) are projected to the same point (GT c(i) ).

2.5

Conclusions

In this chapter, we developed SULDA, an efficient algorithm that performs sparse uncorrelated LDA, based on the characterization of solutions of generalized ULDA. Specifically, we characterize all solutions of the optimization criterion (2.3) of generalized ULDA. Based on the characterization, we incorporated sparsity into the transformation matrix by selecting the solution with minimum `1 -norm from all minimum dimension solutions of ULDA. The resulting `1 -minimization problem was solved by accelerated linearized Bregman method. Different from existing sparse LDA algorithms, SULDA seeks a sparse solution directly from the solution set of ULDA. Thus, the computed sparse transformation is a solution of ULDA, instead of an approximation. This implies that features extracted by SULDA are mutually uncorrelated, which ensures minimum redundancy in the low-dimensional space. Computationally, SULDA is easy to implement as it requires only matrix factorizations (SVDs) and multiplications. The effectiveness of SULDA is supported by experimental results using both simulations and real world data including gene

44

Chapter 2. Sparse Linear Discriminant Analysis expression data and image data. In our experiments, SULDA consistently outperforms its competitors in terms of classification accuracy and achieves comparable interpretability (sparsity and number of used variables). The resulting sparse transformation can also be used to visualize observations and inspect class discrimination in the low-dimensional space. In the derivation of SULDA, we fixed the orthogonal matrix Z. One future focus is to consider arbitrary orthogonal Z and select the sparse solution of ULDA from a larger solution set. Another potential extension is to consider sparse kernel LDA using similar idea. Moreover, it is well-known that LDA tends to give undesired results if samples in a class form several separate clusters (i.e., multimodal), and many extensions of LDA have been proposed to deal with this problem [75, 133]. In the future, we plan to extent the idea of SULDA to discriminant analysis approaches that can handle multimodal labeled data.

Chapter

3

Canonical Correlation Analysis In many real-world applications, researchers are usually faced with problems of analyzing the so-called paired data (X, Y ), where X and Y can be associated with either two different objects or two different views of the same object, and finding the underlying relationship between them. For instance, in cross-language document retrieval [35, 144], there are documents in two different languages that describing the same content; in content-based image retrieval [36, 73], the image and the associated texts can be considered as two different views of the same object; in genomic data analysis [158], DNA copy number variations, gene expression data, and single nucleotide polymorphism (SNPs) data might all be available on a common set of patient samples. Canonical correlation analysis (CCA) [2, 79] is a powerful tool in multivariate data analysis for finding the correlation between two sets of multidimensional variables. CCA seeks a pair of linear transformations associated with the two sets of variables such that the projected variables in the lower-dimensional space are maximally correlated. CCA has found many applications in, for instance, statistical analysis [2], detection of neural activity in functional magnetic resonance imaging (fMRI) [59, 154], machine learning [73, 134] and genomic data analysis [158].

45

46

Chapter 3. Canonical Correlation Analysis The optimal pair of linear transformations in CCA can be obtained by solving a generalized eigenvalue problem, which is computationally expensive for highdimensional data. Moreover, due to high dimensionality, the sample covariance matrices are singular, which makes it more challenging to solve the associated generalized eigenvalue problem. Although regularized CCA [43, 73] was proposed to tackle the singularity of sample covariance matrices, it is still not clear how to select the regularization parameters theoretically and practically in regularized CCA. On the other hand, there may be infinite number of solutions in CCA problem, it is necessary to investigate the properties of these solutions, and characterize them appropriately. In this chapter, we make a new and systematic study of CCA. We first reveal the equivalent relationship between the recursive formulation and the trace formulation for multiple-projection CCA problem. Due to this equivalent relationship, we adopt the trace formulation as the criterion of CCA. We obtain the explicit characterization of all solutions for the multiple-projection CCA problem even when the sample covariance matrices are singular. CCA has been shown to be equivalent to LDA for binary-class problem [74]. We extend this equivalent relationship to multi-class problem by establishing the equivalent relationship between the uncorrelated linear discriminant analysis (ULDA) and the CCA problem. The rest of this chapter is organized as follows. In Section 3.1, we give a brief review of CCA problem. In Section 3.2, we characterize all solutions of the multipleprojection CCA problem. We establish the equivalent relationship between uncorrelated linear discriminant analysis and CCA in Section 3.3. Finally, some concluding remarks are given in Section 3.4.

3.1

Background

This section contains a brief review of CCA, various formulations for CCA and existing approaches for solving it.

3.1 Background

3.1.1

47

Various Formulae for CCA

Let x ∈ Rd1 and y ∈ Rd2 be two random variables with a joint covariance matrix given by   Σxx Σxy , Σ :=  ΣTxy Σyy where Σxx ∈ Rd1 ×d1 , Σyy ∈ Rd2 ×d2 are covariance matrices for variables x and y, respectively, and Σxy ∈ Rd1 ×d2 is the cross-covariance matrix between x and y. In essence, CCA computes two projection vectors, wx ∈ Rd1 and wy ∈ Rd2 , such that the correlation between wxT x and wyT y defined as wxT Σxy wy q ρ := p wxT Σxx wx wyT Σyy wy is maximized. Thus, the criterion used by CCA is the following optimization problem wxT Σxy wy q . p wx 6=0,wy 6=0 wxT Σxx wx wyT Σyy wy max

(3.1)

Since the objective function of the optimization problem in (3.1) is invariant with respect to the scaling of wx and wy , (3.1) can be reformulated as follows: max wxT Σxy wy

wx ,wy

s.t. wxT Σxx wx = 1, wyT Σyy wy = 1. In practice, the covariance matrices Σxx , Σyy and Σxy are usually not available. Instead, the sample covariance matrices are constructed based on given sample data. Let {xi }ni=1 ∈ Rd1 and {yi }ni=1 ∈ Rd2 be n given observations for variables x and y, respectively. Denote h i X = x1 · · · xn ∈ Rd1 ×n ,

h i Y = y1 · · · yn ∈ Rd2 ×n .

Without loss of generality, we assume that both {xi }ni=1 and {yi }ni=1 have zero mean, P P i.e., ni=1 xi = 0 and ni=1 yi = 0. Then the optimization problem for CCA based on sample covariance matrices is given by max wxT XY T wy

wx ,wy

s.t.

wxT XX T wx

(3.2) = 1,

wyT Y

T

Y wy = 1,

48

Chapter 3. Canonical Correlation Analysis or equivalently, min

wx ,wy

kX T wx − Y T wy k22

(3.3)

s.t. wxT XX T wx = 1, wyT Y Y T wy = 1. In [79], multiple-projection CCA problem was also studied, aiming to compute multiple pairs of projections recursively. Specifically, a sequence of {wxk } and {wyk } are computed as follows: (wxk , wyk ) := arg max wxT XY T wy wx ,wy

s.t. wxT XX T wx = 1, wyT Y Y T wy = 1, T

X wx ⊥ {X

T

wx1 , · · ·

,X

T

(3.4)

wxk−1 },

Y T wy ⊥ {Y T wy1 , · · · , Y T wyk−1 }, where k = 1, · · · , l, and l ≤ min{rank(X), rank(Y )}. The columns of X T wxk and Y T wyk in (3.4) are called the canonical variables or canonical vectors, and and

wyk , kY T wyk k2

wxk kX T wxk k2

are called canonical weights or canonical loadings. Canonical variables

are linear combinations of all variables in the original space. For multiple projections, we can prove that formulation (3.4) can be equivalently reformulated as follows: max

Wx ,Wy

Trace(WxT XY T Wy )

s.t. WxT XX T Wx = I, Wx ∈ Rd1 ×l ,

(3.5)

WyT Y Y T Wy = I, Wy ∈ Rd2 ×l , where h i Wx = wx1 · · · wxl ∈ Rd1 ×l ,

h i Wy = wy1 · · · wyl ∈ Rd2 ×l .

The proof is postponed to Section 3.2. Similarly, (3.5) can be reformulated as the following distance minimization formulation [73, 89]: min

Wx ,Wy

kX T Wx − Y T Wy k2F

s.t. WxT XX T Wx = I, Wx ∈ Rd1 ×l , WyT Y Y T Wy = I, Wy ∈ Rd2 ×l .

(3.6)

3.1 Background

49

An important advantage of this formulation is that it can be generalized to multiple sets of variables. Let {X (i) ∈ Rdi ×n }K i=1 be a set of sample matrices, where K is the number of different views. CCA for multiple sets of variables considers the following optimization problem: K P

min W (1) ,··· ,W (K)

k(W (i) )T X (i) − (W (j) )T X (j) k2F

i,j=1

(3.7)

s.t. (W (i) )T X (i) (X (i) )T W (i) = I, W (i) ∈ Rdi ×l , i = 1, · · · , K. Like CCA for two sets of variables, a solution of (3.7) can be found by solving a generalized eigenvalue problem. The formulation of the trace maximization problem in (3.5) has been studied in [16, 73, 134]. In this chapter, we focus on this optimization problem and analyze it thoroughly.

3.1.2

Existing Methods for CCA

Considering Karush–Kuhn–Tucker (KKT) conditions of the optimization problem (3.2), we find that the optimization problem reduces to the generalized eigenvalue problem 

0  Y XT

     T XY w XX 0 w   x = η    x . 0 wy 0 YYT wy T

Under the assumption that Y Y T is nonsingular, it is easy to show that the optimal solution wx of (3.2) can be obtained by solving the following generalized eigenvalue problem: XY T (Y Y T )−1 Y X T wx = η 2 XX T wx , η > 0,

(3.8)

where η 2 is the largest generalized eigenvalue. The optimal solution wy of (3.2) is given by 1 wy = (Y Y T )−1 Y X T wx . η

(3.9)

It has been shown in [73] that we can choose the associated eigenvectors corresponding to the largest l eigenvalues of the generalized eigenvalue problem in (3.8)

50

Chapter 3. Canonical Correlation Analysis as multiple projection vectors {wxk }, and then using (3.9) to find the corresponding {wyk }. Following this idea, Sun et al. [134] developed a way of computing multiple projections Wx and Wy via the eigenvalue decomposition. Let 1

H = Y T (Y Y T )− 2 ∈ Rn×d2 , CXX = XX T ∈ Rd1 ×d1 , CHH = XHH T X T ∈ Rd1 ×d1 . It has been shown that optimal solution Wx to (3.5) can be expressed as the associ† ated eigenvectors corresponding to the largest l eigenvalues of the matrix CXX CHH , † where CXX denotes the pseudo-inverse [65] of CXX . It was also shown in [73] that

the projection matrix Wx can be computed by solving the following optimization problem max Trace(WxT XY T (Y Y T )−1 Y X T Wx ) Wx

s.t. WxT XX T Wx = I, Wx ∈ Rd1 ×l . Consequently, Wx consists of the associated eigenvectors corresponding to the largest † l eigenvalues of the matrix CXX CHH .

One disadvantage of the above approach for computing (Wx , Wy ) is the restriction that Y Y T must be nonsingular. However, in many practical applications [7, 61], the data points are from a very high-dimensional space and thus usually the number of the data samples is much smaller than the data dimension, i.e., d2  n and d1  n. Therefore, both Y Y T and XX T are singular. This is known as the undersampled problem [61] and it is also commonly called a high dimensional small sample size problem. In order to apply the above approach to CCA problems with singular sample covariance matrices, regularized CCA [73, 134] has been proposed to solve the following generalized eigenvalue problem: XY T (Y Y T + γy I)−1 Y X T wx = η 2 (XX T + γx I)wx , where γx > 0 and γy > 0 are regularization parameters. Thus, it is necessary to choose suitable values of parameters γx and γy for effective data analysis. However, appropriate regularization parameters γx and γy are difficult to determine. We note that when γx and γy are large then we will lose information in XX T and Y Y T ; when

3.2 General Solutions of CCA

51

they are too small, the regularization may not be sufficiently effective. Therefore the main task of regularized CCA is to choose appropriate regularization parameters γx and γy . But, up to now it is still not clear how to select regularization parameters theoretically and practically in regularized CCA.

3.2

General Solutions of CCA

In this section, we focus on the trace formulation of CCA in (3.5) to compute multiple projections Wx and Wy simultaneously. We also characterize all solutions of (3.5) without the restriction that Y Y T or XX T is nonsingular. We first define r = rank(X),

s = rank(Y ),

m = rank(XY T ),

t = min{r, s}.

Let the reduced singular value decompositions of X and Y be     h i Σ1 Σ1 X = U   QT1 = U1 U2   QT1 = U1 Σ1 QT1 , 0 0 and

    h i Σ2 Σ2 Y = V   QT2 = V1 V2   QT2 = V1 Σ2 QT2 , 0 0

(3.10)

(3.11)

respectively, where U ∈ Rd1 ×d1 , U1 ∈ Rd1 ×r , U2 ∈ Rd1 ×(d1 −r) , Σ1 ∈ Rr×r , Q1 ∈ Rn×r , V ∈ Rd2 ×d2 , V1 ∈ Rd2 ×s , V2 ∈ Rd2 ×(d2 −s) , Σ2 ∈ Rs×s , Q2 ∈ Rn×s , U and V are orthogonal, Σ1 and Σ2 are nonsingular and diagonal, Q1 and Q2 are column orthogonal. It follows from the two orthogonality constraints in (3.5) that l ≤ min{rank(X), rank(Y )} = min{r, s} = t.

(3.12)

Denote singular values of QT1 Q2 ∈ Rr×s as η1 ≥ η2 ≥ · · · ≥ ηt ≥ 0, and assume that all distinctive non-zero entries of singular values η1 , · · · , ηt are σ1 > σ2 > · · · > σq > 0, with multiplicity for those q singular values being m1 , m2 , · · · , mq , respectively.

52

Chapter 3. Canonical Correlation Analysis Then m=

q X

mi = rank(QT1 Q2 ) ≤ min{r, s} = t.

i=1

Next, let QT1 Q2 = P1 ΣP2T

(3.13)

be the singular value decomposition of QT1 Q2 , where P1 ∈ Rr×r and P2 ∈ Rs×s are orthogonal, Σ ∈ Rr×s , and



σ1 Im1

   Σ=   

... σq Imq

    0  η1        ..  , if r ≤ s,  . .   . .               ηt 0       =  η1         ...         0   , if r > s.          η t          0 ··· 0

(3.14)

Now we have XY T

 Σ1 P1 =U

 I



Σ 0 0 0



 ΣP  (V  2 2

 I

)T .

Let     (1) T f W P Σ fx =  x  =  1 1  U T Wx , W (2) fx I W

fx(1) ∈ Rr×l , W fx(2) ∈ R(d1 −r)×l , W

(3.15)

f (1) ∈ Rs×l , W f (2) ∈ R(d2 −s)×l , W y y

(3.16)

and     (1) T f P Σ W fy =  y  =  2 2  V T Wy , W fy(2) I W

then a direct calculation yields that optimization problem (3.5) is equivalent to   fx(1) )T ΣW fy(1) max Trace (W fx(1) , W fy(1) W

fx(1) )T W fx(1) = I, W fx(1) ∈ Rr×l , s.t. (W (1)

(1)

fy )T W fy (W

fy(1) ∈ Rs×l . = I, W

(3.17)

3.2 General Solutions of CCA

53

fx(1) , W fy(1) ) Hence, (Wx , Wy ) is a solution of optimization problem (3.5) if and only if (W is a solution of optimization problem (3.17), which means the general solutions (Wx , Wy ) of optimization problem (3.5) can be obtained via the general solutions (1)

(1)

fx , W fy ) of optimization problem (3.17). (W

3.2.1

Some Supporting Lemmas

Before we present the general solutions of (3.5), we need some supporting lemmas. Lemma 3.1. [78] Let A ∈ Rν×τ have singular values σ1 (A) ≥ σ2 (A) ≥ · · · ≥ σJ (A) ≥ 0, where J = min{ν, τ }. For each k = 1, · · · , J we have k P

σi (A) = max Trace(X T AY ) X, Y

i=1

ν×k

s.t. X ∈ R

(3.18) τ ×k

, Y ∈R

T

T

, X X = Y Y = I.

We have from Lemma 3.1 that for any l ≤ t = min{r, s}, l P

ηi =

i=1

max fx(1) ,W fy(1) W



fx(1) )T ΣW fy(1) Trace (W



fx(1) )T W fx(1) = I, W fx(1) ∈ Rr×l , s.t. (W (1)

(1)

fy fy )T W (W

(3.19)

fy(1) ∈ Rs×l . = I, W

Since optimization problem (3.5) is equivalent to (3.17), we conclude that l P i=1

ηi = max

Wx ,Wy

Trace(WxT XY T Wy )

s.t. WxT XX T Wx = I, Wx ∈ Rd1 ×l ,

(3.20)

WyT Y Y T Wy = I, Wy ∈ Rd2 ×l . Lemma 3.2. [78] Let A ∈ Rν×τ , V ∈ Rν×k and W ∈ Rτ ×k be given, where k ≤ min{ν, τ } and V, W are column orthogonal. Then σi (V T AW) ≤ σi (A), i = 1, · · · , k, where σi (V T AW) and σi (A) are the i-th largest singular values of V T AW and A, respectively.

54

Chapter 3. Canonical Correlation Analysis fx(1) , W fy(1) ) with W fx(1) ∈ Rr×l , W fy(1) ∈ Rs×l and l ≤ min{r, s} is a Lemma 3.3. (W solution of optimization problem (3.17) if and only if f (1) )T W f (1) = (W f (1) )T W f (1) = I, (W x x y y fx(1) )T ΣW fy(1) is symmetric positive semi-definite with eigenvalues and C := (W η1 , η2 , · · · , ηl . Proof. From (3.19) we can see that the sufficiency is trivial, so we omit it and focus on the necessity. Necessity: When l = 1, the necessity is obvious, thus in the following we only consider the case l ≥ 2. fx(1) , W fy(1) ) is a solution of optimization problem (3.17), then the conSuppose (W straints must be satisfied, that is, f (1) )T W f (1) = I, (W x x

f (1) )T W f (1) = I, (W y y

and by (3.19), we have Trace(C) =

l X

ηi .

(3.21)

i=1

If C = (ci,j ) is not symmetric, without loss of generality, we assume c1,2 6= c2,1 . Set

  cosθ −sinθ , Rθ =  sinθ cosθ

 Rθ Cθ = C 

 I

.

Then Trace(Cθ ) = Trace(C) + (c1,1 + c2,2 )cosθ + (c1,2 − c2,1 )sinθ − c1,1 − c2,2 , and dTrace(Cθ ) = c1,2 − c2,1 6= 0, Trace(C) = Trace(Cθ )|θ=0 . dθ θ=0 Thus, Trace(Cθ ) > Trace(C) for a sufficient small θ (negative or positive depending on the sign of (c1,2 − c2,1 )), equivalently,    R fx(1) )T ΣW fy(1)  θ  > Trace((W fx(1) )T ΣW fy(1) ), Trace (W I

3.2 General Solutions of CCA

55

which together with equalities T



fx(1) fx(1) )T W (W

 R f (1)  θ = W y

I

 R f (1)  θ  (W y

 I

) = I,

fx(1) , W fy(1) ) is a solution of optimization problem (3.17). contradicts the fact that (W Hence, C is symmetric. Let λ1 (C) ≥ λ2 (C) ≥ · · · ≥ λl (C) be eigenvalues of C, then all the eigenvalues must be nonnegative. Otherwise, suppose λl (C) < 0 and let the eigenvalue decomposition of C be  λ (C)  1  .. C =Z .  where Z ∈ Rl×l is orthogonal, then   fx(1) Z)T Σ(W fy(1) Z  Trace (W

I

   T Z ,  λl (C)

 ) = −λl (C) + −1 >

l X

(3.22)

l−1 X

λi (C)

i=1

λi (C) = Trace(C),

i=1

which together with equalities 

 I f (1) Z)T (W f (1) Z) = W f (1) Z  (W x x y

T

  I f (1) Z   (W ) = I, y −1 −1

is a contradiction with the fact that Trace(C) is the maximum objective function value of optimization problem (3.17). Hence, λ1 (C) ≥ · · · ≥ λl (C) ≥ 0 and thus C is symmetric positive semi-definite. Since C is symmetric positive semi-definite, λ1 (C), · · · , λl (C) are also the singular values of C. By Lemma 3.2 and the definition of C, we have the following relationships λi (C) ≤ ηi ,

i = 1, 2, · · · , l.

(3.23)

56

Chapter 3. Canonical Correlation Analysis On the other hand, l X

ηi = Trace(C) =

l X

i=1

λi (C) ≤

i=1

l X

ηi ,

i=1

which together with Equation (3.23) implies λi (C) = ηi ,

i = 1, 2, · · · , l.

In summary, we have proven that C is a symmetric positive semi-definite matrix with eigenvalues {η1 , η2 , · · · , ηl }. Lemma 3.4. Let l satisfy 1 ≤ l ≤ t = min{r, s} and (wxk , wyk ) (k = 1, · · · , l) be a solution of the kth optimization problem (3.4), then ηk = (wxk )T XY T wyk , ∀ k = 1, · · · , l. Proof. If we let i h 1 l = U Σ−1 P (:, 1 : l), w˜x · · · w˜x 1 1 1

i h 1 l = V Σ−1 P (:, 1 : l), w˜y · · · w˜y 1 2 2

then (w˜x1 )T XY T w˜y1 = η1 ,

(w˜x2 )T XY T w˜y2 = η2 , · · · , (w˜xl )T XY T w˜yl = ηl ,

(w˜x1 )T XX T w˜x1 = 1,

(w˜xν )T XX T w˜xν = 1,

X T w˜xν ⊥ {X T w˜x1 , · · · , X T w˜xν−1 },

(w˜y1 )T Y Y T w˜y1 = 1,

(w˜yν )T Y Y T w˜yν = 1,

Y T w˜yν ⊥ {Y T w˜y1 , · · · , Y T w˜yν−1 },

ν = 2, · · · , l. Therefore, we have (wxk )T XY T wyk ≥ ηk ,

∀ k = 1, · · · , l.

Next, we prove by induction that the equality holds in (3.24). Since (wx1 , wy1 ) = arg max{wxT XY T wy : wxT XX T wx = 1, wyT Y Y T wy = 1}, it follows from (3.20) that (wx1 )T XY T wy1 = η1 .

(3.24)

3.2 General Solutions of CCA

57

In the following, we assume 1 ≤ k < l and (wxi )T XY T wyi = ηi , ∀ 1 ≤ i ≤ k, we want to show (wxk+1 )T XY T wyk+1 = ηk+1 . In fact, for any (wx , wy ) satisfying wxT XX T wx = 1, X T wx ⊥ {X T wx1 , · · · , X T wxk }, wyT Y Y T wy = 1,

Y T wy ⊥ {Y T wy1 , · · · , Y T wyk },

the matrix pair h i cx = w1 · · · wk w , W x x x

h i cy = w1 · · · wk w , W y y y

satisfies cx )T XX T W cx = (W cy )T Y Y T W cy = I, (W which implies k+1 X

c T XY T W cy ) = ηi ≥ Trace(W x

i=1

k X

ηi + wxT XY T wy ,

i=1

and so wxT XY T wy ≤ ηk+1 , which further yields that (wxk+1 )T XY T wyk+1 ≤ ηk+1 . Combining the inequality above and equation (3.24), we can get (wxk+1 )T XY T wyk+1 = ηk+1 . Hence, we conclude that (wxk )T XY T wyk = ηk , k = 1, · · · , l.

3.2.2

Main Results

In the following theorem, we reveal the equivalent relationship between optimization problems in (3.4) and (3.5).

58

Chapter 3. Canonical Correlation Analysis Theorem 3.5. Assume l satisfies 1 ≤ l ≤ t = min{r, s}. (i) Let (wxk , wyk ) (k = 1, · · · , l) be a solution of the k-th optimization problem in (3.4), then (Wx , Wy ) with h i 1 l Wx = wx · · · wx ,

h i 1 l Wy = wy · · · wy ,

is a solution of optimization problem (3.5). (ii) Let (Wx , Wy ) be a solution of optimization problem in (3.5), then there exists an orthogonal matrix Z ∈ Rl×l such that (wˆxk , wˆyk ) (k = 1, · · · , l) is solution of the kth optimization problem in (3.4), where i h 1 l = W Z, wˆx · · · wˆx x

h

wˆy1

···

wˆyl

i

= Wy Z.

Proof. (i) Let (wxk , wyk ) (k = 1, · · · , l) be a solution of the kth optimization problem (3.4), it is easy to verify that Wx and Wy defined above satisfy the orthogonality constraints in (3.5). Thus, in order to prove that (Wx , Wy ) is a solution of optimization problem (3.5) we only need to show Trace(WxT XY T Wy )

=

l X

ηi ,

i=1

which follows directly from Lemma 3.4. (ii) Let (Wx , Wy ) be a solution of optimization problem (3.5), it follows from transformations (3.15)-(3.16) and Lemma 3.3 that f (1) )T ΣW f (1) = W T XY T Wy C := (W x y x is symmetric positive definite with eigenvalues  η  1  ... C =Z 

{η1 , · · · , ηl }. Let 

ηl

  T Z 

be the eigenvalue decomposition of C, then  η  1  .. (Wx Z)T XY T (Wy Z) =  . 



ηl

  , 

3.2 General Solutions of CCA

59

which indicates that (wˆxk )T XY T wˆyk = ηk for k = 1, · · · , l. Obviously, (wˆxk , wˆyk ) (k = 1, · · · , l) satisfies constraints of the kth optimization problem (3.4). Hence (wˆxk , wˆyk ) (k = 1, · · · , l) is a solution of the kth optimization problem (3.4). Theorem 3.5 shows that optimization problems in (3.4) and (3.5) are equivalent in the sense that the only difference between a solution of optimization problem in (3.4) and that of optimization problem in (3.5) lies in an orthogonal matrix. This provides theoretical support for the adoption of trace formulation (3.5) as the criterion of CCA in this thesis. The rest of this section is dedicated to the derivation of explicit characterization of all solutions of CCA. The Lagrangian function of optimization problem (3.17) is   f (1) , W f (1) , Λx , Λy ) =Trace (W f (1) )T ΣW f (1) − 1 hΛx , (W f (1) )T W f (1) − Ii L(W x y x y x x 2 1 f (1) )T W f (1) − Ii, − hΛy , (W (3.25) y y 2 fx(1) , W fy(1) ) is a solution of where Λx , Λy ∈ Rl×l are Lagrangian multipliers. If (W optimization problem (3.17), the corresponding KKT conditions should be fx(1) − W fy(1) Λy = 0, ΣT W

fy(1) − W fx(1) Λx = 0, ΣW

and from Lemma 3.3 it is easy to prove  η  1  .. Λx = Λy = C = ZΘZ T , with Θ :=  . 



ηl

  , 

(3.26)

for some orthogonal Z, since C is symmetric positive semi-definite as proved in Lemma 3.2. Therefore, the KKT conditions can be reformulated as f (1) Z) = (W f (1) Z)Θ, ΣT (W x y

f (1) Z) = (W f (1) Z)Θ. Σ(W y x

(3.27)

When Θ is nonsingular, we get an equivalent form of the two equalities above fx(1) Z) = (W fx(1) Z)Θ2 , ΣΣT (W

fy(1) Z) = ΣT (W f (1) Z)Θ−1 . (W x

(3.28)

60

Chapter 3. Canonical Correlation Analysis The KKT conditions (3.27) and (3.28) will be used in the proof of Theorem 3.6, where we characterize all solutions of optimization problem (3.5). k P

Theorem 3.6. (i) If l =

mi for some k satisfying 1 ≤ k ≤ q, then (Wx , Wy ) is

i=1

a solution of optimization problem in (3.5) if and only if Wx = U1 Σ−1 1 P1 (:, 1 : l)W + U2 E,

(3.29a)

Wy = V1 Σ−1 2 P2 (:, 1 : l)W + V2 F,

(3.29b)

where Wx ∈ Rd1 ×l and Wy ∈ Rd2 ×l , W ∈ Rl×l is orthogonal, E ∈ R(d1 −r)×l and F ∈ R(d2 −s)×l are arbitrary. k k+1 P P mi < l < mi for some k satisfying 0 ≤ k < q, then (Wx , Wy ) is a (ii) If i=1

i=1

solution of optimization problem in (3.5) if and only if h i Wx =U1 Σ−1 W + U2 E, P (:, 1 : α ) P (:, α + 1 : α )G 1 1 k 1 k k+1 h i Wy =V1 Σ−1 P2 (:, 1 : αk ) P2 (:, αk + 1 : αk+1 )G W + V2 F, 2 where Wx ∈ Rd1 ×l and Wy ∈ Rd2 ×l , αk :=

k P

(3.30a) (3.30b)

mi for k = 1, · · · , q, W ∈ Rl×l is

i=1

orthogonal, G ∈ Rmk+1 ×(l−αk ) is column orthogonal, E ∈ R(d1 −r)×l and F ∈ R(d2 −s)×l are arbitrary. (iii) If m < l ≤ min{r, s}, then (Wx , Wy ) is a solution of optimization problem in (3.5) if and only if h i Wx =U1 Σ−1 P (:, 1 : m) P (:, m + 1 : r)G 1 1 1 1 W + U2 E, h i Wy =V1 Σ−1 P (:, 1 : m) P (:, m + 1 : s)G 2 2 2 2 W + V2 F,

(3.31a) (3.31b)

where Wx ∈ Rd1 ×l and Wy ∈ Rd2 ×l , W ∈ Rl×l is orthogonal, G1 ∈ R(r−m)×(l−m) and G2 ∈ R(s−m)×(l−m) are column orthogonal, E ∈ R(d1 −r)×l and F ∈ R(d2 −s)×l are arbitrary. Remark 3.7. In Theorem 3.6 and the rest of this thesis, we use A(:, i : j) to denote a submatrix consisting of the ith to the jth column of A, which is the way of representing a submatrix in MATLAB.

3.2 General Solutions of CCA

61

Proof. Necessity: To prove the necessity, we can substitute the formulae for Wx and Wy from each case into optimization problem (3.5). For each of the above three cases, a direct calculation verifies that WxT XX T Wx

= Il ,

WyT Y

T

Y Wy = Il ,

Trace(WxT XY T Wy )

=

l X

ηi .

i=1

Hence, all (Wx , Wy ) in (3.29)-(3.31) are solutions of optimization problem (3.5). Sufficiency: Suppose (Wx , Wy ) is a solution of optimization problem (3.5), then (1)

(1)

fx , W fy ) is a solution of optimization problem (3.17). we will derive the general (W fx(1) , W fy(1) ) in three different formulas for (Wx , Wy ) via the general formulas of (W cases. (i). If l =

k P

mi for some k satisfying 1 ≤ k ≤ q, then Θ is nonsingular and

i=1

equations in (3.28) hold. Moreover, Θ has the following form   σI  1 m1    .. Θ= . .   σk Imk fx(1) Z and W fy(1) Z into Partitioning W   (1,1) f W  x   fx(1) Z = W fx(1,2)  W ,   (1,3) f Wx

  (1,1) f W fy(1) Z =  y  , W fy(1,2) W

fx(1,1) , W fy(1,1) ∈ Rl×l , W fx(1,2) ∈ R(m−l)×l , W fx(1,3) ∈ R(r−m)×l , and W fy(1,2) ∈ where W R(s−l)×l , the first equation in (3.28) is reduced to   2 f (1,1) Θ Wx       2  fx(1,1)  σk+1 Imk +1  W        f (1,2)   f (1,2)  2 .. W = .   x  Wx  Θ ,        (1,3) 2 f   σq Imq Wx   0

62

Chapter 3. Canonical Correlation Analysis which yields fx(13) = 0, W

fx(12) = 0, W

fx(11) Θ2 . fx(11) = W Θ2 W

fx(1) Z is column orthogonal, W fx(1,2) = 0 and W fx(1,3) = 0, it follows that W fx(1,1) Since W (1,1)

fx is an orthogonal matrix. Consequently, Θ2 W  L  1  .. (1,1) fx = W . 

fx(1,1) Θ2 implies that =W 

Lk

  , 

fx(1) Z into the second where Li ∈ Rmi ×mi is orthogonal, i = 1, · · · , k. Substituting W equation in (3.28), we obtain f (1,2) = 0, W y

f (1,1) = W f (1,1) , W y x

and thus, fx(1) W

  (1,1) f Wx  ZT , = 0

fy(1) W

  (1,1) f Wx  ZT , = 0

which, together with (3.15) and (3.16), implies   W = U Σ−1 P (:, 1 : l)W fx(1,1) Z T + U2 W fx(2) , x 1 1 1  W = V Σ−1 P (:, 1 : l)W fx(1,1) Z T + V W fy(2) . y

1

2

2

2

This is exactly the form of equation (3.29) with f (1,1) Z T , W=W x

f (2) , E =W x

f (2) , F =W y

fx(1,1) Z T ∈ Rl×l is orthogonal. since W k k+1 P P mi < l < mi for some k satisfying 0 ≤ k < q, then Θ is still (ii). If i=1

i=1

nonsingular and equations in (3.28) still hold. But the explicit form of Θ becomes     σI  1 m1  Θ1,1 0   . .   Θ= , with Θ1,1 =  . .   0 σk+1 I σk Imk

3.2 General Solutions of CCA

63

For simplicity we denote αk :=

k X

mi ,

k = 1, · · · , q,

i=1

fx(1) Z and W fy(1) Z into and partition W   (1,1) f W  x   fx(1) Z = W fx(1,2)  W ,   (1,3) f Wx

  (1,1) f W fy(1) Z =  y  , W fy(1,2) W

fx(1,1) , W fy(1,1) ∈ R(αk+1 )×l , W fx(1,2) ∈ R(m−αk+1 )×l , W fx(1,3) ∈ R(r−m)×l , and where W fy(1,2) ∈ R(s−αk+1 )×l , then the first equation in (3.28) becomes W     2 Θ fx(1,1)    11 W    2    σk+1 Imk+1 (1,1)    f  Wx   2   (1,2)  2  σk+2 Imk+2    fx  Θ , W   f (1,2)  =  ...   Wx   (1,3)     fx W   2   σq Imq   0 which further results in fx(1,2) = 0, W and

 Θ2  1,1

f (1,3) = 0, W x

 2 σk+1 I

f (1,1) = W f (1,1) Θ2 . W x x

(3.32)

fx(1,1) be partitioned as Let W fx(1,1) W

  M1,1 M1,2 , = M2,1 M2,2

where M1,1 ∈ Rαk ×αk , M1,2 ∈ Rαk ×(l−αk ) , M2,1 ∈ Rmk+1 ×αk , M2,2 ∈ Rmk+1 ×(l−αk ) .

64

Chapter 3. Canonical Correlation Analysis Then (3.32) implies that M1,2 = 0, M2,1 = 0, Θ21,1 M1,1 = M1,1 Θ21,1 . fx(1) Z is column orthogonal, W fx(1,2) = 0, W fx(1,3) = 0, M1,2 = 0 and Moreover, since W M2,1 = 0, it follows that M1,1 is orthogonal and M2,2 is column orthogonal, and M1,1 must be of the form

M1,1

 L  1  .. = . 



Lk

  , 

where Li ∈ Rmi ×mi is orthogonal, i = 1, · · · , k. fx(1) Z into the second equation in (3.28), we get W fy(1,2) = 0 and Substituting W fy(1,1) = W fx(1,1) . Therefore, W   M1,1 0    T fx(1) =  W  0 M2,2  Z ,   0 0

fy(1) W

  M1,1 0     T = 0 M2,2  Z ,   0 0

which, together with (3.15) and (3.16), result in    h i  M1,1   Wx = U1 Σ−1  P1 (:, 1 : αk ) P1 (:, αk + 1 : αk+1 )M2,2  1   0   h i M1,1   −1   W = V Σ  P (:, 1 : α ) P (:, α + 1 : α )M y 1 2 2 k 2 k k+1 2,2   0

0

 fx(2) ,  Z T + U2 W

I  0 fy(2) .  Z T + V2 W I

This is exactly the form of equation (3.30) with   M1,1 0 fx(2) , F = W fy(2) .  ZT , E = W G = M2,2 , W =  0 I (iii). If m < l ≤ min{r, s}, then Θ is singular, and equations in (3.28) do not hold but equations in (3.27) hold. Now Θ is of the form    σI  1 m1 Θ1,1 0  ..  , with Θ1,1 =  Θ= .  0 0



σq Imq

  . 

3.2 General Solutions of CCA

65

fx(1) Z and W fy(1) Z into Partition W h i fx(1) Z = M M , W 1 2

h i fx(1) Z = N N , W 1 2

where M1 ∈ Rr×m , M2 ∈ Rr×(l−m) , N1 ∈ Rs×m , and N2 ∈ Rs×(l−m) , we obtain from KKT conditions (3.27) that the following equations hold ΣT M1 = N1 Θ1,1 ,

ΣN1 = M1 Θ1,1 ,

i.e., ΣΣT M1 = M1 Θ21,1 ,

N1 = ΣT M1 Θ−1 1,1 .

fx(1) Z and W fy(1) Z are column orthogonal, so M1 and N1 are also column Note that W orthogonal. As a result, similar to the derivation in Case (i), we have     W1 W1 M1 =   , N1 =   , 0 0 where W1 ∈ Rm×m is orthogonal. Let   M1,2 , M2 =  M2,2

  N1,2 , N2 =  N2,2

where M1,2 , N1,2 ∈ Rm×(l−m) , M2,2 ∈ R(r−m)×(l−m) , N2,2 ∈ R(s−m)×(l−m) . Then M1,2 = 0, N1,2 = 0, M2,2 and N2,2 are column orthogonal. Summarizing the above analysis, we get     W 0 W 0 fx(1) =  1 fy(1) =  1  ZT , W  ZT , W 0 M2,2 0 N2,2 which implies    h i  W1   Wx = U1 Σ−1  P1 (:, 1 : m) P1 (:, m + 1 : r)M2,2  1   0   h i W1   −1   W = V Σ  P (:, 1 : m) P (:, m + 1 : s)N y 1 2 2 2 2,2   0

0

0 I

 fx(2) ,  Z T + U2 W

I 

fy(2) .  Z T + V2 W

66

Chapter 3. Canonical Correlation Analysis Since W1 and Z T are orthogonal, M2,2 and N2,2 are column orthogonal, it follows that the above formula is exactly of the form of (3.31) with   W1 0 fx(2) ,  ZT , E = W G1 = M2,2 , G2 = N2,2 , W =  0 I

f (2) . F =W y

As can be seen from the proof, the three cases in Theorem 3.6 are based on the multiplicity of singular values of Σ in (3.14). In fact, if we let G = [Il−αk 0]T in case (ii), G1 = [Il−m 0]T and G2 = [Il−m 0]T in case (iii), then the formulae for (Wx , Wy ) in (3.30) and (3.31) reduce to a simpler form in (3.29). Another important observation of Theorem 3.6 is that the component of Wx (or Wy ) lying in the null space of X T (or Y T ) is arbitrary. This is consistent with the fact that the component of Wx (or Wy ) lying in the null space of X T (or Y T ) does not contribute to the canonical correlations between X and Y . Theorem 3.6 will be used to establish relationship between CCA and LDA in the next section and develop a new sparse CCA algorithm in Chapter 4.

3.3

Equivalent relationship between CCA and LDA

One popular application of CCA is dimensionality reduction for multi-class data where one set of variables is derived from the sample data and the other set of variables is derived from the class information. In this section, we study the inherent relationship between CCA and Linear discriminant analysis (LDA) introduced in Chapter 2. Recently, a CCA-based probabilistic interpretation of LDA is given in [5]. Such a probabilistic interpretation deepens the understanding of CCA and LDA as modelbased methods, enables the use of local CCA models as components of a larger probabilistic model, and suggests generalizations to members of the exponential family other than the Gaussian distribution. CCA has been shown to be equivalent

3.3 Equivalent relationship between CCA and LDA

67

to LDA for binary-class problem in [74] and multi-class problem in [5]. We establish the equivalent relationship between the ULDA problem (2.3) (cf. Section 2.1) and the CCA problem. Suppose we are given a data matrix A ∈ Rd×n from K different classes with ni K data points in the ith class. We define vectors {c(i) }K i=1 , c, {ei }i=1 , e and scatter

matrices Sb , Sw and St in the same way as in Section 2.1. Let h i X := x1 , · · · , xn = A − ceT ∈ Rd×n

(3.33)

and  h

Y := y1 , · · · , yn

i



√1 eT  n1 n1





n1  1    ..  T  −  .  e ∈ RK×n  n  √ √1 eT n K nJ nK

...

 = 



(3.34)

be two different views of the same set of data, then {xi }ni=1 and {yi }ni=1 are centered. Since XY T =

h √

n1 (c

(1)

− c) · · ·



nJ (c

(K)

i

− c) ,

it follows that the between-class scatter matrix Sb , the within-class scatter matrix Sw and the total scatter matrix St can be written as Sb =

1 XY T Y X T , n

Sw =

1 X(I − Y T Y )X T , n

St =

1 XX T . n

Moreover, it is easy to check that 

m = rank(XY T ) = rank(Sb ),

YYT







T

n1 n1   1  ..   ..  = I −  .  .  , n   √ √ nK nK

and rank(Y ) = K − 1,

σ1 (Y ) = · · · = σK−1 (Y ) = 1.

Obviously, Y Y T is singular, thus existing methods for computing CCA solutions described in Section 3.1 are inapplicable.

68

Chapter 3. Canonical Correlation Analysis Let all the matrix factorizations (3.10)-(3.14) in Section 3.2 have been determined, we have Σ2 = IK−1 in (3.10), and     T 2 T T T Σ1 Q1 Q2 Σ2 Q2 Q1 Σ1 0 (Σ P )ΣΣ (Σ1 P1 ) 0 1  UT = 1 U  1 1  UT Sb = U  n n 0 0 0 0   h iT i Θ2 1h   = (3.35) U1 Σ1 P1 U2 , U Σ P U2 n 1 1 1 0(d−m) where

 η  1  .. Θ= . 



ηm

  . 

Similarly, we have   h i h iT I 1 r   St = U Σ P U2 U1 Σ1 P1 U2 , n 1 1 1 0(d−r)

(3.36)

rank(St ) = r, rank(Σ) = rank(Sb ) = m. Now we can parameterize a general solution to optimization problem in (2.3) in the following theorem, whose proof is very similar to that of Theorem 2.2. Theorem 3.8. Suppose the matrix factorizations (3.10)-(3.14) are determined. Then G ∈ Rd×l is a solution of optimization problem in (2.3) if and only if m≤l≤r and G=

U1 Σ−1 1

h

i

P1 (:, 1 : m) P1 (:, m + 1 : r)G1 W + U2 E,

(3.37)

where G1 ∈ R(r−m)×(l−m) is column orthogonal, W ∈ Rl×l is orthogonal, and E ∈ R(d−r)×l is arbitrary. Proof. For any G ∈ Rd×l , let  P1T Σ1 e  G=

  e G  1  UT G =  e2  G    I e3 G 

3.3 Equivalent relationship between CCA and LDA

69

e1 ∈ Rm×l , G e2 ∈ R(r−m)×l and G e3 ∈ R(d−r)×l . From equations (3.35) and where G (3.36), we have

e1 , eT1 Θ2 G GT Sb G = G

 T e1 G GT St G =   e2 G

 

e1 G e2 G

 .

Thus, the optimization problem (2.3) is equivalent to eT Θ2 G e1 ) max Trace(G 1  T   e1 e1 G G s.t.     = I. e2 e2 G G

(3.38)

Since it has been proven in Theorem 2.2 that max Trace((GT St G)† (GT Sb G)) = Trace((St )† Sb ) = Trace(Θ2 ),

GT St G=I

it follows that the optimal objective function value of (3.38) is Trace(Θ2 ) as well. Thus, we obtain G is a solution of optimization problem (2.3)   e1 G ⇔   is a solution of optimization problem (3.38) e2 G eT Θ2 G e1 ) = Trace(Θ2 ), G eT G e eT e ⇔Trace(G 1 1 1 + G2 G2 = I e1 G eT1 = Im , G eT1 G e1 + G eT2 G e2 = I ⇔G    m ≤ l ≤ r,          G e1 Im 0  W, ⇔  =  e  G2 0 G1      G ∈ R(r−m)×(l−m) is column orthogonal, W ∈ Rl×l is orthogonal, 1 h i −1 ⇔G = U1 Σ1 P1 (:, 1 : m) P1 (:, m + 1 : r)G1 W + U2 E, e3 , m ≤ l ≤ r. with E = G

70

Chapter 3. Canonical Correlation Analysis Based on Theorem 3.6 and Theorem 3.8, we can reveal an important relationship between ULDA and CCA. By comparing (3.37) with (3.31), we know that the general formula for linear transformation G of ULDA has the same form as the general formula for linear transformation Wx of CCA. Hence, CCA is equivalent to ULDA provided that in CCA one set of variables is derived from the data matrix and the other set of variables is constructed from class information, see (3.33) and (3.34).

3.4

Conclusions

In this chapter, we systematically studied general solutions of CCA with multiple projection vectors. We first established an equivalent relationship between a recursive formulation and a trace formulation for the multiple-projection CCA problem. Supported by this equivalent relationship, we adopted the trace formulation as the criterion of CCA. Then we characterized all solutions for the multiple-projection CCA problem explicitly. Based on this explicit characterization, we revealed a close relationship between uncorrelated linear discriminant analysis and CCA. A thorough investigation of general solutions of CCA not only deepens the understanding of structures of CCA solutions, but also provides alternative approach to obtain multiple CCA projections even when the covariance matrices are singular. Most importantly, it paves the way for the design of novel sparse CCA algorithm and sparse kernel CCA algorithm in Chapter 4 and Chapter 5, respectively.

Chapter

4

Sparse Canonical Correlation Analysis Since canonical variables are linear combinations of all features in the original space and the canonical loadings are typically dense, it is challenging to interpret canonical variables in high-dimensional data analysis. On the other hand, for high-dimensional data, there are usually a large portion of features that are not informative to objectives, inclusion of such features in the representation of canonical variables may degenerate the effectiveness of CCA. In order to improve the interpretability and performance of CCA in high-dimensional data analysis, researchers have studied sparse CCA by computing pairs of projections such that maximal correlation is obtained in the transformed space with a small number of variables being involved in each projection, see [35, 71, 114, 131, 145, 150, 151, 153]. Sparse CCA has been applied in many areas, including identifying genes with expression correlated with regions of DNA copy number variations or single nucleotide polymorphisms (SNPs) [29, 114, 145, 151, 153], cross-language document retrieval [35, 71, 131] and identifying words that are musically meaningful [141]. In this chapter, we develop a new sparse CCA algorithm, named SCCA `1 , by adopting similar ideas as sparse ULDA in Chapter 2. Specifically, we compute sparse CCA solutions by selecting the solution with minimal `1 -norm from the solution set we have parameterized in Chapter 3. We conduct experiments on several simulated data sets and real-world data sets including gene expression data and cross-language

71

72

Chapter 4. Sparse Canonical Correlation Analysis document data to demonstrate the effectiveness of the proposed algorithm, and compare it with some state-of-the-art sparse CCA algorithms.

4.1

A New Sparse CCA Algorithm

In Chapter 3, we have characterized all solutions of CCA explicitly in Theorem 3.6. This explicit characterization provides us an algorithm to compute a sparse solution of CCA from the solution set of optimization problem in (3.5). We note that structured parameter matrices W, and G (or G1 , G2 ) are involved in the parameterization of the solution set, and it is not straightforward to compute the sparse solution over the sets of structured matrices. Our idea is to seek a sparse solution of CCA from a solution subset of optimization problem (3.5), as described in the following lemma. Lemma 4.1. Any (Wx , Wy ) of the following forms   W = U Σ−1 P (:, 1 : l) + U E, x 1 1 1 2  W = V Σ−1 P (:, 1 : l) + V F, y

1

2

2

(4.1)

2

is a solution of the optimization problem in (3.5), where E ∈ R(d1 −r)×l and F ∈ R(d2 −s)×l are arbitrary. Proof. Equation (4.1) can be obtained by letting W = Il in (3.29). Hence, by Theorem 3.6, any (Wx , Wy ) in the form of (4.1) is a solution of optimization problem in (3.5). Thus, the set of solutions (Wx , Wy ) in the form of (4.1) is a subset of all solutions of optimization problem in (3.5). Note that any (Wx , Wy ) in (4.1) is equivalent to   U T W = Σ−1 P (:, 1 : l), x 1 1 1 (4.2)  V T W = Σ−1 P (:, 1 : l), 1

y

2

2

which implies that the arbitrary parameter matrices E and F involved in (4.1) have been eliminated in (4.2). Hence, like sparse ULDA in Chapter 2, we can compute a

4.1 A New Sparse CCA Algorithm

73

sparse solution of CCA by solving the following two optimization problems  d1 ×l min kWx k1 : U1T Wx = Σ−1 1 P1 (:, 1 : l), Wx ∈ R

(4.3)

 d2 ×l min kWy k1 : V1T Wy = Σ−1 2 P2 (:, 1 : l), Wy ∈ R

(4.4)

and

where kWx k1 =

d1 X l X

|Wx (i, j)|,

kWy k1 =

i=1 j=1

d2 X l X

|Wy (i, j)|.

i=1 j=1

Since both optimization problems (4.3) and (4.4) have the same form (i.e., the basis pursuit problem (2.13)), we focus on optimization problem (4.3) and all obtained results can be directly extended to the other problem. As discussed in Chapter 2 when deriving sparse LDA algorithm SULDA, one of the most powerful methods for solving the basis pursuit problem (2.13) is the linearized Bregman method [21, 22, 112, 167, 168] and its acceleration [83]. Applying the accelerated linearized Bregman iteration (2.28) to optimization problems (4.3), we get the following procedure for computing sparse Wx :  k T   V k+1 = V˜xk + τx U1 (Σ−1 1 P1 (:, 1 : l) − U1 Wx ),   x V˜xk+1 = αk Vxk+1 + (1 − αk )Vxk ,     W k+1 = δ S(V˜ k+1 , µ ), x

x

x

where Wx0 = 0, Vx0 = V˜x0 = 0, 0 < δx
0, 2 > 0

2:

Output: Sparse transformation matrices Wx ∈ Rd1 ×l and Wy ∈ Rd2 ×l

3:

Compute matrix factorizations (3.10)-(3.14)

4:

Let Wx0 = Vx0 = V˜x0 = 0, and Wy0 = Vy0 = V˜y0 = 0

5:

while errorx > 1 do

6:

Compute Wxk+1 via (4.5)

7:

errorx := kU1T Wxk+1 − Σ−1 1 P1 (:, 1 : l)kF

8:

end while

9:

while errory > 2 do

10:

Compute Wyk+1 via (4.6)

11:

errory := kV1T Wyk+1 − Σ−1 2 P2 (:, 1 : l)kF

12:

end while

13:

return Wx = Wxk and Wy = Wyk We remark in Algorithm 2 that the following stopping criteria can be adopted: kU1T Wxk − Σ−1 1 P1 (:, 1 : l)kF ≤ 1 ,

kV1T Wyk − Σ−1 2 P2 (:, 1 : l)kF ≤ 2 ,

where 0 < 1 < 1 and 0 < 2 < 1 are tolerance parameters. Let ∆xk := U1T Wxk − Σ−1 1 P1 (:, 1 : l), then k∆xk kF ≤ 1 ,

U1T Wxk = ∆xk + Σ−1 1 P1 (:, 1 : l),

and (Wxk )T XX T Wxk = I + ∆Txk Σ21 ∆xk + ∆Txk Σ1 P1 (:, 1 : l) + P1 (:, 1 : l)Σ1 ∆xk .

4.2 Related Work

75

Thus, k∆Txk Σ21 ∆xk kF + 2k∆Txk Σ1 P1 (:, 1 : l)kF k(Wxk )T XX T Wxk − IkF √ √ ≤ l l kΣ1 k22 k∆xk k2F + 2kΣ1 k2 k∆xk kF √ ≤ l kXk2 k∆xk kF (2 + kXk2 k∆xk kF ) √ = l kXk2 (2 + kXk2 1 )1 √ = O(1 ). ≤ l

(4.7)

Similarly, we also have k(Wyk )T Y Y T Wyk − IkF kY k2 (2 + kY k2 2 )2 √ √ ≤ = O(2 ). l l

(4.8)

Therefore, the orthogonality constraints WxT XX T Wx = WyT Y Y Y Wy = I

(4.9)

are well satisfied by Wxk and Wyk if the tolerance parameters 1 and 2 are small enough. Another issue is that for a fixed δx (0 < δx < 1), the sparsity of Wx computed by Algorithm 2 will generally increase as µx increases. As shown in Theorem 2.5, Wxk eventually converges to the unique solution of optimization problem in (4.3) with minimal Frobenius norm when µx is large enough. Thus, the sparsity of Wx increases to a stationary value. The same result applies to Wy computed by Algorithm 2.

4.2

Related Work

Motivated by the need to understand and interpret the obtained results in realworld applications, many sparse CCA algorithms have been proposed. For instance, a greedy algorithm in [150], sparse CCA based on penalized matrix decomposition (PMD) [114, 151, 153], sparse CCA via penalizing the distance functions in (3.3) or (3.6) [71, 134, 145], sparse CCA via Bayesian Learning [115], and many others. Among these methods, Witten et al.’s algorithm PMD in [151, 153], Waaijenborg et

76

Chapter 4. Sparse Canonical Correlation Analysis al.’s algorithm CCA EN in [145], and Hardoon et al.’s algorithm SCCA PD in [71] can be applied to compute multiple pairs of sparse projections. Thus, in the rest of this section, we give a detailed review of these three sparse CCA algorithms. We assume in this section that both data sets X and Y have been standardized to have P P mean zero and standard deviation one, that is, ni=1 xi = 0, ni=1 yi = 0 and XX T (i, i) = 1,

4.2.1

Y Y T (j, j) = 1,

1 ≤ i ≤ d1 , 1 ≤ j ≤ d2 .

Sparse CCA Based on Penalized Matrix Decomposition

The authors of [151, 153] first considered the following optimization problem: max wxT XY T wy

wx ,wy

s.t. wxT XX T wx ≤ 1, wyT Y Y T wy ≤ 1,

(4.10)

P1 (wx ) ≤ C1 , P2 (wy ) ≤ C2 , where P1 and P2 are convex penalty functions which can take on various forms d1 P including P1 (wx ) = kwx k1 (Lasso) and P1 (wx ) = kwx k1 + λ |wx (i) − wx (i − 1)| i=2

(fused Lasso), C1 and C2 are positive parameters controlling the sparsity of wx and wy , respectively. The optimization problem in (4.10) was then approximated by the following diagonal penalized CCA:  max wxT XY T wy : kwx k22 ≤ 1, kwy k22 ≤ 1, P1 (wx ) ≤ C1 , P2 (wy ) ≤ C2

(4.11)

which replaces sample covariance matrices XX T and Y Y T with their diagonal (i.e., identity matrices). The diagonal penalized CCA problem (4.11) was solved by a penalized matrix decomposition method proposed in [153]. Specifically, the objective function in (4.11) was alternately maximized with respect to wx and wy according to the following updating: wxk+1 =arg max{wxT XY T wyk : kwx k22 ≤ 1, P1 (wx ) ≤ C1 }

(4.12)

wyk+1 =arg max{wyT Y X T wxk+1 : kwy k22 ≤ 1, P2 (wy ) ≤ C2 }.

(4.13)

4.2 Related Work

77

When P1 is an `1 penalty, the update wxk+1 has the form S(XY T wyk , ∆1 ) , kS(XY T wyk , ∆1 )k2

wxk+1 =

where ∆1 = 0 if this results in kwxk+1 k1 ≤ C1 ; otherwise, ∆1 is chosen by binary search so that kwxk+1 k1 = C1 . Here, S(·, ·) is the soft-thresholding shrinkage operator defined in (2.21). Similar updating for wy can be obtained if P2 is also an `1 penalty. There are two tuning parameters C1 and C2 associated with penalty functions P1 √ √ and P2 in PMD. These two parameters can be set to be C1 = c1 d1 and C2 = c2 d2 [151] where 0 < c1 , c2 ≤ 1. A permutation-based algorithm was suggested in [151] for the selection of tuning parameters c1 and c2 , which can be described as follows: 1. For each grid point (c1 (i), c2 (j)) of tuning parameter values (a) Compute wx and wy using data X and Y , and compute di,j

wxT XY T wy = . kX T wx kx kY T wy k2

(b) For k = 1, · · · , N , where N is some large integer i. Permute the rows of X to obtain matrix X k , and compute wxk and wyk using data X k and Y . ii. Compute (wxk )T X k Y T wyk k(X k )T wxk k2 kY T wyk k2 P := N1 N k=1 1dki,j ≥di,j .

dki,j = (c) Calculate the p-value pi,j

2. Choose the tuning parameter (c1 (i), c2 (j)) corresponding to the smallest pi,j . In order to compute multiple pairs of projections, PMD updates the matrices X and Y with X k (Y k )T = X k−1 (Y k−1 )T − αk wxk (wyk )T , or equivalently h i √ k , X k = X k−1 αk wx

h i √ Y k = Y k−1 − αk wyk ,

78

Chapter 4. Sparse Canonical Correlation Analysis where k = 1, · · · , l and αk = (wxk )T X k (Y k )T wyk . An R package implementing these methods, called PMA (Penalized Multivariate Analysis), is available at http://cran.r-project.org/web/packages/PMA/ index.html

4.2.2

CCA with Elastic Net Regularization

Waaijenborg et al. proposed a sparse CCA algorithm in [145], named CCA EN, which starts with the optimization problem in (3.3) and applies an approximation to the elastic net regularization to penalize canonical weights wx and wy . However, the authors did not state the exact criterion of sparse CCA that they were solving. The algorithm is described in Algorithm 3. Algorithm CCA EN has two tuning parameters λx and λy , which are determined in [145] by performing k-fold cross-validation. Our MATLAB implementation of CCA EN is based on Algorithm 3.

4.2.3

Sparse CCA for Primal-Dual Data Representation

The original algorithm proposed in [71] deals with the case where one set of variables is drawn from the primal-representation and the other set of variables is drawn from the dual-representation.1 The authors of [71] considered the following optimization problem min kX T wx − Ky ek2 + µkwx k1 + γk˜ ek1 wx ,e

(4.14)

subject to kek∞ = 1, where Ky is the kernel matrix corresponding to the dual representation of Y , µ and γ are positive parameters, ˜ e = [e1 , · · · , ei−1 , ei+1 , · · · , en ] and ei = 1. To find the 1

The terms primal and dual are different from those commonly used in the optimization litera-

ture. Here, by primal-representation we mean the original input data while by dual-representation we mean data in the kernel space.

4.2 Related Work

79

Algorithm 3 CCA EN 1:

Input: Training data X ∈ Rd1 ×n and Y ∈ Rd2 ×n , the number of pairs of projections l required to compute

2:

Output: Sparse transformation matrix Wx ∈ Rd1 ×l and Wy ∈ Rd2 ×l

3:

X 0 = X and Y 0 = Y

4:

for i = 1, · · · , l do

5:

Standardize X i−1 and Y i−1

6:

Set k = 1, ξki = X i−1 (s, :)T and ηki = Y i−1 (t, :)T such that

(ξki )T ηki kξki k2 kηki k2

is maxi-

mized over all s and t 7:

repeat

8:

wxi,k = S(X i−1 (ξki )T , λx )/kS(X i−1 (ξki )T , λx )k2

9:

wyi,k = S(Y i−1 (ηki )T , λy )/kS(Y i−1 (ηki )T , λy )k2

10:

i i = (Y i−1 )T wyi,k ξk+1 = (X i−1 )T wxi,k and ηk+1

11:

until convergence

12:

Set wxi = wxi,k ,

13:

X i = X i−1 (I − ξi ξiT /kξi k22 ),

wyi = wyi,k ,

ξ i = ξki ,

η i = ηki

Y i = Y i−1 (I − ηi ηiT /kηi k22 )

14:

end for

15:

i i h h return Wx = wx1 · · · wxl and Wy = wy1 · · · wyl

optimal index i, the authors suggested to solve (4.14) for all values of i (i = 1, · · · , n) and select the one that gives the minimum objective value. A detailed description of this algorithm is given in Algorithm 4 In [71] the following automatic selection of parameters µ and γ were used: µ=

2 kXKy ek1 , d1

γ=

2 kKyT Ky ek1 . n

MATLAB codes implementing Algorithm SCCA PD are available at http://www. \davidroihardoon.com/Professional/Code.html. Besides the original algorithm, we also use a variant of SCCA PD which handles the case where both sets of variables are drawn from the primal-representation.

80

Chapter 4. Sparse Canonical Correlation Analysis Algorithm 4 SCCA PD 1:

Input: Training data X ∈ Rd1 ×n and kernel matrix Ky ∈ Rn×n , the number of pairs of projections l required to compute

2:

Output: Sparse transformation matrix Wx ∈ Rd1 ×l and Wy ∈ Rn×l

3:

X 1 = X and Ky1 = Ky

4:

for k = 1, · · · , l do

5:

for s = 1, · · · , n do (s)

(s)

6:

Set wx = 0, e(s) = 0 with es = 1, α = 2X k Kyk e(s) + µ1

7:

Define index set I = {i : αi < 0 or αi > 2µ}

8:

repeat (s)

Update wx using Algorithm 5

9:

(s)

10:

β = 2(Kyk )T Kyk e(s) − 2Kyk (X k )T wx + γ1 and J = {j : βj < 0}

11:

Update e(s) using Algorithm 6

12:

α = 2X k Kyk e(s) + µ1 and I = {i : αi < 0 or αi > 2µ}

13:

until convergence

14:

wx = wx /k(X k )T wx k2 ,

(s)

(s)

(s)

e(s) = e(s) /kKyk e(s) k2

15:

end for

16:

Choose the optimal s such that (wx , e(s) ) minimizes the objective function

(s)

(s)

value of (4.14) and set wxk = wx and ek = e(s) 17:

Compute uk = X k (X k )T wxk and τk = (Kyk )T Kyk ek

18:

Update X k and Kyk as follows: X

19:

k+1

      X k (X k )T uk uTk τk τkT τk τkT k k+1 k = I− T k k T X , Ky = I − T Ky I − T uk X (X ) uk τk τk τk τk (4.15)

end for

h i h i 1 l 1 l 20: return Wx = wx · · · wx and Wy = e · · · e

In this variant, we replace Ky and e in (4.14) with Y and wy , respectively. The

4.2 Related Work

81

Algorithm 5 SCCA PD–Update wx 1:

Input: X ∈ Rd1 ×n , Ky ∈ Rn×n , wxold , e, α and index set I

2:

Output: Updated wx

3:

repeat

4: 5:

for i = 1 to the length of I do if αIi > 2µ then

6:

αIi = 2µ

7:

new old wx,I = wx,I + (2(XKy e)Ii − 2(XX T wxold )Ii − αIi + µ)/2(XX T )Ii ,Ii i i

8: 9: 10: 11: 12:

else if αIi < 0 then α Ii = 0 old new + (2(XKy e)Ii − 2(XX T wxold )Ii − αIi + µ)/2(XX T )Ii ,Ii = wx,I wx,I i i

else if wx,Ii > 0 then old new − (2µ − αIi )/2(XX T )Ii ,Ii = wx,I wx,I i i

13: 14:

else if wx,Ii > 0 then old new + αIi /2(XX T )Ii ,Ii = wx,I wx,I i i

15: 16:

end if

17:

end if

18:

new old ) then ) 6=sign(wx,I if sign(wx,I i i

19: 20: 21: 22: 23:

old =0 wx,I i

else old new wx,I = wx,I i i

end if end for

24:

until convergence

25:

return wx = wxnew

automatic selection of parameters µ and γ is µ=

2 kXY T wy k1 , d1

γ=

2 kY Y T wy k1 . d2

82

Chapter 4. Sparse Canonical Correlation Analysis Algorithm 6 SCCA PD–Update e 1:

Input: X ∈ Rd1 ×n , Ky ∈ Rn×n , wx , eold , β, s and index set J

2:

Output: Updated e

3:

repeat

4:

for i = 1 to the length of J do if Ji 6= s then

5: 6:

old T old T T enew Ji = eJi + (2(Ky X wx )Ji − 2(Ky Ky e )Ji + β − γ)/4(Ky Ky )Ji ,Ji

7:

if enew Ji < 0 then eold Ji = 0

8:

else if enew Ji > 1 then

9:

eold Ji = 1

10:

end if

11:

end if

12: 13:

end for

14:

until convergence

15:

return wx = wxnew

The deflation procedure for X in (4.15) is unchanged, and the deflation procedure for Y is the same as that of X, that is   Yk YkT vk vkT Yk+1 = I − T Yk , vk Yk YkT vk

k = 1, · · · , l − 1,

where Y1 = Y and vk = Yk YkT wyk .

4.2.4

Some Remarks

Aforementioned algorithms compute sparse wx and wy recursively, which requires them to update matrices X and Y after the acquisition of one pair of wx and wy in order to compute a new pair. As algorithms PMD, CCA EN, and SCCA PD do not take orthogonality constraints (4.9) into account, in our experiment, we perform

4.3 Numerical Results

83

column scaling: wxk :=

wxk , kX T wxk k2

wyk :=

wyk kY T wyk k2

(4.16)

on the kth column of Wx and Wy so that (wxk )T XX T wxk = 1,

(wyk )T Y Y T wyk = 1,

k = 1, · · · , l.

Compared with existing sparse CCA algorithms, our proposed method has some nice properties. First of all, the optimization criterion of our method consists of two `1 -minimization problems which are convex and whose global optimal solutions can be found efficiently. Accelerated linearized Bregman iterative method has been applied to obtain the solution of sparse CCA problem. Secondly, the orthogonality constraints on canonical variables are well satisfied, which implies that the canonical variables are mutually uncorrelated. Finally, multiple sparse projections are computed simultaneously, while other approaches in [71, 114, 131, 145, 150, 151] have to recursively compute multiple sparse projections. As a result, the orthogonality constraints on canonical variables may not be satisfied even when the normalization (4.16) is adopted, and the related canonical variables are not mutually uncorrelated.

4.3

Numerical Results

In this section, we conducted experiments on several simulated data sets and realworld data sets obtained from gene classification and cross-language document retrieval to demonstrate the effectiveness of the proposed algorithm. We also compare the performance of the proposed method with existing state-of-the-art sparse CCA algorithms reviewed in Section 4.2. In the implementation of PMD, we provided the tuning parameters c1 and c2 with the following two candidate sets: c1 , c2 ∈ {0.01 : 0.01 : 0.5}, and c1 , c2 ∈ {0.5 : 0.01 : 1}. In the implementation of CCA EN, we employed 5-fold cross-validation to choose

84

Chapter 4. Sparse Canonical Correlation Analysis λx and λy with the following two candidate sets: λx , λy ∈ {0 : 0.01 : 0.5}, and λx , λy ∈ {0.5 : 0.01 : 1}. For our algorithm SCCA `1 , we set δx = δy = δ = 0.9, τx = τy = τ = 1 and 1 = 2 = 10−5 unless otherwise specified. All the experiments were performed under CentOS 5.2 and MATLAB v7.4 (R2007a) running on a SUN v40z server with four AMD 2.4GHz Opteron 850 CPUs and 32GB of Random-access memory (RAM).

4.3.1

Synthetic Data

In this section, we test our sparse CCA algorithm on a simple synthetic data to show that it does well at identifying linear combinations of the underlying factors, and compare it with other three algorithms. The synthetic data X and Y were generated according to the following model: X = (v1 + 1 )uT ,

Y = (v2 + 2 )uT ,

(4.17)

where v1 ∈ R200 and v2 ∈ R300 are given by v1 = [1| ·{z · · 1} 25

−1 · · − 1} | ·{z 25

· · 0}], |0 ·{z 150

v2 = [0| ·{z · · 0} 250

1| ·{z · · 1} 25

−1 · · − 1}], | ·{z 25

1 ∈ R200 and 2 ∈ R300 are two noise vectors with 1 (i) ∼ N (0, 0.12 ), ∀ i = 1, · · · , 200, and 2 (j) ∼ N (0, 0.12 ), ∀ j = 1, · · · , 300, u ∈ R50 is a random vector with all its entries from the normal distribution, that is u(i) ∼ N (0, 1), ∀ i = 1, · · · , 50. Plots of v1 and v2 are presented in Figure 4.1. From the construction of X and Y , we can see that the first 50 variables of X and the last 50 variables of Y are correlated, and only single pair of projections Wx and Wy can be computed from

4.3 Numerical Results

85

True v1

True v2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1

0

50

100

150

200

−1

0

100

200

300

Figure 4.1: True value of vectors v1 and v2 . sparse CCA algorithms. A good sparse CCA algorithm should compute Wx and Wy that can identify correlated variables between X and Y , that is, non-zero elements of Wx must lie in the first 50 components while non-zero elements of Wy must lie in the last 50 components. Wx and Wy obtained by our method SCCA `1 and other three methods are plotted in Figure 4.2. All numerical results are summarized in Table 4.1, where ’sparsity’ denotes the percentage of zero entries as defined in (2.41) and a higher value of ’sparsity’ means fewer non-zero entries in Wx (or Wy ), which leads to easier interpretation. A lower value of

kWxT XX T Wx −Il kF √ l

and

kWyT Y Y T Wy −Il kF √ l

implies that Wx and Wy satisfy the

orthogonality constraints (4.9) better. From Figure 4.2(a), we see that wx and wy computed by SCCA `1 have the property that non-zero elements of wx lie in the first 50 components and the last 150 elements are zeros while non-zero elements of wy lie in the last 50 components and the first 250 elements are zeros. Therefore, SCCA `1 can compute wx and wy that identify correlated variables in X and Y . Notice from Theorem 2.5 that, as µ gets larger, the sparsity of wx and wy generally becomes higher, eventually wx and wy converge to vectors with only one non-zero element as shown in Figure 4.2(a). Algorithm PMD achieves similar results as SCCA `1 when parameters c1 and c2 are selected from candidate set 0.01:0.01:0.5, which can be observed from Figure 4.2(b). However, when parameters c1 and c2 are selected from candidate

86

Chapter 4. Sparse Canonical Correlation Analysis wx

wy µ =µ =5

0.1

x

y

0.05

0.05

0

0

−0.05

−0.05

−0.1 50

100

150

200

0

100

200

0

0

−2

−2

x

0.1

0.05

0.05

0

0

−0.05

100

150

200

150

0

100

200

wx

−3

0.1

200

300

200

300

200

300

candidate set for c 1 and c : 0.01:0.01:0.5 2

0.05

0

0

−0.05

−0.05 −0.1 0

300

100

50

100

150

200

0

100

(b) wy

−3

x 10

candidate set of λx and λy: 0:0.01:0.5

2

0

wy

(a) x 10

200

−0.1

−0.1 50

100 wx

candidate set for c1 and c2: 0.01:0.01:0.5

0.05

−0.05

−0.1 0

50

0.1

µx=µy=10

y

2

300

wy µ =µ =10

0.1

candidate set for c1 and c : 0.5:0.01:1

2

2

0 wx

wy

−3

x 10

candidate set for c 1 and c : 0.5:0.01:1

2

−0.1 0

wx

−3

x 10

µx=µy=5

0.1

wx

candidate set of λx and λy: 0:0.01:0.5

2

wy 0.15

0.1 0.1

0

0

−2

−2

0.05

0

50 −3

x 10

100 wx

150

200

0

candidate set of λx and λy: 0.5:0.01:1

2

100

300

candidate set of λx and λy: 0.5:0.01:1

2

0

200 wy

−3

x 10

0.05

0

0

−0.05

−0.05

0 −0.1 −0.1 −2

−2 0

50

100

150

200

(c)

0

100

200

300

0

50

100

150

−0.15 200 0

100

(d)

Figure 4.2: Wx and Wy computed by different sparse CCA algorithms: (a) SCCA `1 (our approach), (b) Algorithm PMD, (c) Algorithm CCA EN, (d) Algorithm SCCA PD. set 0.5:0.01:1, sparsity of computed wx and wy is very low and there are many non-zero elements showing in the last 150 component of wx and in the first 250 components of wy , which means the computed wx and wy select wrong variables. For Algorithm SCCA PD, we can see from Figure 4.2(d) that the only one non-zero element of wx lies in the first 50 components of wx while non-zero elements of wy lie in the last 50 components of wy , which is similar to the results of SCCA `1 .

4.3 Numerical Results

87

Table 4.1: Comparison of results obtained by SCCA `1 with µx = µy = µ and 1 = 2 = 10−5 , PMD, CCA EN, and SCCA PD.

SCCA `1

PMD with

CCA EN with

candidates for c1 , c2

candidates for λx , λy

SCCA PD

µ=5

µ = 10

0.5:0.01:1

0.01:0.01:0.5

0:0.01:0.5

0.5:0.01:1

c1

1.0000

1.0000

1.0000

1.0000

1.0000

1.0000

1.0000

c2

99.50

99.50

71

99.50

0

2.50

99.50

c3

99.6667

99.6667

26.3333

99.6667

0.3333

0.3333

98.3333

c4

1.3965e-5

6.9641e-6

0

1.1102e-16

2.2204e-16

6.6613e-16

4.4409e-16

c5

1.2531e-5

7.6002e-6

0

2.2204e-16

0

1.1102e-16

2.2204e-16

In the table above, the following notations are used: c1 denotes canonical correlation wxT XY T wy ; c2 and c3 denote sparsity of Wx (%) and Wy (%), respectively; c4 = kWyT Y

Y

T

Wy −Il kF √ l

kWxT XX T Wx −Il kF √ l

and c5 =

measuring orthogonality of canonical variables X T Wx and Y T Wy , respectively;

From Table 4.1 we see that sparsity of wx and wy computed by Algorithm CCA EN stays at a low level and hardly changes as λx and λy increase. Plots of wx (see Figure 4.2(c)) look almost the same when λx and λy increase, so do those of wy . Although the canonical weights corresponding correlated variables in X and Y are relatively larger, there are many nonzero elements in the last 150 components of wx and in the first 250 components of wy , which means the computed wx and wy select many wrong variables. We also observe that violation of orthogonality constraints (4.9) in SCCA `1 (Algorithm 2), measured by

kWxT XX T Wx −Il kF √ l

and

kWyT Y Y T Wy −Il kF √ l

,

is of order O(1 ) and O(2 ), respectively, which is consistent with the inequalities in (4.7) and (4.8).

4.3.2

Gene Expression Data

In this subsection, we test our sparse CCA algorithm on four gene expression data sets. We compare the classification performance of non-sparse CCA and ULDA to verify the equivalent relationship presented in Section 3.3. We also compare our

88

Chapter 4. Sparse Canonical Correlation Analysis sparse CCA algorithm with other three algorithms. The four gene expression data sets we used are: Prostate, Lymphoma, SRBCT and Brain, whose structures can be found in Table 4.2 and detailed descriptions are summarized in Appendix A. The data sets and their preprocessing are fully described in [44]. In this experiment, we choose l (the number of columns in Wx and Wy ) equal P to m (the rank of matrix XY T ). When l = m, Trace(WxT XY T Wy ) = m i=1 ηi , which is the summation of all non-zero canonical correlations between X T Wx and Y T Wy . According to Theorem 3.8, we know that l ≥ m is required to ensure that G ∈ Rd×l is a solution of optimization problem in (2.3) (i.e., ULDA). On the other hand, when we compute G for dimensionality reduction, its dimension should be as small as possible. Hence we choose the transformation with the minimal dimension, i.e., l = m. Table 4.2: Data stuctures: data dimension (d1 ), training size (n), the number of classes (K) and the number of testing data (# Testing), m is the rank of matrix XY T , l is the number of columns in Wx and Wy and we choose l = m in our experiments. Data set

d1

n

K

# Testing

m

l

Prostate

6033

51

2

51

1

1

Lymphoma

4026

32

3

30

2

2

SRBCT

2308

32

4

31

3

3

Brain

5597

21

5

21

4

4

To distinguish non-sparse and sparse solutions of CCA, we use (WxN S , WyN S ) to denote non-sparse CCA solution and (Wx , Wy ) to denote sparse CCA solution, i.e., WxN S = U1 Σ−1 1 P1 (:, 1 : l),

WyN S = V1 Σ−1 2 P2 (:, 1 : l),

and (Wx , Wy ) is the pair computed by SCCA `1 , or Algorithm PMD, or Algorithm CCA EN, or Algorithm SCCA PD. Table 4.3 lists the classification accuracy of ULDA and non-sparse CCA using 1NN as classifier. According to Table 4.3, the classification accuracy achieved by WxN S of CCA is the same as that of ULDA, which

4.3 Numerical Results

89

is consistent with the fact, as described in Section 3.3, that ULDA is a special case of CCA. Table 4.3: Comparison of classification accuracy (%) between ULDA and WxN S of CCA using 1NN as classifier Data set

Prostate

Lymphoma

SRBCT

Brain

ULDA

92.1569

100

96.7742

76.1905

WxN S of CCA

92.1569

100

96.7742

76.1905

Now, we study the performance of Algorithm 2 in comparison with other three algorithms. Experimental results are listed in Table 4.4 and Table 4.5, where the following notations are used: (i) c1 and c2 denote correlations Trace((WxN S )T XY T WyN S ) and Trace(WxT XY T Wy ) for training data, respectively; (ii) c3 and c4 denote correlations Trace((WxN S )T XY T WyN S ) and Trace(WxT XY T Wy ) for testing data, respectively; (iii) c5 and c6 denote sparsity of Wx (%) and Wy (%), respectively; (iv) c7 =

kWxT XX T Wx −Il kF √ l

and c8 =

kWyT Y Y T Wy −Il kF √ l

measuring orthogonality of

canonical variables X T Wx and Y T Wy , respectively; (v) c9: classification accuracy using sparse Wx (%). Table 4.4 and Table 4.5 lead to the following observations: 1. In Algorithm 2, for any fixed δ lying between (0, 1) (δ = 0.9 in our experiments), we only need to choose sufficient large µx and µy as stated in Theorem 2.5 to get a high sparsity for the computed solutions Wx and Wy ; however, the sparsity of Wx and Wy is not affected when (µx , µy ) is increased from (5, 50) ∗ ∗ to (10, 100) in our experiments. This demonstrates that (Wx,µ , Wy,µ ) is alx y

ready a good approximation of (Wx∗ , Wy∗ ) when (µx , µy ) = (5, 50). However,

90

Lymphoma

SCCA `1 with

PMD with

CCA EN with

µx = µ, µy = 10µ

candidates for c1 , c2

candidates for λx , λy

SCCA PD

µ=5

µ = 10

0.5:0.01:1

0.01:0.01:0.5

0:0.01:0.5

0.5:0.01:1

c1

1.0000

1.0000

1.0000

1.0000

1.0000

1.0000

1.0000

c2

1.0000

1.0000

0.5052

0.6521

0.4908

0.5316

0.9409

c3

0.8111

0.8111

0.8111

0.8111

0.8111

0.8111

0.81111

c4

0.8273

0.8273

0.5649

0.7618

0.5538

0.5870

0.8659

c5

99.17

99.17

22.48

99.98

0

56.46

99.65

c6

0

50

50

50

0

0

50

c7

1.7916e-7

1.6042e-7

8.8818e-16

4.4409e-16

0

0

0

c8

8.6222e-6

2.8527e-6

4.4409e-16

4.4409e-16

1.7764e-15

0

3.3307e-16

c9

94.1176

94.1176

80.3922

86.2745

66.6667

66.6667

94.1176

c1

2.0000

2.0000

2.0000

2.0000

2.0000

2.0000

2.0000

c2

2.0000

2.0000

1.6754

1.6257

1.8849

1.8842

1.7217

c3

1.8529

1.8529

1.8529

1.8529

1.8529

1.8529

1.8529

c4

1.7946

1.7946

1.5907

1.5017

1.7861

1.7835

1.5427

c5

99.23

99.23

25.67

99.68

19.56

41.24

99.60

c6

50

50

50

66.67

0

0

50

c7

6.8731e-6

8.6251e-6

9.0242e-1

9.4602e-1

4.2377e-2

4.0695e-2

2.8697e-1

c8

3.9435e-6

1.4939e-6

8.9761e-2

5.9459e-1

1.8663e-3

4.1585e-3

4.8046e-1

c9

100

100

96.6667

96.6667

100

100

96.6667

Chapter 4. Sparse Canonical Correlation Analysis

Prostate

Table 4.4: Comparison of results obtained by SCCA `1 , PMD, CCA EN, and SCCA PD.

SRBCT Brain

SCCA `1 with

PMD with

CCA EN with

µx = µ, µy = 10µ

candidates for c1 , c2

candidates for λx , λy

SCCA PD

µ = 10

0.5:0.01:1

0.01:0.01:0.5

0:0.01:0.5

0.5:0.01:1

c1

3.0000

3.0000

3.0000

3.0000

3.0000

3.0000

3.0000

c2

3.0000

3.0000

2.7473

2.6552

2.8114

1.8898

2.7784

c3

2.7846

2.7846

2.7846

2.7846

2.7846

2.7846

2.7846

c4

2.6265

2.6202

2.4658

2.2657

2.4813

1.7796

2.4004

c5

98.64

98.64

35.67

99.70

30.39

80.95

99.39

c6

41.67

41.67

75

75

0

0

75

c7

4.0122e-6

2.8706e-6

5.3604e-1

7.5417e-1

1.2411e-1

5.8078e-1

1.5976e-1

c8

4.0975e-6

7.5397e-6

4.1404e-1

5.3127e-1

7.3770e-16

1.2208e-2

4.1404e-1

c9

93.5484

93.5484

87.0968

87.0968

93.5484

83.8710

100

c1

4.0000

4.0000

4.0000

4.0000

4.0000

4.0000

4.0000

c2

4.0000

4.0000

3.7513

3.7851

3.7957

3.8060

3.3233

c3

3.3198

3.3198

3.3198

3.3198

3.3198

3.3198

3.3198

c4

2.6198

2.6179

3.0767

3.0352

3.2804

3.3287

2.2795

c5

99.64

99.64

20.54

84.94

33.54

69.45

99.81

c6

20

20

30

75

0

0

75

c7

4.7467e-6

1.4462e-6

3.8263e-1

4.3716e-1

5.2609e-2

8.8121e-2

2.4601e-1

c8

7.6711e-6

4.5569e-67

1.2352e-1

3.9253e-1

6.7958e-3

1.2938e-2

5.8152e-1

c9

80.9524

80.9524

90.4762

76.1905

90.4762

90.4762

61.9048

91

µ=5

4.3 Numerical Results

Table 4.5: Comparison of results obtained by SCCA `1 , PMD, CCA EN, and SCCA PD.

92

Chapter 4. Sparse Canonical Correlation Analysis there is no mathematical justification for selecting parameters for other three algorithms PMD, CCA EN and SCCA PD. 2. For Algorithm 2, the sparse solution Wx gives a comparable classification accuracy compared with the non-sparse solution WxN S . But, the classification performance of algorithms PMD and CCA EN is highly affected by the parameters involved. When we applied PMD with c1 and c2 selected from 0.5 : 0.01 : 1 to the Prostate dataset the obtained accuracy is much less than that obtained by applying PMD with c1 and c2 selected from 0.01 : 0.01 : 0.5 to the same dataset. However, a contrast phenomenon is observed when applying PMD to the Brain dataset. Similar phenomenon can be observed for CCA EN. 3. An important advantage of Algorithm 2 over other three algorithms PMD, CCA EN and SCCA PD is that sparse solutions Wx and Wy computed by Algorithm 2 satisfy the orthogonality constraints (4.9) very well. Orthogonality constraints (4.9) imply that both canonical variables X T Wx and Y T Wy are mutually uncorrelated. According to the experimental results obtained by Algorithm 2 we see that

kWxT XX T Wx −Il kF √ l

= O() and

kWyT Y Y T Wy −Il kF √ l

= O(), which

are consistent with error bounds (4.7) and (4.8). However, for other three algorithms PMD, CCA EN and SCCA PD, we find that and

kWyT Y

Y

T W −I k y l F



l

kWxT XX T Wx −Il kF √ l

= O(1)

= O(1), when l > 1. In addition, although the normal-

ization diag(WxT XX T Wx ) = I and diag(WyT Y Y T Wy ) = I are performed, the canonical variables X T Wx and/or Y T Wy are not mutually uncorrelated. 4. The correlation Trace(WxT XY T Wy ) for training data (c2) achieved by Algorithm 2 is greater than that achieved by other three algorithms, and the correlation Trace(WxT XY T Wy ) for testing data (c4) achieved by Algorithm 2 is greater than that achieved by other three algorithms in Lymphoma and SRBCT data sets. The correlation Trace(WxT XY T Wy ) for testing data (c4) achieved by Algorithm 2 is lower than that achieved by some of other three algorithms in Prostate and Brain data sets, but Algorithm 2 computes Wx and

4.3 Numerical Results

93

Wy with higher sparsity in these cases.

4.3.3

Cross-Language Document Retrieval

In this experiment, we apply sparse CCA to the task of cross-language document retrieval to demonstrate the effectiveness of our sparse CCA algorithm. The data set contains a collection of documents, referred as corpus, where each document is represented in two languages, say English and French. The goal of cross-language document retrieval is to retrieve the most relevant documents in the target language for a given query in one language. Previous studies [71, 131, 144] have shown that the CCA approach works well for cross-language document retrieval and performs better than the latent semantic indexing (LSI) [12] approach. For a given collection of documents in one language (say English), we first obtain a bag-of-words representation of the documents using Term Frequency Inverse Document Frequency (TFIDF), see [96, 120, 127]. Specifically, a document is represented by a bag of words where their frequencies are taken into account. In the model, the ordering of words and grammatical information are ignored. A dictionary containing a set of terms that appear in the corpus is predefined so that a document can be represented as a vector in which each dimension is associated with one term in the dictionary. Given a document d, it is represented by a vector h iT w(t1 , d) w(t2 , d) · · · w(tk , d) ,

(4.18)

where k is the size of the dictionary which is typically very large, and w(ti , d) is the TFIDF weight of the term ti in the document d. The term frequency tf (ti , d) is the number of times ti appears in the document d which is usually normalized to prevent a bias towards longer documents. On the other hand, intuitively, if one term is common in the corpus (e.g. “the”), it is not a good keyword to distinguish relevant and non-relevant documents and terms and thus should be given smaller weight than the less common terms. To accomplish this task, we introduce another concept called inverse document frequency. The inverse document frequency idf (ti )

94

Chapter 4. Sparse Canonical Correlation Analysis measures the importance of the term ti in the corpus. Assume there are n documents in the corpus and let df (ti ) denote the number of documents containing the term   ti , then idf (ti ) = ln df n(ti ) + 1. The TFIDF weight w(ti , d) is determined by w(ti , d) = tf (ti , d) × idf (ti ), and a high value of w(ti , d) is reached by a high term frequency of ti in the document d and a low document frequency of ti in the whole corpus. This preprocessing gives us a term-document matrix E ∈ Rd1 ×n where n is the number of documents, d1 is the number of terms in English document, and the (i, j) entry is the TFIDF weight of term ti in document dj . Similarly, we can get a term-document matrix F ∈ Rd2 ×n for the corresponding French corpus. Now we apply sparse CCA to these two matrices to find two transformations WE and WF . For a given query document in one language, say French, we first convert the document into a term vector qF ∈ Rd2 as in the form of (4.18), then project qF into the lower-dimensional space by computing WFT qF . Similarly, if Et ∈ Rd1 ×N denotes the testing English documents, where N is the number of testing English documents, we project all testing English documents into a lower-dimensional space by computing WET Et . Finally, we perform document retrieval by selecting those testing English documents whose projections are closest to the projected query. We measure the precision of document retrieval by using average area under the receiver operating characteristic curve (AROC) [17, 54, 131]. Specifically, we compute the distance τ (j) = kWFT qF − WET Et (:, j)k2 ,

(4.19)

for j = 1, 2, · · · , N, and sort τ (·) in a decreasing order. Let I be the index location of the the most relevant document, then the retrieved document is among the top I N

(< 1) percentage most relevant documents of all documents, and

I N

is used as the

retrieval precision. For example, suppose there are 100 testing English documents. If a testing English document corresponding to the smallest τj in (4.19) was the English counterpart of the query, then the AROC of this retrieval is 1. If a testing English document corresponding to the 10th smallest τj in (4.19) was the English counterpart of the query, then the AROC of this retrieval equals to (1 −

10 ), 100

i.e.,

4.3 Numerical Results 0.9. For a collection of queries, we use the average of all queries’ retrieval precision as the average retrieval precision of this collection. We conducted experiments with the Aligned Hansards of the 36th Parliament of Canada, which is a collection of text chunks (sentences or smaller fragments) in English and French from the 36th Parliament proceedings of Canada. The detailed information can be found in [64]. In our experiments, we used only a part of text chunks to generate term-documents matrices. The text chunks were split into documents based on ”* * *” delimiters (without quotes) so that a text chunk can be split into multiple documents. After removing stop words 2 and rare words (appearing less than 3 times in the corpus) and stemming 3 , we obtained a 5383×818 term-document matrix for English documents and a 8015 × 818 term-document matrix for French documents. We randomly split each term-document matrix above into two parts, one part is used as a training data set to obtain sparse CCA transformations and the remaining documents as a testing data set to test the retrieval ability of sparse CCA. We first used a data set with 50 documents as training data, named Data Set I, to investigate the performance of SCCA `1 (Algorithm 2) in comparison with other three algorithms; then used a data set with 400 documents as training data, named Data Set II, to compare retrieval performance of Algorithm 2 with SCCA PD only, since we found that it is very time-consuming to obtain solutions by algorithms PMD and CCA EN for Data Set II. In this experiment, Algorithm 2 was implemented with µx = µy = 10, PMD was implemented with parameters c1 and c2 selected from candidate set 0.01:0.01:0.5, and CCA EN was implemented with parameters λx and λy selected from candidate set 0:0.01:0.5. For the SCCA PD approach, the 2

Stop words are those frequent words such as articles, prepositions, conjunctions, etc. that

carry little information about the contents of the processed documents. 3 In information retrieval, stemming is the process for reducing inflected or derived words to their root form, so that different forms of the same word are treated as equivalent terms. For example ”stemmer”, ”stemming”, and ”stemmed” are mapped to the common root ”stem”. This process can be used to reduce the size of the dictionary.

95

96

Chapter 4. Sparse Canonical Correlation Analysis linear kernel was used for data F , that is, we applied Algorithm SCCA PD to the data pair (E, Ky ) with Ky = F T F , and we adopted the deflation procedure (4.15) to Ky . This experimental setting is consistent with [71]. Therefore, Wy obtained by Algorithm 2, PMD and CCA EN is a primal projection in R8015×l , while Wy obtained by SCCA PD is a dual projection in Rn×l (n denotes the number of documents in the training data). For simplicity and consistency, we still use the violation of orthogonality constraint

kWyT Ky KyT Wy −Il kF √

l

kWyT Y Y T Wy −Il kF √ l

to denote

with respect to their sizes.

Our numerical results using Data Set I and Data Set II are listed in Table 4.6 and Table 4.7, respectively, where l denotes the number of columns of Wx (also Wy ) used in the experiment, ’FULL’ means all canonical weights corresponding to positive canonical correlations are used; more precisely, ’FULL’ corresponds to l = 49 in Table 4.6 and l = 399 in Table 4.7. Figure 4.3 shows the retrieval performance obtained by both non-sparse and sparse CCA algorithms on both data sets. 1

1 0.95

0.9 0.9

0.8

0.8

Average AROC

Average AROC

0.85 CCA SCCA ℓ1

0.75

PMD 0.7

CCA EN

0.65

SCCA PD

0.7 CCA SCCA ℓ1 0.6

0.6

SCCA PD

0.5

0.55 0.5

0

10

20 30 40 Number of columns of (Wx,Wy) used

(a)

50

0.4

0

50

100 150 200 250 300 Number of columns of (Wx,Wy) used

350

400

(b)

Figure 4.3: Average AROC achieved by CCA and sparse CCA as a function of the number of columns of (Wx , Wy ) used: (a) Data Set I, (b) Data Set II. According to Table 4.6 and Figure 4.3(a), SCCA `1 has comparable performance with PMD, and satisfies the orthogonality constraints much better than other sparse

4.3 Numerical Results

97

Table 4.6: Average AROC of standard CCA and sparse CCA algorithms using Data Set I (French to English). l

35

40

45

FULL

CCA

98.91

98.97

99.03

99.03

SCCA `1

84.34

84.69

84.65

85.29

PMD

85.85

87.50

87.83

88.94

CCA EN

98.29

98.50

98.05

98.15

SCCA PD

62.01

59.33

58.48

57.59

SCCA `1

98.96

98.96

98.97

98.97

sparsity of

PMD

99.53

99.55

99.54

99.45

Wx ∈ R5383×l (%)

CCA EN

59.78

58.61

57.70

57.11

SCCA PD

99.57

99.60

99.63

99.65

SCCA `1

99.31

99.31

99.31

99.31

PMD

99.43

99.43

99.43

99.32

CCA EN

60.19

59.29

58.59

58.14

SCCA PD

96.63

96.75

96.67

96.61

SCCA `1

1.1437e-5

1.2611e-5

1.4081e-5

1.4454e-5

PMD

6.5088e-1

7.0284e-1

7.4672e-1

8.1044e-1

CCA EN

1.3778e-1

1.4480e-1

1.5223e-1

1.6181e-1

SCCA PD

2.8121e-1

3.2218e-1

4.1094e-1

5.0629e-1

SCCA `1

1.4254e-5

1.4785e-5

1.4773e-5

1.4860e-5

PMD

6.3700e-1

7.1013e-1

7.6125e-1

8.8940e-1

CCA EN

1.2534e-1

1.3226e-1

1.3625e-1

1.5350e-1

SCCA PD

6.0289e-1

7.8656e-1

9.4589e-1

1.0346

AROC (%)

sparsity of Wy (%)

kWxT XX T Wx −Il kF √ l

kWyT Y Y T Wy −Il kF √ l

CCA algorithms. CCA EN achieves larger AROC than other sparse CCA algorithms, but it computes Wx and Wy with the lowest sparsity (less than 60%). SCCA PD achieves the smallest AROC. From Table 4.7 and Figure 4.3(b), we see that the average AROC achieved by standard (non-sparse) CCA stays in a very high level, which reveals that the

98

Chapter 4. Sparse Canonical Correlation Analysis Table 4.7: Average AROC of standard CCA, SCCA `1 and SCCA PD using Data Set II (French to English). l

100

200

300

FULL

CCA

99.42

99.52

99.54

99.55

SCCA `1

94.02

95.34

96.28

96.74

SCCA PD

90.59

84.60

77.97

71.69

sparsity of

SCCA `1

92.30

92.30

92.30

92.30

Wx ∈ R5383×l (%)

SCCA PD

97.40

97.54

97.81

98.20

SCCA `1

94.80

94.81

94.81

94.81

SCCA PD

99.70

99.58

99.13

98.77

SCCA `1

8.1942e-6

1.1937e-5

1.4906e-5

1.7541e-5

SCCA PD

3.6583e-01

7.2186e-01

9.8499e-01

1.2085

SCCA `1

7.9308e-6

1.1524e-5

1.4377e-5

1.6682e-5

SCCA PD

2.0598

3.9042

5.7919

7.1947

AROC (%)

sparsity of Wy (%)

kWxT XX T Wx −Il kF √ l

kWyT Y Y T Wy −Il kF √ l

standard CCA approach is efficient for cross-language document retrieval tasks. For sparse CCA, the average AROC achieved by SCCA `1 is also rather high; the canonical loadings Wx and Wy are very sparse, and the sparsity is higher than 92%. Indeed, SCCA `1 can compute canonical loadings (Wx , Wy ) with high sparsity and achieve retrieval performance close to the standard CCA. We can observe from Figure 4.3(b) that the average AROC obtained by SCCA `1 is high (more than 90%) when only 60 columns of Wx and Wy are used and it increases when the number of columns of (Wx , Wy ) used increases. The average AROC gap between the regular CCA and SCCA `1 is decreasing when more canonical components are used for retrieval. From Table 4.7, we see that both SCCA `1 and SCCA PD can compute highly sparse Wx and Wy , but the average AROC achieved by Algorithm SCCA PD is lower than that achieved by SCCA `1 when l ≥ 80 and decreases when more columns of (Wx , Wy ) are used for retrieval task, as observed from Figure 4.3(b). Table 4.7 also shows that SCCA `1 can compute sparse Wx and Wy that satisfy

4.4 Conclusions

99

the orthogonality constraints well, since

kWxT XX T Wx −Il kF √ l

and

kWxT Y Y T Wx −Il kF √ l

ob-

tained by SCCA `1 are much smaller than those obtained by Algorithm SCCA PD, which implies that SCCA `1 can extract mutually uncorrelated canonical variables X T Wx , as well as mutually uncorrelated canonical variables Y T Wy . In contrast, the values of

kWxT XX T Wx −Il kF √ l

and

kWxT Y Y T Wx −Il kF √ l

associated with Algorithm SCCA PD

are large, which means that the computed sparse (Wx , Wy ) is far away from the solutions of CCA problem. (Wx , Wy ) in Figure 4.3(a) is obtained from Data Set I which contains only 50 training data, but (Wx , Wy ) in Figure 4.3(b) is obtained from Data Set II which contains 400 training data. When we compare Figure 4.3(a) and Figure 4.3(b), we can see AROC in Figure 4.3(b) is better than that in Figure 4.3(a) when (Wx , Wy ) used for retrieval have the same number of columns. This is reasonable, since when more training data are used, the computed (Wx , Wy ) generally contains more information about the data, which usually results in better retrieval performance (higher AROC).

4.4

Conclusions

In this chapter, based on the explicit characterization of all solutions of CCA obtained in Chapter 3, we developed a new sparse CCA algorithm named SCCA `1 . Compared with some existing sparse CCA algorithms, our sparse CCA algorithm has the following properties: (i) It computes a sparse solution of CCA problem by solving two `1 -minimization problems using the accelerated linearized Bregman iteration, and thus it is very simple and can be implemented easily. (ii) It computes multiple sparse projections for CCA problem simultaneously, and the computed sparse solution is an exact solution of CCA problem to the accuracy of given tolerance . In contrast, existing algorithms have to recursively

100

Chapter 4. Sparse Canonical Correlation Analysis compute multiple sparse projections, and thus the orthogonality constraints in CCA problem may not be satisfied. In our numerical examples, such computed sparse solution can be far away from the solution of CCA problem. We have tested several simulated data sets and real-world data sets obtained from gene classification and cross-language document retrieval to demonstrate the effectiveness of our sparse CCA algorithm. The performance of our sparse CCA algorithm is competitive with the state-of-the-art sparse CCA algorithms. There are some interesting research problems related to sparse CCA worthy of further study which are described as follows: 1. Different methods may have a different set of parameters to be used, they all seem to be selected in a heuristic fashion. There are some measures for the goodness of sparse CCA algorithms, these measures include “accuracy”, “orthogonality”, and “sparsity”. Some algorithms perform better in term of some measures and not in others. Hence, it is interesting to investigate whether accuracy and/or orthogonality can be shown as a function of sparsity and consequently define the level of goodness for sparse CCA algorithms. 2. Theorem 2.5 claims the existence of a finite µ∗x in the accelerated linearized ∗ Bregman method (4.5), which ensures that Wx,µ = Wx∗ for any µx ≥ µ∗x . A x

recent study [98] provides an estimation of µ∗x under some mild conditions. It is interesting to apply the results in [98] to obtain a computable estimation of µ∗x in the (accelerated) linearized Bregman method (4.5).

Chapter

5

Sparse Kernel Canonical Correlation Analysis Over the last decades, a lot of powerful kernel-based methods, such as support vector machines (SVMs) [19, 123], kernel principal component analysis (KPCA) [124, 122], kernel Fisher discriminant analysis (KFDA) [104] and kernel canonical correlation analysis (kernel CCA) [1, 4, 102], have been proposed and become rather popular in the field of non-linear data analysis, especially in machine learning. These kernel-based learning methods have been successfully applied to various areas, see [108, 123, 127] for more details. One of the significant limitations of kernel methods is the lack of sparsity in dual transformations such that the kernel function must be evaluated at all training data points, which can lead to excessive computational time to compute projections of new data. To deal with this limitation, we shall find sparse solutions for kernel methods so that projections of new data can be computed by evaluating the kernel function at a subset of the training data. Another motivation for studying sparse kernel methods is that kernel methods are prone to over-fitting as pointed out in [123]. There are many sparse kernel approaches [15], such as SVMs, relevance vector machine [140] and sparse kernel partial least squares [46, 107], but seldom can these

101

102

Chapter 5. Sparse Kernel Canonical Correlation Analysis be found in the area of sparse kernel CCA except [6, 136]. In this chapter, we study sparse kernel CCA via utilizing established results on CCA in Chapter 3, aiming at computing sparse dual transformations and alleviating over-fitting problem of kernel CCA, simultaneously. In Section 5.1, we give a brief introduction to general kernel methods. In Section 5.2, we present a detailed derivation of kernel CCA. In Section 5.3, we first establish a relationship between CCA and least squares problems, and then extend this relationship to kernel CCA. In Section 5.4, based on the relationship, we incorporate sparsity into kernel CCA by penalizing the least squares term with `1 -norm and propose a novel sparse kernel CCA algorithm, named SKCCA. Numerical results of applying the newly proposed algorithm to various applications and comparative empirical results with other algorithms are presented in Section 5.5. Finally, we draw some conclusions in Section 5.6.

5.1

An Introduction to Kernel Methods

Suppose we are given a set of training data {xi }ni=1 ⊆ Rd

1

and a learning algorithm

that utilizes the training data only through their inner products {hxi , xj i}i,j . In kernel methods, we first implicitly map the input data as elements in a potentially high (possibly infinitely) dimensional feature space H via a non-linear mapping φ : Rd → H x → φ(x), then apply the given algorithm on the mapped data, that is, one works with φ(x1 ), · · · , φ(xn ) ∈ H. By appropriately choosing the mapping φ, one can obtain a kernel algorithm with all required variability and richness. 1

In fact, the assumption xi ∈ Rd is not required, we only need to assume that inner products

{hxi , xj i}i,j can be defined.

5.1 An Introduction to Kernel Methods

103

We further require H to be complete and equipped with an inner product, and consider applying the given algorithm to the mapped data {φ(xi )}ni=1 . Since the algorithm depends only on the inner products between samples, the algorithm is well defined in feature space H. For certain feature spaces H and associated mappings φ, there is an effective way for computing inner products in feature spaces using kernel functions. Now suppose there exists a kernel function κ(·, ·) associated with feature space H such that κ(x, x0 ) = hφ(x), φ(x0 )i, for all x, x0 ∈ H. Since the algorithm depends only on the inner products hφ(xi ), φ(xj )i, the mapping φ(x) need never be explicitly computed; one can always replace the inner products with kernel function values when needed. In literature, such feature space H is called Reproducing Kernel Hilbert Space (RKHS) [147] associated with reproducing kernel κ. Some widely used kernel functions include • Gaussian Radial Basis Function (Gaussian RBF)   kx − yk2 κ(x, y) = exp − , σ ∈ R; 2σ 2

(5.1)

• Polynomial κ(x, y) = (γ1 (x · y) + γ2 )d ,

d ∈ N, γ1 , γ2 ∈ R;

(5.2)

• Hyperbolic tangent (Sigmoidal) κ(x, y) = tanh(γ1 (x · y) + γ2 ),

γ1 , γ2 ∈ R.

(5.3)

As pointed out in [124], a significant advantage of kernel functions is that inner products in feature space H can be implicitly computed without using the mapping φ; and for any given algorithm that uses the data only through inner products, the algorithm can be implicitly executed in H by using kernel κ. For example, linear algorithms can be extended to a sophisticated non-linear version by simple choice of the kernel function κ.

104

Chapter 5. Sparse Kernel Canonical Correlation Analysis

5.2

Kernel CCA

Since ordinary CCA only consider linear transformation of the original variables, it cannot capture non-linear relations among variables. However, in a wide range of practical problems linear relations may not be adequate for studying relation among variables. Detecting non-linear relations among data is important and useful in modern data analysis, especially when dealing with data that are not in the form of vectors, such as text documents, images, microarray data and so on. A natural extension, therefore, is to explore and exploit non-linear relations among data. There has been a wide concern in non-linear CCA [42, 99], among which one most frequently used approach is the kernel generalization of CCA, named kernel canonical correlation analysis (kernel CCA) [1, 4, 60, 63, 72, 73, 97, 99, 102, 127]. Let {xi }ni=1 ∈ Rd1 and {yi }ni=1 ∈ Rd2 be n samples for variables x and y, respectively. Denote h i X = x1 · · · xn ∈ Rd1 ×n ,

h i Y = y1 · · · yn ∈ Rd2 ×n ,

and assume both {xi }ni=1 and {yi }ni=1 have zero mean, i.e.,

n P i=1

xi = 0 and

n P

yi = 0.

i=1

The main idea of kernel CCA is that we first virtually map data X into a high dimensional feature space Hx via a mapping φx such that data in the feature space become h i Φx = φx (x1 ) · · · φx (xn ) ∈ RNx ×n , where Nx is the dimension of feature space Hx that can be very high or even infinite. The mapping φx from input data to the feature space Hx is performed implicitly by considering a kernel function κx satisfying κx (x1 , x2 ) = hφx (x1 ), φx (x2 )i, where h·, ·i is an inner product in Hx , rather than by giving the coordinates of φx (x) explicitly. In the same way, we can map Y into a feature space Hy associated with kernel κy through mapping φy such that h i Φy = φy (y1 ) · · · φy (yn ) ∈ RNy ×n .

5.2 Kernel CCA

105

After mapping X to Φx and Y to Φy , we then apply ordinary linear CCA to data pair (Φx , Φy ). Since, intrinsically, kernel CCA is performing ordinary CCA on Φx and Φy , it follows that the solutions of kernel CCA should be obtained by virtually solving max

Wx ,Wy

Trace(WxT Φx Φy Wy )

s.t. WxT Φx ΦTx Wx = I, Wx ∈ RNx ×l ,

(5.4)

WyT Φy ΦTy Wy = I, Wy ∈ RNy ×l . Recall from Theorem 3.6 in Chapter 3 that each solution (Wx , Wy ) of CCA for data pair (X, Y ) can be expressed as Wx = XWx + Wx⊥ ,

Wy = Y Wy + Wy⊥ ,

where Wx⊥ and Wy⊥ are orthogonal to the range space of X and Y , respectively. Since the optimization problem in (5.4) is of the same form as the optimization problem (3.5) of CCA, each solution (Wx , Wy ) of kernel CCA problem in (5.4) shall be represented as Wx = Φx Wx + Wx⊥ ,

Wy = Φy Wy + Wy⊥ ,

(5.5)

where Wx , Wy ∈ Rn×l are usually called dual transformation matrices, Wx⊥ and Wy⊥ are orthogonal to the range space of Φx and Φy , respectively. Substituting (5.5) into (5.4), we have WxT Φx Φy Wy =WxT Kx Ky Wy , WxT Φx ΦTx Wx =WxT Kx2 Wx , WyT Φy ΦTy Wy =WyT Ky2 Wy , where Kx = hΦx , Φx i = [κx (xi , xj )]ni,j=1 ∈ Rn×n , Ky = hΦy , Φy i = [κy (yi , yj )]ni,j=1 ∈ Rn×n

(5.6)

are matrices consisting of inner products of data sets Φx and Φy , respectively. Kx and Ky are called kernel matrices or Gram matrices.

106

Chapter 5. Sparse Kernel Canonical Correlation Analysis Thus, the computation of transformations of kernel CCA can be converted to the computation of dual transformation matrices Wx and Wy by solving the following optimization problem max

Wx ,Wy

Trace(WxT Kx Ky Wy )

s.t. WxT Kx2 Wx = I, Wx ∈ Rn×l ,

(5.7)

WyT Ky2 Wy = I, Wy ∈ Rn×l , which is used as the criterion of kernel CCA in this chapter. A solution of kernel CCA problem in (5.7) can be obtained by solving the following generalized eigenvalue problem 

0

 Ky Kx

Kx Ky 0

     α Kx2 0 α   = λ  , β 0 Ky2 β

(5.8)

where (α, β) is a pair of dual vectors (transformations). As can be seen from the analysis above, terms Wx⊥ and Wy⊥ in (5.5) do not contribute to the canonical correlations between Φx and Φy , thus, are usually neglected in practice. This is also consistent with the Representer Theorem [123]. Therei h N 1 fore, when we are given a set of testing data Xt = xt · · · xt consisting of N points, the projection of Xt onto kernel CCA direction Wx can be performed by first mapping Xt into feature space Hx , then compute its inner product with Wx . More h i specifically, suppose Φx,t = φx (x1t ) · · · φx (xN is the projection of Xt in feature ) t space Hx , then the projection of Xt onto kernel CCA direction Wx is given by WxT Φx,t = WxT Kx,t ,

(5.9)

n×N where Kx,t = hΦx , Φx,t i = [κx (xi , xjt )]j=1:N is the matrix consisting of the i=1:n ∈ R

kernel evaluations of Xt with all training data X. A similar process can be adopted to compute projections of new data drawn from variable y. In the process of deriving (5.7), we have implicitly assumed that data Φx and Φy have been centered (that is, the column mean of both Φx and Φy are zero); otherwise, we need to perform data centering before applying kernel CCA. Unlike data centering of X and Y in primal representation, we cannot perform data centering directly on

5.2 Kernel CCA

107

Φx and Φy since we do not know their explicit coordinates. However, as shown in [123, 124], data centering in RKHS can be accomplished via some operations on kernel matrices. To center Φx , a natural idea should be computing Φx,c = Φx (I − en eT n n

), where en denotes column vector in Rn with all entries being 1. However, since

kernel CCA makes use of the data through kernel matrix Kx , the centering process can be performed on Kx as Kx,c

en eTn en eTn en eTn en eTn = hΦx,c , Φx,c i = (I− )hΦx , Φx i(I− ) = (I− )Kx (I− ). (5.10) n n n n

Similarly, we can center testing data as Kx,t,c = hΦx,c , Φx,t − Φx

en eTn en eTn en eTN en eTN i = (I − )Kx,t − (I − )Kx . n n n n

(5.11)

More details about data centering in RKHS can be found in [123, 124]. In the sequel of this chapter, we assume that the given data have been centered in dual space. There are papers studying properties of kernel CCA, including the geometry of kernel CCA in [97], convergence analysis of kernel CCA in [72] and statistical consistency of kernel CCA in [60]. Kernel CCA has been successfully applied in many fields, including cross−language documents retrieval [144], content−based image retrieval [73], bioinformatics [143, 158] and independent component analysis [4, 63]. Despite the wide usage of kernel CCA, like other kernel methods (e.g., KPCA), it has a significant limitation that is the lack of sparsity in dual transformation matrices Wx and Wy . From (5.9) we can see that the kernel function values κx (xi , x) must be evaluated at all {xi }ni=1 when dual transformation matrix Wx is dense in order to compute the projection of new data x, which can lead to excessive computational time. To handle this limitation of kernel CCA, we shall find sparse solutions for kernel CCA so that projections of new data can be computed by evaluating the kernel function at a small subset of the training data. Although there are many sparse kernel approaches [15], such as support vector machines [123], relevance vector machine [140] and sparse kernel partial least squares [46, 107], seldom can these be found in the area of sparse kernel CCA.

108

Chapter 5. Sparse Kernel Canonical Correlation Analysis

5.3

Kernel CCA Versus Least Squares Problem

Note from the least distance formulation of CCA in (3.3) that when one of X and Y is one dimensional, CCA is equivalent to least squares estimation to a linear regression problem. For more general cases, some relation between CCA and least squares problems has been established under the condition that rank(X) = n − 1 and rank(Y ) = d2 in [134]. In this section, we establish a relation between CCA and least squares problems without any additional constraint on X or Y , which will be further extended to kernel CCA. Note from Lemma 4.1 that any (Wx , Wy ) of the following forms   W = U Σ−1 P (:, 1 : l) + U E, x 1 1 1 2  W = V Σ−1 P (:, 1 : l) + V F, y

1

2

2

2

is a solution of CCA problem in (3.5), where E ∈ R(d1 −r)×l and F ∈ R(d2 −s)×l are arbitrary. Suppose matrix factorizations (3.10)-(3.13) have been accomplished, and let 1

Tx =Y T [(Y Y T ) 2 ]† V1 P2 (:, 1 : l)Σ(1 : l, 1 : l)−1 =Q2 P2 (:, 1 : l)Σ(1 : l, 1 : l)−1 ,

(5.12)

1

Ty =X T [(XX T ) 2 ]† U1 P1 (:, 1 : l)Σ(1 : l, 1 : l)−1 =Q1 P1 (:, 1 : l)Σ(1 : l, 1 : l)−1 ,

(5.13)

where 1 ≤ l ≤ m, then we have the following theorem. Theorem 5.1. For any l satisfying 1 ≤ l ≤ m, suppose Wx ∈ Rd1 ×l and Wy ∈ Rd2 ×l satisfy Wx = arg min{kX T Wx − Tx k2F : Wx ∈ Rd1 ×l },

(5.14)

Wy = arg min{kY T Wx − Ty k2F : Wy ∈ Rd2 ×l },

(5.15)

and

where Tx and Ty are defined in (5.12) and (5.13), respectively. Then Wx and Wy form a solution of optimization problem (3.5), that is, a solution of CCA.

5.3 Kernel CCA Versus Least Squares Problem

109

Proof. Since (5.14) and (5.15) have the same form, we only prove the result for Wx , the same idea can be applied to Wy . We know that Wx is a solution of (5.14) if and only if it satisfies the normal equation XX T Wx = XTx .

(5.16)

Substituting factorizations (3.10), (3.11) and (3.13) into the equation above, we get XX T = U1 Σ21 U1T , and XTx = U1 Σ1 QT1 Q2 P2 (:, 1 : l)Σ(1 : l, 1 : l)−1 = U1 Σ1 P1 (:, 1 : l), which yield an equivalent reformulation of (5.16) U1 Σ21 U1T Wx = U1 Σ1 P1 (:, 1 : l).

(5.17)

It is easy to check that Wx is a solution of (5.17) if and only if Wx = U1 Σ−1 1 P1 (:, 1 : l) + U2 E,

(5.18)

where E ∈ R(d1 −r)×l is an arbitrary matrix. Therefore, Wx is a solution of (5.14) if and only if Wx can be formulated as (5.18). Similarly, Wy is a solution of (5.15) if and only if Wy can be written as Wy = V1 Σ−1 2 P2 (:, 1 : l) + V2 F,

(5.19)

where F ∈ R(d2 −s)×l is an arbitrary matrix. Now, comparing equations (5.18) and (5.19) with equation (4.1) in Lemma 4.1, we can conclude that for any solution Wx of the least squares problem (5.14) and any solution Wy of the least squares problem (5.15), Wx and Wy form a solution of optimization problem (3.5), hence a solution of CCA. Remark 5.2. In Theorem 5.1 we only consider l satisfying 1 ≤ l ≤ m. This is reasonable, since there are m non-zero canonical correlations between X and Y , and weight vectors corresponding to zero canonical correlation does not contribute to the correlation between data X and Y .

110

Chapter 5. Sparse Kernel Canonical Correlation Analysis Since kernel CCA criterion (5.7) and CCA criterion (3.5) have the same form, we can expect a similar characterization of solutions to optimization in (5.7) as Theorem 3.6. Define rˆ = rank(Kx ),

sˆ = rank(Ky ),

m ˆ = rank(Kx KyT ),

and let the eigenvalue decomposition of Kx and Ky be, respectively,     h i iT Π 0 h Π1 0  U T = U1 U2  1  U1 U2 = U1 Π1 U1T , Kx = U  0 0 0 0

(5.20)

and     h i iT Π2 0 Π 0 h  V T = V1 V2  2  V1 V2 = V1 Π2 V1T , Ky = V  0 0 0 0

(5.21)

where U ∈ Rn×n , U1 ∈ Rn׈r , U2 ∈ Rn×(n−ˆr) , Π1 ∈ Rrˆ×ˆr , V ∈ Rn×n , V1 ∈ Rn׈s , V2 ∈ Rn×(n−ˆs) , Π2 ∈ Rsˆ×ˆs , U and V are orthogonal, Π1 and Π2 are nonsingular and diagonal. In addition, let U1T V1 = P1 ΠP2T

(5.22)

be the singular value decomposition of U1T V1 , where P1 ∈ Rrˆ×ˆr and P2 ∈ Rsˆ×ˆs are orthogonal and Π ∈ Rrˆ×ˆs is a diagonal matrix. Then we can prove for 1 ≤ l ≤ min{ˆ r, sˆ} that   W = U Π−1 P (:, 1 : l) + U E, x 1 1 1 2 −1  W = V Π P (:, 1 : l) + V F, y

1

2

2

(5.23)

2

with E ∈ R(n−ˆr)×l and F ∈ R(n−ˆs)×l being arbitrary matrices, form a subset of solutions to optimization problem in (5.7). Similar to Theorem 5.1, solutions of kernel CCA can also be associated with least squares problems, as described in the following theorem.

5.3 Kernel CCA Versus Least Squares Problem

111

Theorem 5.3. Define Tx = Kx Kx† V1 P2 (:, 1 : l)(Π(1 : l, 1 : l))−1 = U1 P1 (:, 1 : l),

(5.24)

Ty = Ky Ky† U1 P1 (:, 1 : l)(Π(1 : l, 1 : l))−1 = V1 P2 (:, 1 : l),

(5.25)

with 1 ≤ l ≤ m, ˆ then each pair of Wx and Wy , satisfying Wx = arg min{kKx Wx − Tx k2F : Wx ∈ Rn×l },

(5.26)

Wy = arg min{kKy Wy − Ty k2F : Wy ∈ Rn×l },

(5.27)

and

respectively, form a solution of kernel CCA (5.7). Proof of Theorem 5.3 is very similar to that of Theorem 5.1, hence omitted. Theorem 5.3 will be applied to derive sparse kernel CCA algorithm in Section 5.4. Before that, we investigate the over-fitting problem of kernel CCA. Since canonical correlations in kernel CCA depend only on kernel matrices Kx and Ky . Therefore, as we shall see from factorizations (5.20)-(5.22), canonical correlations in kernel CCA are determined by singular values of U1T V1 . The following lemma reveals a simple result regarding the distribution of canonical correlations. Lemma 5.4. Let rˆ = rank(Kx ) and sˆ = rank(Ky ). If rˆ + sˆ = n + γ for some γ > 0, then U1T V1 has at least γ singular values equal to 1. Proof. Since U1 ∈ Rn׈r , U2 ∈ Rn×(n−ˆr) and V1 ∈ Rn׈s are column orthogonal and U1 U1T + U2 U2T = In , we have (U1T V1 )T U1T V1 = V1T U1 U1T V1 = Isˆ − V1T U2 U2T V1 . If there exist γ > 0 such that rˆ + sˆ = n + γ, then n − rˆ = sˆ − γ < sˆ and rank(V1T U2 U2T V1 ) = rank(U2T V1 ) ≤ n − rˆ, which implies V1T U2 U2T V1 has at least sˆ − (n − rˆ) = γ zero eigenvalues. Thus, (U1T V1 )T U1T V1 has at least γ eigenvalues equal to 1, which further implies that U1T V1 has at least γ singular values equal to 1.

112

Chapter 5. Sparse Kernel Canonical Correlation Analysis A direct result of lemma 5.4 is that there are at least γ canonical correlations in kernel CCA are 1. In kernel methods, due to nonlinearity of kernel functions, the rank of kernel matrices is very close to n, which makes most canonical correlations to be 1. For example, for Gaussian RBF kernel we can prove that the kernel matrix Kx given by   kxi − xj k2 (Kx )ij = exp − 2σ 2 has full rank, given the points {xi }ni=1 are distinct. A similar result can be proven for linear kernel κ(x, y) = hx, yi,

(5.28)

which is a special case of polynomial kernel (5.2), when {xi }ni=1 and {yi }ni=1 are linearly independent, respectively. Thus, in kernel methods we usually have rˆ = rank(Kx ) = n − 1,

sˆ = rank(Ky ) = n − 1,

after centering data. Since Kx e = 0 and Ky e = 0, we see that U1T e = V1T e = 0 and h i h i both U1 √en and V1 √en are orthogonal matrices. This implies that h

U1

√e n

iT h

V1

√e n

i

  T U1 V1 0  = 0 1

is orthogonal. In this case, all non-zero canonical correlations determined by the singular values of U1T V1 are equal to 1. Therefore, ordinary kernel CCA fails to provide a useful estimation of canonical correlations, since for any distinct sample {xi }ni=1 of variable x and distinct sample {yi }ni=1 of variable y the canonical correlations returned by kernel CCA using Gaussian RBF kernel will be 1 (i.e., perfectly correlated with each other), even though variables x and y have no joint information. To avoid aforementioned data over-fitting problem in kernel CCA, researchers suggested to design a regularized kernelization of CCA [4, 14, 60, 73, 97]. One way of regularization is to penalize weight vectors wx and wy , leading to α T Kx Ky β max q , α,β (αT Kx2 α + λx αT Kx α)(β T Ky2 β + λy β T Ky β)

(5.29)

5.3 Kernel CCA Versus Least Squares Problem

113

where λx and λy are positive parameters. As shown in [14], dual vectors α and β solving the above optimization problem form an eigenvector of       2 0 Kx Ky α K + λx K x 0 α    = η x   Ky Kx 0 β 0 Ky2 + λy Ky β

(5.30)

corresponding to the largest eigenvalue. Dual transformation matrices Wx , Wy ∈ Rn×l can be obtained by computing eigenvectors corresponding to l leading eigenvalues of (5.30). If we have the following SVD 1/2

1/2

(Π1 + λx I)−1/2 Π1 U1T V1 Π2 (Π2 + λy I)−1/2 = Q1 ΛQT2 , where Q1 ∈ Rrˆ×ˆr and Q2 ∈ Rsˆ×ˆs are orthogonal, then we can use   W = U (Π2 + λ Π )−1/2 Q (:, 1 : l), x 1 1 x 1 1 1 ≤ l ≤ m, ˆ  W = V (Π2 + λ Π )−1/2 Q (:, 1 : l), y

1

y

2

2

(5.31)

2

as a solution of regularized kernel CCA (5.29). Another way of regularizing kernel CCA was proposed in [4], where the following optimization problem was considered: max p α,β

α T Kx K y β , (αT (Kx + λx I)2 α)(β T (Ky + λy I)2 β)

which can be equivalently formulated as a generalized eigenvalue problem       2 0 Kx Ky α (K + λx I) 0 α    = η x  . Ky Kx 0 β 0 (Ky + λy I)2 β

(5.32)

(5.33)

If we have the following SVD ˆ1Λ ˆQ ˆT , (Π1 + λx I)−1 Π1 U1T V1 Π2 (Π2 + λy I)−1 = Q 2 ˆ 1 ∈ Rrˆ×ˆr and Q ˆ 2 ∈ Rsˆ×ˆs are orthogonal, then where Q   W = U (Π + λ I)−1 Q ˆ 1 (:, 1 : l), x 1 1 x  W = V (Π + λ I)−1 Q ˆ (:, 1 : l), y

1

2

y

2

is a solution of the regularized kernel CCA (5.32).

1 ≤ l ≤ m, ˆ

(5.34)

(5.35)

114

Chapter 5. Sparse Kernel Canonical Correlation Analysis Like many other regularized learning methods, the performance of regularized kernel CCA depends on the choice of tuning parameters λx and λy . However, there still lack theoretical guidance and computationally efficient way of choosing optimal regularization parameters.

5.4

Sparse Kernel Canonical Correlation Analysis

Motivated by recent research on Lasso [138] and compressed sensing [24, 25, 49], we incorporate sparsity into kernel CCA by using the established relationship between kernel CCA and least squares problems (5.26)-(5.27) and considering the following `1 -norm penalized least squares problems l

X 1 min{ kKx Wx − Tx k2F + ρx,i kWx,i k1 : Wx ∈ Rn×l }, 2 i=1 and

(5.36)

l

X 1 min{ kKy Wy − Ty k2F + ρy,i kWy,i k1 : Wy ∈ Rn×l }, 2 i=1

(5.37)

where ρx,i , ρy,i > 0 are regularization parameters and Wx,i , Wy,i are the ith column of Wx and Wy , respectively. When we set ρx,1 = · · · = ρx,l = ρx > 0 and ρy,1 = · · · = ρy,l = ρy > 0, problems (5.36) and (5.37) become 1 min{ kKx Wx − Tx k2F + ρx kWx k1 : Wx ∈ Rn×l }, 2

(5.38)

1 min{ kKy Wy − Ty k2F + ρy kWy k1 : Wy ∈ Rn×l }, 2

(5.39)

and

where kWx k1 =

n X l X i=1 j=1

|Wx (i, j)|,

kWy k1 =

n X l X

|Wy (i, j)|.

i=1 j=1

Since (5.36) and (5.37) (also, (5.38) and (5.39)) have the same form, all results holding for one problem can be naturally extended to the other, so we concentrate on (5.36). Optimization problem (5.36) reduces to the famous basis pursuit denoising

5.4 Sparse Kernel Canonical Correlation Analysis

115

(BPDN) problem of the form 1 min kAx − bk22 + λkxk1 , x∈Rd 2

(5.40)

when l = 1. Many efficient approaches have been proposed to solve it, see [8, 9, 55, 69, 159]. In this section we adopt the fixed-point continuation (FPC) method [69, 70], due to its simple implementation and nice convergence property. Fixed-point algorithm for (5.40) is an iterative method which updates iterates as  xk+1 = S xk − τ AT (Axk − b), ν , with ν = τ λ,

(5.41)

where τ > 0 denotes the step size, and S(·, ·) is the soft-thresholding operator defined in (2.21). The fixed-point algorithm can be naturally extended to solve (5.36), which yields  k+1 k k Wx,i = S Wx,i − τx KxT (Kx Wx,i − Tx,i ), νx,i , i = 1, · · · , l,

(5.42)

where νx,i = τx ρx,i with τx > 0 denoting the step size. We can prove that fixedpoint iterations have some nice convergence properties which are presented in the following theorem. Theorem 5.5. [69] Let Ω be the solution set of (5.36), then there exists M ∗ ∈ Rn×l such that KxT (Kx Wx − Tx ) ≡ M ∗ , ∀ Wx ∈ Ω.

(5.43)

In addition, if we define ∗ L := {(i, j) : |Mi,j | < ρx,j }

(5.44)

as a subset of indices and let λmax (KxT Kx ) be the maximum eigenvalue of KxT Kx , and choose τx from 0 < τx
0 such that (Wxk )i,j = (Wx∗ )i,j = 0, ∀(i, j) ∈ L,

(5.45)

when k > K. Remark 5.6.

1. Equation (5.43) shows that for any two optimal solutions of

(5.36) the gradient of the squared Frobenius norm in (5.36) must be equal. 2. Equation (5.45) means that the entries of Wxk with indices from L will converge to zero in finite steps. The positive integer K is a function of Wx0 and Wx∗ , and determined by the distance between them. Similarly, we can design a fixed-point algorithm to solve (5.37) as follows:  k+1 k k Wy,i = S Wy,i − τy KyT (Ky Wy,i − Ty,i ), νy,i , i = 1, · · · , l,

(5.46)

where νy,i = τy λy,i and τy > 0 denotes the step size. Now, we are ready to present our new sparse kernel CCA algorithm in Algorithm 7. Algorithm 7 (SKCCA: Sparse kernel CCA) 1:

Input: Training data X ∈ Rd1 ×n , Y ∈ Rd2 ×n

2:

Output: Sparse transformation matrices Wx ∈ Rn×l and Wy ∈ Rn×l .

3:

Construct and center kernel matrices Kx , Ky ;

4:

Compute matrix factorizations (5.20)-(5.22);

5:

Compute Tx and Ty defined in (5.24)-(5.25);

6:

repeat

7:

 k+1 k k Wx,i = S Wx,i − τx KxT (Kx Wx,i − Tx,i ), νx,i , νx,i = τx ρx,i , i = 1, · · · , l,

8:

until convergence

9:

repeat

10:

 k+1 k k Wy,i = S Wy,i − τy KyT (Ky Wy,i − Ty,i ), νy,i , νy,i = τy ρy,i , i = 1, · · · , l,

11:

until convergence

12:

return Wx = Wxk and Wy = Wyk .

5.4 Sparse Kernel Canonical Correlation Analysis

117

Although different solutions may be returned by Algorithm 7 starting from different initial points, we can conclude from (5.43) that cx∗ , KxT Kx Wx∗ = KxT Kx W

cx∗ ∈ Ω, ∀ Wx∗ , W

cy∗ for two cx∗ . Similarly, we have V1T Wy∗ = V1T W which results in U1T Wx∗ = U1T W different solutions of (5.37). Hence, c∗ )T Kx K T W c∗ , (Wx∗ )T Kx KxT Wx∗ = (W x x x cy∗ , cy∗ )T Ky KyT W (Wy∗ )T Ky KyT Wy∗ = (W c∗ )T Kx K T W c∗ . (Wx∗ )T Kx KyT Wy∗ = (W x y y The above equations show that any two optimal solutions of (5.36) and (5.37) approximate the solution of kernel CCA in the same level. Due to the effect of `1 -norm regularization, a solution (Wx∗ , Wy∗ ) computed by Algorithm 7 does not satisfy the orthogonality constraints of kernel CCA any more, but we can derive a bound on the deviation. Since (5.38) is a convex optimization problem, we have KxT (Kx Wx∗ − Tx ) + ρx G = 0, for some G ∈ ∂kWx∗ k1 ,

(5.47)

where ∂kWx∗ k1 denotes the subdifferential of k · k1 at Wx∗ . Simplifying (5.47), we can get −2 T U1T Wx∗ = Π−1 1 P1 (:, 1 : l) − ρx Π1 U1 G,

which implies T (Wx∗ )T Kx KxT Wx∗ =Il − ρx P1 (:, 1 : l)T Π−1 1 U1 G 2 T −2 T − ρx G T U1 Π−1 1 P1 (:, 1 : l) + ρx G U1 Π1 U1 G.

Since G ∈ Rn×l satisfies |Gi,j | ≤ 1 for i = 1, · · · , n, j = 1, · · · , l, it follows that  √  k(Wx∗ )T Kx KxT Wx∗ − Il kF ρx n ρx √ √ ≤ 2+ nl , (5.48) σrˆ(Kx ) σrˆ(Kx ) l

118

Chapter 5. Sparse Kernel Canonical Correlation Analysis where σrˆ(Kx ) denotes the smallest non-zero singular value of Kx . So the bound is affected by regularization parameter ρx , the smallest non-zero singular value of Kx and the size of Wx∗ . Regarding the ability of SKCCA in alleviating data over-fitting problem of kernel CCA, we know that the `1 -penalization term can alleviate data over-fitting problem in linear regression while at the same time introduce sparsity, as shown in [138]. Thus, we can expect that sparse kernel CCA (5.36)-(5.37) enjoys the properties of both computing sparse Wx and Wy and avoiding data over-fitting similar to regularized kernel CCA. Moreover, it is well known that regularized kernel CCA is generally adopted to handle over-fitting problem, and we can establish a relationship between regularized kernel CCA (5.32) and sparse kernel CCA (5.38)-(5.39). ˆ 1 ∈ Rrˆ×ˆr and Q ˆ 2 ∈ Rsˆ×ˆs be orthogonal matrices defined in Lemma 5.7. Let Q (5.34), and define ˆ 1 (:, 1 : l), Tˆx = U1 Q

ˆ 2 (:, 1 : l). Tˆy = V1 Q

Then (Wx∗ , Wy∗ ) satisfying that Wx∗ and Wy∗ are the minimum Frobenius norm solutions of 1 λx min{ kKx Wx − Tˆx k2F + kWx k2F : Wx ∈ Rn×l }, 2 2

(5.49)

1 λy min{ kKy Wy − Tˆy k2F + kWy k2F : Wy ∈ Rn×l }, 2 2

(5.50)

and

respectively, is also a solution of regularized kernel CCA (5.32). Proof. From the representation (5.5) of primal transformations Wx and Wy , we have Wx∗ =arg min{kKx Wx − Tˆx k2F + λx Trace(WxT Kx Wx ) : Wx ∈ Rn×l } Wy∗ =arg min{kKy Wy − Tˆy k2F + λy Trace(WyT Ky Wy ) : Wy ∈ Rn×l }. Since Wx∗ is the minimum Frobenius solution, we have ˆ 1 (:, 1 : l). Wx∗ = U1 (Π1 + λx I)−1 Q

5.4 Sparse Kernel Canonical Correlation Analysis Similarly, ˆ 2 (:, 1 : l). Wy∗ = V1 (Π2 + λy I)−1 Q Comparing the above two equations with the solution of regularized kernel CCA (5.32) given in equation (5.35), we can conclude that (Wx∗ , Wy∗ ) is also a solution of regularized kernel CCA (5.32). Comparing equation (5.38) (associated with sparse kernel CCA) with equation (5.49) (associated with regularized kernel CCA), we see that the difference lies in the predictors (Tx for sparse kernel CCA and Tˆx for regularized kernel CCA) and the penalization terms (penalizing `1 -norm of dual transformation Wx for sparse kernel CCA and Frobenius norm of primal transformation Wx for regularized kernel CCA). Since it is believed that regularized kernel CCA can handle the over-fitting problem of kernel CCA, there is every reason to believe that the newly proposed sparse kernel CCA can also alleviate the over-fitting problem of kernel CCA. Now, we estimate the computational complexity of Algorithm 7. Since the first step concerns the construction of kernel matrices via evaluating kernel functions, the computational complexity of this step depends on the kernel function we used, so we omit this part. In step 2, we need to compute the eigenvalue decompositions of two symmetric matrices in Rn×n which require O(n3 ) flops, and reduced SVD of U1T V1 which requires O((ˆ r + sˆ)ˆ s2 ) flops. In the implementation of fixed-point iteration, the dominant operations are matrix multiplications. We note that kKx Wx − Tx k2F = k(U1 Π1 )T Wx − P1 (:, 1 : l)k2F . Thus, the update of Wxk can be reformulated as   Wxk+1 = S Wxk − τx U1 Π1 (U1 Π1 )T Wx − P1 (:, 1 : l) , νx . A reformulation of update of Wyk can be similarly obtained. Therefore, FPC requires roughly O(nˆ rl) flops to update Wxk and O(nˆ sl) flops to update Wyk in each iteration.

119

120

Chapter 5. Sparse Kernel Canonical Correlation Analysis To sum up, Algorithm 7 has computational complexity as summarized in Table 5.1, where Ix and Iy denote the number of iterations required for computing Wx and Wy , respectively. Table 5.1: Computational complexity of Algorithm 7

5.5

Step No.

Step 2

Step 4-6

Step 7-9

Order of flops

O(n3 )

Ix O(nˆ rl)

Iy O(nˆ sl)

Numerical Results

In this section, we test the newly proposed sparse kernel CCA algorithm, referred to as SKCCA, on both artificial and real-world data, and compare it with kernel CCA (KCCA) and regularized kernel CCA (RKCCA) (5.31). All experiments were performed under CentOS 5.2 and MATLAB v7.4 (R2007a) running on a IBM HS21XM Bladeserver with two Intel Xeon E5450 3.0GHz quad-core Harpertown CPUs and 16GB of Random-access memory (RAM).

5.5.1

Experimental Settings

In the implementation of SKCCA, we need to determine regularization parameters {ρx,i }li=1 and {ρy,i }li=1 for SKCCA. Since we know that x∗ is a solution of BPDN problem (5.40) if and only if 0 ∈ AT (Ax∗ − b) + λ∂kx∗ k1 , which implies that x = 0 is a solution of (5.40) when λ ≥ kAT bk∞ . To avoid zero solution, which is meaningless in practice, we chose ρx,i = γx kKxT Tx,i k∞ , where 0 < γx < 1, and 0 < γy < 1.

ρy,i = γy kKyT Ty,i k∞ ,

i = 1, · · · , l,

5.5 Numerical Results

121

To perform fixed-point iteration, we used FPC BB algorithm, an accelerated version of FPC, which is available at http://www.caam.rice.edu/~optimization/ L1/fpc/ with xtol=10−5 and mxitr=104 and all other parameters default. In the implementation of RKCCA (5.31), we also used 5-fold cross-validation to choose optimal regularization parameters λx and λy from the candidate set 10linspace(−4,4,100) .2

5.5.2

Synthetic Data

In this section, we apply ordinary CCA, RKCCA and SKCCA on synthetic data to demonstrate the ability of kernel CCA in finding nonlinear relationship. Let Z be a random variable following uniform distribution over interval (−2, 2), we sampled 500 pairs of (X, Y ) in the following way: X = [Z; Z] and Y = [Z 2 + 0.31 ; sin(πZ) + 0.32 ], where 1 and 2 follow standard normal distribution. Obviously, variables X and Y are nonlinearly related. We attempt to reveal the nonlinear association using the first pair of canonical variates wxT X and wyT Y , and we also plot (wxT X, wyT Y ) in Figure 5.1. When implementing kernel CCA we employed the Gaussian kernel (5.1) with σ equal to the maximum distance between data points. We set regularization parameters λx = λy = 10−2 in RKCCA and ρx = ρy = 10−1 in SKCCA. The canonical correlation between the first pair of canonical variables are listed in Table 5.2. Table 5.2: Correlation between the first pair of canonical variables obtained by ordinary CCA, RKCCA and SKCCA.

wxT XY T wy kX T wx k2 kY T wy k2 2

CCA

RKCCA

SKCCA

0.3971

0.9621

0.9632

linspace(-4,4,100) is a MATLAB command, which produces 100 points equally distributed

between -4 and 4.

122

Chapter 5. Sparse Kernel Canonical Correlation Analysis 0.15

6 5

0.1

4 0.05

wyT Y

Y

3 2 1

0

−0.05

0 −0.1

−1 −2 −3

−2

−1

0 X

1

2

−0.1

3

−0.05

0 wxT X

0.06

0.12

0.04

0.1

0.02

0.08

0

0.06

−0.02

0.04

−0.04

0.02

−0.06

0

−0.08

−0.02

−0.1

−0.04

−0.12 −0.12

−0.1

−0.08 −0.06 −0.04 −0.02 wxT X

(c)

0.1

(b)

wyT Y

wyT Y

(a)

0.05

0

0.02

0.04

0.06

−0.06 −0.06

−0.04

−0.02

0

0.02 wxT X

0.04

0.06

0.08

0.1

(d)

Figure 5.1: Plots of the first pair of canonical variates:(a) sample data, (b) ordinary CCA, (c) RKCCA, (d) SKCCA. Figure 5.1(b) shows the data scatter of the first pair of canonical variates found by ordinary CCA, from which we see that some strong relationship is left unexplained. In contrast, Figure 5.1(c) and Figure 5.1(d) show the data scatter of the first pair of canonical variates found by RKCCA and SKCCA, respectively. A clear linear relationship between wxT X and wyT Y can be observed. Table 5.2 also shows that the canonical correlations obtained by RKCCA and SKCCA are 0.9621 and 0.9632, respectively, which are larger than that achieved by ordinary CCA. The comparison

5.5 Numerical Results implies that ordinary CCA may not be applicable to find nonlinear relations of two sets of data.

5.5.3

Cross-Language Document Retrieval

Previous studies [131, 144] have shown that kernel CCA works well for cross-language document retrieval and performs better than the latent semantic indexing approach. In this subsection, we apply SKCCA to the task of cross-language document retrieval, and present comparison results of SKCCA with KCCA and RKCCA. In this experiment, we used the following two data sets: 1. The English-French corpus from the Europarl parallel corpus dataset [94] available at http://www.statmt.org/europarl/, where we obtained 202 samples and generated a 23308 × 202 term-document matrix for English corpus and a 33986 × 202 term-document matrix for French corpus. 2. The Aligned Hansards of the 36th Parliament of Canada [64], which is the same data set as that in Chapter 4. For each of the above two data sets, we did two random training-testing splits. For Europarl data 50 and 100 pairs of documents were used for training data and the rest for testing data while for Hansard data 200 and 400 pairs of documents were used for training data and the rest for testing data. In all experiments, the linear kernel (5.28) was employed to compute kernel matrices. We measured the precision of document retrieval by using average area under the ROC curve (AROC), and for a collection of queries we used the average of all queries’ retrieval precision as the average retrieval precision of this collection. More details about the acquisition of term-document matrix, data preprocessing and evaluation of retrieval performance can be found in Chapter 4. Figure 5.2 presents the retrieval accuracy of CCA, KCCA, RKCCA and SKCCA on both data sets. Detailed results are presented in Table 5.3, where we recorded the

123

124

Chapter 5. Sparse Kernel Canonical Correlation Analysis

1

1

0.98

0.98

Average AROC

0.94

Average AROC

CCA KCCA RKCCA SKCCA

0.96

0.92

CCA KCCA RKCCA SKCCA

0.96

0.94

0.92 0.9

0.9

0.88 0.86

0

10

20 30 Number of columns of (Wx , Wy ) used

40

0.88

50

0

20

40 60 Number of columns of (Wx , Wy ) used

(a)

80

(b) 1

1

0.995

0.995

0.99

CCA KCCA RKCCA SKCCA

0.99 CCA KCCA RKCCA SKCCA

0.985

Average AROC

Average AROC

100

0.98

0.985 0.98 0.975

0.975

0.97

0.97

0.965 0.96

0.965 0

50 100 150 Number of columns of (Wx , Wy ) used

200

0.955

0

50

100 150 200 250 300 Number of columns of (Wx , Wy ) used

(c)

350

400

(d)

Figure 5.2: Cross-language document retrieval using CCA, KCCA, RKCCA and SKCCA: (a) Europarl data with 50 training data, (b) Europarl data with 100 training data,(c) Hansard data with 200 training data, (d) Hansard data with 400 training data. retrieval precision (AROC) using projections corresponding to all non-zero canonical correlations (i.e, l = m), ˆ summation of canonical correlations between testing data (Corr), sparsity of Wx and Wy , and violation of the orthogonality constraints measured by Err(Wx ) :=

kWxT Kx2 Wx −Il kF √ l

and Err(Wy ) :=

kWyT Ky2 Wy −Il kF √ l

. In Ta-

ble 5.3, the ‘Sparsity’ column records sparsity of both Wx and Wy , and the first

5.5 Numerical Results

125

Table 5.3: Cross-language document retrieval using CCA, KCCA, RKCCA and SKCCA. Algorithm

AROC

Corr

Sparsity

Err(Wx )

Err(Wy )

Regularizer

Europarl: 50 training data; l = 49 CCA

99.74

40.41

(13.5, 8.8)

1.2577e-14

1.3551e-14

-

KCCA

99.74

40.27

(0, 0)

2.1562e-15

2.1179e-15

-

RKCCA

99.74

40.72

(0, 0)

1.2487e-4

1.3140e-4

(10−4 , 10−4 )

SKCCA

99.74

41.54

(91.9, 91.1)

9.8277e-1

9.7882e-1

(0.7, 0.68)

Europarl: 100 training data; l = 99 CCA

99.10

78.99

(3.7, 1.7)

1.2337e-14

1.0911e-14

-

KCCA

99.10

78.64

(0, 0)

3.1902e-15

3.2905e-15

-

RKCCA

99.44

79.59

(0, 0)

3.8950e-1

1.3295e-4

(0.5214, 10−4 )

SKCCA

99.95

82.60

(95.5, 94.5)

9.9250e-1

9.8891e-1

(0.75, 0.71)

Hansard: 200 training data, l = 199 CCA

99.81

160.32

(12.6, 13.1)

8.7618e-15

1.0059e-14

-

KCCA

99.81

160.01

(0, 0)

6.2742e-15

5.6267e-15

-

RKCCA

99.98

158.15

(0, 0)

6.1274e-1

3.5092e-1

(1.3219, 0.4329)

SKCCA

99.97

162.43

(93.7, 91.2)

9.8233e-1

9.7271e-1

(0.6, 0.55)

Hansard: 400 training data, l = 399 CCA

99.80

286.73

(1.4, 1.7)

1.1289e-14

1.1697e-14

-

KCCA

99.80

286.74

(0, 0)

1.5014e-14

1.3191e-14

-

RKCCA

99.99

287.19

(0, 0)

9.4202e-1

7.7037e-1

(14.8497, 2.7826)

SKCCA

99.98

315.98

(95.8, 93.4)

9.8842e-1

9.7925e-1

(0.59, 0.52)

component records sparsity of Wx while the second component records sparsity of Wy . The ‘Regularizer’ column records values of regularization parameters (λx , λy ) in RKCCA or (γx , γy ) in SKCCA. Figure 5.2 shows that all four algorithms achieve high precision for cross-language document retrieval, even when only a small number of training data were used. From the figures we also see that increasing l, the number of columns of Wx and Wx used

126

Chapter 5. Sparse Kernel Canonical Correlation Analysis in retrieval task, will usually assist in improving the precision. One possible explanation may be that when we increase l, more projections corresponding to non-zero canonical correlations are used for document retrieval and these added projections may carry information contained in the training data. In addition, we notice that when l = 10 the average AROC obtained by RKCCA and SKCCA is already very high and increasing l will not improve AROC too much. Thus, in order to reduce computational time in practice we do not need to compute dual projections corresponding to all non-zero canonical correlations. All figures in Figure 5.2 show that RKCCA and SKCCA outperform KCCA in terms of retrieval accuracy, although their difference is small. This indicates that both RKCCA and SKCCA have ability of avoiding data over-fitting problem in ordinary kernel CCA, as explained in Section 5.3 and Section 5.4 and further illustrated in Table 5.3. As can be seen from Table 5.3, RKCCA and SKCCA achieve high retrieval precision on both data sets, and these two approaches have comparable performance in terms of precision, which is also shown in Figure 5.2. We also note that SKCCA can obtain larger summation of canonical correlations between testing data (Corr) than other three approaches. In both experiments sparsity of Wx and Wy computed by KCCA and RKCCA is 0, which means the dual projections are dense; in contrast, sparsity of Wx and Wy computed by SKCCA is greater than 90%, which means that more than 90% entries of both Wx and Wy are zero. Different from situations in sparse LDA and sparse CCA in previous chapters, the orthogonality constraints are not well satisfied in SKCCA. For regularized kernel CCA (5.31), we can show that the dual transformation Wx satisfies kWxT Kx2 Wx − Il kF λx √ ≤ , σrˆ(Kx ) + λx l that is, Err(Wx ) = O(λx ) when λx  σrˆ(Kx ). This is verified by experimental results of Europarl data set. Comparing retrieval precision of RKCCA and SKCCA in two experiments, we also found that retrieval precision of these two approaches stays in a high level and nearly unchanged when the number of training data was

5.5 Numerical Results

127

increased from 200 to 400, which indicates that RKCCA and SKCCA can work very well with limited training data.

5.5.4

Content-Based Image Retrieval

Content-based image retrieval (CBIR) is a challenging aspect of multimedia analysis and has become popular in past few years. Generally, CBIR is the problem of searching for digital images in large databases by their visual content (e.g., color, texture, shape) rather than the metadata such as keywords, labels, and descriptions associated with the images. There exists study utilizing kernel CCA for image retrieval [72, 73]. In this section, we apply our sparse kernel CCA approach to content-based image retrieval task by combining image and text data. We experimented on the Ground Truth Image Database created at the University of Washington, which consists of 21 data sets of outdoor scene images. The data set is available at http://www.cs.washington.edu/research/imagedatabase/groundtruth/. In our experiment we used 852 images from 19 data sets that have been annotated with keywords. We exploited text features and low-level image features, including color and texture, and applied sparse kernel CCA to perform image retrieval from text query. We used 217 images as training data and the rest were used as testing data. Text Features We used the bag-of-words approach, same as what we have done in cross-language document retrieval experiment, to represent the text associated with images. Since each image in the data set has been annotated with keywords, we consider terms associated with an image as a document. After removing stop-words and stemming, we get a term-document matrix of size 189 × 852 for Ground Truth image data set. We applied Gabor filters to extract texture features and used HSV (hue-saturationvalue) color representation as color features. To enhance sensitivity to the overall shape, we divided each image into 8 × 8 = 64 patches from which texture and color

128

Chapter 5. Sparse Kernel Canonical Correlation Analysis features were extracted. Texture Features The Gabor filters in the spatial domain is given by   02   x + γ 2 y 02 x0 gλθψσγ (x, y) = exp − cos 2π + ψ , 2σ 2 λ

(5.51)

where x0 = xcos(θ) + ysin(θ), y 0 = −xsin(θ) + ycos(θ), x and y specify the position of a light impulse. In this equation, λ represents the wavelength of the cosine factor, θ represents the orientation of the normal to the parallel stripes of a Gabor function in degrees, ψ is the phase offset of the cosine factor in degrees, γ is the spatial aspect ratio and σ is the standard deviation of the Gaussian. In Figure 5.3 the Gabor filter impulse responses used in this experiment are shown. So from each of the 64 image patches, the Gabor filter can extract 16 texture features, which eventually results

θ = 3π/4

θ = π/2

θ = π/4

θ=0

in a total of 64 × 16 = 1024 features for each image.

f=0.15

f=0.2

f=0.25

f=0.3

Figure 5.3: Gabor filters used to extract texture features. Four frequencies f = 1/λ = [0.15, 0.2, 0.25, 0.3] and four directions θ = [0, π/4, π/2, 3π/4] are used. The width of the filters are σ = 4. Color Features We used the HSV color representation as color features. Each color component was quantized into 16 bins, and each image patch was represented by 3 normalized color histograms. This gives 48 features for each of the 64 patches, which eventually results in 48 × 64 = 3072 features for each image.

5.5 Numerical Results

129

Following previous work [72, 73], we used Gaussian kernel   kIi − Ij k2 kx (Ii , Ij ) = exp − , 2σ 2 where Ii is a vector concatenating texture features and color features of ith image and σ is selected as the minimum distance between different images, to compute kernel matrix Kx for the first view. The linear kernel (5.28) was employed to compute kernel matrix Ky using text features for the other view. In Table 5.4, we compare the performance of CCA, KCCA, RKCCA and SKCCA. Like the cross-language document retrieval experiments, we used AROC to evaluate the performance of these algorithms. We see from Table 5.4 that both RKCCA and SKCCA outperform CCA and KCCA, and RKCCA achieves the best performance in terms of AROC. The dual projections Wx and Wy computed by SKCCA have high sparsity, which can excessively reduce the computational time of computing projection of a new data in practice as we only need to evaluate kernel functions between the new data and a small subset of training data. Moreover, SKCCA also obtains larger summation of canonical correlations between testing data than other approaches. Table 5.4: Content-based image retrieval using CCA, KCCA, RKCCA and SKCCA. Algorithm

AROC

Corr

Sparsity

Err(Wx )

Err(Wy )

Regularizer

UW ground truth data: 217 training data; l = 124 CCA

73.96

11.53

(0, 7.7)

6.832e-15

7.638e-15

-

KCCA

82.59

19.09

(0, 0)

3.1310e-15

6.8237e-14

-

RKCCA

85.67

21.96

(0, 0)

9.9861e-1

9.9052e-1

(2.7e+3, 1.7e+2)

SKCCA

83.17

25.55

(91.1, 95.1)

9.5920e-1

9.7713e-1

(0.51, 0.54)

In Figure 5.4, we plot AROC of CCA, KCCA, RKCCA and SKCCA as a function of the number of projections used (i.e., different l). We observe that the AROC of SKCCA is at first smaller than and then exceeds that of kernel CCA. This indicates that when suitable number of dual projections are used for retrieval SKCCA can

130

Chapter 5. Sparse Kernel Canonical Correlation Analysis 0.9 0.85

Average AROC

0.8 0.75 0.7 CCA KCCA RKCCA SKCCA

0.65 0.6 0.55 0.5

0

20

40 60 80 Number of columns of (Wx , Wy ) used

100

120

(a)

Figure 5.4: Content-based image retrieval using CCA, KCCA, RKCCA and SKCCA on UW ground truth data with 217 training data. improve the performance of kernel CCA.

5.6

Conclusions

In this chapter we proposed a novel sparse kernel CCA algorithm called SKCCA. This algorithm is based on a relationship between kernel CCA and least squares problems which is an extension of a similar relationship between CCA and least squares problems. We incorporated sparsity into kernel CCA by penalizing the `1 -norm of dual vectors. The resulting `1 -regularized minimization problems were solved by a fixed-point continuation (FPC) algorithm. Empirical results show that SKCCA not only performs well in computing sparse dual transformations, but also alleviates the over-fitting problem of kernel CCA. Although we did not mention in this chapter, the relationship between CCA and least squares problems described in Theorem 5.1 can be exploited to design new sparse CCA algorithms by adding sparsity inducing penalization to the objective functions. This work will be left for future research.

5.6 Conclusions Besides, several interesting questions and extensions of sparse kernel CCA remain. In many applications such as genomic data analysis, CCA is often performed on more than two data sets. It will be helpful to extend sparse kernel CCA to deal with multiple data sets. In the derivation of SKCCA, we did not discuss the choice of kernel function. However, it is believed that the performance of kernel CCA depends on the choice of the kernel. As for future research, we plan to study the problem of finding optimal kernel of kernel CCA for different applications. Moreover, we also plan to generalize the idea of sparse kernel CCA in this chapter to involve multiple kernels.

131

Chapter

6

Conclusions 6.1

Summary of Contributions

In this thesis, we have considered several sparse dimensionality reduction methods for high-dimensional data analysis and their applications in various fields. Now, we summarize the contributions as follows. First, we studied sparse version of uncorrelated linear discriminant analysis which is an important generalization of LDA. We have parameterized all solutions of the generalized ULDA via solving the optimization problem proposed in [160], and then proposed a novel model, named SULDA, for computing sparse ULDA transformation matrix. The main idea of this model is to select minimum `1 -norm solution from the general solution set, which leads to a basis pursuit problem. We applied the accelerated linearized Bregman iterative method to solve this optimization problem. Experimental results demonstrate that SULDA consistently outperforms its competitors in terms of classification accuracy and achieves comparable interpretability (sparsity and number of used variables). The resulting sparse transformation can also be used to visualize observations and inspect class discrimination in the lowdimensional space. In our second contribution, we have made a new and systematic study of CCA.

133

134

Chapter 6. Conclusions We first revealed the equivalent relationship between the recursive formulation and the trace formulation for the multiple-projection CCA problem. Based on the equivalence relationship, we adopted the trace formulation as the criterion of CCA and obtained an explicit characterization of all solutions for the multiple-projection CCA problem even when the sample covariance matrices are singular. We also established the equivalent relationship between uncorrelated linear discriminant analysis and the CCA problem. Based on the explicit characterization of general solutions of CCA, we have developed a novel sparse CCA algorithm, named SCCA `1 . Compared with existing sparse CCA algorithms, our proposed method has the following properties: 1. Sparse projections are computed by solving two `1 -minimization problems, one for each set of variables. Accelerated linearized Bregman iterative method can be applied to our sparse CCA optimization problems, which makes our algorithm easy to implement. 2. The orthogonality constraints on canonical variables are well satisfied, which implies that canonical variables are mutually uncorrelated. Multiple sparse projections are computed simultaneously, while other approaches have to recursively compute multiple sparse projections. As a consequence, the orthogonality constraints on canonical variables may not be satisfied even when normalization is adopted, and associated canonical variables are not mutually uncorrelated. Finally, we focused on designing efficient algorithm for sparse kernel CCA. We have studied sparse kernel CCA via utilizing established results on CCA, aiming at computing sparse dual transformations and alleviating over-fitting problem of kernel CCA, simultaneously. We first established a relationship between CCA and least squares problems and extended this relationship to kernel CCA. Then, based on this relationship, we succeeded in incorporating sparsity into kernel CCA by penalizing the least squares term with `1 -norm and proposed a novel sparse kernel

6.2 Future Work CCA algorithm, named SKCCA. Numerical results of applying the newly proposed algorithm to various applications and comparative results with kernel CCA and regularized kernel CCA were also presented. Empirical results show that SKCCA not only performs well in computing sparse dual transformations, but also alleviates the over-fitting problem of kernel CCA.

6.2

Future Work

There are still many interesting problems that will lead to further research in the field of sparse dimensionality reduction. The first one consists of designing efficient algorithms for `1 -minimization problem over the manifold of column orthogonal matrices (Stiefel manifold ). When we selected minimum `1 -norm solution from the solution set, the parameter denoting arbitrary orthogonal matrix was fixed at identity matrix. Thus, the optimal solution was over a subset and may not be globally optimal. Minimizing an objective function with orthogonal constraints is a challenging problem and has attracted more and more attention. A further study of `1 -minimization problems with orthogonal constraints should be helpful for designing efficient algorithms in high-dimensional data analysis. We used the accelerated linearized Bregman iterative method to solve the `1 minimization problems in sparse ULDA and sparse CCA, and FPC method to solve the penalized least squares problems in sparse kernel CCA. There may exist faster methods, but since our sparse algorithms (SULDA, SCCA `1 and SKCCA) are frameworks, we can replace the accelerated linearized Bregman iterative method and FPC by other methods without causing any problem. In addition, in the thesis we only considered sparse CCA and sparse kernel CCA for two sets of data; however, as can be seen from the derivation of our algorithms, the same idea can be generalized to handle multiple sets of data. Due to the high efficiency and robustness of SKCCA for finding nonlinear relation among high-dimensional data, it would be fruitful to apply it in other areas like fMRI data analysis, automatic image annotation and so

135

136

Chapter 6. Conclusions on. Another important direction is to study structured sparsity in sparse dimensionality reduction methods. As can be seen from experimental results of SULDA, although the computed transformation matrix is highly sparse, the number of used variables is relatively high. This can still cause problems for interpreting extracted features. In some applications, like genome-wide association study, structural information (e.g., group structure) can greatly facilitate the interpretation of the results obtained. Thus, it is of great meaning to study structured sparse dimensionality reduction methods.

Bibliography

[1] S. Akaho, A kernel method for canonical correlation analysis, in Proceedings of the International Meeting of the Psychometric Society, 2001. [2] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, Wiley, 3 ed., 2003. [3] F. R. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with sparsity-inducing penalties, Foundations and Trends in Machine Learning, 4 (2011), pp. 1–106. [4] F. R. Bach and M. I. Jordan, Kernel independent component analysis, Journal of Machine Learning Research, 3 (2003), pp. 1–48. [5]

, A probabilistic interpretation of canonical correlation analysis, tech. report, University of California, Berkeley, 2005.

[6] S. Balakrishnan, K. Puniyani, and J. D. Lafferty, Sparse additive functional and kernel CCA, in Preceedings of the 29th International Conference on Machine Learning, 2012.

137

138

Bibliography [7] P. Baldi and G. Hatfield, DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling, Cambridge University Press, 2002. [8] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202. `s, NESTA: A fast and accurate [9] S. Becker, J. Bobin, and E. J. Cande first-order method for sparse recovery, SIAM Journal on Imaging Sciences, 4 (2011), pp. 1–39. [10] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (1997), pp. 711–720. [11] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation, 15 (2003), pp. 1373–1396. [12] M. W. Berry, S. T. Dumais, and G. W. O’Brien, Using linear algebra for intelligent information retrieval, SIAM Review, 37 (1995), pp. 573–595. [13] P. J. Bickel and E. Levina, Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations, Bernoulli, 10 (2004), pp. 989–1010. [14] T. D. Bie, N. Cristianini, and R. Rosipal, Eigenproblems in pattern recognition, in Handbook of Geometric Computing: Applications in Pattern Recognition, Computer Vision, Neuralcomputing, and Robotics, Springer, 2005, pp. 129–170. [15] C. M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, 2006.

Bibliography ¨ rck and G. H. Golub, Numerical methods for computing angles [16] ˚ A. Bjo between linear subspaces, Mathematics of Computation, 27 (1973), pp. 579– 594. [17] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30 (1997), pp. 1145–1159. [18] L. M. Bregman, The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, USSR Computational Mathematics and Mathematical Physics, 7 (1967), pp. 200–217. [19] C. J. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2 (1998), pp. 121–167. [20] T. Cacoullos, Estimation of a multivariate density, Annals of the Institute of Statistical Mathematics, 18 (1966), pp. 179–189. [21] J. Cai, S. Osher, and Z. Shen, Convergence of the linearized Bregman iteration for `1 -norm minimization, Mathematics of Computation, 78 (2009), pp. 2127–2136. [22]

, Linearized Bregman iterations for compressed sensing, Mathematics of Computation, 78 (2009), pp. 1515–1536.

´s and B. Recht, Exact matrix completion via convex optimization, [23] E. Cande Foundations of Computational Mathematics, 9 (2009), pp. 717–772. ´s, J. Romberg, and T. Tao, Robust uncertainty principles: Ex[24] E. Cande act signal reconstruction from highly incomplete frequency information, IEEE Transactions on Information Theory, 52 (2006), pp. 489–509. ´s, J. Romberg, and T. Tao, Stable signal recovery from incom[25] E. Cande plete and inaccurate measurements, Communications on Pure and Applied Mathematics, 59 (2006), pp. 1207–1223.

139

140

Bibliography ¨ lkopf, and A. Zien, Semi-Supervised Learning, [26] O. Chapelle, B. Scho MIT Press, 2006. [27] L. Chen, H. M. Liao, M. Ko, J. Lin, and G. Yu, A new LDA-based face recognition system which can solve the small sample size problem, Pattern Recognition, 33 (2000), pp. 1713–1726. [28] S. S. Chen, D. L. Donoho, Michael, and A. Saunders, Atomic decomposition by basis pursuit, SIAM Review, 43 (2001), pp. 129–159. [29] X. Chen and H. Liu, An efficient optimization algorithm for structured sparse CCA, with applications to eQTL mapping, Statistics in Biosciences, 4 (2012), pp. 3–26. [30] K. Chin, S. DeVries, J. Fridlyand, et al., Genomic and transcriptional aberrations linked to breast cancer pathophysiologies, Cancer Cell, 10 (2006), pp. 529–541. [31] D. Chu and S. T. Goh, A new and fast implementation for null space based linear discriminant analysis, Pattern Recognition, 43 (2010), pp. 1373–1379. [32]

, A new and fast orthogonal linear discriminant analysis on undersampled problems, SIAM Journal on Scientific Computing, 32 (2010), pp. 2274–2297.

[33] D. Chu, S. T. Goh, and Y. S. Hung, Characterization of all solutions for undersampled uncorrelated linear discriminant analysis problems, SIAM Journal on Matrix Analysis and Applications, 32 (2011), pp. 820–844. [34] D. Chu, L.-Z. Liao, and M. K. Ng, Sparse orthogonal linear discriminant analysis, SIAM Journal on Scientific Computing, 34 (2012), pp. A2421–A2443. [35] D. Chu, L.-Z. Liao, M. K. Ng, and X. Zhang, Sparse canonical correlation analysis: New formulation and algorithm. Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.

Bibliography [36]

, Sparse kernel canonical correlation analysis, in International MultiConference of Engineers and Computer Scientists, 2013.

[37] D. Chu and X. Zhang, Sparse uncorrelated linear discriminant analysis, in the 30th International Conference on Machine Learning, 2013. [38] L. Clemmensen, T. Hastie, D. Wiiten, and B. Ersbøll, Sparse discriminant analysis, Technometrics, 53 (2011), pp. 406–413. [39] A. d’Aspremont, F. R. Bach, and L. El Ghaoui, Optimal solutions for sparse principal component analysis, Journal of Machine Learning Research, 9 (2008), pp. 1269–1294. [40] A. d’Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet, A direct formulation of sparse PCA using semidefinite programming, SIAM Review, 49 (2007), pp. 434–448. [41] R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image retrieval: Ideas, influences, and trends of the new age, ACM Computing Surveys, 40 (2008). [42] J. Dauxois and G. M. Nkiet, Nonlinear canonical analysis and independence tests, The Annals of Statistics, 26 (1998), pp. 1254–1278. [43] T. De Bie and B. De Moor, On the regularization of canonical correlation analysis, in 4th International Symposium on Independent Component Analysis and Blind Signal Separation, 2003. [44] M. Dettling, BagBoosting for tumor classification with gene expression data, Bioinformatics, 20 (2004), pp. 3583–3593. [45] C. Dhanjal, Sparse Kernel Feature Extraction, PhD thesis, University of Southampton, 2008.

141

142

Bibliography [46] C. Dhanjal, S. Gunn, and J. Shawe-Taylor, Efficient sparse kernel feature extraction based on partial least squares, IEEE Transactions on Pattern Analysis and Machine Intelligence, 99 (2008), pp. 1347–1361. [47] D. L. Donoho, High-dimensional data analysis: the curses and blessings of dimensionality, in Mathematical Challenges of the 21st Century, American Mathematical Society, 2000. [48] D. L. Donoho, Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), pp. 1289–1306. [49]

, For most large underdetermined systems of linear equations the minimal `1 -norm solution is also the sparsest solution, Communications on Pure and Applied Mathematics, 59 (2006), pp. 797–829.

[50] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Wiley Interscience, 2 ed., 2000. [51] S. Dudoit, J. Fridlyand, and T. P. Speed, Comparison of discriminant methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97 (2002), pp. 77–87. [52] M. Dundar, G. Fung, J. Bi, S. Sathyakama, and B. Rao, Sparse fisher discriminant analysis for computer aided detection, in Proceedings of the SIAM International Conference on Data Mining, 2005. [53] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, Annals of Statistics, 32 (2004), pp. 407–499. [54] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letter, 27 (2006), pp. 861–874. [55] M. A. Figueiredo, R. D. Nowak, and S. J. Wright, Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse

Bibliography

143

problems, IEEE Journal of Selected Topics in Signal Processing, 1 (2007), pp. 586–597. [56] R. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7 (1936), pp. 179–188. [57] M. P. Friedlander and P. Tseng, Exact regularization of convex programs, SIAM Journal on Optimization, 18 (2007), pp. 1326–1350. [58] J. Friedman, Regularized discriminant analysis, Journal of the American Statistical Association, 84 (1989), pp. 165–175. [59] O.

Friman,

J.

Cedefamn,

P.

Lundberg,

M.

Borga,

and

H. Knutsson, Detection of neural activity in functional mri using canonical correlation analysis, Magnetic Resonance in Medicine, 45 (2001), pp. 323–330. [60] K. Fukumizu, F. R. Bach, and A. Gretton, Statistical consistency of kernel canonical correlation analysis, Journal of Machine Learning Research, 8 (2007), pp. 361–383. [61] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 2nd ed., 1990. [62] E. Fung and M. K. Ng, On sparse Fisher discriminant method for microarray data analysis, Bioinformation, 2 (2007), pp. 230–234. [63] C. Fyfe and P. L. Lai, ICA using kernel canonical correlation analysis, in Proceedings of the Workshop on Independent Component Analysis and Blind Signal Separation, 2000, pp. 279–284. [64] U. Germann, Aligned hansards of the 36th parliament of Canada. http: //www.isi.edu/natural-language/download/hansard/, 2001. [65] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns Hopkins University Press, 3rd ed., 1996.

144

Bibliography [66] G. H. Golub and H. Zha, The canonical correlations of matrix pairs and their numerical computation, tech. report, Computer Science Department, Stanford University, 1992. [67]

, Perturbation analysis of the canonical correlations of matrix pairs, Linear Algebra and Its Applications, 210 (1994), pp. 3–28.

[68] Y. Guo, T. Hastie, and R. Tibshirani, Regularized linear discriminant analysis and its application in microarray, Biostatistics, 8 (2007), pp. 86–100. [69] E. T. Hale, W. Yin, and Y. Zhang, Fixed-point continuation for `1 minimization: Methodology and covergence, SIAM Journal on Optimization, 19 (2008), pp. 1107–1130. [70]

, Fixed-point continuation applied to compressed sensing: Implementation and numerical experiments, Journal of Computational Mathematics, 28 (2010), pp. 170–194.

[71] D. R. Hardoon and J. Shawe-Tayler, Sparse canonical correlation analysis, Machine Learning, 83 (2011), pp. 331–353. [72] D. R. Hardoon and J. R. Shawe-Taylor, Convergence analysis of kernel canonical correlation analysis: Theory and practice, Machine Learning, 74 (2009), pp. 23–38. [73] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Computation, 16 (2004), pp. 2639–2664. [74] T. Hastie, A. Buja, and R. Tibshirani, Penalized discriminant analysis, The Annals of Statistics, 23 (1995), pp. 73–102. [75] T. Hastie and R. Tibshirani, Discriminant analysis by Gaussian mixtures, Journal of the Royal Statistical Society. Series B (Methodological), 58 (1996), pp. 155–176.

Bibliography [76] T. Hastie, R. Tibshirani, and A. Buja, Flexible discriminant analysis by optimal scoring, Journal of the American Statistical Association, 89 (1993), pp. 1255–1270. [77] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, 2nd ed., 2009. [78] R. A. Horn and C. R. Johnson, Topics in Matrix Analysis, Cambridge University Press, 1991. [79] H. Hotelling, Relations between two sets of variables, Biometrika, 28 (1936), pp. 321–377. [80] P. Howland, M. Jeon, and H. Park, Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition, SIAM Journal on Matrix Analysis and Applications, 25 (2003), pp. 165–179. [81] P. Howland and H. Park, Equivalence of several two-stage methods for linear discriminant analysis, in Proceedings of the Fourth SIAM International Conference on Data Mining, 2004. [82] P. Howland and H. Park, Generalizing discriminant analysis using the generalized singular value decomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (2004), pp. 995–1006. [83] B. Huang, S. Ma, and D. Goldfarb, Accelerated linearized Bregman method, Journal of Scientific Computing, 54 (2013), pp. 428–453. [84] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988. [85] Z. Jin, J. Y. Yang, Z. S. Hu, and Z. Lou, Face recognition based on the uncorrelated discriminant transformation, Pattern Recognition, 34 (2001), pp. 1405–1416.

145

146

Bibliography [86] Z. Jin, J. Y. Yang, Z. M. Tang, and Z. S. Hu, A theorem on the uncorrelated optimal discriminant vectors, Pattern Recognition, 34 (2001), pp. 2041– 2047. [87] I. Jolliffe, Principal Component Analysis, Springer Verlag, 1986. ´e, Y. N. andPeter Richta ´ rik, and R. Sepulchre, General[88] M. Journe ized power method for sparse principal component analysis, Journal of Machine Learning Research, 11 (2010), pp. 517–553. [89] J. R. Kettenring, Canonical analysis of several sets of variables, Biometrika, 58 (1971), pp. 433–451. [90] H. Kim, P. Howland, and H. Park, Dimension reduction in text classification using support vector machines, Journal of Machine Learning Research, 6 (2005), pp. 37–53. [91] S. J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, A method for large-scale `l -regularized least squares, IEEE Journal on Selected Topics in Signal Processing, 4 (2007), pp. 606–617. [92] T. K. Kim and R. Cipolla, Canonical correlation analysis of video volume tensors for action categorization and detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 31 (2009), pp. 1415–1428. [93] T. K. Kim, J. Kittler, and R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29 (2007), pp. 1005–1018. [94] P. Koehn, Europarl: A Parallel Corpus for Statistical Machine Translation, in Proceedings of the Tenth Machine Translation Summit, 2005, pp. 79–86. [95] T. Kolenda, L. K. Hansen, J. Larsen, and O. Winther, Independent component analysis for understanding multimedia content, in Proceedings of

Bibliography IEEE Workshop on Neural Networks for Signal Processing XII, H. Bourlard, T. Adali, S. Bengio, J. Larsen, and S. Douglas, eds., 2002, pp. 757–766. [96] G. Kowalski and M. Maybury, Information Storage and Retrieval Systems: Theory and Implementation, Kluwer Academic Publishers, Norwell, MA, USA, 2nd ed., 2000. [97] M. Kuss and T. Graepel, The geometry of kernel canonical correlation analysis, tech. report, Max Plank Institute for Biological Cybernetics, Germany, 2003. [98] M. Lai and W. Yin, Augmented `1 and nuclear-norm models with a globally linearly convergent algorithm, tech. report, Rice CAAM, 2012. [99] P. Lai and C. Fyfe, Kernel and nonlinear canonical correlation analysis, International Journal of Neural Systems, 10 (2001), pp. 365–374. [100] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction, Springer, 2007. [101] Q. Mai, H. Zou, and M. Yuan, A direct approach to sparse discriminant analysis in ultra-high dimensions, Biometrika, 99 (2012), pp. 29–42. [102] T. Melzer, M. Reiter, and H. Bischof, Nonlinear feature extraction using generalized canonical correlation analysis, in Proceedings of the International Conference on Artificial Neural Networks, 2001, pp. 353–360. [103] L. Merchante, Y. Grandvalet, and G. Govaert, An efficient approach to sparse linear discriminant analysis, in Preceedings of the 29th International Conference on Machine Learning, 2012. ¨ tsch, J. Weston, B. Scho ¨ lkopf, and K.-R. Mu ¨ ller, [104] S. Mika, G. Ra Fisher discriminant analysis with kernels, in Neural Networks for Signal Processing IX, IEEE, 1999, pp. 41–48.

147

148

Bibliography [105] B. Moghaddam, Y. Weiss, and S. Avidan, Generalized spectral bounds for sparse LDA, in Preceedings of the 23th International Conference on Machine Learning, 2006, pp. 641–648. [106] M. Momma, Efficient computations via scalable sparse kernel partial least squares and boosted latent features, in Proceedings of 11th ACM International Conference on Knowledge Discovery in Data Mining, 2005, pp. 654–659. [107] M. Momma and K. P. Bennett, Sparse kernel partial least squares regression, in Proceedings of 16th International Conference on Computational Learning Theory, B. Sch¨olkopf and M. K. Warmuth, eds., 2003, pp. 216–230. ¨ ller, S. Mika, G. Ra ¨ tsch, K. Tsuda, and B. Scho ¨ lkopf, [108] K.-R. Mu An introduction to kernel-based learning algorithms, IEEE Transactions on Neural Networks, 12 (2001), pp. 181–202. [109] B. K. Natarajan, Sparse approximation solutions to linear systems, SIAM Journal on Computing, 24 (1995), pp. 227–234. [110] Y. Nesterov, A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ), Doklady Akademii Nauk SSSR, 269 (1983), pp. 543–547. [111] M. K. Ng, L. Liao, and L. Zhang, On sparse linear discriminant analysis algorithm for high-dimensional data classification, Numerical Linear Algebra with Applications, 18 (2011), pp. 223–236. [112] S. Osher, Y. Mao, B. Dong, and W. Yin, Fast linearized Bregman iteration for compressive sensing and sparse denoising, Communications in Mathematical Sciences, 8 (2010), pp. 93–111. [113] C. H. Park and H. Park, A comparison of generalized linear discriminant analysis algorithms, Pattern Recognition, 41 (2008), pp. 1083–1097.

Bibliography [114] E. Parkhomenko, D. Tritchler, and J. Beyene, Sparse canonical correlation analysis with application to genomic data integration, Statistical Applications in Genetics and Molecular Biology, 8 (2009). Issue 1, Article 1. ´ III, Multi-label prediction via sparse infinite CCA, [115] P. Rai and H. Daume in Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2009. [116] B. Recht, M. Fazel, and P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review, 52 (2010), pp. 471–501. [117] B. Richard, Adaptive Control Processes: A Guided Tour, Princeton University Press, 1961. [118] R. Rosipal and L. J. Trejo, Kernel partial least squares regression in reproducing kernel Hilbert space, Journal of Machine Learning Research, 2 (2001), pp. 97–123. [119] S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290 (2000), pp. 2323–2326. [120] G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1986. [121] L. K. Saul and S. T. Roweis, Think globally, fit locally: Unsupervised learning of low dimensional manifolds, Journal of Machine Learning Research, 4 (2003), pp. 119–155. ¨ lkopf, S. Mika, C. J. Burges, P. Knirsch, K.-R. Mu ¨ ller, [122] B. Scho ¨ tsch, and A. J. Smola, Input space versus feature space in kernelG. Ra based methods, IEEE TRANSACTIONS ON NEURAL NETWORKS, 10 (1999), pp. 1000–1017.

149

150

Bibliography ¨ lkopf and A. J. Smola, Learning with Kernels: Support Vector [123] B. Scho Machines, Regularization, Optimization, and Beyond, MIT Press, 2002. ¨ lkopf, A. J. Smola, and K.-R. Mu ¨ ller, Nonlinear compo[124] B. Scho nent analysis as a kernel eigenvalue problem, Neural Computation, 10 (1998), pp. 1299–1319. [125] S. Shalev-Shwartz, N. Srebro, and T. Zhang, Trading accuracy for sparsity in optimization problems with sparsity constraints., SIAM Journal on Optimization, 20 (2010), pp. 2807–2832. [126] J. Shao, Y. Wang, X. Deng, and S. Wang, Sparse linear discriminant analysis by thresholding for high-dimensional data, The Annals of Statistics, 39 (2011), pp. 1241–1265. [127] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [128] H. Shen and J. Z. Huang, Sparse principal component analysis via regularized low rank matrix approximation, Journal of Multivariate Analysis, 99 (2008), pp. 1015–1034. [129] B. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall/CRC, 1986. [130] B. K. Sriperumbudur, D. A. Torres, and G. R. G. Lanckriet, Sparse eigen methods by D.C. programming, in The 24th International Conference on Machine Learning, 2007, pp. 831–838. [131]

, A majorization-minimization approach to the sparse generalized eigenvalue problem, Machine Learning, 85 (2011), pp. 3–39.

[132] N. Subrahmanya and Y. C. Shin, Sparse multiple kernel learning for signal processing applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (2010), pp. 788–798.

Bibliography [133] M. Sugiyama, Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis, Journal of Machine Learning Research, 8 (2007), pp. 1027–1061. [134] L. Sun, S. Ji, and J. Ye, Canonical correlation analysis for multilabel classification: A least-squares formulation, extensions, and analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (2011), pp. 194–200. [135] D. Swets and J. Weng, Using discriminant eigenfeatures for image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18 (1996), pp. 831–836. [136] L. Tan and C. Fyfe, Sparse kernel canonical correlation analysis, in Proceedings of 9th European Symposium on Artificial Neural Networks, 2001, pp. 335–340. [137] J. B. Tenenbaum, Mapping a manifold of perceptual observations, in Advances in Neural Information Processing, vol. 10, MIT Press, 1998, pp. 682– 688. [138] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society (Series B), 58 (1996), pp. 267–288. [139] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, Sparsity and smoothness via the fused Lasso, Journal of the Royal Statistical Society, Series B, (2005), pp. 91–108. [140] M. E. Tipping, Sparse Bayesian learning and the relevance vector machine, Journal of Machine Learning Research, 1 (2001), pp. 211–244. [141] D. Torres, B. Sriperumbudur, and G. Lanckriet, Finding musically meaningful words by sparse CCA, in NIPS Workshop on Music, the Brain and Cognition, 2007.

151

152

Bibliography [142] J. A. Tropp, Just relax: Convex programming methods for subset selection and sparse approximation, tech. report, Caltech, 2004. [143] J.-P. Vert and M. Kanehisa, Graph-driven features extraction from microarray data using diffusion kernels and kernel CCA, in Advances in Neural Information Processing Systems, 2003. [144] A. Vinokourov, J. Shawe-taylor, and N. Cristianini, Inferring a semantic representation of text via cross-language correlation analysis, in Advances in Neural Information Processing Systems, 2003. [145] S. Waaijenborg, P. C. de Witt Hamer, and A. H. Zwinderman, Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis, Statistical Applications in Genetics and Molecular Biology, 7 (2008). Issue 1, Article 3. [146] G. Wahba, Spline Models for Observational Data, vol. 59 of CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM, Philadelphia, 1990. [147]

, Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV, Advances in kernel methods − Support Vector Learning, MIT Press, 1999, pp. 69–88.

[148] J. A. Wegelin, A survey of partial least squares (PLS) methods, with emphasis on the two-block case, tech. report, University of Washington, 2000. [149] Z. Wen, W. Yin, D. Goldfarb, and Y. Zhang, A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation, SIAM Journal on Scientific Computing, 32 (2010), pp. 1832–1857. [150] A. Wiesel, M. Kliger, and A. O. Hero, A greedy approach to sparse canonical correlation analysis, http://arxiv.org/abs/0801.2748, (2008).

Bibliography [151] D. M. Witten and R. Tibshirani, Extensions of sparse canonical correlation analysis with applications to genomic data, Statistical Applications in Genetics and Molecular Biology, 8 (2009). Issue 1, Article 28. [152] D. M. Witten and R. Tibshirani, Penalized classification using Fisher’s linear discriminant, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (2011), pp. 753–772. [153] D. M. Witten, R. Tibshirani, and T. Hastie, A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics, 10 (2009), pp. 515–534. [154] K. J. Worsley, J. b. Poline, K. J. Friston, and A. C. Evans, Characterizing the response of PET and fMRI data using multivariate linear models, NeuroImage, 6 (1997), pp. 305–319. [155] S. Wright, R. Nowak, and M. Figueiredo, Sparse reconstruction by separable approximation, IEEE Transactions on Signal Processing, 57 (2009), pp. 2479–2493. ¨ lkopf, and G. Bakır, A direct method for building sparse [156] M. Wu, B. Scho kernel learning algorithms, Journal of Machine Learning Research, 7 (2006), pp. 603–624. [157] M. Wu, L. Zhang, Z. Wang, D. Christiani, and X. Lin, Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection, Bioinformatics, 25 (2009), pp. 1145–1151. [158] Y. Yamanishi, J. P. Vert, A. Nakaya, and M. Kanehisa, Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis, Bioinformatics, 19 (2003), pp. i323–i330.

153

154

Bibliography [159] J. Yang and Y. Zhang, Alternating direction algorithms for `1 -problems in compressive sensing, SIAM Journal on Scientific Computing, 33 (2011), pp. 250–278. [160] J. Ye, Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems, Journal of Machine Learning Research, 6 (2005), pp. 483–502. [161] J. Ye, Least squares linear discriminant analysis, in Preceedings of the 24th International Conference on Machine Learning, 2007, pp. 1087–1094. [162] J. Ye, R. Janardan, V. Cherkassky, T. Xiong, J. Bi, and C. Kambhamettu, Efficient model selection for regularized linear discriminant analysis, in Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 2006. [163] J. Ye, R. Janardan, Q. Li, and H. Park, Feature extraction via generalized uncorrelated linear discriminant analysis, IEEE Transactions on Knoledge and Data Engineering, 18 (2006), pp. 1312–1321. [164] J. Ye and Q. Li, A two-stage linear discriminant analysis via QRdecomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (2005), pp. 929–941. [165] J. Ye, T. Li, T. Xiong, and R. Janardan, Using uncorrelated discriminant analysis for tissue classification with gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1 (2004), pp. 181– 190. [166] J. P. Ye and T. Xiong, Computational and theoretical analysis of null space and orthogonal linear discriminant analysis, Journal of Machine Learning Research, 7 (2006), pp. 1183–1204.

Bibliography [167] W. Yin, Analysis and generalizations of the linearized Bregman method, SIAM Journal on Imaging Sciences, 3 (2010), pp. 856–877. [168] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, Bregman iterative algorithms for `1 -minimization with applications to compressed sensing, SIAM Journal on Imaging Sciences, 1 (2008), pp. 143–168. [169] H. Yu and J. Yang, A direct LDA algorithm for high-dimensional data with application to face recognition, Pattern Recognition, 34 (2001), pp. 2067–2070. [170] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B, 68 (2006), pp. 49–67. [171] S. Yun and K.-C. Toh, A coordinate gradient descent method for `1 regularized convex minimization, Computational Optimization and Applications, 48 (2011), pp. 273–307. [172] W. M. Zheng, L. Zhao, and C. R. Zou, An efficient algorithm to solve the small sample size problem for LDA, Pattern Recognition, 37 (2004), pp. 1077– 1079. [173] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society, Series B, 67 (2005), pp. 301–320. [174] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal component analysis, Journal of Computational and Graphical Statistics, 15 (2006), pp. 265– 286.

155

Appendix

A

Data Sets In numerical experiments of this thesis, we have used many different types of data to evaluate our algorithms, including gene expression data and image data. For convenience and future reference, we describe the acquisition and some important information of these data sets in this appendix.

Gene Expression Data Colon: Colon cancer data set contains the expression measurements of 40 tumor and 22 normal colon tissues, that is k = 2, for 6,500 human genes that are measured using the Affymetrix technology. A selection of 2,000 genes with highest minimal intensity across the samples has been made and was further processed in [44], the data set is available at http://stat.ethz.ch/~dettling/bagboost.html. Leukemia: Leukemia data set consists of samples from patients with either acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), that is k = 2. The data set can be downloaded from http://stat.ethz.ch/~dettling/bagboost. html.

157

158

Chapter A. Data Sets Prostate: Prostate cancer raw data comprise the expression of 52 prostate tumors and 50 non-tumor prostate samples, obtained using the Affymetrix technology. It was processed in [44] and the data set is available at http://stat.ethz. ch/~dettling/bagboost.html. Lymphoma: Lymphoma is a data set of the three most prevalent adult lymphoid malignancies. The data set is available at http://stat.ethz.ch/~dettling/ bagboost.html.

SRBCT: SRBCT (Small Round Blood Cell Tumor) data set has 2308 genes and 63 experimental conditions, 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS). The data set is available at http://stat.ethz.ch/~dettling/bagboost.html. Brain: Brain tumor dataset contains n = 42 microarray gene expression profiles from k = 5 different tumors of the central nervous system, that is, 10 medulloblastomas, 10 malignant gliomas, 10 atypical teratoid/rhabdoid tumors (AT/RTs), 8 primitive neuro-ectodermal tumors (PNETs) and 4 human cerebella. The raw data were originated using the Affymetrix technology and the processed data set is available at http://stat.ethz.ch/~dettling/bagboost.html.

Image Data Yale: Yale database contains 165 grayscale images in GIF format of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink. Each image of data set Yale64×64 was downsampled to 64 × 64 pixels. Yale face data is available at http://cvc.yale.edu/projects/ yalefaces/yalefaces.html.

159 UMIST: UMIST Face Database consists of 575 images of 20 individuals (mixed race/gender/appearance). Each individual is shown in a range of poses from profile to frontal views and images are numbered consecutively as they were taken. The files are all in PGM format, approximately 220 x 220 pixels with 256-bit grey-scale. In our experiments, each image in UMIST was downsampled to 112×92 pixels. UMIST face data is available at http://www.sheffield.ac.uk/eee/research/iel/research/ face.

ORL: ORL data set contains ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). The files are in PGM format, and the size of each image is 92 × 112 pixels, with 256 grey levels per pixel. The database can be retrieved from http: //www.cl.cam.ac.uk/Research/DTG/attarchive:pub/data/attfaces.tar.Z. In our experiment, the size of each image is resized to 64 × 64 pixels for ORL64×64 . Palmprint: Palmprint database is available at http://www4.comp.polyu.edu. hk/~biometrics/. The PolyU Palmprint Database contains 7752 grayscale images corresponding to 386 different palms in BMP image format. Around twenty samples from each of these palms were collected in two sessions, where around 10 samples were captured in each session. We selected 100 different palms from this database. 6 samples from each of these palms were collected in two sessions, where 3 samples were captured in the first session and the second session, respectively. All images were compressed to 64 × 64 pixels.

SPARSE DIMENSIONALITY REDUCTION METHODS: ALGORITHMS AND APPLICATIONS

ZHANG XIAOWEI

NATIONAL UNIVERSITY OF SINGAPORE JULY 2013

Sparse Dimensionality Reduction Methods: Algorithms and Applications

Zhang Xiaowei

2013