Regularized sliced inverse regression for kernel models

5 downloads 0 Views 374KB Size Report
in Li (1991). A variety of methods have been developed to estimate bases of S, ... consistency results for functional SIR (Ferré and Villa, 2006). The efficacy of.
Regularized sliced inverse regression for kernel models B Y QIANG WU Department of Statistical Science, Institute for Genome Sciences & Policy, Department of Computer Science

Duke University, Durham NC 27708-0251, U.S.A. [email protected] FENG LIANG Department of Statistics University of Illinois at Urbana-Champaign, IL 61820, U.S.A.

[email protected] and SAYAN MUKHERJEE Department of Statistical Science, Institute for Genome Sciences & Policy, Department of Computer Science

Duke University, Durham NC 27708-0251, U.S.A. [email protected]

1

S UMMARY We extend the sliced inverse regression (SIR) framework for dimension reduction using kernel models and regularization. The result is a nonlinear dimension reduction method that finds submanifolds containing the inverse regression curve rather than linear subspaces and can be applied to high-dimensional data. This is advantageous when the relevant predictor variables are concentrated on a nonlinear low dimension manifold and generalizes the SIR setting to nonlinear subspaces. We provide a simple algorithm for nonlinear dimension reduction. A proof of consistency of the method under weak conditions is given. Simulations as well as applications to high-dimensional data are used to illustrate the efficacy of the method. Some Key Words: Dimension reduction, sliced inverse regression, kernel methods, manifold learning.

2

1

I NTRODUCTION

Dimension reduction is either explicitly or implicitly involved in any statistical or machine learning application with high-dimensional data. In the regression setting the following semi-parametric model summarizes the relation between a response variable Y and a p × 1 predictor variable X Y = F (β1T X, β2T X, . . . , βdT X, ε),

(1)

where ε denotes some random error term independent of X, F (·) is an unknown function, and {βi }di=1 are unknown vectors. The underlying assumption of this model is the response variable Y only depends on X via d  p linear combina-

tions of predictors. This can be stated by the following conditional independence property Y ⊥ ⊥ X|PS X,

S = span(β1 , . . . , βd ),

(2)

where PS denotes the projection operator onto the d-dimensional subspace S.

The smallest subspace that satisfies (2) was termed the central mean subspace SY |X in Cook (1996, 1998) and the effective dimension reduction (e.d.r.) space

in Li (1991). A variety of methods have been developed to estimate bases of S,

including slicing regression (Duan and Li, 1991), sliced inverse regression (Li, 1991), and sliced average variance estimation (Cook and Weisberg, 1991). These methods have recently been extended to the setting with more covariates p than observations n by partial inverse regression (Li et al., 2007). The advantage of all of these methods is that they make almost no assumptions on the regression function F , except that F relates to X only through d linear combinations. A result of this is that if the data are concentrated on a nonlinear low dimensional manifold the e.d.r. space estimated either has very large dimension or fails to capture the intrinsic dimension of the manifold. There has been much work done in the machine learning literature on nonlinear dimension reduction and manifold learning using algorithms such as isomet3

ric feature mapping (ISOMAP) (Tenenbaum et al., 2000), local linear embedding (LLE) (Roweis and Saul, 2000), Hessian Eigenmaps (Donoho and Grimes, 2003), and Laplacian Eigenmaps (Belkin and Niyogi, 2003). None of these methods consider the regression setting where one is given the response variable. In this paper we develop a kernel based sliced inverse regression (kSIR) method to allow for nonlinear e.d.r. directions and generalize to the manifold setting. Though we focus our discussion on the regression setting, extension to classification where Y takes categorical values is straightforward by utilizing link functions. In particular, the original SIR algorithm (Li, 1991) corresponds to Fisher’s discriminant analysis in the classification setting. In Section 2 we describe the regularized kSIR method. Section 3 provides weak conditions as well as asymptotic rates of consistency which generalizes consistency results for functional SIR (Ferr´e and Villa, 2006). The efficacy of the method is illustrated in Section 4 on synthetic data and two high-dimensional real classification problems. Summary comments are provided in Section 5. 2

K ERNEL S LICED I NVERSE R EGRESSION

Our generalization of SIR is based upon properties of reproducing kernel Hilbert spaces (RKHS) and specifically Mercer kernels (Mercer, 1909). We exploit the fact that a Mercer kernel can be used to induce a map from the predictor space to a possibly infinite-dimensional Hilbert space. Linear functions in this Hilbert space are nonlinear in the original predictor space allowing us to extend the linear model component in (1) to a nonlinear model and define a nonlinear e.d.r. space.

4

2.1 Mercer Kernels and Nonlinear E.D.R. Directions A continuous, positive semi-definite kernel K on a compact space X is a Mercer kernel with the following RKHS (Mercer, 1909; K¨onig, 1986) ( ) X X HK = f f (x) = aj ψj (x) with aj 2 /λj < ∞ , j∈Λ

j∈Λ

where {ψj } and {λj } are the orthonormal eigenfunctions and the corresponding  non-increasing eigenvalues of the integral operator with kernel K on L 2 X, µ , Z λj ψj (x) = K(x, u)ψj (u)µ(du) X

where Λ := {j : λj > 0} determines the dimension dK of the RKHS. Given the

eigenfunctions we define the following map φ : X → HK

φ(X) : X → {ψ1 (X), ψ2 (X), ..., ψdK (X)}. We can use this map to generalize the linear model component in (1) to the following nonlinear model  Y = F hβ1 , φ(X)iK , . . . , hβd , φ(X)iK , ε ,

(3)

where h·, ·iK is the RKHS inner product and βi are vectors in RKHS of dimension

dK . This model implies that the response variable Y depends on X only through d linear combinations of φ(X). The following proposition extends the theoretical foundation of SIR to this nonlinear setting.

Proposition 1. Assume that the model given in (3) holds and for any f ∈ H K the

conditional mean

E[hf, φ(X)iK |hβ1 , φ(X)iK , . . . , hβd , φ(X)iK ] 5

is linear in hβ1 , φ(X)iK , . . . , hβd , φ(X)iK . Then the centered inverse regression

curve is contained in the span of of the projection of the covariance operator or matrix, Σφ(X) = cov(φ(X)), onto the e.d.r. directions E(φ(X)|y) − Eφ(X) ∈ span(Σβ1 , . . . , Σβk ). The space S = span(β1 , ..., βd ) is the nonlinear e.d.r. space.

Proposition 1 is a straightforward extension of the multivariate case in Li (1991) to a Hilbert space and is similar to the setting of functional regression developed in Ferr´e and Yao (2003). An immediate consequence of this proposition is that eigenvectors corresponding to the d non-zero eigenvalues of Γ = cov(E[φ(X)|Y ]) with respect to Σ φ(X) are e.d.r. directions. The essence of the kSIR method is to apply the original SIR method to the map φ(X). 2.2 Estimating the E.D.R. directions Given n observations {(X1 , Y1 ), ..., (Xn , Yn )} our objective is to provide an estimate of the e.d.r. directions {βˆ1 , ..., βˆd } as defined in (3). We first formulate a procedure almost identical to the standard SIR procedure

except operating in the mapped space φ(X). This highlights the immediate relation between the SIR and kSIR procedures. 1. Without loss of generality we assume that the mapped predictor variables are zero mean, E[φ(x)] = 0. We compute a sample estimate of the covariance matrix or operator. We define Φ = [φ(x1 ), ..., φ(xn )] as the data matrix in the mapped space where the i-th column is φ(xi ). The sample covariance is

ˆ = 1 Φ ΦT . Σ n 6

2. Bin the Y variables into m groups G1 , . . . , Gm and compute mean vector for the corresponding mapped predictor variables for each group Ψi =

1 X φ(xj ), ni j∈G

i = 1, . . . , m.

i

Compute the sample between-group covariance matrix ˆ= Γ

n X ni i=1

n

Ψi ΨTi .

3. Estimate the SIR directions βˆi by solving the generalized eigen-problem ˆ = λΣβ. ˆ Γβ

(4)

This procedure is computationally difficult or impossible as the estimate of the covariance operator or matrix has dimensionality of that of the RKHS which may be infinite dimensional. In addition, typically the covariance matrix is not full rank and so the solution to the eigen-decomposition is not unique. The same fundamental problems will exist for any method that requires estimates of the SIR directions involving high-dimensional data. The kSIR model given by equation (3) requires not the SIR directions but the projection onto the SIR directions v1 = hβ1 , φ(X)iK , . . . , vd = hβd , φ(X)iK where (v1 , ..., vd ) are the kSIR variates. For the case of Mercer kernels the computation of the kSIR variates does not require an estimate of the covariance of φ(x). The key quantity in this formulation is the gram matrix K defined by the kernel K(·, ·) corresponding to the RKHS where for i, j = 1, ..., n.

Kij = K(xi , xj ), 7

Given the gram matrix K and the matrix J where Jij = 1/nm if observations i, j are in the m-th group consisting of nm observations and zero otherwise, we define the following generalized eigen-problem KJKc = λK 2 c,

(5)

where c is an n dimensional vector. The projection of a point φ(x) onto the estimate of the i-th SIR directions is given by vi = cˆTi [K(x, xi ), ...., K(x, xn )] = cˆTi Kx , where cˆi is the i-th eigenvector of equation (5). The following proposition states the equivalence between the generalized eigen-problems in (5) and (4), see Appendix A for details of the derivation. Proposition 2. Given observations {(x1 , y1 ), ..., (xn , yn )} the estimates of the kSIR variates as defined by (4)

(ˆ v1 , ..., vˆd ) = (hβˆ1 , φ(X)iK , ..., hβˆd , φ(X)iK ), and those defined by (5) (ˆ v1 , ..., vˆd ) = (ˆ cT1 Kx , ..., cˆTd Kx ), ˆ is invertible. When Σ ˆ is not invertible, the conclusion still are equal provided Σ ˆ holds with βˆi as eigenvectors modulo the null space of Σ. 2.3 Stability of the Eigen-decomposition The eigen-decompositions (5) will often be ill-conditioned resulting in over-fitting as well as unstable estimates of the e.d.r. space. This can be addressed by either ˆ or by adding thresholding eigenvalues of the estimate of the covariance matrix Σ a regularization term to (5). 8

Theoretically restricting the subspace to eigenvectors corresponding to nonˆ makes the generalized eigen-decomposition in (4) wellzero eigenvalues of Σ defined. However, it is often still poorly conditioned and in practice this approach does not work well. In practice we regularize (5) KJKc = (K 2 + sI)c

(6)

which results in robust estimates of the e.d.r. space that generalize well. This suggests the following simple algorithm: 1. Given the predictor variables {x1 , .., xn } compute the kernel matrix K where Kij = k(xi , xj ).

˜ 2. Compute the centered kernel matrix K ˜ ij = Kij − Ki· − K·j + K·· , K where K·· is the mean of the kernel matrix and Ki· and K·j are the means of the i-th row and j-th column of the kernel matrix. 3. Compute the sample weighted covariance matrix J based slicing the response variables {y1 , ...yn }. 4. Solve the generalized eigenvector problem ˜ Kc ˜ = (K ˜ 2 + sI)c KJ for c. We use cross-validation to estimate the regularization parameter s. This procedure is computationally advantageous even for the case of linear models when p  n due to the fact that the eigen-decomposition problem is for

n × n matrices rather than the p × p matrices in the standard SIR formulation. 9

3

C ONSISTENCY

OF K SIR

Consistency of the kSIR method will depend on whether Proposition 1 holds. This condition is analogous to requirements for the consistency of functional SIR (Ferr´e and Yao, 2003; Ferr´e and Villa, 2006). Given this condition the regularization method proposed is consistent, see Appendix B for the proof. Theorem 1. Assume E[k(X, X)2 ] < ∞ and the conditions in Proposition 1 hold.

Then for each i = 1, . . . , d, the following holds

khβˆi , φ(·)i − hβi , φ(·)ikK = op (1), where hβi , φ(·)i are projections onto the e.d.r. directions and hβˆi , φ(·)i are projec-

tions onto the e.d.r. directions estimated by the regularization method. In addition

if the e.d.r. directions βi depend only on a finite number of eigenvectors of the covariance operator Σ the rate of convergence is O(n1/4 ). We close with the following comments. To provide rates of convergence the conditions on the decay of the spectrum of the covariance operator are required and if the e.d.r. space depends on eigenvectors of the covariance operator with small eigenvalues then the rate of convergence can be arbitrarily slow. The results in Ferr´e and Yao (2003) can be used to prove the consistency of thresholding eigenvalues under stronger and more complicated conditions. In this context we posit that the regularization method is a better estimator or the results in Ferr´e and Yao (2003) can be improved. In Ferr´e and Villa (2006) consistency for regularized functional SIR was shown. However, the method of regularization was different from ours and they required more complicated conditions for consistency.

10

4

E XAMPLES

4.1 Synthetic data set A synthetic data set illustrates the advantage of a nonlinear inverse regression curve and the robustness of the method to kernel choice. The data consists of 400 samples drawn from the following function two functions Y

= X12 + X22 + ε,

(7)

Y

= X1 + X22 + ε,

(8)

where ε ∼ No(0, .22 ), and x ∈ R5 with X ∼ No(0, I5 ) and I10 is the identity i.i.d.

i.i.d.

covariance with 5 dimensions. We used 20 slices or bins for the Y variable. We

look at two kernels in the kSIR setting: the quadratic kernel Kquadratic (u, v) = (1 + uT v)2 which matches the functional form of the regressions and the Gaussian kernel KGaus (u, v) = exp (−ku − vk2 /2σ 2 ) which is nonlinear but does not match the functional form of the regressions, we set σ to the median pair-wise distance.

We first examine results for equation (7). In Figure 1 we plot the y versus the first variate, β1T x, for SIR as well as kSIR using a linear, quadratic, and Gaussian kernel. The first observation is as expected the performance of SIR and kSIR with a linear kernel is almost identical and poor – linear features cannot explain the variance. The second observation is the Gaussian kernel which is not matched to the regression function performs quite well. This indicates that the nonlinearity does not have to exactly match the regression model for the method to work well. The results for equation (8) are similar and displayed in Figure 2. Both the Gaussian and quadratic kernel perform well. For SIR we look the first two variates, β1T x, β2T x. The first variate contains some information but is incomplete, the second variate does not explain the variation in the response variable.

11

4.2

Digits data

The advantage of nonlinear e.d.r. directions over linear directions is illustrated with a 764 dimensional problem of classifying images of handwritten digits. The MNIST data set (Y. LeCun, http://yann.lecun.com/exdb/mnist/), contains 60, 000 images of handwritten digits {0, 1, 2, ..., 9} as training data and 10, 000

images as test data. Each image consists of p = 28 × 28 = 784 gray-scale pixel

intensities. In our simulations, we randomly sampled 1000 images for training (100 samples for each digit) and computed 9 e.d.r. directions since the data is naturally sliced into 10 bins corresponding to each digit. For comparison, this was done using our kSIR method with a linear as well as Gaussian kernel. We projected the training data and 10000 test data onto these directions. We then used a k-nearest neighbor classifier with k = 5 to classify the test data. We report the classification error over 100 iterations in Table 1. The Gaussian kernel performed much better than the linear kernel and we see the accuracy for digits 2, 3, 5, and 8 dramatically improve with nonlinear structures. 4.3 Gene expression data This example illustrates the computational advantage of the kSIR method. In this data analysis linear e.d.r. directions are sufficient to classify fourteen types of tumors given their gene expression profiles (Ramaswamy et al., 2001; Rifkin et al., 2003). The data consists of 144 training samples and 46 test samples from fourteen types of primary tumors. The dimensionality of the data is 16063. Data analysis in both Ramaswamy et al. (2001) and Rifkin et al. (2003) indicate that that variable selection or feature selection did not increase accuracy and in general resulted in decreased accuracy. We applied the kSIR method to this data not to improve accuracy but to indicate the computational advantage of the method. We first applied a linear support 12

digit

Gaussian kernel(%)

linear kernel(%)

0

2.7276 (± 0.0079)

4.8704 (± 0.0163)

1

1.5004 (± 0.0024)

2.9242 (± 0.0127)

2

10.3886 (± 0.0428) 19.2132 (± 0.0568)

3

8.4545 (± 0.0431)

17.2297 (± 0.0798)

4

7.8391 (± 0.0574)

13.2719 (± 0.1066)

5

8.7668 (± 0.0439)

21.4585 (± 0.0865)

6

4.7244 (± 0.0116)

8.1597 (± 0.0296)

7

8.8658 (± 0.0286)

13.5447 (± 0.0295)

8

9.8142 (± 0.0669)

19.8121 (± 0.0816)

9

7.7443 (± 0.0628)

15.3261 (± 0.0450)

average

7.0826 (± 0.0108)

13.5811 (± 0.0087)

Table 1: Results for classifying digits based on e.d.r. projections. vector machine to this data and obtained a test error rate of 10/46 = 22%. We then applied kSIR to reduce the dimensionality from 16063 to 13. We applied a support vector machine with linear and Gaussian kernel, the variance set with crossvalidation, to the projected data and obtained test error rates of 10/46 = 22% for the Gaussian kernel and 13/46 = 28% using a linear kernel. In the analysis of Rifkin et al. (2003) the error rate for a linear support vector machine with fewer than 100 variables was between 35 − 40%. 5

S UMMARY C OMMENTS

The interest in manifold learning and nonlinear dimension reduction in both statistics and machine learning has led to a variety of statistical models and algorithms. However, these methods are developed in the unsupervised learning framework.

13

Therefore the estimated dimensions may not be optimal for a particular prediction problem. In this paper we develop a supervised nonlinear dimension reduction method by extending a classical model, sliced inverse regression, to the nonlinear setting using a kernel function and corresponding reproducing kernel Hilbert space. The resulting procedure can be implemented by a simple algorithm and we prove conditions under which the method is consistent. From a theoretical point of view, kernel SIR can be viewed as linear SIR in a reproducing kernel Hilbert space. When this RKHS is a function space, kernel SIR shares similarities with functional SIR developed by Ferr´e and his coauthors in a series of papers (Ferr´e and Yao, 2003, 2005; Ferr´e and Villa, 2006). In functional SIR, the observable data are functions and the goal is to find linear e.d.r. directions for functional data analysis. In kernel SIR, the observable data are usually not functions but mapped into a function space in order to characterize the nonlinear structures. Moreover, kernel SIR does not need the explicit mapping in the Hilbert space but a Mercer kernel function defined on the original predictor space. ACKNOWLEDGMENTS We acknowledge support of the National Science Foundation ( DMS-0732276 and DMS-0732260) and the National Institutes of Health (P50 GM 081883). Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH. A PPENDIX A:

EQUIVALENCE OF EIGEN - DECOMPOSITIONS

We derive the eigen-decomposition (5) in terms of the kernel function and prove Proposition 2. Proof. For a Mercer kernel K(x, z) = hφ(x), φ(z)iK . So Kx = hΦ, φ(x)iK and 14

cˆTi Kx = hΦˆ ci , φ(x)iK . Therefore, it suffices to prove βˆi = Φˆ ci . This is equivalent to proving that if c is an eigenvector of (5), then β = Φc is a solution of (4). Suppose Φ has the following spectral decomposition Φ = U DV T [u1 . . . udK ]

"

¯ d×d D 0(n−d)×d

  # v1T  0d×(dK −d)   ...  = U¯ D ¯ V¯ T ,   0(n−d)×(dk −d) vnT

(9)

¯ = [u1 , . . . , ud ] and V¯ = [v1 , . . . , vd ]. where we denote U

Suppose that c is an eigenvector of (5) with eigenvalue λ KJK = λK 2 c.

(10)

Notice that the gram matrix K with elements Kij = hφ(xi ), φ(xj )iK is ¯ 2 V¯ T . K = ΦT Φ = V D T DV T = V¯ D By this and the fact V¯ T V = Id we see that (10) implies ¯ 2 V¯ T J V¯ D ¯ 2 V¯ T c = λV¯ D ¯ 4 V¯ T c. V¯ D

(11)

¯D ¯ V¯ T c. By (11) and the fact that V¯ T V¯ = U¯ T U¯ = Id we have Let β = Φc = U ¯ V¯ T J V¯ D ¯ 2 V¯ T c ΦJΦT β = U¯ D ¯ −1 V¯ T V¯ D ¯ 2 J V¯ D ¯ 2 V¯ T c = U¯ D ¯ −1 V¯ T V¯ D ¯ 4 V¯ T c = λU¯ D ¯ 3 V¯ T c = λU¯ D ¯ V¯ T V¯ D ¯ U¯ T U¯ D ¯ V¯ T c = λU¯ D = λΦΦT β. Since the sample between-group covariance matrix can be written as ˆ = 1 ΦJΦT . Γ n We have in fact proven ΦT JΦβ = λΦT Φβ and so β is an eigenvector of (4). This completes the proof. 15

A PPENDIX B:

PROOF OF CONSISTENCY

We provide the proof of Theorem 1. The result is based on Hilbert space valued variables and covariance operators so we state some of their properties, for details see Blanchard et al. (2007) and references therein. Given a separable Hilbert space H, the outer-product operator f ⊗ g for f, g ∈

H is defined as

(f ⊗ g)(h) = hg, hiH f,

∀h ∈ H.

Let Z be a random vector taking values in H satisfying EkZk2H < ∞. The co-

variance operator Σ defined as

Σ = E[(Z − EZ) ⊗ (Z − EZ)], is self-adjoint, positive, and compact. Note that when dK is infinite, all the matrices should be considered as operators. For kSIR H is the linear span of {φ(x), x ∈ X}, a Hilbert space isomorphic

to HK . We take Z = φ(X). By Proposition 1, the true kSIR directions are given by the generalized eigen-decomposition problem Γβ = λΣβ,

(12)

where Γ = cov[E(Z|Y )]. In the following we will prove Theorem 1 in two steps. In the first step we will show that the eigen-decomposition problem ˆ Γβ ˆ = λ(Σ ˆ 2 + sI)β, Σ

(13)

ˆ and Σ ˆ are provides consistent estimates for the true kSIR directions, where Γ empirical estimates of the covariance and the conditional covariance operators defined in Section 2.2, s > 0 is the regularization parameter, and I is the identity operator. In the second step we will show (6) provides the same solution as (13). 16

By the compactness of Σ the eigenvalues wi and corresponding eigenfunctions ei exist and Σ=

∞ X i=1

We define the inverse as Σ

−1

=

wi ei ⊗ e i .

∞ X i=1

(14)

wi−1 ei ⊗ ei .

Note that this is well defined only in the subspace range(Σ). Instead of working with equations (12) and (13) we work with a standard eigen-decompositions of the form T β = λβ,

(15)

Tˆs β = λβ,

(16)

ˆ 2 + sI)−1 Σ ˆ Γ. ˆ It is easy to see that (12) is equivalent where T = Σ−1 Γ and Tˆs = (Σ to (15) and (13) is equivalent to (16). The following Lemma ensures that T is well defined. Lemma 1. Γ is a finite rank operator and with rank d and there exist γ i and ui ∈ range(C) such that Γ=

d X i=1

and T =

d X i=1

γ i ui ⊗ u i

γi Σ−1 (ui ) ⊗ ui .

The following theorem shows that (16) provide consistent estimates for (15). Theorem 2. Under the condition EK(X, X)2 < ∞, for each i = 1, . . . , d, the following hold

ˆ s,i − λi | = op (1) |λ 17

and kβˆs,i − βi kH = op (1), ˆ s,i , βˆs,i } where {λi , βi } are the eigenvalues and eigenvectors of equation (15) and { λ are the eigenvalues and eigenvectors of equation (16).

In addition if the e.d.r. directions βi depend only on a finite number of eigenvectors of the covariance operator the rate of convergence is O(n 1/4 ). Proof. For N ≥ 1 the projection operator ΠN and its complement are defined as ΠN =

N X i=1

and

ei ⊗ e i

Π⊥ N

= I − ΠN =

∞ X

i=N +1

ei ⊗ e i .

We will prove the following bound which holds for N ≥ 1 kTˆ − T k = Op



1 √

s n



d d X s X + 2 kΠN (vj )kH + kΠ⊥ N (vj )k wN i=1 i=1

where vj = Σ−1 uj . We will use the fact that (Ferr´e and Yao, 2003) √ ˆ − Σk = Op (1/ n) kΣ

and

√ ˆ − Γk = Op (1/ n). kΓ

ˆ 2 + sI)−1 ΣΓ T1 = ( Σ

and

T2 = (Σ2 + sI)−1 ΣΓ.

Define

Then kTˆs − T k ≤ kTˆs − T1 k + kT1 − T2 k + kT2 − T k, and we bound each term separately. For the first term observe ˆ 2 + sI)−1 kkΣ ˆΓ ˆ − ΣΓk = Op kTˆs − T1 k ≤ k(Σ 18



1 √

s n



.

(17)

For the second term note that T1 =

d X

γj

i=1

Therefore



 ˆ 2 + sI)−1 Σuj ⊗uj (Σ

kT1 − T2 k ≤

d X i=1

and

T2 =

d X i=1

 γj (Σ2 + sI)−1 uj ⊗uj .

 

ˆ2 + sI)−1 − (Σ2 + sI)−1 Σuj . γ j (Σ H

Since uj ∈ range(Σ) there exists vj such that uj = Σvj . Then     2 −1 2 −1 2 −1 ˆ 2 2 ˆ ˆ (Σ + sI) − (Σ + sI) Σuj = (Σ + sI) Σ − Σ (Σ2 + sI)−1 Σ2 vj . This implies kT1 −T2 k ≤

d X i=1

 



1

ˆ2 −1 2 2 ˆ 2 −1 ˆ 2 . γj (Σ + sI) Σ − Σ (Σ + sI) Σ kvj k = Op √ s n

For the third term the following holds kT2 − T k ≤ and for each j = 1, . . . , d,

d X i=1

 γj (Σ2 + sI)−1 Σ − Σ−1 uj H ,

k(Σ2 + sI)−1 Σuj − Σ−1 uj kH ≤ k(Σ2 + sI)−1 Σ2 vj − vj k

 ∞ 

X

wj2

= − 1 hv , e ie

j i i 2

s + wj i=1 H !1/2 ∞ X s2 = hvj , ei i2 2 2 (s + wi ) i=1 !1/2 !1/2 ∞ N X X s + hvj , ei i2 hvj , ei i2 ≤ wN i=1 i=N +1 =

s kΠN (vj )kH + kΠ⊥ N (vj )kH . 2 wN 19

Combining these terms results in (17). Note that (17) implies kTˆs − T k = op (1). The rate is O(n1/4 ) if the e.d.r. directions βi depend only on a finite number of eigenvectors of the covariance operator sine kΠ⊥ N (vj )kH = 0 if N is large enough. The consistency then follows by perturbation theory.

What is left is to show that the kSIR algorithm (6) provides exactly the same solutions as (13). This is summarized in the following theorem. Theorem 3. If {λ, c} is a solution to the problem (6), then {λ, β} with β = Φc is a solution of (13).

The proof is similar to that of Proposition 2 in Appendix A. We omit the details. R EFERENCES Belkin, M. and P. Niyogi (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data representation. Neural Computation 15(6), 1373–1396. Blanchard, G., O. Bousquet, and L. Zwald (2007). Statistical properties of kernel principal component analysis. Mach. Learn. 66, 259–294. Cook, R. (1996). Graphics for regressions with a binary response. J. Amer. Statist. Assoc. 91, 983–92. Cook, R. (1998). Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley. Cook, R. and S. Weisberg (1991). Disussion of li (1991). J. Amer. Statist. Assoc. 86, 328–332. 20

Donoho, D. and C. Grimes (2003). Hessian eigenmaps: new locally linear embedding techniques for highdimensional data. Proceedings of the National Academy of Sciences 100, 5591–5596. Duan, N. and K. Li (1991). Slicing regression: a link-free regression method. Ann. Stat. 19(2), 505–530. Ferr´e, L. and N. Villa (2006). Multilayer perceptron with functional inputs: an inverse regression approach. Scandinavian Journal of Statistics 33(4), 807– 823. Ferr´e, L. and A. Yao (2003). Funtional sliced inverse regression analysis. Statistics 37(6), 475–488. Ferr´e, L. and A. Yao (2005). Smoothed functional inverse regression. Statist. Sinica 15(3), 665–683. K¨onig, H. (1986). Eigenvalue distribution of compact operators, Volume 16 of Operator Theory: Advances and Applications. Basel, CH: Birkh¨auser Verlag. Li, K. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86, 316–342. Li, L., R. Cook, and C.-L. Tsai (2007).

Partial inverse regression.

Biometrika 94(3), 615–625. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, London A 209, 415–446. Ramaswamy, S., P. Tamayo, R.Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, and T. Golub (2001). Multiclass cancer diagnosis using tumor 21

gene expression signatures. Proceedings of the National Academy of Sciences U.S.A. 98, 149–54. Rifkin, R., S. Mukherjee, P. Tamayo, S. Ramaswamy, C.-H. Yeang, M. Angelo, M. Reich, T. Poggio, E. Lander, T. Golub, and J. Mesirov (2003). An analytical method for multiclass molecular cancer classification. SIAM Review 45(4), 706–723. Roweis, S. and L. Saul (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 2323–2326. Tenenbaum, J., V. de Silva, and J. Langford (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 2319–2323.

22

kSIR Gaussian kernel

15

15

10

10 y

y

kSIR Quadratic kernel

5 0 −20

5 0

−10 T β1 x

0

−0.2

0

15

15

10

10

5 0 −5

0.4

0.6

SIR

y

y

kSIR linear kernel

0.2 βT1 x

5 0

0 βT1 x

5

−4

−2

0 βT1 x

2

Figure 1: Scatter plot of y versus the first variate, β1T x, for SIR as well as kSIR using a linear, quadratic, and Gaussian kernel.

23

kSIR Gaussian kernel

10

10

5

5

y

y

kSIR Quadratic kernel

0

0 −10

−5

βT1 x

0

5

−0.5

0.5

SIR

10

10

5

5

y

y

SIR

0 βT1 x

0

0 −2

0 βT1 x

2

−2

0 βT2 x

2

Figure 2: Scatter plot of y versus the first variate, β1T x, for kSIR using quadratic and Gaussian kernels for the top two figures. The bottom two display the scatter plots for the first two SIR variates.

24