Robust ridge regression for highrdimensional data - UNLP

22 downloads 29204 Views 234KB Size Report
The MM estimator can be computed for (%' and is robust for large ($'" .... As is usual in robust statistics, these pweighted normal equationsqsuggest an iterative ...
Robust ridge regression for high–dimensional data Ricardo A. Maronna University of La Plata and C.I.C.P.B.A.

Abstract Ridge regression, being based on the minimization of a quadratic loss function, is sensitive to outliers. Current proposals for robust ridge regression estimators are sensitive to “bad leverage observations”, cannot be employed when the number of predictors p is larger than the number of observations n; and have a low robustness when the ratio p=n is large. In this paper a ridge regression estimate based on Yohai’s (1987) repeated M estimation (“MM estimation”) is proposed. It is a penalized regression MM estimator, in which the quadratic loss is replaced by an average of (ri =b), where ri are the residuals and b the residual scale from an initial

estimator, which is a penalized S estimator; and

is a bounded function.

The MM estimator can be computed for p > n and is robust for large p=n: A fast algorithm is proposed. The advantages of the proposed approach over its competitors are demonstrated through both simulated and real data. Key words and phrases: MM estimate; Shrinking; S estimate. Note: Supplemental materials are available on line. Ricardo Maronna (E-mail: [email protected]) is professor, Mathematics Department, University of La Plata, C.C. 172, La Plata 1900, Argentina; and researcher at C.I.C.P.B.A.

1

1

Introduction

Ridge regression, …rst proposed by Hoerl and Kennard (1970), is a useful tool for improving prediction in regression situations with highly correlated predictors. For a regression data set (X; y) with X 2 Rn

p

and y 2 Rn ; the ridge estimator

can be de…ned as b = b0 ; b0 1

0

= arg min fL (X; y; ) :

0

2 R;

1

2 Rp g

(1)

where the loss function L is 2

2 1k

(2)

b0 1n

X b1

(3)

L (X; y; ) = kr ( )k + k and 0

r = (r1 ; :::; rn ) = y

0

is the residual vector, with 1n = (1; 1; :::; 1) where in general A0 denotes the transpose of A: Here

is a penalty parameter that must be estimated in order

to optimize the prediction accuracy. Despite more modern approaches such as the Lasso (Tibshirani 1996), Least Angle Regression (LARS) (Efron et al 2004) and boosting (Bühlmann 2006), ridge regression (henceforth RR) continues to be useful in many situations, in particular in chemometrics. It is particularly useful when it can be assumed that all coe¢ cients have approximately the same order of magnitude (unlike sparse situations when most coe¢ cients are assumed to be null). Zou and Hastie (2005) state that “For usual n > p situations, if there are high correlations between predictors, it has been empirically observed that the prediction performance of the lasso is dominated by ridge regression”. Frank and Friedman (1993) compare the performances of several estimators commonly used in chemometrics, such as

2

partial least squares (PLS) and principal components regression (PCR). They conclude (p. 110) that “In all of the situations studied, RR dominated the other methods, closely followed by PLS and PCR, in that order”. Since RR is based on least squares, it is sensitive to atypical observations. Lawrence and Marsh (1984) proposed a model for predicting fatalities in U.S. coal mines, based on a data set with n = 64 and p = 6 a¤ected by both atypical observations and high collinearity. They demonstrate that a robust version of RR yields results superior to those from ordinary Least Squares (OLS) and from classical RR, in particular in their compatibility with theoretical expectations. From today’s perspective however, their approach to robusti…cation could be greatly improved. Several other approaches have been proposed to make RR robust towards outliers, the most recent and interesting ones being the ones by Silvapulle (1991) and Simpson and Montgomery (1996). These estimators however present two drawbacks, which will be analyzed in Section 4. The …rst one is that they are not su¢ ciently robust towards “bad leverage points”especially for large p: The second is that they require an initial (standard) robust regression estimator. When p > n this becomes impossible; and when p < n but the ratio p=n is “large”the initial estimator has a low robustness, as measured by the breakdown point. The situation of prediction with p >> n has become quite usual, especially in chemometrics. One instance of such a situation appears when attempting to replace the laboratory determination of the amount of a given chemical compound in a sample of material, with methods based on cheaper and faster spectrographic measurements. Section 7 of this article deals with one archaeological instance of such a situation, where the material is glass from ancient vessels, with n = 180 and p = 486. The analysis indicates that the data contain about

3

7% atypical observations, which a¤ect the predictive power of classical RR, while the robust method proposed in this article has a clearly lower prediction error than classical RR, PLS, and a robust version of PLS. A further area that stimulates the development of robust RR for high– dimensional data is that of functional regression. Some approaches for functional regression, such as the one by Crambes et al, (2009) boil down to an expression of the form (2) with k

1k

2

replaced by

0 1A 1;

where A is a positive–de…nite

matrix. i.e., a problem that is similar to RR, generally with p >> n. Therefore, a reliable robust RR estimator for p >> n is a good basis to obtain a reliable robust functional regression estimator. This approach has been employed by Maronna and Yohai (2009) to propose a robust estimator for functional regression based on splines. The purpose of this article it to develop ridge estimators that can be used when p > n; are robust when p=n is large, and are resistant to bad leverage points. In Section 2 we describe an estimator based on Yohai’s (1987) approach of MM estimation, to allow controlling both robustness and e¢ ciency. The MM estimator requires a robust (but possibly ine¢ cient) initial estimator. In Section 3 we describe an estimator based on the minimization of a penalized robust residual scale (“S estimator”) to be used as a starting point. Sections 2.4 and 2.5 deal with the choice of : In Section 4 we recall the estimators proposed by Silvapulle and by Simpson and Montgomery, describe their drawbacks, and propose simple modi…cations of them to improve their robustness. Section 5 describes a simulation study for the case n > p in which the proposed estimator is seen to outperform its competitors, and the modi…ed Silvapulle estimator performs well for not too extreme situations. Section 6 describes a simulation for p > n in which the MM estimator is seen to outperform a robust version of Partial Least Squares.

4

Section 7 deals with the application of the proposed estimator to a real highdimensional data set with p >> n. Section 8 compares the computing times of the di¤erent estimators, Section 9 describes the di¢ culties in estimating the estimator’s variability, and …nally Section 10 summarizes the conclusions. An Appendix containing proofs and details of de…nitions and of calculations is given with the Supplemental Material.

2

MM estimators for ridge regression

To ensure both robustness and e¢ ciency under the normal model we employ the approach of MM estimation (Yohai 1987). Start with an initial robust but possibly ine¢ cient estimator bini ; from the respective residuals compute

a robust scale estimator bini : Then compute an M estimator with …xed scale

bini ; starting the iterations from bini ; and using a loss function that ensures the

desired e¢ ciency. Here “e¢ ciency” will be loosely de…ned as “similarity with the classical RR estimator at the normal model”.

Recall …rst that a scale M estimator (an M–scale for short) of the data vector 0

r = (r1 ; :::; rn ) is de…ned as the solution b = b (r) of n

1X n i=1

where

ri

= ;

(4)

is a bounded –function in the sense of Maronna et al. (2006) and

2 (0; 0:5) determines the breakdown point of b: Recall that the breakdown

point (BDP) of an estimator is the maximum proportion of observations that can be arbitrarily altered with the estimator remaining bounded away from the border of the parameter set (which can be at in…nity). Note that when (t) = t2 and

= 1 we have the classical mean squared error (MSE): b2 = avei fri2 g: We 5

shall henceforth employ for

bis

the bisquare –function

(t) = minf1; 1

1

3

t2 g:

(5)

Let bini be an initial estimator. Let r = r bini and let bini be an M–scale

of r :

n

1X n i=1

where

0

is a bounded

0

ri bini

= ;

(6)

function and is to be chosen. Then the MM estimator

for RR (henceforth RR-MM) is de…ned by (1) with

2 L (X; y; ) = bini

where

n X

ri ( ) bini

i=1

+ k

is another bounded –function such that

0:

1k

2

;

(7)

2 The factor bini before

the summation is employed to make the estimator coincide with the classical one when

(t) = t2 :

We shall henceforth use

0

(t) =

bis

t c0

;

(t) =

bis

t c

(8)

where the constants c0 < c are chosen to control both robustness and e¢ ciency, as described later in Section 2.3. The classical estimator (2) will henceforth be denoted by RR–LS for brevity. Remark: Centering and scaling of the columns of X are two important issues in ridge regression. Since our estimator contains the intercept term b0 ;

centering would a¤ect only b0 and not the vector of slopes b1 ; and is therefore

not necessary. However, in the algorithm for RR–MM the data are previously centered by means of the bisquare location estimator, but this is done only for the convenience of the numerical procedure. As to scaling, we consider it to be

6

the user’s choice rather than an intrinsic part of the de…nition of the estimator. In our algorithm, the columns of X are scaled by means of a scale M estimator.

2.1

The “weighted normal equations” and the iterative algorithm

It is known that the classical estimator RR–LS satis…es the “normal equations” b0 = y

x0 b1 ;

(X 0 X + Ip ) b1 = X 0 y

b0 1n ;

(9)

where Ip is the identity matrix and x and y are the averages of X and y respectively. A similar system of equations is satis…ed by RR–MM. De…ne

(t) =

0

(t) ; W (t) =

(t) : t

(10)

Let ti =

ri W (ti ) 0 ; wi = ; w = (w1 ; :::; wn ) ; W = diag (w) : bini 2

Setting the derivatives of (7) with respect to

w0 y

b0 1n

(11)

to zero yields for RR-MM

X b1 = 0

(12)

and (X 0 W X + Ip ) b1 = X 0 W y

b0 1n :

(13)

Therefore RR–MM satis…es a weighted version of (9). Since for the chosen ; W (t) is a decreasing function of jtj; observations with larger residuals will receive lower weights wi . As is usual in robust statistics, these “weighted normal equations”suggest an iterative procedure. Starting with an initial b : Compute the residual vector r 7

and the weights w: Leaving b0 and w …xed, compute b1 from (13). Recompute

w, and compute b0 from (12). Repeat until the change in the residuals is small enough. Recall that bini remains …xed throughout.

This procedure may be called “Iterative reweighted RR”. It is shown in the

Supplemental Material that the objective function (7) descends at each iteration. The initial estimator, which is a penalized S estimator, is described in Section 3.

2.2

Equivalent degrees of freedom

Let w be de…ned by (10)-(11) and

e= x

w0 X ; X=X w0 1

1e x:

(14)

Then it follows that the …tted values yb = X b1 + b0 1n verify yb =

H+

1w0 10 w

(15)

y

where the “hat matrix” H is given by

H = X X 0 W X + Ip

1

X 0 W:

(16)

Then the equivalent degrees of freedom (edf) (Hastie et al. 2001) are de…ned by

edf = tr H +

1w0 10 w

with

= pbMM + 1

pbMM = tr (H) ;

8

(17)

(18)

where tr (:) denotes the trace. When

= 0 we have tr (H) = rank (X) : We

may thus call pbMM the “equivalent number of predictors”.

The edf may be considered as a measure of the degree of “shrinking” of the

estimator.

2.3

Choosing the constants

We now deal with the choice of in (6) and of c0 and c in (8). For the case

= 0;

i.e., the standard M–estimation of a regression with q = p + 1 parameters, the natural choice is to one with

= 0:5 (1

q=n) (henceforth a standard estimator will refer

= 0):

This choice of

is “natural” in that it maximizes the BDP of b. It can also

be justi…ed by the fact that the scale can be written as 1 n and the denominator n

q

n X

ri

= 0:5;

i=1

q is the same as the one used for the classical s2 in

OLS. For

> 0; the natural extension is

= 0:5 1

qb n

(19)

where qb is a measure of the “actual number of parameters”, which depends on :

We would use qb = edf de…ned in (17), but pbMM cannot be known a priori. For

this reason we shall employ a measure of edf derived from the initial estimator in a manner analogous to (17), as described in Section 3. Since

ceases to be …xed at the value 0.5, c0 must now depend on

to ensure consistency of b for normal data when 9

in order

= 0. We therefore employ

c0 ( ) de…ned by E

z c0 ( )

= ; z

N (0; 1)

which can be computed numerically. As to c; it is chosen to attain the desired e¢ ciency of the standard estimator at the normal model. If p is …xed and n ! 1; the value c = 3:44 yields 0.85 asymptotic e¢ ciency at the normal model when

= 0:

However, for high–dimensional data both bini and c require a correction.

Maronna and Yohai (2010) point out two problems that appear when …tting a standard MM estimator to data with a high ratio p=n : (1) The scale based on

the residuals from the initial regression estimator underestimates the true error scale, and (2) Even if the scale is correctly estimated, the actual e¢ ciency of the MM estimator can be much lower than the nominal one. For this reason, bini is (upwards) corrected using formula (9) of their paper,

namely as

eini =

1

bini with k1 = 1:29; k2 = (k1 + k2 =n)b q =n

6:02:

As to c; it must be also increased: we put c = 4 when qb=n > 0:1:

2.4

Choosing

In order to choose

through cross–validation

we must estimate the prediction error of RR–MM for di¤er-

ent values of : Call CV(K) the K-fold cross validation process, which requires recomputing the estimate K times. For K = n (“leave–one–out”) we can use an approximation to avoid recomputing. Call yb using the i–th observation; i.e., y

i

=

b( i) 0

i

the …t of yi computed without

+ x0i b1

( i)

; where

b(

0

i)

( ; b1

i)0

is the RR–MM estimate computed without observation i: Then a …rst–order

10

Taylor approximation of the estimator yields the approximate prediction errors

r

i

= yi

with hi =

x0i

yb

ri 1 +

i

n X

0

1

(ti ) xi x0i

W (ti ) hi hi 0 (ti )

+2 I

i=1

where W;

!

(20)

1

xi

and ti are de…ned in (10)-(11) and xi is the i-th row of X de…ned

in (14). This approximation does not take the initial estimator into account. Henceforth CV(n) will refer to the above approximation. Given the prediction errors r = (r

0 1 ; :::; r n ) ,

we compute a robust mean

2

squared error (MSE) as b (r ) ; where b is a “ –scale” with constant c = 5 (Yohai and Zamar 1988); and choose the

minimizing this MSE. Initially an

M–scale had been employed, but the resulting s were somewhat erratic, and this led to the use of the more e¢ cient –scale. Note that the choice of c must strike a balance between robustness and e¢ ciency; the value 5 was chosen on the basis of exploratory simulations. It turns out that b (r ), as a function of ; may have several local minima

(as will be seen in Figure 4), which precludes the use of standard minimization

algorithms to choose the “best” : For this reason we prefer to compute b (r ) for a selected set of candidate s in a range [

min ;

max ]:

The de…nition of this

set is given in Section 2.5.

2.5

The set of candidate s and the edf

It is intuitively clear that the robustness of RR–MM depends on that of the initial estimator, which, as we shall see, depends in turn on the edf, which are a function of . For this reason we shall need an a priori approximation to the edf corresponding to each .

11

For the classical estimator RR-LS it is known that the edf are pb ( ) + 1 with pb ( ) =

where

j

p X j=1

j

j

(21)

+ =n

are the eigenvalues of C; the sample covariance matrix of X: We use

a robust version of (21) replacing those eigenvalues by the Spherical Principal Components (SPC) of X (Locantore et al. 1999):

pbSPC ( ) = where

j

p X j=1

j j

(22)

+ =n

are the “pseudo eigenvalues”; the details are described in the Supple-

mental Material. Note that pbSPC ( ) is a decreasing function. The choice of

SPC was based on the fact that it is very fast, can be computed for p >> n; and has some interesting theoretical properties (see the Discussion of the article by Locantore et al. (1999)). Call " the BDP of bini : It is not di¢ cult to show that if

of RR–MM is

> 0; the BDP

" : However, since the estimator is not regression–equivariant,

this result is deceptive. As an extreme case, take p = n Then the BDP of b is null for

= 0; but is 0.5 for

1 and " = 0:5:

= 10

6

! Actually, for

a given contamination rate less than 0.5, the maximum bias of the estimator remains bounded, but the bound tends to in…nity when

tends to zero. This fact

suggests keeping the candidate s bounded away from 0 when p=n is large. This idea is more easily expressed through the edf. As will be seen in Section 3, to control the robustness of the initial estimator it is convenient that the “e¤ective dimension” be bounded away from n; more precisely, that pbSPC ( ) + 1

We therefore choose [

min

= inff

min ;

max ]

0:5n:

such that

0 : pbSPC ( ) + 1 12

0:5ng and pbSPC (

max )

= 1;

(23)

so that max ;

min

= 0 if p < (n

1)=2: We take N values of

between

min

and

with N depending on p; so that the N values of pbSPC ( ) are equispaced.

As a compromise between precision and computing time, we take

N = minfp; n; 20g:

(24)

A technical detail appears. The set of candidate s is chosen so that RR–LS has a given degree of shrinking. Note however that even if there is no contamination, RR–LS and RR–MM with the same

need not yield approximately

the same estimators. For a given ; we want to …nd penalty

0

0

such that RR–MM with

yields approximately the same degree of shrinking as RR–LS with

penalty : There is a simple solution. Note that limk!1 therefore RR–LS with a given penalty

3

0

k2 3 bis

(r=k) = r2 ; and

is approximately “matched” to RR-MM with

= 3 =c2 with c in (8).

The initial estimator for RR–MM: S estimators for RR

As in the standard case, the initial estimator for RR–MM will be the one based on the minimization of a (penalized) robust scale (an S estimator). We de…ne an S estimator for RR (henceforth RR–SE) by replacing in (2) the sum of squared residuals by a squared robust M–scale b (4), namely 2

L (X; y; ) = nb (r ( )) + k

1k

2

:

(25)

Here the factor n is employed to make (25) coincide with (2) when b2 (r) = avei ri2 ; corresponding to

(t) = t2 and

= 1 in (4).

A straightforward calculation shows that the estimator satis…es the “normal

13

equations” (12)-(13) with wi = W (ti ) and

replaced by

0

= n

1

Pn

i=1

wi t2i :

These equations suggest an iterative algorithm similar to the one at the end of Section 2.1, with the di¤erence that now the scale changes at each iteration, i.e. now in the de…nition of ti in (11), bini is replaced by b r b

. Although

it has not been possible to prove that the objective function descends at each iteration, this fact has been veri…ed in all cases up to now. Expressing the estimator as a weighted linear combination yields an expression for the equivalent number of parameters pbSE as in (17). As with RR–MM, it is not di¢ cult to show that if

> 0 the BDP of RR–SE

b coincides with that of b: Recall that this BDP equals min ( ; 1

) with

in

(4). In the case of standard regression, recall that for X in general position, the BDP of a regression S estimator with intercept is min ( ; 1

(p + 1) =n

) ; and

therefore its maximum (which coincides with the maximum BDP of regressionequivariant estimators) is attained when et al. 2006). For

= 0:5 (1

(p + 1) =n) (see Maronna

> 0; the above facts suggest using

= 0:5 1

pb + 1 n

;

(26)

where pb is a measure of the “actual number of parameters”, which will depend on

: It would be ideal to use pb = pbSE ; but this value is unknown before computing

the estimator; for this reason we use pb = pbSPC : Once bSE computed, we compute for RR–MM using (19) with qb = pbSE +1, the reason being that pbSE is expected

to give a more accurate measure of the edf than the tentative pbSPC : It follows from (26) that if we want The technical detail of matching

0:25 we get the value of

min

in (23).

described at the end of Section 2.5 occurs

also with RR–SE, namely, the same value of

yields di¤erent degrees of shrink-

ing for RR–LS and RR–SE. Since its solution is more involved, it is deferred to the Supplemental Material.

14

Other robust RR estimators for n > p

4 4.1

The Silvapulle estimator

Silvapulle´s (1991) estimator is based on the following fact. Let X be centered 0 to zero mean. Call b LS = (b0 ; b 01 ) the ordinary LS (OLS) estimator. Then

the RR–LS estimator

b0 ; b0

1

0

ful…lls

b0 = b0 ; b1 = Z ( ) b 1 ;

where

Z ( ) = (X 0 X + I)

1

(27)

X 0 X:

(28)

Let now b M = (bM;0 ; b 0M:1 ) be a standard regression M estimator and b a

robust residual scale. Then Silvapulle proposes to de…ne a RR estimator by (27)–(28) with b M instead of b LS ; and to choose

in the same way as Hoerl et

al. (1975), namely

b=

pb2

2:

k b M;1 k

Silvapulle uses a simultaneous estimator of

(29) and

of “Huber’s Proposal II”

type (Huber 1964), which has the following drawbacks: (1) The M estimator has a monotonic

=

0

, which makes it non–robust towards high leverage outliers.

(2) The scale has a low BDP. (3) The matrix Z in (28) is also sensitive to leverage points and (4) The centering of X may also be in‡uenced by leverage points. For this reason the following modi…cations are proposed. Call b a robust

location vector of X obtained by applying a robust location estimator to each column (we use the bisquare M estimator). Call X= X X matrix. Let b M = bM;0 ; b 0M;1

0

1n b 0 the centered

be a standard MM regression estimator

of (X; y) with initial scale b: This estimator yields a vector of weights w. Let 15

W = diag (w) and

Z ( ) = X 0W X + I

1

X 0 W X:

Finally de…ne the modi…ed Silvapulle estimator by b1 = Z ( ) b M;1 ; b0 = bM;0

4.2

b0 b : 1

The Simpson-Montgomery estimator

Simpson and Montgomery (1996) employ a Schweppe GM type of estimator (see Maronna et al 2006, Sec. 5.11). They start by computing a standard regression 0 S estimator b = (b0 ; b 01 ) with residual scale b; then

is estimated according

to Hoerl and Kennard’s (1976) iterative method. Call (t; V ) the robust location/covariance M estimator proposed by Krasker and Welsch (1982), and let X be the centered matrix X= X Mahalanobis distances d2i = xi 0 V where

i

1t0 and let y be robustly centered. De…ne the 1

xi and the weights wi = Wmult (ri = (b i )) ;

= Cmedianj (dj ) =di ; with the constant C chosen to regulate e¢ ciency;

and Wmult is the weight function given by the multivariate M estimator. Finally their RR estimator is given by X 0 X + bI

b1 = X 0 y with

= diag (

1 ; :::;

n) :

(30)

The main drawbacks of this approach are: (1) The multivariate estimator has a very low BDP. (2) The estimation of

depends on X 0 X which is not

robust and (3) The Hoerl-Kennard iterations are unstable. For these reasons the following modi…cations were introduced: (1) (t; V ) is a multivariate S estimator using the bisquare scale. (2) Call W the matrix of weights given by the initial regression S-estimator; then the modi…ed estimator is

16

given by (30) with

replaced by W; and (3) Only two Hoerl-Kennard iterations

are employed.

Simulations for p < n

5

The performances of the robust estimators described above were compared for n > p through a simulation study. For each simulation scenario we generate Nrep datasets as yi = x0i

true

+ ei ; i = 1; :::; n with xi

V = [vjk ] with vjj = 1 and vjk =

for j 6= k; and ei

Np (0; V ) where N (0; 1). Since

the estimators are neither regression- nor a¢ ne–equivariant, the choice of

true

matters. For a given value of the “signal to noise ratio” (SNR) R de…ned as p 2 R = k true k = (pVar (e)) ; we take true at random: true = pRb where b has a uniform distribution on the unit spherical surface. To introduce contamination we choose two constants klev and kslo which respectively control the leverage and slope of the contamination, their e¤ect being to mislead b1 towards true

(1 + kslo ) : For a given contamination rate " 2 (0; 1) let m = [n"] where

[:] denotes the integer part. Then the …rst m observations (xi ; yi ) are replaced p by xi = x0 and yi = x00 true (1 + kslo ) with x0 = klev a= a0 V 1 a; where a is random with a0 1p = 0: Thus x0 is placed on the directions most in‡uential to the estimator. The values of kslo were taken in a grid aimed at determining the maximum MSE for each estimator. We took two values for

: 0.5 and 0.9, and

two for klev : 5 and 10. Before computing the estimators, the columns of X are centered and normalized to unit scale. For RR–LS we use the mean and standard deviation, and for the robust estimators we use the bisquare location and scale estimators. Then b is de–normalized at the end.

The set of s for RR–LS is chosen according to (21), in order to obtain N

values of

with N = p; so that the N values of pb ( ) are equispaced between 17

1 and p. To evaluate the performance of a given estimator b = b0 ; b10

0

; recall that

the conditional prediction mean squared error for a new observation (x0 ; y0 ) is MSE b = E y0

b0

x00 b1

2

=1+

;

where = b02 + b1

true

0

V

b1

true

:

It must be recalled that the distributions of robust estimators under contamination are themselves heavy-tailed, and it is therefore prudent to evaluate their performance through robust measures. For this reason, to summarize the Nrep values of MSE b we employed both their average, and their 10% (upper)

trimmed average. The results given below correspond to this trimmed MSE, although the ordinary average yields qualitatively similar results. The estimators employed were: RR–LS with N = p and KCV fold CV with KCV = n and KCV = 5 and =10. RR-MM with the same N and KCV : Simpson–Montgomery estimator, original and modi…ed versions. Silvapulle estimator, original and modi…ed versions.

Table 1 displays the maximum trimmed MSEs for p = 10; n = 50;

= 0:9

and klev = 5 and three values of the SNR. The values for " = 0 correspond to no contamination, and those for " = 0:1 to the maximum of the trimmed MSEs. The results for RR–MM and RR–LS with KCV = 10 are omitted since 18

SNR " RR— LS, CV(n) RR— LS, CV(5) RR–MM, CV(n) RR–MM, CV(5) Simp-Mont. orig. Simp-Mont. modif. Silv. orig Silv. modif.

10 0 1.34 1.34 1.38 1.41 1.48 1.50 1.35 1.35

1 0.1 43.35 44.05 1.70 1.74 36.47 2.53 40.48 1.66

0. 1.28 1.30 1.31 1.32 1.37 1.40 1.29 1.28

0.1 5.33 5.45 1.58 1.47 17.03 1.92 5.85 1.57

0.1 0 0.1 1.13 3.01 1.13 3.06 1.13 1.32 1.13 1.23 1.12 3.27 1.23 1.55 1.16 3.40 1.14 1.38

Table 1: Maximum trimmed MSEs of estimators for p = 10; n = 50; = 0:9 they were not very di¤erent from those with KCV = 5: Both versions of RR– LS have similar behaviors, as well as both versions of RR–MM. All robust estimators show a high e¢ ciency for " = 0. The original Simpson–Montgomery and Silvapulle estimators are clearly not robust; it is curious that under " = 0:1 the former can attain higher MSEs than RR–LS. It must be recalled that the maximum MSE of RR–LS for all kslo is in…nite, but here the range of kslo was limited to …nding the maximum MSE of each robust estimator. The modi…ed Simpson–Montgomery estimator is robust, but for both " = 0 and " = 0:1 it exhibits higher MSEs than the two versions of RR–MM and the modi…ed Silvapulle estimator. For these reasons we henceforth drop the original Silvapulle estimator and both versions of Simpson–Montgomery from the study, as well as RR–LS with KCV = 5. The modi…ed Silvapulle estimator appears as a good competitor to RR–MM, despite its simplicity. We now consider the same scenario but with " = 0:2: The results are given in Table 2.

Among the three robust estimators, RR–MM with CV(5) appears as the best and the modi…ed Silvapulle estimator as the worst, with RR–MM with CV(n) 19

SNR RR— LS, CV(n) RR–MM, CV(n) RR–MM, CV(5) Silv. Mod

10 231.82 3.13 3.13 3.55

1 24.28 2.80 2.14 3.22

0.1 3.41 2.48 1.85 3.18

Table 2: Maximum trimmed MSEs of estimators for p = 10; n = 50; = 0:9 and = 0:2 in between. Figure 1 displays the MSEs of the three estimators as a function of kslo : SNR=1

SNR=0.1

Sil-Mod

3

Sil-Mod

3

2.5

2.5

MM(n) MM(n) 2

MSE

2 MM(5)

1.5

1.5

MM(5) 1

1

0.5

0.5

0

0

5

10 k

15

0

20

0

10

20 k

slo

30

slo

Figure 1: Simulation MSEs for p = 10; n = 50 and " = 0:2 of MM estimators with 5-fold and n-fold cross-validation, and the modi…ed Silvapulle estimator, as a function of contamination slope. The values for kslo = 0 are those for " = 0:

We …nally consider a situation with a high p=n ratio, namely p = 10 and n = 25:

Table 3 and Figure 2 show the results. The e¢ ciency of all robust estimators 20

SNR

10

" RR— LS, CV(n) RR–MM, CV(n) RR–MM, CV(5) Silv. Mod

0 2.03 2.36 2.32 2.34

1

0.1 41.93 2.80 2.70 3.09

0 1.63 1.82 1.73 1.74

0.1 5.38 2.09 1.81 2.27

0.1 0 0.1 1.22 4.58 1.33 1.54 1.25 1.36 1.43 2.02

Table 3: Maximum trimmed MSEs of estimators for p = 10; n = 25; = 0:9 is lower than for n = 50; though still larger than 0.85. Again RR–MM, CV(5) is the best and the modi…ed Silvapulle estimator the worst, with RR–MM, CV(n) in between. A plausible explanation for the superiority of KCV = 5 over KCV = n is that the estimation of the MSE for each

is more robust in the former, which

would lead to a better choice of . It is seen in both tables that all MSEs increase with increasing SNR. This phenomenon appears more clearly in the next Section. A plausible explanation it that for large SNR, the bias caused by shrinking dominates the variance reduction, and all estimators “shrink too much”. All simulations were repeated for

= 0:5 and for klev = 10; yielding the

same qualitative conclusions as above.

6

Simulations for n < p

Since RR estimators are equivariant with respect to orthogonal transformations of X; the case n < p can always be reduced to one with p

n

1: To this end,

X is …rst normalized, the principal components of the transformed matrix are computed, and the data are expressed in the coordinate system of the principal directions with (numerically) non–null eigenvalues. Note that this implies that now the edf will be

n:

Many situations with p > n correspond to data such as e.g. spectra, for

21

2

Sil-Mod

1.8 MM(n)

1.6 1.4 MM(5)

MSE

1.2 1 0.8 0.6 0.4 0.2 0

0

5

10

15 k

20

25

30

slo

Figure 2: Simulation MSEs for p = 10; n = 25; SNR=0.1 and " = 0:1 of MM estimators with 5-fold and n-fold cross-validation, and the modi…ed Silvapulle estimator, as a function of contamination slope. The values for kslo = 0 are those for " = 0:

which X usually exhibits a “local” dependence structure. For this reason we now take xi

Np (0; V ) with vjk =

jj kj

: All other details of the simulation

are like in the former Section. The values chosen for

are 0.9 and 0.5, and those

for klev are 5 and 10. We took n = 30 and p = 40: This is not very realistic, but it was necessary to limit the computing time. The values of kslo ranged between 1 and 15 for " = 0:1 and between 1 and 20 for " = 0:2: Again, the estimators were RR–LS and RR–MM with KCV = n and = 5: Since the Silvapulle and the Simpson–Montgomery estimators cannot be computed for n < p; in order to have competitors against RR we included the robust Partial Least Squares (PLS) estimator RSIMPLS (Hubert and Vanden Branden 2003), and also the classical PLS estimator CSIMPLS, both as implemented in the library

22

LIBRA (see http://wis.kuleuven.be/stat/robust/LIBRA.html). The number of components was chosen in both through cross–validation. The breakdown point of RSIMPLS was the default value of 0.25; attempts to employ a larger breakdown point such as 0.5 or 0.4 caused the program to crash due to some matrix becoming singular. Table 4 gives the results for

= 0:9 and klev = 5: The SNR

was taken as 0.1, 0.5 and 1. The case SNR=10 was omitted since all estimators had very high MSEs. SNR " RR— LS, CV(n) RR— LS, CV(5) RR–MM, CV(n) RR–MM, CV(5) RSIMPLS CSIMPLS

0 4.0 4.1 4.6 4.8 7.3 4.7

1 0.1 38.1 22.7 5.8 5.9 10.4 28.2

0.2 88.2 52.3 9.2 10.7 85.7 61.1

0. 3.2 3.0 3.1 3.1 4.5 3.3

0.5 0.1 19.9 12.1 3.9 3.8 6.1 14.8

0.2 45.7 27.0 6.0 5.7 42.6 31.2

0 2.1 1.8 1.8 1.8 2.2 2.1

0.1 0.1 0.2 5.7 11.7 3.6 6.8 2.1 3.5 2.1 3.0 2.8 10.2 3.7 7.7

Table 4: Maximum trimmed MSEs of estimators for p = 40; n = 30; = 0:9

It is seen that: both versions of RR–MM have similar performances, and are rather e¢ cient. RR–LS with KCV = 5 is remarkably better than with KCV = n: RSIMPLS has a low e¢ ciency, and for " = 0:2 it breaks down, performing even worse than CSIMPLS; this behavior is puzzling.

7

A real data set with p >> n

We consider a dataset corresponding to electron-probe X ray microanalysis of archaeological glass vessels (Janssens et al, 1998), where each of n = 180 vessels is represented by a spectrum on 1920 frequencies. For each vessel the contents of 13 chemical compounds are registered. Since for frequencies below 15 and above 500 the values of xij are almost null, we keep frequencies 15 to 500, so that we 23

have p = 486: Figure 3 shows the residual QQ–plots for compound 13 (PbO). Both robust estimates exhibit seven clear outliers plus six suspect points, while RR–LS exhibits one outlier and CSIMPLS none. The dotted lines correspond to a 95% envelope based on 200 bootstrap samples. It is curious that the envelopes corresponding to the two non-robust estimators are relatively much more dispersed than the ones corresponding to the two robust ones. MM 7

0.4

6

0.3

5

0.2

4

residuals

residuals

LS 0.5

0.1 0

3 2

-0.1 1 -0.2 0 -2

-1

0 Normal quantiles

1

2

-2

-1

CPLS

0 Normal quantiles

1

2

1

2

RPLS 7

0.6

6 5 4

0.2

residuals

residuals

0.4

0 -0.2

3 2 1 0

-0.4 -2

-1

0 Normal quantiles

1

2

-1

-2

-1

0 Normal quantiles

Figure 3: Vessel data: residual QQ plots for compound 13, with bootstrap 95% envelopes.

The edf for RR–LS and for RR–MM with KCV = n and KCV = 5 were respectively 108.3, 10.0 and 5.4, and both versions of PLS chose 5 components. The predictive behavior of the estimators was assessed through 5–fold CV; the criteria used were the root mean squared error (RMSE) and RMSE with upper 10% trimming (RMSE(0.9)), considered to be safer in the presence of outliers. Table 5 shows the results.

24

RR— LS, CV(n) RR— LS, CV(5) RR–MM, CV(n) RR–MM, CV(5) RSIMPLS CSIMPLS

RMSE 0.194 0.186 0.871 0.892 0.890 0.510

RMSE(0.9) 0.133 0.119 0.091 0.088 0.113 0.194

Table 5: Vessel data: RMSEs for compound 13 According to RMSE(0.9), both versions of RR–MM have a similar performance, which is clearly better than those of the other estimators. However, according to RMSE they perform much worse than RR–LS. The plausible reason is that the outliers in‡ate RMSE, and this demonstrates the importance of using robust criteria to evaluate robust estimators. The left panel of Figure 4 shows the ridge trace for coe¢ cients 1, 50,100 and 150 of

1

as a function of

(in logarithmic scale), and the right one shows

the estimated robust MSEs. The latter panel shows that the absolute minimum occurs at

= 2575; i.e. log10 ( ) = 3:41: The coe¢ cients are seen to oscillate

wildly for low

8

and then to stabilize.

Computing times

Table 6 gives the computing times (in minutes) for the robust estimators. The …rst four rows are the averages of …ve samples generated as in Section 6, and the last one corresponds to the data in Section 7. The computation was performed in Matlab on a 3.01 GHz PC with an Intel Core Duo CPU. It is seen that in the last two rows, the times required by RR–MM with KCV = 5 are about …ve times those for KCV = n; what is to be expected, since the latter does not actually perform the cross-validation. But in the other cases, the di¤erence is about ten–fold, which is surprising. It is also seen that

25

0.015

0.014

0.0135

0.01

0.013

MSE

betas

0.005 0.0125

0 0.012

-0.005

-0.01

0.0115

2

3 log(lambda)

0.011

4

2

3 log(lambda)

4

Figure 4: Vessel data: Ridge trace for coe¢ cients 1 (o), 50 (*), 100 (x) and 150 (+) (left) panel) and estimated MSE (right), as a function of log10 ( ) :

RSIMPLS is very fast, and the modi…ed Silvapulle estimator is still faster.

9

The problem of estimating variability

An approach to estimate the variability of estimators and …tted values is to derive an (exact or approximate) expression for their variances. For RR–LS it is easy to derive a simple formula for the covariance matrix of b, for a given n 100 100 100 100 180

p 10 20 40 200 486

RR-MM, CV(n) 0.0203 0.0597 0.1114 0.4040 3.413

RR-MM, CV(5) 0.2705 0.6847 1.0261 2.7384 15.932

RSIMPLS 0.1242 0.1251 0.1305 0.1413 0.517

Table 6: Computing times in minutes

26

Silv. Mod 0.0019 0.0028 0.0054 – –

: But the fact that actually

depends on the data makes the analysis rather

complicated. I do not know of any theoretical results in this direction. The situation is even more complex for the robust estimators. Deriving the asymptotic distributions of robust RR estimators for …xed ; seems feasible (though not trivial) if n ! 1 with p …xed. But the result would be irrelevant, for in practical terms this means that p=n is negligible, and in this case there would not be much di¤erence between ridge and standard (i.e., with

= 0) regression.

The practically interesting cases are those where p=n is not negligible (or even >1), and the corresponding asymptotic theory would have p tend to in…nity with n; with p=n tending to a positive constant. This seems a very di¢ cult task, even for …xed ; and even more so for an estimated : Another approach is to use the bootstrap, which implies recomputing the estimate at least several hundred times. As Table 6 shows, this is feasible when p < n for the modi…ed Silvapulle estimator, or for RR-MM if the data set is small. In the case of large p > n, when only the latter can be employed, the required time seems too large for practical purposes. Therefore, estimating the variability of the robust RR estimators for large data sets would require either a solution to the problem of approximating their asymptotic distribution, or a way to drastically reduce computing times.

10

Conclusions

The modi…ed Silvapulle estimator is simple and fast. The results of Section 5 for n > p indicate that it has a good performance if p=n is small and the contamination rate is low. Otherwise, RR–MM with either KCV = n or = 5 perform better, the latter being the best. When n < p the modi…ed Silvapulle estimator cannot be employed. The results of Sections 6 and 7 indicate that (at least for the situations considered

27

there) both versions of RR–MM outperform RSIMPLS, with respect to both e¢ ciency and robustness. Although in general RR–MM with KCV = 5 does better than with KCV = n; the advantages of the former do not seem so clear as for n > p: Since the di¤erence between the respective computing times is at least …ve-fold, KCV = n should be preferred for high–dimensional data.

11

Supplemental material

Appendix Contains details of the description of the estimators and of the algorithms, and details of derivations (.pdf …le). RidgeCode Matlab code for the computing of RR–MM (zipped RAR …les).

ACKNOWLEDGEMENTS The author thanks the editor, the associate editor, and two referees for many constructive comments and suggestions, which greatly improved the quality of the article. This research was partially supported by grants X–018 from University of Buenos Aires, PID 5505 from CONICET and PICT 00899 from ANPCyT, Argentina. References Bühlmann, P. (2006). “Boosting for High-Dimensional Linear Models”. The Annals of Statistics, 34, 559-583. Crambes, C., Kneip, A. and Sarda, P. (2009). “Smoothing Splines Estimators for Functional Linear Regression”, The Annals of Statistics, 37, 35-72. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). “Least Angle Regression”. The Annals of Statistics, 32, 407–499. Frank, I.E. and Friedman, J.H (1993). “A Statistical View of Some Chemometrics Regression Tools”, Technometrics, 2, 109–135.

28

Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning; Data Mining, Inference and Prediction. New York: Springer. Hoerl, A.E., and Kennard, R.W. (1970), “Ridge Regression: Biased Estimation for Nonorthogonal Problems,” Technometrics, 8, 27-51. Hoerl, A.E., Kennard, R.W. and Baldwin, K.F. (1975). “Ridge Regression: Some Simulations”. Communications in Statistics A (Theory and Methods), 4, 105-123. Huber, P.J. (1964), “Robust Estimation of a Location Parameter”, The Annals of Mathematical Statistics, 35, 73-101. Hubert, M. and Vanden Branden, K. (2003). “Robust Methods for Partial Least Squares Regression”, Journal of Chemometrics, 17, 537–549. Janssens, K., Deraedt, I., Freddy, A. and Veekman, J. (1998). “Composition of 15-17th Century Archeological Glass Vessels Excavated in Antwerp, Belgium”. Mikrochimica Acta, 15 (Suppl.), 253-267. Krasker, W.S. and Welsch, R.E. (1982). “E¢ cient bounded in‡uence regression estimation”. Journal of the American Statistical Association, 77, 595–604. Lawrence, K.D. and Marsh, L.C. (1984). ”Robust Ridge Estimation Methods for Predicting U.S. Coal Mining Fatalities”, Communications in Statistics, Theory and Methods, 13, 139-149. Locantore, N., Marron, J.S., Simpson, D.G., Tripoli, N., Zhang, J.T. and Cohen, K.L. (1999), “Robust Principal Components for Functional Data”, Test, 8, 1–28. Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006). "Robust Statistics: Theory and Methods", John Wiley and Sons, New York. Maronna, R.A. and Yohai, V.J. (2009). “Robust Functional Linear Regression Based on Splines”. Available at http://mate.unlp.edu.ar/publicaciones/23112009231134.pdf Maronna, R.A. and Yohai, V.J. (2010). “Correcting MM Estimates for ‘Fat’

29

Data Sets.” Computational Statistics & Data Analysis, 54, 3168-3173. Peña, D. and Yohai, V.J. (1999), “A Fast Procedure for Outlier Diagnostics in Large Regression Problems”. Journal of the American Statistical Association, 94, 434-445. Silvapulle, M.J. (1991). “Robust Ridge Regression Based on an M–estimator”. Australian Journal of Statistics, 33, 319-333. Simpson, J.R. and Montgomery, D.C. (1996). “A Biased-robust Regression Technique for the Combined Outlier-multicollinearity Problem”. Journal of Statistical Computing and Simulation, 56, 1-20. Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso”. Journal of the Royal Statistical Society, Series B, 58, 267–288. Yohai, V.J. (1987). “High Breakdown–point and High E¢ ciency Estimates for Regression”, The Annals of Statistics 15, 642–65. Zou, H. and Hastie, T. (2005). “Regularization and Variable Selection via the Elastic Net”. Journal of the Royal Statistical Society (B), 67, 301-320.

30