Robust Regression Methods: Achieving Small ... - Semantic Scholar

1 downloads 0 Views 218KB Size Report
robust estimations, multivariate outliers, heteroscedasticity. In a recent article appearing in ... UNDERSTANDING STATISTICS, 3(4), 349–364. Copyright © 2004 ...
UNDERSTANDING STATISTICS, 3(4), 349–364 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Robust Regression Methods: Achieving Small Standard Errors When There Is Heteroscedasticity Rand R. Wilcox Department of Psychology University of Southern California

H. J. Keselman Department of Psychology University of Manitoba

A serious practical problem with the ordinary least squares regression estimator is that it can have a relatively large standard error when the error term is heteroscedastic, even under normality. In practical terms, power can be poor relative to other regression estimators that might be used. This article illustrates the problem and summarizes strategies for dealing with it. Included are new results on the robust estimator recently studied by Anderson and Schumacker (2003).

robust estimations, multivariate outliers, heteroscedasticity In a recent article appearing in this journal, Anderson and Schumacker (2003) provided a good introduction to robust regression methods. They began by summarizing some fundamental concerns about the usual least squares estimator; they described some of the more basic robust methods; and they compared the standard errors of these methods to least squares and concluded that a variation of a minimum M-estimator, called an MM-estimator, which was originally suggested by Yohai (1987), be used. The goal of this article is to expand on Anderson and Schumacker’s article by summarizing issues related to heteroscedasticity. We indicate why certain robust estimators can substantially increase power, even under Requests for reprints should be sent to Rand R. Wilcox, Department of Psychology, University of Southern California, SGM 501, MC 1061, Los Angeles, CA 90089. E’mail: [email protected]

350

WILCOX AND KESELMAN

normality; we summarize various methods aimed at dealing with heteroscedasticity; and we present new results on the relative performance of the MM-estimator studied by Anderson and Schumacker. Another goal is to discuss some aspects of detecting outliers among multivariate data. When dealing with regression, multivariate outliers are related to the deleterious effects of heteroscedasticity. In essence, eliminating outliers can reduce heteroscedasticity, which in turn can increase power when testing hypotheses. Another concern with outliers is that they can have an inordinate influence on the least squares estimator, resulting in a poor reflection of how the bulk of the points are related. As a simple illustration, imagine that, Y = X + ε, both X and ε have standard normal distributions, and X and ε are independent. We generated 20 points according to this model and got a least squares estimate of the slope equal to 1.06, close to the true value 1. Then we added 2 points located at (2.1, –2.4) and found that the estimated slope dropped to 0.32. Note that for X = 2.1, and because we are working with normal distributions, Y = –2.4 is unusually far from the regression line; its distance is more than three standard deviations. So only two unusual values can have a substantial impact on the least squares estimator. HETEROSCEDASTICITY: SOME BASICS Consider two variables, say, X and Y. In regression, homoscedasticity refers to situations in which the conditional variance of Y, given X, does not vary with X. This is illustrated in Figure 1, which shows a plot of Y values corresponding to X = 25, 50, and 75. Heteroscedasticity refers to situations in which the conditional variance of Y, given X, varies with X, an example of which is shown in Figure 2. In symbols, the usual (homoscedastic) regression model is Y = β0 + β1X + ε, where β0 and β1 are the (unknown) intercept and slope, respectively, X and ε are independent, and ε has mean zero and variance σ2. So by implication, the conditional variance of Y, given X, is VAR(Y|X) = σ2, which does not vary with X. If X and Y are independent, homoscedasticity is implied, but otherwise there is no particular reason to assume homoscedasticity. Now consider any estimate of the slope and intercept, say, b1 and b0, respectively. A basic goal is to choose an estimator that has a relatively small standard error. A classic result by Gauss, known today as the Gauss–Markov theorem, described the optimal estimator among a class of estimators that seems quite natural and reasonable. Let (X1, Y1), … , (Xn, Yn) be a random sample of n points. Then among all weighted means of the Y values that might be used to estimate the intercept and slope, the optimal estimates are the values b1 and b0 that minimize

ROBUST REGRESSION METHODS

351

FIGURE 1 Homoscedasticity means that the conditional variance of Y, given X, does not change with X. Here, the variation among the Y values is the same at X = 25, 50, and 75, which is implied by homoscedasticity.

FIGURE 2 with X.

An example of heteroscedasticity. The conditional variance of Y, given X, changes

1

∑ σ 2 (Yi – b0 – b1 Xi )2 . i

σ i2

where is the variance of Y given that X = Xi. That is, use weighted least squares with weights wi = 1/ σ i2 . A serious impediment when trying to take advantage of this result is that in most applied situations, σ i2 is not known and estimation of the σ i2values is not straightforward, such as when X is fairly continuous. A related problem is finding

352

WILCOX AND KESELMAN

effective methods for testing hypotheses and computing confidence intervals when there is heteroscedasticity. STRATEGIES FOR DEALING WITH HETEROSCEDASTICITY There are at least five general strategies for dealing with heteroscedasticity. The first is to assume that it never causes any practical problems and simply use homoscedastic methods. A concern, however, is that it is known that the ordinary (homoscedastic) least squares estimator can have a relatively large standard error, even when the error term ε has a normal distribution (e.g., Wilcox, 2003). That is, power can be low compared to an estimator that effectively deals with heteroscedasticity. In defense of homoscedastic methods, one might suggest that situations in which heteroscedasticity causes practical problems are rare. In fairness, if X and Y are dependent, it seems that there is little or no evidence to support this claim, and there are known situations where heteroscedasticity causes serious practical problems when analyzing real data. Rather than take a chance that homoscedastic methods will have relatively high power, perhaps a better strategy is to use heteroscedastic methods that perform about as well as homoscedastic methods when in fact there is homoscedasticity, but continue to have relatively high power in situations where homoscedastic methods perform poorly. Two strategies outlined here are based on this point of view. A related concern, relevant to the first strategy, is that very poor probability coverage can result when computing a confidence interval for β0 and β1. Put another way, control over the probability of a Type I error can be highly unsatisfactory when testing hypotheses, even with large sample sizes. The basic reason is that conventional homoscedastic methods use an incorrect estimate of the standard error when there is heteroscedasticity (e.g., Long & Ervin, 2000). Again, what seems like a more satisfactory strategy is to use a method that performs well under standard assumptions and that continues to perform well under heteroscedasticity. A second strategy for dealing with heteroscedasticity is a direct approach: Attempt to estimate the optimal weights, wi (i = 1, … , n), and then use these weights in a weighted least squares estimation procedure. In the univariate case, this strategy can be described more formally as follows: assume Y = X + λ(X)ε, where λ is some function used to model heteroscedasticity, and then attempt to estimate the function λ(X),which in turn yields an estimate of the optimal weights, wi. Methods based on this strategy have been proposed by Cohen, Dalal, and Tukey (1993) as well as Wilcox (1996a). On the positive side, the standard error of these estimators can be substantially smaller than the standard error of the ordinary least squares estimator. However, there are some serious drawbacks. One is that extensions of these methods to multiple-predictors have not been derived. Methods for estimating λ(X) in the multiple predictor case (based on smoothers) are available, but ob-

ROBUST REGRESSION METHODS

353

taining a reasonably accurate estimate is difficult, particularly when the sample size is relatively small. Another has to do with testing hypotheses. Wilcox (1996a) considered several methods when using the estimator proposed by Cohen et al., but a satisfactory method was not found. A reasonably satisfactory method was found when using Wilcox’s estimator, but his method does not deal with other basic robustness issues discussed, for example, by Hampel, Ronchetti, Rousseeuw, and Stahel (1986); Huber (1981); Staudte and Sheather (1990); and Wilcox (in press). One of these issues has to do with the sensitivity of an estimator to outliers. A fundamental goal among robust methods is to find an estimator that cannot be dominated by a few unusual points. In particular, a small number of outliers should not be able to mask a true association. Least squares plus various other estimators suffer from this problem (e.g., Staudte & Sheather, 1990; Wilcox, 2003). We elaborate on this issue later in this article. A third strategy for dealing with heteroscedasticity is indirect in the sense that, rather than estimate the optimal weights, simply downweight or eliminate points that are unusually far from the regression line. Note that if the σ i2 were known, the optimal weighted least squares estimator gives relatively little weight to points for which σ i2 is large. Moreover, σ i2 = VAR(Y X = Xi ) is relatively large if Y values at X = Xi are relatively far from the regression line. This suggests that if we were to downweight or eliminate points that are unusually far from the regression line, a smaller standard error, and higher power, might be achieved. Certain robust estimators accomplish this goal while maintaining high power under normality and homoscedasticity (Wilcox, 2003). One of our goals here is to expand on the list of robust estimators given by Anderson and Schumacker (2003) and Wilcox (2003), comment on their relative merits, and then present new results regarding the MM-estimator. A fourth strategy is to use a transformation of the data such as taking logarithms or using a Box–Cox transformation. In some instances this improves matters, but this is not always the case. A particular concern here is that generally these transformations do not deal effectively with outliers. (For a detailed discussion of this point when dealing with measures of location, see Doksum & Wong, 1983.) More precisely, even after data have been transformed, outliers can remain (as illustrated by Wilcox & Keselman, 2003, when dealing with measures of location). So even after transforming data, estimators that reduce the effects of outliers can still increase power (e.g., Wilcox, in press). Another concern is that transformations can mask or eliminate some features of the associations among variables that are of particular interest (Doksum, Blyth, Bradlow, Meng, & Zhao, 1994). A fifth strategy is to simply look at a scatter plot, remove any obvious outliers, and apply the least squares estimator to the data that remain. There are, however, two fundamental concerns. The first is that, even when using established outlier detection methods, the resulting standard error can be relatively large, even when standard assumptions are met. More precisely, under normality and homoscedasticity, no

354

WILCOX AND KESELMAN

points should be eliminated as outliers, but points will be labeled as outliers by chance. One of the points made in this article is that there is, however, a variation of this approach that performs nearly as well as least squares when the error term is homoscedastic and has a normal distribution, and it can offer a distinct advantage when the error term is heteroscedastic. The other fundamental concern has to do with testing hypotheses: Eliminating outliers and using conventional methods can result in using the wrong standard error. Here we merely note that this latter issue can be addressed with appropriate bootstrap methods (Wilcox, 2003, in press). THE FINITE SAMPLE BREAKDOWN POINT OF AN ESTIMATOR There are several technical criteria used by mathematical statisticians when searching for a robust estimator. Before continuing with the main issues in this article, we briefly describe one of the simpler criteria in order to provide some sense of why certain robust regression estimators have been argued to have practical value. The finite sample breakdown point of an estimator reflects how many outliers are needed to completely ruin it. More precisely, it reflects the proportion of points that must be altered to make an estimator arbitrarily large or small. The finite sample breakdown point of the sample mean, for example, is only 1/n, meaning that only one of the n values can make the mean arbitrarily large or small. A consequence is that the sample mean can be highly atypical, as has been illustrated with real data by Wilcox and Keselman (2003) and Wilcox (2003). The median is relatively insensitive to outliers, and its finite sample breakdown point is approximately .5, the highest possible value. In regression, an issue is how many outliers are needed to ruin the least squares estimator, or any proposed estimator, the goal again being to reflect sensitivity to outliers. A reasonably high breakdown point is often sought, but in regression there are known situations where this does not guarantee that an estimator is always insensitive to outliers (e.g., Wilcox, 2003). That is, for some estimators, situations occur in which only two outliers can have a substantial impact, as will be illustrated. So a necessary condition for an estimator to be relatively immune to the deleterious effects of outliers is that it have a reasonably high finite sample breakdown point, but this does not guarantee that practical problems with outliers are always avoided. DETECTING MULTIVARIATE OUTLIERS Anderson and Schumacker (2003) touched on the important issue of detecting outliers, and they described some of the earlier attempts at dealing with this problem. Many advances and improvements on these early methods have appeared. Complete details go well beyond the scope of this article, but some indication of the problem with early methods should be provided. Note that outliers can contribute to

ROBUST REGRESSION METHODS

355

heteroscedasticity, which in turn can result in a relatively large standard error and poor power. So a practical reason for having an interest in outlier detection techniques is that they provide a way to reduce problems due to heteroscedasticity. A fundamental concern with any outlier detection method is something called masking, roughly meaning that the very presence of outliers can result in even extreme outliers being missed. Hampel, Ronchetti, Rousseeuw, and Leroy (1986, p. 69) defined masking a bit more precisely as an event in which one outlier prohibits the detection of another outlier among the data. The notion of masking has a certain similarity to the finite sample breakdown point: How many outliers does it take to ruin an outlier detection method? In the univariate case, for example, outlier detection methods based on the mean and variance perform poorly because when there are two or more outliers, none might be found no matter how extreme they might be (e.g., Rousseeuw & Leroy, 1987; Wilcox, 2003; Wilcox & Keselman, 2003). Masking occurs when one uses an outlier detection method based on the mean and variance because outliers can inflate the mean, and particularly the sample variance, the result being that points that are obvious outliers, based on plots of the data, are not flagged as outliers. The modern strategy for dealing with masking in the univariate case is to replace the sample mean and the sample variance with measures of central tendency and dispersion that are insensitive to outliers. One common approach is to replace the mean with the median and the sample variance with the so-called median absolute deviation (see Wilcox & Keselman, 2003). Another possibility is some variation of boxplot methods (e.g., Carling, 2000; Hoaglin & Iglewicz, 1987). With multivariate data, the best known method for detecting outliers, which was discussed by Anderson and Schumacker (2003), is undoubtedly an approach based on the vector of means and the usual variance–covariance matrix using Mahalanobis distance. That is, when dealing with p-variate data, compute the usual vector of means, X , and the variance–covariance matrix S, in which case the distance of the ith point from the sample mean is Di = (Xi – X)′ S –1 (Xi – X) .

(1)

If Di is sufficiently large, declare Xi an outlier. However, it has long been known that this approach suffers from masking (Rousseeuw & Leroy, 1987). Overcoming this problem might appear to be trivial: Simply check for outliers among each of the marginal distributions using a technique that avoids masking. So, for example, in the bivariate case, this means we merely check for outliers among the X values, ignoring Y, and then check for outliers among the Y values, ignoring X. This method can be unsatisfactory, however, one reason being that, when dealing with correlation (r) and regression, outliers that can grossly affect Pearson’s correlation and the least squares regression line can be missed. That is, it is possible to have outliers that have an inordinate effect, yet when checking for

356

WILCOX AND KESELMAN

outliers among the X values ignoring Y, or when checking for outliers among the Y values ignoring X, none are found. An important point here is that if we check for outliers among the X values using the median absolute deviation–median rule described, for example, in Wilcox and Keselman (2003), or using the boxplot rule in Carling (2000), or the one in Hoaglin and Iglewicz (1987), no outliers are found. The same is true for the Y values, ignoring X. Yet the two points at (2.1, –2.4) are outliers, and they have a substantial impact on the least squares estimate of the slope and intercept as well as r. More precisely, the estimated slope drops from 1.06 to 0.32, and r drops from .677 to .212. Moreover, when testing H0: ρ = 0 with Student’s T, the p value increases from .001 to .343. So a significant result becomes nonsignificant, meaning that we fail to detect a true association. Of course, the association is weaker in some sense with the addition of the two outliers, but simultaneously, there is an association, surely failing to detect this association is undesirable, and for the bulk of the points the association is fairly strong. Now we provide a brief overview of some modern multivariate outlier detection methods. Myriad methods have been proposed; virtually all of them offer substantial advantages over approaches based on means and the usual covariance matrix but, for some purposes, some of the best known methods (among statisticians) have recently been found to be unsatisfactory for some of the practical problems considered here. For a book-length discussion of how to detect outliers, see Barnett and Lewis (1994). Notice that in the univariate case, to avoid masking, outlier detection methods are based on measures of dispersion (and location) that are not themselves sensitive to outliers. Various multivariate outlier detection methods are based on this same strategy. One of the earliest and more successful methods is based on the so-called minimum volume ellipsoid (MVE) estimator. Consider any subset of the data containing half the points. The MVE estimator of location and scatter begins by searching among all such subsets with the goal of finding the one with the smallest volume. Once (an approximation of) this subset is found, compute the mean and covariance matrix of these centrally located points, ignoring the other half of the data. It is evident that this approach guards against outliers. (The finite sample breakdown point is approximately .5, the highest possible value.) Details on using the MVE measures of location and scale to detect outliers can be found in Rousseeuw and van Zomeren (1990). Basically, replace the usual mean and covariance matrix in Equation 1 with the MVE estimates when computing Di, and then declare Xi an outlier if Di exceeds the square root of the .975 quantile of a chi-squared distribution with p degrees of freedom (see also Davies, 1987; Fung, 1993; Hampel et al., 1986; Rousseeuw & Leroy, 1987; Rousseeuw & van Driesen, 1999; Tyler, 1991; cf. Becker & Gather, 1999.) Recently, Rousseeuw and van Driesen suggested replacing the MVE with the minimum covariance determinant (MCD) estimator (i.e., search for the half of the data

ROBUST REGRESSION METHODS

357

that minimizes the generalized variance, a well-known measure of dispersion in multivariate statistics). Both the MVE and MCD estimators can be computed with built-in functions in SAS (1990), S–PLUS (Becker et al., 1988), and R (Venable & Smith, 2002). Outlier detection methods based on the MVE and MCD estimators represent a major improvement over methods based on Mahalanobis distance, but in recent years, for certain purposes, even better methods have appeared (e.g., Wilcox, 2003, in press). One of these is called a minimum generalized variance (MGV) method. We do not describe the straightforward but tedious details here, but we note that they are easily implemented via the software packages S–PLUS and R, as illustrated by Wilcox (2003, in press). CONVENTIONAL REGRESSION DIAGNOSTICS A natural approach, when trying to determine whether a point has an undue influence on the least squares regression line, is to remove it and observe the extent to which this alters the estimate based on all of the data (Cook, 1977, 1979). Anderson and Schumacker (2003) summarized the more basic techniques. There are, in fact, several methods based on this strategy, some of which are a function of the diagonals of the so-called hat matrix (e.g., Fox, 1999; Montgomery & Peck, 1992), which is reported by both SAS and SPSS (e.g., Nornsis, 2000). Belsley, Kuh, and Welsch (1980) developed this idea into a standardized difference of fitted values commonly called DFFITS, and they developed an alternative measure called DFBETAS. A related approach is to assess the effect of removing a point on the precision of the least squares estimator. This can be done via the so-called measure COVRATIO, which reflects the effect on the volume of the confidence region for all parameters when a single point is deleted from the data. COVRATIO assumes homoscedasticity, which leads to a standard expression for the variances and covariances of the estimators, and the determinant of this covariance matrix plays a role when using COVRATIO (Belsley et al., 1980). All of these methods represent important milestones, but by today’s standards they are less satisfactory than more recently developed techniques (e.g., Staudte & Sheather, 1990). A basic concern with these methods is that they provide protection against a single outlier, but two outliers can cause a problem; that is, they can suffer from masking when there are only two outliers. ROBUST REGRESSION Among the strategies previously described for dealing with heteroscedasticity, replacing the least squares estimator with some type of robust estimator currently appears to be the best approach. A variety of such estimators has been proposed (e.g., Rousseeuw & Leroy, 1987; Staudte & Sheather, 1990; Wilcox, 2003, in press), many of which are known to have substantial advantages over least squares for a

358

WILCOX AND KESELMAN

wide range of situations. Our immediate goal is to provide an overview of some of the strategies that might be used, and then we discuss some of their relative merits. (Hypothesis testing can be performed with appropriate bootstrap methods covered by Wilcox, 2003, in press.) Functions of the Residuals From basic principles, least squares regression is based on the goal of minimizing

∑ ri2 where ri (i = 1,…, n) are the usual residuals. Recall from basic principles that this method makes no distributional assumptions; it merely specifies a reasonable method for judging a fit to data. The main point here is that this method can be generalized, again without making any distributional assumptions, by saying that we want to minimize

∑ ξ(ri ), where ξ is some appropriate function of the residuals. The choice ξ(ri ) = ri2 leads to least squares estimation and ξ(ri) = |ri| yields the so-called least absolute value method of estimation, which predates least squares by about 50 years and is among the methods considered by Anderson and Schumacker (2003). However, the least absolute value estimator has a finite sample breakdown point of only 1/n. It protects against unusual Y values, but a single outlier among the X values can cause serious practical problems. Choices for ξ also include what are called M-estimators. In fact, many variations have been proposed in recent years, including what are called generalize estimators and the MM-estimator studied by Anderson and Schumacker (2003). Some of these estimators (e.g., Coakley & Hettmansperger, 1993) have the highest possible breakdown point. We note that the computation of Yohai’s MM-estimator is an involved process consisting of three stages. (In brief, an initial fit is computed using some other robust estimator, a robust measure of scale associated with the residuals is computed which in turn is used in a conventional M-estimation technique.) The computations can be performed by the S–PLUS command summary [lmRob(y~|x)] after issuing the command library (robust). Because our simulations do not support the use of this MM-estimator, we follow Anderson and Schumacker (2003) and omit further details. Another approach to fitting a plane to data is to ignore the largest residuals when assessing the fit. Some of these methods were listed by Anderson and Schumacker (2003) and include the least trimmed squares estimator, meaning that one trims the largest squared residuals when assessing fit. Rather than using the

ROBUST REGRESSION METHODS

359

squared residuals, another variation is to use their absolute values and again trim the largest values. One more strategy is to take the slope and intercept to be the values that minimize the median of the squared residuals. This is called the least median of squares estimator. These estimators offer certain practical advantages over least squares, but all indications are that other regression estimators are usually preferable in terms of achieving a small standard error (Wilcox, in press). S-Estimators Another general approach, resulting in what are called S-estimators, is to search for the regression line that minimizes some robust measure of variability applied to all of the residuals. Again, distributional assumptions are not made when assessing fit. These estimators offer clear advantages over least squares and other estimators but, under heteroscedasticity, it currently seems that alternative estimators are preferable (Wilcox, in press). A related approach in the one-predictor case, called the STS (S-type Theil–Sen) estimator, is to compute the slope between points i and i', Sii', and take the estimate of the slope to be the value of Sii', which minimizes a robust measure of scale applied to the values Y1 – Sii'X1, … ,Yn – Sii'Xn. Here, a robust measure of variation, called the percentage bend midvariance, is assumed. For more details, plus variations of this method, and extensions to multiple predictors, see Wilcox (2003, Section 13.3). Correlation-Type Estimators There are two classes of correlation-type estimators (Wilcox, in press) but, for brevity, we describe only the class that has relevance here. Consider p predictors, X1, … , Xp, and let τj be any correlation between Xj, the jth predictor, and Y – b1X1 – … – bpXp. Then a general class of correlation-type estimators is to choose the slope estimates b1, … , bp so as to minimize ∑ τˆ j . An important special case that has received considerable attention is where τj is Kendall’s tau. With a single predictor, this yields the Theil (1950) and Sen (1968) estimator. (A simple method for computing the Theil–Sen estimator can be found in Wilcox, 2003, and various extensions to multiple predictors were described by Wilcox, in press.) In the single-predictor case, the Theil–Sen estimator takes on a fairly simple form. For any Xi < Xi', again let Sii' be the slope between these pairs of points. The Theil–Sen estimate of the slope is simply the median of all Sii' values, bts. Skipped Estimators Skipped estimators are based on the strategy of first applying some multivariate outlier detection method, eliminating any points that are flagged as outliers, and applying some regression estimator to the data that remain. A natural suggestion is to

360

WILCOX AND KESELMAN

apply the usual least squares estimator, but this approach has been found to be rather unsatisfactory (e.g., Wilcox, in press). To achieve a relatively small standard error and high power under heteroscedasticity, a better approach is to apply the Theil–Sen estimator after outliers have been removed. Currently, one of the best outlier detection methods appears to be the MGV method (Wilcox, in press). Use of the Theil–Sen estimator after outliers are removed by the MGV method yields the MGV estimator. It is easily applied using the S–PLUS or R function mgvreg described by Wilcox (2003, in press). E-Type Skipped Estimators E-type skipped estimators (where E stands for error term) look for outliers among the residuals based on some preliminary fit, remove (or downweight) the corresponding points, and then compute a new fit to the data. Rousseeuw and Leroy (1987) suggested using LMS (least median of squares) to obtain an initial fit, remove any points for which the corresponding standardized residuals are and then apply least squares to the data that remain. He and Portnoy (1992) showed that this approach performs rather poorly in terms of its standard error. An E-type estimator that appears to have practical value, called the TSTS (trimmed STS) estimator, uses the STS estimator (described in the S-Estimators section) to obtain an initial fit, eliminates points with unusually large residuals using a boxplot rule, and then it applies the Theil’Sen estimator to the data that remain. The calculations are performed by the R and S–PLUS function tstsreg in Wilcox (2003, in press). AN ILLUSTRATION Consider again the situation where Y = X + ε, n = 20, both X and ε have standard normal distributions, X and ε are independent, and two outliers are added at (2.1, –2.4). Using the MM-estimator recommended by Anderson and Schumacker (2003), the estimate of the slope, before the outliers are added, is 1.06, very close to the true slope, 1. However with the addition of the outliers, the MM-estimate is now equal to –0.11. Again, a reasonable argument is that the association is weaker with the addition of the outliers, but it would be erroneous to conclude that no association exists. Put another way, one goal of robust estimators is to not have a few aberrant points dominate, and in this particular case the MM-estimator does not achieve this goal. Without the outliers, the Theil–Sen estimate of the slope is 1.01, but with the outliers it is 0.52. The same problem (the estimate of the slope being adversely affected by only two outliers) arises when using a range of other estimators that have been proposed that are not covered here but are listed in Wilcox (in press). Some robust estimators avoid the problem just illustrated, but many of them can have relatively large standard errors, particularly under heteroscedasticity. So a problem of recent interest has been dealing with outliers that can have an inordi-

ROBUST REGRESSION METHODS

361

nate influence and simultaneously achieving a relatively low standard error under heteroscedasticity. One estimator that has been found to achieve this goal is the MGV estimator. Even with the outliers added, it estimates the slope to be 0.97. So the MGV estimator yields a value that is close to the true slope used to generate the bulk of the data, and on average it performs relatively well if we repeatedly generate data as done here (Wilcox, 2003, in press). The TSTS estimator also performs reasonably well, and in the example with the outliers, the estimated slope is 1.01. COMPARISONS BASED ON STANDARD ERRORS Now we report new results on the MM-estimator when there is heteroscedasticity. The issue is how the standard error of the MM-estimator compares to some of the better robust estimators outlined here. To find out, we used simulations to estimate the standard error of the MM-estimator for four types of distributions and three patterns of heteroscedasticity. The four distributions were (a) normal, (b) symmetric and heavy-tailed, (c) asymmetric and light-tailed, and (d) asymmetric and heavy-tailed, and are the same distributions used by Wilcox (2003, p. 493) to compare other regression estimators not covered here. More precisely, observations were generated from a g-and-h distribution (Hoaglin, 1985). Readers interested in more details are referred to Wilcox (1996a, 1996b). Values for both X and ε were generated from one of the four distributions, and then Y = X + λ(X)ε was computed, where the function λ(X) is used to model heteroscedasticity. The three choices for λ(X) where λ(X) = 1 (homoscedasticity), λ(X) = X2 and ⎛ 2 ⎞ . λ( X ) = ⎜ 1 + X + 1⎟⎠ ⎝

These three choices for λ(X) will be called variance patterns 1, 2, and 3, respectively. Note that in general, robust estimators might be estimating different quantities depending on the distributions generating the data. For example, the population means and medians can differ. In our simulation study, however, all of the robust regression estimators are attempting to estimate the same slope, which has the value 1. That is, regardless of how the data were generated and which regression estimator is used, the population parameters being estimated are the same. So here, our results on standard errors have implications about power. Nevertheless, it is possible to generate data so that different regression estimators are attempting to estimate different population values for the slope. So the first of two estimators might have a smaller standard error, but the second might yield more power when testing hypotheses. The only known way to determine whether the choice of estimator can result in a different conclusion when testing hypotheses is to try both. Also, if a robust estimator gives a substantially different estimate of the slope ver-

362

WILCOX AND KESELMAN TABLE 1 Estimated Ratios of Standard Errors

χ2

ε

N

N

N

SH

N

AL

N

AH

SH

N

SH

SH

SH

AL

SH

AH

VP

MM

TS

MGV

TSTS

1 2 3 1 2 3 1 2

0.88 1.97 0.92 2.29 6.59 1.04 3.37 8.97 3.86 0.76 30.63 0.79 0.76 30.63 0.79 1.96 52.55 1.87 0.92 33.82 0.91 3.14 59.55 3.13

0.91 2.64 202.22 4.28 10.67 220.81 1.13 3.21 183.74 8.89 26.66 210.37 0.81 40.57 41.70 3.09 78.43 38.70 0.99 46.77 39.32 6.34 138.53 43.63

0.91 2.62 196.31 4.27 10.94 214.31 1.13 3.21 177.53 8.85 27.07 204.25 0.72 42.30 34.44 2.78 83.56 31.93 0.87 49.18 32.70 5.64 146.76 37.34

0.88 2.36 135.70 3.51 8.66 121.35 1.05 2.84 106.70 7.05 20.89 103.04 0.76 27.91 22.57 2.41 47.64 17.80 0.90 31.46 19.76 4.62 78.35 18.40

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Note. n = 20. VP = variance pattern; MM = minimum M-estimator; TS = Theil–Sen estimator; MGV = minimum generalized variance; TSTS = trimmed S-type Theil–Sen estimator; N = normal; SH = symmetric, heavy-tailed, AL = asymmetric, light-tailed, AH = asymmetric, heavy-tailed.

sus least squares, this might be a sign that least squares provides a poor summary of the bulk of the observations. Table 1 shows the estimates, based on 5,000 replications, of the standard error of the least squares estimator divided by the estimated standard error of the estimator indicated at the top of the column. So, for example, under the column labeled “TS” (the Theil–Sen estimator), when both X and ε have normal distributions and there is homoscedasticity, the ratio is 0.91, indicating that least squares is slightly preferable to the Theil–Sen estimator. Note that many entries are substantially greater than 1, indicating that the least squares estimator has a relatively large standard error. Looking at the column headed by “MM,” we see that in some situations the MM-estimator offers substantial improvement over least squares, but generally it does not perform as well as the TS, MGV, and TSTS estimators, particularly for variance pattern 3. (An exception is when X has a normal distribution, ε has a skewed, light-tailed distribution, and there is homoscedasticity.)

ROBUST REGRESSION METHODS

363

CONCLUSIONS Anderson and Schumacker (2003) concluded that, in terms of achieving a relatively small standard error, a version of an MM-estimator is best among the regression estimators they considered. However their numeric results might suggest that despite this, it offers little advantage over least squares. One goal here was to point out that when there is heteroscedasticity, various robust estimators might offer a substantial advantage. Our new results indicate that the MM-estimator can improve on least squares, but all indications are that, in general, other robust estimators are preferable. Finally, based on Table 1, the simpler Theil–Sen estimator competes well with the MGV method, and it offers some protection against outliers. But as was illustrated, the Theil–Sen estimator can be adversely affected by two or more properly placed outliers. One of the main results here is that researchers can guard against this problem with little or no loss for the situations studied in Table 1.

REFERENCES Anderson, C., & Schumacker, R. E. (2003). A comparison of five robust regression methods with ordinary least squares regression: Relative efficiency, bias, and test of the null hypothesis. Understanding Statistics 2, 79–103. Barnett, V., & Lewis, T. (1994). Outliers in statistical data (3rd ed.). New York: Wiley. Becker, C., & Gather, U. (1999). The masking breakdown point of multivariate outlier detection rules. Journal of the American Statistical Association, 94, 947–955. Becker, R. A., Chambers, J. M., & Wilks, A. R. (1988). The new S language. Pacific Grove, CA: Wadsworth & Brooks/Cole. Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley. Carling, K. (2000). Resistant outlier rules and the non-Gaussian case. Computational Statistics & Data Analysis, 33, 249–258. Coakley, C. W., & Hettmansperger, T. P. (1993). A bounded influence, high breakdown, efficient regression estimator. Journal of the American Statistical Association, 88, 872–880. Cohen, M., Dalal, S. R., & Tukey, J. W. (1993). Robust, smoothly heterogeneous variance regression. Applied Statistics, 42, 339–354. Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19, 15–18. Cook, R. D. (1979). Influential observations in linear regression. Journal of the American Statistical Association, 74, 169–174. Davies, P. L. (1987). Asymptotic behavior of S-estimators of multivariate location parameters and dispersion matrices. Annals of Statistics, 15, 1269–1292. Doksum, K. A., Blyth, S., Bradlow, E., Meng, X., & Zhao, H. (1994). Correlation curves as local measures of variance explained by regression. Journal of the American Statistical Association, 89, 571–582. Doksum, K. A., & Wong, C.-W. (1983). Statistical tests based on transformed data. Journal of the American Statistical Association, 78, 411–417. Fox, J. (1999). Applied regression analysis, linear models, and related methods. Thousand Oaks, CA: Sage.

364

WILCOX AND KESELMAN

Fung, W.-K. (1993). Unmasking outliers and leverage points: A confirmation. Journal of the American Statistical Association, 88, 515–519. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. (1986). Robust statistics: The approach based on influence functions. New York: Wiley. He, X., & Portnoy, S. (1992). Reweighted LS estimators converge at the same rate as the initial estimator. Annals of Statistics, 20, 2161–2167. Hoaglin, D. C. (1985). Summarizing shape numberically: The grand distribution. In D. Hoaglin, F. Mosteller, & J. Tukey (Eds.), Exploring data tables trends and shapes. New York: Wiley. Hoaglin, D. C., & Iglewicz, B. (1987). Fine-tuning some resistant rules for outlier labelling. Journal of the American Statistical Association, 82, 1147–1149. Huber, P. (1981). Robust statistics. New York: Wiley. Long, J. S., & Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model. American Statistician, 54, 217–224. Montgomery, D. C., & Peck, E. A. (1992). Introduction to linear regression analysis. New York: Wiley. Norusis, M. J. (2000). SPSS 10.0 guide to data analysis. Englewood Cliffs, NJ: Prentice Hall. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley. Rousseeuw, P. J., & van Driesen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223. Rousseeuw, P. J. & van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points (with discussion). Journal of the American Statistical Association, 85, 633–639. SAS. (1990). SAS language and procedures. Cary, NC: Author. Sen, P. K. (1968). Estimate of the regression coefficient based on Kendall’s tau. Journal of the American Statistical Association, 63, 1379–1389. Staudte, R. G., & Sheather, S. J. (1990). Robust estimation and testing. New York: Wiley. Theil, H. (1950). A rank-invariant method of linear and polynomial regression analysis. Indagationes Mathematicae, 12, 85–91. Tyler, D. E. (1991). Some issues in the robust estimation of multivariate location and scatter. In W. Stahel & S. Weisberg (Eds.), Directions in robust statistics and diagnostics, Part 2 (pp. 327–336). New York: Springer-Verlag. Venables, W. N., & Smith, D. M. (2002). An introduction to R. Bristol, England: Network Theory. Wilcox, R. R. (1996a). Confidence intervals for the slope of a regression line when the error term has non-constant variance. Computational Statistics & Data Analysis, 22, 89–98. Wilcox, R. R. (1996b). Estimation in the simple linear regression model when there is heteroscedasticity of unknown form. Communications in Statistics—Theory and Methods, 25, 1305–1324. Wilcox, R. R. (2003). Applying contemporary statistical methods. San Diego, CA: Academic. Wilcox, R. R. (in press). Introduction to robust estimation and hypothesis testing (2nd ed.). San Diego, CA: Academic. Wilcox, R. R., & Keselman, H. J. (2003). Modern robust data analysis methods: Measures of central tendency. Psychological Methods, 8, 254–274. Yohai, V. J. (1987). High breakdown point and high efficiency robust estimates for regression. Annals of Statistics, 15, 642–656.