Regression Diagnostics L - IASRI

DIAGNOSTICS IN DESIGN OF EXPERIMENTS L.M. Bhar I.A.S.R.I., Library Avenue, New Delhi-110 012. 1. Introduction Regression analysis is a statistical methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others. This methodology is widely used in business, the social and behavioral sciences, the biological science and many other disciplines including the discipline of design of experiments. When a regression model is considered for an application we can usually not be certain in advance that the model is appropriate for that application. Any one, or several, of the features, for the model, such as linearity of the regression function or normality of the error terms, may not be appropriate for the data before inferences based on that model are undertaken. In addition to this the data may contain outliers or extreme values which are not easily detectable but highly influential as the least squares estimation procedure tends to pull the estimated regression response towards outlying observations. Many diagnostic techniques, which detect the problems with the data and provide remedies, are found in the literature of regression analysis. Many of these techniques are based on the residual analysis. Basic set of up of design of experiments comes under the general set up of regression. Thus some of the diagnostic techniques for regression analysis can be applied to this field. However, some useful statistics cannot be applied directly to this field. We shall deal the diagnostic techniques in the field of design of experiments separately in a different section, along with some other aspects of diagnosis. However, residual analysis is important tool for any diagnosis of any data set arisen from regression analysis. Before, dealing with the diagnosis in design of experiments, we present a brief purview of residual of residual analysis, which could also be applied to the field of design of experiments, including the response surface designs. 2. Residual Analysis Through out we consider the regression model y = Xβ + ε , where y is an n-component vector of observations, X is nxp matrix of predictor variables of rank p, (in case of design of experiments X is the design matrix), β is the p-component vector of parameters and ε is the n-component vector of errors. Errors are distributed normally with mean zero and variance σ 2 . Direct diagnostic plots for the variable y are ordinarily not too useful in regression analysis because the values of the observations on the response variable are a function of the level of the predictor variable. Instead, diagnostics for the response variable are usually carried out indirectly through an examination of the residuals. The residual ei is the difference between the observed value yi and the fitted value yˆ i , ei = yi - yˆ i

Diagnostic in Design of Experiments

2.1 Diagnostics for Residuals We shall consider the use of residuals for examining some important types of departures from the simple linear regression model with normal errors. (1) (2) (3) (4) (5)

The regression function is not linear The error terms do not have constant variance The error terms are not independent The model fits all but one or a few outlier observations The error terms are not normally distributed.

The following plots of residuals are generally utilized for this propose: (1) (2) (3) (4) (5) (6)

Plot of residuals against predictor variable Plot of absolute or squared residuals against predictor variable. Plot of residuals against fitted values Plot of residuals against time or other sequence Plot of residuals against omitted predictor variables Normal probability plot of residuals.

2.1.1 Non-linearity of Regression Function Weather a linear function is appropriate for the data being analyzed can be studied from a residual plot against the predictor variable or equivalently, from a residual plot against the fitted values. If the function is linear the residuals fall within a horizontal band centered around 0, displaying no systematic tendencies to be positive or negative. 2.1.2 Non Constancy of Error Variance Plots of the residuals against the predictor variable or against the fitted values are not only helpful to study whether a linear regression function is appropriate but also to examine whether the variance of the error is constant. 2.1.3 Non-normality of Error Terms. (i) Comparison of frequencies When the number of cases is reasonably large is to compare actual frequencies of the residuals against expected frequencies under normality. For examples one can determine whether, say about 68 percent of the residuals ei fall between ± MSE or about 90 % fall between ± 1.645 MSE . (ii) Normal probability plot. Still another possibility is to prepare a normal probability plot of the residuals. Here each residual is plotted against its expected value under normality. A plot that is nearly linear suggests agreement with normality, whereas a plot that departs substantially from linearity suggests that error distribution is not normal. The expected value of the kth smallest observation in a random sample of n is

⎡ ⎛ k − .375 ⎞⎤ MSE ⎢ z ⎜ ⎟⎥ ⎣ ⎝ n + .25 ⎠⎦

622


2.1.4 Presence of Outliers Frequently in regression analysis applications, the data set contains some cases that are outlying or extreme that is, the observations for these cases are well separated from the remainder of the data. These outlying cases may involve large residuals and often have dramatic effects on the fitted least squares regression function. Tests for outliers A simple test for identifying an outlier involves fitting a new regression line to the other n-1 observations. The suspect observation, which was not used in fitting the new line, can now be regarded as a new observation. One can calculate the probability that in n observations, a deviation from the fitted line as great as that of the outlier will be obtained by chance. If this probability is sufficiently small, the outlier can be rejected as not having come from the same population as the n-1 observations. Other wise, the outlier is retained. There is a number of other test statistics available for detecting outliers in regression. 3. Diagnostics in Design of Experiments In section 2 we dealt with some diagnostics in regression analysis. These techniques can be applied to design of experiments. However, in this section we discuss the problem of outliers in detail, since the test statistics for dealing with outliers in regression, cannot be applied directly to design of experiments because of some problems. 3.1 Detection of outliers The fact that an observation is an outlier, that is, it provides a large residual when the chosen model is fitted to the data, does not necessarily mean that the observation is an influential one with respect to the fitted equation. When an outlier is omitted from the the analysis, the fitted equation may change hardly at all. An example given by Andrews and Pregibon (1978), using the data from Micky, Duan, and Clark(1967), illustrates the point well. Agricultural field experiments are always laid out using standard experimental designs. There are a number of standard designs available in the literature. A few of them are generally used in the field conditions. Interest of the scientists now a days is towards the evolvement of optimal designs, but one of the most important parts of the experiments, i.e., presence of outliers is being ignored. The consequences of the presence of outliers are well known. Even a single outlier may alter the inference to be drawn from the experiments. Agricultural field data definitely contain outliers owing to different reasons. Yet no attention is paid to this important problem. Cook(1977) introduced a statistic to indicate the influence of an observation with respect to a particular model. This statistic is used extensively in linear regression diagnostics. This statistic is also very useful to assess the degree of influential for a subset of parameters. Though the general set up of an experimental design is that of a linear model, yet Cookstatistic cannot be applied as such for testing outliers to this field because of some problems. In experimental designs, particularly in varietal designs, the design matrix x has does not have full column rank; this unique estimation of parameters is not possible.

623


Moreover, in experimental designs the interest of the experimenter is only in a subset of parameters rather than the whole set of parameters. For example, in block designs, we are interested in the estimation of treatment contrasts only, other parameters such as block effects and general mean are considered as nuisance parameters. One may, therefore, be interested to see the effect of an outlying observation on the estimation of treatment contrasts. Therefore, there is a need to develop this statistic appropriately for this field.

3.1.1 Development of test-statistics Cook-statistic in general model for experimental designs Consider the general linear model y = Xβ + ε ; E ( e ) = 0 ; D(e) = σ 2 I n , σ 2 > 0 where y is an nx1 vector of observations, X is an nxp full rank matrix of known constants, θ is a px1 vector of unknown parameters, and e is an nx1 vector of independent random variables each with zero mean and variance σ 2 > 0 . To determine the degree of influence the i-th data point has on the estimate θ a natural first step would be to compute the least squares estimate of θ with the point deleted. Accordingly, let θˆ(i ) denote the least squares estimate of θ with the i-th point deleted. An easily interpretable measure of the distance θˆ(i ) from θˆ is given by Cook (1977) as

−1 ′ ( θˆ − θˆ(i ) ) [D(θˆ)] (θˆ − θˆ(i ) ) D = Rank [D(θˆ)] i

The statistic provides a measure of distance between θ and θˆ(i ) in terms of descriptive levels of significance, because Di is actually (1-α) x 100% confidence ellipsoid for the vector under normal theory, which satisfies Di ≤ F ( p, n − p, (1 − α )) . Extension of Di for more than one outlier is strait forward. For usual interpretation of Cook-statistic see Cook (1977,1979). Now we consider the general linear model, for an experimental design d (say). The model is same except that the rank of X is now m(