OVERVIEW OF SAS SOFTWARE - iasri

SAS SYSTEM FOR LINEAR MODELS Rajender Parsad and R.Srivastava I.A.S.R.I., Library Avenue, New Delhi-110 012 1. Concepts and Analysis of Linear Models A model is an equation or a set of equations, which represents the behaviour of system. It can also be defined as a formalized expression of theory or the casual situation that is regarded as having generated observed data. Therefore, the modelling refers to the development of mathematical expressions that describe in some sense the behaviour of a random variable of interest. This random variable is generally called the dependent or response variable. The variables (real or dummy) called predictor or explanatory or independent variables are thought to provide information on the behaviour of the dependent variable incorporated into the model. The model also involves some unknown constants called parameters, which control the behaviour of the model. These parameters are generally estimated and hypotheses regarding these parameters are tested from the data using the analysis of model. The mathematical complexity of the model and the degree to which it is a realistic model will depend on how much is known about the process being studied and on the purpose of modelling exercise. In general, there are two types of models viz. linear model (the model linear in parameters) and non-linear model (non-linear in parameters). We shall restrict ourselves to linear models only. Most of the models used in the design of experiments are linear. In studying the variability that is evident in data, we are interested in attributing that variability to the various categorizations of the data. The classifications that identify the source of each datum are called factors. The individual classes of classification are the levels of the factor and the subset of data occurring at the intersection of one level of every factor being considered is said to be in the cell of data. In classifying model in terms of factors and their levels, the features of interest are the extent to which different levels of a factor affect the variable of interest and is called as the effect of a level of a factor on that variable. Types of Models on the basis of Nature of Effects As explained earlier, a statistical linear model is actually a linear relation of the effects of the different levels of a number of factors involved in an experiment along with one or more terms representing error effects. The effects of any factor can be either fixed or random. The fixed effects are the effects attributable to a finite set of levels of a factor that occur in the data. For example, the effects of four well defined varieties of a crop say, tomato are fixed as the experimenter is interested in only these four specific varieties and has no thought of other varieties. The random effects attributable to a usually infinite set of levels of a factor, of which only a random sample is deemed to occur in the data. Example: If a crop, say tomato is taken as a factor with varieties as its levels and tried at

SAS System For Linear Models

a number of gardens selected at random from a large number of gardens, then the effect of gardens is random effect since the gardens have been chosen randomly with the object of using them as a representation of the population of all home gardens in a particular area. The models involving only fixed effects (except the error term which is random) are called fixed effects models, or sometimes just fixed models, and those having only random effects apart from a single general mean (µ) common to all observations are called random effects models or more simply, random models. The models that contain both fixed (besides mean) and random effects (besides error term) are called as mixed models. In application, to real life situations, mixed models have broader use than random models, because so often it is appropriate to have both fixed effects and random effects in the same model. Indeed, every model that contains µ is a mixed model, because it also contains residual error term and so automatically has a mixture of fixed and random effects. In practice, however, the name mixed model is usually reserved for any model having both fixed effects other than µ and random effects besides the customary residuals. However, here, we shall restrict ourselves to linear, fixed effects models only. In fixed effects models, the main objectives are to estimate the effects, find a measure of variability among the effects of each of the factors and finally find the variability among the error effects. The random errors are generally assumed to be normally and independently distributed with zero mean and a constant variance σ2. A further assumption that has been made in the model is that the effects are additive in nature. Types of Models on the basis of Number of Classificatory Factors The models can also be classified on the basis of nature of data, that is, the number of controllable factors involved in the data classification. For example, if the data are from the different levels of a single factor, then we call the data as one-way classified data and the model is one-way classified model. If the data are from different levels of two factors, then, we call the data as 2-way classified data and the model is 2-way classified model and so on. In general, if the data belong to the level combinations of m-different factors, we call them m-way-classified data and the model as m-way-classified model. In the context of experimental designs, the model for the analysis of completely randomized design (CRD) is 1-way classified model; for block designs it is 2-way classified model; for row-column designs, it is 3-way classified model, etc. Types of Models on the basis of Nature of Data The levels of the factors may be crossed with each other or nested of one within the other factor. Therefore, there may be two types of classifications viz. crossed and nested. Nested and Cross Classifications If every level of a factor could be used in combination with every level of every other factor, then, the classification is crossed. In this classification the factors “cross” with each other and their “intersections” are the subclass or cells of the situations, wherein, 44


data arise. Absence of data from a cell does not imply non-existence of that cell, only that it has no data. The total number of cells in a cross-classification is the product of the number of levels of various factors and not all of these may have observations in them. For example, suppose we have data from 6 plants, representing 3 varieties being tested in combination with two fertilizer treatments. The entries in the table are such that yijk represents the yield of plant ‘k’ of variety ‘i’ that received treatment ‘j’. TABLE 1 Variety 1 2 3

Treatment 1 y111, y112 y211

2 y121 y221

y311

We shall now write, these 5 dummy (0,1) variables and 5 regression coefficients α1, α2 and α3 corresponding to the effects of 3 varieties and β1 and β2 the effects of two treatments. Further, in general mean or intercept term denoted by µ. Then, the model can be represented as yijk = µ + α1xijk,1 + α2xijk,2 + α3xijk,3 + β1x *ijk,1 +β21x *ijk,2 + eijk

(1)

where the xijk’s and x*ijk ' s are dummy (0,1)- variables, thus, for the observation corresponding to variety 2 and treatment 1, y211 can be represented in model as with xijk,2=1 = x *ijk,1 and rest as zeros, Y211 = µ + α2 + β1 + e211. In general, this model is represented as Yijk = µ + αi + βj + eijk ; for all i=1(1)a, j=1(1)b, k=1(1)nij

(2)

where a(b) is the number of the levels of first (second) factor, say variety (treatment) and nij is the number of observations in the cell at the intersection of ‘i’ level of factor 1 and ‘j’ level of factor 2 and the eijk are residuals normally distributed with mean 0 and variance σ2. In the above, there are two factors, therefore, it is two-way classified model and αi of equation (2) refers to the effect of level ‘i’ of factor say, variety and βj is the effect on the yield of treatment ‘j’. The effects of this nature that pertain to a single level of a factor are called main effects. However, sometimes, effect of one factor (variety) is not the same for all levels of other factor (treatment) and conversely the effect of treatment is not same for all varieties. There are some additional effects accounting for the way in which treatments and varieties are interacting. These effects are called interaction effects and represent the manner in which level of one main effect (variety) interacts with each level of the other main effect (treatment). Thus, the interaction effect between level ‘i’of α-effect and level ‘j’ of β-effect is introduced in the model (2) as

45


Yijk =µ+ αi+βj+(αβ)ij + eijk ; for all I=1(1)a, j=1(1)b, k=1(1)nij

(3)

(αβ)ij measures the failure of the effects of one factor to be the same at all levels of another factor. Note that whenever nij = 1 for all cells, then, the model with interaction (3) becomes indistinguishable from the model without interaction (2), then, (αβ)ij and eijk terms of (3) get combined as εij = (αβ)ij + eijk, say and so (3) becomes Yij = µ + αi + βj + εij (4) This means that for nij = 1, we can study only the case of no interaction. However, there do occur experimental situations, where it is not possible to use each of the levels of every factor in combination with every level of other factor. For example, suppose at a university a student survey is carried out to ascertain the reaction to the instructor’s usage of a new computing facility that provides typewriter terminals in the classroom. We suppose that all freshmen have to take English or Geology or Chemistry in the first semester. All three courses in the first semester are large and are divided into sections, each section with a different instructor and not all sections necessarily have the same number of students. In the survey, the response provided by each student is opinion (measured on a scale of 1 through 10) of his/her instructor’s use of the computer. Based on these data the questions of interest are, Do the instructor’s differ in their use and is the use of the computer affected by the subject matter being taught? A possible model for this situation would include a general mean µ and main effects αi; i=1(1)3 for three types of courses and βj; j=1(1)2 for two sections of each course, if it is assumed that there are two sections in each of the courses. If there are nij students in section ‘j’ of the course ‘i’ and yijk is the opinion of the student k in section ‘j’ of the course ‘i’. Then, the model used for the cross classification is Yijk = µ + αi + βj + eijk; i=(1)3, j=1(1)2, k=1(1)nij will not be of help because βj for j=1 represents the effects a of section 1 of the English course, of the Geology course and of the Chemistry course. This is meaningless as the three sections, are composed of different groups of students, have nothing in common other than that they are all numbered 1 in respective courses. Here, sections are not crossed with courses but sections are within courses, i.e., sections are nested within courses. A nested factor refers to one factor whose levels are nested within the levels of another, i.e., a factor with a different set of levels for each level of another factor. Thus, sections is a nested factor or a nested classification or hierarchical classification and a suitable model for this is Yijk = µ + αi + βi(j) + eijk (5) where βi(j) is the effect for section ‘j’ nested within the course ‘i’. This is a two-way nested classification on the model: sections within courses. The βi(j) should not be confused herewith (αβ)ij , the interaction effect in the crossed classification model. The interaction effect is peculiar to the combination of level ‘i’ of the α-factor and level ‘j’ of

46


the β-factor. Interactions between a factor and one nested within it cannot, therefore, exist. Nested and crossed classifications are by no means mutually exclusive. Both can occur in the same model. It is easy to see that all these kinds of models can be written as Y = Xθ + e

(6)

where X is known as design matrix, Y, the vector of observations, θ, the vector of parameters and e is a vector of random errors and usually assumed to be distributed as N(0, σ2I). Therefore, the analysis of all the model crossed or nested or a combination of both can be derived from the analysis of a general linear model (6) as particular cases. Before discussing the analysis of a general linear model, it will not be out of place to define different types of data. Types of Data On the basis of equality and non-equality of class frequencies the data can be of two types viz. Balanced and Unbalanced Data. If all cells of data have equal number of observations, then the data is said to be balanced and if the number of observations in the cells are unequal, data is said to be unbalanced. Unbalanced data may be all-cells filled data and some -cell-empty data. On the basis of independence and interdependence of estimates of various effects, the data can be classified as orthogonal and non-orthogonal. Orthogonal data is that data set, which lead to estimates of various effects which are independent (uncorrelated) with each other. In this data cell levels of one factor occur with other factor with proportional frequencies. For example, let there be two factors A and B with ‘r’ and ‘s’ levels respectively. Suppose ni. denotes the number of times ith level of A occurs in the data, n.j, the number of times the jth level of B occurs in the data, nij, the number of times the ith level of A occurs with the jth level of B in the data and ‘n’ is the total number of data points. Then the proportional frequency condition can be stated as nij= ni. n.j/n i = 1(1) r ; j = 1 (1) s. It is clear from here that all balanced data sets are orthogonal as well and some all-cellfilled unbalanced data can be orthogonal. However, the data are said to be nonorthogonal if they lead to estimates of various effects that are not independent of each other. Clearly, all some-cell-empty unbalanced data are non-orthogonal. These all types of data sets can occur in experimental design context as well, for example, the data from a randomized complete block (RCB) design is balanced and orthogonal whereas from a balanced incomplete block (BIB) design is unbalanced and non-orthogonal. The data obtained from a block design in which each of blocks contains some of the treatments once and some of the treatments twice is unbalanced and orthogonal. 2. Procedures and Statements PROC ANOVA performs analysis of variance for balanced data only from a wide variety

47


of experimental designs whereas PROC GLM can analyze both balanced and unbalanced data. As ANOVA takes into account the special features of a balanced design, it is faster and uses less storage than PROC GLM for balanced data. The basic syntax of the ANOVA procedure is as given: PROC ANOVA < Options>; CLASS variables; MODEL dependents = independent variables (or effects)/options; MEANS effects/options; ABSORB variables; FREQ variables; TEST H = effects E = effect; MANOVA H = effects E = effect; M = equations/options; REPEATED factor - name levels / options; By variables; The PROC ANOVA, CLASS and MODEL statements are must. The other statements are optional. The CLASS statement defines the variables for classification (numeric or character variables - maximum characters =16). The MODEL statement names the dependent variables and independent variables or effects. If no effects are specified in the MODEL statement, ANOVA fits only the intercept. Included in the ANOVA output are F-tests of all effects in the MODEL statement. All of these F-tests use residual mean squares as the error term. The MEANS statement produces tables of the means corresponding to the list of effects. Among the options available in the MEANS statement are several multiple comparison procedures viz. Least Significant Difference (LSD), Duncan’s New multiple - range test (DUNCAN), Waller - Duncan (WALLER) test, Tukey’s Honest Significant Difference (TUKEY). The LSD, DUNCAN and TUKEY options takes level of significance ALPHA = 5% unless ALPHA = options is specified. Only ALPHA = 1%, 5% and 10% are allowed with the Duncan’s test. 95% Confidence intervals about means can be obtained using CLM option under MEANS statement. The TEST statement tests for the effects where the residual mean square is not the appropriate term such as main - plot effects in split - plot experiment. There can be multiple MEANS and TEST statements (as well as in PROC GLM), but only one MODEL statement preceded by RUN statement. The ABSORB statement implements the technique of absorption, which saves time and reduces storage requirements for certain type of models. FREQ statement is used when each observation in a data set represents ‘n’ observations, where n is the value of FREQ variable. The MANOVA statement is used for implementing multivariate analysis of variance. The REPEATED statement is useful for analyzing repeated measurement designs and the BY statement specifies that separate analysis are performed on observations in groups defined by the BY variables. PROC GLM for analysis of variance is similar to using PROC ANOVA. The statements listed for PROC ANOVA are also used for PROC GLM. In addition; the following more

48


statements can be used with PROC GLM: CONTRAST ‘label’ effect name< ... effect coefficients > ; ESTIMATE ‘label’ effect name< ... effect coefficients > ; ID variables; LSMEANS effects < / options >; OUTPUT < OUT = SAS-data-set>keyword=names< ... keyword = names>; RANDOM effects < / options >; WEIGHT variables Multiple comparisons as used in the options under MEANS statement are useful when there are no particular comparisons of special interest. But there do occur situations where preplanned comparisons are required to be made. Using the CONTRAST, LSMEANS statement, we can test specific hypothesis regarding pre - planned comparisons. The basic form of the CONTRAST statement is as described above, where label is a character string used for labeling output, effect name is class variable (which is independent) and effect - coefficients is a list of numbers that specifies the linear combination parameters in the null hypothesis. The contrast is a linear function such that the elements of the coefficient vector sum to 0 for each effect. While using the CONTRAST statements, following points should be kept in mind. How many levels (classes) are there for that effect. If there are more levels of that effect in the data than the number of coefficients specified in the CONTRAST statement, the PROC GLM adds trailing zeros. Suppose there are 5 treatments in a completely randomized design denoted as T1, T2, T3, T4, T5 and null hypothesis to be tested is Ho: T2+T3 = 2T1 or -2T1+T2+T3 = 0 Suppose in the data treatments are classified using TRT as class variable, then effect name is TRT CONTRAST ‘TIVS 2&3’ TRT -2 1 1 0 0; Suppose last 2 zeros are not given, the trailing zeros can be added automatically. The use of this statement gives a sum of squares with 1 degree of freedom (d.f.) and F-value against error as residual mean squares until specified. The name or label of the contrast must be 20 characters or less. The available CONTRAST statement options are E: prints the entire vector of coefficients in the linear function, i.e., contrast. E = effect: specifies an effect in the model that can be used as an error term ETYPE = n: specifies the types (1, 2, 3 or 4) of the E effect. Multiple degrees of freedom contrasts can be specified by repeating the effect name and coefficients as needed separated by commas. Thus the statement for the above example CONTRAST ‘All’ TRT -2 1 1 0 0, TRT 0 1 -1 0 0; This statement produces two d.f. sum of squares due to both the contrasts. This feature

49


can be used to obtain partial sums of squares for effects through the reduction principle, using sums of squares from multiple degrees - of - freedom contrasts that include and exclude the desired contrasts. Although only t-1 linearly independent contrasts exists for t classes, any number of contrasts can be specified. The ESTIMATE statement can be used to estimate linear functions of parameters that may or may not be obtained by using CONTRAST or LSMEANS statement. For the specification of the statement only word CONTRAST is to be replaced by ESTIMATE in CONTRAST statement. Fractions in effects coefficients can be avoided by using DIVISOR = Common denominator as an option. This statement provides the value of an estimate, a standard error and a t-statistic for testing whether the estimate is significantly different from zero. The LSMEANS statement produces the least square estimates of CLASS variable means i.e. adjusted means. For one-way structure, there are simply the ordinary means. The least squares means for the five treatments for all dependent variables in the model statement can be obtained using the statement. LSMEANS TRT / options; Various options available with this statement are: STDERR: gives the standard errors of each of the estimated least square mean and the tstatistic for a test of hypothesis that the mean is zero. PDIFF: Prints the p - values for the tests of equality of all pairs of CLASS means. SINGULAR: tunes the estimability checking. The options E, E=, E-TYPE = are similar as discussed under CONTRAST statement. When the predicted values are requested as a MODEL statement option, values of variable specified in the ID statement are printed for identification besides each observed, predicted and residual value. The OUTPUT statement produces an output data set that contains the original data set values alongwith the predicted and residual values. Besides other options in PROC GLM under MODEL statement we can give the option: 1. solution 2. xpx (=X’X) 3 . I (g-inverse) Some Special Features PROC GLM recognizes different theoretical approaches to ANOVA by providing four types of sums of squares and associated statistics. The four types of sums of squares in PROC GLM are called Type I, Type II, Type III and Type IV. The Type I sums of squares are the classical sequential sums of squares obtained by

50


adding the terms to the model in some logical sequence. The sum of squares for each class of effects is adjusted for only those effects that precede it in the model. Thus the sums of squares and their expectations are dependent on the order in which the model is specified. The Type II, III and IV are ‘partial sums of squares' in the sense that each is adjusted for all other classes of the effects in the model, but each is adjusted according to different rules. One general rule applies to all three types: the estimable functions that generate the sums of squares for one class of squares will not involve any other classes of effects except those that “contain” the class of effects in question. For example, the estimable functions that generate SS (AB) in a three- factor factorial will have zero coefficients on main effects and the (A x C) and (B x C) interaction effects. They will contain non-zero coefficient on the (A x B x C) interaction effects, because A x B x C interaction “contains” A x B interaction. Type II, III and IV sums of squares differ from each other in how the coefficients are determined for the classes of effects that do not have zero coefficients - those that contain the class of effects in question. The estimable functions for the Type II sum of squares impose no restriction on the values of the non-zero coefficients on the remaining effects; they are allowed to take whatever values result from the computations adjusting for effects that are required to have zero coefficients. Thus, the coefficients on the higherorder interaction effects and higher level nesting effects are functions of the number of observations in the data. In general, the Type II sums of squares do not possess of equitable distribution property and orthogonality characteristic of balanced data. The Type III and IV sums of squares differ from the Type II sums of squares in the sense that the coefficients on the higher order interaction or nested effects that contain the effects in question are also adjusted so as to satisfy either the orthogonality condition (Type III) or the equitable distribution property (Type IV). The coefficients on these effects are no longer functions of the nij and consequently, are the same for all designs with the same general form of estimable functions. If there are no empty cells (no nij = 0) both conditions can be satisfied at the same time and Type III and Type IV sums of squares are equal. The hypothesis being tested is the same as when the data is balanced. When there are empty cells, the hypotheses being tested by the Type III and Type IV sums of squares may differ. The Type III criterion of orthogonality reproduces the same hypotheses one obtains if effects are assumed to add to zero. When there are empty cells this is modified to “the effects that are present are assumed to be zero”. The Type IV hypotheses utilize balanced subsets of non-empty cells and may not be unique. For a 2x3 factorial for illustration purpose adding the terms to the model in the order A, B, AB various types sums of squares can be explained as follows: Effect General Mean A

Type I R(µ) R(A/ µ)

Type II R(µ) R(A/ µ,B)

51

Type III R(A/µ,B,AB)

Type IV


B R(B/µ,A) R(B/µ,A) A*B R(A*B/ µ,A,B) R(A*B/µ,A,B) R (A/µ) is sum of squares adjusted for µ, and so on.

R(B/µ,A,AB) R(AB/µ,A,B)

Thus in brief the four sets of sums of squares Type I, II, III & IV can be thought of respectively as sequential, each - after-all others, Σ-restrictions and Hypotheses. There is a relationship between the four types of sums of squares and four types of data structures (Balanced and Orthogonal, unbalanced and orthogonal, unbalanced and nonorthogonal (all cells filled), unbalanced and non-orthogonal (empty cells)). For illustration, let nIJ denote the number of observations in level I of factor A and level J of factor B. Following table explains the relationship in data structures and Types of sums of squares in a two-way classified data.

Effect

1 Equal nIJ

A B A*B

I=II=III=IV I=II=III=IV I=II=III=IV

In general, I=II=III=IV II=III=IV I=II, III=IV III=IV

Data Structure Type 2 3 Proportionate Disproportionate nIJ non-zero nIJ I=II,III=IV III=IV I=II,III=IV I=II,III=IV I=II=III=IV I=II=III=IV

4 Empty Cell

I=II I=II=III=IV

(Balanced data) (no interaction models) (orthogonal data) (All cells filled data)

Proper Error terms: In general F-tests of hypotheses in ANOVA use the residual mean squares in other terms are to be used as error terms. For such situations PROC GLM provides the TEST statement which is identical to the test statement available in PROC ANOVA. PROC GLM also allows specification of appropriate error terms in MEANS LSMEANS and CONTRAST statements. To illustrate it let us use split plot experiment involving the yield of different irrigation (IRRIG) treatments applied to main plots and cultivars (CULT) applied to subplots. The data so obtained can be analysed using the following statements.

data splitplot; input REP IRRIG CULT YIELD; cards;

52


... ... ... ; PROC print; PROC GLM; class rep, irrig cult; model yield = Rep irrig Rep*irrig cult irrig* cult; test h = irrig e = rep * irrig; contrast ‘IRRIGI Vs IRRIG2’ irrig 1 - 1 / e = rep* irrig; run; As we know here that the irrigation effects are tested using error (A) which is sum of squares due to Rep* irrig, as taken in test statement and contrast statement respectively. In Test statement H = numerator for - source of variation and E = denominator source of variation It may be noted here that the PROC GLM can be used to perform analysis of covariance as well. For analysis of covariance, the covariate should be defined in the model without specifying under CLASS statement. PROC RSREG fits the parameters of a complete quadratic response surface and analyses the fitted surface to determine the factor levels of optimum response and performs a ridge analysis to search for the region of optimum response. PROC RSREG < options >; MODEL responses = independents / ; RIDGE < options >; WEIGHT variable; ID variable; By variable; run; The PROC RSREG and model statements are required. The BY, ID, MODEL, RIDGE, and WEIGHT statements are described after the PROC RSREG statement below and can appear in any order. The PROC RSREG statement invokes the procedure and following options are allowed with the PROC RSREG: DATA = SAS - data-set : specifies the data to be analysed. NOPRINT : suppresses all printed results when only the output data set is required. OUT : SAS-data-set: creates an output data set. The model statement without any options transforms the independent variables to the coded data. By default, PROC RSREG computes the linear transformation to perform the coding of variables by subtracting average of highest and lowest values of the independent variable from the original value and dividing by half of their differences.

53


Canonical and ridge analyses are performed to the model fit to the coded data. The important options available with the model statement are: NOCODE : Analyses the original data. ACTUAL : specifies the actual values from the input data set. COVAR = n : declares that the first n variables on the independent side of the model are simple linear regression (covariates) rather than factors in the quadratic response surface. LACKFIT : Performs lack of fit test. For this the repeated observations must appear together. NOANOVA : suppresses the printing of the analysis of variance and parameter estimates from the model fit. NOOPTIMAL(NOOPT): suppresses the printing of canonical analysis for quadratic response surface. NOPRINT : suppresses both ANOVA and the canonical analysis. PREDICT : specifies the values predicted by the model. RESIDUAL : specifies the residuals. A RIDGE statement computes the ridge of the optimum response. Following important options available with RIDGE statement are MAX: computes the ridge of maximum response. MIN: computes the ridge of the minimum response. At least one of the two options must be specified. NOPRINT: suppresses printing the ridge analysis only when an output data set is required. OUTR = SAS-data-set: creates an output data set containing the computed optimum ridge. RADIUS = coded-radii: gives the distances from the ridge starting point at which to compute the optimum. PROC REG is the primary SAS procedure for performing the computations for a statistical analysis of data based on a linear regression model. The basic statements for performing such an analysis are

PROC REG; MODEL list of dependent variable = list of independent variables/ model options; RUN; The PROC REG procedure and model statement without any option gives ANOVA, root mean square error, R-squares, Adjusted R-square, coefficient of variation etc. The options under model statement are P: It gives predicted values corresponding to each observation in the data set. The estimated standard errors are also given by using this option.

54


CLM: It yields upper and lower 95% confidence limits for the mean of subpopulation corresponding to specific values of the independent variables. CLI : It yields a prediction interval for a single unit to be drawn at random from a subpopulation. STB: Standardized regression coefficients. XPX, I: Prints matrices used in regression computations. NOINT: This option forces the regression response to pass through the origin. With this option total sum of squares is uncorrected and hence R-square statistic are much larger than those for the models with intercept. However, if no intercept model is to be fitted with corrected total sum of squares and hence usual definition of various statistic viz R2, MSE etc. are to be retained then the option RESTRICT intercept = 0; may be exercised after the model statement. For obtaining residuals and studentized residuals, the option ‘R’ may be exercised under model statement and Cook’s D statistic. The ‘INFLUENCE’ option under model statement is used for detection of outliers in the data and provides residuals, studentized residuals, diagonal elements of HAT MATRIX, COVRATIO, DFFITS, DFBETAS, etc. For detecting multicollinearity in the data, the options ‘VIF’ (variance inflation factors) and ‘COLLINOINT’ or ‘COLLIN’ may be used. Besides the options for weighted regression, output data sets, specification error, heterogeneous variances etc. are available under PROC REG. PROC PRINCOMP can be utilized to perform the principal component analysis. Multiple model statements are permitted in PROC REG unlike PROC ANOVA and PROC GLM. A model statement can contain several dependent variables. The statement model y1, y2, y3, y4=x1 x2 x3 x4 x5 ; performs four separate regression analyses of variables y1, y2, y3 and y4 on the set of variables x1, x2, x3, x4, x 5. Polynomial models can be fitted by using independent variables in the model as x1=x, x2=x**2, x3=x**3, and so on depending upon the order of the polynomial to be fitted. From a variable, several other variables can be generated before the model statement and transformed variables can be used in model statement. LY and LX gives Logarithms of Y & X respectively to the base e and LogY, LogX gives logarithms of Y and X respectively to the base 10. TEST statement after the model statement can be utilized to test hypotheses on individual or any linear function(s) of the parameters. For e.g. if one wants to test the equality of coefficients of x1 and x2 in y=βo+β1x1+β2 x2 regression model, statement

55


TEST 1: TEST x1 - x2 = 0; Label: Test < equation ..., equation >; The fitted model can be changed by using a separate model statement or by using DELETE variables; or ADD variables; statements. The PROC REG provides two types of sums of squares obtained by SS1 or SS2 options under model statement. Type I SS are sequential sum of squares and Types II sum of squares are partial SS are same for that variable which is fitted at last in the model. For most applications, the desired test for a single parameter is based on the Type II sum of squares, which are equivalent to the t-tests for the parameter estimates. The Type I sum of squares, however, are useful if there is a need for a specific sequencing of tests on individual coefficients as in polynomial models. PROC ANOVA and PROC GLM are general purpose procedures that can be used for a broad range of data classification. In contrast, PROC NESTED is a specialized procedure that is useful only for nested classifications. It provides estimates of the components of variance using the analysis of variance method of estimation. The CLASS statement in PROC NESTED has a broader purpose then it does in PROC ANOVA and PROC GLM; it encompasses the purpose of MODEL statement as well. But the data must be sorted appropriately. For example in a laboratory microbial counts are made in a study, whose objective is to assess the source of variation in number of microbes. For this study n1 packages of the test material are purchased and n2 samples are drawn from each package i.e. samples are nested within packages. Let logarithm transformation is to be used for microbial counts. PROPER SAS statements are: PROC SORT; By package sample; PROC NESTED; CLASS package sample; Var logcount; run; Corresponding PROC GLM statements are PROC GLM; Class package sample; Model Logcount= package sample (package);

The F-statistic in basic PROC GLM output is not necessarily correct. For this RANDOM statement with a list of all random effects in the model is used and Test option is utilized to get correct error term. However, for fixed effect models same arguments for proper error terms hold as in PROC GLM and PROC ANOVA. Exercise 1: An Experiment was conducted using a Randomized complete block design in

56


5 treatments a, b, c, d & e with three replications. The below: Treatment(TRT) Replication(REP) a B c d 1 16.9 18.2 17.0 15.1 2 16.5 19.2 18.1 16.0 3 17.5 17.1 17.3 17.8

data (yield) obtained is given

e 18.3 18.3 19.8

1. Perform the analysis of variance of the data. 2. Test the equality of treatment means. 3. Test Ho: 2T1=T2+T3, where as T1, T2, T3, T4 and T5 are treatment effects. Procedure: Prepare a SAS data file using DATA Name; INPUT REP TRT $ yield; Cards; ... ... ... ; Print data using PROC PRINT. Perform analysis using PROC ANOVA, obtain means of treatments and obtain pairwise comparisons using least square differences and Duncan’s New Multiple range tests. Make use of the following statements: PROC Print; PROC ANOVA; Class REP TRT; Model Yield = REP TRT; Means TRT/lsd; Means TRT/duncan; Run; Perform contrast analysis using PROC GLM. Proc glm; Class rep trt; Model yld=rep trt; Contrast ‘1 Vs 2&3’ trt 2 -1 -1; Run; Exercise 2: An experiment was conducted with 49 crop varieties (TRT) using a simple lattice design. The layout and data obtained (Yld) is as given below: REPLICATION (REP)-I BLOCKS(BLK)

57


1 22(7) 24(20) 28(25) 27(68) 25(4) 26(11) 23(45)

1 22(29) 8(127) 43(119) 1(24) 36(58) 29(97) 15(47)

2 10(12) 14(26) 8(42) 9(13) 13(10) 12(21) 11(11)

2 18(64) 25(31) 46(85) 11(51) 4(39) 39(67) 32(93)

3 45(22) 44(21) 43(16) 47(37) 49(13) 48(21) 46(12)

4 37(25) 41(23) 40(11) 42(24) 36(30) 39(34) 38(15)

5 18(33) 19(17) 21(13) 17(10) 15(36) 20(30) 16(14)

REPLICATION (REP)-II BLOCKS(BLK) 3 4 5 20(25) 23(45) 5(19) 27(71) 16(22) 19(47) 13(51) 2(13) 47(86) 48(121) 37(85) 40(33) 41(22) 9(10) 12(48) 6(75) 30(65) 33(73) 34(44) 44(5) 26(56)

6 30(33) 34(31) 35(10) 32(12) 29(22) 33(33) 31(18)

6 3(13) 24(23) 17(51) 10(30) 31(50) 38(30) 45(103)

7 5(28) 6(74) 7(14) 2(14) 1(16) 3(11) 4(7)

7 14(60) 49(72) 21(10) 42(23) 35(54) 28(54) 7(85)

1. 2. 3. 4.

Perform the analysis of variance of the data. Also obtain Type II SS. Obtain adjusted treatment means with their standard errors. Test the equality of all adjusted treatment means. Test whether the sum of 1 to 3 treatment means is equal to the sum of 4 to 6 treatments. 5. Estimate difference between average of 1 to 3 treatment means and sum of that of 4 to 6 treatments. 6. Divide the between block sum of squares into between replication sum of squares and between blocks within replications sum of squares. PROCEDURE Prepare the DATA file. DATA Name; INPUT REP BLK TRT yield; Cards; .... .... .... ; Print data using PROC PRINT. Perform analysis of 1 to 5 objectives using PROC GLM. The statements are as follows: Proc print; Proc glm; Class rep blk trt;

58


Model yld= blk trt/ss2; Contrast ‘A’ trt 1 1 1 -1 -1 -1; Estimate ‘A’ trt 3 -1 -1 -1/divisor=3; Run; The objective 6 can be achieved by another model statement. Proc glm; Class rep blk trt; Model yield= rep blk(rep) trt/ss2; run; Exercise 3: Analyze the data obtained through a Split-plot experiment involving the yield of 3 Irrigation (IRRIG) treatments applied to main plots and two Cultivars (CULT) applied to subplots in three Replications (REP). The layout and data (YLD) is given below: Replication-I I1 I2 C1 C1 (1.6) (2.6) C2 C2 (3.3) (5.1)

I3 C1 (4.7) C2 (6.8)

Replication -II I1 I2 C1 C1 (3.4) (4.6) C2 C2 (4.7) (1.1)

Replication-III I1 I2 C1 C1 (3.2) (5.1) C2 C2 (5.6) (6.2)

I3 C1 (5.5) C2 (6.6)

I3 C1 (5.7) C2 (4.5)

Perform the analysis of the data. (HINT: Steps are given in text). Exercise 4: An agricultural field experiment was conducted in 9 treatments using 36 plots arranged in 4 complete blocks and a sample of harvested output from all the 36 plots are to be analysed blockwise by three technicians using three different operations. The data collected is given below: Block-2 Block-1 Technician Technician Operation 1 2 3 Operation 1 2 3 1 1(1.1) 2(2.1) 3(3.1) 1 1(2.1) 4(5.2) 7(8.3) 2 4(4.2) 5(5.3) 6(6.3) 2 2(3.2) 5(6.7) 8(9.9) 3 7(7.4) 8(8.7) 9(9.6) 3 3(4.5) 6(7.6) 9(10.3)

Operation 1 2 3

Block-3 Technician 1 2 3 1(1.2) 6(6.3) 9(9.4) 2(2.7) 5(5.9) 7(7.8)

Operation 1 2 3

8(8.7) 4(4.8) 3(3.3)

59

Block-4 Technician 1 2 3 1(3.1) 9(11.3) 6(8.1) 2(4.5) 8(10.7) 4(6.9)

5(7.8) 7(9.3) 3(5.8)


1. Perform the analysis of the data considering that technicians and operations are crossed with each other and nested in the blocking factor. 2. Perform the analysis by considering the effects of technicians as negligible. 3. Perform the analysis by ignoring the effects of the operations and technicians. Procedure: Prepare the data file. DATA Name; INPUT BLK TECH OPER TRT OBS; Cards; .... .... .... ; Perform analysis of objective 1 using PROC GLM. The statements are as follows: Proc glm; Class blk tech oper trt; Model obs= blk tech (blk) oper(blk) trt/ss2; Lsmeans trt oper(blk)/pdiff; Run; Perform analysis of objective 2 using PROC GLM with the additional statements as follows: Proc glm; Class blk tech oper trt; Model obs= blk oper(blk) trt/ss2; run; Perform analysis of objective 3 using PROC GLM with the additional statements as follows: Proc glm; Class blk tech oper trt; Model obs = blk trt/ss2; run; Exercise 5: A greenhouse experiment on tobacco mossaic virus was conducted. The experimental unit was a single leaf. Individual plants were found to be contributing significantly to error and hence were taken as one source causing heterogeneity in the experimental material. The position of the leaf within plants was also found to be contributing significantly to the error. Therefore, the three positions of the leaves viz. top, middle and bottom were identified as levels of second factor causing heterogeneity. 7 solutions were applied to leaves of 7 plants and number of lesions produced per leaf were counted. Analyze the data of this experiment.

60


Leaf Position Top Middle Bottom

1 1(2) 2(4) 4(3)

2 2(3) 3(3) 5(4)

Plants 3 3(1) 4(2) 6(7)

4 4(5) 5(6) 7(6)

5 5(3) 6(4) 1(3)

6 6(2) 7(2) 2(4)

7 7(1) 1(1) 3(7)

The figures at the intersections of the plants and leaf position are the solution numbers and the figures in the parenthesis are number of lesions produced per leaf. Procedure: Prepare the data file. DATA Name; INPUT plant posi $ trt count; Cards; .... .... .... ;

Perform analysis using PROC GLM. The statements are as follows: Proc glm; Class plant posi trt count; Model count= plant posi trt/ss2; Lsmeans trt/pdiff; Run; Exercise 6: The follwing data was collected through a pilot sample survey on Hybrid Jowar crop on yield and biometrical characters. The biometrical characters were average Plant Population (PP), average Plant Height (PH), average Number of Green Leaves (NGL) and Yield (Kg./plot).

1. Fit a multiple linear regression equation by taking yield as dependent variable and biometrical characters as explanatory variables. Print the matrices used in the regression computations. 2. Test the significance of the regression coefficients and also equality of regression coefficients of a) PP and PH b) PH and NGL 3. Obtain the predicted values corresponding to each observation in the data set. 4. Identify the outliers in the data set. 5. Check for the linear relationship among the biometrical characters.

61


6. Fit the model without intercept. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

PP 142.00 143.00 107.00 78.00 100.00 86.50 103.50 155.99 80.88 109.77 61.77 79.11 155.99 61.81 74.50 97.00 93.14 37.43 36.44 51.00 104.00 49.00 54.66 55.55 88.44 99.55 63.99 101.77 138.66 90.22 76.92 126.22 80.36 150.23 56.50 136.00 144.50 157.33 91.99 121.50

PH 0.5250 0.6400 0.6600 0.6600 0.4600 0.3450 0.8600 0.3300 0.2850 0.5900 0.2650 0.6600 0.4200 0.3400 0.6300 0.7050 0.6800 0.6650 0.2750 0.2800 0.2800 0.4900 0.3850 0.2650 0.9800 0.6450 0.6350 0.2900 0.7200 0.6300 1.2500 0.5800 0.6050 1.1900 0.3550 0.5900 0.6100 0.6050 0.3800 0.5500

NGL 8.20 9.50 9.30 7.50 5.90 6.40 6.40 7.50 8.40 10.60 8.30 11.60 8.10 9.40 8.40 7.20 6.40 8.40 7.40 7.40 9.80 4.80 5.50 5.00 5.00 9.60 5.60 8.20 9.90 8.40 7.30 6.90 6.80 8.80 9.70 10.20 9.80 8.80 7.70 7.70

Yield 2.470 4.760 3.310 1.970 1.340 1.140 1.500 2.030 2.540 4.900 2.910 2.760 0.590 0.840 3.870 4.470 3.310 1.570 0.530 1.150 1.080 1.830 0.760 0.430 4.080 2.830 2.570 7.420 2.620 2.000 1.990 1.360 0.680 5.360 2.120 4.160 3.120 2.070 1.170 3.620

62


41 42 43 44 45 46

64.50 116.00 77.50 70.43 133.77 89.99

0.3200 0.4550 0.7200 0.6250 0.5350 0.4900

5.70 6.80 11.80 10.00 9.30 9.80

0.670 3.050 1.700 1.550 3.280 2.690

Procedure: Prepare a data file Data mlr; Input PP PH NGL Yield; Cards; .... .... ; Use PROC REG Proc Reg; Model Yield = PP PH NGL/ p r influence vif collin xpx i; Test 1: Test PP =0; Test 2: Test PH=0; Test 3: Test NGL=0; Test 4: Test PP-PH=0; Test 4a: Test PP=PH=0; Test 5: Test PH-NGL=0; Test 5a: Test PH=NGL=0; Model Yield = PP PH NGL/noint; run; Model Yield = PP PH NGL; Restrict intercept =0; Run; Exercise 7: An experiment was conducted with five levels of each of the four fertilizer treatments nitrogen, Phosphorus, Potassium and Zinc. The levels of each of the four factors and yield obtained is as given below. Fit a second order response surface design using the original data. Test the lack of fit of the model. Compute the ridge of maximum and minimum responses. Obtain predicted residual Sum of squares. N 40 40 40 40 120 40 40

P2O5 K2O Zn 30 25 20 30 25 60 30 75 20 90 25 20 30 25 20 30 75 60 90 25 60

Yield 11.28 8.44 13.29 7.71 8.94 10.9 11.85

63


120 120 120 40 40 120 120 120 120 160 0 80 80 80 80 80 80 80 80 80 80 80 80

30 30 90 90 90 30 90 90 90 60 60 120 0 60 60 60 60 60 60 60 60 60 60

25 75 25 75 75 75 75 75 75 50 50 50 50 100 0 50 50 50 50 50 50 50 50

60 20 20 20 60 60 60 20 60 40 40 40 40 40 40 80 0 40 40 40 40 40 40

11.03 8.26 7.87 12.08 11.06 7.98 10.43 9.78 12.59 8.57 9.38 9.47 7.71 8.89 9.18 10.79 8.11 10.14 10.22 10.53 9.5 11.53 11.02

Procedure: Prepare a data file. /* yield at different levels of several factors */ title 'yield with factors N P K Zn'; data dose; input n p k Zn y ; label y = "yield" ; cards; ..... ..... ..... ; Use PROC RSREG. proc rsreg; model y= n p k Zn/ nocode lackfit press; Ridge min max; run; Excercise 8: Fit a second order response surface design to the following data. Take replications as covariate. _____________________________________________________________________

Fertilizer1

Fertilizer2

Yields(lb/plot) x1 x2 Rep 1

64

Rep 2


_____________________________________________________________________

50 120 50 120 35 134 85 85 85

15 15 25 25 20 20 13 27 20

-1 +1 -1 +1 -√2 +√2 0 0 0

-1 -1 +1 +1 0 0 -√2 +√2 0

7.52 12.37 13.55 16.48 8.63 14.22 7.90 16.49 15.73

8.12 11.84 12.35 15.32 9.44 12.57 7.33 17.40 17.00

________________________________________________________________________

References Dey, A. (1985). Orthogonal Fractional Factorial Designs. Wiley Eastern Limited, New Delhi Lindman, H.R. (1992). Analysis of variance in Experimental Design; Springer-Verlag. Littel, R.C., Freund, R.J. and Spector, P.C. (1991), SAS system for Linear Models, Third Edition. SAS Institute Inc. Searle, S.R. (1971). Linear Models. John Wiley & Sons, New York. Searle, S.R., Casella, G and McCulloch, C.E. (1992). Analysis of variance components. John Wiley & Sons, New York.

65