International journal of Agronomy and Plant Production. Vol., 4 (1), 127-141, 2013 Available online at http:// www.ijappjournal.com ISSN 2051-1914 ©2013 VictorQuest Publications
A Review on Applied Multivariate Statistical Techniques in Agriculture and Plant Science 1*
Armin Saed-Moucheshi , Elham Fasihfar , Hojat Hasheminasab , Amir Rahmani and Alli Ahmadi
1- Dept. Crop Production and Plant Breeding, Shiraz University, Shiraz (Iran) 2- Dept. Crop Production and Plant Breeding, Razi University, Kermanshah (Iran) 3- Dept. Plant Protection, Tabriz University, Tabriz (Iran) *Corresponding Author Email: [email protected]
Abstract Most scientists make decisions based on analyzing of the obtained data from researches works. Almost all data in science are abundance and by themselves they are of little help unless they are summarized by some methods and appropriate interpretations have been made. The data set may contain so many observations that stand out and whose presence in the data cannot be justified by any simple explanation. Multivariate statistical technique is a form of statistics encompassing the simultaneous observations and analysis of more than one statistical variable. In this review we are trying to clarify how multivariate statistical methods such as multiple regression analysis, principal component analysis (PCA), factor analysis (FA), clustering analysis, and canonical correlation (CC) can be used as methods to explain relationships among different variables and making decisions for future works with examples relating to the agriculture and plant science. Keywords: Canonical correlation; Factor analysis; Principal component analysis; Cluster analysis; Multiple regression. Introduction Most crucial scientific, sociological, political, economic, business, biology and botany make decisions based on analyzing of obtained data from research's works. Almost all data in science are abundance and by themselves they are of little help unless they are summarized by some methods and appropriate interpretations have been made. Since such a summary and corresponding interpretation can rarely be made just by looking at the raw data, a careful scientific scrutiny and analysis of these data can usually provide enormous amount of valuable information. Admittedly, the more complex the data and their structure, the more involved the data analysis (Steel and Torrie, 1960). The complexity in a data set may exist for a variety of reasons. The data set may contain too many observations that stand out and what presence in the data cannot be justified by any simple explanation. Another situation in which a simple analysis alone may not suffice occurs when the data on some of the variables are correlated or when there is a trend present in the data. Many times, data are collected on a number of units, and on each unit not just one, but many variables are measured. Further, when many variables exist, in order to obtain more definite and more easily comprehensible information, scientist need to used further complex analyses in order to get highest information that can be obtained from data (Everitt and Dunn, 1992). For univariate data, when there is only one variable under consideration, these are usually summarized by the (at the either population or sample) mean, variance, skewness, kurtosis and etc (Anderson, 1984). These are the basic quantities used for data description. On the other hand, multivariate statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one statistical variable. Methods of bivariate statistics, for example simple linear regression and correlation, are special cases of multivariate statistics in which two variables are involved (Steel and Torrie, 1960). Multivariate statistics concerns understanding the different aims and background, and it can explain how different variables are related with each other or one another. The practical implementation of multivariate statistics to a particular problem may involve several types of univariate and multivariate analysis in order to
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
understand the relationships among variables and their relevance to the actual problems being studied (Johnson and Wicheren, 1996). Many different multivariate analyses techniques such as multivariate analysis of variance (MANOVA), multiple regression analysis, principal components analysis (PCA), factor analysis (FA), canonical correlation analysis (CC), and clustering analysis are available. In this review we are going to explain applying and usable techniques of multivariate statistics in the agriculture and plant science with related examples in order to provide a practical manual in scientific research works for plant scientist. Multiple Linear Regression Analysis Linear regression is an approach to modeling the relationship between a dependent variable called Y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression. For example we want to determine 1 cm increasing the height of a plant makes how much change in its yield, in which situation we use simple linear regression (Draper and Smith, 1966). The prediction model equation for simple linear regression is: Y=b0 + b1X + ε b0: It is the intercept that geometrically represents the value of dependent variable (Y) where the regression line crosses the Y axis. Substantively, it is the expected value of Y when independent variable is equal zero. b1: Slope coefficient (regression coefficient). It represents the change in Y associated with a one-unit increase in X. ε: In most situations, we are not in a position to determine the population parameters directly. Instead, we must estimate their values from a finite sample from the population and this parameter is the error of the prediction. Multiple regression considers more than one explanatory variable (X). For example changing one unit in the stem height, stem diameter, root length and leaf area caused how many changes in the plant yield. Prediction model for multiple regression is expanded model of simple linear regression which is showed as follow: Y=b0 + b1X1 + b2X2 +…..+ biXi + ε bi= Partial slope coefficient (also called partial regression coefficient, metric coefficient). It represents the change in Y associated with a one-unit increase in Xi when all other independent variables are held constant. Where b0 is the sample estimate of β0 and bi is the sample estimate of βi, and β's are the parameters from the whole population in which sampling is conducted. After determining the intercept and regression coefficients, we have to test them for significance by doing the analysis of variance (ANOVA). ANOVA determine if regression coefficients that the probable model calculates should be present in the final model as a predictor or not. Statistical software calculates a P-value or sig-value for coefficients significance test. If P-value for a coefficient was less than 0.05 (P0.05), the coefficient is not statistically significant and the related variable should not to be present as a predictor (Draper and Smith, 1981). Coefficient of determination or R-square 2 2 (R ) shows that how the model of predictors fits dependent variable or variables. Higher R , higher fit of the model and higher model goodness. Moreover, significant test for intercept (b0) is similar to regression coefficients (Kleinbaum et al., 1998). 2 Significance test of the coefficient and R help researchers to decide what predictor is more important and must be present in the model. As well as these methods, some other techniques are made up for determining the best model of predictors. Beside this, when the number of the predictors increase, usually most of the variables are strongly correlated with each other and it is not necessary to presence all of these correlated variables in the model and they can use instead of each other (Manly, 2001). Backward elimination: in this technique, unlike forward selection, all variable are existed in the model and the less important variables are removed from the model step by step. In the first step, all possible models with removing each one of the variables considered and which variable having the least mean square will be removed from the model. In the next steps, this procedure is applied and whenever the P-value will be higher
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
than standard, the analysis will be stopped and model with remained variables will be the best predicting model (Burnham and Anderson, 2002). Forward selection: in this method, for the first step of analysis, all possible simple regression related to each of the independent variables is calculated and which of the variables that has the highest mean square (or F-value) is presented as the first and most important predictor in the regression model. In the second step, variable interred to the model in the first step is exist in the model and all other possible models in which the first variable is exist must be made up and each one has the most mean square is preferred prediction model. This procedure will continue until the P-value of the model will be higher than the standard P-value. In this situation, the remained variables will not to be presented in the prediction model (Harrell, 2001). Stepwise regression: this variable selection method has proved to be an extremely useful computational technique in data analysis problems (Dong et al., 2008). Similar to forward selection, in stepwise regression all possible univariate models are worked out and which variable has the highest mean square is consisted in the model. In the second step, all other possible models associated with the first consisted variable is investigated and each variable that has the highest mean square is entered to the model, but when the second variable entered, first variable should be test for significance in the presence of the second entered variable. In this situation if the first entered variable is either significant, both variables will be consisted in the model but if the first variable is not significant, it should be removed from the model. In other steps, this procedure is repeated and what variable was entered to the prediction model in the previous steps that has P-value less than the standard will be removed. Indeed this technique use both forward selection and backward elimination techniques and is more suitable than those alone (Miller, 2002). Path analysis: regression coefficients strongly are depending on the unit of the variables. Based on the unit of the variables, the coefficients of the variables are high or low and variables with strong unit has high coefficient and vice versa. In order to comparing coefficients, the solution is to transform the variables' data to the standard data by subtracting the mean and dividing to its standard deviation. After standardizing the variables' data, the variable with higher coefficient has higher effect on the dependent variable. When independent variables are correlated with each other, the variables can affects each other. In this situation, the correlation between each independent variable with the dependent variable could be divided into direct effect of the each independent variable and the indirect effect via other correlated variables (Fig. 1). Using standardized data in the regression model for calculating regression coefficient gives the direct effect of the variables. The indirect effect of the variables can be estimated by multiplying each related direct effect to correlation coefficient between two or more independent variables (Shipley, 1997). Therefore, path analysis can be explained as an extension technique of the regression model, used to test the ﬁt of the correlation matrix against two or more causal models which are being compared by the researcher (Dong et al., 2008).
Figure 1; Diagram of path analysis
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
For better understanding of regression techniques have been mentioned above, we present an example here. Example 1: we had measured some morphological traits of three wheat cultivars consisting of Tiller numbers/plant, Spike length, Spikelets/spike, Spike weight/plant, Grains/spike, Grain weight/spike, 100Grain weight, Total chlorophyll content of flag leaf, Biologic yield/plant, Root weight, Leaves area and grain yield under for water regimes (Moocheshi et al., 2012). Here we want to evaluate relationship between what grain yield and its related measured morphological traits using mentioned techniques above. Multivariate regression Table 1 shows regression coefficient values, their standard error, t-student value and p-value for coefficients. Total regression equation based on the results is: Y=0.5394 - 0.12X1 - 0.02X2 - 0.01X3 + 0.96X4 + 0.01X5 - 0.78X6 - 0.01X7 - 0.004X8 + 0.01X9 + 0.08X10 + 0.001X11 X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area and Y=Grain yield. 2 Coefficient of determination or R is equal to 99.2% which is very high, but it is not a real coefficient 2 of determination because with increasing variables numbers, R will be getting higher. Scientists introduce 2 2 adjusted-R instead of R for solving this problem but it is either not a completely accepted index. Also, as you can see, in this situation that number of variables are abundant and therefore, explaining the relation between dependent and many independent variables are so complex, on the other hand some coefficient values are very little can be removed from the model. Based on the p-value, most of the variables are not statistically significant. P-value shows that what variable must to be present in the model as a predictor and what must not to be present. As you can see in the table 1, X4 and X6 are the variables that have the pvalue lower than 0.05 and we must select them as the most effective variables on yield. The predicting model based regression analysis will be as fallow: Y= 0.96X4 – 0.78X6 Selection procedures Backward elimination: in the four steps of the backward elimination four variables such as X4, X3, X2 and X7 are removed from the model and other variables are remained. Based on this result, four mentioned variables are the least important variables for predicting yield. By this procedure predicting model are formulated as follow (Table 2 and3): Y= -0.19 + 0.98X4 + 0.01X5 – 1.54X6 – 0.004 X8 - 0.01X9 + 0.1X10 + 0.005X11 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield. Forward selection: similar to backward elimination, seven variables are consisted in the forward selection model but the values of the coefficients have little difference (Table 4). Y= -0.003 + 0.98X4 - 0.004X5 + 0.01X6 – 0.01 X8 - 1.54X9 + 0.11X10 - 0.003X11 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield. Stepwise selection: Tables 5 shows the data representing entered variables to, or removed variables from the model of stepwise regression. Similar to the results of backward and forward, stepwise selection can screen seven variables: Y= -0.195 + 0.98X4 - 0.01X5 -1.54X6 – 0.004 X8 - 0.01X9 + 0.1X10 - 0.005X11 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield. What model should be the predicting model is of the choice of researcher and he can use best model that can explain idea of the research but usually stepwise selection is the best. On the other hand, significant ttest for variables in multivariate regression analysis is not sufficient technique. Path analysis: for better doing path coefficient analysis and understanding the relationship between yield and other morphological traits, researcher can use results of the selection procedures in the path analysis, but here we considered all variables. In this technique, the correlation coefficient between yield and each of the measured morphological traits is partitioned into direct and their indirect effects via other variables on yield. Highest direct effect of variables on yield was obtained for spike weight/plant (1.013) while other variables had a very low direct effect on yield (Table 6). Sum of indirect effects of spike weight/plant were negative. Except of spike weight/plant, other variables had high indirect effect on grain yield. Spiklets/spike showed lowest contribution in grain yield through its direct effect but it showed highest contribution through other traits.
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
Table 1. The regression coefficient (B), standard error (SE), T-value and probability of the estimated variables in predicting wheat grain yield by the multiple linear regression analysis under inoculation (In) and non-inoculation (Non-In) conditions and different water levels B Predictor DF SE T P Constant 1 0.5394 0.49180 1.10 0.284 X1 1 -0.1164 0.08245 -1.41 0.171 X2 1 -0.0202 0.05014 -0.40 0.691 X3 1 -0.0082 0.02037 -0.40 0.693 X4 1 0.9617 0.01927 49.90 0.001 X5 1 0.0110 0.00699 1.56 0.131 X6 1 -0.7802 0.34490 -2.26 0.033 X7 1 -0.0070 0.00979 -0.71 0.483 X8 1 -0.0042 0.00318 -1.33 0.196 X9 1 0.0131 0.01165 1.12 0.273 X10 1 0.0840 0.09246 0.91 0.373 X11 1 -0.0008 0.00318 -0.25 0.803 X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area. Table 2. Summary of Backward elimination Variable Number of variables Partial Model Step removed remain in model R-Square R-Square F Value 1 x1 10 0 0.9991 0.02 2 x3 9 0 0.9991 0.03 3 x2 8 0 0.9991 0.28 4 x7 7 0 0.9991 1.06 X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X7= 100-Grain weight, Table 3. Backward elimination and remained variables in the model Parameter Standard Sum of Variable estimate error squares F-Value Intercept -0.19463 0.08673 0.03923 5.040 1 x4 0.97670 0.00947 82.8773 640.1 2 x5 0.01208 0.00342 0.09736 12.50 3 x6 -1.54441 0.21063 0.41875 53.76 4 x8 -0.00407 0.00138 0.06753 8.670 5 x9 -0.01094 0.00460 0.04402 5.650 6 x10 0.09707 0.04682 0.03347 4.300 7 x11 0.00505 0.00160 0.07755 9.960 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
Pr > F 0.8836 0.8558 0.6028 0.3117
Step 1 2 3 4 5 6 7
Variable entered x4 x6 x9 x5 x11 x8 x10 Intercept
partial RSquare 0.9963 0.0013 0.0005 0.0004 0.0002 0.0002 0.0001
Table 4. Summary of forward selection Model RParameter Standard Square estimate error 0.9963 0.97859 0.00963 0.9975 0.01198 0.00341 0.998 -1.54065 0.21043 0.9985 -0.00443 0.0043 0.9987 -0.0034 0.00152 0.9989 -0.01166 0.00465 0.9991 0.11336 0.04937 -0.12314 0.11097
F-Value 83.85 17.04 7.79 8.69 5.48 6.42 4.3
Pr > F 0.0329