some aspects of discriminant - IASRI

SOME ASPECTS OF LINEAR DISCRIMINANT ANALYSIS KAUSTAV ADITYA M.Sc. (Agricultural Statistics), Roll No. 4493, I.A.S.R.I, Library Avenue, New Delhi-110012 Chairperson: Dr. Ranjana Agarwal Abstract: Discriminant analysis is a multivariate technique concerned with classifying distinct sets of objects (or sets of observations) and allocating new objects or observations to the previously defined groups. In this talk, to start with, the univariate approach of discriminating observations is discussed. Thereafter, examining differences in observations with respect to two or more variables simultaneously is emphasized. Discriminant analysis for two groups is dealt with by discussing about selection of discriminator variables leading to development of discriminant function which is then used for classifying the observations. The Fisher’s linear discriminant function is also derived. Tests for differences in the means of two groups for each discriminator variable and also tests for differences between the two groups when all the variables are considered simultaneously are discussed using univariate Wilk’s test statistic. The procedure of Multiple Discriminant Analysis (MDA) for several groups is discussed. Lastly, determination of optimum number of discriminant functions to adequately represent the difference among the groups is also outlined. Key words: Discriminant Analysis, Multivariate Technique, Classificatory Procedure, Multiple Discriminant Analysis (MDA), Wilk’s Test Statistic 1. Introduction Discriminant analysis is a multivariate technique concerned with separating distinct sets of objects (or sets of observations) and allocating new objects (or observations) to the previously defined groups. As a classificatory procedure, it is often employed in order to investigate observed differences when causal relationships are not well understood. For example, (1) in agricultural experiment forecasting of crop yield on the basis of the weather parameters can be done using discriminant analysis. In this context the crop year has been categorized as congenial, normal or adverse to yield on the basis of weather parameters through discriminant analysis, and by this it can be decided that which weather condition would be suitable for the crop. (2) a medical researcher is interested in determining factors that significantly differentiate between the patients who have had a heart attack and those who have not yet had a heart attack. The medical researcher then wants to use some identified factors to predict whether a patient is likely to have a heart attack in the future. The steps of discriminant analysis are as follows: 1. Identification of variables that best discriminates among the various groups. 2. Use of identified variables to develop an equation or a function for computing a new variable or index that will parsimoniously represent the differences between various groups. 3. Use of discriminant function to classify future observations into any of the pre-defined groups.

Some Aspects of Linear Discriminant Analysis

Thus discriminant analysis is the method of describing either graphically (in three or fewer dimensions) or algebraically, the differential features of objects from several known collections (or populations). Here, we try to find discriminants whose numerical values are such that the collections are separated as much as possible. This terminology was introduced by R.A.Fisher (1936-38).This multivariate technique viz. Discriminant analysis has been discussed in many books, to mention a few, Anderson (1984), Hair et al. (1995), Sharma (1996), Johnson and Wichern (2006), etc. The procedure discussed in this write-up is mainly taken from the book by Sharma (1996). 2. The Univariate Approach for Discriminating Observations and Identifying Best Set of Variables For discussing this we use a data table which gives financial ratios for a sample of 24 firms, in which 12 are the most admired firms and 12 are least admired firms (Table 2.1). The financial ratios are: EBITASS, earnings before interest and taxes to total assets, and ROTC, return on total capital, which can be used to visually assess the extent to which the two ratios discriminate between the two groups. It is clear that the two groups of firms are well separated with respect to each ratio. Examining differences between groups with respect to a single variable is referred to as univariate analysis. That is, does each variable (ratio) discriminate between the two groups? The univariate term is used to emphasize the fact that the differences between the two groups are assessed for each variable independent of the remaining variables. Examining differences with respect to two or more variables simultaneously is referred to as multivariate analysis. Table 2.1: Financial data for most admired and least admired firms Group 1

Most admired firm

Firm No. EBITASS

Group 2

Least admired firm

ROTC

Firm No.

EBITASS

ROTC

1

0.158

0.182

13

-0.012

-0.031

2

0.210

0.206

14

0.036

0.053

3

0.207

0.188

15

0.038

0.036

4

0.280

0.236

16

-0.063

-0.074

5

0.197

0.193

17

0.054

-0.119

6

0.227

0.173

18

0.000

-0.005

7

0.148

0.196

19

0.005

0.039

8

0.254

0.212

20

0.091

0.122

9

0.079

0.147

21

-0.036

0.072

10

0.149

0.128

22

0.045

0.064

11

0.200

0.150

23

-0.026

0.024

12

0.187

0.191

24

0.016

0.026

On the other hand, consider the case where data on four financial ratios, X1, X2, X3, X4 are available for the two groups of firms. Figure 2.1 portrays the distribution of each financial ratio. Here curve on the left is denoted as most admired firm and the curve on the right as least admired firm. From the figure it is apparent that there is greater difference between 2


the most admired and least admired firms with respect to the ratios X1 and X2 than with respect to the ratios X3 and X4. That is, ratios X1 and X2 are the variables that provide the best discrimination between the two groups. Variables providing best discrimination among the groups are called discriminator variables.

Figure 2.1 3. Some Terminologies Linear Discriminant Function: Equation in the following form:

Z = W1 X1 + W2 X 2 + ... + Wn X n where, Z = Discriminant score Wi = Discriminant weight associated with the ith independent variable Xi = ith independent variable. Here it has been assumed that the population covariance matrices are equal and have full rank. If the assumptions of equal covariance matrices are rejected one could use quadratic discriminant function for the classification purposes. However, it has been found that for small sizes the performance of linear discriminant function is superior to quadratic discriminant function as the number of parameters that need to be estimated for quadratic discriminant function is nearly doubled. Discriminant Score: Referred to as Z score, defined by the discriminant function. Discriminant score can be calculated by multiplying each variable values by their corresponding weights and by adding those products together we can get the discriminant scores of each individual during the analysis. Cutting Score or Cutoff Value: Criterion (score) under which each individual’s discriminant score is judged to determine into which group the individual is classified.

3


Discriminant Weight: Also called discriminant coefficient, its value is determined by the variable structure of the original variable. Independent variables having large discriminatory power has larger weights and vice versa. Discriminator Variables: Variables providing best discrimination between the various groups are called discriminator variables. 4. Two Groups Discriminant Analysis Discriminant analysis has the following objectives:1. Selection of discriminator variables. 2. Development of discriminant function. 3. Classification of the future observations. These objectives are discussed subsequently for the two groups case. 4.1 Selection of Discriminator Variables Table 4.1 gives the means and standard deviations of the two groups, given in Table 2.1. The difference in the means of the two groups is assessed by the independent sample t-test. Table 4.1: Means and standard deviations for sample t-test Group.1

Group.2

Variables

Mean

Std. Dev.

Mean

Std. Dev.

t-value (calculated)

EBITASS

0.1913

0.051

0.012

0.041

9.367

ROTC

0.1835

0.029

0.017

0.063

8.337

The t-values are used for testing the equality of the means of the two groups for the two ratios, which are 9.367 for EBITTAS and 8.337 for the ROTC. The test suggests that the two groups are significantly different with respect to both the financial ratios at a significance level of 0.05 (table value of t is 2.06), i.e. both the financial ratios do discriminate between the two groups and consequently can be used to form the discriminant function. This conclusion is based on the univariate approach only. That is, a separate independent t-test is done for each financial ratio. A preferred approach is to perform a multivariate test in which both the financial ratios are tested simultaneously. 4.2 Development of Linear Discriminant Function The objective of the discriminant analysis is to maximize the between groups and within group sum of squares ratio that results in the best discrimination between the groups. This can be done by identifying a linear combination Z such that it results in best discrimination between the various groups where Z is referred to as the Linear Discriminant Function. The projection of a point on to the discriminant function is called the Discriminant Score. Let the linear combination or the discriminant function that forms the new variable be

Z = w1 E B IT A S S + w 2 R O T C

…(4.1)

4


where Z is the discriminant function and w1 and w2 are the weights assigned to the two variables respectively. The objective of discriminant analysis is to identify the weights w1 and w2 such that, say,

λ=

SSB SSW

…(4.2)

is maximized where SSB is the pooled between groups sum of squares and SSW is the pooled within group sum of squares. Here SSB can be obtained by pooling the SSB obtained for each of the jth variable, say, SSB j = n 1 ( x j1 − x j. ) 2 + n 2 ( x j2 − x j. ) 2

where n1 and n2 is the number of observations in group 1 and group 2, x j1 and x j2 are the means of the jth variable (j=1,2) in the first and second groups, x j. is the mean of the jth variable for the whole data .Within group sum of squares can be obtained by adding the squares of the observations of the groups. The discriminant function given by Equation (4.1) is obtained by maximizing Equation (4.2) and it is referred to as Fisher’s linear discriminant function. 4.3 Classification of the Future Observations One of the objectives of the discriminant function is to classify the observations into predefined groups. Classification is generally a separate procedure from discrimination. But some times it is also used as a part of the discriminant analysis. Here the classification of the observations is done by using the discriminant scores. Figure 4.1 represents a one dimensional plot of the discriminant scores commonly referred to as a plot of the observations in the discriminant space. Now classification of the observations is done as follows. First the discriminant space is divided into two mutually exclusive and collectively exhaustive regions, R1 and R2. As there is only one discriminant score, then consequently, a point will divide the one dimensional space into two regions and that value of the discriminant score is called the cutoff value. Next the discriminant score of a given firm is plotted in the discriminant space and classified as most admired or least admired if the computed discriminant score for the firm falls in the region R1 or R2 respectively or it can be said that the firms can be classified as the most admired and least admired if the computed discriminant score is greater than or less than the cutoff value.

Figure 4.1 5. Fisher’s Linear Discriminant Function In deriving the discriminant function, Fisher’s. idea was to transform the multivariate observation X to univariate observation ξ such that ξ ’s derived from the different populations were separated as much as possible. Fisher suggested taking linear 5


combinations of X to create ξ ’s because they are simple functions of the X to be handled easily. His approach does not assume that the populations are normal but he assumes that the population covariance matrices of different populations are equal. Let X is a px1 random vector whose variance–covariance matrix is given by Σ and the total SSCP (it is the sum of squares cross product matrix and it can be obtained by summarizing the sum of squares and the sum of the cross products in a matrix; this matrix can be computed by multiplying the total covariance matrix by its degrees of freedom) matrix by Tpxp. Let γ be a px1 vector of weights. The discriminant function will be then given as,

ξ = X 'γ

…(5.1)

The sum of squares of the resulting discriminant scores will be given by

ξ ' ξ = ( X ' γ ) '( X ' γ ) = γ 'X X 'γ = γ ' T γ

,

…(5.2)

where T=XX’ is the total SSCP matrix for the p variables. Since T = B+W (say) where B and W are, respectively the between groups and within group SSCP matrices (these matrices can be obtained by multiplying the pooled between groups covariance matrix and pooled within groups covariance matrix by their corresponding degrees of freedom respectively) for the p variables. Equation (5.2) can be written as

ξ 'ξ = γ '( B + W ) γ = γ ' B γ + γ ' W γ

…(5.3)

In Equation (5.3), γ'Bγ and γ'Wγ are respectively the between groups and within group sum of squares for the discriminant score ξ . The objective of discriminant analysis is to estimate the weight vector, γ of the discriminant function given by equation (5.1) such that

λ=

γ'Bγ γ'Wγ

is maximized. The vector of weights respect to

γ and equating to zero. That is

…(5.4)

γ can

be obtained by differentiating λ with

δλ 2(Bγ)(γ'Wγ) − 2(γ'Bγ)(Wγ) = =0 δγ (γ'Wγ)2 Now, dividing throughout by γ'Wγ (as γ'Wγ is a positive definite quadratic form and γ'Wγ > 0 ∀ γ , γ is non null)

( γ'Bγ ) ( Wγ ) ( γ'Wγ ) =0 ( γ'Wγ )

2(Bγ ) − 2

6


or

2(B γ − λ W γ ) = 0 γ 'W γ (B − λ W )γ = 0

[ as λ =

γ'Bγ ] γ'Wγ

( W -1B − λ I ) γ = 0

…(5.5)

Equation (5.5) is a system of homogeneous equations and for a non trivial solution

| W − 1B − λ I | = 0

…(5.6)

That is the problem reduces to finding eigenvalues and eigenvectors of a non symmetric matrix, W-1B, with the eigenvectors giving the weight matrix for forming the discriminant function. For the two group’s case, Equation (5.5) can be further simplified. For this, consider the population between groups SSCP matrix obtained from the sample between groups matrix which is generally defined as, g

Bs = ∑ ni (μi − μ )(μi − μ ) ' , i =1

where ni is the number of units in the corresponding group and g is the number of groups and μi is the mean vector of the corresponding group and μ is the over all mean vector of the groups. Now B for two group’s case can be obtained from Bs in the following way, in the two group’s case let us denote BS as B,

B = n1 ( μ1 − μ )( μ1 − μ ) '+ n2 ( μ2 − μ )( μ2 − μ ) ' or ⎛ ⎛ n μ + n2 μ2 ⎞ ⎛ n1μ1 + n2 μ2 ⎞ n1μ1 + n2 μ2 ⎞ ⎛ n1μ1 + n2 μ 2 ⎞ B = n1 ⎜ μ1 − 1 1 ⎟ ⎜ μ1 − ⎟ '+ n2 ⎜ μ2 − ⎟ ⎜ μ2 − ⎟' n n n n n n n n + + + + 1 2 1 2 1 2 1 2 ⎝ ⎠⎝ ⎠ ⎝ ⎠⎝ ⎠ n1 B= ( n1μ1 + n2 μ1 − n1μ1 − n2 μ2 )( n1μ1 + n2 μ1 − n1μ1 − n2 μ2 ) ' (n1 + n2 )2 n2 + ( n1μ2 + n2 μ2 − n1μ1 − n2 μ2 )( n1μ2 + n2 μ2 − n1μ1 − n2 μ2 ) ' (n1 + n2 )2

B=

or

n1n22 (n1 + n2 )2

( μ1 − μ2 )( μ1 − μ2 ) ' +

n2 n12 (n1 + n2 )2

n n (n + n ) B = 1 2 1 2 ( μ1 − μ2 )( μ1 − μ2 ) ' (n1 + n2 )2

7

( μ2 − μ1 )( μ2 − μ1 ) '


or

n1n2 ( μ1 − μ2 )( μ1 − μ2 ) ' (n1 + n2 )

B=

…(5.7)

where μ1 and μ2 , respectively, are px1 vectors of means for group1 and group 2 . Let us denote μ1 and μ2 as μ1 and μ2 respectively and also let c =

n1n2 , then (n1 + n2 )

Equation (5.7) reduces to

B = c( μ1 − μ2 )( μ1 − μ2 ) ' Therefore the Equation (5.5) can be written as W − 1 c ( μ 1 − μ 2 ) ( μ 1 − μ 2 ) '− λ I ] γ = 0 W − 1c ( μ1 − μ 2 )( μ1 − μ 2 ) ' γ = λ γ c W − 1 ( μ1 − μ 2 )( μ1 − μ 2 ) ' γ = γ

...(5.8)

λ

Now since

( μ1 − μ 2 ) ' γ

is a scalar. Equation (5.8) can be written as

γ = k W − 1 ( μ1 − μ 2 ) , where k = variance

c

λ

…(5.9)

( μ 1 − μ 2 ) ' γ is a scalar so it is a constant .Since the within group

covariance

matrix, Σw

is

proportional

to

W

and

it

is

assumed

that Σ 1 = Σ 2 = Σ w = Σ , Equation (5.9) can also be written as

γ = k Σ − 1 ( μ1 − μ 2 )

…(5.10)

Assuming a value one for the constant k, Equation (5.10) can also be written as

γ = Σ −1 ( μ 1 − μ 2 ) or

γ ' = ( μ1 − μ 2 ) ' Σ −1

…(5.11)

The discriminant function given by Equation (5.11) is the Fisher’s discriminant function. It is obvious that different values of constant k give different values for γ and hence the absolute weights of discriminant function are not unique. The weights are unique only in relative sense. 6. Evaluating the Significance of Discriminating Variables The first step in the discriminant analysis is to assess the significance of the discriminating variables and also to know whether the selected discriminating variables significantly 8


differentiate between the two groups or not. A discussion of the formal statistical test for testing the difference between the means of the two groups is as follows. The null hypothesis and the alternative hypothesis for each discriminating variable are:

H 0 : μ 1E B IT A S S = μ 2 E B IT A S S H 1 : μ 1E B IT A S S ≠ μ 2 E B IT A S S 2 where μ1EBITASS and μ EBITASS are the means of the variable EBITASS, respectively, for group 1 and group 2. This hypothesis can also be tested using an independent sample ttest. Alternatively one can use Wilks’ Λ test statistic. Wilks’ Λ is computed using the following formula:

Λ =

SS w SSt

…(6.1)

where SSw (pooled within sum of square matrix) is obtained from the SSCPw matrix, which in turn can be obtained by adding the respective sum of squares and sum of crossproducts of the respective groups and the SSt (total sum of square matrix) is obtained from the SSCPt matrix which can be obtained by multiplying St by the total degrees of freedom (n1+n2-1). It is to be noted that the smaller the value of Λ the greater the probability that the null hypothesis will be rejected and vice versa. To assess the statistical significance of the Wilks’ Λ test, it can be converted into an F-ratio using the following transformation: ⎛1− Λ F =⎜ ⎝ Λ

⎞ ⎛ n1 + n 2 − p − 1 ⎞ ⎟ ⎟⎜ p ⎠⎝ ⎠

…(6.2)

Given the null hypothesis is true, the F ratio follows F-distribution with p and (n1+n2-p-1) degrees of freedom. 7. Selection of Discriminator Variables and Determination of Number of Discriminant Functions Since discriminant analysis involves the inversion of with-in group matrices, the accuracy of the computations is severely affected if the matrices are singular or near singular (i.e. some of the discriminator variables are highly correlated or are the linear combinations of the other variables). The tolerance level provides a control for the desired amount of computational accuracy or the degree of multicollinearity one is willing to tolerate. The tolerance of a variable is equal to 1- R2, where R2 is the square of the multiple correlation coefficients between the variable and the other variables in the discriminant function. The higher the value of R2 the lower the value of tolerance and vice versa. That is the tolerance is a measure of the amount of multicollinearity among the discriminator variables. If the tolerance of a given variable is less than the specified value, then the variable is not included in the discriminant function. The maximum number of discriminant functions that can be computed is equal to minimum of G-1 and p, where G is the number of groups and p is the number of variables. As here the number of groups is two, only one discriminant function is possible.

9


8. Statistical Significance of the Discriminant Function Differences in the means of two groups for individual discriminator variable were tested using univariate Wilks’ Λ test statistic in Section 6. However, in the case of more than one discriminator variable it is important to test the differences between the two groups for all the variables jointly or simultaneously. This test has following null and alternative hypothesis:

⎛ μ1 ⎞ ⎛μ2 EBITASS ⎟ = ⎜ EBITASS H0 : ⎜ 1 ⎜μ ⎟ ⎜μ2 ⎝ ROTC ⎠ ⎝ ROTC H 1 : H 0 is not true

⎞ ⎟ ⎟ ⎠

The test for this multivariate hypothesis is a direct generalization of the univariate Wilks’ Λ statistic is given by, Λ =

| S S C Pw | | S S C Pt |

where Λ represents the determinant of the ratio of within group sum of square cross product matrix and total sum of square cross product matrix. Wilks’ Λ can be approximated as a Chi square statistic using the following transformation (where g is the number of groups and p is the number of variables):

χ 2 = −[n − 1 − ( p + g ) / 2]ln Λ The χ 2 statistic is distributed as a Chi-square with p × (g-1) degrees of freedom. Since the discriminant function is a linear combination of the discriminator variables, it can also be concluded that the discriminant function is statistically significant. That is the means of the discriminant scores of the two groups are significantly different. 9. Assessing the Importance of the Discriminator Variables If discriminant analysis is done on the standardized data then the discriminant analysis is called standardized canonical discriminant function. However a separate analysis is not required, standardized coefficients can be calculated from the un-standardized coefficients by using the following transformation:

bˆ*j = b*j sˆ j * * where bˆ j , bj and sˆ j are respectively, standardized coefficients, unstandardized coefficients, and the pooled standard deviation of variable j. The standardized coefficients, of EBITASS and ROTC respectively are 0.743 and 0.305 which are obtained by pooling the variation of the variable EBITASS and ROTC respectively and obtained from Σ w .

Standardized coefficients are used normally to judge the relative importance of the discriminator variables forming the disciminant function. The greater the standardized coefficients the greater the relative importance and vice versa. It appears that ROTC is less important than EBITASS in forming the discriminant function. However caution should be 10


taken before prediction like this one if the variables are correlated among themselves. Depending on the surety of multicollinearity present in the sample data the relative importance of the variables may change from sample to sample. 10. Multiple Discriminant Analysis (MDA) In cases, one may be interested in discriminating among more than two groups; multiple discriminant analysis is a useful procedure. Generally discriminant analysis is useful in the situations where the total sample is divided into groups based on the variables characterizing several known classes. The primary objective of the MDA is to understand group differences and to predict the likelihood that an entity (individual or an object) will belong to particular class or group based on several variables. For example, consider the following situations A marketing manager is interested in determining factors that best discriminate among the groups of heavy, medium and light users of a given product category. Management of a telephone company is interested in identifying the characteristics that best discriminate among the household that have one, two or three or more than three phone connections. Each of these examples involves discrimination among three or more groups. MDA is suitable for such purposes. The objectives of MDA is same as for two group discriminant analysis but it has an additional objective that is to identify the minimum number of discriminant functions that will provide most of the discrimination among the groups, because it may not be possible to represent all the differences among the groups by a single discriminant function. 10.1 Analytical Approach of MDA The objectives and mechanics of multiple-group discriminant analysis are quite similar to two group discriminant analysis. First a univariate analysis can be done to determine if each of the discriminating variables significantly discriminates among the groups. This can be achieved by an overall F-test. The overall F-test would be significant if mean of at least one pair of groups are significantly different. Having identified the discriminating variable the next step is to estimate the discriminant function. Suppose the first discriminant function is

Z1 = W11 X1 + W12 X 2 + ... + W1P X P , where the Wij is the weight of the jth variable for the ith discriminant function. The weights of the discriminant function are such that the

λ1 =

between − groups SS of Z1 , is maximized. within − group SS of Z1

Suppose the second discriminant function is given by,

Z 2 = W21 X1 + W22 X 2 + ... + W2 P X P The weights of above discriminant function are estimated such that the

11


λ2 =

between − groups SS of Z 2 , within − group SS of Z 2

is maximized subject to the constraints that the discriminant scores Z1 and Z2 are uncorrelated . The procedure is repeated until all possible discriminant functions are identified. Once the discriminant functions are identified the next step is to determine a rule for classifying the future observations. Classification procedure involves the division of the discriminant space in g mutually exclusive and collectively exhaustive regions. For example, to classify any given observation using discriminant scores, the discriminant scores are computed, then the observations are plotted in the discriminant space and the observation is classified into the group in whose region it falls. 10.2 Evaluating the Significance of the Variables We know that in almost all cases the distribution of the transformed value of Wilks’ Λ follows an F-distribution. Here let us take an example of four groups’ discriminant analysis which is measured on two variables X1 and X2. The F-statistic is used to test the following univariate null and alternative hypotheses for each discriminating variable X1 and X2: H 0 : μ1 = μ 2 = μ 3 = μ 4 H 1 : μ1 ≠ μ 2 ≠ μ 3 ≠ μ 4 where µ1, µ2, µ3 and µ4 are, respectively, population means for groups 1,2,3 and 4.The null hypothesis is rejected if the means of atleast one pair of groups are significantly different at 5% level of significance. That is one pair or pairs of groups are significantly different with respect to the means of X1 and X2. 10.3 How Many Discriminant Functions? The main question in MDA is how many discriminant functions one retains or uses adequately to represent the difference among the groups? This question can be answered by evaluating the significance (both statistical and practical) of each discriminant function. Not all of the k discriminant functions are statistically significant, i.e.only r (