jurnal teknologi maklumat dan sains kuantitatif

26 downloads 0 Views 346KB Size Report
Yap Bee Wah, Ong Seng Huat. Examination timetabling with genetic algorithms. Azman Yasin, Nusnasran Puteh, Hatim Mohamad Tahir. Numerical solution for ...
Jilid4,BH. 1,2002

JURNAL TEKNOLOGI MAKLUMAT DAN SAINS KUANTITATIF Kandungan

Muka Surat

The effects of nonnormality on the performance of the linear discriminant function for two dependent populations Yap Bee Wah, Ong Seng Huat Examination timetabling with genetic algorithms Azman Yasin, Nusnasran Puteh, Hatim Mohamad Tahir Numerical solution for one dimensional thermal problems using the finite element method Hisham Bin Md. Basir

1

1 11

25

y

,.

LJ Pemodelan tuntutan insurans bagi perbelanjaan perubatan (kajian kes) Noriszura Hj. Ismail, Yeoh Sing Yee Applications of leverenz theorem in univalent functions i$lh' Aishah Sheikh Abdullah Keutamaan pemilihan bidang dan tempat pengajian: Pendekatan konjoin kabur Nadzri Bin Mohamad, Abu Osman Bin Md. Tap

Universiti Teknologi MARA

45

59

67

22^

Maklumat&SaisisK.uanntatifJilid4, Bil- 1,2002(1-9)

The effects of nonnormality on the performance of the linear discriminant function for two dependent populations Yap Bee Wah Faculty of Information Technology and Quantitative Sciences, University of Technology MARA, 40450 Shah Alam Ong Seng Huat Institute of Mathematical Sciences, University of Malaya, 50603 Kuala Lumpur Abstract The linear discriminant function (LDF) is based upon parametric normal model where the population parameters are estimatedfromsample data. However, real-life data do not usually satisfy the parametric normal model. A Monte Carlo simulation study is carried out to investigate the effects of nonnormality on the performance of the LDF when the variables are correlated with positively skewed distributions and the two populations are jointly distributed (dependent). It is found that the LDF is quite robust to slight departuresfromthe normality assumption but its performance is adversely affected when the distributions are highly skewed. Keywords: Dependent Populations; Linear Discriminant Function; Simulation; Nonnormality; Skewed Distributions 1.

Introduction

The linear discriminant function (LDF) based upon parametric normal models is used with the population parameters estimated from the sample data. Lachenbruch et al. (1973) and Clarke et al. (1979) have shown that for independent populations nonnormality can adversely affect the error rates (Rawlings et al., 1986). This study is motivated by the fact that real-life data do not usually satisfy the parametric normal models and may involve variables which are skewed. We extend the study by Clarke etal. (1979) to the case of two jointly distributed or dependent populations involving covariates. Covariates are variables which by themselves do not have any discriminatory power but due to their correlation with other variables, might prove useful when combined with other variables. The motivation for considering dependent populations is given by the following example. Consider the data on the trace elemental concentration levels of breast cancer tissues analysed recently by Ng et.al. (1997). In this data set concentrations of the trace elements in normal and malignant breast tissues were measured for each of twenty six female patients. The two groups of normal and malignant tissues constitute the populations studied. Since each patient provides one observation for each of the normal and malignant groups, the two populations are dependent. e-mail: [email protected]; [email protected]

2

The objectives of this study are to examine the robustness of LDF and its performance of classification in terms of probability of correct classification (PCC) when the variables are correlated with positively skewed distributions.

2.

Classification model for two dependent populations

This study considers classification involving two dependent sets of/? variates and q covariates wherex = (xj ,...,xp, zl ,...,zq) a n d / = (yv..., yp,z, ,...,zq ) are taken from multivariate normal populations xx and% It is assumed that x and y are dependent and jointly follow a multivariate normal distribution as follows: Ml

y

Z

•N 2(p+q)

p 2

p*Z

/*2

X (1)

where x and y are (p+q) x 1 random vectors consisting of/? discriminators and q covariates; (i, ,i = 1,2 are (p+q) x 1 vector of means, | p* | < 7 and I is a (p+q) x (p+q) covariance matrix. Note that the q covariates are common to both x and y In classification without covariate adjustment, the q covariates are included as variates to form the usual plug-in likelihood ratio classification statistic (McLachlan, 1992) given by

W=[z--(x+y)] where k = In

c(l\2)Yp2 yc(2\l)^Pl

!T'(x-y)>k

(2)

, Pi is the prior probability and c(i\j) is the cost of misclassifying

an observation belonging toft,and nt. Assuming equal priors and equal cost of misclassification, we get k= 0 and allocate z to nx if W>Q. Otherwise, allocate z to n2. Let 2 be partitioned as 2 =

^•11

^12

where Z n is a (p xp) covariance matrix; E 22 is a (q x q)

J^21

~^22_

covariance matrix ; 2 1 2 is a (p x q) covariance matrix and |}= E;2EJJ. With covariate adjustment the classification statistic is given by (3) where

(L, =X(1)-$

X =ya)-$y

. This is obtained from (2) by substituting

unconditional quantities with the corresponding conditional ones.

3

3.

Simulation Methodology

Without loss of generality, it is assumed that the means and co variance matrix in (1) after a linear transformation (x -» AX+b) to canonical form are given by ]i!=0;\i2=\i=(^0,0,....,0)' ; \ = o ; Z=Ip+q where A2 is the Mahalanobis distance and Ip*q is an identity matrix. Training samples from multivariate normal distributions with these parameters are generated and then transformed. We consider the lognormal distribution used by Clarke et al.{\ 979) where each element in the multivariate normal vectors are transformed as follows: ti = exp[(x-y)/8J (4) 8 are selected to give a range of skewness and kurtosis. The parameter Y in the transformation acted only as a multiplicative constant and could be eliminated from the SL transforms. Hence, the parameters varied in this study are ^(sample size), p, q, A , P and S . The values of these parameters chosen are: N = 15, 30 and 60 p= 1,3, 5, and 10 q = 1 and 2 5 - 2 and 5 A = 1,2 and 4 p* = -0.8-0.4,-0.1,0.0,0.1,0.4,0.8 Figure 1 displays the shapes of normal and lognormal distributions generated using SPLUS 2000. The physical representation of the parameter values for A is highlighted in Figure 2. Table 1 displays the mean, variance, skewness and kurtosis for the transformed distributions. Following Johnson et al. (1994, p.211), the rth moment of lognormal random variable Tabout zero is \ir =E(Tr ) = exp(r^ + —r2 co7/2,

(5)

where co= exp(a2 ) and the variance is li2=li'2-\i'2

=eK((d2 -((o'P)2)

= e2!:(a((a-l).

(6)

The shape factors which determine the skewness and kurtosis are a3=(

+ 2)

(7)

and a4 = 0)4 +2co3 +

3CD2-3

(8)

4

Note that a3>0 and a4>3 , that is, the lognormal distributions are positively skewed and leptokurtic. In this study since we choose y= 0 , hence^ = 0; and substituting co = exp(o2) = exp(lfo2 ) for 8 = 2 and 5 into (5)-(8), the following values for the mean, variance, skewness and kurtosis in Table 1 are obtained. Table 1: Marginal Moments for Sampled Populations Log Normal Distributions

7 8

Mean

Variance

Skewness

Kurtosis

2 5

1.13 1.Q2

0.36 0.04

1.75 0.61

8.90 3.68

0

.

lognormal(8=5)

lognormal(8=2) normal

vy--*» , — -3.5

-1.0

-1.5

4.0

6.5

,

9.0

11.5

14.0

16.5

Figure 1: Normal and lognormal distributions generated using S-PLUS 2000 The algorithm for generating samples in the simulation experiment is given below. a)

b)

c)

Generate multivariate normal vectors x and y from the model in (1) after linear transformation. For each vector of observations, transform element-wise using (4). Construct the LDF (2) using sample estimates from the transformed training samples. Classify each observation in the sample using the leave-one-out method and obtain the probability of correct classification,.PCC . Repeat (a)-(c) for M simulation runs. M

d)

I

Calculate the expected PCC i.e. PCC = £ PCCm JM m=l I

0.4 -

^2

1Tl

\ / \

/

0.3 -

A = 1

x)

P(2|l)

0.2 -

/

1\

\

Optimal PCC = (0.5) 0.1 -

0.0 -

1

1

1

1

1

p2*0.4 -

" y

f(x) 0.2 -

P(2|l)

f\ \

/

>

0.0 -

A

A. /

\

^

A2-

M-2'

Optimal PCC = O ( y )

P(l/2)

= *(7)

ir

'

A

A

= 0.8413

^l

0.4 -

TTl

A = 2

k-\

^ ^

0.1 -

r

- ^ — IM

A

0.3 -

i—-a>"—i

1 ^

A = 4

0.3 -

P(2|l)

/

/

\

\

Optimal PCC = * ( y )

P(l/2)

0.2 -

=