Missing Attribute Value Prediction Based on Artificial Neural Network ...

2008 International Conference on BioMedical Engineering and Informatics

Missing Attribute Value Prediction Based on Artificial Neural Network and Rough Set Theory N.A. Setiawan , P.A. Venkatachalam, A.F.M. Hani Electrical and Electronic Engineering Department, Universiti Teknologi PETRONAS Bandar Seri Iskandar 31750 Tronoh, Perak, MALAYSIA E-mail: [email protected], [email protected] Singular Value Decomposition (SVD) based method (SVDImpute), weighted k-nearest neighbors (KNNImpute), and row average. Bhattacharya, Shrestha and Solomatine [6] used ANN in reconstructing missing wave data in sedimentation modeling. The curse of dimensionality is one of damning factors for ANN to reach good performance [7]. Feature selection and attribute reduction are needed to overcome the problem of curse dimensionality. Li, Manry, Narasimha, and Yu [8] proposed piecewise linear network using orthonormal least square (PLNOLS) to rank and select features. Rough set theory also has the ability to reduce the dimensionality of attributes and to select features without loss of information by its reduct and core concepts [9]. ANN with RST attribute reduction is proposed in this research to predict the simulated missing attribute values on heart disease data from University of California Irvine (UCI) database. The accuracy of prediction is compared to ANN without input attribute reduction, ANN with PLN-OLS feature selection, kNN and most common attribute value method, the simple and common method on filling of missing data.

Abstract In this research, artificial neural network (ANN) combined with rough set theory (RST), named as ANNRST, is proposed to predict missing values of attribute. The prediction of missing values of attribute is applied on heart disease data from UCI datasets. The ANN used is multilayer perceptron (MLP) with resilient back-propagation learning. RST can reduce the dimensionality of attributes through its reduct. Reduct is used as input of ANN combined with decision attribute. By simulating of missing values, the prediction accuracy of ANN is compared to ANNRST. The accuracy of ANNRST is also compared with missing data imputation of k-Nearest Neighbor (k-NN), most common attribute value method and ANN with piecewise linear network-orthonormal least square feature selection (PLN-OLS). Simulation results show that ANNRST can predict the missing value with maximum accuracy close to ANN without dimensionality reduction (pure ANN) and outperform k-NN, most common attribute value method, and ANN with PLN-OLS. Keywords: neural network, rough set theory, missing value.

2. Backpropagation neural network Back-propagation ANN or multilayer perceptron (MLP) has been the most common network applied to variety of applications. It consists of processing elements connected along as nodes (neurons) and weights. The knowledge is represented as weights between the layers. The model used in this research is three layered network with eight nodes of hidden layer and single node of output layer. Number of input is suited the results of feature selection and attribute reduction. Figure 1 shows the network topology of ANN that uses six input nodes, eight hidden nodes and single output node. The number of input is the result of attribute reduction and feature selection.

1. Introduction Missing attribute values are common problem in knowledge discovery from data (KDD) processes. This problem may arise from collecting processes due to the device and human error. The quality of KDD depends on the quality of data. Ignoring the missing data may degrade the performance of KDD. Grzysmala-Busse and Hu [1] studied the comparison of several approaches to missing attribute values in data mining. Li and Cercone introduced a new approach RSFit and ItemRSFit on processing data with missing attribute values based on rough set theory (RST), distance based methods and association rules [2-4]. Troyanskaya, et al. [5] implemented and evaluated three methods for missing value estimation for DNA microarrays: a

978-0-7695-3118-2/08 $25.00 © 2008 IEEE DOI 10.1109/BMEI.2008.322

306

Equation (4) is a relation that will induce a partition of the universe into sets that using only attributes in B. The sets which the objects are divided into are called equivalence classes denoted as [x ]B . If a new attribute is added to the information system that represents some classification of the objects, the system is called decision system which is defined as: (5) S = (U , A ∪ {d }) . where d represents the decision attribute. The elements of A are called conditional attributes or conditions. Let S = (U , A) , and a subset of attributes, B ⊆ A , the approximation of a set of objects, X, using only the information in B defined by : B-lower approximation of X: (6) BX = {x | [x ]B ⊆ X } . B-upper approximation of X: (7) B X = {x | [x ]B ∩ X ≠ ∅} . The lower approximation set contains all objects which with certainly belong to set X. The upper approximation set contains all objects which possibly belong to the set X. B-boundary region of X: (8) BRB ( X ) = B X − B X . This set contains the objects that can not be classified as definitely inside X nor definitely outside X. A set is rough if BRB (X ) ≠ ∅ . Sometimes all the knowledge in an information system is necessary to divide the object into classes. In this case the knowledge can be reduced. Reducing the knowledge results in reducts. A reduct is a minimal set of attributes B ⊂ A , such that (9) IND A ( B) = IND A ( A) . Reducts can be computed based on discernibility matrices and discernibility function. A discernibility matrix of information system S is a symmetric n × n matrix with entries: (10) cij = a ∈ A | a( xi ) ≠ a x j for i, j = 1,..., n.

y1 1 w11 x1

w12

y2 1

x2

y31

x3

y4 1

y12

y5 1

x4

y6 1

x5

y7 1

x6

w67 w68

y8 1

Figure 1. ANN topology with six inputs and single hidden layer. 6

y j = f ( ∑ x i wij ) . 1

(1)

i =1 8

y1 = f ( ∑ y j ) . 2

1

(2)

j =1

f is activation function. It can be a sigmoid function. Number of input neuron is denoted as i. Number of hidden neuron is denoted as j.

3. Rough set theory Rough set theory deals with the analysis of classification of a set of objects that may represent vagueness of knowledge. Explanation of RST can be seen in [10]. An information system is defined as: S = (U , A ). (3) where U represents the universal set and A represents non empty finite set of atrributes such that a : U → V a for every a ∈ A where V a is the value set of a Let B ⊆ A . Then each subset defines an equivalence relation called an indiscernibility relation ( INDA (B) ) which is defined as: IND A ( B ) = {( x, y ) ∈ U 2 | ∀a ∈ B , a ( x ) = a ( y )}. (4)

( )

A discernibility function fA for an information system S is a Boolean function of m Boolean variables a1* ,..., a m* (corresponding to the attributes a1 ,..., a m ) defined as f A (a1* ,..., a m* ) = ∧ ∨ cij* | 1 ≤ j ≤ i ≤ n, cij ≠ ∅ .(11)

{

}

where cij* = {a * | a ∈ cij } . By finding the set of all prime implicants of the discernibility function, all the minimal reduct of the system may be determined. For decision systems S, finding an approximation of the decision, d can be done by constructing its decision-relative discernibility matrix. The process of computing this matrix is called computing the discernibility matrix modulo the decision attribute.

307

thirteen conditional attributes, then the three attributes is removed. The original data has thirteen conditional attributes and single decision attribute. Only ten conditional attributes are used in this research due to the completeness of data. Preprocessing is done by removing the instance that has too many missing values and by removing outliers based on statistical mean and standard deviation. 597 instances are available after preprocessing. The final attributes are shown in Table 1. The diagnosis of coronary artery disease is the decision attributes. Discretisation is conducted using Boolean reasoning.

If M ( A) = (c ij ) is the discernibility matrix of a decision system S, the decision-relative discernibility matrix of defined as : M d ( A) = (cijd ) with assumption of cijd = ∅ if

d ( x i ) = d ( x j ) and cijd = cij − d otherwise. In this research, the concept of reduct is used to reduce the dimension of input attributes for ANN.

4. Combined system (ANNRST) RST is used to reduce the input of ANN. For example of heart disease problem, before attribute reduction, there is ten input attributes. Using RST reduct concept, the number of input can be reduced less than ten, e.g., six attributes. Thus the input of ANN is reduced. The topology of ANNRST can be seen in Figure 2. X1 X2 X3

5.2. Simulation of missing value and reduct computation Missing values are simulated arbitrary for fbs attribute by removing its values. Ten to 90 values were removed from 597 instances on fbs attribute and treated as missing values. Reduct computation using ROSETTA [12] with Boolean reasoning discretisation and Johnson's algorithm results in reduct with five attributes: age, trestbps, chol, thalach and oldpeak.

Xm-2 Xm-1 Xm

RST Attribute Reducer X’1

X’2

X’n-1

5.3. Experiment

X’n

ANN is used to predict the simulated missing values on fbs attribute. Pure ANN uses ten input nodes. ANN with RST (ANNRST) dimensionality reduction and PLN-OLS attribute ranking (ANN-PLN-OLS) use six input nodes. Attributes used for ANNRST are attributes of the computed reduct and decision attribute (num). Attributes based on PLN-OLS feature ranking using six attributes are thalach, cp, age, restecg, trestbps and oldpeak. Ten to ninety simulated missing values of fbs attribute are predicted using pure ANN, ANNRST and ANN-PLN-OLS. Hundred simulations with random initial weight are conducted each simulated missing data using MATLAB with Nguyen-Widrow weight initialization, minmax scaling, resilient bacpropagation training, and tansig activation function at hidden and output layer. K-NN imputation method and most common attribute value filling are computed to find the accuracy for the same case. The comparison is made by calculating the maximum and average accuracy of 100 simulation runs for ANN based methods. The results of ANN based methods are not always consistent in every simulation runs. They depend on their initial weights even though the other parameters are kept constant. Iteration is stopped when there is no significance change in its

ANN Figure 2. ANNRST topology with single hidden layer.

X1 - Xm are inputs of ANN before RST attribute reduction. X’1-X’n are inputs of ANN after attribute reduction and m > n.

5. Experiment and results 5.1. Data and preprocessing The source of heart (coronary artery) disease data is from data mining repository at UCI [11]. The amount of data is 920 instances. The cleveland data is the most complete. The swiss data is the most incomplete, thus it is not used in this research. The longbeach and hungarian data have many missing values in three of

308

mean squared error (MSE) to avoid overfitting problem. The average accuracy indicates the ANN based method stability to reach the best result.

this case (fbs attribute), the discrete or symbolical value is only "0" and "1", and value of "0" makes up almost 90% of the total value. The result will be different if the value is more varied, e.g. more than two kinds of symbolical value. ANN and ANNRST can reach 100% maximum accuracy with average accuracy 70.2% and 73.3% for ANN and ANNRST respectively. RST data reduction shows good effect for both maximum and average accuracy.

5.4. Comparison results and discussion The comparison results of the accuracy between imputation methods are shown in Figure 3 and Figure 4. Figure 3 shows that ANNRST is comparable with ANN for maximum accuracy and is better than ANNPLN-OLS and most common value filling. K-NN method shows the worst accuracy in this case. For average accuracy in Figure 4 the ANNRST is slightly better than ANN and ANN-PLN-OLS.

100 ANN

Max accuracy(%)

ANNRST

Table 1. Summary of attributes (UCI heart disease database)

trestbps chol fbs restecg

thalach exang oldpeak num

Description Age Sex Chest pain type

Resting systolic blood pressure on admission to the hospital (mmHg) Serum cholesterol (mg/dl) Fasting blood sugar over 120 mg/dl ? Resting electrocardiographic results : Maximum heart rate achieved Exercise induced angina? ST depression induced by exercise relative to rest Diagnosis of heart disease (angiographic disease status / presence of coronary artery disease (CAD))

Value description Numerical value 1 if male; 0 if female 1 typical angina 2 atypical angina 3 non-anginal pain 4 asymptomatic Numerical value

80

70

60 10

20

30

40

50

60

70

80

90

Missing data

Figure 3. Comparison graph for max accuracy vs missing data 80

Numerical value 1 if yes 0 if no 0 normal 1 having ST-T wave abnormality 2 LV hypertrophy Numerical value

Average accuracy(%)

Attribute age sex cp

Most common value ANN-PLN-OLS kNN

90

ANN ANNRST ANN-PLN-OLS kNN Most common value

75

70

65

60 10

1 if yes 0 if no Numerical value

20

30

40

50

60

70

80

90

Missing data

Figure 4. Comparison graph for average accuracy vs missing data

0 if less than 50% diameter narrowing in any major vessel (CAD no) 1 if more than 50% (CAD yes)

6. Conclusion In this paper, ANN with RST (ANNRST) attribute reduction is proposed. Comparing to the k-NN imputation method, ANNRST outperforms with its maximum and average accuracy. ANNRST also gives better maximum accuracy than most common value filling even though its average accuracy is below the most common value filling. RST attribute reduction approach also gives better results than PLN-OLS feature ranking when applied to ANN based imputation methods. ANNRST gives comparable results of maximum accuracy and better results of average accuracy than pure ANN does. Thus RST can be used

The ANN based method average accuracy are still better than the accuracy of k-NN method especially on high amount of missing value. RST and PLN-OLS attribute reduction method seem to give improvement of pure ANN in average accuracy. The most common value filling method outperform the average accuracy of ANN based methods. This method really depends on the value of attribute that has the missing data. For

309

to reduce the dimensionality of ANN based imputation methods without loss of its accuracy and will make ANN topology simpler.

[5]

O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R.B. Altman, "Missing value estimation methods for DNA microarrays", Bioinformatics 17(6), 2001, pp. 520-525. [6] B. Bhattacharya, D.L. Shrestha, D.P. Solomatine, "Neural networks in reconstructing missing wave data in sedimentation modeling", Proceedings of the XXXth IAHR Congress, Greece, August 2003. [7] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1994. [8] J. Li, M.T. Manry, P.L. Narashima, C. Yu, "Feature selection using a piecewise linear network", IEEE Transaction on Neural Networks, 17(5), September 2006, pp. 1101-1115. [9] Q. Shen, A. Chouchoulas, "Rough set-based dimensionality reduction for supervised and unsupervised learning", Int. J. Appl. Math. Comput. Sci., Vol.11, 2001, pp. 583-601. [10] T.R. Hvidsten, Fault diagnosis in rotating machinery using rough set theory and ROSETTA, Technical Report, Norwegian University of Science and Technology, 1999. [11] D.J. Newman, S. Hettich, , C.L. Blake, C.J. Merz, UCI Repository of machine learning databases http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, Department of Information and Computer Science. 1998 [12] A. Ohrn., ROSETTA,http://www.idi.ntnu.no/~aleks/rosetta, 1999.

7. Acknowledgment We would like to thank Universiti Teknologi PETRONAS for the kind support of presenting this paper at BMEI 2008.

8. References [1] [2]

[3] [4]

J.W. Grzysmala-Busse, M. Hu, "A comparison of several approaches to missing attribute values in data mining", RSCTC 2000, LNAI, 2005, pp. 378- 385. J. Li, N. Cercone, Comparisons on different approaches to assign missing attribute values, Technical Report, CS-2006-04, School of Computer Science, University of Waterloo, January 2006. J. Li, N. Cercone,"Assigning missing attribute values based on rough set theory", IEEE GrC 2006, Atlanta USA, May 2006. J. Li, N. Cercone, Predicting missing attribute values based on Frequent Itemset and RSFit, Technical Report, CS-2006-13, School of Computer Science, University of Waterloo, April 2006.

310