Feature Selection Techniques on Thyroid

3 downloads 0 Views 827KB Size Report
Keywords: Component; Feature Selecrion Methods; Data Mining; Breast ... Nowadays, we are capable to collect and generate data more than before. .... Weka provides the environment to perform many machine learning algorithm and feature.
Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf, Girija Chetty, Dat Tran

Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf1, Girija Chetty2, Dat Tran3 Faculty of Information Science and Engineering University of Canberra Australian Capital Territory, AUSTRALIA {mohammad.baniahmad1, girija.chetty2, dat.tran3}@canberra.edu.au

Abstract Last century, the challenge was to develop new technologies that store large amount of data. Recently, the challenges are to effectively utilize the incredible amount of data and to obtain knowledge that benefits business, scientific, and government transactions by using subset of features rather than the whole features in the dataset. In the present paper, we have focused on feature selection techniques as a method to gain high quality attributes to enhance the mining process. Feature selection techniques touch all disciplines that require knowledge discovery from large data. In our study, we made a comparison between benchmark feature selection methods based on three well-known datasets and three well-recognized machine learning algorithms. The study found that feature selection methods are capable to improve the performance of learning algorithms. However, no single feature selection methods that best satisfy all datasets and learning algorithms. Therefore, machine learning researchers should understand the nature of datasets and learning algorithms characteristics in order to obtain better outcome. Overall, correlation based feature selection (CFS) and consistency based subset evaluation (CB) performed better than information gain, symmetrical uncertainty, Relief (RF), and principle components analysis (PC).

Keywords: Component; Feature Selecrion Methods; Data Mining; Breast Cancer Dataset; Thyroid, Hepatitis.

I.

INTRODUCTION

Nowadays, we are capable to collect and generate data more than before. Contributing factors include the steady progress of computer hardware technology for storing data and the computerization of business, scientific, and government transactions. As well, the use of the internet as a wide information system has flooded us with incredible amount of data and information. Data mining has been appearing due to the normal evaluation of information technology. Data mining has attracted a big attention to information systems researchers in the recent years due to the wide availability of big amount of data and the need for tuning such data into knowledge and useful patterns. The gained knowledge and patterns can be used in many fileds such as marketing, business analysis, and health information systems [1]. This amount of data imply the existence of low quality, unreliable, redundant and noisy data which affect the process of observing knowledge and useful pattern, then knowledge discovery during training is more difficult. Therefore, researchers have felt the necessity for producing more reliable data from large amount of records such as using feature selection methods. Feature selection or attribute subset combination is the process of identifying and utilizing the most relevant attributes and removing the redundant and irrelevant attributes as possible [2] [3]. In addition, features selections do not alter the original representation of data in any mean. However, it just selects a promising useful subset. Recently, the inspiration for applying features selection techniques in machine learning has shifted from theoretical approach to become one of steps in model building. Many attribute selection methods use the task as a search problem, where each result in the search space group a distinct subset of the possible attributes [4]. Since the space is exponential in the number of attributes which produce lots of possible subsets, this requires the use of a heuristic search procedure for all data sets. The search procedure is combined with an attribute utility estimator in order to evaluate the relative merit of alternative subsets of attributes [2]. This large number of possible subsets and the computation cost involved necessitate researchers to conduct a benchmark feature selection methods that produce the best possible subset in regards to more accurate results as well as low computation overhead.

nternational Journal on Data Mining and Intelligent Information Technology Applications(IJMIA) Volume3, Number1, March 2013 doi : 10.4156/IJMIA.vol3.issue1.1

1

Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf, Girija Chetty, Dat Tran

In the current work, we have focused on some diseases datasets (Thyroid, Hepatitis, and Breast cancer). Yearly around the world, millions of ladies suffer from breast cancer, making it the second common non-skin cancer after lung cancer, and the fifth cause of death among cancer diseases in the world [5]. Thyroid disorder in women is much more common than thyroid problems in men and may lead to thyroid cancer [6]. Hepatitis can be caused by chemicals, drugs, drinking too much alcohol, or by different kinds of viruses and may lead to liver problems [7]. This paper begins by brief related work, then a description of benchmark feature selection methods, a description of our methodology in the current paper, and the results obtained by using the three datasets. Finally, a brief discussion and future work.

II. RELATED WORK Numerous feature selection methods have been broadly used in different domains. Hall and Holmes [2] have presented a benchmark comparison of several attribute selection methods for supervised classification. Attribute selection is achieved by cross-validating the attribute rankings with respect to a classification learner C4.5 and naïve Bayes. The results conclude that feature selection methods can enhance the performance of some learning algorithms. The findings also include that Correlation based feature selection method has produced the best result among six different feature selections methods. Saeys, et al [8] reviewed the importance of feature selection approach in a set of well known bioinformatics applications. They have focused into two main issues: the large input dimensionality, and the small sample sizes. The authors found that feature selection methods could help researchers solve the above-mentioned issues. They also believed that feature selection application will become fundamental in dealing with the high dimensional applications. The literature showed two categories of feature selection, wrapper and filter. The wrapper evaluates and select attributes based on accuracy estimates by the target learning algorithm. Using a certain learning algorithm, wrapper basically searches the feature space by omitting some features and testing the impact of feature omission on the prediction metrics. The feature that make significant difference in learning process implies it does matter and should be considered as a high quality feature. On the other hand, filter uses the general characteristics of data itself and work separately from the learning algorithm. Precisely, filter uses the statistical correlation between a set of features and the target feature. The amount of correlation between features and the target variable determine the importance of target variable [9]. A further category is by sorting attributes using algorithms that rank a features or set of features in which attributes are ranked in regards to their improvement to a subset of attributes [1].

III. FEATURE SELECTION METHODS A. Information Gain The information gain method was proposed to approximate quality of each attribute using the entropy by estimating the difference between the prior entropy and the post entropy [10]. This is one of the simplest attribute ranking methods and is often used in text categorization. If  is an attribute and  is the class, the following equation gives the entropy of the class before observing the attribute: () = − ∑ ()  



()

(1)

where () is the probability function of variable  . The conditional entropy of  given  (post entropy) is giving by: (|) = − ∑ () ∑ (|)



(|)

(2)

The information gain (the difference between prior entropy and postal entropy) is given by the following equations: (, ) =  () −  (|) (, ) = − ∑ ()  



() − ∑(−() ∗ ∑ (|)

(3) 

(|))

(4)

2

Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf, Girija Chetty, and Dat Tran

B. Correlation based feature selection (CFS). CFS is a simple filter algorithm that ranks feature subsets and discover the merit of feature or subset of features according to a correlation based heuristic evaluation function. The purpose of CFS is to find subsets that contain features that are highly correlated with the class and uncorrelated with each other. The rest of features should be ignored. Redundant features should be excluded as they will be highly correlated with one or more of the remaining features. The acceptance of a feature will depend on the extent to which it predicts classes in areas of the instance space not already predicted by other features. CFS’s feature subset evaluation function is shown as follows [11].  





=

(5)

()

where    is the worth of feature subset  that contain  features,  is the average feature correlation to the class, and  is the average feature to feature correlation. In order to apply this equation to calculate approximately the correlation between features, CFS uses a modified information gain method called symmetrical uncertainty to compensate the information gain bias for attributes with more values as follows [12].  =

( ) ,   ( ) 

(6)

C. Relief One of the most known feature selection techniques is Relief. The aim is to measure the quality of attributes according to how their values distinguish instances of different classes. Relief uses instance based learning (lazy learning such as k Nearest Neighbor) to assign a grade to each feature. Each feature’s grade reflects its ability to distinguish among the class values. Features are ranked by weight and those that exceed a threshold -determined by the user- are selected to form the promising subset. For each instance, the closest neighbour instance of the same class and the closest instance of a different class are selected. The score  of the  variable is computed as the average over all examples of magnitude of the difference between the distance to the nearest hit and the distance to the nearest miss as follows [13].  =  −

 (,,) 

+

 (,,) 

(7)

where  is the grade for the attribute ,  is a random sample instance, ℎ is the nearest hit, ℎ′ is the nearest miss, and  is the number of samples. D. Principle components analysis (PC) The purpose of PC is to reduce the dimensionality of data set that contains a large number of correlated attributes by transforming the original attributes space to a new space in which attributes are uncorrelated. The algorithm then ranks the variation between the original dataset and the new one. Transformed attributes with most variations are kept; meanwhile discard the rest of attributes. It’s also important to mention that PC is valid for unsupervised data sets because it doesn’t take into account the class label[14]. E. Consistency based Subset Evaluation (CB) CB adopts the class consistency rate as the evaluation measure. The idea is to obtain a set of attributes that divide the original dataset into subsets that contain one class majority [2]. One of well known consistency based feature selection is consistency metric [15] proposed by Liu and Setiono: Consistency = 1 −

∑    

(8)

where  is feature subset,  is the number of features in ,   is the number of occurrences of the th attribute value combination,   is the cardinality of the majority class for the th attribute value, and  is the number of features in the original data set. For continuous values, we may use Chi2 [16]. Chi2 automatically discretises the continuous features values and removes irrelevant continuous attributes.

3

Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf, Girija Chetty, and Dat Tran

IV. EXPERIMENT METHODOLOGY The experiment applied the above mentioned attribute selection methods to three well-known datasets from UCI machine learning repository [17]. We have chosen three datasets that are ranged in sizes. The smallest dataset contains 155 attributes and the largest dataset contains 3772 attributers. Number of attributes also ranges from 9 to 30 attributes while all the datasets contain 2 classes. We have considered the whole dataset as training data while the testing data method is the cross validation. Table I shows a full description of datasets. TABLE I. DATASETS CHARACTERISTICS Dataset

Training

Testing

Attributes

class

BreastCancer

699

Cvalidation

9

2

Hepatitis

155

Cvalidation

20

2

Thyroid

3772

Cvalidation

30

2

For obtaining a fair judgment, as possible, between feature selection methods, we have considered three machine learning algorithms from three categories of learning methods. The first algorithm is k nearest neighbours (kNN) from lazy learning category. kNN is an instance-based classifier where the class of a test instance is based upon the class of those training instances alike to it. Distance functions are common to find the similarity between instances. Examples of distance functions are Euclidean and Manhattan distance functions [18]. The second algorithm is Naïve Bayes classifier (NB) form Bayes category. NB is a simple probabilistic classifier based on applying Bayes' theorem. NB is one of the most efficient and effective learning algorithms for machine learning and data mining because the condition of independency (no attributes depend on each other) [19]. The last machine learning algorithm is Random Tree (RT) or classification tree. RT is used to classify an instance to a predefined set of classes based on their attributes values. RT is frequently used in many fields such as engineering, marketing, and medicine [20]. The software package used in the present paper is Waikato Environment for Knowledge Analysis (Weka). Weka provides the environment to perform many machine learning algorithm and feature selection methods. Weka is an open source machine learning software written in JAVA language. WEKA contains some data mining and machine learning methods for data pre-processing, classification, regression, clustering, association rules, and visualization [21].

V. EXPERIMENT RESULTS We have used the notation “+”,”-“, and “=” to show the feature selection methods classification performance in compare with original dataset (before performing feature selection methods); where “+” denotes to improvement,” -“denotes to degradation, and” =” denotes to unchanged. The experimental results of using Naïve Bayes (NB) as a machine learning algorithm on three datasets (Thyroid, Hepatitis, and BreastCancer) are shown in Table 2. TABLE II. RESULTS FOR ATTRIBUTE SELECTION METHODS WITH NAÏVE BAYES. Method

Thyroid

Hepatitis

BreastCancer

NB

92.60%

84.52%

95.99%

CFS

96.53%+

87.74%+

95.99%=

IG

93.88%+

85.16%+

95.99%=

RF

92.60%=

84.52%=

95.99%=

PC

94.30%+

84.52%=

96.14%+

4

Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf, Girija Chetty, and Dat Tran

CB

94.59%+

84.52%=

96.28%+

SU

93.88%+

85.16%+

95.99%=

Table II shows that the classification accuracy of using NB on original Thyroid dataset is 92.60%, where it shows improvement by applying the feature selection methods CFS, IG, PC, CB, and SU with best result performed by CFS (96.53%). On the original Hepatitis dataset, the classification accuracy is 84.52%, where it shows improvement by applying the feature selection methods CFS, IG, and SU with best results performed by CFS (87.74%). By using BreastCancer original dataset, the classification accuracy is 95.99% with rooms of performance in classification accuracy using the feature selection methods PC and CB with best result performed by CB (96.28%). Figure 1 illustrates the results on Table 2. The second machine learning classifier for testing the feature selection methods is kNN. The experimental results of using kNN as a machine learning algorithm on three datasets (Thyroid, Hepatitis, and BreastCancer) are shown in Table III.

Fig.1 Results for attribute selection methods with Naïve Bayes. TABLE III. RESULTS FOR ATTRIBUTE SELECTION METHODS WITH kNN. Method KNN CFS IG RF PC CB SU

Thyroid

Hepatitis

BreastCancer

95.92% 96.10%+ 96.50%+ 95.92%+ 95.78%96.37%+ 96.50%+

81.94% 84.52%+ 81.29%81.94%= 81.29%81.94%= 81.29%-

95.42% 95.42%= 95.42%= 95.42%= 95.42%= 95.85%+ 95.42%=

Table III shows that the classification accuracy of using kNN on the original Thyroid dataset is 95.92%, where it shows improvement by applying the feature selection methods CFS, IG, RF, CB, and SU with best result performed by IG and SU (96.53%). On the original Hepatitis dataset, the classification accuracy is 81.94%, where it shows improvement by applying feature selection methods CFS. However, it shows degradation by using IG, PC, and SU. The best results performed by CFS (84.52%) and the worse results obtained by IG, PC, and SU (81.29%). By using BreastCancer original dataset, the classification accuracy is 95.42%. The classification accuracy has not changed by applying

5

Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf, Girija Chetty, and Dat Tran

CFS, IG, RF, PC, and SU. However, the feature selection CB obtained the finest result (95.85%). Figure 2 illustrates the results on Table III. The last machine learning classifier in our experiment is Random Tree (RT). The experimental results of using RT as a machine learning algorithm on three datasets (Thyroid, Hepatitis, and BreastCancer) are shown in Table IV.

Fig.2 Results for attribute selection methods with kNN. TABLE IV. RESULTS FOR ATTRIBUTE SELECTION METHODS WITH DECISION TREE. Method RT CFS IG RF PC CB SU

Thyroid

Hepatitis

BreastCancer

96.92%

76.77%

94.56%

96.29%-

77.42%+

94.56%=

96.63%-

74.19%-

94.56%=

97.22%+

76.77%=

94.56%=

97.03%+

76.13%-

94.85%+

97.16%+

80.65%+

93.56%-

96.63%-

74.19%-

94.56%=

On Table IV, we can observe improvement in classification accuracy by applying the feature selection RF, PC, and CB on Thyroid dataset, while there is degradation by using CFS, IG, and SU on the same dataset. The best performed feature selection method is RF (97.22%), while the worse is CFS (96.29% which is still same classification accuracy on the original dataset). By testing on Hepatitis dataset, classification accuracy has been increased by using the feature selection methods CFS and CB while decreased on IG, PC, and SU. On the last dataset (BreastCancer), there is improvement on feature selection method PC, degradation on CB while unchanged by using CFS, IG, RF, and SU. Figure 3 illustrates the results on Table IV.

6

Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf, Girija Chetty, and Dat Tran

Fig3. Results for attribute selection methods with Decision Tree.

VI. DISCUSSION AND CONCLUSION According to the results obtained by the current work on three different sizes datasets, Naïve Bayes has performed the supreme in regard to classification accuracy. kNN and DT have performed just better on datasets after applying feature selection methods. In general, attribute feature selection methods can improve the performance of learning algorithms. However, no single feature selection method that best satisfy all datasets and learning algorithm. Therefore, machine learning researcher should understand the nature of datasets and learning algorithm characteristics in order to obtain better outcome as possible. Overall, CFS and CB feature selection methods performed the better than IG, SU, RF, and PC. We have also found that IG and SU performed typically because SU is a modified version of IG. Future work should compare between feature selection methods and the associated learning algorithms in regard to speed, tolerance to noise, as well as applying feature selection methods on more datasets.

REFERENCES [1]

J. HAN AND K. M, DATA MINING CONCEPTS AND TECHNIQUES VOL. 3. SAN FRANSCISCO: MORGAN KAUFMANN, 2011.

[2]

M. A. HALL AND G. HOLMES, "BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR DISCRETE CLASS DATA MINING," IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, 2003.

[3]

M. ASHRAF, ET AL., "A NEW APPROACH FOR CONSTRUCTING MISSING FEATURES VALUES," INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION PROCESSING, VOL. 3, PP. 110-118, 2012.

[4]

A. L. BLUM AND P. LANGLEY, "SELECTION OF RELEVANT FEATURES AND EXAMPLES IN MACHINE LEARNING," ARTIFICIAL INTELLIGENCE, VOL. 97, PP. 245-271, 1997.

[5]

I. A. R. CANCER. (2002, 20 MAY 2011). MAMMOGRAPHY SCREENING CAN REDUCE DEATHS FROM BREAST CANCER. AVAILABLE: HTTP://WWW.IARC.FR/EN/MEDIACENTRE/PR/2002/PR139.HTML.

[6]

S. L. LEE. (2012, 01/03/2012). THYROID PROBLEMS. AVAILABLE: HTTP://WWW.EMEDICINEHEALTH.COM/THYROID_PROBLEMS/ARTICLE_EM.HTM

[7]

(2012). INTRODUCING HEPATITIS C. AVAILABLE: HTTP://WWW.HEP.ORG.AU/

[8]

Y. SAEYS, ET AL., "A REVIEW OF FEATURE SELECTION TECHNIQUES IN BIOINFORMATICS," BIOINFORMATICS, VOL. 23, PP. 2507-2517, OCTOBER 1, 2007 2007.

7

Feature Selection Techniques on Thyroid, Hepatitis, and Breast Cancer Datasets Mohammad Ashraf, Girija Chetty, and Dat Tran

[9]

M. LEACH, "PARALLELISING FEATURE SELECTION ALGORITHMS," UNIVERSITY OF MANCHESTER, MANCHESTER2012.

[10]

I. KONONENKO, " ESTIMATING ATTRIBUTES: ANALYSIS AND EXTENSIONS OF RELIEF," PRESENTED AT THE MACHINE LEARNING ECML-94, 1994.

[11]

M. A. HALL, "CORRELATION-BASED FEATURE SELECTION FOR MACHINE LEARNING," PHD, DEPARTMENT OF COMPUTER SCIENCE, THE UNIVERSITY OF WAIKATO, HAMILTON, 1999.

[12]

L. RUTKOWSKI, ET AL., EDS., ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PART I. POLAND: SPRINGER, 2010, P.^PP. PAGES.

[13]

I. GUYON AND A. ELISSEEFF, "AN INTRODUCTION TO VARIABLE AND FEATURE SELECTION," JOURNAL OF MACHINE LEARNING RESEARCH, VOL. 3, PP. 1157-1182, 2003.

[14]

I. T. JOLLIFFE. (2002, 30/04/2012). PRINCIPAL COMPONENT ANALYSIS. 2. AVAILABLE: HTTP://BOOKS.GOOGLE.COM.AU

[15]

H. LIU AND R. SETIONO, "A PROBABILISTIC APPROACH TO FEATURE SELECTION: A FITER SOLUTION," IN PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 1996, PP. 319-327.

[16]

H. LIU AND R. SETIONO, "CHI2:FEATURE SELECTION AND DISCRETIZATION OF NUMERIC ATTRIBUTES," IN PROCEEDINGS OF THE 7THIEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTICIAL INTELLIGENCE, 1995.

[17]

W. WOLBERG AND L. MANGASARIAN, "MULTISURFACE METHOD OF PATTERN SEPARATION FOR MEDICAL DIAGNOSIS APPLIED TO BREAST CYTOLOGY," PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES, VOL. 87, PP. 9193 - 9196, 1990.

[18]

J. PEVSNER, BIOINFORMATICS AND FUNCTIONAL GENOMICS, 2 ED. NEW YORK: WILEYBLACKWELL, 2009.

[19]

H. ZHANG AND J. SU, "NAIVE BAYES FOR OPTIMAL RANKING," JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, VOL. 20, PP. 79-93, 2008/06/01 2008.

[20]

L. ROKACH AND O. MAIMON, EDS., DATA MINING WITH DECISION TREES MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE). WORLD SCIENTIFIC PUBLISHING, 2008, P.^PP. PAGES.

[21]

M. ASHRAF, ET AL., "INFORMATION GAIN AND ADAPTIVE NEURO-FUZZY INFERENCE SYSTEM FOR BREAST CANCER DIAGNOSES," PRESENTED AT THE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND CONVERGENCE INFORMATION TECHNOLOGY (ICCIT), SEOUL, 2010.

8