A Feature Reduction Framework based on Rough Set ... - IEEE Xplore

SAI Intelligent Systems Conference 2015 November 10-11, 2015 | London, UK

A Feature Reduction Framework based on Rough Set for Biomedical Data Sets Syed Hasnain Ali

Dr. Usman Qamar

Department of Computer Engineering College of Electrical and Mechanical Engineering, National University of Sciences and Technology, Pakistan [email protected]


Madiha Guftar

Abdul Wahab Muzaffar



Abstract—Feature selection reduces a data set into a subset which also represents the entire data with less computational complexity and performance does not affect much. However, to extract such a subset is a nontrivial task, although there are a number of methods to handle this problem. In the near past an approach based on rough set has been used for feature selection. The dependency measure is one of the ways to find out the minimal feature subset, called Reducts, from the entire dataset. One of the mature areas of feature reduction is the techniques based on rough set theory, which totally depends on the concept of sets and mathematical formulas. We have conducted experiments using different publicly available datasets from UCI repository and real data sets developed from patient report. A framework is devised using different rough set based algorithms, it has been observed that after reduction of attributes our results improved in terms of time complexity while a negligible effect is seen on the other measures. We measured the performance of our framework using precision, recall, accuracy and F-measure. Keywords—Rough Set; Incremental Quick Reduct; Genetic Algorithm; Johnson’s Algorithm; K-nn

I.

INTRODUCTION

Feature selection (FS) is the procedure of selecting a subgroup of features that still provide information sufficient to perform any processing on behalf of the entire dataset. The feature selection process provides a subset of features from the dataset that contains the most useful information [1]. The selected features can be used further in mining of data, machine learning and pattern recognition, instead of using the entire data set, which reduces the processing time. Real world datasets contain an abundance of noisy data and irrelevant features due to which feature selection becomes a significant process to clean these datasets [2-4]. Rough Set Theory (RST) proposed by [5] is a mathematical method to handle imperfection in knowledge i.e. imprecision, uncertainty and vagueness. In latest years we observed a quick raise of attention in rough set theory and its application, globally. Rough set theory has been used by many

researchers as underlying framework for feature selection. Different algorithms have been proposed based on different concepts provided by rough set theory. Set approximation and dependency calculation are basic steps towards finding the relevant features (reducts) from the original dataset while still maintaining relevant information. This research focuses on the feature selection in different medical datasets. The selected features may be used to predict the accuracy of the data set instead of the whole features. The rest of the paper is distributed as follows. Section 2 provides details on related work. In section 3 data acquisition process is described. Section 4 gives details related to rough set preliminaries section 5 describes Feature Selection in Rough Sets. The proposed methodology is discussed in Section 6. Results are discussed in section 7 and in section 8 conclusion of work has been summarized. II.

RELATED WORK

Jelonek et al. [6] used rough sets to select attributes, for the classification of histological images with neural networks. To decrease the training time of the network is the main objective by decreasing the amount of inputs to the neural network. Using a set of seven types of brain cancer cells images; for each image a large number of features are generated. Dimensionality could be decreased to about 11% of the original set through joining medical expertise with processing by rough sets. Experiments using Cross-validation exhibited that a neural network utilizing this reduced set of attributes only performed well to some extent than a network utilizing the complete set of inputs. In a continuation study [7] using alternative image features, a pool of an even larger set of features were reduced to about 4% of the original set, using an entirely automatic rough set feature selection. Satisfactory performance remained upheld. Different types of techniques were utilized by Dreiseitl et al. [8], including rough sets, for the selection of important features for the prediction of myocardial information. With the

343 | P a g e 978-1-4673-7606-8/15/$31.00 ©2015 IEEE


help of self-organizing maps the selected feature subsets were validated and visualized. For data-driven modelling stability among data from diverse geographic sites is not the single robustness problem of interest, temporal stability is of significance as well. For creating a hierarchy of recognized risk factors for surgical wound infection Kandulski et al. [9] emphasis on synthesizing rough set approximations. The data was gathered within a timespan of four years, by applying the technique on the data they declare that the hierarchy of features selected based on the first two years of data seems to be identical to that gathered when data from the last two years was considered only. With growing magnitudes of the volume of data retain in medical databases, productive and successful procedures for medical data mining are extremely required. In this domain, applications of rough sets comprise persuading propositional rules from databases by means of rough sets preceding to utilize those rules in an expert system. Knowledge discovery system on the basis of rough sets and feature-oriented generalization and its application to medicine is presented by Tsumoto [10]. Take out the features information and Diagnostic rules from clinical data on diseases of congenital anomaly. Exploratory outcomes exhibited that the method proposed by the authors mines expert knowledge in the approved manner and moreover determines that symptoms noticed in 6 positions (feet, ears, noses, lips, eyes, and fingers) play significant part in distinctive diagnosis. On set of medical datasets an approach of rough sets to reduction of feature and creation of classification rules is presented by Hassanien el al. [11]. They originate a technique of rough set reduction to discover entire reducts of the data that encompass the smallest amount subset of features related with a labeled class for the purpose of classification. A Statistical test for the calculation of the importance of a rule was introduced, for the purpose to compute the authenticity of the rules based on the approximation quality of the features. A data of patients with doubted breast cancer was utilized and assessed. The rough set classification accuracy was presented for the purpose to compare it with famous ID3 classifier algorithm. For ECG recognition a new rough set application was presented by Huang and Zhang [12]. First, the rough set theory was used to decrease the recognition rules for characteristic points in the ECG. Then these rules are utilized as limitation conditions of eigenvalue determination arithmetic to identify characteristic points in the ECG. Numerous characteristics of correlative arithmetic i.e sizer method, difference method and how to select difference parameters are discussed. To validate R wave recognition they implemented MIT-BIH data and it is revealed that the subsequent detection rate of conventional recognition approaches is less than this. In recent times, Independent Component Analysis (ICA) [13] has attained admiration as a beneficial technique for uncovering statistically autonomous sources (variables) for the separation of blind source, in addition to for the extraction of feature. Swiniarski et al. [14] studied a number of composite approaches for the extraction/reduction of feature, selection of features, and classifier design for the recognition of breast

cancer in mammograms. The approaches encompassed independent component analysis, principal component analysis (PCA) and rough set theory. Three types of classifiers including an error back propagation neural network, a Learning Vector Quantization neural network, and a rough sets rule-based classifier were designed and tested. On the Basis of comparison on 2 distinct data sets of mammograms, with respect to other classifiers, the rough sets rule-based classifier accomplished with a remarkably enhanced level of accuracy. Thus, by using the ICA or PCA as a feature extraction method along with rough sets for rule-based classification and feature selection suggests a better solution in the detection of breast cancer for mammogram recognition. III.

KNOWLEDGE ACQUISITION AND PRESENTATION

Five different data sets were used in this research. Out of five, two data sets were obtained from Armed Forces Institute of Cardiology & National Institute of Heart Diseases (AFIC & NIHD), and three data sets were obtained from the UCI Machine learning Repository.The Syncope and Cardiac patient’s datasets obtained from AFIC comprised of unstructured text form reports of 157 patients and 1500 patients respectively which contain information related to results of the test. For the Syncope data set, Test reports did not contain detailed information related to the symptoms that the patient had at the time when he/she experienced syncope, for that reason there was also a closed end questionnaire with 31 attributes designed with the help of Cardiologists for taking history manually from the patient prior to test. Fertility, Hepatitis and Breast Cancer Datasets obtained from UCI comprised of 100, 155 and 699 records respectively. All the acquired data (all data sets) were saved and normalized in Microsoft Excel. Missing values of the data set are also treated there. It is mandatory to deal with missing values because sometimes data sets do not work properly without this. These missing values are replaced with the most commonly occurring value of that attribute; else they are replaced with the most likely value on statistics based. IV.

ROUGH SET THEORY PRELIMINARIES

RST (Rough set theory) can be utilized as a tool towards finding or discovering the dependencies between the data and to decrease the total amount of attributes available in a dataset via the data alone, needing certainly not extra information (Pawlak, 1991; Polkowski, 2002). It is probable to discover a subgroup (termed a reduct) of the original attributes using Rough Set (RST) which give the utmost information, of a given dataset having discretized attribute values; all additional or extra attributes can be taken out from data with negligible loss or damaging the information. RST provides different ways to represent the data e.g. information systems, decision systems, indiscernibility, approximations, reducts, via a discernibility matrix, or discernibility function. This section provides basic definitions and concepts of RST A. Information System The demonstration of Knowledge in RS (Rough sets) is done through information systems. An IS is basically a view or table that offers an appropriate means to define a fixed number set of elements called the universal set by a finite set

344 | P a g e 978-1-4673-7606-8/15/$31.00 ©2015 IEEE


of attributes in that way expressing all presented information and knowledge. . The rows are representing the objects, entities or records and columns representing the attributes or features. Formally IS (∆) is defined as ∆ = {U,A}, in which “U” is representing a universal set of non-empty set of records and “A” is “N” number of attributes also called features. Every attribute has a value: a: UÆVa, in which Va is known as value set of attribute “a”. B. Decision System A specialized or specific form of information system is known as Decision System having “decision attribute”. Formally a decision system has the format Γ= {U, C U {D}), in which the condition attributes are represented by “C” and “D” shows decision attribute. C. Indiscernibility The mathematical mechanism of RS (Rough Sets) is extracted from the hypothesis that level of details in a set of data can be indicated by divisions and their equivalence relations, associated on the set of objects, which is known as indiscernibility relations. Indiscernibility defines an equivalence relation between objects in ∆ = {U,A}. A relation R ‫ ك‬i x j is equivalence if it is reflexive (iRi and jRj), symmetric (i.e. iRj=jRi) and transitive (if iRj and jRk then iRk). For any C‫א‬A in ∆, there may exist an indiscernibility relation INDA(C): INDA(C) = {(O1,O2) ‫ א‬U2 | ძ c‫א‬Cc(O1) = c(O2)}. INDA(C) (also denoted by [x]c) is called “Cindiscernibility” relation. If two objects (O1,O2) ‫ א‬INDA(C), then objects O1 and O2 are indiscernible or indistinguishable w.r.t. C. D. Approximation The principle conceptions of RST (rough set theory) are the Approximations. Let X ‫ ك‬U be some arbitrary set of objects. Normally it is not possible to define such sets in a crisp manner. With any rough set couples of distinct sets, is associated. Indiscernibility defines two important relations called Lower Approximation and Upper Approximation to define such sets. In an information system , assume that B ‫ א‬A and X ‫א‬U, B-lower approximation (denoted BX) and B-upper approximation (denoted by ‫ܤ‬X) of X will be: (1) (2) B-Lower approximation is also known as the positive region. It is the union of equivalence classes in [X]B that, w.r.t. B, will surely be contained by X. B-Upper approximation is also known as negative region. It is the set of objects that, w.r.t. B, can possibly be contained by X. E. Dependency Dependency measure provides another way to perform analysis of data. Intensity of attribute dependency provides a measure that in what way an attributes subgroup is reliant on

another attributes subgroup. An attribute “D” totally depends on attribute “C” if “C” uniquely determines the value of “D”. Formally, we can say that in decision system Γ= {U, C U {D}), the attribute “D” depends on attribute “C” by a degree “K”, calculated by:

(3) Where (4) is known as the positive region of “U/D” w.r.t. “C”.“K” is called the degree of dependency and specifies the ratio of the elements that can positively be contained by partition induced by D i.e. U/D. If K = 1, D fully rely on C, for 0