Using Data Mining Techniques to Enhance Heart

0 downloads 0 Views 633KB Size Report
It should be noted that the data mining process is something more than data .... WEKA is an open source machine learning software written in Java language.
Using Data Mining Techniques to Enhance Heart Disease Diagnosis Dhiyaa ALjdiaoi 1,*, Hamed Monkaresi 2 Department of Computer Engineering, Razi University, Kermanshah, Iran Email address: [email protected], telephone number: 00989212635305 2 Department of Computer Engineering, Razi University, Kermanshah, Iran Email address: [email protected], telephone number: 00989183369976 1

ABSTRACT Data mining techniques have been applied magnificently in many fields including science, business, the Web, bioinformatics, and on different types of data such as visual, textual, spatial, real-time and sensor data. Medical data is still information rich but knowledge poor. There is a lack of effective analysis tools to discover the hidden relationships and trends in medical data obtained from clinical records. Using single data mining technique in the diagnosis of heart disease has been comprehensively investigated showing acceptable levels of accuracy. The designed system based on Cleveland Heart Disease Dataset, that consists of 13 features are considered as input. The current research being carried out using the data mining techniques to enhance heart disease diagnosis and prediction including decision trees, Naive Bayes classifiers, K-nearest neighbor classification (KNN) and support vector machine (SVM). Results show that NB classifier achieve 87.45% of classification accuracy.

Keywords: Heart Disease, Data Mining, K-nearest neighbor, Naïve Bayes, Support Vector Machine, Decision Tree

1.

INTRODUCTION

In health care systems, the efficacy of medical therapies can be predicted by creating the data mining applications. With the aid of data mining techniques, the researchers in the medical field can recognize and predict the diseases in addition to providing effectual care for patients like heart disease prediction and detection [1-4]. Data mining is a benefit solution to rummage the big dataset and discovery of knowledge to create new approaches for important problems. Today, the contribution of data mining is a very big solution and method in all detection fields including health care. It should be noted that the data mining process is something more than data analysis; it includes classification, clustering, association rule mining, and prediction [5]. The term heart disease refers to various heart-related defects that mainly affect the heart. In recent years, the heart diseases are the major causes of mortality in all over the world. In different countries including India, many people are severely affected by this perilous disease [6]. Due to mentioned topics, heart disease diagnosis is important for healthy life. Therefor different classifiers were used for heart disease diagnosis, so this paper used four classifiers separately. 2.

Research Method

The present study conducted by using data from the University of California, Irvine(UCI). This data includes 13 features classified into 2 classes of "present" and "absent" heart disease. After feature analysis, we have used 4 classifiers including Decision Tree (DT), K-Nearest Neighbor(KNN), Naïve Bayes (NB) and Support Vector Machine (SVM) developed and validated.

www.C-IT.ir

1

2.1. KNN Algorithm K-nearest neighbor algorithm is a method for classification based on similarity to other cases. Those close to others, are called a "neighbor". When a case is new, its distance from each of the cases in the model is calculated. Applying this classification, specifies the case as being the nearest neighbor, which is the most similar. Therefore, it puts the case into the group that contains the nearest neighbors. The algorithm is also able to calculate values continuously for a target. In this situation, the average or the median target value of the nearest neighbor is used to obtain the predicted value of new case [7]. 2.2. SVM Algorithm Support Vector Machine (SVM) is a regulatory algorithm introduced by Vapnik in 1995. The base of the algorithm is using the precision to generalize the errors. The algorithm makes "hyperplane" and divides the data into classes so that all samples belonging to one class will be categorized on one side and the rest on the other side. Linear SVM Classifier is defined for the SVM classifying task, and dividing them occurs provided that the chosen line involves the most marginalized sure [8], [9]. 2.3. NB Algorithm Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes’ theorem. Studies comparing classification algorithms have found a simple Bayesian classifier known as the Naïve Bayesian classifier to be comparable in performance with decision classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases. Naïve Bayes classifier in data mining is a mathematical classifier based on independency and probability. The Naïve Bayes classifier adopts the idea that the existence of a certain feature of an object is unrelated to the existence of any other feature, given the class variable. For example, an animal may be considered to be a cat if it is hunt, play with kids, has four legs, has a head and weight about 3 kilograms. Naïve Bayes algorithm treats all features independently and how they make a prediction of this animal is a cat, with no feature depends on other features values [10]. Naïve Bayes algorithm is significant classifiers; it is easy to construct, does not requires parameter estimation, easy to interpret. Therefore, Naïve Bayes can be performed by expert and inexpert data mining developers. Finally, Naïve Bayes generally performs well in comparison with other data mining methods [11]. 2.4. DT Algorithm Decision tree is a classification method which contains nodes, branches, and leaves. The first node on the tree or the top node is called the root node. Each node in the tree is connected with one or more nodes using branches, the last node in the tree that contains no outgoing branches is called a leaf node. The leaf node indicates to terminate or the outcome value [11].

2.5. Dataset discription Cleveland Clinic Foundation Heart disease dataset has been collected at the University of California, Irvine. The Dataset has 76 raw attributes. However, all the published experiments only refer to 13 of them, because these features are considered the key attributes based on experienced cardiac clinicians and other features have so many missing values. The dataset of Cleveland contains 303 rows, which 297 instances of them are complete. Six instances contain missing values and they are removed from the experiment. This dataset can be downloaded from this address: https://archive.ics.uci.edu/ml/datasets/Heart+Disease. The 13 attributes are used as input. These features are explained in (Table 1). The dataset has 1 attribute as class of heart disease with two output classes denoted as one (heart failure presence) and zero (heart failure absence).

www.C-IT.ir

2

Table 1: Selected Cleveland Heart Disease Dataset Attributes S/N

Attribute

Description

Values

1

Age

Age in years

Continuous

2

Gender

Male or Female

1=Male, 0=Female

3

Cp

Chest Pain Type

1= Typical angina 2=Atypical angina 3=Non-anginal pain 4= Asymptomatic

4

Trestbps

Resting blood pressure (in mm Hg)

Continuous

5

Chole

Serum Cholesterol (in mg/dl)

Continuous

6

FBS

Fasting Blood Sugar

1 >= 120 mg/dl 0