Lung Cancer Classification Using Genetic Algorithm ...

4 downloads 0 Views 2MB Size Report
Joey Mark Diaz, Raymond Christopher Pinon, Geoffrey Solano. University of the ..... V. Bevilacqua, G. Mastronardi, F. Menolascina, A. Paradiso and. S. Tommasi ...
Lung Cancer Classification Using Genetic Algorithm to Optimize Prediction Models Joey Mark Diaz, Raymond Christopher Pinon, Geoffrey Solano University of the Philippines, Manila Manila, Philippines

Abstract- Lung cancer is one of the most fatal types of

Cellular

cancer around the world. The World Cancer Research

expressions.

function

is

determined

by

its

gene

Humans have approximately 20,000 to

Fund International estimated that in 2012, 1.8 million new

25,000 genes, each of which has a particular sequence

cases of this disease were diagnosed. Early diagnosis and

[24]. Genes are first transcribed into messenger RNA

classification

of

this

condition

prompts

medical

professionals on safer and more effective treatment of the patient. Availability of microarray technology has paved the way to exploring the genes and its association in

status

artificial

neural

with

predictive

notable

performance. Keywordsalgorithm,

then

used

to

transcription to translation is called gene expression. Cells which are in a state of disease have different

algorithm (GA) successfully identified genes that classify cancer

is

gene expressions from normal functioning cells. Genes

network to classify lung cancer status of a patient. Genetic lung

mRNA

This study utilized

for the support vector machine and

patient

(transcription),

feature (genes) selection

various diseases like lung cancer. genetic algorithm as a method of

(mRNA)

synthesize protein (translation). The whole process from

which

differ

in

expression

are

used

as

biological

markers to indicate particular disease states. With the advent of modern technologies such as the DNA microarray, we are now able to measure gene expression levels of thousands of genes in a given cell or

microarray,

support

feature

vector

machine,

selection,

genetic

tissue. Microarray technology made it possible to search

neural

systematically for markers of cancer classification and

artificial

networks, GALGO

outcome prediction in a variety of tumor types [14]. Microarray technology thus became an important tool I.

for studying the transcriptome of cancer cells.

INTRODUCTION

An application of microarrays is by classification Lung cancer has been identified as a major health

analysis. Microarray data is used to determine if genes

issue for both developed and developing countries. In

are active, hyperactive or inactive in various tissues.

2000,

Then samples are classified into two or more groups[4].

over one million deaths have been reported

worldwide with 53% occurring in developed countries and 47% in less developed countries[22]. As of 2012, 1.8 million cases have been diagnosed and estimates suggest that by 2030, lung cancer will reach around 10 million deaths per year [1][21].

However, classification analysis using microarray data becomes difficult because of small sample size, high dimensionality of data (gene expressions from all 20,000+ genes) and presence of fragments(noise and irrelevant information). Thus we have to implement

Surgical removal of lung cancer still remains the gold standard in preventing lung cancer. Early diagnosis of lung cancer is therefore important to prevent the spread of the cancer.

strategies to prevent misclassification and improve our analysis. In this study, we will use the GALGO package developed for R for classification analysis of lung

Treatment of lung cancer also varies depending on

cancer

data.

GALGO

allows

the development and

the type tumor present. Classification of different tumor

analysis of statistical models using a unique wrapping

types is thus important to ensure higher survival rates.

function for selection of genes[8].

However, classification of lung cancers is challenging [4][19].

II. REVIEW OF RELATED LITERATURE

Currently, subjective

cancer

classification

interpretation

of

is

based

histopathological

on and

clinical data. Classification also depends on the site of origin

of

the

tumor.

Clinical

information

may

be

incomplete at times and the wide classes of most tumors lack

morphologic

classification [23].

features

which

are

essential

in

Microarray

technology

was

used

for

tumor

classification and cancer diagnosis in works of Golb et. aI, Ben-Dor et al. and Alizadeh et al. These techniques, using two or three classes, returned test success rates of 90-100%

for

most

binary

class

data.

However,

expansion of the problem to multiple tumor classes

decreases perfonnance of these methods drastically

gene markers. Data was subdivided into test and training

because classification for different cancer types is not

sets then used to classify patients by the support vector

yet clearly defined. This makes methods like Golub et

machine model [29].

al. and Slonin et al. based on gene expression, starting with a feature selection to take possible correlation with an

ideal

gene

complex

marker

relationships

particularly between

difficult.

genes

Also,

affect

the

discriminant analysis in classification [26].

This study utilized genetic algorithm (GA) as a method

of

feature

(genes)

selection

to

optimize

performance of SVM and ANN in classifying lung cancer status of a patient.

Tibshirani et al. and Ooi et al. used discriminant approaches

which

consider

genetic

interactions. III. METHODOLOGY

Tibshirani et al. (2001) was successful in finding genes for classifying small round cell blue cell tumors and leukemias using the simple nearest prototype (centroid) classifier.

A. Dataset and Preprocessing

Ooi et al (2002) used genetic algorithm

maximum likelihood classification method (GAIMLHD) and found out that the method pennits substantial feature

reduction

in

classifier

genesets

without

compromising predictive accuracy[25].

based clustering (HGACLUS) schema and combined the of

simulated

annealing

for

fmding

an

optimal/near optimal set of mediods and found that HGACLUS robustly

performed

than

other

more

accurately

methods

in

(SM),

21 squamous cell lung carcinoma (SQ),

20

pulmonary carcinoids (CO), and 139 adenocarcinoma

Pan et al. (2003) used a hybrid genetic algorithm advantages

The dataset consisted of 203 patients subdivided into 17 normal lung patients, 6 small cell lung carcinoma

and

more

simulated

data,

embryonal CNS data and NC160 data [27].

(AD) patients. There were 12600 genes (features) which were preprocessed using the standard nonnal score method. Selected features used to classify lung cancer were based on the highest standard deviation trimming down the dataset to 3312 genes [19]. B.

Liu and Lin (2005) used the Genetic algorithm to identify a set of key features and combine the silhouette statistic with a form of linear discriminant analysis. They found that the GA/silhouette algorithm with the

Prediction Models Optimization and model Validation Artificial neural networks (ANN) and support vector

machine (SVM) were used as prediction models to

one-minus Pearson distance metric achieved the best

classify Lung cancer.

performance and outperformed many previous methods.

tools in solving multiclass prediction problems as in

Zhu et al. (2007) used a Markov Blanket-Embedded Genetic Algorithm (MBEGA) for selecting genes. The embedded Markov blanket based operators add or delete features (genes) from a solution to improve the solution and increase accuracy. The method is effective and efficient in eliminating redundant and irrelevant features

the case of lung cancer classification.

were

optimized

These models

algorithm

(GA)

feature (genes) selection. Validation of the fmal models sample bootstraps to address the problem of small test

classifier model [10].

C.

Setting Up the Genetic Algorithm

(2009) used a hybrid filter/wrapper

method called IG-GA for feature selection in microarray Information gain (lG) was used to select

important

genetic

was done using cross validation method which draws sample [30].

datasets.

using

implemented in R Galgo package especially in terms of

based on both Markov blanket and predictive power in

Yang et al.

SVM and ANN are powerful

feature

subsets

(genes)

and

the

genetic

Two separate genetic algorithms each for ANN and SVM

models

package.

were

run

Chromosome

using

size

Galgo

was

set

R Statistical to

50

genes

algorithm was used for actual feature selection. The

(features) with a target accuracy rate (fitness) of at least

method was used on eleven classification problems from

97%.

literature and has shown that the methods simplify the number of gene expression levels effectively and either

IV. RESULTS AND DISCUSSION

obtains higher classification accuracy or uses fewer features [28]. Cabrera

A. Genetic Algorithm with SVM as Classifier (2014)

developed

a

computer

program

which can assess presence of lung cancer and further classify subtypes of lung cancer using normalization by decimal

scaling,

quantile

normalization,

min-max

normalization and z-score transfonnation. The median absolute deviation (MAD) and signal-to-noise ratio (SNR) was used in dimension reduction for choosing

A total of 160 sets of calculations (chromosomes) with

at

most

200

generations

each

set

for

the

classification problem were perfonned. Seven sets of solution chromosomes (set of genes) satisfying the desired accuracy rate of at least 97% correct cancer class

prediction

were

obtained.

For the

majority

of the

B.

Genetic Algorithm with ANN as Classifier

calculations which yielded a lung cancer class prediction of less than 97%, the accuracy ranges from 86% to higher than 96% (See Figure 1).

Same with the method using SVM, a total of 160 sets of calculations (chromosomes) with at most 200 generations each set for the classification problem were

Fitness

run. Majority (154 chromosomes) of the sets of analyses

7 (Solutions I Chromosomes)

Cancer Classification Using SVM in GA]:svm-radiaIKC-classificationT·

in this method satisfied the desired accuracy rate of at least 97% correct prediction. Only 6 chromosomes (no solution) obtained an accuracy rate of less than 97%

50

(Figure 2).

150

100

Figure 5 displays the candidate models derived using

Generation Fitness

forward selection method.

153 (No-Solutions I Chromosomes)

Cancer Classification Using SVM in GA]:svm-radiaIKC-classificationT·

The top 5 models were

plotted with accuracy rate plotted on the y-axis and the selected features (genes) on the x-axis. The selected [mal

model

was

model

no.

1

with

45

features.

Sensitivity rate is from 72.9% to 98.1 % with an average 50

100

150

200

of 89.34%. The minimum specificity is 93.3% and the maximum is 99.3% with an average of 97.12% (Figure

Generation

6). Fig. I. Classification Performance of Solution and No-solution Chromosomes with Genetic Algorithm Using SVM

Fitness 154 (Solutions I Chromosomes) [Lung Cancer Classification Using ANN in GA):nnet··O,l-3kf(

Based on the solution chromosomes, a final model was derived with respect to parsimony and prediction accuracy rate.

Figure 3 shows the candidate models

obtained using forward selection method with accuracy

50

rate plotted on the y-axis and the selected features

100

150

Generation

(genes) on the x-axis. It was revealed that of the top 22 models presented, the simplest model that predicts lung

Fitness 6 (No-Solutions I Chromosomes) [Lung Cancer Classification Using ANN in GA):nnet--O,l-3kf(

cancer status of a patient at a high accuracy rate was model no. 15 consisting of 43 genes. It has a sensitivity

.1;

rate that is between 79.5% and 98.9% with an average of

u:

91.16. Moreover, model specificity is between 89.4% and 100% with an

average

of 97.8% (Figure 4).

N '" 0

- Mean (all)

,

- Mean (unfinis h) � 6 �---,,---,-�-,---� 100

50

Adenocarcinoma and normal patients were predicted

150

200

Generation

with the highest accuracy at 97.92% and 97.14 % respectively. Also, good prediction accuracy rates were observe for small cell lung carcinoma (SM), squamous cell lung carcinoma (SQ) and pulmonary carcinoid (CO) patients with correct prediction accuracy rate of 83.33%, 81.03% and 91.17% respectively. The overall prediction accuracy rate was 95.87% and the average accuracy rate was 91.16% (Table 1.)

Fig. 2.

Classification Peiformance oj Solution and No-solution Chromosomes with Genetic Algorithm Using ANN Table 2 revealed that the [mal model was able to

classify

adenocarcinoma

and

normal

patients

with

accuracy rates of 97.92% and 92.6 % respectively. Moreover, squamous cell lung carcinoma (SQ) and pulmonary carcinoid (CO) patients were classified at accuracy

rates of 82.49% and 86.0%

respectively.

Lowest classification accuracy was observed on small cell lung carcinoma (SM) patients at an accuracy rate of 64.64%. Overall correct prediction rate was 93.66 % with an average of 84.75%.

Models Using Forward Selection [Lung Cancer Classification Using SVM in GA1:svm-radiaIKC-classificationT-O,1-3kfolds 1

2

3

"

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 3S 36 37 38 39 40 41 42 43

44

45 46 47 48 49 50

0

ro

0

'" '" Q)

E u:

- CO -+- NL

6

-*0

-

6

Fig. 5.

average (5)

v-

Candidate Models Derived Using Forward Selection Method with Genetic Algorithm Using ANN

Class Confusion (1 Models) [Lung Cancer Classification Using ANN in GA]:nnet··O,1·3kfolds

(NA) so

S

M

I

�--------��

I ------------���--��----���-T� 0.02 I I 1111111111110033 1 I IIIIIIMIWIIIII I � O�

NL

001

CO

AD

0

0007

111111111111111111111111111111111111111111111111111111lllllllllllmm�111111111111111 11111111111111111111111111111111111111111111111111 139/AD139 0.0981 933 Samples

S ensi t Speci!



______

0 003-

9049 1 0.062 20120 17117 61M6 21/21O 0.0.997916 0937 0.987 0729 0.993 0.0.98664 0 001

NL

CO

Samples

S

S

Samples Samples Samples

IlIn*lI3;m:Drl*M)gJmlrl"'�1 II I IIlJ111:ie�l�lD�iI 1I;m:[l ..l4O....klLWH't"1 .. lIJ:IIlIrV»l.�g'M)galDrl"':DgQII[ir9'»l._ft�1 lI;l;1[11:iV)l