Joey Mark Diaz, Raymond Christopher Pinon, Geoffrey Solano. University of the ..... V. Bevilacqua, G. Mastronardi, F. Menolascina, A. Paradiso and. S. Tommasi ...
Lung Cancer Classification Using Genetic Algorithm to Optimize Prediction Models Joey Mark Diaz, Raymond Christopher Pinon, Geoffrey Solano University of the Philippines, Manila Manila, Philippines
Abstract- Lung cancer is one of the most fatal types of
Cellular
cancer around the world. The World Cancer Research
expressions.
function
is
determined
by
its
gene
Humans have approximately 20,000 to
Fund International estimated that in 2012, 1.8 million new
25,000 genes, each of which has a particular sequence
cases of this disease were diagnosed. Early diagnosis and
[24]. Genes are first transcribed into messenger RNA
classification
of
this
condition
prompts
medical
professionals on safer and more effective treatment of the patient. Availability of microarray technology has paved the way to exploring the genes and its association in
status
artificial
neural
with
predictive
notable
performance. Keywordsalgorithm,
then
used
to
transcription to translation is called gene expression. Cells which are in a state of disease have different
algorithm (GA) successfully identified genes that classify cancer
is
gene expressions from normal functioning cells. Genes
network to classify lung cancer status of a patient. Genetic lung
mRNA
This study utilized
for the support vector machine and
patient
(transcription),
feature (genes) selection
various diseases like lung cancer. genetic algorithm as a method of
(mRNA)
synthesize protein (translation). The whole process from
which
differ
in
expression
are
used
as
biological
markers to indicate particular disease states. With the advent of modern technologies such as the DNA microarray, we are now able to measure gene expression levels of thousands of genes in a given cell or
microarray,
support
feature
vector
machine,
selection,
genetic
tissue. Microarray technology made it possible to search
neural
systematically for markers of cancer classification and
artificial
networks, GALGO
outcome prediction in a variety of tumor types [14]. Microarray technology thus became an important tool I.
for studying the transcriptome of cancer cells.
INTRODUCTION
An application of microarrays is by classification Lung cancer has been identified as a major health
analysis. Microarray data is used to determine if genes
issue for both developed and developing countries. In
are active, hyperactive or inactive in various tissues.
2000,
Then samples are classified into two or more groups[4].
over one million deaths have been reported
worldwide with 53% occurring in developed countries and 47% in less developed countries[22]. As of 2012, 1.8 million cases have been diagnosed and estimates suggest that by 2030, lung cancer will reach around 10 million deaths per year [1][21].
However, classification analysis using microarray data becomes difficult because of small sample size, high dimensionality of data (gene expressions from all 20,000+ genes) and presence of fragments(noise and irrelevant information). Thus we have to implement
Surgical removal of lung cancer still remains the gold standard in preventing lung cancer. Early diagnosis of lung cancer is therefore important to prevent the spread of the cancer.
strategies to prevent misclassification and improve our analysis. In this study, we will use the GALGO package developed for R for classification analysis of lung
Treatment of lung cancer also varies depending on
cancer
data.
GALGO
allows
the development and
the type tumor present. Classification of different tumor
analysis of statistical models using a unique wrapping
types is thus important to ensure higher survival rates.
function for selection of genes[8].
However, classification of lung cancers is challenging [4][19].
II. REVIEW OF RELATED LITERATURE
Currently, subjective
cancer
classification
interpretation
of
is
based
histopathological
on and
clinical data. Classification also depends on the site of origin
of
the
tumor.
Clinical
information
may
be
incomplete at times and the wide classes of most tumors lack
morphologic
classification [23].
features
which
are
essential
in
Microarray
technology
was
used
for
tumor
classification and cancer diagnosis in works of Golb et. aI, Ben-Dor et al. and Alizadeh et al. These techniques, using two or three classes, returned test success rates of 90-100%
for
most
binary
class
data.
However,
expansion of the problem to multiple tumor classes
decreases perfonnance of these methods drastically
gene markers. Data was subdivided into test and training
because classification for different cancer types is not
sets then used to classify patients by the support vector
yet clearly defined. This makes methods like Golub et
machine model [29].
al. and Slonin et al. based on gene expression, starting with a feature selection to take possible correlation with an
ideal
gene
complex
marker
relationships
particularly between
difficult.
genes
Also,
affect
the
discriminant analysis in classification [26].
This study utilized genetic algorithm (GA) as a method
of
feature
(genes)
selection
to
optimize
performance of SVM and ANN in classifying lung cancer status of a patient.
Tibshirani et al. and Ooi et al. used discriminant approaches
which
consider
genetic
interactions. III. METHODOLOGY
Tibshirani et al. (2001) was successful in finding genes for classifying small round cell blue cell tumors and leukemias using the simple nearest prototype (centroid) classifier.
A. Dataset and Preprocessing
Ooi et al (2002) used genetic algorithm
maximum likelihood classification method (GAIMLHD) and found out that the method pennits substantial feature
reduction
in
classifier
genesets
without
compromising predictive accuracy[25].
based clustering (HGACLUS) schema and combined the of
simulated
annealing
for
fmding
an
optimal/near optimal set of mediods and found that HGACLUS robustly
performed
than
other
more
accurately
methods
in
(SM),
21 squamous cell lung carcinoma (SQ),
20
pulmonary carcinoids (CO), and 139 adenocarcinoma
Pan et al. (2003) used a hybrid genetic algorithm advantages
The dataset consisted of 203 patients subdivided into 17 normal lung patients, 6 small cell lung carcinoma
and
more
simulated
data,
embryonal CNS data and NC160 data [27].
(AD) patients. There were 12600 genes (features) which were preprocessed using the standard nonnal score method. Selected features used to classify lung cancer were based on the highest standard deviation trimming down the dataset to 3312 genes [19]. B.
Liu and Lin (2005) used the Genetic algorithm to identify a set of key features and combine the silhouette statistic with a form of linear discriminant analysis. They found that the GA/silhouette algorithm with the
Prediction Models Optimization and model Validation Artificial neural networks (ANN) and support vector
machine (SVM) were used as prediction models to
one-minus Pearson distance metric achieved the best
classify Lung cancer.
performance and outperformed many previous methods.
tools in solving multiclass prediction problems as in
Zhu et al. (2007) used a Markov Blanket-Embedded Genetic Algorithm (MBEGA) for selecting genes. The embedded Markov blanket based operators add or delete features (genes) from a solution to improve the solution and increase accuracy. The method is effective and efficient in eliminating redundant and irrelevant features
the case of lung cancer classification.
were
optimized
These models
algorithm
(GA)
feature (genes) selection. Validation of the fmal models sample bootstraps to address the problem of small test
classifier model [10].
C.
Setting Up the Genetic Algorithm
(2009) used a hybrid filter/wrapper
method called IG-GA for feature selection in microarray Information gain (lG) was used to select
important
genetic
was done using cross validation method which draws sample [30].
datasets.
using
implemented in R Galgo package especially in terms of
based on both Markov blanket and predictive power in
Yang et al.
SVM and ANN are powerful
feature
subsets
(genes)
and
the
genetic
Two separate genetic algorithms each for ANN and SVM
models
package.
were
run
Chromosome
using
size
Galgo
was
set
R Statistical to
50
genes
algorithm was used for actual feature selection. The
(features) with a target accuracy rate (fitness) of at least
method was used on eleven classification problems from
97%.
literature and has shown that the methods simplify the number of gene expression levels effectively and either
IV. RESULTS AND DISCUSSION
obtains higher classification accuracy or uses fewer features [28]. Cabrera
A. Genetic Algorithm with SVM as Classifier (2014)
developed
a
computer
program
which can assess presence of lung cancer and further classify subtypes of lung cancer using normalization by decimal
scaling,
quantile
normalization,
min-max
normalization and z-score transfonnation. The median absolute deviation (MAD) and signal-to-noise ratio (SNR) was used in dimension reduction for choosing
A total of 160 sets of calculations (chromosomes) with
at
most
200
generations
each
set
for
the
classification problem were perfonned. Seven sets of solution chromosomes (set of genes) satisfying the desired accuracy rate of at least 97% correct cancer class
prediction
were
obtained.
For the
majority
of the
B.
Genetic Algorithm with ANN as Classifier
calculations which yielded a lung cancer class prediction of less than 97%, the accuracy ranges from 86% to higher than 96% (See Figure 1).
Same with the method using SVM, a total of 160 sets of calculations (chromosomes) with at most 200 generations each set for the classification problem were
Fitness
run. Majority (154 chromosomes) of the sets of analyses
7 (Solutions I Chromosomes)
Cancer Classification Using SVM in GA]:svm-radiaIKC-classificationT·
in this method satisfied the desired accuracy rate of at least 97% correct prediction. Only 6 chromosomes (no solution) obtained an accuracy rate of less than 97%
50
(Figure 2).
150
100
Figure 5 displays the candidate models derived using
Generation Fitness
forward selection method.
153 (No-Solutions I Chromosomes)
Cancer Classification Using SVM in GA]:svm-radiaIKC-classificationT·
The top 5 models were
plotted with accuracy rate plotted on the y-axis and the selected features (genes) on the x-axis. The selected [mal
model
was
model
no.
1
with
45
features.
Sensitivity rate is from 72.9% to 98.1 % with an average 50
100
150
200
of 89.34%. The minimum specificity is 93.3% and the maximum is 99.3% with an average of 97.12% (Figure
Generation
6). Fig. I. Classification Performance of Solution and No-solution Chromosomes with Genetic Algorithm Using SVM
Fitness 154 (Solutions I Chromosomes) [Lung Cancer Classification Using ANN in GA):nnet··O,l-3kf(
Based on the solution chromosomes, a final model was derived with respect to parsimony and prediction accuracy rate.
Figure 3 shows the candidate models
obtained using forward selection method with accuracy
50
rate plotted on the y-axis and the selected features
100
150
Generation
(genes) on the x-axis. It was revealed that of the top 22 models presented, the simplest model that predicts lung
Fitness 6 (No-Solutions I Chromosomes) [Lung Cancer Classification Using ANN in GA):nnet--O,l-3kf(
cancer status of a patient at a high accuracy rate was model no. 15 consisting of 43 genes. It has a sensitivity
.1;
rate that is between 79.5% and 98.9% with an average of
u:
91.16. Moreover, model specificity is between 89.4% and 100% with an
average
of 97.8% (Figure 4).
N '" 0
- Mean (all)
,
- Mean (unfinis h) � 6 �---,,---,-�-,---� 100
50
Adenocarcinoma and normal patients were predicted
150
200
Generation
with the highest accuracy at 97.92% and 97.14 % respectively. Also, good prediction accuracy rates were observe for small cell lung carcinoma (SM), squamous cell lung carcinoma (SQ) and pulmonary carcinoid (CO) patients with correct prediction accuracy rate of 83.33%, 81.03% and 91.17% respectively. The overall prediction accuracy rate was 95.87% and the average accuracy rate was 91.16% (Table 1.)
Fig. 2.
Classification Peiformance oj Solution and No-solution Chromosomes with Genetic Algorithm Using ANN Table 2 revealed that the [mal model was able to
classify
adenocarcinoma
and
normal
patients
with
accuracy rates of 97.92% and 92.6 % respectively. Moreover, squamous cell lung carcinoma (SQ) and pulmonary carcinoid (CO) patients were classified at accuracy
rates of 82.49% and 86.0%
respectively.
Lowest classification accuracy was observed on small cell lung carcinoma (SM) patients at an accuracy rate of 64.64%. Overall correct prediction rate was 93.66 % with an average of 84.75%.
Models Using Forward Selection [Lung Cancer Classification Using SVM in GA1:svm-radiaIKC-classificationT-O,1-3kfolds 1
2
3
"
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 3S 36 37 38 39 40 41 42 43
44
45 46 47 48 49 50
0
ro
0
'" '" Q)
E u:
- CO -+- NL
6
-*0
-
6
Fig. 5.
average (5)
v-
Candidate Models Derived Using Forward Selection Method with Genetic Algorithm Using ANN
Class Confusion (1 Models) [Lung Cancer Classification Using ANN in GA]:nnet··O,1·3kfolds
(NA) so
S
M
I
�--------��
I ------------���--��----���-T� 0.02 I I 1111111111110033 1 I IIIIIIMIWIIIII I � O�
NL
001
CO
AD
0
0007
111111111111111111111111111111111111111111111111111111lllllllllllmm�111111111111111 11111111111111111111111111111111111111111111111111 139/AD139 0.0981 933 Samples
S ensi t Speci!
�
______
0 003-
9049 1 0.062 20120 17117 61M6 21/21O 0.0.997916 0937 0.987 0729 0.993 0.0.98664 0 001
NL
CO
Samples
S
S
Samples Samples Samples
IlIn*lI3;m:Drl*M)gJmlrl"'�1 II I IIlJ111:ie�l�lD�iI 1I;m:[l ..l4O....klLWH't"1 .. lIJ:IIlIrV»l.�g'M)galDrl"':DgQII[ir9'»l._ft�1 lI;l;1[11:iV)l