Gene-Expression-Based Cancer Subtypes ... - Semantic Scholar

5 downloads 5553 Views 566KB Size Report
Jadavpur University, Kolkata 700032, India (e-mail: [email protected]). ... Digital Object Identifier 10.1109/TBME.2012.2225622 remains a bottleneck in ...... of a leukemic stem cell gene expression signature with clinical outcomes in.
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 4, APRIL 2013

1111

Gene-Expression-Based Cancer Subtypes Prediction Through Feature Selection and Transductive SVM Ujjwal Maulik∗ , Senior Member, IEEE, Anirban Mukhopadhyay∗ , Senior Member, IEEE, and Debasis Chakraborty

Abstract—With the advancement of microarray technology, gene expression profiling has shown great potential in outcome prediction for different types of cancers. Microarray cancer data, organized as samples versus genes fashion, are being exploited for the classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer type. Nevertheless, small sample size remains a bottleneck to design suitable classifiers. Traditional supervised classifiers can only work with labeled data. On the other hand, a large number of microarray data that do not have adequate follow-up information are disregarded. A novel approach to combine feature (gene) selection and transductive support vector machine (TSVM) is proposed. We demonstrated that 1) potential gene markers could be identified and 2) TSVMs improved prediction accuracy as compared to the standard inductive SVMs (ISVMs). A forward greedy search algorithm based on consistency and a statistic called signal-to-noise ratio were employed to obtain the potential gene markers. The selected genes of the microarray data were then exploited to design the TSVM. Experimental results confirm the effectiveness of the proposed technique compared to the ISVM and low-density separation method in the area of semisupervised cancer classification as well as gene-marker identification. Index Terms—Low-density separation (LDS), microarray data, semisupervised classification, support vector machines (SVM), transductive learning.

I. INTRODUCTION ANCER classification of different tumor types is of great importance in cancer diagnosis and drug discovery. A major challenge in clinical cancer research is the prediction of prognosis at the time of tumor discovery. Accurate prediction of different tumor types can help in providing better treatment and toxicity minimization on the patients. The advent of microarray technology has made it possible to study the expression profiles of a large number of genes across different experimental conditions. Microarray- based gene expression profiling has shown great potential in the prediction of different cancer subtypes [1], [3], [16], [23], [26]–[30]. Nevertheless, small sample size

C

Manuscript received July 27, 2012; revised September 17, 2012; accepted October 12, 2012. Date of publication October 18, 2012; date of current version March 15, 2013. The code of the software is available at www.anirbanm.in/biotsvm. Asterisk indicates corresponding author. ∗ U. Maulik is with the Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India (e-mail: [email protected]). ∗ A. Mukhopadhyay is with the Department of Computer Science and Engineering, Kalyani University, Kalyani 741235, India (e-mail: anirban@klyuniv. ac.in). D. Chakraborty is with the Department of Electronics and Communication Engineering, Murshidabad College of Engineering and Technology, Cossimbazar, Rajasthan 742102, India (e-mail: [email protected]). Digital Object Identifier 10.1109/TBME.2012.2225622

remains a bottleneck in obtaining robust and accurate prediction models [13], [15]. The number of samples in microarray based-cancer studies is usually small because microarray experiments are time consuming, expensive, and limited by sample availability. Cancer classification using gene expression data usually relies on traditional supervised learning techniques, in which only labeled data (i.e., data from a sample with clinical follow-up) can be exploited for learning, while unlabeled data (i.e., data from a sample without clinical follow-up) are disregarded. Recent research in the area of cancer diagnosis suggests that unlabeled data, in addition to the small number of labeled data, can produce significant improvement in accuracy, a technique called semisupervised learning [9]. Indeed, semisupervised learning has proved to be effective in solving different biological problems including protein classification [36], prediction of transcription factor–gene interaction [14], and gene- expressionbased cancer subtype discovery [24], [32]. Major research on extending support vector machines (SVMs) to handle semilabeled data is based on the following idea: solve the standard inductive SVM (ISVM) while treating the unknown labels as additional optimization variables. By maximizing the margin in the presence of unlabeled samples, one can learn the decision boundary that traverses through lowdensity regions while respecting labels in the input space. In other words, this approach implements the cluster assumption for semisupervised learning, that samples in a data cluster have identical labels. The idea was first introduced in [34] under the name of transductive SVM, but since it learns an inductive rule defined over the entire input space, the approach is referred to as semisupervised SVM (S3 VM). Each cluster of samples is assumed to belong to one data class. Thus, a decision boundary is defined between clusters. A variety of semisupervised techniques have been proposed [9], [22], [35] and many successful algorithms directly or indirectly assume high density within class and low density between classes, and can fail when the classes are strongly overlapping [4], [37]. This can be illustrated by comparing the well-known SVMs to their semisupervised extension, transductive SVM [21], progressive TSVM algorithm (PTSVM) [11], transductive SVMs (TSVMs) [33], and semisupervised SVMs (S3 VMs) [5]. TSVMs and S3 VMs are iterative algorithms that use SVMs to gradually search a reliable hyperplane exploiting both labeled and unlabeled samples in the training phase. Cancer classification using microarray data poses another major challenge because of the huge number of features (genes) compared to the number of examples (tissue samples). This is an important problem in machine learning which is known as feature selection [6]. Successful gene identification involves 1)

0018-9294/$31.00 © 2012 IEEE

1112

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 4, APRIL 2013

dimension reduction to reduce computational cost; 2) reduction of noise to increase classification performance; and 3) identification of more interpretable features. Only a small number of genes in the microarray data consisting of thousands of genes show strong correlation with the target phenotypes. Only a few small selected genes have their biological relationship with the target diseases. A survey on the classical and computational intelligence methods for gene identification can be found in [2]. In this paper, we aimed to develop a classification system by identifying potential gene markers and subsequently applying the proposed technique on the selected genes for the classification of human cancer. A forward greedy reduction algorithm was exploited to identify the gene markers. The effectiveness of the proposed technique was compared with the LDS and ISVM on the basis of overall average accuracy, Wilcoxon signed rank test [18], and one tailed paired t-test [25]. The performance of the proposed technique was also investigated by combining with another feature selection method known as signal-to-noise ratio (SNR) [17]. The rest of this paper is organized as follows: traditional transductive SVMs are briefly mentioned in Section II. Section III describes the proposed TSVM. Feature selection techniques are briefly provided in Section IV. Section V describes the datasets used and reports experimental results followed by conclusion and discussion in Section VI. II. TRANSDUCTIVE SVMS FOR SEMISUPERVISED CLASSIFICATION A. Low-Density Separation Approach The low-density separation algorithm is based on the cluster assumption. It implements two effective procedures to place the decision boundary in low-density regions between clusters. First, it derives graph-based distances that emphasize lowdensity regions. Second, it uses a gradient descent approach to optimize the TSVM objective function in order to obtain a decision boundary that avoids high-density regions. By combining the two procedures, LDS achieves better accuracy compared to the SVM and other traditional semisupervised methods on several test datasets [10]. Further details of the LDS algorithm can be found in the original paper [10]. A MATLAB implementation available from http://olivier.chapelle.cc/lds was used for the LDS algorithm. B. Transductive SVMs The ISVM classifier is based on the hyperplanes that maximize the separating margin between two classes using the available labeled samples. The ISVM was originally developed as a two-class pattern recognition problem that has been extended to the multiclass problem later. A nice tutorial on SVMs can be found in [7]. However, in many real-life applications, obtaining labeled patterns is expensive, while large unlabeled samples are readily available. Since the unlabeled patterns are significantly easier to obtain than labeled ones, TSVMs were proposed in [33]. TSVMs are basically iterative algorithms [11] that gradually search the optimal separating hyperplane in the feature space with a trans-

ductive process that incorporates unlabeled samples in the training phase. In the semisupervised framework, two datasets are defined: a labeled training dataset S = [(xi , yi )] , i = 1, . . . , l, and an unlabeled dataset V = [(xj )] , j = l + 1, . . . , n: the learning process of the TSVM can be formulated as an optimization problem as follows: J(w, ξ, ξ ∗ ) =

l d   1 w2 + C ξi + C ∗ ξj∗ 2 i=1 j =1

subject to yi (φ(xi ) · w + b) ≥ 1 − ξi ,

ξi ≥ 0; i = 1, 2, . . . , l.

yj (φ(xj ) · w + b) ≥ 1 − ξj∗ ,

ξj∗ ≥ 0; j = 1, 2, . . . , d. (1)

Here, C and C ∗ are the user-specified penalty values of both the training and transductive samples, and ξi and ξj∗ are the slack variables. Here, d denotes the number of selected unlabeled samples in the transductive process (d ≤ n − l). Training the TSVM corresponds to solving the aforementioned optimization problem. A detailed description of the TSVM can be found in [21]. Finally, the decision function of the TSVM becomes after setting the Lagrange multipliers αi and αj∗ ⎡ ⎤ l d   ∗ ∗ ∗ f (x) = sgn ⎣ yi αi k(x, xi ) + yj αj k(x, xj ) + b⎦ . i=1

j =1

(2) III. PROPOSED TSVMS A. Proposed Transductive Procedure Unlike the selection procedure of transductive samples in traditional TSVMs, the selection of transductive samples is done through a process of filtering the unlabeled samples as follows: considering a binary classification problem, the algorithm begins with training the SVM classifier using the available working set W (0) . As support vectors (i.e., patterns belonging to the margin M = {xw(i) .φ(x) + b(i) | ≤ 1}) are the only patterns that affect the position of the discriminant hyperplane, unlabeled samples that fall into the margin will have richer information to find a better separating hyperplane. To select these samples, we define N ± to be the positive and negative patterns within the margin bounds. At each iteration, N ± transductive samples are selected on either side of the separating hyperplane to define the positive and negative candidate sets B ± . In other words, all the positive and negative semilabeled samples are selected from both the upper (positive) and the lower (negative) side of the (i) margin. As a consequence, a transductive set Bt = B + ∪ B − (0) is formed at the first (i = 1) iteration. Let At = ∅. At this stage, (1) Bt is merged with the initial working set and the classifier is retrained and the process is repeated. Subsequently, the sec(2) ond resultant transductive set Dt is computed by intersecting (1) between the first and second transductive sets (i.e., At and (2) Bt ). As a consequence, the resulting transductive set contains the samples common between the first and second transductive sets. The first transductive set is then removed from the initial

MAULIK et al.: GENE-EXPRESSION-BASED CANCER SUBTYPES PREDICTION THROUGH FEATURE SELECTION AND TRANSDUCTIVE SVM

Input Labeled points: S = [(xj , yj )] , j = 1, 2, . . . , l and unlabeled points: V = [(xj )] , j = l + 1, . . . , n. Output Transductive SVM classifier with original training set and the transductive set. begin 1. Initialize the working set W (0) = S, previous transductive (0) set At = ∅ and specify C and C ∗ 2. Train SVM classifier with the working set W (0) 3. Obtain the label vector of the unlabeled set V . for i = 1 to T // T is the number of iterations 4. Select N + positive transductive samples from the upper side of the margin and N − negative transductive samples from the lower side respectively. 5. Select positive candidate set B + containing N + positive transductive samples and negative candidate set B − containing N − negative transductive samples respectively. (i) 6. Bt = B + ∪ B − 7. Update the training set: (i−1) =∅ if At (i) W (i) = W (i−1) ∪ Bt (i) (i) Dt = Bt else (i) (i−1) (i) Dt = At ∩ Bt (i−1) (i) (i) (i−1) W = (W − Dt ) ∪ Dt end if (i) (i) 8. At = Bt 9. Train TSVM classifier with the updated training set W (i) 10. Obtain the label vector of the unlabeled set V end for end Fig. 1.

Proposed transductive procedure for a two-class TSVM.

working set, while the second resultant transductive set is added to it. The resulting hybrid training set is used at the next iteration to find a more reliable discriminant hyperplane. It is to be noted that the same unlabeled set V is relabeled by the current trained learner. The final transductive set would contain samples having label consistency throughout the entire process. The algorithm is illustrated in Fig. 1. This procedure improves the generalization capability of the classifier. Gradually, the separating hyperplane will move to a finer position in subsequent iterations. This can be explained by arguing that reducing the misclassification of transductive samples can lead to the identification of a more reliable discriminant function. However, like all semisupervised techniques, also for the proposed transductive SVM, it is not possible to guarantee an increase of accuracy with respect to the inductive SVM in all cases. The convergence of the learning depends on the similarity between the problems represented by the training points and unlabeled points. In the proposed TSVM multiclass problem, we adopted one-against-all (OAA) architecture. B. Selection of Transductive Samples The proposed TSVM exploits the standard theoretical approach of the TSVMs presented in Section II-B. However, in designing the proposed TSVM, we address two important issues: 1) select the samples with an expected accurate labeling and 2) choose the informative samples. According to the previous studies in semisupervised learning, the transductive samples

1113

are usually selected from the upper (positive) and the lower (negative) side of the margin; P ≥ 1 transductive samples closest to the margin bounds are assigned to the label “+1” and “−1,” respectively. If the number of unlabeled samples on one side of the margin is lower than P , the labeling is done anyway. A dynamic adjustment is necessary for taking into account that the position of the hyperplane changes at every iteration. Typically, the most confident unlabeled patterns, together with their predicted labels, are added to the current training set. The classifier is retrained and the process is repeated. It is to be noted that the classifier uses its own prediction to teach itself. It is natural to imagine that a classification error can reinforce itself. Therefore, it is important to take a caution in the selection of transductive samples because wrong labeling may substantially degrade the performance of the classifier. Due to the fact that support vectors contain the richest information among the informative samples (i.e., the ones in the margin band), the unlabeled patterns closest to the margin bounds have the highest probability to be correctly classified. Therefore, in the proposed approach, we design a selection procedure (i.e., filtering process) to increase the acceptability of the samples with the expected correct labeling. In other words, an unlabeled sample should be considered as transductive sample if the TSVM ensemble assigns the same label to it. We can expect this sample bearing the information with an expected accurate class label. IV. FEATURE SELECTION TECHNIQUES This section briefly describes the consistency-based feature selection (CBFS) and SNR procedures. A. Consistency-Based Feature Selection Feature selection is a useful technique in dealing with dimensionality reduction. In classification, it is used to find an optimal subset of relevant features so that the overall accuracy is increased while the datasize is reduced. When a classification problem is defined by features, the number of features can be quite large, many of which can be irrelevant. A relevant feature can increase the performance of a classifier while an irrelevant feature can deteriorate it. Therefore, in order to select the relevant features, it is necessary to measure the goodness of selected features using a feature selection criterion. The class separability is often used as one of the basic selection criteria. In this study, consistency measure is exploited as a selection criterion that does not attempt to maximize the class separability but aims to retain the discriminatory power of the original features. A typical feature selection method has three basic steps: 1) a generation procedure to generate the next candidate subset of features; 2) an evaluation function to evaluate the candidate subset; and 3) a stopping criterion to decide when to stop. Each feature selection method preserves a particular property of a given information system, which is based on a certain predetermined heuristic function. In rough set theory, feature reduction is about finding some feature subset having the minimal features while retaining some particular properties. More details about this feature selection method can be found in [12] and [31].

1114

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 4, APRIL 2013

In [12], the authors introduced the consistency function. Consistency measure is defined by inconsistency rate which is computed as follows: Definition 1: A pattern is considered inconsistent if there exist at least two instances such that they match all but are with different class label. Definition 2: The inconsistency count ξi for a pattern pi of feature subset is the number of times it appears in the data minus the largest number among different class labels. Definition 3: The inconsistency rate of a feature subset is the  sum, ξi , of all the inconsistency counts over all patterns of the feature subset that appears in data divided by |U |. Correspondingly, consistency is computed as δ = (|U | − ξi /|U |. From the aforementioned analysis, one can understand that dependence is the ratio of samples correctly classified, and consistency is the ratio of samples probably correctly classified.  There are two kinds of samples in P OSE (G) H. P OSE (G) is the set of consistent samples, while H is the set of the samples with the largest number among different class labels in the boundary region.  Definition 4: Let R = (U, F G, f ) be a decision table, E ⊆ F, a ∈ E, we say that the condition attribute a is indispensable in E if δ(E −a) (G) < δE (G), otherwise a is redundant. We say that E ⊆ F is independent if any attribute a in E is indispensable. δE (G) reflects not only the size of positive regions, but also the distribution of boundary samples. The attribute is said to be redundant if the consistency does not decrease when we delete it. The term “redundant” has two meanings. The first one is relevant but redundant, the same as the meaning as in the literatures [19], [20]. The second is irrelevant. So consistency can detect two kinds of superfluous attributes [12]. Definition 5: Attribute subset E is a consistency-based reduct of the decision table if 1) δE (G) = δF (G); and 2) ∀a ∈ E; δE (G) > δE −a (G). In this definition, the first term guarantees that the reduct has the same distinguishing ability as the whole set of features, while the second term guarantees that all of the attributes in the reduct are indispensable. Therefore, there is no superfluous attribute in the reduct. Formally, a forward greedy attribute reduction algorithm can be written as in Fig. 2. B. Signal-to-Noise Ratio The sgnal-to-noise ratio (SNR) is defined as SNR =

μ1 − μ2 σ1 + σ2

(3)

where μi and σi , i ∈ {1, 2}, respectively, denote the mean and standard deviation of class i for the corresponding gene (feature). The larger absolute value of SNR for a gene indicates that the gene’s expression level is high in one class and low in another. Therefore, this bias is very useful in distinguishing the genes that are expressed differently in the two classes of samples. After computing the SNR statistic for each gene, the genes are sorted in descending order of their SNR values. From the sorted list, top ten genes are selected as the gene markers

Input

Decision table R = (U, F



d, f ).

Output One reduct red. begin 1. red ← ∅ // red is the pool to conserve the selected attributes. 2. For each ai ∈ P − red,  compute Sig(ai , red, G) = δred a (G) − δred (G). i end. 3.Select the attribute ak which satisfies: Sig(ak , red, G) = max(Sig(ai , red, E)). 4. if Sig(a k , red, G) > 0 red → red ak goto step 2 else return red 5.end

Fig. 2.

Forward greedy attribute reduction algorithm.

(five up-regulated, i.e., positive SNR and five down-regulated, i.e., negative SNR) for a particular tumor subtype (for example, EWS subtype in the SRBCT dataset). The top ten gene markers for the other tumor subtypes are selected in a similar way, i.e., by considering two classes each time, one corresponding to the tumor class for which the gene markers are being identified, and the other corresponding to all the remaining classes. For example, 40 marker genes for the SRBCT data were obtained (10 genes for each of the 4 cancer subtypes). V. DATASETS AND EXPERIMENTAL RESULTS This section presents the datasets and experimental results. A. Gene Expression Datasets In this paper, four publicly available benchmark datasets, viz leukemia, small round blue-cell tumor (SRBCT), mixed-lineage leukemia (MLL), and diffuse large B-cell lymphoma (DLBCL), have been downloaded in the form of expression values from the website [8] for our experiment. The datasets are described in this section. Leukemia: This datatset contains 72 samples and 5147 genes. The subtypes consist of 47 acute lymphoblastic leukemia (ALL) and 25 acute myeloid leukemia (AML) samples. Small round blood cell tumors (SRBCTM): the dataset has 83 samples and the total number of genes is 2308. They include Ewings sarcoma (EWS) (29 samples), neuroblastoma (NB) (18 samples), Burkitt’s lymphoma (BL) (11 samples), and rhabdomyosarcoma (RMS) (25 samples). Mixed-lineage leukemias (MLL): This dataset contains 72 examples and 12 533 genes. The subtypes are ALL (24 samples), MLL (20 examples), and AML (28 samples). Diffuse large B-cell lymphomas (DLBCL): This dataset contains 77 samples and 7070 genes. The subtypes are diffuse large B-cell lymphomas (DLBCL) (58 samples) and follicular lymphoma (FL) (19 samples).

MAULIK et al.: GENE-EXPRESSION-BASED CANCER SUBTYPES PREDICTION THROUGH FEATURE SELECTION AND TRANSDUCTIVE SVM

1115

TABLE I DESCRIPTION OF THE GENE MARKERS FOUND IN THE FOUR CANCER DATASETS

Gene Image ID Leukemia dataset: M27891 at Y07604 at SRBCT dataset: 784224 812105 207274 782811 344134 MLL dataset: 31375 at 31385 at 31394 at 31441 at DLBCL dataset M59829 at X53961 at U46006 s at X85785 rna1 at

Description cystatin C (amyloid angiopathy and cerebral hemorrhage), CST3 non-metastatic cells 4, protein expressed in fibroblast growth factor receptor 4 transmembrane protein Human DNA for insulin-like growth factor II (IGF-2); exon 7 and additional ORF high mobility group (nonhistone chromosomal) protein isoforms I and Y immunoglobulin lambda-like polypeptide 3 ribosomal protein L28 serpin peptidase inhibitor, clade I (pancpin), member 2 ribonuclease, RNase A family, 2 (liver, eosinophil-derived neurotoxin) pseudogene heat shock 70kDa protein 1 -like Transferrin, Peptidase S60, transferrin lactoferrin cysteine and glycine-rich protein 2 Duffy blood group, chemokine receptor

B. Selection of Gene Markers

Leukemia dataset are

Since the gene-microarray datasets contain thousands of genes, it is necessary to identify gene markers that are mostly responsible for distinguishing a particular tumor class from the remaining ones. In this study, we used the feature selection algorithm (see Fig. 2) for the selection of marker genes. The identified gene markers are (Leukemia: M27891_at, Y07604_at), (SRBCT: 207274, 344134, 782811, 784224, 812105), (MLL: 31375_at, 31385_at, 31394_at, 31441_at), and (DLBCL: M59829_at, X53961_at, U46006_s_at, X85785_rna1_at). We describe the gene markers for the datasets in Table I. Similarly, using the SNR method, the number of marker genes selected are respectively 20, 40, 30, and 20 from the Leukemia, SRBCT, MLL, and DLBCL datasets. It is to noted that only the genes selected by the feature selection methods were used to test the algorithms.

To compare the performance of the three algorithms, we computed overall average accuracies and standard deviations, and conducted statistical tests such as the Wilcoxon signed rank test [18] and one-tailed paired t-test [25]. The statistical t-test is formulated as follows. If there is an unknown common population variance σ 2 , then we use an estimate σ 2 for it, where (n1 − 1)s21 + (n2 − 1)s22 n1 + n2 − 2

(4)

where s21 , s22 are the sample variances and n1 , n2 are sample sizes. For small samples, we use the test statistic T =

X1 − X2 σ n11 + n12

s1 = 1.69,

n1 = 10

X 2 = 89.49,

s2 = 2.70,

n2 = 10.

We have σ2 =

(10 − 1)1.692 + (10 − 1)2.702 = 5.06. 10 + 10 − 2

Now n1 + n2 − 2 = 18, so T ∼ t(18). Using the one-tailed test at the %5 level, the critical value of t for ν = 18 is t = 1.734 (Elementary Mathematical Tables, Cambridge Univ. Press). So we reject null hypothesis (H0 ) if ttest < 1.734, where ttest =

91.43 − 89.49 = 1.926. 1 1 2.25 10 + 10

As ttest > 1.734, we reject H0 and conclude that there is evidence, at the 5% level, that the performance of the (CBFS + TSVM) is significantly better than that of the (SNR + LDS).

C. Performance Assessment

σ2 =

X 1 = 91.43,

(5)

where X 1 , X 2 are sample means and T ∼ t(n1 + n2 − 2). For example, the results (i.e., third entry in columns 4 and 5 in Table II) obtained for the (CBFS + TSVM) and (SNR + LDS) methods corresponding to a training set of size 20 for the

D. Integrating Unlabeled Data From the Same Dataset Improves Prediction Performance In cancer gene expression datasets, it is common that only some of the samples have sufficient clinical follow-up data and others are unlabeled with regard to the clinical question of interest. Therefore, we investigated whether integrating unlabeled data from the same dataset could improve the prediction performance. We compared performance from the proposed transductive method to that from the ISVM and LDS. The labeled samples of each dataset were therefore roughly divided into the training and test sets. Feature selection was applied on the training set. Experiments were carried out with 10, 15, and 20 training samples taken from the training set. For each size, ten different training patterns were realized using a random procedure, with the assumption that there is at least one sample for each class. Thus, the performance of the proposed method was evaluated on 30 training sets (for each dataset) made up of different samples and with different sizes. Therefore, this validation

1116

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60, NO. 4, APRIL 2013

TABLE II OVERALL ACCURACIES AND STANDARD DEVIATIONS AVERAGED OVER TEN RANDOM REALIZATIONS OF THE TRAINING SETS MADE UP OF 10, 15, AND 20 SAMPLES OF THE FOUR CANCER DATASETS. Dataset

Test set

Leukemia

36

SRBCT

41

MLL

36

DLBCL

38

Training set 10 15 20 10 15 20 10 15 20 10 15 20

CBFS + TSVM 88.43 ± 1.76 90.70 ± 2.53 91.43 ± 1.69 89.18 ± 2.33 90.24 ± 1.95 93.71 ± 2.15 70.55 ± 2.86 74.99 ± 2.38 77.97 ± 1.60 88.93 ± 2.37 90.25 ± 1.87 91.83 ± 1.61

CBFS + LDS 85.88 ± 3.083 86.94 ± 2.691 89.49 ± 2.704 85.11 ± 2.551 89.02 ± 2.366 90.75 ± 3.103 75.55 ± 2.921 77.58 ± 2.653 78.60 ± 3.026 84.99 ± 2.871 86.83 ± 3.011 87.66 ± 2.451

CBFS + ISVM 81.99 ± 2.791 83.93 ± 2.671 85.27 ± 2.471 80.75 ± 2.591 84.87 ± 2.021 89.26 ± 1.911 65.27 ± 2.821 70.25 ± 2.351 73.56 ± 1.551 83.64 ± 2.521 85.25 ± 2.231 86.83 ± 2.041

SNR + TSVM 87.88 ± 2.446 90.42 ± 3.236 91.38 ± 2.326 64.60 ± 2.511 66.63 ± 3.341 71.21 ± 2.721 85.55 ± 2.401 86.94 ± 3.511 88.80 ± 2.241 84.75 ± 3.681 86.04 ± 2.731 89.73 ± 2.233

SNR + LDS 82.55 ± 2.781 85.49 ± 2.851 87.16 ± 3.231 68.71 ± 2.891 71.45 ± 2.661 74.44 ± 2.451 80.82 ± 3.201 81.10 ± 3.021 83.44 ± 2.591 81.55 ± 2.641 83.89 ± 2.971 86.66 ± 2.561

SNR + ISVM 81.66 ± 2.481 82.19 ± 2.671 86.10 ± 3.221 59.61 ± 2.731 60.23 ± 2.641 64.87 ± 2.911 75.82 ± 2.861 77.77 ± 2.723 78.60 ± 1.896 80.78 ± 2.811 81.83 ± 3.011 83.04 ± 2.041

Superscripts denote confidence levels for the difference in accuracy between the proposed (CBFS+TSVM) and the corresponding combination of algorithms using one-tailed paired t test: 1 is 99.5%, 2 is 99%, 3 is 97.5% 4 is 95%, 5 is 90% and 6 is below 90%.

TABLE III SUMMARY OF ACCURACY RESULTS Measure No. wins No signif. wins p-value Average

CBFS+TSVM 86.51

CBFS+LDS 9-3 8-2 1.099E-1 84.86

CBFS+ISVM 12-0 12-0 4.882E-4 80.90

procedure is reliable and statistically stable. In order to investigate the effectiveness of the proposed TSVM, the test set was used as the unlabeled set. However, these samples were not considered for model selection. The transductive set was then extracted from this set and added to the training set. The resulting hybrid training set (initial training set and transductive set) was exploited to classify the unlabeled set.

E. Input Parameters For all the experiments reported in this paper, the Gaussian RBF kernel function of the form k(xi , xj ) = exp(−γxi − xj 2 ), where γ is the weight, was used. However, the approach is general, and any other kernel can be used. The datasets were normalized so that each feature is rescaled between −1 and +1. For parameters C and γ to be tuned, we let each of them to vary among the candidate set {0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8}. On account of a very small number of training samples, only the course grid-search method was applied. Each combination of parameter choices was evaluated using fivefold cross validation, and the parameters with best cross-validation accuracy were adopted. The final classifier model was then trained on the training set using the optimal parameters. For transductive learning, we assumed that C ∗ equals C; however, other weighting strategies may be used. In our experiments, the highest accuracies were obtained by setting T = 10 or T = 15, depending on the specific training set considered. The LDS algorithm has several parameters, but we only considered two most critical ones C and ρ, where C is the soft margin parameter and ρ is a softening parameter for graph distance computation. Default values were used for other parameters. The parameters were tuned using the aforementioned procedure.

SNR+TSVM 9-3 6-3 2.661E-1 82.83

SNR+LDS 9-3 9-3 1.294E-1 80.60

SNR+ISVM 9-3 9-2 6.80E-3 76.04

F. Results The proposed algorithm as well as the other algorithms was applied to the reference test set using all of the ten random realizations of each training set size. Using the ten different training sets of a given size (say 15), ten accuracy values (in percent) were obtained. The overall accuracy was then computed by taking the mean of the resulting ten accuracy values. Table II reports the overall accuracies and standard deviations on the reference test data corresponding to the training patterns of different sizes. It can be observed from the table that the proposed (CBFS+TSVM) significantly increased the average accuracies compared to the (CBFS+LDS) except the MLL dataset. Similarly, the proposed approach clearly outperformed the (CBFS+ISVM) across all the datasets. It can also be seen that the (SNR+TSVM) achieved significant empirical success over the (SNR+LDS) except the SRBCT dataset indicating that any feature selection method could have been used instead of the proposed one. The results are summarized in Table III. The second row shows the number of domains in which the TSVM was more accurate than the corresponding classifier, versus the number in which it was less. For example, the (CBFS+TSVM) classifier was more accurate than the (CBFS+LDS) in nine domains, and less in three. The third row considers only those domains where the accuracy difference was significant at the 5% level, using one-tailed paired t-test. For example, the (CBFS+TSVM) combination was significantly more accurate than the (CBFS + LDS) in eight domains, and less in two. The fourth row indicates p-values on the 12 average accuracy differences obtained using the Wilcoxon signed rank test, and resulted in high confidence indicating that the proposed approach was more accurate than each of the other learners because smaller p-value implies better performance. The fifth row indicates the

MAULIK et al.: GENE-EXPRESSION-BASED CANCER SUBTYPES PREDICTION THROUGH FEATURE SELECTION AND TRANSDUCTIVE SVM

average accuracy across all the datasets, and again the proposed technique performed better than each of the other combinations. In general, the proposed technique is effective for the classification of cancer subtypes. VI. CONCLUSION AND DISCUSSION The present study was designed to address the small sample size problem in gene-expression based- outcome prediction for human cancers. This paper mainly tries to show the effectiveness of the proposed transductive SVM scheme in the framework of transductive inference learning, using two feature selection methods. In TSVM algorithms, the transductive samples are selected on the basis of geometric analysis of the feature space, and only support vector-like samples that contain the richest information are included in the training set. In particular, the proposed TSVM is an iterative procedure that defines the hyperplane according to a transductive process that integrates unlabeled samples together with the training samples. The proposed transductive learning approach successfully employed unlabeled gene expression data and achieved better empirical success. However, if labeled and unlabeled data follow different distributions, integrating unlabeled data may lead to poor performance. Our results demonstrated significant potential of semisupervised learning in the domain of clinical problems. Experimental results on the cancer datasets clearly indicate that the proposed technique has better performance compared to the ISVM and LDS. Moreover, as a scope of future work, we plan to apply fuzzyrough set theory to find more relevant gene markers and introduce fuzzy set theory in transductive/semisupervised learning to improve the performance of the proposed technique. REFERENCES [1] S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik, “An improved algorithm for clustering gene expression data,” Bioinformatics, vol. 23, no. 21, pp. 2859–2865, 2007. [2] S. Bandyopadhyay, U. Maulik, and D. Roy, “Gene identification: Classical and computational intelligence approaches,” IEEE Trans. Syst., Man, Cybern. C, vol. 38, no. 1, pp. 55–68, Jan. 2008. [3] S. Bandyopadhyay, R. Mitra, and U. Maulik, “Development of the human cancer microRNA network,” BMC Silence, vol. 1, no. 6, 2010. [4] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from examples,” Univ. Chicago, Chicago, IL, Tech. Rep. TR 2004–2006, 2004. [5] K. P. Bennett and A. Demiriz, “Semi-supervised support vector machines,” in Proc. Adv. Neural Inform. Process Syst., 1998, vol. 10, pp. 368– 374. [6] A. Blum and P. Langley, “Selection of relevant features and examples in machine learing,” Artif. Intell., vol. 97, no. 1/2, pp. 245–271, 1997. [7] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Knowl. Discov. Data Mining, vol. 2, pp. 121–167, 1998. [8] [Online]. Available: http://www.biolab.si/supp/bi-cancer/projections/ index.htm [9] O. Chapelle, V. Sindhwani, and S. S. Keerthi, “Optimization techniques for semi-supervised support vectors,” J. Mach. Learn. Res., vol. 9, pp. 203– 233, 2008. [10] O. Chapelle and A. Zien, “Semi-supervised classification by low-density separation,” in Proc. 10th Int. Works. Artif. Intell. Stat., 2005, pp. 57–64. [11] Y. Chen, G. Wang, and S. Dong, “Learning with progressive transductive support vector machine,” Pattern Recognit. Lett., vol. 34, no. 12, pp. 1845– 1855, 2003. [12] M. Dash and H. Liu, “Consistency based search in feature selection,” Artif. Intell., vol. 151, pp. 155–176, 2003. [13] A. Dupuy and R. M. Simon, “Critical review of public microarray studies in cancer outcome and guidelines on statistical analysis and reporting,” J. Nat. Cancer 1nst., vol. 99, pp. 147–157, 2007.

1117

[14] J. Ernst, Q. K. Beg, K. A. Kay, G. Bal´azsi, Z. N. Oltvai, and Z. Bar-Joseph, “A Semi-supervised method for predicting transcription factor–gene interactions in escherichia coli,” Plos Comput. Biol., vol. 4, p. e1000044, 2008. [15] L. Ein-Dor, O. Zuk, and E. Domany, “Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer,” Proc. Nat. Acad. Sci. USA, vol. 103, pp. 5923–5928, 2006. [16] A. J. Gentles, S. K. Plevritis, R. Majeti, and A. A. Alizadeh, “Association of a leukemic stem cell gene expression signature with clinical outcomes in acute myeloid leukemia,” J. Amer. Med. Assoc., vol. 304, pp. 2706–2715, 2010. [17] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,” Science, vol. 286, pp. 531–537, 1999. [18] M. Hollander and D. A. Wolfe, Nonparametric Statistical Methods, NJ: Wiley, 1999. [19] Q. H. Hu, D. R. Yu, and Z. X. Xie, “Information-preserving hybrid data reduction based on fuzzy-rough techniques,” Pattern Recognit. Lett., vol. 27, pp. 414–423, 2006. [20] R. Jensen and Q. Shen, “Semantics-preserving dimensionality reduction: Rough and fuzzy-rough-based approaches,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 12, pp. 1457–1471, Dec. 2004. [21] T. Joachims, “Transductive inference for text classification using support vector machines,” in Proc. Int Conf. Mach. Learning, 1999, pp. 200–209. [22] R. Johnson and T. Zhang, “On the effective Laplacian normalization for graph semi-supervised learning,” J. Mach. Learning Res., vol. 8, pp. 1489– 1517, 2007. [23] H. K. Kim, I. J. Choi, C. G. Kim, A. Oshima, and J. E. Green, “Gene expression signatures to predict the response of gastric cancer to cisplatin and fluorouracil,” J. Clin. Oncol., vol. 27, no. 15S, 2009. [24] D. C. Koestler, C. J. Marsit, B. C. Christensen, M. R. Karagas, R. Bueno, D. J. Sugarbaker, K. T. Kelsey, and E. A. Houseman, “Semi-supervised recursively partitioned mixture models for identifying cancer subtypes,” Bioinformatics, vol. 26, pp. 2578–2585, 2010. [25] E. Kreyszig, Introductory Mathematical Statistics. New York: Wiley, 1970. [26] U. Maulik, A. Mukhopadhyay, and S. Bandyopadhyay, “Combining Pareto-optimal clusters using supervised learning for identifying coexpressed genes,” BMC Bioinformat., vol. 10, no. 27, 2009. [27] U. Maulik, “Analysis of gene microarray data in soft computing framework,” Appl. Soft Comput., vol. 11, no. 6, pp. 4152–4160, 2011. [28] A. Mukhopadhyay, S. Bandyopadhyay, and U. Maulik, “Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification,” PLoS ONE, vol. 5, no. 11, pp. 1– 14, 2010. [29] U. Maulik, S. Bandyopadhyay, and A. Mukhopadhyay, Multiobjective Genetic Algorithms for Clustering: Applications in Data Mining and Bioinformatics. New York: Springer-Verlag, 2011. [30] U. Maulik and A. Mukhopadhyay, “Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data,” Comput. Oper. Res., vol. 37, no. 8, pp. 1369–1380, 2010. [31] Y. Qian, J. Liang, W. Pedrycz, and C. Dang, “Positive approximation: An accelerator for attribute reduction in rough set theory,” Artif. Intell., vol. 174, pp. 597–618, 2010. [32] I. Steinfeld, R. Navon, D. Ardig`o, I. Zavaroni, and Z. Yakhini, “Clinically driven semi-supervised class discovery in gene expression data,” Bioinformatics, vol. 24, pp. 190–197, 2008. [33] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [34] V. Vapnik and A. Sterin, “On structural risk minimization or overall risk in a problem of pattern recognition,” Autom. Remote Contr., vol. 10, no. 3, pp. 1495–1503, 1977. [35] J. H. Wang and X. T. Shen, “Large margin semi-supervised learning,” J. Mach.Learning Res., vol. 8, pp. 1867–1891, 2007. [36] J. Weston, C. Leslie, E. Le, D. Zhou, A. Elisseeff, and W. S. Noble, “Semisupervised protein classification using cluster kernels,” Bioinformatics, vol. 21, pp. 3241–3247, 2008. [37] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semisupervised learning using Gaussian fields and harmonic functions,” in Proc. 20th Int. Conf. Mach. Learning, 2003.

Authors’ photographs and biographies not available at the time of publication.