Deriving meaningful rules from gene expression ... - Semantic Scholar

2 downloads 0 Views 349KB Size Report
rule base is further simplified without compromising the classification accuracy. The most ... machine (SVM) [9], and artificial neural networks (ANN) [10].
Deriving meaningful rules from gene expression data for classification Nikhil Ranjan Pal, Animesh Sharma‡ , Somitra Kumar Sanadhya {nikhil, animesh r, somitra r}@isical.ac.in Electronics and Communication Sciences Unit, Indian Statistical Institute, 203, B. T. Road, Calcutta - 700108, India. Abstract We propose a novel scheme for designing fuzzy rule based classifiers for gene expression data analysis. A neural network based method is used for selecting a set of informative genes. Considering only these selected set of genes, we cluster the expression data with a fuzzy clustering algorithm. Each cluster is then converted into a fuzzy if-then rule, which models an area in the input space. These rules are tuned using a gradient descent technique to improve the classification performance. The rule base is tested on a leukemia data set containing two classes and it is found to produce excellent results. The membership functions associated with the rules are then analyzed and the rule base is further simplified without compromising the classification accuracy. The most attractive attributes of the proposed scheme are: it is an automatic extraction scheme; unlike other classifiers, it produces human interpretable rules, and it is not expected to give bad generalization as fuzzy rules do not respond to areas not represented by the training data.

I. I NTRODUCTION Improvements in tumor classification are central to precise and individualized therapeutic approaches. One of the most powerful techniques developed in Biotechnology is DNA Microarray [1]. Using Microarrays biologists are able to capture expression levels of almost all genes of the cell in a single experiment. The number of these genes is in thousands. Thus, it is definitely a major advancement in understanding cell processes. But since the data are in huge dimension and the number of instances typically available is very limited, the classification of such data is a difficult task. The challenge is to gather meaningful information out of such high dimensional data to gain insight into the biological processes and to identify how their disturbance leads to various diseases [2]. This high dimensional nature of microarray data and the limited number of exemplars make the task of designing classifiers difficult because of the curse of dimensionality [3]. It is known that for a given problem all features that characterize a data point may not be equally important. Moreover, use of more features is also not necessarily good. For example, even if we are able to design a classifier using all gene expression values, it does not help to identify discriminating genes, the marker genes. We further emphasize that the use of an appropriate set of features has a significant effect on the designed classifier and the influence also depends on the type of classifiers used. Various supervised and unsupervised machine-learning methods have been employed for the analysis of gene expression samples based on gene expression patterns. In unsupervised analysis, since the goal is class discovery, the data are organized without using the class label information. Some examples of unsupervised methods widely used in the analysis of expression data are hierarchical clustering [4], k-means clustering [5], [6] and self organizing feature maps (SOFM) [7]. Supervised analysis uses ‡ Corresponding Author

1

some external information, such as the disease status of the samples studied. The main objective of supervised analysis is to design classifiers that can be used to discriminate between classes to which the data belong. To design a classifier, typically, the data set is divided into a training set and a test set. The classifier is then trained on the training set and tested on the test set. Once, the test result is found to be satisfactory, the classifier can be applied to data with unknown classification. Some of the popular supervised methods include k - nearest neighbor (k-NN) classification [8], support vector machine (SVM) [9], and artificial neural networks (ANN) [10]. We summarize here some results from the literature on the leukemia data set as we shall use the same data set for our investigation. Toure et al. [11] reported 58.8% accuracy in predicting the class of leukemia cancer. For the same data set, Cho et al. [12] used different mutually exclusive sets of features to design several classifiers and combined them using a neural network. The results obtained by them using various classifiers varied between 58.5% and 100%. Min Su et al. [13] reported an accuracy of 76.5% and Ben-Dor et al. [14] reported recognition accuracy of 91.1% on the same data set. Mukherjee et al. [15] achieved 94.1% accuracy with top 5 genes selected on the basis of a feature selection method proposed for SVM classifiers. Fuzzy rule based classifiers (FRBCs) have been used in various areas like remotely sensed image analysis [16] and medical diagnosis [17]. Although FRBCs have been applied to analyze gene expression data [18] , they have not been adequately exploited for such analysis. It may be noted that in [18] fuzzy sets are defined by experts. In the present study we propose a methodology to extract a set of fuzzy rules for classification of expression data. It uses both supervised and unsupervised methods. We begin with a set of features selected by an Online Feature Selection (OFS) scheme [19]. Our fuzzy rule extraction scheme uses a fuzzy clustering algorithm to partition the training data into a number of clusters. Each cluster is then converted into a fuzzy rule. The rule base is then further refined using a gradient based iterative scheme. We also demonstrate how some features selected by the OFS scheme can be eliminated without compromising classification accuracy. Unlike [18] we have not done any expert defined categorization of the expression levels to define fuzzy sets. Our scheme is a completely automated system which finds a set of optimally defined fuzzy sets. The classification performance of the FRBC is found to be similar to that of other classifiers, but simpler and easier to interpret. We apply our method to generate human interpretable rules from the two-category cancer expression data, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) [20]. Pre-determined test sets of unseen cases are used for validation. This is one of the well-known data sets available for classification of gene expression data [21], [22], [23], [24], [25]. The proposed system is able to classify the expression data with a good accuracy using only four rules defined just on four features. II. M ATERIALS AND M ETHODS We use a NN (Neural Network) to select a small set of feature and then use a fuzzy clustering algorithm to cluster the data set in the reduced dimension to extract prototypes for each class. Then these prototypes are used to generate set of fuzzy rules which is then refined to improve the classifier performance. A. Data Set. Learning the distinction between AML and ALL is a well-known and well-studied problem [20]. These tumors are often indistinguishable histologically but show significant differences in clinical behavior. Subclassification of these tumors based on their molecular profiles may help explain why these tumors respond so differently to treatment. Golub et al. [20] developed an innovative classification scheme for leukemia, analyzing microarray data based on neighborhood analysis. This strategy was able to distinguish

2

between AML and ALL with an accuracy of 94.1%. In this study we have used the same data set. The gene expression intensities were obtained from Affymetrix high-density oligonucleotide microarrays, containing 7219 probes. In this data set gene expression profiles have been constructed from 72 persons who have either ALL or AML. Each person has submitted one sample of DNA microarray, so that the database consists of 72 samples. We have used the same training-test partition as used by Cho et al. [12] for a fair comparison with published results. The training data set consists of 38 bone marrow samples, containing 27 ALL and 11 AML. The test data set contains 20 ALL and 14 AML cases. We note that this data set has become a benchmark in the cancer classification community and hence we have used it [21], [22], [23], [24], [25]. B. Feature selection. Feature selection techniques aim to reduce the feature space to a highly predictive subset of the space, i.e., they aim to discard the bad/ irrelevant features from the available set of features without loosing the accuracy of prediction. The literature is quite rich in feature selection methodologies[26]. Some of these methods use neural networks, neuro-fuzzy systems [11], while others use fuzzy logic [27] or statistical techniques [26]. Other approaches to dimensionality reduction involves replacing the given set of features by a new but smaller set of computed features [28]. Here we have used a set of features selected by a neural network based feature selection method [19]. In a standard multilayer perceptron (MLP) network, the effect of some features (inputs) can be eliminated by not allowing them into the network, i.e., by equipping each input node (hence each feature) with a gate and closing the gate. For good features the associated gates can be completely opened. On the other hand, if a feature is partially important, then the corresponding gate should be partially opened. Pal and Chintalapudi [11] suggested a mechanism for realizing such a gate so that useful features can be identified and attenuated according to their relative usefulness. In order to model the gates we associate a gate function to each node in the input layer of the MLP. A gate function should produce a value of 1 or nearly 1 for a good feature; while for a bad feature, it should be nearly 0. We call the network an Online Feature Selection (OFS) network. Further details of the scheme are given in the Appendix. This methodology was used by Pal et al. [19] for gene expression data analysis. C. Fuzzy clustering. We use the above mentioned OFS network to select five features. We use these features and perform fuzzy clustering using the Fuzzy c-means (FCM) clustering algorithm. The FCM may assign a data point to more then one cluster to some degree. The degree to which a data point belong to a cluster is specified by a membership grade (Explained in Appendix). Fuzzy c-means has been successfully applied to various TR S TR TR areas including gene expression data [29]. Let X T R = XAM XALL be the training data, where XAM L L TR represents the AML part of training data and XALL represents the ALL part of it. Suppose we want to TR extract C1 and C2 rules respectively from the ALL and AML data. Then we cluster XALL into C1 and TR XAM L into C2 clusters. Deciding on the optimal number of clusters (rules) per class is related to the cluster validity issue [30], which we do not pursue here. In this work we experimented with just two different number of clusters (rules) for each class and found that 2 rules per class gives the best result and hence we report the result with 2 rules per class. D. Fuzzy rule based classifier. A cluster found in the previous step represents a dense compact area in the data and the associated cluster prototype quantizes the cluster. If v ∈ RP be a prototype representing a cluster of points

3

in RP , then we can describe the cluster by a set of points v ∈ RP satisfying “x CLOSE TO v”, where “CLOSE TO” is a fuzzy set. A prototype (representing a cluster of points) vi for class k can be translated into a fuzzy rule of the form Ri : If x is CLOSE TO vi then the class is k. The fuzzy set CLOSE TO v is further represented by a set of p simpler atomic clauses: x1 is CLOSE TO v1 and x2 is CLOSE TO v2 and ..... and xp is CLOSE TO vp . Here v = (v1 , v2 , ......, vp )T and x = (x1 , x2 , ......, xp )T . In this way we get a set of initial rules. In general, the i-th rule representing one of the c classes takes the form : x1 is CLOSE TO vi1 and.....and xp is CLOSE TO vip then the class is k. Here p is the number of features and hence the number of atomic clauses. The TO vij is modeled by a  fuzzy set CLOSE  2 2 Gaussian membership function: µij (xj : vij , σij ) = exp −(xj − vij ) /σij , although, other choice are possible. We compute σij as the standard deviation of the jth component of the data points falling in the ith cluster. For a given data point x, we first find the firing strength of each rule using the product T-norm : αi (x) =

j=p Y

µij (xj : vij , σij ).

j=1

Here αi (x) is the firing strength of the i-th rule on a data point x. This gives the degree of match between the data point x and the antecedent of the i-th rule. Now class label of the rule having the maximum firing strength determines the class of the data point x. Let l = argmax{αi (x)}, and suppose the l-th |

rule represents class c, then x is assigned to class c.

{z i

}

E. Tuning of the Rule base. The initial rule base R0 thus obtained can be further tuned to achieve a better performance. Let x be from class ’c’ and Rc be the rule from class ’c’ giving the maximum firing strength αc for x. Also let R¬c be the rule from the incorrect classes having the highest firing strength α¬c for x. We use the error function E for x, Ex = (1 − αc + α¬c )2 to train the rule base. This error function has been used by Chiu P [31]. Our goal is to minimize E = Ex where x ∈ X T R . To do this we reduce Ex with respect to vcj , v¬cj and σcj , σ¬cj of the two rules Rc and R¬c . This will refine the rules with respect to their context in the data space. Details of the rule updating procedure is given in the appendix. The performance of the classifier depends crucially on the adequacy of the number of features and the number of rules. We discover that the set of 5 genes selected by OFS scheme generates a few rules which can achieve 100% accuracy on the training set and 94.1% accuracy on the test data. These fuzzy rules, unlike SVM and MLP, are interpretable and have biological meaning thus are easy to interpret. Further analyzing at these rules extracted using 5 genes, we were able to cut down 1 gene. Thus using only 4 genes we could obtain simpler rules without compromising the accuracy of the classifier. III. R ESULTS A. Generated Rules. We discuss the effectiveness of FRBC for performing AML-ALL classification based on gene expression data used by Golub et al. [20]. We also demonstrate the elegance of these rules by demonstrating their human interpretability, unlike the black box characteristics of NN and less interpretable hyperplane of SVM. In connection with the leukemia data set, authors in [32] remark, “It contains 2 ALL samples that are consistently misclassified or classified with low confidence by most methods”. There are a number

4

Fig. 1.

Rules with 5 features.

Fig. 2.

Rules with 4 features.

5

TABLE I F IVE B EST F EATURES SELECTED BY OFS, THEIR WITHIN - CLASS MEAN (µ) & STANDARD DEVIATION (σ) WITH SNR Feature # 1 2 3 4 5

GeneID

Name

µALL

σALL

µAM L

σAM L

3320 4847 4052 4196 1249

LTC4S U50136 Zyxin X95735 at Catalase EC1.11.1.6 PRG1 X17042 MCL1 L08246 at

978 350 1391 1643 1067

319 388 1351 1741 656

2562 3024 4295 7109 3767

753 1436 1664 3020 1851

µAM L −µALL σAM L +σALL

1.48 1.46 0.96 1.15 1.08

TABLE II F IVE B EST F EATURES SELECTED BY OFS, THEIR M IN AND M AX VALUES IN T RAINING AND T EST DATA Feature # 1 2 3 4 5

GeneID 3320 4847 4052 4196 1249

Name LTC4S U50136 Zyxin X95735 at Catalase EC1.11.1.6 PRG1 X17042 MCL1 L08246 at

Min (Training) 64 -428 318 33 128

Max (Training) 3568 7133 8970 10449 7003

Min (Test) 383 -674 115 140 190

Max (Test) 3965 6218 7260 11003 6718

of possible explanations for this, including incorrect diagnosis of the samples. Using FRBC, we could classify the AML-ALL with 94.1% test accuracy (only 2 misclassifications) in the 34 test samples using features 1,2,4 and 5 from the five features selected by OFS (Table I). Using these features we generated 2 rules per class. Thus we come to a set of 20 fuzzy sets (2 for each feature per class). The rule base is then refined. The points correctly classified by these rule involving 20 fuzzy sets are given in Table III. The number within ( ) gives the number of points incorrectly classified by the rule. Figure 1 gives a pictorial representation of the rules. Each panel in Figure 1 corresponds to one feature. All five fuzzy sets defining a rule are given a particular color. A careful inspection of Figure 1 reveals that three of the fuzzy sets (in rules 2,3 and 4) defined on the third feature have almost the same mean,i.e., they represent the same concept. So the third feature does not add discriminatory power TABLE III P OINTS CLASSIFIED BY FOUR RULES INVOLVING 20 F UZZY S ETS ON FIVE FEATURES Rule 1 2 3 4

ALL Training 7 (0) 20 (0) 0 (0) 0 (0)

AML Training 0 (0) 0 (0) 4 (0) 7 (0)

ALL Test 3 (0) 17 (0) 0 (0) 0 (0)

AML Test 0 (0) 0 (0) 6 (1) 6 (1)

TABLE IV P OINTS CLASSIFIED BY FOUR RULES INVOLVING 16 F UZZY S ETS ON FOUR FEATURES Rule 1 2 3 4

ALL Training 17 (0) 10 (0) 0 (0) 0 (0)

AML Training 0 (0) 0 (0) 7 (0) 4 (0)

ALL Test 11 (0) 9 (0) 0 (0) 0 (0)

AML Test 0 (0) 0 (0) 8 (1) 4 (1)

6

TABLE V P OINTS CLASSIFIED BY FOUR RULES INVOLVING 13 F UZZY S ETS Rule 1 2 3 4

ALL Training 17 (0) 10 (0) 0 (0) 0 (0)

AML Training 0 (0) 0 (0) 7 (0) 4 (0)

ALL Test 11 (0) 9 (0) 0 (0) 0 (0)

AML Test 0 (0) 0 (0) 8 (1) 4 (1)

TABLE VI F OUR RULES WITH THEIR FUZZY SET DEFINITIONS Rule 1 2 3 4

Feature 1 (Gene 3320) Low Low-Med High-Med High

Feature 2 (Gene 4847) Low Low High Medium

Feature 4 (Gene 4196) Low Low High Low-Med

Feature 5 (Gene 1249) Low Low High Medium

to rules 2,3 and 4. On the other hand, rule 1 is well separated from other rules by features 1,2,4 and 5. So the third feature may be dropped. Thus we experimented now with only 4 features and generated 2 rules per class (Figure 2). The points classified by these four rules involving 16 fuzzy sets are given in Table IV. Table IV shows thus the overall performance remains the same. It is very much evident from Figure 2, that this set of 16 fuzzy sets can be further reduced to a set of 13 fuzzy sets (4 fuzzy set on feature 1 and 3 on each of features 2, 4 and 5 respectively), without compromising with the accuracy. So we do this simplification of the membership function and calculate the accuracy of the simplified rule base on the data set. Figure 3 displays the simplified rule base and Table V shows the classification performance. Table V shows that both training and test accuracies remain the same for the simplified rule base. Next we try to attach human interpretable linguistic labels to each membership function involved in the rule. Table II shows the minimum and maximum value of each of the five selected features on the training and test data sets. Analyzing the location of membership function defined on a variable on its domain, we assign the linguistic values like low, low-medium, high-medium and high. The rule base with these meaningful linguistic values are shown in Table VI. Looking at Table IV and Table VI, we can say: i) Low expression level of Gene 3320, 4847, 4196 and 1249 classifies the 17 and 11 ALL sample in training and test data respectively, ii) Low expression level of Gene 4847, 4196 and 1249 with Low-Med expression level of Gene 3320 classifies the 10 and 9 ALL sample in training and test data respectively, iii) High expression level of Gene 4847, 4196 and 1249 with High-Med expression level of Gene 3320 classifies the 7 and 8 AML sample in training and test data respectively and iv) Medium expression level of Gene 4847 and 1249 with Low-Med expression level of Gene 4196 and High expression level of Gene 3320 classifies the 4 AML sample in training and test data. This elegantly demonstrates the physical interpretability of rules derived from FRBC. This will also eliminate the technology dependent scaling errors [18]. IV. D ISCUSSION We have proposed a method for gene expression data classification. Our method used a connectionist system to select five important features (genes). Then we used explorative data analysis to extract fuzzy

7

Fig. 3.

The simplified rules with 13 fuzzy sets.

rules for classification based on the five selected features. Analyzing the membership functions extracted, we were able to remove one of the five selected features. We also simplified further the membership functions (hence the rules). The extracted rules are easy to interpret and are not likely to produce bad generalization because fuzzy rules do not respond to areas not represented in the training data. Since the rule base can do an equally good job with only four features, one may wonder, did the NN method select a bad feature? No. Because as we mentioned earlier, importance of a feature depends on both the problem and the tool used to solve that problem. So a feature which may be important for one classifier, may not be that important for another classifier. Although, here we have intuitively assigned linguistic labels to the membership function, in future we plan to use data analysis techniques to derive such labels. We also want to investigate effectiveness of the FRBC approach on features selected by other techniques. We would also like to explore FRBC on multi-class gene expression data and other bioinformatics problems such as protein fold prediction and phylogenetic analysis. R EFERENCES [1] P. O. Brown and D. Botstein, “Exploring the new world of the genome with dna microarrays,” Nature Genetics, vol. 21, pp. 33–37, 1999. [2] M. Schena, R. A. Heller, T. P. Theriault, K. Konrad, E. Lachenmeier, and R. W. Davis, “Microarrays: biotechnology’s discovery platform for functional genomics,” Trends in Biotechnology, vol. 16, no. 7, pp. 301–306, 1998. [3] R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961. [4] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of the National Academy of Sciences, vol. 95, pp. 14 863–14 868, 1998. [5] J. A. Hartigan, Clustering Algorithms. John Wiley and Sons, New York, 1975. [6] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, “Systematic determination of genetic network architecture,” Nature Genetics, vol. 22, pp. 281–285, 1999. [7] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, and E. Dmitrovsky, “Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation,” Proceedings of the National Academy of Sciences, vol. 96, pp. 2907–2912, 1999.

8

[8] J. Theilhaber, T. Connolly, S. Roman-Roman, S. Bushnell, A. Jackson, K. Call, T. Garcia, and R. Baron, “Finding genes in the c2c12 osteogenic pathway by k-nearest-neighbor classification of expression data,” Genome Research, vol. 12, no. 1, pp. 165–176, 2002. [9] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, “Support vector machine classification and validation of cancer tissue samples using microarray expression data,” Bioinformatics, vol. 16, no. 10, pp. 906–914, 2000. [10] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, and F. Westermann, “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol. 7, pp. 673–679, 2001. [11] N. R. Pal and K. K. Chintalapudi, “A connectionist system for feature selection,” Neural, Parallel and Scientific Computations, vol. 5, pp. 359–382, 1997. [12] S. B. Cho and J. Ryu, “Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features,” in Proceedings of the IEEE, vol. 90, no. 11, 2002, pp. 1744–1753. [13] S. Min, M. Basu, and A. Toure, “Multi-domain gating network for classification of cancer cells using gene expression data,” in Proceedings of the 2002 International Joint Conference on Neural Networks, vol. 1, 2002, pp. 286–289. [14] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and N. Yakhini, “Tissue classification with gene expression profiles,” Journal of Computational Biology, vol. 7, pp. 559–584, 2000. [15] S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. P. Mesirov, and T. Poggio, “Support vector machine classification of microarray data,” MIT Cambridge, Tech. Rep. A.I. Memo No.1677, 1999. [16] A. Brdossy and L. Samaniego, “Fuzzy rule-based classification of remotely sensed imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 40(2), pp. 362–374, 2002. [17] J. F.-F. Yao and J.-S. Yao, “Fuzzy decision making for medical diagnosis based on fuzzy number and compositional rule of inference.” Fuzzy Sets and Systems, vol. 120, no. 2, pp. 351–366, 2001. [18] L. O. Machadoa, S. Vinterboa, and G. Webera, “Classification of gene expression data using fuzzy logic,” Journal of Intelligent Fuzzy Systems, vol. 12, pp. 19–24, 2002. [19] N. R. Pal, A. Sharma, S. K. Sanadhya, and Karmeshu, “On identifying marker genes from gene expression data in a neural framework through online feature analysis,” communicated to International Journal of Intelligent Systems, 2005. [20] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, and J. P. Mesirov, “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, pp. 531–537, 1999. [21] Zhou, “Ls bound based gene selection for dna microarray data,” BMC Bioinformatics, vol. 20, pp. 1093 – 1102, 2004. [22] Wang, “Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data,” BMC Bioinformatics, vol. 60, p. 4, 2003. [23] Liu, “A combinational feature selection and ensemble neural network method for classification of gene expression data,” BMC Bioinformatics, vol. 4, p. 136, 2004. [24] K. Bae and B. K. Mallick, “Gene selection using a two-level hierarchical bayesian model,” BMC Bioinformatics, vol. 20, p. 18, 2004. [25] R. Alexandridis, S. Lin, and M. Irwin, “Class discovery and classification of tumor samples using mixture modeling of gene expression data-a unified approach,” BMC Bioinformatics, vol. 20, p. 16, 2004. [26] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice Hall, London, 1982. [27] R. De, N. R. Pal, and S. K. Pal, “Feature analysis : Neural network and fuzzy set theoretic approaches,” Pattern Recognition, vol. 30, no. 10, pp. 1579–1590, 1997. [28] N. R. Pal and E. V. Kumar, “Two efficient connectionist schemes for structure preserving dimensionality reduction,” IEEE Transactions on Neural Networks, vol. 9, no. 6, pp. 1142–1153, 1998. [29] J. Wang, T. H. Bo, I. Jonassen, O. Myklebost, and E. Hovig, “Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data,” BMC Bioinformatics, vol. 60, pp. 1471–1482, 2003. [30] N. R. Pal and J. C. Bezdek, “On cluster validity for the fuzzy c-means model,” IEEE Transactions on Fuzzy Systems, vol. 3(3), pp. 370–379, 1995. [31] S. L. Chiu, “Fuzzy model identification based on cluster estimation,” Journal of Intelligent Fuzzy Systems, vol. 2(3), p. 267278, 1994. [32] J. P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” Proceedings of the National Academy of Sciences, vol. 101, no. 12, pp. 4164–4169, 2004. [33] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers Norwell, MA, USA, 1981.

A PPENDIX : O NLINE F EATURE S ELECTION To use the gate we multiply each input feature value by its gate function value and the modulated feature value is passed into the network. The gate functions attenuate the features before they propagate through the net, so we may call these gate functions attenuating functions. A simple way of identifying useful gate functions is to check whether the function Fi : R → [0, 1] has the affirmative answers to the following questions: (i) does it have a tunable parameter and is it differentiable with respect to the tunable parameter? (ii) is it monotonic with respect to its tunable parameter? The sigmoidal function satisfies the above criteria and in this paper we have used it.

9

The basic philosophy of learning would be to keep all gates almost closed at the beginning of the learning (i.e. no feature is important) and then open the gates as required during the training. To complete the description in connection with MLP, let Fi be the gate or attenuation function associated with the ith input feature. Fi has an argument mi , Fi0 (mi ) be the value of derivative of the attenuation function at mi . Let µ be the learning rate of the attenuation parameter; ν be the learning rate of the connection o weights, xi be the ith input of an input vector x; x0i be the attenuated value of xi , i.e., x0i = xi F (m); wij be the weight connecting the j th node of the first hidden layer to the ith node of the input layer; and δj1 be the error term for the j th node of the first hidden layer [11]. 0 It can be easily shown that except for wij , the update rules for all weights remain the same as that for an ordinary MLP trained with backpropagation. Assuming that the first hidden layer has q nodes, the 0 update rules for wij and mi are : 0 0 wij,new = wij,old − νxi δj1 F (mi )

(1)

0 1 δj )F 0 (mi ) m0i,new = m0i,old − µxi (Σqj=1 wij

(2)

As mentioned earlier, for the gate function, several choices are possible but we use here the sigmoidal function F (w) = 1/(1 + e−m ). The p gate parameters are so initialized that when the training starts F (m) is practically zero for all gates, i.e., no feature is allowed to enter the network. As the gradient descent learning proceeds, gates for the features that can reduce the error faster are opened faster. The learning of the gate function continues along with other weights of the network. At the end of the training we can pick up important features based on the values of the attenuation function. Typically, the training can be stopped when the training error is reduced to an acceptable level. In this study we stopped the training when training error reduced to 0.00001 and misclassification became 0. Note that, different initializations of the network may lead to different subsets of good features (genes). If this happens, this indicates that there are different sets of features that can do the classifier job equally well. One may rank the features based on the extent the gates are opened and use a set of top ranked features. This is expected to do a good job because OFS looks at all the features at a time during the training process. Consequently, two co-related features are not likely to appear as good features. A PPENDIX : T HE F UZZY C -M EANS A LGORITHM The Fuzzy C-Means (FCM) clustering algorithm [33] attempts to cluster data vectors into C groups based on the distances between them. The FCM algorithm minimizes the objective function J=

PC

i=1

PN

k=1

2 um ik kxk − vi k ,

subset to PC

i=1

uik = 1 ∀ k = 1, 2, ..., N

and 0
1 is the fuzzifier, uik denotes the membership of k th data vector to ith cluster and vi ∈ Rp is the

10

centroid of the ith cluster. First order necessary conditions on U and V at a local minima of J are: 

uik = 

C X

j=1

kxk − vi k kxk − vj k

!

2 m−1

−1 

, ∀i, k

(3)

and PN

um ik xk , ∀i. m k=1 uik

vi = Pk=1 N

(4)

The algorithm iterates between equations (4) and (3) in that order, as described below.

Algorithm:Fuzzy C-Means 1) 2) 3) 4) 5)

Initialize a valid fuzzy c-partition U = [uik ]C×N . Compute a new set of prototypes using eq. (4). Compute a new partition matrix using eq. (3) with these new prototypes. Repeat this process ( Steps 2 and 3 alternately ) till the entries of the partition matrix stabilize. Defuzzification : Assign the data vector xk to the cluster for which its membership value ujk is largest.

The same procedure above can be carried out by initializing the prototypes instead of the partition matrix in which case the algorithm iterates between equations (3) and (4) in that order. The convergence properties remain the same under both schemes of initialization. As the value of m increases the algorithm produces more fuzzy partitions [33]. A PPENDIX : T UNING

OF THE

RULE BASE .

Here we give an algorithmic description of the rule refinement algorithm when product is used to compute the firing strength. The tuning process is repeated until the rate of decrease in E becomes negligible. When product is used to compute the firing strength, the rule refinement algorithm is as follows: Begin Choose learning parameters ηm and ηs Choose a parameter reduction factor 0 < ε < 1 Choose the maximum number of iterations, maxiter. Compute the error E0 for the initial rule base R0 . Compute the misclassification M0 corresponding to initial rule base R0 . t←1 While ( t ≤ maxiter) do For each vector x ∈ X Find the rules Rc and R¬c . Modify the parameters of rules and as follows. For k=1 to p

11

old new − ηm = vck vck

∂E old ∂vck

old = vck + ηm (1 − αc + α¬c ) new old v¬ck = v¬ck − ηm

αc old 2 σck

∂E old ∂v¬ck

old = v¬ck − ηm (1 − αc + α¬c ) old new − ηs = σck σck

∂E old ∂σck

old = σck + ηs (1 − αc + α¬c ) new old σ¬ck = σ¬ck − ηs

α¬c old 2 σ¬ck

αc old σck

∂E old ∂σ¬ck

old = σ¬ck − ηs (1 − αc + α¬c )

old (xk − vck )

old (xk − v¬ck )

3 (xk

α¬c old σ¬ck

old 2 − vck )

3 (xk

old 2 − v¬ck )

End For End For Compute the error Et for the new rule base Rt . Compute the misclassification Mt for Rt . If Mt > Mt−1 or Et > Et−1 then ηm ← (1 − ε)ηm ηs ← (1 − ε)ηs Rt ← Rt−1 /* If the error is increased, then possibly the learning coefficients are too large. So, decrease the learning coefficients and restore the rule base to Rt . */ If Mt = 0 or Et = 0 then Stop t←t+1 End while End At the end of the rule base tuning we get the final rule base Rf inal which is expected to give a very low error rate. Since a Gaussian membership function is extended to infinity, for any data point all rules will be fired to some extent. In our implementation, if the firing strength is less than a threshold, (0.01), then the rule is not assumed to be fired. If no rule is fired by a data point, then that point can be thought of as an outlier. If this happens for some test data, then that will indicate an observation not close enough to the training data and consequently no conclusion should be made about such test points.