A comparative study of multi-classification methods

0 downloads 0 Views 265KB Size Report
and probabilistic neural networks, nearest neighbour classifiers, multi-class ... included the use of network-based methods for the analysis of protein.
332

Int. J. Comput. Intelligence in Bioinformatics and Systems Biology, Vol. 1, No. 3, 2010

A comparative study of multi-classification methods for protein fold recognition Ioannis K. Valavanis* Faculty of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou Str., 15780 Zografou, Athens, Greece Fax: +30 210 772 2320 E-mail: [email protected] *Corresponding author

George M. Spyrou Biomedical Informatics Unit, Biomedical Research Foundation, Academy of Athens, 4 Soranou Efessiou Str., 115 27 Athens, Greece E-mail: [email protected]

Konstantina S. Nikita Faculty of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou Str., 15780 Zografou, Athens, Greece Fax: +30 210 772 2320 E-mail: [email protected] Abstract: Fold recognition based on sequence-derived features is a complex multi-class classification problem. In the current study, we comparatively assess five different classification techniques, namely multilayer perceptron and probabilistic neural networks, nearest neighbour classifiers, multi-class support vector machines and classification trees for fold recognition on a reference set of proteins that are organised in 27 folds and are described by 125-dimensional vectors of sequence-derived features. We evaluate all classifiers in terms of total accuracy, mutual information coefficient, sensitivity and specificity measurements using a ten-fold cross-validation method. A polynomial support vector machine and a multilayer perceptron of one hidden layer of 88 nodes performed better and achieved satisfactory multi-class classification accuracies (42.8% and 42.1%, respectively) given the complexity of the problem and the reported similar classification performances of other researchers. Keywords: protein; sequence-derived features; neural network; NN; support vector machine; SVM; classification tree; nearest neighbour classifier; multi-class.

Copyright © 2010 Inderscience Enterprises Ltd.

A comparative study of multi-classification methods

333

Reference to this paper should be made as follows: Valavanis, I.K., Spyrou, G.M. and Nikita, K.S. (2010) ‘A comparative study of multi-classification methods for protein fold recognition’, Int. J. Computational Intelligence in Bioinformatics and Systems Biology, Vol. 1, No. 3, pp.332–346. Biographical notes: Ioannis K. Valavanis received his Diploma in Electrical and Computer Engineering in 2003 from the National Technical University of Athens (N.T.U.A.), Greece, his MSc in Bioinformatics in 2006 from the University of Athens, and his PhD in 2009 from N.T.U.A. His PhD research included the use of network-based methods for the analysis of protein structural/sequence data and data related to multifactorial diseases. His scientific interests include bioinformatics, medical and genetic data analysis, artificial intelligence and its applications. He is the co-author of 16 articles published in international conferences or journals and two published as book chapters. George M. Spyrou received his BSc in Physics from University of Athens, Greece. He holds Masters of Science in Medical Physics and on Bioinformatics as well. During his PhD in Medical Physics he worked on algorithms and simulations applied on medical issues, especially on breast cancer imaging. Currently, he is working as a Senior Research Scientist in the Biomedical Informatics Unit of the Biomedical Research Foundation of the Academy of Athens (BRFAA) leading the Modeling and Computational Intelligence in Biomedical Informatics Group. Also, he has been assigned as the Head of the Department of Informatics and New Technologies in BRFAA. Konstantina S. Nikita received her Diploma in Electrical Engineering and her PhD from the National Technical University of Athens (NTUA), Greece, and her MD from the Medical School, University of Athens. Currently she is a Professor at the Department of Electrical and Computer Engineering, NTUA. She has co-authored 130 papers in refereed international journals and chapters in books, and more than 220 papers in conference proceedings. She is a member of the Editorial Board of the ΙΕΕΕ Transactions on Biomedical Engineering and Guest Editor of several international journal issues on biomedical engineering subjects. She was the recipient of the 2003 Bodossakis Foundation Academic Prize for exceptional achievements in ‘Theory and applications of information technology in medicine’.

1

Introduction

Today large-scale sequencing projects produce rapidly protein sequences, while the number of known three-dimensional protein structures increases slowly due to the still expensive and time-consuming determination of the structure of a protein through x-ray-crystallography or nuclear magnetic resonance (NMR) experiments. Furthermore, the protein structure determination for several proteins, e.g., transmembrane proteins or some large proteins, may not be possible with these techniques. It is worth noting that UniProtKB/TrEMBL database (Release 38.1, 18-3-2008) (Apweiler et al., 2004) contains 5.443.281 protein sequence entries, while the number of stored protein structures in Protein Data Bank (PDB, 25-3-2008) (Berman et al., 2000) is 45.906. Thus, the need of extracting structural information through computational analysis of protein sequences has

334

I.K. Valavanis et al.

become very important and a lot of research has been conducted towards this goal in the late years. Especially, the prediction of the fold of a query protein from its primary sequence has become very challenging. Fold is a three-dimensional pattern that according to SCOP hierarchical classification of protein structures (Hubbard et al., 1999), is characterised by a set of major secondary structural elements (e.g., α-helices and β-sheets) with certain arrangement and topological connections. Each fold can belong to one out of four structural classes, namely α, β, α + β, α/β, which are used to characterise a protein in an upper level of structural organisation depending on the majority of secondary structural elements in a protein structure and the succession of these elements in the structure. It has been estimated that the number of folds is ~1,000 (Chothia, 1992), a very small number compared to the number of proteins. Taking in account, as well, that the fold a protein belongs to is essential for its function, we can infer the importance of methods that can be used to predict the fold of a protein from its primary sequence. One family of methods that are used to assign a fold to a protein sequence uses alignment of the query sequence in order to assign to it the fold of a protein which has an adequate level of similarity with it. Alignment can be done in one of the following ways: sequence-sequence (Needleman and Wunsch, 1970; Vingron and Waterman, 1994), sequence-profile (Baldi et al., 1994), (Gough et al., 2001) when the query sequence is aligned with a sequence pattern called profile, profile-profile (Thomson et al., 1994; Martin-Renom et al., 2004) and sequence-structure (Skolnic and Kihara, 2001; Kim et al., 2003) when the query sequence is aligned to template structures. Profile-profile alignment methods are the most sensitive at detecting distant homologues with low sequence similarity (< 20%) among the sequence-based alignment methods, while sequence-structure alignment performance depends on the data derived from known structures which are still, though, very small in number compared to protein sequences. Another strategy for fold prediction is to use sequence-derived features combined with classification techniques. This family of methods has gained great interest since the work described in (Craven et al., 1995). Craven et al extracted several sequence-derived attributes, i.e., average residue volume, charge and polarity composition, predicted secondary structure composition, isoelectric point, Fourier transform of hydrophobicity function, from a set of 211 proteins belonging to 16 folds and used the sequence attributes to train and test decision trees, k nearest neighbour (kNN) and neural network (NN) classifiers in the 16-class fold assignment problem. Fold prediction gets more challenging when more classes appear and the classification problem becomes more complex. Work in this field has included the application of one-versus-others (Dubchak et al., 1999), unique one-versus-others and all-versus-all methods (Ding and Dubchak, 2001), all of which use NNs or support vector machines (SVMs) as classifiers in multiple binary classification tasks. In Marsolo et al. (2005), authors used Bayesian classifiers, decision trees and SVMs as building blocks in multi-level classification scheme in order to classify proteins firstly at class level and then at fold level, while in Suganthan and Kalvanmoy (2004), authors combined evolutionary algorithms for the selection of the most informative sequence-derived features and SVMs as classification modules in the fold recognition problem. In the current study, we revisit the multi-class classification problem in the context of protein fold recognition based on sequence-derived features. We use a non-redundant set of proteins organised in 27 folds and four classes according to SCOP (Hubbard et al., 1999) and apply several computational intelligence or statistics based classifiers, namely

A comparative study of multi-classification methods

335

NNs, SVMs, kNN classifiers and classification trees (CTs), in the 27-class classification problem. Our scope is to identify the optimal classification strategy in this complex multi-class problem and, thus, the classification task is not separated into multiple binary problems. Classifiers are built for various parameters that define their internal architecture and are evaluated through appropriate accuracy measurements. Finally, the re-sampling technique of cross-validation is used in order to avoid biased results. In what follows, the dataset, the used classification techniques and the strategy of their evaluation are described in Section 2, while results related to the performance of classifiers are reported and discussed in Section 3. Finally, the conclusions drawn from the current study are presented in Section 4.

2

Materials and methods

2.1 Dataset A well defined and used as reference dataset of 311 proteins (Ding and Dubchak, 2001; Marsolo et al., 2005; Suganthan and Kalvanmoy, 2004) was used. This dataset, which was extracted in its initial form using PDB_Select sets (Hobolm and Sander, 1994), is a non-redundant set of proteins that belong to 27 folds according to SCOP (Hubbard et al., 1999) and contains proteins with no more than 35% of sequence identity for aligned subsequences longer than 80 residues. For each protein sequence a set of 125 sequence-derived features concerning amino acid composition (20 features), predicted secondary structure (21 features), hydrophobicity (21 features), normalised van der Waals volume (21 features), polarity (21 features) and polarisability (21 features) were extracted. Apart from amino acid composition characteristics, which is a feature vector that contains the percentage occurrences of amino acids in the primary sequence, all other features were extracted using the classification of all amino acids into three classes for each attribute (e.g., polar, neutral and hydrophobic for hydrophobicity attribute). The exact way these features were extracted is described in detail in the work of Dubchak et al. (1999, 1995). In this study, the values of features were normalised in [–1, 1]. The distribution of protein sequences of the dataset in the 27 folds and four structural classes (α, β, α/β, α + β) is presented in Table 1. Table 1

Distribution of proteins within dataset in folds and classes N seq

Fold α class 1

Globin-like

13

2

Cytochrome c

7

3

DNA-binding 3-helical bundle

12

4

4-helical up-and-down bundle

7

5

4-helical cytokines

9

6

Alpha; EF-hand

6

Note: N seq is the number of sequences of each fold

336

I.K. Valavanis et al.

Table 1

Distribution of proteins within dataset in folds and classes (continued) N seq

Fold β class 7

Immunoglobin-like β-sandwich

30

8

Cupredoxins

9

9

Viral coat and capsid protein

16

10

ConA-like lectins/glucanases

7

11

SH3-like barrel

8

12

OB-fold

13

13

Trefoil

8

14

Trypsin-like proteases

9

15

Lipocalines

9

α/β class 16

(TIM)-barrel

29

17

FAD (also NAD)-binding motif

11

18

Flavodoxin-like

11

19

NAD(P)-binding Rossman-fold

13

20

P-loop containing nucleotide

10

21

Thioredoxin-like

9

22

Ribonuclease H-like motif

10

23

Hydrolases

11

24

Periplasmic binding protein-like

11

α + β class 25

β-grasp

7

26

Ferredoxin-like

13

27

Small inhibitors, toxins, lectins

13

Note: N seq is the number of sequences of each fold

2.2 Classifiers We utilised four different classification methods that belong to the fields of statistics or computational intelligence. These corresponded to NNs, SVMs, CTs and kNN classifiers. For constructing and testing NNs and CTs the corresponding toolboxes of Matlab 7.0 (Mathworks, http://www.mathworks.com) were utilised, while the kNN classifier was developed in the same programming environment. The SVMlight toolbox (SVM multiclass, http://www.citeulike.org/user/pjcite/article/3338163) was used for the implementation of the multi-class SVM classifier. In the following, each classification method is briefly described.

A comparative study of multi-classification methods

337

2.2.1 Neural networks NNs consist of a number of processing units, called neurons, ordered in layers that work together to produce an output. The interconnections of neurons, which are activated using linear or non-linear activation functions, enable a NN to recognise nonlinear relationships between input variables and the output. The internal structure of a NN (e.g., interconnection weights) can be adjusted for a specific classification problem during the training process. This process is actually a repetitive procedure of updating the NN internal structure towards the minimisation of an error function between the NN’s output and the desired output. Two different NN architectures, the multi-layer perceptron NN (MLP-NN) and the probabalistic NN (P-NN), were used in the current study. 1

Multi-layer perceptron neural network The MLP-NN classifier (Haykin, 1999) used in this study is a feed-forward NN consisting of one input layer with a number of input neurons equal to the number of features fed into the NN, one hidden layer with variable number of neurons, and one output layer consisting of 27 output neurons, encoding each of the 27 folds of proteins. The log-sigmoid and tan-sigmoid activation functions were employed for the hidden and the output layers, respectively. The use of the log-sigmoid function in the hidden layer, lead us to normalise the values of the 125 features describing each protein sequence in [–1, 1]. The use of the tan-sigmoid function in the 27 output neurons can result in an output value in the range [0.0, 1.0] for each output neuron. Thus, a training example of a protein sequence that belongs to fold i is assigned a value equal to 1 in the ith output neuron, while all other output neuron values are set to 0. The MLP-NN was trained, using the training set, by the batched back-propagation algorithm with adaptive learning rate and momentum. When the MLP-NN is used to classify a query protein, the values of the 125 sequence-derived features are fed to the input layer and the protein is assigned to the fold that corresponds to the mostly activated output neuron, that is the output neuron that outputs the greatest value in the range [0.0, 1.0]. The MLP-NN classifier was trained and tested for a variable number Nhidden of hidden neurons (Nhidden = 8,16,32,…,136).

2

Probabilistic neural network The P-NN (Kadah et al., 1996) consists of one input layer with number of neurons equal to the number of used features, a hidden layer known also as summation layer, and an output layer. In order to classify a query protein sequence, the corresponding sequence-derived features vector (X) is applied to the input layer which computes distances of the feature vector (X) from the feature vectors in the training set. In the summation layer a probabilistic density function p ( X⏐Ci ) of the class Ci, i = 1,…,27 that corresponds to the ith fold is estimated taking into account the classes to which a certain amount of training examples of the training set belong to. Finally, the neuron in the output layer classifies the input feature vector into the class with the highest probabilistic density function. Several values for the spread function (S), that defines the amount of the training examples taken into account for the classification of new feature vector, were used (S = 0.05, 0.1, 0.2,…,1.6).

338

I.K. Valavanis et al.

2.2.2 k Nearest neighbour The kNN classifier is a very popular statistical classifier because of its simplicity and its easiness of development. When a new feature vector is to be classified, the kNN classifier identifies its k nearest neighbours in the training set. Here, the Euclidian distance was used as similarity measurement. The feature vector is classified to the most frequent class among its k neighbours. In the current study, a 1NN classifier was firstly used, that simply assigns a query protein sequence to the fold to which the protein sequence in the training set with the most similar sequence-derived feature vector belongs to. Next, kNN classifiers (k > 1) using a distance based weight (Wu et al., 2002; Mougiakakou et al., 2007) were used. When a feature vector (X) is to be classified, the kNN classifier finds the k nearest to (X) patterns (Y1, Y2,…, Yk), which belong to the training set, based on the Euclidian distance function D. Each class Ci, i = 1,.., 27, corresponding to the 27 folds contests the feature vector with a voting power wi. k

wi =

∑ D ( X, Y ) f ( Y , C ) , −1

j

j

i

i = 1,..., 27

(1)

j =1

where function f is defined as, ⎧⎪1, f ( Y j , Ci ) = ⎨ ⎪⎩0,

if Ci = class of y j else

(2)

The query protein sequence is finally declared into the class that contributes maximum weights in the neighbours. The kNN classifiers were used for k = 1,3, 6, 9, 12, 28, 56, and 84.

2.2.3 Classification tree CTs comprise a non-parametric discriminative method (Breiman et al., 1984). The CT partitions the initial set of protein sequences that belong to the training set recursively into subsets using one of the features that describe the protein sequences. Thus, the initial set is sorted according to each of the features used and then split into two subsets testing different threshold levels for each variable. The variable and threshold which result in the best split (in terms of purity of the resulting subsets) are finally chosen to split the set. The Gini index criterion (Breiman et al., 1984) of impurity was used for node splitting. After the CT has been built, the classification of a query protein sequence starts from using the feature assigned to the root in order to make the first binary decision for the query protein sequence. The protein sequence travels then towards the leafs of the CT until it is assigned to one the classes of CT, i.e., a fold.

2.2.4 Support vector machine SVM is a new promising classification method initially developed for a binary classification problem. SVM, which is a margin method, draws an optimal hyper-plane (determined by w, b) in a high dimensional space that defines a boundary that maximises the margin between data samples that belong to the two classes, so as it generalises well in unknown data (Vapnik, 1995). The decision function for a feature vector (X) is

A comparative study of multi-classification methods f ( X ) = w .φ ( X ) + b

339 (3)

where φ is a mapping of feature vectors to the high dimensional space. The kernel function of the SVM defines the φ mapping. Instead of using a binary SVM and adopting techniques for incorporating it in a multi-class problem, e.g., one versus all or all versus all (Ding and Dubchak, 2001), we chose to use a newly developed multi-class SVM (Crammer and Singer, 2001). Thus, the multi-class fold recognition problem requires the training of a single SVM classifier and not of a big number of binary SVMs as done in other studies (Dubchack et al., 1999; Ding and Dubchak, 2001). The optimisation problem is solved using an efficient cutting plane algorithm that exploits the sparseness and structural decomposition of the problem (Tsochantaridis et al., 2004). Four different multi-class SVM classifiers were built and evaluated using the available dataset based on four different kernel functions: linear, radial basis, polynomial and sigmoid.

3

Evaluation of classifiers

Each classifier was trained and tested using the available dataset of 311 protein sequences and their sequence-derived feature vectors on the basis of proper measurements for the evaluation of classifiers and a ten-fold cross validation technique. Total accuracy (TA) is the first measurement that was chosen to evaluate each classifier. Suppose that a classifier, which is trained to classify a feature vector in m classes (C1,…,Cm), is run for an evaluation set of N = n1 + n2 + … nm samples (n1 belong C1, n2 belong to C2,…, and nm belong to Cm). Given that P = p1 + p2 +…+ pm (p1 belong C1, p2 belong to C2,…, pm belong to Cm) are assigned to the correct class, TA is defined as:

TA = P / N

(4)

Mutual information coefficient (MIC) is another measurement that can be utilised for the evaluation of classifiers, including the multi-class ones as well (Baldi et al., 2000). Let N = {ni / N}, i = 1,.., m contain the fraction of samples that belong to each of the class, Υ = {yi / N}, i = 1,…, m present the fraction of samples that have been assigned to each class and Z = {zij / N }, i, j = 1,…, m be the consistency table (zij presents the number of samples that belong to class i and have been assigned to class j). MIC is defined as: MIC = ( H ( N ) + H ( Y ) − H ( Z ) / H ( N ) )

(5)

where H(A) is the entropy of a vector or matrix A (Baldi et al., 2000). MIC takes values in the range [0, 1], while a total consistency between the obtained outputs and the desired outputs of a classifier is characterised by MIC = 1. Ten-fold cross validation (10-CV) is re-sampling technique that is often used during evaluation process of a classifier in order to get un-biased results. 10-CV yields ten equally sized partitions of data that are extracted randomly and without replacement from the available dataset. The classifiers are trained using ten out of ten partitions (corresponding, here, to 279 out a total of 311 cases) and tested to the remaining partition (here 32 out of 311 cases), while the whole process is repeated until the classifier has been tested into all partitions. Mean values of the classifier evaluation measurements (mean ± std) in the CV test sets can obtained and used as objective measurements for the evaluation of classifiers.

340

I.K. Valavanis et al.

Figure 1 TA of MLP-NN classifier per number of hidden nodes (see online version for colours) Total Accuracy

0,50 0,40 0,30 0,20 0,10 0,00 8

16

24

32

40

48

56

64

72

80

88

96

104 112 120 128 136

# hidden nodes

Figure 2 TA of P-NN classifier per spread value (see online version for colours) Total Accuracy

0,50 0,40 0,30 0,20 0,10 0,00 0,05

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

spread

Figure 3 TA of kNN classifier per number of nearest neighbours (see online version for colours)

Total Accuracy

0,50 0,40 0,30 0,20 0,10 0,00 1

3

6

9

12

28

56

# nearest neighbors

Figure 4 TA of SVM classifier per type of kernel (see online version for colours) Total Accuracy

0,50 0,40 0,30 0,20 0,10 0,00 Linear

Polynom

Radial basis kernel

Sigmoid

84

A comparative study of multi-classification methods

341

Figure 5 Best TA per classifier type (see online version for colours)

Best Total Accuracy

0,60 0,50 0,40 0,30 0,20 0,10 0,00 SVM

MLP-NN

P-NN

kNN

CT

Classifier Type

Mutual Information Coefficient

Figure 6 MIC for the selected classifiers (according to TA) per classifier type (see online version for colours)

0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 SVM

MLP-NN

P-NN

kNN

CT

Classifier Type

4

Results and discussion

All classification techniques were implemented for variable parameters describing their internal structure (e.g., the number of nodes of the hidden layer of NN or the kernel function of SVM), so as each technique yielded more than one classifiers. Each of the multi-class derived classifiers was trained and tested using the 10-CV technique. TA and MIC measurements (mean ± std) obtained for the 10 CV test-sets were utilised for the evaluation of classifiers. Regarding the MLP-NN classifier, we varied the number of hidden nodes from eight up to 136 hidden nodes. Results of the mean value and standard deviation of the TA measurements in the 10-CV tests sets are presented in Figure 1. It is observed that small number of hidden nodes (8–32) yield low TA, since that small number of hidden nodes cannot deal with the complexity of the multi-class problem: a 125-dimensional feature vector has to be depicted to 27 output neurons. The maximum average accuracy was obtained for 88 hidden nodes and equals to 0.422. When the number of hidden nodes was

342

I.K. Valavanis et al.

set to more than 112, we observed a decrease in TA measurement. This is possibly due to the overfitting of MLP-NN to the training data. Overfitting is possible in these cases due to the big number of hidden nodes and makes the MLP-NN unable to generalise in the unknown data. TA measurements for MLP-NN were, also, obtained when we took into account additionally the second or second and third mostly activated output neurons of the MLP-NN, thus allowing the ‘second’ and ‘third’ choice of the MLP-NN to contribute to the accuracy when they indicate the correct fold. Mean TA was found 0.541 when using 88 hidden neurons and both the first and second mostly activated output neurons, while this measurement was found 0.609 when first, second and third mostly activated output neurons were taken into account. These accuracy measurements don’t actually depict the ability of the MLP-NN to assign a fold to protein sequences. However they can in some way characterise its performance and show the potential of MLP-NN. P-NN classifier was trained and tested for several spread values starting from 0.05 up to 1.6 and TA measurements are presented in Figure 2. The greater spread value is, the more training examples are included by the P-NN in order to make its final decision. The optimal spread value was found 0.4 and yielded to a mean TA equal to 0.323. kNN classifier, which is actually a statistical version of P-NN was tested using 1,3,6,9,12,28,56 and 84 nearest neighbours and results are presented in Figure 3. Three nearest neighbours yielded to the best TA equal to 0.347. From Figure 3, it is observed that TA decreases when the number of nearest neighbours taken into account increases especially after the value of 12. This happens due to that too many training examples, which can actually confuse the classifier, contribute to the final decision. E.g., fold 27 may include up to 13 cases in the training set and, thus, an 84NN classifier, when is to classify a protein of fold 27, would take into account too many other additional cases that don’t belong to fold 27 and would misclassify this protein. CT was found to have a similar performance, TA = 0.345, with the best performing kNN. One single classification model was derived using the technique of CTs. SVM classification technique was tested using four different classifiers corresponding to one of the following kernels: linear, polynomial, radial basis and sigmoid. Results on TA are presented in Figure 4. The use of polynomial kernel lead to the best performing SVM classifiers, which achieved the maximum mean TA accuracy equal to 0.428, which is the best mean accuracy of all classifiers evaluated in the current study. TA values (mean ± std) for the best classifiers per type of classifier are presented in the bar chart of Figure 5. We can confirm that SVM, the best performing classifier in terms of mean TA in the CV test sets, is slightly better than the best MLP-NN classifiers. Best kNN and CT classifiers follow, while the best P-NN classifier performed worst. MIC measurements in CV test sets were obtained for the best performing classifiers per type of classifier, as additive information to TA, and are reported (mean ± std) in the bar chart of Figure 6. All best performing classifiers per type of classifier, in terms of TA, have high MIC values and more specifically in the range [0.712–0.798]. CT classifier was featured a maximum mean MIC value (0.798) while MLP-NN and SVM (best performing among all classifiers according to TA) follow and have MIC values equal to 0.772 and 0.746, respectively. Given that all best performing classifiers per type of classifier are featured similar MIC values, and best SVM, best MLP-NN outperform clearly all other classifiers in terms of TA measurements, we can select these classification techniques for the fold recognition on the basis of the available dataset and giving a slight precedence to SVM.

A comparative study of multi-classification methods Table 2

343

Mean occurrence of folds in CV test sets and mean sensitivity, specificity measurements obtained per fold for the selected classifiers (according to TA)

344

I.K. Valavanis et al.

The unequal distribution of the 27 folds in the available dataset may introduce a bias in the up to now presented results for the best performing classifiers in terms of TA. In order to enlighten the degree of bias, we calculated mean sensitivity and specificity per fold in the CV test sets, as additional information to TA and MIC measurements described in previous section. For obtaining sensitivity and specificity measurements, the obtained multi-class predictions for fold recognition were handled in the context of 27 different two-class problems using the one-versus-all comparisons (Patel and Markey, 2005). Results on these two indices as well as mean number of cases of each fold in CV test sets are presented in Table 2. Results show that mainly folds that are quite frequent in CV test sets, and in CV train sets as well, (e.g., folds 1, 7, 16) are featured high sensitivity measurements and actually are the ones that contribute more to high total TA measurements. Furthermore, polynomial SVM and the MPL-NN using 88 nodes that were selected in terms of TA seem to outperform the other classifiers (again with a slight precedence of SVM) in terms of sensitivity measurements, as well. All obtained specificity measurements are high and are rather of low information content: the number of true negative predictions in the two-class problems, i.e., the number of cases that are correctly not assigned to a fold, is high but many of these cases are misclassified into other folds and not the correct one. For the case of polynomial SVM that performed better than all constructed classifiers in terms of TA, the leave-one-out re-sampling technique was utilised as well and TA measurement was calculated once again. The now obtained value was 0.441 that is 137 out of 311 cases were correctly classified during the iterative process of training the classifier using 310 cases and testing with the remaining one. The resulted TA measurement is quite close to the one obtained using 10-CV testing (TA = 0.428) and shows that the used accuracy measurement is independent from the testing technique. It is important to note here that the best mean TA values of all classifiers achieved by polynomial SVM and MLP-NN of 88 nodes in hidden layer (42%–43% using ten-fold cross validation) are very well acceptable accuracy measurements for the 27-class classification problem studied here. A random classifier would have a 3.7% (1/27 × 100) TA measurement. Furthermore, Ding and Dubchak (2001) report TA measurement in the order of 42%–45% when using one-versus-other or all-versus-all classification methods based on a big number of binary NNs or SVMs trained for different subsets of the features used here and the same protein sequences set. Here, we were able to achieve an accuracy of the same order by utilising an SVM generalised to multi-class problems or an appropriate architecture of MLP-NN. In addition, the multi-class classifiers described here can assign to a vector of sequence-derived features, corresponding to a query protein sequence, a fold with a very low time and computational cost (1–2 seconds on a usual PC) since the classifiers have been trained. The use of the best performing classifiers combined with subsets of the features used here, and the selection of the most informative features comprises future work of the authors.

5

Conclusions

The current study applied several multi-class classification methods in the protein fold recognition problem. Sequence-derived feature vectors were used to classify protein sequences in one out of 27 folds, and popular classifiers from the fields of statistics or computational intelligence were comparatively assessed using appropriate accuracy and

A comparative study of multi-classification methods

345

mutual information measurements. Polynomial multi-class SVM and MLP-NN were found to outperform other classifiers, while the SVM was the one to perform best in terms of total accuracy. The selected SVM achieved a 42.8% accuracy in the 27-class problem, an acceptable performance for this complex classification task.

Acknowledgements The authors would like to thank the authors of Ding and Dubchak (2001) for making available the set of proteins and the corresponding sequence-derived features in WWW under http://ranger.uta.edu/~chqding/protein. Ioannis Valavanis would like to thank the State Scholarships Foundation of Greece (IKY).

References Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N. and Yeh, L.L. (2004) ‘UniProt: the universal protein knowledge base’ Nucleic Acids Research, Vol. 32, pp.D115–D119. Baldi, P., Chauvin, Y, Hunkapiller, T. and McClure, M.A. (1994) ‘Hidden Markov models of biological primary sequence information’, in Proceedings of National Academy of Science USA, Vol. 91, pp.1059–1063. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F. and Nielsen, H. (2000) ‘Assessing the accuracy of prediction algorithms for classification: an overview’, Bioinformatics, Vol. 16, No. 5, pp.412–424. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) ‘The Protein Data Bank’, Nucleic Acids Research, Vol. 28, pp.235–242. Breiman, L., Friedman, J.H., Olson, R.A. and Stone, C.J. (1984) Classification and Regression Trees, Wadsworth, Belmont, California. Chothia, C. (1992) ‘One thousand families for the molecular biologist’, Nature, Vol. 357, pp.543–544. Crammer, K. and Singer, Y. (2001) ‘On the algorithmic implementation of multi-class kernel-based vector machines’, Journal Machine Learning Research, pp.256–292. Craven, M.W., Mural, R.J., Hauser, L.J. and Uberbacher, E.C. (1995) ‘Predicting protein folding classes without overly relying on homology’, in Proceedings of Intelligent Systems in Molecular Biology (ISMB), Vol. 3, pp.98–106. Ding, C.H. and Dubchak, I. (2001) ‘Multi-class protein fold recognition using support vector machines and neural networks’, Bioinformatics, Vol. 7, pp.349–358. Dubchak, I., Muchnik, I., Holbrook, S.R. and Kim, S.H. (1995) ‘Prediction of protein folding class using global description of amino acid sequence’, in Proceedings of National Academy of Science USA, Vol. 92, pp.8700–8704. Dubchak, I, Muchnik, I., Mayor, C., Dralyuk, I. and Kim, S.H. (1999) ‘Recognition of a protein fold in the context of the structural classification of proteins (SCOP) classification’, Proteins, Vol. 35, pp.401–407. Gough, J., Karplus, K., Hughey, R. and Chothia, C. (2001) ‘Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure’, Journal of Molecular Biology, Vol. 313, pp.903–919. Haykin, S. (1999) Neural Networks: a Comprehensive Foundation, Prentice-Hall, Upper Saddle River, NJ, USA.

346

I.K. Valavanis et al.

Hobohm, U. and Sander, C. (1994) ‘Enlarged representative set of protein structures’, Protein Science, Vol. 3, pp.522–524. Hubbard, T.J.P., Bart, A., Brenner, S.E., Murzin, A.G. and Chothia, C. (1999) ‘SCOP: a structural classification of proteins database’, Nucleic Acids Research, Vol. 27, pp.254–256. Kadah, Y.M., Frag, A.A., Zurada, J.M., Badawi, A.M. and Youssef, A.B.M. (1996) ‘Classification algorithms for quantitative tissue characterization of diffuse liver disease from ultrasound images’, IEEE Transactions in Medical Imaging, Vol. 15, pp.466–478. Kim, D., Xu, D., Guo, J.T., Ellrott, K. and Xu, Y. (2003) ‘PROSPECT II: protein structure prediction program for genomescale applications’, Protein Engineering, Vol. 16, pp.641–650. Marsolo, K., Parthasarathy, S. and Ding, C. (2005) ‘A multi-level approach to SCOP fold recognition’, in Proceedings of the Fifth IEEE Symposium on Bioinformatics and Bioengineering, pp.57–64. Marti-Renom, A., Madhusudhan, M.S. and Andrej, S. (2004) ‘Alignment of protein sequences by their profiles’, Protein Science, Vol. 13, pp.1071–1087. Mathworks, http://www.mathworks.com. Mougiakakou, S.G., Valavanis, I.K., Nikita, A. and Nikita, K. (2007) ‘Differential diagnosis of CT focal liver lesions using texture features, feature selection and ensemble driven classifiers’, Artificial Intelligence in Medicine, Vol. 41, pp.25–37. Needleman, S. and Wunsch, C. (1970) ‘A general method applicable to the search for similarities in the amino acid sequence of two proteins’, Journal of Molecular Biology, Vol. 48, pp.443–453. Patel, A.C. and Markey, M.K. (2005) ‘Comparison of three-class classification performance metrics: a case study in breast cancer CAD’, in Eckstein, M.P., Jiang, Y. (Eds.): Medical Imaging 2005: Image Perception, Observer Performance and Technology Assessment, Proceedings of SPIE, WA, Vol. 5749, pp.581−589. Skolnick, J. and Kihara, D. (2001) ‘Defrosting the frozen approximation: PROSPECTOR – a new approach to threading’, Proteins, Vol. 42, pp.319–331. Suganthan, S.Y.M. and Kalyanmoy, P.N. (2004) ‘Multi-class protein fold recognition using multi-objective evolutionary algorithms’, in Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp.61–66. SVM multiclass, http://www.citeulike.org/user/pjcite/article/3338163. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) ‘CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice’, Nucleic Acids Research, Vol. 22, pp.4673–4680. Tsochantaridis, I., Hofmann, T., Joachims, T. and Altun, Y. (2004) ‘Support vector learning for interdependent and structured output spaces’, International Conference on Machine Learning 2004. Vapnik, V. (1995) The Nature of Statistic Learning Theory, Springer, New York. Vingron, M. and Waterman, M. (1994) ‘Sequence alignment and penalty choice. Review of concepts, case studies and implications’, Journal of Molecular Biology, Vol. 235, pp.1–12. Wu, Y., Ianakiev, K. and Govindaraju, V. (2002) ‘Improved k-nearest neighbor classification’, Pattern Recognition, Vol. 35, pp.2311–2318. Zambon, M., Lawrence, R.L., Bunn, A. and Powell, S. (2006) ‘Effect of alternative splitting rules on image processing using classification tree analysis’, Photogrammetric Engineering and Remote Sensing, Vol. 72, pp.25−30.