Protein Fold Prediction using Cluster Merging

Protein Fold Prediction using Cluster Merging Ngyuen Quang Phuoc

Sung-Ryul Kim

eB Corporation

Div. of Internet and Media, Konkuk University

14th Fl., HIGH-END TOWER, 235-2, Guro-Dong Seoul, Korea

1 HwaYang-Dong, GwangJin-Gu, Seoul, Korea *contact author

Readers are referred to many textbooks for details of the structures. In this paper, we focus on the problem of predicting the secondary structure of a protein given its primary structure sequence. The paper is organized as follows: In Chapter II we describe the background works and present our motivations. In Chapter III we describe our method and then we present our experimental results in Chapter VI. Finally, we conclude in Chapter V. II. Previous Works and our Motivation

AbstractProtein folding prediction, also called protein structure prediction, is one of the most important issues for understanding living organisms. Therefore, predicting the folding structure of proteins from their linear sequence is a very big challenge in biology. Despite years of research and the wide variety of approaches, protein folding still remains a difficult problem. One of the main difficulties is controlling the overfitting and under-fitting behavior of classifiers in the prediction systems. In this paper we propose a new learning method to improve the accuracy of protein folding prediction by balancing between over-fitting and under-fitting. The key of this method is based on a special way for analyzing the distance among training data points in order to cluster them into spaces which have high density of data points. By this, the over fitting and under fitting can be controlled in a comprehensive manner. Some experimental results seem to indicate that the proposed method has a significant potential on improve the accuracy of protein folding prediction.

A. Previous Works There exist extensive amount of work regarding protein structure prediction using many different methods. Because of space constraints, we cannot explain all of the known methods. We will simply summarize our assessment of the current state of the art here. In many real-life or experimental studies of data mining, some classification approaches work better with some datasets, while they work poorly with other datasets for no apparent reason. For instance, methods using decision trees had some success in the medical domain [22]. However, they also have had some certain limitations when they were used in the same domain, as described for instance in [15] and [17]. Furthermore, the success of methods using support vector machines has been shown in bioinformatics, such as in [3]. At the same time, those methods do not perform as well in some cases [21]. As these cases show, there is no good understanding of why some models are accurate or not. Their performance is often times coincidental. A growing belief is that over-fitting and under-fitting problems may cause poor performance of the classification/class prediction model. Over-fitting means that the extracted model describes the behavior of known data very well but does poorly on new data points. Underfitting occurs when the system uses the available data and then attempts to analyze vast amounts of data that has not seen yet.

Keywords: Protein folding prediction,Classification, Cluster, Over-fitting, Under-fitting

I. INTRODUCTION Proteins (also known as polypeptides) are organic compounds made of amino acids arranged in a linear chain and folded into a globular form. The amino acids in a polymer are joined together by the peptide bonds between the carboxyl and amino groups of adjacent amino acid residues. The sequence of amino acids in a protein is defined by the sequence of a gene, which is encoded in the genetic code. In general, the genetic code specifies 20 standard amino acids. Research concerning the native folded state of a protein has great potential to provide many biological events since the 3D structure of a protein gives functional information about that protein, and one of the fundamental aims of biology is to understand the function of proteins. Knowledge about the function of proteins provides an understanding of biochemical processes in living beings, the characterization of genetic diseases, the implementation of designer drugs, and so on. A protein may have many different conformations (threedimensional structures). However, only one or a few of these structures, which are so-called native conformations, have biological activity. Because protein structures are so complicated, they are categorized into four levels of structures.

B. Over-fitting and Under-fitting Problem

293

In general, one classification system describes the positive data (and thus we will call it the positive system) while the other system describes the negative data (and thus we will call it the negative system). In Figure 3(a) the positive system corresponds to sets A and B (which define the positive pattern) while sets C and D correspond to the negative system (which define the negative pattern). In many real-life applications, there are two different penalty costs: if one erroneously classifies a true positive point as negative or if one classifies a true negative point as positive. The first case is known as the false positive, while the second case is known as the false negative. Furthermore, a closer examination of Figure 3(a) indicates that there are some unclassifiable points which are not covered by either of the patterns, or are covered by patterns that belong to both classes. For instance, point N (indicated by a triangle) is not covered by any of the patterns, while point M (also a triangle) is covered by sets A and C which belong to the positive and the negative patterns, respectively. For the first case, as point N is not covered by any of the patterns, the inferred system may declare it as unclassifiable. In the second case, there is a direct disagreement by the inferred system as the new point (i.e., point M) is covered simultaneously by patterns of both classes. Again, such a point may be also declared as unclassifiable. Thus, in many real-life applications of data mining one may have to consider three different penalty costs as follows: one cost for the false positive case, one cost for the false negative case, and one cost for the unclassifiable case. Consider Figure 3(b). Suppose that all patterns A, B, C and D have been reduced significantly but still cover the original training data. A closer examination of this figure indicates that both points M and N are not covered by any of the inferred patterns now. In other words, these points and several additional points which were classified before by the inferred systems have now become unclassifiable. Furthermore, the data points which were simultaneously covered by patterns from both classes before, and thus were unclassifiable, are now covered by only one type of pattern or none at all. Thus, it is very likely that the situation depicted in Figure 3(b) may have a higher total penalty cost than the original situation depicted in Figure 3(a). If one takes this idea of reducing the covering sets as much as possible to the extreme, then there would be one pattern (i.e., just a small circle) around each individual training data point. In this extreme case, the total penalty cost due to unclassifiable points would be maximized as the system would be able to classify the training data only and nothing else. The previous scenarios are known as over fitting of the training data. On the other hand, suppose that the original patterns depicted as sets A, B, C and D (as shown in Figure 3(a)) are now expanded significantly as in Figure 3(c). An examination of this figure demonstrates that points M and

Figure 2. Sample data from two classes in 2-D

Let us assume that the “circles” and “squares” in Figure 2 correspond to sampled observations from two classes defined in 2-D. In general, a data point is a vector defined on n variables along with their values. In the above figure, n is equal to 2 and the two variables are indicated by the X and Y axes. For the sake of description we assume here that there are only two classes. Arbitrarily, we will call one of them the positive class (the circles in the figure) and the other the negative class. Thus, a positive data point, also known as a positive example, is a data point that has been evaluated to belong to the positive class. A similar definition exists for negative data points or negative examples. Usually we are given a set of positive and negative examples, such as the ones depicted in Figure 3. The set of data is called the training data (examples) or the classified examples. The remainder of the data from the state space is called the unclassified data.

Figure 3. An illustrative example of over-fitting and under-fitting

294

assumed that we do not know the actual class values of these two new points. We would like to use the available training data and inferred patterns to classify these two points. Because points P and Q are covered by patterns A and B, respectively, both of these points may be assumed to be positive examples.

N are now covered simultaneously by patterns of both classes. Also, more points are now covered simultaneously by patterns of both classes. Thus, under this scenario we also have lots of unclassifiable points because this scenario creates lots of cases of disagreement between the two classification systems (i.e., the positive and the negative systems). This means that the total penalty cost due to unclassifiable points will also be significantly higher than under the scenario depicted in Figure 3(a). This scenario is known as under fitting of the training data.

Figure 5. Pattern B is a homogeneous set while pattern A is not. Pattern A can be replaced by the two homogeneous sets A1 and A2

Let us look more closely at pattern A. This pattern covers regions of the state space that are not adequately populated by positive training points. Such regions, for instance, exist in the upper left corner and the lower part of pattern A (see also Figure 3(a)). It is possible that the unclassified points which belong to such regions are erroneously assumed to be of the same class as the positive training points covered by pattern A. Point P is in one of these sparely covered regions under pattern A. Thus, the assumption that point P is a positive point may not be very accurate. On the other hand, pattern B does not have such sparely covered regions. Thus, it may be more likely that the unclassified points covered by pattern B are more accurately assumed to be of the same class as the positive training points covered by the same pattern. For instance, the assumption that point Q is a positive point may be more accurate. The simple observations above lead one to surmise that the accuracy of the inferred systems can be increased if the derived patterns are, somehow, more compact and homogeneous. A homogenous set describes a steady or uniform distribution of a set of distinct points. That is, within the pattern there are no regions (also known as bins) with unequal concentrations of classifiable (i.e., either positive or negative) and unclassified points. In other words, if a pattern is partitioned into smaller bins of the same unit size and the density of these bins is almost equal to each other (or, equivalently, the standard deviation is small enough), then this pattern is a homogenous set.

Figure 4. An illustrative example of a better classification

Thus, we cannot separate the control of over fitting and under fitting into two independent domains. We need to find a way to simultaneously balance over fitting and under fitting by adjusting the inferred systems (i.e., the positive and the negative systems) obtained from a classification algorithm. The balance of the two systems will target at minimizing the total misclassification costs of the final system. In terms of Figures 3(a), (b), and (c) let us now consider the situation depicted in Figure 4. At this point assume that in reality point M is negative while point N is positive. Figure 4 shows different levels of over fitting and under fitting for the two classification systems. For the sake of illustration, sets C and D are kept the same as in the original situation (i.e., as depicted in Figure 3(a)) while set A has been reduced (i.e., it fits the training data more closely) and now it does not cover point M. On the other hand, set B is expanded (i.e., it generalizes the training data more) to cover point N. III. Our Method A. Homogeneous Pattern In order to help motivate the proposed methodology, we first consider the situation depicted in Figure 5(a). This figure presents two inferred patterns. These are the circular areas that surround groups of training data (shown as small circles). Actually, these data are part of the training data shown earlier in Figure 3 (Recall that the circles in Figure 2 represent positive points). Moreover, in Figure 5(a) there are two additional data points shown as small triangles and are denoted by P and Q, respectively. In this situation, it is

B. Density of a Homogeneous Pattern The pattern which is represented by the non homogenous A can be replaced by two more homogenous sets denoted as A1 and A2 as in Figure 5(b). Now the regions covered by the two new smaller patterns A1 and A2 are more homogeneous than the area covered by the original pattern 295

another two-class problem: one “true” class consists of another original class and the “others” class, which is the rest. This process is repeated for each of the K classes, and this leads to K two-way trained classifiers. In the testing process, the system uses testing queries for each of the K two-way classifiers and determines the maximum of K scores from the K classifiers. The maximum score is considered to be a classifier for the two-class problem: one class consists of proteins in one “true” class, and the “others” includes all other classes. All of steps of this process are repeated for the K - 1 remaining classes.

A. Given these considerations, point P may be assumed to be an unclassifiable point while point Q is still a positive point. As presented in the previous paragraphs, the homogenous property of patterns may influence the number of misclassification cases of the inferred classification systems. Furthermore, if a pattern is a homogenous set, then the

D. Accuracy Measure for Multi-class Prediction In two-class problems, assessing the accuracy involves calculating true positive rates and false positive rates. In multi-class problems, particularly those converted by the one-against-all method, this assessment, however, has to be extended to be suitable for adapting to more than two classes. A simple standard assessment, Q, was introduced by Rost and Scander, 1993 [18] and Baldi et al, 2000 [1]. Suppose there are N = n1 + n2 + … + nk test proteins where ni are the number of examples in the ith class. Let C = c1 + c2 + ... + ck be the total number of proteins that are correctly recognized, where ci is the number of examples that are correctly recognized in the ith class. Therefore the accuracy for the ith class is Qi = ci/ni and the overall accuracy is Q = C/N. An individual class contributes to the overall accuracy in proportion to the number proteins in its class. Hence each of Qi relates to the overall Q by a weight wi=ni/N. The overall accuracy is:

Figure 6. An illustrative example of Homogeneous Pattern

number of training points covered by this pattern may be another factor which affects the accuracy of the overall inferred systems. For instance, Figure 6 shows the case discussed in Figure 5(b) (i.e., pattern A has been replaced by two more homogenous sets denoted by A1 and A2). Suppose that all patterns A1, A2 and B are homogenous sets and a new point S (indicated by a triangle) is covered by pattern A1. A closer examination of this figure shows that the number of points in B is higher than those in A1. Although both points Q and S are covered by homogenous sets, the assumption that point Q is a positive point may be more accurate than the assumption that point S is a positive point. This observation leads one to surmise that the accuracy of the inferred systems may also be affected by a density measure. Such a density could be defined as the number of points in each inferred pattern per unit of area or volume. Therefore, this density will be called the homogeneity degree.

E. Main Algorithm The following figure illustrates the main steps in our algorithm. The details are omitted due to space limitations.

C. Multi-class Classification: One-against-all Method Most of classification methods dealing with two-class problems are often accurate and efficient. Examples include the support vector machines and the neural networks. When dealing with more classes, they, however, usually have less accuracy and efficiency. This section describes a method, one-against-all, that utilizes two-class classification methods as the basic building block for larger number of classes. This is a simple and effective method introduced by Dubchak et al., 1999 [11] and Brown et al., 2000 [2]. In this process, suppose there are K classes in the problem. The K classes, firstly, are partitioned into a twoclass problem: one class consists of proteins in one “true” class, and the “others” includes all other classes. A twoclass classification method is then used to train for this twoclass problem. The process partitions the K classes into

Figure 7. The main algorithm flowchart

IV. Experimental Results A. Data Sets To evaluate the proposed method and compare it with existing methods, we have applied our method to the 296

sequences, and fold prediction by different machinelearning techniques can be performed rapidly and automatically.

protein data sets studied by Ding and Dubchak (2001) [6], which can be obtained from (http://ranger.uta.edu/~chqding /protein). This data set has been widely used by many existing studies (Chen and Kurgan, 2007 [4]; Guo and Gao , 2008 [11]; Shamim et al., 2007 [19]; Shen and Chou, 2006 [20]). They contain 6 sub data sets, each data set consisting of 311 proteins for training and 385 proteins for test. The data sets were selected from the database built for the prediction of 128 folds and based on the PDB select sets (Hobohm et al., 1992 [12]; Hobohm and Sander, 1994 [12]) where two proteins have no more than 35% of the sequence identity for the aligned subsequences longer than 80 residues. Since the accuracy of any machine learning method depends directly on the number of representatives for training, the authors utilized 27 most populated folds in the database which have seven or more proteins and represent all major structural classes: α, β, α/β, and α + β. The folds in the database and the corresponding number of proteins in training (Ntrain) are shown in Table 1.

Table 2: Six parameter datasets extracted from protein sequence Symbol Parameter Dim C Amino acids composition 20 S Predicted secondary structure 21 H Hydrophobicity 21 V Normalized van der Waals volume 21 P Polarity 21 Z Polarizability 21

Note that the six feature vector datasets (parameter sets) are extracted independently. Thus, one may apply machinelearning techniques based on a single parameter set for protein fold prediction. We found that using multiple parameter sets and applying majority voting on the results lead to much better prediction accuracy. This is the approach we take in this study. Alternatively, one may combine different parameter sets into one dataset so that each protein is represented by a 125-dimensional feature vector. We experimented with this approach and found that the prediction accuracy is not enhanced.

Table 1: The number of training data sets and test data sets for each class Class Training data Test data all-α 55 61 all-β 109 117 α/β 115 145 α+β

32

62

Total

311

385

B. Results After implementing the new machine learning method on such datasets from Ding and Dubchak, we obtain the accuracy in percent as the Table 3. Among them, the best accuracy is 92.2% when we predict the class all-α on data set no.2. And worse accuracy is 69.6% when we predict the class all-β on data set no.6. Overall, the average accuracy is 82.5%, which is an acceptable result. In the next section, we will compare these results with some studies that use the same data sets from Ding and Dubchak.

As an independent dataset for testing we used the PDB40D set developed by the authors of the SCOP (Structural Classification of Proteins) database (Lo Conte et al, 2000) [16]. This set contains the SCOP sequences having less than 40% identity with each other. From this set we selected 386 representatives of the same 27 largest folds (Ntest), which is shown in Table 1. All PDB-40D proteins that had higher than 35% identity with the proteins of the training set were excluded from the testing set. To use machine learning methods, feature vectors are extracted from protein sequences. Percentage composition of the 20 amino acids forms a parameter set. For each structural or physico-chemical property listed in Table 2, feature vectors are extracted from the primary sequence based on three descriptors: ‘composition’, percent composition of three constituents (e.g. polar, neutral and hydrophobic residues in hydrophobicity); ‘transition’, the transition frequencies (polar to neutral, neutral to hydrophobic, etc.); and ‘distribution’, the distribution pattern of constituents (where the first residue of a given constituent is located, and where 25, 50, 75 and 100% of that constituent are contained). For concrete details, see Dubchak et al. (1995, 1999) [9, 10]. With the feature extraction method, feature vectors (we call them parameter vectors) can be easily calculated from new protein

Table 3: The detailed accuracy Accuracy (%) Sets Averag

all-α

all-β

α/β

α+β

1 2 3 4 5

90.13 92.21 84.94 84.42 85.19

80.00 80.00 72.47 70.39 73.77

84.68 86.49 78.96 78.70 81.04

89.36 88.31 84.68 86.49 88.83

86.04 86.75 80.26 80.00 82.21

6

84.41

69.61

78.96

85.97

79.74

Average

e

82.50

Prediction errors are showed in Table 4. Prediction error includes the misclassification and unclassifiable cost. Misclassification occurs when A-class protein is predicted as B-class protein. In case the classifier model cannot predict class value for a protein then the protein is said to 297

sequence,” Proc. Natl Acad. Sci. USA, 92 (1995), 8700-8704. [10] Dubchak, I., Muchnik, I., Mayor, C., Dralyuk, I., and Kim, S. H. “Recognition of a protein fold in the context of the structural classification of proteins (SCOP) classification. Proteins 35 (1999), 401-407. [11] Guo, X. and Gao, X. “A novel hierarchical ensemble classifier for protein fold recognition,” Protein Eng. Des. Sel. 21 (2008), 659-664. [12] Hobohm, U., Scharf, M., Schneider, R., and Sander, C. “Selection of a representative set of structures from the Brookhaven Protein Bank,” Protein Sci. 1 (1992), 409-417. [13] Hobohm, U. and Sander, C., (1994) “Enlarged representative set of proteins,” Protein Sci. 3 (1994), 522-524. [14] Isik, Z., et al: “Protein structural class determination using support vector machines,” Lecture Notes in Computer Science 3280, ISCIS (2004), 82-89. [15] Kokol, P. M., Zorman, M. , Stiglic, B., and Malcic, I., “The limitations of decision trees and automatic learning in real world medical decision making,” Proceedings of the 9th World Congress on Medical Informatics MEDINFO'98, vol. 52, pp. 529-533. [16] Lo Conte, L., Ailey, B., Hubbard, T. J. P., Brenner, S. E., Murzin, A. G., and Chothia, C. “SCOP: a structural classification of proteins database,” Nucleic Acids Res. 28 (2000), 257-259. [17] Podgorelec, V., Kokol, P., Stiglic, B., and Rozman, I., “Decision trees: an overview and their use in medicine,” Journal of Medical Systems, Kluwer Academic/Plenum Press, vol. 26, no. 5 (2002), pp. 445-463 [18] Rost, B. and Sander, C., “Prediction of protein secondary structure at better than 70% accuracy,” J. Mol. Bio. 232 (1993), 584-599. [19] Shamim, M. T. et al., “Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs,” Bioinformatics 23 (2007), 3320-3327. [20] Shen, H. B. and Chou, K. C., “Ensemble classifier for protein fold pattern recognition,” Bioinformatics 22 (2006), 1717–1722. [21] Spizer, M., Stefan, L., Paul, C., Alexander, S., and George, F., “IsoSVM – Distinguishing isoforms and paralogs on the protein level,” Journal of BMC Bioinformatics, vol. 7 (2006), 110-124. [22] Zavrsnik, J., Kokol, P., Maleiae, I., Kancler, K., Mernik, M., and Bigec, M., “ROSE: decision trees, automatic learning and their applications in cardiac medicine,” MEDINFO'95, Vancouver, Canada, 201-206.

be in the unclassifiable case. Table 4: The prediction error (%), (1): misclassification; (2): unclassifiable all-α

Se ts

α/β

all-β

1

2

1

2

1

2

1

9.4

0.5

19.5

0.5

14.8

0.5

2

7.3

0.5

19.5

0.5

13.0

0.5

3

14.6

0.5

27.0

0.5

20.5

0.5

4

15.1

0.5

29.1

0.5

20.8

0.5

5

14.3

0.5

25.7

0.5

18.4

0.5

6

14.8

0.8

29.9

0.5

20.3

0.8

α+β 1 10. 1 11. 2 14. 8 13. 0 10. 7 13. 5

Average

2

Aver age

0.5

14.0

0.5

13.3

0.5

19.7

0.5

20.0

0.5

17.8

0.5

20.3 17.5

V. Conclusions and Future Works In this paper we have described an intensive cluster merging method for a notoriously hard problem of structure prediction for proteins by using a novel idea to balance over-fitting and under-fitting properties. The results show that this approach can be feasible but we still have to improve and optimize the threshold parameters in order to achieve a better prediction performance. For future works, we hope to test this method with other applications such as image recognition and improve the performance of the approach by finding a suitable density for homogeneous patterns, also possibly decreasing the execution time by using parallel computing techniques.

* This research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency (NIPA-2011 – (C1090-1101-0008)).

REFERENCES [1] Baldi, P., et al.: “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, 16 (2000), 412-424. [2] Brown, M., et al.: “Knowledge-based analysis of microarray gene expression data support vector machines,” Proc. Natl. Acad. Sci. 97 (2000), 262-267. [3] Byvatov, E. and Schneider, G., “Support vector machine applications in bioinformatics,” Journal of Application Bioinformatics, vol. 2, no.2 (2003), 262-267. [4] Chen, K. and Kurgan, L., “PFRES: Protein fold classification by using evolutionary information and predicted secondary structure,” Bioinformatics, 23 (2007), 2843-2850. [5] Damoulas, T. and Girolami, M., “Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection,” Bioinformatics, 24 (2008), 1264-1270. [6] Dandekar, T. and Argos, P., “Potential of genetic algorithms in protein folding and protein engineering simulations,” Protein Eng. Des. Select. 5 (1992), 637-645. [7] Ding, C. H., and Dubchak, I.., “Multi-class protein fold recognition using support vector machines and neural networks,” Bioinformatics 17 (2001) 349-358. [8] Dong, Q., Zhou, S., and Guan, J., “A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation,” Bioinformatics 25 (2009), 2655-2662. [9] Dubchak, I., Muchnik, I., Holbrook, S. R. and Kim, S. H., “Prediction of protein folding class using global description of amino acid 298