A Model Based on Minimotifs for Classification of Stable ... - IEEE Xplore

2 downloads 0 Views 210KB Size Report
as obligate vs. non obligate and transient vs. permanent, among others. We focus on prediction of obligate protein complexes, which are more stable and ...
A Model Based on Minimotifs for Classification of Stable Protein-protein Complexes Luis Rueda and Manish Pandit

Abstract—Prediction of protein-protein interactions (PPIs) is an important problem in biology, since interactions play key role in most biological processes and functions in living cells. PPIs have been studied from many perspectives. Of these, an important problem is prediction of different complex types such as obligate vs. non obligate and transient vs. permanent, among others. We focus on prediction of obligate protein complexes, which are more stable and perform a specific function, as opposed to transient and non-obligate complexes which last for a short period of time. We have modeled the prediction problem using minimotifs, aka short-linear motifs, to extract information contained in the protein sequences to distinguish between obligate and non-obligate PPIs. Incorporating different classifiers such as the k-nearest neighbor (k-NN), the support vector machine (SVM) and linear dimensionality reduction (LDR) yields a very powerful scheme for prediction. On two well-known datasets, the model delivers classification accuracies as high as 99%. Analysis and cross-dataset validation show that the information contained in the training sequences is crucial for prediction and determination of stability in PPIs. Keywords-protein-protein interaction; short-linear motifs; classification; linear dimensionality reduction

I. I NTRODUCTION Prediction of protein-protein interactions (PPIs) has been studied from many different perspectives in solving different problems, the main aspects that are studied include [1]: sites of the protein complex interfaces (where the interaction occurs), arrangement of proteins in a complex (how two proteins form a complex aka docking), the type of protein complex (what is to be predicted), the molecular interaction event (if the interaction will occur), and temporal and spatial trends (dynamics and temporal states of the interactions) [2], [3]. Of these, we focus on the problem of determining the stability of complexes by means of different types, and the transitions from non-obligate (less stable, or transient) to obligate (more stable) complexes. Obligate interactions are considered as permanent, while nonobligate interactions can be either permanent or transient [4]. Non-obligate and transient interactions are more difficult to study and understand due to their instability and short life, while obligate and permanent interactions last for a longer period of time, and hence are more stable. Luis Rueda and Manish Pandit are with the School of Computer Science, University of Windsor, 401 Sunset Ave., Windsor, ON, N9B3P4, Canada (email:{lrueda,panditm}@uwindsor.ca). This work has been partially supported by NSERC, the Natural Sciences and Research Council of Canada, and the University of Windsor, Office of Research Services research equipment grants. This work has also been made possible by using the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET:www.sharcnet.ca) and Compute/Calcul Canada.

Characterizing the properties of protein interaction types can be done by studying their sequence or structural information, or both. Structure-based prediction methods including computational approaches, homology modelling, threadingbased methods and protein-protein docking are typically more accurate than those that do not employ structural data [5]. These studies have been carried out mostly by relying on biological knowledge about the atoms or molecules, which are normally selected manually by observing groups of complexes or on prediction results. Interacting regions can be characterized by diverse sets of physicochemical properties [5] and other sequence features. In addition, other properties have been used for prediction of PPIs, such as analysis of solvent accessibility [6], geometry, hydrophobicity, sequence-based features, and desolvation energy [7]. Based on interface properties such as interface area and ratio of area [6], Zhu et al. predicted biological and crystal packing interactions using a support vector machine (SVM). The work presented in [7] shows the use of desolvation energies to predict obligate and non-obligate complexes using SVM and linear dimensionality reduction (LDR). Recently, the work presented in [8] shows that electrostatic energies can be effectively used for prediction of biological PPIs. From another perspective, simple codon pairs have been used in prediction of protein-protein interactions in yeast [9], achieving a relatively good performance. The most effective approaches for prediction of PPIs use mainly structural information of protein complexes to calculate the feature values. The Protein Data Bank (PDB) is the main source of the molecular structures of protein complexes [10]. Models based on structural information from the PDB are not perfect yet and are time consuming. In addition, a small number (approximately 90,000) of structures are known in the PDB, and these structures correspond to a much smaller number of proteins and their interactions. This number is very small compared to the number of possible protein interactions available (or yet to be discovered) in high-throughput protein and protein-protein interaction databases, such as UniProt [11] those integrated in the International Molecular Exchange (IMEx) Consortium [12]. As a result, models based on protein structures are limited to availability of structural information. Also, structure-based models tend to be time consuming, when physicochemical properties are to be computed. Thus, for a large number of proteins a model that can replace the use of structural properties with sequence information is desirable, which is the main motivation of the proposed model. Motifs are patterns widespread over a group of proteins that

are related by function or may have other biological features in common. Usually, each motif contains a sequence pattern of 3-20 amino acids. Motifs of 3-10 amino acids are considered as short, linear motifs (SLiMs) or minimotifs [13]. SLiMs should have the capacity of encoding a functional interaction in a short sequence, and enrichment in intrinsically disordered regions of protein sequences. SLiMs should also be able to function independently of their tertiary structure context and their tendency to evolve convergently [13]. Sequence information is obtained from the SLiMs, which represent the conserved regions of protein complexes. The conserved region property for different types of complexes is different and can be used to predict PPIs. Figure 1 shows a SLiM found in E. coli glutathione S-transferase (1a0f), Chain B, at residue 42. The space-filled area of the chain (inside the circle) is a SLiM which was identified using Multiple Expectation Maximization for Motif Elicitation (MEME) [14]. In this paper, we propose a model that uses minimotifs or SLiMs as properties to predict obligate and non-obligate protein interaction complexes. The model uses k-NN, LDR and SVM as the classifiers to predict these types. Our prediction results for two well-known datasets show an impressive prediction accuracy of more than 99%, which implies an increase of at least 7% from previous approaches, even better than the state-of-art structure-based methods, while using only sequence information. Cross-dataset validation corroborates the power of the model in terms of classification accuracy and generalization. Here, we present a wide range of experimental results, cross-dataset validations and discussions. Some preliminary results on specific motif lengths are included in [15]. II. T HE P ROPOSED M ETHOD The method that we propose for prediction of obligate and non-obligate PPI types uses information contained in the training sequences as properties. The first and most important stage involves inferring SLiMs from the sequences in the training datasets. Using an information-theoretic-based scoring system, we generate feature vectors for all complexes. Classification is performed via k-NN, LDR and SVM and validation is carried out using leave-one-out and cross-dataset schemes. We call this method “prediction of protein interaction types using SLiMs and information contained in sequences” or PPI-SLiM-Seq for short. Each stage of the model is described in detail in the subsequent subsections. A. Datasets In our experiments, we use two pre-classified datasets of obligate and non-obligate protein complexes. These datasets contain curated protein complexes from the studies of Zhu et al. [6] and Mintseris et al. [17], and are referred here to as the ZH and MW datasets, respectively. The ZH dataset contains 75 obligate and 62 non-obligate complexes, and the MW dataset contains 212 non-obligate and 115 obligate complexes.

Fig. 1. A short, linear motif found in E. coli glutathione S-transferase (1a0f), Chain B at residue 42. The SLiM was identified using MEME and the image was produced using the ICM Browser [16] from the PDB [10] structure files of the complex.

B. SLiM Identification A few approaches have been proposed for SLiM discovery, and the corresponding tools are publicly available, such as SLiMFinder [18], SLiMSearch [19], QuasiMotiFinder [20], MnM [21] and MEME. All these tools are capable of discovering SLiMs from new datasets of proteins. Some methods, however, use predefined knowledge (SLiM/Motif databases) to find SLiMs. These schemes pose some inconveniences for our approach, since we need to discover new SLiMs. We have then chosen MEME, which provides a stand-alone application that can be run in open-source environments. MEME is based on the expectation maximization (EM) algorithm for the finite mixture model introduced in [14]. The EM algorithm has some advantages over Gibbs sampling [22]. It incorporates the position-specific prior (PSP) model, which can include multiple types of auxiliary data when discovering motifs, and consequently reducing the running time of the underlying process [14]. We have then used MEME [14] to find independent sets of SLiMs for the two datasets, ZH and MW. Since the webbased application has some limitations (e.g., on the number and length of the sequences), we deployed MEME on the SHARCNET1 , which also allows parallel processing. In our implementation, the parameters of MEME were optimized to find 500 and 1,000 SLiMs in the datasets. The length of the SLiMs were set to 3 − 10 and 2 − 7, the minimum number of sites to 8 and the maximum number of sites to 200. Motifs of length greater than 10 were not considered, by following the 1 http://www.sharcnet.ca

general consensus of the most recent studies, which claim that SLiMs are patterns of typically less than 10 amino acids [13], [19]. We chose a minimum number of sites equal to 8 to make sure at least a few complexes have occurrences of the motif, and hence avoiding a large number of zeros in the classification dataset. For the maximum number of sites, a value of 200 was set for in order to limit the number of searches MEME does for each motif. As this process is repeated a large number of times, using a larger value would slow down the overall process quite significantly. In addition, based on our observations of the resulting SLiMs and the relevant information, it is unlikely that a SLiM would contain a few hundred sites. Based on the different combinations of these MEME parameters, four SLiM sets were compiled as follows: • SLiM ZH 1000 3 10 - 1000 SLiMs, length : 3 − 10, dataset: ZH • SLiM ZH 1000 2 7 - 1000 SLiMs, length : 2 − 7, dataset: ZH • SLiM MW 1000 3 10 - 1000 SLiMs, length : 3 − 10, dataset: MW • SLiM MW 1000 2 7 - 1000 SLiMs, length : 2 − 7, dataset: MW C. Feature Vectors Once the SLiM sets are obtained, a 20-dimensional feature vector is computed for each complex in the dataset. For each complex, its sequences are divided into overlapping ℓ-mers, which are considered as potential sites of motifs in the training set. Let us consider an ℓ-mer a in a sequence of length L. We divide the sequence into all possible overlapping ℓ-mers of length W and deliver a total of {L − (W + 1)} ℓ-mers. Then, Equation (3), which is explained in Section II-D, is used to calculate the information contained in ℓ-mer a, given a profile X. Once the scores for all possible ℓ-mers are obtained, the top 20 scores are placed in a 20-dimensional feature vector. This process is repeated for all SLiM sets found as detailed above, producing four different datasets that are used for training and testing the classifiers. D. Scoring the Sites The score for each ℓ-mer or potential site of a motif is computed as follows. According to the authors of [23], [24], [25] and [26], the information contained in the training sequences has a significant impact in prediction of PPIs. There are different methods to calculate the information content of a site [27]: symbol frequency scores, symbol entropy scores, and stereochemical scores. We use a variant of the model of symbol entropy scores. Position specific probability matrices (PSPMs) are used as background (profile) to calculate the information content of a site. The conditional probability of a site a of length ℓ, for a given profile X, can be defined as: P (a | X) =

ℓ ∏ P (ai ) i=1

(1)

where X is the profile, P (ai ) is the probability (of the ith

∏ residue of a) from that profile. Since P (ai ) ≤ 1, P (ai ) is very small for large sites, and hence taking − log gives a more meaningful measure. PSPMs were obtained for each SLiM after running MEME. Also, the information amount contained in a site is equivalent to the product of P (ai ) and log(P (ai )) [28]. Thus, we aim to weight each term in the sum by using the probability of the corresponding amino acid, yielding the information content for a site of length ℓ: I(a | X) = −

ℓ ∑

P (ai ) × log(P (ai )) (2) i=1 Equation (2) implies that the larger the site is, the larger the information content is. Thus, in order to erase this effect, we divide the total information content by the length of the site, ℓ. In this way, the information content of a site a of length ℓ is defined as: ˆ | X) = − 1 × I(a ℓ

ℓ ∑

P (ai ) × log(P (ai ))

(3)

i=1

Since log(1) = 0, for any P (ai ) = 1, a small threshold δ (say δ = 0.01) is subtracted from P (ai ) as follows: { log(0.99) if P (ai ) = 1 log(P (ai )) = (4) log(P (ai )) otherwise This correction avoids unexpected errors while computing the classification features for SLiMs containing dominating amino acids (i.e., when only one amino acid is present at a specific position of the SLiM). E. Classification The classification stage follows two validation approaches. One of these is the leave-one-out validation with a k-NN classifier. For k-NN, we have used the Euclidean distance which is learned using the specialized large margin nearest neighborhood [29]. In the leave-one-out approach, each complex is selected to be classified and the nearest neighbors are found from the training dataset (the remaining complexes). However, the complex to be classified may contain potential sites for some motifs in the training dataset. To avoid using any information contained in the test set (complex to be classified), the following two rules are followed in order to compute the scores for the test complex: 1) Exclude the profiles of all the SLiMs that were found in the test complex. 2) Consider only those profiles of SLiMs that have the same length as the current ℓ-mer. The second approach uses a cross-dataset validation to test the accuracy and significance of the newly proposed features. For cross-dataset validation we have used SVM and LDR for classification. The reason for using these two approaches (as opposed to k-NN) is to show the power of generalization of the scheme in prediction of new complexes, and that is provided by SVM and LDR.

For SVM, we have used LibSVM with a linear kernel with default parameters [30]. We have used SVM since it is a competitive classifier for a wide range of problems and due to its power of generalization. Since SVM is a very well-known classifier in the machine learning community, the relevant details of this technique are omitted. The reader is referred to review more specific literature on SVMs [31]. For LDR we have employed three different criteria combined with linear or quadratic classifiers, resulting in a total of six classification schemes. Of these, the maximum accuracies for quadratic and linear among the three criteria are reported. LDR is based on linear transformation of the data to a lower dimension in such a way that class separability is preserved, or even improved. We use three different LDR citeria [32]: Fisher’s discriminant analysis (FDA): It aims to maximize the following criterion: { } JF DA (A) = tr (ASW At )−1 (ASE At ) (5) The matrix A is found by considering the eigenvector corresponding to the largest eigenvalue of SF DA = S−1 W SE . Heteroscedastic discriminant analysis (HDA): It aims to obtain the matrix A that maximizes the function. { JHDA (A) = tr (ASW At )−1 [ASE At ]} −1 −1 −1 −1 1 1 2 2 2 2 (6) t 2 p1 log(SW S1 SW )+p2 log(SW S2 SW ) 2 −ASW SW A p1 p2 which is resolved via eigenvalue decomposition. Chernoff discriminant analysis (CDA): It aims to maximize the following function. JCDA (A) = tr{p1 p2 ASE At (ASW At )−1 + log(ASW At ) − p1 log(AS1 At ) − p2 log(AS2 At )} (7) which is resolved via a gradient algorithm. Once the linear transformation is applied, the new data is passed through a quadratic Bayesian (QB) or linear Bayesian (LB) classifier, where the latter is obtained by making the covariance matrices the same (i.e., by taking the average of the two covariances). In the two approaches, accuracy is obtained as the number of true positives (TP or obligate complexes that are correctly classified) plus the number of true negatives (TN or nonobligate complexes that are correctly classified), divided by the total number of samples. III. R ESULTS The computational experiments and results presented in this study focus on two main aspects. The first one is to demonstrate the ability of the current model to learn the underlying parameters and optimize the prediction using the first validation approach and k-NN. In this regard, nearly perfect prediction accuracy has been obtained, even surpassing state-of-the-art structure-based methods. The second validation aproach uses the two main datasets to show how the parameters learned are independent of the dataset being used, and hence showing the power of generalization of the model

TABLE I k-NN

CLASSIFICATION RESULTS FOR THE DATASETS USING PPI-SL I M-S EQ .

ℓ = 3 − 10 1, 000 SLiMs

ℓ=2−7 1, 000 SLiMs

length (ℓ)

ZH (%)

MW (%)

10

98.54

98.77

9

99.27

99.07

8

95.62

99.27

7

99.27

96.31

6

99.27

96.62

5

99.27

99.27

4

99.27

97.54

7

98.35

98.46

6

98.54

96.62

5

99.27

98.77

4

96.35

99.27

3

93.43

65.64

in predicting new complexes – interestingly enough, nearly perfect prediction accuracy is also obtained in this setup. A. Leave-one-out Validation of PPI-SLiM-Seq Table I shows the results of leave-one-out validation with k-NN. The first column of the table represents the SLiM sets. Two different SLiM sets from each dataset (ZH and MW) are used. The other columns show the accuracies for different values of ℓ. For a specific value of ℓ, the maximum accuracy using different values of k is selected and entered in the table. The values of k = 1, 5, 10, 15, 20, 25, 30, 35 have been chosen arbitrarily. We use the Euclidean distance which is learned using specialized large margin nearest neighborhood. Underlined in the table are the largest accuracies for each dataset. For ℓ = 10, the highest accuracy is 98.54% for k= 35 and the lowest is 95.62% for k= 5. For ℓ = 9, 7, 6, 5, it yields the highest accuracy of 99.27% and for ℓ = 8 it yields the highest accuracy of 95.62%. This table also shows that for a partition size of 3, it yields lower accuracy. For k = 5, it yields the highest accuracy of 93.43%. For the ZH dataset, the highest accuracy is 99.27% for different values of ℓ and k. For ℓ = 7, and all the values of k, the accuracy is 99.27%. B. Cross-dataset Validation of PPI-SLiM-Seq To validate further the results achieved by the proposed model, we ran the second validation approach, or cross-dataset validation, on the two datasets. We used the SLiMs of the MW dataset for training with the ZH dataset for testing and vice versa. Table II shows the results of cross-dataset validation. The first column briefly describes the dataset with SLiMs, and the next column is for the partition size or SLiM length. The “SVM” column contains the classification accuracy obtained by SVM and the remaining columns are for the QB and LB combined with different LDR criteria. We used the ZH SLiMs for training with the MW dataset for ℓ = 6, 5 yielding almost the same accuracy. The MW SLiMs for training with the ZH

TABLE II SVM

AND LDR CLASSIFICATION RESULTS FOR THE ZH AND MW DATASETS WITH THE MW AND ZH SL I M S RESPECTIVELY.

LDR (ℓ)

SVM

Quadratic

Linear

ZH Dataset

5

95.62

97.81

97.08

MW SLiMs

4

97.81

99.27

97.81

MW Dataset

6

98.77

98.77

98.47

ZH SLiMs

5

98.47

99.08

98.47

TABLE III C OMPARISON OF CLASSIFICATION ACCURACY WITH OTHER RELATED WORKS . Aziz

Vasudev

PPI-SLiM

et al.

and Rueda

-Seq with

NOXClass

[7]

[8]

k-NN

88.32

82.13

96.17

99.27

Zhu

As an example, the proposed model is more powerful than the recently proposed model that uses electrostatic energies, surpassing that method by over 2% on both datasets. As a final remark, we note the importance of using SLiMs in prediction of obligate and non-obligate protein complexes. The power of the proposed scheme demonstrates that using evolutionary features on sequence information only describes the stability of protein complexes. This is shown in this work in two well-known datasets, and the results have been validated following sound protocols for machine learning prediction validation, including leave-one-out and cross-dataset validations – the latter shows the generalization properties of the proposed model. Many research avenues are open from this study, including the scalability of the model for a large number of proteins, which are not currently available in the structural databases. V. C ONCLUSIONS AND F UTURE D IRECTIONS

et al. [6] Mintseris et al. [17]

80.86

97.36

99.07

dataset for ℓ = 5, 4 also yields almost the same results. We chose the values of ℓ experimentally to maximize accuracy. Cross dataset validation yields an accuracy of 97.81% and 99.27% for ℓ = 5, 4 respectively using SVM and different LDR for the ZH dataset with the MW SLiMs for training. For the same values of ℓ, leave-one-out with a k-NN gives an accuracy of 99.27% (for k = 1) and 99.27% (for k = 1, 25, 30, 35) respectively. Again, the MW dataset with ZH SLiMs for training cross-dataset validation gives an accuracy of 98.77% and 99.08% for ℓ = 6, 5 respectively using SVM and different LDR, while for the same values of ℓ, leave-oneout with k-NN yields accuracy of 96.62% (for k = 1, 5) and 99.07% (for k = 20, 25, 30, 35) respectively.

In this study, we have used SLiMs identified by MEME to predict obligate and non-obligate complexes. The results are excellent with accuracies of over 99% on well-known datasets. Cross-dataset validation demonstrates the power of the method to predict unknown complexes, while being independent of the underlying parameters. The results shown in the paper are superior to any of the state-of-the-art methods for predicting obligate complexes. Although the proposed model is shown to be very powerful, it can be improved in various ways. One can use other SLiM identification tools in order to compare the results and another cross dataset validation. We have considered a maximum sequence partition of length 10, which can be extended to a partition size of up to 20. Different parameters can be used in different combinations to identify different SLiM sets. Fixed length of SLiMs and fixed site numbers can be another approach in SLiM identification. Finally, another approach that can be explored is to combine the information contained in the sequences with other physicochemical properties in prediction.

IV. D ISCUSSION AND C OMPARISON In [6], Zhu et al. predicted obligate and non-obligate complexes with 88.32% accuracy – note that these results are only on complexes of the ZH dataset, and the results have not been verified in this study. They used four NOXclass features with a two-stage SVM to achieve that performance. Again the authors of [7] achieved maximum accuracy of 82.13% with LDR (HDA, FDA, CDA) combined with quadratic and linear classifiers, which is still lower than that of [6]. Using electrostatic energies as properties, Vasudev and Rueda predicted obligate and non-obligate complexes with an accuracy of 96.17% [8] which is approximately 14% higher than the accuracy obtained by Aziz et al. [7] and 8% higher than the accuracy obtained by Zhu et al. [6]. In the proposed approach, only information contained in sequences as features is used to achieve maximum accuracy of 99.27% using leave-one-out with a k-NN, which is nearly perfect and significantly higher than all previously proposed methods.

R EFERENCES [1] I. Kufareva and R. Abagyan, “Predicting molecular interactions in structural proteomics,” in Computational Protein-Protein Interactions, N. R. and S. G., Eds. CRC Press, 2009. [2] E. Levy and J. Pereira-Leal, “Evolution and dynamics of protein interactions and networks,” Current Opinion in Structural Biology, vol. 18, pp. 1–9, 2008. [3] R. Nussinov and B. Ma, “Protein dynamics and conformational selection in bidirectional signal transduction,” BMC Biology, vol. 10, p. 2, 2012. [4] I. Nooren and J. Thornton, “Diversity of protein–protein interactions,” The EMBO journal, vol. 22, no. 14, pp. 3486–3492, 2003. [5] S. Park, J. Reyes, D. Gilbert, J. Kim, and S. Kim, “Prediction of proteinprotein interaction types using association rule based classification,” BMC bioinformatics, vol. 10, no. 1, p. 36, 2009. [6] H. Zhu, F. Domingues, I. Sommer, and T. Lengauer, “NOXclass: prediction of protein-protein interaction types,” BMC bioinformatics, vol. 7, no. 1, p. 27, 2006. [7] M. Aziz, M. Maleki, L. Rueda, M. Raza, and S. Banerjee, “Prediction of biological protein–protein interactions using atom-type and amino acid properties,” Proteomics, 2011. [8] G. Vasudev and L. Rueda, “A model to predict and analyze proteinprotein interaction types using electrostatic energies,” BIBM, 2012.

[9] Y. Zhou, Y. Zhou, F. He, J. Song, and Z. Zhang, “Can simple codon pair usage predict protein-protein interaction?” Molecular BioSystems 8, 1394-1404, 2012. [10] F. Bernstein, T. Koetzle, G. Williams, E. Meyer Jr, M. Brice, J. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, “The Protein Data Bank,” European Journal of Biochemistry, vol. 80, no. 2, pp. 319–324, 1977. [11] T. U. Consortium, “Reorganizing the protein space at the Universal Protein Resource (UniProt),” Nucleic Acids Research 40(D1):D71-D75, 2012. [12] S. Orchard, S. Kerrien, S. Abbani et al., “Protein interaction data curation: the International Molecular Exchange (IMEx) consortium.” Nature Methods 9:345-350, 2012. [13] N. Davey, G. Trav´e, and T. Gibson, “How viruses hijack cell regulation,” Trends in biochemical sciences, vol. 36, no. 3, pp. 159–169, 2011. [14] T. Bailey, M. Boden, T. Whitington, and P. Machanick, “The value of position-specific priors in motif discovery using MEME,” BMC Bioinformatics, vol. 11, p. 179, 2010. [15] M. Pandit and L. Rueda, “Prediction of biological protein-protein interaction types using short-linear motifs,” in Proc. of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. ACM, 2013, pp. 698–699. [16] R. Abagyan, M. Totrov, and D. Kuznetsov, “ICM–A new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation,” Journal of Computational Chemistry, vol. 15, no. 5, pp. 488–506, 1994. [17] J. Mintseris and Z. Weng, “Structure, function, and evolution of transient and obligate protein–protein interactions,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 31, p. 10930, 2005. [18] R. Edwards, N. Davey, and D. Shields, “SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins,” PLoS One, vol. 2, no. 10, p. e967, 2007. [19] N. Davey, N. Haslam, D. Shields, and R. Edwards, “SLiMSearch 2.0: biological context for short linear motifs in proteins,” Nucleic acids research, vol. 39, no. suppl 2, pp. W56–W60, 2011. [20] R. Gutman, C. Berezin, R. Wollman, Y. Rosenberg, and N. Ben-Tal, “QuasiMotiFinder: protein annotation by searching for evolutionarily conserved motif-like patterns,” Nucleic acids research, vol. 33, no. suppl 2, pp. W255–W261, 2005. [21] T. Mi, J. Merlin, S. Deverasetty, M. Gryk, T. Bill, A. Brooks, L. Lee, V. Rathnayake, C. Ross, D. Sargeant et al., “Minimotif Miner 3.0: database expansion and significantly improved reduction of false-positive predictions from consensus sequences,” Nucleic Acids Research, vol. 40, no. D1, pp. D252–D260, 2012. [22] A. Neuwald, J. Liu, and C. Lawrence, “Gibbs motif sampling: detection of bacterial outer membrane protein repeats,” Protein science, vol. 4, no. 8, pp. 1618–1632, 1995. [23] J. Bock and D. Gough, “Predicting protein–protein interactions from primary structure,” Bioinformatics, vol. 17, no. 5, pp. 455–460, 2001. [24] Y. Ofran and B. Rost, “Predicted protein-protein interaction sites from local sequence information,” FEBS letters, vol. 544, no. 1-3, pp. 236– 239, 2003. [25] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, and H. Jiang, “Predicting protein–protein interactions based only on sequences information,” Proceedings of the National Academy of Sciences, vol. 104, no. 11, p. 4337, 2007. [26] B. Wang, P. Chen, D. Huang, J. Li, T. Lok, and M. Lyu, “Predicting protein interaction sites from residue spatial sequence profile and evolution rate,” FEBS letters, vol. 580, no. 2, pp. 380–384, 2006. [27] W. Valdar, “Scoring residue conservation,” Proteins: Structure, Function, and Bioinformatics, vol. 48, no. 2, pp. 227–241, 2002. [28] I. Eidhammer, I. Jonassen, and W. Taylor, Protein bioinformatics: an algorithmic approach to sequence and structure analysis. Wiley Online Library, 2004. [29] A. Mucherino, P. Papajorgji, and P. Pardalos, “k-Nearest Neighbor Classification,” Data Mining in Agriculture, pp. 83–106, 2009. [30] C. Chang and C. Lin, “Libsvm: a library for support vector machines,” last date accessed: May 31, 2011. [Online]. Available: http://www.csie.ntu.edu.tw/ cjlin/papers/libsvm.pdf [31] S. Abe, Support Vector Machines for Pattern Classification. Springer, 2005. [32] L. Rueda and M. Herrera, “Linear dimensionality reduction by maximizing the chernoff distance in the transformed space,” Pattern Recognition, vol. 41, no. 10, pp. 3138–3152, 2008.