Prediction of Biological Protein-protein Interaction ...

0 downloads 0 Views 100KB Size Report
Keywords protein-protein interaction, short-linear motifs, classification. 1. ... tain large numbers of proteins and experimentally known interactions, including ...
Prediction of Biological Protein-protein Interaction Types Using Short-Linear Motifs Manish Pandit, Luis Rueda, and Alioune Ngom School of Computer Science University of Windsor 401 Sunset Avenue, Windsor, Ontario N9B 3P4, Canada

{panditm,lrueda,angom}@uwindsor.ca

ABSTRACT Protein-protein interactions (PPIs) play a key role in many biological processes and functions in living cells. Thus, identification, prediction, and analysis of PPIs are important aspects in molecular biology. We propose a computational model to predict biological PPI types using short-linear motifs (SLiMs). The information contained in protein sequences is used to distinguish between interaction types, namely obligate and non-obligate. Classifiers, such as k -nearest neighbor (k -NN), support vector machine (SVM) and linear dimensionality reduction (LDR) on two well-known datasets confirm the power of the proposed model with accuracy above 99%. The results show that the information contained in the training sequences is crucial for prediction and analysis of biological PPIs.

Categories and Subject Descriptors I.5.2 [Pattern Recognition]: Design Methodology—Classifier design and evaluation, Pattern analysis; J.3 [Life and Medical Science]: Biology and genetics

focus on prediction of two types of interactions, namely obligate and non-obligate. The most effective approaches for prediction of PPIs use mainly structural information of protein complexes to calculate the feature values [3, 4]. However, for most of the interactomes only a fraction of the structures is currently known, compared to high-throughput databases that contain large numbers of proteins and experimentally known interactions, including those in the IMEx Consortium: DIP, IntAct, MINT, I2D, MatrixDB, BioGRID, InnateDB and SIB [5], among others. As a result, models based on protein structures are limited to availability of structural information, while some methods have used homology-based models to predict structural properties of the interactions [6]. On the other hand, short-linear motifs (SLiMs) are short patterns of 3-10 amino acids that have been found to be biologically meaningful because of their capacity to encode functional aspects, bind to important domains (e.g., PDZ and SH2), and enrichment in intrinsically disordered regions of proteins. SLiMs should also be able to function independently of their tertiary structure context and their tendency to evolve convergently [7].

General Terms Algorithms, Performance, Experimentation

Keywords protein-protein interaction, short-linear motifs, classification.

1. INTRODUCTION Prediction of PPIs has gained much interest in recent years with many different proposed methods [1]. Obligate interactions are usually considered as permanent, while nonobligate interactions can be either permanent or transient [2]. Non-obligate and transient interactions are more difficult to study and understand due to their instability and short life, while obligate and permanent interactions last for a longer period of time, and hence are more stable [2]. We

Copyright is held by author/owner(s) BCB’13, September 22-25, 2013, Washington, DC ACM 978-1-4503-2434-2/13/09.

ACM-BCB 2013

2.

METHODS

The method that we propose (PPI-SLiM-Seq) uses shortlinear motifs as properties to predict obligate and non-obligate protein interaction types. SLiM-based scores are assigned to all possible ℓ-mers of the sequences with a certain length ℓ, creating features that are used by known classifiers such as k -NN, LDR and SVM to predict unknown protein complexes. We have used MEME [8] to find SLiMs from a compiled sequence dataset which contains all the sequences for all the chains of each complex listed in the dataset (ZH or MW). We deployed MEME on the SHARCNET (www.sharcnet.ca), which also allows parallel processing. In this implementation, the parameters of MEME were optimized to find 500 and 1,000 SLiMs in the datasets. The length of the SLiMs were set to 3 − 10 and 2 − 7, the minimum number of sites to 8 and the maximum number of sites to 200. For a complex to be classified, the sequences are divided into overlapping ℓ-mers, which are potential sites of motifs in the training dataset. Considering an ℓ-mer a in a sequence of length L, each sequence in the complex is parsed to find all possible overlapping ℓ-mers of length W , delivering a total of {L − (W + 1)} ℓ-mers. Then, the score or information contained in ℓ-mer a, given a profile X that represents a SLiM of length ℓ is computed as follows:

698

LDR

∑ ℓ

ˆ | X) = − 1 × I(a ℓ

P (ai ) × log(P (ai ))

(1)

i=1 Since log(1) = 0, for any P (ai ) = 1, a small threshold δ (say δ = 0.01) is subtracted from P (ai ) as follows: { log(0.99) log(P (ai )) = log(P (ai ))

if P (ai ) = 1 otherwise

(2)

We have implemented leave-one-out validation with a k NN classifier that uses the Euclidean distance which is learned using specialized large margin nearest neighborhood. We have also implemented cross-dataset validation to test the accuracy and significance of the newly proposed features. For cross-dataset validation we have used LDR and SVM for classification. For SVM, we have used the linear kernel with default parameters. LDR is based on linear transformation of the data to a lower dimension in such a way that class separability is preserved, or even improved. We use three different LDR criteria, namely Fisher’s discriminant analysis (FDA), heteroscedastic discriminant analysis and Chernoff discriminant analysis (CDA) [4]. The data is linearly transformed onto a lower-dimensional space and then passed through a quadratic or linear Bayesian classifier.

3. RESULTS AND DISCUSSION For experimental purposes, we have used two pre-classified, curated datasets of protein complexes obtained from previous studies [4], namely Zhu et al. (ZH) and Mintseris et al. (MW). The ZH dataset contains 75 obligate and 62 nonobligate complexes, and the MW dataset contains 212 non-obligate and 115 obligate complexes. Several experiments have been conducted, showing the power of the prediction scheme. By using motifs of a particular length ℓ and applying the k -NN classifier following a leave-one-out validation procedure (see Mehtods), accuracies of over 99% have been achieved. We have also implemented a cross-dataset validation procedure. We used the SLiMs of the MW dataset for training with the ZH dataset for testing and vice versa. Table 1 shows the results of crossdataset validation. The first column briefly describes the dataset with SLiMs, and the next column is for the partition size or SLiM length. The “SVM” column contains the classification accuracy obtained by SVM and the remaining columns are for different LDR criteria. We used the ZH SLiMs for training with the MW dataset for ℓ = 6, 5 yielding almost the same accuracy. Using MW SLiMs for training with the ZH dataset for ℓ = 5, 4 also yields almost the same results. We chose the values of ℓ experimentally to maximize accuracy. Cross dataset validation yields accuracies of over 97%, and noticeable, values above 99% for ℓ = 4 (MW dataset) for ℓ = 5 (ZH dataset). Additional results (not included here) show that PPISLiM-Seq is at least 2% more accurate than previous approaches that use structural information (solvent accessibility, desolvation and electrostatic energies). This demonstrates the power of the proposed scheme that uses sequence information only to predict and analyze the stability of protein complexes. This is shown in this work in two well-known

ACM-BCB 2013



SVM

Quadratic

Linear

ZH Dataset

5

95.62

97.81

97.08

MW SLiMs

4

97.81

99.27

97.81

MW Dataset

6

98.77

98.77

98.47

ZH SLiMs

5

98.47

99.08

98.47

Table 1: SVM and LDR classification results for the ZH and MW datasets with the MW and ZH SLiMs respectively.

datasets, and the results have been validated following sound protocols for machine learning prediction validation, including leave-one-out cross-validation and cross-dataset validation – the latter shows the generalization properties of the proposed model. Many research avenues are open from this study, including the scalability of the model for prediction of high-throughput PPI data.

4.

REFERENCES

[1] S. Park, J. Reyes, D. Gilbert, J. Kim, and S. Kim, “Prediction of protein-protein interaction types using association rule based classification,” BMC bioinformatics, vol. 10, no. 1, p. 36, 2009. [2] I. Nooren and J. Thornton, “Diversity of protein–protein interactions,” The EMBO journal, vol. 22, no. 14, pp. 3486–3492, 2003. [3] M. Aziz, M. Maleki, L. Rueda, M. Raza, and S. Banerjee, “Prediction of biological protein–protein interactions using atom-type and amino acid properties,” Proteomics, 2011. [4] G. Vasudev and L. Rueda, “A model to predict and analyze protein-protein interaction types using electrostatic energies,” BIBM, 2012. [5] S. Orchard, S. Kerrien, S. Abbani et al., “Protein interaction data curation: the international molecular exchange (imex) consortium.” Nature Methods 9:345-350, 2012. [6] Q. Zhang, D. Petrey, L. Deng et al., “Structure-based prediction of protein-protein interactions on a genome-wide scale.” Nature 490:556-560, 2012. [7] N. Davey, G. Trav´e, and T. Gibson, “How viruses hijack cell regulation,” Trends in biochemical sciences, vol. 36, no. 3, pp. 159–169, 2011. [8] T. Bailey, M. Boden, T. Whitington, and P. Machanick, “The value of position-specific priors in motif discovery using MEME,” BMC Bioinformatics, vol. 11, p. 179, 2010.

699