Prediction of Protein-Protein Recognition Using ...

Prediction of Protein-Protein Recognition Using Support Vector Machine Based on Feature Vectors Huang-Cheng Kuo1*, Ping-Lin Ong2, Jung-Chang Lin1, Jen-Peng Huang3 1

Department of Computer Science and Information Engineering National Chiayi University, Taiwan 600 2 Department of Biochemical Science and Technology National Chiayi University, Taiwan 600 3

Department of Information Management Southern Taiwan University, Taiwan 710 *

[email protected]

Abstract Analysis of protein-protein recognition is a popular issue recently, which plays a crucial role in regulation of biochemical pathways and signaling transmittal. A protein is recognized with the other protein by combining a transient complex, otherwise which will become a permanent complex. Therefore, understanding physico-chemical properties of the protein interface can offer important clues for biological processes and functions. In this paper, we propose prediction method for protein-protein recognition based on features extracted from the residues. Residues on binding sites of two contacting proteins in a complex are projected from 3D to 2D plane. In order to have the same direction, each 2D plane is rotated by an angle decided by principal component analysis (PCA) method. Then, the 2D plane is partitioned into a 5x5 grid. The feature vector is composed of the residues distribution of polarity, electricity, and hydrophobicity on the 2D plane. Support vector machine (SVM) is adopted for prediction. Experimental results show that the prediction achieves an accuracy rate of 80%. Keywords: Protein-protein recognition, Feature vector, Principal component analysis, Support vector machine.

1. Introduction Proteins are the major catalytic agents, signal transmitters, and transporters in cells [4]. Protein-

protein interactions play a key role in protein function. The interactions are involved in signaling cascades and biochemical pathways. Analysis of physico-chemical properties of the protein interface can offer some clues for biological function and cellular processes. However wet-lab experiment on protein-protein interaction analysis is still highly time-consuming and expensive. Algorithms for predicting protein-protein interaction have been developed. Currently there are many literatures [1][2][3][8][10][15][17][18] focused on interaction of proteins and protein sequence alignment. If one of protein sequence is homology with another, it may be classified into a same group and further exploit the known protein to predict the structures and functions of the unknown protein. All of the above, prediction of the relations of interactions between proteins is an important issue and is interesting for biologists. Hence a new and efficient method for predicting the recognition protein complexes based on analysis of the properties of interface and using mathematical method is proposed in this paper. We devise a feature vector which represent the properties of binding site residues in terms of polarity, electricity, and hydrophobic. Support Vector Machine is adopted to classify two classes of complexes, namely transient and permanent complexes or recognition and non-recognition. As for what proposes the proteins combine together is not in the scope of this paper. This paper focuses on the recognition complexes having what kind of conditions or physico-chemical properties that can distinguish from non-recognition complexes. There are four kinds of protein complexes: homo-complexes, hetero-complexes, protein-inhibitor complexes, and

antibody-antigen complexes [16]. Here we adopt the hetero-complexes and homo-complexes. We also will discuss the proteins in different physico-chemical properties will be transient recognition. When two proteins recognize, the two proteins all surfaces did not be involved in, only some surfaces would involve in contacting each other, and the contact surfaces are called binding sites which have much main functions and characters on , such as We extract information of residues on the contacting binding sites of a protein complex. Such information includes residue propensity, hydrophobicity, accessible surface area, shape index, electrostatic potential, curvedness, and conservation scores [17]. We retrieve the binding sites data from “BOND” website [19], which provides large amount of protein complex binding sites as our experiment data. As for the cave of protein binding sites, the virus would combine with it. Therefore finding the recognizable proteins which are referred as chemical compounds to fill in the cave of binding sites and avoid virus invading is important, and those discoveries can help move forward in drug design. In this paper, we project the residues of 3D structure on the two contacting binding sites of a protein complex to 2D plane, because (1) it is more convenient to observe the residue distribution in 2D plane, and (2) it is easier to rotate 2D plane. While rotating the 2D plane, the angle is decided by using PCA method. After rotating residues on 2D plane, we observe the spread of residues on the 2D plane and put the specific characters to the feature vector. The feature vector represents a protein complex character. Until we finish all feature vectors, we put them to train SVM module and test the prediction accuracy with leave-one-out cross-validation. The experimental results show that this method can achieve an accuracy rate of 80%. Thus, that our method is working and effective is proven. The rest of this paper is organized as follow. Section 2 presents the related work. Description of the feature vector is in section 3. Section 4 describes the preparation the datasets for experiments. Section 5 elaborates the proposed prediction mechanism. Section 6 reports the experimental results. The last one is conclusions and future works.

2. Related works Many researches have improved the predication accuracy on protein-protein interaction by using binding sites [10][17], and also exploited domains to predict interactions [3][15]. On predicting with binding sites, we want to known which attributes are important and obvious to express

the protein characteristics. Tapan Patel et al. [17] mentions there are seven properties at binding sites for analysis. These properties are residue propensity, hydrophobicity, accessible surface area, shape index, electrostatic potential, curvedness, and conservation score. In addition, there are some studies show the interface area of binding site is about 1600(±400) Å2 [12][13]. With regard to classify transient complexes and permanent complexes, many literatures applied Support Vector Machine [10][15] or Neural Network [1][14] to train modules and predict the accuracy for protein complexes. In Mintseris and Zhiping Weng’s study [11], they use the atomic contact vectors (ACVs) to represent the interface of interactions between proteins, and test the utility of ACVs by using quadratic Fisher discriminant (QFD) and kernel discriminant analysis (KDA) to classify the transient recognition complexes and the permanent recognition complexes. But there are two disadvantages. One of which is the vector dimension is too high because they count each atomic type contact even no such contact exists. Another is adopted attributes is too fewer. In this paper, we use the residues of binding pairs to replace atom pairs, and the dimension of feature vectors is much lower than the above-mentioned methods.

3. Feature vector The feature vector of a complex is consisted of polarity, electricity, hydrophobicity, and distance between two protein residues. We project the binding site residues of the protein complex on a 2D plane. And then rotate the 2D plane so that the residues on the 2D plane are vertically placed on the 2D plane. The 2D plane is partitioned into a 5x5 grid. Each block of the grid is equal size and the width of grid is decided by two proteins on the range of grid, and the height of grid is also. The grid is illustrated in Figure 1.

3.5. The electric attraction of each block Due to the same charge will exclusive and the opposite charge will attractive, we sum up the electricity of each block and the sum of each block as a field writes to the feature vector. Therefore, there are 25 fields will be produced by this attribute. The electric charges are represented by +1, 0, and 1 respectively, which are positive charge, no charge, and negative charge. Table 2 is the electric charge of 20 amino acids.

Figure 1. A protein complex – 1AXI. The different colors represent the different proteins in the same complex.

Table 2. The electric charge of 20 amino acids

After taking many attributes into SVM, we find five attributes can make much higher accuracy of prediction. We use them to construct a feature vector for representing a protein complex property. They are the physical and chemical characters of the residues on binding sites and be described as follows.

3.1. The ratio of residues distributed area We count the number of blocks occupied by residues of two proteins on the 2D grid. The number is divided by 25 to get the ratio.

3.2. The number of binding sites This attribute will count the residues of two proteins in the same protein complex. The residues are obtained from binding sites and they are not duplication on each protein binding site.

3.3. The average distance of residue pairs We calculate the distances of each binding pair residues and average them on 3D structures.

3.4. The hydrophobicity of each residues pair We have known the hydrophobicity amino acids will aggregate with each others for avoiding contacting H2O, and the power of aggregation is called hydrophobicity bond. Hence this attribute will count how many pairs is hydrophobicity aggregation. The hydrophobicity of 20 amino acids is shown in Table 1, which is labeled “1” means hydrophobicity, and “0” means hydrophilic.

Ala Arg Asn Asp

0 1 0 -1

Cys Glu Gln Gly

0 -1 0 0

His Ile Leu Lys

1 0 0 1

Met Phe Pro Ser

0 0 0 0

Thr Trp Tyr Val

4. Datasets In this paper, there are two kinds of experimental data, transient recognition and permanent recognition complexes respectively. The data of recognition complexes we adopts from [11], which are identified 209 transient recognition complexes by Julian Mintseris and Zhiping Weng, including 34 antibodyantigen complexes and 60 enzyme-inhibitor complexes. First, we obtain binding sites from the BOND website (http://bond.unleashedinformatics.com/), which offers detailed amino acid numbers of a pair of interaction proteins. Second, we retrieved the protein 3D structure coordinates from PDB [7] (http://www.rcsb.org/pdb/), which provides a large number of accurate three-dimensional protein complex structures. Since we could not find the matching binding sites on the BOND website from the 209 recognition complexes, we could not integrate them with PDB. After filtering the inadequate data, there are 68 transient recognition complexes for the experiment. The data of permanent recognition we adopt from [6], which provides 76 homodimer protein. Homodimer protein is combined by two independent proteins, and usually comes from the same family or similar structures. Most homodimers are permanent complexes [10]. Therefore, we assume that homodimers are permanent complexes (i.e. nonrecognition), and we obtained 52 permanent complexes. The total of 120 protein complexes show in Table 3.

Table 1. The hydrophobicity of 20 amino acids Ala Arg Asn Asp

1 0 0 0

Cys Glu Gln Gly

1 0 0 0

His Ile Leu Lys

0 1 1 0

Met Phe Pro Ser

1 1 0 0

Thr Trp Tyr Val

0 0 0 1

0 0 0 0

5. Methods The proposal method is divided into five steps.

Step 1. The 3D coordinates of residues of protein binding sites are obtained by looking up the PDB file and BOND file. In order to simplify calculation, we adopt the coordinate of Ca atom of residues. There needs three points to decide a projected plane. The mid-point of each residue pair on binding sites is computed. The three points are determined as follows: the first point is the mean of all the mid-points. The second point is a mid-point which is farthest from the first point. The third point is a mid-point which is farthest from the second point. Euclidean distance is adopted. Figure 2 shows this step.

⎡ x ' ⎤ ⎡ cosθ − sin θ ⎤ ⎡ x ⎤ ⎢ y ' ⎥ = ⎢ sin θ cos θ ⎥ ⎢ y ⎥ ⎣ ⎦ ⎣ ⎦⎣ ⎦ where θ is the angle of rotation, x’ is the new point of x, and y’ is the new point of y. Step 4. For improving the prediction accuracy of SVM, we use Principal Component Analysis (PCA) to rotate the binding site residues of two proteins on the 2D plane. We calculated the eigenvectors of the first component from covariance matrix, and then using the eigenvectors to rotate the residues of binding sites on the plane to parallel with axis of Y. Step 5. All of residues on the 2D plane will put into a 5x5 grid. Each block is equal size and the width of grid is decided by two proteins on the range of grid, and the height of grid is also It is worth notice that different complexes will have different size of each block as shown in Figure 4.

Figure 2. The illustration is for the first step which is constructed the projected plane from two protein binding sites. Step 2. Project all of the Ca atoms of residues on the binding sites to this plane, which will be different for each protein complex. Step 3. Rotate the residues on this plane twice. First, we rotate the plane parallel with the YZ plane. Second, we rotate the plane again, making it parallel with the XY plane, while eliminating the Z coordinate of residues. Then we can just take the (x, y) results and calculate, as shown in Figure 3.

Figure 4. Comparing the distribution of 1AXI with two pictures, the left graph is after Step 5, and the right graph is showed the original 3D structure by Rasmol software. All of above are our proposal method, they are novel and in the experiment we will use these methods to obtain the feature vector of each complex. Finally predicting complex is recognition or not by applying Support Vector Machine (SVM).

6. Experimental results

Figure 3. The rotation illustration The counter clockwise rotation formula is as follows:

In this Section we use “libsvm” [20] with leaveone-out cross validation to predict the accuracy for transient recognition complexes and permanent recognition complexes(nonrecognition) with above attributes of feature vectors. Libsvm is a SVM tool which is developed by Chih-Jen Lin and his laboratory members.

Table 3. The Dataset of 68 Recognition Complexes and 52 Nonrecognition Complexes Recognition complexes 1A2K A:D

1A4Y A:B

1ACB E:I

1ARO P:L

1ATN A:D

1AVW A:B

1AVZ B:C

1AXI A:B

1AY7 A:B

1B41 A:B

1BKD R:S

1BLX A:B

1BML A:C

1BP3 A:B

1BQQ T:M

1BUH A:B

1BVN P:T

1BZQ A:L

1C1Y A:B

1C4Z A:D

1CD9 A:B

1CDM A:B

1CLV A:I

1CMX A:B

1CSE E:I

1CXZ A:B

1D2Z A:B

1D5M A:C

1D6R T:A

1DE4 A:C

1DF9 B:C

1DFJ E:I

1DHK A:B

1DN1 A:B

1DPJ A:B

1DS6 A:B

1DTD A:B

1DZB A:X

1E0O A:B

1E44 A:B

1E96 A:B

1EAI A:C

1EAY A:C

1EV2 A:E

1F02 I:T

1F3V A:B

1F60 A:B

1F7Z A:I

1FC2 C:D

1FOE A:B

1FQ1 A:B

1FYH A:B

1G3N A:C

1GH6 A:B

1GL4 A:B

1HX1 A:B

1I1R A:B

1I2M A:B

1I5K A:C

1IAR A:B

1IBR A:B

1IM3 A:D

1J7V L:R

1JDH A:B

1JDP A:H

1JTD A:B

1JTG A:B

1KAC A:B

Nonrecongition complexes 11GS A:B

1AD3 A:B

1AFW A:B

1AJS A:B

1ALK A:B

1AOM A:B

1AOR A:B

1AQ6 A:B

1AUO A:B

1AY0 A:B

1AZY A:B

1BFT A:B

1BSR A:B

1CG2 B:C

1CHM A:B

1CP2 A:B

1D9G A:B

1DAA A:B

1E7N A:B

1F6Y A:B

1FA7 A:B

1FIP A:B

1FRO A:B

1FZR C:D

1HJR B:D

1HSS C:D

1ICW A:B

1IMB A:B

1ISB A:B

1KBA A:B

1LYN A:B

1MJL A:B

1MKA A:B

1MY7 A:B

1NXF A:B

1OAC A:B

1OH0 A:B

1PRE A:B

1QAE A:B

1QIP A:B

1RFB A:B

1SES A:B

1SLT A:B

1SMT A:B

1SOX A:B

1WGI A:B

1XSO A:B

1YFB A:B

2CCY A:B

2RSP A:B

3CSM A:B

5COX A:B

Figure 5 shows the first experiment which compares prediction accuracy of using one attribute for the feature vector. The attributes are labeled with alphabet letters in Table 4.

A B C D E

The ratio of residues distributed area The number of binding sites The average distance of residue pairs The hydrophobicity of each residues pair The electric attraction of each block

80 Percentage of accuracy (%)

Table 4. The letter for each attribute

The Prediction of Each Attribute on 120 Dataset

70 60 50 40 30 20 10 0 A

B

C

D

E

Attributes

Figure 5. Predicting on each attribute The second experiment is to compare the prediction accuracies of difference combinations of two attributes for the feature vector. The result is shown in Figure 6. In Figure 6, we observe that the highest prediction accuracy is the combination of the average distance of

residue pairs and the electric attraction of each block. The accuracy rate is 80%. The Prediction of Two Attributes on 120 Dataset 82 Percentage of accuracy (%)

80 78 76 74 72 70 68 66 64

protein. There are five attributes in a feature vector. With a dataset of feature vectors, support vector machine is used for predicting protein-protein recognition. The experimental results report the method has good performances. The best prediction accuracy achieves 80% with leave-one-out cross validation. We expect the accuracy would be higher if there is a large dataset to train the prediction module. In the future, we will obtain more examples in the dataset to predict by the method. We will try to find a more powerful feature vector to improve the accuracy. Using Neural Network to predict will be considered.

62 A+B

A+C

A+D

A+E

B+C

B+D

B+E

C+D

C+E

D+E

8. References

Attributes

Figure 6. Combination of two attributes to predict

[1]

Bing Wang, Hau San Wong, Peng Chen, HongQiang Wang, and De-Shuang Huang, “Predicting Protein-Protein Interaction Sites Using Radial Basis Function Neural Networks,” International Joint Conference on Neural Networks, 2006, pp. 2325-2330.

[2]

Bing Wang, Peng Chen, De-Shuang Huang, Jing-jing Li, Tat-Ming Lok, Michael R. Lyu, “Predicting Protein Interaction Sites from Residue Spatial Sequence Profile and Evolution Rate,” Proceedings of the Federation of European Biochemical Societies Letters, Vol. 580, No. 2, 2006, pp. 380-384.

[3]

Chengbang Huang, Faruck Morcos, Simon P. Kanaan, Stefan Wuchty, Danny Z. Chen, and Jesus A. Izaguirre, “Predicting Protein-Protein Interactions from Protein Domains Using a Set Cover Approach,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 4, No. 1, 2007, pp. 78-87.

[4]

Feihong Wu, Byron Olson, Drena Dobbs, and Vasant Honavar, “Comparing Kernels For Predicting Protein Binding Sites From Amino Acid Sequence,” International Join Conference on Neural Networks, 2006, pp. 1612-1616.

[5]

Graham R. Smith and Michael J. E. Sternberg, “Prediction of Protein-Protein Interactions by Docking Methods,” Current Opinion Structural Biology, Vol. 12, No. 1, 2002, pp. 28-35.

[6]

Hannes Ponstingl, Kim Henrick, and Janet M. Thornton, “Discriminating Between Homodimeric and Monomeric Proteins in the Crystalline State,” Proteins, Vol. 41, No. 1, 2000, pp. 47-57.

In the third experiment, we analyze performance of feature vector of set of three attributes. The Prediction of Three and All Attributes on 120 Dataset

Percentage of accuracy (%)

82 80 78 76 74 72 70 68 66

C+ D+ E A+ B+ C+ D+ E

+E

B+ D+ E

B+ C

E A+ D+ E B+ C+ D

A+ C+

A+ B+ E A+ C+ D

A+ B+ C A+ B+ D

64 62

Attributes

Figure 7. Combination of three and all attributes to predict In figure 7, we also observe that the highest rate of 80% accuracy in this experiment. This occurs when the three attributes are: the ratio of residues distributed area, the average distance of residue pairs, and the hydrophobicity of each residues pair. After a series of experiments, we find out these attributes are powerful and affecting the recognition of proteins.

7. Conclusions and Future works In this paper, we proposed a new method which projects 3D protein complexes structures to 2D plane, and divide the grid on the plane into 5x5 blocks. For predicting a protein complex recognition or non-recognition, we design a feature vector for the

[7]

Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T.N. Bhat, Helege Weissig, Ilya N. Shindyalov, and Philip E. Bourne, “The Protein Data Bank,” Nucleic Acids Research, Vol. 28, No.1, 2000, pp. 235-242.

[8]

Hsueh-Fen Juan, Hsuan-Cheng Huang, “An Efficient Mechanism for Prediction of ProteinLigand Interactions Based on Analysis of Protein Tertiary Substructures,” IEEE Symposium on Bioinformatics and Bioengineering, 2004, pp. 427-433.

[15] Shinsuke Dohkan, Asako Koike, and Toshihisa Takagi, “Prediction of Protein-Protein Interactions Using Support Vector Machines,” IEEE Symposium on Bioinformatics and Bioengineering, 2004, pp. 576-583.

Inbal Halperin, Buyong Ma, Haim Wolfson and Ruth Nussinov, “Principles of Docking: an Overview of Search Algorithms and a Guide to Scoring Functions Proteins,” Proteins, Vol. 47, No. 4, 2002, pp. 409-443.

[16] Susan Jones and Janet M. Thornton, “Principles of protein-protein interactions,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 93, No.1, 1996, pp. 1320.

[10] James R.Bradford and David R.Westhead, “Improved Prediction of Protein-Protein Binding Sites Using a Support Vector Machines Approach,” Bioinformatics, Vol. 21, No. 8, 2005, pp. 1487-1494.

[17] Tapan Patel, Manoj Pillay, Rahul Jawa and Li Liao, “Information of Binding Sites Improves Prediction of Protein-Protein Interaction,” International Conference on Machine Learning and Applications, 2006, pp. 205-212.

[11] Julian Mintseris and Zhiping Weng, “Atomic Contact Vectors in Protein-Protein Recognition,” Protein: Structure, Function, and Genetics, Vol. 53, No. 3, 2003, pp. 629-639.

[18] Zahra Nafar, Ashkan Golshani, “Data Mining Methods for Protein-Protein Interactions,” Canadian Conference on Electrical and Computer Engineering, 2006, pp. 991- 994.

[12] Loredana Lo Conte, Cyrus Chothia, and Joel Janin, “The Atomic Structure of Protein-Protein Recognition Sites,” Journal of Molecular Biology, Vol. 285, No. 5, 1999, pp. 2177-2198.

[19] Biomolecular Object Network Databank, http://bond.unleashedinformatics.com

[9]

[13] Pinak Chakrabarti and Joel Janin, “Dissecting Protein-Protein Recognition Sites,” Proteins, Vol. 47, No. 3, 2002, pp. 334-343.

[14] Piero Fariselli, Florencio Pazos, Alfonso Valencia, and Rita Casadio, “Prediction of Protein—Protein Interaction Sites in Heterocomplexes with Neural Networks,” European Journal of Biochemistry, Vol. 269, No. 5, 2002, pp. 1356-1361.

[20] LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm