An Excel spreadsheet computer program combining

Vol. 10 no. 5 1994 Pages 495-500

An Excel spreadsheet computer program combining algorithms for prediction of protein structural characteristics Josep Clotet1, Juan Cedano2 and Enrique Querol112-3

A program running on personal computers (either Apple Macintosh or PC, using Excel worksheets) for the prediction of some protein structural characteristics is reported. The program runs according to the Chou and Fasman algorithm, with some modifications, for secondary structure prediction. The program also incorporates several complementary analyses for secondary structure prediction to help the user in the decision-making process: rules for amino acid preferences in the N-cap and C-cap ofa-helices; prediction of the protein structural class and search of sequential motifs related to secondary structure. Additional algorithms performed by the program are: prediction of domain boundaries, prediction of loops, prediction of the state of cysteines (reduced or in disulfide bridge), hydropathy profiles according to Kyte and Doolittle, Hoop and Woods, and flexibility plot according to Karplus and Schulz. Introduction

The increasing number of DNA-protein sequences entering databases makes the use of algorithms to predict protein conformation necessary. To generate, from first principles, structures for proteins that are too uncooperative to crystallize or too large for NMR still requires methods of analysis and prediction at the sequence level. Although the number of computer approaches such as modeling, neural networks, empirical predictive algorithms (such as those of Chou and Fasman, 1978; Garnier et al., 1978), etc., are still considered by researchers, and are even offered in sophisticated program packages such as those from GCG (Genetics Computer Group, University of Wisconsin, Madison, WI), some of these algorithms are easily performed on personal computers (Fasman, 1989). New approaches for secondary structure

prediction (for a review see Thornton, 1991; Russell and Barton, 1993) make use of the similarity among members of protein families to distinguish loop regions from regular secondary structures (Zvelebil et al., 1987; Rost and Sander, 1993). The power of the method lies more in considering the parsing of the sequence than in the algorithm used for secondary structure prediction. New methods based on neural networks are being developed, but they perform at their best when a family of sequences to compare also exists (Rost and Sander, 1993). We have previously reported a program to perform the Chou-Fasman algorithm (Parrilla et al., 1986), but it runs on the now outdated Apple II range of computers. The present program, using Excel worksheets, has several refinements with respect to the original procedure of Chou and Fasman (1978). This program makes less overprediction of turns. It also uses Boolean alternatives to take a decision when secondary structures overlap. In addition, some useful algorithms have been incorporated: hydropathic profiles (Hoop and Woods, 1981; Kyte and Doolittle, 1982) and several algorithms to help the analysis of conformation. These algorithms are: (i) rules for amino acid preferences in the N-cap and C-cap of a-helices (Richardson and Richardson, 1988); (ii) an algorithm to predict the structural class of the protein (Zhang and Chou, 1992); (iii) search of sequential motifs related to secondary structure (Rooman and Wodak,1988); (iv) an algorithm based upon the method of Vonderviszt and Simon (1986) for prediction of domain boundaries; (v) an algorithm based upon the search for loops reported by Leszczynski and Rose (1986); (vi) an algorithm based upon the analysis of reduced/disulfide-bridge state of cysteines of Muskal et al. (1990); and (vii) the Karplus and Schulz flexibility algorithm (1987). Algorithms (i)-(v) may help the user to take decisions, or modify those taken automatically by the program. System and methods

'Departament de Bioquimica i Biologia Molecular and 2Inslitut de Biologia Fonamental V, Villar Palasi, Universital Autonoma de Barcelona, 08193 Bellaterra, Barcelona, Spain. 3

To whom correspondence should be addressed at. fnstitut de Biologia Fonamental, Universitat Autonoma de Barcelona, 08193 Bellaterra, Barcelona, Spain.

i Oxford University Press

The minimal requirements of program hardware and software are: Apple Macintosh computers with 2 Mbytes of RAM. Program: Excel spreadsheet, 3.0 or, ideally, 4.0. There is a corresponding version for PC Windows.

495

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Iowa Libraries/Serials Acquisitions on July 26, 2015

Abstract

J.Clotet, J.Cedano and E.Querol

Algorithm Chou-Fasman secondary structure prediction.

496

Helix cap-preference rules The program incorporates the empirical rules of amino acid residue preferences for N- and C-cap ends of helices, according to Richardson and Richardson (1988). Structural motifs The program incorporates sequential motifs related to specific secondary structures, as reported by Rooman and Wodak (1988), to help the user to take decisions. Protein structural class An algorithm to predict the structural class of the protein according to the method of Zhang and Chou (1992) has been implemented. This method correlates the amino acid composition of the protein to the structural class, assigning it to one of the four classes: a, (3, a + b and a/p. The predicted class is depicted as a small table at the bottom of the spreadsheet. Domain boundaries An algorithm for prediction of the domain boundaries of the amino acid sequence has been implemented. The minimum values are calculated as reported by Vonderviszt and Simon (1986), but the domain borders have been defined in a different way: (i) when two minima are very close to each other, the program chooses the one with the minimum value; (ii) when a minimum coincides with a regular secondary structure it is discarded (not considered) as a boundary; and (iii) only sequences > 50 amino acid residues are accepted as a domain region. The predicted domains are depicted as a small table at the bottom of the spreadsheet, showing the residue numbers defining a border, or 'single domain" when it is a singledomain protein. Loops Once the a-helix, /3-strand and turn segments have been predicted, the program starts a search of loops outside regions previously predicted as a-helix or /3-strand. The


The criteria for the search of nucleation peptides, helix or strand propagation and confirmation have been basically as reported by Chou and Fasman (1978). Nevertheless, a higher threshold has been chosen to minimize overprediction for /3-turn localization. Turns have been defined by having (Pt) > 1, (Pa) < (Pt) > (P/J) and Pt > 1.00 x 10-4. Where a series of turns is found to overlap, the assignment is made to the turn with the higher local (Pt) value. Our version improves the matching of turns with respect to X-ray data. At this stage the program has tentatively defined regions of helix, sheet and turn, and now begins the analysis of regions that have more than one structure assigned to them ('overlaps'). This is the most critical step in predictive algorithms. The program starts with the analysis of the helix-strand overlaps. It executes the procedures PROBABILITY, BOUNDARIES, LENGTH and STRAND, and in order to reach a decision the program assigns each of them a different value: positive for helix, negative for strand, and zero for some situations to be found in LENGTH/ STRAND procedures. This method gives a Boolean answer '.true, or.false.' The program calculates (Pa) and (P/3) in the overlapping region and initially assigns helix or strand to the highest (P). If the difference between (Pa) and (P/?) is >0.20, it takes the function PROBABILITY as true for helix (or strand) and assigns a value of + 4 (helix) or -4 (strand). If the difference is in the interval of ±0.20, the region is assigned a value of + 2 (helix) or -2 (strand). Now the program enters the analysis of BOUNDARIES to solve overlaps between helix and strands with similar (P). We have used the frequencies at which three residues are found at the beginning and end of helix and strands according to Chou and Fasman (1978). The program calculates boundaries for each region in both structures and decides which one has a higher value, assigning + 2 for helix and -2 for strand. All the above values of (+ 4,-4) (+ 2,-2) and the intervals have been determined by trial and error, using the PDB data on the structure of 30 proteins. Evaluation of the length of both helix and strand structures is as follows: if the length of the helix is longer than the strand, the helix will be chosen assigning + 1 to LENGTH function. The STRAND procedure checks whether there is a turn close to the tentative strand. It takes five residues from the sheet as the limit distance to check for the presence of a turn. If there is a tentative turn, the program checks for a symmetrical strand, as /?-strands taking part of a /3-sheet are very frequent in the supersecondary structure of proteins. It assigns a value of -2 to this function. Finally, the program solves the

overlapping of helices or strands with turns: it makes use of the parameters (PT) (the conformational potential based on all four positions of a reverse turn, according to Chou and Fasman), assigning a structure from the highest (Pa), (P/3) or (Pt). It performs a third evaluation of the remaining regions. Now the program performs a second search for the presence of turns, with Pt > 0.75 x 10~4, in regions assigned as random coils. This analysis has to be performed because the overlapping procedure can change or delete some of the secondary structure previously assigned.

Spreadsheet for prediction of protein structure

Disulfide bridges An algorithm to analyze the covalent status of the cysteine residues has been implemented. The program uses the data reported by Muskal et al. (1990), which predicts the free and half cysteines from the amino acid environment of these residues. Nevertheless, as Muskal et al. did not provide a simple algorithm for the prediction, we have implemented one based upon their work. It performs a search of the cysteines and, upon finding them, averages the seven adjacent residues, using the data tabulated by Muskal et al. (1990). A value >0 indicates half-cysteine (involved in a disulfide bridge). If the resulting number of half cysteines is uneven, the half cysteine predicted with the lowest value is rejected. By trial and error we have found a threshold value of 0.08 to yield the best prediction. Profiles The program plots two hydropathic profiles according to Kyte and Doolittle (1982) and Hoop and Woods (1981). The user can select the window length (a 21-residue and a 5-residue window are the program default values for Kyte-Doolittle and Hoop-Woods profiles respectively). The user can introduce a new profile algorithm or a new hydrophobicity table. Finally, the program performs the flexibility plot according to Karplus and Schulz (1985). Implementation Our program is friendly, interactive and quite flexible, all of which permit the user to introduce new tables of parameters. Although it is automatically run, it does not hide much of the decision-making process from the user, giving him or her the possibility of looking for each of the intermediate predictions. It incorporates the additional criteria described above to enable the user to take a different decision from the automatic prediction. When the user opens the program (file named

'PROTEIN PREDICTION') he or she will find a dialogue window indicating each step. The protein sequence has to be entered, or imported from other files, using the one-letter amino acid code, in cells of a typical Excel spreadsheet box (inside a document named 'SequencesW). For users not familiar with Excel, it should be mentioned that a spreadsheet cell takes up to 255 characters; thus, if the sequence surpasses that value, a number of additional cells have to be used. Nevertheless, in order to visualize the whole sequence in the cell charts it is advisable to enter the sequence as 50-character strings. At the start of the prediction, there is a window asking whether the user wishes to see the screen activity during the calculations or not. The running time for a 300 amino acid residue protein is ~30 min for performing the whole set of predictions and showing the screen activity, and about 20% faster if the screen activity is not shown. Discussion Figure 1 shows three partial views from three outputs of the program (for carboxypeptidase A, myohemerythrin and superoxide dismutase respectively). Columns G - I show the Chou-Fasman prediction, without solving the overlap. Columns N - P show, with the overlapping solved, the final prediction in bold characters. Columns D - F , and K - M show (as + , + + and + + + ) the additional criteria (according to Richardson and Richardson, 1988) to help the user make a decision. Columns D and K correspond to an a-helix, columns E and L to a /3-strand and columns F and M to a /?-turn. Three + symbols represent the maximum probability for a type of a specific secondary structure being present. The program evaluates the presence of symbols in columns D - F versus the previous prediction, and then reinforces the final prediction. For Table I. Comparative accuracy (%) between results elsewhere reported by Chou and Fasman and those performed by this program Protein

Chou and Fasman

This program

Adenylate kinase BPTI /3-Glucanase T4 lysozyme CEW lysozyme Myohemerythrin Staphyloccocal nuclease Papain Ribonuclease S Superoxide dismutase Thioredoxin a-Chymotrypsin Subtilisin BPN Thermolysin

66a 86 66 gg 66 62 80 89 66 77 78 74 72

75a 88 70 80 73 53 74 62 68 71 76 65 73 60

a

Data are percentages relative to X-ray data.

497


presence of turns inside loop regions is permitted. Using the data tabulated by Leszczynski and Rose (1986), we have implemented an algorithm of search looking for 6-residue loop-nucleation segments (the minimal size reported by Leszczynski and Rose, 1986) whose average value exceeds a threshold value of 1.12 (the value which we found best matches with the data reported by Leszczynski and Rose). Additional residues with a tabulated value of 1.0 or more are then incorporated into the N and C terminus of the nucleation segment. Propagation continues until a residue is found with (Pioop) < 1 -00, or up to the nearest a or (3 structure, or when the size of the loop reaches a length of 16 residues (upper limit according to Leszczynski and Rose, 1986).

1 Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Iowa Libraries/Serials Acquisitions on July 26, 2015

Spreadsheet for prediction of protein structure

without any bias or hand refinement by the authors. The detailed predictions are not shown. To check the program performance in predicting protein structural class we have analyzed the same set of proteins reported by Zhang and Chou (1992), namely chicken eggwhite lysozyme, myohemerythrin, bovine pancreatic trypsin inhibitor (BPTI), papain, elastase, adenylate kinase, thermolysin, alcohol dehydrogenase, ferredoxin, a-chymotrypsin, lactate dehydrogenase, myoglobin, staphyloccocal Nuclease, superoxide dismutase, thioredoxin, cytochrome C55O, rubreodoxin and carboxypeptidase A. Our program yields the same results as those reported by Zhang and Chou, scoring 70% accuracy. To check the performance in predicting the state (reduced or disulfide bridge) of cysteines using our program, we separated a set of proteins with and without disulfide bridges from the 1992 release of the Brookhaven Protein Data Bank. The set used was: adenylate kinase, alcohol dehydrogenase, BPTI, carboxypeptidase A, achymotrypsin, cytochrome C550, elastase, ferredoxin, lactate dehydrogenase, egg-white lysozyme, T4 lysozyme, myohemerythrin, myoglobin, staphyloccocal nuclease, papain, rubredoxin, superoxide dismutase, thermolysin and thioredoxin. The global average scoring for the set of proteins was 75%. Finally, to check the program performance in predicting domain boundaries, we used a larger set of proteins than that reported by Vonderviszt and Simon (1986). The set of proteins used was: alcohol dehydrogenase, BPTI, carbonic anhydrase, carboxypeptidase A, a-chymotrypsin, cytochrome C55O, glyceraldehyde-3-phosphate dehydrogenase, hemoglobin /3-chain, IgG heavy chain, lactate dehydrogenase, egg-white lysozyme, T4 lysozyme, myohemerythrin, myoglobin, staphyloccocal nuclease, papain, ribonuclease A, subtilisin, superoxide dismutase, thermolysin and thioredoxin. Domains are the most difficult structures to predict. As described in the Algorithm section, we have modified the method not to consider as a domain a sequence < 50 residues in length. The program also discards predicted domains when they overlap a regular secondary structure. We have considered a good prediction to be a domain boundary located in a region within ±15 residues of the experimental position (in fact, real domains are not easily

Fig. 1. Output of the program showing partial predictions of three proteins: (a) carboxypeptidase A; (b) myohemerythrin; (c) superoxide dismutase. Column A numbers the amino acid residues. Columns B and C present the amino acid residues in one-letter and three-letter codes respectively. The state of cysteines is shown besides the one-letter symbol of the amino acid (column B) in a wider cell, as bold 'C-C for the reduced or disulfide bridge and as 'C in the reduced state. Columns D-F and K-M show, as +, + + or + + +, the additional criteria used by the program (rules for amino acid preferences in the N- and C-cap of a-helices: Richardson and Richardson, 1988) to help the user to take a decision. Columns D and K correspond to an a-helix, E and L to /?-strand and F and M to a /3-turn. Columns G-I show the Chou-Fasman previous prediction, without solving the overlapping. Columns N-Q (a-helix, /3-strand, /3-turn and loop respectively) show the final prediction, with the overlapping solved. Columns S and T indicate sequential motifs related to secondary structures, such as Rooman and Wodak (1988). Columns U-W depict the probabilities according to the ChouFasman procedure. Columns X and Y depict the Boolean decisions. At the bottom of the spreadsheet, two small tables indicate the domain prediction and the structural class of the protein.

499


example, in Figure l(a), the a-helix from residues 19-28 is reinforced while the a-helix from residues 12-17 disappears in the final prediction. In Figure l(b), the two pluses in column D (a-helix) help the user make a decision for a helix between residues 19 and 30, as the program has been unable to solve the overlap with a /3-strand from residues 28-34. Loops are indicated as 'loop' in the Q column (Figure lc). The state of cysteines is shown besides the oneletter symbol of the amino acid (column B) in a wider cell, as bold 'C-C for the reduced or disulfide bridge (see residue 55 in Figure lc), and as ' C in the reduced state. Columns S and T indicate the existence of sequential motifs related to secondary structure (according to Rooman and Wodack, 1988). Columns U-W depict probabilities according to the Chou-Fasman procedure for the user to consider other decisions. The last columns, 'Boolean decisions', depict the Boolean values, calculated as described in the Algorithm section. At the bottom of the spreadsheet, two small tables indicate the domain prediction and the structural class of the protein. The program allows loops to overlap with turns, but not with regular secondary structures. The prediction of loops by the program scores 75% accuracy for the list of loops reported by Leszczynski and Rose (1986). Loop regions are very valuable for antigenic determinant prediction and for protein engineering (they are appropriate targets for site-directed mutagenesis because they are often involved in protein functions). Residue replacements performed on them are generally conformationally neutral. The accuracy of the secondary structure predictions was determined for the same set of proteins used and reported by Chou and Fasman (1978) and Privilege and Fasman (1989). To this list of proteins has been added a /3-glucanase, which we predicted (Querol et al., 1992) prior to the X-ray structure of a highly homologous one becoming known (Keitel et al., 1993). Accuracy has been defined as the percentage of correctly assigned residues (Privilege and Fasman, 1989). Table I depicts the prediction accuracy generated by our program compared to that reported by the procedures of Chou and Fasman (1978) and Privilege and Fasman (1989). It is noteworthy that all of our results presented in the tables have been obtained using only the automatic output of the program

J.Clotet, J.Cedano and E.Querol

Acknowledgements This work was supported by grants from CICYT (Ministerio of Education y Ciencia) BIO91-0477 and BIO94-0912-CO2-01 to E.Q.; J.C. is recipient of a pre-doctoral fellowship from CIRIT, Generalitat de Catalunya.

References Chou,P.Y. and Fasman.G.D. (1978) Prediction of protein secondary structure from their amino acid sequence. Adv. Enzymol., Al, 45-148. Fasman,G.D. (1989) The development of the prediction of protein structure. In Fasman,G.D. (ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 193-316. Garnier,J., Osguthorpe.D.J. and Robson.B. (1978) Analysis of the accuracy and implications of simple methods for predicting secondary structure of globular proteins. J. Mol. Biol., 120, 97-120. Hoop,T.P. and Woods.K.R. (1981) Prediction of protein antigenic determinants from amino acid sequence. Proc. Natl. Acad. Sci. USA, 78, 3824-3828. Juncosa.M., Pons.J., Dot.T., Querol,E. and Planas,A. (1994) Identification of active site carboxylic residues in Bacillus licheniformis endol,3-l,4-/3-D-glucan 4-glucanohydrolase by site-directed mutagenesis. J. Biol. Chem., in press. Karplus,P.A. and Schulz.G.E. (1985) Prediction of chainflexibilityin proteins. Naturwissenschaften, 72, 212-213. Keitel,T., Simon,O. Borriss,R. and Heinemann,U. (1993) Molecular and active site structure of a Bacillus /3-glucanase(l,3-l,4). Proc. Natl. Acad. Sci. USA, 90, 5287-5291. Kyte, J. and Doolittle, R.F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 157, 105-132. Leszczynski,F.J. and Rose.G.D. (1986) Loops in globular proteins, a novel category of secondary structure. Science, 234, 849-855. Muskal.S.M., Holbrook.S.R. and Kim.S.H. (1990) Prediction of the disulfide-bonding state of cysteine in proteins. Prot. Engng., 3. 667672. Parrilla,A., Domenech.A. and Querol,E. (1986) A Pascal microcomputer program for prediction of protein secondary structure and hydropathic segments. Comput. Applic. Biosci., 2, 211-215. Planas,A., Juncosa,M., LloberasJ. and Querol,E. (1992) Essential catalytic role of Glul34 in endo-l,3-l,4-/?-D-glucan 4-glucanohydrolase from B.licheniformis as determined by site-directed mutagenesis. FEBS Lett., 308, 141-145. Privilege.P. and Fasman.G.D. (1989) Chou-Fasman prediction of secondary structure of proteins: the Chou-Fasman-Privilege

500

algorithm. In Fasman,G.D. (ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 391-416. Querol.E., Padr»s,E., Planas,A., Juncosa,M. and Lloberas,J. (1992) Prediction and Fourier Transform infrared spectroscopy estimation of the secondary structure of a Bacillus licheniformis endo-P-\,3-\,4D-glucanase Biochem. Biophys. Res. Commun., 184, 612-617. Richardson,J.S. and Richardson,D. (1988) Amino acid preferences for specific location at the end of a-helices. Science, 240, 1648-1652. Rooman,M.J. and Wodak,S.J. (1988) Identification of predictive sequence motifs limited by protein structure data base size. Nature, 335, 45-49 Rose, G.D. (1978) Prediction of chain turns in globular proteins on a hydrophobic basis. Nature, 212, 586-590. Rost.B. and Sander.C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584-599. Russell.R.B. and Barton,G.J. (1993) The limits of potein secondary structure prediction accuracy from multiple sequence alignment. J. Mol. Biol, 234, 951-957. Thornton,JM. Flores,T.P. Jones,D.T. and Swindells.M.B. (1991) Prediction of progress at last. Nature, 354, 105-106. Vonderviszt,F. and Simon,I. (1986) A possible way for prediction of domain boundaries in globular proteins from amino acid sequence. Biochem. Biophys. Res. Commun., 139, 11 — 17. Zhang.Ch. and Chou,K, (1992) An optimization approach to predicting protein structural class from amino acid composition. Prot. Sci., 1, 401-408 Zvelebil,M.J., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987) Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J. Mol. Biol., 195, 957-961.


delineated). These modifications lowered the overprediction of domains. The scoring for the above set of proteins was 45% accurate. What is the utility of secondary structure predictions? The ultimate challenge of predicting secondary sequences is to recognize the correct chain-folding. This goal, however, cannot be achieved with existing methods. Since tertiary topology is mostly a collection of secondary structures connected by loop regions, its prediction will probably have to be preceded by knowledge of secondary structures. Secondary structure predictions can provide hypotheses for experimental work, site-directed mutagenesis, etc. (Planas et al., 1992; Juncosa et al., 1994). The program is free. Copies of the program can be obtained from E. Querol by sending a 3.5 in. diskette. Please indicate whether you want the Apple or PC version and the desired screen-size display.