Rapid identification of proteins

36 downloads 7889 Views 1MB Size Report
Feb 9, 1993 - University of Florida College of Medicine, Department of Neuroscience, JHMHC .... method which expects the best fit for amino acids which we.
Proc. Natl. Acad. Sci. USA Vol. 90, pp. 5138-5142, June 1993 Biochemistry

Rapid identification of proteins (amino acid composition/molecular weight/isoelectric point/protein analysis)

GERRY SHAW University of Florida College of Medicine, Department of Neuroscience, JHMHC Box-J100244, Gainesville, FL 32610

Communicated by George K. Davis, February 9, 1993

introduced since protein blotting is normally performed on proteins larger than this. MAKE-LIB simultaneously writes the accession numbers and identifiers of each selected protein sequentially to the NAMES file. Amino acid composition determinations are of differing accuracy and reproducibility for different amino acids (3-5). We therefore made multiple composition determinations from blots ofabout 5 ug of each offour different pure proteins of known primary sequence and found that each amino acid was quantified with a characteristic and quite repeatable error (Table 1). The variance and standard deviation (SD) of the quantities obtained for each amino acid were then calculated as described in the legend to Table 1. The values obtained are in line with those reported by other workers (5) but show that different facilities have rather different accuracies and reproducibilities for each amino acid. To make optimum use of this method, users should probably determine their own values for Table 1. The amino acid composition data obtained on an unknown protein are entered into the FINDER program, each quantity is corrected by the appropriate error factor shown in Table 1, the mole percentage is calculated, and the corrected values are displayed on the monitor. Table 1 shows that Gly, Met, and Pro are determined particularly unreliably, as shown by the high variance and SDs of scores. Val might sometimes be less reliably determined than suggested by Table 1 since it is poorly recovered from acid hydrolysates of proteins rich in hydrophobic sequences (6). Both MAKE-LIB and FINDER therefore determine mole percentage values initially excluding these amino acids. The mole percentage values of these 4 amino acids are then calculated relative to the total of 100% for the other 12. Gross errors in any or all of these 4 amino acids cannot therefore affect the remaining composition data. FINDER compares the 16 calculated values with the corresponding values in each array in the SCORES file. The program gives a score of 2 if the experimental value for a particular amino acid and the corresponding database value are within 1.5 SDs of each other, 1 if they are within 3 SDs, and 0 if they are further than 3 SDs apart (Table 1). Since amino acids present in very small percentage amounts are determined less accurately, experimentally determined scores below 3 mol % are treated, for the purposes of calculating the range of variability tolerated, as if they were 3 mol %. This scoring method is one of more than 50 tested and is a good compromise between speed and sensitivity. Entries in the scoREs file which have a score higher than a preset value (usually 20 points for a preliminary run) are displayed on the monitor screen along with the calculated mean score, SD of scores (see Table 2) and a histogram of score distribution. The programs and data files run on IBM PC-compatible computers and occupy a total of less than 3.4 megabytes of disk space. Programs were written in the C language and run in between 1.5 min and 10 sec, depending on the type of computer used.

ABSTRACT The amino acid composition, molecular weight, and isoelectric point of a protein can all be easily and economically determined by current electrophoretic techniques. A method which uses such easily obtained data to identify proteins is described. A computer program first corrects for systematic errors in amino acid quantitation and then searches the current sequence database for proteins with amino acid compositions similar to the corrected values, taking into account the reliability of determination of each amino acid. The program also provides the calculated molecular weight, isoelectric point, and name of each candidate, providing three further independent criteria for protein identification. The program is surprisingly sensitive, and the composition data alone, if of good quality, usually suggest the correct protein as a strong candidate if it or a close homologue is present in the database. Further studies show that proteins in the current database have amino acid compositions distinct enough to aflow this method to be generally applicable. The method is a quick and cost-effective first step in protein characterization and should become increasingly useful as the number of fully sequenced proteins continues to rise.

Current PAGE and blotting methods allow the convenient and inexpensive determination of amino acid composition of a protein, as well as molecular weight estimation and isoelectric point measurement (1). These parameters are much easier and cheaper to determine than primary amino acid sequence, the usual method of characterizing unidentified proteins. Protein sequencing is a fairly complex multistep procedure requiring a considerable amount of time and resources, and technical problems frequently occur (1, 2). Here is described an efficient method for protein identification based on the more easily determined protein characteristics.

MATERIALS AND METHODS Programs. The protein sequence database was the Protein Identification Resource (PIR) Release 32, containing 40,287 entries. Fig. 1 shows in diagrammatic form how the programs work. MAKE-LIB counts the number of Asx, Thr, Ser, Glx, Pro, Gly, Ala, Val, Met, Ile, Leu, Tyr, Phe, His, Lys, and Arg residues in each protein. These are the amino acids routinely quantified, counting Asp and Asn together (Asx) and Glu and Gln together (Glx), and not counting Trp and Cys. The percentages of these amino acids (calculated as described below) are put into the first 16 elements of an 18-element array, the last two elements being loaded with the calculated molecular mass and calculated isoelectric point of the protein. If the calculated molecular mass is >5 kDa, the entire 18-element array is saved on disc as one entry of a file called SCORES, the process being reiterated for all proteins in the database. The 5-kDa cutoff removes 4786 proteins and was The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Abbreviations: PVDF, poly(vinylidene difluoride); PIR, Protein Identification Resource; GFAP, glial fibrillary acidic protein.

5138

Biochemistry: Shaw

Proc. Natl. Acad. Sci. USA 90 (1993)

Protein Sequence Database MAKE-LIB Program

Amino Acid Composition

Accession Numbers and Identifiers

LUbrary-SCORES

Ubrary-NAMES

I I Amino add Composition Data

FINDER1

5139

(PVDF) membranes were obtained from Bio-Rad or Millipore. Amino Acid Analysis. Coomassie brilliant blue-stained protein bands were excised from PVDF membranes and sealed in vacuo in glass vials containing 6 M HCI (constant boiling; Pierce) plus 0. 1% phenol. Acid hydrolysis was performed at 120°C for 20 hr. Hydrolysates were applied to a Beckman system 6300 amino acid analyzer, which separates amino acids on a polystyrenesulfonic acid ion-exchange resin in sodium-containing buffers and derivatizes the amino acids with ninhydrin as they are eluted. Ninhydrin levels are then determined spectrophotometrically. Data were interpreted with the Nelson Analytical 2600 chromatograph software package.

Candidate Proteins FIG. 1. Flow diagram of the operation of programs described here.

Proteins. Proteins were obtained from Sigma (bovine ubiquitin, bovine serum albumin, rabbit muscle myosin, pig muscle glyceraldehyde-3-phosphate dehydrogenase, chicken ovalbumin, chicken lysozyme, human carbonic anhydrase, bovine carbonic anhydrase, rabbit phosphorylase b, Escherichia coli ,-galactosidase) or produced locally [porcine neurofilament subunits NF-L and NF-M, porcine glial fibrillary acidic protein (GFAP), and various unidentified proteins]. The cytoskeletal pig spinal cord cytoskeletal preparation was produced as described (7). ABRF-92 is a protein sample from the Association of Biomedical Research Facilities, the identity of which is kept secret, used to test how accurately amino acid compositions are being determined nationwide (5). Electroblotting. Electroblotting was performed at a constant voltage of 90 V for 30-120 min from Laemmli gels in the presence of 10 mM Mes at pH 6.0, either made 20% methanol for low molecular weight proteins or between 0.1% and 0.01% SDS for larger proteins. Poly(vinylidene difluoride Table 1. Accuracy and reproducibility of amino acid composition data Amino acid Factor Variance SD Asx 1.01 0.0032 0.0563 Thr 1.01 0.0007 0.0273 Ser 0.0058 0.94 0.0759 Glx 1.04 0.0028 0.0528 Pro 1.15 0.0929 0.3048 1.10 0.0096 0.0978 Gly Ala 1.04 0.0003 0.0169 Val 0.87 0.0039 0.0627 Met 0.68 0.0296 0.1722 Ile 0.91 0.0022 0.0464 Leu 1.04 0.0013 0.0359 1.10 0.0014 Tyr 0.0377 1.00 Phe 0.0024 0.0488 His 0.80 0.0061 0.0778 1.21 0.0044 Lys 0.0662 0.97 Arg 0.0061 0.0784 Samples of bovine serum albumin, chicken ovalbumin, chicken lysozyme, and human carbonic anhydrase were blotted onto PVDF three separate times, and the amino acid compositions were determined. The results obtained for each amino acid were compared with the theoretically expected values, and the average degree of over- or under quantitation has been placed in the "Factor" column. The values obtained for each amino acid from a particular protein were then standardized to give a mean of 1.00. The values for the same amino acid from each protein were then combined and the variance and standard deviation (SD) were determined.

RESULTS The degree of over- or underestimation of quantitation and the reproducibility of determination of each amino acid under the experimental conditions used were determined as described in Table 1 and in Materials and Methods. The determination of these two factors allows the systematic correction of amino acid composition data and the construction of a scoring method which expects the best fit for amino acids which we can determine very accurately but tolerates a loose fit with those which are less accurately determined. To test FINDER, several pure proteins were blotted onto PVDF membranes, their amino acid compositions were determined, and the data were entered into the FINDER program. The program was surprisingly sensitive, almost always producing the best match with the expected protein (Table 2). FINDER effectively points to a single candidate based on the amino acid composition alone, although the SDS/PAGE molecular sizes are also in agreement with the calculated molecular sizes of the candidates. Many ofthe closely matching sequences were obviously related to the target molecule. For instance, the second-best although rather distant match to E. coli 8-galactosidase was ,B-galactosidase from Klebsiella pneumoniae. Similarly, bovine NF-L finds as best matches all four NF-L sequences in the PIR 32 database. Note that the absence ofthe bovine NF-L sequence from the database would clearly not have prevented protein identification. Often members of the same protein superfamilies as the source protein were among those better matching. For example, data from blots of GFAP produced vimentin, desmin, lamins, and other intermediate-filament subunits as poorer matches than GFAP itself, but nonetheless with scores in the 16-20 range. The relatively low best score obtained by certain of these proteins-e.g., bovine ubiquitin in Table 2-indicates that some of the data cannot have been of the highest quality, but that the method was still sensitive enough to identify the protein in question. In the case of ubiquitin, note that no protein other than a member of the ubiquitin family scored more than 20, so that even the relatively low score obtained still suggests ubiquitin as a strong candidate. FINDER also suggested the expected proteins as best candidates in the case of bovine neurofilament subunit NF-M, bovine GFAP, bovine serum albumin, chicken ovalbumin, chicken lysozyme, pig glyceraldehyde-3-phosphate dehydrogenase, and rabbit muscle myosin. In general, lower scores with the expected candidate were associated with lower amounts of amino acid analyzed; >10 nmol of total amino acid hydrolyzed usually gave a workable result, but the best data were obtained with >25 nmol, corresponding to about 2.5 pg, or 50 pmol, of a 50-kDa protein. These studies suggest that the proteins selected have amino acid compositions unique enough for the program to select them from the database. To see whether this is generally true of known proteins, a program (called SEEKER) was written which randomly selects a protein composition from the SCORES file and looks for other proteins with similar compositions,

5140

Biochemistry: Shaw

Proc. Natl. Acad. Sci. USA 90 (1993)

Table 2. Results obtained from FINDER Protein analyzed and SDS/PAGE molecular mass E. coli f-galactosidase, 115 kDa Bovine NF-L, 68 kDa

Human carbonic anhydrase, 29 kDa

Ubiquitin, 8 kDa

Best candidates GBEC; t3-galactosidase, E. coli LacZ A24925; /3galactosidase, Klebsiella pneumoniae A39967; Inter-a-trypsin inhibitor, human (fragment) QFPGL; neurofilament triplet L protein, pig A25227; neurofilament triplet L protein, mouse S07144; neurofilament triplet L protein, human QFMSL; neurofilament L protein, mouse (fragment) CRHU2; carbonate dehydratase (EC 4.2.1.1) II, human A27175; carbonate dehydratase (EC 4.2.1.1) II, human A26386; acidic fibroblast growth factor, human (fragments) A33879; cytosol aminopeptidase, Saccharomyces cerevisiae A34080; ubiquitin 14, Dictyostelium discoideum D34080; ubiquitin 1, D. discoideum C34080; ubiquitin 2, D. discoideum B27806; ubiquitin, D. discoideum B34080; ubiquitin 19, D. discoideum A31560; polyubiquitin, Drosophila melanogaster A26087; ubiquitin, Drosophila melanogaster A26437; ubiquitin, human A22005; ubiquitin precursor, human A27806; ubiquitin (clone p229), D. discoideum UQHU; ubiquitin, human UQBO; ubiquitin, bovine UQFFM; ubiquitin, Mediterranean fruit fly UQBY; ubiquitin, S. cerevisiae UQNC; ubiquitin precursor, Neurospora crassa UQUYSF; ubiquitin, fall armyworm (fragment) S04863; ubiquitin precursor, maize (fragment) D29456; ubiquitin fusion protein, S. cerevisiae A30126; ubiquitin precursor, Caenorhabditis elegans S17740; ubiquitin precursor, Phytophthora infestans

Score 30 24 22 26 25 24 20 27 27 23 22 24 24 24 23 23 22 22 22 22 22 21 21 21 21 21 21 21 21 21 21 21 21 21 27 25 24 24 23 23 23 23 23 23 30 20

Molecular mass, kDa 116.3 117.5 36.4 61.8 61.8 61.7 32.4 29.1 29.2 18.4 57.0 59.6 25.5 42.6 25.6 42.6 25.9 8.5 25.7 77.0 42.8 8.4 8.4 8.4 8.5 34.4 8.4 30.5 42.8 93.9 25.8 25.5 25.7 34.0 49.6 49.6 49.5 49.5 49.4 49.9 49.4 51.2 74.4 86.7 25.6 13.4

IEP 5.1 5.5 5.1 4.3 4.3 4.4 4.1 7.0 7.0 7.1 5.5 7.1 7.1 7.1 7.1 7.1 7.1 7.1 7.1 7.1 7.1 7.1 7.1 7.1 7.1 7.1 8.1 7.8 7.1 7.1 7.1 8.1 7.1 7.1 4.6 4.5 4.5 4.5 4.5 4.5 4.6 7.6 4.9 5.9 8.0 7.1

Amount, nmol 43.45

25.93

55.91

22.41

S12577; polyubiquitin, Tetrahymena pyriformis (SGC5) S13928; polyubiquitin, chicken S12583; polyubiquitin, mouse Unidentified spinal cord 67.54 S04695; tubulin (3 chain, Volvox carteri f. cytoskeletal protein, 50 kDa UBKM; tubulin (3 chain; Chlamydomonas reinhardtii JQ0177; tubulin X3 chain, Polytomella agilis MZ0005; tubulin -2 chain, Polytomella agilis S05496; tubulin (3 chain, Euglena gracilis B30309; tubulin (8 chain, Euplotes crassus S00683; tubulin (3 chain, Stylonychia lemnae A37851; phosphoenolpyruvate carboxylase, E. coli C36346; fibulin C, human A23679; furin precursor, mouse Unidentified protein sample 71.76 KYBOA; chymotrypsin A precursor, bovine S09959; Ig heavy chain V-D-J region, mouse ABRF-92, 25 kDa DEZMG3; glyceraldehyde-3-phosphate dehydrogenase A precursor, maize 19 7.0 42.8 KYBOB; chymotrypsin B precursor, bovine 19 25.7 4.8 PS0140; aspergillopepsin A precursor; Aspergillus awamori (fragment) 19 38.9 4.3 F29380; Ig heavy chain precursor V region (fragment) 19 15.0 7.7 41.1 JU0340; aspergillopepsin A precursor; Aspergillus awamori 19 4.3 A21195; chymotrypsinogen 2 precursor, dog 19 27.7 7.0 the best scores are and all with Only the lowest score listed are included. IEP, calculated isoelectric point. Amount, total shown, proteins quantity (nmol) of relevant amino acids analyzed. The variable ubiquitin molecular masses arise from polyubiquitin cDNAs and ubiquitin precursors, which contain multiple 8.4-kDa ubiquitin monomers.

using a variety of scoring methods. Unrelated proteins of similar amino acid composition are extremely difficult to find, even with scoring methods looser than that used by FINDER, suggesting a surprising degree of uniqueness to amino acid composition data. SEEKER performed 100 random searches using the FINDER scoring method, and the resulting 1528 proteins scoring .20 were examined in detail. Of these, 651

proteins were clearly related to the target protein-e.g., the same sequence under a different accession number; the same gene product from a different tissue, strain, or species; or readily comprehensible situations such as a protein and a large fragment of the same protein. Fifty-four of the searches found no unrelated protein scoring >20 points. In 28 of those cases no other protein, whether related or not, scored >20, so that

Biochemistry: Shaw

Proc. Natl. Acad. Sci. USA 90 (1993)

within the limits defined by the scoring method, the amino acid composition of these proteins was unique. In the remaining 46 cases the unrelated proteins invariably obtained low scores, as shown in Fig. 2. No protein unrelated to the target protein scored >25, and only 2 unrelated proteins scored 25. In contrast, 55 clearly related proteins scored 25, and 201 related proteins scored >25. Of the proteins with low scores, 13.5% of the proteins scoring 20 were related to the target, as were 31% ofthose scoring 21, 47% ofthose scoring 22, 80%o of those scoring 23, 86% of those scoring 24, and 96.5% of those scoring 25. FINDER scores of 20 and above therefore have significant probability of indicating a genuine match, the likelihood increasing dramatically with increasing score. Proteins selected by SEEKER that produced a significant number of scores with clearly unrelated proteins had amino acid compositions close to the average composition of the database (for the PIR 32 database, in mole percentages: Asx, 9.84; Thr, 6.05; Ser, 7.43; Glx, 10.77; Pro, 5.39; Gly, 7.41; Ala, 7.81; Val, 6.69; Met, 2.37; Ile, 5.58; Leu, 9.45; Tyr, 3.34; Phe, 4.11; His, 2.35; Lys, 5.99; Arg, 5.39). The FINDER scoring algorithm was therefore used to search for proteins matching this average. Even in this worst-case situation the highest score was 25, obtained by a single protein. Sixteen proteins scored 24, and 32 proteins scored 23. These data suggest that, given good-quality input, even a protein with an amino acid composition very close to the database average would be selected on the basis of amino acid composition alone. A corollary of this is that even quite low-quality data would still allow the identification of the -50% of proteins in the current database which are not close in amino acid composition to any unrelated protein. The program has been used to aid in the identification of a growing list of proteins found in a variety of different situations. The program frequently suggests a strong and plausible candidate with reasonably matching molecular mass and isoelectric point, which can then be tested for identity to the target protein. Here I provide two examples. Salt- and detergent-extracted pig spinal cord cytoskeletal preparations contain intermediate-filament subunits as major components, but microtubule- and microfilament-associated proteins would be expected to be extracted (Fig. 3A), so that a protein of 50-kDa apparent molecular mass might be a novel intermediate filament-associated protein (7). The amino acid composition from a single PVDF blot was fed into FINDER, which revealed that the 50-kDa protein had a composition very close to that of the 3-tubulin multigene family (Table 2). Out Number of sequences 650600550 -

*Unrelated I

450460400350

Related

-

300-

250200150-

100-

501I

50 kDa

** ?Po

*." -*

wwwo

-

_:y

_

A 61 62636465666768697071

_g6 q 9.

B

C 69 68

FIG. 3. (A) SDS/PAGE of solubilized salt- and detergentextracted pig spinal cord cytoskeletal material resolved on DEAEcellulose; fraction numbers are at the bottom of each lane. An unidentified protein with an SDS/PAGE molecular size of 50 kDa was eluted between neurofflament subunits NF-H and NF-M, suggesting an isoelectric point of about 5.0. (B and C) All lanes containing the 50-kDa protein are strongly labeled with both monoclonal (B) and polyclonal antibodies (C) to 1-tubulin, in line with the results from FINDER.

of the 58 proteins scoring 20 or more, 29 were f-tubulins, and the 4 best-scoring proteins were all 3-tubulins, scoring in the range 24-27. The SDS/PAGE molecular mass and the isoelectric point estimated from ion-exchange chromatography are also consistent with the 50-kDa protein being a ,B-tubulin, although none of the close non-3-tubulin candidates meet both these criteria. Finally, (8-tubulin, while not expected, is also not an implausible component of the preparation. Later experiments showed that antibodies to 3-tubulin gave clear and strong signals on the 50-kDa band, leaving little doubt that this protein is a member of the 3-tubulin family (Fig. 3 B and C). A further example is provided by ABRF-92, a protein sample given out by the Association of Biomedical Research Facilities, the identity of which is kept secret. The amino acid composition determined from ABRF-92 was fed into FINDER, which pointed to bovine chymotrypsinogen B precursor as clearly and unambiguously the best match (Table 2). The Protein Core Facility at the University of Florida obtained two peptide sequences corresponding exactly to the bovine chymotrypsinogen B precursor sequence, and a telephone call to the Association of Biomedical Research Facilities further confirmed this identification. In the case of several other proteins, FINDER found no close matches. Later, partial peptide sequences from two of these showed no similarity to any known protein, suggesting that these proteins are novel, and also indicating that the program tends not to generate false positives. In one case FINDER failed to identify a protein which was later characterized by peptide sequencing. In this case, examination of the original composition data revealed that it matched the expected values very poorly, probably due to contamination or other technical problems. Finally, several published amino acid profiles were examined with FINDER. Even though the error and reproducibility factors shown in Table 1 are expected to be somewhat different for other laboratories, the program usually worked efficiently. For example, the amino acid composition of the 66-kDa protein described by Chiu et al. (8) was entered into FINDER, which found as the single best match a-intemexin, (score of 25 points; next best score, 21 points), in line with the generally held belief that these two proteins are identical (9).

DISCUSSION

20 21 22 23 24 25 26 27 28 29 30 31 32 Score

FIG. 2. Results obtained from 100 runs of the

NF-H NF-M

5141

SEEKER program.

The program described should be a useful addition to the current array of scientific software. It is far easier, cheaper, and quicker to obtain amino acid composition data than peptide sequence. The blockage ofthe N-terminal amino acid

5142

Biochemistry: Shaw

is not a problem for composition analysis, the amount of protein required is quite low, and the recovery ofamino acids from PVDF membranes should be excellent. Perhaps the most surprising finding reported here is the highly specific nature of the composition data; unrelated proteins have distinct amino acid compositions, and proteins with the same or very similar profiles are invariably closely related. The composition profile, as routinely determined, is an array of 16 non-integer numbers whose range in known proteins is quite surprising. Even if proteins of >20 kDa are selected, the ranges in PIR 32 are as follows: Asx, 0-46.08%; Thr, 0-45.3%; Ser, 0-38.69%; Glx, 0-45.78%; Pro, 0-47.5%; Gly, 0-66.32%; Ala, 0-4.65%; Val, 0-19.18%; Met, 0-15.19%; Ile, 0-23.35%; Leu, 0-29.18%; Tyr, 0-16.76%; Phe, 0-27.71%; His, 0-64.94%; Lys, 0-32.23%; Arg, 0-32.14%. For comparison, note that only 10 integer numbers (i.e., each number has only 10 possible values) are sufficient to uniquely identify most of tens of millions of telephones in the United States. Three further criteria beyond the composition can also be used in the identification process: the name, calculated molecular mass, and isoelectric point of each candidate protein. The molecular mass determined from SDS/PAGE is reasonably accurate for most proteins (10), but in some extreme cases may differ from the real molecular mass by as much as a factor of 2. Similarly, the isoelectric-point calculation does not take into account interactions between charged groups within the molecule or the effects of posttranslational modification and so is only accurate to within about 1 pH unit. However both numbers, used judiciously, can rule out a large number of proteins from any list of candidates. Finally, the candidate should be a protein which could plausibly be found in the experimental situation under examination. The composition, molecular weight, isoelectric point, and context therefore provide four independent criteria which can be used in the identification process. Experience to date suggests that one can feel very confident about the identity of a protein which matches a particular candidate by all four criteria. The finding that a l3-tubulin isotype is a component of the salt-extracted spinal cord cytoskeleton is interesting and may point to the presence of an unusually stable form of microtubules in these preparations. Alternatively, the ,-tubulin may be a contaminant and of no real significance. In either case we are now at a stage that would normally have required considerable work and expense. The surprising variability of amino compositions of known proteins, the reason that FINDER works, has been conclusively demonstrated here. The resolution of FINDER is dependent on the empirically determined accuracy and reproducibility of amino acid composition data produced by a

Proc. Natl. Acad. Sci. USA 90 (1993)

particular facility. Future improvements in protein hydrolysis techniques and amino acid determination will therefore naturally increase the resolution of the program. FINDER can be easily modified and made even more sensitive if Cys and Trp, or separate Glu/Gln and Asp/Asn determinations, become more widely performed. It is becoming ever more likely that the sequence of an unknown protein in a particular experimental situation has already been determined, so that the protein could be identified by programs like the one described here. Failure to find a good match with FINDER, given good-quality amino acid composition data, is also useful information, since it suggests that a protein is novel and worthy of detailed characterization. After the work described here was complete, I became aware that previous authors have used amino acid composition data to identify proteins, although it is fair to state that this approach is not well known or widely used and that the method described here is significantly more refined in several respects (11, 12). Executable versions of FINDER and the NAMES and SCORES files are available from the author. FINDER has been copy-

righted. I thank Benne Parten at the Protein Core Facility of the University of Florida for performing the hydrolyses and amino acid analyses. Julio Hawkins, Laura Errante, Ben Dunn, Nancy Denslow, Jeff Harris, and Paul Hargrave provided helpful criticism and other input. This work was supported by National Institutes of Health Grant NS22695. 1. Matsudaira, P., ed. (1989) A Practical Guide to Protein and Peptide Purification for Microsequencing (Academic, San Diego). 2. LeGendre, N. (1990) BioTechniques 9, 788-805. 3. Ozols, J. (1990) Methods Enzymol. 182, 587-601. 4. Gharahdaghi, F., Atherton, D., DeMott, M. & Mische, S. M. (1992) Techniques in Protein Chemistry III, ed. Angeileti, R. H. (Academic, San Diego), pp. 249-260. 5. Strydom, D. J., Tarr, G. E., Pan, Y.-C. E. & Paxton, R. J. (1992) Techniques in Protein Chemistry III, ed. Angelleti, R. H. (Academic, San Diego), pp. 261-274. 6. Tsugita, A., Uchida, T., Mewes, H. W. & Ataka, T. (1987) J. Biochem. 102, 1593-1597. 7. Shaw, G. & Hou, Z.-H. (1990) J. Neurosci. Res. 25, 561-568. 8. Chiu, F. C., Barnes, E. A., Das, K., Haley, J., Socolow, P., Macaluso, F. P. & Fant, J. (1989) Neuron 2, 1435-1445. 9. Fliegner, K. H., Ching, G. Y. & Liem, R. K. H. (1990) EMBO J. 9, 749-755. 10. Weber, K. & Osborn, M. 0. (1969) J. Biol. Chem. 244, 44064412. 11. Eckerskorn, C., Jungblut, P., Mewes, W., Klose, J. & Lottspeich, F. (1988) Electrophoresis 9, 830-838. 12. Sibbald, P. R., Sommerfeldt, H. & Argos, P. (1991) Anal. Biochem. 198, 330-333.