Determining the roles of different chain fragments ... - Semantic Scholar

2 downloads 0 Views 172KB Size Report
threading' that allows estimation of the contributions of particu- lar chain fragments to recognition of the native structure (Reva and Topiol, 2000). In particular, we ...
Protein Engineering vol.15 no.1 pp.13–19, 2002

Determining the roles of different chain fragments in recognition of immunoglobulin fold

B.Reva1,2, A.Kister3, S.Topiol1 and I.Gelfand3 1CTA/CAMM,

Novartis Institute for Biomedical Research, 556 Morris Avenue, Summit, NJ 07901 and 3Department of Mathematics, Rutgers State University, New Brunswick, NJ 08855, USA

2To

whom correspondence should be addressed. Currently on leave from the Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Puschino, Moscow Region, Russia.

2Present

address: Discovery Partners, Computational Division, Suite 645, Two Executive Drive, Fort Lee, NJ 07024. E-mail: [email protected]

We examine sequence-to-structure specificity of β-structural fragments of immunoglobulin domains. The structure specificity of separate chain fragments is estimated by computing the Z-score values in recognition of the native structure in gapless threading tests. To improve the accuracy of our calculations we use energy averaging over diverse homologs of immunoglobulin domains. We show that the interactions between residues of β-structure are more determinant in recognition of the native structure than the interactions within the whole chain molecule. This result distinguishes immunoglobulins from more typical proteins where the interactions between residues of the whole chain normally recognize the native fold more accurately than interactions between the residues of the secondary structure residues alone [Reva,B. and Topiol,S. (2000) Biocomputing: Proceedings of the Pacific Symposium. World Scientific Publishing Co., pp. 168–178]. We also find that the predominant contributions of the secondary structure are produced by the four central β-strands that form the core of the molecule. The results of this study allow us through quantitative means to understand the architecture of immunoglobulin molecules. Comparing the fold recognition data for different chain fragments one can say that β-strands form a rigid frame for immunoglobulin molecules, whereas loops, with no structural role, can develop a broad variety of binding specificities. It is well known that protein function is determined by specific portions of a protein chain. This study suggests that the whole protein structure can be predominantly determined by a few fragments of chain which form the structural framework of the molecule. This idea may help in better understanding the mechanisms of protein evolution: strengthening a protein structure in the key framework-forming regions allows mutations and flexibility in other chain regions. Keywords: averaging over homologs/fragmentation threading/ framework-forming regions in immunoglobulins/sequence-tostructure specificity Introduction The unique structure of a protein molecule is dependent on interactions between residues of a chain in solution. Suppose some of these interactions could be turned off or changed as a result of residue substitutions in the directed mutagenesis. How would this affect the native structure and the total © Oxford University Press

energy landscape of the molecule? Currently, there is no experimentally tested reliable computational approach that could give an exact answer to the role of any given chain residues or fragments in determining the native structure. However, there is experimental evidence showing significant differences between residues in determining the native structure of a protein molecule. In particular, alignments of protein homologs show that conserved residues are not distributed randomly along a chain, but rather correlated with their positions in the structure (Ashford et al., 1987; Chothia et al., 1998; Ptitsyn, 1998; Mirny and Shakhnovich, 1999; Orengo, 1999; Yang and Honig, 2000). The experimental studies show also that tolerance of protein structure to residue substitutions depends not only on the type of mutation but rather on its position in the surrounding three-dimensional structure (Fersht, 1999). This data can be generalized in the problem of sequenceto-structure specificity: what is the particular role of a given residue or a group of residues in determining a protein structure? Answering this question quantitatively (even approximately) is important for developing protein structure prediction methods, understanding the architecture of protein globules and designing new protein structures. In principle, one could think of two different hypothetical classes of protein sequences. In the first class of sequences, most chain fragments contribute equally to the energy separation of the native structure from other structures. In the second class of sequences, there are significant differences among chain fragments in their contribution into recognition of the native structure. Some of the chain fragments are highly preferred in the native structures making the dominant contribution to energy separation of the native structure from other folds; the other chain fragments are energetically ‘neutral’ (have low-specificity) for the native structure or possibly even prefer a different fold. In this work we investigate which of these two hypothetical classes of sequences more closely reflects the case of immunoglobulins. Recently, we have developed an approach of ‘fragmentation threading’ that allows estimation of the contributions of particular chain fragments to recognition of the native structure (Reva and Topiol, 2000). In particular, we compared the roles of secondary structure and loops in recognition of the native structures of proteins. The accuracy of recognition was estimated by computing the Z-score values for fragments of protein chains in threading tests. We found that for the overwhelming majority of diverse protein sequences, secondary structure fragments are more determinant of the native structure than loops. Typically, however, the secondary structure is, in turn, less determinant of the structure than the whole sequence. Therefore, roughly speaking, the majority of sequences show the features described above as class I. However, ~14% of the sequences studied (34 of 240) showed the features of class II: the fragments of secondary structure taken alone produce significantly lower Z-score values than the corresponding whole sequences. Most of these proteins (33 of 34) contain a significant portion of β-structure. 13

B.Reva et al.

Fig. 1. Two-dimensional schematic representation of immunoglobulin structures (adapted from Flores et al., 1994). The three-dimensional structure is represented as a sequence of secondary structure elements (helices, shown by open circles; and strands, shown by triangles). The scheme illustrates the relative spatial position and direction of these elements. (a) Variable domains (VL and VH) and (b) constant domains (CL and CH) of, correspondingly, light (L) and heavy (H) chains of immunoglobulin FAB (PDB code: 6fab). Capital letters used in designation of β-strands correspond to the chain fragments names used in multiple sequence alignments of Gelfand and Kister (Gelfand and Kister, 1995); closed (gray) triangles are used to single out the core β-strands.

In this work we apply the fragmentation threading approach in a more detailed study of the specific role of secondary structure fragments in determining (recognizing) the threedimensional structure of the immunoglobulin molecule that have been found among the class II proteins. Immunoglobulins are chosen for such a study because they constitute a vast and well explored class of proteins. There are hundreds of threedimensional structures of immunoglobulins that can be used for producing reliable sequence and structure alignments; alignments of diverse homologs help to increase the accuracy of energy calculations (Finkelsten, 1998; Reva et al., 1999). Because immunoglobulins play a central role in developing specific immune responses, they are an important object for protein engineering. Analysis of conserved mutations in multiple alignments of immunoglobulin sequences and structures suggested a special role of four central β-strands (strands B, C, E, F in Figure 1) in determining immunoglobulin fold (Galitsky et al., 1999; Halaby et al., 1999; Kister et al., 2001). These four cross-linked, anti-parallel β-strands form the core of immunoglobulin globules and constitute one of the main features of the immunoglobulin-like fold. These chain fragments are enriched in conserved hydrophobic residues. Because of their position in the core, they dominate interresidue contacts and interactions. There is also a specific cysteine bridge between strands B and F. Therefore, in examining sequence-to-structure specificity in immunoglobulins it is interesting, first of all, to evaluate the role of these β-strands in fold recognition and compare their contribution with those of other chain regions. In this study, we focus on three specific 14

Fig. 2. Scheme of interactions taken into account (Reva et al., 1998; Reva and Topiol, 2000); filled circles show Cα atoms of residues to which a potential is applied. (a) Interactions depend on the distance r between residues α and β (‘long-range’ potentials). (b) Chain bending potential: bending at the intervening residue α affects the distance r between terminal residues i – 1 and i ⫹ 1. (c) Short-range potentials depending on the distance between terminal residues α and β. (d) Chiral potential (Reva et al., 2000) depending on the dihedral angle χ between two planes of Cα atoms, (i – 1, i, i ⫹ 1) and (i, i ⫹ 1, i ⫹ 2), and the residues α and β. Potentials are derived at a resolution of ∆ ⫽ 1 Å; the angular resolution for chiral potential is ∆χ ⫽ 30°. This crude resolution of energy functions renders insignificant the statistical errors connected with the inclusion (or deletion) of each individual protein in the database used in the derivation of the potentials.

questions: What is the role in recognition of the native structure for residues of (i) the secondary structure, (ii) the four central β-strands, and (iii) loops? Methods The method of ‘fragmentation threading’ was first described in Reva and Topiol (Reva and Topiol, 2000). The essential features of the method are given below. The energy functions In energy calculations we used pairwise Cα atom-based potentials (Reva et al., 1997; Reva et al., 2000) as illustrated in Figure 2. These phenomenological potentials reflect the statistics of protein structure. Interactions between atoms of residues are thermodynamically averaged over different orientations and different chemical surroundings. Such potentials can reproduce only the smoothed energy landscape of a protein molecule and can be applied only to relatively large groups of residues (as was done in this study). The potentials have been tested in different recognition tests and proved to be accurate enough to distinguish the native structure of proteins with relatively high accuracy. Thus, we believe that using the phenomenological potentials in the task of recognition of the native structure by a significant portion of the protein sequence is still a valid implementation of the potentials and one can trust the results obtained. The Z-score The accuracy of protein structure recognition is commonly characterized by the value of the Z-score (Hendlich et al., 1990):

Determining the roles of different chain fragments

Z⫽

ENf – 〈E〉

(1)

D

where ENf is the energy of the native fold, 〈E〉 ⫽

1 M

M

ΣE

i

i⫽1

is the average energy, D⫽

[

M

1 M

Σ (E – i

〈E〉)2

i⫽1

]

1/ 2

is the standard deviation of energies; Ei is the energy of the i-th of M alternative folds (i ⫽ 1, . . . , M). The Z-score gives the average number of standard deviations between the native and the random fold energy. It allows one to estimate the expected number of folds NZ, among which the native structure can be selected as the one with the lowest energy. Assuming a normal distribution of energies of competing folds, Nz ⫽ z

√2π

x2p ⫽

x2

∫ exp (– 2 ) dx

(2)

–⬁

When Z⬍⬍–1, which corresponds to a reasonable accuracy of the predicting methods, 2

Nz µ √2π|Z|exp

Normal distribution for alternative fold energies In estimating the accuracy of protein structure recognition by a Z-score value we assume a normal distribution for alternative fold energies. This type of energy distribution is typical for systems approximated by the random energy model (REM) (Derrida, 1981). The total energy in the REM is a sum of many independent random interactions. The energy of a protein globule is a sum of thousands of inter-residue interactions. Alternative random structures allow for a huge variety of interresidue contacts. Therefore, it is generally believed that the energy spectrum of a protein molecule has the normal form. However, because we compare the energies of interactions for relatively small and specially chosen groups of residues it is especially important to validate the applicability of the REM for separate chain fragments. A deviation between the theoretical and observed distribution is usually measured by the χ2 value (Mathews and Walker, 1964). To compute this quantity we divide an energy distribution into bins and calculate the observed and the expected bin populations. The χ2 value for a protein p is computed as:

(Z2 )

(3)

The larger NZ (and hence –Z), the more accurate the protein structure recognition. Generation of alternative structures by gapless threading For evaluating the average energy, 〈E〉, and standard deviation, D, one needs to use a representative set of alternative structures. Commonly, such structures are obtained by the method of gapless threading of a query sequence onto all possible threedimensional structures provided in the form of backbones of a set of non-homologous proteins (Hendlich et al., 1990). No internal gaps or insertions are allowed; thus, a chain of N residues in length can be threaded through a host protein molecule of M residues in length in M – N ⫹ 1 different ways. Because threaded structures which differ by a small number of register shifts are similar to each other (measured by RMSD; Reva et al., 1998) we sampled threaded structures with register shifts of 10. Only a fraction of the possible conformations of the query sequences is generated by this procedure. However, this fraction is enough for a crude estimation of the Z-score values. [All the protein structures used in threading were taken from the PDB according to Sander’s 25% similarity list of October 1997 (Hobohm et al., 1992). We used only 364 structures from this list as follows: those with no chain breaks, with a resolution better than 2.5 Å and an R factor less than 0.2, and with no structural homologs (Reva et al., 1998)]. To determine a particular contribution of a given group of residues to structure recognition we calculated separately the energy of this group in the native structure as well as the corresponding energies in alternative structures. These energies were used for estimating the Z-score for the particular fragments of aligned sequences.

1 Q

K

Σ i⫽1

(e) 2 (n(o) p,i – n p,i) 2 (n(e) p,i)

(4)

(e) 2 where n(o) p,i and n p,i) are the observed and the expected population, respectively, of energies in bin i for protein p; K is the number of bins taken into account; and Q is the number of degrees of freedom. For the normal distribution Q ⫽ K – 3 because the total population, mean value and dispersion of the expected statistics are extracted from the observed statistics. When χ2 ~1, the experimental data can be treated as confirming the expected statistics. We used a threshold of χ2 ⫽ 3 (corresponding to 5% significance at Q ⫽ 2) to distinguish the energy distributions that differ from the normal one. Because the region of low energies (E ⬍ 〈E〉) is of major interest, the χ2 values were estimated for the lower half of the energy distributions. The interval (–⬁, 〈E〉) was divided into K ⫽ 5 bins: (–⬁, E40), (E40, E40 ⫹ δ), (E40 ⫹ 3δ, 〈E〉), where E40 is the 40th lowest observed energy, δ ⫽ (〈E〉 – E40)/4. Alignment of immunoglobulin chains The multiple alignment of immunoglobulin chains is based on both sequence and structural information (Gelfand and Kister, 1995). The alignment procedure consisted of two steps: (1) determination of the secondary structure units in the proteins, and (2) multiple sequence and structural alignment of the secondary structure units. For assignment of the secondary structure we determined a consensus in (i) backbone dihedral angles, (ii) torsion angles around virtual Cα–Cα bonds, and (iii) hydrogen bonds between main chain atoms (Gelfand and Kister, 1995). In constructing the multiple alignments the corresponding strands and loops from different structures were grouped together and aligned separately (i.e. the group of A strands, the group of AB loops, etc; see Tables I and II). No internal deletions or insertions were allowed in the selected chain fragments. In constructing the multiple alignment of separate chain fragments we took into account and determined a consensus in (i) the sequence alignment, (ii) the backbone conformation and hydrogen bonds, and (iii) the residue–residue contacts (contact between two residues was detected if any two heavy atoms of these residues were closer than 5 Å; Gelfand and Kister, 1995). Only the structures where all the

15

B.Reva et al.

Table I. Average Z-scores for Ig domains obtained for different groups of structural fragmentsa Fragmentsb

VL(70)

Whole chain All β-strands Core β-strands Chain with no core Loops

VH(64)

CL(13)

CH(16)

〈Z1〉

〈Zh〉

〈Γ〉

〈Z1〉

〈Zh〉

〈Γ〉

〈Z1〉

〈Zh〉

〈Γ〉

〈Z1〉

〈Zh〉

〈Γ〉

–5.5 –6.0 –5.1 –3.9 –3.3

–6.4 –6.6 –5.9 –4.6 –3.6

5 4 4 6 6

–6.3 –6.5 –4.8 –4.6 –3.0

–7.4 –7.6 –5.7 –5.2 –3.2

9 5 4 10 13

–6.2 –6.3 –5.0 –4.2 –1.7

–6.7 –7.0 –5.5 –4.4 –1.8

2 2 2 2 2

–4.8 –5.3 –4.9 –2.8 –1.3

–5.1 –5.4 –5.2 –3.0 –1.6

2 2 2 2 2

a〈Z 〉 and 〈Z 〉 are correspondingly the average Z-scores calculated for individual chains, and the average Z-scores calculated with averaging energies over 1 h diverse homologs; 〈Γ〉 is the average number of diverse homologs used in averaging energies; the diverse homologs were selected individually for each of the chains of the alignment by the criterion of pairwise energy correlation that does not exceed 0.7 and 0.8 for the variable and the constant domains, respectively. bCore β-strands are the strands B, C, E, F as shown in Figure 1.

Table II. Average characteristics of energy distributions obtained separately for the diverse mini-alignments of Ig domainsa Ig domains

Fragments

% Residue

〈Z1〉

〈Zh〉

〈χ12〉

〈χh2〉

〈∆E1〉

〈∆Eh〉

〈∆D1〉

〈∆Dh〉

VL(7)

Whole chain All β-strands Core β-strands Chain with no core Loops

100 56 26 74 44

–5.6 –5.8 –4.9 –4.2 –3.4

–6.5 –6.8 –5.8 –4.6 –3.4

1.5 1.4 2.4 1.8 1.4

1 2.3 2.7 2.5 2.2

–1.4 –1.1 –1.1 –1 –0.8

–1.3 –1.1 –1.1 –0.9 –0.6

26.8 11.1 6.2 19.9 10.9

20.9 9.3 5.2 15.5 8.6

VH(11)

Whole chain All β–strands Core β–strands Chain with no core Loops

100 61 24 76 49

–6.2 –6.6 –4.9 –4.3 –3.1

–7.3 –7.8 –5.5 –5.1 –3.3

2 1.9 2.2 1.4 2.1

1.7 1.4 3.1 1.2 4.1

–1.4 –1.3 –1.1 –0.9 –0.6

–1.3 –1.2 –1.1 –0.8 –0.4

26.1 12.3 6.5 19.6 9.7

20.3 10.4 5.8 14.8 7.3

CL(2)

Whole chain All β-strands Core β-strands Chain with no core Loops

100 69 37 63 31

–6.2 –6.2 –4.6 –4.3 –1.6

–6.4 –6.6 –4.9 –4.3 –1.5

0.9 1.9 1.9 2.3 13.1

1 0.8 1.3 3.2 10.2

–1.6 –1.4 –1 –1.3 –0.5

–1.5 –1.3 –1 –1.2 –0.4

25.6 15.6 8.2 18.2 9

23 13.6 7.5 16.6 8.5

CH(2)

Whole chain All β-strands Core β-strands Chain with no core Loops

100 65 37 63 35

–5.2 –5.2 –4.5 –2.9 –2.2

–5.2 –5.5 –5 –2.8 –1.5

2.8 2.4 2.3 2.8 8.5

2.3 2.7 2.1 2.4 7.3

–1.6 –1.4 –1.1 –0.9 –0.6

–1.4 –1.4 –1.1 –0.8 –0.4

29.1 16.6 8.3 16.8 8.5

25.9 15.7 7.5 14.9 7.7

aAveraging

is done over all chains in the corresponding diverse mini-alignments of Tables III–IV; the numbers of VL, VH, CL and CH domains in the corresponding alignments are given in brackets; % Residue is the average percentage of residues in the chain fragments; 〈Z〉, 〈χ2〉, 〈∆E〉 and 〈∆D〉 are, correspondingly, the average Z-score, the χ2-criterion, the average energy differences per residue and the averaged standard deviation of energies; indices 1 and h correspond respectively to individual chain energies and to energies produced by averaging over all homologs in the mini-alignments.

residues have been resolved by X-ray were used in the final multiple alignments. The resulting alignments of variable domains consist of 70 (VL) and 64 (VH) chains for light and heavy chains, respectively; the alignments of constant domains were constructed for 13 (CL) light and 16 (CH) heavy chains. Averaging energies over homologs To reduce the errors in energy calculations we used the recently developed approach of averaging energies over aligned homologs (Finkelstein, 1998; Reva et al., 1999). The averaged energy Ui corresponding to a structure i is computed directly as: Ui ⫽

1 Γ

Γ

ΣE

λ i

(5)

λ⫽1

where Eλi is an energy of a homolog λ in a structure i, and Γ is the number of homologs. The improvement in the accuracy of protein structure recognition depends on the diversity of homologs used in this averaging (Reva et al., 1999). Nearly identical sequences are 16

obviously of little use for energy averaging; the homologs selected for averaging should be as dissimilar as possible (within the context of a set of homologs). When homologous sequences are diverse and there are no significant differences between their native folds, such energy averaging reduces random energy errors and improves the separation of the common native fold from other structures. According to the general approach (Finkelstein, 1998), the diverse homologs are determined by the lowest correlation in energy. However, in practice, the selection of homologs is a controversial procedure because initially homologs are selected by their sequence similarity. Based on our previous tests (Reva et al., 1999), we used the optimized procedure for selection of homologs, i.e. we computed pairwise correlation coefficients for each pair λµ of homologous chain fragments and, for the purpose of averaging, selected only those homologs with an energy correlation of less than 0.7–0.85. The energies corresponding to interactions with ‘gap’ positions in the alignment were taken to be zero.

Determining the roles of different chain fragments

Table III. (A) Alignment of the diverse VL domains with the domain 8FAB used as a roota and (B) alignment of the diverse VH domains with the domain 1IBG used as a rootb

aAll

sequences are extracted from the original multiple alignment of VL domains; energy correlation between any two chains in the alignment does not exceed 0.7 (from the original alignment of 70 sequences only the sequences 8FAB, 1FDL and 1IND of 103, 106 and 107 residue length, respectively, produced the diverse sets of seven homologs); the names of the β-strands are given by one-letter words, the loop names are given by two-letter words; the core strands (B, C, E, F) are shown in bold. bAll sequences are extracted from the original multiple alignment of V domains; energy correlation between any two chains in the alignment does not H exceed 0.7 (from the original alignment of 64 sequences only the sequences 1IBG, 1ACY, 1VGE 2FB4 and 2IG2 of 119, 120, 121, 125 and 125 residue length, respectively, produced the diverse sets of 11 homologs); the names of the β-strands are given by one-letter words, the loop names are given by twoletter words; the core strands (B, C, E, F) are shown in bold.

In the recognition tests with energy averaging over homologs (Reva et al., 1999) a sequence of known three-dimensional structure was used as a ‘root’ sequence. (A root sequence is taken into alignment with no deletions or insertions; the alignments of other sequences have to be adjusted correspondingly.) The three-dimensional structures are known for all sequences of immunoglobulins used in this work. Therefore,

each of the sequences could be taken as a root sequence with the corresponding three-dimensional structure used as the common native fold. We considered two types of averaging. In the first case, each of the sequences of the original alignment was treated as a root, we computed the corresponding set of diverse homologs for each sequence and used them, in turn, in averaging over energies. The obtained characteristics were 17

B.Reva et al.

Table IV. (A) Alignment of the diverse CL domains with the domain 2FB4 used as a roota and (B) alignment of the diverse CH domains with the domain 1BAF used as a rootb

aAll

sequences are extracted from the original multiple alignment of CL domains; energy correlation between chains 2FB4 and 1DFB is 0.68 (the chain 2FB4 is alphabetically the first one among the shortest chains of 97 residues); the names of the β-strands are designated by single letters and loop names by double letters; the core strands (B, C, E, F) are shown in bold. bAll sequences are extracted from the original multiple alignment of C domains; energy correlation between chains 1BAF and 1MCP is 0.69 (the chain H 1BAF is alphabetically the first one among the shortest chains of 94 residues); the names of the β-strands are designated by single letters and loop names by double letters; the core strands (B, C, E, F) are shown in bold.

simply averaged over all the sequences in the original alignment. Because these sequences are not equally diverse, for control, we also considered separately, specially selected diverse sets of homologs (mini-alignments) extracted from the corresponding original alignments. Specifically, the minialignments were formed from diverse sets with a maximal number of homologs; in cases where the numbers of homologs in the alignments were equal, we chose the one corresponding to the shortest root sequence to decrease a number of possible deletions. This procedure worked well for relatively diverse variable domains (for more details see captions to Table III). For constant domains, the root sequences for mini-alignments were chosen among the shortest sequences of the original alignments and then by alphabetic order of the corresponding PDB names. Results and discussion The main results of the study are presented in Table I. In Table I we compare the averaged Z-scores for the whole chains, for all fragments of β-structure, for the core β-strands (strands B, C, E, F; see Figure 1), for the chains whose core strands have been extracted, and for loops of immunoglobulins. The Z-scores are calculated and averaged for the single domains (〈Z1〉) and for the corresponding alignments of diverse homologs (〈Zh〉). The diverse homologs were determined individually for each of the chains of the original multiple alignments using the criterion of maximal pairwise energy correlation of 0.7 (see Methods). Table I shows the role of the different structural fragments in recognition of the native fold of immunoglobulins. One can see that for all four classes of immunoglobulin domains the β-structure shows higher specificity (lower Z-scores) for the native fold than the whole chain. Furthermore, the four central β-strands (the core strands) contribute most significantly to the recognition of the native structure to the extent comparable with the structure specificity 18

of the whole chains. In contrast to this, the chains without inclusion of the interactions produced by the core strands are significantly less determinant of structure recognition. The loop fragments are structurally least relevant. To make the results of our calculations more reliable and also to take the advantage of the multiple alignments, we computed the Z-scores using energy averaging over diverse homologs (Finkelsten, 1998; Reva et al., 1999). Diverse homologs were separately grouped into mini-alignments (similar to the ones shown in Tables III and IV) for all types of the considered chain fragments by the maximal pairwise energy correlation criterion. Table I shows that averaging energies over diverse homologs noticeably improves the accuracy of the structure recognition in all the cases considered. The results produced for single chain energies, as characterized above, and for averaged energies are perfectly consistent. The averaged number of diverse homologs, 〈Γ〉, may be used to indicate the extent of the diversity of the corresponding chain fragments. One can see that β-structure and core strands are the most conserved fragments of the variable chains of immunoglobulins, whereas loops are the most variable ones. The constant domains show significantly less variability compared to the variable ones. (To get at least two chains in the diverse alignments for all the sequences in CL and CH multiple alignments we had to increase the correlation threshold to 0.85.) To avoid biasing in averaging over whole sets of chains in the original multiple alignments we conducted control tests by computing and averaging the Z-scores and other characteristics of the energy distributions within the diverse mini-alignments of immunoglobulin domains. The results of this test are given in Table II. The procedure of selection of diverse homologs is described in the Methods; the corresponding mini-alignments are presented in Tables III and IV. One can see that the values of the Z-scores of Tables I and II are generally consistent both qualitatively and quantitatively.

Determining the roles of different chain fragments

Besides the Z-scores, Table II gives additional information on average percentages of β-structure, core strands and loops to indicate how many residues actually contribute to structure recognition. The χ2 values of Equation (8) are used to compare the deviations between the observed energy distributions and the REM-based estimates. A critical question for this study is the legitimacy of the approach used for estimating the accuracy of fold recognition. The Z-score comparative analysis is based on the concept that distributions of alternative fold energies are always well approximated by the normal law. The low χ2 values of Table II shows that for the majority of the tested chain fragments the energy distributions can be reasonably treated as normal ones (however, the energy distributions for loops show more significant deviations from the normal law than the other considered fragments). Table II also presents data on average energy differences per residue between the native and misfolded conformations for the considered chain fragments and the corresponding standard deviations of energies. In complete agreement with the general theory (Finkelstein, 1998), the average energy differences between the native and the misfolded structures are virtually unaffected by energy averaging over homologs, whereas the corresponding standard deviations are systematically reduced in comparison to the average standard deviations for single chains. This results in a reduction of the Z-scores and an increase in the accuracy of structure recognition. Conclusions In this work we examined the role of secondary structure fragments in fold recognition of immunoglobin molecules. The obtained results show that the interactions between residues of β-structure are more determinant in recognition of the native structure than the interactions within the whole chain molecule. This result distinguishes immunoglobulins from more typical proteins where the interactions between residues of the whole chain normally recognize the native fold more accurately than interactions between the residues of the secondary structure (Reva and Topiol, 2000). We also found that within the set of β-structures, the main contributions of the secondary structure are produced by the four central β-strands that form the core of the molecule. The results of this study provide a quantitatively based understanding of the design of immunoglobulin molecules. Comparing the fold recognition data for different chain fragments one can say that β-strands form a rigid framework for the immunoglobulin molecule, whereas loops, with no structural task, are ‘free’ to develop a broad variety of binding specificities. It is well known that protein function is determined by specific fragments of a protein chain. This study suggests that the whole protein structure can be predominantly determined by a few specific fragments of a chain which form a structural frame of the molecule. This idea may help in better understanding the mechanisms of protein evolution: strengthening of a protein structure in the key frame-forming regions allows mutations and flexibility in other chain regions. The fragmentation threading is a simple and fast method to explore design of protein globules and to approach the problem of sequence-to-structure specificity.

References Ashford,D., Chothia,C. and Lesk,A. (1987) J. Mol. Biol., 196, 199–216. Chothia,C., Gelfand,I.M. and Kister,A.E (1998) J. Mol. Biol., 278, 457–479. Derrida,B. (1981) Phys. Rev. B, 24, 2613–2626. Fersht,A. (1999) Structure and Mechanism in Protein Science: A Guide to Enzyme Catalysis and Protein Folding. W.H.Freeman & Co. Finkelstein,A. (1998) Phys. Rev. Lett., 80, 4823–4825. Flores,T., Moss,D. and Thornton,J. (1994) Protein Eng., 7, 31–37. Galitsky,B., Gelfand,I. and Kister,A. (1999) Protein Eng., 12, 919–925. Gelfand,I. and Kister,A. (1995) Proc. Natl Acad. Sci. USA, 92, 10884–10888. Halaby,D.M., Poupon,A. and Mornon,J.-P. (1999) Protein Eng., 12, 563–571. Hendlich,M., Lackner,P., Weitckus,S., Flokner,H., Froschauer,S. Gottsbacher,K., Casari,G., Sippl,M. (1990) J. Mol. Biol., 216, 167–180. Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409–417. Kister,A., Roytberg,M., Clothia,C., Vasiliev,Y., Gelfand,I. (2001) Protein Sci., 10, 1801–1810. Mathews,J. and Walker,B. (1964) Mathematical Methods in Physics. W.A.Benjamin Inc., New York. Mirny,L. and Shakhnovich,E. (1999) J. Mol. Biol., 291, 177–196. Orengo,C. (1999) Protein Sci., 8, 699–715. Ptitsyn,O. (1998) J. Mol. Biol., 278, 655–666. Reva,B. and Topiol,S. (2000) Biocomputing: Proceedings of the Pacific Symposium. World Scientific Publishing Co., pp. 168–178. Reva,B., Finkelstein,A., Sanner,M. and Olson,A. (1997) Protein Eng., 10, 865–876. Reva,B., Finkelstein,A. and Skolnick,J. (1998) Folding Design, 3, 141–147. Reva,B., Finkelstein,A., Skolnick,J. (2000) Derivation and testing residueresidue mean force potentials for use in protein structure recognition. Protein Structure Prediction Methods and Protocols. Humana Press Inc., Tokowa, NJ, USA. pp 155–174. Reva,B., Skolnick,J. and Finkelstein,A. (1999) Proteins, 35, 353–359. Yang,A.-S. and Honig,B. (2000) J. Mol. Biol., 301, 691–711. Received May 1, 2001; accepted September 25, 2001

Acknowledgements The authors are grateful to Dr A.V.Finkelstein for useful discussions. A.E.K. is supported by the Gabriella and Paul Rosenbaum Foundation. I.M.G. and A.E.K. thank Mrs M.Goldman for continuous encouragement.

19