THE THEORY AND PRACTICE OF DISTANCE ... - Science Direct

Bulletin of Mathematical Biology Vol. 45, No. 5, pp. 665-720, 1983, Printed in Great Britain

0092-8240/8353.00 + 0.00 Pergamon Press Ltd. © 1983 Society for Mathematical Biology

THE THEORY AND PRACTICE OF DISTANCE GEOMETRY TIMOTHY F. HAVEL Group in Biophysics, University o f California, Berkeley, CA 94720, U.S.A.

IRWIN D. KUNTZt Department of PharmaceuticalChemistry, University of California, San Francisco, CA 94143, U.S.A. GORDON M. CRIPPEN Dept. o f Chemistry, Texas A & M University, College Station, TX 77843, U.S.A. The mathematics of distance geometry constitutes the basis of a group of algorithms for revealing the structural consequences of diverse forms of information about a macromolecule's conformation. These algorithms are .of proven utility in the analysis of experimental conformational data. This paper presents the basic theorems of distance geometry in Euclidean space and gives formal proofs of the correctness and, where possible, of the complexity of these algorithms. The implications of distance geometry for the energy minimization of macromolecules are also discussed.

Introduction. A central problem in molecular biophysics is the determination of the time-average conformations of biological macromolecules. While the only truly effective means of doing this at present is X-ray crystallography, the fact that many macromolecules do not form good crystals has led to the use of the many other experimental techniques for obtaining less-complete information. Methods for the synthesis of this data into a coherent picture have not kept up with these developments. In addition, the time-average conformation in solution is not always the same as in the crystal, and the exact conformation in solution may experience substantial fluctuations from the average. These deviations are often not taken into account when studying structure/function relations in macromolecules. Finally, the known functional characteristics of a macromolecule certainly place constraints upon what its conformation might be, but there exists no systematic method of extracting this information about the structure. t To whom correspondence should be addressed. 665

666

TIMOTHY F. HAVEL, IRWIN D. KUNTZ AND GORDON M. CRIPPEN

The most common techniques currently used to deal with conformational problems of these kinds are exhaustive grid search (Marshall et al., 1979), constrained random search coupled with energy minimization (Dygert et al., 1975) and computer graphics (Langridge et al., 1981). The former two are expensive in terms of computer time, and the latter is expensive in terms of human time. Over the last few years a new approach has evolved that shows promise of being able to handle these problems in their full generality. This approach is known by the name of the field of mathematics it relies on: 'Distance Geometry'. Although originally conceived of (Crippen, 1977) as a means of circumventing the long-standing local minimum problem of molecular energy minimization (Nemethy and Scheraga, 1977), it has led to the development of a data structure for representing classes of related conformations consistent with experimental, energetic and functional information that is compact and succinct. Algorithms have been designed to manipulate this data structure so as to identify the common structural characteristics of the conformations it embodies. Other algorithms have been developed to extract examples of conformations consistent with the given information from it and to estimate the range of such conformations. The utility of these algorithms has been demonstrated by a number of experimental and theoretical studies. References to some of these are given in Section 3.3 of this paper, and a complete review is available (Crippen, 1981). Many of the algorithms presented in this paper have been published elsewhere, but the theorems on which they are based and the proofs of their correctness and complexity bounds have not. This paper gives a complete and logically self-contained exposition of this material, and considers its implications for the theory of macromolecular conformation as a whole. We will also present several new algorithms, and describe refinements to the existing algorithms. The unifying feature of all these algorithms is that they rely solely on geometric principles to derive structural conclusions. 1. The Use o f Distances as Coordinates. This section is concerned with the mathematical foundations of distance geometry and its application to Euclidean space. We first discuss the advantages and disadvantages of representing physical systems in terms of distances. This leads to the central result of this section, which is a characterization of physical distances. We conclude by explaining the physical meaning of this characterization, with particular emphasis on its implications for the local minimum problem. 1.1. Euclidean distance geometry. Euclidean distance geometry may be defined as the study of Euclidean configurations using the distances between points as the primary coordinate system. Distance geometry was initially developed by Karl Menger in the 1920s and 1930s, and has b~een worked on extensively in the U.S.A. by Leonard Blumenthal and his students at the

THE THEORY AND PRACTICE OF DISTANCE GEOMETRY

667

University o f Missouri. It is from Blumenthal's classic text, The Theory and Applications o f Distance Geometry (1970), that many o f the results that follow are taken. Although it is the distances between points that are of central importance in physical interactions , the majority o f physical problems are defined in terms of more artificial coordinate systems. The problems inherent in this are particularly noticeable when there are geometric constraints that must not be violated in the solution. These constraints are called 'holonomic', and frequently take the form o f a set of fixed distances. A simple example o f such is the distance between the two masses in the rigid rotor problem. To maintain the constraints, physicists often resort to elaborate coordinate systems in which the relations b e t w e e n the invariants are given b y nonlinear trigonometric functions. The relations among the distances themselves are always given b y rational functions. As might be expected, the use o f distances as coordinates in physical problems has a drawback. Since there are |

distances characterizing a configuration consisting o f N points, b u t only 3N -- 6 degrees o f freedom in three dimensions, a coordinate representation given in terms o f distances will in general have no physically meaningful solution. The following theorem makes it seem, however, that when one is not concerned with the dimensionality o f one's results the distances are the 'natural' coordinates to use. THEOREM 1.1. The number o f internal degrees o f freedom available to a configuration o f points in Euclidean space o f arbitrarily high dimension is exactly the same as the number o f distances between those points. Proof. Without loss o f generality, we consider a set o f N points in R n with n > N. By means o f a suitable linear transformation, we may reduce all but the first N -- 1 coordinates o f each point to zero. The total number o f variables left is N ( N -- 1). Since the configuration is invariant with respect to translation and rotation, the total number of degrees of freedom is thus N ( N -- 1) -- NT -- NR, where NT and NR are, respectively, the number o f translational and rotational degrees o f freedom. The number of independent translations in R n is n, and since independent rotations occur in perpendicular planes defined b y any two o f the n coordinate axes, there are n(n -- 1)/2 of them.? This gives N ( N - 1 ) - ( N - - 1 ) - ( N - - 1 ) ( N - - 2 ) / 2 = N ( N - - 1)/2 degrees o f freedom, as claimed. Another way to see this is to observe that there are n(n + 1)/2 independent components to the inertial tensor, n of which correspond to the mass distribution.

668


Note that this theorem does not say that the distances between N points in E n may be chosen arbitrarily for n i> N -- 1 ; even then certain relations, for example the triangle inequality, must be satisfied. What it does say is that under these conditions any one distance can assume a continuum o f possible values even after the values o f all the remaining distances have been fixed. As will be shown in the p r o o f o f Theorem 2.3, for an independent configuration of N points in Ely-2, that is, a set o f points in Ew-2 not contained in any N -- 3 dimensional subspace, any one distance can assume at most two possible values after the remaining distances have been fixed. 1.2. The congruent embedding problem. In order to use distances effectively as coordinates, we will need at the very least some means o f testing our proposed coordinate lists for physical actuality. We will refer to this as the 'congruent embedding problem'. In order to describe what is special about Euclidean distances, it is useful to capture their simplest properties in the following definition. Definition 1.1. A semimetric is a real-valued function p defined on the Cartesian product o f a topological space X with itself with the following properties: (1) p(x, y ) >1 0 (positive semidefinite); (2) p(x, y ) = p(y, x ) (symmetric); (3) p(x, y ) = 0 ** x = y (homogeneous); for all x, y E X. A topological space on which a semimetric has been defined is called a 'semimetric space'. A semimetric that obeys the triangle inequality (4) p(x, y ) + p(y, z) >1 p(x, z) for all x, y E X is called a 'metric', and the space on which it acts is called a 'metric space'. We shall see that the Euclidean distance function is an example of a metric. Definition 1.2. A 'congruence' is a function • that maps one semimetric space S onto another semimetric space S' such that

p(x, y) = p'(d#(x), ~(.y))

(1.1)

for all x, y E S. In other words, congruences are transformations between semimetric spaces that preserve the semimetric. Two semimetric spaces are termed 'congruent' if there exists a congruence between them. We will


669

denote this equivalence relation by S - S'. It is easily shown that congruences are necessarily homeomorphisms. Thus, stated abstractly, we seek necessary and sufficient conditions given solely in terms of the semimetric for an arbitrary semimetric space to be congruent to a subset o f En. The next definition will allow us to phrase this more succinctly. Definition 1.3. A semimetric space S is said to be 'congruently embeddable't in a semimetric space T if it is ~ongruent to a subset o f T. Where no confusion can arise, we shall simply say S is embeddable in T. We denote this partial order relation by S C T. A semimetric space S is 'irreducibly embeddable' in En if it is embeddable in E n but not in any nontrivial subspace o f En, that is, if it is congruent with an independent subset o f En. The following definition and theorem allow us to decompose the congruent embedding problem into a set o f subproblems, each defined on n + 3 points. In accord with Blumenthal's notation, we shall be numbering our points from 0 to N for the remainder of this section. Definition 1.4. A semimetric space T has congruence order k with respect to a given class o f semimetric spaces {S} if S E T whenever each k-tuple o f points {So. . . . . sk-1} in S is embeddable in T. THEOREM 1.2. The Euclidean space E n has congruence order n + 3 with respect to the class o f a l l semimetric spaces {S}. Proof. We must show that if each (n + 3)-tuple o f a semimetric space S is embeddable in En, then we can find a congruence of S with a subset o f En. Let {S}n+l = {So, sl . . . . . sn} be a (n + 1)-tuple in S for which the congruence image { e } n + l that exists in E m by hypothesis has m a x i m u m dimension m. Without loss o f generality we m a y take m = n, so that {e}~+l is independent. Considering now an arbitrary (n + 2)-tuple {s}~+2 containing {S}n+l, we have by hypothesis an (n + 2)-tuple {e'}n+2 in E~ congruent to {s}n+2. By the fact that congruence is an equivalence relation, we have f t {s}n+l = {e~, el, . . . , en} = {e},+l. It is a distinctive property o f Euclidean space that congruences between subsets can be extended to motions over the entire space, and this extension is unique for independent subsets. Composing the m o t i o n taking {e'}~+l to {e}~+l with the congruence between {s}n+2 and {e'}n+2 that exists for each such (n + 2)-tuple, we find our choice o f {S}n+l has determined a mapping M : S ~ E n. To show that this mapping is a congruence, let Sn+l, Sn+2 be any two elements of S, and let en+l, en+2 be their images under this mapping. By hypothesis, the (n + 3)-taple {So, • • • , Sn+2} is congruent to a (n + 3)-tuple Blumenthal uses the equivalent term 'congruently imbeddable'.

670

TIMOTHY F. HAVEL, IRWIN D. K U N T Z A N D G O R D O N M. CRIPPEN

• , en+2} in En, and as before there exists a unique motion taking {e"}n+j to {e}n+l. Since there exists at most one point with given distances to n + 1 other independent points in En, we see that the images ofe~+l and e~+2 under this motion must coincide with en+ 1 and en+ 2. Hence O(Sn+1, Sn+a) JP ?P = d(en+ 1, e~+2) = d(e~+l, en+2). {e~, . .

Note that this p r o o f is valid under weaker hypotheses, which we state in a corollary for future reference.

C O R O L L A R Y 1.1. A semimetric space S is irreducibly embeddable in E n i f S contains a (n + 1)-tuple irreducibly embeddable in En such that every (n + 3 )-tuple containing it is embeddable in En. An (n + 1)-tuple on which a semimetric has been defined will henceforth be called a 'semimetric (n + 1)-tuple'. For the three-dimensional case with which we are most concerned, this corollary says that we need only find a necessarily noncoplanar set o f four points and then check that every set o f six points including these four is embeddable. We will n o w show h o w this can be done. 1.3. Cayley-Menger determinants. In manipulating distance coordinates it is useful to have them in a well-defined standard format. The following format is the most natural one. Definition 1.5. The distance matrix o f a semimetric (N + 1)-tuple {s0, • . •, SN} is an (N + 1) by (N + 1) matrix D = D{so . . . . . Sw} whose i]-th element dq = p(si, sj) for all i, ]. In correspondence with the way in which we have been labeling our points, the elements o f our distance matrices will be numbered from 0 to N. By the symmetry o f the distance function it follows that this matrix is symmetric. Distance matrices have a long history, having been used extensively as a basis for deriving geometric and clustering models o f data in the social sciences and psychology (Shepard, 1980). They have also been used in examining and comparing macromolecular structures such as proteins (Phillips, 1970; Nishikawa and Ooi, 1973; Rossman and Liljas, 1974; Kuntz, 1975). As it turns out, however, it is n o t this matrix that will be o f the greatest use to us. Definition 1.6. The matrix o f squared distances D (2) = {d~} is defined as the matrix whose elements are the squares o f the elements o f D. The bordered matrix o f squared distances, D~), is an (N + 2) by (N + 2) matrix consisting of D (2) augmented b y an additional row and column consisting of


671

all ones except for their c o m m o n diagonal element, which is zero, as shown below:

13¢-~)=

I

0 d2° ,

.

.

dgl 0 .

.

... ...

.

1

.

.

1

.

1 1 1 . .

,

...

(1.2)

.

0

Definition l. 7. A Cayley-Menger determinant is the determinant of a bordered matrix of squared distances det(D~Z)).

We now give the principal result of this section. THEOREM 1.3. A necessary and sufficient condition that a semimetric (N + 1)-tuple be embeddable in E n f o r n i such that DUB(Pi, pj) is maximum;

4. 5.

S+-S+Pi;

6.

for each Px not in S with k > i do begin

7. 8. 9.

DLB(Pi, p]) 0. Performing these substitutions on the off-diagonals establishes that Pi} = d/} for all i, ] > 0 as well. Having constructed this explicit representation, the theorem is proved.

THE THEORY AND PRACTICEOF DISTANCEGEOMETRY

703

Thus we see that transforming from distance coordinates to Cartesian coordinates is as simple as matrix diagonalization.t In practice, small errors in the distances may cause the eigenvalues ,'~ to be nonzero for k > n, and if some of these become negative the method will fail because the square-roots of the eigenvalues are required to obtain the coordinates. Even if they are all positive, the fact that we require three-dimensional coordinates means that it is desirable to have a means o f obtaining coordinates that are in some sense an o p t i m u m approximation to the full N-dimensional coordinate set. The approximation we use will be o p t i m u m in the sense of minimizing the rotation error between the two configurations, that is, the sum of the squares of the differences in the coordinates minimized over all possible relative orientations of the two configurations. The m e t h o d will also allow us to largely bypass the problem of negative eigenvalues (Eckart and Young, 1936). THEOREM 3.2. The N × N symmetric matrix o f rank n that best approximates any given N × N symmetric matrix o f higher rank in the sense o f minimizing the distance between them [as defined in Definition (2.1)] is obtained by setting all but the n eigenvalues o f largest absolute value to zero and performing the inverse unitary transformation. For positive semidefinite metric matrices, the corresponding coordinates obtained from the rank n matrix by Theorem (3.1) will be optimum in the sense o f minimizing the rotation error with respect to the higher dimensional configuration, and the relative orientations o f the two configurations will be such that this minimum is actually attained. Proof. For any two symmetric matrices A = U r P U and B = Vr~2V we have IIA - - B[I 2 = IIAII 2 + IIBll 2 -

2(A,

B ).

(3.4)

Minimizing this expression subject to the norms o f A and B constant is the same as maximizing (A, B). Now (A, B) = Tr (ATB) = Tr (AB) by symmetry

(3.5)

= Zr ( u T p u v T ~ v ) : Zr ({ ~k ( ( /~ Uii'Yilt4k)(~m VmkCOkVm]))l) t The inverse transformation, from Cartesian coordinates to distances, is, of course, even more trivial.

704


= ~

~'lWm(Ul"Vm)2

l,m

/=x

(3.7)

4.

Proof Letting Xoi be the vector from the center o f mass to the ith point, we have N

N

Y xoj = o = y~ (Xol + x,), 1=1

(3.8)

i=1

where xi/is the vector from the ith to the j t h point. Hence N

--1 X01 ~---~

N

E

/=2

(3.9)

XI]

and 1

N

N

d~, = Xo,. Xo, = V

Y

Y x , . x,,,

j=2 k=2 1

N

N

]=2 k =2

-2~

2 ( X - - 1 ) y dI;--2 ~ ]=2

(N--l) N Z dlj N2 /=2 1

N

1

k>j=2

1

i

Z

N2 a>j=2 N

y d~j N2 k>i=a Y 4~N /="a

=--

4

(3.1o)

706

TIMOTHYF. HAVEL, IRWIN D. KUNTZ AND GORDON M. CRIPPEN

The same equations hold for any point besides the first as well, so the theorem is proved. It was first shown by Lagrange in 1783 (Flory, 1969) that the second term of equation (3.7) equals the trace of the inertial tensor divided b y the dimension of the space (for unit-mass points). This quantity in turn is the mean square distance from the center o f mass, or squared radius o f gyration, of the configuration. Thus, if we add an (N + 1)th point to an N-point Euclidean configuration whose distances to the other points are defined by equation (3.7) and use it as our reference point, the row sums of the metric matrix will be zero. This condition will also hold after setting to zero all but the three eigenvalues o f greatest absolute value. The metric matrix then becomes a Gram matrix, for which this condition implies that the coordinates are center of mass as well as principal axes coordinates. We are now ready to present our algorithm for converting distances to Cartesian coordinates (Crippen and Havel, 1978). An earlier version of this algorithm was developed b y Crippen (1978). Algorithm 3.1. Given an approximate set of distances between the points of a three-dimensional configuration, c o m p u t e a set o f 'best fit' principle axes, center of mass coordinates. Procedure for Algorithm 3.1 Boolean procedure COORDS(PTS, DST, CRD): begin comment: PTS is indexed set of N points, DST the distances between them, and CRD is the returned coordinates. 1. for i +- 1 until N do

2. 3.

4. 5. 6. 7. 8. 9. 10. 11. 12. end

doi +-1IN ~-,/d~i- l/N2~i 0 on the half-planes x0 > xl and x0 < x l . Convexity follows as before from Proposition (2.2). At x0 = x l , of course, neither the function nor its Hessian exist. This convexity fails to hold in higher dimensions. For example, in two dimensions the second derivative is d~.616(Xo -- xl) 2 -- 2 ( y 0 - yl)2]. When this becomes negative the Hessian cannot be positive semidefinite. We are now ready to define a new distance error function, which we will denote byD: N

D({xm}) = ~

[(max{0, (dg./u]]) -- 1}) 2 + (max{0, (l]ffd]]) -- 1})2l.

(3.15)

]