On the Distribution of the Distance Between Two ... - Semantic Scholar

On the Distribution of the Distance Between Two Multivariate Normally Distributed Points Houssain Kettani1 and George Ostrouchov2 1

Department of Electrical and Computer Engineering and Computer Science Polytechnic University of Puerto Rico San Juan, PR 00919 [email protected] 2

Statistics and Data Sciences Group Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN 37831 [email protected] Abstract Motivated by the problem of cluster identification, we consider two multivariate normally distributed points. We seek to find the distribution of the squared Euclidean distance between these two point. Consequently, we find the corresponding distribution in the general case. We then reduce this distribution for special cases based on the mean and covariance.

1 Introduction and Formulation Suppose we have two clusters of points Ci and Cj , in n-dimensional space. Each point in a cluster is drawn from the same multivariate normal distribution with known mean and covariance. An n-dimensional random variable x has a multivariate normal (or Gaussian) distribution with mean m and covariance matrix R if it has the following probability density function (pdf): 1 f (x) = (2π)−n/2 |R|−1/2 exp{− (x − m)T R−1 (x − m)}. 2 Such random variable is denoted by x ∼ N (m, R). Consequently, we pose the following question: given a point xi in cluster Ci , and point xj in cluster Cj , what is the distribution of the squared distance between the two points. To this end, the squared

Euclidean distance between two n-dimensional points xi and xj is given by . d2ij = (xi − xj )T (xi − xj ), where “T ” denotes the transpose operator. Let us now write

. z = (xi − xj ),

then, d2ij = zT z =

n X

(1)

zk2 ,

k=1

were zk are the entries of z. The distance dij is also referred to as the amplitude of the random variable z, and its distribution is referred to as the amplitude probability distribution function (APDF). This notion is of particular interest in wave, antenna, and signal analysis [9, Appendix E], and Electromyography (EMG) [12]. The results in this paper is also useful in the areas of cluster identification [13] and dimension reduction [4]. In the following section, we present the distribution of the squared distance. We then consider special cases in Section 3, where the distribution is simplified to more familiar distributions. We end this paper by a concluding remarks in Section 4. An index of notations is presented at the end of this paper to facilitate understanding and look up of notations used in this paper.

2 Distance Distribution Let xi and xi be two n × 1 random vectors with xi ∼ N (mi , Ri ) and xj ∼ N (mj , Rj ), respectively. Then from (1), z ∼ N (µ, Σ), where µ = mi − mj . The following theorem finds the characteristic function of z. In this theorem, we use the function etr(·) to denote the exponential function of the trace of a matrix. 2.1 Theorem

Let z ∼ N (µ, Σ). Then the distribution of zT z has the following characteristic function: 1

φ(ω) = |I + 2jωΣ|− 2 etr{−jω(I + 2jωΣ)−1 µµT }.

2.2 Proof

We start by the basic definition of the characteristic function φ(ω) = E[exp(−jωzT z)] Z

1 n 1 (2π)− 2 |Σ|− 2 exp(−jωzT z) exp{− (z − µ)T Σ−1 (z − µ)}dz 2 Z n 1 1 = (2π)− 2 |Σ|− 2 exp(− s)dz, 2

=

2

(2)

where . s = 2jωzT z + (z − µ)T Σ−1 (z − µ) = 2jωzT z + zT Σ−1 z − 2µT Σ−1 z + µT Σ−1 µ

(3) (4)

s = (z − µ + q)T (Σ−1 + 2jωI)(z − µ + q) + 2jωµT (µ − q),

(5)

2.2.1 Claim

where q = 2jω(Σ−1 + 2jωI)−1 µ 2.2.2 Proof of Claim We need to complete the square in s to get rid of the indefinite integral. So let us consider the following quadratic form: . t = (z − µ + q)T (Σ−1 + 2jωI)(z − µ + q) = s + u + v, where s as in (4), u = 2zT Σ−1 q − 4jωµT z + 4jωzT q

(6)

v = −2µT Σ−1 q + 2jωµT µ − 4jωµT q + qT Σ−1 q + 2jωqT q,

(7)

and Now we pick q such that u = 0. Thus, we have zT (Σ−1 q − 2jωµ + 2jωq) = 0 ⇐ Σ−1 q − 2jωµ + 2jωq = 0 ⇔ (Σ−1 + 2jωI)q = 2jωµ. ⇔ q = 2jω(Σ−1 + 2jωI)−1 µ

(8) (9)

Now, we have v = = = = = =

−2µT Σ−1 q + 2jωµT µ − 4jωµT q + qT Σ−1 q + 2jωqT q) −2µT (Σ−1 q − 2jωµ + 2jωq) − 2jωµT µ + qT Σ−1 q + 2jωqT q −2jωµT µ + qT Σ−1 q + 2jωqT q −2jωµT µ + qT (Σ−1 q + 2jωq − 2jωµ) + 2jωqT µ −2jωµT µ + 2jωqT µ −2jωµT (µ − q)

3

(10) (11) (12) (13) (14)

where we applied (8) in (10) and (11). Hence, we have s = (z − µ + q)T (Σ−1 + 2jωI)(z − µ + q) + 2jωµT (µ − q),

(15)

where q as in (9). And this proves the claim. 2.2.3 Proof of Theorem Now we have Z

n 1 1 (2π)− 2 |Σ|− 2 exp{− (z − µ + q)T (Σ−1 + 2jωI)(z − µ + q)} exp{−jωµT (µ − q)}dz 2 Z 1 1 −n − T 2 2 = (2π) |Σ| exp{−jωµ (µ − q)} exp{− (z − µ + q)T (Σ−1 + 2jωI)(z − µ + q)}dz 2 n −1 − 12 −n − 12 2 2 = (2π) |Σ + 2jωI| (2π) |Σ| exp{−jωµT (µ − q)}

φ(ω) =

1

= |I + 2jωΣ|− 2 exp{−jωµT (µ − q)} 1

= |I + 2jωΣ|− 2 exp{−jωµT (I − 2jω(Σ−1 + 2jωI)−1 )µ} 1

= |I + 2jωΣ|− 2 exp{−jωµT (I − 2jωΣ(I + 2jωΣ)−1 )µ}. Now note that Kailath variant matrix identity [1, p.153], states that for matrices A, B and C, we have (A + BC)−1 = A−1 − A−1 B(I + CA−1 B)−1 CA−1 . Thus, substituting A = I, B = 2jωΣ, and C = I, we get I − 2jωΣ(I + 2jωΣ)−1 = (I + 2jωΣ)−1 . Therefore, we have 1

φ(ω) = |I + 2jωΣ|− 2 exp{−jωµT (I + 2jωΣ)−1 µ} 1

= |I + 2jωΣ|− 2 exp{−jωtr[(I + 2jωΣ)−1 µµT ]}, where tr(·) denotes the trace of a matrix. Hence the the proof of Theorem 2.1 is completed. 2.3 Remark

The random variable y can be viewed as the trace of a matrix that is non-central Wishart distributed W1 (n, Σ, µµT ) of one dimension, n degrees of freedom, covariance matrix Σ, and non-centrality matrix µµT . See [15] for more details on such distribution. Thus, (2) gives the characteristic function and probability density distribution of such trace, respectively. The corresponding pdf is given in [14] as an infinite weighted sum of gamma density functions. The following lemma, on the other hand, presents an a asymptotic distribution of y. A plot of the pdf from the given characteristic function can be obtained using an algorithm presented in [7].

4

2.4 Lemma

As n → ∞, the asymptotic distribution of y is ³

´

N ntr(Σ), n3 /2tr(Σ2 ) 2.5 Proof

This result follows from [5], which states that if W ∼ Wm (n, Σ, µµT ), then as n → ∞, the asymptotic distribution of " #1/2 n (trW/n − trΣ) 2tr(Σ2 ) is N (0, 1). See [15, pp. 517–518] for similar results regarding other functions of W . Thus, the result of the lemma follows by observing that y ∼ W1 (n, Σ, µµT ).

3 Special Cases In this section, we consider special assumptions on the random variables xi and xj . 3.1 Case: Σ Is Diagonal

This corresponds to having the entries xik of xi independent, the entries xjk of xj independent, and xik and xjk independent. 3.1.1 Lemma Let z ∼ N (µ, Σ), where Σ is a diagonal matrix with entries σk2 in row k. Then the distribution of zT z has the following characteristic function: φ(ω) =

n h Y

i

1

(1 + 2jωσk2 )− 2 exp(−jωµ2k /(1 + 2jωσk2 )) .

(16)

k=1

3.1.2 Proof Since Σ is diagonal, we have 1

|I + 2jωΣ|− 2 =

n Y

1

(1 + 2jωσk2 )− 2 .

(17)

k=1

and T

−1

µ (I + 2jωΣ) µ =

n X k=1

Now, plugging (17) and (18) in (2), we get (16).

5

µ2k /(1 + 2jωσk2 ).

(18)

3.1.3 Lemma . Let z1 ∼ N (µ1 , σ12 ). Then the distribution of y = z12 has the following characteristic function: !

Ã

φ(ω) = (1 +

1 2jωσ12 )− 2

−jωµ21 , exp 1 + 2jωσ12

(19)

and the following probability density function: (

− 12

f (y) = (2πyσ12 )

1 exp − 2

Ã

y + µ21 σ12

!)



1



y 2 µ1 cosh  2  σ1

(20)

3.1.4 Proof The characteristic function follows from Lemma 3.1.1 by putting n = 1. Now, we can find the corresponding pdf as follows. For y > 0, we have FY (y) = P (Y ≤ y) = P (Z 2 ≤ y) √ √ = P (− y ≤ Z ≤ y)

Ã

(

Z √y

!)

1 (t − µ1 )2 = exp − dt √ 2 σ12 − y "Z √ Ã Ã ( !) ( !) # Z −√y 2 2 y 1 (t − µ ) (t − µ ) 1 1 1 1 = (2πσ12 )− 2 exp − dt − exp − dt 2 σ12 2 σ12 0 0 "Z Ã √ Ã √ ( !) ( !) # Z y y 1 1 1 ( s − µ1 )2 1 ( s + µ1 )2 1 2 − 12 √ exp − √ exp − (2πσ1 ) ds + ds , = 2 s 2 σ12 s 2 σ12 0 0 1 (2πσ12 )− 2

√ for t > 0, and where in the last step we applied the following change of variables, s = t2 . So, dt = 2ds s ds √ dt = −2 s for t < 0. Now, we get the pdf from the above cumulative distribution function (cdf) as follows

∂FY (y) ∂y " ( Ã √ Ã √ !) ( !)# 1 ( y − µ1 )2 1 1 ( y + µ1 )2 2 − 12 exp − = (2πσ1 y) + exp − , 2 2 σ12 2 σ12 ( Ã !) " ( √ ) (√ )# 2 yµ yµ 1 1 ) (y + µ 1 1 1 1 (2πσ12 y)− 2 exp − exp − 2 + exp , = 2 2 σ12 σ1 σ12

fY (y) =

(

− 12

= (2πyσ12 )

1 exp − 2

Ã

y + µ21 σ12

!)

6



1



y 2 µ1 cosh  2  σ1

3.1.5 Notes It is noted that the characteristic function in Lemma 3.1.1 is that of the sum of n independent random variables having the distribution in Lemma 3.1.3 with parameters (µi , σi ), i = 1, . . . , n. On another note, for the distribution in Lemma 3.1.3, when µ1 = 0 and σ12 = 1, the distribution is chi-squared with one degree of freedom. On the other hand, when µ1 6= 0 and σ12 = 1, the distribution is non-central chi-squared with one degree of freedom and noncentrality parameter µ21 . 3.2 Case: Σ = σ 2 I

This corresponds to having the entries xik of xi independent, the entries xjk of xj independent, and xik and xjk independent. 3.2.1 Lemma Let z ∼ N (µ, σ 2 I). Then the distribution of zT z has the following characteristic function: ³

n

´

φ(ω) = (1 + 2jωσ 2 )− 2 exp −jωλ/(1 + 2jωσ 2 ) , where λ =

Pn

k=1

(21)

µ2k , and the following probability density function:

∞ (y + λ) X y n/2+j−1 λj ] 2σ 2 j=0 Γ(n/2 + j)22j j!σ n+4j √ (y + λ) λy 1 (n−2)/4 (y/λ) exp(− )I(n−2)/2 ( 2 ) = 2 2 2σ 2σ σ where Iα (·) is the modified Bessel function of the first kind of degree α. This is the noncentral Gamma distribution with parameters a = n/2, b = 2σ 2 , and c = λ/(4σ 4 ). n

f (y) = 2− 2 exp[−

3.2.2 Proof P

The characteristic function follows from Lemma 3.1.1 by taking σk2 = σ 2 and λ = nk=1 µ2k . Now, to obtain the corresponding pdf, we proceed as follows. First note that the characteristic function and pdf of noncentral chi squared random variable with n degrees of freedom and noncentrality parameter λ is given as follows: n ϕλ (ω) = (1 + 2jω)− 2 exp (−jωλ/(1 + 2jω))) , and g(y) = 2

−n 2

∞ X 1 y (n/2)+j−1 λj exp[− (y + λ)] n 2j 2 j=0 Γ( 2 + j)2 j!

µ ¶(n−2)/4

1 y 1 = exp(− (λ + y)) 2 2 λ

q

I(n−2)/2 ( λy)

where Iα (·) is the modified Bessel function of the first kind of degree α and is given by the following infinite sum ∞ X (y/2)2i α . (22) Iα (y) = (y/2) i=0 i!Γ(α + i + 1) 7

See [6, pp. 900 – 932] for more information on various kinds of Bessel functions and some of the identities and approximations associated with them. Thus, from the properties of Fourier transform, since φ(ω) = ϕλ/σ2 (ωσ 2 ), we get f (y) = σ12 gλ/σ2 ( σy2 ). And this concludes the proof. 3.2.3 Note A non-central Gamma distribution is defined as having the following characteristic function ³

´

ϕ(ω) = (1 + jωb)−a exp −jωb2 c/(1 + jωb) , where a, b, and c are referred to as shape, scale, and non-centrality parameters, respectively [10]. The corresponding probability distribution function is given as ∞ X

(cx)k , x > 0, k=0 k!Γ(a + k) √ = b−a e−bc e−x/b (x/c)(a−1)/2 Ia−1 (2 cx), x > 0.

γa,b,c (x) = b−a e−bc e−x/b xa−1

3.3 Case: Σ Is Diagonal and µ = 0

Then from (16) we get φ(ω) =

n Y

1

(1 + 2jωσk2 )− 2 .

(23)

k=1

But this is the characteristic function of the sum of the n independent random variables each is distributed as Γ( 21 , 2σk2 ), where Γ(a, b) is the Gamma distribution with probability density function x

xa−1 e− b γa,b (x) = a , x > 0, b Γ(a) and Γ(a) is the Gamma function, defined as Γ(a) =

Z ∞ 0

xa−1 exp(x)dx,

and a characteristic function: ϕ(ω) = (1 + jωb)−a The distribution of such sums is considered in [16], from which we get the Stacy distribution with the following pdf: ∞ n X X Y Γ(kl + 12 ) n (−x)i √ f (x) = x 2 −1 . n 2 kl +1/2 Pn k ! π(2σ ) l l i=0 Γ( 2 + i) l=1 k =i j=1

j

Note now that for an integer m, we have √ (2m − 1)!! π 1 . Γ(m + ) = 2 2m 8

So, putting m = kl , we get n

f (x) = x 2 −1

∞ X

X (−x)i n Pn i=0 Γ( 2 + i) k j=1

Since we have

n X

n Y

(2kl − 1)!! . 2kl +1/2 σ 2kl +1 k !2 l l=1 l =i j

kj = i,

j=1

we can rewrite f (x) as f (x) = x

n −1 2

∞ X

X (−x)i n 2i+n/2 P n i=0 Γ( 2 + i)2 k

j =i

j=1

n Y (2kl − 1)!! l=1

kl !σl2kl +1

.

Note again that for an integer m, we have √ m (m − 2)!! π Γ( ) = . 2 2(m−1)/2 Thus, putting m = n + 2i, we get n n ∞ X Y x 2 −1 X (2kl − 1)!! (−x/2)i f (x) = √ . 2π i=0 [n + 2(i − 1)]!! Pn kj =i l=1 kl !σl2kl +1 j=1

Now note that for an integer m, we have (2m − 1)!! = Hence, putting m = kl , and noting that

Pn

j=1

(2m)! . 2m m!

kj = i, we get

n n ∞ Y X (−x)i x 2 −1 X −(2k +1) √ Ck2kl l σl l . f (x) = 2i P 2π i=0 2 [n + 2(i − 1)]!! n kj =i l=1 j=1

3.4 Case: Σ = σ 2 I and µ = 0

In this case, (23) simplifies to

n

φ(ω) = (1 + 2jωσ 2 )− 2 . This is the characteristic function of gamma distributed random variable with parameters a = b = 2σ 2 .

9

(24) n 2

and

3.5 Case: Σ = I and µ = 0

In this case we get

n

φ(ω) = (1 + 2jω)− 2

(25)

This is the characteristic function of chi-squared random variable with n degrees of freedom. The corresponding pdf is given by n x 2 −1 exp(− x2 ) f (x) = ; x > 0. n Γ( n2 )2 2 In this case, the distance dij is chi distributed with n degrees of freedom having the corresponding pdf 2

xn−1 exp(− x2 ) f (x) = ; x > 0. n Γ( n2 )2 2 −1

4 Concluding Remarks We presented the distribution of the squared Euclidean distance between two multivariate normally distributed points. We then reduced this distribution for special cases based on the mean and covariance. We believe that such result may find its application in the area of clusters identification by developing an algorithm that infers the cluster to which some points belong from histogram of their distances from a particular point.

Index of Notations • n!! is the double factorial function defined as n!! =

   n(n − 2) . . . 3 · 1  

n(n − 2) . . . 4 · 2 1

if n > 0 is odd; if n > 0 is even, if n = −1, 0.

• Cij where i and j non-negative integers with j ≥ i denotes the binomial coefficient defined as Cij =

j! · i!(j − i)!

• coshi (z) is the alternate hyperbolic cosine/sine function defined as (

coshi (z) = =

cosh(z) sinh(z)

if i is even; if i is odd,

ez + (−1)i e−z · 2

10

References [1] C. Bishop (1995), “Neural networks for pattern recognition,” Oxford University Press. [2] A. G. Constantine (1963), “Some non-central distribution problems in multivariate analysis,” the Annals of Mathematical Statistics, Vol. 34, No. 4, pp. 1270–1285, December, 1963. [3] C. Donati-Martin, Y. Doumerc, H. Matsumoto, and M. Yor (2004), “Some properties of the Wishart processes and a matrix extension of the Hartman-Watson laws,” Publications of the Research Institute of Mathimatical Science, Kyoto University, Kyoto, Japan, Vol. 40, No. 4, pp. 1385–1412. [4] I. K. Fodor (2002), “A survey of dimension reduction techniques,” Lawrence Livermore National Laboratory (LLNL) Technical Report, UCRL-ID-148494, University of California, Livermore, CA, 2002. [5] Y. Fujikoshi (1970), “Asymptotic expansions of distributions of test statistics in multivariate analysis,” Journal of Science of Hiroshima University, Series A-1, Vol. 34, pp. 73–144. [6] I. S. Gradshteyn and I. M. Ryzhik (2000), “Table of integrals, series, and products,” Sixth Edition, Academic Press. [7] J. A. Gubner (1996), “Computation of shot–noise probability distributions and densities,” SIAM Journal of Scientific Computing, Vol. 17, No. 3, pp. 750–761, May 1996. [8] C. S. Herz (1955), “Bessel functions of matrix argument,” the Annals of Mathematics, 2nd Scr., Vol. 61, No. 3, pp. 474–523, May, 1955. [9] J. R. Hoffman, M. G. Cotton, R. J. Achatz, R. N. Statz, and R. A. Dalke (2001), “Measurements to determine potential interference to GPS receivers from ultrawideband transmission systems,” National Telecommunications and Information Administration (NTIA) Report 01-384, Institute for Telecommunication Sciences, Boulder, Colorado, February, 2001. [10] M. M. Islam (2003), “Family of non-central transformed chi-square distributions,” Pakistan Journal of Statistics, Volume 19(3), pp. 325–330. [11] N. L. Johnson, S. Kotz, and N. Balakrishnan (1994), “Continuous univarate distributions,” Vol. 1, Second Edition, John Wiley and Sons, Inc. [12] B. Jonsson (1978), “Kinesiology: with special reverence to electromyographic kinesiology,” Contemporary Clinical Neurophysiology, EEG Suppl. No. 34, pp. 417–428. [13] L. Kaufman and P. J. Rousseeuw, (1990), “Finding groups in data: an introduction to cluster analysis,” John Wiley and Sons, Inc. [14] S. Kourouklis and P. G. Moschopoulos (1985), “On the distribution of the trace of a noncentral Wishart matrix,”Metron, Vol. XLIII, No. 1-2, pp. 85–92. [15] R. J. Muirhead (2005), “Aspects of multivariate statistical theory,” John Wiley and Sons, Inc.

11

[16] E. W. Stacy (1962), “A generalization of the gamma distribution,” Annals of Mathematical Statistics, Vol. 33, No. 3, pp. 1187–1192. [17] L. L. Scharf (1991), “Statistical signal processing: detection, estimation, and time series analysis,” Addison Wesley.

12