Automation and Remote Control, Vol. 64, No. 8, 2003, pp. 1264–1274. Translated from Avtomatika i Telemekhanika, No. 8, 2003, pp. 69–81. c 2003 by Shaikin. Original Russian Text Copyright

STOCHASTIC SYSTEMS

Statistical Estimation and Classification on Commutative Covariance Structures1 M. E. Shaikin Trapeznikov Institute of Control Sciences, Russian Academy of Sciences, Moscow, Russia Received January 27, 2003

Abstract—Statistical inference is investigated under the following constraints on the covariance structure for the observation vector: covariance matrices belong to some commutative matrix algebra. Commutative approximation of arbitrary covariance structures and statistical estimation of the parameters of a given commutative structure are studied. The results are applied to statistical classification of Gaussian vectors having commutative covariance structure.

1. INTRODUCTION Statistical formulations of the problems of control of an object, filtration of its phase vector or output signal, classification or clustering of statistical ensembles, and many other applied problems in statistical decision theory usually involve the choice of a probabilistic model for random (internal or external) factors acting on the controlled object or observation system. The range of covariance models for the components of random factors is rather wide; it extends from the simplest model of uncorrelated identically distributed components with a unique unknown parameter (variance) to a general model of arbitrarily correlated components with n(n + 1)/2 parameters, where n is the number of components. The agreement of a model with observation data need not necessarily grow with the complexity of the model. In a linear m-parameter model, the covariance matrix is represented as Σ =

m P

αi Ai , where

i=1

Ai are given matrices and αi are parameters. Such covariance structures are natural for certain experiment designs [1]. In general, the matrices Ai are not given and must be estimated, along with the parameters αi , from observation data. If there are several covariance matrices Σ1 , . . . , Σs (this is typical of the Gaussian formulation of s-alternative classification), a common linear covariance model to all Σ1 , . . . , Σs can be used for pairwise commutating matrices Σ1 , . . . , Σs . Indeed, in this case, a common orthonormal base {p1 , p2 , . . . , pn } to all Σi exists and the representation Σi = P Λi P 0 holds for all i = 1, . . . , s, where P is an orthogonal matrix with columns p1 , p2 , . . . , pn , Λi = diag(λi1 , . . . , λin ) is a diagonal matrix with nonnegative elements, and the prime denotes the transpose of P . Clearly, the set CP of all matrices of the type Σ = P ΛP 0 for a fixed orthogonal matrix P , where Λ ≥ 0 is a diagonal matrix, is an n-dimensional commutative matrix algebra. Consequently, the covariance structure CP is also called a commutative (and algebraic) structure. Statistical inference for algebraic covariance structures, which are not necessarily commutative, but with known base matrices Ai , is investigated in [2] in the general aspect and in [1] as applied to special structures. Similar problems for linear (not necessarily algebraic) structures are studied in an earlier work [3]. In this paper, we study statistical estimation for commutative covariance structures of the type CP and estimate scalar parameters λi as well as matrix parameters P . 1

This work was supported by the Russian Foundation for Basic Research, project no. 01-01-00758. c 2003 MAIK “Nauka/Interperiodica” 0005-1179/03/6408-1264$25.00

STATISTICAL ESTIMATION AND CLASSIFICATION

1265

2. FORMULATION OF THE PROBLEM Let x1 , . . . , xN , xi ∈ Rn , be a repeated sample from a Gaussian distribution with parameters µ and Σ. The covariance matrix Σ is given in the form Σ =

m P

σi Ai , where σi are unknown parameters

i=1

and Ai are n × n symmetric linearly independent matrices, which are also not known. More exactly, we assume that Ai all are diagonalized by the same orthogonal matrix P , i.e., Ai = P Λi P 0 , where Λi are diagonal matrices. Moreover, there exists at least one set of coefficients σ1 , . . . , σm for which the matrix

m P

σi Ai is positive definite. Our problem is to find some estimates (for example, maximum

i=1

likelihood estimates) of the unknown vector parameters µ, scalar parameters σ1 , . . . , σm , and the matrix parameter P . The logarithm of the likelihood function is proportional to 2 ln L = −n ln(2π) − ln |Σ| − trΣ−1 C − (x − µ)0 Σ−1 (x − µ), N where x=

N 1 X xk , N k=1

C=

(1)

N 1 X (xk − x)(xk − x)0 . N k=1

b = x. Here the last term in (1) vanishes. The Function (1) is maximized with respect to µ for µ logarithm of the reduced likelihood function is proportional to

−n ln(2π) − ln |Σ| − trΣ−1 C,

(2)

which is used in Section 4 for estimating the parameters σ1 , . . . , σm , Λ1 , . . . , Λm , P of the covariance matrix Σ =

m P

σi P Λi P 0 . In Section 3, we state the basic preliminaries on commutative algebras

i=1

with several commutating generators. In Section 5 we study the commutative approximation of a noncommutative covariance structure. Section 6 is devoted to many-alternative classification of Gaussian vectors with commutative structure for covariances. 3. PRELIMINARIES ON COMMUTATIVE COVARIANCE STRUCTURES Let A1 , . . . , Am be self-adjoint operators in a real n-dimensional linear space. We denote a real algebra with unity E generated by the elements A1 , . . . , Am by A = A(E, A1 , . . . , Am ). By definition, A is the linear span of the products Ai1 Ai2 · · · Aik , where the factors all need not necessarily be distinct. The algebra with one generator A is the algebra P(A) of all polynomials of the variable A. The algebra P(A) is obviously commutative. The dimension of P(A) is the equal to the degree of the minimal annihilating polynomial M(λ; A) for A and dim P(A) = deg M(λ; A) ≤ n, where n is maximal for the operator A whose eigenvalues all are distinct [4]. Combining P(A) with a self-adjoint operator B ∈ / P(A), let us consider the algebra A(E, A, B), which need not necessarily be commutative. Since we are interested in commutative covariance structures, we assume that A and B commute: [A, B] := AB − BA = 0. The operator B then is not simple (otherwise, B ∈ P(A)), but (according to [4]) a scalar-type operator or (according to [5]) a simple operator. Scalar-type commutating operators have a common proper base [4] and this assertion can be extended to an arbitrary set of pairwise commuting scalar-type operators. Obviously, the converse is also true: operators having a common base commute. Let us also state the matrix formulation of this assertion. Let {p1 , . . . , pn } denote the common proper base of the operators A1 , . . . , Am and let A1 , . . . , Am be their matrices in some base {e1 , . . . , en }. Then the above assertion implies that pairwise commuting matrices A1 , . . . , Am of simple stricture can be simultaneously, i.e., by the same similitude transformation, reduced to diagonal form Ai = T Λi T −1 , i = 1, 2, . . . , m, where Λi are diagonal matrices. The matrix T transforms AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

1266

SHAIKIN

the base {e1 , . . . , en } to the base {p1 , . . . , pn }. The kth column of the matrix T contains the coordinates (in base {e1 , . . . , en }) of the eigenvector pk corresponding to the eigenvalue λik of the matrix Ai , k = 1, 2, . . . , n. This result is not applicable to covariance structures, because the covariance matrix Σ in transforming the initial base to a new base by the matrix T is converted into a unitary-like matrix Λ; therefore Σ = T ΛT ∗ , where, in general, T ∗ 6= T −1 . But, if two self-adjoint (and even only normal, see [4]) matrices are similar, then they are also unitarily similar Σ = U ΛU ∗ , where U ∈ Mn (C) is a unitary matrix. Furthermore, since Σ ∈ Mn (R) and Λ ∈ Mn (R), there exists a real orthogonal matrix P ∈ Mn (R) such that Σ = P ΛP 0 [6]. Thus, an arbitrary set {A1 , . . . , Am } of self-adjoint operators generates a commutative covariance structure if and only if the operators A1 , . . . , Am have a common proper base. Consequently, any operator A ∈ A(E, A1 , . . . , Am ) in an arbitrary (but fixed) base is represented by the matrix A = P ΛP 0 , where Λ is a real diagonal matrix with elements ≥ 0 and P is an orthogonal matrix, which does not depend on the operator A, but depends only on the base pi . Denoting the columns of the matrix P by pi , we can write A=

n X

λj pj p0j

j=1

or in operator form A=

n X

λj pj ⊗ p j .

j=1

In particular, the generators Ai ∈ A admit the representation Ai = eigenvalues of the operator Ai ) and the linear combination K = K=

m P n P i=1 j=1

m P

n P j=1

λij pj ⊗ pj (λij are the

σi Ai admits the representation

i=1

σi λij pj ⊗ pj .

The operators Ej = pj ⊗ pj are linearly independent if the common eigensubspaces Ei of the operators A1 , . . . , Am are one-dimensional; in this case, the spectrum of each of the operators A1 , . . . , Am consists of n elements. But if dim Ei = ni ≥ 1 and pj , j ∈ Ji , form a base (of power ni ) for the subspace Ei , then the operators Ei =

X

pj ⊗ p j ,

i = 1, . . . , k, n1 + . . . + nk = n,

j∈Ji

are linearly independent. The operators Ei are called the primitive idempotent commutative algebra A. They are self-adjoint mutually orthogonal operators Ei = E0i , Ei Ej = Ej Ei = δij . There exists an algorithm for finding primitive idempotents Ei if the eigenvalues of the operators A1 , . . . , Am are known. Thus, for an algebra with a unique symmetric generator A1 = A having distinct eigenvalues λ1 < λ2 < . . . < λk , the operators Ei are defined by Ei = πi (A), where πi are polynomials with real coefficients (hence, in particular, the representation A = The polynomials πi are of the form πi (x) =

k P

λi Ei is unique).

i=1

Q x − λj

[7]. In the general case of an arbitrary λi − λ j number of generators m > 1, let λi1 < λi2 < . . . < λik all be distinct eigenvalues of the operator Ai without regard for their multiplicity. Assuming that j6=i

πiαi (x) =

Y βi 6=αi

x − λiβi λiαi − λiβi

and

Eiαi = πiαi (Ai ),

AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STATISTICAL ESTIMATION AND CLASSIFICATION

we obtain Eα =

m Q i=1

1267

Eiαi , where α = (α1 , . . . , αm ) run through all possible sequences of length m

over the alphabet {1, 2, . . . , k} [8]. 4. MAXIMUM LIKELIHOOD ESTIMATES FOR THE PARAMETERS OF A COMMUTATIVE COVARIANCE STRUCTURE First let us assume that the matrices Ai in the representation Σ =

m P

σi Ai are known and only

i=1

the parameters σi are to be estimated. Function (1) is maximized for the parameters σ1 , . . . , σm by differentiating (2) with respect to these parameters: ∂ Σ = Ai , ∂σi

∂ −1 Σ = −Σ−1 Ai Σ−1 , ∂σi

∂ ln |Σ| = trΣ−1 Ai . ∂σi

Hence the derivatives of function (2) with respect to σi are −trΣ−1 Ai + trΣ−1 Ai Σ−1 C and the equations for the maximum likelihood estimates take the form

tr

m X

−1 bj Aj σ

Ai = tr

j=1

m X

−1 bj Aj σ

Ai

j=1

m X

−1 bj Aj σ

C,

i = 1, . . . , m.

(3)

j=1

For fixed Ai , system (3) has at least one solution guaranteeing the positive-definiteness of the matrix

m P

bj Aj if the matrix C is positive definite. σ

j=1

If the matrices Ai are not known and Ai =

n P j=1

Σ=

X

σi Ai =

i m P

Using the notation µj =

λij pj p0j , then

XX i

σi λij pj p0j .

j

σi λij , we obtain

i=1

Σ=

n X

µj pj p0j = P M P 0 ,

j=1

where M = diag{µ1 , . . . , µn }. Let us find the maximum likelihood estimates of the parameters µj and pj . Note that ln |Σ| = ln |P M P 0 | = ln |M | and trΣ−1 C = trP M −1 P 0 C = trM −1 CP in the reduced likelihood function (2). Therefore, equations for the maximum likelihood estimates of the parameters µj are obtained by differentiating the function − ln |M | − trM −1 P 0 CP with respect to µj . If the variables µj all are algebraically independent, we obtain ∂ ln |M | = 1/µi ∂µi and since trM −1 P 0 CP =

n 1 P i=1 µi

p0i Cpi , we have ∂ 1 trM −1 P 0 CP = − 2 p0i Cpi . ∂µi µi

AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

(4)

1268

SHAIKIN

Hence the equations for the maximum likelihood estimates take the form −

1 1 + 2 p0 i Cpi = 0 or µi µi

bi = p0i Cpi . µ

(5)

Now we find the maximum likelihood estimates for the eigenvectors pi of the matrix Σ. In n 1 P expression (4), only the last term, equal to ϕ = − p0i Cpi , depends on pi . This function must i=1 µi be maximized for the vector variables pi , i = 1, . . . , n, satisfying the constraints p0i pi = δij . But it suffices to use only the constraints p0i pi = 1, i = 1, . . . , n; then the remaining constraints p0i pj = 0, i 6= j, are automatically satisfied. Indeed, let us minimize the function ϕ=

n X

p0i Cpi /µi −

i=1

n X

νi (p0i pi − 1),

i=1

where νi are Lagrange multipliers. Differentiating ϕ with respect to pi and equating the derivatives ∂ϕ to zero, we obtain ∂pi Cpi /λi = νi pi ,

i = 1, . . . , n.

(6)

For the ith equation in (6) to have a nonzero solution, the matrix C/λi −νi I must be degenerate: |C − λi νi I| = 0. The roots θ1 , . . . , θn of the equation |C − θI| = 0 are ordered such that θi = λi νi . Left multiplying (6) by p0i , we obtain p0i Cpi /λi = νi p0i pi = νi , and for the expression p0i Cpi /λi to be θj minimal, νi must be taken equal to νmin = minj . The corresponding eigenvector is determined λj C from the normalized solution of the equation − νmin I pbi = 0. Hence pbi is the eigenvector of λi θj the matrix C corresponding to the eigenvalue θi = λi νmin = λi minj . Now it is a simple matter λj bi = pb 0i C pbi . Therefore, the eigenvalues and to choose orthogonal pbi such that pb 0i pbj = δij . By (5), µ eigenvectors of the empirical covariance matrix C are not some simple estimates of the eigenvalues and eigenvectors of the unknown covariance matrix Σ, but are their maximum likelihood estimates. Result (5) is obtained under the assumption that the likelihood function (4) depends on n unknown parameters µ1 , . . . , µn . It is not excluded, however, that the spectrum of the matrix Σ consists only of k < n points µ1 , . . . , µk . Let n1 , . . . , nk be the multiplicities of the eigenvalues P µ1 , . . . , µk , n1 + . . . + nk = n. Then, as has already been mentioned, the operators Ei = pj ⊗ pj j∈Ji

are linearly independent and admit the representations Ai =

k X

λij Ej ,

i = 1, . . . , m,

Σ=

j=1

m X

σi Ai =

i=1

m X k X

σi λij Ej ,

i=1 j=1

where λij , j = 1, . . . , k, are the eigenvalues of the matrix Ai . Setting aside the difficult problem of estimation from a sample of unknown parameters n1 , . . . , nk , let us consider a simple case of practical value in which the spectra {λi1 , λi2 , . . . , λik } of the matrices Ai , i = 1, . . . , m and the multiplicities n1 , . . . , nk are known. Then it is natural to take bj = E

X

pbi pb 0i ,

bj = µ

m X

bi λij , j = 1, . . . , k, σ

i=1

i∈Jj

Abi =

k X

bj , i = 1, . . . , m. λij E

j=1

AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STATISTICAL ESTIMATION AND CLASSIFICATION

1269

bj = nj , we obtain Substituting these values into (3) and since trE k X q=1

nq λiq

m P

bj λjq σ

=

k X

λiq vq m P

q=1

j=1

!2 ,

i = 1, . . . , m,

(7)

bj λjq σ

j=1

where v1 =

n1 X

bj , µ

v2 =

j=1

n1X +n2

n X

bj , . . . , vk = µ

j=n1 +1

bj . µ

j=n−nk +1

Here vq /nq are the estimates of the eigenvalues µq (of multiplicity nq ) of the covariance matrix Σ with spectrum {µ1 , . . . , µk }, q = 1, . . . , k. Let us consider the special case of k = m. Since the matrices A1 , . . . , Am are assumed to be linearly independent, the matrices Λ1 , . . . , Λm are also linearly independent and the m × m matrix of coefficients (λij ) is nonsingular. The equations m X

bj λjq = σ

j=1

vq , nq

q = 1, . . . , k,

b1 , . . . , σ bm . Substituting these σ bj into (7), we find that (7) is satisfied, are consequently solvable for σ bj are the maximum likelihood estimates of the parameters σj , j = 1, . . . , m. i.e., σ

5. COMMUTATIVE APPROXIMATION OF NONCOMMUTATIVE COVARIANCE STRUCTURES The covariance structures encountered in applications are rarely commutative. Certainly, the set of all nonnegative-definite matrices in the algebra P(A) of polynomials of a single matrix variable A is a commutative covariance structure (see Section 3). In designing experiments with randomized blocks, covariance matrices that are invariant to the automorphisms of the experiment design form a cone in the algebra with two commuting generators [1], thereby generating a commutative structure. But the designs constructed from incomplete block schemes are balanced or partially balanced [9], and generate noncommutative covariance structures. It is natural, therefore, to examine the approximation of arbitrary covariance structures by commutative structures. Let Σi = Pi Λi Pi0 , i = 1, . . . , N , be a set of covariance matrices, where Pi are orthogonal matrices and Λi are diagonal matrices. The columns of the matrices Pi form an orthonormal base Bi . The base-to-base distance ρ(B1 , B2 ) is defined by the formula ρ2 (B1 , B2 ) = b i = P Mi P 0 , where P is the tr(P1 − P2 )(P1 − P2 )0 . Let us approximate Σi by matrices of the type Σ same orthogonal matrix not dependent on i, and Mi are diagonal matrices. Let us choose the base B for the unknown matrix P from the condition of least deviation of B from all Bi , i = 1, . . . , N : N X

ρ2 (Bi , B) = minB .

i=1

Since ρ2 (Bi , B) = 2n − 2tr(Pi P 0 ), we must minimize 2nN − 2

N P

tr(Pi P 0 ) under the constraint

i=1

P P 0 = I. Introducing a symmetric matrix of Lagrange multipliers Λ, let us form the function ϕ(P ) = nN − tr

N X

!

Pi P

0

+ tr(Λ(P 0 P − I)).

i=1

AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

1270

SHAIKIN

Differentiating it with respect to P , we obtain −

n X

Pi + Λ = 0.

i=1

Hence P = Λ−1

N X

Pi = Λ−1 Q,

i=1

where Q=

N X

Pi .

i=1

From the condition P P 0 = I we obtain Λ−1 QQ0 Λ−1 = I. Therefore, Λ2 = QQ0 and P = (QQ0 )−1/2 Q.

(8)

What now remains is to find the diagonal matrices Mi , i = 1, . . . , N . Let Σ be one of the matrices Σi . Let us project Σ onto a linear manifold of matrices of the type P M P 0 , where M is a diagonal matrix, using the scalar product (A, B) = tr(AB 0 ) of arbitrary matrices A and B. As in P Section 4, let us express P M P 0 as k µk Ek . From the orthogonality condition Σ−

X

! bk Ek , Ej µ

=0

k

we obtain (Σ, Ej ) =

X

b k (Ek , Ej ) or µ

bj (Ej , Ej ) (Σ, Ej ) = µ

for all j.

k

Hence bj = µ

tr(ΣEj ) . trEj

(9)

Examples. (1) Let us approximate two noncommuting 2 × 2 matrices Σ1 and Σ2 by commuting matrices of the type P Mi P 0 , i = 1, 2. Note that Σk = Pk Λk Pk0 , k = 1, 2, where Pk is a matrix of the type cos ϕk − sin ϕk sin ϕk cos ϕk

!

.

There is an algebraic isomorphism Pk ↔ eiϕk , which we use below to simplify computations. We have Q0 ↔ e−iϕ1 + e−iϕ2 , 1 (QQ0 )−1/2 ↔ p 2(1 + cos(ϕ1 − ϕ2 )

Q = P1 + P2 ↔ eiϕ1 + eiϕ2 , QQ0 ↔ 2(1 + cos(ϕ1 − ϕ2 )), and, finally,

eiϕ1 + eiϕ2 P = (QQ0 )−1/2 Q ↔ p . 2(1 + cos(ϕ1 − ϕ2 ) AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STATISTICAL ESTIMATION AND CLASSIFICATION

1271

Thus, according to (8), we have cos ϕ1 + cos ϕ2 − sin ϕ1 − sin ϕ2 sin ϕ1 + sin ϕ2 cos ϕ1 + cos ϕ2

1 P =p 2(1 + cos(ϕ1 − ϕ2 ))

!

.

The idempotent matrices are cos2 ϕ sin ϕ cos ϕ sin ϕ cos ϕ sin2 ϕ

E1 =

!

sin2 ϕ − sin ϕ cos ϕ − sin ϕ cos ϕ cos2 ϕ

, E2 =

!

,

where ϕ = ϕ2 − ϕ1 . Hence µ1 = tr(ΣE1 ) and µ2 = tr(ΣE2 ) are given by formula (9): µ1 = σ11 cos2 ϕ + σ22 sin2 ϕ + σ12 sin 2ϕ, µ2 = σ11 sin2 ϕ + σ22 cos2 ϕ + σ12 sin 2ϕ. (2) Let Σ1 = P1 Λ1 P10 and Σ2 = P2 Λ2 P20 be noncommuting 3 × 3 matrices. Let us approximate them by commuting matrices P M1 P 0 and P M2 P 0 , respectively. The orthogonal matrix P is conveniently determined from the well-known isomorphism P ↔ U , where U is a unitary matrix: U=

α β −β α

!

, α, β ∈ C, |α|2 + |β|2 = 1.

The transposed matrix P 0 corresponds to the matrix U ∗ , which is the Hermite conjugate of U . Using the Kelly–Klein parameters α and β, we can express any 3 × 3 orthogonal matrix P as

α2 − β 2 + α 2 − β

2

/2 −i α2 + β 2 − α 2 − β

P = i α2 − β 2 − α 2 + β 2 /2

α2

+

β2

+

α2

+β

2

2

/2 −(αβ + αβ)

/2

. i(αβ − αβ)

−i(αβ − αβ)

αβ + αβ

(10)

αα − ββ

Indeed, the unitary matrix U transforms a 3-dimensional column vector with coordinates x, y, and z into a vector with coordinates x0 , y 0 , and z 0 by the rule R = U RU ∗ , where R=

z x − iy x + iy −z

!

,

z0 x0 − iy 0 0 0 x + iy −z 0

R=

!

.

The canonical vectors (1, 0, 0), (0, 1, 0), and (0, 0, 1) correspond to the Pauli spin matrices R1 =

0 1 1 0

!

,

R2 =

0 −i i 0

!

,

R3 =

1 0 0 −1

!

.

Therefore, the ith column in (10) is the vector (x0 , y 0 , z 0 ) corresponding to the matrix Ri = U Ri U ∗ . Let us express the unitary matrix U through the Pauli matrices as α β −β α

!

1 1 i 1 = (α + α)R0 + (β − β)R1 + (β + β)R2 + (α − α)R3 , 2 2 2 2

where R0 is a 2 × 2 unit matrix. Since the Pauli matrices obey the rules of multiplication for quaternion units, i.e., Ri Rj = −Rj Ri for i 6= j, Ri2 = R0 , and Ri∗ = Ri , we obtain QQ0 ↔ (|α1 + α2 |2 + |β1 + β2 |2 )R0 . Finally, we obtain 1 P =p (P1 + P2 ), 2 |α1 + α2 | + |β1 + β2 |2 where Pi is a matrix of the type (10) with parameters α = αi and β = βi , i = 1, 2. AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

1272

SHAIKIN

6. APPLICATION TO CLASSIFICATION OF GAUSSIAN VECTORS WITH COMMUTATIVE COVARIANCE STRUCTURE For statistical classification of Gaussian vectors, commutative covariance structures are of interest for two reasons. First, the parameters of Gaussian distributions, i.e., the vector of means mi and the covariance matrix Σi for the ith class, as a rule, are not known, but estimated from sample data. In general, a structure (mi , Σi , i = 1, . . . , s) has s(n + n(n + 1)/2) parameters, where n is the dimension of the attribute space and s is the number of classes. Nevertheless, to define a commutative covariance structure (mi , P Λi P 0 , i = 1, . . . , s), it suffices to define sn parameters for averages, n(n − 1)/2 for defining the common proper base (i.e., matrix P ), and sn parameters for the matrices Λi . Compared to the general case, the number of parameters is reduced by (s − 1)n(n − 1)/2. Certainly, a commutative structure is suitable for describing only such Gaussian distributions with which almost identically oriented covariance ellipsoids differing only in the length of the major axis are associated. Second, for commutative covariance structures, the classifier decision function is simpler if it is expressed in a proper base of covariance matrices Σi , i = 1, . . . , s. But the determination of the threshold for the decision rule is not simplified even for a commutative covariance structure. Thus, in the simplest case of two classes, s = 2, the Neumann-Pearson criterion yields the equation R dN (x; m2 , Σ2 ) = α for the threshold c, where the integration domain D(c) is defined by the D(c)

−1 0 −1 0 −1 formula D(c) = {x : x0 (Σ−1 2 − Σ1 )x + 2(m1 Σ1 − m2 Σ2 )x > c}. Here α is the conditional probability for making a wrong decision on the membership of a sample to the first class when it actually belongs to the second class. There is no analytical method of computing the threshold c. It is approximately determined by stochastic modeling with different values of c [10], because it is not −1 0 −1 0 −1 possible to find the exact distribution of the quadratic form x0 (Σ−1 2 − Σ1 )x + 2(m1 Σ1 − m2 Σ2 )x if the distribution of x is of the form N (·; m2 , Σ2 ). The distribution of a quadratic statistic is known only in certain particular cases [11]. The probabilities of classification errors under a given threshold can be determined sometimes by approximate methods [12]. In conclusion, it is worthwhile to compare, for s = 2 (two statistical classes), the exact solution of the problem of reduction of two covariance matrices Σ1 and Σ2 to canonical form and its b 1 = P Λ1 P 0 , Σ b 2 = P Λ2 P 0 , with matrix P defined by formula (8). Let Σ1 approximate solution Σ be a nonsingular matrix. According to the theorem concerning the simultaneous reduction of two quadratic forms with coefficient matrices Σ1 and Σ2 , there exists a matrix F such that F 0 Σ1 F = I and F 0 Σ2 F = N , where N = diag(ν1 , . . . , νn ) and νi are the roots of the equation |Σ−1 1 Σ2 − νI| = 0. In general, F is not a unitary matrix and the common base {f1 , . . . , fn }, where fi is the ith column of the matrix F , is not appropriate for the pair (Σ1 , Σ2 ). To find F , we may note that the matrices −1/2 −1/2 Σ−1 Σ2 Σ1 have the same spectrum, i.e., νi coincide with the roots of the equation 1 Σ2 and Σ1 −1/2 −1/2 −1/2 −1/2 |Σ1 Σ2 Σ1 − νi I| = 0. The matrix Σ1 Σ2 Σ1 is Hermite. Therefore (a) all νi are real, −1/2 −1/2 −1/2 ∗ (b) U Σ1 Σ2 Σ1 U = N for some unitary matrix U , and (c) it suffices to take F = Σ1 U for determining the unknown matrix F . Below, in comparing the exact solution F with the approximate solution Fb = P , where P is defined by formula (8), we restrict ourselves, for the sake of simplicity of presentation, to 2 × 2 matrices. First let us consider the exact solution F . The roots ν1 and ν2 are determined from the characteristic equation −1 ν 2 − νtrΣ−1 1 Σ2 + det Σ1 Σ2 = 0.

Assuming, as in Example 1, that Σ1 = P1 Λ1 P10 , Σ2 = P2 Λ2 P20 , Λ1 = diag(λ11 , λ12 ), Λ2 = −1 diag(λ21 , λ22 ), P1 ↔ eiϕ1 , and P2 ↔ eiϕ2 , we obtain det(Σ−1 1 Σ2 ) = det Λ2 /detΛ1 and tr Σ1 Σ2 = AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STATISTICAL ESTIMATION AND CLASSIFICATION

1273

0 0 tr Λ−1 1 RΛ2 R , where R := P1 P2 . Simple computation gives

det Σ−1 1 Σ2 = λ21 λ22 /λ11 λ12 ,

λ21 λ22 λ22 λ21 = + cos2 ϕ + + sin2 ϕ, ϕ = ϕ2 − ϕ1 , λ11 λ12 λ11 λ12 λ21 λ21 λ22 λ22 2 2 − sin ϕ cos ϕ λ cos ϕ + λ sin ϕ λ11 λ11 11 11 0 A := Λ−1 RΛ R = . 2 1 λ21 λ22 λ21 λ 22 2 2 − sin ϕ cos ϕ sin ϕ + cos ϕ λ12 λ12 λ12 λ12 trΣ−1 1 Σ2

−1/2

−1/2

The eigenvectors u1 , . . . , un of the matrix Σ1 Σ2 Σ1 can be found from the equations −1/2 −1/2 Σ1 Σ2 Σ1 ui = νi ui or, which is the same thing, from the equations Avi = νi vi , where vi = −1/2

−1/2

−1/2

−1/2

−1

−1/2

Λ1 P10 ui , i = 1, 2. Indeed, it is easy to verify that Σ1 Σ2 Σ1 = Λ1 P10 A Λ1 P10 . The vectors f1 and f2 of the canonical (but not proper) base of Σ1 and Σ2 are determined as −1/2 −1/2 1/2 fi = Σi ui = P1 Λ1 P10 P1 Λ1 vi = P1 vi , i.e., the result of rotation of vectors v1 and v2 through ϕ1 . We now consider a numerical example. Let ϕ1 = π/6, ϕ2 = π/4, λ11 = 1, λ12 = 2/3, λ21 = 2, and λ22 = 3/2. We have ϕ = ϕ2 − ϕ1 = π/12, ν1 ν2 = 9/2, and

λ21 λ22 λ22 λ21 ν1 + ν2 = trA = + cos2 ϕ + + sin2 ϕ λ11 λ12 λ11 λ12 = 17/4 cos2 (π/12) + 9/2 sin2 (π/12) = 4.2662. Hence ν1 = 2.3561 and ν2 = 1.9101. Let us compute the eigenvectors vi of the matrix A. The equation (A − νi I)vi = 0 implies that the product of any row vector of the matrix A − νi I by the column vector vi is zero. It suffices to consider one (for example, the first) row, because A − νi I is a matrix of rank 1. Denoting the first row of the matrix A − νi I by (a(νi ), b), we obtain (

λ21 λ22 −0.4897 for i = 1 a(νi ) = cos2 ϕ + sin2 ϕ − νi = 0.0565 for i = 2, λ11 λ11 1 λ21 λ22 b= − sin 2ϕ = 0.1250. 2 λ11 λ11 The orthogonality condition shows that v1 and v2 are proportional to the vectors w1 = +

0.1250 0.4897

!

,

0.1250 −0.0565

w2 = −

!

,

respectively. The signs before brackets are chosen such that w1 lies in the first quadrant of the coordinate system OXY and the base (w1 , w2 ) is positively oriented. The vectors wi = αi vi must 0 0 2 be normalized by the conditions 1 = u0i ui = vq i Λ1 vi = wi Λ1 wi /αi under the assumption that q q αi = wi0 Λ1 wi . Since w10 Λ1 w1 = 0.419 and w20 Λ1 w2 = 0.133, it is not difficult to find the vectors v1 and v2 first and then f1 and f2 . Computations show that kf1 k = 1.205, kf2 k = 1.003, and the angle between f1 and f2 is 80◦ 000 . The approximate solution, according to (10), is the orthogonal matrix Fb = P ↔ eiϕb, where cos ϕb = 0.7934, which corresponds to the angle ϕb = 37◦ 300 . Hence Fb = AUTOMATION AND REMOTE CONTROL

0.7934 −0.6088 0.6088 0.7934 Vol. 64

No. 8

!

.

2003

1274

SHAIKIN

Let fb1 and fb2 denote the columns of the matrix Fb . Comparing the solutions, we find that the cone {f1 , f2 } wholly lies within the interior of the 90◦ -cone {fb1 , fb2 }. Moreover, the mismatch between c f1 and f1 is 6◦ 500 , whereas that between f2 and c f2 is only 3◦ 100 . 7. CONCLUSIONS Commutative approximation of an arbitrary covariance structure is studied, probably, first in this paper. The designed approximation is not the only possible method, and it is worthwhile to examine other approximations, not as simple in formulation as ours, and other methods of solving this approximation problem. Commutative structures are attractive in application for the extreme simplicity of their algebraic analysis. As applied to statistical classification of commutative covariance structures, what matters is to find methods for approximate estimation of the classification error probability for such structures and deviations of approximate estimates of probabilities from their true values. At present, error probabilities for large-dimensional arbitrary covariance structures are estimated exclusively by statistical modeling methods. REFERENCES 1. Sysoev, L.P. and Shaikin, M.E., Optimal Estimates of Parameters in Regression Models of Special Covariance Structure and Their Application in Two-Factor Experiments, Avtom. Telemekh., 1981, no. 6, pp. 44–56. 2. Anderson, T.W., Statistical Inference for Covariance Matrices with Linear Structure, in Time Series Analysis, Rosenblatt, M., Ed., New York: Wiley, 1963. 3. Srivastava, J.N., On Testing Hypotheses Regarding a Class of Covariance Structures, Psychometrika, 1966, vol. 31, no. 1, pp. 147–164. 4. Glazman, I.M. and Lyubich, Yu.I., Konechnomernyi lineinyi analiz (Finite-Dimensional Linear Analysis), Moscow: Nauka, 1969. 5. Gantmakher, F.R., Teoriya matrits, Moscow: Nauka, 1967. Translated under the title The Theory of Matrices, New York: Chelsea, 1959. 6. Marcus, M. and Minque, H., A Survey of Matrix Theory and Matrix Inequalities, Boston: Allyn and Bacon, 1964. Translated under the title Obzor po teorii matrits i matrichnykh nepavenstv , Moscow: Nauka, 1972. 7. Halmos, P.R., Finite-Dimensional Vector-Spaces, New York: Springer-Verlag, 1974. Translated under the title Konechnomernye vektornye prostranstva, Moscow: Fizmatgiz, 1963. 8. Pukhal’skii, E.A., Computation of Invariants in Classification of Covariance structures, Avtom. Telemekh., 1986, no. 4, pp. 68–75. 9. Shaikin, M.E., The Algebraic Structure of PBIB-Plans of Variance Analysis and Its Application to Multifactor Experiments with Interaction, Avtom. Telemekh., 1997, no. 11, pp. 90–101. 10. Pugachev, V.S., Teoriya veroyatnostei i matematicheskaya statistika (Probability Theory and Mathematical Statistics), Moscow: Nauka, 1979. 11. Kostylev, V.I., Distribution of the Sum of Two Independent Gamma-Statistics, Radiotekhn. Elektron., 2001, vol. 46, no. 5, pp. 530–533. 12. Jain, A., Moulin, P., Miller, M.I., et al., Information-Theoretic Bounds on Target Recognition Performance, IEEE Trans. PAMI , 2002, vol. 24, no. 9, pp. 1153–1166.

This paper was recommended for publication by V.A. Lototskii, a member of the Editorial Board AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STOCHASTIC SYSTEMS

Statistical Estimation and Classification on Commutative Covariance Structures1 M. E. Shaikin Trapeznikov Institute of Control Sciences, Russian Academy of Sciences, Moscow, Russia Received January 27, 2003

Abstract—Statistical inference is investigated under the following constraints on the covariance structure for the observation vector: covariance matrices belong to some commutative matrix algebra. Commutative approximation of arbitrary covariance structures and statistical estimation of the parameters of a given commutative structure are studied. The results are applied to statistical classification of Gaussian vectors having commutative covariance structure.

1. INTRODUCTION Statistical formulations of the problems of control of an object, filtration of its phase vector or output signal, classification or clustering of statistical ensembles, and many other applied problems in statistical decision theory usually involve the choice of a probabilistic model for random (internal or external) factors acting on the controlled object or observation system. The range of covariance models for the components of random factors is rather wide; it extends from the simplest model of uncorrelated identically distributed components with a unique unknown parameter (variance) to a general model of arbitrarily correlated components with n(n + 1)/2 parameters, where n is the number of components. The agreement of a model with observation data need not necessarily grow with the complexity of the model. In a linear m-parameter model, the covariance matrix is represented as Σ =

m P

αi Ai , where

i=1

Ai are given matrices and αi are parameters. Such covariance structures are natural for certain experiment designs [1]. In general, the matrices Ai are not given and must be estimated, along with the parameters αi , from observation data. If there are several covariance matrices Σ1 , . . . , Σs (this is typical of the Gaussian formulation of s-alternative classification), a common linear covariance model to all Σ1 , . . . , Σs can be used for pairwise commutating matrices Σ1 , . . . , Σs . Indeed, in this case, a common orthonormal base {p1 , p2 , . . . , pn } to all Σi exists and the representation Σi = P Λi P 0 holds for all i = 1, . . . , s, where P is an orthogonal matrix with columns p1 , p2 , . . . , pn , Λi = diag(λi1 , . . . , λin ) is a diagonal matrix with nonnegative elements, and the prime denotes the transpose of P . Clearly, the set CP of all matrices of the type Σ = P ΛP 0 for a fixed orthogonal matrix P , where Λ ≥ 0 is a diagonal matrix, is an n-dimensional commutative matrix algebra. Consequently, the covariance structure CP is also called a commutative (and algebraic) structure. Statistical inference for algebraic covariance structures, which are not necessarily commutative, but with known base matrices Ai , is investigated in [2] in the general aspect and in [1] as applied to special structures. Similar problems for linear (not necessarily algebraic) structures are studied in an earlier work [3]. In this paper, we study statistical estimation for commutative covariance structures of the type CP and estimate scalar parameters λi as well as matrix parameters P . 1

This work was supported by the Russian Foundation for Basic Research, project no. 01-01-00758. c 2003 MAIK “Nauka/Interperiodica” 0005-1179/03/6408-1264$25.00

STATISTICAL ESTIMATION AND CLASSIFICATION

1265

2. FORMULATION OF THE PROBLEM Let x1 , . . . , xN , xi ∈ Rn , be a repeated sample from a Gaussian distribution with parameters µ and Σ. The covariance matrix Σ is given in the form Σ =

m P

σi Ai , where σi are unknown parameters

i=1

and Ai are n × n symmetric linearly independent matrices, which are also not known. More exactly, we assume that Ai all are diagonalized by the same orthogonal matrix P , i.e., Ai = P Λi P 0 , where Λi are diagonal matrices. Moreover, there exists at least one set of coefficients σ1 , . . . , σm for which the matrix

m P

σi Ai is positive definite. Our problem is to find some estimates (for example, maximum

i=1

likelihood estimates) of the unknown vector parameters µ, scalar parameters σ1 , . . . , σm , and the matrix parameter P . The logarithm of the likelihood function is proportional to 2 ln L = −n ln(2π) − ln |Σ| − trΣ−1 C − (x − µ)0 Σ−1 (x − µ), N where x=

N 1 X xk , N k=1

C=

(1)

N 1 X (xk − x)(xk − x)0 . N k=1

b = x. Here the last term in (1) vanishes. The Function (1) is maximized with respect to µ for µ logarithm of the reduced likelihood function is proportional to

−n ln(2π) − ln |Σ| − trΣ−1 C,

(2)

which is used in Section 4 for estimating the parameters σ1 , . . . , σm , Λ1 , . . . , Λm , P of the covariance matrix Σ =

m P

σi P Λi P 0 . In Section 3, we state the basic preliminaries on commutative algebras

i=1

with several commutating generators. In Section 5 we study the commutative approximation of a noncommutative covariance structure. Section 6 is devoted to many-alternative classification of Gaussian vectors with commutative structure for covariances. 3. PRELIMINARIES ON COMMUTATIVE COVARIANCE STRUCTURES Let A1 , . . . , Am be self-adjoint operators in a real n-dimensional linear space. We denote a real algebra with unity E generated by the elements A1 , . . . , Am by A = A(E, A1 , . . . , Am ). By definition, A is the linear span of the products Ai1 Ai2 · · · Aik , where the factors all need not necessarily be distinct. The algebra with one generator A is the algebra P(A) of all polynomials of the variable A. The algebra P(A) is obviously commutative. The dimension of P(A) is the equal to the degree of the minimal annihilating polynomial M(λ; A) for A and dim P(A) = deg M(λ; A) ≤ n, where n is maximal for the operator A whose eigenvalues all are distinct [4]. Combining P(A) with a self-adjoint operator B ∈ / P(A), let us consider the algebra A(E, A, B), which need not necessarily be commutative. Since we are interested in commutative covariance structures, we assume that A and B commute: [A, B] := AB − BA = 0. The operator B then is not simple (otherwise, B ∈ P(A)), but (according to [4]) a scalar-type operator or (according to [5]) a simple operator. Scalar-type commutating operators have a common proper base [4] and this assertion can be extended to an arbitrary set of pairwise commuting scalar-type operators. Obviously, the converse is also true: operators having a common base commute. Let us also state the matrix formulation of this assertion. Let {p1 , . . . , pn } denote the common proper base of the operators A1 , . . . , Am and let A1 , . . . , Am be their matrices in some base {e1 , . . . , en }. Then the above assertion implies that pairwise commuting matrices A1 , . . . , Am of simple stricture can be simultaneously, i.e., by the same similitude transformation, reduced to diagonal form Ai = T Λi T −1 , i = 1, 2, . . . , m, where Λi are diagonal matrices. The matrix T transforms AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

1266

SHAIKIN

the base {e1 , . . . , en } to the base {p1 , . . . , pn }. The kth column of the matrix T contains the coordinates (in base {e1 , . . . , en }) of the eigenvector pk corresponding to the eigenvalue λik of the matrix Ai , k = 1, 2, . . . , n. This result is not applicable to covariance structures, because the covariance matrix Σ in transforming the initial base to a new base by the matrix T is converted into a unitary-like matrix Λ; therefore Σ = T ΛT ∗ , where, in general, T ∗ 6= T −1 . But, if two self-adjoint (and even only normal, see [4]) matrices are similar, then they are also unitarily similar Σ = U ΛU ∗ , where U ∈ Mn (C) is a unitary matrix. Furthermore, since Σ ∈ Mn (R) and Λ ∈ Mn (R), there exists a real orthogonal matrix P ∈ Mn (R) such that Σ = P ΛP 0 [6]. Thus, an arbitrary set {A1 , . . . , Am } of self-adjoint operators generates a commutative covariance structure if and only if the operators A1 , . . . , Am have a common proper base. Consequently, any operator A ∈ A(E, A1 , . . . , Am ) in an arbitrary (but fixed) base is represented by the matrix A = P ΛP 0 , where Λ is a real diagonal matrix with elements ≥ 0 and P is an orthogonal matrix, which does not depend on the operator A, but depends only on the base pi . Denoting the columns of the matrix P by pi , we can write A=

n X

λj pj p0j

j=1

or in operator form A=

n X

λj pj ⊗ p j .

j=1

In particular, the generators Ai ∈ A admit the representation Ai = eigenvalues of the operator Ai ) and the linear combination K = K=

m P n P i=1 j=1

m P

n P j=1

λij pj ⊗ pj (λij are the

σi Ai admits the representation

i=1

σi λij pj ⊗ pj .

The operators Ej = pj ⊗ pj are linearly independent if the common eigensubspaces Ei of the operators A1 , . . . , Am are one-dimensional; in this case, the spectrum of each of the operators A1 , . . . , Am consists of n elements. But if dim Ei = ni ≥ 1 and pj , j ∈ Ji , form a base (of power ni ) for the subspace Ei , then the operators Ei =

X

pj ⊗ p j ,

i = 1, . . . , k, n1 + . . . + nk = n,

j∈Ji

are linearly independent. The operators Ei are called the primitive idempotent commutative algebra A. They are self-adjoint mutually orthogonal operators Ei = E0i , Ei Ej = Ej Ei = δij . There exists an algorithm for finding primitive idempotents Ei if the eigenvalues of the operators A1 , . . . , Am are known. Thus, for an algebra with a unique symmetric generator A1 = A having distinct eigenvalues λ1 < λ2 < . . . < λk , the operators Ei are defined by Ei = πi (A), where πi are polynomials with real coefficients (hence, in particular, the representation A = The polynomials πi are of the form πi (x) =

k P

λi Ei is unique).

i=1

Q x − λj

[7]. In the general case of an arbitrary λi − λ j number of generators m > 1, let λi1 < λi2 < . . . < λik all be distinct eigenvalues of the operator Ai without regard for their multiplicity. Assuming that j6=i

πiαi (x) =

Y βi 6=αi

x − λiβi λiαi − λiβi

and

Eiαi = πiαi (Ai ),

AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STATISTICAL ESTIMATION AND CLASSIFICATION

we obtain Eα =

m Q i=1

1267

Eiαi , where α = (α1 , . . . , αm ) run through all possible sequences of length m

over the alphabet {1, 2, . . . , k} [8]. 4. MAXIMUM LIKELIHOOD ESTIMATES FOR THE PARAMETERS OF A COMMUTATIVE COVARIANCE STRUCTURE First let us assume that the matrices Ai in the representation Σ =

m P

σi Ai are known and only

i=1

the parameters σi are to be estimated. Function (1) is maximized for the parameters σ1 , . . . , σm by differentiating (2) with respect to these parameters: ∂ Σ = Ai , ∂σi

∂ −1 Σ = −Σ−1 Ai Σ−1 , ∂σi

∂ ln |Σ| = trΣ−1 Ai . ∂σi

Hence the derivatives of function (2) with respect to σi are −trΣ−1 Ai + trΣ−1 Ai Σ−1 C and the equations for the maximum likelihood estimates take the form

tr

m X

−1 bj Aj σ

Ai = tr

j=1

m X

−1 bj Aj σ

Ai

j=1

m X

−1 bj Aj σ

C,

i = 1, . . . , m.

(3)

j=1

For fixed Ai , system (3) has at least one solution guaranteeing the positive-definiteness of the matrix

m P

bj Aj if the matrix C is positive definite. σ

j=1

If the matrices Ai are not known and Ai =

n P j=1

Σ=

X

σi Ai =

i m P

Using the notation µj =

λij pj p0j , then

XX i

σi λij pj p0j .

j

σi λij , we obtain

i=1

Σ=

n X

µj pj p0j = P M P 0 ,

j=1

where M = diag{µ1 , . . . , µn }. Let us find the maximum likelihood estimates of the parameters µj and pj . Note that ln |Σ| = ln |P M P 0 | = ln |M | and trΣ−1 C = trP M −1 P 0 C = trM −1 CP in the reduced likelihood function (2). Therefore, equations for the maximum likelihood estimates of the parameters µj are obtained by differentiating the function − ln |M | − trM −1 P 0 CP with respect to µj . If the variables µj all are algebraically independent, we obtain ∂ ln |M | = 1/µi ∂µi and since trM −1 P 0 CP =

n 1 P i=1 µi

p0i Cpi , we have ∂ 1 trM −1 P 0 CP = − 2 p0i Cpi . ∂µi µi

AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

(4)

1268

SHAIKIN

Hence the equations for the maximum likelihood estimates take the form −

1 1 + 2 p0 i Cpi = 0 or µi µi

bi = p0i Cpi . µ

(5)

Now we find the maximum likelihood estimates for the eigenvectors pi of the matrix Σ. In n 1 P expression (4), only the last term, equal to ϕ = − p0i Cpi , depends on pi . This function must i=1 µi be maximized for the vector variables pi , i = 1, . . . , n, satisfying the constraints p0i pi = δij . But it suffices to use only the constraints p0i pi = 1, i = 1, . . . , n; then the remaining constraints p0i pj = 0, i 6= j, are automatically satisfied. Indeed, let us minimize the function ϕ=

n X

p0i Cpi /µi −

i=1

n X

νi (p0i pi − 1),

i=1

where νi are Lagrange multipliers. Differentiating ϕ with respect to pi and equating the derivatives ∂ϕ to zero, we obtain ∂pi Cpi /λi = νi pi ,

i = 1, . . . , n.

(6)

For the ith equation in (6) to have a nonzero solution, the matrix C/λi −νi I must be degenerate: |C − λi νi I| = 0. The roots θ1 , . . . , θn of the equation |C − θI| = 0 are ordered such that θi = λi νi . Left multiplying (6) by p0i , we obtain p0i Cpi /λi = νi p0i pi = νi , and for the expression p0i Cpi /λi to be θj minimal, νi must be taken equal to νmin = minj . The corresponding eigenvector is determined λj C from the normalized solution of the equation − νmin I pbi = 0. Hence pbi is the eigenvector of λi θj the matrix C corresponding to the eigenvalue θi = λi νmin = λi minj . Now it is a simple matter λj bi = pb 0i C pbi . Therefore, the eigenvalues and to choose orthogonal pbi such that pb 0i pbj = δij . By (5), µ eigenvectors of the empirical covariance matrix C are not some simple estimates of the eigenvalues and eigenvectors of the unknown covariance matrix Σ, but are their maximum likelihood estimates. Result (5) is obtained under the assumption that the likelihood function (4) depends on n unknown parameters µ1 , . . . , µn . It is not excluded, however, that the spectrum of the matrix Σ consists only of k < n points µ1 , . . . , µk . Let n1 , . . . , nk be the multiplicities of the eigenvalues P µ1 , . . . , µk , n1 + . . . + nk = n. Then, as has already been mentioned, the operators Ei = pj ⊗ pj j∈Ji

are linearly independent and admit the representations Ai =

k X

λij Ej ,

i = 1, . . . , m,

Σ=

j=1

m X

σi Ai =

i=1

m X k X

σi λij Ej ,

i=1 j=1

where λij , j = 1, . . . , k, are the eigenvalues of the matrix Ai . Setting aside the difficult problem of estimation from a sample of unknown parameters n1 , . . . , nk , let us consider a simple case of practical value in which the spectra {λi1 , λi2 , . . . , λik } of the matrices Ai , i = 1, . . . , m and the multiplicities n1 , . . . , nk are known. Then it is natural to take bj = E

X

pbi pb 0i ,

bj = µ

m X

bi λij , j = 1, . . . , k, σ

i=1

i∈Jj

Abi =

k X

bj , i = 1, . . . , m. λij E

j=1

AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STATISTICAL ESTIMATION AND CLASSIFICATION

1269

bj = nj , we obtain Substituting these values into (3) and since trE k X q=1

nq λiq

m P

bj λjq σ

=

k X

λiq vq m P

q=1

j=1

!2 ,

i = 1, . . . , m,

(7)

bj λjq σ

j=1

where v1 =

n1 X

bj , µ

v2 =

j=1

n1X +n2

n X

bj , . . . , vk = µ

j=n1 +1

bj . µ

j=n−nk +1

Here vq /nq are the estimates of the eigenvalues µq (of multiplicity nq ) of the covariance matrix Σ with spectrum {µ1 , . . . , µk }, q = 1, . . . , k. Let us consider the special case of k = m. Since the matrices A1 , . . . , Am are assumed to be linearly independent, the matrices Λ1 , . . . , Λm are also linearly independent and the m × m matrix of coefficients (λij ) is nonsingular. The equations m X

bj λjq = σ

j=1

vq , nq

q = 1, . . . , k,

b1 , . . . , σ bm . Substituting these σ bj into (7), we find that (7) is satisfied, are consequently solvable for σ bj are the maximum likelihood estimates of the parameters σj , j = 1, . . . , m. i.e., σ

5. COMMUTATIVE APPROXIMATION OF NONCOMMUTATIVE COVARIANCE STRUCTURES The covariance structures encountered in applications are rarely commutative. Certainly, the set of all nonnegative-definite matrices in the algebra P(A) of polynomials of a single matrix variable A is a commutative covariance structure (see Section 3). In designing experiments with randomized blocks, covariance matrices that are invariant to the automorphisms of the experiment design form a cone in the algebra with two commuting generators [1], thereby generating a commutative structure. But the designs constructed from incomplete block schemes are balanced or partially balanced [9], and generate noncommutative covariance structures. It is natural, therefore, to examine the approximation of arbitrary covariance structures by commutative structures. Let Σi = Pi Λi Pi0 , i = 1, . . . , N , be a set of covariance matrices, where Pi are orthogonal matrices and Λi are diagonal matrices. The columns of the matrices Pi form an orthonormal base Bi . The base-to-base distance ρ(B1 , B2 ) is defined by the formula ρ2 (B1 , B2 ) = b i = P Mi P 0 , where P is the tr(P1 − P2 )(P1 − P2 )0 . Let us approximate Σi by matrices of the type Σ same orthogonal matrix not dependent on i, and Mi are diagonal matrices. Let us choose the base B for the unknown matrix P from the condition of least deviation of B from all Bi , i = 1, . . . , N : N X

ρ2 (Bi , B) = minB .

i=1

Since ρ2 (Bi , B) = 2n − 2tr(Pi P 0 ), we must minimize 2nN − 2

N P

tr(Pi P 0 ) under the constraint

i=1

P P 0 = I. Introducing a symmetric matrix of Lagrange multipliers Λ, let us form the function ϕ(P ) = nN − tr

N X

!

Pi P

0

+ tr(Λ(P 0 P − I)).

i=1

AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

1270

SHAIKIN

Differentiating it with respect to P , we obtain −

n X

Pi + Λ = 0.

i=1

Hence P = Λ−1

N X

Pi = Λ−1 Q,

i=1

where Q=

N X

Pi .

i=1

From the condition P P 0 = I we obtain Λ−1 QQ0 Λ−1 = I. Therefore, Λ2 = QQ0 and P = (QQ0 )−1/2 Q.

(8)

What now remains is to find the diagonal matrices Mi , i = 1, . . . , N . Let Σ be one of the matrices Σi . Let us project Σ onto a linear manifold of matrices of the type P M P 0 , where M is a diagonal matrix, using the scalar product (A, B) = tr(AB 0 ) of arbitrary matrices A and B. As in P Section 4, let us express P M P 0 as k µk Ek . From the orthogonality condition Σ−

X

! bk Ek , Ej µ

=0

k

we obtain (Σ, Ej ) =

X

b k (Ek , Ej ) or µ

bj (Ej , Ej ) (Σ, Ej ) = µ

for all j.

k

Hence bj = µ

tr(ΣEj ) . trEj

(9)

Examples. (1) Let us approximate two noncommuting 2 × 2 matrices Σ1 and Σ2 by commuting matrices of the type P Mi P 0 , i = 1, 2. Note that Σk = Pk Λk Pk0 , k = 1, 2, where Pk is a matrix of the type cos ϕk − sin ϕk sin ϕk cos ϕk

!

.

There is an algebraic isomorphism Pk ↔ eiϕk , which we use below to simplify computations. We have Q0 ↔ e−iϕ1 + e−iϕ2 , 1 (QQ0 )−1/2 ↔ p 2(1 + cos(ϕ1 − ϕ2 )

Q = P1 + P2 ↔ eiϕ1 + eiϕ2 , QQ0 ↔ 2(1 + cos(ϕ1 − ϕ2 )), and, finally,

eiϕ1 + eiϕ2 P = (QQ0 )−1/2 Q ↔ p . 2(1 + cos(ϕ1 − ϕ2 ) AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STATISTICAL ESTIMATION AND CLASSIFICATION

1271

Thus, according to (8), we have cos ϕ1 + cos ϕ2 − sin ϕ1 − sin ϕ2 sin ϕ1 + sin ϕ2 cos ϕ1 + cos ϕ2

1 P =p 2(1 + cos(ϕ1 − ϕ2 ))

!

.

The idempotent matrices are cos2 ϕ sin ϕ cos ϕ sin ϕ cos ϕ sin2 ϕ

E1 =

!

sin2 ϕ − sin ϕ cos ϕ − sin ϕ cos ϕ cos2 ϕ

, E2 =

!

,

where ϕ = ϕ2 − ϕ1 . Hence µ1 = tr(ΣE1 ) and µ2 = tr(ΣE2 ) are given by formula (9): µ1 = σ11 cos2 ϕ + σ22 sin2 ϕ + σ12 sin 2ϕ, µ2 = σ11 sin2 ϕ + σ22 cos2 ϕ + σ12 sin 2ϕ. (2) Let Σ1 = P1 Λ1 P10 and Σ2 = P2 Λ2 P20 be noncommuting 3 × 3 matrices. Let us approximate them by commuting matrices P M1 P 0 and P M2 P 0 , respectively. The orthogonal matrix P is conveniently determined from the well-known isomorphism P ↔ U , where U is a unitary matrix: U=

α β −β α

!

, α, β ∈ C, |α|2 + |β|2 = 1.

The transposed matrix P 0 corresponds to the matrix U ∗ , which is the Hermite conjugate of U . Using the Kelly–Klein parameters α and β, we can express any 3 × 3 orthogonal matrix P as

α2 − β 2 + α 2 − β

2

/2 −i α2 + β 2 − α 2 − β

P = i α2 − β 2 − α 2 + β 2 /2

α2

+

β2

+

α2

+β

2

2

/2 −(αβ + αβ)

/2

. i(αβ − αβ)

−i(αβ − αβ)

αβ + αβ

(10)

αα − ββ

Indeed, the unitary matrix U transforms a 3-dimensional column vector with coordinates x, y, and z into a vector with coordinates x0 , y 0 , and z 0 by the rule R = U RU ∗ , where R=

z x − iy x + iy −z

!

,

z0 x0 − iy 0 0 0 x + iy −z 0

R=

!

.

The canonical vectors (1, 0, 0), (0, 1, 0), and (0, 0, 1) correspond to the Pauli spin matrices R1 =

0 1 1 0

!

,

R2 =

0 −i i 0

!

,

R3 =

1 0 0 −1

!

.

Therefore, the ith column in (10) is the vector (x0 , y 0 , z 0 ) corresponding to the matrix Ri = U Ri U ∗ . Let us express the unitary matrix U through the Pauli matrices as α β −β α

!

1 1 i 1 = (α + α)R0 + (β − β)R1 + (β + β)R2 + (α − α)R3 , 2 2 2 2

where R0 is a 2 × 2 unit matrix. Since the Pauli matrices obey the rules of multiplication for quaternion units, i.e., Ri Rj = −Rj Ri for i 6= j, Ri2 = R0 , and Ri∗ = Ri , we obtain QQ0 ↔ (|α1 + α2 |2 + |β1 + β2 |2 )R0 . Finally, we obtain 1 P =p (P1 + P2 ), 2 |α1 + α2 | + |β1 + β2 |2 where Pi is a matrix of the type (10) with parameters α = αi and β = βi , i = 1, 2. AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

1272

SHAIKIN

6. APPLICATION TO CLASSIFICATION OF GAUSSIAN VECTORS WITH COMMUTATIVE COVARIANCE STRUCTURE For statistical classification of Gaussian vectors, commutative covariance structures are of interest for two reasons. First, the parameters of Gaussian distributions, i.e., the vector of means mi and the covariance matrix Σi for the ith class, as a rule, are not known, but estimated from sample data. In general, a structure (mi , Σi , i = 1, . . . , s) has s(n + n(n + 1)/2) parameters, where n is the dimension of the attribute space and s is the number of classes. Nevertheless, to define a commutative covariance structure (mi , P Λi P 0 , i = 1, . . . , s), it suffices to define sn parameters for averages, n(n − 1)/2 for defining the common proper base (i.e., matrix P ), and sn parameters for the matrices Λi . Compared to the general case, the number of parameters is reduced by (s − 1)n(n − 1)/2. Certainly, a commutative structure is suitable for describing only such Gaussian distributions with which almost identically oriented covariance ellipsoids differing only in the length of the major axis are associated. Second, for commutative covariance structures, the classifier decision function is simpler if it is expressed in a proper base of covariance matrices Σi , i = 1, . . . , s. But the determination of the threshold for the decision rule is not simplified even for a commutative covariance structure. Thus, in the simplest case of two classes, s = 2, the Neumann-Pearson criterion yields the equation R dN (x; m2 , Σ2 ) = α for the threshold c, where the integration domain D(c) is defined by the D(c)

−1 0 −1 0 −1 formula D(c) = {x : x0 (Σ−1 2 − Σ1 )x + 2(m1 Σ1 − m2 Σ2 )x > c}. Here α is the conditional probability for making a wrong decision on the membership of a sample to the first class when it actually belongs to the second class. There is no analytical method of computing the threshold c. It is approximately determined by stochastic modeling with different values of c [10], because it is not −1 0 −1 0 −1 possible to find the exact distribution of the quadratic form x0 (Σ−1 2 − Σ1 )x + 2(m1 Σ1 − m2 Σ2 )x if the distribution of x is of the form N (·; m2 , Σ2 ). The distribution of a quadratic statistic is known only in certain particular cases [11]. The probabilities of classification errors under a given threshold can be determined sometimes by approximate methods [12]. In conclusion, it is worthwhile to compare, for s = 2 (two statistical classes), the exact solution of the problem of reduction of two covariance matrices Σ1 and Σ2 to canonical form and its b 1 = P Λ1 P 0 , Σ b 2 = P Λ2 P 0 , with matrix P defined by formula (8). Let Σ1 approximate solution Σ be a nonsingular matrix. According to the theorem concerning the simultaneous reduction of two quadratic forms with coefficient matrices Σ1 and Σ2 , there exists a matrix F such that F 0 Σ1 F = I and F 0 Σ2 F = N , where N = diag(ν1 , . . . , νn ) and νi are the roots of the equation |Σ−1 1 Σ2 − νI| = 0. In general, F is not a unitary matrix and the common base {f1 , . . . , fn }, where fi is the ith column of the matrix F , is not appropriate for the pair (Σ1 , Σ2 ). To find F , we may note that the matrices −1/2 −1/2 Σ−1 Σ2 Σ1 have the same spectrum, i.e., νi coincide with the roots of the equation 1 Σ2 and Σ1 −1/2 −1/2 −1/2 −1/2 |Σ1 Σ2 Σ1 − νi I| = 0. The matrix Σ1 Σ2 Σ1 is Hermite. Therefore (a) all νi are real, −1/2 −1/2 −1/2 ∗ (b) U Σ1 Σ2 Σ1 U = N for some unitary matrix U , and (c) it suffices to take F = Σ1 U for determining the unknown matrix F . Below, in comparing the exact solution F with the approximate solution Fb = P , where P is defined by formula (8), we restrict ourselves, for the sake of simplicity of presentation, to 2 × 2 matrices. First let us consider the exact solution F . The roots ν1 and ν2 are determined from the characteristic equation −1 ν 2 − νtrΣ−1 1 Σ2 + det Σ1 Σ2 = 0.

Assuming, as in Example 1, that Σ1 = P1 Λ1 P10 , Σ2 = P2 Λ2 P20 , Λ1 = diag(λ11 , λ12 ), Λ2 = −1 diag(λ21 , λ22 ), P1 ↔ eiϕ1 , and P2 ↔ eiϕ2 , we obtain det(Σ−1 1 Σ2 ) = det Λ2 /detΛ1 and tr Σ1 Σ2 = AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003

STATISTICAL ESTIMATION AND CLASSIFICATION

1273

0 0 tr Λ−1 1 RΛ2 R , where R := P1 P2 . Simple computation gives

det Σ−1 1 Σ2 = λ21 λ22 /λ11 λ12 ,

λ21 λ22 λ22 λ21 = + cos2 ϕ + + sin2 ϕ, ϕ = ϕ2 − ϕ1 , λ11 λ12 λ11 λ12 λ21 λ21 λ22 λ22 2 2 − sin ϕ cos ϕ λ cos ϕ + λ sin ϕ λ11 λ11 11 11 0 A := Λ−1 RΛ R = . 2 1 λ21 λ22 λ21 λ 22 2 2 − sin ϕ cos ϕ sin ϕ + cos ϕ λ12 λ12 λ12 λ12 trΣ−1 1 Σ2

−1/2

−1/2

The eigenvectors u1 , . . . , un of the matrix Σ1 Σ2 Σ1 can be found from the equations −1/2 −1/2 Σ1 Σ2 Σ1 ui = νi ui or, which is the same thing, from the equations Avi = νi vi , where vi = −1/2

−1/2

−1/2

−1/2

−1

−1/2

Λ1 P10 ui , i = 1, 2. Indeed, it is easy to verify that Σ1 Σ2 Σ1 = Λ1 P10 A Λ1 P10 . The vectors f1 and f2 of the canonical (but not proper) base of Σ1 and Σ2 are determined as −1/2 −1/2 1/2 fi = Σi ui = P1 Λ1 P10 P1 Λ1 vi = P1 vi , i.e., the result of rotation of vectors v1 and v2 through ϕ1 . We now consider a numerical example. Let ϕ1 = π/6, ϕ2 = π/4, λ11 = 1, λ12 = 2/3, λ21 = 2, and λ22 = 3/2. We have ϕ = ϕ2 − ϕ1 = π/12, ν1 ν2 = 9/2, and

λ21 λ22 λ22 λ21 ν1 + ν2 = trA = + cos2 ϕ + + sin2 ϕ λ11 λ12 λ11 λ12 = 17/4 cos2 (π/12) + 9/2 sin2 (π/12) = 4.2662. Hence ν1 = 2.3561 and ν2 = 1.9101. Let us compute the eigenvectors vi of the matrix A. The equation (A − νi I)vi = 0 implies that the product of any row vector of the matrix A − νi I by the column vector vi is zero. It suffices to consider one (for example, the first) row, because A − νi I is a matrix of rank 1. Denoting the first row of the matrix A − νi I by (a(νi ), b), we obtain (

λ21 λ22 −0.4897 for i = 1 a(νi ) = cos2 ϕ + sin2 ϕ − νi = 0.0565 for i = 2, λ11 λ11 1 λ21 λ22 b= − sin 2ϕ = 0.1250. 2 λ11 λ11 The orthogonality condition shows that v1 and v2 are proportional to the vectors w1 = +

0.1250 0.4897

!

,

0.1250 −0.0565

w2 = −

!

,

respectively. The signs before brackets are chosen such that w1 lies in the first quadrant of the coordinate system OXY and the base (w1 , w2 ) is positively oriented. The vectors wi = αi vi must 0 0 2 be normalized by the conditions 1 = u0i ui = vq i Λ1 vi = wi Λ1 wi /αi under the assumption that q q αi = wi0 Λ1 wi . Since w10 Λ1 w1 = 0.419 and w20 Λ1 w2 = 0.133, it is not difficult to find the vectors v1 and v2 first and then f1 and f2 . Computations show that kf1 k = 1.205, kf2 k = 1.003, and the angle between f1 and f2 is 80◦ 000 . The approximate solution, according to (10), is the orthogonal matrix Fb = P ↔ eiϕb, where cos ϕb = 0.7934, which corresponds to the angle ϕb = 37◦ 300 . Hence Fb = AUTOMATION AND REMOTE CONTROL

0.7934 −0.6088 0.6088 0.7934 Vol. 64

No. 8

!

.

2003

1274

SHAIKIN

Let fb1 and fb2 denote the columns of the matrix Fb . Comparing the solutions, we find that the cone {f1 , f2 } wholly lies within the interior of the 90◦ -cone {fb1 , fb2 }. Moreover, the mismatch between c f1 and f1 is 6◦ 500 , whereas that between f2 and c f2 is only 3◦ 100 . 7. CONCLUSIONS Commutative approximation of an arbitrary covariance structure is studied, probably, first in this paper. The designed approximation is not the only possible method, and it is worthwhile to examine other approximations, not as simple in formulation as ours, and other methods of solving this approximation problem. Commutative structures are attractive in application for the extreme simplicity of their algebraic analysis. As applied to statistical classification of commutative covariance structures, what matters is to find methods for approximate estimation of the classification error probability for such structures and deviations of approximate estimates of probabilities from their true values. At present, error probabilities for large-dimensional arbitrary covariance structures are estimated exclusively by statistical modeling methods. REFERENCES 1. Sysoev, L.P. and Shaikin, M.E., Optimal Estimates of Parameters in Regression Models of Special Covariance Structure and Their Application in Two-Factor Experiments, Avtom. Telemekh., 1981, no. 6, pp. 44–56. 2. Anderson, T.W., Statistical Inference for Covariance Matrices with Linear Structure, in Time Series Analysis, Rosenblatt, M., Ed., New York: Wiley, 1963. 3. Srivastava, J.N., On Testing Hypotheses Regarding a Class of Covariance Structures, Psychometrika, 1966, vol. 31, no. 1, pp. 147–164. 4. Glazman, I.M. and Lyubich, Yu.I., Konechnomernyi lineinyi analiz (Finite-Dimensional Linear Analysis), Moscow: Nauka, 1969. 5. Gantmakher, F.R., Teoriya matrits, Moscow: Nauka, 1967. Translated under the title The Theory of Matrices, New York: Chelsea, 1959. 6. Marcus, M. and Minque, H., A Survey of Matrix Theory and Matrix Inequalities, Boston: Allyn and Bacon, 1964. Translated under the title Obzor po teorii matrits i matrichnykh nepavenstv , Moscow: Nauka, 1972. 7. Halmos, P.R., Finite-Dimensional Vector-Spaces, New York: Springer-Verlag, 1974. Translated under the title Konechnomernye vektornye prostranstva, Moscow: Fizmatgiz, 1963. 8. Pukhal’skii, E.A., Computation of Invariants in Classification of Covariance structures, Avtom. Telemekh., 1986, no. 4, pp. 68–75. 9. Shaikin, M.E., The Algebraic Structure of PBIB-Plans of Variance Analysis and Its Application to Multifactor Experiments with Interaction, Avtom. Telemekh., 1997, no. 11, pp. 90–101. 10. Pugachev, V.S., Teoriya veroyatnostei i matematicheskaya statistika (Probability Theory and Mathematical Statistics), Moscow: Nauka, 1979. 11. Kostylev, V.I., Distribution of the Sum of Two Independent Gamma-Statistics, Radiotekhn. Elektron., 2001, vol. 46, no. 5, pp. 530–533. 12. Jain, A., Moulin, P., Miller, M.I., et al., Information-Theoretic Bounds on Target Recognition Performance, IEEE Trans. PAMI , 2002, vol. 24, no. 9, pp. 1153–1166.

This paper was recommended for publication by V.A. Lototskii, a member of the Editorial Board AUTOMATION AND REMOTE CONTROL

Vol. 64

No. 8

2003