The Orthogonally Constrained Regression Revisited

4 downloads 0 Views 358KB Size Report
When p = q, the variable Q is considered an orthogonal matrix (QT Q = QQT = I). This is the ..... that K = ¡ KT for K¡ skew-symmetric matrix, QT (I ¡ QQT ) = 0 and (I ¡ QQT )2 = I¡ QQT for ..... :8900 ¡ :7598 ¡ :6719. ¡ 1:9832 ..... Solving Differential Equations on Manifolds” (User's Guide), http://www.math.ntnu.no/num/synode/.
The Orthogonally Constrained Regression Revisited Moody T. CHU and Nickolay T. TRENDAFILOV The Penrose regression problem, including the orthonormal Procrustes problem and rotation problem to a partially speciŽ ed target, is an important class of data matching problems arising frequently in multivariate analysis, yet its optimality conditions have never been clearly understood. This work offers a way to calculate the projected gradient and the projected Hessian explicitly. One consequenceof this calculationis the complete characterization of the Ž rst order and the second order necessary and sufŽ cient optimality conditions for this problem. Another application is the natural formulation of a continuous steepest descent ow that can serve as a globally convergent numerical method. Applications to the orthonormal Procrustes problem and Penrose regression problem with partially speciŽ ed target are demonstrated in this article. Finally, some numerical results are reported and commented. Key Words: Continuous-timeapproach; Penrose regression;Procrustes rotation; Rotation to partially speciŽ ed target; Projected gradient; Projected Hessian; Optimality conditions.

1. INTRODUCTION The problem of matching data matrices to maximal agreement by orthogonal rotations arises in many disciplines. A concise and instructive discussion of its application to factor analysis and multidimensionalscaling can be found in Gower (1984). For general consideration, ten Berge and Knol (1984) proposed a taxonomy of matching procedures according to properties of gauging criteria, orthogonality, simultaneity, generality, and symmetry involved in the underlying problem. In this article we derive the Ž rst order and the second order optimality conditions for two of the most important cases in the family of problems. Our result includes as a special case what is already known in the literature and appears to be the strongest possible provision for assessing a local optimizer. MoodyT. Chu is Professor, Department of Mathematics, North Carolina State University, Raleigh, NC 27695-8205 (E-mail: [email protected]). Nickolay T. TrendaŽ lov is Professor, Laboratory of Computational Stochastics, Institute of Mathematics and Informatics, Bulgarian Academy of Sciences; address for correspondence: Department of Mechanical Engineering, University of Strathclyde, 75 Montrose Street, Glasgow G1 1XJ (E-mail: I.TrendaŽ [email protected]). The Ž rst author’s research was supported in part by the National Science Foundation under grant DMS-942228. The second author’s research was performed while visiting SISTA/ESAT, Katholieke Universiteit Leuven, and was supported by DWTC, Flemish Government, Belgium. ® c 2001 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 10, Number 4, Pages 746–771 746

ORTHOGONALLY CONSTRAINED REGRESSION

747

Table 1. ClassiŽ cation of the Penrose Regression Problems Penrose regression problem orthonormal Q p ¶q

Weighted Procrustes

Unweighted Procrustes C = Iq

minimize kAQC ¡ Bk – PRP

orthonormal Procrustes problem – OPP orthogonal Procrustes problem Q = VUT

orthogonal Q p = q

symmetric (p = q) or asymmetric (p > q) symmetric

The main problem to be considered is the so-called Penrose regression problem (PRP), also known as the weighted orthonormal Procrustes problem. Given Ž xed A 2 Rn£p ; C 2 Rq£m , and B 2 Rn£m , the PRP is the optimization problem: minimize

kAQC ¡

Bk

(1.1)

subject to

Q 2 Rp£q ; QT Q = Iq ;

(1.2)

where p ¶ q is assumed and Iq stands for the q £ q identity matrix. In general, n and m can be any numbers. If n ¶ p ¶ q ¶ m and A and C are full column-rank matrices, then the original PRP (1.1)–(1.2) can be transformed into a PRP with A and C square (uppertriangular). A related secondary problem, the well-known orthonormal Procrustes problem (OPP) that has been of great interest in the regression theory, is a special case of the PRP with m = q and C = Iq . Thus, we can immediately apply our results for the PRP to the OPP. When p = q, the variable Q is considered an orthogonal matrix (QT Q = QQT = I). This is the so-called symmetric problem according to the taxonomy of ten Berge and Knol (1984). In case C = Iq (= Ip ), the OPP is also known as the orthogonal Procrustes problem whose optimalsolutionis well understood.Indeed, the solutionfor the orthogonalProcrustes problem is given by Q = V U T , where V and U are the orthogonal matrices involved in the singular value decomposition of AT B = V §U T (Green 1952 and Golub and van Loan 1991). The corresponding symmetric PRP was discussed in detail by Chu and TrendaŽ lov (1998). The more interesting yet challenging case is the so called asymmetric problem where p 6= q (we shall assume henceforth p > q and not address the symmetric problem further at all). Table 1 gives a helpful classiŽ cation. We are not aware of any direct solution for the asymmetric problem. Several indirect methods are worth mentioning: Green (1952) and Gower (1984) suggested an iterative scheme for the OPP that seems to work well in practice although they were unable to prove its convergence (Gower 1984; ten Berge and Knol 1984). Mooijaart and Commandeur (1990) proposed an iterative algorithm for the symmetric PRP (p = q), based on plane rotations. It can be applied also to the asymmetric PRP (p > q) by embedding the unknown Q 2 Rp£q in Rp£p (Commandeur, personal comm.), but is quite inefŽ cient numerically (see Section 6). Koschat and Swayne (1991) proposed yet another iterative method for the PRP when the weight C is diagonal. The algorithm can be extended to handle the unrestricted PRP with an arbitrary q £ m matrix C. The method is based on the idea of Ž rst embedding the asymmetric problem into a symmetric problem, for which a solution is easy to Ž nd, and then repeatedly updating the embedded system. Another approach for

748

M. T. CHU AND N. T. TRENDAFILOV

solving the unrestricted PRP based on the majorization idea was proposed by Kiers (1990) and reŽ ned by Kiers and ten Berge (1992). In an additional step, the unconstrained solution is updated to fulŽ ll (1.2). Kiers and ten Berge (1992, p. 375) proved also that Koschat and Swayne’s algorithm is a special case of their reŽ ned majorization. The method adopted in this article differs from the methods listed above in two ways. First, it works directly on the feasible set (1.2) of the PRP, and second, the solution evolves continuously in time. An important feature of our approach is that related problems can be treated in a completely uniŽ ed manner. To show this, besides to the symmetric and asymmetric PRP, we solve also the unsolved weighted Procrustes rotation problem to a partially speciŽ ed target; that is, when some elements of the matrix B in (1.1) are Ž xed and others are not speciŽ ed. The approach adopted is rather general and can be applied to other data analytical constrained least-squares problems. Our approach is based on the recent development of the so-called continuousrealization methods and their applications; for example, QR, EVD, and SVD algorithms in numerical linear algebra; Rayleigh quotient and balanced matrix factorizations in signal processing; and so on. See Chu (1994) for a brief review of this subject; for a detailed study see Helmke and Moore (1994). This formulation often has the advantage of furnishing better understanding about the global properties of the solution for the underlying problem based on the theory of dynamical systems, Lyapunov stability, and so on. The idea of the continuoustime approach is that certain numerical methods can be thought of as a discretization of a dynamical system governing a ow that starts at a certain initial state and evolves until it reaches an equilibrium point. In this work we propose and apply the projected gradient approach—a speciŽ c continuous-time method—to reconsider PRP (1.1)–(1.2). SpeciŽ cally, we shall Ž rst study the topology of the set of all p £ q(p ¶ q) orthonormal matrices O(p; q) := fQ 2 Rp£q jQT Q = Iq g;

(1.3)

which forms a smooth manifold known also as Stiefel manifold. Then we show that a gradient ow for the PRP can be derived without difŽ culty. Our approach here is similar in spirit to that in Chu and Driessel (1990) and Chu and TrendaŽ lov (1998). The results for the PRP includeswhat is known for the OPP as a special case. We derive matrix ODEs describing gradient ows for both of the problems. As a by-product, we can specify the necessary and the sufŽ cient optimality conditions characterizing the PRP (and OPP) optimizers. Green (1952) and Gower (1984) have suggested for the OPP that the equation AT AQ ¡

AT B = QS

(1.4)

must be satisŽ ed for some q £ q symmetric matrix S at an optimal solution Q for the OPP (Gower 1984, p. 767). It appears that the best known optimality condition for the OPP is due to ten Berge (1977a) who showed that at a minimizer Q the corresponding matrix B T AQ must be symmetric and positive semi-deŽ nite. We shall see in the sequel that this necessary condition is a special case of our more general result. The theory for the PRP is far from being clear. It is our contribution in this article to completely characterize, respectively, the Ž rst order and the second order necessary conditions and sufŽ cient conditions for the optimality of the PRP.

ORTHOGONALLY CONSTRAINED REGRESSION

749

This article is organized as follows: Some important topologicalproperties, particularly the tangentspace, of the manifold O(p; q) are briey discussedin Section 2. These properties are then applied to the PRP in Section 3. We derive the projected gradient and the projected Hessian of a certain objective function which enables us to specify the Ž rst order and the second order sufŽ cient and necessary conditions for a local minimizer of the PRP. The same idea can be extended to the OPP which is stated in Section 4 as an application. Next, in Section 5, the weighted orthonormal Procrustes rotation problem to a partially speciŽ ed target is solved by means of the projected gradient approach. Finally, we present some numerical experiments in Section 6 to illustrate the evolution of the dynamics and to compare the proposed algorithm to the existing ones.

2. STIEFEL MANIFOLD The set O(p; q) of all p £ q real matrices with orthonormal columns forms a smooth manifold of dimension q(q ¡ 1)=2 + q(p ¡ q) in Rp£q . It appears that Stiefel was the Ž rst person who studied its topology in detail (Stiefel 1935–1936). For a quick grasp of understandingthis Stiefel manifold,we recommend the article by Edelman, Arias, and Smith (1998). We outline in this section some main points that will be needed in the discussion. Even though this article is for asymmetric (p > q) PRP, the considerations in this section are permissible for p ¶ q. We shall regard O(p; q) as embedded in the pq dimensional Euclidean space Rp£q equipped with the Frobenius inner product hX; Y i := trace(XY T )

(2.1)

for any X; Y 2 Rp£q . Hereafter we suppose that Q depends on the real parameter t, such that for all t 2 R the matrix Q(t) is orthonormal; that is, Q(t) forms an one-parameter family of p £ q orthonormal matrices. Thus, we regard Q(t) to as a curve evolving on O(p; q) (we write Q, for short). By deŽ nition, a tangent vector H of O(p; q) at Q is the velocity of the smooth curve Q(t) 2 O(p; q) at t = 0. Having an explicit expression of the feasible set O(p; q) we obtain by differentiating QT Q = Iq at t = 0 that

.

QT jt=

0Q

.

+ QTQjt=

0

= H T Q + QT H = 0q ;

(2.2)

and thus, its tangent space TQ O(p; q) at any orthonormal matrix Q 2 O(p; q) is given by TQ O(p; q)

=

fH 2 Rp£q jH T Q + QT H = 0q g

=

fH 2 Rp£q jQT H is skew-symmetricg :

(2.3)

To further characterize a tangent vector, we recall that a least squares solution X to the equation MX = N is given by X = M y N + (I ¡

M y M )W;

M. T. CHU AND N. T. TRENDAFILOV

750

where M y is the Moore–Penrose inverse of M , I is an identity matrix, and W is an arbitrary matrix of proper dimension. Applied to our case with M = QT where Q 2 O(p; q) and N = K 2 Rq£q where K is skew-symmetric, we note that (QT )y = Q. The following theorem therefore follows. Theorem 1.

Any tangent vector H 2 TQ O(p; q) has the form H = QK + (Ip ¡

QQT )W;

(2.4)

where K 2 Rq£q is skew-symmetric, W 2 Rp£q is arbitrary. When p = q, H = QK. For convenience, we shall abbreviate Ip as I. DeŽ ne S (q) := fall symmetric matrices in Rq£q g ;

(2.5)

S(q)? := fall skew-symmetric matrices in Rq£q g ;

(2.6)

and

which is simply the orthogonal complement of S(q) with respect to the Frobenius inner product (2.1). It is not difŽ cult to check by dimension counting arguments that the normal space of O(p; q) at any orthonormal matrix Q is given by N

Q O(p; q)

(2.7)

= QS (q):

Finally, denote by N (QT ) := fX 2 Rp£q jQT X = 0g the null-space of QT . Thus, Theorem 1 can be rewritten as the following decomposition of the space Rp£q . Theorem 2. The space Rp£q can be written as the direct sum of three mutually perpendicular subspaces Rp£q = QS(q) © QS(q)? © N (QT ):

(2.8)

Therefore we are able to deŽ ne the following projections. Corollary 1.

Let Z 2 Rp£q . Then º

T

(Z) := Q

QT Z ¡

ZT Q 2

+ (I ¡

QQT )Z

(2.9)

deŽ nes the projection of Z onto the tangent space TQ O(p; q). Similarly, º N

(Z) := Q

QT Z + Z T Q 2

deŽ nes the projection of Z onto the normal space N

(2.10)

Q O(p; q).

3. PENROSE REGRESSION PROBLEM Given matrices A 2 Rn£p ; C 2 Rq£m , and B 2 Rn£m , consider the function E : Rp£q ¡ ! R deŽ ned by E(Q) :=

1 hAQC ¡ 2

B; AQC ¡

Bi ;

(3.1)

ORTHOGONALLY CONSTRAINED REGRESSION

751

where h; i is deŽ ned in (2.1). Clearly, the PRP is equivalent to the minimization of the function E(Q) over the feasible set O(p; q). With respect to the Frobenius inner product, the gradient rE(Q) of E(Q) should be interpreted as the matrix rE(Q) = AT (AQC ¡

B)C T :

(3.2)

Suppose the projection g(Q) of the gradient rE(Q) onto the tangent space TQ O(p; q) can be computed explicitly. Then the ODE dQ = ¡ g(Q) dt

(3.3)

naturally deŽ nes the steepest descent ow for the function E on the feasible set O(p; q). By applying Corollary 1, we can Ž nd this projected gradient g(Q) and thus the steepest descent ow for E(Q) in (3.3) is characterized by the following ODE: dQ dt

=

Q (C(AQC ¡ B)T AQ ¡ 2

QT AT (AQC ¡ ¡ (I ¡

B)C T )

QQT )AT (AQC ¡

B)C T :

(3.4)

Starting with an initial value satisfying Q(0)T Q(0) = Iq , say, µ ¶ Iq Q(0) = ; 0 we may trace the corresponding integral curve of (3.4) to its limit point which will be an approximate solution to the PRP. In Section 5 we demonstrate the numerical solution of (3.4) making use of the MATLAB ODE suite (Shampine and Reichelt 1997). An alternative approach for solving matrix ODE of this kind was proposed by Diele, Lopez, and Peluso (1998). A general approach for solving matrix ODE on differentiable manifolds see e.g. Eng, Marthinsen, and Munthe-Kaas (1997). Knowledge of the projected gradient g(Q) explicitly provides information about the Ž rst order optimality condition. Theorem 3. A necessary condition for Q 2 O(p; q) to be a stationary point of the PRP is that the following two conditions must hold simultaneously: (a). C(AQC ¡ B)T AQ is symmetric, and (b). (I ¡ QQT )AT (AQC ¡ B)C T = 0. Proof: For Q to be a stationary point, it is necessary that g(Q) = 0. Since the two terms in (3.4) are mutually perpendicular by Theorem 2, each individual term must be zero & by itself. Condition (a) follows from premultiplying the Ž rst factor in (3.4) by QT . Now, consider the Lagrangian L(Q; ¤) associated to PRP, where (Q; S) 2 Rp£q £ S(q): L(Q; S) = Then one can easily Ž nd that

1 hAQC ¡ 2

­ ® Bi + I ¡ QT Q; S :

rQ L(Q; ¤) = AT (AQC ¡

B)C T ¡

QS = 0

(3.5)

752

M. T. CHU AND N. T. TRENDAFILOV

is a necessary condition for Q to solve the PRP. Multiplying (3.5) by QT we obtain QT AT (AQC ¡

B)C T = S 2 S(q);

(3.6)

and substituting it in (3.5) gives (I ¡

QQT )AT (AQC ¡

B)C T = 0:

(3.7)

Thus, making use of the classical Lagrangian approach, we establish identicalnecessary conditions to those given by Theorem 3. For some problems, the necessary conditions obtained by Lagrangian approach provide equationsfor Ž nding the unknowns of the original problem. Unfortunately, for most of the problems this is not the case, for example, PRP. Indeed, (3.6) cannot help to Ž nd Q and solve PRP. The advantage of the projected gradient approach is that it not only gives necessary conditions for the solver, but always gives an ODE deŽ ning the path to the solution of the original problem, for example, (3.4) for PRP . We can also derive an explicit projected Hessian formula to further identify the stationary points. The development is based on an extension idea discussed by Chu and Driessel (1990). From the standard constrained optimization theory (Gill, Murray, and Wright 1981, sec. 3.4), we have the following result: Theorem 4. At a stationary point Q 2 O(p; q) satisfying Theorem 3, a second order necessary condition for Q to be a minimizer of the PRP is that the inequality ­T T ® Q A AQKCC T ¡ QT AT (AQC ¡ B)C T K; K ­ ® +2 A T AQKCC T ; (I ¡ QQT )W ­ ® + AT A(I ¡ QQT )W C; (I ¡ QQT )W C ­ ® ¡ (I ¡ QQT )W QT AT (AQC ¡ B)C T ; (I ¡ QQT )W ¶ 0 (3.8)

holds for all skew-symmetric matrices K 2 Rq£q and arbitrary matrices W 2 Rp£q . If (3.8) holds with the strict inequality, then it is sufŽ cient that Q is a local minimizer for the PRP. The assertion follows from the adjoint property hXY; Zi = hY; X T Zi and the facts that K = ¡ K T for K¡ skew-symmetric matrix, QT (I ¡ QQT ) = 0 and (I ¡ QQT )2 = I ¡ QQT for Q-orthonormal matrix. The condition in Theorem 4 is the best one can hope for from the general theory of nonlinear equality constrained optimization. To our knowledge, the characterization in Equation (3.8) is given here for the Ž rst time. Note that we have purposefully rewritten the last term on the left-hand side of (3.8) in the quadratic form of (I ¡ QQT )W . Particularly, a second-order necessary condition for a stationary point Q 2 O(p; q) to be a minimizer of the PRP is given by the inequalities: ­T T ® ­ ® Q A AQKCC T ; K + C(AQC ¡ B)T AQ; K 2 ¶ 0 (3.9) for all skew-symmetric matrices K in Rq£q and ­ ® A(I ¡ QQT )W C; A(I ¡ QQT )W C ­ ¶ (I ¡ QQT )W QT AT (AQC ¡

B)C T ; (I ¡

QQT )W

®

(3.10)

ORTHOGONALLY CONSTRAINED REGRESSION

753

for arbitrary matrix W 2 Rp£q . We can further characterize the condition (3.9) in the following way. Note that the spectral decomposition of K 2 for any skew-symmetric matrix K 2 Rq£q is necessarily of the form K 2 = ¡ U §2 U T ;

(3.11)

where U is an q £ q orthogonal matrix and § = diagf¼ 1 ; : : : ; ¼ q g contains singular values with ¼ 2i¡1 = ¼ 2i , i = 1; : : : ; b q2 c, and ¼ q = 0 if q is odd. Denote the spectral decompositions of the following matrices: C(AQC ¡

B)T AQ

=

V ¤V T ;

AT A

=

T ©T T ;

CC T

=

SªS T ;

where each of V; S 2 Rq£q and T 2 Rp£p is orthogonal. Note that all entries in © = diagf¿ 1 ; : : : ; ¿ p g and ª = diagfÁ1 ; : : : ; Áq g are nonnegative.We can rewrite the inequality (3.9) as follows: Ã q ! Ã q ! p q X X X X ­ ® T 2 T 2 2 2 h©Rª; Ri ¡ V ¤V ; U § U = ¿ j rjs Ás ¡ ¶ i pit ¼ t ¶ 0; j= 1

s= 1

i= 1

t= 1

(3.12)

where P = (pit ) := V T U and R = (rjs ) := T T QKS. Since the orthogonal matrix P is arbitrary, it is tempting to say that in order to maintain the inequality in (3.12) it must be that all entries ¶ 1 ; : : : ; ¶ q are nonpositive. That is, it appears reasonable to claim that a necessary condition for the stationary point Q 2 O(p; q) to be a solution of the PRP is that the matrix C(AQC ¡ B)T AQ be negative semi-deŽ nite. The truth is that such a claim is wrong. The difŽ culty lies in the fact that both terms in (3.12) are of the same order O(kKk2 ). We simply have no way of showing that ¶ i µ 0 for all i is necessary to guarantee (3.9) for all nonzero skew-symmetric K . In fact we can argue the opposite by considering the special symmetric case. It can be shown that the condition C(AQC ¡ B)T AQ being negative semi-deŽ nite is sufŽ cient for a stationary point Q to be a solution for the symmetric PRP (Chu and TrendaŽ lov 1998, theorem 3.5), but it simply cannot be necessary. This subtlety is in contrast to the necessary condition Corollary 2 for the OPP to be discussed in the next section. Our general theory indicates how the presence of C complicates the conditions. Finally,we show that the Lagrangian approachleads to the same second order optimality condition already obtained in Theorem 4. Following the standard result in constrained optimization theory (Gill, Murray, and Wright 1981, sec. 3.4) we have ­ 2 ® H; rQQ L(Q; ¤)H ¶ 0; for H 2 TQ O(p; q) (see Section 2), where

r2QQ L(Q; ¤)H = AT AH CC T ¡

HS:

Substituting S from (3.6) and making use of the presentation of H 2 TQ O(p; q) one can arrive at an identical result to that obtained in Theorem 4.

754

M. T. CHU AND N. T. TRENDAFILOV

4. ORTHONORMAL PROCRUSTES PROBLEM The OPP is a special case of the PRP with m = q and C = Iq . Our point in this section is to illustrate how some of the classical results for the OPP follow quickly from our previous results for the more general PRP. The steepest descent ow for OPP is characterized by the ODE dQ Q = (QT AT B ¡ dt 2

B T AQ) ¡

(I ¡

QQT )AT (AQ ¡

B):

(4.1)

By Theorem 3, the Ž rst order optimality condition for the OPP becomes the following: Theorem 5. For Q 2 O(p; q) to be a stationary point of the OPP, the following two conditions must hold simultaneously: (a). B T AQ is symmetric, and (b). (I ¡ QQT )AT (AQ ¡ B) = 0. Proof: The condition (a) in Theorem 3 is reduced to ¡ B T AQ being symmetric because QT AT AQ is automatically symmetric. & We remark here that the conditions in Theorem 5 are equivalent to the Equation (1.4) used in the literature (Gower 1984, p. 767). This is easily seen by manipulations similar to (3.6) and (3.7). Our concern is that it has never been clear to us how (1.4) gets developed for the asymmetric case. (It was probably presented at the 1977 Annual Meeting of the Psychometric Society and obtained probably by making use of the Lagrangian approach, see (3.5).) Our theory now provides a rigorous mathematical justiŽ cation. The projected Hessian and the second order optimality condition become, according to Theorem 4, as follows Theorem 6. At a stationary point Q 2 O(p; q) satisfying Theorem 5, a second-order necessary condition for Q to be a minimizer of the OPP is that the inequality ­T ® ­ ® B AQK; K + 2 AT AQK; (I ¡ QQT )W ­ ® + AT A(I ¡ QQT )W; (I ¡ QQT )W ­ ® ¡ (I ¡ QQT )W QT AT (AQ ¡ B); (I ¡ QQT )W ¶ 0 (4.2) holds for all skew-symmetric matrices K 2 Rq£q and arbitrary matrices W 2 Rp£q . If (4.2) holds with the strict inequality, then it is sufŽ cient that Q is a local minimizer of the OPP. The two special cases of Theorem 6 with W = 0 and K = 0 become, respectively, ­T ® B AQK; K ¶ 0 (4.3) for all skew-symmetric matrices K in Rq£q and ­ ® ­ A(I ¡ QQT )W; A(I ¡ QQT )W ¶ (I ¡ QQT )W QT AT (AQ ¡

® B); (I ¡ QQT )W (4.4) for arbitrary matrix W 2 Rp£q . We shall show below that (4.3) is equivalentto the condition known in the literature (ten Berge 1977). But (4.4) apparently is new.

ORTHOGONALLY CONSTRAINED REGRESSION

755

Assuming the spectral decomposition of K 2 (3.11), since B T AQ is necessarily symmetric at any stationary point Q, let B T AQ = V ¤V T denote the corresponding spectral decomposition. Note that ­T ® ­ ® B AQK; K = V ¤V T ; U §2 U T Ã q ! q X X = ¶ i p2it ¼ t2 ; i= 1

t= 1

where P = (pij ) = V T U . Since the orthogonal matrix P 2 Rq£q can be arbitrary, in order to maintain the inequality in (4.3) it must be that all entries ¶ 1 ; : : : ; ¶ q are nonnegative. We have proved the following result. Corollary 2. A second-ordernecessary conditionfor the stationarypoint Q 2 O(p; q) to be a solution of the OPP is that the matrix B T AQ be positive semi-deŽ nite and that the inequality (4.4) be satisŽ ed for arbitrary W 2 Rp£q . Numerical experiments show that the condition (4.4) is rather rough: the left hand side of the condition (4.4) is considerably dominating. Thus, it is probably not quite important for practical reasons.

5. PENROSE REGRESSION PROBLEM WITH PARTIALLY SPECIFIED TARGET All of the problems considered in this work arise in factor analysisand multidimensional scaling for transformation (rotation) of a certain initial solution to a prescribed simple structure solution (Gower 1984). The structure of the desired solution is given by so-called target matrix. The aim of the PRP is to Ž nd the best Ž t of the initial solution to the desired one in least-squares sense. In terms of the formal deŽ nition (1.1) of PRP the initial solution A should be rotated by Q such that after given weighting by C to Ž t the target B as perfectly as possible. In many practical situations the target matrix is not known entirely or the user wants to Ž x some of its elements while leaving the rest of them unspeciŽ ed. In such situations one is interested in Ž tting only the speciŽ ed elements of the target. This problem is known as Procrustes rotation problem to a partially speciŽ ed target. Browne (1972) solved this problem for orthogonal rotation Q and without weights C. In this section we solve the more general weighted orthonormal Procrustes rotation problem to a partially speciŽ ed target. We see that without any additional efforts the projected gradient approach leads to its solution. For given Ž xed A 2 Rn£p ; C 2 Rq£m , and V 2 Rn£m and given partially Ž xed B 2 Rn£m , the weighted orthonormal Procrustes rotation problem to partially speciŽ ed target concerns the optimization: minimize subject to

k(AQC ¡ Q2R

p£q

B) ­V k T

; Q Q = Iq ;

(5.1) (5.2)

M. T. CHU AND N. T. TRENDAFILOV

756

where the matrix V = fvij g is deŽ ned as vij =

½

1 0

; if bij is speciŽ ed ; otherwise

and “­” denotes the standard elementwise (Hadamard) matrix product. In other words, this rotation problem seeks for orthonormal Q such that AQC gives the best Ž t to the speciŽ ed elements in the target matrix B in least-squares sense. Consider the function EV : Rp£q ¡ ! R deŽ ned by EV (Q) :=

1 h(AQC ¡ 2

B) ­V; (AQC ¡

B) ­V i :

(5.3)

Clearly, the problem for weighted orthonormal rotation to partially speciŽ ed target is equivalent to the minimization of the function EV (Q) over the feasible set O(p; q). With respect to the Frobenius inner product, the gradient rEV (Q) of EV (Q) should be interpreted as the matrix rEV (Q) = A T [(AQC ¡

B) ­V ]C T :

(5.4)

Following step-by-step the formalism developed in Section 3 one can easily derive ODE that deŽ nes the steepest descent ow for the function EV on the feasible set O(p; q): dQ Q = (C[(AQC ¡ dt 2

B) ­V ]T AQ ¡

QT AT [(AQC ¡

¡ (I ¡

B) ­V ]C T )

QQT )AT [(AQC ¡

B) ­V ]C T : (5.5)

Starting with an initial value Q(0) satisfying Q(0)T Q(0) = Iq , we may trace the corresponding integral curve of (5.5) to its limit point which will be an approximate solution to the problem (5.1)–(5.2). In Section 6 we demonstrateand discuss the numerical solutionof (5.5) making use of the MATLAB ODE suite (Shampine and Reichelt 1997). For illustration purposes only, we solve here the small example considered by Browne (1972), which is to Ž nd 3 £ 3 orthogonal rotation Q which transforms 2

3T :6640 :6880 :4920 :8370 :7050 :8200 :6610 :4570 :7650 A = 4 :3220 :2480 :3040 ¡ :2910 ¡ :3140 ¡ :3770 :3970 :2940 :4280 5 ; ¡ :0750 :1920 :2240 :0370 :1550 ¡ :1040 :0770 ¡ :4880 :0090 as close as possible in least-squares sense, to the following target: 2

x B=4 0 x

x 0 0

x 0 0

x x x

x x 0

x x x

:7 x x

0 x x

3T :7 x 5 ; x

where “x” denotes the unspeciŽ ed elements in B. Solving (5.5) with random initial value: 2 3 :6150 :7817 ¡ :1036 Q(0) = 4 :6672 ¡ :4459 :5967 5 :4202 ¡ :4361 ¡ :7958

ORTHOGONALLY CONSTRAINED REGRESSION

757

we Ž nd that 2

:8172 Q = 4 :3232 :4772

is the desired rotation and compute

3 :4439 :3676 ¡ :8810 :3455 5 ¡ :1635 ¡ :8634

2

3T :6109 :7340 :6072 :6076 :5486 :4986 :7052 :2356 :7678 AQ = 4 :0234 :0555 ¡ :0860 :6219 :5643 :7132 ¡ :0689 :0237 ¡ :0389 5 ; :4201 :1728 :0925 :1752 :0169 :2610 :3137 :6909 :4213

which is exactly the Browne’s solution (Browne 1972). With bold type style are given the speciŽ ed elements to be Ž tted in target B. Starting with different initial values may change the signs in an entire column of AQ—either the second or the third or both, but (almost) not the magnitude of the elements; that is, the goodness-of-Ž t remains unaffected. Similarly to the Section 3, the Ž rst order optimality conditions are readily obtained. Theorem 7. For Q 2 O(p; q) to be a stationary point of the weighted orthonormal Procrustes rotation problem to a partially speciŽ ed target, the following two conditions must hold simultaneously: (a). C[(AQC ¡ B) ­V ]T AQ is symmetric, and (b). (I ¡ QQT )AT [(AQC ¡ B) ­V ]C T = 0.

6. NUMERICAL EXPERIMENT In this section, we report some experiences of our numerical experiment with the ODEs (3.4), (4.1), and (5.5). The computation is carried out in MATLAB 5.2 on a SUN Ultra-2/200 workstation. The solver used for the initial value problem is ode15s from the MATLAB ODE SUITE (Shampine and Reichelt 1997) also publiclyavailablefrom the Internet. The code ode15s is a quasi-constant step size implementationof the Klopfenstein–Shampine family of numerical differential formulas for stiff systems. More details of this code can be found in Shampine and Reichelt (1997). We give some results to illustrate the behavior of the numerical solutions of (3.4), (4.1), and (5.5). To Ž t the data comfortably into the text, we display all numbers with Ž ve digits. All codes and results used in this experiment are available upon request. One important feature of the ODEs (3.4), (4.1), and (5.5) is that the resulting Q(t) should automatically stay on the manifold O(p; q). In numerical calculation, however, round-off errors and truncations errors may throw the computed Q(t) off the manifold of constraint, for example, Example 1(b). In a large number of numerical experiments we even found few datasets for which the ow Q(t) does not converge due to “slipping off” the constrained manifold. To remedy this problem, we adopt an additional nonlinear projection scheme suggested by Gear (1986) and Dieci, Russell, and Van Vleck (1994): Suppose Q is an approximate solution to some of the ODEs under consideration satisfying QT Q = I + O(hr );

(6.1)

758

M. T. CHU AND N. T. TRENDAFILOV

˜ be the unique QR where r represents the order of the numerical method. Let Q = QR decomposition of Q with diag(R) > 0. Then Q˜ = Q + O(hr )

(6.2)

and Q˜ 2 O(p; q). The condition diag(R) > 0 is important to ensure that the transition of Q(t) is smooth in t. In our implementation, the approximate solution Q of the ODE is ˜ An alternative projection scheme has been considered replaced by the corresponding Q. at an early stage of this research and has been also suggested by the anonymous reviewer: Suppose again that Q is an approximate solution of the ODEs satisfying (6.1) and let Q˜ 2 O(p; q) be the closest to Q in least-squares terms; that is, the solution of asymmetric OPP with A = I, which is Q˜ = V U T with SVD of Q = V §U T . For small deviation of Q from O(p; q) (which is our case), both projection schemes produce identical Q˜ 2 O(p; q) and QR decomposition is faster for large sizes of Q. To complete this discussion we note that in general the resulting ow Q(t) is satisfactory robust and the ODE can be integrated numerically without applying any projection onto O(p; q) at every integration step. This usually saves up to 20% CPU time.

6.1

SIMULATION EXPERIMENTS WITH OPP

First, we present experiments with gradient ow for OPP, Equation (4.1). In the two following examples, the tolerance for the absolute error is set at 10¡6 , and for the relative error at 10¡3 . This criterion is used to control the accuracy in following the solution path. Higher accuracy does not change the dynamics of the solution and simply needs more CPU. We examine the output values at a time interval of 1. The integration terminates automatically when the relative improvement of E(Q) between two consecutive output points is less than 10¡4 indicating that a local minimizer has been found. Example 1. In practice, the data in A and B often represent two different ordinations of the same samples or populations. Based on this idea, we produce the test data for this experiment by Ž rst using the random number generator rand in MATLAB to create the matrix 2 3 :2190 :3835 :5297 :4175 6 :0470 :5194 :6711 :6868 7 6 7 7 A=6 6 :6789 :8310 :0077 :5890 7 : 4 :6793 :0346 :3834 :9304 5 :9347

:0535 :0668 :8462

Then, we deŽ ne B := AQ0 with

2

0 6 1 Q0 = 6 4 0 0

1 0 0 0

3 0 0 7 7; 1 5 0

so that the underlying OPP, though may have many local solutions due to its nonlinearity, has exactly one global solution Q0 at which E(Q0 ) = 0. We emphasize here that the data

ORTHOGONALLY CONSTRAINED REGRESSION

759

History of Objective Value

0

10

­1

log(F(Q(t)))

10

­2

10

­3

10

­4

10

0

5

10

15

25

30

35

40

45

50

35

40

45

50

History of Non­ orthogonality

­2

10

log(Omega(Q(t)))

20

­3

10

­4

10

0

5

Figure 1.

10

15

20

25 t

30

A semi-log plot of E(Q(t)) and « (Q(t)) for Example 1(a).

obtained this way are not realistic because in practice B can rarely be a simple permutation of columns of A. In fact, it is precisely for the reason that it is often difŽ cult to determine the relationship between B and A that an orthonormal Procrustes analysis is needed (Gower 1984). We present the following examples just to illustrate how the descent ow behaves. (a). Obviously, the initial value determines where our descent ow will converge to. For instance, suppose we start with the matrix 2

3 :2618 :9198 ¡ :2333 6 :8912 ¡ :3467 ¡ :2333 7 7; Q(0) = 6 4 :2618 :1301 :9399 5 :2618 :1301 :0876

(6.3)

that represent a nontrivial perturbationof Q0 . We solve this example with both the projection scheme on and off. The objectivevalue kAQ¤ ¡ Bk º 4:0415 10¡4 when the projectionis on with 262065opsused, and kAQ¤ ¡ Bk º 6:4311 10¡4 withoutprojectionfor 215372ops. Figure 1 records the history of the changes of the objective value E(Q(t)) = kAQ(t) ¡ Bk where Q(t) is determined by integrating the ODE (4.1). Clearly, the global solution is obtained in this case. Also recorded in Figure 1 is the history of the function «(Q(t)) := kI3 ¡

Q(t)T Q(t)k

(6.4)

M. T. CHU AND N. T. TRENDAFILOV

760

History of Objective Value

1

10

0

log(F(Q(t)))

10

­1

10

­2

10

­3

10

0

50

100

150

200

250

300

350

400

450

500

350

400

450

500

History of Non­ orthogonality

0

log(Omega(Q(t)))

10

­1

10

­2

10

­3

10

0

50

Figure 2.

100

150

200

250 t

300

A semi-log plot of E(Q(t)) and « (Q(t)) for Example 1(b).

that measures the deviation of Q(t) from the manifold of constraint O(4; 3). It is seen that Q(t) is well kept within the local tolerance, when the projection scheme is on. (b).

Suppose the initial value Q(0) is taken to be µ ¶ I3 Q(0) = : 0

Incorporating the projection schemes based on QR and SVD, we can reach only a local minimizer 2 3 :0094 :3811 ¡ :5768 6 :9999 :0094 :0088 7 7 Q¤ = 6 4 :0088 ¡ :5768 :4625 5 ¡ :0110

:7225

:6733

with objective value kAQ¤ ¡ Bk º :2234. Then repeating the computations with the projection schemes off we found the following considerably better “solution”:

Q¤ ¤

2

3 :0010 :9817 ¡ :0172 6 :9998 :0011 :0008 7 7 =6 4 :0008 ¡ :0175 :9835 5 ¡ :0010 :0208 :0197

ORTHOGONALLY CONSTRAINED REGRESSION

1

10

0

History of Objective Value

log(F(Q(t)))

10

761

0

5

10

15

20

25

30

35

40

45

50

35

40

45

50

History of Non­ orthogonality

­2

log(Omega(Q(t)))

10

­3

10

­4

10

0

5

10

Figure 3.

15

20

25 t

30

A semi-log plot of E(Q(t)) and « (Q(t)) for Example 2.

with objective value kAQ¤ ¤ ¡ Bk º :0066. The test results are presented in Figure 2. It is interestingto note that both Q¤ and Q¤ ¤ satisfy the conditionsin Theorem 5 and Corollary 2, while Q¤ ¤ violates considerably the orthonormality constraint: QT¤ ¤ Q¤ ¤

2

3 :9995 :0020 :0016 = 4 :0020 :9646 ¡ :0336 5 : :0016 ¡ :0336 :9679

The result is quite natural: by violating the orthonormality constraint we actually relax the constrained problem, which, of course, leads to lower minima. Example 2.

Suppose B = AQ0 + 12 ¢ where 2

:5383 :9503 6 ¡ :6168 :3468 6 6 ¢ = 6 ¡ 1:2161 ¡ :9547 4 ¡ :8900 ¡ :7598 ¡ 1:9832 :3192

:6004 1:0047 ¡ :3608 ¡ :6719 ¡ :6037

3 7 7 7 7 5

represents a random perturbation from normal distribution N(0; 1). With Q = Q0 , this noise has the magnitude kAQ0 ¡ Bk = 1:7179. But by following (4.1) with Q(0) given by

M. T. CHU AND N. T. TRENDAFILOV

762

(6.3), we obtain 2

¡ :0895 6 :7471 Q# = 6 4 :2731 ¡ :5990

3 :7726 ¡ :5277 ¡ :1845 :0163 7 7 :6034 :7309 5 ¡ :0701 :4324

with kAQ# ¡ Bk being reduced to 1.1182. In this example the results obtained by the projection scheme on and off are identical; the ops used are 274483 and 238814, respectively. The test results are recorded in Figure 3. Again, we can only report that a local minimizer is found.

6.2

SIMULATION COMPARISON OF SEVERAL PRP SOLVERS

It is worth comparing the proposed algorithm based on the projected gradient approach with the other existing solutions given by-product of planar rotations (Mooijaart and Commandeur 1990), majorization (Kiers 1990), and “reŽ ned” majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991). We report here 100 numerical solutionsof (1.1)–(1.2) obtainedby each of the four methods making use of three different random number generators. The experiment is organized as follows. We generate 100 random matrices A 2 Rn£p ; C 2 Rq£m , and B 2 Rn£m . For these 100 triples A; B, and C we solve the problem (1.1)–(1.2) by the available solvers. We solve the Equation (3.4) starting from the following initial value Q0 : Ž rst, solve AX = B for X by least squares, then solve the linear system Y C = X for Y , and Ž nally project Y onto the Stiefel manifold using the QR decomposition to Ž nd Q0 . In these experiments, the tolerance for the absolute error is set at 10¡6 , and for relative error is set at 10¡3 . In order to gain better overall information for these three approaches we compute the sample mean of the obtained minimal value of the objective function (1.1) and its sample variance over all 100 sets of data. The sample mean of the CPU time used per run and its sample variance over all 100 runs are also computed. It is common for MATLAB procedures to measure the number of the ops used. The projected gradient solutions, for all experiments, are considerably less ops-consuming than the other three methods. This may mislead the reader into thinking that solving ODEs is an easy task, thus we decide to report the CPU time, though it is not quite a reliable measure for comparisons. We generate matrices A; B, and C with dimensions n = 7, m = 4, q = 3, and p = 5; 6; 7; 10; 20; that is, we compare solutions for Ž ve different sizes of the orthonormal unknown p£q matrix Q. Results of Mooijaartand Commandeur’s(1990) algorithm(product of planar rotations) are reported for the case n = 7, m = 4, p = 5, and q = 3 only, because it is rather slow in CPU time. First we generate A; B and C by rand¡ :5, uniformly distributed random numbers on the interval (¡ :5; :5). The results obtained are summarized in Table 2. The “reŽ ned” majorization algorithm is the fastest of the three in CPU time. The minimae of (1.1) obtained (i.e., the error of the Ž t) by the three methods are practically identical. Note that Mooijaart and Commandeur’s (1990) algorithm for n = 7, m = 4, p = 10, and q = 3 gives sample mean and variance of the minima—1.1744 and .1250, and sample mean and variance of the CPU time—28.1649 and 773.9405.

ORTHOGONALLY CONSTRAINED REGRESSION

763

Table 2. Results for Data Generated by rand¡:5. Data n= 7 m= 4 p= 5 q= 3

n= 7 m= 4 p= 6 q= 3

n= 7 m= 4 p= 7 q= 3

n= 7 m= 4 p = 10 q= 3

n= 7 m= 4 p = 20 q= 3

Method

Minimum mean variance

CPU Time mean variance

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

1.6581

.1125

.0596

.0030

Majorization (Kiers 1990)

1.6486

.1123

.1435

.0154

Projected gradient

1.6493

.1128

.2597

.0110

Product of planar rotations (Mooijaart and Commandeur 1990)

1.6463

.1193

3.1545

8.3937

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

1.5242

.1087

.0815

.0033

Majorization (Kiers 1990)

1.5247

.1083

.1909

.0262

Projected gradient

1.5236

.1087

.3248

.0274

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

1.4947

.1249

.1226

.0104

Majorization (Kiers 1990)

1.4951

.1252

.2233

.0314

Projected gradient

1.4953

.1251

.3712

.0171

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

1.1744

.1250

.1543

.0088

Majorization (Kiers 1990)

1.1744

.1250

.2292

.0151

Projected gradient

1.1747

.1250

.4073

.0147

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

.9744

.1104

.2220

.0182

Majorization (Kiers 1990)

.9744

.1104

.4136

.0384

Projected gradient

.9745

.1104

.7071

.0303

M. T. CHU AND N. T. TRENDAFILOV

764

Table 3. Results for Data Generated by rand Data n= 7 m= 4 p= 5 q= 3

n= 7 m= 4 p= 6 q= 3

n= 7 m= 4 p= 7 q= 3

n= 7 m= 4 p = 10 q= 3

n= 7 m= 4 p = 20 q= 3

Method

Minimum mean variance

CPU Time mean variance

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

2.0113

.3478

.4590

.1700

Majorization Kiers (1990)

1.9964

.3343

2.1178

2.3850

Projected gradient

1.9798

1.9798

.6917

.0342

Product of planar rotations (Mooijaart and Commandeur 1990)

1.9921

.3585

23.8393

1.4163 1003

ReŽned majorization (Kiers and ten Berge 1992 Koschat and Swayne 1991)

1.7986

.4608

.7296

.2739

Majorization (Kiers 1990)

1.8018

.4603

2.5293

2.6128

Projected gradient

1.8008

.4608

.7807

.0487

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

1.6446

.3332

1.0917

.4763

Majorization (Kiers 1990)

1.6440

.3329

3.2325

2.4534

Projected gradient

1.6397

.3355

.9111

.0495

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

1.3151

.3480

1.4191

.8074

Majorization (Kiers 1990)

1.3156

.3482

3.9922

3.9207

Projected gradient

1.3149

.3479

1.0482

.0387

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

1.0587

.5094

1.3540

.4421

Majorization (Kiers 1990)

1.0597

.5101

8.9595

13.8677

Projected gradient

1.0587

.5093

2.3604

.3359

ORTHOGONALLY CONSTRAINED REGRESSION

765

Next, for the same values n; m; q, and p we generate A; B, and C by rand; that is, uniformly distributed random numbers on the interval (0,1). The results are summarized in Table 3. The behavior of the majorization algorithm (Kiers 1990) is rather poor for these data. The CPU times of the “reŽ ned” majorization and projected gradient algorithms are similar for moderate size of Q. The error of the model Ž t (1.1) of the three algorithms are practically identical. Finally, we generate A; B, and C by randn; that is, random numbers with normal distribution N(0,1). The results are summarized in Table 4. Again, for these data, the “reŽ ned” majorization algorithm is the fastest one. The projected gradient algorithm gives better Ž t of the model (1.1) to these data while the most signiŽ cant deviations are produced by the “reŽ ned” majorization algorithm.

6.3

PRP WITH PARTIALLY SPECIFIED TARGET

Example 3. Finally, we report some numerical experiments with the solution of the weighted orthonormal Procrustes rotation problem to partially speciŽ ed target (5.1)–(5.2). (a). In the Ž rst experiment we generate random 5 £ 3 orthonormal matrix Q0 and random matrices A and C (if the reader is interested, the random data in this subsection can be obtained upon request from the second author). Then we form a target B from AQ0 C by considering some of its elements speciŽ ed and Ž xed to the corresponding values given in AQ0 C and the rest of them unspeciŽ ed and denoted by x’s. Let us say that we have to Ž t 2 3 x x x ¡ :9211 6 x x x x 7 6 7 6 x x x ¡ :7706 7 6 7 6 7 B = 6 x x x ¡ :5564 7 6 7 6 x x x x 7 6 7 4 x x x ¡ :6329 5 x x x ¡ :5897 by AQC for some unknown 5 £ 3 orthonormal matrix Q. Solving ODE (5.5) on QT Q = I3 starting from random initial orthonormal matrix Qin we Ž nd that 2

6 6 Qout = 6 6 4

¡ :1578 :9831 :0590 :0300 ¡ :0335 :3509 ¡ :3921 :0037 ¡ :7900 ¡ :0096 :0434 :3661 ¡ :9057 ¡ :1745 :3395

3 7 7 7 7 5

M. T. CHU AND N. T. TRENDAFILOV

766

Table 4. Results for Data Generated by randn Data n= 7 m= 4 p= 5 q= 3

n= 7 m= 4 p= 6 q= 3

n= 7 m= 4 p= 7 q= 3

n= 7 m= 4 p = 10 q= 3

n= 7 m= 4 p = 20 q= 3

Method

Minimum mean variance

CPU Time mean variance

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

26.8046

131.2824

.1446

.0099

Majorization (Kiers 1990)

25.7152

109.8358

.4894

.1720

Projected gradient

23.1794

86.2939

.6894

.0223

Product of planar rotations (Mooijaart and Commandeur 1990)

23.5038

78.9624

7.6978

119.1133

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

18.4156

66.3144

.3116

.1664

Majorization (Kiers 1990)

18.3138

98.1012

.6902

.2352

Projected gradient

18.2555

63.3046

.6516

.0148

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

13.4904

32.2519

.5140

.2083

Majorization (Kiers 1990)

13.2380

23.5104

.9722

.5702

Projected gradient

13.3174

26.8213

.7656

.0275

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

9.5870

22.1143

.4164

.0880

Majorization (Kiers 1990)

9.0801

2.5370

1.4120

.7686

Projected gradient

8.9235

2.7134

.9176

.0198

ReŽned majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)

7.8754

18.2798

.2256

.0273

Majorization (Kiers 1990)

7.5919

15.6923

1.8748

2.9253

Projected gradient

7.3064

16.1990

2.0505

.1311

ORTHOGONALLY CONSTRAINED REGRESSION

767

History of Objective Value

2

10

log(F(Q(t)))

0

10

­2

10

­4

10

0

10

20

30

40

50

60

70

80

90

100

70

80

90

100

History of Non­ orthogonality

0

log(Omega(Q(t)))

10

­5

10

­ 10

10

­ 15

10

­ 20

10

0

10

Figure 4.

20

30

40

50 t

60

A semi-log plot of EV (Q(t)) and « (Q(t)) for Example 3(a).

solves (5.1)–(5.2) with error of the Ž t 1.7063 10¡4 . One can check that 2 3 :0231 ¡ :4695 ¡ :4196 ¡ :9211 6 :1719 :0267 ¡ :1113 ¡ :2171 7 6 7 6 ¡ :0244 ¡ :4332 ¡ :3538 ¡ :7706 7 6 7 6 7 AQout C = 6 ¡ :0771 ¡ :3796 ¡ :2754 ¡ :5564 7 : 6 7 6 :2096 :2225 :0603 ¡ :0278 7 6 7 4 :1913 ¡ :2119 ¡ :3482 ¡ :6330 5 ¡ :1339 ¡ :2871 ¡ :0698 ¡ :5897

The speciŽ ed elements of the target are recovered perfectly in this case; that is, Qout is a global minimizer of (5.1)–(5.2). The test results are recorded in Figure 4. (b). Next, consider randomly generated A; B, and C with sizes 7 £ 5, 7 £ 4, and 3 £ 4, respectively. Form a target by considering the elements of B larger than .8 speciŽ ed and Ž xed and

M. T. CHU AND N. T. TRENDAFILOV

768

History of Objective Value

1

10

0

log(F(Q(t)))

10

­1

10

­2

10

­3

10

0

50

100

150

100

150

History of Non­ orthogonality

0

log(Omega(Q(t)))

10

­5

10

­ 10

10

­ 15

10

­ 20

10

0

50 t

Figure 5.

Semi-log plots of EV (Q(t)) and « (Q(t)) for Example 3(b).

the rest of them unspeciŽ ed; that is, consider the target 2 :8484 :8965 :8419 6 :8213 x x 6 6 x x x 6 6 B=6 x x x 6 6 x x x 6 4 :9273 :9948 x x x :8860

x x x x x x x

3

7 7 7 7 7 7: 7 7 7 5

Solving (5.5) with a random initial 5 £ 3 orthonormal matrix, we Ž nd 2

6 6 Qout = 6 6 4

:6896 :0263 :3193 :6473 ¡ :0528

¡ :2855 :0564 :2751 :8877 ¡ :2627 :3643 :4825 ¡ :2759 :7355 :0010

3

7 7 7; 7 5

ORTHOGONALLY CONSTRAINED REGRESSION

769

and correspondingly 2

6 6 6 6 6 AQout C = 6 6 6 6 4

:8490 :8206 :7611 :7075 :9542 :9265 1:0005

:8964 :6260 :9632 :7685 1:0664 :9952 :9845

:8417 :9433 :8237 :8258 1:2658 1:0780 :8858

1:1581 :9211 1:1761 :9883 1:3645 1:2847 1:2996

3

7 7 7 7 7 7; 7 7 7 5

which gives error of the Ž t 1.0001 10¡4 to the target B. For another randomly generated triple A; B, and C with the same sizes form a target by the same way as above; that is, consider the target 2 3 :9495 0 :9473 0 6 0 0 0 :8144 7 6 7 6 0 0 0 0 7 6 7 6 7 B=6 0 0 0 :8626 7 : 6 7 6 0 :8773 0 0 7 6 7 4 0 :9983 0 0 5 0 :9223 0 0 The solution of (5.5) with a random initial value is

Qout

and we compute that

2

:2217 6 :0884 6 =6 6 ¡ :4601 4 ¡ :0914 :8503 2

6 6 6 6 6 AQout C = 6 6 6 6 4

3 :6325 ¡ :3833 :5745 :0242 7 7 :3960 :7517 7 7; :3351 ¡ :2429 5 :0256 :4780

:9755 :5546 :7606 :6229 :9168 :5281 :6668 :5067 1:1653 :9513 1:2650 :7614 :4875 :5638 ¡

3 :9031 :7363 :3733 :8188 7 7 :6975 :8034 7 7 7 :1403 :8552 7 ; 7 :3081 1:4487 7 7 :7505 1:2647 5 :0224 :7184

which gives error of the Ž t .4393 to the target B. The test results are recorded in Figure 5. To investigate the sensitivity of the solution we made 50 random runs and solve ODE (5.5) for each run with 20 different initial values Qin . In 42 of 50 cases we found the sample variance of the error Ž t over all 20 starts of order 10¡4 or less, which indicates that the minima obtained is practically not sensitive on the starting values.

770

M. T. CHU AND N. T. TRENDAFILOV

7. CONCLUSION By using the projected gradient idea, we are able to completely characterize the Ž rst order and the second order optimality conditions for the Penrose regression problem. Our results extend what is already known in the literature for the orthonormal Procrustes problem. Furthermore, our approach provides a natural new numerical method for solving these problems. From the numerical experiments it is clear that the projected gradient algorithm is generally slower in CPU time compared to the “reŽ ned” majorization algorithm, but could be an useful alternative for some data. We should stress that the approach presented in this article is rather universal (both theoretically and numerically) and its application to data analytical problems leading to least squares optimization subject to constraints is straightforward. This was illustrated by solving the unsolved yet weighted Procrustes rotation problem to a partially speciŽ ed target.

ACKNOWLEDGMENTS The authors thank Henk Kiers, University of Groningen for the MATLAB codes implementing the majorization (Kiers 1990) and reŽ ned majorization methods (Kiers and ten Berge 1992; Koschat and Swayne 1991). Also we thank Jaques Commandeur, Leiden University for discussions while the second author worked on a MATLAB code that realized the Mooijaart and Commandeur algorithm. The authors thank the editor for his support and the anonymous reviewers for the competent and exhaustive comments that clariŽ ed and strengthened the work. Ross Lippert was so kind to improve “our” English.

[Received October 1998. Revised February 2001.]

REFERENCES Browne, M. W., (1972), “Orthogonal Rotation to Partially SpeciŽ ed Target,” British Journal of Mathematical and Statistical Psychology, 25, 115–120. Chu, M. T. (1994), “A List of Matrix Flows With Applications,” Fields Institute Communications, 3, 87–97. Chu, M. T., and Driessel, K. R. (1990), “The Projected Gradient Method for Least Squares Matrix Approximations With Spectral Constraints,” SIAM Journal of Numerical Analysis, 27, 1050–1060. Chu, M. T., and TrendaŽ lov, N. (1998), “On a Differential Equation Approach to the Weighted Orthogonal Procrustes Problem,” Statistics & Computing, 8, 125–133. Dieci, L., Russell, R. D., and Van Vleck, E. S. (1994), “Unitary Integrators and Applications to Continuous Orthonormalization Techniques,” SIAM Journal of Numerical Analysis, 31, 261–281. Diele, F., Lopez, L., and Peluso, R. (1998),“The Cayley Transform in the Numerical Solutionof Unitary Differential Systems,” Advances in Computational Mathematics, 8, 317–334. Edelman, A., Arias, T., and Smith, S. T. (1999), “The Geometry of Algorithms With Orthogonality Constraints,” SIAM Journal of Matrix Analysis and its Applications, 20, 303–353. Eng, K., Marthinsen, A., and Munthe-Kaas, H. (1997), “DiffMan—An Object Oriented MATLAB Toolbox for Solving Differential Equations on Manifolds” (User’s Guide), http://www.math.ntnu.no/num/synode/. Gill, P. E., Murray, W., and Wright, M. H. (1981), Practical Optimization, New York: Academic Press. Gear, C. W. (1986), “Maintaining Solution Invariants in the Numerical Solution of ODEs,” SIAM Journal on ScientiŽ c and Statistical Computing, 7, 734–743.

ORTHOGONALLY CONSTRAINED REGRESSION

771

Golub, G. H., and Van Loan, C. F. (1991), Matrix Computation(2nd ed.), Baltimore: The Johns Hopkins University Press. Gower, J. C. (1984), “Multivariate Analysis: Ordination, Multidimensional Scaling and Allied Topics,” in Handbook of Applicable Mathematics, Vol. VI: Statistics, Part B, ed. Emlyn Lloyd, New York: Wiley. Green, B. (1952), “The Orthogonal Approximation of an Oblique Structure in Factor Analysis, ”Psychometrika, 17, 429–444. Helmke, U., and Moore, J. B. (1994), Optimization and Dynamical Systems, London: Springer Verlag. Kiers, H. A. L. (1990), “Majorization as a Tool for Optimizing a Class of Matrix Functions,” Psychometrika, 55, 417–428. Kiers, H. A. L., and ten Berge, J. M. F. (1992), “Minimization of a Class of Matrix Trace Functions by Means of ReŽ ned Majorization,” Psychometrika, 57, 371–382. Koschat, M. A., and Swayne, D. F. (1991), “A Weighted Procrustes Criterion,” Psychometrika, 56, 229–239. Mooijaart, A., and Commandeur, J. J. F. (1990), “A General Solution of the Weighted Orthonormal Procrustes Problem,” Psychometrika, 55, 657–663. Shampine, L. F., and Reichelt, M. W. (1997), “The MATLAB ODE Suite,” SIAM Journal on ScientiŽ c Computing, 18, 1–22. Stiefel, E. (1935-1936), “Richtungsfelder und Fernparallelismus in n-Dimensional Manning Faltigkeiten,” Commentarii Mathematici Helvetici, 8, 305–353. ten Berge, J. M. F. (1977a), “Optimizing Factorial Invariance,” unpublished PhD Thesis, Groningen. (1977b), “Orthogonal Procrustes Rotation for Two or More Matrices,” Psychometrika, 42, 267–276. ten Berge, J. M. F., and Knol, D. L. (1984), “Orthogonal Rotations to Maximal Agreement for Two or More Matrices of Different Column Orders,” Psychometrika, 49, 49–55.