The Orthogonally Constrained Regression Revisited Moody T. CHU and Nickolay T. TRENDAFILOV The Penrose regression problem, including the orthonormal Procrustes problem and rotation problem to a partially speci ed target, is an important class of data matching problems arising frequently in multivariate analysis, yet its optimality conditions have never been clearly understood. This work offers a way to calculate the projected gradient and the projected Hessian explicitly. One consequenceof this calculationis the complete characterization of the rst order and the second order necessary and suf cient optimality conditions for this problem. Another application is the natural formulation of a continuous steepest descent ow that can serve as a globally convergent numerical method. Applications to the orthonormal Procrustes problem and Penrose regression problem with partially speci ed target are demonstrated in this article. Finally, some numerical results are reported and commented. Key Words: Continuous-timeapproach; Penrose regression;Procrustes rotation; Rotation to partially speci ed target; Projected gradient; Projected Hessian; Optimality conditions.
1. INTRODUCTION The problem of matching data matrices to maximal agreement by orthogonal rotations arises in many disciplines. A concise and instructive discussion of its application to factor analysis and multidimensionalscaling can be found in Gower (1984). For general consideration, ten Berge and Knol (1984) proposed a taxonomy of matching procedures according to properties of gauging criteria, orthogonality, simultaneity, generality, and symmetry involved in the underlying problem. In this article we derive the rst order and the second order optimality conditions for two of the most important cases in the family of problems. Our result includes as a special case what is already known in the literature and appears to be the strongest possible provision for assessing a local optimizer. MoodyT. Chu is Professor, Department of Mathematics, North Carolina State University, Raleigh, NC 27695-8205 (E-mail:
[email protected]). Nickolay T. Trenda lov is Professor, Laboratory of Computational Stochastics, Institute of Mathematics and Informatics, Bulgarian Academy of Sciences; address for correspondence: Department of Mechanical Engineering, University of Strathclyde, 75 Montrose Street, Glasgow G1 1XJ (E-mail: I.Trenda
[email protected]). The rst author’s research was supported in part by the National Science Foundation under grant DMS-942228. The second author’s research was performed while visiting SISTA/ESAT, Katholieke Universiteit Leuven, and was supported by DWTC, Flemish Government, Belgium. ® c 2001 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 10, Number 4, Pages 746–771 746
ORTHOGONALLY CONSTRAINED REGRESSION
747
Table 1. Classi cation of the Penrose Regression Problems Penrose regression problem orthonormal Q p ¶q
Weighted Procrustes
Unweighted Procrustes C = Iq
minimize kAQC ¡ Bk – PRP
orthonormal Procrustes problem – OPP orthogonal Procrustes problem Q = VUT
orthogonal Q p = q
symmetric (p = q) or asymmetric (p > q) symmetric
The main problem to be considered is the so-called Penrose regression problem (PRP), also known as the weighted orthonormal Procrustes problem. Given xed A 2 Rn£p ; C 2 Rq£m , and B 2 Rn£m , the PRP is the optimization problem: minimize
kAQC ¡
Bk
(1.1)
subject to
Q 2 Rp£q ; QT Q = Iq ;
(1.2)
where p ¶ q is assumed and Iq stands for the q £ q identity matrix. In general, n and m can be any numbers. If n ¶ p ¶ q ¶ m and A and C are full column-rank matrices, then the original PRP (1.1)–(1.2) can be transformed into a PRP with A and C square (uppertriangular). A related secondary problem, the well-known orthonormal Procrustes problem (OPP) that has been of great interest in the regression theory, is a special case of the PRP with m = q and C = Iq . Thus, we can immediately apply our results for the PRP to the OPP. When p = q, the variable Q is considered an orthogonal matrix (QT Q = QQT = I). This is the so-called symmetric problem according to the taxonomy of ten Berge and Knol (1984). In case C = Iq (= Ip ), the OPP is also known as the orthogonal Procrustes problem whose optimalsolutionis well understood.Indeed, the solutionfor the orthogonalProcrustes problem is given by Q = V U T , where V and U are the orthogonal matrices involved in the singular value decomposition of AT B = V §U T (Green 1952 and Golub and van Loan 1991). The corresponding symmetric PRP was discussed in detail by Chu and Trenda lov (1998). The more interesting yet challenging case is the so called asymmetric problem where p 6= q (we shall assume henceforth p > q and not address the symmetric problem further at all). Table 1 gives a helpful classi cation. We are not aware of any direct solution for the asymmetric problem. Several indirect methods are worth mentioning: Green (1952) and Gower (1984) suggested an iterative scheme for the OPP that seems to work well in practice although they were unable to prove its convergence (Gower 1984; ten Berge and Knol 1984). Mooijaart and Commandeur (1990) proposed an iterative algorithm for the symmetric PRP (p = q), based on plane rotations. It can be applied also to the asymmetric PRP (p > q) by embedding the unknown Q 2 Rp£q in Rp£p (Commandeur, personal comm.), but is quite inef cient numerically (see Section 6). Koschat and Swayne (1991) proposed yet another iterative method for the PRP when the weight C is diagonal. The algorithm can be extended to handle the unrestricted PRP with an arbitrary q £ m matrix C. The method is based on the idea of rst embedding the asymmetric problem into a symmetric problem, for which a solution is easy to nd, and then repeatedly updating the embedded system. Another approach for
748
M. T. CHU AND N. T. TRENDAFILOV
solving the unrestricted PRP based on the majorization idea was proposed by Kiers (1990) and re ned by Kiers and ten Berge (1992). In an additional step, the unconstrained solution is updated to ful ll (1.2). Kiers and ten Berge (1992, p. 375) proved also that Koschat and Swayne’s algorithm is a special case of their re ned majorization. The method adopted in this article differs from the methods listed above in two ways. First, it works directly on the feasible set (1.2) of the PRP, and second, the solution evolves continuously in time. An important feature of our approach is that related problems can be treated in a completely uni ed manner. To show this, besides to the symmetric and asymmetric PRP, we solve also the unsolved weighted Procrustes rotation problem to a partially speci ed target; that is, when some elements of the matrix B in (1.1) are xed and others are not speci ed. The approach adopted is rather general and can be applied to other data analytical constrained least-squares problems. Our approach is based on the recent development of the so-called continuousrealization methods and their applications; for example, QR, EVD, and SVD algorithms in numerical linear algebra; Rayleigh quotient and balanced matrix factorizations in signal processing; and so on. See Chu (1994) for a brief review of this subject; for a detailed study see Helmke and Moore (1994). This formulation often has the advantage of furnishing better understanding about the global properties of the solution for the underlying problem based on the theory of dynamical systems, Lyapunov stability, and so on. The idea of the continuoustime approach is that certain numerical methods can be thought of as a discretization of a dynamical system governing a ow that starts at a certain initial state and evolves until it reaches an equilibrium point. In this work we propose and apply the projected gradient approach—a speci c continuous-time method—to reconsider PRP (1.1)–(1.2). Speci cally, we shall rst study the topology of the set of all p £ q(p ¶ q) orthonormal matrices O(p; q) := fQ 2 Rp£q jQT Q = Iq g;
(1.3)
which forms a smooth manifold known also as Stiefel manifold. Then we show that a gradient ow for the PRP can be derived without dif culty. Our approach here is similar in spirit to that in Chu and Driessel (1990) and Chu and Trenda lov (1998). The results for the PRP includeswhat is known for the OPP as a special case. We derive matrix ODEs describing gradient ows for both of the problems. As a by-product, we can specify the necessary and the suf cient optimality conditions characterizing the PRP (and OPP) optimizers. Green (1952) and Gower (1984) have suggested for the OPP that the equation AT AQ ¡
AT B = QS
(1.4)
must be satis ed for some q £ q symmetric matrix S at an optimal solution Q for the OPP (Gower 1984, p. 767). It appears that the best known optimality condition for the OPP is due to ten Berge (1977a) who showed that at a minimizer Q the corresponding matrix B T AQ must be symmetric and positive semi-de nite. We shall see in the sequel that this necessary condition is a special case of our more general result. The theory for the PRP is far from being clear. It is our contribution in this article to completely characterize, respectively, the rst order and the second order necessary conditions and suf cient conditions for the optimality of the PRP.
ORTHOGONALLY CONSTRAINED REGRESSION
749
This article is organized as follows: Some important topologicalproperties, particularly the tangentspace, of the manifold O(p; q) are briey discussedin Section 2. These properties are then applied to the PRP in Section 3. We derive the projected gradient and the projected Hessian of a certain objective function which enables us to specify the rst order and the second order suf cient and necessary conditions for a local minimizer of the PRP. The same idea can be extended to the OPP which is stated in Section 4 as an application. Next, in Section 5, the weighted orthonormal Procrustes rotation problem to a partially speci ed target is solved by means of the projected gradient approach. Finally, we present some numerical experiments in Section 6 to illustrate the evolution of the dynamics and to compare the proposed algorithm to the existing ones.
2. STIEFEL MANIFOLD The set O(p; q) of all p £ q real matrices with orthonormal columns forms a smooth manifold of dimension q(q ¡ 1)=2 + q(p ¡ q) in Rp£q . It appears that Stiefel was the rst person who studied its topology in detail (Stiefel 1935–1936). For a quick grasp of understandingthis Stiefel manifold,we recommend the article by Edelman, Arias, and Smith (1998). We outline in this section some main points that will be needed in the discussion. Even though this article is for asymmetric (p > q) PRP, the considerations in this section are permissible for p ¶ q. We shall regard O(p; q) as embedded in the pq dimensional Euclidean space Rp£q equipped with the Frobenius inner product hX; Y i := trace(XY T )
(2.1)
for any X; Y 2 Rp£q . Hereafter we suppose that Q depends on the real parameter t, such that for all t 2 R the matrix Q(t) is orthonormal; that is, Q(t) forms an one-parameter family of p £ q orthonormal matrices. Thus, we regard Q(t) to as a curve evolving on O(p; q) (we write Q, for short). By de nition, a tangent vector H of O(p; q) at Q is the velocity of the smooth curve Q(t) 2 O(p; q) at t = 0. Having an explicit expression of the feasible set O(p; q) we obtain by differentiating QT Q = Iq at t = 0 that
.
QT jt=
0Q
.
+ QTQjt=
0
= H T Q + QT H = 0q ;
(2.2)
and thus, its tangent space TQ O(p; q) at any orthonormal matrix Q 2 O(p; q) is given by TQ O(p; q)
=
fH 2 Rp£q jH T Q + QT H = 0q g
=
fH 2 Rp£q jQT H is skew-symmetricg :
(2.3)
To further characterize a tangent vector, we recall that a least squares solution X to the equation MX = N is given by X = M y N + (I ¡
M y M )W;
M. T. CHU AND N. T. TRENDAFILOV
750
where M y is the Moore–Penrose inverse of M , I is an identity matrix, and W is an arbitrary matrix of proper dimension. Applied to our case with M = QT where Q 2 O(p; q) and N = K 2 Rq£q where K is skew-symmetric, we note that (QT )y = Q. The following theorem therefore follows. Theorem 1.
Any tangent vector H 2 TQ O(p; q) has the form H = QK + (Ip ¡
QQT )W;
(2.4)
where K 2 Rq£q is skew-symmetric, W 2 Rp£q is arbitrary. When p = q, H = QK. For convenience, we shall abbreviate Ip as I. De ne S (q) := fall symmetric matrices in Rq£q g ;
(2.5)
S(q)? := fall skew-symmetric matrices in Rq£q g ;
(2.6)
and
which is simply the orthogonal complement of S(q) with respect to the Frobenius inner product (2.1). It is not dif cult to check by dimension counting arguments that the normal space of O(p; q) at any orthonormal matrix Q is given by N
Q O(p; q)
(2.7)
= QS (q):
Finally, denote by N (QT ) := fX 2 Rp£q jQT X = 0g the null-space of QT . Thus, Theorem 1 can be rewritten as the following decomposition of the space Rp£q . Theorem 2. The space Rp£q can be written as the direct sum of three mutually perpendicular subspaces Rp£q = QS(q) © QS(q)? © N (QT ):
(2.8)
Therefore we are able to de ne the following projections. Corollary 1.
Let Z 2 Rp£q . Then º
T
(Z) := Q
QT Z ¡
ZT Q 2
+ (I ¡
QQT )Z
(2.9)
de nes the projection of Z onto the tangent space TQ O(p; q). Similarly, º N
(Z) := Q
QT Z + Z T Q 2
de nes the projection of Z onto the normal space N
(2.10)
Q O(p; q).
3. PENROSE REGRESSION PROBLEM Given matrices A 2 Rn£p ; C 2 Rq£m , and B 2 Rn£m , consider the function E : Rp£q ¡ ! R de ned by E(Q) :=
1 hAQC ¡ 2
B; AQC ¡
Bi ;
(3.1)
ORTHOGONALLY CONSTRAINED REGRESSION
751
where h; i is de ned in (2.1). Clearly, the PRP is equivalent to the minimization of the function E(Q) over the feasible set O(p; q). With respect to the Frobenius inner product, the gradient rE(Q) of E(Q) should be interpreted as the matrix rE(Q) = AT (AQC ¡
B)C T :
(3.2)
Suppose the projection g(Q) of the gradient rE(Q) onto the tangent space TQ O(p; q) can be computed explicitly. Then the ODE dQ = ¡ g(Q) dt
(3.3)
naturally de nes the steepest descent ow for the function E on the feasible set O(p; q). By applying Corollary 1, we can nd this projected gradient g(Q) and thus the steepest descent ow for E(Q) in (3.3) is characterized by the following ODE: dQ dt
=
Q (C(AQC ¡ B)T AQ ¡ 2
QT AT (AQC ¡ ¡ (I ¡
B)C T )
QQT )AT (AQC ¡
B)C T :
(3.4)
Starting with an initial value satisfying Q(0)T Q(0) = Iq , say, µ ¶ Iq Q(0) = ; 0 we may trace the corresponding integral curve of (3.4) to its limit point which will be an approximate solution to the PRP. In Section 5 we demonstrate the numerical solution of (3.4) making use of the MATLAB ODE suite (Shampine and Reichelt 1997). An alternative approach for solving matrix ODE of this kind was proposed by Diele, Lopez, and Peluso (1998). A general approach for solving matrix ODE on differentiable manifolds see e.g. Eng, Marthinsen, and Munthe-Kaas (1997). Knowledge of the projected gradient g(Q) explicitly provides information about the rst order optimality condition. Theorem 3. A necessary condition for Q 2 O(p; q) to be a stationary point of the PRP is that the following two conditions must hold simultaneously: (a). C(AQC ¡ B)T AQ is symmetric, and (b). (I ¡ QQT )AT (AQC ¡ B)C T = 0. Proof: For Q to be a stationary point, it is necessary that g(Q) = 0. Since the two terms in (3.4) are mutually perpendicular by Theorem 2, each individual term must be zero & by itself. Condition (a) follows from premultiplying the rst factor in (3.4) by QT . Now, consider the Lagrangian L(Q; ¤) associated to PRP, where (Q; S) 2 Rp£q £ S(q): L(Q; S) = Then one can easily nd that
1 hAQC ¡ 2
® Bi + I ¡ QT Q; S :
rQ L(Q; ¤) = AT (AQC ¡
B)C T ¡
QS = 0
(3.5)
752
M. T. CHU AND N. T. TRENDAFILOV
is a necessary condition for Q to solve the PRP. Multiplying (3.5) by QT we obtain QT AT (AQC ¡
B)C T = S 2 S(q);
(3.6)
and substituting it in (3.5) gives (I ¡
QQT )AT (AQC ¡
B)C T = 0:
(3.7)
Thus, making use of the classical Lagrangian approach, we establish identicalnecessary conditions to those given by Theorem 3. For some problems, the necessary conditions obtained by Lagrangian approach provide equationsfor nding the unknowns of the original problem. Unfortunately, for most of the problems this is not the case, for example, PRP. Indeed, (3.6) cannot help to nd Q and solve PRP. The advantage of the projected gradient approach is that it not only gives necessary conditions for the solver, but always gives an ODE de ning the path to the solution of the original problem, for example, (3.4) for PRP . We can also derive an explicit projected Hessian formula to further identify the stationary points. The development is based on an extension idea discussed by Chu and Driessel (1990). From the standard constrained optimization theory (Gill, Murray, and Wright 1981, sec. 3.4), we have the following result: Theorem 4. At a stationary point Q 2 O(p; q) satisfying Theorem 3, a second order necessary condition for Q to be a minimizer of the PRP is that the inequality T T ® Q A AQKCC T ¡ QT AT (AQC ¡ B)C T K; K ® +2 A T AQKCC T ; (I ¡ QQT )W ® + AT A(I ¡ QQT )W C; (I ¡ QQT )W C ® ¡ (I ¡ QQT )W QT AT (AQC ¡ B)C T ; (I ¡ QQT )W ¶ 0 (3.8)
holds for all skew-symmetric matrices K 2 Rq£q and arbitrary matrices W 2 Rp£q . If (3.8) holds with the strict inequality, then it is suf cient that Q is a local minimizer for the PRP. The assertion follows from the adjoint property hXY; Zi = hY; X T Zi and the facts that K = ¡ K T for K¡ skew-symmetric matrix, QT (I ¡ QQT ) = 0 and (I ¡ QQT )2 = I ¡ QQT for Q-orthonormal matrix. The condition in Theorem 4 is the best one can hope for from the general theory of nonlinear equality constrained optimization. To our knowledge, the characterization in Equation (3.8) is given here for the rst time. Note that we have purposefully rewritten the last term on the left-hand side of (3.8) in the quadratic form of (I ¡ QQT )W . Particularly, a second-order necessary condition for a stationary point Q 2 O(p; q) to be a minimizer of the PRP is given by the inequalities: T T ® ® Q A AQKCC T ; K + C(AQC ¡ B)T AQ; K 2 ¶ 0 (3.9) for all skew-symmetric matrices K in Rq£q and ® A(I ¡ QQT )W C; A(I ¡ QQT )W C ¶ (I ¡ QQT )W QT AT (AQC ¡
B)C T ; (I ¡
QQT )W
®
(3.10)
ORTHOGONALLY CONSTRAINED REGRESSION
753
for arbitrary matrix W 2 Rp£q . We can further characterize the condition (3.9) in the following way. Note that the spectral decomposition of K 2 for any skew-symmetric matrix K 2 Rq£q is necessarily of the form K 2 = ¡ U §2 U T ;
(3.11)
where U is an q £ q orthogonal matrix and § = diagf¼ 1 ; : : : ; ¼ q g contains singular values with ¼ 2i¡1 = ¼ 2i , i = 1; : : : ; b q2 c, and ¼ q = 0 if q is odd. Denote the spectral decompositions of the following matrices: C(AQC ¡
B)T AQ
=
V ¤V T ;
AT A
=
T ©T T ;
CC T
=
SªS T ;
where each of V; S 2 Rq£q and T 2 Rp£p is orthogonal. Note that all entries in © = diagf¿ 1 ; : : : ; ¿ p g and ª = diagfÁ1 ; : : : ; Áq g are nonnegative.We can rewrite the inequality (3.9) as follows: Ã q ! Ã q ! p q X X X X ® T 2 T 2 2 2 h©Rª; Ri ¡ V ¤V ; U § U = ¿ j rjs Ás ¡ ¶ i pit ¼ t ¶ 0; j= 1
s= 1
i= 1
t= 1
(3.12)
where P = (pit ) := V T U and R = (rjs ) := T T QKS. Since the orthogonal matrix P is arbitrary, it is tempting to say that in order to maintain the inequality in (3.12) it must be that all entries ¶ 1 ; : : : ; ¶ q are nonpositive. That is, it appears reasonable to claim that a necessary condition for the stationary point Q 2 O(p; q) to be a solution of the PRP is that the matrix C(AQC ¡ B)T AQ be negative semi-de nite. The truth is that such a claim is wrong. The dif culty lies in the fact that both terms in (3.12) are of the same order O(kKk2 ). We simply have no way of showing that ¶ i µ 0 for all i is necessary to guarantee (3.9) for all nonzero skew-symmetric K . In fact we can argue the opposite by considering the special symmetric case. It can be shown that the condition C(AQC ¡ B)T AQ being negative semi-de nite is suf cient for a stationary point Q to be a solution for the symmetric PRP (Chu and Trenda lov 1998, theorem 3.5), but it simply cannot be necessary. This subtlety is in contrast to the necessary condition Corollary 2 for the OPP to be discussed in the next section. Our general theory indicates how the presence of C complicates the conditions. Finally,we show that the Lagrangian approachleads to the same second order optimality condition already obtained in Theorem 4. Following the standard result in constrained optimization theory (Gill, Murray, and Wright 1981, sec. 3.4) we have 2 ® H; rQQ L(Q; ¤)H ¶ 0; for H 2 TQ O(p; q) (see Section 2), where
r2QQ L(Q; ¤)H = AT AH CC T ¡
HS:
Substituting S from (3.6) and making use of the presentation of H 2 TQ O(p; q) one can arrive at an identical result to that obtained in Theorem 4.
754
M. T. CHU AND N. T. TRENDAFILOV
4. ORTHONORMAL PROCRUSTES PROBLEM The OPP is a special case of the PRP with m = q and C = Iq . Our point in this section is to illustrate how some of the classical results for the OPP follow quickly from our previous results for the more general PRP. The steepest descent ow for OPP is characterized by the ODE dQ Q = (QT AT B ¡ dt 2
B T AQ) ¡
(I ¡
QQT )AT (AQ ¡
B):
(4.1)
By Theorem 3, the rst order optimality condition for the OPP becomes the following: Theorem 5. For Q 2 O(p; q) to be a stationary point of the OPP, the following two conditions must hold simultaneously: (a). B T AQ is symmetric, and (b). (I ¡ QQT )AT (AQ ¡ B) = 0. Proof: The condition (a) in Theorem 3 is reduced to ¡ B T AQ being symmetric because QT AT AQ is automatically symmetric. & We remark here that the conditions in Theorem 5 are equivalent to the Equation (1.4) used in the literature (Gower 1984, p. 767). This is easily seen by manipulations similar to (3.6) and (3.7). Our concern is that it has never been clear to us how (1.4) gets developed for the asymmetric case. (It was probably presented at the 1977 Annual Meeting of the Psychometric Society and obtained probably by making use of the Lagrangian approach, see (3.5).) Our theory now provides a rigorous mathematical justi cation. The projected Hessian and the second order optimality condition become, according to Theorem 4, as follows Theorem 6. At a stationary point Q 2 O(p; q) satisfying Theorem 5, a second-order necessary condition for Q to be a minimizer of the OPP is that the inequality T ® ® B AQK; K + 2 AT AQK; (I ¡ QQT )W ® + AT A(I ¡ QQT )W; (I ¡ QQT )W ® ¡ (I ¡ QQT )W QT AT (AQ ¡ B); (I ¡ QQT )W ¶ 0 (4.2) holds for all skew-symmetric matrices K 2 Rq£q and arbitrary matrices W 2 Rp£q . If (4.2) holds with the strict inequality, then it is suf cient that Q is a local minimizer of the OPP. The two special cases of Theorem 6 with W = 0 and K = 0 become, respectively, T ® B AQK; K ¶ 0 (4.3) for all skew-symmetric matrices K in Rq£q and ® A(I ¡ QQT )W; A(I ¡ QQT )W ¶ (I ¡ QQT )W QT AT (AQ ¡
® B); (I ¡ QQT )W (4.4) for arbitrary matrix W 2 Rp£q . We shall show below that (4.3) is equivalentto the condition known in the literature (ten Berge 1977). But (4.4) apparently is new.
ORTHOGONALLY CONSTRAINED REGRESSION
755
Assuming the spectral decomposition of K 2 (3.11), since B T AQ is necessarily symmetric at any stationary point Q, let B T AQ = V ¤V T denote the corresponding spectral decomposition. Note that T ® ® B AQK; K = V ¤V T ; U §2 U T Ã q ! q X X = ¶ i p2it ¼ t2 ; i= 1
t= 1
where P = (pij ) = V T U . Since the orthogonal matrix P 2 Rq£q can be arbitrary, in order to maintain the inequality in (4.3) it must be that all entries ¶ 1 ; : : : ; ¶ q are nonnegative. We have proved the following result. Corollary 2. A second-ordernecessary conditionfor the stationarypoint Q 2 O(p; q) to be a solution of the OPP is that the matrix B T AQ be positive semi-de nite and that the inequality (4.4) be satis ed for arbitrary W 2 Rp£q . Numerical experiments show that the condition (4.4) is rather rough: the left hand side of the condition (4.4) is considerably dominating. Thus, it is probably not quite important for practical reasons.
5. PENROSE REGRESSION PROBLEM WITH PARTIALLY SPECIFIED TARGET All of the problems considered in this work arise in factor analysisand multidimensional scaling for transformation (rotation) of a certain initial solution to a prescribed simple structure solution (Gower 1984). The structure of the desired solution is given by so-called target matrix. The aim of the PRP is to nd the best t of the initial solution to the desired one in least-squares sense. In terms of the formal de nition (1.1) of PRP the initial solution A should be rotated by Q such that after given weighting by C to t the target B as perfectly as possible. In many practical situations the target matrix is not known entirely or the user wants to x some of its elements while leaving the rest of them unspeci ed. In such situations one is interested in tting only the speci ed elements of the target. This problem is known as Procrustes rotation problem to a partially speci ed target. Browne (1972) solved this problem for orthogonal rotation Q and without weights C. In this section we solve the more general weighted orthonormal Procrustes rotation problem to a partially speci ed target. We see that without any additional efforts the projected gradient approach leads to its solution. For given xed A 2 Rn£p ; C 2 Rq£m , and V 2 Rn£m and given partially xed B 2 Rn£m , the weighted orthonormal Procrustes rotation problem to partially speci ed target concerns the optimization: minimize subject to
k(AQC ¡ Q2R
p£q
B) V k T
; Q Q = Iq ;
(5.1) (5.2)
M. T. CHU AND N. T. TRENDAFILOV
756
where the matrix V = fvij g is de ned as vij =
½
1 0
; if bij is speci ed ; otherwise
and “” denotes the standard elementwise (Hadamard) matrix product. In other words, this rotation problem seeks for orthonormal Q such that AQC gives the best t to the speci ed elements in the target matrix B in least-squares sense. Consider the function EV : Rp£q ¡ ! R de ned by EV (Q) :=
1 h(AQC ¡ 2
B) V; (AQC ¡
B) V i :
(5.3)
Clearly, the problem for weighted orthonormal rotation to partially speci ed target is equivalent to the minimization of the function EV (Q) over the feasible set O(p; q). With respect to the Frobenius inner product, the gradient rEV (Q) of EV (Q) should be interpreted as the matrix rEV (Q) = A T [(AQC ¡
B) V ]C T :
(5.4)
Following step-by-step the formalism developed in Section 3 one can easily derive ODE that de nes the steepest descent ow for the function EV on the feasible set O(p; q): dQ Q = (C[(AQC ¡ dt 2
B) V ]T AQ ¡
QT AT [(AQC ¡
¡ (I ¡
B) V ]C T )
QQT )AT [(AQC ¡
B) V ]C T : (5.5)
Starting with an initial value Q(0) satisfying Q(0)T Q(0) = Iq , we may trace the corresponding integral curve of (5.5) to its limit point which will be an approximate solution to the problem (5.1)–(5.2). In Section 6 we demonstrateand discuss the numerical solutionof (5.5) making use of the MATLAB ODE suite (Shampine and Reichelt 1997). For illustration purposes only, we solve here the small example considered by Browne (1972), which is to nd 3 £ 3 orthogonal rotation Q which transforms 2
3T :6640 :6880 :4920 :8370 :7050 :8200 :6610 :4570 :7650 A = 4 :3220 :2480 :3040 ¡ :2910 ¡ :3140 ¡ :3770 :3970 :2940 :4280 5 ; ¡ :0750 :1920 :2240 :0370 :1550 ¡ :1040 :0770 ¡ :4880 :0090 as close as possible in least-squares sense, to the following target: 2
x B=4 0 x
x 0 0
x 0 0
x x x
x x 0
x x x
:7 x x
0 x x
3T :7 x 5 ; x
where “x” denotes the unspeci ed elements in B. Solving (5.5) with random initial value: 2 3 :6150 :7817 ¡ :1036 Q(0) = 4 :6672 ¡ :4459 :5967 5 :4202 ¡ :4361 ¡ :7958
ORTHOGONALLY CONSTRAINED REGRESSION
757
we nd that 2
:8172 Q = 4 :3232 :4772
is the desired rotation and compute
3 :4439 :3676 ¡ :8810 :3455 5 ¡ :1635 ¡ :8634
2
3T :6109 :7340 :6072 :6076 :5486 :4986 :7052 :2356 :7678 AQ = 4 :0234 :0555 ¡ :0860 :6219 :5643 :7132 ¡ :0689 :0237 ¡ :0389 5 ; :4201 :1728 :0925 :1752 :0169 :2610 :3137 :6909 :4213
which is exactly the Browne’s solution (Browne 1972). With bold type style are given the speci ed elements to be tted in target B. Starting with different initial values may change the signs in an entire column of AQ—either the second or the third or both, but (almost) not the magnitude of the elements; that is, the goodness-of- t remains unaffected. Similarly to the Section 3, the rst order optimality conditions are readily obtained. Theorem 7. For Q 2 O(p; q) to be a stationary point of the weighted orthonormal Procrustes rotation problem to a partially speci ed target, the following two conditions must hold simultaneously: (a). C[(AQC ¡ B) V ]T AQ is symmetric, and (b). (I ¡ QQT )AT [(AQC ¡ B) V ]C T = 0.
6. NUMERICAL EXPERIMENT In this section, we report some experiences of our numerical experiment with the ODEs (3.4), (4.1), and (5.5). The computation is carried out in MATLAB 5.2 on a SUN Ultra-2/200 workstation. The solver used for the initial value problem is ode15s from the MATLAB ODE SUITE (Shampine and Reichelt 1997) also publiclyavailablefrom the Internet. The code ode15s is a quasi-constant step size implementationof the Klopfenstein–Shampine family of numerical differential formulas for stiff systems. More details of this code can be found in Shampine and Reichelt (1997). We give some results to illustrate the behavior of the numerical solutions of (3.4), (4.1), and (5.5). To t the data comfortably into the text, we display all numbers with ve digits. All codes and results used in this experiment are available upon request. One important feature of the ODEs (3.4), (4.1), and (5.5) is that the resulting Q(t) should automatically stay on the manifold O(p; q). In numerical calculation, however, round-off errors and truncations errors may throw the computed Q(t) off the manifold of constraint, for example, Example 1(b). In a large number of numerical experiments we even found few datasets for which the ow Q(t) does not converge due to “slipping off” the constrained manifold. To remedy this problem, we adopt an additional nonlinear projection scheme suggested by Gear (1986) and Dieci, Russell, and Van Vleck (1994): Suppose Q is an approximate solution to some of the ODEs under consideration satisfying QT Q = I + O(hr );
(6.1)
758
M. T. CHU AND N. T. TRENDAFILOV
˜ be the unique QR where r represents the order of the numerical method. Let Q = QR decomposition of Q with diag(R) > 0. Then Q˜ = Q + O(hr )
(6.2)
and Q˜ 2 O(p; q). The condition diag(R) > 0 is important to ensure that the transition of Q(t) is smooth in t. In our implementation, the approximate solution Q of the ODE is ˜ An alternative projection scheme has been considered replaced by the corresponding Q. at an early stage of this research and has been also suggested by the anonymous reviewer: Suppose again that Q is an approximate solution of the ODEs satisfying (6.1) and let Q˜ 2 O(p; q) be the closest to Q in least-squares terms; that is, the solution of asymmetric OPP with A = I, which is Q˜ = V U T with SVD of Q = V §U T . For small deviation of Q from O(p; q) (which is our case), both projection schemes produce identical Q˜ 2 O(p; q) and QR decomposition is faster for large sizes of Q. To complete this discussion we note that in general the resulting ow Q(t) is satisfactory robust and the ODE can be integrated numerically without applying any projection onto O(p; q) at every integration step. This usually saves up to 20% CPU time.
6.1
SIMULATION EXPERIMENTS WITH OPP
First, we present experiments with gradient ow for OPP, Equation (4.1). In the two following examples, the tolerance for the absolute error is set at 10¡6 , and for the relative error at 10¡3 . This criterion is used to control the accuracy in following the solution path. Higher accuracy does not change the dynamics of the solution and simply needs more CPU. We examine the output values at a time interval of 1. The integration terminates automatically when the relative improvement of E(Q) between two consecutive output points is less than 10¡4 indicating that a local minimizer has been found. Example 1. In practice, the data in A and B often represent two different ordinations of the same samples or populations. Based on this idea, we produce the test data for this experiment by rst using the random number generator rand in MATLAB to create the matrix 2 3 :2190 :3835 :5297 :4175 6 :0470 :5194 :6711 :6868 7 6 7 7 A=6 6 :6789 :8310 :0077 :5890 7 : 4 :6793 :0346 :3834 :9304 5 :9347
:0535 :0668 :8462
Then, we de ne B := AQ0 with
2
0 6 1 Q0 = 6 4 0 0
1 0 0 0
3 0 0 7 7; 1 5 0
so that the underlying OPP, though may have many local solutions due to its nonlinearity, has exactly one global solution Q0 at which E(Q0 ) = 0. We emphasize here that the data
ORTHOGONALLY CONSTRAINED REGRESSION
759
History of Objective Value
0
10
1
log(F(Q(t)))
10
2
10
3
10
4
10
0
5
10
15
25
30
35
40
45
50
35
40
45
50
History of Non orthogonality
2
10
log(Omega(Q(t)))
20
3
10
4
10
0
5
Figure 1.
10
15
20
25 t
30
A semi-log plot of E(Q(t)) and « (Q(t)) for Example 1(a).
obtained this way are not realistic because in practice B can rarely be a simple permutation of columns of A. In fact, it is precisely for the reason that it is often dif cult to determine the relationship between B and A that an orthonormal Procrustes analysis is needed (Gower 1984). We present the following examples just to illustrate how the descent ow behaves. (a). Obviously, the initial value determines where our descent ow will converge to. For instance, suppose we start with the matrix 2
3 :2618 :9198 ¡ :2333 6 :8912 ¡ :3467 ¡ :2333 7 7; Q(0) = 6 4 :2618 :1301 :9399 5 :2618 :1301 :0876
(6.3)
that represent a nontrivial perturbationof Q0 . We solve this example with both the projection scheme on and off. The objectivevalue kAQ¤ ¡ Bk º 4:0415 10¡4 when the projectionis on with 262065opsused, and kAQ¤ ¡ Bk º 6:4311 10¡4 withoutprojectionfor 215372ops. Figure 1 records the history of the changes of the objective value E(Q(t)) = kAQ(t) ¡ Bk where Q(t) is determined by integrating the ODE (4.1). Clearly, the global solution is obtained in this case. Also recorded in Figure 1 is the history of the function «(Q(t)) := kI3 ¡
Q(t)T Q(t)k
(6.4)
M. T. CHU AND N. T. TRENDAFILOV
760
History of Objective Value
1
10
0
log(F(Q(t)))
10
1
10
2
10
3
10
0
50
100
150
200
250
300
350
400
450
500
350
400
450
500
History of Non orthogonality
0
log(Omega(Q(t)))
10
1
10
2
10
3
10
0
50
Figure 2.
100
150
200
250 t
300
A semi-log plot of E(Q(t)) and « (Q(t)) for Example 1(b).
that measures the deviation of Q(t) from the manifold of constraint O(4; 3). It is seen that Q(t) is well kept within the local tolerance, when the projection scheme is on. (b).
Suppose the initial value Q(0) is taken to be µ ¶ I3 Q(0) = : 0
Incorporating the projection schemes based on QR and SVD, we can reach only a local minimizer 2 3 :0094 :3811 ¡ :5768 6 :9999 :0094 :0088 7 7 Q¤ = 6 4 :0088 ¡ :5768 :4625 5 ¡ :0110
:7225
:6733
with objective value kAQ¤ ¡ Bk º :2234. Then repeating the computations with the projection schemes off we found the following considerably better “solution”:
Q¤ ¤
2
3 :0010 :9817 ¡ :0172 6 :9998 :0011 :0008 7 7 =6 4 :0008 ¡ :0175 :9835 5 ¡ :0010 :0208 :0197
ORTHOGONALLY CONSTRAINED REGRESSION
1
10
0
History of Objective Value
log(F(Q(t)))
10
761
0
5
10
15
20
25
30
35
40
45
50
35
40
45
50
History of Non orthogonality
2
log(Omega(Q(t)))
10
3
10
4
10
0
5
10
Figure 3.
15
20
25 t
30
A semi-log plot of E(Q(t)) and « (Q(t)) for Example 2.
with objective value kAQ¤ ¤ ¡ Bk º :0066. The test results are presented in Figure 2. It is interestingto note that both Q¤ and Q¤ ¤ satisfy the conditionsin Theorem 5 and Corollary 2, while Q¤ ¤ violates considerably the orthonormality constraint: QT¤ ¤ Q¤ ¤
2
3 :9995 :0020 :0016 = 4 :0020 :9646 ¡ :0336 5 : :0016 ¡ :0336 :9679
The result is quite natural: by violating the orthonormality constraint we actually relax the constrained problem, which, of course, leads to lower minima. Example 2.
Suppose B = AQ0 + 12 ¢ where 2
:5383 :9503 6 ¡ :6168 :3468 6 6 ¢ = 6 ¡ 1:2161 ¡ :9547 4 ¡ :8900 ¡ :7598 ¡ 1:9832 :3192
:6004 1:0047 ¡ :3608 ¡ :6719 ¡ :6037
3 7 7 7 7 5
represents a random perturbation from normal distribution N(0; 1). With Q = Q0 , this noise has the magnitude kAQ0 ¡ Bk = 1:7179. But by following (4.1) with Q(0) given by
M. T. CHU AND N. T. TRENDAFILOV
762
(6.3), we obtain 2
¡ :0895 6 :7471 Q# = 6 4 :2731 ¡ :5990
3 :7726 ¡ :5277 ¡ :1845 :0163 7 7 :6034 :7309 5 ¡ :0701 :4324
with kAQ# ¡ Bk being reduced to 1.1182. In this example the results obtained by the projection scheme on and off are identical; the ops used are 274483 and 238814, respectively. The test results are recorded in Figure 3. Again, we can only report that a local minimizer is found.
6.2
SIMULATION COMPARISON OF SEVERAL PRP SOLVERS
It is worth comparing the proposed algorithm based on the projected gradient approach with the other existing solutions given by-product of planar rotations (Mooijaart and Commandeur 1990), majorization (Kiers 1990), and “re ned” majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991). We report here 100 numerical solutionsof (1.1)–(1.2) obtainedby each of the four methods making use of three different random number generators. The experiment is organized as follows. We generate 100 random matrices A 2 Rn£p ; C 2 Rq£m , and B 2 Rn£m . For these 100 triples A; B, and C we solve the problem (1.1)–(1.2) by the available solvers. We solve the Equation (3.4) starting from the following initial value Q0 : rst, solve AX = B for X by least squares, then solve the linear system Y C = X for Y , and nally project Y onto the Stiefel manifold using the QR decomposition to nd Q0 . In these experiments, the tolerance for the absolute error is set at 10¡6 , and for relative error is set at 10¡3 . In order to gain better overall information for these three approaches we compute the sample mean of the obtained minimal value of the objective function (1.1) and its sample variance over all 100 sets of data. The sample mean of the CPU time used per run and its sample variance over all 100 runs are also computed. It is common for MATLAB procedures to measure the number of the ops used. The projected gradient solutions, for all experiments, are considerably less ops-consuming than the other three methods. This may mislead the reader into thinking that solving ODEs is an easy task, thus we decide to report the CPU time, though it is not quite a reliable measure for comparisons. We generate matrices A; B, and C with dimensions n = 7, m = 4, q = 3, and p = 5; 6; 7; 10; 20; that is, we compare solutions for ve different sizes of the orthonormal unknown p£q matrix Q. Results of Mooijaartand Commandeur’s(1990) algorithm(product of planar rotations) are reported for the case n = 7, m = 4, p = 5, and q = 3 only, because it is rather slow in CPU time. First we generate A; B and C by rand¡ :5, uniformly distributed random numbers on the interval (¡ :5; :5). The results obtained are summarized in Table 2. The “re ned” majorization algorithm is the fastest of the three in CPU time. The minimae of (1.1) obtained (i.e., the error of the t) by the three methods are practically identical. Note that Mooijaart and Commandeur’s (1990) algorithm for n = 7, m = 4, p = 10, and q = 3 gives sample mean and variance of the minima—1.1744 and .1250, and sample mean and variance of the CPU time—28.1649 and 773.9405.
ORTHOGONALLY CONSTRAINED REGRESSION
763
Table 2. Results for Data Generated by rand¡:5. Data n= 7 m= 4 p= 5 q= 3
n= 7 m= 4 p= 6 q= 3
n= 7 m= 4 p= 7 q= 3
n= 7 m= 4 p = 10 q= 3
n= 7 m= 4 p = 20 q= 3
Method
Minimum mean variance
CPU Time mean variance
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
1.6581
.1125
.0596
.0030
Majorization (Kiers 1990)
1.6486
.1123
.1435
.0154
Projected gradient
1.6493
.1128
.2597
.0110
Product of planar rotations (Mooijaart and Commandeur 1990)
1.6463
.1193
3.1545
8.3937
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
1.5242
.1087
.0815
.0033
Majorization (Kiers 1990)
1.5247
.1083
.1909
.0262
Projected gradient
1.5236
.1087
.3248
.0274
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
1.4947
.1249
.1226
.0104
Majorization (Kiers 1990)
1.4951
.1252
.2233
.0314
Projected gradient
1.4953
.1251
.3712
.0171
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
1.1744
.1250
.1543
.0088
Majorization (Kiers 1990)
1.1744
.1250
.2292
.0151
Projected gradient
1.1747
.1250
.4073
.0147
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
.9744
.1104
.2220
.0182
Majorization (Kiers 1990)
.9744
.1104
.4136
.0384
Projected gradient
.9745
.1104
.7071
.0303
M. T. CHU AND N. T. TRENDAFILOV
764
Table 3. Results for Data Generated by rand Data n= 7 m= 4 p= 5 q= 3
n= 7 m= 4 p= 6 q= 3
n= 7 m= 4 p= 7 q= 3
n= 7 m= 4 p = 10 q= 3
n= 7 m= 4 p = 20 q= 3
Method
Minimum mean variance
CPU Time mean variance
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
2.0113
.3478
.4590
.1700
Majorization Kiers (1990)
1.9964
.3343
2.1178
2.3850
Projected gradient
1.9798
1.9798
.6917
.0342
Product of planar rotations (Mooijaart and Commandeur 1990)
1.9921
.3585
23.8393
1.4163 1003
Rened majorization (Kiers and ten Berge 1992 Koschat and Swayne 1991)
1.7986
.4608
.7296
.2739
Majorization (Kiers 1990)
1.8018
.4603
2.5293
2.6128
Projected gradient
1.8008
.4608
.7807
.0487
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
1.6446
.3332
1.0917
.4763
Majorization (Kiers 1990)
1.6440
.3329
3.2325
2.4534
Projected gradient
1.6397
.3355
.9111
.0495
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
1.3151
.3480
1.4191
.8074
Majorization (Kiers 1990)
1.3156
.3482
3.9922
3.9207
Projected gradient
1.3149
.3479
1.0482
.0387
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
1.0587
.5094
1.3540
.4421
Majorization (Kiers 1990)
1.0597
.5101
8.9595
13.8677
Projected gradient
1.0587
.5093
2.3604
.3359
ORTHOGONALLY CONSTRAINED REGRESSION
765
Next, for the same values n; m; q, and p we generate A; B, and C by rand; that is, uniformly distributed random numbers on the interval (0,1). The results are summarized in Table 3. The behavior of the majorization algorithm (Kiers 1990) is rather poor for these data. The CPU times of the “re ned” majorization and projected gradient algorithms are similar for moderate size of Q. The error of the model t (1.1) of the three algorithms are practically identical. Finally, we generate A; B, and C by randn; that is, random numbers with normal distribution N(0,1). The results are summarized in Table 4. Again, for these data, the “re ned” majorization algorithm is the fastest one. The projected gradient algorithm gives better t of the model (1.1) to these data while the most signi cant deviations are produced by the “re ned” majorization algorithm.
6.3
PRP WITH PARTIALLY SPECIFIED TARGET
Example 3. Finally, we report some numerical experiments with the solution of the weighted orthonormal Procrustes rotation problem to partially speci ed target (5.1)–(5.2). (a). In the rst experiment we generate random 5 £ 3 orthonormal matrix Q0 and random matrices A and C (if the reader is interested, the random data in this subsection can be obtained upon request from the second author). Then we form a target B from AQ0 C by considering some of its elements speci ed and xed to the corresponding values given in AQ0 C and the rest of them unspeci ed and denoted by x’s. Let us say that we have to t 2 3 x x x ¡ :9211 6 x x x x 7 6 7 6 x x x ¡ :7706 7 6 7 6 7 B = 6 x x x ¡ :5564 7 6 7 6 x x x x 7 6 7 4 x x x ¡ :6329 5 x x x ¡ :5897 by AQC for some unknown 5 £ 3 orthonormal matrix Q. Solving ODE (5.5) on QT Q = I3 starting from random initial orthonormal matrix Qin we nd that 2
6 6 Qout = 6 6 4
¡ :1578 :9831 :0590 :0300 ¡ :0335 :3509 ¡ :3921 :0037 ¡ :7900 ¡ :0096 :0434 :3661 ¡ :9057 ¡ :1745 :3395
3 7 7 7 7 5
M. T. CHU AND N. T. TRENDAFILOV
766
Table 4. Results for Data Generated by randn Data n= 7 m= 4 p= 5 q= 3
n= 7 m= 4 p= 6 q= 3
n= 7 m= 4 p= 7 q= 3
n= 7 m= 4 p = 10 q= 3
n= 7 m= 4 p = 20 q= 3
Method
Minimum mean variance
CPU Time mean variance
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
26.8046
131.2824
.1446
.0099
Majorization (Kiers 1990)
25.7152
109.8358
.4894
.1720
Projected gradient
23.1794
86.2939
.6894
.0223
Product of planar rotations (Mooijaart and Commandeur 1990)
23.5038
78.9624
7.6978
119.1133
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
18.4156
66.3144
.3116
.1664
Majorization (Kiers 1990)
18.3138
98.1012
.6902
.2352
Projected gradient
18.2555
63.3046
.6516
.0148
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
13.4904
32.2519
.5140
.2083
Majorization (Kiers 1990)
13.2380
23.5104
.9722
.5702
Projected gradient
13.3174
26.8213
.7656
.0275
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
9.5870
22.1143
.4164
.0880
Majorization (Kiers 1990)
9.0801
2.5370
1.4120
.7686
Projected gradient
8.9235
2.7134
.9176
.0198
Rened majorization (Kiers and ten Berge 1992; Koschat and Swayne 1991)
7.8754
18.2798
.2256
.0273
Majorization (Kiers 1990)
7.5919
15.6923
1.8748
2.9253
Projected gradient
7.3064
16.1990
2.0505
.1311
ORTHOGONALLY CONSTRAINED REGRESSION
767
History of Objective Value
2
10
log(F(Q(t)))
0
10
2
10
4
10
0
10
20
30
40
50
60
70
80
90
100
70
80
90
100
History of Non orthogonality
0
log(Omega(Q(t)))
10
5
10
10
10
15
10
20
10
0
10
Figure 4.
20
30
40
50 t
60
A semi-log plot of EV (Q(t)) and « (Q(t)) for Example 3(a).
solves (5.1)–(5.2) with error of the t 1.7063 10¡4 . One can check that 2 3 :0231 ¡ :4695 ¡ :4196 ¡ :9211 6 :1719 :0267 ¡ :1113 ¡ :2171 7 6 7 6 ¡ :0244 ¡ :4332 ¡ :3538 ¡ :7706 7 6 7 6 7 AQout C = 6 ¡ :0771 ¡ :3796 ¡ :2754 ¡ :5564 7 : 6 7 6 :2096 :2225 :0603 ¡ :0278 7 6 7 4 :1913 ¡ :2119 ¡ :3482 ¡ :6330 5 ¡ :1339 ¡ :2871 ¡ :0698 ¡ :5897
The speci ed elements of the target are recovered perfectly in this case; that is, Qout is a global minimizer of (5.1)–(5.2). The test results are recorded in Figure 4. (b). Next, consider randomly generated A; B, and C with sizes 7 £ 5, 7 £ 4, and 3 £ 4, respectively. Form a target by considering the elements of B larger than .8 speci ed and xed and
M. T. CHU AND N. T. TRENDAFILOV
768
History of Objective Value
1
10
0
log(F(Q(t)))
10
1
10
2
10
3
10
0
50
100
150
100
150
History of Non orthogonality
0
log(Omega(Q(t)))
10
5
10
10
10
15
10
20
10
0
50 t
Figure 5.
Semi-log plots of EV (Q(t)) and « (Q(t)) for Example 3(b).
the rest of them unspeci ed; that is, consider the target 2 :8484 :8965 :8419 6 :8213 x x 6 6 x x x 6 6 B=6 x x x 6 6 x x x 6 4 :9273 :9948 x x x :8860
x x x x x x x
3
7 7 7 7 7 7: 7 7 7 5
Solving (5.5) with a random initial 5 £ 3 orthonormal matrix, we nd 2
6 6 Qout = 6 6 4
:6896 :0263 :3193 :6473 ¡ :0528
¡ :2855 :0564 :2751 :8877 ¡ :2627 :3643 :4825 ¡ :2759 :7355 :0010
3
7 7 7; 7 5
ORTHOGONALLY CONSTRAINED REGRESSION
769
and correspondingly 2
6 6 6 6 6 AQout C = 6 6 6 6 4
:8490 :8206 :7611 :7075 :9542 :9265 1:0005
:8964 :6260 :9632 :7685 1:0664 :9952 :9845
:8417 :9433 :8237 :8258 1:2658 1:0780 :8858
1:1581 :9211 1:1761 :9883 1:3645 1:2847 1:2996
3
7 7 7 7 7 7; 7 7 7 5
which gives error of the t 1.0001 10¡4 to the target B. For another randomly generated triple A; B, and C with the same sizes form a target by the same way as above; that is, consider the target 2 3 :9495 0 :9473 0 6 0 0 0 :8144 7 6 7 6 0 0 0 0 7 6 7 6 7 B=6 0 0 0 :8626 7 : 6 7 6 0 :8773 0 0 7 6 7 4 0 :9983 0 0 5 0 :9223 0 0 The solution of (5.5) with a random initial value is
Qout
and we compute that
2
:2217 6 :0884 6 =6 6 ¡ :4601 4 ¡ :0914 :8503 2
6 6 6 6 6 AQout C = 6 6 6 6 4
3 :6325 ¡ :3833 :5745 :0242 7 7 :3960 :7517 7 7; :3351 ¡ :2429 5 :0256 :4780
:9755 :5546 :7606 :6229 :9168 :5281 :6668 :5067 1:1653 :9513 1:2650 :7614 :4875 :5638 ¡
3 :9031 :7363 :3733 :8188 7 7 :6975 :8034 7 7 7 :1403 :8552 7 ; 7 :3081 1:4487 7 7 :7505 1:2647 5 :0224 :7184
which gives error of the t .4393 to the target B. The test results are recorded in Figure 5. To investigate the sensitivity of the solution we made 50 random runs and solve ODE (5.5) for each run with 20 different initial values Qin . In 42 of 50 cases we found the sample variance of the error t over all 20 starts of order 10¡4 or less, which indicates that the minima obtained is practically not sensitive on the starting values.
770
M. T. CHU AND N. T. TRENDAFILOV
7. CONCLUSION By using the projected gradient idea, we are able to completely characterize the rst order and the second order optimality conditions for the Penrose regression problem. Our results extend what is already known in the literature for the orthonormal Procrustes problem. Furthermore, our approach provides a natural new numerical method for solving these problems. From the numerical experiments it is clear that the projected gradient algorithm is generally slower in CPU time compared to the “re ned” majorization algorithm, but could be an useful alternative for some data. We should stress that the approach presented in this article is rather universal (both theoretically and numerically) and its application to data analytical problems leading to least squares optimization subject to constraints is straightforward. This was illustrated by solving the unsolved yet weighted Procrustes rotation problem to a partially speci ed target.
ACKNOWLEDGMENTS The authors thank Henk Kiers, University of Groningen for the MATLAB codes implementing the majorization (Kiers 1990) and re ned majorization methods (Kiers and ten Berge 1992; Koschat and Swayne 1991). Also we thank Jaques Commandeur, Leiden University for discussions while the second author worked on a MATLAB code that realized the Mooijaart and Commandeur algorithm. The authors thank the editor for his support and the anonymous reviewers for the competent and exhaustive comments that clari ed and strengthened the work. Ross Lippert was so kind to improve “our” English.
[Received October 1998. Revised February 2001.]
REFERENCES Browne, M. W., (1972), “Orthogonal Rotation to Partially Speci ed Target,” British Journal of Mathematical and Statistical Psychology, 25, 115–120. Chu, M. T. (1994), “A List of Matrix Flows With Applications,” Fields Institute Communications, 3, 87–97. Chu, M. T., and Driessel, K. R. (1990), “The Projected Gradient Method for Least Squares Matrix Approximations With Spectral Constraints,” SIAM Journal of Numerical Analysis, 27, 1050–1060. Chu, M. T., and Trenda lov, N. (1998), “On a Differential Equation Approach to the Weighted Orthogonal Procrustes Problem,” Statistics & Computing, 8, 125–133. Dieci, L., Russell, R. D., and Van Vleck, E. S. (1994), “Unitary Integrators and Applications to Continuous Orthonormalization Techniques,” SIAM Journal of Numerical Analysis, 31, 261–281. Diele, F., Lopez, L., and Peluso, R. (1998),“The Cayley Transform in the Numerical Solutionof Unitary Differential Systems,” Advances in Computational Mathematics, 8, 317–334. Edelman, A., Arias, T., and Smith, S. T. (1999), “The Geometry of Algorithms With Orthogonality Constraints,” SIAM Journal of Matrix Analysis and its Applications, 20, 303–353. Eng, K., Marthinsen, A., and Munthe-Kaas, H. (1997), “DiffMan—An Object Oriented MATLAB Toolbox for Solving Differential Equations on Manifolds” (User’s Guide), http://www.math.ntnu.no/num/synode/. Gill, P. E., Murray, W., and Wright, M. H. (1981), Practical Optimization, New York: Academic Press. Gear, C. W. (1986), “Maintaining Solution Invariants in the Numerical Solution of ODEs,” SIAM Journal on Scienti c and Statistical Computing, 7, 734–743.
ORTHOGONALLY CONSTRAINED REGRESSION
771
Golub, G. H., and Van Loan, C. F. (1991), Matrix Computation(2nd ed.), Baltimore: The Johns Hopkins University Press. Gower, J. C. (1984), “Multivariate Analysis: Ordination, Multidimensional Scaling and Allied Topics,” in Handbook of Applicable Mathematics, Vol. VI: Statistics, Part B, ed. Emlyn Lloyd, New York: Wiley. Green, B. (1952), “The Orthogonal Approximation of an Oblique Structure in Factor Analysis, ”Psychometrika, 17, 429–444. Helmke, U., and Moore, J. B. (1994), Optimization and Dynamical Systems, London: Springer Verlag. Kiers, H. A. L. (1990), “Majorization as a Tool for Optimizing a Class of Matrix Functions,” Psychometrika, 55, 417–428. Kiers, H. A. L., and ten Berge, J. M. F. (1992), “Minimization of a Class of Matrix Trace Functions by Means of Re ned Majorization,” Psychometrika, 57, 371–382. Koschat, M. A., and Swayne, D. F. (1991), “A Weighted Procrustes Criterion,” Psychometrika, 56, 229–239. Mooijaart, A., and Commandeur, J. J. F. (1990), “A General Solution of the Weighted Orthonormal Procrustes Problem,” Psychometrika, 55, 657–663. Shampine, L. F., and Reichelt, M. W. (1997), “The MATLAB ODE Suite,” SIAM Journal on Scienti c Computing, 18, 1–22. Stiefel, E. (1935-1936), “Richtungsfelder und Fernparallelismus in n-Dimensional Manning Faltigkeiten,” Commentarii Mathematici Helvetici, 8, 305–353. ten Berge, J. M. F. (1977a), “Optimizing Factorial Invariance,” unpublished PhD Thesis, Groningen. (1977b), “Orthogonal Procrustes Rotation for Two or More Matrices,” Psychometrika, 42, 267–276. ten Berge, J. M. F., and Knol, D. L. (1984), “Orthogonal Rotations to Maximal Agreement for Two or More Matrices of Different Column Orders,” Psychometrika, 49, 49–55.