(PROXSCAL) of symmetric data matrices

0 downloads 0 Views 1007KB Size Report
This general loss function is called STRESS as a tribute to Kruskal (1964). In (1.1), ... However, scaling programs based on ALS start by converting the ...... because M ≡ K A∗A∗K is semi-orthonormal, being the product of orthonormal matrices.
Mathematical derivations in the proximity scaling (PROXSCAL) of symmetric data matrices Jacques J.F. Commandeur Willem J. Heiser

Department of Data Theory University of Leiden RR-93-04

1

Contents 1 Introduction

3

2 Unrestricted solutions

6

3 A general solution for restricted matrices

10

4 Specific model restrictions in PROXSCAL 4.1 Generalized Euclidean model . . . . . . . . 4.2 Weighted Euclidean model . . . . . . . . . 4.3 Reduced rank model . . . . . . . . . . . . 4.4 Identity model . . . . . . . . . . . . . . . .

13 14 21 23 25

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

5 Projecting the common space on a complicated subspace in the metric of a positive semidefinite matrix 27 6 Updating the common space when it is partially known

31

7 Restricting the common space to be a linear combination of external variables 36 8 Optimal scaling

44

9 Initialization of the PROXSCAL algorithms

48

10 Acceleration schemes

55

11 Normalized STRESS

58

2

1

Introduction

This report presents most of the algebra involved in the multidimensional scaling (MDS) of symmetric data matrices as performed by the computer program PROXSCAL (PROXimity SCALing) to be implemented in the software package SPSS. At this time of writing the mathematical derivations underlying PROXSCAL are scattered over a number of papers, notably De Leeuw and Heiser (1980), Heiser (1985b), Heiser (1985a), Heiser and Stoop (1986), and Meulman and Heiser (1984), but also in unpublished material written by Heiser. The present report provides a comprehensive overview of all this algebra, filling in details that are sometimes only implicitly covered in other papers. Moreover, we intend to use the present report as a reference for more programmatical details concerning the calculations performed in PROXSCAL itself. The algebra involved in MDS of asymmetric and rectangular data matrices will be treated elsewhere. Given the (dis)similarities δijk between n objects (i, j = 1, . . . , n) for m sources (k = 1, . . . , m), PROXSCAL determines m configurations Xk of order (n × p), such that the Euclidean distances dij (Xk ) between the rows of the Xk ’s (conceived of as n points in p dimensions) approximate the given (dis)similarities δijk as well as possible for all i, j = 1, . . . , n and k = 1, . . . , m. The formal problem PROXSCAL solves is the minimization of the least squares loss function n

m

1 XX wijk [δijk − dij (Xk )]2 . f (X1 , . . . , Xm ) ≡ m k=1 i s, that is, if there are more dimensions than external variables, the first p columns of Q consist of the s

37

quantified external variables, and the last (p − s) columns of Q are treated as unrestricted elements. Then, it is possible to directly minimize m

1 X ¯ k )0 Vk (QBAk − X ¯ k ), h(Q; B; ∗) = tr(QBAk − X m k=1

(7.9)

¯ k = V − B(X 0 )X 0 . where X k k k We start by noting that the product QB may be written as the sum of rank-one matrices. Defining qj (j = 1, . . . , h) as the (n × 1) vector containing column j of matrix Q, and bj as the (h × 1) vector containing row j of matrix B, it is true that QB =

h X

qj b0j ,

(7.10)

j=1

and loss function (7.9) may be written as m h h X 1 X X 0 0 ¯ ¯ k ). tr( qj bj Ak − Xk ) Vk ( qj b0j Ak − X h(Q; B; ∗) = m k=1 j=1 j=1

(7.11)

Considering only one bj and qj , and letting Uj =

X

qj b0j ,

(7.12)

t6=j

we may write (7.11) as " h #0 " h # m X X 1 X ¯ k − Uj Ak ) Vk ¯ k − Uj Ak ) tr h(qj , bj ) = qj b0j Ak − (X qj b0j Ak − (X m k=1 j=1 j=1 m

1 X 0 = cj + q Vk qj b0j Ak A0k bj − 2qj0 Tj bj , m k=1 j (7.13) where m

m

m

X 1 X 1 X ¯ k − Uj Ak )A0k = 1 Vk (X B(Xk0 )Xk0 A0k − Vk Uj Ak A0k , Tj ≡ m k=1 m k=1 m k=1

(7.14)

and cj is independent of bj and qj . For fixed vector qj , the global minimum of (7.13) is attained where m 1 X 0 qj Vk qj Ak A0k )−1 Tj0 qj . (7.15) bj = ( m k=1 For fixed bj , (7.13) may be written as m

h(∗, qj ) = cj +

qj0 (

1 X 0 b Ak A0k bj Vk )qj − 2qj0 Tj bj m k=1 j

= cj + qj0 V¯j qj − 2qj0 Tj bj , 38

(7.16)

where

m

1 X 0 V¯j = q Ak A0k bj Vk . m k=1 j Defining q¯j as a vector satisfying

(7.17)

V¯j q¯j = Tj bj ,

(7.18)

h(∗, qj ) = dj + (qj − q¯j )0 V¯j (qj − q¯j ),

(7.19)

we may write (7.16) as

with dj a term independent of qj . Applying the algebra of Section 5, (7.19) can be minimized by alternatingly computing 1 ¯ Vj (q¯j − qj0 ) φ1 1 1 = Tj bj + (I − V¯j )qj0 , φ1 φ1

q˜j = qj0 +

(7.20)

where φ1 is the largest eigenvalue of V¯j , and qj0 is the current solution for qj satisfying the constraints which correspond to the measurement level of external variable j, and then determining a new qj+ satisfying these same constraints by minimizing the function q(qj ) = (qj − q˜j )0 (qj − q˜j ),

(7.21)

which is generally achieved by performing a regression of q˜j on the original external variable qj . Before specifically discussing how to obtain optimal quantifications for variable qj in (7.21), we first deal with the question whether and how quantifications of the external variables should be centered and normalized. In this context, it is important to remember that the original problem consists of the approximation of (dis)similarity data with distances. Because translations are distance preserving transformations, we have that h X dij (ZAk ) = dij (QBAk ) = dij ( qj b0j Ak ) j=1 h X = dij ( [qj + rj 1]b0j Ak )

(7.22)

j=1

for any arbitrary scalar rj . It follows from (7.22) that it is immaterial whether and how we center variable qj since loss function (1.1) is unaffected by such a transformation. Moreover, we are free at any time to multiply qj with an arbitrary number aj as long as we apply the inverse transformation to bj . This is true because h X j=1

qj b0j

h h X X −1 0 = (aj qj )(aj bj ) = qˆj ˆ b0j j=1

(7.23)

j=1

with qˆj = aj qj and ˆ bj = a−1 j bj . Therefore, this indeterminacy may be used to normalize either qj or bj on fixed length. In the following exposition we adopt the conventions 10 qj = 0 and qj0 qj = n for j = 1, . . . , h. 39

If qj in (7.21) refers to a numerical variable we require the update to be a linear transformation of the original variable subject to 10 qj = 0 and qj0 qj = n. If qj is complete, the latter constraints completely fix the linear transformation, meaning that the updating of the quantifications of a complete numerical variable is not required, and may be skipped. However, if the external variable is incomplete, then, letting Mj denote the (n × n) diagonal matrix with a one on its diagonal where an element of qj is nonmissing and zeroes elsewhere, we have to solve q(aj , rj ) = [(aj qj + rj 1) − q˜j ]0 Mj [(aj qj + rj 1) − q˜j ]

(7.24)

for the nonmissing part of qj . In PROXSCAL, we adopt the option called missing data multiple in the Gifi system (cf., Gifi, 1990) for the missing part of the quantification vectors, which implies that we minimize q(qj ) = [qj − q˜j ]0 (I − Mj )[qj − q˜j ]

(7.25)

for the missing elements of qj . The latter is achieved by simply setting (I − Mj )qj+ equal to (I − Mj )q˜j . To solve (7.24), we compute aj =

(10 Mj 1)(qj0 Mj q˜j ) − (qj0 Mj 1)(q˜j0 Mj 1) (10 Mj 1)(qj0 Mj qj ) − (qj0 Mj 1)2

(7.26)

(q˜j − aj qj )0 Mj 1 , 10 Mj 1

(7.27)

and rj =

and set Mj qj+ equal to Mj (aj qj + rj 1). Finally, the resulting vector qj+ is centered and normalized, and care is taken to adapt the corresponding regression weights vector bj accordingly (see (7.23)). If qj is an ordinal variable, (7.21) is minimized by performing a monotone regression of q˜j on the original variable qj . Moreover, if some elements of qj are missing, then the monotone regression is only applied to the nonmissing elements, while the missing elements of qj are simply replaced by the corresponding elements of q˜j . Again, due to (7.22) the result may be centered on the origin, and due to (7.23) it may then be normalized on fixed length as long as the inverse transformation is applied to the corresponding regression weights. If qj is a nominal variable we have to solve, for the nonmissing elements of the variable, q(yj ) = (Gj yj − q˜j )0 Mj (Gj yj − q˜j ),

(7.28)

where Gj is the indicator matrix (cf., Gifi, 1990) of order (n × kj ), kj being the number of categories for external variable j, and the vector yj of order (kj × 1) contains the kj distinct categories of variable j. Dropping missing rows and columns, and tacitly assuming that we deal with the resulting reduced vectors and matrices, the global minimum of (7.28) is obtained for y¯j = (G0j Gj )−1 G0j q˜j . (7.29) Then, qj = Mj Gj y¯j yields an update for the nonmissing elements of nominal variable j, while the missing elements of qj are simply replaced by the corresponding elements of q˜j . The result is centered on the origin, and normalized on fixed length, and the inverse transformation is applied to the corresponding regression weights (see (7.23)). 40

If p > s, that is, if there are more dimensions than external variables, then the last (p−s) columns of Q are treated as unrestricted vectors. Thus, for j = s + 1, . . . , p an update for qj is obtained by setting it equal to q˜j defined in (7.20), and then normalize to length n. In all cases, in the minimization of (7.19) an update for qj may be calculated just once, or we may repeatedly alternate over (7.20) and (7.21) until some convergence criterion is met. The same applies to the calculation of updates for bj and qj for one variable j: in the minimization of (7.13) we may either stop after one iteration or continue to alternate until no further improvement is obtained. Once all vectors qj and bj have been updated for j = 1, . . . , h, we may stop and update the common space with h X + Z = qj bj , (7.30) j=1

or, again, repeat the whole minimization procedure for (7.11) until some convergence criterion is reached. Summarizing, in the GENERALIZED and REDUCED rank models the following convergent algorithm may be used to restrict the common space to be a linear combination of (possibly incomplete) external variables of mixed measurement levels: P 0 0 0 1. compute S = m1 m k=1 B(Xk )Xk Ak ; 2. for j = 1, . . . , h, perform the following calculations: Pm 1 0 00 0 and T = S − b q j j j t6=j k=1 Vk Uj Ak Ak ; m Pm 0 0 + 1 0 0 −1 0 0 0 2.2. compute b+ j = (m k=1 qj Vk qj Ak Ak ) Tj qj and set bj = bj ;

2.1. compute Uj =

P

2.3. if qj is a numerical variable and complete: skip this step; else: P 0 0 00 compute V¯j = m1 m k=1 qj Ak Ak bj Vk , and compute the largest eigenvalue φ1 of V¯j (or an estimation thereof); calculate q˜j = φ11 Tj bj + (I − φ11 V¯j )qj0 ; 2.3.1. if qj is a numerical variable: compute the direction cosine aj and the intercept rj in the regression of q˜j on the original variable qj according to (7.26) and (7.27); set the nonmissing elements of the update qj+ equal to (aj qj + rj 1), and the missing elements equal to the corresponding elements of q˜j ; 2.3.2. if qj is an ordinal variable: for the nonmissing elements, perform a monotone regression of q˜j on the original variable qj , and store the result in the corresponding elements of qj+ ; set the missing elements equal to the corresponding elements of q˜j ; 2.3.3. if qj is a nominal variable: for the nonmissing elements of the variable calculate Gj (G0j Gj )−1 G0j q˜j , and store in the corresponding elements of qj+ ; set the missing elements equal to the corresponding elements of q˜j ; 2.3.4. if qj is free (i.e., j > s): treat the vector as completely missing, i.e., store all elements of q˜j in the update qj+ as is; In all cases, center qj+ on the origin, and then normalize qj+ on length n; apply the inverse normalization to the regression weight vector b0j ; set qj0 = qj+ ; 41

P 3. calculate the restricted update Z + = hj=1 qj0 b0j , evaluate (4.1), and go to step 2 if the difference in function value between the current and the previous iteration is smaller than some predefined criterion; otherwise stop. We have the following special cases. If the weight matricesPare equal to each other in the 0 GENERALIZED model, and the identification condition m1 m k=1 Ak Ak = I is used, then formula (7.15) for updating the regression weights simplifies into bj = with

1 qj0 V

qj

T j qj ,

(7.31)

m

1 X Tj = B(Xk0 )Xk0 A0k − V Uj m k=1

(7.32)

and Uj as defined in (7.12). At the same time, matrix V¯j defined in (7.17) can then be written as V¯j = b0j bj V . (7.33) Since the largest eigenvalue φ1 of matrix V¯j in (7.33) satisfies φ1 = φ2 b0j bj with φ2 the largest eigenvalue of matrix V , the unrestricted update (7.20) for qj in (7.19) can be written as 1 1 1 1 Tj bj + (I − V¯j )qj0 = Tj bj + (I − b0 bj V )qj0 0 φ1 φ1 φ2 bj bj φ2 b0j bj j 1 1 Tj bj + (I − V )qj0 = 0 φ2 bj bj φ2

q˜j =

(7.34)

with Tj as defined in (7.32). Thus in this case φ2 , the largest eigenvalue of V , is identical for all variables qj and only has to be computed once. When applying identification condition (4.5) to the space weight matrices Ak in the external case, care must be taken to adjust the regression weight matrix B in Z = QB accordingly. If wijk = 1 for all i, j, and k, then, since V = nJ in this case, and still assuming that P m 1 0 k=1 Ak Ak = I, updating formula (7.34) changes into m m

1 1 X bj = 0 ( B(Xk0 )Xk0 A0k − (nJ )Uj )0 qj qj (nJ )qj m k=1 " #0 m 1 1 X = 0 B(Xk0 )Xk0 A0k − Uj qj , qj qj nm k=1

(7.35)

on the condition that 10 qj = 0 for j = 1, . . . , h. The updating of the quantifications of the external variables is much simplified in the unweighted case. Because (7.33) may now be written as V¯j = nb0j bj J , (7.36) with J the centering matrix, and since φ1 , the largest eigenvalue of (7.36), equals nb0j bj , we now have for the unrestricted update (7.20) that q˜j =

1 1 1 Tj bj + (I − 0 nb0j bj J )qj0 = 0 Tj bj , 0 nbj bj nbj bj nbj bj 42

(7.37)

with

m

Tj =

1 X B(Xk0 )Xk0 A0k − nUj . m k=1

(7.38)

This means that (7.21) becomes q(qj ) = (qj −

1 1 Tj bj )0 (qj − 0 Tj bj ), 0 nbj bj nbj bj

(7.39)

and shows that (7.19) can now be solved analytically in all cases. In the most simple situation (one source, wij = 1 for all i, j), and assuming that the external variables are centered on the origin, we have that bj =

1 qj0 qj

1 ( B(Xk0 )Xk0 − Uj )0 qj , n

(7.40)

and, in (7.39), Tj = B(Xk0 )Xk0 − nUj .

(7.41)

Again, no majorization is needed to obtain updates for the vectors qj in this situation.

43

8

Optimal scaling

In the case where the (dis)similarities are unique up to a transformation the algorithm requires an additional step. There are two possibilities. Either the (dis)similarities may only be compared within each source (MATRIX CONDITIONAL), or they may all be compared with each other (UNCONDITIONAL). In the first case the general loss function has to be generalized to m

n

1 XX f (φ1 , . . . , φm ; X1 , . . . , Xm ) ≡ wijk [φk (δijk ) − dij (Xk )]2 , m k=1 i