Descent methods for optimization on Manifolds applied to ... - CiteSeerX

Descent methods for optimization on Manifolds applied to Independent Component Analysis Elena Celledoni Department of Mathematical Sciences, NTNU, Norway

June, 2006

1 / 36

Outline

1

General format of the methods • Gradient methods • Univariate descent methods • MEC Learning

2 / 36

Outline

1


2

Statistical signal processing • Independent Component Analysis • Constraint optimization in ICA

2 / 36

Outline

1


2

Statistical signal processing • Independent Component Analysis • Constraint optimization in ICA

3

Conclusions

2 / 36

PART 1

3 / 36

Optimization on manifolds

We consider optimization problems of the type min φ(x),

x∈M

or of the type max φ(x),

x∈M

where M is a differentiable manifold, φ : M → R, is a cost function to be minimized or a objective function to be maximized.

4 / 36

General format for the methods We consider iterative methods of the form xk+1 = ϕxk (−αk pk ), where ϕxk : Txk M → M, is a retraction map, αk a scalar, and pk ∈ Txk M a search direction.

5 / 36

General format for the methods We consider iterative methods of the form xk+1 = ϕxk (−αk pk ), where ϕxk : Txk M → M, is a retraction map, αk a scalar, and pk ∈ Txk M a search direction.

Retraction (Shub 86) for each fiber Tq M of T M we have: • ϕq is defined in some open ball B(0, rq ) ⊂ Tq M of radius rq

about 0. • ϕq (v ) = q if and only if v = 0 ∈ Tq M; • ϕ0q = IdTq M . 0

5 / 36

Optimizing via gradient flows Let M be a Riemannian manifold with metric h·, ·i, given φ : M → R a smooth function the equilibria of x(t) ˙ = −gradφ x(t) are the critical points of φ.

gradφ is the Riemannian gradient • gradφ(x) ∈ Tx M • φ0 |x (v ) = hgradφ(x), v i for all v ∈ Tx M

6 / 36

Optimizing via gradient flows Let M be a Riemannian manifold with metric h·, ·i, given φ : M → R a smooth function the equilibria of x(t) ˙ = −gradφ x(t) are the critical points of φ.

gradφ is the Riemannian gradient • gradφ(x) ∈ Tx M • φ0 |x (v ) = hgradφ(x), v i for all v ∈ Tx M

Using a forward Euler timestepping with stepsize αk for the ODE and advancing the solution using a retraction ϕxk one obtains

Gradient descent on Manifolds xk+1 = ϕxk (−αk pk ), 6 / 36

Optimizing via gradient flows U. Helmke & J.B. Moore,

Optimization and Dynamical Systems, Springer-Verlag 1994

M.T. Chu & K.R. Drissel,

The projected gradient method for least squares matrix approximations with spectral constraints, SIAM J. Num. Anal., 1990. Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems, Lin. Alg. and Appl., 1988. Geometric Optimization methods for adaptive filtering, PhD Thesis 1993. Natural Gradient Works Efficiently in Learning, Neural Computation, 1998 Learning algorithm for ICA by geodesic flows on orthogonal group Proc. IJCNN 99

R.W. Brockett,

S. T. Smith,

S.I. Amari, Y. Nishimori,

7 / 36

Homogeneous manifolds Let M be a homogeneous manifold acted upon transitively by a Lie group G , Λ(g , q) = Λq (g ) = g · q, ψ : g → G , ρq = (Λq ◦ ψ)0 0

O aq

ψ

g

ρq

Tq M

/G Λq

ϕq

/M

If it exists aq : Tq M → g s.t. ρq ◦ aq = IdTq M and ϕq (v ) := ψ(aq (v )) · q = (Λq ◦ ψ ◦ aq )(v ) ϕq is a retraction. We can construct retractions using any coordinate map form g to G . 8 / 36

On homogeneous manifolds With aq : Tq M → g s.t. ρq ◦ aq = IdTq M we have mq := aq (Tq M) ⊂ g.

• We can look for search directions in g (UVD methods) • We can look for search directions in mq (MEC Learning)

9 / 36

Univariate descent methods Consider a basis for g, E1 , E2 , . . . , Ed , take pk = ρxk (Ei ) for i ∈ {1, . . . , d}, ψ : g → G , xk+1 = ϕxk (−αk pk ) = ψ(−αk Ei ) · xk , αk = argmin φ(ψ(−αEi ) · xk ), α∈R

choose i ∈ {1, . . . , d} cyclically or solving the optimization problem φ(ψ(−αEj ) · xk ). αk Ei = argmin α ∈ R, j ∈ {1, . . . , d} at each iteration. (Celledoni and Fiori 2006)

10 / 36

Univariate descent methods on Stiefel manifolds Assume M = St(n, p) = {X n × p s.t. X T X = Idp } and G = SO(n). Consider the basis of g = so(n), T T T T T e1 eT 2 − e2 e1 , e1 e3 − e3 e1 , . . . , ed−1 ed − ed ed−1 , T take pk = ρxk (ei eT j − ej ei ) for i ≤ j ∈ {1, . . . , n − 1}, ψ = exp T xk+1 = ϕxk (−αk pk ) = exp(−αk (ei eT j − ej ei )) · xk ,

the solution of the univariate optimization problem T αk = argmin φ(exp(−α(ei eT j − ej ei )) · xk ), α∈R

and the computation of the retraction are very simple due to the choice of basis.

11 / 36

Univariate descent methods: example Example (Computing eigenpairs) To compute p eigenpairs of a symmetric and positive definite matrix A n × n one considers the maximization or minimization of the function 1 φ(X ) = trace(X T AX ) 2 on the Stiefel manifold St(n, p). In this case one finds that T φ(α) : = φ(exp(−α(ei eT j − ej ei )) · xk )

= a sin(α)2 + b cos(α)2 + c sin(α) + d cos(α) +e sin(α) cos(α) φ0 (α) = (a − b) sin(2α) + e cos(2α) + c cos(α) − d sin(α), with a, b, c, d, e coefficients computed using the columns i and j of A, xk xkT , xk and Axk . 12 / 36

Univariate descent vs gradient descent A n × n random symmetric matrix, p number of computed eigenpairs experiment method flops error n = 10 p = 8 gradient 5023347 9.63 × 10−7 n = 10 p = 8 UVD 1105129 4.94 × 10−7 n = 20 p = 16 gradient 21176640 8.92 × 10−7 n = 20 p = 16 UVD 19755729 4.21 × 10−7 n = 30 p = 24 gradient 113127486 8.63 × 10−7 n = 30 p = 24 UVD 200926984 9.60 × 10−7 gradient n = 30 p = 4 18023792 9.95 × 10−7 n = 30 p = 4 UVD 284915561 9.27 × 10−7 UVD and gradient perform similarly when n ≈ p, gradient is better when n >> p

13 / 36

Gradient vs UVD,n = 10 p = 3 eigenpairs 7

Valore della funzione obiettivo

6.5

6

5.5

5

4.5

0

50

100

150 200 250 300 Iterazione sugli elementi di base

350

400

450

The adaptation parameter of the gradient method must be chosen ad hoc to get convergence, in UVD there is no adaptation parameter. 14 / 36

Summarizing

• UVD is competitive with gradient methods when n ≈ p • UVD does not depend on a choice of adaptation parameter • UVD might do better then gradient for finding global

minima/maxima

15 / 36

Optimizing via mechanical systems I - Stiefel manifolds Consider S ∗ = {[2mi , wi ]} rigid system of n masses mi with positions wi (unitary distance form the origin on mutually orthogonal axis). The masses move in a viscous liquid. No translation.

˙ W

H˙

= HW , P = −µHW = 14 (F + P)W T − W (F + P)T

µ viscosity parameter W matrix of the positions H angular velocity matrix

P matrix of the viscosity resistance F active forces

W is on O(n) or on the Stiefel manifold 16 / 36

Optimizing via mechanical systems II The mechanical system seen as an adapting rule for neural layers with weight matrix W . The forces ∂U F := − ∂W with U a potential energy function. The equilibria of the mechanical systems S ∗ are at the local minima of U. Take U = JC cost function to be minimized, or U = −JO objective function to be maximized, W (t), t → ∞ approaches the solution of the optimization problem. S. Fiori, ’Mechanical’ Neural Learning for Blind Source Separation, Electronics Letters, 1999

17 / 36

Reformulation of the equations when n > p.

exponentials of elementary matrices • There is no adaptation

parameter

34 / 36

Pros and cons of the approaches

Pros MEC • The algorithm can be

formulated using the tangent space to the manifold

Cons MEC • How does it compare to

gradient methods?

• This allows the use of

retractions and use a minimal number of parameters to describe the problem • There is a nice

interpretation of MEC as a learning algorithm

35 / 36

Thanks

Thanks. . . for your attention!

36 / 36