where Ïxk : Txk MâM, is a retraction map, αk a scalar, and ... Retraction (Shub 86) ..... We want to separate the images, we look at the separation index: 1 n.
Descent methods for optimization on Manifolds applied to Independent Component Analysis Elena Celledoni Department of Mathematical Sciences, NTNU, Norway
June, 2006
1 / 36
Outline
1
General format of the methods • Gradient methods • Univariate descent methods • MEC Learning
2 / 36
Outline
1
General format of the methods • Gradient methods • Univariate descent methods • MEC Learning
2
Statistical signal processing • Independent Component Analysis • Constraint optimization in ICA
2 / 36
Outline
1
General format of the methods • Gradient methods • Univariate descent methods • MEC Learning
2
Statistical signal processing • Independent Component Analysis • Constraint optimization in ICA
3
Conclusions
2 / 36
PART 1
3 / 36
Optimization on manifolds
We consider optimization problems of the type min φ(x),
x∈M
or of the type max φ(x),
x∈M
where M is a differentiable manifold, φ : M → R, is a cost function to be minimized or a objective function to be maximized.
4 / 36
General format for the methods We consider iterative methods of the form xk+1 = ϕxk (−αk pk ), where ϕxk : Txk M → M, is a retraction map, αk a scalar, and pk ∈ Txk M a search direction.
5 / 36
General format for the methods We consider iterative methods of the form xk+1 = ϕxk (−αk pk ), where ϕxk : Txk M → M, is a retraction map, αk a scalar, and pk ∈ Txk M a search direction.
Retraction (Shub 86) for each fiber Tq M of T M we have: • ϕq is defined in some open ball B(0, rq ) ⊂ Tq M of radius rq
about 0. • ϕq (v ) = q if and only if v = 0 ∈ Tq M; • ϕ0q = IdTq M . 0
5 / 36
Optimizing via gradient flows Let M be a Riemannian manifold with metric h·, ·i, given φ : M → R a smooth function the equilibria of x(t) ˙ = −gradφ x(t) are the critical points of φ.
gradφ is the Riemannian gradient • gradφ(x) ∈ Tx M • φ0 |x (v ) = hgradφ(x), v i for all v ∈ Tx M
6 / 36
Optimizing via gradient flows Let M be a Riemannian manifold with metric h·, ·i, given φ : M → R a smooth function the equilibria of x(t) ˙ = −gradφ x(t) are the critical points of φ.
gradφ is the Riemannian gradient • gradφ(x) ∈ Tx M • φ0 |x (v ) = hgradφ(x), v i for all v ∈ Tx M
Using a forward Euler timestepping with stepsize αk for the ODE and advancing the solution using a retraction ϕxk one obtains
Gradient descent on Manifolds xk+1 = ϕxk (−αk pk ), 6 / 36
Optimizing via gradient flows U. Helmke & J.B. Moore,
Optimization and Dynamical Systems, Springer-Verlag 1994
M.T. Chu & K.R. Drissel,
The projected gradient method for least squares matrix approximations with spectral constraints, SIAM J. Num. Anal., 1990. Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems, Lin. Alg. and Appl., 1988. Geometric Optimization methods for adaptive filtering, PhD Thesis 1993. Natural Gradient Works Efficiently in Learning, Neural Computation, 1998 Learning algorithm for ICA by geodesic flows on orthogonal group Proc. IJCNN 99
R.W. Brockett,
S. T. Smith,
S.I. Amari, Y. Nishimori,
7 / 36
Homogeneous manifolds Let M be a homogeneous manifold acted upon transitively by a Lie group G , Λ(g , q) = Λq (g ) = g · q, ψ : g → G , ρq = (Λq ◦ ψ)0 0
O aq
ψ
g
ρq
Tq M
/G Λq
ϕq
/M
If it exists aq : Tq M → g s.t. ρq ◦ aq = IdTq M and ϕq (v ) := ψ(aq (v )) · q = (Λq ◦ ψ ◦ aq )(v ) ϕq is a retraction. We can construct retractions using any coordinate map form g to G . 8 / 36
On homogeneous manifolds With aq : Tq M → g s.t. ρq ◦ aq = IdTq M we have mq := aq (Tq M) ⊂ g.
• We can look for search directions in g (UVD methods) • We can look for search directions in mq (MEC Learning)
9 / 36
Univariate descent methods Consider a basis for g, E1 , E2 , . . . , Ed , take pk = ρxk (Ei ) for i ∈ {1, . . . , d}, ψ : g → G , xk+1 = ϕxk (−αk pk ) = ψ(−αk Ei ) · xk , αk = argmin φ(ψ(−αEi ) · xk ), α∈R
choose i ∈ {1, . . . , d} cyclically or solving the optimization problem φ(ψ(−αEj ) · xk ). αk Ei = argmin α ∈ R, j ∈ {1, . . . , d} at each iteration. (Celledoni and Fiori 2006)
10 / 36
Univariate descent methods on Stiefel manifolds Assume M = St(n, p) = {X n × p s.t. X T X = Idp } and G = SO(n). Consider the basis of g = so(n), T T T T T e1 eT 2 − e2 e1 , e1 e3 − e3 e1 , . . . , ed−1 ed − ed ed−1 , T take pk = ρxk (ei eT j − ej ei ) for i ≤ j ∈ {1, . . . , n − 1}, ψ = exp T xk+1 = ϕxk (−αk pk ) = exp(−αk (ei eT j − ej ei )) · xk ,
the solution of the univariate optimization problem T αk = argmin φ(exp(−α(ei eT j − ej ei )) · xk ), α∈R
and the computation of the retraction are very simple due to the choice of basis.
11 / 36
Univariate descent methods: example Example (Computing eigenpairs) To compute p eigenpairs of a symmetric and positive definite matrix A n × n one considers the maximization or minimization of the function 1 φ(X ) = trace(X T AX ) 2 on the Stiefel manifold St(n, p). In this case one finds that T φ(α) : = φ(exp(−α(ei eT j − ej ei )) · xk )
= a sin(α)2 + b cos(α)2 + c sin(α) + d cos(α) +e sin(α) cos(α) φ0 (α) = (a − b) sin(2α) + e cos(2α) + c cos(α) − d sin(α), with a, b, c, d, e coefficients computed using the columns i and j of A, xk xkT , xk and Axk . 12 / 36
Univariate descent vs gradient descent A n × n random symmetric matrix, p number of computed eigenpairs experiment method flops error n = 10 p = 8 gradient 5023347 9.63 × 10−7 n = 10 p = 8 UVD 1105129 4.94 × 10−7 n = 20 p = 16 gradient 21176640 8.92 × 10−7 n = 20 p = 16 UVD 19755729 4.21 × 10−7 n = 30 p = 24 gradient 113127486 8.63 × 10−7 n = 30 p = 24 UVD 200926984 9.60 × 10−7 gradient n = 30 p = 4 18023792 9.95 × 10−7 n = 30 p = 4 UVD 284915561 9.27 × 10−7 UVD and gradient perform similarly when n ≈ p, gradient is better when n >> p
13 / 36
Gradient vs UVD,n = 10 p = 3 eigenpairs 7
Valore della funzione obiettivo
6.5
6
5.5
5
4.5
0
50
100
150 200 250 300 Iterazione sugli elementi di base
350
400
450
The adaptation parameter of the gradient method must be chosen ad hoc to get convergence, in UVD there is no adaptation parameter. 14 / 36
Summarizing
• UVD is competitive with gradient methods when n ≈ p • UVD does not depend on a choice of adaptation parameter • UVD might do better then gradient for finding global
minima/maxima
15 / 36
Optimizing via mechanical systems I - Stiefel manifolds Consider S ∗ = {[2mi , wi ]} rigid system of n masses mi with positions wi (unitary distance form the origin on mutually orthogonal axis). The masses move in a viscous liquid. No translation.
˙ W
H˙
= HW , P = −µHW = 14 (F + P)W T − W (F + P)T
µ viscosity parameter W matrix of the positions H angular velocity matrix
P matrix of the viscosity resistance F active forces
W is on O(n) or on the Stiefel manifold 16 / 36
Optimizing via mechanical systems II The mechanical system seen as an adapting rule for neural layers with weight matrix W . The forces ∂U F := − ∂W with U a potential energy function. The equilibria of the mechanical systems S ∗ are at the local minima of U. Take U = JC cost function to be minimized, or U = −JO objective function to be maximized, W (t), t → ∞ approaches the solution of the optimization problem. S. Fiori, ’Mechanical’ Neural Learning for Blind Source Separation, Electronics Letters, 1999
17 / 36
Reformulation of the equations when n > p.
exponentials of elementary matrices • There is no adaptation
parameter
34 / 36
Pros and cons of the approaches
Pros MEC • The algorithm can be
formulated using the tangent space to the manifold
Cons MEC • How does it compare to
gradient methods?
• This allows the use of
retractions and use a minimal number of parameters to describe the problem • There is a nice
interpretation of MEC as a learning algorithm
35 / 36
Thanks
Thanks. . . for your attention!
36 / 36