geometrical structures of fir manifold and their

1 downloads 0 Views 209KB Size Report
the Lie group sense. Writing the learning algorithm in the Lie group form and multiplying both sides by the mixing lter H(z) in the Lie group sense, we obtain.
GEOMETRICAL STRUCTURES OF FIR MANIFOLD AND THEIR APPLICATION TO MULTICHANNEL BLIND DECONVOLUTION  L.-Q. Zhang, A. Cichocki and S. Amari Brain-style Information Systems Research Group, BSI The Institute of Physical and Chemical Research Saitama 351-0198, Wako shi, JAPAN Phone: +81 48 4679665 Fax: +81 48 4679694 E-mail: fzha, [email protected] Web: www.open.brain.riken.go.jp

Abstract. In this paper we study geometrical structures on the manifold of FIR lters and their application to multichannel blind deconvolution. First we introduce the Lie group and Riemannian metric to the manifold of FIR lters. Then we derive the natural gradient on the manifold using the isometry of the Riemannian metric. Using the natural gradient, we present a novel learning algorithm for blind deconvolution based on the minimization of mutual information. We also study properties of the learning algorithm, such as equivariance and stability. Simulations are given to illustrate the e ectiveness and validity of the proposed algorithm. INTRODUCTION Recently blind separation of independent sources has become an increasing important research area due to its theoretical interest, as well as its rapidly growing applications in various elds, such as telecommunication systems, image enhancement and biomedical signal processing. Refer to review papers [4] and [9] for details. It has been shown that the natural gradient improves dramatically the learning eciency in instantaneous blind separation and blind deconvolution. For doubly in nite lter mixtures, the natural gradient algorithm has been developed by Amari, Douglas and Cichocki [6]. The main objective of this paper is to develop an ecient learning algorithm for training causal nite  PROCEEDINGS OF '99 IEEE WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING (NNSP'99), PP303-312, MADISON, WISCONSIN, AUGUST 23-25, 1999

303

lter demixing models. In contrast to doubly in nite lters, the causal nite lters do not have the self closed multiplication and inverse operations in the manifold of lters with a xed length. In general the multiplication of two lters with a given length makes a new lter with extended length, so does the inverse operation. In this paper, we derive the explicit form of the natural gradient for FIR lters by using the isometric property of the Lie group on the manifold. The natural gradient descent scheme is employed to train the FIR lter and a novel learning algorithm is developed to update the parameters in the demixing model. Computer simulations are given to illustrate the e ectiveness and validity of the proposed algorithm. For the space limitation, the detailed proofs will be presented in a full paper. The natural gradient approach to FIR lters is of theoretical and practical interest since under a simple condition, any doubly nite lters can be decomposed into a cascade form of two FIR lters, one is a delay lter and the other is a forward lter [26].

PROBLEM FORMULATION In this paper, as a convolutive mixing model, we consider a multichannel linear time-invariant(LTI) systems, with no poles on the unit circle, of the form x(k) = H (z)s(k); (1) P1 ? where H(z ) = =0 H z , is an unknown mixing lter, H is an n  n-dimensional matrix of mixing coecients at time-lag p, which is called the impulse response at time p, s(k) is an n-dimensional vector of source signals with mutually independent components at time k, and x(k) is an n?dimensional sensor vector at time k, k = 1; 2;   . The goal of multichannel blind deconvolution is to retrieve source signals s(k) only using sensor signals x(k); k = 1; 2;   , and certain knowledge of the source signal distributions and statistics. We carry out blind deconvolution by using a causal nite multichannel equalizer of the form y(k) = W (z)x(k); (2) P where W(z ) = =0 W z ? , N is the length of FIR lter W(z ), y = [y1 (k);    ; y ] is an n-dimensional vector of the outputs and W is an n  ndimensional coecient matrix at time lag p. The set of all FIR lters W(z ) of length N , having the constraint that W0 is nonsingular, is denoted by M(N ), p

p

p

N p

n

p

p

p

T

p

(

M(N ) = W(z ) j W(z ) =

N X

)

W z ? ; det(W0 ) 6= 0 : p

p

(3)

p=0

M(N ) is a manifold of dimension n (N + 1). In general, multiplication of two lters in M(N ) will enlarge the lter length and the result does belong 2

304

to M(N ) anymore. This makes it dicult to introduce the Riemannian structure to the manifold of nite multichannel lters. In order to explore possible geometrical structures of M(N ) which will lead to e ective learning algorithms for W(z ), we de ne algebraic operations of lters in the Lie group framework. When we apply W(z ) to the sensor signal x(k), the global transfer function from s to y is de ned by G(z ) = [W(z )H(z )] : The goal of the blind deconvolution task is to nd W(z ) such that G(z) = P D(z); (4) ?  ? where P 2 R is a permutation matrix, D(z ) = diagfz 1 ;    ; z g, and  2 R  is a nonsingular diagonal scaling matrix. In other words, the objective of multichannel blind deconvolution is to recover the source vector s(k) from the observation vector x(k), up to scaling, reordering and delays. Assume that p(y; W), p (y ; W) are the joint probability density function of y and marginal pdf of y ; (i = 1;    ; n) respectively. In order to separate independent sources by a demixing model, we formulate the blind deconvolution problem into an optimization problem. Our target is to make the components of y as both mutually independent and identically independently distributed as possible. To this end, we employ the Kullback-Leibler divergence as a cost function [5] N

n

n

n

dn

d

n

i

i

i

l(y; W) = ?H (y; W) + R

n X

H (y ; W);

(5)

i

i=1

R

where (y W) = ? (y) log (y) y ( i W) = ? i ( i ) log i ( i ) i The divergence l(y; W) is a nonnegative functional, which measures the mutual independence of the output signals y (k). The output signals y are mutually independent if and only if l(y; W) = 0. H

;

p

p

d

;

H y ;

p

y

p

y

dy :

i

GEOMETRICAL STRUCTURE FOR MANIFOLD M(N ) In this section we introduce geometrical structures, such as the Lie group and Riemannian metric, to the manifold of FIR lters. Such structures are useful for the derivation of the natural gradient [1].

Lie Group In the manifold M(N ), Lie operations multiplication  and inverse y are de ned as follows: For B(z ); C(z ) 2 M(N ),

B(z)  C(z) =

p N X X

BC q

p=0 q =0

By(z) =

N X

By z ? ; p

p

p=0

305

? )z? ;

(p

q

p

(6) (7)

where By are recurrently de ned by By0 = B?0 1 ; By = ? =1 By ? B B?0 1 ; p = 1;    ; N: With these operations, both B(z )  C(z ) and By still remain in the manifold M(N ). It is easy to verify that the manifold M(N ) with the above operations forms a Lie Group. The identity element is E (z ) = I . Moreover, the Lie group possesses the following properties 1) Associative Law : A(z )  (B(z )  C(z )) = (A(z )  B(z ))  C(z ); (8) 2) Inverse Property : B(z )  By (z ) = By (z )  B(z ) = I : Pp

p

p

p

q

q

q

In fact the Lie multiplication of two B(z ); C(z ) 2 M(N ) is the truncated form of the ordinary multiplication up to order N , that is

B(z)  C(z) = [B(z)C(z)]

(9)

N

where [B (z )] is such a truncated operator that any terms with orders higher than N in the polynomial B(z ) are omitted. N

Riemannian Metrics The Lie Group has an important property that it admits an invariant Riemannian metric. Let TW ( ) be the tangent space of M(N ), and X(z ), Y(z) 2 TW ( ) be the tangent vectors. We introduce the inner product with respect to W(z ) as < X(z ); Y(z ) >W( ) : Since M(N ) is a Lie group, any B(z) 2 M(N ) de nes an onto-mapping: W(z) ! W(z)  B(z). The multiplication transformation maps a tangent vector X(z ) at W(z ) to a tangent vector X(z )  B(z ) at W(z )  B(z ). Therefore we can de ne a Riemannian metric on M(N ), such that the right multiplication transformation is isometric, that is, it preserves the Riemannian metric on M(N ). Explicitly we write it as follows z

z

z

< X(z ); Y(z ) >W ( ) =< X(z )  B(z ); Y(z )  B(z ) >W ( )B( ) : z

z

z

(10)

If we de ne the inner product at the identity E (z ) by < X(z ); Y(z ) >E ( ) = P =0 tr (X Y ); then < X(z ); Y (z ) >W ( ) is automatically induced by z

N p

p

T p

z

< X(z ); Y(z ) >W ( ) =< X(z )  W(z )y ; Y(z )  W(z )y >E ( ) : z

z

(11)

NATURAL GRADIENT Stochastic gradient optimization methods for parameterized systems su er from slow convergence due to the statistical correlations of the processes signals. While quasi-Newton and related methods can be used to improve convergence, they also su er from the mass computation and numerical instability, as well as local convergence. The natural gradient search scheme is an ecient technique for solving iterative estimation problems [1]. For a cost function l(W(z )) de ned on the

306

Riemannian manifold M(N ), the natural gradient r~ l(W(z )) is the steepest ascent direction of the cost function l(W(z )) as measured by the Riemannian metric on M(N ), which is the contravariantPform of partial derivaW( )) = (W( )) ? tives (ordinary gradient), denoted by (W =0 ( ) W z ; where   (W ) (W ( )) . The natural gradient r~ l(W(z )) of the cost funcW = W  tion l(W(z )) is calculated as follows [1] W(z)) i ; (12) hX(z ); r~ l(W(z ))iW ( ) = hX(z ); @l@(W (z ) E( ) for any X(z ) 2 TW ( ) . Comparing the both side of (12), we have @l

@

@l

@

z

@l

p

@

pij

n

z

N

z

p

@l

z

@

p

p

n

z

z

z

(z )) r~ l(W(z )) = @l@(W X(z)  W(z):

(13)

where X(z ) is a nonholonomic base, which is de ned by   dX(z ) = dW(z )  Wy (z ) = dW(z )W?1 (z ) : 

N



(14)

It should be noted that dX(z ) = dW(z )W?1 (z ) is a nonholonomic basis, which has a de nite geometrical meaning and proves to be useful in blind separation algorithms [3]. N

LEARNING ALGORITHM We apply the stochastic gradient descent method to obtain a learning algorithm. First, we simplify the cost function l(y; W) for the FIR demixing model as follows,

l(y; W(z )) = ? 12 log(det(W0 W0 )) ? T

n X

log p (y (k); W(z )); i

i

(15)

i=1

where det(W0 W0 ) is the determinant of matrix W0W0 . For the gradient of l(y; W(z )) with respect to W(z ), we calculate the total di erential dl of l(y; W(z )) when we takes a di erential dW(z ) dl(y; W(z )) = l(y; W(z ) + dW(z )) ? l(y; W(z )): (16) Following the procedure for deriving the natural gradient algorithm for the instantaneous mixing case [5], we have dl(y; W(z )) = ?tr(dW0 W0?1 ) + '(y) dy; (17) where tr is the trace of a matrix and '(y) is a vector of nonlinear activation functions, ' (y ) = ? log ( ) = ? (( )) : Using the notation (14), we have T

T

T

i

i

d

pi yi

dyi

0

pi yi

pi yi

dl(y; W(z )) = ?tr(dX 0 ) + '(y) dW (z )W ?1 (z )y: T

307

(18)

From this equation we obtain the partial derivatives of l(y; W(z )) with respect to X (z ), @l(W(z )) = ? I + '(y)y (k ? p); p = 0;    ; N (19)

@X

T

0;p

p

Using the natural gradient descent learning rule we present a novel learning algorithm as follows W

= ?

p

p

q

= 

@l(W(z )) W ? @X =0

p X

q

q

p X ?

T

;q

q =0



0 I ? '(y)y (k ? q) W ? ; p

(20)

q

for p = 0; 1;    ; N , where  is a learning rate. In particular, the learning algorithm for W0 is described by ?  W0 =  I ? '(y)y (k) W0 : (21) It should be noted that the algorithm looks similar but in fact it is not identical to the Amari et al algorithm [6]. The essential di erence is that the update rule for W in this paper only depends on W ; q = 0;    ; p, while in [6] it depends on all parameters W ; q = 0;    ; N . The algorithm (20) has two important properties, uniform performance (the equivariant property) and nonsingularity of W0 . In the multichannel deconvolution problem, an algorithm is equivariant if its dynamical behavior depends on the global transfer function G(z ) but not on the speci c mixing lter H(z ). In fact the learning algorithm (20) has equivariant property in the Lie group sense. Writing the learning algorithm in the Lie group form and multiplying both sides by the mixing lter H (z ) in the Lie group sense, we obtain (z )) (22) G(z ) = ? @l@(W X(z)  G(z): ( )) where G(z ) = W(z )  H (z ). From equation (19) we know (W X( ) is formally independent of the mixing channel H (z ). This means that the algorithm (20) is equivariant. Another important property of the learning algorithm (21) is that it keeps the nonsingularity of W0 provided the initial W0 is nonsingular [22]. This means that the learning algorithm (20) keeps the lter W(z ) on the manifold M(N ) if the initial lter is on the manifold. The condition implies that the equilibrium points of the learning algorithm satisfy the following equations  E '(y(k))y (k ? p) = 0; for p = 1;    ; N; (23)  E I ? '(y(k))y (k) = 0: (24) This means that we can make the separated signals y as mutually independent as possible if the nonlinear activation function '(y) are suitably chosen. T

p

q

q

@l

@

T

T

308

z

z

STABILITY OF LEARNING ALGORITHM Since the learning algorithm for updating W ; k = 0; 1;    ; N , is a linear combination of X ; k = 0; 1;    ; N , the stability of X ; k = 0; 1;    ; N implies the stability of the learning algorithm. In order to analyze the stability of the learning algorithm, we suppose that the separating signals y = (y1;    ; y ) are not only spatially mutually independent but also temporally i.i.d. for each component . Now consider the learning algorithm for updating X k

k

n

k

T

p

dX = ( I ? '(y(k)y(k ? p) ); p = 0; 1;    ; N: 0 dt

(25) To analyze the asymptotic properties of the learning algorithm, we take expectation on equation (25) p

T

;p

dX =   I ? E f'(y)y (k ? p)g ; p = 0; 1;    ; N: 0 dt p

T

;p

(26)

Then the stability conditions for (26) are m + 1 > 0; for i = 1;    ; n; (27)  > 0; for i = 1;    ; n; (28) 2 2     > 1; for i; j = 1;    ; n; (29) 0 2 0 2 2 where m = E (' (y (k))y (k)];  = E [' (y )];  = E [jy j ]; i = 1;    ; n: The conditions are identical to the ones derived by Amari et al [2] for instantaneous mixtures. Detailed derivation is left in a full paper. i

i

i

i

i

j

i

j

i

i

i

i

i

i

SIMULATIONS When we implement the natural gradient algorithm, rst we have to estimate the length of the demixing lter. The choice of the length N of the demixing lter usually depends on the mixing lter and error tolerance for recovered signals. Although we do not know the mixing lter in the blind deconvolution, we can estimate roughly the length N of the demixing lter. The overestimate of N will not a ect the outcome of the natural gradient approach, but will increase the computing cost. Computer simulations show that the parameters in the overestimate range will automatically converge to zero provided the mixing model is minimum phase and stable. To evaluate the performance of the proposed learning algorithms, we employ the multichannel intersymbol interference, denoted by M , as a criteria, P P X j ? max jG j =1 =0 jG M = max jG j =1 P P X j ? max jG j =1 =0 jG (30) + max jG j =1 I SI

n

n

N

j

p

pij

p;j

pij

I SI

p;j

i

n

n

N

i

p

pij

p;i

j

309

pij

p;i

pij

pij

0.7

0.6

ISI Performance Illustration • Natural Gradient: Solid Line

ISI Performance

0.5

• Bussgang Algorithm: Dashed Line

0.4

0.3

0.2

0.1

0

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Iteration Numbers

Figure 1:

MISI

performance of the natural gradient algorithm

It is easy to show that M = 0 if and only if G(z ) is of the form (4). We give two examples to demonstrate the behavior and performance of the algorithm (20). In both examples the mixing model is multichannel ARMA model as follows I SI

x(k) +

10 X

A x(k ? i) = B s(k) + i

10 X

0

B s(k ? i) + v(k); i

(31)

i=1

i=1

where x; s; v 2 R3 . The matrices A and B are randomly chosen such that the mixing system is stable and minimum phase. The sources s are chosen to be i.i.d signals uniformly distributed in the range (-1,1), and v are the Gaussian noises with zero mean and a covariance matrix 0:1I . The nonlinear activation function was chosen '(y) = y3 . Example 1. We use AR model as a mixing system, which can be exactly inverted by a FIR lter. A large number of simulations show that the natural gradient learning algorithm can easily recover source signals in the sense of W(z)H(z) = P . Figure 1 illustrates 100 trial ensemble average M performance of the natural gradient learning algorithm and the Bussgang algorithm. It is observed that the natural gradient algorithm usually needs less than 2000 iterations to obtain satisfactory results, while the Bussgang algorithm needs more than 20000 iterations to obtain satisfactory results since there is a long plateau in the Bussgang learning. Example 2. We choose ARMA model as a mixing system such that the system is stable and minimum phase. A large number of simulations have been performed. Figure 2 and 3 illustrate a typical example of the coecients of global transfer function G(z ) = W(z )H(z ) at initial state and after 3000 iterations respectively, where theP(i; j )?th sub- gure plots the coecients of the transfer function G (z ) = 1=0 g z ? up to order of 80. i

i

I SI

ij

310

k

ijk

k

G(z) 1,1

G(z) 1,2

1.5 1

1.5

1

0.5

G(z)

1,1

0.5

0

0

−0.5

−0.5

−1

−1

−1

−1.5

15

15

10

10

10

5

5

5

0

0

20

40

60

80

−5

−5

−10

−10

−1.5 0

20

G(z) 2,1

40

60

80

0

20

G(z) 2,2

1.5

40

60

−15 0

80

20

1.5

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

−1

−1

−1.5

40 G(z)

G(z) 2,3

1.5

−1.5

−1.5

60

80

20

40

60

80

0

20

G(z) 3,1

40

60

80

0

20

G(z) 3,2

1.5

40

60

40 G(z)

60

80

0

10

10

5

5

5

0

0 −5

−5

−10

−10

−15 40 G(z)

60

80

20

40 G(z)

60

80

0

10

10

10

1

1 0.5

5

5

0

0

0

0

0

0

−0.5

−0.5

−0.5

−5

−5

−5

−1

−1

−1

−10

−10

−10

−1.5

−1.5

−15 40

60

80

0

20

40

60

80

0

20

40

60

Figure 2: G( ) at the initial state

20

Fig. 3.

z

40

60

80

60

80

60

80

3,3

5

−15 0

80

40 G(z)

15

0.5

20

20

3,2

15

1

0

80

−15 0

3,1

15

0.5

−1.5

60 2,3

0

−5 −10 20

40 G(z)

10

0

20

2,2

15

80

1.5

20

15

G(z) 3,3

1.5

−15 0

2,1

15

−15 0

1,3

0

−5 −10 −15

0

G(z)

1,2

15 1

0.5

0 −0.5 −1.5

G(z)

G(z) 1,3

1.5

−15 0

20

40

60

80

0

20

40

G( ) after 3000 iterations z

REFERENCES [1] S. Amari. Natural gradient works eciently in learning. Neural Computation, 10:251{276, 1998. [2] S. Amari, T. Chen, and A. Cichocki. Stability analysis of adaptive blind source separation. Neural Comput., 10:1345{1351, 1997. [3] S. Amari, T. Chen, and A. Cichocki. Nonholonomic orthogonal constraints in blind source separation. Neural Comput., to be published. [4] S. Amari and A. Cichocki. Adaptive blind signal processing{ neural network approaches. Proceedings of the IEEE, 86(10):2026{2048, 1998. [5] S. Amari, A. Cichocki, and H.H. Yang. A new learning algorithm for blind signal separation. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural Information Processing Systems 8 (NIPS*95), pages 757{ 763, Cambridge, MA, 1996. The MIT Press. [6] S. Amari, S. Douglas, and A. Cichocki. Multichannel blind deconvolution and source separation using the natural gradient. IEEE Trans. on Signal Processing, to appear. [7] S. Amari, S. Douglas, A. Cichocki, and H. Yang. Novel on-line algorithms for blind deconvolution using natural gradient approach. In Proc. 11th IFAC Symposium on System Identi cation, SYSID'97, pages 1057{1062, Kitakyushu, Japan, July, 8-11 1997. [8] A.J. Bell and T.J. Sejnowski. An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7:1129{1159, 1995. [9] J.-F Cardoso. Blind signal separation: Statistical principles. Proceedings of the IEEE, 86(10):2009{2025, 1998. [10] J.-F. Cardoso and B. Laheld. Equivariant adaptive source separation. IEEE Trans. Signal Processing, SP-43:3017{3029, Dec 1996. [11] A. Cichocki, S. Amari, and J. Cao. Neural network models for blind separation of time delayed and convolved signals. IEICE Trans. on Fundamentals, E82A(9):1595{1603, Sept 1997. [12] A. Cichocki and R. Unbehauen. Robust neural networks with on-line learning for blind identi cation and blind separation of sources. IEEE Trans Circuits and Systems I : Fundamentals Theory and Applications, 43(11):894{906, 1996.

311

[13] A. Cichocki, R. Unbehauen, and E. Rummert. Robust learning algorithm for blind separation of signals. Electronics Letters, 30(17):1386{1387, 1994. [14] S. Douglas, A. Cichocki, and S. Amari. Multichannel blind separation and deconvolution of sources with arbitrary distributions. In Proceeding of IEEE Workshop on Neural Networks for Signal Processing(NNSP'97), pages 436{ 445, Florida, US, September 1997. [15] D. N. Godard. Self-recovering equalization and carrier tracking in twodimensional data communication systems. IEEE trans. Comm., COM28:1867{1875, 1980. [16] M. I. Gurelli and C. L. Nikias. Evam: An eigenvector-based algorithm for multichannel blind deconvolution of input colored signals. IEEE Trans. Signal Processing, 43:134{149, 1995. [17] Y. Hua. Fast maximum likelihood for blind identi cation of multiple FIR channels. IEEE Trans. Signal Processing, 44:661{672, 1996. [18] Y. Sato. Two extensional applications of the zero-forcing equalization method. IEEE Trans. Commun., COM-23:684{687, 1975. [19] L. Tong, R.W. Liu, V.C. Soon, and Y.F. Huang. Indeterminacy and identi ability of blind identi cation. IEEE Trans. Circuits, Syst., 38(5):499{509, May 1991. [20] L. Tong, G. Xu, and T. Kailath. Blind identi cation and equalization base on second-order statistics: A time domain approach. IEEE Trans. Information Theory, 40:340{349, 1994. [21] K. Torkkola. Blind separation of convolved sources based on information maximization. In S. Usui, Y. Tohkura, S. Katagiri, and E. Wilson, editors, Proc. of the 1996 IEEE Workshop Neural Networks for Signal Processing 6 (NNSP96), pages 423{432, New York, NY, 1996. IEEE Press. [22] H. Yang and S. Amari. Adaptive on-line learning algorithms for blind separation: Maximum entropy and minimal mutual information. Neural Comput., 9:1457{1482, 1997. [23] D. Yellin and E. Weinstein. Criteria for multichannel signal separation. IEEE Trans. Signal Processing, 42:2158{2168, 1994. [24] L. Zhang and A. Cichocki. Blind deconvolution/equalization using state-space models. In Proceeding of 1998 Int'l IEEE Workshop on Neural Networks for Signal Processing(NNSP'98), pages 123{131, Cambridge, UK, August 31September 2 1998. [25] L. Zhang and A. Cichocki. Blind separation of ltered source using state-space approach. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 648{654. MIT press, Cambridge, MA, 1999. [26] L. Zhang, A. Cichocki, and S. Amari. Multichannel blind deconvolution of nonminimum phase systems using information backpropagation. In Proceedings of the Fifth International Conference on Neural Information Processing(ICONIP'99), page to appear, Perth, Australia, Nov. 16-20 1999.

312