Explicit Recursivity into Reproducing Kernel Hilbert Spaces

EXPLICIT RECURSIVITY INTO REPRODUCING KERNEL HILBERT SPACES Devis Tuia1, Gustavo Camps-Valls1 and Manel Mart´ınez-Ramón2 1

2

Image Processing Laboratory (IPL), Universitat de València, Spain. {dtuia,gcamps}@uv.es Dpt. Signal Theory and Communications, Univ. Carlos III de Madrid, Spain. [email protected] ABSTRACT

This paper presents a methodology to develop recursive filters in reproducing kernel Hilbert spaces (RKHS). Unlike previous approaches that exploit the kernel trick on filtered and then mapped samples, we explicitly define model recursivity in the Hilbert space. The method exploits some properties of functional analysis and recursive computation of dot products without the need of pre-imaging. We illustrate the feasibility of the methodology in the particular case of the gammafilter, an infinite impulse response (IIR) filter with controlled stability and memory depth. Different algorithmic formulations emerge from the signal model. Experiments in chaotic and electroencephalographic time series prediction scenarios demonstrate the potentiality of the approach. Index Terms— Recursive filter, gamma filter, kernel methods, functional analysis, pre-image. 1. INTRODUCTION Recursive identification algorithms play a crucial role for many problems in adaptive control, adaptive signal processing and for general model building and monitoring problems [1]. Efficient implementation of such algorithms has been paid great attention in the signal processing literature. Linear autoregressive models require relatively few parameters and allow closed-form analysis, while ladder or lattice implementation of linear filters has long been studied in signal theory. However, in the likely case that the system generating the data is nonlinear, the difficult problems of model specification and parameter estimation arise. Neural networks were proposed to tackle both the nonlinear problem and the online adaptation of model weights in a natural way [2]. In the last decade, nonlinear time series analysis with kernel methods has gained popularity due to the good trade-off observed between performance, stability and robustness. The problem was earlier addressed by using the support vector regression (SVR) method with lagged samples of the available time signals [3–5]. Two main problems are observed here. First, these approaches do not consider in the model the non-i.i.d. nature of the time series. Work partly supported by the Spanish Ministry for Science and Innovation under projects AYA2008-05965-C04-03, CSD2007-00018 and TEC2008-02473, and by the Swiss National Science Foundation under the PostDoc grant PBLAP2-127713.

978-1-4577-0539-7/11/$26.00 ©2011 IEEE

4148

Hammerstein signal models were studied in [6], while [7] introduced Volterra and Wiener series as particular cases of a kernel regression framework. The nonlinear generalization of autoregressive and moving-average (ARMA) filters with kernels was presented in [8], and kernel γ-filters were introduced in [9] to implement IIR filters with restricted recursion and memory depth control. Second, the model has to be retrained with every new incoming sample. In recent years, a great interest has been devoted to kernel-based adaptive filtering in which model weights are updated online. Recently proposed filters include kernel Kalman filters [10], kernel recursive least squares [11], kernel LMS [12] and specific kernels for dynamical modeling [12]. Kernel algorithms developed so far define signal models by exploiting the classical “kernel trick”. Essentially, samples are mapped to a reproducing kernel Hilbert space (RKHS). Then, linear operations are performed therein and, finally, dot products are replaced by kernel functions. This is a valid and powerful methodology but insight in model structure is hidden behind the kernel function. We alternatively propose here to define signal model recursion explicitly in the RKHS rather than mapping samples and then “kernelizing”. The methodology has three basic steps: 1) define the (recursive) model structure directly in a suitable RKHS, 2) exploit a proper representer’s theorem for model weights, and 3) apply standard recursive formulas from classical signal processing in RKHS. The obtained kernels inherently implement recursion in RKHS. This yields the interesting possibility of expressing the kernel as a function of previously computed kernels, while model hyperparameters keep their meaning. The proposed model does not need to solve a pre-image problem as in [13]. The remainder of the paper is outlined as follows. Section 2 presents the general formulation for kernel-based recursivity in feature spaces. Section 3 illustrates the usefulness of the methodology in the particular case of the γ-filter. The obtained recursive kernels emerging from the signal model are used for kernel least-squares regression. Results in synthetic and real time series prediction scenarios are given in Section 4. Finally, Section 5 concludes the paper. 2. RECURSIVITY INTO HILBERT SPACES Consider a set of samples {xi , yi }N i=1 , and define an arbitrary Markovian model representation of recursion between input-

ICASSP 2011

output time-series pairs, yn = f (wi , xin |θf ) + nx xin = g(xin−k , yn−l |θg ) + ny , ∀k > 1, l ≥ 0

(1)

where xin is the signal present at the input of the ith filter tap at time n, yn−k is the previous output at time n − k, nx is the state noise, and ny is the measurement noise. Here f and g are linear functions parametrized by model parameters wi and hyperparameters θf and θg . Let us now define a feature mapping φ(·) to a RKHS F . Now, one could replace samples by their mapped counterparts, yn = f (wi , φ(g(φ(xin−k ), yn−l |θg ))|θf ), and try to express it in terms of dot products to replace them by kernel functions. Note, however, that this operation may become very complex, and eventually unsolvable, depending on the parametrization of the recursive function g. Smart solutions exist but often require pre-imaging or dictionary approximations [11, 13]. In this paper, we instead propose to define model recursive equations directly into RKHS. Now, the linear model is defined explicitly in F yn = f (wi , φin |θF ), +nx φin = g(φin−k , yn−l |θG ) + ny , ∀k > 1, l ≥ 0.

(2)

Importantly, note that: 1) samples φin do not necessarily have φ(xin ) as their pre-image, 2) model parameters wi are defined in the possibly infinite-dimensional feature space F , while 3) hyperparameters have the same meaning, as they define recursion according to an explicit model structure regardless of the space where it is defined. This problem may be solved by first defining a proper i i reproducing property, e.g., wi = N m=1 βm φm . That is, the infinite-dimensional feature vectors are spanned by linear combinations of signals filtered in the feature space. Note that this is different from the traditional filter-and-map approach. Plugging this into (2) and applying the kernel trick yields:

in (2) and explicitly define K i (m, n) as a function of previously computed kernels. The methodology is illustrated in the next section in the particular recursive model structure of the γ-filter. This way, we not only attain a recursive model in feature spaces, but also avoid applying approximate methods. 3. RECURSIVE GAMMA FILTER IN RKHS A remarkable compromise between stability and simplicity of adaptation can be provided by the linear γ-filter proposed in [14]. We here present its recursive kernel formulation directly in RKHS, and demonstrate that the filter generalizes finite impulsive response (FIR) in Hilbert spaces. Finally, we present different forms of predictive models with the obtained recursive kernel. 3.1. The recursive gamma filter The standard gamma filter is defined by yn

=

P

wi xin

(3)

i=1

xin

=

i=1 xn , , 2 ≤i≤P (1 − μ)xin−1 + μxi−1 n−1

(4)

where n is the time index, θf = P , and θg = μ is a free parameter controlling stability and memory depth. 3.2. The recursive gamma filter into RKHS The same recursion can be expressed into a Hilbert space yn

=

P

wi φin

(5)

i=1

φin

=

i=1 ϕ(xn ), , 2 ≤i≤P (1 − μ)φin−1 + μφi−1 n−1

(6)

where ϕ(·) is a nonlinear transformation into a RKHS, and i φ n is a vector into this RKHS that may not be an image of N N a vector of the input space, i.e. φin = ϕ(xin ). These vectors i i i yˆn = f βm φm , φin |θF = f βm K i (m, n)|θF . are not accessible except for the case of i = 1, as stated in m=1 m=1 recursion (6). As we remarked in the previous section, this model asIn the proposed model, a linear relationship is assumed besumes linear relations between samples in the RKHS. If this tween samples in the RKHS, as each new sample will define i assumption is insufficient to solve the identification probin a subspace and the method will adjust a set of weights βm lem at hand, scalar sample xn can be easily changed by each subspace according to some optimality criterion. Nevervector zn = [xn , xn−1 · · · xn−P ] , being P the selected theless, the assumption of linear relationships between samtime-embedding. ples may not be correct for scalar samples. In this case, a Nevertheless, the weight vectors wi of (5) are linearly model length can be assumed for the input samples and then, N i spanned by the N training vectors, wi = mapping can be applied to vectors made up of delayed winm=1 (αm − ∗i i αm )φm . By including this expansion in (5) and applying the dows of the signals. kernel trick, K i (m, n) = φim , φin , one obtains Note again that K i (m, n) is not necessarily equivalent to K(xim , xin ) and hence the problem is far different from those P N i defined in previous kernel adaptive filters. To compute this (αim − α∗i (7) yn = m )K (m, n). kernel, one can actually resort to applying recursion formulas i=1 m=1

4149

The dot products can be computed using the kernel trick and model recursion defined in (6) as 8 K(xm , xn ), > > > > > 2 i > > < (1 − μ) K (m − 1, n − 1)

i=1

3.3. Prediction models The estimated tensor kernel in (9) has P entries that can be combined for training and prediction in different ways:

(8)

• Composite multiple kernel. Define a composite kernel consisting of the sum of all time-scale kernels, P Kc = P1 i=1 K i .

The second part still has two (interestingly non-symmetric) dot products that are not directly computable. Nevertheless, applying again recursion (6), this can be rewritten as 0, i=1 i−1 i φm−1 , φn−1 = i−1 i−1 i

• Symmetrized kernel. Define a symmetric kernel from the previous composite kernel by K2 = Kc Kc .

K i (m, n) =

+ μ2 K i−1 (m − 1, n − 1) > > > > > + μ(1 − μ)φim−1 , φi−1 n−1 > > : i , φ + μ(1 − μ)φi−1 n−1 m−1

2 ≤ i ≤ P.

Both kernels will be plugged into a regularized least square (Reg-LS) or a kernel ridge regression (KRR) [15].

(1 − μ)φm−2 + μφm−2 , φn−1 2 ≤ i ≤ P

which, in turn, can be rearranged as ⎧ ⎨0, φim−1 , φi−1 (1 − μ)φim−2 , φi−1 n−1 n−1 = ⎩ i−1

4. EXPERIMENTAL RESULTS 4.1. Model Development

i=1

In the experiments below, we focus on the radial basis function (RBF) kernel with lengthscale σ, which offers stable and fast solutions. Model building requires tuning free pai−1 Term K (m − 2, n − 1) and the first term of the second rameters related to the use of the Reg-LS or theKRR models case can be recursively computed using (8). Assuming that (σ, λ), and the γ-filter parameters (μ, P ). An exhaustive φin = 0 for n < 0, one obtains the recursion iterative search strategy was used here. As fitness crite MSE for performance 0, i = 1 rion, we considered the normalized i−1 i

P φm−1 , φn−1 = N j−2 i−1 1 2 , where (1 − μ) K (m − j, n − 1) i ≥ 2 μ m assessment, NMSE = log10 (y − y ˆ ) j=2 2 i i i=1 Nσ ˆ +μK

(m − 2, n − 1)

2 ≤ i ≤ P.

and finally, the recursive kernel can be rewritten as 8 K(xm , xn ), i=1 > > > > > 2 i > > > (1 − μ) K (m − 1, n − 1) > > < + μ2 K i−1 (m − 1, n − 1) K i (m, n) = m−1 > X > > 2 > + μ (1 − μ)j−1 [K i−1 (m − j, n − 1)+ > > > > j=2 > > : 2 ≤ i ≤ P. + K i−1 (m − 1, n − j)],

(9)

Note that this kernel evolves through time and can be constructed recursively. Also, hyperparameter μ still has the meaning of memory depth in feature space through the following properties. Property 3.1 Recursive gamma filter generalizes the recursive FIR filtering in RKHS. Proof Note that the FIR filter in RKHS is obtained as a particular case if we set μ = 1 in (9): K(xm , xn ), i=1 i K (m, n) = (10) 2 ≤ i ≤ P. K i−1 (m − 1, n − 1) Property 3.2 The kernel recursive gamma filter generalizes the gamma filter. Proof Change the kernel to a dot product and then pass from dual to primal. Then, (3) and (4) are obtained.

4150

σ ˆ 2 is the estimated variance of the data. This measure removes the dependence on the dynamic range of the data. This normalization implies that if the estimated mean of the data is used as a predictor, NMSE = 0 is obtained. The following ranges are considered for the free parameters: σ = [10−1 , 102 ], λ = [10−1 , 1], P = [2, 8], μ = [0, 1]. The proposed recursive kernels are compared with the standard RBF kernel computed with lagged samples zt ≡ [xt , xt−1 , . . . , xt−P ] , thus yielding the so-called lagged kernel Kl with entries Kl (zi , zj ). The search range of the parameters is the same for all experiments and methods. Note that the other kernels use only samples xt for computing the proposed recursive kernels. 4.2. Experiment 1: The Mackey-Glass time series

The first case study deals with the standard Mackey-Glass time series prediction problem, which is well-known for its strong non-linearity. This classical high-dimensional chaotic system is generated by the delay differential equation: dx/dt = −0.1xn + 0.2xn−Δ /(1 + x10 n−Δ ), with delays Δ = 17 and Δ = 30, thus yielding the time series MG17 and MG30 , respectively. We considered {50, 100, 200} training samples and used the next 1000 for testing. Results are shown in Table 1 for both time series. The proposed kernels show good performance, outperforming the stacked approach in almost all cases. Recursive kernels provide a richer description of the time series and minimize the NMSE efficiently. Best results are obtained by the K2 kernel trained with KRR.

0.05

Table 1. NMSE for the experiments considered as a function of the number of training examples considered. MG17 50 100 200 Kl –0.85 –0.92 –1.01 Kc –0.80 –0.97 –1.07 LS K2 –0.99 –1.18 –1.30 Kl –0.96 –1.07 –1.18 KRR Kc –0.97 –1.07 –1.14 K2 –1.04 –1.25 –1.33

MG30 50 100 200 –0.27 –0.67 –0.90 –0.43 –0.71 –0.84 –0.75 –1.21 –1.29 –0.40 –0.76 –1.04 –0.56 –0.87 –0.99 –0.82 –1.27 –1.35

0.03

EEG 50 100 200 –0.06 –0.24 –0.27 –0.10 –0.31 –0.33 –0.11 –0.34 –0.39 –0.10 –0.28 –0.36 –0.15 –0.32 –0.32 –0.15 –0.35 –0.37

0.02 0.01

y

Solver

0.04

0 −0.01 −0.02 −0.03

The second experiment deals with the EEG signal prediction 4-samples ahead. This is a very challenging non-linear problem with high levels of noise and uncertainty. We used file “SLP01A” from the MIT-BIH Polysomnographic Database1 . As for the MG experiments, we used {50, 100, 200} training samples, while the next 2000 samples were used for prediction. Test results are shown in the last columns of Table 1. This time-series is very complex and lower NMSE scores are obtained in general. However, the recursive kernels proposed still outperform Kl , thus confirming the interest of considering recursivity in feature space. In Fig 1, the KRR prediction of the first 160 test samples by the different approaches are illustrated: all the methods show a tendency to over-smooth the data, where peaks are observed. Nonetheless, the K2 kernel always outperforms the other approaches, confirming the numerical results. 5. CONCLUSIONS We have introduced a methodology for developing kernel recursion in reproducing kernel Hilbert spaces, with the particular application to kernel-based recursive filters. The method does not resort to pre-imaging or reduced-rank approximations. The developed γ-filter recursive kernel developed allows time-embedding through the memory factor P/μ in RKHS directly. The obtained kernel demonstrates good performance in time series prediction, but other problems could be tackled. Also, as noted in the paper, extension to higher order (nonlinear) sample dependences can be easily explored. 6. REFERENCES [1] L. Ljung, System Identification. Theory for the user, Prentice Hall, Upper Saddle River, N.J., 2nd edition, 1999. [2] G. Dorffner, “Neural networks for time series processing,” Neural Network World, vol. 6, pp. 447–468, 1996. [3] P. M. L. Drezet and R. F. Harrison, “Support vector machines for system identification,” in Proc. UKACC Int. Conf. Control, Swansea, U.K., 1998, pp. 688–692. [4] A. Gretton, A. Doucet, R. Herbrich, P. Rayner, and B. Schölkopf, “Support vector regression for black-box sys1 Data available

20

40

60

80 t

100

120

140

160

Fig. 1. KRR predictions of the first 160 test points for the recursive and RBF kernels. tem identification,” in Proc. 11th IEEE Workshop on Statistical Signal Processing, Aug 2001, pp. 341–344. [5] D. Mattera, Support vector machines for signal processing, pp. 321–342, Studies in Fuzziness and Soft Computing. Springer, Berlin / Heidelberg, 2005. [6] I. Goethals, K. Pelckmans, J. A. K. Suykens, and B. De Moor, “Subspace identification of Hammerstein systems using least squares support vector machines,” IEEE Trans. Automat. Contr., vol. 50, no. 10, pp. 1509–19, 2005. [7] M. O. Franz and B. B. Schölkopf, “A unifying view of Wiener and Volterra theory and polynomial kernel regression,” Neural Computation, vol. 18, no. 12, pp. 3097–3118, 2006. ´ [8] M. Mart´ınez-Ramón, J.L. Rojo-Alvarez, G. Camps-Valls, A. Navia-Vázquez, E. Soria-Olivas, and A.R. Figueiras-Vidal, “Support vector machines for nonlinear kernel ARMA system identification,” IEEE Trans. Neural Networks, vol. 17, no. 6, pp. 1617–1622, Nov. 2006. [9] G. Camps-Valls, J. Muñoz-Mar´ı, M. Mart´ınez-Ramón, ´ J. Requena-Carrión, and J.L. Rojo-Alvarez, “Learning nonlinear time-scales with kernel γ-filters,” Neurocomputing, vol. 72, pp. 1324–1328, 2008. [10] L. Ralaivola and F. d’Alche Buc, “Time series filtering, smoothing and learning using the kernel Kalman filter,” in Proc. IEEE International Joint Conference on Neural Networks, Jul 2005, vol. 3, pp. 1449 – 1454 vol. 3. [11] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least squares algorithm,” IEEE Trans. Signal Processing, vol. 52, pp. 2275–2285, 2004. [12] W. Liu, P. P. Pokharel, and J. C. Principe, “The kernel leastmean-square algorithm,” IEEE Trans. Signal Processing, vol. 56, no. 2, pp. 543 –554, Feb. 2008. [13] L. Ralaivola and F. d’Alche Buc, “Dynamical modeling with kernels for nonlinear time series prediction,” in Proc. Advances in Neural Processing Systems, Dec 2004, vol. 16, pp. 129–136. [14] J. C. Principe, B. deVries, and P. G. deOliveira, “The gamma filter – A new class of adaptive IIR filters with restricted feedback,” IEEE Trans. Signal Processing, vol. 41, no. 2, pp. 649– 656, Feb 1993. [15] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, Cambridge University Press, Cambridge, 2004.

at http://www.physionet.org/physiobank/database/slpdb/slpdb.shtml

4151

K2 Kl

−0.05

4.3. Experiment 2: EEG prediction

Actual Kc

−0.04