Sparse Incremental Regression Modeling Using ... - Semantic Scholar

6 downloads 6386 Views 138KB Size Report
Because kernel means are not restricted to the training input data and each ..... and non-linear models of a turbocharged automotive diesel engine,”. Mechanical ...
IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. Y, MONTH 200Z

1

Sparse Incremental Regression Modeling Using Correlation Criterion with Boosting Search S. Chen, X.X. Wang and D.J. Brown Abstract— A novel technique is presented to construct sparse generalized Gaussian kernel regression models. The proposed method appends regressors in an incremental modeling by tuning the mean vector and diagonal covariance matrix of individual Gaussian regressor to best fit the training data based on a correlation criterion. It is shown that this is identical to incrementally minimize the modeling mean square error. The optimization at each regression stage is carried out with a simple search algorithm re-enforced by boosting. Experimental results obtained using this technique demonstrate that it offers a viable alternative to the existing state-of-art kernel modeling methods for constructing parsimonious models. Keywords— Regression, Gaussian kernel model, incremental modeling, correlation, boosting

ing methods, the number of clusters or the model size must be learned by other means, for example, via cross-validation [15],[16]. Moreover, the regressor kernel variances also need to be decided using some other appropriate techniques. II. M ETHOD with the regression model    

I. I NTRODUCTION A basic principle in nonlinear data modeling is the parsimonious principle of ensuring the smallest possible model that explains the training data. The state-of-art sparse kernel modeling techniques [1]–[10] have widely been adopted in data modeling applications. These existing sparse modeling techniques typically use a fixed common variance for all the regressors and select the kernel centers from the training input data. We present a flexible construction method for generalized Gaussian kernel models by appending regressor one by one in an incremental modeling. The correlation between a Gaussian regressor and the training data is used as the criterion to optimize the mean vector and diagonal covariance matrix of the regressor. This approach is equivalent to incrementally minimizing the modeling mean square error. The optimization is carried out with a simple boosting search. Because kernel means are not restricted to the training input data and each regressor has an individually tuned diagonal covariance matrix, our method can produce very sparse models that generalize well and it offers a viable alternative to the existing state-of-art sparse kernel modeling methods. Our proposed incremental modeling method is very different from the cascade-correlation incremental learning [11]. In the cascade-correlation method, regression units are constructed on a variable space of increasing dimension, namely, the inputs to a unit being the original inputs and the outputs of the previously selected units. Our proposed method is a truly incremental modeling from the input space to the output space. It has a desired geometric property that a regressor is constructed to fit the peak (in the sense of magnitude) of the current modeling residual at each stage. This geometric property is graphically illustrated in a simple one-dimensional modeling problem. Our method also has advantages over the radial basis function network training methods based on clustering (e.g. [12]-[14]). In these clustering based learnS. Chen is with School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, U.K. X.X. Wang and D.J. Brown are with Department of Creative Technologies, University of Portsmouth, Portsmouth PO1 3HE, U.K.

pairs of training data

Consider      the problem of fitting the  !

 



(1) 



where * is the " -dimensional input variable;  , #%$'&$)( , denote ,the is the number of regressors; +! model weights; ( ) # $ . & / $ ( and , , denote the regressors. We allow     the regressor to be chosen as the generalized Gaussian kernel  102 4365 87  function with

5



02 4365





97

B 1:!;=>

< ?A@ ; 5CB D EF7HG < ?@ ; 5CBD

(2)



where is the 7 & th kernel center or mean vector and the cois diagonal. We will adopt an incremental variance matrix approach to build up the regression model (1) by appending regressors one by one. Let us first introduce the following notation   D   ?I )     D L D  ?KJ )   ?KJ ;

  J

M

  J

#N$O&P$ 

(3)

  D  K? J   is the modeling after the Q th reObviously, D error at   ?I gressor has been fitted and is simply the desired output 

for the input . Next define the mean square error (MSE) for the Q -term regression model over the training data as

MSEJ

 #



RTS

VUW D  ?KJ  #



YXZ



 L



 J [ [  [ \

W

 ^]_

(4) The incremental modeling process is terminated when MSE JY`ba , where a is a preset modeling accuracy. The termination of the model construction process can alternatively be decided by cross-validation [15],[16], and other termination criteria include the Akaike information criterion [17], the optimal experimental design criteria [9] and the leave-one out generalization criterion [10].      D  regressor J At the Q th stage of modeling, the is fitted ;  6 ?cJ   to the training data set by tuning its mean 5 7 vector J and diagonal covariance matrix J . The correlation function between the regressor and the training data set

2

IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. Y, MONTH 200Z

as given by









 D

  ^ ?KJ ;  5 9 7   J      (5) J J  U W J 

 W  

  ?KJ ; D   J S     D    6 ?KJ ;     defines the similarity between J and .

This correlation criterion can 5 7 be used to position and shape a regressor. That is, 5 J 9and 7  J of the Q th regressor are chosen to maximize  J J  . After the regressor positioning J and shaping, the corresponding weight is calculated by the usual least squares solution   



  K? J ; D        J

 W    (6)  J  J  5 8 7  Selecting regressors by maximizing  J J  is identical J

to incrementally minimizing the modeling MSE (4). Substituting (3) into (4) with  J given by (6) yields MSE J

 #



NS

  D U W  K? J ; 5

87

L

W

#

5

J

J

87 J



(7)



Clearly maximizing  J J  is7 equivalent to minimizJ5 ing MSE J with respect to J and J . An important technique to alleviate over-fitting and improve robustness of the solution is to apply regularization [6]-[10]. The zero-order regularization can readily be incorporated with our proposed method by changing the least squares (6) into   solution    J





  ?KJ ;  D      W J

 

   J

(8)

where is a small positive regularization parameter. 5 7 The optimization for determining J and J can be performed with guided random search methods, such as the genetic algorithm [18],[19] and adaptive simulated annealing [20],[21]. However, we perform this optimization by a simple search which is re-enforced by boosting [22]–[24]. Let the 5    D  covector  J contain the mean vector J and the diagonal 7  6 ?KJ ;   , variance matrix J . Given the training data the basic boosting search algorithm  is summarized:  W D the  randomly Initialization: Set iteration index   D , give D  ?  9 ?     ?    chosen initial values for  J ,   J     J   J  , with the associated weighting    for #H$ & $  , and specify a small positive value a for terminating the search and a maximum number of iterations ( . L Step 1: Boosting  # 1. Calculate the loss of each point, namely cost[ D  ? [     J  J   , # $ 2$!  #" #$   &%(' )+*-,/.  2. Find  8 cost[ # $0/$1 and J %9':)+ *;%=<   P   #$ cost[ # $ 2$! 325  J 476 3. Normalize the loss   cost[  #%$>2$! loss[  R  cost 4. Compute a weighting factor ? $ according to @ $



  [   [  [    loss

? $

L@  #

$ @

$

5. For 

 #

 

   [  #

 , update the distribution weightings

  lossC  [  ? $

PBA

; lossC    [  ? $

#

for ? $ED

#



for ? $ $



6. Normalize the weighting vector    [  #

    [  #     

4 





#





#N$>H$F

Step 2: Parameter updating   1. Construct the  # th point using the formula  J

?HG

 D





W D 2. Construct the  P

 J

?/HG



 P

#" #$     J 



       



JI 

 ? #  J

D

  

L  D th point using the formula

#" #$    J 

 J

?HG

 6 

?/HG

 D

 

W D a better point (smaller loss value) from  J  3. Choose   ?/HG H$   L  to replace 325  and  J  D J 4 6  L    ?/HG  # and repeat from Step 1 until K J  Set   D   ?/HG  J  # K ` a or (  iterations have been reached.  #" #$    J  Then choose the Q th regressor  J The above basic boosting search algorithm performs a guided random search and solution obtained may depend on the initial choice of the population. To derive a robust algorithm that ensures a stable solution, we augment it into the following repeated boosting search algorithm. Initialization: Specify a maximum repeating times (L and a small positive number a for stopping the search 6 W D  D First generation: Randomly choose D the  number of the ?  ?   ?  J  J , and call the boosting initial population  J #" #$ M* search algorithm to obtain a solution  J  L  D For Repeat loop: N #PO*(QL  L Set  ?   #" #$  N #  , and randomly generate the other D J J I ?  # points  J for $O&$R Call   the boosting search algorithm to obtain a solution #" #$

 J

N

#" H$ 

L



L

 

#" #$ N #  J N K ` a , Exit loop If K J 6 End for  #" H$   Choose the Q th regressor as  J  J N The algorithmic parameters that need to be chosen appropriately are the population size  , termination criterion a  and maximum number of iterations ( in the boosting search as well as the maximum number of repeating times ( L and the stopping criterion a for the repeating loop. To simplify the 6 algorithm tuning, we can simply fix ( and (QL without the need to specify a  and a . In the following modeling experi6 ments, the value of  , (  and ( L were chosen empirically to ensure that the incremental modeling procedure produced consistent final models with the same levels of modeling accuracy and model sparsity for different runs. The stopping threshold a for the incremental modeling procedure should ideally be set to a value slightly larger than the system noise variance. Since the system noise level is generally unknown a priori, an appropriate value for a has to be learned during the modeling process. Alternative, the Akaike information criterion [17] and the optimal experimental design criteria [9] can be employed to terminate the model construction procedure without the need to specify a modeling accuracy a .

CHEN ET AL

3

2 1

1

1

0

0.5

−1

0

0 −1 −2 −10

(a) 0

(b) 10

0.5

0

−2 −10

0

(c) 10

0.4

0.4

0.2

0.2

0

0

−0.2 0

0

(f)

(e) 10

−0.4 −10

10

−0.2

(d) −0.5 −10

−0.5 −10

0

10

−0.4 −10

0

10

Fig. 1. Incremental modeling procedure for the simple function fitting problem: in (a)–(f), the light curves are the modeling errors of the previous stage,     , and the dark curves are the fitted current regressors, , for  , respectively.  

III. E XPERIMENTAL

RESULTS

Two examples were used to illustrate the proposed sparse modeling approach. The first example was a one-dimensional simulated data set and was chosen to demonstrate graphically the motivation and desired property of the incremental regression procedure using the correlation criterion. The second example was a real-data set. Example 1. The 500 points of training data were generated from ,/." ,/. $#%  :   P8  !   # L '&)(   * : with equal-spaced # # , where was a Gaussian white noise+ with # zero mean and variance 0.01. With a popula I  tion size  , the maximum number of  iterations (   and the maximum repeating times ( ,L   I # together with the modeling accuracy set to a # , the incremental modeling consistently produced models of - Gaussian re,  gressors with the same MSE . #!# for a large number of different runs. We also used the Akaike information criterion [17] and the optimal experimental design criteria [9] to stop the selection procedure, rather than specifying the modeling accuracy a , and the results obtained are identical. The construction process in a typical run is illustrated graphically in Fig. 1 (a)–(f), where the effectiveness of regressor tuning based on the correlation criterion is clearly demonstrated. In Fig. 2 (a), the model output from the constructed 6-term model is superimposed on the noisy training data, and the final modeling errors are shown in Fig. 2 (b). Example 2. This example constructed a model representing   the relationship between the fuel   rack position (input /  ) and the engine speed (output  ) for a Leyland TL11 turbocharged, direct injection diesel engine operated at a low engine speed. Detailed system description and experimental setup can be found in [25]. The data set contained 410 samples. The first 210 data points were used in training and  L L the last 200 points in model data   validation.    The 0training (    set L was with54 76F &  I and & # / &  constructed  I 1*32 for & # / & # . We used the proposed

approach to fit a generalized Gaussian regression '498    I  model to this data set. With , and together  Q (  Q ( L   ( :#;# with a , the incremental modeling produced in repeated runs consistent models of 9 Gaussian regressors with  ( :#;4 ,  ( obtained by one of the existing state-of-art kernel modeling techniques required at least 20 model regressors to achieve the same modeling accuracy (see [10]). In comparison, the proposed modeling approach resulted in a much sparser 9term generalized Gaussian kernel model. IV. C ONCLUSIONS An incremental modeling technique has been presented to construct sparse generalized Gaussian regression models. The proposed technique can tune the mean vector and diagonal covariance matrix of individual Gaussian regressor to best fit the training data incrementally based on the correlation between the regressor and the training data. A simple boosting search algorithm has been adopted for regressor tuning at each modeling stage. Experimental results using this construction technique have demonstrated that it offers a viable alternative to the existing state-of-art kernel modeling methods for constructing parsimonious regression models.

4

IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. Y, MONTH 200Z

2

system output/model prediction

System Model

y(x)

1

0

-1

-2 -10

System Model

4.5 4 3.5 3 2.5

-5

0

5

10

0

50

100

150

200

x

sample

(a)

(a)

250

300

350

400

0.09

model prediction error

0.4

model error e(x)

5

0.2

0

-0.2

0.06 0.03 0 -0.03 -0.06

-0.4 -10

-0.09 -5

0

5

10

50

100 150 200 250 300 350 400 sample

(b)

(b)

Fig. 2. Incremental modeling results for the simple function fitting problem: in (a) outputs of the final 6-term model are superimposed on the noisy training data , and (b) shows the final modeling errors.

R EFERENCES [1]

0

x

S. Chen, S.A. Billings and W. Luo, “Orthogonal least squares methods and their application to non-linear system identification,” Int. J. Control, Vol.50, No.5, pp.1873–1896, 1989. [2] S. Chen, C.F.N. Cowan and P.M. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Trans. Neural Networks, Vol.2, No.2, pp.302–309, 1991. [3] V. Vapnik, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995. [4] V. Vapnik, S. Golowich and A. Smola, “Support vector method for function approximation, regression estimation, and signal processing,” in: M.C. Mozer, M.I. Jordan and T. Petsche, Eds., Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press, 1997, pp.281–287. [5] B. Sch¨olkopf, K.K. Sung, C.J.C. Burges, F. Girosi, P. Niyogi, T. Poggio and V. Vapnik, “Comparing support vector machines with Gaussian kernels to radial basis function classifiers,” IEEE Trans. Signal Processing, Vol.45, No.11, pp.2758–2765, 1997. [6] S. Chen, Y. Wu and B.L. Luk, “Combined genetic algorithm optimisation and regularised orthogonal least squares learning for radial basis function networks,” IEEE Trans. Neural Networks, Vol.10, No.5, pp.1239–1243, 1999. [7] M.E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. Machine Learning Research, Vol.1, pp.211–244, 2001. [8] B. Schlkopf and A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press, 2002. [9] S. Chen, X. Hong and C.J. Harris, “Sparse kernel regression modelling using combined locally regularized orthogonal least squares and D-optimality experimental design,” IEEE Trans. Automatic Control, Vol.48, No.6, pp.1029–1036, 2003. [10] S. Chen, X. Hong, C.J. Harris and P.M. Sharkey, “Sparse modelling using orthogonal forward regression with PRESS statistic and regularization,” IEEE Trans. Systems, Man and Cybernetics, Part B, Vol.34, No.2, pp.898–911, 2004. [11] S.E. Fahlman and C. Lebiere, “The cascade-correlation learning architecture,” in: D.S. Touretzky, Ed., Neural Information Processing Systems 2, Morgan-Kaufmann, 1990, pp.524–532.



 (dashed) superimposed Fig. 3. The engine data set: (a) model output  (solid), and (b) model prediction error    on system output  . [12] J. Moody and C.J. Darken, “Fast learning in networks of locally-tuned processing units,” Neural Computation, Vol.1, pp.281–294, 1989. [13] S. Chen, S.A. Billings and P.M. Grant, “Recursive hybrid algorithm for non-linear system identification using radial basis function networks,” Int. J. Control, Vol.55, No.5, pp.1051–1070, 1992. [14] S. Chen, “Nonlinear time series modelling and prediction using Gaussian RBF networks with enhanced clustering and RLS learning,” Electronics Letters, Vol.31, No.2, pp.117–118, 1995. [15] M. Stone, “Cross validation choice and assessment of statistical predictions,” J. Royal Statistics Society Series B, Vol.36, pp.117–147, 1974. [16] R.H. Myers, Classical and Modern Regression with Applications. 2nd Edition, Boston: PWS-KENT, 1990. [17] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Automatic Control, Vol.AC-19, pp.716–723, 1974. [18] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA: Addison Wesley, 1989. [19] K.F. Man, K.S. Tang and S. Kwong, Genetic Algorithms: Concepts and Design. London: Springer-Verlag, 1998. [20] L. Ingber, “Simulated annealing: practice versus theory,” Mathematical and Computer Modeling, Vol.18, No.11, pp.29–57, 1993. [21] S. Chen and B.L. Luk, “Adaptive simulated annealing for optimization in signal processing applications,” Signal Processing, Vol.79, No.1, pp.117–128, 1999. [22] Y. Freund and R.E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Computer and System Sciences, Vol.55, No.1, pp.119–139, 1997. [23] R.E. Schapire, “The strength of weak learnability,” Machine Learning, Vol.5, No.2, pp.197–227, 1990. [24] R. Meir and G. R¨atsch, “An introduction to boosting and leveraging,” in: S. Mendelson and A. Smola, eds., Advanced Lectures in Machine Learning. Springer Verlag, 2003, pp.119–184. [25] S.A. Billings, S. Chen and R.J. Backhouse, “The identification of linear and non-linear models of a turbocharged automotive diesel engine,” Mechanical Systems and Signal Processing, Vol.3, No.2, pp.123–142, 1989.