Nonparametric nonlinear regression using polynomial ... - Springer Link

1 downloads 0 Views 734KB Size Report
Mar 21, 2008 - Abstract The solution of nonparametric regression problems is ... Keywords Nonparametric regression · Polynomial approximation · Neural.
Comput Manag Sci (2009) 6:5–24 DOI 10.1007/s10287-008-0074-3 ORIGINAL PAPER

Nonparametric nonlinear regression using polynomial and neural approximators: a numerical comparison A. Alessandri · L. Cassettari · R. Mosca

Published online: 21 March 2008 © Springer-Verlag 2008

Abstract The solution of nonparametric regression problems is addressed via polynomial approximators and one-hidden-layer feedforward neural approximators. Such families of approximating functions are compared as to both complexity and experimental performances in finding a nonparametric mapping that interpolates a finite set of samples according to the empirical risk minimization approach. The theoretical background that is necessary to interpret the numerical results is presented. Two simulation case studies are analyzed to fully understand the practical issues that may arise in solving such problems. The issues depend on both the approximation capabilities of the approximating functions and the effectiveness of the methodologies that are available to select the tuning parameters, i.e., the coefficients of the polynomials and the weights of the neural networks. The simulation results show that the neural approximators perform better than the polynomial ones with the same number of parameters. However, this superiority can be jeopardized by the presence of local minima, which affects the neural networks but does not regard the polynomial approach. Keywords Nonparametric regression · Polynomial approximation · Neural approximation · Least squares

A. Alessandri (B) · L. Cassettari · R. Mosca Department of Production Engineering, Thermoenergetics and Mathematical Models, DIPTEM-University of Genoa, P.le Kennedy Pad. D, 16129 Genoa, Italy e-mail: [email protected] L. Cassettari e-mail: [email protected] R. Mosca e-mail: [email protected]

123

6

A. Alessandri et al.

1 Introduction An important task in a number of applications is that of finding a function that is known only by samples. Such problem is studied in different fields, including regression and function approximation. In this paper the focus is on the estimation of nonparametric models using a finite data set with the goal of comparing classical regression approaches based on polynomial fitting and other approximators, such as feedforward neural networks, widely used in the area of machine learning. Models with nonparametric structure have the desirable property of being flexible, which is particularly useful when there is no a-priori knowledge like, for example, in many industrial applications. Although, at least in principle, one can rely on a generic nonparametric high-dimensional surface, such an approach is often impractical and motivates the use of models with a meaningful and mathematical tractable structure. Among the various choices, polynomials have attracted a special attention for their simplicity. Moreover, any continuous function can be approximated arbitrarily well on a compact set by polynomials by Weierstrass’ approximation theorem (Kolmogorov and Fomin 1975). Such a capability is often referred to as “universal approximation property.” Indeed, other classes of approximators may enjoy this property like, for examples, a large family of nonlinear approximators known as neural networks (see, e.g., Park and Sandberg 1991; Leshno et al. 1993; Girosi 1994; K˚urková 1995). In addition, one-hidden-layer neural networks exhibit another powerful feature that consists in requiring a small number of parameters (i.e., neural weights) to ensure a fixed approximation accuracy, especially in high-dimensional settings. More specifically, one-hidden-layer sigmoidal neural networks and radial-basis-function (RBF) networks with tunable external and internal parameters may guarantee a uniform approximation precision with a number of parameters that grow at most polynomially with the dimension of the input of the function to be approximated (see Barron 1993; Zoppoli et al. 2002; K˚urková and Sanguineti 2002, 2005 and the references therein). It is worth noting such bounds are not so favorable for linear-in-the-parameters approximators like, for example, algebraic polynomials and trigonometric series. Another crucial question for the selection of a class of approximating functions is that of the available algorithms to find the parameters of the model. As to polynomial approximators, the determination of the parameters via least-square approaches is pretty easy and low demanding from the computational point of view since the model depends linearly on them. Moreover, numerical issues can be reduced through the use of factorization techniques (Björck 1996). The difficulties that may arise with neural approximators are due to the nonlinear dependence from the parameters. As a consequence, the algorithms for the tuning of the weights may suffer from local minima. Concerning the computational efforts, a number of well-established methodologies for optimizing the neural weights with a finite set of samples are reported in the literature (e.g., the Levenberg-Marquardt or Newton algorithms, see Sjoberg et al. 1995). The motivations of the present paper are in connection with recent works (Malik and Rashid 2000; Kim and Park 2001; Crino and Brown 2007), where the performances of the polynomial approximators that result from the applications of response surface methods (RSMs) are compared with neural networks. With this respect, the contribution of our research consists in pointing out the theoretical background that

123

Nonparametric nonlinear regression using polynomial and neural approximators

7

may justify the performances obtained by the different classes of approximators and in comparing such performances in two case studies. To this end, we shall address the problem of estimating a smooth function from stationary ergodic samples. As to the consistency properties of regression estimators, the interested reader may refer to Lugosi and Zeger (1995), Nobel and Adams (2001), Shuhe (2004), Krzy˙zak and Schäfer (2005) and Pollard and Radchenko (2006). We shall deal with finite sets of input–output pairs affected by additive noise. Clearly, such noise makes the problem of estimating the approximating function more difficult. The success of the approximation can be evaluated in terms of fitting error, which measures the distance between the unknown function that generates the data and the approximating function. Concerning this issue, we shall rely on the work of Bartlett and Kulkarni (1998), where a probabilistic evaluation of both the effects of the noise level in the experimental data and the role played by the number of samples when one wants a desired, hopefully small, gap of fitting error with respect to the ideal best estimator. The results of Bartlett and Kulkarni (1998) are useful to understand the different performances obtained in the simulations by standard polynomial estimators as compared with feedforward neural networks. To this end, it is fundamental the concept of covering number, which allows to measure the “richness” of a family of approximating functions in a given functional space (see, for an introduction, Zhou 2002; Pontil 2003; K˚urková and Sanguineti 2007). To conclude, we introduce some notations we shall use throughout the paper. For d ∈ N, the symbol col (v1 , v2 , . . . vd ) stands for a generic column vector v ∈ Rd and   d 2 d d v = i=1 vi . Moreover, given f : R → R and a sequence {x i }, x i ∈ R ,   n 2 i = 1, 2, . . . , n, we define  f (x1n )n = i=1 f (x i ) /n. Let C(X ) denote the Banach space of the continuous functions f : X → R on the compact set X ⊂ Rd with the supremum norm. Moreover, Br (C(X )) is defined as the closed ball of functions belonging to C(X ), centered at zero with radius r > 0, where the distance is given by the supremum norm. For every given sequence {xi }, xi ∈ X ⊂ Rd , i = 1, . . . , n, the closed ball with center  in g ∈ C(X  ) n = g, . associated with {x } is defined as B , x and radius r measured by . n i r n 1   f ∈ C(X ) :  f (x1n )−g(x1n )n ≤ r . N x1n (G, ε) is the ε-covering number of G |x1n ⊆    Rd , i.e., the minimal number l ∈ N {∞} of balls Bε g j , .n , x1n , g j ∈ G,    j = 1, 2, . . . , l, that cover G, i.e., G ⊆ lj=1 Bε g j ,  · n , x1n . Let us define N (G, ε) as the maximum of Nx1n (G, ε) over x1n ∈ X n . The paper is organized as follows. Section 2 deals with the description of polynomial and neural nonparametric models. In Sect. 3, the basic theoretical background about the empirical estimation of functions is presented. Section 4 reports the description of the numerical results, including an application to an industrial problem. The conclusions are drawn in Sect. 5. 2 Nonparametric modelling Let us suppose to sample two stochastic stationary ergodic processes {xi } and {yi } given by an input xi ∈ X ⊂ Rd , where X is compact, and an output yi ∈ Y ⊂

123

8

A. Alessandri et al.

R, respectively. We assume that such data are generated according to the unknown function f : X → R, resulting in the n observations yi = f (xi ) + ei , i = 0, 1, . . . , n,

(1)

where ei ∈ R represents a pure error or measurement error. We assume that {ei } is a stochastic stationary ergodic process and that f is continuous on X . The function f can be approximated by a polynomial of degree p, which entails the appearance of a fitting error eˆi , i.e., f (xi ) = γ p (xi , w) + eˆi , i = 0, 1, . . . , n,

(2)

where γ p (x, w) is a polynomial of order p in the component of the vector x ∈ X . pol The vector w ∈ R N ( p) collects all the coefficients of the polynomial function γ p ,

p  d +l −1 . If d is fixed, N pol ( p) is a monotone increasing where N pol ( p) = l l=0

function of p, and can be used to represents the complexity of γ p (x, w). Another class of approximating functions is that of one-hidden-layer feedforward neural networks composed of two layers, with k computational units in the hidden layer. The input-output mapping is given by ⎡ z q (1) = g ⎣

d 

⎤ w pq (1)z p (0) + w0q (1)⎦ , q = 1, . . . , k,

(3a)

p=1

z 1 (2) =

k 

w p1 (2)z p (1),

(3b)

p=1

where the coefficients w pq (s) and the so-called biases w0q (s) are lumped together into the weights vectors ws , s = 1, 2, and g : R → R is the activation function of the neural units in the hidden layer. A neural unit is often called neuron, as well as 

k is referred to as the number of neurons. Let us denote the weights vector as w = nn col (w1 , w2 ) ∈ R N (k) , where the total number of weights is N nn (k) = k (d + 2). nn The function (3) with the weights vector w ∈ R N (k) is denoted by γk (x, w), where col (z 1 (0), z 2 (0), . . . , z d (0)) = x ∈ X ⊂ Rd is the input vector of the network and z 1 (2) = γk (x, w) ∈ R is its output. Since the total number of parameters N pol (k) is uniquely determined by the number of hidden neurons k and grows linearly with k if d is fixed, the model complexity of the function γk is represented by k, as well as the degree p is a measure of the complexity of the polynomial function γ p . Thus, we have (4) f (xi ) = γk (xi , w) + eˆi , i = 0, 1, . . . , n , where eˆi ∈ R is the fitting error. For the sake of brevity, from now on let us denote by Γν (X ) a generic class of Γ approximating functions γν : X × R N (ν) → R, where ν represents the model

123

Nonparametric nonlinear regression using polynomial and neural approximators

9

complexity of the approximators in Γν (X ) and N Γ (ν) is the number of parameters of γν . We suppose that N Γ (ν) is a monotone increasing function of ν, and that Γν (X ) is dense in C(X ). Finally, let the total error (denoted by e˜i ) be the sum of pure and fitting errors, i.e., e˜i = ei + eˆi , i = 0, 1, . . . , n. 3 Empirical estimation of functions The difficulty of solving the problem of estimating a nonparametric model using a finite set of data depends on the available statistical information. If the measure of (X, Y ) on Rd × R would be known, we should have to minimize   J (g) = E [y − g(x)]2 . As is well known, the solution of such problem is g ◦ (x) = E (y |x ). As a matter of fact, in general we have no knowledge of the distribution of (X, Y ), and the only information at our disposal is a collection of independent, identically distributed (i.i.d.) samples (xi , yi ), i = 1, 2, . . . , n. Thus, we aim to construct a sequence of measurable functions {gn }, gn : X → R, that enjoys the property of consistency, i.e., lim J (gn ) = J (g ◦ ),

n→+∞

(5)

where the function gn is estimated by using the available information. Clearly, to formally deal with (5), we need to define a suitable space to study the convergence in a given norm (see, for an introduction, Lugosi and Zeger 1995). The function gn can be obtained by solving the following problem, where from now on we define x1n = col (x1 , x2 , . . . , xn ) ∈ Rdn and y1n = col (y1 , y2 , . . . , yn ) ∈ Rn .   Problem 1 Given a data set x1n , y1n generated by f : X → R (i.e., yi = f (xi )+ei , i = 1, 2, . . . , n), find gn◦ ∈ C(X ) such that Jn (gn◦ , x1n , y1n ) = min Jn (g, x1n , y1n ), g∈C(X )

where Jn (g, x1n , y1n ) =

n 1 [yi − g(xi )]2 . n

(6)

i=1

There exists a solution to Problem 1 due to the generalized Weierstrass’ theorem (Kolmogorov and Fomin 1975, as Jn (g, x1n , y1n ) is a continuous function of g. If we search for a solution on a subset Γν (X ) ⊂ C(X ), we obtain the following problem.   Problem 2 Given a data set x1n , y1n generated by f : X → R (i.e., yi = f (xi ) + ei , i = 1, 2, . . . , n), find γnν ∈ Γν (X ) that minimizes Jn (γnν , x1n , y1n ).

123

10

A. Alessandri et al.

First, let us consider the consistency of the sequence {γnν }, n = 1, 2, . . .. The difference between the empirical cost and the optimal cost with γnν can be decomposed as follows: Jn (γnν , x1n , y1n ) − J (g ◦ ) =

 min

γν ∈Γν (X )

  J (γν ) − J (g ◦ ) + Jn (γnν , x1n , y1n ) −

min

γν ∈Γν (X )

 J (γν ) ,

(7) where the first term on the right-hand side is the so-called approximation error, while the second one is the estimation error. The approximation error can be made arbitrarily small using a family Γν (X ) of functions sufficiently “rich” to approximate g ◦ . The possibility of achieving any desired accuracy of the approximation error can be ensured by the density properties of Γν (X ) in C(X ) and the continuity of the cost function. The estimation error is a measure of the capacity of a finding a “good” γν in the space Γν (X ) by minimizing the empirical cost. As to the convergence of the estimation error, the following result holds (see Pollard 1984; Gyöfi et al. 2002). Theorem 1 Suppose   that the measurement sequence {yi } is a scalar stochastic process such that E yi2 < ∞, |yi | ≤ y¯ , i = 1, 2, . . . , n, where yi = f (xi ) + ei , i =   f¯ 1, 2, . . . , n and f ∈ B f¯ (C(X )), f¯ > 0; let γnν ∈ Γν (X ) = γν ∈ Γν : |γν (x)| ≤  f¯, ∀x ∈ X . Then, for every ε > 0, the probability that Jn (γnν , x1n , y1n ) −

min f¯

γν ∈Γν (X )

J (γν ) > ε

(8)

is no more than

  ε2 n ε  f¯ n 8 E Nx1 Γν (X ), , exp − 16 512( y¯ + f¯)4

(9)

where the expectation is taken with respect to the distributions of xi , i = 1, 2, . . . , n, in X . Unfortunately, it is difficult to solve Problem 2 for each n, n = 1, 2, . . .. If ν is fixed, the solution of Problem 2 reduces to the search of the optimal parameters that minimize the cost function. To this end, with a little abuse of notation, we denote the empirical error cost function as follows: Jn (w, γν , x1n , y1n ) =

n 2 1  yi − γν (xi , w) , n

(10)

i=1

Γ

where w ∈ W ⊂ R N (ν) (W can be chosen as large as needed by using a-priori information on the regression model). Thus, from Problem 2 we have the following nonlinear optimization problem.

123

Nonparametric nonlinear regression using polynomial and neural approximators

11

  Problem 3 Given a data set x1n , y1n generated by f : X → R (i.e., yi = f (xi ) + ei , i = 1, 2, . . . , n), find wn◦ ∈ W such that Jn (wn◦ , γν , x1n , y1n ) = min Jn (w, γν , x1n , y1n ), w∈W

where W ⊂ R N

Γ (ν)

is compact.

Note that the continuity of γν trivially implies the continuity of (10). If (10) is a continuous function of w ∈ W and since W is compact, then a solution of Problem 3 exists for the Weierstrass’ theorem. Basing on the aforesaid and likewise in Bartlett and Kulkarni (1998), we now introduce some definitions that we shall use later on.   Definition 1 Given a data set x1n , y1n generated by f : X → R (i.e., yi = f (xi )+ei , i = 1, 2, . . . , n), an empirical estimator is a pair γν , wn◦ , where γν ∈ Γν and wn◦ results from the solution of Problem 3. Of course, the empirical estimator suffers from the presence of the measurement error. With this respect, it is important to evaluate the performances of the estimator in terms of fitting error. To this end, note that the approximation error can be evaluated using the cost function (10):    f (x n ) − γν (x n , w)2 = Jn (w, γν , x n , f (x n )), 1 1 1 1 n

(11)



where, for the sake of brevity, f (x1n ) = col [ f (x1 ), f (x2 ), . . . , f (xn )] and    γν (x1n , w) = col γν (x1 , w), γν (x2 , w), . . . , γν (xn , w) . In the following, we shall consider estimators that provide a fitting error not larger than the best possible fitting error plus ε > 0, as in the following definition.   Definition 2 Given a data set x1n , y1n generated by f : X → R  (i.e., yi = f (xi )+ei , i = 1, 2, . . . , n), an ε-approximate estimator is a pair γν , wnε is such that   Jn (wnε , γν , x1n , f (x1n )) ≤ min Jn (w, γν , x1n , f (x1n )) + ε, w∈W

where γν ∈ Γν and wnε ∈ W ⊂ R N

Γ (ν)

.

  Definition 3 Given a data set x1n , y1n generated by f : X → R (i.e., yi = f (xi )+ei ,  i = 1, 2, . . . , n), an empirical estimator γν , wnε fails to ε-approximate f if   Jn (wnε , γν , x1n , f (x1n )) > min Jn (w, γν , x1n , f (x1n )) + ε, w∈W

where γν ∈ Γν and wnε ∈ W ⊂ R N

Γ (ν)

.

123

12

A. Alessandri et al.

Clearly, a crucial question is the accuracy of the ε-approximate estimator in relation to the complexity of the class of the approximating functions, where the estimator is sought. To this end, we shall analyze the ε-approximate estimator, originally proposed in Bartlett and Kulkarni (1998), and adapt it to our context. The results in Bartlett and Kulkarni (1998) offer a solid framework to interpret the outcomes of the numerical simulations in Sect. 4. In Bartlett and Kulkarni (1998), the following result is proved. Theorem 2 Suppose that noise sequence {ei } is a zero-mean stochastic process such ¯ i = 1, 2, . . . , n, with measurements yi = f (xi ) + ei , i = 1, 2, . . . , n that ei  ≤ e,   f¯,W and f ∈ B f¯ (C(X )), f¯ > 0; let γnν ∈ Γν (X ) = γν ∈ Γν : |γν (x, w)| ≤ f¯,  ∀x ∈ X, ∀w ∈ W . Then, for every ε > 0, the probability of a noise sequence for which an empirical estimator fails to ε-approximate f is no more than

 2ε2 n ε f¯,W exp − 2 2 . N Γν (X ), 4e¯ e¯ f¯

(12)

Theorem 2 provides an upper bound on the probability that an empirical estimator fails to ε-approximate a function f . Such bound depends on both the covering number f¯,W

of Γν (X ) and an exponential that decreases with n. Note that a larger ε (i.e., a wider tolerance in the failure of the empirical estimator) allows to keep such probability smaller with reduced values of n, i.e., of the number of samples. Clearly, the covering number is desired to grow slowly, in such a way that (12) tends to zero fast enough as n goes to infinity. Unfortunately, in general it is difficult to determine the covering number of a given class of functions. However, upper bounds on the covering number for various families of approximators are known. As to kernels based on Legendre orthogonal polynomials, one can refer to Baohuai et al. (2008). Such bounds for one-hidden-layer feedforward neural networks and feedforward neural networks with a generic number of hidden layers are given in Anthony and Bartlett (1999) at p. 207–208 (Corollary 14.15 and, with different assumptions, Corollary 14.16) and at p. 210 (Theorem 14.17), respectively. 4 Numerical results In this section, we illustrate the results obtained with two numerical examples by exploiting the theoretical framework of Sect. 3. The first testbed concerns the approximation of a known scalar function. In this case, we computed the fitting error, which is more important than the total error for the purpose of performance evaluation. The second is an industrial case study, where the original problem was that of evaluating the efficiency of a production shop floor simulated using a discrete-event model. The tests were accomplished to compare polynomials and one-hidden-layer feedforward neural networks with hyperbolic tangent as activation function of the hidden layer, while the output layer is simply linear (i.e., there is no activation function). The determination of the neural weights was made by means of the Levenberg–Marquardt

123

Nonparametric nonlinear regression using polynomial and neural approximators Table 1 Fitting errors in all the simulation runs using noise-free data

13

Neural networks

Polynomials

Mean

SD

Mean

SD

100

0.00097

0.00173

0.00473

0.02330

200

0.00099

0.00170

0.00325

0.00084

300

0.00099

0.00169

0.00301

0.00072

400

0.00094

0.00165

0.00285

0.00075

500

0.00091

0.00163

0.00273

0.00078

600

0.00096

0.00166

0.00264

0.00080

700

0.00097

0.00167

0.00255

0.00083

800

0.00097

0.00167

0.00242

0.00087

900

0.00096

0.00166

0.00241

0.00086

1,000

0.00094

0.00165

0.00235

0.00088

Number of samples

training algorithm available in Matlab (see Demuth and Beale 2000). The parameters of the polynomial approximators were computed by finding the standard least-squares solution via the orthogonal-triangular decomposition. We used candlestick charts for the plots of the errors; they allow a synthetic graphical description of the best result, 25% quantile, median, 75% quantile, and worst result with possible outliers.

4.1 Approximation of a scalar function We considered the problem of approximating the scalar function f (x) = x 2 exp(−5x) on the domain [0, 2] using measures corrupted by additive, zero-mean Gaussian noises. To accomplish this approximation task, we considered polynomials and one-hiddenlayer feedforward neural networks with the same number of parameters (i.e., the coefficients of the polynomials and the neural weights). In practice, we fixed the number of neural units ν and chose the degree p of the polynomial approximator equal to N nn (ν) − 1, i.e., N pol ( p) = N nn (ν). The simulations were performed by varying the measurement noise deviation and the number of samples, as shown in Tables 1, 2, 3, 4, where one can find the means and standard deviations (SDs) in the fitting errors computed in the various trials for all the kinds of networks. As can be noticed in the column of the means, the polynomial approximators apparently turn to perform better, particularly in the presence of a large noise level. Note also a larger dispersion in the results of the neural networks as compared with the polynomials. This may depend on the choice of the initial weights, which is necessary for the neural approximators, while the polynomial approximators do not require initialization (see Figs. 1, 2). As a matter of fact, the fitting errors given by the neural approximators are distributed around two constant values, as clearly shown, for example, in the noise-free case of Fig. 1. The corresponding behavior in the noisy case is more complex, as shown in Fig. 2, where one can realize also that the best neural performances under the effect of the noises are obtained with fewer neurons. The bias around two constant values has to be ascribed to the presence of

123

14 Table 2 Fitting errors in all the simulation runs using data corrupted by Gaussian, zero-mean measurement noise with dispersion equal to 0.0001

Table 3 Fitting errors in all the simulation runs using data corrupted by Gaussian, zero-mean measurement noise with dispersion equal to 0.001

A. Alessandri et al. Neural networks

Polynomials

Mean

SD

Mean

SD

100

0.00428

0.12914

0.00473

0.02350

200

0.00549

0.40110

0.00325

0.00084

300

0.00116

0.00816

0.00301

0.00073

400

0.00103

0.00765

0.00285

0.00075

500

0.00093

0.00164

0.00273

0.00078

600

0.00102

0.00545

0.00264

0.00080

700

0.00099

0.00171

0.00255

0.00083

800

0.00097

0.00164

0.00248

0.00084

900

0.00098

0.00193

0.00241

0.00086

1,000

0.00095

0.00168

0.00235

0.00088

Number of samples

Neural networks

Polynomials

Mean

SD

Mean

SD

100

0.17209

4.2901

0.00448

0.01190

200

0.02241

0.8329

0.00327

0.00081

300

0.02333

0.8951

0.00303

0.00074

400

0.03504

2.2803

0.00286

0.00074

500

0.00312

0.1126

0.00275

0.00077

600

0.01810

0.8778

0.00265

0.00079

700

0.00182

0.0337

0.00256

0.00082

800

0.09287

5.1301

0.00269

0.00074

900

0.00125

0.0108

0.00241

0.00086

1,000

0.00283

0.1720

0.00235

0.00088

Number of samples

local minima. The trapping into local minima provides the results depicted in Fig. 3, to be compared with the much better result of Fig. 5, which may be ascribed to a more favorable initial choice of the weights (see also Fig. 4 and, for comparison, Fig. 6 with noisy data). We can summarize the results as follows. 1. The neural approximators suffer from local minima, which makes the results more dispersed with respect to the polynomials. 2. On the overall, the neural networks provide the best results in terms of fitting error but are very computationally demanding. 3. In particular, the neural approximators are well-suited to taking advantages with an increase of the number of samples and a reduction of the noise. As to (1), the question is pretty well-known. With this respect, such issues can be reduced with additional computations through multiple initial choices of the neural weights. If, on one hand, this represents a drawback, on the other hand the neural

123

Nonparametric nonlinear regression using polynomial and neural approximators Table 4 Fitting errors in all the simulation runs using data corrupted by Gaussian, zero-mean measurement noise with dispersion equal to 0.01

15

Neural networks

Polynomials

Mean

SD

Mean

SD

100

0.28302

9.59017

0.00475

0.02280

200

0.07345

2.59010

0.00326

0.00086

300

0.01371

0.54405

0.00302

0.00072

400

0.01441

0.74600

0.00288

0.00074

500

0.00375

0.13407

0.00276

0.00077

600

0.03020

2.08005

0.00265

0.00079

700

0.00792

0.67002

0.00256

0.00082

800

0.00828

0.49603

0.00264

0.00065

900

0.00837

0.40001

0.00263

0.00079

1,000

0.00361

0.08850

0.00253

0.00082

Number of samples

networks outperform the polynomials, as outlined in (2). Moreover, the local minima trapping becomes less evident in the presence of a large measurement noise, as one can notice by comparing the boxplots on the left parts of Figs. 1 and 2. The comment (3) descends from the inspection of Tables 1, 2, 3, 4, where one can notice that the performances of neural networks become much better than those of the polynomials with weaker noises and more many samples. Therefore, one can summarize that on the average a polynomial fits better with fewer samples and stronger uncertainty, while additional or more reliable information are more fruitfully exploited by neural networks. Another interpretation of the overall results can be drawn via (7) in terms of approximation and estimation errors. If one reminds the bounds on the approximation properties of one-hidden-layer feedforward neural networks (see Barron 1993; K˚urková and Sanguineti 2002, 2005), a smaller approximation error may be expected as compared with the polynomial approximators. On the contrary, the neural networks are likely to suffer a larger estimation error due to the local minima. However, since on the overall the neural approximators perform better than the polynomial ones, we can guess that the approximation error of the first ones is much lower than that of the second ones.

4.2 An industrial case study We considered a production shop floor for iron components that was modelled using Simul8 (see Concannon et al. 2003). Such model is the cascade of pairs of buffer and server that represent a step of the overall manufacturing process. The job routing is pictorially described in Fig. 7. The details of the distributions of the server capacities are reported in Table 5, with the time of the setup when it is made on line. The distribution of the interarrivals of the crude mineral is exponential with a rate of 1 item/min. The transfer process is realized by using a belt conveyor. The goal is to

123

16

A. Alessandri et al.

0.02

0.018

0.018

0.016

0.016

0.014

0.014

0.012

0.012

Fitting error

Fitting error

Neural networks and polinomials with 100 samples over 1000 runs 0.02

0.01

0.01

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

1

2

3

4

5

6

7

8

0

9 10

2

5

No. of neurons

8 11 14 17 20 23 26 29

Polynomial degree

0.02

0.018

0.018

0.016

0.016

0.014

0.014

0.012

0.012

Fitting error

Fitting error

Neural networks and polinomials with 1000 samples over 1000 runs 0.02

0.01

0.01

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

1

2

3

4

5

6

7

8

No. of neurons

9 10

0

2

5

8 11 14 17 20 23 26 29

Polynomial degree

Fig. 1 Results obtained by feedforward neural networks and polynomials with the same number of parameters using noise-free data

123

Nonparametric nonlinear regression using polynomial and neural approximators

17

0.02

0.018

0.018

0.016

0.016

0.014

0.014

0.012

0.012

Fitting error

Fitting error

Neural networks and polinomials with 100 samples over 1000 runs 0.02

0.01

0.01

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

1

2

3

4

5

6

7

8

0

9 10

2

5

No. of neurons

8 11 14 17 20 23 26 29

Polynomial degree

0.02

0.018

0.018

0.016

0.016

0.014

0.014

0.012

0.012

Fitting error

Fitting error

Neural networks and polinomials with 1000 samples over 1000 runs 0.02

0.01

0.01

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

1

2

3

4

5

6

7

8

No. of neurons

9 10

0

2

5

8 11 14 17 20 23 26 29

Polynomial degree

Fig. 2 Results obtained by feedforward neural networks and polynomials with the same number of parameters using data corrupted by Gaussian, zero-mean measurement noise with dispersion equal to 0.01

123

18

A. Alessandri et al. 0.025 polynomial of degree 14 neural networks with 5 neurons f(x) data (100 points)

0.02

0.015

0.01

0.005

0 −0.005

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Fig. 3 Comparison between feedforward neural networks and polynomials with the same number of parameters in interpolating noise-free data 0.035 polynomial of degree 14 neural networks with 5 neurons f(x) data (100 points)

0.03 0.025 0.02 0.015 0.01 0.005 0 −0.005 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Fig. 4 Comparison between feedforward neural networks and polynomials with the same number of parameters in interpolating data corrupted by Gaussian, zero-mean measurement noise with dispersion equal to 0.001

study the influence of multi-step grinding machines and dimensional control devices on the average daily production with a working shift of 8 h. The production efficiency is treated as a continuous variable that depends on the aggregate machine capacities taking values in the ranges [2.90, 3.50] and [5.80, 6.20] for multi-step grinding machines and dimensional control devices, respectively. We collected the results of 162 simulation runs, where the output is the average daily production as a function of two continuous variables that represent the overall server of the multi-step grinding machines and dimensional control devices, respectively. For the purpose of a comparison, first we fixed the number of neural units and then we chose the degrees of two polynomial approximators that have a total number of

123

Nonparametric nonlinear regression using polynomial and neural approximators

19

0.025 polynomial of degree 14 neural networks with 5 neurons f(x) data (100 points)

0.02

0.015

0.01

0.005

0

−0.005 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Fig. 5 Comparison between feedforward neural networks and polynomials with the same number of parameters in interpolating noise-free data

0.025 polynomial of degree 14 neural networks with 5 neurons f(x) data (100 points)

0.02

0.015

0.01

0.005

0

−0.005

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Fig. 6 Comparison between feedforward neural networks and polynomials with the same number of parameters in interpolating data corrupted by Gaussian, zero-mean measurement noise with dispersion equal to 0.001

coefficients lower and higher than the corresponding number of neurons, respectively. We denote them as the lower-degree and higher-degree polynomial approximators, respectively. Table 6 shows the correspondences among number of neural units and lower/higher degrees of polynomials. The total errors obtained with one-hidden-layer feedforward neural networks as a function of the number of neural units varying from 1 to 40 are depicted in Fig. 8. For the neural networks, the total errors are computed over 1,000 runs with different choices of the initial weights. Note that the performances improve with the increase of

123

20

A. Alessandri et al.

Fig. 7 A model of the shop floor example Table 5 Technical details of the machines (the statistical parameters for the normal distribution refer to mean and dispersion, respectively) Machine type

Number of machines

Type of distr.

Stat. parameters

Setup time

Lathe Milling machine

8 8

Triangular Uniform

3, 4, 5 3, 5

10–180 min Off-line setup

Multi-step griding machine

3

Normal

3.8 , 1.5

Off-line setup

Balancing machine

6

Uniform

1, 2

7 min/100 item

Nitriding machine

14

Normal

4.5, 1

Off-line setup Off-line setup

Hardness control machine

5

Triangular

1, 2, 3

Griding machine

7

Normal

2.4, 0.5

Off-line setup

Cleaner

5

Fixed

2.5

Off-line setup

Dimensional control device

6

Normal

6.8, 1.5

Off-line setup

Packaging

4

Uniform

1, 2

Off-line setup

Table 6 Number of neural units, lower degrees, and higher degrees

Number of neural units

Lower degree

Higher degree

1

1

2

2–3

2

3

4–6

3

4

7–12

4

5

13–23

5

6

24–40

6

7

the number of neural units. Moreover, the training algorithm becomes more stable, as one can notice looking at the amplitude of the boxes and the outliers out of whiskers. Figure 9 allows to compare the total errors of lower-degree and higher-degree polynomial approximators with the mean total errors of feedforward neural networks computed on 1,000 runs with different initial weights. As can be noticed, the neural approximators outperform the polynomial approximators. The best performance was obtained by a 7-neural-unit network with a mean total error equal to 0.17745, to

123

Nonparametric nonlinear regression using polynomial and neural approximators

21

Neural networks with 162 samples over 1000 runs

16

14

12

Total error

10

8

6

4

2

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

No. of neural units Fig. 8 Results of feedforward neural networks with different number of neural units

10

Polinomials and neural networks with 162 samples

4

lower−degree polynomial higher−degree polynomial mean value with neural networks 3

10

Total error

2

10

1

10

0

10

−1

10

0

5

10

15

20

25

30

35

40

No. of neural units Fig. 9 Mean total errors of feedforward neural networks with different number of neural units over 1,000 runs with different initial weights as compared with the total errors of lower-degree and higher-degree polynomial approximators

123

22

A. Alessandri et al.

data set neural approx. with 7 neural units pol. approx. of degree 4 pol. approx. of degree 5

440 430 420 410 400 390 380 370 360 6.2 6.1

3.6 6

3.4 5.9

3.2 5.8

3 5.7

2.8

data set neural approx. with 7 neural units pol. approx. of degree 4 pol. approx. of degree 5

440 430 420 410 400 390 380 370

2.8 3

360 5.7

3.2 5.8

3.4 5.9

6

3.6 6.1

6.2

3.8

Fig. 10 Results of a feedforward neural networks with 7 neural units and two polynomials of degrees 4 and 5. Total errors are 0.2527, 1.9154, and 2.2515, respectively

be compared with a total error of 1.9154 given by a polynomial of degree 4 in the most favorable case. In Fig. 9, note also the rapid performance deterioration of the polynomial approximators when their orders increase, which may be due to overfitting. Finally, Fig. 10 shows the shapes of the resulting approximating curves obtained in the best simulation run.

123

Nonparametric nonlinear regression using polynomial and neural approximators

23

5 Conclusions In this paper, we have described a theoretical framework to compare the performances of polynomials and one-hidden-layer feedforward neural networks in regression problems. To this end, we have exploited the current achievements of empirical risk minimization approach and function approximation to interpret the numerical results of two case studies. When there is the problem of estimating a function using a given data set, one has to account for both approximation error and estimation error. The approximation error is small with a “richer” class of approximating functions. As to this question, upper bounds on the approximation capabilities in relation to the dimensionality of the problem are known for which one-hidden-layer feedforward neural networks appear superior with respect to polynomials (Barron 1993; K˚urková and Sanguineti 2002). However, in order to ensure a small total error, one needs to keep the estimation error as small as possible. Unfortunately, the presence of local minima due to the nonlinear dependence of the weights in the functional structure of the neural networks may undermine the possible advantages. Such nuisances can be reduced though additional computations to face with local minima via different initializations of the neural parameters.

References Anthony M, Bartlett P (1999) Neural network learning: theoretical foundations. Cambridge University Press, Cambride Baohuai S, Jianli W, Ping L (2008) The covering number for some Mercer kernel Hilbert spaces. J Complex (to appear) Barron A (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inf Theory 39(3):930–945 Bartlett P, Kulkarni S (1998) The complexity of model classes, and smoothing noisy data. Syst Control Lett 34(3):133–140 Björck A (1996) Numerical methods for least squares problems. SIAM, Philadelphia Concannon K, Elder M, Hunter K, Tremble J, Tse S (2003) Simulation modeling with SIMUL8. Visual Thinking International Ltd, Mississauga Crino S, Brown D (2007) Global optimization with multivariate adaptive regression splines. IEEE Trans Syst Man Cybern B 37(2):333–340 Demuth H, Beale M (2000) Neural network toolbox—user’s guide. The Math Works Inc., Natick Girosi F (1994) Regularization theory, radial basis functions and networks. In: From statistics to neural networks. Theory and Pattern Recognition Applications, Subseries F, Computer and Systems Sciences. Springer, Heidelberg, pp 166–187 Gyöfi L, Kohler M, Krzy˙zak A, Walk H (2002) A distribution-free theory of nonparametric regression. Springer, New York Kim B, Park G (2001) Modeling plasma equipment using neural networks. IEEE Trans Plasma Sci 29(1):8– 12 Kolmogorov A, Fomin S (1975) Introductory real analysis. Dover Publications, New York Krzy˙zak A, Schäfer D (2005) Nonparametric regression estimation by normalized radial basis function networks. IEEE Trans Inf Theory 51(3):1003–1010 K˚urková V (1995) Approximation of functions by perceptron networks with bounded number of hidden units. Neural Netw 8(5):745–750 K˚urková V, Sanguineti M (2002) Comparison of worst-case errors in linear and neural network approximation. IEEE Trans Inf Theory 28(1):264–275

123

24

A. Alessandri et al.

K˚urková V, Sanguineti M (2005) Error estimates for approximate optimization by the extended Ritz method. SIAM J Optim 15(2):461–487 K˚urková V, Sanguineti M (2007) Estimates of covering numbers of convex sets with slowly decaying orthogonal subsets. Discrete Appl Math 155(15):1930–1942 Leshno M, Ya V, Pinkus A, Schocken S (1993) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw 6(6):861–867 Lugosi G, Zeger K (1995) Nonparametric estimation via empirical risk minimization. IEEE Trans Inf Theory 41(3):677–687 Malik Z, Rashid K (2000) Comparison of optimization by response surface methodology with neurofuzzy methods. IEEE Trans Magn 36(1):241–257 Nobel A, Adams T (2001) Estimating a function from ergodic samples with additive noises. IEEE Trans Inf Theory 47(7):2985–2902 Park J, Sandberg IW (1991) Universal approximation using radial-basis-function networks. Neural Comput 3(2):246–257 Pollard D (1984) Convergence of stochastic processes. Springer, New York Pontil M (2003) A note on different covering numbers in learning theory. J Complex 19(5):665–671 Pollard D, Radchenko P (2006) Nonlinear least-squares estimation. J Multivar Anal 97(2):548–562 Shuhe H (2004) Consistency for the least squares estimator in nonlinear regression model. Stat Probab Lett 67(2):183–192 Sjoberg J, Zhang Q, Ljung L, Benveniste A, Deylon B, Glorennec P, Hjalmarsson H, Juditsky A (1995) Nonlinear black-box models in system identification: a unified overview. Automatica 31(12):1691–1724 Zhou D-X (2002) The covering number in learning theory. J Complex 18(3):739–767 Zoppoli R, Sanguineti M, Parisini T (2002) Approximating networks and extended Ritz method for the solution of functional optimization problems. J Optim Theory Appl 112(2):403–439

123