Neural Networks, IEEE Transactions on

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 2, MARCH 2000

323

On the Optimality of Neural-Network Approximation Using Incremental Algorithms Ron Meir, Member, IEEE, and Vitaly E. Maiorov

Abstract—The problem of approximating functions by neural networks using incremental algorithms is studied. For functions belonging to a rather general class, characterized by certain smoothness properties with respect to the 2 norm, we compute upper bounds on the approximation error where error is measured by the norm, 1 . These results extend previous work, applicable in the case = 2, and provide an explicit algorithm to achieve the derived approximation error rate. In the range 2 near-optimal rates of convergence are demonstrated. A gap remains, however, with respect to a recently established lower bound in the case 2, although the rates achieved are provably better than those obtained by optimal linear approximation. Extensions of the results from the 2 norm to are also discussed. A further interesting conclusion from our results is that no loss of generality is suffered using networks with positive hidden-to-output weights. Moreover, explicit bounds on the size of the hidden-to-output weights are established, which are sufficient to guarantee the established convergence rates. Index Terms—Approximation bounds, incremental algorithms, neural networks.

I. INTRODUCTION

R

ECENT years have witnessed a surge of interest in approaches to adaptive nonlinear approximation and estimation of functions based on data. Augmenting and extending the classic parametric and nonparametric methods for estimation, new approaches have been developed which are able to blend the desirable attributes of both methods achieving efficient adaptive estimation, with respect to very large functional classes. Two well-known examples are free knot splines and nonlinear wavelet estimators. The former approach is well known to achieve optimality in a well-defined sense for functions in arbitrary dimension, while the latter methods have been shown to be nearly optimal for one-dimensional problems, while retaining impressive computational advantages with respect to spline functions; see [13] for a comprehensive recent review of these results. A major problem in many approaches to estimation is the so-called curse of dimensionality, whereby performance degrades rapidly as the dimensionality of the problem increases. Several procedures have been suggested in order to circumvent Manuscript received November 5, 1998; revised August 4, 1999 and September 14, 1999. This work was supported in part by a Grant from the Israel Science Foundation, the Ollendorff Center of the Department of Electrical Engineering at the Technion, and in part by the Center for Absorption in Science, Ministry of Immigrant Absorption, Israel. R. Meir is with the Department of Electrical Engineering, Technion, Haifa 32000, Israel. V. E. Maiorov is with the Department of Mathematics, Technion, Hafia 32000, Israel. Publisher Item Identifier S 1045-9227(00)02999-4.

this problem, of which we mention three. Perhaps the simplest approach is to model the nonlinearity of a -dimensional function as a sum of univariate functions. This is the basis of the so-called class of additive models, taking the form

where and are unknown univariate functions. An obvious extension of this approach, the basis of the so-called projection pursuit model is to consider linear combinations of nonlinear univariate functions of projections of the input variable . Then we have

where are unknown parameters, and are unknown univariate functions, termed ridge-function due to their geometric interpretation. In the field of function approximation this type of representation is known as approximation by ridge functions. The class of neural networks, with which we are concerned in this paper, can be viewed as a degenerate form of proare all taken to be jection pursuit, where the functions equal. Thus we have (1) In the context of learning or system identification, there are four basic issues that need to be addressed when attempting to analyze models such as the above. • Criterion Selection: Construct an error measure pertinent to the problem at hand. • Algorithm Design: Given an error criterion, a model , , structure, and a data set , construct an algorithm for efficiently estimating . Often such algorithms are based on a certain quality criterion (a.k.a. loss function) which must also be specified. • Approximation: How well can the constrained representations above describe general -dimensional signals (functions)? Can the inherent approximation error be estimated? Ideally, one would like to have the flexibility of approximating well as large a class of functions as possible. • Estimation: Given the finiteness of the data set, assess the performance of the algorithm with respect to the optimal possible performance possible.

1045–9227/00$10.00 © 2000 IEEE

324

Any complete approach to statistical learning must address each of the above issues, in order to obtain a full characterization of the efficacy of the proposed methodology. Observe that these problems belong to three different mathematical fields, namely, nonlinear optimization, function approximation and mathematical statistics. It is only in recent years that impressive progress has been made in addressing all these issues as a coherent whole (see, for example, [19] for wavelets, [21] for local polynomial models, and [43] for splines). In this work we focus on the class of single hidden layer feedforward neural-network models, as given in (1). Much work has been done on the statistical properties of these systems, through various notions of generalized dimensions (VC dimension, pseudodimension, etc.) and a great deal is known in this respect regarding the estimation issue mentioned above (see, for example, [2], [42], and references therein). The problem of approximation by neural networks has been extensively studied since the first papers on this topic appeared in 1989 [11], [22], [24]. In 1993 Leshno et al. [27] gave precise conditions for the function in (1) to yield denseness in the class of continuous functions. More precisely, they showed that the class of neural networks (with a potentially infinite number of hidden nodes ) is dense in the class of continuous functions over if and only if is locally integrable and nonpolynomial. Concomitantly and following this work, several authors have addressed the issue of convergence rates for the approximation error for various classes of functions [4], [15], [30], [33]. Under certain conditions optimal rates of convergence have been demonstrated, although a great deal of work still needs to be done in this direction. We refer the reader to [38] for an excellent extensive review of this topic. A major problem with the application and use of neural networks is the extremely difficult computational burden involved in minimizing a highly nonlinear function of a large number of variables (the weights and thresholds in the network). While various schemes have been proposed to address these issues, many of them are of a heuristic nature, and are not readily amenable to analysis along the lines delineated above. In this work we consider procedures which incrementally add a single hidden unit to the network, at each stage optimizing only over the parameters of that single unit. In this way, the computational burden is greatly facilitated in that only a small number of parameters need to be computed, as compared with standard approaches which estimate all the parameters of the network simultaneously. Unfortunately, it cannot be shown that the this procedure solves the computational problem entirely, as it is well known that local minima exist even for single nonlinear units [3]. This type of incremental procedure was introduced into the neuralnetwork literature by Fahlman and Labiere [20], although similar approaches had already been studied in the 1970’s (see the review [5]). In the context of function approximation, which is our main interest here, incremental algorithms were first proposed by Jones [25] in an abstract Hilbert space setting, and later extended by Barron [4] in the context of neural networks. Important generalizations and extensions, discussed below, have been presented in [26] for Hilbert spaces and in [18], in the more general Banach space setting which will be considered


here (see also [16] and [17] for some recent contributions). The main results established in these papers can be briefly summarized as follows. Let be a subspace of a Banach space with norm , which will be taken in this paper to the norm. Let co consist of all functions of the form usual , , where are nonnegative and sum to one. Then, for any in the closure of the convex hull of , there whose distance from can be upper exists a function in co , where is a constant and is an exponent bounded by depending on the nature of the space. Moreover, this optimal function may be obtained by an incremental procedure (as defined in Section IV). Donahue et al. [18] also show that the rates are often optimal. We note in passing, that the relevance of these incremental approaches to learning from data has been discussed in [26], where precise bounds are given on the estimation error. An analysis of incremental greedy algorithms in the context of orthogonal systems can be found in [39]. The main contribution of this work is the establishment of approximation error bounds for incremental algorithms under rather general conditions on the approximating function and the approximating norm. In other words, instead of demanding convergence to functions in the convex-hull, we study convergence to general functions characterized by their smoothness properties. The results are compared to lower bounds derived recently in [30]. Finally, it should be commented that there is strong motivation for considering general smoothness parameters, e.g., , the norm characterizing smoothness of the space, and , the norm with which error is measured. The basic issue is that one never really knows what is the correct value of or the most natural value of with which to measure error. For example, in robust statistics there is often strong motivation to use values of between one and two. It is therefore important to derive bounds for general values of and . We refer the reader to [19] for a detailed discussion of this and related issues. The remainder of this paper is organized as follows. In Section II we present some basic definitions and briefly review some standard results from the theory of function approximation and multivariate splines. Section III summarizes some of are repreour previous work wherein general functions in sented exactly using an integral transform involving the radon and wavelet transforms. In Section IV we then present an incremental algorithm and derive upper bounds on the approxinorm to measure the mation error incurred by it, using the norms in error. These results are then extended to general Section V. The paper concludes with a summary and discussion of some open questions. Some of the more technical proofs have been relegated to the appendixes. II. BACKGROUND AND PRELIMINARY RESULTS We begin with some comments concerning notation. We let denote the -dimensional Euclidean space, , and is the set of nonnegative real numbers. The set of natural numwill denote a generic compact bers is denoted by . and we use the notation , set. For is a shorthand for terms which behave like powers of where . Clearly, for any , logarithms in , for example for sufficiently large . these terms are always smaller than

MEIR AND MAIOROV: ON THE OPTIMALITY OF NEURAL-NETWORK APPROXIMATION USING INCREMENTAL ALGORITHMS

For any measurable set , we use the notation to denote the norm of a function . we will often simplify and use the notation . When A comment is also in order concerning the various constants appearing in the paper. The results derived in the sequel rely on several classic lemmas from the theory of functions, as well as on results from previous work, all of which include unspecified constants. In order to keep track of the constants, we have given indices to the constants arising from the various lemmas used, in the order of their appearance. The constants appearing in the main theorems of this work (Theorems IV.1, IV.2, and V.1) will be denoted by the generic symbol “ ,” which will be related to the other numbered constants in the proofs of the theorems. Although this procedure may seem somewhat cumbersome, we believe that it does have the advantage of keeping track of the sources for the different constants, even if no numerical value is given for each constant. In order to keep track of the various constants, we have summarized their sources at the end of Appendix A. and be two Banach Definition 1: Let spaces with their respective norms. We then say that is con, if tinuously embedded in , denoted by , 1) for any , where is independent 2) of . Before proceeding to the details of the algorithm and performance bounds, we outline the basic methodology used. First, we observe in Section III that any square integrable function, defined over a compact domain, may be exactly expressed using a convex integral representation, based on ridge functions (as used in neural networks). We then show in Section IV how the integral may be approximated by a finite sum using Monte Carlo based sampling procedures. Finally, we show in Sections IV and V how this nearly optimal approximation may be achieved incrementally, at a greatly reduced computational cost. A. Some Distance Measures for Functional Spaces We now introduce several distance measures relevant to measuring the approximation of one space by another. be the set of rank linear op1) Linear Distance: Let erators mapping the Banach space to itself. A linear operator is of rank if its range is of dimension . Let (2) Note that the linear operator does not depend on the function which is being approximated. For example, it is well known that for Hilbert spaces , the best linear approximation relative is given by , to a fixed basis denotes an inner product. where 2) Nonlinear Distance: Let be a functional space. We define the distance from to by (3)

325

The Kolmogorov -width is defined as

is an -dim subspace of

(4)

where the infimum is taken over all -dimensional subspaces of . Observe, that in distinction with the linear case, the optimal choice of in (3) depends on . Clearly, . Of course the computational problem inherent in the nonlinear case is much harder. An example of nonlinear approximation is the selection of an optimal subset of basis vectors . from a potentially infinite set The final distance measure we discuss is relatively new, having been introduced only recently by Maiorov and Ratsaby [31], to whom we refer the reader for details. The basic motivation for this measure, which we refer to as the -width, is the fact that in the Kolmogorov-width defined above, one considers only linear subspaces of dimension . However, recently studied approximation schemes, such as neural networks, are not linear subspaces, and therefore no direct relationship exists between the Kolmogorov -width and the degree of hidden units. A approximation by neural networks with quantity which has turned out to be very useful in quantifying the complexity of nonlinear function spaces in the context of estimation is that of the pseudodimension . Roughly speaking this quantity measures the complexity of a functional class by looking at its degree of oscillation with respect to a set of points. For a detailed description of the pseudodimension and its application to the theory of estimation we refer the reader to [42]. The -width is then defined as (5) where the infimum is taken over all spaces with pseudodimension . In discussing rates of approximation for neural networks, one usually presents the result in terms of the number of hidden units rather than the pseudodimension. However, for some special cases a direct relationship exist between the two, so that the results may be directly compared (see [2], [6], and [42] for a discussion of the relationship in some special cases). Observe, though, that in the case where the space is a linear . vector space of dimension we have that B. Some Standard Smoothness Classes Any attempt at deriving approximation error rates needs to make some assumptions on the regularity of the function being approximated. In recent years an increasing amount of attention has been paid to the so-called Besov class of functions , defined over a compact domain . This class of functions is very general and flexible, and covers most of the standard smoothness classes such as continuous functions, Lipscan be characterchitz, Hölder, Sobolev etc. The indexes ized as follows. The index is related to the number of “smooth” derivatives, while characterizes the type of norm with which smoothness is assessed. Roughly speaking, larger values of correspond to higher sensitivity to details. The index has less practical utility, and will not be further discussed. For an extensive discussion of Besov spaces see [41] in a general setting and [14] in the context of approximation theory. For the sake of com-

326


TABLE I ASYMPTOTIC RATES OF CONVERGENCE FOR OPTIMAL LINEAR AND NONLINEAR APPROXIMATION OF SOBOLEV SPACES. THE VARIABLE n STANDS FOR THE LINEAR DIMENSION OF THE APPROXIMATING SPACE IN THE CASE OF THE WIDTHS AND d , AND FOR THE PSEUDODIMENSION IN THE CASE OF . ONLY THE CASE p 2 IS DISPLAYED FOR q p AND THE EMBEDDING CONDITION (8) IS ASSUMED. THE RESULTS FOR AND d ARE ADAPTED FROM [37] AND [28], WHILE THOSE FOR ARE FROM [31]

pleteness we give a precise definition of in Appendix A [see (35)]. In this work, however, we focus our attention on a related class of functions, the so-called Sobolev class with noninteger smoothness parameter, for reasons which will become clear in the sequel. From Lemma II.1 below we observe that the Besov and Sobolev spaces are intimately connected, so that very little loss of generality is suffered in using the latter. , we introduce the wellFor any compact domain for any nonnegative known Sobolev class of functions , , and define the integer . Let derivative

where defined for

Sobolev–Slobodekij space (7). However, the above-mentioned equivalence between these spaces shows that no problem arises here. Due to this equivalence, and in order to keep the notation simple, we will refer to both these spaces as the Sobolev space for . Bounds obtained for one of the spaces hold for the other, with the only difference being a universal multiplicative constant. for real-valued An important attribute of the classes is the following lemma relating them to the more general Besov classes. , Lemma II.1: ([40, Secs. 2.3.2 and 2.3.3]) For any , , and , the following embeddings hold:

. The classic Sobolev class is then as (6)

This definition has been extended to the case of real-valued , leading to the so-called Sobolev–Slobodekij space (see [41] for , , and . Set details). Let

Thus, since is arbitrary, we lose very little in generality by focusing on the Sobolev space (with noninteger smoothness parameter ). Approximation error bounds derived in the sequel require some condition relating the norm of the approximation error to the parameters of the underlying space. A basic result from the theory of function spaces (e.g., [41]) states that the condition (8) suffices for the embedding (9)

(7) for From ([41, Sec. 2.2.2]) we have that noninteger values of . Remark 1: We have chosen to extend the classic Sobolev , , to the Sobolev–Slobodekij space so space that the parameter may assume any nonnegative real value. It should be observed that a related extension may be established using a different generalized derivative giving rise to the , which for comso-called generalized Sobolev space pleteness is defined in Appendix A [see (37)]. However, in the which is our main concern here, the space case is equivalent to the Sobolev–Slobodekij space defined in (7); see [35, Sec. 9.3] for a proof. We comment that Theorem IV.1 and its later ramifications, is based on the work in [30] , where was used instead of the the generalized Sobolev space

. to hold. Here we have used Classic results for the optimal linear and nonlinear approximation of Sobolev spaces are given in Table I, adapted from [37] and [28]. For the results to be meaningful, it is assumed that the relates embedding condition (8) holds. The notation two terms that are of the same order for large values of . Several conclusions may be drawn from this table. First, in linear approaches suffice. However, in the case the case , one observes that nonlinear approaches are essential if the best rates of convergence are to be attained. Finally, we comment that it is demonstrated in [8] that multivariate splines may be used to yield the optimal rates given in the table. In the linear case the knots of the splines are fixed, independently of the function being approximated, while in the nonlinear case they are free to vary with this function. However, the computational problem in the latter case becomes much more involved. We comment in passing that in the one-dimensional case nearly optimal rates of approximation have been attained using wavelets


(see [32] for an excellent overview of these results and many others). In the context of neural-network approximation of Sobolev spaces, Mhaskar [33] was first to derive upper bounds in the case , obtaining the optimal rates given in Table I. However, the class of activation functions considered was rather restrictive, and no efficient approximation algorithm was given. More recently, Petrushev [36] has constructed efficient neural-network , which work approximation algorithms in the case , giving rise well for very smooth functions, namely to optimal rates of convergence, for a very broad class of activation functions. As far as we are aware this paper is the first to establish upper bounds for neural-network approximation in the general setting described above, namely general values of . Lower bounds for approximation by neural networks and general ridge functions were recently established in [30] and [29], respectively. Finally, we define a certain subclass of neural networks, which will be used in the sequel. Let be a univariate function, and set

327

domain be contained in a minimal cube of side-length . For given subdivide into equivolume subinto equal intervals of size cubes, by splitting each axis of . Let denote the linear class of tensor-product splines of degree , obtained by taking the product of univariate splines along with equally spaced knots at a distance of size is each axis. In the case where is noninteger, the space is the smallest integer larger than . However, used, where throughout in order not to clutter the we retain the symbol let be the linear operator notation. For any , and define mapping to its best approximation in (13) The space

is spanned by

one-dimensional case). Each expressed as

basis B-splines , (see [37, Sec. 7.3] for the may then be uniquely

(10)

(11) . Note that and define , where is the -dimensional linear span , which essentially corresponds to a general two-layer of feedforward neural network with linear output activation. Thus, we are essentially looking at approximations from a restricted . The main objective of this work is the subset of derivation of upper bounds on the error incurred by approxusing incremental algorithms, imating functions in , resulting in neural networks belonging to the class where the constant will be specified in Section III and may depend on . Let

(12) where the infimum is taken with respect to all functions obtained using the incremental algorithm. . Since for Throughout the paper we focus on the case , , upper any compact domain , will suffice to bounds on the approximation error for , although the bounds in this obtain upper bounds for case are not always tight. We comment on this issue further in Section V.

(14) In the sequel we will show how a neural network with hidden using units may be used to approximate a function norm. In fact, we will show that the function dethe fined in (13) may be well approximated, while only makes a minor contribution. The orders of magnitude of the two contributions will be determined by the value of in (13). We first can be made arbitrarily show, based on [9], that small by selecting large enough. be the linear space of splines defined Lemma II.2: Let set as in (13). Assume above, and for holds. further that the embedding condition Then, there exists a finite constant such that

where the norm is given in (7). The following lemma relates different norms of functions in . be defined as in (13) for Lemma II.3: Let , and assume the embedding condition holds. Then for and

Lemma II.2 is proved in ([9, Th. 5.2]) for the case , and is extended to general and in Appendix A, where the proof of Lemma II.3 can also be found. III. AN INTEGRAL REPRESENTATION FOR

C. Spline-Based Function Decomposition We present a basic tool which allows one to split an arbitrary into two components, in terms of a projecfunction in tion onto the space of spline functions. This technique is crucial for the error bounds which will be derived in the sequel. Let the

FUNCTIONS

We summarize the basic starting point of this paper, based on [30], to which the reader is referred for further details. One of the motivations for the work in [30] is the paper by Delyon et al. [12], which establishes a similar integral representation based on the multivariate wavelet transform. In our theory, only

328


univariate wavelets are used. We comment that integral representations have been used for neural networks by several other authors [4], [10], [22]. be a univariate wavelet function, and its assoLet ciated scale function (see [32] for detailed definitions of these can be functions). We have shown in [30] that any exactly represented in the form (15) where the unit sphere in

and dimensions. Here we have used

is

may be discretized in order to obtain a finite neural-network representation. The basic idea is very simple, and is related to the Monte Carlo theory of integration, as well as to ideas of random coding in information theory. It has been introduced into the neural-network literature by Barron [4] and further extended in [26] and [18]. In principle, the results of Section IV may be derived from the more general results in Section V. However, the proof for the is easier and more intuitive and the algorithm Hilbert space considered is less restrictive than the one proposed in the general Banach space setting. be random variables drawn Let , and set independently at random according to the pdf

(16)

(17) where is some fixed integer and . Note that the function is a ridge function, namely it assumes constant values on hyper-planes given by , where is a constant. These basis functions will serve as the building blocks for the neural-network construction, based on discretizing the integral representation [see Section IV]. The is given by function

where (18) where is the Fourier transform of . As can be seen from its is a probability density function definition, the function and , so that we may (pdf), namely write

Throughout the paper we follow common conventions in denoting random variables by uppercase letters, and their realizations by lowercase letters. in (16) is proportional to the Remark 2: Observe that , which requires that it be wavelet scale function must vanish sufficiently quickly integrable [32]. Thus, at infinity, which seems to preclude standard sigmoidal functions. Note, however, that an integrable function can be obtained by summing two shifted versions of a monotonic sigmoidal function, so that there is no loss of generality in this restriction. IV. INCREMENTAL ALGORITHM PERFORMANCE USING THE NORM Before presenting the incremental algorithm studied in this work, we consider how the convex integral representation (15)

where the ridge function is defined in (16). As increases one expects that converges to its expected . Denoting expectations with value by , and using respect to the product measure on Hölder’s inequality [23] together with the independence of the samples , one easily shows that

where , the expectation being taken with respect to the random variable . Once upper bounds for expected deviations have been derived, it trivially follows that there must exists some specific parameter , such that the bound holds. In order to obtain value, say , the variable rates for the classes will need to be carefully evaluated (or at least bounded), giving rise to the following theorem. Theorem IV.1: Let be a compact subset of . Then for any and (19) , is . Moreover, . We observe that the proof of Theorem IV.1 can be obtained using advanced results from the theory of functional interpolation [7]. We present here a somewhat more elementary proof, relying on simpler concepts and on the results in [30]. . Proof: We present the proof for the case , where the functions The somewhat easier case, are in fact uniformly bounded, was considered and be defined as in (13), where in [30]. Let . Then where given in (17) and

(20)


329

By choice of it clearly belongs to the space . Thus we conclude from Theorem 2 in [30] that there exists a such that for any function

(21)

for the function

Using Lemma II.3 with have

we then

(22) Combining (21) and (22) we then conclude that

Recall that clude that

(23) , and thus from Lemma II.2 we con-

(24)

Since is arbitrary we may set (23), and (24) to obtain

It should be clear that , required in Step 2) of the algorithm, exists. Since the right-hand side of the equation appearing in , clearly there must Step 2) is an expectation with respect to for which the expression within the expectaexist a value of tion is smaller or equal to the average value. Similarly, a value of exists for which the expression is larger or equal. Using the mean value theorem, and assuming continuity of yields the claim. In fact, replacing the equality in Step 2) by a “less or equal” symbol, we do not even need a continuity condition for . Define

and combine (20),

where , which is the desired claim. The boundedness of is established in Lemma IV.1. Remark 3: Observe that Theorem IV.1 implies the existence for which of a set of parameter values . Moreover, from the lower bounds established in [30], the rates are optimal up to logarithmic factors in . in Theorem IV.1 [see (17)] Lemma IV.1: The constant , where may be bounded as . Proof: See Appendix B. We now define an incremental algorithm, and investigate the approximation error incurred by it in Theorem IV.2 below. The basic idea of the algorithm is the following. Recall that the stochastic algorithm introduced above generates the parameters by a random draw from the product distribution . The idea here is at stage to fix the parame, and determine the new parameter from ters , rather than from the full -term the single unit distribution distribution. Theorem IV.2 then demonstrates that this incremental approach achieves the same rate of convergence attained by original Monte Carlo approach.

which measures the error incurred at the th stage by this incremental procedure. Then we have our main result. and , the error of Theorem IV.2: For any the incremental algorithm is bounded as

for some finite constant . Proof: The claim will be established upon showing that

(25)

namely, the error incurred by the incremental procedure is identical to that of the nonincremental one described preceding Theorem IV.1. The result will then follow upon using Hölder’s inequality and the upper bound (19) for the right-hand side of (25).

330


We proceed by induction. For , the result follows by def, namely inition. Assume now that he claim holds for

We then have

where the last step follows by induction. We conclude therefore that

where we have used and . This establishes the desired result. Remark 4: Observe that the incremental algorithm described at each stage by computing above selects the new parameter the average of some random variable. An alternative possibility so that it minimizes this quantity, giving would be to select rise to a greedy algorithm. Clearly the lower bound derived , which would apply to the latter case as well. In the case we pursue in Section V, we will resort to such a greedy algorithm. where we have used

. Now, from (15) we have , so that the third term in the last equation vanishes. Thus we conclude that

From the definition of

and

we have

V. INCREMENTAL GREEDY ALGORITHM PERFORMANCE USING THE NORM The proof of Theorem IV.2 relies heavily on the existence of an inner product. This useful tool is no longer available for the . In this section case of general Banach spaces such as we make use of the work Donahue et al. [18], who in turn rely heavily on results from the theory of the geometry of Banach spaces. Similarly to the previous sections we consider the space . The results for follow along similar lines and will be presented as a corollary at the end of the section. We also recall that the embedding condition (8), discussed in Section II, to hold. is needed for the embedding Before presenting the basic result of this section, we describe the incremental greedy algorithm used, as it differs somewhat from the incremental procedure presented in Section IV in the . The case of Hilbert spaces. Let be a bounded subset of


sequence fashion. Let define

is then selected in the following minimize , and recursively so that

(26) and . We assume that the inwhere fimum in (26) is in fact achieved; the procedure can be generalized, as in [18], to account for more general situations. Thus, , where is the function in achieving the minimum at the th stage. In principle one may allow opas well, although in this paper we consider timization over a fixed schedule. Observe that due to the greedy nature of the algorithm, the requirements of the present procedure are more norm, where stringent than those used in Section IV for the each new function was selected by choosing its corresponding parameters as the mean of a certain random variable. We refer the reader to [18] for an extensive discussion of this and other greedy algorithms. The main result of Section V follows. Theorem V.1: Let the embedding condition hold for , , and assume that for all . Then for any and

and (29) . In order to obtain the Lemma V.1 requires that bounds in terms of the smoothness parameter , rather than , we establish the following result. Lemma V.2: Let the embedding condition hold for , , . Then there exist parameters , such that for any (30)

(31) is given in (27), and . . We express the function Proof: Assume first that as the sum , where as in (13). Since it follows that , and thus Lemma V.1 may be applied, implying that there exists a function such that

where

Applying Lemma II.3 to

where

331

we obtain

(32) (27) From Lemma II.2 we have that (33) and is obtained via the incremental greedy algorithm (26). The main idea in the proof of Theorem V.1 is a two-part approximation scheme. First, we show in Lemmas V.1 and V.2, may be well approxibased on [30], that any [see (10) and mated by functions in the convex class (17) for definitions]. Then, it is argued, making use of results from [18], that an incremental greedy algorithm can be used by the class to approximate the closure of the class . The proof is completed by using the triangle inequality. The following lemma is a direct consequence of Lemma 5 in [30]. Lemma V.1: Let the embedding condition hold for , and set . , such that for any Then there exist parameters

(28)

Combining (32) and (33) using the triangle inequality we obtain

where taining

The bound on once again

. Set

, ob-

is obtained from Lemma V.1 using Lemma II.3

In the case for any function ,

we recall from Hölder’s inequality that , where

332


is the Euclidean volume of . Thus, results for this range in the results for can be directly obtained by setting derived above. Lemma V.2 establishes a bound on , without referring to any particular approximation scheme. We now quote a result from [18], showing that functions in the may be efficiently approximated using an convex hull of incremental greedy algorithm. Lemma V.3: ([18], Corollary 3.6) Let be a bounded subset , , where is a measurable subset of , of given. Select such that with for . Then for each , there exists a sequence such that the sequence generated by

TABLE II UPPER BOUNDS FOR THE ERROR OF INCREMENTAL APPRROXIMATION OF THE SOBOLEV SPACE W (K ) BY NEURAL NETWORKS USING THE L NORM, AND LOWER BOUNDS FOR APPROXIMATION BY NEURAL NETWORKS DERIVED IN [30] FOR TWO SPECIAL ACTIVATION FUNCTIONS. THE EMBEDDING CONDITION (8) IS ASSUMED. THE PARAMETER IS GIVEN IN (27)

where we have used Lemma V.1 in the second step. Using the , , we then conclude that assumption

(34) , satisfies

if

Since the upper bound on increases with and is finite by the embedding condition, we conclude that for larger than some finite value . Thus, the , and we conclude conditions of Lemma V.3 apply with , there exists a sequence such that for each that for

and

Since , setting for ease of presentation and using Lemma V.2, it follows that for if . We note that the work in [18] was not restricted to a compact domain , as is this work. Thus, there is no direct link norms for and , explaining between the the origin of the two different inequalities in Lemma V.3. In the present work, the restriction to compact domains arises from the technical conditions needed to establish approximation rates for Sobolev spaces. Using Hölder’s inequality as at the end of the proof of Lemma V.2 we obtain the following simple corollary. Corollary V.1: Under the conditions of Lemma V.3, and the added assumption of the compactness of the domain

if

. , with Proof of Theorem V.1: Consider the class bounded as in (29). To make the dependence of on ex. Given let plicit, we use the notation be the function from Lemma V.2 such that . We consider first the case , and then may be obtained. point out how the result for : For any function and we have 1) . Using the triangle inequality we obtain for

where . In the penultimate step from Lemma V.2, while in the we have used the bound on final step we have substituted the value of from that lemma. : The case follows along similar lines 2) using Lemma V.2 and Corollary V.1. The results of Sections IV and V, together with the lower bounds from [30], are summarized in Table II (disregarding constants). Remark 5: In Table II we have presented upper and lower bounds on the approximation error incurred by . The lower incremental algorithms for the case bounds derived in [30] apply to general and , but were derived for two specific activation functions (the piecewise polynomial activation function and the standard ). Furthermore, sigmoidal function , , clearly since , and therefore , so that upper bounds for can be immediately obtained from those for . It is likely, though, that in this case the bounds are not as tight


as possible. However, in the case the upper bound yields the optimal rate (again, up obtained for to logarithmic factors). These observations are summarized in Fig. 1, and are discussed further in Section IV. Remark 6: In order to compare the upper bounds derived above to the optimal rates given in Table I, we observe that the correct yard-stick is the -width introduced in Section II, which does not require the approximating space to be a linear subspace. However, the relevant complexity measure in this case is the pseudodimension rather the number of hidden nodes , as in Table II. While the connection between and is unknown in general, we have recently shown in [6] that in the case where the activation function is a piecewise polynomial with a fixed . degree and a finite number of knots, then in Thus, at least for this case one may replace by obtaining rates in terms of the number of hidden nodes. Disregarding logarithmic terms, we can see from Tables I and II is achieved in the case . Fithat the optimal rate , an inspection of the denominally, in the case where , with nator of [see (27)] reveals that . Comparing the results to those obtained equality only if by linear methods (see Table I) we observe that the rates are in fact faster for the incremental greedy algorithm discussed above. This issue is further discussed in Section VI. VI. DISCUSSION We have presented results for the error incurred in the approximation of functions over compacta by neural networks, generated by computationally attractive incremental procedures. While we have not been able to demonstrate the general computational efficiency of the algorithms (e.g., polynomial dependence on the input dimension), we have shown that they possess nearly optimal performance under certain conditions, and are superior to linear methods under different conditions. In comparing the rates obtained in this work to the optimal rates in Table I, we observe that the appropriate width to compare with is the -width, expressed in terms of the number of hidden nodes, rather than the pseudodimension. In view of Remark 6, we can indeed make such a comparison in the case of piecewise polynomial activation functions, for which the , where is the pseudodimension is proportional to number of hidden nodes. We focus throughout on the regime , as the current work does not permit us to draw any . Referring to Fig. 1 and conclusions on upper bounds for comparing with the optimal rates given in Table I, our results (as well as previous work) can then be summarized as follows. • From the work of Maskhar [33] we conclude that, at least for certain types of activation functions, optimal rates of approximation by neural networks can be achieved on the , although not necessarily by incremental diagonal procedures. , incremental algorithms • In the case were introduced, which achieve the same rates of convergence as those obtained by nonincremental procedures. , Moreover, the rates are optimal in the case and are provably superior to linear methods when .

333

Fig. 1. Approximation by incremental neural-network construction. Upper bounds for the distance function (12) in the regions II, III, and IV are given, and are shown to be optimal (up to logarithmic factors) in region II. On the line (p = 2; q > 2) results are provably superior to those attained by linear methods.

• Upper bounds on approximation error by incremental aland gorithms can be derived for any values of (see Remark 6). These results are optimal for and (region I in Fig. 1). It should be noted, however, that in this regime linear methods suffice to achieve optimality, as is evident from Table I. On the line results are provably superior to those attained by linear methods. Finally, while the upper bounds apply in the regions II and III as well, they are not optimal there. It is still unclear whether truly optimal behavior can be es, tablished for incremental algorithms in the regime as is the case for multivariate free-knot splines. The basic issue here is whether the nonoptimality is a result of the algorithm, the bounding technique or the use of neural networks in general. We leave this as an interesting open question at this time. In any event, these results extend the range for which approximation error bounds have been derived for neural networks, and provide explicit algorithms. Several research issues remain besides the consideration of optimality in the general case. First, although the algorithm discussed are computationally attractive, we have not been able to establish their tractability under general conditions. Proving such computational tractability, together with the approximation error rates derived in this work, would provide a strong theoretical argument for the use of neural networks. Second, establishing estimation error bounds for the incremental algorithms studied here would also be an important contribution. Such results were recently established by Lee et al. [26] for the special , but to the best of our knowledge have not been case extended to general values of . Finally, the problem of the optimality of the incremental algorithms for general values of and is still open. APPENDIX A THE BESOV AND GENERALIZED SOBOLEV SPACES For convenience we summarize several of the key definitions and lemmas used in the paper. We first present a precise definition of the Besov class of functions (see [40, Sec. 2.3.1]). Let

334


be a difference operator and recursively define . For any let where and . Denote by , , the to for which all derivatives up space of functions from [see to order are integrable with respect to the norm (6)], and let

where

. Then for any

In spite of the difference between the definitions of the spaces in (37) and in (7), they can be shown ([35, , which is Sec. 9.3]) to be equivalent in the special case our main concern. Thus, we will not distinguish between them. . One final comment is in order concerning the space is supported on , It is clear that even if the function will possess there is no guarantee that the derivative the same support, and may in fact be supported on . The following Lemma shows that it is possible to extend the original to , in which case the above problem does space not arise. be a compact domain Lemma A.1: ([1, p. 84]) Let with a smooth (Lipschitz) boundary such that , , and denote by a ball of diameter containing . Then can be extended to a function such that vanishes outside , and equals for . Moreover, there is a finite constant , independent of , such that

(35)

Other equivalent characterizations of the Besov space may be found in [41], where the definition is extended to negative as well. In view of Remark 1 we introduce a slightly different definition of the a generalized Sobolev space. This definition was used in [30], and is based on notions from the theory of Fourier transforms. The equivalence of this space and the Sobolev space (7) used in this paper will be discussed below. We recall the standard Fourier transform of a function

(38)

A. Sources for the Constants In line with our discussion in Section II, we give the sources of the various constants appearing in the paper. The constants and appear in Lemmas II.2 and II.3, respectively. The constant appears in the proof of Theorem IV.1 arising from the and appear in Lemma V.1, work in [30]. The constants results from Lemma V.2. The remaining numbered while appear at various stages in the proof constants of Lemma IV.1 in Appendix C. APPENDIX B PROOF OF LEMMAS II.2 AND II.3

We use the notation to denote the inverse Fourier trans, define the linear operator form. For any

(36)

. It is a simple matter to show that where the standard derivative is related to the differential operator in the case where is an integer. For any nonnegative integer we have . and nonnegative real variFor any compact domain able we then define the generalized Sobolev class of functions as follows:

(37)

We start by quoting a result from [37], based on the work in [9]. The result is stated for the multivariate case, and can be easily derived by induction from the one-dimensional version in ([37, Sec. 7.3]). be the linear space of splines defined in Lemma B.1: Let let (14) hold. Then, for any Section II-C, and for any there exist constants , and , independent of , such that 1)

2)

where

and we have set . Proof of Lemma II.2: We use the upper-case notation and for the constants from Lemma B.1, in order to and appearing in distinguish them from the constants , and recursively the body of the paper. Set


. In order to simplify the notation . Then

in the proof we use

335

. Consider now the multivariate polyof degree , , defined over the cube . nomial Repeating the above procedure for each dimension we obtain (40) , and . First, consider the case . Recall that is the minimal cube containing the compact domain . Then where

For

and

,

. Thus for

where are the subcubes of defining the space de. Using Markov’s fined in Section II-C, and such that we obtain inequality (40) and recalling that

we obtain (41) (39)

. For we make use of the where result i) in Lemma B.1, similarly to the proof of Lemma II.2 above. Using (39) and replacing by we then have

where we have used Lemma B.1 in the first and last inequalities. Hence we conclude that

(42) Combining (41) and (42) yields which is the desired result. APPENDIX C PROOF OF LEMMA IV.1

Using the triangle inequality and Lemma B.1

Assume, without loss of generality, that , the support of , . From Lemma 4 in includes the origin and let [30] (43) The result then follows from

which is the desired result. Proof of Lemma II.3: We present the proof for the case where is an integer. For general values of one needs to transform the problem into one of integer using properties of the spline functions (see, for example, [41]). Begin by recalling the Markov inequality for polynomials ([14, Sec. 4.1]). be a univariate polynomial of degree defined on Let , then , where is the derivative of and is an absolute constant. By a simple change of variables one obtains for an interval of size that , and by induction for the th derivative

is defined in (18). It was shown in [30] that , but we find it more convenient to use the general symbol , setting it to 1/2 at the final stage. It should be kept in (43) is defined with respect to the funcin mind that tion , described in the proof of Theorem IV.1, instead . This important observation results from the proof of Theorem IV.1, which replaces the problem of approximating with that of approximating the spline function . We first show that where

(44) Note that Hölder’s inequality cannot be used to show this, since and the domain of the integral is infinite. Let . Then fix

(45)

336


Applying Hölder’s inequality to the first integral (defined over a compact domain) we have

From (45)–(47) we obtain

(48) (46) In order to estimate (18). Then we have

we use the definition of

where we have used Lemma II.2 in the form

from where the last inequality holds for sufficiently large , or alternatively for any by increasing the constants appropriately. Integrating (48) over the unit sphere and using Hölder’s inequality we then obtain

The Fourier transform implicit in the second integral should be understood in the sense of a Fourier transform of a generalized function (note that we only require the value of this quantity integrated over ). From [34] we obtain for

where . Recall that the Fourier of the generalized function is defined by the transform which holds for any compactly relationship . supported function Hence we obtain

Set

which

establishes

(44) with . Using (43), (44), and Hölder’s we inequality in the form obtain

(49)

. Then, since supp Using Plancherel’s equality, we infer that

(50) Passing to Cartesian coordinates and integrating over

we have

Since

Thus we conclude that Combining this result with Hölder’s inequality we obtain (47) where .

(51)


where is defined in (36) and use has been made of for , we conLemma A.1. Since and that clude from (49) and (51) with

Observing that II.3 we obtain

Choosing the proof with

, and applying Lemma

as in the proof of Theorem IV.1 completes . ACKNOWLEDGMENT

The authors would like to thank the anonymous reviewers for their very helpful and constructive comments. REFERENCES [1] R. A. Adams, Sobolev Spaces. New York: Academic, 1975. [2] M. Anthony and P. L. Bartlett, A Theory of Learning in Artificial Neural Networks. Cambridge, U.K.: Cambridge Univ. Press, 1999. [3] P. Auer, M. Herbster, and M. Warmuth, “Exponentially many local minima for single neurons,” in Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds. Cambridge, MA: MIT Press, 1996, pp. 316–322. [4] A. R. Barron, “Universal approximation bound for superpositions of a sigmoidal function,” IEEE Trans. Inform. Theory, vol. 39, pp. 930–945, 1993. [5] A. R. Barron and R. L. Barron, “Statistical learning networks: a unifying view,” in Comput. Sci. Statist.: Proc. 20th Symp. Interface, E. Wegman, Ed.. Washington, D.C., 1988, pp. 192–203. [6] P. L. Bartlett, V. Maiorov, and R. Meir, “Almost linear vc dimension bounds for piecwise polynomial networks,” Neural Comput., vol. 10, pp. 2159–2173, 1998. [7] J. Bergh and J. Lofstrom, Interpolation Spaces. Berlin: SpringerVerlag, 1976. [8] M. S. Birman and M. Z. Solomjak, Quantitative Analysis in Sobolev Imbedding Theorems and Applications to Spectral Theory. Providence, RI: Amer. Math. Soc., 1980, vol. 144. [9] C. de Boor and G. Fix, “Spline approximation by quasiinterpolation,” J. Approx. Theory, vol. 7, pp. 19–45, 1973. [10] H. Chen, T. Chen, and R. Liu, “Approximation capability in C (R ) by multilayer feedforward networks and related problems,” IEEE Trans. Neural Networks, vol. 6, no. 1, pp. 25–30, 1995. [11] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. Contr., Signals, Syst., vol. 2, pp. 303–314, 1989. [12] B. Delyon, A. Juditsky, and A. Benveniste, “Accuracy analysis for wavelet approximations,” IEEE Trans. Neural Networks, vol. 6, pp. 332–348, 1995. [13] R. DeVore, “Nonlinear approximation,” Acta Numerica, vol. 7, pp. 51–151, 1998. [14] R. A. Devore and G. G. Lorentz, Constructive Approximation. New York: Springer-Verlag, 1993. [15] R. A. DeVore, K. I. Oskolkov, and P. P. Petrushev, “Approximation by feed-forward neural networks,” Ann. Numer. Math., vol. 4, pp. 261–287, 1997. [16] A. T. Dingankar and I. W. Sandberg, “A note on error bounds for approximation in inner product spaces,” Circuits, Syst., Signal Processing, vol. 15, no. 4, pp. 519–522, 1996. [17] I. W. Dingankar and A. T. Sandberg, “A note on error bounds for function approximation using nonlinear networks,” Circuits, Syst. Signal Processing, vol. 17, no. 4, pp. 449–457, 1998. [18] M. J. Donahue, L. Gurvits, C. Darken, and E. Sontag, “Rates of convex approximation in nonhilbert spaces,” Constructive Approx., vol. 13, pp. 187–220, 1997. [19] D. L. Donoho and I. M. Johnstone, “Wavelet shrinkage: Asymptopia?,” J. Royal Statist. Soc. B, vol. 57, no. 2, pp. 301–369, 1995.

337

[20] S. E. Fahlman and C. Labiere, “The cascade-correlation learning architecture,” in Advances in Neural Information Processing Systems, D. S. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1990. [21] J. Fan and I. Gijbels, Local Polynomial Modeling and Its Applications. London, U.K.: Chapman and Hall, 1996. [22] K. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, pp. 183–192, 1989. [23] G. Hardy, J. E. Littlewood, and G. Polya, Inequalities, 2nd ed. Cambridge, U.K.: Cambridge Univ. Press, 1952. [24] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal function approximators,” Neural Networks, vol. 2, pp. 359–366, 1989. [25] L. Jones, “A simple lemma on greedy approximation in Hilbert space and convergence rate for projection pursuit regression and neural network training,” Ann. Statist., vol. 20, pp. 608–613, 1992. [26] W. S. Lee, P. S. Bartlett, and R. C. Williamson, “Efficient Agnostic learning of neural networks with bounded fan-in,” IEEE Trans. Inform. Theory, vol. 42, no. 6, pp. 2118–2132, 1996. [27] M. Leshno, V. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural Networks, vol. 6, pp. 861–867, 1993. [28] G. G. Lorentz, M. V. Golitscheck, and Y. Makovoz, Constructive Appoximation: Advanced Problems. Berlin, U.K.: Springer-Verlag, 1996. [29] V. E. Maiorov, “On best approximation by ridge functions,” J. Approx. Theory, vol. 99, pp. 68–94, 1999. [30] V. E. Maiorov and R. Meir, “On the near optimality of the stochastic approximation of smooth functions by neural network,” Advances Computa. Math., 2000, to be published. [31] V. E. Maiorov and J. Ratsaby, “On the degree of approximation using manifolds of finite pseudo-dimension,” J. Constr. Approx., vol. 15, pp. 291–300, 1999. [32] S. Mallat, A Wavelet Tour of Signal Processing. New York: Academic, 1998. [33] H. Mhaskar, “Neural networks for optimal approximation of smooth and analytic functions,” Neural Computa., vol. 8, no. 1, pp. 164–177, 1996. [34] O. P. Misra and J. L. Lavoine, “Transform analysis of generalized functions,” in North-Holland Math Studies. Amsterdam, The Netherlands: North-Holland, 1986. [35] S. M. Nikolskii, Approximation of Functions of Several Variables and Imbedding Theorems. Berlin, Germany: Springer-Verlag, 1975. [36] P. P. Petrushev, “Aproximation by ridge functions and neural networks,” SIAM J. Math. Anal., vol. 30, pp. 155–189, 1998. [37] A. Pinkus, N-Widths in Approximation Theory. New York: SpringerVerlag, 1985. [38] A. Pinkus, “Approximation theory of the mlp model in neural networks,” Acta Numerica, vol. 8, pp. 143–195, 1999. [39] V. N. Temlyakov, “The best m-term approximation and greedy algorithms,” Advances in Computa. Math., vol. 8, no. 3, pp. 249–265, 1998. [40] H. Triebel, Interpolation Theory Function Spaces and Differential Operators. Berlin, Germany: Veb Deutcher Verlag, 1978. [41] H. Triebel, Theory of Function Spaces. Basel: Birkhauser, 1983. [42] M. Vidyasagar, A Theory of Learning and Generalization. New York: Springer-Verlag, 1996. [43] G. Wahba, Spline Models for Observational Data, ser. Regional Conference Series in Aplied Mathematics. Philadelphia, PA: SIAM, 1990.

Ron Meir (M’98) received the B.Sc. degree in physics and mathematics from the Hebrew University, Jerusalem, in 1982, and the M.Sc. and Ph.D. degrees from the Weizmann Institute in 1984 and 1988, respectively. After two years as a Weizmann Research Fellow at Caltech he spent a year and a half working at Bellcore on various aspects of neural network design and analysis. HIs interests include the ststistical theory of learning, pattern recognition, time series modeling, and neural network design and analysis.

Vitaly E. Maiorov graduated from the Department of Mathematics, Moscow State University, in 1969 and received the Ph.D. degree in 1977 from the same institution. From 1977 to 1992, he was an Associate Professor of Mathematics at the Moscow Railway Engineers Institute. Since 1992 he has been associated with the Department of Mathematics at the Technion, Israel. His main research interests are in the theory of approximation, problems of approximation by ridge-functions and neural networks, complexity of stochastic algorithms, and learning theory.