jpj = p1 +

FOR NUMERICAL DIFFERENTIATION, DIMENSIONALITY CAN BE A BLESSING! ROBERT S. ANDERSSEN + AND MARKUS HEGLAND

Abstract. Finite dierence methods, like the mid-point rule, have been applied successfully to the

numerical solution of ordinary and partial dierential equations. If such formulas are applied to observational data, in order to determine derivatives, the results can be disastrous. The reason for this is that measurement errors, and even rounding errors in computer approximations, are strongly ampli ed in the dierentiation process, especially if small step-sizes are chosen and higher derivatives are required. A number of authors have examined the use of various forms of averaging which allows the stable computation of low order derivatives from observational data. The size of the averaging set acts like a regularization parameter and has to be chosen as a function of the grid size h. In this paper, it is initially shown how rst (and higher) order single-variate numerical dierentiation of higher dimensional observational data can be stabilized with a reduced loss of accuracy than occurs for the corresponding dierentiation of one-dimensional data. The result is then extended to the multivariate dierentiation of higher dimensional data. The nature of the trade-o between convergence and stability is explicitly characterized, and the complexity of various implementations is examined.

1. Introduction It is assumed that given observational data can be characterised by a model of the form (1) y = f (jh) + ; h = 1=n; j = (j ; : : : ; jd ) 2 f0; 1; : : : ; ngd; where the function f 2 C p(R d ), for some appropriate choice of p, denotes the unknown trend, and the the measurement errors, which are assumed to be independent and identically distributed normal variables with expectation 0 and variance . The numerical dierentiation problem consists in determining (estimating) an approximation to the derivative Df of f , where D denotes a (possibly higher order and higher dimensional) dierentiation operator; e.g. j j X (2) D= c (x) @@ x ; j jq j

j

1

j

2

p

p

p

p

where the c (x) denote arbitrary coecients, p = (p ; : : : ; pd) with pi 0, @ x = @ p1 x @ pd xd , and jpj = p + + pd. Examples include the Laplacian, Divergence, Gradient and Curl operators as well as other partial derivatives. Numerical dierentiation is ill-posed and standard nite dierence formulas, which have proved very useful for the solution of partial dierential equations, often amplify the measurement error when applied p

1

p

1

1

Date : September 11, 1997. 1991 Mathematics Subject Classi cation. 65D25. Key words and phrases. Numerical dierentiation. + CSIRO Mathematical and Information Sciences, GPO Box 1965, Canberra ACT 2601, Australia, [email protected] . Computer Sciences Laboratory, Australian National University, Canberra ACT 0200, Australia, [email protected] . 1

2

R. ANDERSSEN AND M. HEGLAND

to observational data. In particular, for small stepsizes h, the good approximation properties of a nite dierence formula are completely masked by the ampli ed measurement errors. In [2], a scheme is proposed which allows one to construct stable estimates for the dierentiation of one-dimensional observational data. This scheme uses averages of nite dierence approximations over dierent step sizes to approximate the derivatives. For second derivatives, one such scheme is given by [1]

yi

(3)

[2]

:= (2r + 1)?3h?2

r X

j =?r

yi

j r

+ +2 +1

? 2yi j + yi +

j ?2r?1:

+

Even if the data yi are contaminated by observational errors, this approximation converges, as h ! 0, to the second derivative f (ih) as long as r = hs? with 0 < s < 0:2, and f has a continuous second derivative. In the sequel, similar formulas for partial derivatives are developed. For example, it will be shown that the second partial derivative @ f=@x can be approximated in a stable manner by (2)

1

2

(4)

yi1;i; 2 := (2r + 1)? h? [2 0]

4

2

2

r X r X

j1 =?r j2 =?r

yi1

j

r ;i j

+ 1 +2 +1 2 + 2

? 2yi

j ;i j

1+ 1 2+ 2

+ yi1 j2? r? ;i2 j2 : +

2

1

+

The condition which guarantees convergence is r = hs? with 0 < s < 0:66. An independent analysis of a dierent class of methods for higher-order numerical dierentiation of multi-dimensional observational data can be found in Muller [6, p.77]. Compared with numerical dierentiation, which is local and sensitive to the presence of observational errors in the given data, numerical integration, which is non-local, is normally well-posed, in the sense that any numerical quadrature estimate of the value of the integral is not sensitive to random perturbations in the evaluation of the integrand. However, in higher dimensions, because of the non-local character of integration, an exponentially large number of function evaluations must be performed, as the dimension increases, in order to achieve a given accuracy for the numerical estimate of the integral. On the other hand, the stability of the resulting estimate, even in the presence of observational errors, is automatically guaranteed because of the well-posedness of integration. This need for an excessive number of function evaluations is often referred to as the \Curse of Dimensionality" (cf. [7, p.1-2]). For example, consider the integral (The following notation I , used in [7, p.1-2], has no connection with its use in the sequel as an index set.) 1

I=

Z

;

[0 1]d

f (x)d

over the d-dimensional unit cube where d = dx dxd. Classical multi-dimensional integration rules for the integral I use Cartesian products of one-dimensional rules and take, for uniform grids, the form X Ih = hd wy: 1

2f0;::: ;ngd

j

j

j

The most popular product rule is the one based on the trapezoidal rule. The approximation error of the trapezoidal rule is O(h ), if all second partial derivatives @ f=@xi are continuous. Because the total number of function evaluations required for this approximation is N = (n +1)d, the approximation error for the product trapezoidal rule is O(N =d ). Consequently, for a given scheme, in order to achieve a 10 times decrease in the current approximation error, one requires a 10d= increase in the number of function evaluations. This yields an explicit characterization of the mentioned \Curse of Dimensionality". Note. In other contexts, such as data smoothing (cf. Hastie and Tibshirani [4]), the increase in function evaluations associated with the increase in dimension is also viewed as a \curse". 2

2

2

2

2

FOR NUMERICAL DIFFERENTIATION, DIMENSIONALITY CAN BE A BLESSING!

3

In numerical dierentiation, there is no curse. This is a direct consequence of the local nature of dierentiation. In fact, in order to compute the derivative of a function at a certain point, only the behaviour of that function in the neighbourhood of that point needs to be known. For integration, however, the behaviour of the function throughout the domain of integration has an in uence on the value of the integral. The interesting, even in some ways amazing, fact is that some of the \Curse of Dimensionality", associated with numerical integration, can help to \Cure" the \Curse of Ill-Posedness", associated with the evaluation of partial derivatives of multi-dimensional observational data. The rami cations of this observation are the focus of this paper. 2. First Order Differentiation of Two-Dimensional Data 2.1. The One-Dimensional Averaging. By analogy with the model for observational data introduced in equation (1), it is assumed that the given two-dimensional observational data have the form (5) yi;j = fi;j + i;j ; h = 1=n; fi;j = f (ih; jh); i; j 2 f0; 1; : : : ; ng; where the errors i;j are taken to be identically distributed, independent random variables with mean 0 and variance . In addition, it is assumed throughout this section that f 2 C (R ). Even though the errors depend on two spatial parameters i and j , they are only assumed to be sampled from a one-dimensional distribution, since each data point (observation, measurement) yi;j of fi;j is performed in the same manner, independently of the spatial position de ned by (ih; jh). A standard approximation for the partial derivative @f (x; y)=@x at the grid-points uses the midpoint rule f_i;jmid = fi ;j 2?h fi? ;j : (6) These formulas are de ned for all i; j , except in the vicinity of the boundary of the region on which the dierentiation is being performed, where dierent rules must be applied depending on the regularity of f in the neighbourhood of the boundary. If one applies this mid-point rule to the data of equation (5) instead of the function values fi;j , then one obtains mid f^_i;j = yi ;j 2?h yi? ;j (7) (8) = f (ih + h; jh) ? f (ih ? h; jh) + i ;j ? i? ;j 2h 2h @f @ f = @x (ih; jh) + h @x (ih + sih; jh) + i;j ; si 2 (?1; 1): (9) Though the errors i;j = (i ;j ? i? ;j )=(2h) will be normally distributed random variables with zero mean and variance h? =2, they will not necessarily be independent (e.g. i ;j and i? ;j are not independent, because they involve the common factor i;j ). However, the lack of independence is a minor matter compared with the value of the variance, which becomes arbitrarily large as the step-size h tends to zero, since is xed (even if quite small). Consequently, when evaluated on the observational data yi;j , the mid-point formula (6) fails to yield a convergent approximation to the partial derivative @f (ih; jh)=@x as the step-size h tends to zero. This illustrates why standard nite dierence formulas for partial dierentiation will tend to be unstable when evaluated on observational data. The goal of this paper is the construction of stable alternatives to the standard formulas. The essence of the strategy to be adopted is encapsulated in the expression (8). The approximation error, determined by the second term on the right hand side of (8), is O(h ), and, consequently, is very small when h is small. Thus, the aim is to identify strategies which allow the error associated with i;j to be decreased at the expense of an increase in the approximation error. This kind of trade-o actually lies at the 2

(3)

(

(

)

+1

+1

)

2

1

1

+1

2

1

3

3

2

2

+1

1

+2

2

2

2

4


heart of any regularization method used to solve improperly posed problems. It is also the basis for the variance-bias interpretation of data smoothing. For example, in standard numerical analysis texts, where it is tacitly assumed that one has control over the choice of the step-length h, such a trade-o is achieved by de ning the optimal choice h of h to be the value which guarantees that

h @@xf (ih + sih; jh) = i;j : However, this strategy is limited to situations where the variance of i;j is relatively small, such as occurs in exact numerical situations where the only error is computer rounding error. Consequently, in a deeper technical sense, it avoids the real issue as to how such a trade-o can be achieved as h ! 0. It is this aspect which is the focus of this paper. In fact, the key question being examined is: 2

3

3

\For given data where the size of the step-length h has already been determined, how does one perform the numerical dierentiation in order to fully utilize all the available data and achieve the type of trade-o mentioned above." A similar interpretation holds for the kernel smoothing methods applied to the numerical dierentiation of observational data by statisticians (cf. Wand and Jones [8]). If one keeps j xed, numerical partial dierentiation reduces to the classical one-dimensional situation, as is clear from equation (8). Stable nite dierence formulas for rst order dierentiation, based on averaging, have been proposed by Anderssen and de Hoog [2]. Such formulas achieve stability by averaging the values obtained from the application of the mid-point rule on a variety of dierent sized grids. In particular, consider the following one-parameter family of mid-point rules f_i;j(mid) [s] = fi+s;j2?shfi?s;j ; s = 1; 2; : : : : If one applies these mid-point rules to the data of equation (5) instead of the function values fi;j , then they generate a one-parameter family of identically distributed normal random variables (mid) f~_i;j [s] = yi+s;j2?shyi?s;j = fi+s;j2?shfi?s;j + i+s;j2?shi?s;j ; s = 1; 2; : : : ; with mean (@f (ih; jh)=@x + O(s2 h2)) and variance 22 =(2sh)2. These random variables are independent because, for s = 1; 2; : : : , each is evaluated on dierent data yi;j from equation (5). Consequently, for each i and j , one can average the mid-point estimates f_i;j(mid) to obtain the following random variable r (mid) X (mid) f_i;j = 1 f~_i;j [s] (10)

rs r r X (fi s;j ? fi?s ;j ) X 1 (i s;j ? i?s;j ) : =r (11) + 2sh 2rsh s s Its mean and variance are given, respectively, by r r X 1X r s (@f (ih; jh)=@x + O(s h )); 2 s =(2rsh) : This is the approach pursued by Anderssen and de Hoog [2], though in greater generality than indicated here. Among other things, they showed that, if r behaves like O(h ) with < < 1, then, asymptotically, this average has mean f_i;j and bounded variance. =1

+

)

+

=1

=1

2

=1

2

2

2

=1

2 3


5

Subsequently, Anderssen et al. [1] examined, again for one-dimensional dierentiation (but extended to orders higher than the rst), \spatial neighbourhood averaging". Here, one rst chooses a particular member of the one-parameter family by specifying the value of s. One then evaluates this mid-point formula on spatially adjacent grids as mid f~_i k;j [s] = yi k s;j2?shyi k?s;j ; k = ?r; ; r; and averages them to generate the following random variable r X mid mid 1 _ f i;j = 2r + 1 f~_i k;j [s] (12) k ?r r r X (fi k s;j ? fi k?s;j ) + X (i k s;j ? i k?s;j ) : = 2r 1+ 1 (13) 2sh 2(2r + 1)sh k ?r k ?r (14) As long as r is less than the value of s, the individual random variables, corresponding to each midpoint formula, will be independent for k = ?r; : : : ; r, since each of the data values yi;j entering the mid formulas f~_i k;j [s], k = ?r; ; r, only appears once. Consequently, because the mean and variance of these individual spatial mid-point formulas are, respectively, (@f ((i + k)h; jh)=@x + O((s + 1) h )) and 2 =(2(s + 1)h) , it follows that their average is a normally distributed random variable with mean and variance given by r r 2 X 1 X @f ((i + k)h; jh) + O(s h ) ; 2r + 1 k ?r @x 2r + 1 k ?r (2(2r + 1)sh) : (

)

+ +

+

+

(

)

(

)

+

=

+ +

+

+ +

=

(

+

=

)

+

2

2

2

2

2

2

2

2

=

=

This step clearly indicates the additional technical considerations introduced by such spatial-neighbourhood averaging. In [1], as indicated above, one averages over dierent estimates of the same derivative, whereas, in this generalization, one is averaging over the same estimate of spatially-adjacent derivatives. The variance of the resulting random variable is easily computed (cf. Finney [3, Section 5.8]) to be 2 (2r + 1)(2sh) : 2

2

The evaluation of the approximation error can be more complex, as one must average over the approximation errors arising from each spatial contribution, as well as average over the discretization errors. In particular, because of the standard Taylor series result that @f ((i + k)h; jh) + @f ((i ? k)h; jh) = 2 @f (ih; jh) + @ f ((i + k)h; jh) O((kh) ); ?1 < < 1; @x @x @x @x mid the mean of f_i;j can be rewritten as 3

2

3

(

)

r X @f (ih; jh)

!

r X

!

1 1 + O ( k h ) + 2r + 1 k ?r @x 2r + 1 k ?r O(s h )) : Consequently, the averaging of the individual approximation errors generates two terms which will be referred to as the \averaging error" and the \averaged discretization error". In [1], as indicated above, the averaging is arranged so that one only has to analyse an averaged discretization error, whereas, in the spatial neighbourhood averaging being examined here, one will have both. Because 0 k r and =

2

2

=

2

2

6


Pr k r , the averaged discretization error will dominate and the mean of f_ mid i;j k =1

r < s,

2

(

3

)

becomes, with

@f (ih; jh) + O(s h ); @x if it is assumed, as indicated earlier, that @ f ((i + k)h; jh)=@x is bounded. Thus, the average and the mid mean of the random variable f_i;j take, respectively, the form @f (ih; jh) + O(s h ); 2 (15) : @x (2r + 1)(2sh) From the point of view of subsequent deliberations, it is important to note, at this stage, that the averaging has a linear eect on the approximation error associated with the accuracy of the mean, and a quadratic eect on the value of the variance. In fact, it is the exploitation of this dierence which allows the type of results, given below, to be derived. The variance of this averaged midpoint estimate is approximately 0:125r? times the variance of the standard midpoint estimate of equation (7). If r is taken to be of order O(h? = ), this averaged midpoint value is stable as there is no growth of the variance as h ! 0. The price paid to achieve this stability is an increase in the approximation error, resulting from the use of mid-point formulas with a stepsize sh as well as from the averaging. Together, these two processes induce an approximation error of order O(h = ) [1]. 2.2. Two-Dimensional Averaging. For the two-dimensional data yi;j (where j varies as well as i), the key observation is that, in estimating the partial derivative @f=@x, one is not limited to doing the spatial-neighbourhood averaging only in the the x-direction, as examined and motivated above. One is now free to also do the spatial neighbourhood averaging in the second dimension; i.e. with respect to some speci ed choice of s, average the mid-point estimates mid f~_i;j k [s] = yi s;j k2?shyi?s;j k ; k = ?r; ; r: The clear advantage of this step is that it removes the problem of error correlation, because there is no duplication of the data values entering the above family of mid-point estimates. In this way, there is not longer a need to demand that r < s. For example, one could construct the estimate of f_i;j as the average r X mid mid 1 _ (16) f~_i;j k [1]: f i;j = 2r + 1 k ?r For each i; j , this de nes a normally distributed random variable with mean and variance given, respectively, by r 1 X @f (ih; (j + k)h) + O(h ); 2r + 1 @x 2(2r + 1)h : 2

2

3

(

3

)

2

2

2

2

3

2 3

14 19

(

)

+

+

+

+

(

)

(

)

+

=

2

2

2

k=?r

Here, however, one nds, on appealing to standard Taylor series result of the type given above, that the averaging error dominates, and the value of the mean becomes @f (ih; jh) + O(r h ): @x The eect of this type of averaging is to reduce the variance associated with the standard mid-point formula (7) by a factor of O(r = ). The cost for this reduction has been achieved at the expense of the size of the approximation error which has been increased by a factor of order O(r h ). But, because stability can only be guaranteed if r is of order h? , the size of the corresponding approximation error 2

2

1 2

2

2

2


7

O(h?2).

increases to become of order Consequently, this scheme is only convergent when f is only a function of the second variable. It therefore follows that the larger stepsize in the scheme which mid generates f_i;j plays a crucial role. It is needed not only to guarantee the independence of the random variables which are averaged, but also to guarantee its simultaneous convergence and stability. Introducing a larger stepsize into the chosen mid-point formula which is spatially neighbourhood averaged in the second dimension, one obtains the following random variable estimate for f_i;j (

)

r ^f_ mid = 1 X yi i;j 2r + 1 k ?r (

(17)

)

s;j +k ? yi?s;j +k

+

2sh

=

:

For each i; j , this random variable has mean and variance given, respectively, by r r @f (ih; (j + k)h) + O(s h ) ; 1 X 2 X ; 2r + 1 k ?r @x 2r + 1 k ?r (2(2r + 1)sh) 2

2

2

2

=

=

and, hence,

@f (ih; jh) + O(r h ) + O(s h ); : @x 2(2r + 1)s h Stability is guaranteed if rs is of the order of h? . As well as the averaged discretization error O(s h ), associated with the chosen mid-point formula, the spatial neighbouring averaging generates an averaging error of order O((rh) ). For small h, one aims to balance the averaging and the averaged discretization error. Because, if one dominated, the other could be increased without much loss of accuracy, but an improvement in the stability. Consequently, r and s are chosen to be of order O(h? = ). Contrary to what might have been anticipated, this type of averaging does not improve on the spatial neighbourhood averaging in the x-direction. One still has a method with an accuracy of O(h = ). The next possibility combines the two previous approaches. The resulting spatial neighbourhood averaged midpoint rule takes the form r2 r1 X X ^_ 1 yi m s; j k ? yi m?s; j k ; r < s: f i;j = (2r + 1)(2r + 1) (18) 2sh k ?r1 m ?r2 2

2

2

2

2

2

2

2

2

2

2

2

2 3

2 3

+

1

2

+

+

+

+

1

=

=

The constraint r < s ensures that no correlated errors are generated. The mean and variance of this normally distributed random variable estimate of f_i;j are given, respectively, by 1

r2 r1 X X 1 @f ((i + m)h; (j + k)h) + O(s h ) = @f (ih; jh) + O((r + r + s )h ); (2r + 1)(2r + 1) k ?r1 m ?r2 @x @x 1

2

=

2

2

2 1

2 2

2

2

=

and

2 : (2r + 1)(2r + 1)(2sh) Consequently, stability is guaranteed if r r s is of order h? . For the reasons outlined above, r , r and s are chosen to have the same order, which, in the current situation, implies r r s O(h? = ). Thus, as a direct result of the additional spatial neighbourhood averaging in the y-direction, one obtains a stable scheme which has an accuracy of O(h). In this way, dimensionality has become a blessing for numerical dierentiation, because it has allowed one to construct a stable scheme with an improved accuracy over that obtainable from the standard schemes. 2

1

1 2

2

2

2

2

1

1

2

2

1 2

8


^ 2.3. Implementation. The estimate f_i;j can be evaluated in two separate ways. On the one hand, one rst computes the dierences ui m;j k := yi m s; j k2?shyi m?s; j k ; k = ?r ; ; r ; m = ?r ; ; r ; and then averages them to obtain r1 X r2 X ^_ 1 f i;j = (2r + 1)(2r + 1) ui m; j k : k ?r1 m ?r2 +

+

+

+

+

+

1

+

1

+

2

1

2

2

+

t = s;

t m; j +k ;

+ +

=

2

=

=

On the other hand, one rst constructs the averages r2 r1 X X 1 yi wi t;j := (2r + 1)(2r + 1) k ?r1 m ?r2 +

1

=

and then computes the dierence

^ f_i;j = wi s; j2?shwi?s; j : For the evaluation of a single derivative, the rst implementation involves 2(2r + 1)(2r + 1) ? 1 additions (subtractions) and (2r + 1)(2r + 1) divisions, while the second involves the same number of additions, but only 5 divisions. Consequently, if divisions should be minimized, then the second implementation is always the preferred option. On the other hand, because one is dividing by terms which do not dependent on the speci c locations of the data points, both can be rewritten so that the division by 2(2r + 1)(2r + 1)sh is performed as the last step. In this way, the goal reduces to one of minimizing the number of additions, and, hence, the minimization of the number of times that the expensive averaging-step must be performed. If only a single partial derivative is required, then both schemes are equivalent, since they both require the averaging to be performed twice. However, the situation changes when more complex partial derivatives are required over an array of locations, which is a commonly occurring situation in various applications including, for example, the evaluation of a two-dimensional velocity eld. In fact, the results are slightly counter-intuitive. Consider an N M (square lattice) array of locations, over which some speci ed partial dierential operator is required. For the gradient rf = (@f=@x ; @f=@x ), the rst procedure reduces to initially computing the dierences ui;j := yi s1; j ? yi?s1; j and vi;j := yi; j s2 ? yi; j?s2 ; on the N M array, and then evaluating the gradient using the two averaging steps r1 X r2 X 1 (fx1 )i;j := u 2(2r + 1)(2r + 1)s h k ?r1 m ?r2 i m; j k +

1

1

1

1

2

2

2

+

+

1

2

1

+

=

+

=

and

r1 X r2 X 1 vi m; j k : (fx2 )i;j := 2(2r + 1)(2r + 1)s h k ?r1 m ?r2 This procedure therefore involves 2NM applications of the averaging step. 1

2

2

+

=

=

+

2


9

On the other hand, the second procedure reduces to initially performing the following summations for the averaging

wi;j :=

r r X X 2

1

yi

m;j +k ;

+

k=?r1 m=?r2

on the N M array, and then evaluating the gradient as (fx1 )i;j := wi s1; j ? wi?s1; j 2(2r + 1)(2r + 1)s h and (fx2 )i;j := wi; j s2 ? wi; j?s2 : 2(2r + 1)(2r + 1)s h Typically, since it only involves NM applications of the averaging step, the second procedure is (approximately) twice as fast for the numerical evaluation of the gradient computation as the rst. For the computation of the divergence F = r:(f; g) = (@f=@x + @g=@x ), it is the rst procedure which has the better complexity. One starts with the two functions f and g, and then computes derivatives which are added to give the scalar result. For the rst procedure, one combines the computation of the dierences and their addition in order to determine the values F^i;j := fi s; j ? fi?s; j + gi; j s ? gi; j?s; +

1

2

1

2

2

+

1

1

+

2

+

on the N M array, and then evaluates the divergence as

r2 r1 X X 1 Fi;j := 2(2r + 1)(2r + 1)sh F^i k ?r1 m ?r2 1

m; j +k ;

+

2

=

=

which only involves NM applications of the averaging step. In order to reduce the number of divisions to one, which is performed as the last step, one must choose s = s = s. For the second procedure, twice the number of applications of the averaging is required. One initially computes the summations 1

wi;j := and

zi;j :=

r r X X 1

2

fi

m; j +k

gi

m; j +k ;

k=?r1 m=?r2 r r X X 2

1

k=?r1 m=?r2

2

+

+

on the N M array, and then evaluates the divergence as j ? wi?s; j + zi; j s ? zi; j ?s Fi;j := wi s; 2(2 : r + 1)(2r + 1)sh The evaluation of the curl (i.e. @f=@x ? @g=@x proceeds in exactly the same way as for the divergence. Remark. This ability to remove the division to the last step, and, thereby, reduce complexity considerations to the number of averaging steps performed, ceases, if the grid does not remain a square lattice. +

+

1

2

2

1

10


3. Higher Order Derivatives and higher dimensions { Stability and Convergence 3.1. The Basic Theory. The basic method of the previous section is now generalised to to the numerical evaluation of the following q-th order (homogeneous) dierential operator X @q f (x) Df (x) = c @x ; j j q p

p

p =

where the function f : R d ! R is assumed to have sucient smoothness and the c are constants. The approximation is computed from the following set of observed function values y = f (jh) + ; h = 1=n; j = (j ; : : : ; jd ) 2 I = f0; 1; : : : ; ngd; where I denotes the index set of the locations of the data, hI , and the observational errors are normally distributed independent random variables with expectation 0 and variance . The multidimensional data array with components y will be denoted by y, while the linear space of all such multidimensional data arrays will be denoted by R I . Furthermore, let f = f (jh); h = 1=n; j = (j ; : : : ; jd ) I and f 2 R denote the multidimensional array containing the exact function values. The Laplacian d X Df = 4f = @@xf i i is an example of such a dierential operator. The generalization consists of two steps; namely, an averaging step followed by a dierentiation step. In order to simplify the algebra, it is assumed that y has been extended periodically to Zd. Let V = f?r ; : : : ; r g f?rd ; : : : ; rdg, where ri 0 are given integers, de ne the index set of the template on which the averaging is performed, and MV denote the standard averaging operator on this template, where each data point averaged has equal weight. The index set V I will be called the support of MV . A smooth estimate fb of f is generated by applying the averaging operator MV to the data y X fb = (MV y) = jV1 j y ? ; j 2 I; (19) 2V where jV j denotes the size (number of members) of the index set V . Because of the way in which V has been de ned, MV takes the form of a convolution. Since, in general, f is not periodic, the estimates fb will only be a good approximations of f when j 2 I 0, where I 0 is the index set of the array of averaged values; i.e., the maximal set such that I 0 + V I: For the error analysis, presented below, the second derivative of f will be required. The Hessian of f will be denoted by @ f (x) Hf (x) = @x @x ; p

j

1

j

2

j

j

1

j

2

2

=1

1

1

j

j

j

j

j

k

k

j

j

2

i

j i;j =1;:::;d

and its matrix 2-norm by kHf (x)k. The second moment of the characteristic function of V is de ned by d X X 1 1 (V ) = jV j kkk = 3 ri (ri + 1) 32 krk i 2V where kkk = k + + kd . 2

2

k

2

2 1

2

2

=1


11

Proposition 1. If f 2 C ( ); f = f (jh ), and f = MV f , then f ? f h kHf (z)k (V ); j 2 I 0 where jzi ? ji hj hri ; i = 1; : : : ; d. Proof. From Taylor's theorem, one obtains f ? = f ((j ? k)h) = f (jh) ? hrf (jh)T k + h kT Hf ( 0)k for some 2Pjh + hV . Since V is symmetric, averaging over V eliminates any linear expression in k. In particular, 2V k = 0. The proposition is proved by applying MV to the the standard bound jkT Hf (z)kj kkk kHf (z0)k and then invoking the convexity of the averaging performed by MV , the continuity of the L -norm (equivalent to the largest singular value), and the Bolzano intermediate value theorem for kHf (z0)k. This approximation error can be viewed as the undesirable side-eect of the averaging. On the other hand, its advantage is the potential reduction it can generate in the variance of the random errors in the data. The nature of this reduction is encapsulated in the following proposition. Proposition 2. If the errors are i.i.d. N (0; ) random variables, (MV ) are N (0; =jV j) random variables. Furthermore, if j ? j0 62 2V , then the random variables (MV ) and (MV ) are independent. Proof. The variance reduction is standard. Furthermore, (MV ) depends only on for i 2 j + V . Consequently, if j ? j0 62 2V , then (MV ) and (MV ) depend on disjoint sets of components of , and, therefore, are independent. The condition j ? j0 62 2V is equivalent to condition jji ? ji0 j 2ri + 1 for i = 1; : : : ; d. The numerical evaluation of the derivative Df on the data y at the point j can be formulated as the application of a linear operator A[s] on R I de ned by X (A[s] y) = ? y 2

j

j

j

2

j

2

2

k

k

2

2

2

j

2

j

j0

j

j

i

j0

j

j

2I

j

i

i

i

where 1. the coecients are assumed to be periodic with period I , 2. with respect to some given index set s = (s ; : : : ; sd) 2 N d , which characterises the support (spacing) of the points at which the operator A[s] acts, the components are only nonzero, if, for ki 2 Z; i = 1; : : : d, i = (k s ; : : : ; kdsd) 3. the dierence approximation A[s] approximates the dierential operator D in the sense that, if f 0 = (Df )(jh) and f_ = A[s] f , then (20) jf_ ? f 0j h ksk C (f ) where C (f ) depends on the values of certain q + 2nd partial derivatives of f . As the are periodic, the linear operators A[s] correspond to convolutions. One is now in a position to combine the numerical dierentiation operator A[s] with the averaging operator MV , in order to construct stable numerical dierentiation rules. In fact, because MV is also a convolution, it commutes with A[s]; i.e., MV A[s] = A[s]MV : Examples of such numerical dierentiation rules include the well-known midpoint rules discussed above in Section 2. The following proposition yields a bound on the approximation error of the averaged nite dierence. i

1

i

1 1

j

j

j

2

j

2

12


Proposition 3. Let MV be the averaging operator with V = f?r ; : : : ; r g f?rd; : : : ; rdg, A[s] be a dierence operator, and let f_ = A[s]MV f . Furthermore, let f 0 = (Df )(jh). If the function f is such that Df 2 C , then jf_ ? f 0j h ksk C (f ) + 32 h krk kHDf ( )k; j 2 I 0; where I 0 I is the index set of the array of averaged values. 1

1

j

2

2

j

2

2

2

j

Proof. From the triangle inequality, one obtains jf_ ? f 0j jf_ ? w j + jw ? f 0j; where w = Mv f 0 . The rst term on the right side can be written as (MV u) , where u = A[s]f ? f 0 . The values ju j of the components of the multidimensional array u are bounded by h2 ksk2C (f ). Furthermore, the averaging operator is uniformly bounded, as can be proved by applying Schwartz' inequality. In particular, if ju j c for some constant c, then j(MV u) j c, since X (MV u)2 1 2 jV j u2+ : j

j

j

j

j

j

j

j

j

j

jV j

j

j

2V

k

k

Thus, the eect of the averaging MV u can be bounded by jfb_ ? w j = j(MV u) j h ksk C (f ): For the estimation of the second term in the triangle inequality, one uses the smoothness of f (i.e., Df 2 C ) and Proposition 1 to obtain jw ? f 0j h kHDf (v)k (V ); j 2 I 0 where the choice of v is characterized in Proposition 1. The required bound of this proposition is now an immediate consequence of these two results. One now turns to the estimation of bounds for the ampli cation of the measurement errors. A rst lemma shows that, if the \spacing" s of the nite dierence formula is suciently large then the error ampli cation is essentially the 2-norm of the coecients of the dierence stencil (template). Lemma 1. Let MV be the averaging operator with V = f?r ; : : : ; r g f?rd; : : : ; rdg and A[s] be a dierence operator. If are independent normally distributed random variables with expectation 0 and variance and if si 2ri + 1 then the := (A[s]MV ) are normally distributed random variables with expectation 0 and variance X var( ) = jV j = jV j kA[s]ek ; 2I where e is the array with all its components zero except for the origin j = 0, where e = 1. Proof. Let = MV . Because MV and A[s] commute, it follows that = MV A[s] = A[s]MV = A[s]: The linearity of the operators involved implies that E ( ) = 0, where E () denotes the expectation operator. Thus, one obtains that X var( ) = E ( ) = ? ? E ( ): j

j

2

j

2

2

2

j

2

j

1

j

2

j

1

j

2

2

2

j

2

i

i

j

j

2

j

; 2I

i k

j

i

j

k

i

k


13

The only terms which contribute to the sum are the ones for which j ? i = N s and j ? k = N s, and, for these terms, k ? i = N s, where the Nt = diag(n ; : : : ; nd); ni 2 N , t = 1; 2; 3 denote integer matrices. Furthermore, for k ? i 62 2V , E ( ) = 0. Since, for si ri + 1, the only point N s which is an element of 2V is 0, the result of the lemma follows. The next lemma yields a bound for the error ampli cation. The condition on A[s] guarantees the validity of the intermediate mean-value theorem for dierentiation, and holds for all standard nite dierence formulas. Lemma 2. Let A[s] be a dierence operator which satis es the consistency condition that, for all periodic f with continuous Df , there exists, for every k 2 I , a vector z 2 such that [A[s]f ] = Df (z). Then 1

3

2

1

i

k

k

kA[s]ek (2)q h?(q+d=2) Proof. Let f (x) =

Qd f (x ) with i i i =1

X

jpj=q

jc j p

Yd i=1

si? pi (

=

+1 2)

:

n=si ? p s i X fi (xi) = n exp(?2 ?1 k xi ): k 1

=1

C 1.

It follows that f is periodic and Furthermore, fi(si h k) is one, if k = 0, and zero otherwise; i.e., f (x) is a Lagrange function for the grid points given by N s. This proves that the function f interpolates e. Consequently, if f = f (jh) then A[s]f = A[s]e and, for every k, there is a z such that (mean value theorem) [A[s]e] = Df (z): This proves that f interpolated e. Since the derivatives of the component functions fi are bounded by dpi f (x ) 2n pi i pi i dxi si it follows that, since n = 1=h, X Yd 2 pi jDf (z)j jc j hsi : i j j q j

k

p

p =

=1

Because these bounds are independent of z, and because only dierent from zero, one obtains the bound

kA[s]ek

Yd n ! = X 1 2

i=1 si

jpj=q

jc j p

Qd

n i=1 si

of the components of A[s]e are

Yd 2 p

i

i=1

hsi

:

Rearranging terms yields the given bound. Proposition 4. Let MV be the averaging operator with V = f?r ; : : : ; r g f?rd; : : : ; rdg and A[s] be a dierence operator. In addition, let be i.i.d. normal random variables with expectation 0 and variance , and = A[s]MV . Furthermore, assume that the consistency condition of the previous lemma holds for A[s]. If si 2ri + 1; 1

2

j

1

14


then are normally distributed random variables with expectation 0 and variance bounded by j

var( ) (2) q h? 2

j

2

q

0 X Yd (2ri + 1)? @ jc j si? p i i j j q 0 1 d X Y d @ jc j (2ri + 1)? p A : Yd d

1

(2 + )

=1

2

=

( i +1 2)

p

1 A

=1

p =

2

(2) q h? 2

2

q

(2 + )

( i +1)

p

jpj=q

i=1

Proof. The proposition follows by combining the results of the two previous lemmas. A numerical dierentiation rule is said to be stable, if the error ampli cation is bounded. As a consequence of the previous proposition, one obtains Corollary 1. If, for all c 6= 0 with p = (p1; : : : ; pd ), there is a constant K such that p

Yd i=1

(2ri + 1)pi Kh? q +1

d=2)

( +

holds, then the averaged dierentiation rules, de ned above, are stable. It is also convergent, if, for all i = 1; : : : ; d, there exist constants K1 and K2 and 1; 2 < 1 such that si K1h? 1 ; and ri K2h? 2 : Any realistic template (stencil) V must not be too elongated in any one of the coordinate directions. A natural constraint, which achieves this, is to require that =d r r r = jV 2j : On the other hand, from the earlier assumptions about the index set I it follows that n + 1 = jI j =d; and hence h jI j? =d: In this way, the study of the stability and convergence of multi-dimensional numerical (partial) dierentiation reduces, for a given partial derivative and multi-dimensional space with dimension d, to specifying how the size of jV j must change relative to the size of jI j. For example, if pi = p; i = 1; 2; ; d, then q = pd, It therefore follows from Corollary 1 that stability is guaranteed if 1

1

2

1

1

(2r)q d Kh? q +

and, hence, that

d=2)

( +

jV j p K 2? q ( +1)

Kn q

d=2) ;

( +

d=2) jI jp+1=2:

( +

This represents a heuristic proof that stability is guaranteed if, for a suitable chosen ,

jV j jI j ; 0 < < 1:

The admissible range of values of , which guarantee both convergence and stability, now follows on combining this result with the estimate of Proposition 3.


15

3.2. Implementation. As for the gradient in two dimensions (cf. Section 2.3), the way the computations are performed can aect the number of oating point operations required to evaluate multidimensional numerical derivatives. In order to quantify this, it is necessary to examine the complexity of the various implementations. For example, the evaluation of any partial derivative over the whole domain involves the formation of the double matrix vector product A[s]MV y. This is done in two stages which essentially consist of averaging and dierencing. The division associated with the averaging stage is moved into the dierencing stage. Consequently, the rst stage consists in forming the sum X u = y ? ; j 2 I 0: j

2V

j

k

k

These sums are only computed for j 2 I 0, as the function is not assumed periodic. (The evaluation of derivatives near the boundary will involve dierent approximations which is not discussed in this paper.) The actual evaluation of the double matrix vector product will involve jI 0j(jV j ? 1) additions, but no multiplications nor divisions. The dierence step reduces to the evaluation of X jV j? (A[s]u) = ( =jV j)u ? ; j 2 I 00: 1

j

2I

k

j

k

k

It is assumed that the coecients =jV j have been precomputed. The dierences are only evaluated on the set I 00 which guarantees that only components of u with k 2 I 0 are used. An essential observation is that the are only zero for a small subset of indices which shall be denoted by S . Thus, the dierence step requires jI 00j(jS j ? 1) additions and jI 00j jS j multiplications. Except for the trivial precomputation step, mentioned above, no divisions are required. The total amount of operations for the formation of A[s]MV y is the sum of these two expressions. For very large data sets, the number of data in the vicinity of the boundary will be small when compared with the number in the interior of the domain I . Consequently, I 0 and I 00 are asymptotically of the same size as I , and the total number of operations required to form A[s]MV becomes jI j (jV j + jS j? 2) additions and jI jjS j multiplications. (In the sequel, only asymptotic estimates will be derived.) Furthermore, as discussed in the previous section, the size of V is related to that of I by the relationship form jV j = cjI j , where 0 < < 1. Finally, on modern computers, oating point additions and multiplications require about the same number of operations, and so the total oating point operation count is asymptotically jI j(cjI j + 2jS j ? 2) = O(jI j ). An alternative way to proceed is to compute the action of A[s]MV on some given data y. Here, one turns to the application of the fast Fourier transform (FFT), where the complexity will be O(jI j log (I )). Thus, there is a crossover point de ned by an equation of the form jI j = C log (jI j) where C is a constant depending on the choice of V and how the FFTs are performed. Consequently, for larger I , the FFT approach will be more ecient. For this \indirect" method, a d-dimensional FFT will be required. An ecient implementation for a general d can be found in [5]. For the evaluation of a gradient, or Hessian, there will be several, say t, independent derivatives to be computed. If these derivatives are again required over the whole domain this reduces to the evaluation of a matrix vector product of the form (A [s]; : : : ; At [s])T MV y: If the averaging is performed rst on the scalar data, this will require (asymptotically) jI j (jV j ? 1 + t(jS j ? 1)) additions and tjI j jS j multiplications. On the other hand, if the dierence operators were applied rst and the results were all averaged independently, then the number of operations would be k

k

k

+1

2

2

1

16


t times what is needed for the computation of one such partial derivative. Consequently, dierencing rst will be much more expensive than performing averaging rst. On the other hand, if a reduction operation takes place (such as occurs in the evaluation of a divergence), the dierencing should be performed rst; i.e., evaluate the formula as 2y 3 MV (A [s]; : : : ; At [s]) 4 ... 5 : yt Compared with gradient-like operations, an additional summation must be performed which involves (t ? 1)jI j additions. Consequently, one requires jI j (jV j + tjS j? 2) additions and tjI jjS j multiplications. When either jS j or jV j is large, FFT's, based on the discrete Fourier transform, can be applied to the stencil of the averaging. The stencil of the averaging, or multi-point dierencing method, can be represented by MVT A[s]T e : This stencil can be precomputed with little work. Because the supports for the dierentiation and the averaging do not overlap, the number of nonzero elements of the stencil is jS j jV j. Consequently, the application of the stencil requires jS j jV j ? 1 additions and jS j jV j multiplications. Having computed the stencil, it can be applied to determine the derivative at a variety of points, and in particular, all points of I 00. Note, however, that for this the number of additions and multiplications required would be (asymptotically) jI j(jS j jV j ? 1) and jI j jS j jV j. This is much more than the separate application of A[s] and MV uses. However, if the value of several derivatives at only one point is required then the number of additions and multiplications reduces to tjS j jV j ? 1 and tjS j jV j. A similar estimate holds if a set of derivatives derivatives are only required at a limited number of points As a conclusion, when implementing algorithms for the computation of derivatives, one should carefully consider the characteristics of the problem in order to choose an ecient method. In summary, if derivatives on the full range of points in the domain are to be computed, one fares best when the averaging and the dierencing are done separately. For the computation of a set of dierent derivatives of the same scalar function, one should consider averaging before dierentiating. If the result required is a linear combination of a set of derivatives of multiple functions, one should do the dierentiation rst and the averaging second. For large and complicated derivatives and large averages, multidimensional FFTs should be used for best performance. 1

1

1

References [1] R.S. Anderssen, F. de Hoog, and M. Hegland, A stable nite dierence ansatz for higher order dierentiation of non-exact data, Mathematics Research Report MRR96-023, ANU, School of Mathematical Sciences, 1996, ftp://nimbus.anu.edu.au:/pub/Hegland/andhh96.ps.gz. [2] R.S. Anderssen and F.R. de Hoog, Finite dierence methods for the numerical dierentiation of non-exact data, Computing 33 (1984), 259{267. [3] D.J. Finney, Statistics for mathematicians: and introduction, Oliver and Boyd, Edinburgh, 1968. [4] T.J. Hastie and R.J. Tibshirani, Generalized additive models, Monographs on statistics and applied probability, vol. 43, Chapman and Hall, 1990. [5] Markus Hegland, An implementation of multiple and multi-variate Fourier transforms on vector processors, SIAM J. Sci. Comp. 16 (1995), no. 2, 271{288. [6] H.G. Muller, Nonparametric regression analysis of longitudinal data, Lecture Notes in Statistics, vol. 46, Springer, 1987. [7] H. Niederreiter, Random number generation and quasi-Monte Carlo methods, SIAM, 1992. [8] M.P. Wand and M.C. Jones, Kernel smoothing, Monographs on statistics and applied probability, vol. 60, Chapman and Hall, 1995.