Oracle inequalities for inverse problems

14 downloads 0 Views 327KB Size Report
Jun 23, 2000 - we need to estimate a function f from indirect noisy observations. ... Key Words: Statistical inverse problems, Oracle inequalities, Adaptive curveĀ ...
Oracle inequalities for inverse problems L. Cavalier , G.K. Golubev  , D. Picard y and A.B. Tsybakov y June 23, 2000 Abstract

We consider a sequence space model of statistical linear inverse problems where we need to estimate a function f from indirect noisy observations. Let a nite set  of linear estimators be given. Our aim is to mimic the estimator in  that has the smallest risk on the true f . Under general conditions, we show that this can be achieved by simple minimization of unbiased risk estimator, provided the singular values of the operator of the inverse problem decrease as a power law. The main result is a nonasymptotic oracle inequality that is shown to be asymptotically exact. This inequality can be also used to obtain sharp minimax adaptive results. In particular, we apply it to show that minimax adaptation on ellipsoids in multivariate anisotropic case is realized by minimization of unbiased risk estimator without any loss of eciency with respect to optimal non-adaptive procedures.

Mathematics Subject Classi cations: 62G05, 62G20 Key Words: Statistical inverse problems, Oracle inequalities, Adaptive curve estimation,

Model selection, Exact minimax constants

CMI, Universite Aix-Marseille 1, 39 rue F.Joliot-Curie, F-13453 Marseille Cedex, France Laboratoire de Probabilites et Modeles Aleatoires, Universites Paris 6 - Paris 7, 4 pl. Jussieu, BP 188, F-75252 Paris Cedex 05, France  y

1

1 Introduction

Let A be a known linear operator on a Hilbert space H with inner product (; ) and the norm k  k. Let f 2 H be an unknown function that we want to estimate from indirect observations Y (g) = (Af; g) + "(g); g 2 H; (1) where 0 < " < 1 and (g) is a Gaussian random variable on a probability space ( ; A; P), with mean 0 and variance kgk2, such that Ef(g)(v)g = (g; v), for any g; v 2 H , where E is the expectation w.r.t. P. The relation (1) de nes a Gaussian white noise model. Instead of using all the observations fY (g); g 2 H g it is usually sucient to consider the set of values fY ( k )g, for some orthonormal basis f k g1 k=1 . Then  ( k ) = k are i.i.d. standard Gaussian random variables. We assume that the basis f k g is such that (Af; k) = bk k where bk > 0, and k = (f; 'k) are the Fourier coecients of f w.r.t. some orthonormal basis f'k g (not necessarily 'k = k ). A typical example when it occurs is that the operator A admits a singular value decomposition of the form A'k = bk k ; A k = bk 'k ; (2) where A is the adjoint of A, bk are singular values, f k g is an orthonormal basis in RangeA  H and f'k g is the corresponding orthonormal basis in H . Under these assumptions (but also in some other situations) one has the equivalent discrete sequence observation model derived from (1): yk = bk k + "k ; k = 1; 2; : : : ; (3) where yk stands for Y ( k ). If bk > 0 the model (3) can be written in the form Xk = k + "k k ; k = 1; 2; : : : ; (4) where Xk = yk =bk , and k = b?k 1 > 0. This can be also viewed as a model with direct observations and correlated data (Johnstone (1999)). The sequence space formulation (3) or (4) for statistical inverse problems has been studied in a number of papers [see Johnstone and Silverman (1990), Korostelev and Tsybakov (1993), Koo (1993), Donoho (1995), Mair and Ruymgaart (1996), Golubev and Khasminskii (1999a,b),Johnstone (1999), Goldenshluger and Pereverzev (1999), Cavalier and Tsybakov (2000) among others]. For ill-posed inverse problems we have bk ! 0 and k ! 1, as k ! 1. Let ^ = (^1; ^2; : :P:) be an estimator of  = (1; 2; : : :) based on the data (4). Then f is estimated by f^ = k ^k 'k . The mean integrated squared risk (MISE) of f^ is X R(f;^ f ) = Ef kf^ ? f k2 = E (^k ? k )2 = E k^ ? k2; k

2

where the notation k  k means the `2-norm when applied to -vectors in the sequence space. Here and later Ef and E denote the expectations w.r.t.fY (g); g 2 H g or X = (X1; X2; : : :) for models (1) and (4) respectively. Analyzing the risk R(f;^ f ) of the estimator f^ is equivalent to analyzing the corresponding sequence space risk E k^ ? k2. Let  = (1; 2; : : :) be a sequence of nonrandom weights, also called lter. Every lter  de nes the estimator ^() = (^1; ^2; : : :) where ^k = k Xk . Examples of commonly used weights k are the projection weights k = I (k  w), for some integer w (where I () denotes the indicator function), the Tikhonov-Phillips weights 1 k = 1 + (k=w ) ; w > 0; > 0; and the Pinsker weights k = (1 ? (k=w) )+; w > 0; > 0; where x+ = max(x; 0). The estimator f^ with Tikhonov-Phillips weights for even values is asymptotically equivalent to a smoothing spline estimator [Wahba (1977,1990)]. The lter  can be interpreted as a smoothing parameter. Although  is not nite dimensional, it is usually determined by a nite number of parameters, as in the above examples. In this paper we discuss how to choose these parameters optimally in a datadriven way. In particular, a data-driven choice of smoothing parameters w and for the Tikhonov-Phillips method is interesting. We suppose that there is a nite set of possible candidate lters  = f1; : : :; N g, with s = (s1 ; s2; : : :); s = 1; : : : ; N; N  2, satisfying some general conditions. These lters can be, for example, any of the 3 types described above, as well as pooled sets of di erent kinds of lters. Given the data, X = (X1; X2; : : :), our aim is to select a data-dependent sequence of weights  = (X ) = (1; 2; : : :), with values in , that has asymptotically minimal squared risk for the true . We show that  can be de ned as a minimizer (with respect to  2 ) of an unbiased estimator of the risk. Optimality properties of such  follow from the oracle inequalities that are the main result of the paper. The oracle inequalities are non-asymptotic, they are obtained under very weak conditions on , and they lead to asymptotically exact inequalities of the form R(f ; f )  (1 + o(1)) min R(f; f ); (5) 2  P1  as " ! 0, where f = P1 k=1 k Xk 'k , f = k=1 k Xk 'k and o(1) does not depend on f . As a consequence of these inequalities, we can justify the optimal choice of smoothing parameters in the Tikhonov-Phillips and projection methods, as well as in Pinsker's method (the last one yields as a by-product sharp minimaxity of  on Sobolev ellipsoids). The optimality results are valid under the assumption that k is growing as a power of k, 3

as k ! 1. Generality of the oracle inequalities allows to apply them in various problems. As an example, we consider sharp adaptation on multivariate anisotropic smoothness classes. An interesting conclusion is that the adaptive estimator based on  attains the minimax lower bound for multivariate anisotropic case, without any loss of eciency. Other oracle inequalities for inverse problems have been proposed recently by Johnstone (1999) and Cavalier and Tsybakov (2000). Johnstone (1999) deals with a class of nonlinear estimators based on soft thresholding in a wavelet context. He obtains an asymptotically exact oracle inequality for this class. Cavalier and Tsybakov (2000) consider the model (4) and obtain asymptotically exact oracle inequalities of the form (5) where  is the class of all monotone weight sequences , and f  (respectively,  ) is chosen in a di erent way, by application of a penalized blockwise Stein's rule. Their method is sharp adaptive in a minimax sense on ellipsoids, but it is not suited for parameter selection in restricted classes of lters, such as the Tikhonov-Phillips one.

2 Main results We deal with the model (4) and we assume the following. Assumption 1. Neither of  2  is identically 0, k > 0 for all k, P1i=1 i22i < 1; 8 2 ; and max sup ji j  1: (6) 2 i The risk of the linear estimator ^() is given by 1 1 X X (7) R" [; ] = E k^() ? k2 = (1 ? i )2i2 + "2 i22i : i=1

i=1

The assumption (6) is quite natural. In fact, it follows from (7) that the estimator ^() with at least one i 62 [0; 1] is inadmissible. However, we included the case of negative bounded i since it corresponds to a number of well-known estimators such as kernel ones with non-negative kernels. The results below remain valid (up to a change in constants) if we replace 1 by an arbitrary constant C in (6). Our data-driven choice of the smoothing parameter  is based on the principle of unbiased risk estimation. To be more speci c, recall brie y a heuristic motivation of the Mallows Cp. The best choice of  is the lter 0 (called oracle) that minimizes R"[; ] over  2 . The oracle 0 cannot be found directly since the functional R" [; ] depends on the unknown i2. However, an unbiased estimator of i2 is available in the form Xi2 ? "2i2. Thus the functional 1 1 X X (8) U [; X ] = (2i ? 2i )(Xi2 ? "2i2) + "2 i22i i=1

i=1

4

1 X

1 X (2i ? 2i )Xi2 + 2"2 i2i i=1 i=1 P is an unbiased estimator of R [; ] ? 1 2:

=

"

i=1 i

1 X

E U [; X ] = R"[; ] ? i2; 8: i=1

(9)

Under our assumptions the random series in (8) converges in the mean squared sense for any  2 , and the de nition (8) is understood in this sense. Of course, in practice one does not compute in nite series. One either requires that i = 0 for all i large enough (as for the projection or Pinsker lters) or translates the de nition of U [; X ] into the function space to make it computable (e.g. for the Tikhonov { Phillips lters with even the computations are possible in terms of splines). The principle of unbiased risk estimation suggests to minimize over  2  the functional U [; X ] in place of R"[; ]. This leads to the following data-driven choice of :  = arg min U [; X ]: (10) 2

We show that this simple and intuitive smoothing parameter selection rule is ecient for inverse problems with the power growth of k , in the sense that it satis es asymptotically precise oracle inequalities. Denote ?1=2 X 1 4 4 2 ; 8 2 ; () = sup k jk j i i and

k

i=1

 = max (): 2 Let the following assumptions hold. Assumption 2. There exists a constant C1 > 0 such that, uniformly in  2 , 1 1 X X i42i  C1 i44i : i=1

i=1

Assumption 3. There exist constants C2 > 0 and q > 0 such that X exp(?q=())  C2: 2

Assumptions 2 and 3 are very mild, and they are satis ed in most of the interesting examples. Since jij  1, we have 1 1 X X i44i  i42i ; i=1

i=1

5

and Assumption 2 means that both sums are of the same order. The sums "4 P1i=1 i44i P 4 2 are the main terms of the variance Var(U [; X ]). On the other hand and "4 1 i=1 i i P E (U [; X ])  "2 1i=1 i22i and q P1 4 4 "4 i=1 i i "2 P1 22  : i=1 i i

Therefore, theqsmallness of  guarantees the smallness of the ratio of standard deviation to mean Var(U [; X ])=E (U [; X ]) uniformly over . This is essentially close to Assumption 3 and may be used for its intuitive motivation. Remark also that under Assumption 2 we have q ()  C1; 8 2 : (11) Denote

S = max sup i22i = min sup i22i ; 2 2 i

and

i

(12)

L = log(NS ) + 2 log2 S: (We recall that N is the number of elements of the family .) Main results of the paper are given in the next two theorems.

Theorem 1 Let Assumptions 1 { 3 hold. Then for every  2 `2, every B > B0C22 and for the estimator  = (1; 2; : : :) with i = i Xi we have

R [; ] + 2(C22 + 1)B"2L!(B 2L); E k ? k2  (1 + 1(C22 + 1)B ?1) min 2 " where

!(x) = max sup k22k I 2 k

X 1 i=1

(13)

 i22i  x sup k22k ; x > 0; k

and B0 > 0, 1 > 0; 2 > 0 are constants depending only on C1 and q.

Theorem 2 Let Assumptions 1 { 3 hold. Then there exist constants 3 > 0; 4 > 0 depending only on C1 and q, such that for every  2 `2 and for the estimator  =

(1; 2; : : :) with i = i Xi we have i  h R [; ]; E k ? k2  1 + 3 B ?1 + (C22 + 1)BL min 2 " provided B > 0 and B ?1 + (C22 + 1)BL  4 .

6

In order to prove (5), in many examples it is sucient to use Theorem 2, and even its simpli ed version that we are going to state now. Assume that lim  log(NS ) = 0: (14) "!0 Then Assumption 3 holds with C2 = 1, q = 1 for " small enough, and L = O(log(NS )). Choosing B = (L )?1=2 we get the following corollary of Theorem 2.

Corollary 1 Let Assumptions 1 and 2 hold. Then there exist constants C3 > 0; C4 > 0, depending only on C1 , such that for  log(NS ) < C3 we have

E k ? k2 

  q R [; ]; 1 + C4  log(NS ) min 2 "

for every  2 `2 and for the estimator  = (1; 2; : : :) with i = i Xi .

Thus, the condition (14) is sucient to have asymptotically exact oracle inequalities of the form (5). Recall that for ill-posed inverse problems k ! 1, as k ! 1. A crucial restriction, implicit in Theorems 1, 2, and Corollary 1 is that k should grow not faster than a power of k. In fact, if k grow exponentially, then even in the simplest case of projection weights k we have  6= o(1), as " ! 0 (thus, Theorem 2 cannot be applied if N grows with " ! 0). Theorem 1 in this case can be applied but does not give correct rates. The oracle inequality (13) is suited to obtain minimax results or rates of convergence for classes of sequences . The remainder term in the RHS of (13) is typically of the form "2(log(1="))a with a > 1. This shows limits of applications of (13): for the classes where the least favorable functions  satisfy min2 R" [; ] 0. Remark 2. Theorems 1, 2 can be used not only for ill-posed, but also for well-posed inverse problems where k 6! 1. For example, both theorems apply if k  k? with 0  < 1=4 (allowing  = o(1), as " ! 0). For faster decreasing k only Theorem 1 works. The constant C2 is growing as " ! 0 in this case (that is why we kept it 7

explicit in the results). The RHS of (13) depends on how C2 is growing, and this in turn depends on the size N of the set . If N  log(1=") as " ! 0, the behavior of C2 is also logarithmic, and the remainder term in (13) is O("2 log(1=")a) for some a > 0, which is only logarithmically worse than the optimal rate "2 of the well-posed inverse problems. Remark 3. Consider the special case that corresponds to direct observations (i.e. k  1). Here several oracle inequalities have been known previously. Theorems 1 and 2 extend these results, especially in what concerns multivariate applications. First oracle inequalities for direct observations model appeared, although implicitly, in the proofs of optimality of Cp, cross-validation and related data-driven methods [Li (1986, 1987), Polyak and Tsybakov (1990, 1992)]. They are also implicit in the minimax adaptive constructions [Golubev (1987, 1992), Golubev and Nussbaum (1992), Efromovich (1999) and the references therein]. Presumably, the rst explicit use of oracle inequalities and its implications to minimax is due to Donoho and Johnstone (1994) and Kneip (1994). More recent references are Donoho and Johnstone (1995,1996), Nemirovskii (1998), Birge (1999), Birge and Massart (1999), Goldenshluger and Tsybakov (2000). For linear estimators, the results are often proved in the following model which can be embedded into ours. Let Y = (Y1; : : :; Yn )T be the vector of observations in the nonparametric regression model Yk = f (k=n) + k ; k = 1; : : :; n; where k are i.i.d. N (0; 1) random errors, and f () is the unknown function to be estimated. Consider the linear estimator f^ = S Y of the vector f = (f (1=n); f (2=n); : : : ; f (1))T P f^ = S Y where S = ni=1 i uiuTi is a symmetric n  n matrix (smoother matrix) with eigenvalues i , and fuig is an orthonormal basis in IRn. Denoting Xi = n?1=2uTi Y , i = n?1=2uTi f , i = uTi  (where  = (1; : : :; n )T ), " = n?1=2 we rewrite the initial regression model in the equivalent form Xi = i + "i ; i = 1; : : :; n; which is a special case of (4), modulo the fact that the `2 -vectors should contain zeros starting from (n + 1)th position:  = (1; : : :; n; 0; 0; : : :). The linear estimator f^ is translated into ^() = (^1; : : : ; ^n; 0; 0; : : :) where ^i = i Xi = n?1=2i uTi Y , and the risk is ! n X 2 2 ? 1 kS Y ? f k : E k^() ? k = E n i=1

Kneip (1994) studies this setup assuming that the class of lters  contains the sequences  with monotone non-increasing coecients i and such that for any 2 sequences ; 0 2  we have either i  0i; 8i or i  0i ; 8i (ordered linear smoothers). The set  in Kneip (1994) need not be nite. With the above translation into our notation, Kneip [1994, p.844] proves the oracle inequality

E k ? k2  (1 + O(B ?1 )) min R [; ] + O(B )"2; 8B > 0; 2 " 8

(15)

where  is the data-driven estimator (10). This is similar to (13) (but recall that Kneip's inequality (15) covers only the case k  1). Another di erence is that we assume niteness of the set  (which, in fact, is not restrictive in view of applications), but we drop the assumption of order. The last point is useful in multivariate models where the ordering on the lters is not natural (cf. the example in Section 6 below).

3 Examples Consider some examples and consequences of Theorems 1 and 2. Typical assumptions on the parameters appearing in these theorems will be the following: (i) Power growth of k , as k ! 1. (ii) Power behavior of S : S = O("?t ), for some t > 0, as " ! 0. (iii) At most power growth of N : N = O("? ), for some  > 0; as " ! 0 (N = O(log(1=")) in some examples). In all the cases we have log(NS ) = O(log(1=")). Example 1. Projection estimators. Let 1  w1 < : : : < wN be integers. Consider the projection lters s = (s1; s2; : : :) de ned by 1i = I (i  w1); 2i = I (i  w2); : : : ; Ni = I (i  wN ); i = 1; 2; : : : (16) Throughout this section we assume a power law behavior of k : mink  k  maxk ; (17) k = 1; 2; : : : ; for some min; max;  0. Note that Assumption 2 is satis ed with the equality and C1 = 1. It is easy to see that Cws?1=2 = Cw1?1=2; (s )  Cws?1=2;   s=1max ;:::;N and S  C (wN =w1)2 : Here and later C stands for positive constants (possibly di erent in di erent occasions) that do not depend on ;  and ". Using the bound on () we conclude that Assumption 3 is satis ed for any choice of the sequence wi. Remark also that n o !(x)  C sup k2 I k2 +1  Cxk2  Cx2 (18) k

9

and

  L  C log(NwN =w1) + w1?1 log2(wN =w1) : Theorem 1 together with (18) gives the following inequality

E k ? k2  [1 + O(B ?1 )] min R [; ] + O("2)B 4 +1L2 +1: 2 " Minimizing here the right-hand side with respect to B and observing that R" [; ]  "2w1 one obtains the following result.

Proposition 1 Assume that  = (1 ; : : :; N ) is the set of projection lters de ned

by (16) and that (17) holds. Then there exists a constant C5 > 0 depending only on min; max; , such that

13 2 0q log( Nw =w ) N 1 log( w =w ) E k ? k2  41 + C5 @ 1=(4 +2) + 1+1=N(4 +2)1 A5 min R [; ]: 2 " w1 w1

for every  2 `2 and for the estimator  = (1; 2; : : :) with i = i Xi .

On the other hand, since  log(NS )  Cw1?1=2 log(NwN =w1), Corollary 1 gives q 2 3 C log( Nw =w ) N 1 5 E k ? k2  41 + min R [; ] 1 = 4 2 " w1 provided w1?1=2 log(NwN =w1) ! 0, as " ! 0. As a consequence of this inequality and Proposition 1 we get an asymptotically exact oracle inequality of the form (5):

Corollary 2 Assume that  = (1; : : :; N ) is the set of projection lters de ned by (16) and that (17) holds. If N = N (") and w1 = w1 ("), wN = wN (") are such that

log(NwN =w1) = 0 max(w11=(2 +1); w11=2) then for every  2 `2 and for the estimator  = (1; 2; : : :) with i = i Xi we have lim "!0

E k ? k2  (1 + o(1)) min R [; ]; 2 " where o(1) ! 0 uniformly in  2 `2 .

10

In other words, Corollary 2 states that our adaptively selected lter behaves itself asymptotically at least as good as the best projection estimator in . For the direct case (where k  1) such an inequality is obtained by Birge (1999) who used the Lepski adaptation method rather than the Mallows Cp. Next, consider the situation where there is no restrictions on w1 except w1  1 (i.e. the class  can contain the projection lters of the order less or equal to some wN ). Applying Proposition 1 we get an inequality with a logarithmic loss of eciency: Proposition 2 Assume that  = (1; : : :; N ) is the set of projection lters de ned by (16) and that (17) holds. For every  2 `2 and for the estimator  = (1; 2; : : :) with i = i Xi we have q E k ? k2  C6( log N + log wN ) min R [; ]; 2 "

where C6 > 0 depends only on min; max; . An important special case is wavelet estimators for which we set w1 = 2j0 (where j0 is the index of the initial level), wj = 2wj?1 . Typically one chooses 2?j0 to be decreasing as a power of " and N  log(1="). It is easy to see that the result of Corollary 2 remains valid in this case. Thus, a wavelet estimator that uses our data-driven selection of wj is asymptotically at least as good as the best linear wavelet estimator for any . By taking suprema of both sides of the oracle inequality over Besov classes (for de nition see Donoho and Johnstone (1994, 1995, 1996)) we get that our adaptive estimator attains optimal rate of convergence on all the Besov classes where linear wavelet estimators attain optimal rates.

Example 2. Lewelwise "keep-or-kill" estimators. Let m > 1 and 1  w1 < : : : < wm be integers and let e = (e1; : : :; em?1) be a binary sequence of length m ? 1, ek 2 f0; 1g. We associate to e a lter (e) = (1 (e); 2(e); : : :) de ned by

i(e) = I (i  w1) +

Consider the collection of lters

mX ?1 k=1

ek I (wk < i  wk+1 ); i = 1; 2; : : :

 = f(e) : e 2 E g (19) where E is the set of all binary sequences e of length m ? 1. The linear estimator with weights i (e) "keeps" the blocks of coecients fi : wk < i  wk+1 g for which ek = 1 and "kills" the blocks for which ek = 0. Clearly, N = Card = 2m?1 : 11

As in Example 1, we get that Assumption 2 is satis ed and that   Cw1?1=2; S  C (wm=w1)2 provided (17) holds. Therefore, applying Corollary 1 we get the following result. Proposition 3 Assume that  is the set of levelwise "keep-or-kill" lters de ned by (19) and that (17) holds. If m = m(") and w1 = w1("), wm = wm (") are such that (20) lim m + log(1=w2m=w1) = 0 "!0 w1 then for every  2 `2 and for the estimator  = (1; 2; : : :) with i = i Xi we have E k ? k2  (1 + o(1)) min R [; ]; 2 " where o(1) ! 0 uniformly in  2 `2.

Note that for wavelet bases (if we consider global thresholding level by level) we have w1 = 2j0 , wm = 2j0 +m?1 with an integer j0, and the condition (20) reduces to lim 2?j0 =2m = 0; "!0 which is readily satis ed for typical situation where 2?j0 decreases as a power of " and m  log(1="). As a conclusion we get, in particular, that the wavelet keep-or-kill level by level estimator that uses our data-driven rule attains optimal rate of convergence on all the Besov classes where wavelet keep-or-kill estimators attain optimal rates.

Example 3. Tikhonov-Phillips estimators. Consider the set of lters

1  = f = fk g : k = 1 + (k=w ) ; w 2 W ; 2 Ag

(21)

where A is a nite set of real numbers, possibly depending on ": A = f (1); (2); : : : ; (NA)g such that 2 + 1=2 < min = (1) < (2) <  < (NA ) = max; and W is a nite subset of the interval [w1; wmax], with CardW = NW , where NA; NW are integers, 0 < w1 < wmax < 1. Note that Card = N where N = NA NW . We get by simple algebra 1 1 X i4 i4 4 +1 ; X  4 +1  C w  )4 )2  C w (1 + ( i=w ) (1 + ( i=w ) i=1 i=1 12

where C and C  are positive constants depending only on min; max; . This and (17) guarantee that Assumption 2 is satis ed. Next, i2 2 sup (1 + (i=w) )2 = C ( ; )w i

where C ( ; ) is bounded from 0 and 1 uniformly for min   max. Therefore S  C (wmax=w1)2 : Furthermore, we get similarly !?1=2 1 2 4 X i i ?1=2:   C max sup  Cw 1 4 2 1 + (i=w) (1 + (i=w) ) i

i=1

Thus,

 log(NS )  Cw1?1=2 log(Nwmax=w1) and to guarantee an asymptotically exact oracle inequality, it remains to require (in view of Corollary 1) that the parameters wmax; w1; NW ; NA were such that (14) holds.

Proposition 4 Assume that  = (1; : : :; N ) is the set of Tikhonov-Phillips lters de-

ned by (21) and that (17) holds. If N = NANW ; wmax = wmax("); w1 = w1 ("), are such that log(Nwmax=w1) = 0; (22) lim 1=2 "!0

w1 then for every  2 `2 and for the estimator  = (1; 2; : : :) with i = i Xi we have

E k ? k2  (1 + o(1)) min R [; ]; 2 " where o(1) ! 0 uniformly in  2 `2.

Note that condition (22) is very weak and it is easily checked in typical situations. For example, one can take NW = O("?a), wmax = O("?b ), for some arbitrary xed a > 0; b > 0, and assume that NA , max do not depend on " (it is typical to consider just a small xed number of integers , or even one integer ). This should be completed, in order to satisfy (22), by the mild assumption w1= log2(1=") ! 1. Since there is no restriction on the power a, the dicrete net W can be arbitrarily ne, and it is not hard to show that optimality of our discretized rule extends to the set of lters (21) where w varies continuously in the interval.

13

4 Preliminary lemmas

Let i be i.i.d. N (0; 1) random variables and let v = fvig11 be a random sequence from `2 measurable w.r.t. fig1i=1. It is assumed that v takes values in a nite set V of `2 sequences: v 2 V = fv1; : : : ; vN g.

Lemma 1 For any K  1

q X 1 q i h E vii  2 log(NK ) Ekvk + Ekvk2=K : i=1 Proof. Denote v = kvk?1 P1i=1 vii. By the Cauchy-Schwarz inequality X 1 q 2 log(NK )g j  j  Ek v k max j  j I f max j  j  E vii  Ekvk max v2V v v2V v v2V v i=1 q 2 log(NK )g + Ekvk max j  j I f max j  j > v v v2V v2V q  2 log(NK )Ekvk q   2I fmax j j > 2 log(NK )g 1=2: j  j + (Ekvk2)1=2 E max v2V v v2V v q 2 I ft > 2 log(NK )g we use that F (maxv2V jv j)  Now, for the function F ( t ) = t P F (j j), and since  1 ; : : :;  N are identically distributed standard Gaussian variv v v v2V ables, we get q X 1 E vii  2 log(NK )Ekvk i=1 q 1=2  2 2 1 = 2 + (Ekvk ) N Ejv1 j I fjv1 j > 2 log(NK )g q q h i  2 log(NK ) Ekvk + Ekvk2=K where for the last inequality we used that Ejv1 j2I fjv1 j > xg  2(2)?1=2(x + 1) exp(?x2=2); 8x > 0; and NK  2. 2 Denote m(v) = supi jvij= kvk, mV = maxv2V m(v). Lemma 2 Let there exist r > 0, M > 0 such that X 2 m (v) exp (?r=m(v))  M: (23) v2V

14

Then for any K  1 we have

q X   q 1 2 E vi(i ? 1)  D log(NK ) + mV log K Ekvk + Ekvk2 =K i=1

where D = D(r; M ) > 0 is a constant of the form D = D0(r)(M +1) with D0 (r) depending only on r.

p Proof. Let v = ( 2 kvk)?1 P1i=1 vi(i2 ? 1). Using the Markov inequality and the

formula

p

? log(1 ? x) =

1 xk X k=1 k

one obtains, for any 0 < t < [ 2m(v)]?1, Pfv > xg  exp(?tx)E exp(t"v ) p !# 1 Y tv 1 2tvi i ? = exp(?tx) exp ? p log 1 ? kvk 2 kvk 2 3 i=1 2 p ! k 1 1 1 X X 2 tv i 5 4 = exp(?tx) exp 2k kvk " k=2 i=1 X # 1 1 p 1 k  exp(?tx) exp m2(v) 2k [ 2tm(v)] k=2 " # p t 1  exp(?tx) exp ? 2m2(v) log[1 ? 2tm(v)] ? p2m(v) : Minimization of the last expression with respect to t yields p Pfv > xg  exp ['v (x)] ; 'v (x) = 2m12(v) log[1 + 2xm(v)] ? p2mx (v) : Note that for u  0 we have Z 1 u2 Z 1  u  2 u log(1 + u) ? u = u 0 ? 1 + u d  ? 0 1 + u d = ? 2(1 + u) : Thus 2 px ; 'v (x)  ? 2(1 + 2xm(v)) and we conclude that ! 2 x p P fjv j > xg  2 exp ? ; 8x > 0: (24) 2(1 + 2xm(v)) 15

Using (24), we nd, for any Q > 0, Z1 Z1 2 Ev I fjv j > Qg = 2 Q xP fjv j > xgdx  4 Q x exp ?

It is easy to see that

! 2 x p dx: 2(1 + 2xm(v))

( p 2 2=4; x ? x (v)x  1; ? 2(1 + p2xm(v))  ?x=[p32m(v)]; p22m m(v)x > 1; and we get by simple algebra ! Q 2 2 2 Ev I fjv j > Qg  8 exp(?Q =4) + 256m (v) exp ? p128m(v) : In view of (23) we have p ! ! p X 2 Q Q= 128 ? r  M exp ? m m (v) exp ? p ; 8Q > 128r: 128m(v) V v2V Now, acting as in the proof of Lemma 1 and using (25), (26) we obtain p X 1 2 E vi(i ? 1)  2Ekvk max jv j i=1 p

(25) (26)

v2V

p 2Ekvk max  2Ekvk max j  j I f max j  j  Q g + j jI fmax j j > Qg v v v2V v2V v2V v v2V v q   p 2I fmax j j > Qg 1=2   2QEkvk + 2Ekvk2 E max v2V v v2V v  1=2 q p X 2 2  2QEkvk + 2Ekvk Ev I fjv j > Qg v2V p !#1=2 " q p 128 ? r Q= 2 ;  2QEkvk + 2Ekvk2 8N exp(?Q =4) + 256M exp ? m V q p p

for any Q > 128r. Choosing Q = 2 log(NK ) + 128(mV log K + r) one gets the lemma. 2 Consider the estimator ~i = i(X )Xi ; where i = i(X ) 2 [?1; 1] depends on the data X (not necessarily i (X ) = i (X )) . We assume that the lter (X ) = (1(X ); 2(X ); : : :) takes values in the set of candidate lters . In the next lemma we give a bound for the risk of this estimator. We need the following notation: "[] = "2L sup i22i ; 8 2 : i

16

Lemma 3 For any B > 0 we have E k~ ? k2  [1 + 2B ?1]E R"[(X ); ] + B (D12 + 4)E "[(X )] where D1 = D(q; C1C2 ).

Proof. Write  X 1 1 X E k~ ? k2 = E (1 ? i(X ))2i2 + "2 i22i (X ) i=1

(27)

i=1

1 X

1 X

? 2"E (1 ? i (X ))ii(X )i i + "2E i22i (X )(i2 ? 1): i=1

i=1

Using Lemma 1 with K = S we get q X 1=2 X 1 1 "E (1 ? i(X ))ii (X )ii  " 2 log(NS )E (1 ? i (X ))2i22i (X )i2 i=1 i=1 1=2  X 1 q 2 2 2 2 ? 1 = 2 + " 2 log(NS ) S E (1 ? i(X )) i i (X )i i=1  X 1=2 1 q 2 2 ii(X ) (1 ? i (X )) i  " 2 log(NS )E sup i i=1  X 1=2 1 q +" 2 log(NS )S ?1=2 E (1 ? i (X ))2i2 max sup ii : 2 i

i=1

Now, (12) and the elementary inequality 2ab  B ?1a2 + Bb2, 8B > 0; yield, for any B > 0, X 1 1 X ? 1 (28) "E [1 ? i (X )]ii(X )ii  B E [1 ? i (X )]2i2 i=1 i=1 +"2B log(NS )E sup i22i (X ) + "2B log(NS ) min sup i22i  2  i i 1 X  B ?1E [1 ? i (X )]2i2 + 2B E "[(X )]: i=1

To bound the last term in (27) we use Lemma 2 with vi = i22i (X ), K = S . The condition (23) of Lemma 2 is satis ed: in fact, the inequalities ji (X )j  1 and (11) entail p m(v)q ((X ))  C1. Thus, (23) holds with M = C1C2; r = q. Note also that mV   p and log(NS ) + mV log S  2L . Therefore, application of Lemma 2 yields X 1=2 1=2  X  X 1 1 q 1 2 2 2 2 : " E i i (X )(i ? 1)  D1 2L "2 E i44i (X ) + S ?1=2 E i44i (X ) i=1

i=1

17

i=1

Next, by (12),

 X 1

E

1=2

 min sup   2 i i i i=1 1=2  1 X  E sup i22i (X )E i22i (X ) : i i=1 S ?1=2

i44i (X )

" X 1

E

i=1

i22i (X )

#1=2

Hence, for any B > 0, !1=2 X 1 1 q 2  X 2 2 2 2 2 2 i i (X ) " E i i (X )(i ? 1)  D1 2L " E sup iji(X )j i i=1 i=1 1=2  1 q X (29) +D1 2L "2 E sup i22i (X )E i22i (X ) i i=1 1 X 2 ? 1 i22i (X )  2" B E i22i (X ) + D12L "2B E sup i i=1 1 X = 2"2B ?1E i22i (X ) + D12B E "[(X )]: i=1

Combining (27) { (29) we complete the proof.

5 Proof of the theorems Denote 0 = (01; 02; : : :) the oracle: 0 = arg min2 R" [; ]. We have 1 X E U [; X ] = E R"[; ] ? i2 i=1 1 1 X X + 2"E (i 2 ? 2i )iii + "2E (i 2 ? 2i )i2(i2 ? 1) i=1 i=1 1 X = E R"[; ] ? i2 i=1 1 1 X X +2"E (1 ? i )2iii + "2E (i 2 ? 2i )i2(i2 ? 1)

(30)

i=1

i=1

We now bound the last two terms in (30). First, note that [(1 ? i )2 ? (1 ? 0i )2]2 = [(1 ? i ) + (1 ? 0i )]2[i ? 0i ]2  2[(1 ? i )2 + (1 ? 0i )2](i 2 + 02i ): 18

(31)

Then we have by Lemma 1 with K = S and (31) 1 1 X X "E (1 ? i )2iii = "E [(1 ? i )2 ? (1 ? 0i )2]iii i=1 i=1 X 1  2 0 2  ?"E [(1 ? i ) ? (1 ? i ) ]iii i=1 1=2 X 1 q  ?" 2 log(NS )E [(1 ? i )2 ? (1 ? 0i )2]2i2i2  i=1 X 1=2 1 q ? " 2 log(NS )=S E [(1 ? i )2 ? (1 ? 0i )2]2i2i2 i=1 1=2 X 1 q  2 0 2  2 02 2 2  ?2" log(NS )E [(1 ? i ) + (1 ? i ) ](i + i )i i  i=1 X 1=2 1 q ? 2" log(NS )=S E [(1 ? i )2 + (1 ? 0i )2](i 2 + 02i )i2i2 i=1

The argument as in the proof of Lemma 3 gives, for any B > 0, 1=2 X 1 q  2 0 2  2 02 2 2 2" log(NS )E [(1 ? i ) + (1 ? i ) ](i + i )i i i=1   1 X 1  2B E [(1 ? i )2 + (1 ? 0i )2]i2 + 2B"2 log(NS )E sup i2i 2 + sup i202 i i i i=1 and  X 1=2 1 q  2 0 2  2 02 2 2 2" log(NS )=S E [(1 ? i ) + (1 ? i ) ](i + i )i i 1=2  1=2   i=1X 1 q 22 +  202  2" log(NS )=S E [(1 ? i )2 + (1 ? 0i )2]i2  max sup i i i i 2 i i=1   1 X 1 2  2 2 02  2 0 2 2 2 i i + sup i i :  2B E [(1 ? i ) + (1 ? i ) ]i + 2B" log(NS )E sup i i i=1 Putting together these inequalities, we nd 1 1 1 X X X (32) "E (1 ? i )2iii  ?B ?1E (1 ? i )2i2 ? B ?1 (1 ? 0i )2i2 i =1 i =1 i=1 o n i2i 2 + sup i202 ?4B"2 log(NS )E sup i i i 1 X  ?B ?1E (1 ? i )2i2 ? B ?1R"[0; ] ? 4B E "[] ? 4B "[0]: i=1

19

Now, we bound the last term in (30) using Lemma 2 with K = S and vi = (i 2?2i )i2: Note that 2i  (2i ? 2i )2  42i ; 8 jij  1: This and (11) entail q 2 2 m(v) = P1supi 4ji 2? 2i j2i 1=2  2()  2 C1: ( i=1 i (i ? 2i ) ) Hence, in view of Assumption 3, the condition with M = q (23) of Lemma 2 is satis ed p 4C1C2, r = 2q. Furthermore, mV  2, and log(NS ) + mV log S  2 2L . Therefore, by Lemma 2, we obtain 1=2 q 2 X 1 1 X  2 2 2 2  2  2 4 " E (1 ? i ) i (i ? 1)  ?2D2 2L " E (i ? 2i ) i i=1 1=i2=1 1 q 2 ?1=2 X ?2D2 2L " S E (i 2 ? 2i )2i4 i=1 1=2  X  X 1=2 1 1 q  ?2D2 2C1L "2 E i4i 4 + S ?1=2 E i4i 4 i=1

i=1

where D2 = D(2q; 4C1C2). Repeating the argument of (29) to bound the last expression we nally get 1 1 X X "2E (1 ? i )2i2(i2 ? 1)  ?2"2B ?1E i2i 2 ? 4D22 C1"2BL E sup k2k2 k i=1 i=1 1 X (33) = ?2"2B ?1E i2i 2 ? 4C1D22B E "[]: i=1

Now we are ready to complete the proof of Theorems 1 and 2. From (32), (33) we have 1 1 X X 2"E (1 ? i )2iii + "2E (1 ? i )2i2(i2 ? 1) i=1

 ?2B ?1R

"

[; ] ? 2B ?1R

"

i=1 [0; ] ? (4C

E

2  1 D2 + 8)B  " [ ]

? 8B "[0]:

This and (30) yield 1 X   E R" [ ; ]  E U [ ; X ] + 2 i=1

i

+ 2B ?1E R"[; ] + (4C1D22 + 8)B E "[](34) + 2B ?1R"[0; ] + 8B "[0]: 20

By de nition of  and (9),

E

U [; X ]  E



U [0; X ] = R

"

[0; ]

1 X

? i2: i=1

This and (34) imply [1 ? 2B ?1]E R"[; ]  [1 + 2B ?1]R"[0; ] + 4(C1D22 + 2)B E "[] + 8B "[0]: (35) Next, by Lemma 3 for any B > 0 E k ? k2  [1 + 2B ?1]E R"[; ] + B (D12 + 4)E "[]: (36) Inequalities (34) and (36) entail Theorems 1 and 2. In fact, to get Theorem 1 we use the following argument. Note that, for any x > 0,     1 1 X X sup i22i = sup i22i I x sup i22i  i22i + sup i22i I x sup i22i < i22i i i i i i i=1 i=1   1 1 1 X X X 22 = 1 2 2 I x sup  22   x1 i22i + max  sup  i22i + !(x): i i i i i i  2  x i i i=1 i=1 i=1

This inequality with x = B 2L , the inequalities (35), (36) and the fact that D1  C (C2 + 1), D2  C (C2 + 1), yield Theorem 1. To prove Theorem 2 note rst that q (37) "[]  C1L R" [; ]; 8 2 ;  2 `2: Substitution of (37) into (35) and (36) gives, respectively, ?1 + 8pC1 BL 1 + 2 B  (38) E R"[ ; ]  1 ? 2B ?1 ? 4pC (C D2 + 2)BL R"[0; ] 1 1 2  (provided the denominator here is positive) and q  2 ? 1 E k ? k  (1 + 2B + C1(D12 + 4)BL )E R" [; ]: (39) From (38) and (39) we obtain ?1  )(1 + 2B ?1 + 5 (C22 + 1)BL ) min R [; ]; E k ? k2  (1 + 2B +1 ?5BL 2 " 2B ?1 ? 6(C22 + 1)BL provided B > 0 and 1 ? 2B ?1 ? 6(C22 + 1)BL > 0, where 5 > 0; 6 > 0 are constants depending only on C1 and q. This proves Theorem 2. 21

6 Application to sharp adaptive estimation In this section we apply the oracle inequalities of Section 2 to show that sharp minimax adaptive estimators for inverse problems can be obtained by the principle of unbiased risk estimation. We study the problem where sharp adaptive estimators were not known previously, namely a recovery of anisotropic smooth functions from indirect noisy data. For brevity, we restrict the discussion to a speci c example (measuring the temperature of the earth). However, the key elements of the proofs are given under general assumptions and the result can be easily extended. Let  = (1; 2) be the polar coordinates of a point on the surface of the earth (we suppose for simplicity that the earth is a sphere). Consider the problem of measuring the temperature t() at the point . The function t() is suciently smooth (since it is a solution of thermal conductivity equation) and, of course, periodic. More speci cally, we assume that t() belongs the following anisotropic Sobolev ball  Z   @ m1 t 2  @ m2 t 2  2 W2m(p) = t : p21 @ d  1 ; (40) + p m1 2 @m2 T 1 2 where T = [0; 2]  [0; 2] and the parameters m = (m1; m2), p = (p1; p2) characterize the smoothness of the function t() with respect to 1 and 2 (mi 2 IN , pi > 0). The anisotropic character of smoothness here is important and re ects the fact that the temperature changes more rapidly along the meridians than along the parallels. Next, it is assumed that temperature is measured by means of remote infra-red detectors allocated for instance on satellites or on planes. It means that one cannot measure temperature at a given point directly, but one rather measures an average temperature at a vicinity of this point i.e. Z1Z1 expf?2 0 k ? 0kgt(01; 02) d01 d02: Et() = C (1 ) 0 ?1 ?1 Here C ( 0) is the normalizing constant, Z1Z1 expf?2 0 kkg d1 d2 C ( 0) = ?1 ?1 and kk2 = 2 + 2. The parameter ?1 characterizes the size of the vicinity of the point 1

2

0

where we measure the temperature. The measurements are contaminated by a noise inherent to infra-red detection. We may write the resulting model as

y() = Et() + "n();  2 T; 22

(41)

where n() is a standard white Gaussian noise in the Hilbert space of periodic functions. The writing (41) means that for every g 2 L2(T ) we observe the random variable Y (g) = (g; Et) + "(g) where (g)  N (0; kgk2) and E((g)(g0 )) = 0 if (g; g0) = 0 (here (; ) and k  k are the scalar product and the norm in L2(T )). Since the function t(1; 2) is periodic it is convenient to consider the statistical model (41) in the Fourier domain. Denote by Z kl = exp(ik1 + il2)t() d; k; l = 0; 1; 2; : : : T the Fourier coecients of t(). Then (41) is equivalent to the model zkl = kl + "klkl ; (42) where zkl = Y (exp(ik1 + il2)), kl are i.i.d. N (0; 1) random variables and Z1Z1 3 1 ? 2 expfik1 + il2 ? 2 0kkg d = 2 2 0 2 3=2 : kl = C ( ) ( 0 + k + l ) 0 ?1 ?1 (For the last equality see for instance Theorem 1.14 in Stein and Weiss (1971).) Note also that the image of Sobolev's ball W2m(p) in the Fourier domain has the form   1 X m2 (p) = fklg : kl2 a2kl  1 ; (43) k;l=?1

where akl = [p21 (2k)2m1 + p22(2l)2m2 ]1=2. Thus our problem is equivalent to recovering the Fourier coecients  = fklg based on the data (42) under the prior information provided by (43). Recall that an estimator bM is called asymptotically minimax on m2 (p) if, as " ! 0, Eke ? k2 = (1 + o(1)) sup EkbM ? k2; r"[m2 (p)] = inf sup e 2m2 (p) 2m ( p ) 2 where the in mum is taken over all estimators. The next proposition derived from Pinsker's (1980) theorem gives an asymptotically minimax estimator and its risk. Denote  a root of the equation "2 X 2 a [1 ? a ] = 1: (44) kl +  k;l kl kl Proposition 5 The linear estimator (45) bklM = [1 ? akl]+zkl; k; l = 0; 1; 2; : : :; is asymptotically minimax on m2 (p) with the risk bM ? k2 = "2 X kl2 [1 ? akl]+: sup Ek  m 22 (p)

k;l

23

Using this proposition one can compute the asymptotical minimax risk up to a constant. The calculations are cumbersome and we do not reproduce them here, since the exact value of the asymptotic minimax risk is not needed for our construction of adaptive estimator. We need only to evaluate the order of r" [m2 (p)] : Lemma 4 Let m2  m1. Then as " ! 0 ! 2 " 1 m2 r" [m2 (p)] = O(2 ) = O 3 4=m1 1=m2 ; = 2m m 2m : 1 2 + m1 + 4m2 0p1 p2

Remark 4. We can write = 2s=(2s + 1) where

4 1 1= 1 + 4 = + s m2 m1 max(m1; m2) min(m1; m2) : It is remarkable that the respective roles of m1 and m2 are not equivalent, unlike the case where kl  1. The coecient kl is symmetric in k; l, and nevertheless the rate is asymmetric. Proof. Denote V1 = (2)?1(p1)?1=m1 and V2 = (2)?1(p2 )?1=m2 . Since kl2 is increasing in k; l we have X "2 kl2 [1 ? akl ]+  C"2V1V2V21V2 : (46) k;l

Here and later we denote by C positive constants, possibly di erent in di erent occasions. It is clear from (44) that  ! 0 (and thus V1 ! 1, V2 ! 1), as " ! 0. Consequently, for " small enough we get q X X akl[1 ? akl]+  2?m2 ?1 [1 ? (k=V1)2m1 + (l=V2)2m2 ]+ kV1 =2;lV2=2 kV1 =2;lV2=2 Z Z q  2?m2 ?1 xV =2+1 yV =2+1[1 ? (x=V1)2m1 + (y=V2)2m2 ]+dxdy 1 2 Z Z q ? m ? 1 2 (47)  2 V1V2 [1 ? x2 + y2]+dxdy  CV1V2?1 x1=2+1=V1 y1=2+1=V2

where we used that m1  1, m2  1. Noting that b2k=2cbl=2c  ckl2 where c > 0 does not depend on k; l, and using (47) one obtains 2X 2 X 1 = " kl2 akl[1 ? akl]+  c" V21V2 akl[1 ? akl ]+ k;l jkjV1 =2;jljV2=2 2 ? 2 2 ? 3 2 ?  C"  V1V2V1V2 = C 0 "  2V1V2 ( 20 + V12 + V22)3=2: 24

The proof follows now from this inequality, (46), Proposition 5 and a simple algebra. 2 The minimax estimator bM de ned by (45) depends on the parameters (m; p) of the functional class m2 (p), which are usually not known in practice. To overcome this drawback of bM we construct another estimator which is asymptotically minimax but does not depend on the parameters of the Sobolev ball. This estimator is called sharp adaptive. We show that sharp adaptive estimator can be obtained by using the lter (10) with appropriately chosen set . Let 0 be a set of all lters with weights kl having the form  q  2 2 1 2 kl (W; ) = kl (W1; W2; 1; 2) = 1 ? (k=W1) + (l=W2) ; +

where W 2 [log(1="); 1)  [log(1="); 1), 2 [1; 1)  [1; 1). The minimax lter belongs to this set for " small enough (cf. (45) and note that  = O(" ) by Lemma 4). Unfortunately 0 contains in nitely many elements and we cannot apply Theorem 1 for  = 0. Note also that the direct numerical minimization of the unbiased risk estimator over 0 is very time consuming. Therefore we look for a nite sub-family which approximates well the lters in 0. In order to make the estimator numerically feasible we will pick an approximating set containing the "minimal" number of elements. Let  = log?1 (1="). Denote w (k) = (1 + )k , k = 1; 2; : : : and  (k) = (1 ? k)?1; k = 1; 2; : : : ; b1=c. De ne the approximating nite sub-family  as a set of lters of the following form:  = fkl (w (i); w(j ); (s); (p))g where i; j; s; p 2 IN are such that w (i); w(j ) 2 [log(1="); "?1]  [log(1="); "?1] and (s);  (p) > 1: Although, for the sake of simplicity, we assumed here a speci c form of kl, our results hold for the general situation where kl satisfy the following assumption. Assumption 4. The sequence kl2 is positive, monotone non-decreasing in k and l, and there exists c > 0 such that b2k=2cbl=2c  ckl2 for all k; l. Lemma 5 Let Assumption 4 hold. For any  2 0 there exists a lter 0 2  such that for all  2 l2 E kb(0) ? k2  (1 + C)Ekb() ? k2 (48) where C > 0 does not depend on ; ; 0 . Proof. Since X X E k^() ? k2 = (1 ? kl )2kl2 + "2 kl2 2kl k;l

k;l

it suces to show that for any  2 0 there exists a lter 0 2  such that kl  0kl for all k; l and X 2 02 X klkl  (1 + C) kl2 2kl : k;l

k;l

25

To prove this, rst note that the componentwise inequality (W; )  (W 0; 0) implies

kl(W; )  kl(W 0; 0):

(49)

Fix (W; ) = (W1; W2; 1; 2) and denote i0 = maxfi : w (i)  W1g; j0 = maxfj : w (j )  W2g; s0 = maxfs : (s)  1g; p0 = maxfp : (p)  2g: Set (W 0; 0) = (w (i0); w (j0); (s0); (p0)); (W 1; 1) = (w (i0 + 1); w (j0 + 1);  (s0 + 1);  (p0 + 1)): We have (W 0; 0)  (W; )  (W 1; 1); and thus

kl(W 0; 0)  kl (W; )  kl (W 1; 1); Set 0kl = kl (W 1; 1). In view of (50) it suces to show that X 2 2 1 1 X klkl(W ; )  (1 + C) kl2 2kl (W 0; 0): k;l

k;l

(50) (51)

Let V (W1; W2; 1; 2) = Pk;l kl2 2kl (W1; W2; 1; 2): From Lemma 6 of the Appendix we have ! w  (k + 1) ? w (k )  (k + 1) ?  (k ) 1 1 0 0 0 0 V (W ; ) ? V (W ; )  CV (W ; ) max : + 2(k + 1) k w (k)  This and the obvious inequalities

 (k + 1) ?  (k)   2(k + 1); w(k + 1) ? w (k) = w(k) imply (51). Lemma 5 is proved. 2 Now, choose the data-driven lter  = (kl ; k; l = 0; 1; 2; : : :) in the nite family of lters  =  using the unbiased risk estimator as in (10): X X 2  2 2 2  (kl ? 2kl )zkl + 2" klkl :  = arg min 2  k;l

k;l

Theorem 3 The estimator  = (kl ; k; l = 0; 1; 2; : : :), where kl = kl zkl, is minimax sharp adaptive on the scale of Sobolev classes, i.e. for every (m; p) such that mi 2 IN , pi > 0 we have

sup E k ? k2  (1 + o(1)) inf~ sup E k~ ? k2; " ! 0: m m

22 (p)

 22 (p)

where inf ~ is the in mum over all estimators.

26

(52)

Proof. We apply Theorem 1. First, we check the Assumptions 1 { 3. Assumption 1 is obvious, Assumption 2 follows from the inequalities (cf. (A.2) of the Appendix) X 4 4 X klkl  cW4 1W2 4kl  CW1W2W4 1W2 ; (53) jkjW1 =2;jljW2=2

k;l

and

X k;l

kl4 2kl 

X jkjW1 ;jljW2

kl4  CW1W2W4 1 W2 ;

where the constants C > 0 do not depend on (W; ). Using again (53), we nd 0 1?1=2 X ()  sup kl2 @ kl4 4kl A  C (W1W2)?1=2  C log?1(1=") jkjW1 ;jljW2

k;l

where the constants C > 0 do not depend on (W; ). This proves that Assumption 3 is satis ed. Furthermore, using (A.2) of the Appendix, we obtain   2 2 I X  2 2  x max  2 2 max  !(x)  max 20 k;l kl kl k;l kl kl k;l kl kl   2 2 2 max klI CW1W2W1 W2  x jkjWmax kl  Wmax 1 ;W2 jk jW1 ;jljW2 1 ;jljW2 n o  C Wmax [W12 + W22]3=2I W1W2  Cx = Cx3=2: 1 ;W2 Next, N = Card is of the order log6 "?1 and the parameter S has the order "?3. Thus by Theorem 1 and Lemma 5 one concludes that for " small enough the estimator  satis es the oracle inequality b() ? k2 + C"2B 4 log6 "?3 log5=2 "?1 E k ? k2  (1 + CB ?1) min E k   2 b() ? k2 + C"2B 4 log6(1=")  (1 + CB ?1)(1 + C) min E k   20 ? 1 = (1 + CB )(1 + C)E kbM ? k2 + C"2B 4 log6(1=") (54) where bM is the asymptotically minimax estimator de ned in (45). For the last equality we used that this estimator belongs to 0 for " small enough in view of Lemma 4 and the fact that  = O(" ). Choose B = "?r=5 where r = 1 ? > 0 and is de ned in Lemma 4. Taking the suprema of both sides of (54) w.r.t.  2 m2 (p), using Proposition 5 and observing that by Lemma 4 the minimax risk over m2 (p) has the order "2?r , we get (52).

2

Note that Theorem 3 remains valid if m in the de nition of the ellipsoid m2 (p) is real-valued. Furthermore, it is easy to see from the proof that the o(1) in (52) converges to 0 uniformly over a large set of values m; p. 27

APPENDIX

Lemma 6 Let Assumption 4 hold. Then partial derivatives of log V exist a.e. on the set [log(1="); 1)  [log(1="); 1)  [1; 1)  [1; 1) and satisfy uniformly on this set the inequalities

@ log V j  CW ?1; j @ log V j  C ?2; i = 1; 2: (A.1) j @W i i @ i i Proof. Using Assumption 4 and acting as in (47) we get, for Wi  log(1="), V (W1; W2; 1; 2) =



cW2 1W2

X

Xk;l

kl2 2kl (W1; W2; 1; 2)

2kl (W1; W2; 1; 2) ZjkjW1 =2;jljWZ2=2 h q i 2  cW1W2 jxjW =2+1 jyjW =2+1 1 ? (x=W1)2 1 + (y=W2)2 2 2+dx dy 1 2 Z Z i h q 2  cW1W2W1 W2 jxj1=2+1=W jyj1=2+1=W 1 ? x2 + y2 2+ dx dy 1 2 2  CW1W2W1W2 ; (A.2) where the constant C > 0 does not depend on (W; ). Denote for brevity A(x; y) = (x2 1 + y2 2 )?1=2. Since A(x; y)  x? 1 , 8 x > 0, y > 0, one obtains @ V = @ X kl2 2kl (W1; W2; 1; 2) @W1 @W1 k;l  k l  k 2 1?1 X 2 ? 1 = 4 1W1 klkl (W1; W2; 1; 2)A W ; W W 1 2 1 k;l   ? 1 1 X k  4 1W1?1W2 1W2 jkjW ;jljW W1 Z1 1 2  C 1W2W2 1W2 x 1?1 dx  CW2W2 1 W2 ; (A.3) 0

where the constants C > 0 do not depend on (W; ). Combining (A.2) and (A.3) we get the rst inequality in (A.1). The second inequality in (A.1) is checked in similar way. In fact, it suces to use (A.2) and the relation @ V = @ X kl2 2kl(W1; W2; 1; 2) @ 1 @ 1 k;l  k l  k 2 1 k X 2 = 4 klkl(W1; W2; 1; 2)A W ; W W log W 1 2 1 1 k;l 28

h q i  k  1 k 2 2 1 2 1 ? (k=W1) + (l=W2) + W log W  1 1   1 X k k log W  CW2W2 1W2 1 jkjW1 W1 Z1  CW1W2W2 1W2 0 x 1 j log(x)jdx  CW1W2W2 1 W2 1?2; X 4W2 1W2 k;l

where the constants C > 0 do not depend on (W; ).

References [1] Birge, L. (1999) An alternative point of view on Lepski's method. Preprint N.549, Laboratoire de Probabilites et Modeles Aleatoires, Universites Paris 6 - Paris 7. [2] Birge, L. and Massart, P. (1999) A generalized Cp criterion for Gaussian model selection. Preprint. [3] Cavalier, L. and Tsybakov, A.B. (2000) Sharp adaptation for inverse problems with random noise. Preprint N.559, Laboratoire de Probabilites et Modeles Aleatoires, Universites Paris 6 - Paris 7. [4] Donoho, D. L. (1995) Nonlinear solutions of linear inverse problems by wavelet-vaguelette decomposition. Journal of Applied and Computational Harmonic Analysis, 2, 101-126. [5] Donoho, D.L. and Johnstone, I.M. (1994) Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425-455. [6] Donoho, D.L. and Johnstone, I.M. (1995) Adapting to unknown smoothness via wavelet shrinkage. J.Amer. Statist. Assoc. 90, 1200-1224. [7] Donoho, D.L. and Johnstone, I.M. (1996) Neoclassical minimaxproblems, thresholding and adaptive function estimation. Bernoulli 2, 39-62. [8] Efromovich S.Yu. (1999) Nonparametric Curve Estimation. Springer, N.Y. [9] Goldenshluger A. and Pereverzev S.V. (1999) Adaptive estimation of linear functionals in Hilbert scales from indirect white noise observations. Preprint 506, WIAS, Berlin. [10] Goldenshluger A. and Tsybakov A.B. (2000) Adaptive estimation and prediction in linear regression with in nitely many parameters. Preprint N.598, Laboratoire de Probabilites et Modeles Aleatoires, Universites Paris 6 - Paris 7. [11] Golubev, G.K.(1987) Adaptive asymptotically minimax estimates of smooth signals. Problems of Information Transmission, 23, 57-67. [12] Golubev, G.K.(1992) Nonparametric estimation of smooth probability densities in L2 . Problems of Information Transmission, 28, 44-54.

29

[13] Golubev, G.K., and Nussbaum, M. (1992). Adaptive spline estimates in a nonparametric regression model. Theory of Probability and its Applications, 37, 521-529. [14] Golubev G.K. and Khasminskii R.Z. (1999a) Statistical approach to some inverse boundary problems for partial di erential equations. Problems of Information Transmission, 35, 51-66. [15] Golubev G.K. and Khasminskii R.Z. (1999b) Statistical approach to Cauchy problem for Laplace equation. Festschrift for W.van Zwet, to appear. [16] Johnstone I.M. (1999) Wavelet shrinkage for correlated data and inverse problems: adaptivity results. Statistica Sinica 9, 51-83. [17] Johnstone I.M. and Silverman B.W. (1990) Speed of estimation in positron emission tomography and related inverse problems. Annals of Statistics, 18, 251-280. [18] Kneip, A. (1994) Ordered linear smoothers. Annals of Statistics, 22, 835-866. [19] Koo J.Y. (1993) Optimal rates of convergence for nonparametric statistical inverse problems. Annals of Statistics, 21, 590-599. [20] Korostelev, A. P. and Tsybakov, A. B.(1993) Minimax Theory of Image Reconstruction. Lecture Notes in Statistics, v.82, Springer, New York. [21] Li, K.-C. (1986) Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing. Ann. Statist. 14 1101-1112. [22] Li, K.-C. (1987) Asymptotic optimality of CP , CL , cross-validation and generalized cross-validation: Discrete index set. Ann. Statist. 15 958-976. [23] Mair B. and Ruymgaart F.H. (1996) Statistical estimation in Hilbert scale. SIAM J. Appl. Math., 56, 1424-1444. [24] Nemirovskii, A. (1998) Topics in Nonparametric Statistics. Ecole d'ete de probabilites de St.Flour. Lecture Notes. [25] Pinsker, M.S. (1980). Optimal ltering of square integrable signals in Gaussian white noise. Problems of Information Transmission, 16, 120-133. [26] Polyak, B.T. and Tsybakov, A.B. (1990) Asymptotic optimality of the Cp -test for the orthogonal series estimation of regression. Theory Probab. Appl. 35 293-306. [27] Polyak, B.T. and Tsybakov, A.B. (1992) A family of asymptotic optimal methods for choosing the estimate order in orthogonal series regression. Theory Probab. Appl. 37 471-481. [28] Stein, E., Weiss, G. (1971) Introduction to Fourier Analysis on Euclidean Spaces. Princeton University Press. [29] Wahba G. (1977) Practical approximate solutions to linear operator equations when the data are noisy. SIAM J. Numer. Anal., 14, 651-667. [30] Wahba G. (1990) Spline Models for Observational Data, SIAM, Philadelphia.

30