Kernel estimates; regression function

2 downloads 0 Views 223KB Size Report
Key words and phrases : Kernel estimates; regression function; metric-valued ... Abstract: We study a nonparametric regression estimator when the explana-.
NONPARAMETRIC REGRESSION ESTIMATION WHEN THE REGRESSOR TAKES ITS VALUES IN A METRIC SPACE∗

Sophie DABO-NIANG a

a

and Noureddine RHOMARI

a,b

CREST, Timbre J340, 3, Av. Pierre Larousse, 92245 Malakoff cedex, France [email protected]; b

[email protected]

Universit´e Mohammed I, Facult´e des Sciences, 60 000 Oujda, Maroc [email protected] (1st version, September 2001)

AMS classification: 62 G 08; 62 H 30. Key words and phrases : Kernel estimates; regression function; metric-valued random vectors; discrimination. Abstract: We study a nonparametric regression estimator when the explanatory variable takes its values in a separable semi-metric space. We establish some asymptotic results and give upper bounds of the p-mean and (pointwise and integrated) almost sure estimation errors, under general conditions. We end by an application to the discrimination in a semi-metric space and study as an example the case when the explanatory variable is the Wiener process in C[0, 1]. R´ esum´ e: Dans ce travail, nous ´etudions l’estimateur `a noyau de la r´egression quand la variable explicative prend ses valeurs dans un espace semi-m´etrique s´eparable. Nous ´etablissons sa consistance en moyenne d’ordre p et presque sˆ ure (ponctuelle et int´egr´ee), sous des conditions g´en´erales. Nous pr´ecisons aussi des bornes sup´erieures de ces erreurs d’estimation. Ensuite, nous appliquons, ces r´esultats `a la discrimination de variables d’un espace semi-m´etrique et ´etudions comme exemple le cas o` u la variable explicative est le processus de Wiener dans C[0, 1]. ∗

Comments are welcome

1

1

Introduction

Let (X, Y ) be a random vector from X × R with E(|Y |) < ∞, where (X , d) is a separable semi-metric space equipped with the semi-metric d. The distribution of (X, Y ) is often unknown, so is the regression function m(x) = E(Y |X = x), x ∈ X . We then want to estimate the regression operator m(x) by mn (x), a function of x and a sample (X1 , Y1 ), . . . , (Xn , Yn ) of (X, Y ). The present framework includes the classical finite dimensional case of X = Rd , d ≥ 1, but also spaces of infinite dimension, as usual function spaces or sets of probabilities on some given measurable space. This problem has been widely studied, when the variable X lies in a finite dimensional space, and there are many references on this topic, contrary to infinite dimensional observation case. In this work, X may have an infinite dimension; it can be for example a function space which is an important case. Recently, the statistics of the functional data has known a growing interest. These questions in infinite dimension are particularly interesting, at once for the fundamental problems they formulate, see Bosq (2000), but also for many applications they may allow, see Ramsay and Silverman (1997). One possible application is that the (Xi )i are random curves, for instance in C(R) or C[0, T ]. Many phenomena, in various areas (e.g. weather, medicine,...), are continuous in time and may or must be represented by curves. We can be interested either in the forecast or with the classification. For example let us consider the weather data of some country. One can investigate to what extent total annual precipitation for weather stations can be predicted from the pattern of temperature variation through the year. Let Y be the logarithm of total annual precipitation at a weather station, and let X = X(t), t ∈ [0, T ] be its temperature function, the interval [0, T ] is either [0, 12] or [0, 365] depending on weather monthly or daily data. For more examples, we refer to Ramsay and Silverman (1997) and the references therein. The popular estimate of m(x), x ∈ X , which is a locally weighted average,

2

is defined by the following expression mn (x) =

n X

Wni (x)Yi

(1)

i=1

where (Wn1 (x), . . . , Wnn (x)) is a probability vector of weights and each Wni (x) is a Borel measurable function of x, X1 , . . . , Xn . We will be interested in kernel method K(d(Xi , x)/hn ) Wni (x) = Pn , (2) j=1 K(d(Xj , x)/hn ) Pn if j=1 K(d(Xj , x)/hn ) 6= 0 otherwise Wni (x) = 0, where (hn )n is a sequence of positive numbers, and K is a nonnegative measurable function on R. When (Xi , Yi ) ∈ Rd × Rk , the estimation of m was treated by many authors with various weights including the nearest neighbor and kernel methods. For the references cited hereafter, we limit ourselves to the general case where no assumption concerning the existence of density with respect to some measure of reference is made. Stone (1977) showed that the nearest neighbor estimate is universally consistent that is, E(|mn (X) − m(X)|p ) −→ 0 as n → ∞ whenever E(|Y |p ) < ∞, p ≥ 1, for more general weight vectors. Devroye and Wagner (1980), and Spiegelman and Sacks (1980), extended the same result to the kernel estimate. Universal consistency results were also presented by Gy¨orfi (1981). Krzy˙zak (1986) gave rates of probability and almost sure convergence for a wide class of kernel estimates; see also the references cited by these authors. While a pointwise consistency was treated, among others, by Devroye (1981) for both the nearest neighbor and the kernel methods and by Greblicki et al. (1984) for a more general kernel estimates. For example, Devroye (1981) proved that for PX -almost all x, mn (x) is consistent, in p-mean, that is lim E(|mn (x) − m(x)|p ) = 0,

(3)

n→∞

for PX -almost all x, in Rd , whenever E(|Y |p ) < ∞, p ≥ 1, hn → 0 and lim nhdn = ∞, where PX is the probability distribution of X. He also estabn→∞ lished its almost sure pointwise convergence and its almost sure integrated (w.r.t. PX ) consistency if in addition Y is bounded and lim nhdn / log n = ∞. n→∞

3

In this paper we are interested to extend Devroye’s results (1981) to the case where X takes values in a general separable semi-metric space, which may be of an infinite dimension. We also precise the rates of convergence which are optimal in the case of finite dimensional X . The literature on this topic in infinite dimension is relatively limited, to our knowledge. Bosq (1983) and Bosq and Delecroix (1985) deal with general kernel predictors for Hilbert-valued Markovian processes. In the case of independent observations, Kulkarni and Posner (1995) studied the nearest neighbor estimation in general separable metric space X . They precised a rate of convergence connected to metric covering numbers in X . The recent work of Ferraty and Vieu (2000) is on kernel method for X being in a seminormed vector space, whose distribution has a finite fractal dimension (see condition (4) bellow). They proved that, when Y is bounded, the kernel estimate related to (1) and (2) converges almost surely to m(x), if m is continuous at x and if there exist two positive numbers a(x), c(x) such that P (X ∈ Bhx ) = c(x) h→0 ha(x) lim

(4)

a(x)

and hn → 0 and lim nhn / log n = ∞, where Bhx denotes the closed ball n→∞

of radius h and center x. Under similar conditions, Ferraty et al. (2002) extended this result to dependent observations and obtained an a.s. uniform convergence on compact set and precised its rate. Here, the same statement, in independent case, are established, under general assumptions (see theorem 2 below). The conditions are connected with the probability of the explanatory variable X being in the ball Bhxn . We ask nP (X∈Bhxn ) log n n→∞

that lim

= ∞, which is fulfilled for all random variables X and

some class of sequences (hn ) converging to 0, and m may be discontinuous at x; hence we extend [12]’s results, and in the case of condition (4) we improve their rate of convergence. The proofs are different. Similar condition as (4) allow [13] to extend the study to dependent observations and to establish a.s. uniform consistency under a uniform-type condition of (4). However this condition (4) excludes an interesting class of processes having infinite fractal dimension like the Wiener processes (see paragraph 4.2). The proof we used here can easily be extended to the dependent case via the coupling 4

method, but it seems difficult to obtain with it the uniform consistency; the a.s. convergence remains valid for absolutely regular (β-mixing) processes (see [26]). Note that our conditions written in term of P (X ∈ Bhxn ) unify the regression estimation with finite and infinite dimensional explanatory variables, and the results, especially the bounds of estimation errors, highlight how the probability of small balls related to explanatory variable influences the convergences. Let µ be the probability distribution of X. In the case of finite dimensional space X = Rd , the crucial tool, needed to prove those results, e.g., in [9], [18] and [27], is the differentiation result of a probability measure i.e. if f ∈ Lp (Rd , µ) = {f : (Rd , BRd , µ) −→ (R, BR ) /kf kp < ∞}, then, Z 1 lim |f (z) − f (x)|p dµ(z) = 0, (5) h→0 µ(B x ) B x h h for µ-almost all x, p ≥ 1. This holds for all Borel locally finite measures on Rd , in particular for probability measures. Unfortunately, that statement is in general false when X is an infinite dimensional space, even for probability measures, as proved in Preiss (1979). He constructs a Gaussian measure γ, on an Hilbert space H, and an integrable function f such that, ) ( Z 1 f dγ; x ∈ H, 0 < h < s = ∞. lim inf s→0 γ(Bhx ) Bhx However, (5) remains valid, under some additional conditions, for some probability measure µ on an Hilbert space, as stated in Tiˇser’s theorem below. Theorem A (Tiˇser, 1988). Let H be an Hilbert space and let γ be a Gaussian measure with the following spectral representation of its covariance operP ator Rx = i ci hx, ei iei , where (ei ) is an orthonormal system in H. Suppose ci+1 ≤ ci i−α for given α > 5/2. Then, for all f ∈ Lp (H, γ), p > 1, Z 1 lim |f (z) − f (x)|dγ(z) = 0, γ-almost every x. h→0 γ(B x ) B x h h Remark 1 0 In the above Tiˇser’s theorem, the convergence can be stated with |·|p instead

5

of | · |, and for all f ∈ Lploc (H, γ) and p0 < p (loc stands for locally integrable), that is Z 1 0 lim |f (z) − f (x)|p dγ(z) = 0, γ-almost every x. x h→0 γ(B ) B x h h For this, we use either his own proof or the same idea in the first part of the Lemma 2.1’s proof in Devroye (1981), or the same argument as in the proof of theorem 7.15, page 108 in [32]. Remark 2 It is obvious that (5) remains valid for all almost surely continuous functions f and all Borel locally finite measures on a semi-metric space. In particular it holds for each point x of continuity of f and all probability measures µ. That’s why we preferred to work with general regressions fulfilling condition (5). In this work, we shall prove that (5) and E(|Y |p ) < ∞, for some p ≥ 1, imply (3), even for an infinite dimensional space X –valued random variable X. We also show the pointwise and integrated (w.r.t. µ i.e. L1 (µ)-error, R |mn − m|dµ) strong convergence of mn , when the r.v. Y is bounded, under general conditions (m may be discontinuous). We precise the rate of convergence in each type of consistency. We then apply these results to the nonparametric discrimination of variables in a semi-metric space (e.g. the curves classification problem). As an example we present a special case of µ a Wiener measure and X = C[0, 1] equipped with the sup-norm. This example, which is not fulfill [12]’s condition (4), illustrates the extension and the contribution of this work; we can apply it to the regression with an explanatory variable whose distribution has an infinite fractal dimension and to classify nonparametric general curves; which is new, to our knowledge. It is obvious that all the results hold also when Y takes values in Rk , k ≥ 1, thanks to linearity of the estimator (1) with respect to the (Yi )’s. Another work dealing with Y in a separable Banach space, which may be also a curve function, shows that the same results remain valid for the regression estimation with response variable in Banach space. As it mentioned above, we can also drop the independence condition and replace it by the absolute regular mixing condition (see [26]). 6

The rest of the paper is as follows. We give the assumptions and some comments in section 2. Section 3 contains results on pointwise consistency in p-mean and the strong consistency, followed by the study of the integrated error estimation for the special case of bounded Y . Section 4 is devoted to the application to the nonparametric discrimination problem, followed by an example of the Wiener measure and X = C[0, 1] equipped with the sup-norm. The proofs are postponed to the last section. Remark 3 Let X 0 be a separable semi-metric space, and Φ : R −→ R and Ψ : X −→ X 0 are some measurable functions. It is clear that the results bellow can be applied to estimate mΨ Φ (z) = E(Φ(Y )|Ψ(X) = z). The introduction of these functions allows to cover and estimate various (functional) parameters related to the distribution of (X, Y ), by varying them. For example: i) If Φ = Id and Ψ = Id, then mId Id (x) = E(Y |X = x) = m(x). ii) Let A be a measurable set and Φ(Y ) = 1IA (Y ), then in this case mΨ Φ (z) = P (Y ∈ A|Ψ(X) = z) corresponds to the conditional distribution probability of Y given Ψ(X),... Further, Ψ can be used when X is not separable, in order to fulfill the condition of separability. We have omitted these functions just to simplify the notations.

2

Assumptions

Let us assume that there exist positive numbers r, a and b such that, a1I{|u|≤r} ≤ K(u) ≤ b1I{|u|≤r} ,

(6)

and for a fixed x in X , let Bhx denote the closed ball of radius h and center x, and suppose hn → 0

and

x lim nµ(Brh ) = ∞. n

n→∞

We assume also that for some p ≥ 1, Z 1 lim |m(w) − m(x)|p dµ(w) = 0, h→0 µ(B x ) B x h h 7

(7)

(8)

and, h

p

E |Y − m(X)| |X ∈

x Brh n

i

=o

h

ip/2 

x nµ(Brh ) n

(9)

Remark 4 1- (8) is satisfied for all points of continuity of m, for all µ. 2- (8) does not imply, in general, that m is continuous. Indeed the Direchlet’s function defined on [0, 1] as m(x) = 1I{x rational in [0,1]} , fulfills (8) with µ an absolute probability measure with respect to Lebesgue measure on [0, 1], but it is nowhere continuous (example from Krzy˙zak (1986)); in infinite dimension take instead of rational some countable dense set and continuous measure. 3- If m verifies (8) for p = 1 and is bounded in a neighborhood of x then (8) is true for all p. 4- When Y is bounded the condition (9) is obviously satisfied. x x (X)], 5- It is clear that E[|Y − m(X)|p |X ∈ Brh ] = µ(B1 x ) E[|Y − m(X)|p 1IBrh n n h then the condition (9) is satisfied, for example, when (7) holds and if x x E[|Y − m(X)|p |X ∈ Brh ] = O(1) or lim n(µ(Brh ))1+2/p = ∞. n n n→∞

6- As noticed above, thanks to theorem A, (8) and (9) remain valid for all 0 f ∈ Lp (H, µ) and µ-almost all x in H, p0 > p ≥ 1, where H is an Hilbert space and µ a gaussian measure as in theorem A; thus for such probability 0 measure, (8) and (9) are fulfilled when E|Y |p < ∞. In this case theorem 1, bellow, holds for µ-almost all x in H. 7- Recall that (8) and (9) hold for µ-almost all x and all µ in the finite dimensional space X = Rd , whenever E|Y |p < ∞, see [9] or [32]. In the finite dimension X = Rd , we have hd = O(µ(Bhx )) for µ-almost all x, and all probability measures µ on Rd (see [9] or [32]), so, in this case the condition (7) is implied by hn → 0 and lim nhdn = ∞. Then the n→∞ formulation of the assumptions and the bounds of the estimation errors in term of the probability µ(Bhx ) unify the two cases of the finite and the infinite dimensional explanatory variables. It should be noted that the evaluation of µ(Bhx ) which is known as probability of small balls is very difficult in infinite dimension (see e.g. [21]).

8

3 3.1

Main results p-mean consistency

Theorem 1 If E(|Y |p ) < ∞, and if (6), (7) and (9) hold then, E(|mn (x) − m(x)|p ) −→ 0, n→∞

for all x in X such that m satisfies (8). If µ is discrete, the result in theorem 1 holds for all points in the support of µ whenever E(|Y |p ) < ∞ and hn → 0. Rate of p-mean convergence The next result is about a rate of p-mean convergence, p ≥ 2. We therefore need to reinforce the condition (8) in order to sharply evaluate the bias. We suppose that m is ”p-mean Lipschitz”, in a neighborhood of x, with parameters 0 < τ = τx ≤ 1 and c = cx > 0, that is, Z 1 |m(w) − m(x)|p dµ(w) ≤ cx hpτ , as h → 0. (10) x µ (Bh ) Bhx It is obvious that (10) is satisfied if m is lipschitzian of parameters τ = τx and cx , but it is weaker than the Lipschitz one and it does not imply, in general, even the continuity of m at x (see the example in remark 4). A similar assumption was used by Krzy˙zak (1986). x ) = O(1) and if (6), Corollary 1 If E(|Y |p ) < ∞, E(|Y − m(X)|p |X ∈ Brh n (7), (10) are satisfied, with p ≥ 2, then we have  p/2 ! 1 E(|mn (x) − m(x)|p ) = O hpτ . n + x nµ(Brh ) n

Without the assumption of existence of the probability densities, this statement in finite dimension seems to be new for unbounded variable Y . Remark 5 µ(B x ) 1. If we assume the condition lim sup ha(x)h < ∞, for some a(x) > 0, which is h→0

weaker than (4), we have, for h = cn−1/(2τ +a(x)) , E(|mn (x) − m(x)|p ) = O(n−pτ /(2τ +a(x)) ), 9

p ≥ 2.

2. In the finite dimensional case, X = Rd , the best rate derived from the bound in corollary 1 is optimal (cf. [29]), because hd = O(µ(Bhx )) for µalmost all x, and all probability measures µ on Rd ; see lemma 2.2 of [9] and its proof, or [32, p. 189]. In this case, we obtain a similar rate as above with d instead of a(x). 3. In Rd , Spiegelman and Sacks (1980) obtained the same rate in the case p = 2, τ = 1 and bounded Y for E(|mn (X) − m(X)|2 ). Krzy˙zak (1986) obtained for a more general kernel the rates in probability and a.s. convergence under condition (10) and unbounded response variable Y .

3.2

Strong consistency

Bellow, we give the pointwise and µ-integrated almost sure convergence results of the kernel estimate and we precise their rates of convergence. In order to simplify the proof, we assume that |Y | ≤ M < ∞, a.s. But with the truncating techniques we can extend the result to unbounded Y ; see the comments bellow. x nµ(Brh ) n log n n→∞

Theorem 2 If Y is bounded, (6) holds, hn → 0 and lim |mn (x) − m(x)| −→ 0,

a.s., as n → ∞,

= ∞, then (11)

for all x in X such that m satisfies (8). If in addition (8) holds for µ-almost all x, lim E(|mn (X) − m(X)||X1 , Y1 , . . . , Xn , Yn ) = 0,

n→∞

a.s.

(12)

Rate of a.s. convergence We establish now the rate of strong consistency. Corollary 2 Let us assume that |Y | ≤ M , if (6) and (10), for p = 1, hold, nµ(B x ) if hn → 0 and lim logrhn n = ∞, then we have n→∞

|mn (x) − m(x)| = O hτn +



10

log n x nµ(Brh ) n

1/2 ! ,

a.s..

In fact we obtain more precisely, for large n, a.s., τ



|mn (x) − m(x)| ≤ cx (b/a)(rhn ) + 6M (b/a)

x nµ(Brh ) n log n

−1/2 ,

and the 6M can be replaced by some almost surely bound of q  p x 2 |X]1I x (X) . 2 max E[|m(X) − m(x)|2 |X ∈ Brh ], 2 E[|Y − m(X)| Brhn n But if m is lipschitzian, in a neighborhood of x, with parameter 0 < τ = τx ≤ 1 and cx > 0, this bound becomes τ

|mn (x) − m(x)| ≤ cx (b/a)(rhn ) + 2M (b/a)



x nµ(Brh ) n log n

−1/2 ,

and the 2M can be replaced by some almost surely bound of p x 2 E[|Y − m(X)|2 |X]1IBrh (X). n We can relax the boundedness of Y by assuming that it satisfies the Cramer condition, in this case the log n above is replaced by (log n)3 , see the comments after.

Remark 6 µ(B x ) 1. If the distribution of X satisfies the additional conditions lim sup ha(x)h < h→0

∞, for some a(x) > 0, (weaker than (4)) and m continue at x, then (7) and (8) are fulfilled, and therefore we find theorem 1 of Ferraty and Vieu (2000). If in addition (10) holds for p = 1, then we find for h = c(log n/n)1/(2τ +a(x)) that |mn (x) − m(x)| = O((log n/n)τ /(2τ +a(x)) ), a.s., which improves the bound given by Ferraty and Vieu (2000), for m lipschitzian, under an additional condition on the expansion of the limit in (4). They have assumed that there exists a real positive b(x) such that µ(Bhx ) =  ha(x) c(x) + O(ha(x)+b(x) ) and obtain the rate O (n/ log n)−γ(x)/(2γ(x)+a(x)) , where γ(x) = min{b(x), τ }. 2. In a finite dimension, X = Rd , Krzy˙zak (1986) obtained, for a slight general kernel, similar rates in probability and almost sure convergence in bounded case but more weaker for unbounded Y .

11

3.3

How to choose the window h?

It is well known that the performance of the kernel estimate depends on the choice of the window parameter h. The bound in corollary 2 is simple and easy to compute. So, this allows us to choose the window parameter h that minimizes that bound. We can also use other different methods as cross-validation.

4 4.1

Applications Discrimination.

Let an object O having an observed variable X in some space X and an unknown nature-class Y in some discrete set. Classical discrimination is about predicting the unknown nature Y of an object O having a d-dimensional observation X and a nature Y . The unknown nature, Y , of the object is called a class and takes values in a finite set {1, . . . , M }. One constructs a function g taking values in {1, . . . , M } which represents one’s guess g(X) of Y given X. This mapping g, which is called a classifier, is defined on X and takes values in {1, . . . , M }. We err on Y if g(X) 6= Y , and the probability of error for a classifier g is, L(g) = P {g(X) 6= Y }. The question is how to build this function g and what is the best manner to determine the class of this object? In discrimination, the problem is to classify this object O, i.e. to decide on the value of Y from X and a sample of this pair of variables by an empirical rule, gn , which is a measurable function from X × (X × {1, . . . , M })n into {1, . . . , M }; its probability of error is P {gn (X) 6= Y |X1 , Y1 , . . . , Xn , Yn }, where (X1 , Y1 ), . . . , (Xn , Yn ) are observations from (X, Y ). It is well known that the Bayes classifier which is defined by, g∗ =

arg min P {g(X) 6= Y }, g:X →{1,...,M }

is the best possible classifier, with respect to quadratic loss (see e.g. [11]). Note that g ∗ depends upon the distribution of (X, Y ) which is unknown. However, we build an estimator gn of the classifier based on a sample of ob12

servations. Here we consider an observation X belonging to some separable semimetric space X , with an unknown class Y . X may be of infinite dimension, e.g. a curve. We want to predict this class from X and a sample of this pair of variables. For example to classify objects with curve parameters (e.g. determine the risk, Y , of patients according to electrocardiogram curve, ...); in this case the (Xi ) take values in some function space, for instance C(R). In this section we extend Devroye’s results in the setting of X belonging to a semi-metric space X . This space may be with infinite dimension, e.g. a function space, as we have mentioned it. Let X be a random variable with values in X and Y takes values in {1, . . . , M } which is estimated from X and (X1 , Y1 ), . . . , (Xn , Yn ) by gn (X). So the quality of this estimate is measured by the probability of error Ln = P {gn (X) 6= Y |X1 , Y1 , . . . , Xn , Yn }, and it is clear that Ln ≥ L∗ =

inf

g:X →{1,...,M }

P {g(X) 6= Y },

where L∗ is the Bayes probability of error. The Bayes classifier g ∗ can be approximated by gn chosen so that, n X i=1

Wni (x)1I{Yi =gn (x)} = max

1≤j≤M

n X

Wni (x)1I{Yi =j} .

(13)

i=1

Such classifier gn (not necessarily uniquely determined) is called an approximate Bayes classifier. The probability and almost sure convergences are established in the following theorem. Theorem 3 If (6), (7) and (8) hold for µ-almost all x then, lim Ln = L∗ n→∞ in probability. x )/ log n = ∞ for µ-almost all x then, lim Ln = If in addition, lim nµ(Brh n n→∞ n→∞ ∗ L , almost surely.

13

4.2

Example of Wiener process

In the simple, but interesting, setting X = C[0, 1] the set of all continuous real functions defined over [0, 1], equipped with the sup norm, let X = W be the standard Wiener process. Hence µ = P w is the Wiener measure. Define the set Z 1 x02 (t)dt < ∞}, S = {x ∈ C[0, 1] : x(0) = 0, x is absolutely continuous and 0

where the 0 stands for the derivative. Then (see Csaki, 1980),   Z 1 1 02 exp − x (t)dt P w (Bh0 ) ≤ P w (Bhx ) ≤ P w (Bh0 ), 2 0 as h → 0 and x ∈ S. In addition P w (Bh0 )/ exp (−π 2 /8h2 ) → 1 as h → 0, see Bogachev (1999) (cf. also Lifshits (1995), section 18). Hence, for x ∈ S,   Z 1 1 02 P w (Bhx ) P w (Bhx ) x (t))dt ≤ lim inf ≤ lim sup ≤ 1. exp − h→0 exp(−π 2 /(8h2 )) 2 0 exp(−π 2 /(8h2 )) h→0 2

From this we deduce that for h = (c log n)−1/2 , we have c(x)n−cπ /8 ≤ 2 P w (Bhx ) ≤ n−cπ /8 , for large n. It implies, with corollary 1, if h = (c log n)−1/2 with c < 8/π 2 , then E(|mn (x) − m(x)|p ) = O((log n)−τ p/2 ), p ≥ 2, and also a.s. if Y is bounded. µ(Bhx ) a h→0 h

Note that ∀x ∈ S and ∀a > 0 we have lim

= 0; hence the standard

Wiener measure on (C[0, 1], k · k∞ ) has an infinite fractal dimension and then does not verify the condition of [12] and [13, 14] (cf. condition 4).

5

Comments

1) If Y is unbounded but admits an exponential moment E(eθkY k ) < ∞, for nµ(B x ) some θ > 0, then (11) takes place whenever hn → 0, lim log3rhnn = ∞ and n→∞

m fulfills (8), and the log n in corollary 2 is replaced by (log n)3 . We can only assume the existence of a polynomial moment, but in this case the log n in the condition and in the bound of corollary 2 become a power of n. 14

x 2) We think that the conditions hn → 0 and lim nµ(Brh ) = ∞ (resp. n nµ(B x ) lim logrhn n n→∞

n→∞

= ∞) are necessary in theorems 1 (resp. 2).

3) As we have mentioned, since (8) may hold for discontinuous regression functions m then all the results above can be valid for some discontinuous regression operators. In addition if K(u) = 1I{|u|≤r} , and if hn → 0 and x ) = ∞ then (8) is necessary. lim nµ(Brh n

n→∞

4) The proofs we used allow us not to assume a particular structure of distribution of X as in [12, 13, 14]; they allow to include processes having infinite fractal dimension, for example the Wiener process. 5) (X , d) is a general semi-metric space; it may be for example any subset of usual function spaces (e.g. set of the probability densities equipped with the usual metrics, any subspace of functions which is not a linear space, obtained for example under some constraints,...), or: a. (P, dpro ), a set of probabilities defined on some measurable space and dpro the Prohorov metric. b. (PR , dK ), a set of probabilities defined on R endowed with its borelian σfield, and dK (P, Q) = supx |P (−∞, x] − Q(−∞, x]|, the Kolmogorov metric. c. (D[0, 1], dsko ) set of the cadlag functions on [0, 1] and dsko the Skorohod metric. 6) It should be noted that, the same results remain valid when Y takes values in separable Banach space (of finite or infinite dimension). 7) By the mean of approximation of dependent r.v.’s by independent ones, we can drop the independence hypothesis and replace it by β-mixing, (see [26]).

6

Proofs.

We apply, with slight modifications, proofs similar to those in Devroye (1981), except that we use our lemma 1 below instead of his lemma 2.1. We add some complements in the proof of strong consistency, and also to obtain a rate of convergence. But, for the sake of completeness we present them in entirety. Inequality (15) bellow gives a sharp bound than Devroye’s (1981) inequality 2.7. 15

Let us first introduce some notations needed in the sequel. According to definition of the kernel estimate (1) and (2) we will take the convention that 0/0 = 0. Recall that Bhx denotes the closed ball of radius h and center x and define n X ξ = 1I{d(X,x)≤rhn } , ξi = 1I{d(Xi ,x)≤rhn } and Nn = ξt . t=1

Lemma below is needed to prove the consistency in p-mean like the strong one. Lemma 1 For all measurable f ≥ 0 and all x in X we have, for all n ≥ 1, ! Z n X (b/a) E Wni (x)f (Xi ) | ξ1 , . . . , ξn ≤ f (w)dµ(w)1IN 6=0 , (14) x x µ(B ) B rh n rh i=1 n

and E

n X i=1

! Wni (x)f (Xi )

x (b/a)[1 − (1 − µ(Brh ))n ] n ≤ x µ(Brh ) n

Z f (w)dµ(w). (15) x Brh n

Proof of lemma 1. Define first U1 =

n X

ξi f (Xi ) Pn

t=1 ξt

i=1

.

It is easy to see that from the kernel condition (6) and the positivity of f we have n X Wni (x)f (Xi ) ≤ (b/a)U1 , i=1

and remark that the i.i.d.’s of (X1 , ξ1 ), . . . , (Xn , ξn ) implies ξi E(f (Xi )|ξi ) = ξi E(f (X)|ξ = 1)  1 = ξi E f (X)1 I , {d(X,x)≤rh } n x µ(Brh ) n

16

and E(f (Xi )|ξ1 , . . . , ξn ) = E(f (Xi )|ξi ). Then, we have E(U1 |ξ1 , . . . , ξn , Nn 6= 0) =

n X

ξi E(f (Xi )/ξi ) Pn

t=1 ξt

i=1

=

n X

E(f (X)/ξ = 1)

i=1

ξi Nn

= E(f (X)/ξ = 1) 1 = E(f (X)1I{d(X,x)≤rhn } ) x µ(Brh ) n and E(U1 |ξ1 , . . . , ξn , N = 0) = 0, which can be summarized as E(U1 |ξ1 , . . . , ξn ) =

 1 E f (X)1I{d(X,x)≤rhn } 1IN 6=0 . x µ(Brhn )

Taking now, the expectation of this quantity we obtain  1 E f (X)1 I P (N 6= 0) {d(X,x)≤rh } n x µ(Brh ) n  1 x = E f (X)1I{d(X,x)≤rhn } [1 − (1 − µ(Brh ))n ]. n x µ(Brhn )

E(U1 ) =

This ends the proof of lemma 1. Proof of theorem 1 and corollary 1. We use the same arguments as those in Theorem 2.1 in [9]. If we apply successively Jensen’s inequality we obtain for all p ≥ 1, p # p # " n " n X X 31−p E Wni (x)Yi − m(x) ≤ E Wni (x)(Yi − m(Xi )) i=1 i=1 " n # X +E Wni (x)|m(Xi ) − m(x)|p i=1 p

+ |m(x)| P (Nn = 0) By positivity of the kernel and the independence of (Xi )’s we have, x x P (Nn = 0) = [1 − nµ(Brh )]n ≤ exp(−nµ(Brh )) −→ 0, n n

17

(16)

x because r is finite and nµ(Brh ) → ∞ for all x, so the last term of (16) tends n x x to 0. For corollary 1, remark that exp(−nµ(Brh )) = o(1/(nµ(Brh )). n n rd By Lemma 1 the 3 term in (16) is not greater than Z (b/a) |m(w) − m(x)|p dµ(w), x x µ(Brhn ) Brh n

which goes to 0, for x such that m satisfies (8). And for the rate of convergence, this quantity is bounded by bc(rhn )pτ /a = O(hpτ n ), if m satisfies the p-mean Lipschitz condition (10) in a neighborhood of x. We are going to show that the first term on the right side of (16) tends to 0 for x as (9) holds. We distinguish the case p ≥ 2 and 1 ≤ p < 2. Suppose first that p ≥ 2. Since (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. and Wni (x) depends only upon (Xj )j , then inequalities of Marcinkiewicz and Zygmund (1937) (see also Petrov (1975) or Chow and Teicher (1997)), conditionally to (X1 , . . . , Xn ), and Jensen, yield that there exists a constant C = C(p) > 0 depending only upon p, such that  p # p/2  " n n X X 2 2  E Wni (x)(Yi − m(Xi )) ≤ CE Wni (x)(Yi − m(Xi ))  i=1 i=1  p/2   p/2 X n ≤ CE  sup Wnj (x) Wni (x)(Yi − m(Xi ))2  j i=1 " # p/2 X n p ≤ CE sup Wnj (x) Wni (x)|Yi − m(Xi )| j

" = CnE

i=1

# p/2 sup Wnj (x) Wnn (x)|Yn − m(Xn )|p

(17)

j

As above, using one more time the i.i.d.’s of the variables (Xi , Yi ) and that

18

Wni (x) depends upon (Xj )j only we have, " # p/2 E sup Wnj (x) Wnn (x)|Yn − m(Xn )|p j

" " = E E

## p/2 p sup Wnj (x) Wnn (x)|Yi − m(Xi )| |X1 , . . . , Xn j

" = nE

# p/2 sup Wnj (x) Wnn (x)k(Xn ) ,

(18)

j

where k(w) = E[|Y − m(X)|p |X = w]. P Now, Define U = K(d(Xn , x)/hn ), u = E(U ), V = n−1 j=1 K(d(Xj , x)/hn ) x x and Zn−1 = min(1, b/V ). Let us note that aµ(Brhn ) ≤ u ≤ bµ(Brh ) and n Wnj (x) = K(d(Xj , x)/hn )/(U + V ) ≤ Zn−1 for all j. Then, we have 

p/2 sup Wnj (x) ≤ (Zn−1 )p/2 . j

So, this last bound and (18) give an estimate of (17), p # " n X p/2 E Wni (x)(Yi − m(Xi )) ≤ CnE[Zn−1 Wnn (x)k(Xn )]. i=1

p/2

p/2

p/2

It is clear that for all c > 0, Zn−1 = Zn−1 1IV