Nonparametric Estimation of Regression Functions in ... - Springer Link

2 downloads 0 Views 134KB Size Report
Key words: nonparametric estimation, regression functions, point process ... the estimation of the regression function for fixed sample size see Devroye et al.
Statistical Inference for Stochastic Processes 6: 291–307, 2003. c 2003 Kluwer Academic Publishers. Printed in the Netherlands. 

291

Nonparametric Estimation of Regression Functions in Point Process Models ∗ ¨ ¨ SEBASTIAN DOHLER and LUDGER RUSCHENDORF

Institute for Mathematical Stochastics, University of Freiburg, Eckerstr. 1, 79104 Freiburg, Germany (Received in final form 25 May 2002; Accepted 5 November 2002) Abstract. We prove that the empirical L2 -risk minimizing estimator over some general type of sieve classes is universally, strongly consistent for the regression function in a class of point process models of Poissonian type (random sampling processes). The universal consistency result needs weak assumptions on the underlying distributions and regression functions. It applies in particular to neural net classes and to radial basis function nets. For the estimation of the intensity functions of a Poisson process a similar technique yields consistency of the sieved maximum likelihood estimator for some general sieve classes. Key words: nonparametric estimation, regression functions, point process models, neural nets, sieve classes.

1. Introduction In this paper we consider estimation of the regression function and the intensity function for Poissonian type point process models. The main result of this paper is a uniform strong consistency result for empirical L2 -risk minimizing estimators for the regression function over some general type of sieve classes. The method of empirical risk minimization was proposed in the work of Vapnik and Chervonenkis (1974) and then developed into a powerful method by various authors. For general review see the books of Devroye et al. (1996) and Gy¨orfi et al. (2002) while for the estimation of the regression function for fixed sample size see Devroye et al. (1994), Lugosi and Zeger (1995), and Krzyzak et al. (1996). The main steps of this approach are an error decomposition into a bias term and a stochastic error term. For handling the bias term some approximation result is needed while the treatment of the stochastic error term needs to establish some uniform law of large numbers for the empirical risk measure term. The proof of this uniform law is based on empirical process theory. For a concise presentation of this method see Lugosi and Zeger (1995) and Krzyzak et al. (1996). In this paper we give an extension of this approach from the fixed sample case to the framework of random sampling processes of Poissonian type. For the ∗ Author for correspondence: Tel.: +49-761-203-5665; Fax: +49-761-203-5661; e-mail:

[email protected]

292

¨ ¨ SEBASTIAN DOHLER AND LUDGER RUSCHENDORF

estimation of the regression function we develop in detail the extension of the method of empirical L2 -risk minimizing estimation. For the estimation of the intensity function it is more natural to switch to sieved maximum likelihood estimation. Since the technical details are similar we give only a sketch of the corresponding proof of the consistency result. The estimation of the intensity function of Poisson processes is a well motivated and studied subject in the statistical literature. The results in this paper supplement the classes of estimation procedures which were investigated for Poisson processes by many authors (for general review on statistical problems for point processes see the books of Karr (1991) and Kutoyants (1998)) by nonparametric estimators of general sieve type which turned out to be very useful and promising for various kinds of applications in the fixed sample case.  Let  = N i=1 ε(Xi ,Yi ) be a point process with (Xi , Yi ) ∼ (X, Y ) an iid sequence with values in Rk × R and a random number N of points independent of the sequence (Xi , Yi ). Our aim is to estimate the regressionfunction m(x) = i E(Y |X = x) based on an iid sample 1 , . . . , n , where i = N j =1 ε(Xij ,Yij ) ∼ . A finite sample estimation theory of this type of point process models has been developed in R¨uschendorf (1989) (where they are called random sampling processes). A main example are Poisson processes with a finite intensity in the observation window. Kernel estimators for the estimation of the intensity in the Poisson case were studied in Rudemo (1982) and Kutoyants (1998). The arithmetic mean estimator for the intensity was investigated in Liese (1990) and Liese and Ziegler (1999). Kim and Koo (2000) studied maximum likelihood type sieve estimators of the intensity for wavelet sieves while Reynaud-Bouret (2001) established and analyzed adaptive properties of penalized projection estimators in various smoothness classes for the intensity function. Our main aim in this paper is to prove a universal consistency result for estimators of the regression function minimizing the empirical L2 risk over some sieve class (F n ), that is, we consider n (Fn ) = arg min m n = m f ∈Fn

Ni n  

(Yij − f (Xij ))2 .

(1.1)

i=1 j =1

Here and throughout the paper we assume for the sake of simplicity the existence of minimizing functions in Fn . If the minima do not exist the same analysis can be carried out with functions approximating the infimum. For a similar remark cf. also the paper of Lugosi and Zeger (1995). Concerning the sieve class (Fn ) in (1.1) we assume that Fn approximates a dense subclass F in L2 which is generated by a VC-class as a vector space. CONDITION (A) 2 k 1. Assume that F ⊂ L (P ) is dense for all probability measures P on R . 2. F = { i ai fi ; ai ∈ R, fi ∈ F0 } is a vector space generated by a VC-class F0 of functions f : Rk → [0, 1].

293

REGRESSION FUNCTIONS

The sieve Fn ⊂ F is constructed as a sequence of subclasses of F, such that Fn ↑ F. This type of condition was used in D¨ohler (1999) in the context of hazard regression estimation in censoring models. Conditions 1 and 2 hold true in particular for some general classes of neural nets and radial basis functions. Let σ : R → [0, 1] be monotonically nondecreasing and limx→∞ σ (x) = 1, limx→−∞ σ (x) = 0 and define the neural net class  K  k ci σ (ai x + bi ) + c0 ; F = F(σ ) := f : R → R; x → i=1

 K ∈ N, ai ∈ R , ci ∈ R .  Furthermore, for h : R+ → [0, 1], with h(x) dλk (x) < ∞ let  K  k F = R(h) := f : R → R+ ; x → ci h(Ai x + bi ) + c0 ; k

i=1

(1.2)



K ∈ N, Ai ∈ Rk×k , bi ∈ Rk , ci ∈ R

(1.3)

denote the radial basis function net generated by h. Then F(σ ) and R(h) satisfy conditions 1 and 2 in condition (A). The denseness assertion follows from Hornik (1991), respectively, Krzyzak et al. (1996) while the finiteness of the VC-dimension follows from the fact that in both cases each member of F0 is the composition of a fixed increasing or decreasing function with a finite dimensional vector space for functions. µn (1 , . . . , n ) =  µn (x) is called universally An estimator sequence  µn =  2 -sense if E( µn (X) − m(X))2 = strongly consistent for m(x) = E(Y |X = x) in L  2 ( µn (x) − m(x)) dµ(x) → 0 a.s. for all (X, Y ) such that EY 2 < ∞, where µ = P X . Our aim is to prove universal strong consistency for the empirical risk minimizing estimator m n . 2. Universal Consistency of Sieved Empirical L 2 -Risk Minimizing Estimators In this section we state the main universal consistency result for the empirical mn ) and establish an extension of the L2 -risk minimizing estimator sequence ( truncation technique of Lugosi and Zeger (1995) to the case of random sampling processes. The estimator is  based on a sieve Fn = F(βn , Kn ) ⊂ F with βn ↑ ∞, Kn ↑ ∞ with elements f = K i=1 ci fi ∈ Fn such that |ci |  βn and K  Kn (cp. (3.1)). THEOREM N 2.1 (Universal strong consistency of empirical risk kestimators). Let  = i=1 ε(Xi ,Yi ) be a (random sampling) point process in R × R such that

294

¨ ¨ SEBASTIAN DOHLER AND LUDGER RUSCHENDORF

EY 2 < ∞ and EN r < ∞ for some r > 6 and let (i ) be an iid sample of . Then under Assumption A there exists a sieve (Fn ) = (F(βn , Kn )) ⊂ F such that Fn ↑ n = m n (Fn ) is universally F and the empirical L2 -risk minimizing estimator m strongly consistent (in L2 -sense) for the regression function m = E(Y |X = ·), that is ( mn (x) − m(x))2 dP X (x) → 0 a.s. for all random variables (X, Y ) with 2 EY < ∞. For the proof of Theorem 2.1 we need the following extension of Theorem 1 of Lugosi and Zeger (1995) which allows us to restrict to bounded Y. THEOREM 2.2 (Truncation method). Let F be dense in L2 (P ) for all probability measures P on Rk . Assume there exists a sequence Fn ↑ F such that for all bounded r.v.s. Y we have       Ni n  N   1 2 2  (Yij − f (Xij )) − E  (Yj − f (Xj ))  → 0 (2.1) sup  f ∈Fn  n i=1 j =1  j =1 n (Fn ) is strongly consistent for m = E(Y |X = ·) for all Y ∈ L2 . then m n = m Proof. For the proof we use the truncation argument of Lugosi and Zeger (1995). For L > 0 denote the truncated random variables Y L := TL ◦ Y , YijL := TL ◦ Yij , where   L if x  L, x if − L < x < L, TL (x) =  −L if x  − L. Denote the empirical risk estimator for the truncated process by m Ln

:= arg min f ∈Fn

Ni n  

(YijL − f (Xij ))2 ,

i=1 j =1

and the related optimal approximations by   N  fn∗ := arg min E  (Yj − f (Xj ))2  , f ∈Fn

and fnL,∗

j =1

  N  := arg min E  (YjL − f (Xj ))2  . f ∈Fn

j =1

Furthermore, define n (X)) − inf dL2 (Y, f (X)), δn := dL2 (Y, m f ∈Fn

δnL

:= dL2 (Y , m n (X)) − inf dL2 (Y L , f (X)) L

f ∈Fn

295

REGRESSION FUNCTIONS

with L2 -distance dL2 = dL2 (P ) and the empirical risk measure  1/2  Ni n   1  := sup  (YijL − f (Xij ))2  − f ∈Fn  n i=1 j =1  1/2    N    − E  (YjL − f (Xj ))2   .  j =1 

$Ln

In the first step we establish δn  δnL + 2dL2 (Y L , Y ).

(2.2)

For the proof of (2.2) note that inf dL2 (Y L , f (X))  dL2 (Y L , fn∗ (X))

f ∈Fn

 dL2 (Y, fn∗ (X)) + dL2 (Y L, Y ) = inf dL2 (Y, f (X)) + dL2 (Y L , Y ), f ∈Fn

using independence of N and (X, Y ). From the triangle inequality we obtain n (X)) + dL2 (Y L , Y ) + dL2 (Y L , Y ) − inf dL2 (Y L , f (X)) δn  dL2 (Y L , m f ∈Fn

=

δnL

+ 2dL2 (Y , Y ). L

In the next step we prove √



1/2 Ni n   1 ENδnL  2$Ln + 2  (Y L − Yij )2  . n i=1 j =1 ij

(2.3)

For the proof of (2.3) note that by the triangle inequality for i = 1, . . . , n, j = 1, . . . , Ni and aij , bij , cij ∈ R holds  1/2 1/2  Ni Ni n  n     (aij − bij )2    (aij − cij )2  + i=1 j =1

i=1 j =1

1/2 Ni n   (cij − bij )2  . + 

i=1 j =1

(2.4)

296

¨ ¨ SEBASTIAN DOHLER AND LUDGER RUSCHENDORF

Using (2.4) and the definition of dnL,∗ we obtain √

ENδnL =





1/2 Ni n   1 EN dL2 (Y L , m n (X)) −  (Y L − m n (Xij ))2  + n i=1 j =1 ij 

1/2 Ni n   1 + (Y L − m n (Xij ))2  − n i=1 j =1 ij √ − EN inf dL2 (Y L , f (X)) f ∈Fn



1/2 Ni n   1 (Y L − Yij )2  +  $Ln +  n i=1 j =1 ij 1/2 Ni n   1 (Yij − m n (Xij ))2  − + n i=1 j =1 

√ − EN inf dL2 (Y L , f (X)) f ∈Fn



1/2 Ni n   1 (Y L − Yij )2  +  $Ln +  n i=1 j =1 ij 1/2 Ni n   1 (Yij − fnL,∗ (Xij ))2  − + n i=1 j =1 

√ − EN inf dL2 (Y L , f (X)) f ∈Fn



1/2 Ni n   1 (YijL − Yij )2  +  $Ln + 2  n i=1 j =1 1/2 Ni n   1 (Y L − fnL,∗ (Xij ))2  − + n i=1 j =1 ij 

√ − EN inf dL2 (Y L , f (X)) f ∈Fn



1/2 Ni n   1 (Y L − Yij )2  .  2$Ln + 2  n i=1 j =1 ij

297

REGRESSION FUNCTIONS

(2.2) and (2.3) imply



1/2 Ni n   1  $Ln + √ (YijL − Yij )2  + 2dL2 (Y L , Y ). δn  √ n EN EN i=1 j =1 2

2

(2.5)

Y L is bounded. Therefore, using assumption (2.1) we obtain lim sup $Ln = 0 a.s. n→∞

Relation (2.5) implies using the strong law of large numbers lim supn→∞ δn  4dL2 (Y L , Y ). The denseness assumption on F implies limn→∞ inff ∈Fn dL2 (Y, f (X)) = dL2 (Y, E(Y |X)). Therefore, from limn→∞ δn = 0 a.s. we obtain n (X)) = dL2 (Y, E(Y |X)) a.s., which yields the strong consislimn→∞ dL2 (Y, m tency property stated in Theorem 2.2. For details of this last step see also the proof of Theorem 2.1 in Section 3.  3. A Uniform Law of Large Numbers for the Empirical Risk Measure In order to establish a universal consistency result for the empirical risk estimator it is by Theorem 2.2 sufficient to consider the case where Y is bounded. So let us assume in this section ASSUMPTION (B) Y is bounded, that is |Y |  L < ∞ for some constant L. For β > 0, K ∈ N we introduce  K  k ci fi ; with F(β, K) := f : R −→ [−Kβ, Kβ]; f = 

i=1

fi ∈ F0 , |ci |  β ,

(3.1)

and the induced class G(β, K, L) := {gf : Rk × [−L, L] → [0, (L + Kβ)2 ], gf (x, y) → (y − f (x))2 ; f ∈ F(β, K)}.

(3.2)

The maximal deviation functional as in (2.1) for the class F(β, K) then can be written in the form   n Ni   sup  (Yij − f (Xij ))2 − $n (β, K, L)(ω) := f ∈F (β,K)  i=1 j =1    n  N   2   (Yij − f (Xij ))  −E  i=1 j =1  n  n       g, i (ω) , = sup  g, i (ω) − E  g∈G (β,K,L)  i=1

i=1

(3.3)

298

¨ ¨ SEBASTIAN DOHLER AND LUDGER RUSCHENDORF

 with g, µ := g dµ. To estimate $n (β, K, L) we shall make use of a variant of Pollard’s inequality due to Ziegler (1994). For the relevant notions from empirical process theory we refer to van der Vaart and Wellner (1996). As usual dimVC (F) denotes the Vapnik–Cervonenkis dimension of F, N(ε, F, dL2 (µ) ) denotes the ε-covering number of F with respect to L2 (µ), N2 (ε, Z), D2 (ε, Z) denote the ε-covering, respectively, ε-packing numbers of Z ⊂ Rn with respect to the Euclidean metric. LEMMA 3.1. Let F be a VC-class with majorant F and let d = dimVC (F). Then there exist some constant C(d)  0 such that N(εF L2 (µ) , F, dL2 (µ) )  C(d)ε −4(d−1)

(3.4)

for 0 < ε  1 and all measures µ with F L2 (µ) < ∞. We also make use of the following stability properties of the covering numbers which are variants of those in Pollard (1990, Section 5), see also D¨ohler (1999, 2000). LEMMA 3.2 (Stability properties of covering numbers). Let F and G be two function classes on Rk and µ be a measure, then: (a) N(ε+δ, F ⊕G, dL2 (µ) )  N(ε, F, dL2 (µ) )N(δ, G, dL2 (µ) ), where F ⊕G denotes the sumclass. (b) If |f |  K for all f ∈ F, then  ε  , F, dL2 (µ) , N(ε, F 2 , dL2 (µ) )  N 2K where F 2 = {f 2 ; f ∈ F}. (c) If µ is a finite measure and |f |  K for all f ∈ F, then for any a > 0 ε  4aK1 2 L (µ) , F, dL2 (µ) . N(ε, [−a, a] · F, dL2 (µ) )  N 2a ε The empirical risk measure $n (β, K, L) involves the following random set Zn of points in Rn : Zn (β, K, L)(ω) := {(g, 1 (ω), . . . , g, n (ω))|g ∈ G(β, K, L)} ⊂ Rn . (3.5) Let us denote the corresponding random entropy integral by  δn (ω)  log D2 (ε, Zn (β, K, L)(ω)) dε, Jn (ω) := 9

(3.6)

0

where δn (ω) := sup{y2 |y ∈ Zn (β, K, L)(ω)} and D2 (ε, Zn ) denotes the ε-packing number of Zn in Euclidean metric. In the following we use the notation Jn = Jn (β, K) = Jn (β, K, L) depending on the parameters of interest.

REGRESSION FUNCTIONS

299

THEOREM 3.3 (Uniform law for the empirical risk measure, bounded case). Let 1  p, q < ∞, q + 1 < p let L > 0 be fixed and assume that for K ∈ N, β > 0 and any bounded random variable |Y |  L holds EJnp (β, K) = O(nq ).

(3.7)

Then there exist sequences Kn ↑ ∞, βn ↑ ∞ such that for any bounded r.v. |Y |  L holds 1 lim $n (βn , Kn , L) = 0 a.s. n→∞ n Proof. In the first step of the proof we establish: there exists δ > 0 and εn ↓ 0, such that for any |Y |  L and K ∈ N, β > 0 holds   1 1 $n (β, K, L) > εn  1+δ . (3.8) P n n for all n  N0 = N0 (β, K, L) ∈ N. Define δ  := (1/2)(p − (q + 1)) and εn :=  (1/n)(1/p)(p−(q+1+δ )) and note that by assumption δ  > 0 and q + 1 + δ  < p. We apply Pollard’s maximal inequality for empirical processes indexed by function classes (cp. Pollard, 1990, (7.5)), to obtain the estimate   p (2Cp )p EJn (β, K, L) 1 $n (β, K, L) > εn  P n (nεn )p with some constant Cp > 0. Therefore, by the growth assumption on the random entropy integral there exist N0 = No (β, K, L) ∈ N and C = C(β, K, L) > 0 such that for n  N0 and with δ = δ  /2   (2Cp )p C(β, K, L)nq 1 $n (β, K, L) > εn  P n np nq+1+δ  −p (2Cp )p C(β, K, L) = n1+δ  1  1+δ . (3.9) n Let Nij = N0 (i, i, j ) and assume w.l.g. Ni1 < Ni2 < · · · and define Ni := Nii . Assume w.l.g. Ni ↑ ∞ and define for n ∈ {Np , . . . , Np+1 − 1}, Kn := βn := p. Then we obtain for n  NL   1 1 $n (βn , Kn , L) > εn  1+δ , P (3.10) n n since n  NL implies n ∈ {Np0 , . . . , Np0 +1 − 1} for some p0  L, that is, Kn = βn = p0 , and n  Np0  Np0 L = N0 (p0 , p0 , L) and so (3.10) follows from (3.9). Define An = {(1/n)$n (βn , Kn , L) > εn }, then P (An )  1/n1+δ for n sufficiently large. This implies the statement of the theorem by application of the Borel–Cantelli Lemma. 

300

¨ ¨ SEBASTIAN DOHLER AND LUDGER RUSCHENDORF

For the application of Theorem 3.3 we need to establish an upper bound for the random entropy integral in order to verify condition (3.7). PROPOSITION 3.4. Let 1  p < ∞ and EN (3/2)p < ∞. Then EJnp (β, K) = O(n(3/4)p )

(3.11)

for any bounded r.v. Y, |Y |  L and any K ∈ N and β > 0. Proof. Define Fn (β, K, L)(ω) := ((L + Kβ)2 , 1 (ω), . . . , (L + Kβ)2 , n (ω)) ∈ Rn . Then Jn (β, K, L)(ω)  C|Fn (β, K, L)(ω)|  1 log N2 (ε|Fn (β, K, L)(ω)|, Zn (β, K, L)(ω)) dε. 0

(3.12) For the proof of (3.12) note that for all z ∈ Zn (β, K, L)(ω) and all 1  i  n|zi |  (Fn (β, K, L)(ω))i , that is, Fn majorizes Zn . As Fn (β, K, L)(ω)  = (L + n 2 2 2 Kβ) (N1 (ω), . . . , Nn (ω)) we obtain |Fn (β, K, L)(ω)| = (L+Kβ) i=1 Ni (ω) and z2  |Fn (β, K, L)(ω)|. By a substitution as in Pollard (1990, (7.7)) we obtain Jn (β, K, L)(ω)

 1 log D2 (ε|Fn (β, K, L)(ω)|, Zn (β, K, L)(ω)) dε.  9 |Fn (β, K, L)(ω)| 0

Using that the packing number D2 is related to the covering number N2 by D2 (ε, F)  N2 (ε/2, F) we obtain (3.12). In the next step we prove N2 (ε|Fn (β, K, L)(ω)|, Zn (β, K, L)(ω)) K(V0 +1)   K(V0+1)   n 1  Ni (ω)  C(β, K) ε i=1

(3.13)

with V0 = 4(dimVC (F0 ) − 1), C(β, K) := C(d)k 2KV0 (βK)K(K+V0 ) , and C(d) > 0 a universal constant.  For the proof define F 2 := (L + Kβ)2 and M(ω) := (F 2 , ni=1 i (ω))1/2. We first establish N2 (ε|Fn (β, K, L)(ω)|, Zn (β, K, L)(ω))   εM(ω) , G(β, K, L), dL2 (ni=1 i (ω))   N   n i=1 Ni (ω)

(3.14)

301

REGRESSION FUNCTIONS

To this purpose define r := N(εM(ω), G(β, K, L), dL2 (ni=1 i (ω)) ), and let g1 , . . . , gr ∈ G(β, K, L) be an εM(ω) approximation of G(β, K, L) with respect to the dL2 (ni=1 i (ω)) -metric. Define yi := (gi , 1 (ω), . . . , gi , n (ω)). Then for all z ∈ Zn (β, K, L)(ω) corresponding to some g ∈ G(β, K, L) there exist some yi with   n  Ni (ω). d2 (y, yi )  ε|Fn (β, K, L)(ω)| i=1

This inequality results from an application of Jensens’s inequality and the approxobtain (3.14). imation of g by gi . Together  we i (ω) Define next i (ω) := jN=1 εXij . From Pollard’s inequality (3.4) we obtain     n  4(d0 −1)  1     n Ni (ω), F0 , dL2 ( i=1 i (ω))  A (3.15) N ε ε i=1 with d0 = dimV C (F0 ). Denote G  (β, K, L) := {gf : Rk × [−L, L] −→ [−(L + Kβ), (L + Kβ)], (x, y)  −→ y − f (x)|f ∈ F(β, K)}. Then G(β, K, L) ⊂ (G  (β, K, L))2 and F(β, K) ⊂ ⊕K i=1 [−β, β] · F0 . Using the stability properties of the covering numbers in Lemma 3.2 these properties imply omitting the arguments in the following formulas:   εM A = N  , G, dL2 ( i ) Ni    Ni = N(εF, G, dL2 ( i ) ) as M = F   εF , G  , dL2 ( i )  N by Lemma 3.2(b) using 2(L + Kβ)) |gf |  L + Kβ  ε  ε , G  , dL2 ( i )  N , F, dL2 ( i ) . = N 2 2 Since F ⊂ ⊕K i=1 [−β, β] · F0 we obtain from Lemma 3.2(a) and (c) K   ε , [−β, β] · F0 , dL2 ( i ) A  N 2K     K  8βK Ni ε , F0 , dL2 ( i )  N 2Kβ ε     K  8βK Ni ε   N . Ni , F0 , dL2 ( i ) ε 2Kβ Ni

302

¨ ¨ SEBASTIAN DOHLER AND LUDGER RUSCHENDORF

This implies by Pollard’s inequality in (3.15)   −V0  K ε 8βK Ni   A  C(d) ε 2Kβ Ni  K(V0 +1)   K(V0 +1)/2 1 = C(d)K 2KV0 (βK)K(K+V0 ) Ni ε and thus we obtain (3.13). Define    K(V0 +1)   1  1  log C(β, K) dε R1 (β, K) := ε 0

(3.16)

√ and R2 (K) := K(V0 + 1) then from (3.13) and (3.12) and using some simple calculations (see D¨ohler R¨uschendorf, 2000, proof of Proposition 15) we obtain the estimate     n  Ni (ω) . Jn (β, K, L)(ω)  C|Fn (β, K)(ω)| R1 (β, K, L) + R2 (K) 4 i=1

(3.17) With Zn (ω) :=



n i=1

Ni2 (ω) this yields

Jn (β, K, L)(ω)  C1 (β, K, L)Zn (ω)(1 + C2 (β, K, L)Zn (ω)1/2).

(3.18)

Define p  := [p] ∈ N and δ := p − p  , then 

Jnp (β, K, L)(ω)  C1 (β, K, L)Znp (ω)[1 + C2 (β, K, L)Zn (ω)1/2]p × ×[1 + C2 (β, K, L)Zn (ω)1/2 ]δ   C1 (β, K, L)Znp (ω)[1 + C2 (β, K, L)Zn (ω)1/2]p × ×[1 + C2δ (β, K, L)Zn (ω)δ/2], using (1 + x)δ  1 + x δ for x  0. Therefore, by the binomial formula we obtain EJnp (β, K, L)

 C1 (β, K, L)

p     p k=0

k

C2k (β, K, L) ×

×E(Znp Znk/2 (1 + C3 (β, K, L)Zn (ω)δ/2 ) 

 C4 (β, K, L)EZnp+(p /2)+(δ/2)  C4 (β, K, L)EZn3p/2 .

303

REGRESSION FUNCTIONS

3p/2

AsEN 3p/2 < ∞ this implies by a well-known moment estimate EZn E( ni=1 Ni2 (ω))3p/4 = O(n)3p/4) which yields the statement of the theorem.

= 

Note that the various constants Ci (β, K, L) in the proof of Proposition 3.4 can be made explicit. For the application of Proposition 3.4 to Theorem 3.3 we obtain with q = (3/4)p the condition q + 1 = (3/4)p + 1 < p, that is, p > 4 and, therefore, we need the assumption EN r < ∞, where r = (3/2)p > 6. This leads to the moment assumption on N in Theorem 2.1. For the proof of Theorem 2.1 we shall make use of the usual decomposition of the error into approximation and estimation error (for related decomposition results see e.g. Krzyzak et al., 1996). LEMMA 3.5. Define J2 (f ) = E(Y − f (X))2 EN and the empirical counterpart n  i 2 J2,n (f ) = 1/n i=1 N j =1 (Yij − f (Xij )) . Then  (a) J2 (f ) − J2 (m) = (EN) (f − m)2 dPX , mn ) − J2 (m) = (J2 ( mn ) − inff ∈Fn J2 (f )) + (inff ∈Fn J2 (f ) − J2 (m)), (b) J2 ( mn ) − inff ∈Fn J2 (f )|  2 supf ∈Fn |J2,n (f ) − J2 (f )|. (c) |J2 ( Proof. (a) and (b) are obvious. For the proof of (c) note that mn ) − inf J2 (f )) = J2 ( mn ) − J2,n ( mn ) + J2,n ( mn ) − inf J2 (f ) (J2 ( f ∈Fn

f ∈Fn

mn ) − J2,n ( mn ) + sup (J2,n ( mn ) − J2 (f )). = J2 ( f ∈Fn

Therefore, mn ) − inf J2 (f )|  |J2 ( mn ) − J2,n ( mn )| + sup |J2,n (f ) − J2 (f )| |J2 ( f ∈Fn

f ∈Fn

 2 sup |J2,n (f ) − J2 (f )|. f ∈Fn

 Proof of Theorem 2.1. The bounds on the random entropy numbers in Proposition 3.4 together with Theorem 3.3 imply convergence of the empirical risk measure supf ∈Fn |J2,n (f ) − J2 (f )| → 0 for bounded Y. Theorem 2.2 implies that this convergence extends to all Y ∈ L2 (P ). By the denseness assumption on the sieve class Fn also the approximation term in Lemma 3.5(b) converges to zero, inff ∈Fn J2 (f ) − J2 (m) → 0. So from Lemma 3.5(b) and (c) we mn ) − J2 (m)) = EN ( mn (x) − m(x))2 dPX (x) → 0 a.s. for any obtain (J2 ( 2 Y ∈ L (P ). Remark 3.6. The conditions for the universal consistency result are stated in quite weak form. Stone (1977) first showed the existence of weakly universal

304

¨ ¨ SEBASTIAN DOHLER AND LUDGER RUSCHENDORF

consistent estimators. Since then many results on weakly and strongly universal consistent estimators have been obtained by various methods. For a list of references see Devroye et al. (1994). It is known that no nontrivial upper bounds for the rate of convergence can be obtained under this general type of assumptions (see Lugosi and Zeger, 1995). Under additional assumptions on the moments of Y and on the approximation properties of the class F the estimates in this paper can be made explicit to obtain convergence rates and the order of Kn and βn . In the related paper of D¨ohler (1999) this is made explicit in the problem of estimation of the hazard function in censoring models. An essential step in this approach is an exponential bound in the maximal inequality (instead of the polynomial bound used here). This exponential bound holds true if the random entropy integral is bounded by a constant (see Pollard, 1990, (7.3)). 4. Estimation of the Intensity Function of a Poisson Process For the related problem of estimating the intensity function of a Poisson process it is more natural to switch from empirical L2 -risk minimization to sieved maximum likelihood estimation. Various parametric and nonparametric estimators of the intensity have been already studied in the literature (see the references in our introduction). The empirical risk minimization method applies in much the same way also to the maximum likelihood risk function. For the fixed sample case and the estimation of the hazard function in censoring models this has been worked out in D¨ohler (1999). Since many technical details are similar to those in the first part of this paper we give only a sketch of the proof of the strong consistency of the sieved maximum likelihood estimator for general sieve classes satisfying Assumption A. Let  be a Poisson process on [0, 1]k with  finite intensity measure µ and inα0 dλ\. Let 1 , 2 , . . . , n be an tensity function $ α0 then µ(A) = E(A) = A $ iid sample, i ∼  then the log-likelihood of P$α with respect to P$α0 is given by  α) = n Ln ($

 (1

−$ α ) dλ\

+

(log $ α )d

n 

i .

i=1

Define α) = 7($ α ) = EL1 ($

 (1 − $ α +$ α0 log $ α ) dλ\.

(4.1)

In order to avoid technical problems with the logarithm let $ α = exp(α), $ α0 = α ) =: Ln (α)) exp(α0 ), and so we obtain (using the notation Ln ($  n     1 1 Ln (α) − 7(α) = αd i − α exp α0 dλ\. (4.2) n n i=1

305

REGRESSION FUNCTIONS

With respect to the function-class F(β, K) as defined in (3.1) the maximal deviation is given by   n n      g, i , (4.3) $n (β, K) := sup  g, i  − E   g∈G (β,K) 

i=1

i=1

where g, µ = g dµ and G(β, K) = exp(F(β, K)). The random point set Zn (β, K) and the random entropy integral Jn (β, K) are of a similar structure as in Section 3 (see (3.5) and (3.6)). To estimate Jn (β, K) we need estimates for the covering number respectively packing number D2 (ε, Zn (β, K)). These estimates are obtained similar as in the proof of Proposition 3.4 under the corresponding assumptions on the Poisson process . As a consequence we obtain a maximalinequality as in the proof of Theorem 3.3. Applying the Borel–Cantelli lemma this yields the corresponding uniform law. Thus we can find suitable sequences βn ↑ ∞, Kn ↑ ∞ such that with the corresponding sieve Fn = F(βn , Kn ), Fn ↑ F, and   1   (4.4) sup  Ln (α) − 7(α) → 0 a.s. α∈Fn n In consequence we obtain as in D¨ohler (1999) that the sieved maximum-likelihood estimator  αn converges to α0 with respect to the statistical distance 7(α) − 7(α0 ) (the Kullback–Leibler distance), that is, 7( αn ) − 7(α0 ) → 0 a.s. Using (4.1) the 7-distance has the representation  |7(α0 ) − 7(α)| = 7(α0 ) − 7(α) = G(α − α0 ) exp(α0 ) dλ\

(4.5)

(4.6)

with G(y) := exp(y) − (1 + y) (see D¨ohler, 1999). This representation implies L1 -convergence  (4.7) | αn − α0 | exp(α0 ) dλ\ → 0 a.s. by some simple estimates of G. As a consequence therefore we obtain strong consistency in L1 -sense of the sieved ML-estimator for the intensity function using the parameterization of the intensity as in (4.2). THEOREM 4.1 (Strong consistency of the ML-estimator). Let (i ) be an iid sample of Poisson processes on [0, 1]k with finite intensity parametrized as in (4.2). Let F satisfy Assumption A. Then there exist sequences βn ↑ ∞, Kn ↑ ∞ such that the ML-estimator ( αn ) based on the sieve Fn = F(βn , Kn ) is strongly consistent αn − α0 | exp α0 dλ\ → 0 a.s. for the intensity function α0 in L1 -sense that is, |

306

¨ ¨ SEBASTIAN DOHLER AND LUDGER RUSCHENDORF

Acknowledgements The authors thank two reviewers for several useful and interesting comments on the subject and presentation of the paper.

References Devroye, L., Gy¨orfi, L., Krzyzak, A. and Lugosi, G.: On the strong universal consistency of nearest neighbor regression function estimates, Ann. Statist. 22 (1994), 1371–1385. Devroye, L., Gy¨orfi, L. and Lugosi, G.: A Probabilistic Theory of Pattern Recognition, Applications of Mathematics, Vol. 31, Springer, Berlin, 1996. D¨ohler, S.: Consistent hazard regression estimation by sieved maximum likelihood estimators, In: Proceedings of the Conference on limit theorems in Balatonlelle, 1999 (in press). D¨ohler, S.: Empirische Risiko-Minimierung bei Zensierten Daten, Dissertation, University of Freiburg, 2000. D¨ohler, S. and R¨uschendorf, L.: Adaptive estimation of hazard functions, In: Probability and Mathematical Statistics, 2000 (in press). Gy¨orfi, L., Kohler, M., Krzyzak, A. and Walk, H.: A Distribution Free Theory of Nonparametric Regression, Springer Series in Statistics, Springer, Berlin, 2002. Hornik, K.: Approximation capabilities of multilayer feedforward networks, Neural Networks 4 (1991), 251–257. Karr, A.: Point Processes and their Statistical Inference, Marcel Dekker, New York, 1991. Kim, W.-C. and Koo, J.-Y.: Inhomogeneous Poisson intensity via information projections onto wavelets subspaces, 2000, Preprint. Kohler, M.: On the universal consistency of a least squares spline regression estimator. Math. Meth. Statist. 6 (1997), 349–364. Kohler, M.: Universal consistency of local polynomial kernel regression estimates, 2000, Preprint. Krzyzak, A. and Linder, T.: Radial basis function networks and complexity regularization in function learning. Neural Networks 9 (1998), 247–256. Krzyzak, A., Linder, T. and Lugosi, G.: Nonparametric estimation and classification using radial basis function nets and empirical risk minimization. IEEE Trans. Neural Networks 7(2) (1996), 475–487. Kutoyants, Y. A.: Statistical Inference for Spatial Poisson Processes, Lecture Notes in Statistics, Vol. 134, Springer, Berlin, 1998. Liese, F.: Estimation of intensity measures of Poisson point processes. In: Proceedings of the 11th Prague Conference, 1990, pp. 121–139. Liese, F. and Ziegler, K.: A note on empirical process methods in the theory of point processes. Scand. J. Statist. 26 (1999), 533–537. Lugosi, G. and Zeger, K.: Nonparametric estimation via empirical risk minimization. IEEE Trans. Inform. Theory 41(3) (1995), 677–687. Pollard, D.: Empirical Processes: Theory and Applications, Institute of Mathematical Statistics, 1990. Reynaud-Bouret, P.: Concentration inequalities for inhomogeneous Poisson processes and adaptive estimation of the intensity, 2001, Preprint. Rudemo, M.: Empirical choice of histograms and kernel density estimators, Scand. J. Stat. Theory Appl. 9 (1982), 65–78. R¨uschendorf, L.: Inference for random sampling processes, Stoch. Proc. Appl. 32 (1989), 129–140. Stone, C. J.: Consistent nonparametric regression. Discussion, Ann. Statist. 5 (1977), 595–645.

REGRESSION FUNCTIONS

307

van der Vaart, A. and Wellner, J.: Weak Convergence and Empirical Processes, Springer, Berlin, 1996. Vapnik, V. and Chervonenkis, A.: The Theory of Pattern Recognition, Nauka, Moscow, 1974. Ziegler, K.: On functional central limit theorems and uniform laws of large numbers for sums of independent processes, Dissertation, University of Munich, 1994.