Covering Numbers for Support Vector Machines - CiteSeerX

13 downloads 37 Views 194KB Size Report
covering numbers for the class of functions computed with support vector ...... 1 > 0 for some i 2(j* j0). We separate P2 into several parts. Set k0. = j0 + 1 lm.
Covering Numbers for Support Vector Machines

Ying Guo Peter L. Bartlett John Shawe-Taylor Robert C. Williamson Department of Engineering RSISE Department of Computer Science Department of Engineering Australian National University Australian National University Royal Holloway College, Australian National University Canberra 0200, Australia Canberra 0200, Australia University of London Canberra 0200, Australia [email protected] [email protected] Egham, TW20 0EX, UK [email protected] [email protected]

Abstract Support vector machines are a type of learning machine related to the maximum margin hyperplane. Until recently, the only bounds on the generalization performance of SV machines (within the PAC framework) were via bounds on the fatshattering dimension of maximum margin hyperplanes. This result took no account of the kernel used. More recently, it has been shown [8] that one can bound the relevant covering numbers using some tools from functional analysis. The resulting bound is quite complex and seemingly difficult to compute. In this paper we show that the bound can be greatly simplified and as a consequence we are able to determine some interesting quantities (such as the effective number of dimensions used). The new bound is quite a simple formula involving the eigenvalues of the integral operator induced by the kernel. We present an explicit calculation of covering numbers for an SV machine using a Gaussian kernel which is significantly better than that implied by the maximum margin fat-shattering result.

1 INTRODUCTION Support Vector (SV) Machines [5] are learning algorithms based on maximum margin hyperplanes [4] which make use of an implicit mapping into feature space by using a more general kernel function in place of the standard inner product. Consequently one can apply an analysis for the maximum margin algorithm directly to SV machines. However such a process completely ignores the effect of the kernel. Intuitively one would expect that a “smoother” kernel would somehow reduce the capacity of the learning machine thus leading to better bounds on generalization error if the machine can attain a small training error. In [9, 8] it has been shown that this intuition is justified. The main result there (quoted below) gives a bound on the covering numbers for the class of functions computed with support vector machines. This bound along with statistical results of the form given in [7] result in bounds that do explicitly depend on the kernel used. In the traditional viewpoint of statistical learning theory, one is given a class of functions F, and the generalization performance attainable using F is determined via the covering numbers N(  F) (precise definitions are given below). Many generalization error bounds can be expressed in terms of N(  F). The main method of bounding N(  F) has been to use the Vapnik-Chervonenkis dimension or one of its generalizations (see [1] for an overview). In [9, 8] an alternative viewpoint is taken where the class F is viewed as being generated by an integral operator induced by the kernel. Properties of this operator are used to bound the required covering numbers. The result is in a form that is not particularly easy to use (see (13) below). The main technical result of this paper is an explicit reformulation of this bound which is amenable to direct calculation. We illustrate the new result by bounding the covering numbers of SV machines which use Gaussian RBF kernels. The result shows the influence of  2 on the covering numbers: the covering numbers will decrease when  2 increases. Here  2 is the variance of the Gaussian function used for the kernel. More generally, the main result makes model order selection possible using any parametrized family of kernel functions: we can describe precisely how the capacity of the class is affected by changes to the kernel. For d 2 N , Rd denotes the d-dimensional space of vectors = (x1  : : :  xd ). For 0 < p  1, define the spaces `dp := f 2 Rd : k k`dp < 1g

x

x

x

where the p-norms are

(

with

0d 1p X kxk`dp := @ jxj jp A  for 0 < p < 1 j =1

kxk`d1 := j=1 max jx j :::d j

Then Ck

for p = 1:

For d = 1, we write lp = lp1 and the norms are defined similarly (by formally subsituting 1 for d in the above definitions). The -covering number of F with respect to the metric d denoted N(  F d) is the size of the smallest -cover for F using the metric d. Given m points 1  : : :  m 2 `dp , we use the shorthand m = ( 1  : : :  m ). Suppose F is a class of functions defined on Rd . The `d1 norm with respect to m of f 2 F is defined as kf k`X 1m := maxi=1:::m jf ( i )j. The input space is taken to be X, a compact subset of Rd . Our main result is a bound for the covering number of SV machines. We only discuss the case when d = 1. (In fact the result does hold for general d; see the discussion in the conclusion). Let k : X  X ! R be a kernel satisfying the hypotheses of Mercer’s theorem (Theorem 2). Given m points 1  : : :  m 2 X. Denote by FRw the hypothesis class implemented by SV machines on an m-sample with weight vector (in feature space) bounded by R w :

X

x

x

x

x

x

x

X

x

8 9 < X = XX FRw = :x 7! i k(x xi ): i j k(xi  xj )  Rw2  : i

Let 1

j

Tk : f 7!

X

k( y)f (y)dy:

and denote by n (), n 2 N the corresponding eigenfunctions. (See the next section for a reminder of what this means.) For translation invariant kernels (such as k (x y ) = exp((x ; y)2 =2 )), the eigenvalues are given by

p

i = 2 K (j!0 ) (2) for j 2 Z, where K (! ) = F k (x)](! ) is the Fourier transform of k () (see [9, 8] for further details). For a smooth kernel, the Fourier transform F (j!0 ) decreases faster. (There are less “high frequency components.”) Thus for smooth kernels, i decreases to zero rapidly for increasing i.

Theorem 1 (Main Result) Suppose k is a kernel satisfying the hypothesis of Mercer’s Theorem. Hypothesis class FRw , eigenfunctions n () and eigenvalues (i ) are defined as above. Let 1  : : :  m 2 X be m data points. Let

x

< 1 and

x

For n 2 N set

Ck = sup knkL1 : n

v u  : : :   j X 1 n = 6Rw Ck u tj  1 2 j + i  n 1

i=j  +1

(3)

n  FRw  `X1m )  n: sup N ( Xm 2Xm

The quantity n is an upper bound on the entropy number of FRw , which is the functional inverse of the covering number. In this theorem, the number j  has a natural interpretation: For a given value of n, it can be viewed as the effective dimension of the function class. Clearly, this effective dimension depends on the rate of decay of the eigenvalues. As expected, for smooth kernels (which have rapidly decreasing eigenvalues), the effective dimension is small. It turns out that all kernels satisfying Mercer’s conditions are sufficiently smooth for j  to be finite. The remainder of the paper is organized as follows. We start by introducing notation and definitions (Section 2). Section 3 contains the main result (the proof is in Appendix A). Section 4 contains an example application of the main result. Section 5 concludes.

2 DEFINITIONS AND PREVIOUS RESULTS Let L(E F ) be the set of all bounded linear operators T between the normed spaces (E k  kE ) and (F k  kF ), i.e. operators such that the image of the (closed) unit ball

UE := fx 2 E : kxkE  1g

(1)

 2     be the eigenvalues of the integral operator Tk : L2 (X) ! LZ 2(X)

)



1j j  = min j : j+1 < 1 :n:2: j :

1

(4)

is bounded. The smallest such bound is called the operator norm, kT k := sup kTxkF : (5) x2UE The nth entropy number of a set M E , for n 2 N , is

n (M )

:= inf f > 0: there exists an

for M in E

-cover containing n or fewer pointsg: (6)

(The function n 7! n (M ) can be thought of as the functional inverse of the function 7! N(  M d) where d is the metric induced by k  kE .) The entropy numbers of an operator T 2 L(E F ) are defined as

(7) n (T ) := n (T (UE )): Note that 1 (T ) = kT k, and that n (T ) certainly is well defined for all n 2 N if T is a compact operator, i.e. if T (UE ) is compact. In the following, k will always denote a kernel, and d and m will be the input dimensionality and the number of training examples, respectively, so that the training data is a sequence

(x1  y1 ) : : :  (xm  ym ) 2 Rd  R:

Let log denote the logarithm to base 2. We will map the input data into a feature space mapping . We let ~ := ( ), and

x x FRw := fhw x~ i: x~ 2 S kwk  Rw g RS :

(8)

S via a

Given a class of functions F, the generalization performance attainable using F can be bounded in terms of the covering numbers of F. More precisely, for some set X, and i 2 X for i = 1 : : :  m, define the -growth function of the function class F on X as m Nm (  F) := sup m N(  F `X1 ) (9) x1 :::xm2X m where N(  F `X ) is the -covering number of F with re1 m spect to `X . Many generalization error bounds can be ex1 pressed in terms of Nm (  F). Given some set X, some 1  p < 1 and a function f : X ! R we define kf kLp(X) := jf (x)jp d(x) 1=p if the integral exists and kf kL1(X) := ess supx2X jf (x)j. For 1  p  1, we let Lp (X) := ff : X ! R: kf kLp (X) < 1g. We sometimes write Lp = Lp (X). Suppose T : E ! E is a linear operator mapping a normed space E into itself. We say that x 2 E is an eigenvector if for some scalar , Tx = x. Such a  is called the eigenvalue associated with x. When E is a function space (e.g. E = L2 (X)) the eigenvectors are of course functions, and are usually called eigenfunctions. Thus n is an eigenfunction of T : L2 (X) ! L2 (X) if Tn = n . In general  is complex, but in this paper all eigenvalues are real (because of the symmetry of the kernels used to induce the operators). We will make use of Mercer’s theorem. The version stated below is a special case of the theorem proven in [6, p. 145].

x

;R



Theorem 2 (Mercer) Suppose k 2 L1 (X ) is a symmetric kernel such that the integral operator Tk : L2 (X) ! L2 (X), 2

Tk f () :=

Z

X

k( y)f (y)dy

(10)

is positive. Let j 2 L2 (X) be the eigenfunction of Tk associated with the eigenvalue j 6= 0 and normalized such that kj kL2 = 1 and let j denote its complex conjugate. Suppose j is continuous for all j 2 N . Then 1. 2. 3.

(j (T ))j 2 `1 . j 2 L1(X) and supj kj kL1 < 1. P k(x y) = j j (x)j (y) holds for all (x y), where j 2N

the series converges absolutely and uniformly for all (  ).

xy

We will call a kernel satisfying the conditions of this theorem a Mercer kernel. From statement 2 of Mercer’s theorem there exists some constant Ck 2 R+ depending on k ( ) such that

jj (x)j  Ck for all j 2 N and x 2 X:

(11)

This conclusion is the only reason we have added the condition that n is continuous; it is not necessary for the theorem as stated, but it is convenient to bundle all of our assumptions into the one place. In any case it is not a very restrictive assumption: if X is compact and k is continuous, then  j is automatically continuous (see e.g. [3]). Alternatively, if k is translation invariant, then j are scaled cosine functions and thus continuous.

In [8] an upper bound on the entropy numbers was given in terms of the eigenvalues of the kernel used. The result is in terms of the entropy numbers of a scaling operator A. The notation (as )s 2 lp donates the sequence (a1  a2    ) such 1 that s=0 jas j < 1.

P

Theorem 3 (Entropy numbers for (X)) Let k : X  X ! Rpbe a Mercer kernel. Choose aj > 0 for j 2 N such that s =as s 2 `2 , and define

;



A: (xj )j 7! (RA aj xj )j

(12)

p with RA := Ck k( j =aj )j k` . Then p    a1    aj  j : n (A: `2 ! `2 )  sup 6Ck  s =as s  n ` j 2N 2

1

2

(13)

This result leads to the following bounds for SV classes. Theorem 4 (Bounds for SV classes) Let k be a Mercer kernel. Then for all n 2 N ,

n (FRw )  Rw

inf n (A) (ps =as )s 2`2

(14)

(as )s :

where A is defined as in Theorem 3.

Combining Equations (13) and (14) gives effective bounds on Nm (  FRw ) since

n (T : `2 ! `m 1) 

0

) Nm( 0  FRw )  n:

These results thus give a method to obtain bounds on the entropy numbers for kernel machines. In Inequality (14), we can choose (as )s to optimize the bound. The key technical contribution of this paper is the explicit determination of the best choice of (as )s . We assume henceforth that (s )s is fixed and sorted in non-increasing order, and a i > 0 for all i. For j 2 N , we define the set



 a1    ai  i  a1    aj  j  sup = : 1

Aj = (as )s :

i2N

n

1

n

In other words, Aj is the set of (as )s such that the 1 sup a1 a2n   ai i is attained at i = j . i2N Let





(15)

1   s =as s  a1 n  aj j  `2 where for notational simplicity, we write (as ) for (as )s . p

B ((as ) n j ) = 

3 THE OPTIMAL CHOICE OF (as )s AND j Our main aim in this section is to show that the infimum in (14) and the supremum in (13) can be achieved and to give an explicit recipe for the sequence (a s ) and number j  that achieve them. The main technical theorem is as follows.

Theorem 5 Let k : X  X ! R be a Mercer kernel. Suppose 1  2     are the eigenvalues of Tk . For any n 2 N, the minimum 1 j  1 : : : j  j = min j : j+1 < (16) 2

(

)



n

always exists, and

sup B ((as ) n j )  B ((as ) n j  ) pinf (as )s :( s =as ) 2`2 j 2N s

8 pi < ai = : p j j n

where

when i  j 

1

when i > j  :

1

(17)

This choice of (as ) results in a simple form for the bound of (14) in terms of n and (i ): Corollary 6 Let k : X  X ! R be a Mercer kernel and let A be given by (12). Then for any n 2 N , the entropy numbers satisfy

pinf

( s =as )s 2`2

(as )s :

 with j 

n (A: `2 ! `2 )

v u 1 : : : j  j  X 1 u t  6Ck j + i  n2 i=j  +1   j 

j = min j : j+1 < 1n::: 2

(18)

:

This corollary, together with (14), implies Theorem 1. PROOF OUTLINE The proof of Theorem 5 is quite long and is in Appendix A. It involves the following four steps. 1. We first prove that for all n 2 N ,

(

^j = min j : j+1
j  , and any (as ) 2 Aj0 , B ((as ) n j0 )  B ((as ) n j  )

4.

x

FRw = fhw x~ i: x~ 2 S kxk`2  1 kwk`2  Rw g:

One can use the fat-shattering dimension to bound the covering number of the class of functions F Rw (see [2]). Lemma 7 With FRw as above, fatFRw (

2

)  Rw

:

(21)

Theorem 8 If F is class of functions mapping from a set X into the interval 0 B ], then for any , if m  fatF ( =4)  1,





log Nm (  F)  3fatF ( =4) log2 4eBm :

2

log Nm(  FRw )  48 Rw



(22)



log2 4eBm :

;p =a 2 ` Finally we show that (as ) 2 Aj and s s s 2 when (as ) is chosen according to (17).

(23)

In order to determine the eigenvalues of T k , we need to periodize the kernel. This periodization is necessary in order to get a discrete set of eigenvalues since k (x) has infinite support (see [9] for further details). For the purpose of the present paper, we can assume a fixed period 2 =! 0 for some !0 > 0. Since the kernelpis translation invariant, thepeigenfunctions are n (x) = 2 cos(n!0 x) and so p Ck = 2. The 2 comes from the requirement in Theorem 2 that kj k`2 = 1. The eigenvalues are

p

j = 2 e; 40 2 j2 :

Setting c1 written as

!2

p 2 = 2 , c2 = !40 2 ,

the eigenvalues can be

j = c1 e;c2j2 :

  j  j

3. The next step is to prove that the choice of (a s ) and j described by (16) and (17) are optimal. It is separated into two parts:

also holds.

x

(19)

exists, whenever (i ) are the eigenvalues of a Mercer kernel.

sup B ((as ) n j ) pinf (as )s :( s =as )s 2`2 j 2N  jinf 2N inf B ((as ) n j ):

We illustrate the results of this paper with an example. Con2 2 sider the kernel k (x y ) = k (x ; y ) where k (x) = e;x = . For such kernels (RBF kernels) k ( )k`2 = 1 for all 2 X. Thus the class (1) can be written as

Combining these results we have the following bound with which we shall compare our new bound.

1

1

4 EXAMPLE

(24)

1 implies j   From (16), we know that j +1 < n2 j . But (24) shows that this condition on the eigenvalues is equivalent to 1

 Pj  1 c1 e;c2(j+1)2 < n; 2j cj1 e;c2 i=1 i2 j 

which is equivalent to

c2 (j + 1)2 > 2j ln n + c62 (j + 1)(2j + 1) ,





2 c (j + 1)j j + 5 > 2 ln n 32 4

which follows from

12 ln n 1=3

j > !2 2 0

:

(25)

14

Hence,

j 

$

% 12 ln n 1=3 !02 2

12

+ 1:

(26)

10 8

P

We can now use (18) to give an upper bound on n . The tail 1 i=j  +1 i in (18) is dominated by the first term, hence we obtain the following bound.

 ;  c2  2  n = O j n j c1 exp(; 6 (j + 1)(2j + 1)) :

6 4

2

Substituting (26) shows that



2

log n = O log log n + log  ; ( log n) 3



1e–13

(27)

1e–11

1e–09

1e–07

1e–05

.1e–2

.1

2 1. .1e2

Figure 3: j  versus for a Gaussian kernel. Since j  can be interpreted as an “effective number of dimensions”, this clearly illustrates why the bound on the covering numbers for Gaussian kernels grows so slowly as # 0 . Even when = 10;9 , j  is only 13.

.1e2 1. .1

The relationship between 2 and n . Here,  2 is the variance of the Gaussian functions. When  2 increases, the kernel function will be wider, so the class FRw should be simpler. In Equation (27), we notice that if  decreases, n decreases for fixed n. Similarly, if  increases, n decreases for fixed n . Since the entropy number (and the covering number) indicates the capacity of the learning machine, the more complicated the machine is, the bigger the covering number for fixed n . Specifically we see from Equation (27) that

.1e–1 .1e–2 .1e–3 1e–05 1e–06 1e–07 1e–08 1e–09 1e–10 1e–11 1e–12 1e–13

Figure 1: n versus Corollary 6.

log 1= n = ( 23 )

n for a Gaussian kernel as given by and that

log Nm (  FRw ) = O(1=):

Figures (1) to (3) illustrate our bounds (for  2

14 12

= 1).

5 CONCLUSIONS

10 8 6 4 2 1.

.1e2

Figure 2:

.1e3

j  versus n for a Gaussian kernel.

We can get several results from Equation (27). The relationship between n and n. For fixed  , (27) shows that

log 1= n = (log 3 n) 2

which implies





3 log Nm(  FRw ) = O log 2 1



(28)

which is considerably better than (23). This can also be seen in Figure 1.

We have presented a new formula for bounding the covering numbers of support vector machines in terms of the eigenvalues of an integral operator induced by the kernel. We showed, by way of an example using a Gaussian kernel, that the new bound is easily computed and considerably better than previous results that did not take account of the kernel. We showed explicitly the effect of the choice of width of the kernel in this case. The “effective number of dimensions”, j  , can illustrate the characters of the kernel functions clearly. For a smooth kernel, the “effective number of the dimensions” j  is small. The value of j  depends on n which in turn depends on . Thus j  can be considered analogous to existing “scalesensitive” dimensions, such as the fat-shatterring dimension. A key difference is that we now have bounds for j  that explicitly depend on the kernel. We have discussed the result for the situation where the input dimension is 1. The main complication arising when d > 1 is that repeated eigenvalues become generic for isotropic translation invariant kernels. This does not break the bounds as stated (as long as one properly counts the multiplicity of eigenvalues). However, it is possible to obtain bounds that can be tighter in some cases, by using a slightly more refined argument [9].

References [1] M. Anthony. Probabilistic analysis of learning in artificial neural networks: The pac model and its variants. Neural Computing Surveys, 1:1–47, 1997. http://www.icsi.berkeley.edu/˜jagota/NCS. [2] M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [3] Robert Ash. Information Theory. Interscience Publishers, New York, 1965. [4] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, 5th Annual ACM Workshop on COLT, pages 144– 152, Pittsburgh, PA, 1992. ACM Press.

since (i ) is non-increasing. Since limj !1 j = 0, we get limj!1 Pj = 0. Thus for any n 2 N there is a ^j such that (29) is true. Corollary 10 Suppose k is a Mercer kernel and Tk the associated integral operator. If i = i (Tk ), then the minimum ^j from (19) always exists. Proof By Mercer’s Theorem, (i ) 2 `1 and so limi!1 i 0. Lemma 9 can thus be applied.

STEP TWO Lemma 11 Suppose Aj and B ((as ) n j ) are defined as above, p  s =as s 2 `2 , j  and (as ) 2 Aj satisfy

;



B ((as ) n j  ) = jinf B ((as ) n j ): 2N (a inf )2A

[5] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 – 297, 1995. [6] H. K¨onig. Eigenvalue Distribution of Compact Operators. Birkh¨auser, Basel, 1986. [7] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over datadependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998. [8] R. Williamson, A. Smola, and B. Sch¨olkopf. Entropy numbers, operators and support vector kernels. In B. Sch¨olkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 127–144, Cambridge, MA, 1999. MIT Press. [9] R.C. Williamson, A.J. Smola, and B. Sch¨olkopf. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. NeuroCOLT NC-TR-98-019, Royal Holloway College, 1998.

A

PROOF OF THEOREM 1

STEP ONE As indicated above, we will first prove the existence of ^j , which is defined in (19). Lemma 9 Suppose 1  2      0 is a non-increasing sequence of non-negative numbers and lim j !1 j = 0. Then for all n 2 N , there exists ^j 2 N such that

^j +1
i;c a12i ; 12 8i 2 flm + 1     km g 8c 2 f1     i ; 1g

for m 2 N . Hence, if lm is finite,

km X

1

i a2 ; 12 i i=km;1 +1  lm



lm 1 X

a2i

1



2

=

0



j X

1 ; j ; k0  0: 2 a 2 i=k0 +1 i

km 1 X

a2i



km X



(38) where we set km and lm to 1 if the max does not exist. Since (i ) is a non-increasing sequence, from (38) we know

i a2i ; 2

i=k0 +1 ai

2

(41)

(42)

;



1

 2  0:

(43)

Now, for all km , using (39) and (43) repeatedly, we get

< 12  8i 2 flm + 1     ngg lm = maxfn > km;1 : a12i  12  8i 2 fkm;1 + 1     ngg

 1

;

i=k0 +1

Suppose a12 ; 12 < 0 for some i 2 N . We will separate the i sum into several parts. Set

1

1

which together with (36) gives

(37)

 , so 1 ; 1  0: 2 2

k0 = j0  km = maxfn > lm :

! j;k

1

Hence, for any km , finite or infinite,

From (36), we get aj0 +1

aj0 +1

 Yj

i=k0 +1 i

i=k0 +1



0

1 a2  (j ; k0 )

j X 1

Hence, the left hand side of (33) can be rewritten as

1 i 1 X X 2 i 2 ; i=j +1 ai i=j +1 1 1 X 2

x1 + x2 + : : : + xm  m(x1    xm ) m1 for xi > 0: (40) Now (40) implies that for any k0 + 1  j  km , we have

1 ; 2 2 ai 

=





k X



i a12 ; 12 +    i i=k0 +1 1

+  l1



km X

i a12 ; 12 i i=km;1 +1 k 1 X 1

i=k0 +1

+ lm





1 a2i ; 2 +   

km 1 X

1 ; 2 2 ai 



i=km;1 +1 k 2 X 1 1

 l2 2 ; 2 + i=k0 +1 ai 

km 1 X 1 + lm ; 2 2 i=km;1 +1 ai 

km 1 X 1      lm 2 ; 2  0: i=k0 +1 ai  for all m 2 N . Hence

2



i=km;1 +1

km 1 X 1 ; + lm 2 2 i=lm +1 ai 

km 1 X 1 : =  lm ; 2 2 i=km;1 +1 ai 

i a12 ; 12 i i=k0 +1

1 X





i a12 ; 12  0: i i=j0 +1

(44)

Noticing (37), inequality (33) is true. Now, let us prove the main result. Lemma 13 Let Then we have (39)

Aj

and

B ((as ) n j ) be defined as above.

B ((as ) n j  ) = jinf2N (a inf B ((as ) n j0 ) 0 s )2Aj 0

(45)

where

8 pi when i  j  <

p ai = :  j j when i > j   n (

j ) 1

(46)

1

To prove E1  0. Since i  0 and ai  0, we exploit the inequality of the arithmetic and geometric means (40) again. Hence

1

j  = min j : j+1 < 1 :n:2: j : (47) Proof The main idea is to compare B 2 ((as ) n j0 ) with B 2 ((as ) n j  ) and show B 2 ((as ) n j0 )  B 2 ((as ) n j  ) for all j0 2 N and any (as ) 2 Aj0 . From the definition of B ((as ) n j ), we know  1 i !  a1    aj  j20 X 0 2 B ((as ) n j0 ) = 2 a n i i=1 and

1 X  j B 2 ((as ) n j  ) = j  1 :n: 2: j + i : i=j  +1 1



For convenience, we set

  = 1 :n: 2: j

j 1

:

1 1 X ; i A i=j +1

) i a1    aj j ; j 1 : : : j j 0 a2 n n2 i=1 8 1i 9 1 = < X i  a1    aj  j X +: i ; 2 n i=j +1 ai i=j +1  8

j X 1 < + :j0 1 :n: 2: j + i i=j +1 9 0 1 1= X ; @j   + i A i=j  +1 0

=

0

0

1 0

0

2 0

0

0

1 0

2 0

0

n2 1 : : : j j10 1 : : : j j10 0 0 = j0 ; j0 n2 n2 = 0: To prove E2  0. Applying Lemma 12 shows E2  0. To prove E3  0. In order to prove E 3  0, let us define function      1j X 1 g(j ) = j 1 n2 j + i : i=j +1

1 0

0

= E1 + E2 + E3 : We will show E1  0, E2  0 and E3  0.

(51)

 : : :  j 1 : : : j;1 j; 1 j

j =  j;1 =  n2 n2 1

1

g(j ; 1) ; g(j ) = (j ; 1) j;1 + j ; j j = (j ; j;1 ) ; j ( j ; j;1 ) : (52) ;11 j = jj , (52) can be modified to Noticing jj; g(j ; 1) ; g(j )  = j;;(j1;1) ( jj ; jj;1 ) ; j jj;;11 ( j ; j;1 ) (53) : Since j  j  , following (47), we get 1    j;1 j;1 1 8j  j  : j  (54) n2 So

j =

0

0

(50)

we have

1 0

0

1 0

0

1

B 2 ((as ) n j0 ) ; B 2 ((as ) n j  ) j0 1 i  a1    aj  j20 X i  a1    aj0  j20 + X 0 = 2 2 a n a n i=j0 +1 i i=1 i 0

0

1 0

0

; j0

(48)

Rewrite (48):

0

j X 1 1  : : :  1 j ; @j0 i A + n2 i=j +1 0

j 1 X  1 : : : j  @ ; j + i ; j0 n2 i=j  +1 (X j  

0

0 0

We will show that g (j ) is a non-increasing function of j , for j  j  . Set

Hence,

B 2 ((as ) n j0 ) ; B 2 ((as ) n j  ) 1 i  a1    aj  j20 1 X X 0  = ; j  ; i : 2 n i=1 ai i=j  +1 Part a: For the condition j0  j  .

E1 

0  2 2 ! jj 1 j a a      A j0 @ a12    a2j 1 n2 j 1 j 1 : : : j j

(49)



=



   j;1

1

n2

j

1

1

jj



1

   j;1

j + j j;



1

   j;1

j;

n

2

n

1

1

2

Making use of the formula

xn ; y n = (x ; y )

(

1

1

1)

= j;1 :

n X xn;i yi;1  i=1

(55)

we obtain

jj ; jj;1 = ( j ; j;1 ) Together with j ;1

Since (i =a2i ) > 0, using the inequality of the arithmetic and geometric mean (40) again, we get

j X

jj;i ji;;11

! j1 2   n = j :      1 j  2 2 2 j 2 a a    a n Dj  1 j i=1 i Since (as ) 2 Aj0 , we get Dj0  Di for any i 6= j0 and   1 j +1 < 1 n2j j holds based on (47). Hence

> 0 and (53), we obtain (j ; j;1 ) ; j ( j ; j;1 )  0:

Hence,

g(j ; 1)  g(j ):

Since j0  j  , we get

P1  0 + (Dj0 ; Dj )

E3 = g(j0 ) ; g(j  )  0:

(56)

Combining the above results, we get

B 2 ((as ) n j0 ) ; B 2 ((as ) n j  )  0 8j0  j  : Part b: For the condition j0 > j  .

(57)

Rewrite (48):

B 2 ((as0) n j0 ) ; B 2 ((as ) n1j  ) j0 1 i  a1    aj  j20 X X 0 A = @ a2i + 2 a n i i i=j0 +1 i=1

j0

8 j i=j +1 i=j +1 9 j 0 i for some i 2 (j   j0 ). We separate P2 into several parts. Set

k0 = j0 + 1 lm = minfn < km :

Dj0 ; 1  0 a2i

8i 2 fn     km ; 1gg km = minfn < lm;1 : Daj2i0 ; 1 > 0 8i 2 fn     lm;1 ; 1gg:

j X i

 2 ;j  a i i=1

    i Daj2i0 ; 1  i+c Daj2i0 ; 1 8i 2 fkm+1      lm ; 1g c 2 N    Dj0  i a2i ; 1 > i;c Daj2i0 ; 1 8i 2 flm     km ; 1g 8c 2 f1     i ; 1g:

(62)

Using (62), we have

kX m ;1

Dj



i a2 ; 1 i

kX lm m ;1 D X;1 Dj0

j0 ; 1 ; 1 +   lm l m 2 2 i=lm ai i=km+1 ai 0

i=km+1

j

X i   ; j  + ( D ; D ) j j 0 2 2:

i=1 ai

i=1 ai

(61) Since (i ) is a non-increasing sequence, from (61) we know

F1 can be rewritten as: 1 0 j j0 j0 X X X   i i  A @ 2+ Dj0 ; j  ; i 2 i=j  +1 i=1 ai i=j  +1 ai =

j X i

1  (Dj0 ; Dj )  j   Dj > (Dj0 ; Dj ) D1  j  j +1  0: (60) j Let us consider P2 now. If P2  0, then F1  0. So let us prove that F1  0 is also true when P2  0. Ob;11 and Dj0  Di for any i 6= j0 , the serving a2i = Dii =Dii;

last element of P2

j0 1 X X  ;j ; i ; i

=



j X i

i=1 j ; 1  j j;1 ( j ; j;1 ):

i=1 ai

= lm

kX m ;1

i=km+1

Dj

0

a2i



;1 :

(63)

Hence,

0  P2 =

j X Dj

When Dj 



0

0 2 ; 1 i a i  i=j +1

kX 0 ;1 Dj0 ; 1  i 2 2 i=j  +1 ai i=k1 ai



kX 1 ;1 0 ;1 Dj0 ; 1  +  kX Dj0 ; 1 :  i l1 2 2 i=j  +1 ai i=k1 ai kX 1 ;1

=

If l1

Dj

0

Pk ;1  Dj 0

i=k1



P k ;1  Dj

kX 1 ;1

i=j  +1

Dj



0

can get

0  P2  = j +l 

D

 +l jX

i=j  +1  +l jX i=j  +1

j0 a2i ; 1

 



j +l



1

1

Djl0 and  = Djl , the

0 j 1 j j l    D j +l l (  ; 0 ) j A @ ;1 =  0 j +l l j 0

we only need to show



j +1 j  j0 ( l0 ; l ) > 1: (68) j +l l l ( j0 ; j ) Since j  +1  j  +l , the left hand side of (68) becomes   j +1 j  j0 ( l0 ; l ) > j  j0 ( l0 ; l ) : j +l l l( j0 ; j ) l l ( j0 ; j )

Making use of the formula (55) again, we obtain

Dj0 ; 1 a2i



0 !l 1 j +l l @ a2    a2 Dj j  +1 j  +l 0  j ! l 1 Dj j +l l @Dj ; 1A Djj++ll 0

jl 1  D j j +l l @ D  ; 1A j +l 0 j 1 l ; 1A j +l l @ Dj 1

0

j  j0 ( l0 ; l ) l l ( j0 ; j )  j0 ( 0 ;  ) Pli=1 l0;i i;1 j = P  j ;i i;1 l l( 0 ;  ) ji=1   Pl l;i i;1 0 j  = j l 0Pji=1 j 0;i i;1 l  i=1 0   Pli=1 j0 +l;i i;1 j = Pj j ;i l+i;1 l i=1 0 

1 ; 1A

0

Dj0 with l 2 f1     k0 ; 1g:

=

(64)

(65)

In order to show F1

 0, we just need to show 0 j 1 j  j +1 (Dj0 ; Dj ) +   l @ Dj l ; 1A  0: j +l Dj Dj0 (66)

k=1

i=1

0



(69)

Pj Pl j+l;i i;1 Pkl =1 Pij=1  0j ;i l+i;1 k=1 i=1 0  Pj Pl j l;1 > Plk=1 Pji=1  j0 ;1 l

F1 = P1 + P2    > j j +1 (DDj0 ; Dj )

0j j 1  l + j +l l @ Dj ; 1A :

Pj Pl j +l;i i;1 Plk=1 Pji=1  j0 ;i l+i;1 :

Observe the numerator and the denominator both have j   l n elements represented as m 0  . But we know 0 >  since Dj0 > Dj  , hence from (69), we obtain

Combining (60) and (64), we have

Dj0

1 ; 1A = 0:

1   j  ( l ; l )  1   l( j ; j ): 0  j j +l 0  l j +1 0

1

=

jl

(67)

0 a2i ; 1  0, we can use (62) and (63) re;11 again, we peatedly. Finally, using (40) and a 2i = Dii =Dii; 0

i=k1

Inequality (66) holds. When Dj0 > Dj  , setting 0 = inequality (66) can be rewritten as

Dj0

a2i ; 1 i :



0

j  j +1 (Dj0 ; Dj ) +   l @ Dj j +l Dj Dj0

Noticing

ai ; 1 > 0, we get 0 2

P2 > If l1



; 1 i +

= Dj0 ,

So

Hence

 k=1 i=1 0  l;1 j  0 = j l j0 ; 1 l = > 1: j l 0   

j  j0 ( l0 ; l )  1: l l( j0 ; j ) F1 = P1 + P2 > 0

is proved for j0 = j  + k with all k 2 N .

(70)

To prove F2  0. Using Lemma 12 again, we get

B

F2  0:

(71)

Combining (70) and (71), we get

B 2 ((as ) n j0 ) ; B 2 ((as ) n j  )  0 8j0 > j  :

(72)

THE PROOF THAT INEQUALITY (31) CANNOT BE IMPROVED

Lemma 15 Suppose Aj and B ((as ) n j ) are defined as above. Let j 2 N and (as ) 2 Aj . Suppose j  and (as ) exist. Then

inf sup B ((as ) n j ) (ps =as )s 2`2 j2N = jinf B ((as ) n j ): 2N (a inf )2A

(as )s :

Combining (57) and (72), (45) is proved true.

STEP FOUR We supposed that (as ) 2 Aj  in the above proof. Now let us show it. First, for j > j  , a1 : : : aj : : : aj 1j



sup B ((as ) n j ) inf (ps =as )s 2`2 j2N  jinf B ((as ) n j ): 2N (a inf )2A

inf sup B ((as ) n j ) (ps =as )s 2`2 j2N = sup B ((as ) n j ) j 2N

;p =a 2 ` . s s s 2 v1 p  u u tX i

= B ((as ) n j  )  (a inf B ((as ) n j  ) s )2Aj  jinf B ((as ) n j ): 2N (a inf )2A s

j

We have already proved

inf sup B ((as ) n j ) (ps =as )s 2`2 j2N  jinf B ((as ) n j ): 2N (a inf )2A

(as )s :

s

i=1 ai

2

1 1 X = (73) 2 i=j +1 i : When k (x y ) and n are given, (i ) and j  are determined. 1 2 So  = n; j (1    j  ) jPis a constant. By Mercer’s The1 orem, (i ) 2 `1 and thus i=j  +1 i is finite. So (73) is ;  p finite. Hence s =as s 2 `2 is proved. CONCLUSION Following the proof above, we get Corollary 14 Suppose A j and B ((as ) n j ) are defined as above. Then we have B ((as (j  )) n j  ) = jinf B ((as ) n j ) (74) 2N (asinf )2Aj where p

8 i when i  j  < p

  ai = :  j j when i > j   n (

j ) 1

(75)

1

Theorem 1 is then established.

(78)

(as )s :

Thus (as ) 2 Aj  . We can also show

j  = min j : j+1 < 1 :n:2: j

j

Choose an (as ) to realise the infimum on the left hand side; then (as )s 2 Aj  , where j  is the j that realises the inner supremum. Then

!1 1 1 : : : j;1 j;1 a1 : : : aj;1 j;1 : = n2 n

v u u tj  +

(77)

(as )s :

s

p

s =as s =

j

Proof Let us prove



n a1 : : : a 1j a1 : : : a j1 (j;j ) 1j j j = n n a : : : a j1 1 j = : n Second, for j  j  . From (54), we get a : : : a 1j  p1 : : : j ! 1j 1 j = n n2 

s

1

:

(76)

j

So, equation (77) is proved to be true.

Acknowledgements This work was supported by the Australian Research Council. Thanks to Bernhard Sch¨olkopf and Alex J. Smola for useful discussions.