%P%$%"%9$N?dDj$H%b%G%kA*Br ?y;3 >-

>[email protected] 1Q8w

El5~9)6HBg3X Bg3X1! >pJsM}9)3X8&5f2J 7W;;9)[email protected] ")152-8552 L\9u6hBg2,;32-12-1, Tel 03-5734-2187

[email protected], http://ogawa-www.cs.titech.ac.jp/˜sugi.

%b%G%kA*Br$NLdBj$O!$65;UIU$-3X=,$K$*[email protected]$9$k$?$a$K=EMW$G$"$k!%%b%G%kA*Br$NpJsNL5,=‘ (AIC) $,$3$l$^$G$K9-$/MQ$$$i$l$F$-$?!%$7$+$7$J$,$i!$AIC $G$O71N} %G!l9g$K$OE,@Z$K%b%G%kA*Br$r9T$&$3$H$, $G$-$J$$!%$=$3$GK\[email protected]$G$O!$subspace information criterion (SIC) $H$h$P$l$k?7$7$$%b%G%kA*Br$N5,=‘$r Ds0F$7!$71N}%G!/$J$$$H$-$K$bE,@Z$K%b%G%kA*Br$,9T$($k$3$H$r7W;;5!pJsNL5,=‘ (AIC)!$%P%$%"%9/%P%j%"%s%9!%

Bias Estimation and Model Selection Masashi Sugiyama

Hidemitsu Ogawa

Department of Computer Science, Tokyo Institute of Technology 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan. [email protected], http://ogawa-www.cs.titech.ac.jp/˜sugi.

The problem of model selection is considerably important for acquiring higher levels of generalization capability in supervised learning. So far, Akaike information criterion (AIC) is widely used for model selection. However, AIC does not work well when the number of training examples is small because of the asymptotic approximation. In this paper, we propose a new criterion for model selection called the subspace information criterion (SIC). Computer simulations show that SIC works well even when the number of training examples is small.

Supervised learning, Generalization capability, Model selection, Information criterion, Akaike’s information criterion (AIC), Bias/Variance.

1

Introduction

Supervised learning is obtaining an underlying rule from training examples, and can be regarded as a function approximation problem. So far, many supervised learning methods have been developed, including the backpropagation algorithm [5][22], projection learning [17], Bayesian inference [12], regularization learning [20], and support vector machines [30]. In these learning methods, the quality of the learning results depends heavily on the choice of models. Here, models refer to, for example, Hilbert spaces to which the learning target function belongs in the projection learning and support vector regression cases, multi-layer perceptrons (MLPs) with different numbers of hidden units in the back-propagation case, pairs of the regularization term and regularization parameter in the regularization learning case, and families of probabilistic distributions in the Bayesian inference case. If the model is too complicated, then learning results tend to over-fit training examples. In contrast, if the model is too simple, then it is not capable of fitting training examples causing learning results become under-fitted. In general, both over- and under-fitted learning results have lower levels of generalization capability. Therefore, the problem of finding an appropriate model, referred to as model selection, is considerably important for acquiring higher levels of generalization capability. The problem of model selection has been studied mainly in the field of statistics. Mallows [13] proposed the CP -statistic for the selection of subset-regression models (see also [14]). The CP -statistic gives an unbiased estimate of the averaged predictive empirical error over the noise. Mallows [14] extended the range of application of the CP -statistic to the selection of arbitrary linear regression models. This statistic is called CL or the unbiased risk estimate [31]. The CP -statistic and the unbiased risk estimate require a good estimate of the noise variance. In contrast, the generalized cross-validation [8][31], which is an extension of the traditional crossvalidation [25][31], is the criterion for finding the model minimizing the averaged predictive empirical error without the knowledge of noise variance. Li [11] showed the asymptotic optimality of the unbiased risk estimate and the generalized cross-validation, i.e., they asymptotically select the model minimizing the averaged predictive empirical error (see also [31]). However, these methods do not explicitly take the generalization error into account. In contrast, model selection methods explicitly evaluating the difference between the empirical error and the generalization error have been studied from various standpoints: information statistics [1], Bayesian statistics [24][2][12], stochastic complexity [21], and structural risk minimization [30]. Particularly, informationstatistics-based methods have been extensively studied. Akaike’s information criterion (AIC) [1] is one of the most eminent methods of this type. Many successful applications of AIC to real world problems have been reported (e.g. [7][3]). AIC assumes that models are

faithful 1 . Takeuchi [27] extended AIC to be applicable to unfaithful models. This criterion is called Takeuchi’s modification of AIC (TIC). TIC adopted the log loss as the loss function. Murata et al. [15] generalized the loss function of TIC, and proposed the network information criterion. The learning method with which TIC can deal is restricted to the maximum likelihood estimation. Konishi and Kitagawa [10] relaxed the restriction and derived the generalized information criterion for a class of learning methods represented by statistical functionals. In this paper, we propose a new criterion for model selection from the functional analytic viewpoint. We call this criterion the subspace information criterion (SIC). SIC is mainly different from AIC-type criteria in three respects. The first is the generalization measure. In AIC-type criteria, the averaged generalization error over all training sets is adopted as the generalization measure, and the averaged terms are replaced with particular values calculated by one given training set. In contrast, the generalization measure adopted in SIC is not averaged over training sets. The fact that the replacement of the averaged terms is unnecessary for SIC is expected to result in better selection than AIC-type methods. The second is the approximation method. AIC-type criteria use asymptotic approximation for estimating the generalization error, while SIC estimates it by utilizing the noise characteristics. Our computer simulations show that SIC works well even when the number of training examples is small. The third is the restriction of models. Takeuchi [28] pointed out that AIC-type criteria are effective only in the selection of nested models (see also [15]). In SIC, no restriction is imposed on models.

2

Mathematical foundation model selection

of

Let us consider the problem of obtaining an approximation to a target function f(x) of L variables from a set of M training examples. The training examples are made up of input signals xm in D, where D is a subset of the L-dimensional Euclidean space RL , and corresponding output signals ym in the unitary space C: {(xm , ym ) | ym = f(xm ) + nm }M m=1 ,

(1)

where ym is degraded by additive noise nm . Let θ be a set of factors determining learning results, e.g. the type and number of basis functions, and parameters in learning algorithms. We call θ a model. Let fˆθ be a learning result obtained with a model θ. Assuming that f and fˆθ belong to a Hilbert space H, the problem of model selection is described as follows. Definition 1 (Model selection) From given models, find the one minimizing the generalization error defined as En kfˆθ − fk2 , (2) 1 A model is said to be faithful if the learning target can be expressed by the model [15].

where En denotes the ensemble average over the noise, and k · k denotes the norm in H.

3

Subspace information criterion

In this section, we derive a model selection criterion named the subspace information criterion (SIC), which gives an estimate of the generalization error of fˆθ . Let y, z, and n be M -dimensional vectors whose mth elements are ym , f(xm ), and nm , respectively: y = z + n.

(3)

y and z are called a sample value vector and an ideal sample value vector, respectively. Let Xθ be a mapping from y to fˆθ : fˆθ = Xθ y. (4) Xθ is called a learning operator. In the derivation of SIC, we assume the following conditions. 1. The learning operator Xθ is linear. 2. The mean noise is zero. 3. An unbiased learning result fˆu has been obtained with a linear operator Xu : En fˆu = f, fˆu = Xu y.

(5) (6)

Assumption 1 implies that the range of Xθ becomes a subspace of H. This assumption is practically not so strict since linear learning operators include various learning methods such as least mean squares learning [18], regularization learning [16], projection learning [17], and parametric projection learning [19]. It follows from Eqs.(6), (3), and Assumption 2 that En fˆu = En Xu y = En Xu z + En Xu n = Xu z.

(7)

Hence, Assumption 3 yields Xu z = f.

(8)

A detailed discussion about Assumption 3 is given in Section 5. The unbiased learning result fˆu and the learning operator Xu are used for estimating the generalization error of fˆθ . It is well-known that the generalization error of fˆθ is decomposed into the bias and variance [26]: En kfˆθ − fk2 = kEn fˆθ − fk2 + En kfˆθ − En fˆθ k2 .

(9)

It follows from Eqs.(4) and (3) that Eq.(9) yields En kfˆθ − fk2

= =

kXθ z − fk2 + En kXθ nk2 kXθ z − fk2 + tr (Xθ QXθ∗ ) , (10)

where tr (·) denotes the trace of an operator, Q is the noise covariance matrix, and Xθ∗ denotes the adjoint operator of Xθ . Let X0 be an operator defined as X0 = Xθ − Xu .

(11)

Then, the bias of fˆθ can be expressed by using fˆu as kXθ z − fk2 = kfˆθ − fˆu k2 − kfˆθ − fˆu k2 + kXθ z − fk2 = kfˆθ − fˆu k2 − kXθ z + Xθ n − (f + Xu n)k2

+ kXθ z − fk2 = kfˆθ − fˆu k2 − 2RehXθ z − f, X0 ni − kX0 nk2 , (12)

where ‘Re’ stands for the real part of a complex number and h·, ·i denotes the inner product in H. The second and third terms of the right-hand side of Eq.(12) can not be directly calculated since f and n are unknown. Here, we shall replace them with the averages of them over the noise. Then, the second term vanishes since the mean noise is zero, and the third term yields En kX0 nk2 = tr (X0 QX0∗ ) .

(13)

Note that this approximation gives an unbiased estimate of the bias since it follows from Eq.(12) that En kfˆθ − fˆu k2 − tr (X0 QX0∗ ) = kXθ z − fk2 . (14) To guarantee that the bias is non-negative, we adopt the following term as an approximation of the bias: h i kXθ z − fk2 ≈ s kfˆθ − fˆu k2 − tr (X0 QX0∗ ) , (15) where s[·] is defined as s[t] =

t if t ≥ 0, 0 if t < 0.

Substituting Eq.(15) into Eq.(10), we have h i En kfˆθ − fk2 ≈ s kfˆθ − fˆu k2 − tr (X0 QX0∗ ) + tr (Xθ QXθ∗ ) .

(16)

(17)

This approximation immediately gives the following model selection criterion. Definition 2 (Subspace information criterion) Among the given models, select the one minimizing the following SIC: h i SIC = s kfˆθ − fˆu k2 − tr (X0 QX0∗ ) + tr (Xθ QXθ∗ ) . (18) The model minimizing SIC is called the minimum SIC model (MSIC model) and the learning result obtained by the MSIC model is called the MSIC learning result. The generalization capability of the MSIC learning result measured by Eq.(2) is expected to be the best, the expectation is supported by computer simulations performed in Section 6. A procedure for obtaining the MSIC learning result is called the MSIC procedure.

4

Discussion

In this section, the difference of SIC from other model selection criteria is discussed.

4.1

Comparison with CL

2. AIC-type criteria use asymptotic approximation for estimating the generalization error. In contrast, we consider the case when the number of training examples is finite, and the generalization error is estimated by utilizing the noise characteristics. Therefore, SIC is expected to work well even when the number of training examples is small. Indeed, computer simulations performed in Section 6 support this claim. 3. Takeuchi [28] pointed out that AIC-type criteria are effective only in the selection of nested models (see also [15]):

1. The idea of the approximation used in SIC is partly similar to CL proposed by Mallows [14]. The problem considered in his paper is to estimate the ideal sample value vector z from training examples {(xm , ym )}M m=1 . The averaged predictive empirical error over the noise is adopted as the error measure: En

M 2 X ˆ fθ (xm ) − f(xm ) = En kBθ y − zk2 , (19)

m=1

where Bθ is a mapping from y to an M -dimensional vector whose m-th element is fˆθ (xm ). Analogous to Eqs.(9) and (10), the averaged predictive empirical error can be decomposed into the bias and variance:

S1 ⊂ S2 ⊂ · · · .

4.

En kBθ y − zk2 = kBθ z − zk2 + tr (Bθ QBθ∗ ) . (20) Then, CL is given as CL

=

kBθ y − yk2 − tr ((Bθ − I)Q(Bθ − I)∗ ) + tr (Bθ QBθ∗ ) . (21) 5.

2. Although the problem considered in Mallows’ paper was to estimate z, acquiring a higher level of generalization capability is implicitly expected. However, minimizing the averaged predictive empirical error does not generally mean to minimize the generalization error. In contrast, we consider the problem of minimizing the generalization error and SIC directly gives an estimate of the generalization error. 3. Mallows employed the sample value vector y as an unbiased estimator of the target z. In contrast, we assumed that the unbiased learning result fˆu of the target function f has been obtained. fˆu plays a similar role to y in Mallows’ case. 4. We utilized the fact that the bias is non-negative and introduced the s-function defined by Eq.(16) for keeping the estimate of the bias non-negative. In contrast, the estimate of the bias in CL can be negative (see the first and second terms of the right-hand side of Eq.(21)). Our simulations show that the use of the s-function well improves the performance of model selection.

4.2

6.

7.

8.

Comparison with AIC-type methods

1. In AIC-type criteria, the relation between the generalization error and the empirical error is first evaluated in the sense of the average over all training sets. Then, the averaged terms are replaced with particular values calculated by one given training set. In contrast, the generalization measure adopted in SIC (see Eq.(2)) is not averaged over training sets. The fact that the replacement of the averaged terms is unnecessary for SIC is expected to result in better selection than AIC-type methods.

5

(22)

In SIC, no restriction is imposed on models except that the range of Xθ is included in H. AIC-type methods compare models under the learning method minimizing the empirical error. In contrast, SIC can consistently compare models with different learning methods, e.g. least mean squares learning [18], regularization learning [16], projection learning [17], and parametric projection learning [19]. Namely, the type of learning methods is also included in the model. AIC-type criteria do not explicitly require a priori information about the class to which the target function belongs. In contrast, SIC requires a priori information about the Hilbert space H to which the target function f and a learning result fˆθ belong. If H is unknown, then a Hilbert space including all models is practically adopted as H (see the experiments in Section 6). SIC requires the noise covariance matrix Q while AIC-type criteria do not. However, as shown in Section 5, we can cope with the case where Q is not available. In the derivation of AIC-type methods, terms which are not dominant in model selection are neglected (see e.g. [15]). This implies that the value of the AIC-type criteria is not an estimate of the generalization error. In contrast, SIC gives an estimate of the generalization error itself. This difference can be clearly seen at the right-top graph in Fig.2 (see Section 6). AIC-type methods assume that training examples are i.i.d. In contrast, SIC can deal with the correlated noise if the noise covariance matrix Q is available.

SIC for least mean squares learning

In this section, SIC is applied to the selection of least mean squares (LMS) learning models. LMS learning is to obtain a learning result fˆθ (x) minimizing the empirical error M 2 X ˆ fθ (xm ) − ym

m=1

(23)

from training examples {(xm , ym )}M m=1 . In the LMS learning case, a model θ refers to a reproducing kernel Hilbert space S (see [6][31]) which is a subspace of H. Let KS (x, x′) be the reproducing kernel of S, and D be the domain of functions in S. Then, KS (x, x′ ) satisfies the following conditions. • For any fixed x′ in D, KS (x, x′ ) is a function of x in S. • For any function f in S and for any x′ in D, it holds that hf(·), KS (·, x′ )i = f(x′ ).

(24)

Note that the reproducing kernel is unique if it exists. Let µ be the dimension of S. Then, for any orthonormal basis {ϕi }µi=1 in S, KS (x, x′ ) is expressed as

SIC requires an unbiased learning result fˆu and an operator Xu providing fˆu . Here, we show a method of obtaining fˆu and Xu by LMS learning. Let us assume that H is also a reproducing kernel Hilbert space and regard H as a model. Let AH be a sampling operator defined with the reproducing kernel of H: AH =

M X

m=1

em ⊗ KH (x, xm ) .

(30)

Then, since f belongs to H, the ideal sample value vector z is expressed as z = AH f. (31) Let XH be the LMS learning operator with the model H, and fˆH be a learning result obtained by XH : fˆH = XH y = XH z + XH n = XH AH f + XH n.

(32)

µ

KS (x, x′ ) =

X

ϕi (x)ϕi (x′ ).

(25)

i=1

Let AS be an operator from H to the M -dimensional unitary space CM defined as AS =

M X

m=1

En fˆH = XH AH f + En XH n = A†H AH f = f, em ⊗ KS (x, xm) ,

(26)

where (· ⊗ ·) denotes the Neumann-Schatten product 2 , and em is the m-th vector of the so-called standard basis in CM . AS is called a sampling operator since it holds that AS f = z (27) for any f in S. Then, LMS learning is rigorously defined as follows. Definition 3 (Least mean squares learning) [18] An operator X is called the LMS learning operator for the model θ if X minimizes, for each y in CM , the functional J[X] =

Now let us assume that the range of A∗H agrees with H 4 . This holds only if the number M of training examples is larger than or equal to the dimension of H. Then, it follows from Eqs.(32) and (29) that

M 2 X ˆ fθ (xm ) − ym = kAS Xy − yk2 .

(28)

m=1

Let A† be the Moore-Penrose generalized inverse 3 of A. Then, the following proposition holds. Proposition 1 [18] The LMS learning operator for the model θ is given by Xθ = A†S .

(29)

which implies that fˆH is unbiased. Hence, we can put fˆu = fˆH and Xu = XH . For evaluating SIC, the noise covariance matrix Q is required (see Eq.(18)). One of the measures is to use ˆ=σ Q ˆ 2I

(f ⊗ g) h = hh, gif. 3 An operator X is called the Moore-Penrose generalized inverse of an operator A if X satisfies the following four conditions [4].

AXA = A, XAX = X, (AX)∗ = AX, and (XA)∗ = XA. The Moore-Penrose generalized inverse is unique and denoted as A† .

(34)

as a hypothesis noise covariance matrix, and estimate σ ˆ2 M from training examples {(xm , ym )}m=1 as (see Theorem 1.5.1 in [9]) σ ˆ2 =

M 2 X ˆ (M − dim(H )) , f (x ) − y u m m

(35)

m=1

where M > dim(H ) is implicitly assumed. Based on the above discussions, we shall show a calculation method of LMS learning results and the value of SIC. Let TS and TH be M -dimensional matrices whose (i, j)-elements are KS (xi , xj ) and KH (xi , xj ), respectively. Then, the following theorem holds. Theorem 1 (Calculation of LMS learning results) LMS learning results fˆθ (x) and fˆu (x) can be calculated as fˆθ (x)

=

M X

hTS† y, em iKS (x, xm ),

(36)

M X

† hTH y, em iKH (x, xm).

(37)

m=1 2 For any fixed g in a Hilbert space H and any fixed f in a 1 Hilbert space H2 , the Neumann-Schatten product (f ⊗ g) is an operator from H1 to H2 , defined by using any h ∈ H1 as (see [23])

(33)

fˆu (x)

=

m=1

Let T be an M -dimensional matrix defined as † † † T = TS† − TH TS TS† − TS† TS TH + TH .

(38)

Then, the following theorem holds. 4 This condition is sufficient for fˆ to be unbiased, not necesH sary.

′ ′ input {(xm , ym )}M m=1 , KH (x, x ), and {KSk (x, x )}k ; −3 ǫ ←− 10 ; TH ←− [KH (xi , xj )]ij ; if (M ≤ dim(H )) or (rank(TH ) < dim(H )) print(’H is too complicated.’); exit; end † 2 TH ←− TH (TH + ǫI)−1 ; † σ ˆ 2 ←− (kyk2 − hTH TH y, yi) (M − dim(H )); for all k TSk ←− [KSk (xi , xj )]ij ; TS†k ←− TSk (TS2k + ǫI)−1 ; † † † +TH ; TSk TS†k − TS†k TSk TH T ←− TS†k − TH † 2 2 SICk ←− s hT y, yi − σ ˆ tr (T ) + σ ˆ tr TSk ; end k0 ←− argmink {SICk }; PM fˆ(x) ←− m=1 hTS†k y, em iKSk0 (x, xm); 0

Figure 1: MSIC procedure in a pseudo code — Find the best model from {Sk }k . Theorem 2 (Calculation of SIC for LMS learning) When the noise covariance matrix Q is estimated by Eqs.(34) and (35), SIC for LMS learning is given as (39) SIC = s hT y, yi − σ ˆ 2 tr (T ) + σ ˆ 2 tr TS† ,

where s[·] is defined by Eq.(16), and σ ˆ 2 is given as σ ˆ2 =

† kyk2 − hTH TH y, yi . M − dim(H )

(40)

In practice, the calculation of the Moore-Penrose generalized inverse is sometimes unstable. To overcome the unstableness, we recommend to use Tikhonov’s regularization [29]: TS† ←− TS (TS2 + ǫI)−1 ,

(41)

where ǫ is a small constant, say ǫ = 10−3 . Then, a complete algorithm of the MSIC procedure, i.e., finding the best model from {Sk }k , is described in a pseudo code in Fig.1.

6

Computer simulations

In this section, computer simulations are performed to demonstrate the effectiveness of SIC compared with the network information criterion (NIC) [15], which is a generalized AIC. Let the target function f(x) be √ √ f(x) = 2 sin x + 2 2 cos x √ √ − 2 sin 2x − 2 2 cos 2x √ √ + 2 sin 3x − 2 cos 3x √ √ + 2 2 sin 4x − 2 cos 4x √ √ + 2 sin 5x − 2 cos 5x, (42)

and training examples {(xm , ym )}M m=1 be xm

=

ym

=

π 2πm + , M M f(xm ) + nm ,

−π −

(43) (44)

where the noise nm is subject to the normal distribution with mean 0 and variance 3: nm ∼ N (0, 3).

(45)

Let us consider the following models: {SN }20 N=1

(46)

where SN is a trigonometric polynomial space of order N , i.e., a Hilbert space spanned by {1, sin nx, cos nx}N n=1 , and the inner product is defined as Z π 1 hf, giSN = f(x)g(x)dx. 2π −π

(47)

(48)

Then, the reproducing kernel of SN is expressed as . x − x′ (2N + 1)(x − x′ ) sin sin 2 2 KSN (x, x′) = if x = 6 x′ , 2N + 1 if x = x′ . (49) Our task is to find the best model SN minimizing Z π 2 1 ˆ Error = (50) fSN (x) − f(x) dx, 2π −π where fˆSN (x) is the learning result obtained with the model SN . Let us consider the following two model selection methods.

(A) SIC: LMS learning is adopted. The largest model S20 including all models is employed as H. (B) NIC: The squared loss function is adopted. In this case, learning results obtained by the stochastic gradient descent method [5] converge to the LMS estimator. The distribution of sample points given by Eq.(43) is regarded as a uniform distribution. In both (A) and (B), no a priori information is used and the LMS estimator is commonly adopted. Hence, the efficiency of SIC and NIC can be fairly compared by this simulation. Fig.2 shows the simulation results when the numbers of training examples are 50 (left graphs) and 200 (right graphs). The top graphs show the values of the error measured by Eq.(50), SIC, and NIC in each model, denoted by the solid, dashed, and dotted lines, respectively. The horizontal axis denotes the highest order N of trigonometric polynomials in the model SN . The middle graphs show the target function f(x), training examples, and the MSIC learning result, denoted by the solid line, ’◦’, and the dashed line, respectively. The bottom graphs show the target function f(x), training

M = 50

M = 200

3

3

Error SIC NIC

Error, SIC, NIC

4

Error, SIC, NIC

4

2

2

1

1

Error SIC NIC

0

5

10 N

15

15

20

f MSIC result

10

0

0

0

−5

−5

−10

−10 2

−2

MSIC model = S5 (Error = 0.98)

15

f MNIC result

10

0

−5

−5

−10

−10 MNIC model = S20 (Error = 3.36)

2

f MNIC result

10

0

2

0

15

5

0

20

MSIC model = S5 (Error = 0.11)

5

−2

15

f MSIC result

10 5

0

10 N

15

5

−2

5

−2

0

2

MNIC model = S6 (Error = 0.17)

Figure 2: Simulation results.

examples, and the minimum NIC (MNIC) learning result, denoted by the solid line, ’◦’, and the dotted line, respectively. These results show that when M = 200, both SIC and NIC give reasonable learning results. However, when it comes to the case where M = 50, SIC outperforms NIC. This implies that SIC works well even when the number of training examples is small. As mentioned in Section 4.2.7, NIC neglects terms which are not dominant in model selection while SIC gives an estimate of the generalization error. The righttop graph in Fig.2 clearly shows this difference. SIC well approximates the generalization error while the value of NIC is larger than the true generalization error. Generally speaking, neglecting non-dominant terms does not affect the performance of model selection. However, this graph shows that, for 5 ≤ N ≤ 20, the slope of NIC is gentler than the true generalization error while that of SIC is in good agreement with it. This implies that SIC is expected to be resistant to the noise since the discrimination of models by SIC is clearer than NIC.

7

Conclusion

In this paper, we proposed a new model selection criterion called the subspace information criterion (SIC). Computer simulations showed that SIC works well even when the number of training examples is small.

References [1] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, AC-19(6), 716–723. [2] Akaike, H. (1980). Likelihood and the Bayes procedure. In N. J. Bernardo, M. H. DeGroo, D. V. Lindley, & A. F. M. Smith (Eds.), Bayesian Statistics (pp. 141–166). Valencia: University Press. [3] Akaike, H., & Kitagawa, G. (Eds.) (1994, 1995). The practice of time series analysis I, II. Tokyo: Asakura Syoten. (In Japanese) [4] Albert, A. (1972). Regression and the Moore-Penrose pseudoinverse. New York and London: Academic Press. [5] Amari, S. (1967). Theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, EC16(3), 299–307. [6] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. [7] Bozdogan, H. (Ed.) (1994). Proceedings of the first US/Japan conference on the frontiers of statistical modeling: An informational approach. Netherlands: Kluwer Academic Publishers. [8] Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized crossvalidation. Numerische Mathematik, 31, 377–403. [9] Fedorov, V. V. (1972). Theory of optimal experiments. New York: Academic Press. [10] Konishi, S., & Kitagawa, G. (1996). Generalized information criterion in model selection. Biometrika, 83, 875–890.

[11] Li, K. (1986). Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing. The Annals of Statistics, 14(3), 1101–1112. [12] MacKay, D. (1992). Bayesian interpolation. Neural Computation, 4(3), 415–447. [13] Mallows, C. L. (1964). Choosing a subset regression. Presented at the Central Regional Meeting of the Institute of Mathematical Statistics, Manhattan, Kansas. [14] Mallows, C. L. (1973). Some comments on CP . Technometrics, 15(4), 661–675. [15] Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion—Determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. [16] Nakashima, A., & Ogawa, H. (1999). How to design a regularization term for improving generalization. In Proceedings of ICONIP’99, the 6th International Conference on Neural Information Processing, 1 (pp. 222– 227), Perth, Australia. [17] Ogawa, H. (1987). Projection filter regularization of illconditioned problem. In Proceedings of SPIE, Inverse Problems in Optics, 808 (pp. 189–196). [18] Ogawa, H. (1992). Neural network learning, generalization and over-learning. In Proceedings of the ICIIPS’92, International Conference on Intelligent Information Processing & System, 2 (pp. 1–6), Beijing, China. [19] Oja, E., & Ogawa, H. (1986). Parametric projection filter for image and signal restoration. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP34(6), 1643–1653. [20] Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497. [21] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471. [22] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. [23] Schatten, R. (1970). Norm ideals of completely continuous operators. Berlin: Springer-Verlag. [24] Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. [25] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36, 111–147. [26] Takemura, A. (1991). Modern mathematical statistics. Tokyo: Sobunsya. (In Japanese) [27] Takeuchi, K. (1976). Distribution of information statistics and validity criteria of models. Mathematical Science, 153, 12–18. (In Japanese) [28] Takeuchi, K. (1983). On the selection of statistical models by AIC. Journal of the Society of Instrument and Control Engineering, 22(5), 445–453. (In Japanese) [29] Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington DC: V. H. Winston. [30] Vapnik, V. N. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. [31] Wahba, H. (1990). Spline model for observational data. Philadelphia and Pennsylvania: Society for Industrial and Applied Mathematics.

>[email protected] 1Q8w

El5~9)6HBg3X Bg3X1! >pJsM}9)3X8&5f2J 7W;;9)[email protected] ")152-8552 L\9u6hBg2,;32-12-1, Tel 03-5734-2187

[email protected], http://ogawa-www.cs.titech.ac.jp/˜sugi.

%b%G%kA*Br$NLdBj$O!$65;UIU$-3X=,$K$*[email protected]$9$k$?$a$K=EMW$G$"$k!%%b%G%kA*Br$NpJsNL5,=‘ (AIC) $,$3$l$^$G$K9-$/MQ$$$i$l$F$-$?!%$7$+$7$J$,$i!$AIC $G$O71N} %G!l9g$K$OE,@Z$K%b%G%kA*Br$r9T$&$3$H$, $G$-$J$$!%$=$3$GK\[email protected]$G$O!$subspace information criterion (SIC) $H$h$P$l$k?7$7$$%b%G%kA*Br$N5,=‘$r Ds0F$7!$71N}%G!/$J$$$H$-$K$bE,@Z$K%b%G%kA*Br$,9T$($k$3$H$r7W;;5!pJsNL5,=‘ (AIC)!$%P%$%"%9/%P%j%"%s%9!%

Bias Estimation and Model Selection Masashi Sugiyama

Hidemitsu Ogawa

Department of Computer Science, Tokyo Institute of Technology 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan. [email protected], http://ogawa-www.cs.titech.ac.jp/˜sugi.

The problem of model selection is considerably important for acquiring higher levels of generalization capability in supervised learning. So far, Akaike information criterion (AIC) is widely used for model selection. However, AIC does not work well when the number of training examples is small because of the asymptotic approximation. In this paper, we propose a new criterion for model selection called the subspace information criterion (SIC). Computer simulations show that SIC works well even when the number of training examples is small.

Supervised learning, Generalization capability, Model selection, Information criterion, Akaike’s information criterion (AIC), Bias/Variance.

1

Introduction

Supervised learning is obtaining an underlying rule from training examples, and can be regarded as a function approximation problem. So far, many supervised learning methods have been developed, including the backpropagation algorithm [5][22], projection learning [17], Bayesian inference [12], regularization learning [20], and support vector machines [30]. In these learning methods, the quality of the learning results depends heavily on the choice of models. Here, models refer to, for example, Hilbert spaces to which the learning target function belongs in the projection learning and support vector regression cases, multi-layer perceptrons (MLPs) with different numbers of hidden units in the back-propagation case, pairs of the regularization term and regularization parameter in the regularization learning case, and families of probabilistic distributions in the Bayesian inference case. If the model is too complicated, then learning results tend to over-fit training examples. In contrast, if the model is too simple, then it is not capable of fitting training examples causing learning results become under-fitted. In general, both over- and under-fitted learning results have lower levels of generalization capability. Therefore, the problem of finding an appropriate model, referred to as model selection, is considerably important for acquiring higher levels of generalization capability. The problem of model selection has been studied mainly in the field of statistics. Mallows [13] proposed the CP -statistic for the selection of subset-regression models (see also [14]). The CP -statistic gives an unbiased estimate of the averaged predictive empirical error over the noise. Mallows [14] extended the range of application of the CP -statistic to the selection of arbitrary linear regression models. This statistic is called CL or the unbiased risk estimate [31]. The CP -statistic and the unbiased risk estimate require a good estimate of the noise variance. In contrast, the generalized cross-validation [8][31], which is an extension of the traditional crossvalidation [25][31], is the criterion for finding the model minimizing the averaged predictive empirical error without the knowledge of noise variance. Li [11] showed the asymptotic optimality of the unbiased risk estimate and the generalized cross-validation, i.e., they asymptotically select the model minimizing the averaged predictive empirical error (see also [31]). However, these methods do not explicitly take the generalization error into account. In contrast, model selection methods explicitly evaluating the difference between the empirical error and the generalization error have been studied from various standpoints: information statistics [1], Bayesian statistics [24][2][12], stochastic complexity [21], and structural risk minimization [30]. Particularly, informationstatistics-based methods have been extensively studied. Akaike’s information criterion (AIC) [1] is one of the most eminent methods of this type. Many successful applications of AIC to real world problems have been reported (e.g. [7][3]). AIC assumes that models are

faithful 1 . Takeuchi [27] extended AIC to be applicable to unfaithful models. This criterion is called Takeuchi’s modification of AIC (TIC). TIC adopted the log loss as the loss function. Murata et al. [15] generalized the loss function of TIC, and proposed the network information criterion. The learning method with which TIC can deal is restricted to the maximum likelihood estimation. Konishi and Kitagawa [10] relaxed the restriction and derived the generalized information criterion for a class of learning methods represented by statistical functionals. In this paper, we propose a new criterion for model selection from the functional analytic viewpoint. We call this criterion the subspace information criterion (SIC). SIC is mainly different from AIC-type criteria in three respects. The first is the generalization measure. In AIC-type criteria, the averaged generalization error over all training sets is adopted as the generalization measure, and the averaged terms are replaced with particular values calculated by one given training set. In contrast, the generalization measure adopted in SIC is not averaged over training sets. The fact that the replacement of the averaged terms is unnecessary for SIC is expected to result in better selection than AIC-type methods. The second is the approximation method. AIC-type criteria use asymptotic approximation for estimating the generalization error, while SIC estimates it by utilizing the noise characteristics. Our computer simulations show that SIC works well even when the number of training examples is small. The third is the restriction of models. Takeuchi [28] pointed out that AIC-type criteria are effective only in the selection of nested models (see also [15]). In SIC, no restriction is imposed on models.

2

Mathematical foundation model selection

of

Let us consider the problem of obtaining an approximation to a target function f(x) of L variables from a set of M training examples. The training examples are made up of input signals xm in D, where D is a subset of the L-dimensional Euclidean space RL , and corresponding output signals ym in the unitary space C: {(xm , ym ) | ym = f(xm ) + nm }M m=1 ,

(1)

where ym is degraded by additive noise nm . Let θ be a set of factors determining learning results, e.g. the type and number of basis functions, and parameters in learning algorithms. We call θ a model. Let fˆθ be a learning result obtained with a model θ. Assuming that f and fˆθ belong to a Hilbert space H, the problem of model selection is described as follows. Definition 1 (Model selection) From given models, find the one minimizing the generalization error defined as En kfˆθ − fk2 , (2) 1 A model is said to be faithful if the learning target can be expressed by the model [15].

where En denotes the ensemble average over the noise, and k · k denotes the norm in H.

3

Subspace information criterion

In this section, we derive a model selection criterion named the subspace information criterion (SIC), which gives an estimate of the generalization error of fˆθ . Let y, z, and n be M -dimensional vectors whose mth elements are ym , f(xm ), and nm , respectively: y = z + n.

(3)

y and z are called a sample value vector and an ideal sample value vector, respectively. Let Xθ be a mapping from y to fˆθ : fˆθ = Xθ y. (4) Xθ is called a learning operator. In the derivation of SIC, we assume the following conditions. 1. The learning operator Xθ is linear. 2. The mean noise is zero. 3. An unbiased learning result fˆu has been obtained with a linear operator Xu : En fˆu = f, fˆu = Xu y.

(5) (6)

Assumption 1 implies that the range of Xθ becomes a subspace of H. This assumption is practically not so strict since linear learning operators include various learning methods such as least mean squares learning [18], regularization learning [16], projection learning [17], and parametric projection learning [19]. It follows from Eqs.(6), (3), and Assumption 2 that En fˆu = En Xu y = En Xu z + En Xu n = Xu z.

(7)

Hence, Assumption 3 yields Xu z = f.

(8)

A detailed discussion about Assumption 3 is given in Section 5. The unbiased learning result fˆu and the learning operator Xu are used for estimating the generalization error of fˆθ . It is well-known that the generalization error of fˆθ is decomposed into the bias and variance [26]: En kfˆθ − fk2 = kEn fˆθ − fk2 + En kfˆθ − En fˆθ k2 .

(9)

It follows from Eqs.(4) and (3) that Eq.(9) yields En kfˆθ − fk2

= =

kXθ z − fk2 + En kXθ nk2 kXθ z − fk2 + tr (Xθ QXθ∗ ) , (10)

where tr (·) denotes the trace of an operator, Q is the noise covariance matrix, and Xθ∗ denotes the adjoint operator of Xθ . Let X0 be an operator defined as X0 = Xθ − Xu .

(11)

Then, the bias of fˆθ can be expressed by using fˆu as kXθ z − fk2 = kfˆθ − fˆu k2 − kfˆθ − fˆu k2 + kXθ z − fk2 = kfˆθ − fˆu k2 − kXθ z + Xθ n − (f + Xu n)k2

+ kXθ z − fk2 = kfˆθ − fˆu k2 − 2RehXθ z − f, X0 ni − kX0 nk2 , (12)

where ‘Re’ stands for the real part of a complex number and h·, ·i denotes the inner product in H. The second and third terms of the right-hand side of Eq.(12) can not be directly calculated since f and n are unknown. Here, we shall replace them with the averages of them over the noise. Then, the second term vanishes since the mean noise is zero, and the third term yields En kX0 nk2 = tr (X0 QX0∗ ) .

(13)

Note that this approximation gives an unbiased estimate of the bias since it follows from Eq.(12) that En kfˆθ − fˆu k2 − tr (X0 QX0∗ ) = kXθ z − fk2 . (14) To guarantee that the bias is non-negative, we adopt the following term as an approximation of the bias: h i kXθ z − fk2 ≈ s kfˆθ − fˆu k2 − tr (X0 QX0∗ ) , (15) where s[·] is defined as s[t] =

t if t ≥ 0, 0 if t < 0.

Substituting Eq.(15) into Eq.(10), we have h i En kfˆθ − fk2 ≈ s kfˆθ − fˆu k2 − tr (X0 QX0∗ ) + tr (Xθ QXθ∗ ) .

(16)

(17)

This approximation immediately gives the following model selection criterion. Definition 2 (Subspace information criterion) Among the given models, select the one minimizing the following SIC: h i SIC = s kfˆθ − fˆu k2 − tr (X0 QX0∗ ) + tr (Xθ QXθ∗ ) . (18) The model minimizing SIC is called the minimum SIC model (MSIC model) and the learning result obtained by the MSIC model is called the MSIC learning result. The generalization capability of the MSIC learning result measured by Eq.(2) is expected to be the best, the expectation is supported by computer simulations performed in Section 6. A procedure for obtaining the MSIC learning result is called the MSIC procedure.

4

Discussion

In this section, the difference of SIC from other model selection criteria is discussed.

4.1

Comparison with CL

2. AIC-type criteria use asymptotic approximation for estimating the generalization error. In contrast, we consider the case when the number of training examples is finite, and the generalization error is estimated by utilizing the noise characteristics. Therefore, SIC is expected to work well even when the number of training examples is small. Indeed, computer simulations performed in Section 6 support this claim. 3. Takeuchi [28] pointed out that AIC-type criteria are effective only in the selection of nested models (see also [15]):

1. The idea of the approximation used in SIC is partly similar to CL proposed by Mallows [14]. The problem considered in his paper is to estimate the ideal sample value vector z from training examples {(xm , ym )}M m=1 . The averaged predictive empirical error over the noise is adopted as the error measure: En

M 2 X ˆ fθ (xm ) − f(xm ) = En kBθ y − zk2 , (19)

m=1

where Bθ is a mapping from y to an M -dimensional vector whose m-th element is fˆθ (xm ). Analogous to Eqs.(9) and (10), the averaged predictive empirical error can be decomposed into the bias and variance:

S1 ⊂ S2 ⊂ · · · .

4.

En kBθ y − zk2 = kBθ z − zk2 + tr (Bθ QBθ∗ ) . (20) Then, CL is given as CL

=

kBθ y − yk2 − tr ((Bθ − I)Q(Bθ − I)∗ ) + tr (Bθ QBθ∗ ) . (21) 5.

2. Although the problem considered in Mallows’ paper was to estimate z, acquiring a higher level of generalization capability is implicitly expected. However, minimizing the averaged predictive empirical error does not generally mean to minimize the generalization error. In contrast, we consider the problem of minimizing the generalization error and SIC directly gives an estimate of the generalization error. 3. Mallows employed the sample value vector y as an unbiased estimator of the target z. In contrast, we assumed that the unbiased learning result fˆu of the target function f has been obtained. fˆu plays a similar role to y in Mallows’ case. 4. We utilized the fact that the bias is non-negative and introduced the s-function defined by Eq.(16) for keeping the estimate of the bias non-negative. In contrast, the estimate of the bias in CL can be negative (see the first and second terms of the right-hand side of Eq.(21)). Our simulations show that the use of the s-function well improves the performance of model selection.

4.2

6.

7.

8.

Comparison with AIC-type methods

1. In AIC-type criteria, the relation between the generalization error and the empirical error is first evaluated in the sense of the average over all training sets. Then, the averaged terms are replaced with particular values calculated by one given training set. In contrast, the generalization measure adopted in SIC (see Eq.(2)) is not averaged over training sets. The fact that the replacement of the averaged terms is unnecessary for SIC is expected to result in better selection than AIC-type methods.

5

(22)

In SIC, no restriction is imposed on models except that the range of Xθ is included in H. AIC-type methods compare models under the learning method minimizing the empirical error. In contrast, SIC can consistently compare models with different learning methods, e.g. least mean squares learning [18], regularization learning [16], projection learning [17], and parametric projection learning [19]. Namely, the type of learning methods is also included in the model. AIC-type criteria do not explicitly require a priori information about the class to which the target function belongs. In contrast, SIC requires a priori information about the Hilbert space H to which the target function f and a learning result fˆθ belong. If H is unknown, then a Hilbert space including all models is practically adopted as H (see the experiments in Section 6). SIC requires the noise covariance matrix Q while AIC-type criteria do not. However, as shown in Section 5, we can cope with the case where Q is not available. In the derivation of AIC-type methods, terms which are not dominant in model selection are neglected (see e.g. [15]). This implies that the value of the AIC-type criteria is not an estimate of the generalization error. In contrast, SIC gives an estimate of the generalization error itself. This difference can be clearly seen at the right-top graph in Fig.2 (see Section 6). AIC-type methods assume that training examples are i.i.d. In contrast, SIC can deal with the correlated noise if the noise covariance matrix Q is available.

SIC for least mean squares learning

In this section, SIC is applied to the selection of least mean squares (LMS) learning models. LMS learning is to obtain a learning result fˆθ (x) minimizing the empirical error M 2 X ˆ fθ (xm ) − ym

m=1

(23)

from training examples {(xm , ym )}M m=1 . In the LMS learning case, a model θ refers to a reproducing kernel Hilbert space S (see [6][31]) which is a subspace of H. Let KS (x, x′) be the reproducing kernel of S, and D be the domain of functions in S. Then, KS (x, x′ ) satisfies the following conditions. • For any fixed x′ in D, KS (x, x′ ) is a function of x in S. • For any function f in S and for any x′ in D, it holds that hf(·), KS (·, x′ )i = f(x′ ).

(24)

Note that the reproducing kernel is unique if it exists. Let µ be the dimension of S. Then, for any orthonormal basis {ϕi }µi=1 in S, KS (x, x′ ) is expressed as

SIC requires an unbiased learning result fˆu and an operator Xu providing fˆu . Here, we show a method of obtaining fˆu and Xu by LMS learning. Let us assume that H is also a reproducing kernel Hilbert space and regard H as a model. Let AH be a sampling operator defined with the reproducing kernel of H: AH =

M X

m=1

em ⊗ KH (x, xm ) .

(30)

Then, since f belongs to H, the ideal sample value vector z is expressed as z = AH f. (31) Let XH be the LMS learning operator with the model H, and fˆH be a learning result obtained by XH : fˆH = XH y = XH z + XH n = XH AH f + XH n.

(32)

µ

KS (x, x′ ) =

X

ϕi (x)ϕi (x′ ).

(25)

i=1

Let AS be an operator from H to the M -dimensional unitary space CM defined as AS =

M X

m=1

En fˆH = XH AH f + En XH n = A†H AH f = f, em ⊗ KS (x, xm) ,

(26)

where (· ⊗ ·) denotes the Neumann-Schatten product 2 , and em is the m-th vector of the so-called standard basis in CM . AS is called a sampling operator since it holds that AS f = z (27) for any f in S. Then, LMS learning is rigorously defined as follows. Definition 3 (Least mean squares learning) [18] An operator X is called the LMS learning operator for the model θ if X minimizes, for each y in CM , the functional J[X] =

Now let us assume that the range of A∗H agrees with H 4 . This holds only if the number M of training examples is larger than or equal to the dimension of H. Then, it follows from Eqs.(32) and (29) that

M 2 X ˆ fθ (xm ) − ym = kAS Xy − yk2 .

(28)

m=1

Let A† be the Moore-Penrose generalized inverse 3 of A. Then, the following proposition holds. Proposition 1 [18] The LMS learning operator for the model θ is given by Xθ = A†S .

(29)

which implies that fˆH is unbiased. Hence, we can put fˆu = fˆH and Xu = XH . For evaluating SIC, the noise covariance matrix Q is required (see Eq.(18)). One of the measures is to use ˆ=σ Q ˆ 2I

(f ⊗ g) h = hh, gif. 3 An operator X is called the Moore-Penrose generalized inverse of an operator A if X satisfies the following four conditions [4].

AXA = A, XAX = X, (AX)∗ = AX, and (XA)∗ = XA. The Moore-Penrose generalized inverse is unique and denoted as A† .

(34)

as a hypothesis noise covariance matrix, and estimate σ ˆ2 M from training examples {(xm , ym )}m=1 as (see Theorem 1.5.1 in [9]) σ ˆ2 =

M 2 X ˆ (M − dim(H )) , f (x ) − y u m m

(35)

m=1

where M > dim(H ) is implicitly assumed. Based on the above discussions, we shall show a calculation method of LMS learning results and the value of SIC. Let TS and TH be M -dimensional matrices whose (i, j)-elements are KS (xi , xj ) and KH (xi , xj ), respectively. Then, the following theorem holds. Theorem 1 (Calculation of LMS learning results) LMS learning results fˆθ (x) and fˆu (x) can be calculated as fˆθ (x)

=

M X

hTS† y, em iKS (x, xm ),

(36)

M X

† hTH y, em iKH (x, xm).

(37)

m=1 2 For any fixed g in a Hilbert space H and any fixed f in a 1 Hilbert space H2 , the Neumann-Schatten product (f ⊗ g) is an operator from H1 to H2 , defined by using any h ∈ H1 as (see [23])

(33)

fˆu (x)

=

m=1

Let T be an M -dimensional matrix defined as † † † T = TS† − TH TS TS† − TS† TS TH + TH .

(38)

Then, the following theorem holds. 4 This condition is sufficient for fˆ to be unbiased, not necesH sary.

′ ′ input {(xm , ym )}M m=1 , KH (x, x ), and {KSk (x, x )}k ; −3 ǫ ←− 10 ; TH ←− [KH (xi , xj )]ij ; if (M ≤ dim(H )) or (rank(TH ) < dim(H )) print(’H is too complicated.’); exit; end † 2 TH ←− TH (TH + ǫI)−1 ; † σ ˆ 2 ←− (kyk2 − hTH TH y, yi) (M − dim(H )); for all k TSk ←− [KSk (xi , xj )]ij ; TS†k ←− TSk (TS2k + ǫI)−1 ; † † † +TH ; TSk TS†k − TS†k TSk TH T ←− TS†k − TH † 2 2 SICk ←− s hT y, yi − σ ˆ tr (T ) + σ ˆ tr TSk ; end k0 ←− argmink {SICk }; PM fˆ(x) ←− m=1 hTS†k y, em iKSk0 (x, xm); 0

Figure 1: MSIC procedure in a pseudo code — Find the best model from {Sk }k . Theorem 2 (Calculation of SIC for LMS learning) When the noise covariance matrix Q is estimated by Eqs.(34) and (35), SIC for LMS learning is given as (39) SIC = s hT y, yi − σ ˆ 2 tr (T ) + σ ˆ 2 tr TS† ,

where s[·] is defined by Eq.(16), and σ ˆ 2 is given as σ ˆ2 =

† kyk2 − hTH TH y, yi . M − dim(H )

(40)

In practice, the calculation of the Moore-Penrose generalized inverse is sometimes unstable. To overcome the unstableness, we recommend to use Tikhonov’s regularization [29]: TS† ←− TS (TS2 + ǫI)−1 ,

(41)

where ǫ is a small constant, say ǫ = 10−3 . Then, a complete algorithm of the MSIC procedure, i.e., finding the best model from {Sk }k , is described in a pseudo code in Fig.1.

6

Computer simulations

In this section, computer simulations are performed to demonstrate the effectiveness of SIC compared with the network information criterion (NIC) [15], which is a generalized AIC. Let the target function f(x) be √ √ f(x) = 2 sin x + 2 2 cos x √ √ − 2 sin 2x − 2 2 cos 2x √ √ + 2 sin 3x − 2 cos 3x √ √ + 2 2 sin 4x − 2 cos 4x √ √ + 2 sin 5x − 2 cos 5x, (42)

and training examples {(xm , ym )}M m=1 be xm

=

ym

=

π 2πm + , M M f(xm ) + nm ,

−π −

(43) (44)

where the noise nm is subject to the normal distribution with mean 0 and variance 3: nm ∼ N (0, 3).

(45)

Let us consider the following models: {SN }20 N=1

(46)

where SN is a trigonometric polynomial space of order N , i.e., a Hilbert space spanned by {1, sin nx, cos nx}N n=1 , and the inner product is defined as Z π 1 hf, giSN = f(x)g(x)dx. 2π −π

(47)

(48)

Then, the reproducing kernel of SN is expressed as . x − x′ (2N + 1)(x − x′ ) sin sin 2 2 KSN (x, x′) = if x = 6 x′ , 2N + 1 if x = x′ . (49) Our task is to find the best model SN minimizing Z π 2 1 ˆ Error = (50) fSN (x) − f(x) dx, 2π −π where fˆSN (x) is the learning result obtained with the model SN . Let us consider the following two model selection methods.

(A) SIC: LMS learning is adopted. The largest model S20 including all models is employed as H. (B) NIC: The squared loss function is adopted. In this case, learning results obtained by the stochastic gradient descent method [5] converge to the LMS estimator. The distribution of sample points given by Eq.(43) is regarded as a uniform distribution. In both (A) and (B), no a priori information is used and the LMS estimator is commonly adopted. Hence, the efficiency of SIC and NIC can be fairly compared by this simulation. Fig.2 shows the simulation results when the numbers of training examples are 50 (left graphs) and 200 (right graphs). The top graphs show the values of the error measured by Eq.(50), SIC, and NIC in each model, denoted by the solid, dashed, and dotted lines, respectively. The horizontal axis denotes the highest order N of trigonometric polynomials in the model SN . The middle graphs show the target function f(x), training examples, and the MSIC learning result, denoted by the solid line, ’◦’, and the dashed line, respectively. The bottom graphs show the target function f(x), training

M = 50

M = 200

3

3

Error SIC NIC

Error, SIC, NIC

4

Error, SIC, NIC

4

2

2

1

1

Error SIC NIC

0

5

10 N

15

15

20

f MSIC result

10

0

0

0

−5

−5

−10

−10 2

−2

MSIC model = S5 (Error = 0.98)

15

f MNIC result

10

0

−5

−5

−10

−10 MNIC model = S20 (Error = 3.36)

2

f MNIC result

10

0

2

0

15

5

0

20

MSIC model = S5 (Error = 0.11)

5

−2

15

f MSIC result

10 5

0

10 N

15

5

−2

5

−2

0

2

MNIC model = S6 (Error = 0.17)

Figure 2: Simulation results.

examples, and the minimum NIC (MNIC) learning result, denoted by the solid line, ’◦’, and the dotted line, respectively. These results show that when M = 200, both SIC and NIC give reasonable learning results. However, when it comes to the case where M = 50, SIC outperforms NIC. This implies that SIC works well even when the number of training examples is small. As mentioned in Section 4.2.7, NIC neglects terms which are not dominant in model selection while SIC gives an estimate of the generalization error. The righttop graph in Fig.2 clearly shows this difference. SIC well approximates the generalization error while the value of NIC is larger than the true generalization error. Generally speaking, neglecting non-dominant terms does not affect the performance of model selection. However, this graph shows that, for 5 ≤ N ≤ 20, the slope of NIC is gentler than the true generalization error while that of SIC is in good agreement with it. This implies that SIC is expected to be resistant to the noise since the discrimination of models by SIC is clearer than NIC.

7

Conclusion

In this paper, we proposed a new model selection criterion called the subspace information criterion (SIC). Computer simulations showed that SIC works well even when the number of training examples is small.

References [1] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, AC-19(6), 716–723. [2] Akaike, H. (1980). Likelihood and the Bayes procedure. In N. J. Bernardo, M. H. DeGroo, D. V. Lindley, & A. F. M. Smith (Eds.), Bayesian Statistics (pp. 141–166). Valencia: University Press. [3] Akaike, H., & Kitagawa, G. (Eds.) (1994, 1995). The practice of time series analysis I, II. Tokyo: Asakura Syoten. (In Japanese) [4] Albert, A. (1972). Regression and the Moore-Penrose pseudoinverse. New York and London: Academic Press. [5] Amari, S. (1967). Theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, EC16(3), 299–307. [6] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. [7] Bozdogan, H. (Ed.) (1994). Proceedings of the first US/Japan conference on the frontiers of statistical modeling: An informational approach. Netherlands: Kluwer Academic Publishers. [8] Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized crossvalidation. Numerische Mathematik, 31, 377–403. [9] Fedorov, V. V. (1972). Theory of optimal experiments. New York: Academic Press. [10] Konishi, S., & Kitagawa, G. (1996). Generalized information criterion in model selection. Biometrika, 83, 875–890.

[11] Li, K. (1986). Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing. The Annals of Statistics, 14(3), 1101–1112. [12] MacKay, D. (1992). Bayesian interpolation. Neural Computation, 4(3), 415–447. [13] Mallows, C. L. (1964). Choosing a subset regression. Presented at the Central Regional Meeting of the Institute of Mathematical Statistics, Manhattan, Kansas. [14] Mallows, C. L. (1973). Some comments on CP . Technometrics, 15(4), 661–675. [15] Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion—Determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. [16] Nakashima, A., & Ogawa, H. (1999). How to design a regularization term for improving generalization. In Proceedings of ICONIP’99, the 6th International Conference on Neural Information Processing, 1 (pp. 222– 227), Perth, Australia. [17] Ogawa, H. (1987). Projection filter regularization of illconditioned problem. In Proceedings of SPIE, Inverse Problems in Optics, 808 (pp. 189–196). [18] Ogawa, H. (1992). Neural network learning, generalization and over-learning. In Proceedings of the ICIIPS’92, International Conference on Intelligent Information Processing & System, 2 (pp. 1–6), Beijing, China. [19] Oja, E., & Ogawa, H. (1986). Parametric projection filter for image and signal restoration. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP34(6), 1643–1653. [20] Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497. [21] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471. [22] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. [23] Schatten, R. (1970). Norm ideals of completely continuous operators. Berlin: Springer-Verlag. [24] Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. [25] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36, 111–147. [26] Takemura, A. (1991). Modern mathematical statistics. Tokyo: Sobunsya. (In Japanese) [27] Takeuchi, K. (1976). Distribution of information statistics and validity criteria of models. Mathematical Science, 153, 12–18. (In Japanese) [28] Takeuchi, K. (1983). On the selection of statistical models by AIC. Journal of the Society of Instrument and Control Engineering, 22(5), 445–453. (In Japanese) [29] Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington DC: V. H. Winston. [30] Vapnik, V. N. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. [31] Wahba, H. (1990). Spline model for observational data. Philadelphia and Pennsylvania: Society for Industrial and Applied Mathematics.