Validation of Fusion Eric Bax

May 5, 1999

Abstract

In fusion, a hypothesis function is formed by using a mixing function to combine the outputs of a collection of basis functions. We develop a method to bound the out-of-sample error of the hypothesis function. First, uniform error bounds are calculated for the basis functions. Then a linear program is used to infer a bound for the hypothesis function from the basis function bounds. The resulting hypothesis function bound is not based on the size of the class of prospective hypothesis functions. Instead, the bound is based on the number of basis functions and on similarities between the basis functions and the hypothesis function. Hence, the linear program produces a stronger bound than direct validation when the number of basis functions is small, when the class of prospective hypothesis functions is complex, and when the hypothesis function is similar to the basis functions.

Key words machine learning, learning theory, fusion, committees, VapnikChervonenkis, linear programming.

Math and Computer Science Department, 23173([email protected]).

1

University of Richmond,

VA

2

1 Introduction Consider the following machine learning problem. There is an unknown target function and a distribution over the input space of the function. For example, the input distribution could consist of images produced by a satellite over the Paci c Ocean, and the target function could be the fraction of the imaged region that is experiencing heavy winds. We have a set of in-sample examples with inputs drawn according to the input distribution and outputs determined by the target function. We also have a set of out-of-sample example inputs drawn according to the input distribution. We use an error measure that sums over examples, e.g., mean-squared dierence between the hypothesis function and the target function. Our goal is to nd a hypothesis function with with low error over the out-of-sample inputs. We partition the in-sample data into training data and validation data. We use the training data to develop a set of basis functions. We use the validation data to select a mixing function for the outputs of the basis functions, developing a hypothesis function through fusion of basis functions. To obtain a bound on the out-of-sample error of the hypothesis function, we rst use the validation data to compute uniform error bounds for the basis functions. Then we develop a linear program with constraints based on basis function errors and similarities between the basis functions and the hypothesis function over the out-of-sample inputs. We solve the linear program to produce an out-of-sample error bound for the hypothesis function. With this technique, the validation data are used to establish uniform bounds for the basis functions, so they must be withheld from the development of the basis functions. However, the validation data may be used to select the mixing function that produces the hypothesis function. Hence, all in-sample data play a role in the development of the hypothesis function, and we produce an error bound for the function as well. In the next section, we review the uniform validation technique to obtain uniform error bounds over the basis functions. In the following section, we develop the linear program to infer hypothesis error bounds from basis function error bounds. We show how the method extends beyond fusion to produce error bounds for early stopping and uniform error bounds over multiple hypothesis functions. Then we discuss the intuition behind the linear program. Also, we extend the linear programming technique by showing how to develop additional constraints. Finally, we present test results comparing bounds computed by linear programming to direct error bounds. This work extends earlier results for classi cation problems to regression problems. Refer to [Bax 1998] for the results on classi ers. For more information on fusion, refer to [Breiman 1992, Haykin 1999, Kim and Bartlett 1995, Sridhar, Seagrave, and Bartlett 1996, Wolpert 1992]. The linear programming technique developed here is an example of a general strategy called validation by inference. For more information on validation by inference, refer to

3 [Bax, Cataltepe, and Sill 1997].

4

2 Review of Uniform Validation Denote the basis functions by g1 ; : : : ; gM . Denote the target function by f . Let E (g(x); f (x)) be the error score for input x. Assume that E is monotonic in jg(x) ? f (x)j and that E has range [s; t]. De ne the error score of g on a data set as the average of E (g(x); f (x)) over the inputs. Let be the error score of a basis function over the validation data, and let 0 be the (unknown) error score over the out-of-sample data. Let D be the number of validation examples, and let N be the number of out-of-sample examples. Denote the error scores on individual validation examples using random variables X1 ; : : : ; XD . Denote the negatives of the error scores on out-of-sample examples using random variables XD+1 ; : : : ; XD+N . Note that Prf 0 + (t ? s)g = Prf X1 + : : : + XD + XD+1 + : : : + XD+N (t ? s)g:

D

N

(1)

Using a result by Hoeding, [Hoeding 1963], p.16, Eq. 2.7, ?1 22 ?1

Prf 0 + (t ? s)g e D? (

+N

)

:

(2)

If D = N , then we can derive a stronger and simpler result. In this case, Prf 0 + (t ? s)g = Prf X1 + : :N: + X2N (t ? s)g: (3)

So

Prf 0 + (t ? s)g = Prf X1 + :2:N: + X2N 2 (t ? s)g: Using another result by Hoeding, [Hoeding 1963], p.16, Eq. 2.6,

(4)

Prf 0 + (t ? s)g e?2N :

(5)

2

For the remainder of this paper, assume D = N . In practice, if D 6= N , then substitute (2) for (5). Let m and m0 represent the validation and out-of-sample error scores for basis function gm . Using the sum of probabilities to bound the probability of the union event, 0 M + (t ? s)g Me?2N : Prf10 1 + (t ? s) or : : : or M 2

(6)

In other words, with probability at least 1 ? Me?2N , for all basis functions, the out-of-sample error score is at most (t ? s) greater than the validation error score. Refer to the out-of-sample error bound m + (t ? s) as bm . 2

5

3 The Linear Program 4 Development of the Linear Program Let h be the hypothesis function. Partition the range of the target function into S1 ; : : : ; SK . For each out-of-sample input xn and partition Sk , de ne y = 1 if f (xn ) 2 Sk (7) nk

0 otherwise De ne emnk to be the greatest lower bound on the error score for gm if f (xn ) 2 Sk , i.e., emnk = vinf E (gm (xn ); v): (8) 2Sk Similarly, let cnk be the least upper bound on error score for hypothesis function h if f (xn ) 2 Sk , i.e., cnk = sup E (h(xn ); v): (9) v2Sk

The out-of-sample error score for h is at most

N X K 1X cnk ynk : N n=1 k=1

(10)

The out-of-sample error bounds b1 ; : : : ; bM for the basis functions imply the constraints: N X K X emnk ynk bm : (11) 8m 2 f1; : : :; M g N1 n=1 k=1

Hence, the solution of the following integer linear program (ILP) is an upper bound on the out-of-sample error for the hypothesis function. N X K X maximize N1 cnk ynk

(12)

subject to 8m 2 f1; : : :; M g N emnk ynk bm; n=1 k=1

(13)

n=1 k=1 N X K 1X

8n 2 f1; : : :; N g

K X k=1

ynk = 1;

(14)

and 8(n; k) 2 f1; : : :; N g f1; : : :; K g ynk 2 f0; 1g: (15) Solving this program may require extensive computation since integer linear programming is NP-complete [Garey and Johnson 1979, Karp 1972]. To produce a linear program, replace the integer constraints (15) by the constraints 8(n; k) 2 f1; : : : ; N g f1; : : : ; K g 0 ynk 1: (16)

6 This weakens the constraints, but it ensures that the program has a tractable solution [Papadimitriou and Steiglitz 1982, Khachian 1979]. (For algorithms to solve linear programs, refer to [Franklin 1980], pp. 79-103.)

4.1 Use of the Linear Program

Use the following process to develop basis functions and produce out-of-sample error bounds for a hypothesis function formed by fusion of basis function outputs. Partition the in-sample data into training data and validation data. Use the training data to develop a collection of basis functions. Choose some > 0, and use the validation data to compute out-of-sample error bounds for the basis functions according to Formula 6. Use the training and validation data to select the hypothesis function from the class of prospective hypothesis functions. Solve the linear program corresponding to the basis functions, the hypothesis function, and the out-of-sample inputs. The solution is an out-of-sample error bound for the hypothesis function if the basis function bounds are all valid. By Formula 6, the probability that the basis function bounds are all valid is at least 1 ? Me?2N , where M is the number of basis functions, and N is the number of validation examples and the number of out-of-sample examples. Note that the linear program bound is valid even if the hypothesis function is not the result of fusion of the basis functions. For example, an out-of-sample error bound for the hypothesis function chosen by early stopping can be computed as follows. Partition the in-sample data into a training set and a validation set. Choose a function at random. Use an iterative method to develop a function that approximates the training data well. Record the sequence of functions produced by the iterations. Sample these functions at intervals to form the set of basis functions. Use the validation data to compute uniform out-of-sample error bounds for the basis functions. Now compute the error over the validation data for every function in the iteration sequence. The function with minimum validation error is the function chosen by early stopping. It is our hypothesis function. With probability at least 1 ? Me?2N , the solution to the linear program is a valid out-of-sample error bound. (For further details, refer to [Bax, Cataltepe, and Sill 1997].) Note that if the uniform basis function bounds are valid, then, for every hypothesis function, the solution of the corresponding linear program is a valid out-of-sample error bound. Uniform basis function bounds logically imply uniform hypothesis function bounds. Thus, we can use linear programs to compute uniform out-of-sample error bounds for any number of hypothesis functions. With probability at least 1 ? Me?2N , the bounds are all valid. 2

2

2

4.2 Bounds Without Out-of-Sample Inputs

Brie y, consider the case in which out-of-sample inputs are unknown, but there is a method to generate inputs at random according to the input distribution.

7 Generate random sets of inputs having the same size as the out-of-sample data set. For each random data set, form and solve the linear program. Use these sample solutions to compute a probabilistic bound on the solution of the linear program for the out-of-sample inputs. For example, generate 100 random data sets and solve them to generate 100 sample linear program solutions. Let v be the value of the fth greatest sample solution. Then generate K more sample solutions. Let p be the fraction of these solutions with value greater than v. Let be the probability that the solution for a random data set has value greater than v. By Hoeding [Hoeding 1963], p.16, Eq. 2.6, Prf p + ^g e?2K ^ : (17) Let w be the (unknown) linear program solution for the out-of-sample data. Recall from Formula 6 that the probability that the linear program solution w is not a valid out-of-sample error bound is at most Me?2N . Hence, the probability that v + (t ? s) is a valid out-of-sample error bound is at least 2

2

1 ? [Prf p + ^g + Prfw > vj < p + ^g + Me?2N ]

(18)

1 ? [e? K + (p + ^) + Me? N ]:

(19)

2

2

^2

2

2

8

5 Intuition We begin with a simple example to demonstrate the process of validation by inference. Consider a single basis function, g1 , and a single out-of-sample input, x1 . Use mean-squared dierence as the error function. Suppose we have determined that the error of g1 on x1 is no more than 0:01, with probability at least 90%, i.e., Prf[g1(x1 ) ? f (x1 )]2 :01g 90%: (20) Our goal is to compute an error bound for a hypothesis function h. Suppose we nd that h(x1 ) = 0:3 and g1 (x1 ) = 0:2. If the error bound for g1 holds, then jf (x1 ) ? g1 (x1 )j 0:1, so f (x1 ) 2 [0:1; 0:3]. The value in this range that maximizes the error of h(x1 ) is f (x1 ) = 0:1. In this case, the error is 0:04. Hence, with at least 90% probability, the error for the hypothesis function is at most 0:04. This example illustrates the following procedure for bounding by inference: 1. Establish error bounds for basis functions. 2. Use the bounds and the basis function values on out-of-sample inputs to constrain the target function. 3. Find the value of the target function that satis es the constraints and maximizes hypothesis function error. Now extend the example to include an additional basis function, g2. Suppose that we have uniform error bounds for the basis functions: Prf[g1(x1 ) ? f (x1 )]2 0:01 and [g2 (x1 ) ? f (x1 )]2 0:01g 90%: (21) Recall g1 (x1 ) = 0:2. Suppose g2(x1 ) = 0:35. Then, with probability at least 90%, f (x1 ) 2 [0:1; 0:3] \ [0:25; 0:45]. Refer to [0:1; 0:3] and [0:25; 0:45] as constraint regions. Refer to [0:25; 0:3] as the feasible region. The target function value in the feasible region that maximizes hypothesis function error is 0:25. In this case, the error is 0:0025. So, with probability at least 90%, the hypothesis function has error 0:0025 or less. Now extend the example to include an additional out-of-sample input, x2 . For simplicity, assume that the basis functions and the hypothesis function have the same values on x2 as on x1 , i.e., g1 (x1 ) = g1 (x2 ) = 0:2, g2 (x1 ) = g2 (x2 ) = 0:35, and h(x1 ) = h(x2 ) = 0:3. Once again, assume we have uniform error bounds for the basis functions, i.e., Prf8m 2 f1; 2g 21 ([gm (x1 ) ? f (x1 )]2 +[gm (x2 ) ? f (x2 )]2 ) 0:01g 90%: (22) The basis function error bounds constrain the target function values (f (x1 ); f (x2 )) to circles centered at the basis function values (g1 (x1 ); g1 (x2 )) and (g2 (x1 ); g2 (x2 )).

9 The intersection of these constraint circles is the feasible region. The point in the feasible region with maximum distance from the hypothesis function values (h(x1 ); h(x2 )) is the pair of target function values that produces the largest error for the hypothesis function among all target function pairs that satisfy the constraints. (If the feasible region is empty, then the uniform error bounds for the basis functions contain contradictions, so the basis function bounds are not valid.) p For our example, the radius of each constraint circle is 0:02, and the centers are (0:20; 0:20) and (0:35; 0:35). In the intersection of the constraint circles, the most distant points from the hypothesis function point (0:30; 0:30) are distance 0:01 away. Thus, the basis function bounds imply the hypothesis function bound 1 2 2 (23) 2 ([h(x1 ) ? f (x1 )] + [h(x2 ) ? f (x2 )] ) 0:005: Note that in this simple example the hypothesis function bound is tighter than the basis function bounds. Additional basis functions contribute additional constraints, and additional data contribute additional dimensions. In general, the feasible region is the intersection of M N -dimensional spheres, where M is the number of basis functions and N is the number of out-of-sample inputs. To allow a solution by linear programming, we partition the range to discretize the geometric problem of nding a point in the feasible region with greatest distance from the point given by the hypothesis function values on the out-of-sample inputs. In general, the solution of the linear program yields error bounds inferior to those produced by the exact solution of the geometric problem. Subdividing the partitions can produce superior error bounds, but this requires solving a linear program with more variables.

10

6 Extending the Linear Program One strategy to derive additional constraints for the linear program is to validate some aspect of the behavior of the target function over the out-of-sample data, then identify constraints implied by the pattern of behavior. Validating upper bounds on the dierence in behavior between the target function and the basis functions (the out-of-sample error) yields the original constraints. Additional constraints can be derived by validating lower bounds on out-of-sample error. Extend the uniform upper bounds for basis function errors in Formula 6 to include lower bounds: Prf8m 2 f1; : : :; M g jm0 ? m j < (t ? s)g 1 ? 2Me?2N : 2

(24)

The lower bounds imply the following additional constraints for the linear program. N X K X e0 y b 0 ; (25) 8m 2 f1; : : : ; M g 1

N n=1 k=1 mnk nk

m

where b0m = m ? (t ? s) is the lower bound corresponding to upper bound bm , and e0mnk is the greatest lower bound corresponding to least upper bound emnk . More constraints can be derived by validating the frequency distribution of the target function output. Recall that the linear program is based on partitioning the range of the target function into subranges S1 ; : : : ; SK . Let pk be the fraction of validation inputs for which the target function is in Sk . Let p0k be the (unknown) fraction of out-of-sample inputs for which the target function is in Sk . Refer to pk and p0k as placement frequencies. Extend the uniform bounds for basis function errors to include bounds for placement frequencies: Prf8m 2 f1; : : :; M g jm0 ? m j (t ? s) and 8k 2 f1; : : : ; K g jp0k ? pk j < g (26) ? 2N 1 ? 2(M + K )e : (27) The placement frequency bounds imply the following additional constraints for the linear program. 2

N X 1 8k 2 f1; : : :; K g pk ? N ynk pk + : n=1

(28)

Other constraints can be derived by validating the rank of the target function output among basis function outputs. For m 2 f0; : : : ; M g, let rm be the fraction of validation inputs for which exactly m of the basis function outputs are less than or equal to the target function value. Let rm0 be the corresponding fraction for out-of-sample inputs. Refer to rm and rm0 as rank frequencies. Extend the uniform bounds for basis function errors to include bounds for rank

11 frequencies: Prf8m 2 f1; : : : ; M g jm0 ? m j (t ? s) and 8m 2 f0; : : : ; M g jrm0 ? rm j g (29) ? 2N 1 ? 2(M + M + 1)e : (30) Use the following partitioning scheme to produce the linear program. Let the number of subranges be one more than the number of basis functions, i.e., let K = M + 1. For each input, for each k 2 f1; : : : ; K g, de ne Sk to be the portion of the range containing values greater than exactly k ? 1 basis function outputs. This partitioning scheme maintains the original constraints implied by the basis function error bounds, and the rank frequency bounds imply the following additional constraints. 2

8k 2 f1; : : : ; K g rk ? N1

N X n=1

ynk rk + :

(31)

Within this partitioning framework, there are many frequency distributions that can be validated to produce further constraints. Examples include the rank frequencies for a subset of the basis functions and the frequency with which the target function output is between the outputs of some pair of basis functions. To increase the number of partitions within this framework, subdivide each subrange Sk into equal-sized sub-subranges, then adjust the constraints accordingly. Choose the constraints for a given problem without reference to the validation data and the out-of-sample inputs. This restriction ensures the validity of the Hoeding bounds for uniform validation of the properties that imply the constraints. Choose the nature of the constraints using training data and prior knowledge about the problem. Choose the number of constraints according to the number of validation examples and out-of-sample inputs available. More constraints mean weaker individual constraints, because more constraints require uniform validation of more properties.

12

7 Tests This section outlines the results of tests on two sets of data. The rst data set contains information on credit card users. The second data set contains information on individuals at risk for diabetes. For each data set, the tests compare out-of-sample error bounds produced by linear programming to error bounds produced by applying VC theory [Vapnik and Chervonenkis 1971] directly. In the rst data set, each example corresponds to a credit card user. There are ve inputs, corresponding to user traits. A sixth trait is used as the target function. The traits are unknown because the data provider has chosen to keep them secret. The data were obtained from the machine-learning database site at the University of California at Irvine (www.ics.uci.edu/pub/machine-learningdatabases). The discrete-valued traits were removed, leaving the six continousvalued traits. Of the 690 examples in the original database, 24 examples had at least one trait missing. These examples were removed, leaving 666 examples. The data were cleaned by Joseph Sill [Sill and Abu-Mostafa 1997]. The target function values were constrained to the range [?1:25; 1:75]. In each test, basis function and hypothesis function outputs were also constrained to this range, producing the squared error range [s; t] = [0; 9]. In each test, the 666 examples were randomly partitioned into 444 training examples, D = 111 validation examples, and N = 111 out-of-sample examples. In each test, M = 10 basis functions were trained by early stopping. Then a hypothesis function was formed by linear fusion of the basis classi er outputs. For each basis function, the training data were randomly partitioned into 400 examples used for actual training and 44 examples used for early stopping. The basis functions were arti cial neural networks with ve input units, ve hidden units, and one output unit. The hidden units had tanh activation functions. The initial weights were selected independently and uniformly at random from [?0:1; 0:1]. Each network was trained by gradient descent on mean squared error over training examples, using sequential mode weight updates with random order of example presentation in each epoch. A snapshot of the weights was recorded after each epoch. The snapshot with minimum error on the 44 early stopping examples was returned as the trained basis function. In each test, the hypothesis function was formed by nding the linear combination of basis function outputs with least squared error over the validation data. (For details on least-squares minimization, refer to [Franklin 1993], pp. 50-55.) Uniform error bounds with 90% con dence were used for both the linear program bound and the direct VC bound. For the linear program, uniform error bounds over the basis functions are attained for small enough that 1 ? 10e?2111 0:90: 2

(32)

The best solution is = 0:144. The target function range [0; 9] was partitioned into K = 1000 intervals to form the linear program.

13 For the VC bound, we require uniform bounds over the class of 10-dimensional hyperplanes, with 111 validation examples and 111 out-of-sample examples. Using the growth function for linear indicator functions in [Vapnik 1998], pp. 156159, and following the bound derived for general linear functions in [Vapnik 1998], p.192, produces the formula for the minimum that gives 90% con dence:

s

=

? ln m : 2 111 ; 0 10 (2 111)

(33)

where m() is the growth function for 10-dimensional hyperplanes. The solution is = 0:392. Hence, the direct VC out-of-sample error bound is the error over the validation data plus (t ? s) = 0:392 9. Table 1 shows the results for each test. Note that the linear program consistently produces out-of-sample error bounds that are tighter than direct VC bounds. In the second data set, each example corresponds to a patient tested for diabetes. The seven inputs include basic personal information and results of simple tests. The target function is the result of an oral glucose tolerance test, an indicator of diabetes mellitus. The data were obtained from the machine-learning database repository at the University of California at Irvine. All function values were constrained to the range [0; 200]. So the squared error was constrained to the range [s; t] = [0; 40000]. In each test, the 768 examples were randomly partitioned into 368 training examples, D = 200 validation examples, and N = 200 out-of-sample examples. In each test, M = 5 basis functions were trained by early stopping in a similar manner to the tests on credit card data. The training examples were partitioned into 300 for actual training and 68 for early stopping. The basis functions were networks with seven input units, four hidden units, and a single output unit. The hidden units had tanh activation functions. The initial weights were selected independently and uniformly at random from [?20; 20]. In each test, the hypothesis classi er was formed by nding the linear combination of basis function outputs with least squared error over the validation data. The error score range was partitioned into K = 100 regions to form the linear program for each test. Both the linear program error bounds and the direct VC error bounds were based on uniform error bounds with 90% con dence. For the linear program bound, = 0:099 achieves this con dence, since there are M = 5 basis classi ers, D = 200 validation examples, and N = 200 out-of-sample examples. For the VC bound, = 0:244 is required to have 90% con dence in uniform bounds over the class of ve-dimensional linear functions. Table 2 shows the results for each test. Note that the linear program consistently produces tighter out-of-sample error bounds than the direct VC bounds.

14

Test Validation Error Linear Program Direct VC Out-of-Sample Out-of-Sample Error Bound Error Bound 1 0.412 2.45 3.94 2 0.571 2.30 4.10 3 0.566 2.59 4.09 4 0.566 2.61 4.09 5 0.418 2.48 3.94 6 0.471 2.81 4.00 7 0.466 2.56 3.99 8 0.559 3.02 4.08 9 0.451 2.28 3.98 10 0.472 2.41 4.00 avg 0.495 2.55 4.02 Table 1: Test results for credit data. Errors and error bounds are for the hypothesis function, which is a linear combination of the outputs of 10 basis functions. The maximum possible error score is 9.

Test Validation Error Linear Program Direct VC Out-of-Sample Out-of-Sample Error Bound Error Bound 1 841 5863 10592 2 795 5464 10546 3 844 5720 10595 4 1016 6655 10767 5 796 5791 10547 6 940 5694 10691 7 879 5645 10629 8 829 5444 10579 9 825 5615 10576 10 970 5633 10721 avg 874 5752 10624 Table 2: Test results for diabetes data. Errors and error bounds are for the hypothesis function, which is a linear combination of the outputs of 5 basis functions. The maximum possible error score is 40000.

15

8 Conclusion We have developed an algorithm to compute out-of-sample error bounds for hypothesis functions formed by combining the outputs of trained basis functions. The algorithm allows all in-sample data to play a role in the development of the hypothesis function. The algorithm uses a linear program to infer a bound for the hypothesis function from uniform bounds for the basis functions. One direction for future research is to analyze how the composition of the set of basis functions aects the error bound produced by the algorithm. The number of basis functions mediates a tradeo. Removing basis functions removes constraints from the linear program. However, nearly identical basis functions correspond to nearly identical constraints. Perhaps near-duplicate basis functions should be removed from the set. Validating fewer basis functions would result in stronger uniform bounds and hence tighter constraints. Since the basis functions must be selected without reference to the validation data, similarities among prospective basis functions should be detected using training data or randomly generated inputs. Another idea is to apply the linear program to the basis functions themselves to derive stronger basis function bounds before applying the algorithm to the hypothesis function. First, compute uniform basis function bounds using validation data. Next, cast the rst basis function in the role of a hypothesis function, form the linear program, and solve for a new basis function bound. If the new bound is stronger than the original, then replace it. Do the same for the remaining basis functions. Repeat this process until no basis function bounds can be improved. Then use a linear program based on the resulting basis function bounds to produce a hypothesis function bound. Finally, it would be interesting to extend the linear program bounds to derive uniform bounds over classes of hypothesis functions. For example, to derive uniform bounds over the class of linear functions of some dimension, use a small fraction of the functions as basis functions to infer uniform bounds for the remaining functions in the class. Of course, it is not feasible to explicitly form and solve the linear program for each member of the class. Instead, it would be necessary to use analysis or statistical sampling to derive uniform bounds on dierences between validation errors and linear program solutions. Still, the resulting bounds may be stronger than direct uniform bounds over all functions in the class.

16

9 Acknowledgements Thanks to the Learning Systems Group at Caltech for interesting problems. Thanks to Dr. Joel Franklin at Caltech for eective solutions. Thanks to Dr. Van Bowen at the University of Richmond, Dr. Zehra Cataltepe at Bell Labs, and Dr. Sam Roweis at University College London for useful pointers and helpful discussions.

17

References [Bax 1998] E. Bax, Validation of voting committees, Neural Computation, 10 (4), 975-986. [Bax, Cataltepe, and Sill 1997] E. Bax, Z. Cataltepe, and J. Sill, Alternative error bounds for the classi er chosen by early stopping, Proc. IEEE Paci c Rim Conf. on Communications, Computers, and Signal Processing, Victoria, B.C., Canada, 811-814. [Breiman 1992] L. Breiman, Stacked regressions, Tech. Rep. No. 367, Statistics Dept., Univ. of California at Berkeley. [Franklin 1980] J. Franklin, Methods of Mathematical Economics SpringerVerlag New York, Inc. [Franklin 1993] J. Franklin, Matrix Theory, Prentice Hall, Inc., Englewood Clis, New Jersey. [Garey and Johnson 1979] M. R. Garey and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-Completeness, W. H. Freeman and Company, New York. p.245. [Haykin 1999] S. Haykin, Neural Networks, A Comprehensive Foundation, Prentice Hall, Inc., Upper Saddle River, New Jersey. Ch. 7. [Hoeding 1963] W. Hoeding, Probability inequalities for sums of bounded random variables, Am. Stat. Assoc. J., 13-30. [Karp 1972] R. M. Karp, Reducibility among combinatorial problems, in R. E. Miller and J. W. Thatcher (eds.), Complexity of Computer Computations, Plenum Press, New York. pp. 85-103. [Khachian 1979] L. G. Khachian, \A polynomial algorithm for linear programming," Doklady Akad. Nauk USSR, 244, 5:1093-96. Translated in Soviet Math Doklady, 20:191-94. [Kim and Bartlett 1995] K. Kim and E. B. Bartlett, Error estimation by series association for neural network systems, Neural Computation, 7:799-808. [Papadimitriou and Steiglitz 1982] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization (Algorithms and Complexity), Prentice-Hall, Inc., Englewood Clis, New Jersey. [Sill and Abu-Mostafa 1997] J. Sill and Y. Abu-Mostafa, Monotonicity hints. In M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing Systems, 9 (pp. 634-660). Cambridge, MA:MIT Press.

18 [Sridhar, Seagrave, and Bartlett 1996] D. V. Sridhar, R. C. Seagrave, and E. B. Bartlett, Process modeling using stacked neural networks, AIChE Journal, 42(9):2529-2539. [Vapnik 1998] V. N. Vapnik, Statistical Learning Theory, John Wiley and Sons, Inc. [Vapnik and Chervonenkis 1971] V. N. Vapnik and A. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory Prob. Appl., 16:264-280. [Wolpert 1992] D. H. Wolpert, Stacked generalization, Neural Networks, 5:241259.

May 5, 1999

Abstract

In fusion, a hypothesis function is formed by using a mixing function to combine the outputs of a collection of basis functions. We develop a method to bound the out-of-sample error of the hypothesis function. First, uniform error bounds are calculated for the basis functions. Then a linear program is used to infer a bound for the hypothesis function from the basis function bounds. The resulting hypothesis function bound is not based on the size of the class of prospective hypothesis functions. Instead, the bound is based on the number of basis functions and on similarities between the basis functions and the hypothesis function. Hence, the linear program produces a stronger bound than direct validation when the number of basis functions is small, when the class of prospective hypothesis functions is complex, and when the hypothesis function is similar to the basis functions.

Key words machine learning, learning theory, fusion, committees, VapnikChervonenkis, linear programming.

Math and Computer Science Department, 23173([email protected]).

1

University of Richmond,

VA

2

1 Introduction Consider the following machine learning problem. There is an unknown target function and a distribution over the input space of the function. For example, the input distribution could consist of images produced by a satellite over the Paci c Ocean, and the target function could be the fraction of the imaged region that is experiencing heavy winds. We have a set of in-sample examples with inputs drawn according to the input distribution and outputs determined by the target function. We also have a set of out-of-sample example inputs drawn according to the input distribution. We use an error measure that sums over examples, e.g., mean-squared dierence between the hypothesis function and the target function. Our goal is to nd a hypothesis function with with low error over the out-of-sample inputs. We partition the in-sample data into training data and validation data. We use the training data to develop a set of basis functions. We use the validation data to select a mixing function for the outputs of the basis functions, developing a hypothesis function through fusion of basis functions. To obtain a bound on the out-of-sample error of the hypothesis function, we rst use the validation data to compute uniform error bounds for the basis functions. Then we develop a linear program with constraints based on basis function errors and similarities between the basis functions and the hypothesis function over the out-of-sample inputs. We solve the linear program to produce an out-of-sample error bound for the hypothesis function. With this technique, the validation data are used to establish uniform bounds for the basis functions, so they must be withheld from the development of the basis functions. However, the validation data may be used to select the mixing function that produces the hypothesis function. Hence, all in-sample data play a role in the development of the hypothesis function, and we produce an error bound for the function as well. In the next section, we review the uniform validation technique to obtain uniform error bounds over the basis functions. In the following section, we develop the linear program to infer hypothesis error bounds from basis function error bounds. We show how the method extends beyond fusion to produce error bounds for early stopping and uniform error bounds over multiple hypothesis functions. Then we discuss the intuition behind the linear program. Also, we extend the linear programming technique by showing how to develop additional constraints. Finally, we present test results comparing bounds computed by linear programming to direct error bounds. This work extends earlier results for classi cation problems to regression problems. Refer to [Bax 1998] for the results on classi ers. For more information on fusion, refer to [Breiman 1992, Haykin 1999, Kim and Bartlett 1995, Sridhar, Seagrave, and Bartlett 1996, Wolpert 1992]. The linear programming technique developed here is an example of a general strategy called validation by inference. For more information on validation by inference, refer to

3 [Bax, Cataltepe, and Sill 1997].

4

2 Review of Uniform Validation Denote the basis functions by g1 ; : : : ; gM . Denote the target function by f . Let E (g(x); f (x)) be the error score for input x. Assume that E is monotonic in jg(x) ? f (x)j and that E has range [s; t]. De ne the error score of g on a data set as the average of E (g(x); f (x)) over the inputs. Let be the error score of a basis function over the validation data, and let 0 be the (unknown) error score over the out-of-sample data. Let D be the number of validation examples, and let N be the number of out-of-sample examples. Denote the error scores on individual validation examples using random variables X1 ; : : : ; XD . Denote the negatives of the error scores on out-of-sample examples using random variables XD+1 ; : : : ; XD+N . Note that Prf 0 + (t ? s)g = Prf X1 + : : : + XD + XD+1 + : : : + XD+N (t ? s)g:

D

N

(1)

Using a result by Hoeding, [Hoeding 1963], p.16, Eq. 2.7, ?1 22 ?1

Prf 0 + (t ? s)g e D? (

+N

)

:

(2)

If D = N , then we can derive a stronger and simpler result. In this case, Prf 0 + (t ? s)g = Prf X1 + : :N: + X2N (t ? s)g: (3)

So

Prf 0 + (t ? s)g = Prf X1 + :2:N: + X2N 2 (t ? s)g: Using another result by Hoeding, [Hoeding 1963], p.16, Eq. 2.6,

(4)

Prf 0 + (t ? s)g e?2N :

(5)

2

For the remainder of this paper, assume D = N . In practice, if D 6= N , then substitute (2) for (5). Let m and m0 represent the validation and out-of-sample error scores for basis function gm . Using the sum of probabilities to bound the probability of the union event, 0 M + (t ? s)g Me?2N : Prf10 1 + (t ? s) or : : : or M 2

(6)

In other words, with probability at least 1 ? Me?2N , for all basis functions, the out-of-sample error score is at most (t ? s) greater than the validation error score. Refer to the out-of-sample error bound m + (t ? s) as bm . 2

5

3 The Linear Program 4 Development of the Linear Program Let h be the hypothesis function. Partition the range of the target function into S1 ; : : : ; SK . For each out-of-sample input xn and partition Sk , de ne y = 1 if f (xn ) 2 Sk (7) nk

0 otherwise De ne emnk to be the greatest lower bound on the error score for gm if f (xn ) 2 Sk , i.e., emnk = vinf E (gm (xn ); v): (8) 2Sk Similarly, let cnk be the least upper bound on error score for hypothesis function h if f (xn ) 2 Sk , i.e., cnk = sup E (h(xn ); v): (9) v2Sk

The out-of-sample error score for h is at most

N X K 1X cnk ynk : N n=1 k=1

(10)

The out-of-sample error bounds b1 ; : : : ; bM for the basis functions imply the constraints: N X K X emnk ynk bm : (11) 8m 2 f1; : : :; M g N1 n=1 k=1

Hence, the solution of the following integer linear program (ILP) is an upper bound on the out-of-sample error for the hypothesis function. N X K X maximize N1 cnk ynk

(12)

subject to 8m 2 f1; : : :; M g N emnk ynk bm; n=1 k=1

(13)

n=1 k=1 N X K 1X

8n 2 f1; : : :; N g

K X k=1

ynk = 1;

(14)

and 8(n; k) 2 f1; : : :; N g f1; : : :; K g ynk 2 f0; 1g: (15) Solving this program may require extensive computation since integer linear programming is NP-complete [Garey and Johnson 1979, Karp 1972]. To produce a linear program, replace the integer constraints (15) by the constraints 8(n; k) 2 f1; : : : ; N g f1; : : : ; K g 0 ynk 1: (16)

6 This weakens the constraints, but it ensures that the program has a tractable solution [Papadimitriou and Steiglitz 1982, Khachian 1979]. (For algorithms to solve linear programs, refer to [Franklin 1980], pp. 79-103.)

4.1 Use of the Linear Program

Use the following process to develop basis functions and produce out-of-sample error bounds for a hypothesis function formed by fusion of basis function outputs. Partition the in-sample data into training data and validation data. Use the training data to develop a collection of basis functions. Choose some > 0, and use the validation data to compute out-of-sample error bounds for the basis functions according to Formula 6. Use the training and validation data to select the hypothesis function from the class of prospective hypothesis functions. Solve the linear program corresponding to the basis functions, the hypothesis function, and the out-of-sample inputs. The solution is an out-of-sample error bound for the hypothesis function if the basis function bounds are all valid. By Formula 6, the probability that the basis function bounds are all valid is at least 1 ? Me?2N , where M is the number of basis functions, and N is the number of validation examples and the number of out-of-sample examples. Note that the linear program bound is valid even if the hypothesis function is not the result of fusion of the basis functions. For example, an out-of-sample error bound for the hypothesis function chosen by early stopping can be computed as follows. Partition the in-sample data into a training set and a validation set. Choose a function at random. Use an iterative method to develop a function that approximates the training data well. Record the sequence of functions produced by the iterations. Sample these functions at intervals to form the set of basis functions. Use the validation data to compute uniform out-of-sample error bounds for the basis functions. Now compute the error over the validation data for every function in the iteration sequence. The function with minimum validation error is the function chosen by early stopping. It is our hypothesis function. With probability at least 1 ? Me?2N , the solution to the linear program is a valid out-of-sample error bound. (For further details, refer to [Bax, Cataltepe, and Sill 1997].) Note that if the uniform basis function bounds are valid, then, for every hypothesis function, the solution of the corresponding linear program is a valid out-of-sample error bound. Uniform basis function bounds logically imply uniform hypothesis function bounds. Thus, we can use linear programs to compute uniform out-of-sample error bounds for any number of hypothesis functions. With probability at least 1 ? Me?2N , the bounds are all valid. 2

2

2

4.2 Bounds Without Out-of-Sample Inputs

Brie y, consider the case in which out-of-sample inputs are unknown, but there is a method to generate inputs at random according to the input distribution.

7 Generate random sets of inputs having the same size as the out-of-sample data set. For each random data set, form and solve the linear program. Use these sample solutions to compute a probabilistic bound on the solution of the linear program for the out-of-sample inputs. For example, generate 100 random data sets and solve them to generate 100 sample linear program solutions. Let v be the value of the fth greatest sample solution. Then generate K more sample solutions. Let p be the fraction of these solutions with value greater than v. Let be the probability that the solution for a random data set has value greater than v. By Hoeding [Hoeding 1963], p.16, Eq. 2.6, Prf p + ^g e?2K ^ : (17) Let w be the (unknown) linear program solution for the out-of-sample data. Recall from Formula 6 that the probability that the linear program solution w is not a valid out-of-sample error bound is at most Me?2N . Hence, the probability that v + (t ? s) is a valid out-of-sample error bound is at least 2

2

1 ? [Prf p + ^g + Prfw > vj < p + ^g + Me?2N ]

(18)

1 ? [e? K + (p + ^) + Me? N ]:

(19)

2

2

^2

2

2

8

5 Intuition We begin with a simple example to demonstrate the process of validation by inference. Consider a single basis function, g1 , and a single out-of-sample input, x1 . Use mean-squared dierence as the error function. Suppose we have determined that the error of g1 on x1 is no more than 0:01, with probability at least 90%, i.e., Prf[g1(x1 ) ? f (x1 )]2 :01g 90%: (20) Our goal is to compute an error bound for a hypothesis function h. Suppose we nd that h(x1 ) = 0:3 and g1 (x1 ) = 0:2. If the error bound for g1 holds, then jf (x1 ) ? g1 (x1 )j 0:1, so f (x1 ) 2 [0:1; 0:3]. The value in this range that maximizes the error of h(x1 ) is f (x1 ) = 0:1. In this case, the error is 0:04. Hence, with at least 90% probability, the error for the hypothesis function is at most 0:04. This example illustrates the following procedure for bounding by inference: 1. Establish error bounds for basis functions. 2. Use the bounds and the basis function values on out-of-sample inputs to constrain the target function. 3. Find the value of the target function that satis es the constraints and maximizes hypothesis function error. Now extend the example to include an additional basis function, g2. Suppose that we have uniform error bounds for the basis functions: Prf[g1(x1 ) ? f (x1 )]2 0:01 and [g2 (x1 ) ? f (x1 )]2 0:01g 90%: (21) Recall g1 (x1 ) = 0:2. Suppose g2(x1 ) = 0:35. Then, with probability at least 90%, f (x1 ) 2 [0:1; 0:3] \ [0:25; 0:45]. Refer to [0:1; 0:3] and [0:25; 0:45] as constraint regions. Refer to [0:25; 0:3] as the feasible region. The target function value in the feasible region that maximizes hypothesis function error is 0:25. In this case, the error is 0:0025. So, with probability at least 90%, the hypothesis function has error 0:0025 or less. Now extend the example to include an additional out-of-sample input, x2 . For simplicity, assume that the basis functions and the hypothesis function have the same values on x2 as on x1 , i.e., g1 (x1 ) = g1 (x2 ) = 0:2, g2 (x1 ) = g2 (x2 ) = 0:35, and h(x1 ) = h(x2 ) = 0:3. Once again, assume we have uniform error bounds for the basis functions, i.e., Prf8m 2 f1; 2g 21 ([gm (x1 ) ? f (x1 )]2 +[gm (x2 ) ? f (x2 )]2 ) 0:01g 90%: (22) The basis function error bounds constrain the target function values (f (x1 ); f (x2 )) to circles centered at the basis function values (g1 (x1 ); g1 (x2 )) and (g2 (x1 ); g2 (x2 )).

9 The intersection of these constraint circles is the feasible region. The point in the feasible region with maximum distance from the hypothesis function values (h(x1 ); h(x2 )) is the pair of target function values that produces the largest error for the hypothesis function among all target function pairs that satisfy the constraints. (If the feasible region is empty, then the uniform error bounds for the basis functions contain contradictions, so the basis function bounds are not valid.) p For our example, the radius of each constraint circle is 0:02, and the centers are (0:20; 0:20) and (0:35; 0:35). In the intersection of the constraint circles, the most distant points from the hypothesis function point (0:30; 0:30) are distance 0:01 away. Thus, the basis function bounds imply the hypothesis function bound 1 2 2 (23) 2 ([h(x1 ) ? f (x1 )] + [h(x2 ) ? f (x2 )] ) 0:005: Note that in this simple example the hypothesis function bound is tighter than the basis function bounds. Additional basis functions contribute additional constraints, and additional data contribute additional dimensions. In general, the feasible region is the intersection of M N -dimensional spheres, where M is the number of basis functions and N is the number of out-of-sample inputs. To allow a solution by linear programming, we partition the range to discretize the geometric problem of nding a point in the feasible region with greatest distance from the point given by the hypothesis function values on the out-of-sample inputs. In general, the solution of the linear program yields error bounds inferior to those produced by the exact solution of the geometric problem. Subdividing the partitions can produce superior error bounds, but this requires solving a linear program with more variables.

10

6 Extending the Linear Program One strategy to derive additional constraints for the linear program is to validate some aspect of the behavior of the target function over the out-of-sample data, then identify constraints implied by the pattern of behavior. Validating upper bounds on the dierence in behavior between the target function and the basis functions (the out-of-sample error) yields the original constraints. Additional constraints can be derived by validating lower bounds on out-of-sample error. Extend the uniform upper bounds for basis function errors in Formula 6 to include lower bounds: Prf8m 2 f1; : : :; M g jm0 ? m j < (t ? s)g 1 ? 2Me?2N : 2

(24)

The lower bounds imply the following additional constraints for the linear program. N X K X e0 y b 0 ; (25) 8m 2 f1; : : : ; M g 1

N n=1 k=1 mnk nk

m

where b0m = m ? (t ? s) is the lower bound corresponding to upper bound bm , and e0mnk is the greatest lower bound corresponding to least upper bound emnk . More constraints can be derived by validating the frequency distribution of the target function output. Recall that the linear program is based on partitioning the range of the target function into subranges S1 ; : : : ; SK . Let pk be the fraction of validation inputs for which the target function is in Sk . Let p0k be the (unknown) fraction of out-of-sample inputs for which the target function is in Sk . Refer to pk and p0k as placement frequencies. Extend the uniform bounds for basis function errors to include bounds for placement frequencies: Prf8m 2 f1; : : :; M g jm0 ? m j (t ? s) and 8k 2 f1; : : : ; K g jp0k ? pk j < g (26) ? 2N 1 ? 2(M + K )e : (27) The placement frequency bounds imply the following additional constraints for the linear program. 2

N X 1 8k 2 f1; : : :; K g pk ? N ynk pk + : n=1

(28)

Other constraints can be derived by validating the rank of the target function output among basis function outputs. For m 2 f0; : : : ; M g, let rm be the fraction of validation inputs for which exactly m of the basis function outputs are less than or equal to the target function value. Let rm0 be the corresponding fraction for out-of-sample inputs. Refer to rm and rm0 as rank frequencies. Extend the uniform bounds for basis function errors to include bounds for rank

11 frequencies: Prf8m 2 f1; : : : ; M g jm0 ? m j (t ? s) and 8m 2 f0; : : : ; M g jrm0 ? rm j g (29) ? 2N 1 ? 2(M + M + 1)e : (30) Use the following partitioning scheme to produce the linear program. Let the number of subranges be one more than the number of basis functions, i.e., let K = M + 1. For each input, for each k 2 f1; : : : ; K g, de ne Sk to be the portion of the range containing values greater than exactly k ? 1 basis function outputs. This partitioning scheme maintains the original constraints implied by the basis function error bounds, and the rank frequency bounds imply the following additional constraints. 2

8k 2 f1; : : : ; K g rk ? N1

N X n=1

ynk rk + :

(31)

Within this partitioning framework, there are many frequency distributions that can be validated to produce further constraints. Examples include the rank frequencies for a subset of the basis functions and the frequency with which the target function output is between the outputs of some pair of basis functions. To increase the number of partitions within this framework, subdivide each subrange Sk into equal-sized sub-subranges, then adjust the constraints accordingly. Choose the constraints for a given problem without reference to the validation data and the out-of-sample inputs. This restriction ensures the validity of the Hoeding bounds for uniform validation of the properties that imply the constraints. Choose the nature of the constraints using training data and prior knowledge about the problem. Choose the number of constraints according to the number of validation examples and out-of-sample inputs available. More constraints mean weaker individual constraints, because more constraints require uniform validation of more properties.

12

7 Tests This section outlines the results of tests on two sets of data. The rst data set contains information on credit card users. The second data set contains information on individuals at risk for diabetes. For each data set, the tests compare out-of-sample error bounds produced by linear programming to error bounds produced by applying VC theory [Vapnik and Chervonenkis 1971] directly. In the rst data set, each example corresponds to a credit card user. There are ve inputs, corresponding to user traits. A sixth trait is used as the target function. The traits are unknown because the data provider has chosen to keep them secret. The data were obtained from the machine-learning database site at the University of California at Irvine (www.ics.uci.edu/pub/machine-learningdatabases). The discrete-valued traits were removed, leaving the six continousvalued traits. Of the 690 examples in the original database, 24 examples had at least one trait missing. These examples were removed, leaving 666 examples. The data were cleaned by Joseph Sill [Sill and Abu-Mostafa 1997]. The target function values were constrained to the range [?1:25; 1:75]. In each test, basis function and hypothesis function outputs were also constrained to this range, producing the squared error range [s; t] = [0; 9]. In each test, the 666 examples were randomly partitioned into 444 training examples, D = 111 validation examples, and N = 111 out-of-sample examples. In each test, M = 10 basis functions were trained by early stopping. Then a hypothesis function was formed by linear fusion of the basis classi er outputs. For each basis function, the training data were randomly partitioned into 400 examples used for actual training and 44 examples used for early stopping. The basis functions were arti cial neural networks with ve input units, ve hidden units, and one output unit. The hidden units had tanh activation functions. The initial weights were selected independently and uniformly at random from [?0:1; 0:1]. Each network was trained by gradient descent on mean squared error over training examples, using sequential mode weight updates with random order of example presentation in each epoch. A snapshot of the weights was recorded after each epoch. The snapshot with minimum error on the 44 early stopping examples was returned as the trained basis function. In each test, the hypothesis function was formed by nding the linear combination of basis function outputs with least squared error over the validation data. (For details on least-squares minimization, refer to [Franklin 1993], pp. 50-55.) Uniform error bounds with 90% con dence were used for both the linear program bound and the direct VC bound. For the linear program, uniform error bounds over the basis functions are attained for small enough that 1 ? 10e?2111 0:90: 2

(32)

The best solution is = 0:144. The target function range [0; 9] was partitioned into K = 1000 intervals to form the linear program.

13 For the VC bound, we require uniform bounds over the class of 10-dimensional hyperplanes, with 111 validation examples and 111 out-of-sample examples. Using the growth function for linear indicator functions in [Vapnik 1998], pp. 156159, and following the bound derived for general linear functions in [Vapnik 1998], p.192, produces the formula for the minimum that gives 90% con dence:

s

=

? ln m : 2 111 ; 0 10 (2 111)

(33)

where m() is the growth function for 10-dimensional hyperplanes. The solution is = 0:392. Hence, the direct VC out-of-sample error bound is the error over the validation data plus (t ? s) = 0:392 9. Table 1 shows the results for each test. Note that the linear program consistently produces out-of-sample error bounds that are tighter than direct VC bounds. In the second data set, each example corresponds to a patient tested for diabetes. The seven inputs include basic personal information and results of simple tests. The target function is the result of an oral glucose tolerance test, an indicator of diabetes mellitus. The data were obtained from the machine-learning database repository at the University of California at Irvine. All function values were constrained to the range [0; 200]. So the squared error was constrained to the range [s; t] = [0; 40000]. In each test, the 768 examples were randomly partitioned into 368 training examples, D = 200 validation examples, and N = 200 out-of-sample examples. In each test, M = 5 basis functions were trained by early stopping in a similar manner to the tests on credit card data. The training examples were partitioned into 300 for actual training and 68 for early stopping. The basis functions were networks with seven input units, four hidden units, and a single output unit. The hidden units had tanh activation functions. The initial weights were selected independently and uniformly at random from [?20; 20]. In each test, the hypothesis classi er was formed by nding the linear combination of basis function outputs with least squared error over the validation data. The error score range was partitioned into K = 100 regions to form the linear program for each test. Both the linear program error bounds and the direct VC error bounds were based on uniform error bounds with 90% con dence. For the linear program bound, = 0:099 achieves this con dence, since there are M = 5 basis classi ers, D = 200 validation examples, and N = 200 out-of-sample examples. For the VC bound, = 0:244 is required to have 90% con dence in uniform bounds over the class of ve-dimensional linear functions. Table 2 shows the results for each test. Note that the linear program consistently produces tighter out-of-sample error bounds than the direct VC bounds.

14

Test Validation Error Linear Program Direct VC Out-of-Sample Out-of-Sample Error Bound Error Bound 1 0.412 2.45 3.94 2 0.571 2.30 4.10 3 0.566 2.59 4.09 4 0.566 2.61 4.09 5 0.418 2.48 3.94 6 0.471 2.81 4.00 7 0.466 2.56 3.99 8 0.559 3.02 4.08 9 0.451 2.28 3.98 10 0.472 2.41 4.00 avg 0.495 2.55 4.02 Table 1: Test results for credit data. Errors and error bounds are for the hypothesis function, which is a linear combination of the outputs of 10 basis functions. The maximum possible error score is 9.

Test Validation Error Linear Program Direct VC Out-of-Sample Out-of-Sample Error Bound Error Bound 1 841 5863 10592 2 795 5464 10546 3 844 5720 10595 4 1016 6655 10767 5 796 5791 10547 6 940 5694 10691 7 879 5645 10629 8 829 5444 10579 9 825 5615 10576 10 970 5633 10721 avg 874 5752 10624 Table 2: Test results for diabetes data. Errors and error bounds are for the hypothesis function, which is a linear combination of the outputs of 5 basis functions. The maximum possible error score is 40000.

15

8 Conclusion We have developed an algorithm to compute out-of-sample error bounds for hypothesis functions formed by combining the outputs of trained basis functions. The algorithm allows all in-sample data to play a role in the development of the hypothesis function. The algorithm uses a linear program to infer a bound for the hypothesis function from uniform bounds for the basis functions. One direction for future research is to analyze how the composition of the set of basis functions aects the error bound produced by the algorithm. The number of basis functions mediates a tradeo. Removing basis functions removes constraints from the linear program. However, nearly identical basis functions correspond to nearly identical constraints. Perhaps near-duplicate basis functions should be removed from the set. Validating fewer basis functions would result in stronger uniform bounds and hence tighter constraints. Since the basis functions must be selected without reference to the validation data, similarities among prospective basis functions should be detected using training data or randomly generated inputs. Another idea is to apply the linear program to the basis functions themselves to derive stronger basis function bounds before applying the algorithm to the hypothesis function. First, compute uniform basis function bounds using validation data. Next, cast the rst basis function in the role of a hypothesis function, form the linear program, and solve for a new basis function bound. If the new bound is stronger than the original, then replace it. Do the same for the remaining basis functions. Repeat this process until no basis function bounds can be improved. Then use a linear program based on the resulting basis function bounds to produce a hypothesis function bound. Finally, it would be interesting to extend the linear program bounds to derive uniform bounds over classes of hypothesis functions. For example, to derive uniform bounds over the class of linear functions of some dimension, use a small fraction of the functions as basis functions to infer uniform bounds for the remaining functions in the class. Of course, it is not feasible to explicitly form and solve the linear program for each member of the class. Instead, it would be necessary to use analysis or statistical sampling to derive uniform bounds on dierences between validation errors and linear program solutions. Still, the resulting bounds may be stronger than direct uniform bounds over all functions in the class.

16

9 Acknowledgements Thanks to the Learning Systems Group at Caltech for interesting problems. Thanks to Dr. Joel Franklin at Caltech for eective solutions. Thanks to Dr. Van Bowen at the University of Richmond, Dr. Zehra Cataltepe at Bell Labs, and Dr. Sam Roweis at University College London for useful pointers and helpful discussions.

17

References [Bax 1998] E. Bax, Validation of voting committees, Neural Computation, 10 (4), 975-986. [Bax, Cataltepe, and Sill 1997] E. Bax, Z. Cataltepe, and J. Sill, Alternative error bounds for the classi er chosen by early stopping, Proc. IEEE Paci c Rim Conf. on Communications, Computers, and Signal Processing, Victoria, B.C., Canada, 811-814. [Breiman 1992] L. Breiman, Stacked regressions, Tech. Rep. No. 367, Statistics Dept., Univ. of California at Berkeley. [Franklin 1980] J. Franklin, Methods of Mathematical Economics SpringerVerlag New York, Inc. [Franklin 1993] J. Franklin, Matrix Theory, Prentice Hall, Inc., Englewood Clis, New Jersey. [Garey and Johnson 1979] M. R. Garey and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-Completeness, W. H. Freeman and Company, New York. p.245. [Haykin 1999] S. Haykin, Neural Networks, A Comprehensive Foundation, Prentice Hall, Inc., Upper Saddle River, New Jersey. Ch. 7. [Hoeding 1963] W. Hoeding, Probability inequalities for sums of bounded random variables, Am. Stat. Assoc. J., 13-30. [Karp 1972] R. M. Karp, Reducibility among combinatorial problems, in R. E. Miller and J. W. Thatcher (eds.), Complexity of Computer Computations, Plenum Press, New York. pp. 85-103. [Khachian 1979] L. G. Khachian, \A polynomial algorithm for linear programming," Doklady Akad. Nauk USSR, 244, 5:1093-96. Translated in Soviet Math Doklady, 20:191-94. [Kim and Bartlett 1995] K. Kim and E. B. Bartlett, Error estimation by series association for neural network systems, Neural Computation, 7:799-808. [Papadimitriou and Steiglitz 1982] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization (Algorithms and Complexity), Prentice-Hall, Inc., Englewood Clis, New Jersey. [Sill and Abu-Mostafa 1997] J. Sill and Y. Abu-Mostafa, Monotonicity hints. In M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing Systems, 9 (pp. 634-660). Cambridge, MA:MIT Press.

18 [Sridhar, Seagrave, and Bartlett 1996] D. V. Sridhar, R. C. Seagrave, and E. B. Bartlett, Process modeling using stacked neural networks, AIChE Journal, 42(9):2529-2539. [Vapnik 1998] V. N. Vapnik, Statistical Learning Theory, John Wiley and Sons, Inc. [Vapnik and Chervonenkis 1971] V. N. Vapnik and A. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory Prob. Appl., 16:264-280. [Wolpert 1992] D. H. Wolpert, Stacked generalization, Neural Networks, 5:241259.