On Convergence Properties of Pocket Algorithm - CiteSeerX

8 downloads 0 Views 173KB Size Report
Neural networks, Optimal learning, Pocket algorithm, Perceptron algorithm, ... Furthermore, the application of risk minimization theory, developed in the 1970's [8], ...... and Richard G. Palmer, Introduction to the Theory of Neural Computation, ...
1

On Convergence Properties of Pocket Algorithm Marco Muselli Abstract The problem of finding optimal weights for a single threshold neuron starting from a general training set is considered. Among the variety of possible learning techniques, the pocket algorithm has a proper convergence theorem which asserts its optimality. Unfortunately, the original proof ensures the asymptotic achievement of an optimal weight vector only if the inputs in the training set are integer or rational. This limitation is overcome in this paper by introducing a different approach that leads to the general result. Furthermore, a modified version of the learning method considered, called pocket algorithm with ratchet, is shown to obtain an optimal configuration within a finite number of iterations independently of the given training set. Keywords Neural networks, Optimal learning, Pocket algorithm, Perceptron algorithm, Convergence theorems, Threshold neuron.

I. Introduction

I

N recent years, the interest of the scientific community in neural networks has increased considerably: the main reason for this fact certainly lies in the successful results obtained in several application fields. Speech recognition [1] handwritten character reading [2], [3] and chaotic signal prediction [4] are only three among the variety of problems where neural networks have shown to be an efficient way for the treatment of real-world data. The robustness and reliability shown in these particular applications is also supported by many theoretical results which characterize neural networks as a promising tool for automatic statistical analysis. In particular, the representation ability of connectionist models has been well established: every (Borel) measurable function can be approximated to within an arbitrary precision by a feedforward neural network with a single hidden layer containing a sufficient number of units [5], [6], [7]. Furthermore, the application of risk minimization theory, developed in the 1970’s [8], [9], has allowed the univocal definition of the concepts of learning and induction for neural networks. The basic hypothesis of this analysis is the existence of a noisy unknown function from which the input-output pairs of the training set have been extracted. The learning process attempts to estimate this underlying function starting from the available samples. Several induction principles can be introduced to control the learning process: three possible choices have been proposed by Vapnik [9], [10], [11]: the empirical risk minimization, the structural risk minimization and the local risk minimization. Although these three principles are essentially different, their computational kernel is based on the search for the weights of a neural network (with fixed architecture) that minimizes the total error on the training set. In fact, it has been found that performing this task allows the probability of giving a good response to a new input pattern to be maximized, if the VC dimension of the resulting configuration is controlled simultaneously. However, such an optimization is very difficult to achieve because of the complexity of the given cost function: the task of deciding if a given function can be realized by a given neural network is NP-complete [12], [13]. Nevertheless, the use of local or global optimization algorithms, such as the The author is with the Istituto per i Circuiti Elettronici, Consiglio Nazionale delle Ricerche, 16149 Genova, Italy. E-mail: [email protected]

2

steepest descent method or the conjugate gradient method, has allowed optimal or near-optimal results in many interesting cases to be obtained [14], [15]. If we limit ourselves to considering the case of a single neuron, in the literature we find some optimal algorithms which minimize the cumulative error on the training set. When we are facing a regression problem, in which the unknown function is real-valued, the Widrow-Hoff rule or LMS (Least Mean Square) algorithm [15] is able to find the optimal solution if the loss function for the computation of the error is the usual MSE (Mean Square Error). Since the convergence is reached in an asymptotic way, at least in the worst case, the LMS algorithm will be called (asymptotically) optimal. The achievement of the desired solution is guaranteed by the presence of a single local-global minimum in the cost function which turns to be quadratic. So, a local optimization method, like the steepest descent used by the LMS algorithm, allows optimal weights for the neuron to be obtained. Different techniques, such as the conjugate gradient methods [16], [17], can reach the minimum point (at least theoretically) in a finite number of steps: therefore we shall call them finite-optimal algorithms. In the complementary case of classification problems, in which the output is binary-valued, the perceptron algorithm is finite-optimal [18], [19] only if the training set is linearly separable; in this case there exists a weight vector which satisfies all the input-output pairs of the training set. In the opposite case, the perceptron algorithm cycles indefinitely in the set of feasible weight vectors. Some alternatives or modifications [20], [21], [22] have been proposed to stop the learning when the training set is not linearly separable, but none of these is able to provide the desired optimal solution with probability one. An important possibility is offered by the pocket algorithm [23], [24] it runs perceptron learning iteratively and holds the weight vector which has remained unchanged for the greatest number of iterations in memory. A proper convergence theorem ensures that the pocket algorithm is optimal if the components of the input patterns in the training set are integer or rational. In this case the probability that the saved weight vector, called pocket vector, is optimal approaches one when the number of iterations increases. This important theoretical property has been widely employed to show the good qualities of some constructive methods which use the pocket algorithm as a basic component [25], [26], [27]. In classification problems a reasonable measure of error is given by the number of input patterns of the training set misclassified by the given neural network. Unfortunately, the pocket algorithm does not ensure that the error associated with the current pocket vector is smaller than previously saved configurations. Thus, its convergence properties can be improved if a proper check is executed before saving a new weight vector; this version of the method is called pocket algorithm with ratchet [24]. The aim of this paper is to extend the theoretical result provided by the convergence theorem of Gallant [23]. In particular a new approach allows the establishment of the asymptotic optimality of the pocket algorithm for general values of the input patterns in the training set. Furthermore, a new theorem shows that the version with ratchet is finite-optimal, i.e. it reaches the desired configuration within a finite number of iterations. An effective stopping criterion is however not available, since the minimum error that can be obtained by a threshold neuron on a given training set is not known a priori. The proposition and consequent proof of the two convergence theorems, contained in section 3, is preceded by a formal definition of optimality that will be employed for a theoretical analysis of the pocket algorithm (section 2). II. Optimality definitions Let N (w) denote a neural network with a fixed architecture, whose input-output transformation depends uniquely on a set of weights contained in a vector w. Furthermore, let ε(w) be the cumulative error scored by N (w) on a given training set S containing a finite number s of samples. The criteria followed for the computation of such an error do not depend on the particular training set S, but only

3

on the kind of problem we are dealing with (classification, regression, etc.). Suppose without loss of generality that ε(w) ≥ 0 for every weight vector w and denote with εmin the infimum of ε(w): εmin = inf ε(w) w

A learning algorithm A for the neural network N (w) provides at every iteration t a weight vector wA (t) which depends on the particular rule chosen to minimize the cumulative error ε. Thus we can introduce the following definitions, where the probability P is defined over the possible vector sequences wA (t) generated by the given procedure. Definition 1: A learning algorithm A for the neural network N (w) is called (asymptotically) optimal if: lim P (ε(wA (t)) − εmin < η) = 1 for every η > 0 (1) t→+∞

independently of the given training set S. Definition 2: A learning algorithm A for the neural network N (w) is called finite-optimal if there exists t such that: P (ε (wA (t)) = εmin , for every t ≥ t) = 1 (2) independently of the given training set S. In the case of pocket algorithm the neural network N (w) is formed by a single threshold neuron whose input weights and bias are contained in the vector w. The activation function for this neuron is bipolar and is given by: Ã n ! ( P X +1 if ni=0 wi xi ≥ 0 wi xi = ϕ (3) −1 otherwise i=0 having denoted with xi , i = 0, 1, . . . , n, the neuron inputs. For the sake of simplicity we have also included the bias w0 in the weight vector w by adding a fixed component x0 = +1 to every input pattern. To reach an optimal configuration the pocket algorithm repeatedly executes perceptron learning whose basic iteration is formed by the following two steps: 1. a sample (x, y) is randomly chosen in the training set S with uniform probability. 2. if the current threshold neuron N (w) does not satisfy this sample its weight vector w is modified according to the following rule: wi = wi + yxi

for i = 0, 1, . . . , n

(4)

It can be shown that the substitution rule (4) changes the current neuron so as to satisfy (eventually after several weight changes) the input-output pair (x, y) chosen at step 1. It is known that the perceptron learning algorithm is finite-optimal when the training set S is linearly separable. In this case, a threshold neuron that correctly satisfies all the samples contained in the training set S exists. In the opposite case the perceptron algorithm cycles indefinitely through feasible weight vectors without providing any (deterministic or probabilistic) information about their optimality. However, we can observe that a neuron N (w) which satisfies a high number of samples of the training set S has a small probability of being modified at step 2. This consideration motivates the procedure followed by the pocket algorithm: it holds the weight vector w which has remained unchanged for the greatest number of iterations during perceptron learning in memory. Such a vector w is called pocket vector and the corresponding neuron N (w) forms the output of the training. To analyze the convergence properties of the pocket algorithm in a theoretical way we have to define a proper cumulative error ε(w). Since the output of the neuron N (w) is binary (3), every training set can be associated with a classification problem. Thus, a good measure of the error made by a given configuration w can be the number of misclassified samples of the training set S.

4

In this way, if the vector w correctly satisfies h input-output pairs of the training set S, we have ε(w) = s − h. In particular, if r is the number of samples successfully classified by an optimal vector we obtain εmin = s − r. Consequently, the cumulative error ε(w) can assume at most r + 1 different values. Now, let us denote with wPL (t), t = 1, 2, . . ., the configuration generated by the perceptron algorithm at the tth iteration and with W the set of the feasible weight vectors w for the threshold neuron N (w). Furthermore, let Sw be the subset of the training set S that contains the samples correctly classified by N (w). The following definitions will be useful: Definition 3: If w = wPL (t) 6= wPL (t−1) we shall say that a visit of the configuration w has occurred at the tth iteration of the perceptron algorithm. If w = wPL (t) = wPL (t − 1) we shall say that a permanence of the configuration w has occurred at iteration t. A run of length k for the weight vector w is then given by a visit of w followed by k −1 permanencies. It can easily be seen that the probability of having a permanence for w is given by |Sw |/s, where |Sw | is the cardinality of the set Sw . On the contrary, the probability of having a visit for w is not known a priori. III. Convergence theorems for the pocket algorithm The existence of a convergence theorem for the pocket algorithm [23] which ensures the achievement of an optimal configuration when the number of iterations increases indefinitely has been an important motivation for the employment of this learning procedure. However, this theorem can be applied only when the inputs in the training set are integer or rational. A generalization to the real case is possible by introducing a different approach which can be extended to show the optimality of other similar learning methods. Consider the 2s subsets Rm of the given training set S, with m = 1, . . . , 2s ; each of these can be associated with a subset Vm ⊂ W of feasible configurations such that for every w ∈ Vm we have Sw = Rm . Naturally, if Rm is not linearly separable the corresponding subset Vm will be empty. In the same way, if r is the number of samples in S correctly classified by an optimal neuron we obtain that Vm is empty for every m such that |Rm | > r. For the perceptron convergence theorem [18], if the set Rm is linearly separable, a number τm of iterations exists such that if τm randomly selected samples in Rm are subsequently presented to the perceptron algorithm, then a configuration w ∈ Vm is generated. Since the number of sets Rm is finite, we can denote with τ the maximum value of τm for m = 1, 2, . . . , 2s . Furthermore, let us introduce the sets Wh , with h = 0, 1, . . . , s, containing all the weight vectors w ∈ W which correctly classify h samples of the training set S Wh = {w ∈ Vm : |Rm | = h , m = 1, . . . , 2s } It can easily be seen that Wh = ∅ for r < h ≤ s. Consider the Markov chain, the states of which are given by the sets Vm to which the configurations wPL (t) generated by the perceptron algorithm at each iteration belong. As one can note, this chain has stationary transition probabilities the values of which depend on the given training set S. Only the probability p of having a permanence for Vm can be determined a priori and is given by |Rm |/s. At any iteration of the perceptron algorithm, a sample in S is randomly chosen with uniform probability. If we define success the choice of an input-output pair in Rm and failure the opposite event, we obtain a sequence of Bernoulli trials with probability p = |Rm |/s of success. Let Qtk (p) denote the probability that the length of the longest run of successes in t trials is less than k. The complexity of the exact expression for Qtk (p) [28] prevents us from using it in the mathematical passages; however, useful bounds for Qtk (p) can be obtained by some simple remarks.

5

Let us subdivide the sequence of t trials of the given experiment into non-overlapping subsets of length k; the maximum number of them is bt/kc, having denoted with bxc the integer not greater than x. If the original sequence does not contain a run of k successes, none of these subsets can be formed by k successes. Since the probability of this event is Qkk (p) = 1 − pk and the bt/kc subsets are stochastically independent, we obtain Qtk (p) ≤ (1 − pk )bt/kc . To find a lower bound for Qtk (p), let us denote with ui , i = 1, . . . , t, the ith outcome of the given experiment and consider the t − k + 1 (overlapping) subsequences of k consecutive trials starting at u1 , u2 , . . . , ut−k+1 . It can easily be seen that, if none of these t − k + 1 subsequences contains a run of k successes, neither does the complete sequence. Thus, we could conclude that Qtk (p) ≥ (1 − pk )t−k+1 if the t − k + 1 subsequences were stochastically independent (which is generally not true). However, this lower bound is always valid as the following lemma asserts. Lemma 1: If 1 ≤ k ≤ t and 0 ≤ p ≤ 1 the following inequalities hold: ³

1 − pk

´t−k+1

³

≤ Qtk (p) ≤ 1 − pk

´bt/kc

(5)

Proof: From a direct analysis of the given binary experiment we obtain the particular expressions: Qtk (0) = 1 Qt1 (p) = (1 − p)t

, ,

Qtk (1) = 0 Qtt (p) = 1 − pt

(6)

for 1 ≤ k ≤ t and 0 ≤ p ≤ 1; consequently the inequalities (5) are verified for p = 0, p = 1, k = 1 and k = t. In the general case 1 < k < t and 0 < p < 1 let us denote with ui the i-th outcome of the given experiment and consider the events: Ui = {ui is a success} Ejl = {a run of k successes in the outcomes uj , . . . , ul exists} with 1 ≤ i ≤ t and 1 ≤ j ≤ l ≤ t. By definition of the probability Qtk (p) we obtain: P (Ui ) = p

,

P (Ejl ) = 1 − Ql−j+1 (p) k

(7)

With these notations it can be directly seen that E1t ⊂ E1k ∩ Ek+1,t , having denoted with Ejl the complement of the set Ejl . But E1k and Ek+1,t are two stochastically independent events; so we have from (7): Qtk (p) = P (E1t ) ≤ P (E1k )P (Ek+1,t ) t−k k = Qkk (p)Qt−k k (p) = (1 − p )Qk (p)

(8)

where we have used (6). By repeatedly applying (8) we obtain the upper bound for Qtk (p): t−kbt/kc

Qtk (p) ≤ (1 − pk )bt/kc Qk

(p) ≤ (1 − pk )bt/kc

since t − kbt/kc < k and Qtk (p) = 1 for t < k. The lower bound in (5) can be found by noting that E1t = E1k ∪ E2t ; the application of (6) and (7) therefore gives 1 − Qtk (p) = P (E1t ) = P (E1k ) + P (E2t ) − P (E1k ∩ E2t ) k = 1 − Qt−1 k (p) + p − P (E1k ∩ E2t )

(9)

6

But the set E1k ∩ E2t contains all the sequences of t trials of the given experiment which begins with k successes and has at least a run of k successes in the last t − 1 outcomes. Thus, the following lower bound for P (E1k ∩ E2t ) holds: ³

P (E1k ∩ E2t ) = pk P (Uk+1 ) + P (Uk+1 )P (Ek+2,t ) ³

´ ´

≥ pk P (Uk+1 )P (E2t |Uk+1 ) + P (Uk+1 )P (E2t |Uk+1 ) ³

´

= pk P (E2t ) = pk 1 − Qt−1 k (p)

(10)

since it can easily be seen that: P (E2t |Uk+1 ) = P (Ek+2,t ) By substituting (10) in (9) we obtain: t−1 k k Qtk (p) ≥ Qt−1 k (p) − p + P (E1k ∩ E2t ) ≥ (1 − p )Qk (p)

and after t − k applications: Qtk (p) ≥ (1 − pk )t−k Qkk (p) = (1 − pk )t−k+1 This lemma gives a mathematical basis for the proof of the asymptotic optimality of the pocket algorithm: Theorem 1: (pocket convergence theorem) If the training set S is finite, the pocket algorithm is (asymptotically) optimal, i.e. lim P (ε(wPA (t)) − εmin < η) = 1

t→+∞

for every η > 0

where wPA (t) is the pocket vector held in memory at iteration t. Proof: Denote with L(A, t) the length of the longest run for any configuration w belonging to the set A in the first t iterations of the perceptron algorithm. If Wr contains all the optimal vectors for the training set S we can find a subset Rm ⊂ S having cardinality |Rm | = r for which the associated subset Vm of Wr is not empty. Thus we can write for every η > 0 P (ε(wPA (t)) − εmin < η) = P (L(Wr , t) > L(W1 , t) , . . . , L(Wr , t) > L(Wr−1 , t)) ≥ P (L(Vm , t) > L(W1 , t) , . . . , L(Vm , t) > L(Wr−1 , t)) = 1 − P (L(Vm , t) ≤ L(W1 , t) or . . . or L(Vm , t) ≤ L(Wr−1 , t)) ≥1−

r−1 X

P (L(Vm , t) ≤ L(Wh , t))

h=1

Then it is sufficient to show that for every h < r we have lim P (L(Vm , t) ≤ L(Wh , t)) = 0

t→+∞

(11)

if |Rm | = r. To this end, let us denote with p and q the probabilities of having a permanence for the weight vectors of Vm and Wh respectively. Thus we have p = r/s and q = h/s with p > q. Now, if r = s the training set S is linearly separable and the perceptron convergence theorem ensures the achievement of an optimal configuration w ∈ Ws within a finite number of iterations. The subsequent random choices of samples in the training set leave this optimal vector unchanged, since

7

its permanence probability p is unitary. Thus, (11) is verified since at a given iteration t an optimal configuration becomes the current pocket vector wPA (t) and its number of permanencies will increase indefinitely. In the case 1 ≤ r < s we can write P (L(Vm , t) ≤ L(Wh , t)) =

t X

P (L(Vm , t) ≤ k | L(Wh , t) = k) · P (L(Wh , t) = k)

(12)

k=0

Since the set Rm is linearly separable, if the perceptron algorithm consecutively chooses τ + k samples in Rm we obtain a run with length k + 1 for a configuration w ∈ Vm . Let us employ again the term success to denote the random choice of an input-output pair in Rm ; then we can define the following events: Ui = {at the i-th iteration a success occurs} Fjl = {between the iterations j and l (included) there is not a run of τ + k successes} Gi = {the first run with length k for a vector v ∈ Wh begins at the i-th iteration} with 1 ≤ i ≤ t and 1 ≤ j ≤ l ≤ t. By using the previously given definition of the probability Qtk (p) we obtain P (Ui ) = p

,

P (Fjl ) = Ql−j+1 τ +k (p)

(13)

With these notations we can write P (L(Vm , t) ≤ k | L(Wh , t) = k) ≤ P (F1t | L(Wh , t) = k) =

t−k+1 X

P (F1t |Gi ) · P (Gi | L(Wh , t) = k)

(14)

i=1

since the event {L(Vm , t) ≤ k} is contained in F1t and Gi ⊂ {L(Wh , t) = k}. Now, let us consider the event Fjl and suppose a success occurs at the i-th iteration (with j ≤ i ≤ l). By denoting with Ui the complement of the set Ui , for the independence of the choices made at different iterations by the perceptron algorithm, we obtain: P (Fjl ∩ Ui ) P (Fj,i−1 ∩ Fi+1,l ∩ Ui ) ≤ P (Ui ) P (Ui ) = P (Fj,i−1 )P (Fi+1,l ) = P (Fjl |Ui )

P (Fjl |Ui ) =

since Fjl ⊂ (Fj,i−1 ∩ Fi+1,l ). As a matter of fact, if iterations j, j + 1, . . . , l do not contain a run of τ + k successes, neither does a subset of them. Thus, if we suppose a failure occurs at a given iteration i, an upper bound for the conditional probability of the event Fjl is obtained: P (Fjl ) = P (Fjl |Ui )P (Ui ) + P (Fjl |Ui )P (Ui ) ³

´

≤ P (Fjl |Ui ) P (Ui ) + P (Ui )

= P (Fjl |Ui ) = P (Fj,i−1 )P (Fi+1,l )

(15)

Consider the situation (event Gi ) in which the first run with length k for a weight vector v ∈ Wh begins at the i-th iteration; this can be true only if samples in Sv have been chosen at the iterations

8

i + 1, . . . , i + k − 1 and at the iteration i + k, an input-output pair not belonging to Sv has been selected. Since no general conditions are given on the intersection between the sets Sv and Rm , some samples chosen in Sv can also belong to Rm . However, for the inequality (15) the probability P (F1t |Gi ) is increased by supposing that at the iterations i + 1, . . . , i + k we have obtained k failures. In a similar way, the visit of the vector v ∈ Wh at iteration i can influence the choices made by perceptron learning at iterations i − τ + 1, . . . , i − 1, i. Thus we can obtain a further increase in the probability P (F1t |Gi ) if we suppose that τ failures take place at these iterations. On the other hand, the first i − τ random choices are independent of the visit of the configuration v since τ subsequent choices in Sw are sufficient to reach v starting from any weight vector. Consequently we obtain from (13) and (15) P (F1t |Gi ) ≤ P (F1,i−τ )P (Fi+k+1,t ) t−i−k = Qi−τ τ +k (p)Qτ +k (p) Ã

³

τ +k

≤ min 1, 1 − p µ

³

≤ min 1, 1 − pτ +k

´b i−τ c ³ τ +k

´

t −3 τ +k

1−p

τ +k

´b t−i−k c

!

τ +k



where we have used the upper bound in (5) for Qtk (p) and the inequality Qtk (p) ≤ 1 which holds for any admissible value of k, t and p. Now, it can easily be shown that t−k+1 X

P (Gi | L(Wh , t) = k) = 1

i=1

and (14) becomes: P (L(Vm , t) ≤ k | L(Wh , t) = k) µ

³

τ +k

≤ min 1, 1 − p

´

t −3 τ +k



(16)

By substituting (16) in (12) an upper bound for the probability P (L(Vm , t) ≤ L(Wh , t)) is obtained P (L(Vm , t) ≤ L(Wh , t)) ≤

t X

µ

³

τ +k

min 1, 1 − p

´

t −3 τ +k



· P (L(Wh , t) = k)

(17)

k=0

Now, let us verify that the right-hand side of (17) vanishes when the number t of iterations increases indefinitely. To this end, let us break the summation above into two contributions having values of the integer k belonging to the intervals [0, bσc], (bσc, t], where σ = −α

log t log p

,

α=

log p 2 log p √ = log pq log p + log q

Note that 0 < α < 1 and σ → +∞ when t → +∞. Then we obtain for 0 ≤ k ≤ bσc bσc X

µ

³

min 1, 1 − pτ +k

k=0

³

≤ 1 − pτ +σ

´

t −3 τ +σ

´

t −3 τ +k



· P (L(Wh , t) = k)

(18)

9

but by definition of σ

Ã

!

log t p = exp −α log p = t−α log p σ

hence we have: lim

³

t→+∞

1 − pτ +σ Ã

´

t −3 τ +σ

Ã

!

´ ³ t log p = lim exp − + 3 log 1 − pτ t−α t→+∞ α log t − τ log p à ! τ 1−α p t log p = lim exp =0 t→+∞ α log t

!

since α < 1 and log p < 0. In the case bσc < k ≤ t we obtain, by using the lower bound in (5) for Qtk (p) t X

µ

³

τ +k

min 1, 1 − p

´

t −3 τ +k



· P (L(Wh , t) = k)

k=bσc+1



t X

P (L(Wh , t) = k) = P (L(Wh , t) > bσc)

k=bσc+1

≤ 1 − P (L(Wh , t) < bσc) ≤ 1 − Qtbσc (q) ≤ 1 − (1 − q σ )t In fact, the probability P (L(Wh , t) < bσc) is surely not greater than the probability Qtbσc (q) of not having bσc consecutive choices in Sv where v ∈ Wh . But from (18): Ã

q

σ

!

Ã

log t 2 log q = exp −α log q = exp − log t log p log p + log q = exp ((α − 2) log t) = tα−2

!

for which we have: lim 1 − (1 − q σ )t =

t→+∞

=

³

´

lim 1 − exp t log(1 − tα−2 )

t→+∞

lim 1 − exp(−tα−1 ) = 0

t→+∞

Theorem 1 assures good convergence properties for the configuration generated by the pocket algorithm: the probability that the pocket vector is optimal increases with the execution time. Unfortunately, no useful theoretical evaluations on the number of iterations required to obtain a given value of this convergence probability exist. We must also emphasize that the cumulative error ε(w(t)) associated with the pocket vector w(t) does not decrease monotonically with the number t of iterations. In fact, in practical applications we observed that a new pocket vector often satisfies a smaller number of samples of the training set with respect to the previous saved configuration. To eliminate this undesired effect, a modified version of the learning method, called pocket algorithm with ratchet, has been proposed. It computes the corresponding error ε(w(t)) at every possible saving of a new pocket vector w(t) and maintains the previous configuration when the number of misclassified samples does not decrease.

10

In this way the stability of the algorithm is improved as the following theorem asserts. Theorem 2: If the training set is finite, the pocket algorithm with ratchet is finite-optimal, i.e. there exists an integer t such that P (ε (wPA (t)) = εmin , for every t ≥ t) = 1 Proof: From the procedure followed by the pocket algorithm with ratchet, we obtain that the cumulative error ε(wPA (t)) associated with the pocket vector wPA (t) decreases monotonically towards a minimum ε(wPA (t)) corresponding to the last saved configuration. Thus, if r is the number of samples in the training set S correctly classified by an optimal neuron, we can write P (ε(wPA (t)) = εmin , for every t ≥ t) = 1 − P (wPA (t) 6∈ Wr , for every t ≥ t) ≥ 1 − P (wPA (t) 6∈ Vm , for every t ≥ t) where Vm ⊂ Wr is associated with a subset Rm of the training set S having cardinality |Rm | = r. Let k be the length of the run scored by wPA (t); then the probability that wPA (t) 6∈ Vm for every t ≥ t is surely not greater than the probability that a sequence of τ + k consecutive choices of samples in Rm does not take place after iteration t. Thus, by setting p = r/s from the definition of Qtk (p) we obtain P (ε(wPA (t)) = εmin , for every t ≥ t) ≥ 1 − P (wPA (t) 6∈ Vm , for every t ≥ t) ≥1−

lim Qt (p) t→+∞ τ +k

≥ 1 − lim (1 − p t→+∞

τ +k

k

j

)

t τ +k

=1

having used the upper bound in (5). IV. Conclusions The interest of the pocket algorithm as a learning method for neurons with hard threshold activation function is mainly based on the presence of a proper convergence theorem which ensures the achievement of an optimal configuration when the number of iterations increases indefinitely. Unfortunately the original proof of this theorem only holds if the inputs in the training set are integer or rational and does not allow to conclude the optimality of the pocket algorithm in all the possible situations. This work introduces an alternative approach which allows us to overcome the limitations of the original treatment. In particular the convergence in probability to the optimal weight vector is theoretically shown independently of the given training set. Furthermore, a second theorem asserts that the pocket algorithm with ratchet finds an optimal weight vector within a finite number of iterations with probability one. Consequently, this is surely the learning method to prefer when the size of the training set does not lead to an improposable execution time. On this point, it should be observed that the proofs contained in the previous section do not provide any information about the number of iterations required to reach an optimal or near-optimal weight vector. This fact prevents us from establishing a direct stopping criterion for the training with the pocket algorithm in practical situations. Acknowledgement The author wish to thank dr. Stephen Gallant for the fruitful discussions and his valuable comments.

11

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]

Richard P. Lippmann, “Review of neural networks for speech recognition,” Neural Computation, vol. 1, pp. 1–38, 1989. Yann Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, pp. 541–551, 1989. Isabelle Guyon, Vladimir Vapnik, B. Boser, Leon Bottou, and Sara A. Solla, “Structural risk minimization for character recognition,” in Advances in Neural Information Processing Systems 4, John E. Moody, Steve J. Hanson, and Richard P. Lippmann, Eds. 1992, pp. 471–479, Morgan Kaufmann. A. Lapedes and R. Farber, “How neural nets work,” in Neural Information Processing Systems, D. Z. Anderson, Ed. 1987, pp. 442–456, American Institute of Physics. George Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals, and Systems, vol. 2, pp. 303–314, 1989. K. Funahashi, “On the approximate realization of continuous mapping by neural networks,” Neural Networks, vol. 2, pp. 183–192, 1989. Kurt Hornik, Maxwell Stinchcombe, and Halbert White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, pp. 359–366, 1989. Vladimir N. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability and Its Applications, vol. 16, pp. 264–280, 1971. Vladimir Vapnik, Estimation of Dependences Based on Empirical Data, New York: Springer-Verlag, 1982. Vladimir Vapnik, “Principles of risk minimization for learning theory,” in Advances in Neural Information Processing Systems 4, John E. Moody, Steve J. Hanson, and Richard P. Lippmann, Eds. 1992, pp. 831–838, Morgan Kaufmann. Vladimir Vapnik and Leon Bottou, “Local algorithms for pattern recognition and dependencies estimation,” Neural Computation, vol. 5, pp. 893–909, 1993. J. S. Judd, “Learning in networks is hard,” in Proceedings of the First International Conference on Neural Networks. 1987, pp. 685–692, San Diego, CA: IEEE. Avrim Blum and Ronald L. Rivest, “Training a 3-node neural network is NP-complete,” in Proceedings of the 1988 Workshop on Computational Learning Theory, David Haussler and Leonard Pitt, Eds., Cambridge, MA, 1988, pp. 9–18, Morgan Kaufmann. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distribute Processing, D. E. Rumelhart and J. L. McClelland, Eds. 1986, pp. 318–362, Cambridge, MA: MIT Press. John Hertz, Anders Krogh, and Richard G. Palmer, Introduction to the Theory of Neural Computation, Redwood City, CA: Addison-Wesley, 1991. D. G. Luenberger, Introduction to Linear and Nonlinear Programming, Reading, MA: Addison Wesley, 1984. William H. Press, Brian P. Flannery, Saul A. Teukolsky, and William T. Vetterling, Numerical Recipes in C, Cambridge: Cambridge University Press, 1988. Frank Rosenblatt, Principles of Neurodynamics, Washington, DC: Spartan Press, 1961. Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry, Cambridge, MA: MIT Press, 1969. Y-C. Ho and R. L. Kashyap, “An algorithm for linear inequalities and its applications,” IEEE Transactions on Electronic Computers, vol. 14, pp. 683–688, 1965. L. G. Khachiyan, “A polinomial algorithm in linear programming,” Soviet Mathematics Doklady, vol. 20, pp. 191–194, 1979. A. J. Mansfield, “Comparison of perceptron training by linear programming and by the perceptron convergence procedure,” in Proceedings of the International Joint Conference on Neural Networks, Seattle, WA, 1991, pp. II–25–II–30. Stephen I. Gallant, “Perceptron-based learning algorithms,” IEEE Transactions on Neural Networks, vol. 1, pp. 179–191, 1990. Stephen I. Gallant, Neural Networks Learning and Expert Systems, Cambridge, MA: MIT Press, 1993. Marc M´ezard and Jean-Pierre Nadal, “Learning in feedforward layered networks: The tiling algorithm,” Journal of Physics A, vol. 22, pp. 2191–2203, 1989. Marcus Frean, “The upstart algorithm: A method for constructing and training feedforward neural networks,” Neural Computation, vol. 2, pp. 198–209, 1990. Marco Muselli, “On sequential construction of binary neural networks,” IEEE Transactions on Neural Networks, vol. 6, pp. 678–690, 1995. Marco Muselli, “Simple expressions for success run distributions in bernoulli trials,” To appear on Statistics & Probability Letters, 1996.

Marco Muselli was born in 1962. He received the M.S. degree in electronic engineering from the University of Genoa, Italy, in 1985. Mr. Muselli is currently a Researcher at the Istituto per i Circuiti Elettronici of CNR in Genoa. His research interests include neural networks, global optimization, genetic algorithms and characterization of nonlinear systems.