simultaneous recurrent neural network trained with non-recurrent

SIMULTANEOUS RECURRENT NEURAL NETWORK TRAINED WITH NON-RECURRENT BACKPROPAGATION ALGORITHM FOR STATIC OPTIMIZATION Gursel Serpen# and Yifeng Xu Electrical Engineering and Computer Science, University of Toledo, Toledo, OH 43606 USA

Abstract – This paper explores feasibility of employing the non-recurrent backpropagation training algorithm for a recurrent neural network, Simultaneous Recurrent Neural network, for static optimization. A simplifying observation that maps the recurrent network dynamics, which is configured to operate in relaxation mode as a static optimizer, to feedforward network dynamics is leveraged to facilitate application of a non-recurrent training algorithm such as the standard backpropagation and its variants. A simulation study that aims to assess feasibility, optimizing potential, and computational efficiency of training the Simultaneous Recurrent Neural network with non-recurrent backpropagation is conducted. A comparative computational complexity analysis between the Simultaneous Recurrent Neural network trained with nonrecurrent backpropagation algorithm and the same network trained with the recurrent backpropagation algorithm is performed.

Simulation results demonstrate that it is feasible to

apply the non-recurrent backpropagation to train the Simultaneous Recurrent Neural network. The optimality and computational complexity analysis fails to demonstrate any advantage on behalf of the non-recurrent backpropagation versus the recurrent backpropagation for the optimization problem considered.

However, considerable future potential that is yet to be

explored exists given that computationally efficient versions of the backpropagation training algorithm, namely quasi-Newton and conjugate gradient descent among others, are also applicable for the neural network proposed for static optimization in this paper.

#

Corresponding author: phone: (419) 530 8158, fax: (419) 530 8146, and e-mail: [email protected]

Keywords: recurrent network, backpropagation, optimization, traveling salesman, computational complexity

INTRODUCTION The Simultaneous Recurrent Neural network (SRN) trained with the recurrent backpropagation (RBP) algorithm was recently applied to the solution of large-scale combinatorial optimization problems [Serpen et. al., 2001]. The SRN demonstrated notable performance compared to other neuro-optimization algorithms including those in the family of Hopfield networks [Hopfield and Tank, 1985] and its derivatives including the Boltzmann Machine, and the Mean Field Annealing network among others [Cichocki & Unbehauen, 1993; Serpen & Livingston, 2000]. However, these studies also indicated that training the SRN for static optimization problems including the Traveling Salesman problem using the recurrent backpropagation is computationally expensive. Specifically, empirical results showed that the required computational resources in the form of memory space and total computational cost in terms of training time on a serial machine increased quadratically with the increase in the problem size. This determination strongly motivates a search to identify more computationally efficient ways to train the Simultaneous Recurrent Neural network for static optimization problems.

The architecture of the Simultaneous Recurrent Neural network [Werbos et al., 1998] is shown in Figure 1. Inside the block is a feedforward network that may contain multiple layers of nodes, expressed as a nonlinear mapping from external input x to output z, where f is the function representing the nonlinear mapping and W is the weight matrix. There is direct unity and nondelayed feedback (ignoring the propagation delay associated with any physical system) from the outputs to the inputs of the feedforward network. The propagation delay associated with the

Page 2 of 27

feedback path of an SRN is assumed to be much smaller than the typical sampling delays existing in the feedback paths of Time Delay Neural networks, and consequently, is ignored without loss of generality for the purposes of this study.

Input - x

Feedforward Network

Output - z

f( W, x, z )

Non-delayed, Unity Feedback Figure 1. Simultaneous Recurrent Neural Network.

The Simultaneous Recurrent Neural network has temporal dynamics, which will converge to a fixed point in the presence of one such equilibrium point. Once the network converges to a fixed point (assuming associative memory mode of operation), the recurrent inputs become constant over time. This observation reduces the SRN to a multilayer perceptron (MLP) network, a feedforward topology, with special constraints on its inputs. In this case, in addition to the external input x, the feedback is considered as a second input that is constant over time upon convergence to a fixed point. Specifically, the recurrent network is considered as a feedforward network subject to the constraint that z (∞) = f [W , x, z (∞)] ,

(1)

where z (∞) is the relaxed value of output of the network with the topology in Figure 2.

Page 3 of 27

Input - x Output - z(∞)

Feedforward Network Output - z(∞)

f [W W, x, z(∞)]

Figure 2. The Simultaneous Recurrent Neural Network upon Convergence to A Fixed-Point.

The significant implication of this observation related to simplification of the SRN network dynamics is that the standard, non-recurrent backpropagation (BP) can be employed to train the neural network rather than the computationally expensive recurrent backpropagation, which requires an adjoint network to be set up among others [Werbos, 1988; Almeida, 1987; Pineda, 1987].

A minimal architecture for the SRN can be achieved by assuming a multilayer continuousdynamics perceptron network for the feedforward mapping in Figure 1 with one hidden layer. The computational topology in Figure 3 is realized for I, J, and K nodes in input, hidden, and output layers, respectively:

U y

Output

V Hidden

Output

x

Input

Feedforward Network (Mapping)

z

W

Non-delayed, unity feedback Figure 3. Minimal Topology for Simultaneous Recurrent Neural Network.

Page 4 of 27

In Figure 3, x and z represent the external input and output whereas U, W, and V are the J×K, K×J and I×J weight matrices associated with forward, backward, and input signal propagations, respectively.

As an associative memory or static optimizer, the SRN topology can further be

reduced to the instance given in Figure 4 upon convergence to a fixed point and in the absence of

z

U y

Output

W

Hidden

Output

any external input:

z

Figure 4. Feedforward Equivalent of Minimal SRN Topology for Static Optimization

A typical computational model for a node would employ continuous perceptrons in both the hidden layer and the output layer. More complex and therefore realistic computational models for the node dynamics can be represented by partial differential equations. Let y and z represent the node output vectors for the hidden and output layers, respectively, while the internal state of the output layer nodes is given by s. Node dynamics in the output layer will be represented by first-order partial differential equations: J ds k = u kj y j and dt j =1 z k = f ( s k ) for k = 1, 2 ,

,K,

(2)

where y j is the output of j-th node in hidden layer, J is the node count in hidden layer, u kj is the forward weight from j-th node in hidden layer to k-th node in output layer, K is the number of

Page 5 of 27

nodes in the output array, and f is continuous, differentiable function typically a sigmoid. However, computational model for the nodes in the hidden layer can utilize continuous perceptron equations for the sake of simplicity and without loss of generality. For a node y j in the hidden layer, the dynamics is defined by K

net j =

w jk z k and k =1

y j = f (net j ) for j = 1,2,

,J ,

(3)

where z k is the output of k-th node in output layer, J is the node count in hidden layer, w jk is the backward weight from k-th node in output layer to j-th node hidden layer, K is the number of nodes in the output array, and f is a continuous and differentiable function, typically a sigmoid.

The activation function can be instantiated as unipolar sigmoid with slope coefficient λ ∈ R + the same for both the hidden and output layers without loss of generality. In other terms, we have yj =

where net j =

K k =1

1 1+ e

− λ × net j

j = 1, 2, …, J,

(4)

w jk z k and λ ∈ R + , and z k , k-th element of the output vector z, is defined by

zk =

1 with λ ∈ R + and k= 1, 2, … , K. − λ × sk 1+ e

(5)

STANDARD BACKPROPAGATION TRAINING FOR SIMULTANEOUS RECURRENT NEURAL NETWORK

This section presents the Simultaneous Recurrent Neural network (SRN) as a static optimizer configured to solve combinatorial optimization problems, i.e., the Traveling Salesman problem

Page 6 of 27

(TSP). Application of the backpropagation training algorithm to the relaxed dynamics of the SRN configured for static optimization is demonstrated through a mathematical formulation.

In the TSP [Reinelt, 1994], a solution requires each city to be visited exactly once, and the total distance to be minimal. In the application of SRN to solve the TSP, an N-city TSP is represented by an N×N array implemented through the output layer of the SRN. Each row of the array represents a different city and each column represents a possible position of that city in the path. The network outputs should provide a clear indication of which city is selected as the travel path is decoded from the SRN output array. Thus, it is desirable to have node outputs in the output layer to be as close as possible to limiting values of 0.0 and 1.0 for the unipolar sigmoid function.

In a 4-city TSP example, all possible paths and cities are represented in a 4×4 array, which is shown in Figure 5.

The black cell in Figure 5 represents a node with an output value

approaching 1.0 in the array and the white one represents a node with an output value approaching 0.0. The path decoded from the representation in Figure 5 would be B-A-D-C-B. Thus, the SRN topology is designed to match this form of the TSP representation.

The SRN topology proposed for the TSP is a two layer recurrent network as in Figure 6. The output layer has an N×N array of nodes, which represents the solution of the TSP in the form of a matrix discussed above. The notation zmn represents the output of the node in the m-th row and n-th column of the output layer. There is a single hidden layer with a node count of J. Each node in the network has an associated weight vector: u mn represents the weight vector of the

Page 7 of 27

node in the m-th row and n-th column in the output layer and w j represents the weight vector of the j-th node in the hidden layer. There is no external input to the SRN. Consequently, the SRN has a two-layer topology typically with a small number of hidden nodes.

Visiting Order 1 2 3 4 A Cities

B C D

Figure 5. 4-City TSP Representation.

Hidden Layer w1 w2

wJ

u11

1

2

11 21

u21

Output Layer 12 1N 22 2N

J

uN1

N1

N2 NN

Figure 6. Architecture of the SRN for the N-City TSP.

Page 8 of 27

The error function given in Equation 6, which is formulated in the Appendix, maps the TSP to the SRN topology [Serpen et al., 2001]:

N

N

E = g col

N

1− i =1 j =1

2

z mj (∞ )

N

N

N

+ g row

m =1

N

N

1− i =1 j =1

N

2

z in (∞ )

N

N

+ g bin

n =1

[− [z (∞ ) − 0.5]

2

ij

z ij (∞ )z m( j +1) (∞ )d im ,

+ g dis

]

+ 0.25

i =1 j =1

(6)

i =1 j =1 m =1

where g col , g row , g bin , and g dis are positive, real weight parameters, d im is the distance between city i and city m, and N is the number of cities.

The standard backpropagation algorithm states that the update for the weights of an output node is computed through ∆u kj = −η

∂E for k=1,2,…,N×N and j=1,2,…,J, ∂u kj

(7)

where η is the learning rate. The partial derivative of error function in Equation 6 with respect to the weight ukj between the j–th hidden layer node and k–th output layer node with k=1,2,…,N×N, derivation of which is detailed in the Appendix, is given as follows: ∂E ∂ukj

N

N

N

zmr (∞ ) − 1

= 2 g col k = ( q −1) N + r

q =1 r =1 N

m =1 N

− 2 gbin

(z (∞ ) − α ) qr

q =1 r =1

N N ∂zqr (∞ ) + 2 g row ∂ukj q =1 r =1

∂zqr (∞ ) ∂ukj

N

N

N

+ g dis q =1 r =1

m =1

N

zqn (∞ ) − 1

n =1

d qm zm ( r +1) (∞ )

∂zqr (∞ ) ∂ukj ∂zqr (∞ )

(8)

∂ukj

where ∂z qr (∞ ) ∂u kj term is readily computable given the node dynamics in the output layer.

Page 9 of 27

Next, in order to compute the weight update values for the hidden layer nodes, note that

δ ok = −

J

∂E ∂E ∂z k ∂E f′ =− =− ∂net k ∂z k ∂s k ∂z k

u kj y j ,

(9)

j =1

where f is the activation function, and yj is the output of the j–th hidden layer node. The term ∂E ∂z k is the derivative of the error function in Equation 6 with respect to the k-th node output in the output layer:

∂E ∂z k

N

N

N

z mr (∞ ) − 1 + 2 g row

= 2 g col k = ( q −1) N + r

q =1 r =1

m =1

N

N

q =1 r =1 N

N

− 2 g bin

N

z qn (∞ ) − 1

n =1

(z (∞) − α ) + g qr

q =1 r =1

(10) N

N

N

dis q =1 r =1

d qm z m ( r +1) (∞ )

m =1

where α = 0.5 and k=1,2,…,N×N. Accordingly, adjustments for the weights belonging to the hidden layer nodes are defined as [Werbos, 1974]

N ×N

∆w jk = −ηf ′ k =1

N×N

w jk z k (∞ ) z k (∞ )

δ ok u kj ,

(11)

k =1

where k=1,2,…,N×N, j=1,2,…,J, and η is the learning rate.

SIMULATION STUDY

The simulation study was formulated to investigate the feasibility of training the Simultaneous Recurrent Neural network configured as a static optimizer with the standard, non-recurrent backpropagation algorithm. The simulation study also envisioned assessing the optimizing

Page 10 of 27

potential and the computational cost of the proposed SRN/BP algorithm. Consequently, a computational cost comparison between the SRN versions trained with the non-recurrent backpropagation and the recurrent backpropagation algorithms was accomplished.

Performance characteristics including the quality of solutions (as measured by the normalized average distance), and the training time (in terms of processor time expended as measured by the UNIX process timex) were utilized in the comparative assessment. It is relevant to note that the iteration count is not included in the performance comparison, since it is not meaningful to do so due to the fact that the average time cost per iteration is not the same for the SRN with BP compared to the SRN with RBP. In general, the SRN with RBP requires more time for a given iteration than the SRN with BP. This is because of the complex process of setting up the networks to relax both the SRN and its adjoint dynamics in the case of training with RBP.

In order to facilitate a reasonably objective and unbiased comparison between the two training algorithms, a “similar” simulation environment was employed. The simulation code was written in C/C++ to run within the UNIX operating system environment. The same C/C++ based code template formed the bases for standard backpropagation as well as the recurrent backpropagation implementations.

The simulations ran on a Sun Sparc workstation with two 300 MHz

processors and 1.2 GB main memory.

In the following sections, setup and initialization is explicitly presented for the SRN/BP algorithm, while similar details for the SRN/RBP algorithm follows an identical approach to the one presented in [Serpen et al., 2001] and therefore are not duplicated herein.

Page 11 of 27

Setup and Initialization

The SRN architecture given in Figure 6 was utilized in the simulation study. A number of parameters need to be set and initial values need to be determined for the SRN configured to solve the TSP: these are the number of hidden layer nodes, the learning rate, values of constraint weight parameters, slope and range of the activation functions of the nodes in the hidden and output layers, initial values of weights and node outputs, specification of the distance matrix, the stopping criterion for the search process, and convergence criterion for the network dynamics.

The SRN as presented in Figure 6 has a two-layer topology typically with a small number of hidden nodes: the node count in the hidden layer was specified as eight for the entire simulation study and this number was determined through an initial empirical study. The network node outputs are randomly initialized using a uniform probability distribution. The initial values of outputs of all the nodes in both layers are set to values in the range of 0.0 to 1.0. The weights are initialized to a value in the range of

–0.2 to +0.2 randomly with a uniform probability

distribution as well. All nodes in the network utilized a unipolar sigmoid with a slope coefficient value of 1.0. Nodes were considered as “active” or “inactive” for values of ≥ 0.8 and ≤ 0.2 , respectively.

The distances among N cities are represented by an N×N matrix. Each entry of the matrix is the distance between related two cities according to the index of that entry. The distance matrix is symmetric and each entry of the cost matrix is randomly set to a value between 0.0 and 1.0 using

Page 12 of 27

a uniform probability distribution. According to this distance matrix, the expected value of normalized average distance for a randomly chosen path is 0.5.

Values of the constraint weight parameters in Table 1 provided good starting points for the range of problems, 100 cities to 600 cities, tested with the SRN trained with the RBP as reported in an earlier study [Serpen, et.al., 2001]. The weight parameters were initialized to the values shown in Table 1 for the SRN trained with the BP simulation as well. During the training process, weight parameters needed to be increased following a schedule: this turned out to be critical for the SRN to locate a solution, which was also concurred with in [Serpen et.al., 2001]. Weight parameters were incremented according to values given in Table 1 every five iterations.

Parameter

Initial Value

Increment

g col

0.003

0.004

g row

0.003

0.004

g bin

0.003

0.004

g dis

0.010

0.020

Table 1. Weight Parameter Initial and Incremental Values.

The learning rate is an important factor that affects the ability of the backpropagation algorithm to locate a solution. If the learning rate is too large, oscillation may occur as the network is approaching to a solution. On the contrary, if the learning rate is too small, it is likely to take quite a long time for the network to reach the solution due to the slow learning speed. A learning rate value of 0.3 was determined to be appropriate through a preliminary empirical study.

Page 13 of 27

The quality of the solution is measured by computing the normalized average distance (NAD) for the solution path the SRN locates, which is defined by

NAD =

1 N

N

N

N

z ij (∞ )z m ( j +1) (∞ )d im ,

(12)

i =1 j =1 m =1

where d im is the distance between city i and city m, and N is the number of cities.

Stopping criterion for network training is a critical issue. Since the globally optimal solution of the TSP is not known, it is not possible to monitor the quality of solutions to determine a good stopping point for the training of the network. One standard approach is to stop training as soon as a valid solution, guaranteed to be a locally optimal for a gradient descent based search algorithm, is located. This criterion was employed for simulations with both training algorithms, namely the RBP and the BP. Clearly, training beyond this point is likely to result in solutions with improved quality.

Determining the conclusion of a relaxation for SRN dynamics is another important simulation consideration. This determination is facilitated by deciding when to declare the SRN dynamics converged to a stable equilibrium point: |zk(ti) – zk(ti-1)| < ε for all zk with k = 1,2,…,K, where zk is the k-th neuron in the output layer, K is the number of neurons in the output layer, ε ∈ R + ,

ε→0, and ti is the index of discrete time. Practical implementation of this condition can be realized through the following inequality:

Page 14 of 27

1 K

K

z k (t i ) − z k (t i − 1) < 0.001 . k =1

The simulation study was performed for problem sizes of 10, 30, 50, 70, and 100 cities: a total of 10 trials were carried out for each city count instance. The normalized average distance for the computed solution and simulation time (processor time expended) are recorded during a particular simulation case. The training time is determined through the UNIX operating system process timex, which returns the processor time expended by the simulation process.

Simulation Results

Simulation was performed for both algorithms, RBP and BP, on the same problem instances of the TSP. A statistical analysis on the simulation data was performed using the t-test and, consequently, a box-whiskers plot for each measure being monitored was produced.

The plot for solution quality as measured by the normalized average distance in Figure 7 demonstrates that the SRN/BP algorithm performs poorly compared to the SRN/RBP algorithm for larger instances of the Traveling Salesman problem. In fact, the solution quality is no better than a randomly chosen one as the problem size approaches the 100-city instance. Results of the paired t-test on solution quality measure are provided in Table 2: as the P values indicate the differences in solution quality between the two training algorithms are statistically significant.

Training time (as measured for convergence to the first feasible solution located) required for both algorithms is plotted in Figure 8. Results clearly suggest that the SRN/BP algorithm

Page 15 of 27

performance is severely lagging (noting the logarithmic scale for the vertical axis of the plot). Furthermore, results of the paired t-test on convergence time measure are provided in Table 3: the P values indicate statistical significance for differences in training time between the two training algorithms.

Solution Quality vs. City Count

0.5

0.4

0.3

0.2

0.1

BP 10 0C it y

R BP 10 0C ity

BP 70 -C ity

R BP 70 -C ity

BP 50 -C ity

R BP 50 -C ity

BP 30 -C ity

R BP 30 -C ity

BP 10 -C ity

R BP

0

10 -C ity

Normalized Average Distance

0.6

City Count

Figure 7. Simulation Results on Solution Quality

Page 16 of 27

City Count 10 RBP

Algorithm Mean Deviation

30 BP

RBP

50 BP

70

RBP

BP

100

RBP

BP

RBP

BP

0.23

0.34

0.19

0.30

0.21

0.48

0.20

0.47

0.19

0.46

0.072

0.081

0.037

0.049

0.037

0.060

0.027

0.038

0.030

0.058

P t(10)

0.006

< 0.001

< 0.001

< 0.001

< 0.001

1.83

1.83

1.83

1.83

1.83

Table 2. Results of t-test for Solution Quality Measure

Training Time vs. City Count 100000

1000

100

10

BP 10 0C it y

R BP 10 0C ity

BP 70 -C ity

R BP 70 -C ity

BP 50 -C ity

R BP 50 -C ity

BP 30 -C ity

R BP 30 -C ity

BP 10 -C ity

R BP

1

10 -C ity

Training Time (seconds)

10000

City Count

Figure 8. Simulation Results on Training Time

Page 17 of 27

City Count 10 Algorithm Mean (seconds) Deviation P t(10)

RBP

30 BP

50

RBP

70

100

BP

RBP

BP

RBP

BP

RBP

BP

3.2

2.3

71.0

57.3

151.6

284.8

237.0

6310.6

717.0

17496.3

0.573

0.520

16.431

6.431

21.186

201.434

51.083

3709.409

299.463

11930.87

< 0.001

0.034

0.035

< 0.001

< 0.001

1.83

1.83

1.83

1.83

1.83

Table 3. Results of t-test on Training Time Measure

Performance assessment of the SRN/BP for two larger TSP instances, namely 130 and 150 cities, was also attempted for a single trial since the simulation time requirement turned out to be computationally prohibitive for a statistically significant number of trials to be performed. The results are presented in Table 4 and suggest that two main trends associated with the SRN/BP for being computationally expensive and producing lower quality solutions are valid.

City Count

130

150

Performance Measure

RBP

BP

Solution Quality

0.16

0.49

Simulation Time (hr:min:sec) Solution Quality

00:28:11

44:56:05

0.20

0.48

Simulation Time (hr:min:sec)

00:51:31

23:10:24

Table 4. Performance Results for Larger TSP Instances

In summary, simulation results clearly demonstrated that it is feasible to train the SRN with the standard, non-recurrent backpropagation algorithm.

However the optimizing capability as

Page 18 of 27

observed through the solution quality measure and the computing time expended to locate the locally-optimal solution for the SRN/BP both require much improvement for a competitive edge.

It is highly relevant to note that the performance of the SRN trained with a nonrecurrent backpropagation variant has a great potential for improvement, since computationally efficient versions of standard backpropagation, namely quasi-Newton [Battiti, 1989] and conjugate gradient descent [Johansson et al., 1991; Kinsella, 1992], are yet to be explored for improving the training time through the basic backpropagation algorithm. Such a computationally efficient algorithm might potentially facilitate an order-of-magnitude improvement for the computing time expended. These drastic savings in computational effort could then make it possible to continue the search process beyond the very first solution located, which is highly likely to be locally optimal. Once it becomes computationally affordable to carry on the search process much beyond the very first solution located, it is clear that improved solutions, with much smaller and competitive normalized average distances, are computable. This promise may position the Simultaneous Recurrent Neural network trained with a computationally-efficient variant of standard, non-recurrent backpropagation as a competitive algorithm compared to the same neural network trained with the recurrent backpropagation.

CONCLUSIONS

A recurrent neural network algorithm, Simultaneous Recurrent Neural network (SRN), was trained with a non-recurrent training algorithm, standard backpropagation, to solve static optimization problems to explore the feasibility, the optimizing potential, and the promise of greater computational efficiency. Simulation study showed that it is feasible to train the SRN

Page 19 of 27

with the non-recurrent backpropagation algorithm. However, solutions computed through the non-recurrent backpropagation failed to match quality of those solutions obtained with the recurrent backpropagation algorithm. Furthermore, the training time requirement of the SRN with the backpropagation algorithm increased dramatically for larger scale TSP instances compared to those of the SRN trained with the recurrent backpropagation algorithm. This research focus still offers much promise that is yet to be explored since only the basic backpropagation algorithm was utilized and there exist numerous computationally-efficient versions of the backpropagation training algorithm, which might offer the competitive edge for non-recurrent training algorithms over the recurrent ones for the Simultaneous Recurrent Neural network.

ACKNOWLEDGEMENTS

This project has been funded in part by the USA National Science Foundation Grant No. 9800247.

Authors greatly acknowledge the computing resources made available by Dr.

Demetrios Kazakos in his capacity as the chairman of the Electrical Engineering and Computer Science Department at the University of Toledo, Toledo, Ohio, USA. Authors also appreciate the valuable feedback received from anonymous referees towards noticeably improving this paper.

BIBLIOGRAPHY Almeida, L. B., A Learning Rule for Asynchronous Perceptrons with Feedback in a Combinatorial Environment, Proceeding of IEEE 1st International Conference on Neural Networks, San Diego, CA, June 21-24, pp. 609-618, 1987. Battiti, R., Accelerated Back-Propagation Learning: Two Optimization Methods, Complex Systems, Vol. 3, pp. 331-342, 1989.

Page 20 of 27

Cichocki, A. and Unbehauen, R., Neural Network for Optimization and Signal Processing, Wiley, 1993. Hopfield, J. J., Neurons with Graded Response Have Collective Computational Properties Like Those of Two-State Neurons, Proceedings of the National Academy of Science, USA, Vol. 81, pp. 3088-3092, 1984. Hopfield, J. J., and Tank, D. W., Neural Computation of Decision in Optimization Problems, Biological Cybernetics, Vol. 52, pp. 141-152, 1985. Johansson, E. M., Dowla, F. U., and Goodman, D. M., Backpropagation Learning for MultiLayer Feed-Forward Neural Networks Using the Conjugate Gradient Method, International Journal of Neural Systems, Vol. 2, No. 4, pp. 291-301, 1991. Kinsella, J. A., Comparison and Evaluation of Variants of the Conjugate Gradient Method for Efficient Learning in Feed-Forward Neural Networks with Backward Error Propagation, Network, Vol. 3, pp. 27-35, 1992. Pineda, F. J., Generalization of Back-Propagation to Recurrent Neural Networks, Physical Review Letters, Vol. 59, pp. 2229-2232, 1987. Reinelt, G., The Traveling Salesman, Computational Solutions for TSP Applications, Vol. 840 of Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg New York, 1994. Serpen, G., Patwardhan, A. and Geib, J., The Simultaneous Recurrent Neural Network Addressing the Scaling Problem in Static Optimization, International Journal of Neural Systems, Vol. 11, No. 5, pp. 477-487, 2001. Serpen, G. and Livingston, D. L., Determination of Weights for Relaxation Recurrent Neural Networks, Neurocomputing, Vol. 34, pp. 145-168, 2000. Werbos, P. J., Generalization of Backpropagation with Application to A Recurrent Gas Market Model, Neural Networks, Vol. 1, No. 4, pp. 234-242, 1988. Werbos, P. J., and Pang, X., Neural Network Design for J Function Approximation in Dynamic Programming, technical report available online at http://www.isr.umd.edu/TechReports/CSHCN/1998/CSHCN_MS_98-1/CSHCN_MS_981.pdf, 1998. Werbos, P. J., Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph. D. Thesis, Harvard University, 1974.

Page 21 of 27

APPENDIX ERROR FUNCTION AND ITS PARTIAL DERIVATIVES

In order to train the Simultaneous Recurrent Neural network as a static optimizer for the Traveling Salesman problem, it is necessary to define a measure of the error for the output nodes. The error function needs to ensure valid solutions as well as a minimum path length. Certain constraints need to be in place to ensure a valid solution. Given the problem representation presented in Figure 6, each row and column in the N×N output array must have exactly one node active: the output value of an active node should be close to 1.0 while the output of each inactive node in the N×N array must approach the limiting value of 0.0.

The traveling salesman should visit each city in the list exactly once. Thus, when network converges to a solution, there should be exactly one node active per each row and per each column. This constraint can be implemented using inhibition among the nodes in a given row and column. The error term for the column constraint is defined by

N

N

E col = g col

N

1− i =1 j =1

2

z mj (∞ )

,

(A-1)

m =1

where i and j are the indices for rows and columns, respectively, m is the index for rows of the network, z mj (∞ ) is the stable value of mj-th node output upon convergence to a fixed point, and g col is a positive real weight parameter. When each column of the output matrix has exactly one

active node, this error term will be zero. The first summation over the indexing variable i is included because the error function needs to be defined for each node in the output layer.

Page 22 of 27

Similarly, the error term for the row constraint is given by

N

N

E row = g row

N

1− i =1 j =1

2

z in (∞ )

(A-2)

n =1

where, i and j are the indices for rows and columns, respectively, of the network, n is the index for columns and g row is a positive real weight parameter. This error term will have a value of zero when each row of the output matrix has exactly one active node. Again, the second summation over the index variable j is included since the error function needs to be defined for every ij-th node in the output layer.

An error term is also introduced that forces the node outputs to limiting values of 0.0 or 1.0 as

N

N

Ebin = g bin

[− (z (∞ ) − α )

2

ij

]

+β ,

(A-3)

i =1 j =1

where α and β are constants and g bin is the positive real weight parameter for this constraint. This error term is a downward opening quadratic function. By choosing values of α = 0.5 and

β = 0.25 , the zeros of the function are at 0.0 and 1.0. Thus, this error term has a minimum value of zero when all of the nodes in the output layer have their output values at either 0.0 or 1.0.

Page 23 of 27

The error term associated with the distance between the cities can be formulated as

N

N

N

E dis = g dis

z ij (∞ )z m ( j +1) (∞ )d im ,

(A-4)

i =1 j =1 m =1

where dim is the cost associated with the path from city i to city m and g dis is the positive real weight parameter for this constraint. For each node zij, the index m searches each node in the (j+1)st column, indicated by the zm(j+1) term. If both nodes are active, the distance from city i to city m, dim, will be included in this error term, where the minimum value is achieved if the total distance of the path is minimum.

The total error function E is the sum of each individual error terms defined by E = E col + E row + E bin + E dis .

(A-5)

The error function E given in Equation A-5 needs to be manipulated to derive partial derivatives required by the backpropagation training algorithm application to the relaxed SRN dynamics. More specifically, the derivative of error function E with respect to some weight wkl, where wkl is the weight between k-th node in the output layer for k=1,2,…,N×N and l-th node in the hidden layer for l=1,2,…,J, needs to be defined. The chain rule yields

Page 24 of 27

∂E ∂E ∂z k (∞ ) = . ∂wkl ∂z k (∞ ) ∂wkl

This equation can be rewritten in terms of an output node array with N rows and N columns, where k, the index of the output layer nodes, is related to row and column indices q and r, respectively by k = (q–1)N+r: ∂E ∂E ∂z qr (∞ ) = . ∂wkl ∂z qr (∞ ) ∂wkl

(A-6)

Furthermore, the derivative of total error function given in Equation A-5 can be computed by adding the derivatives of individual error terms:

∂ ∂ ∂ ∂ ∂E = Ecol + Erow + Edis + Ebin. ∂wkl ∂wkl ∂wkl ∂wkl ∂wkl

(A-7)

Using the error term due to the row constraint in Equation A-2 and taking the derivative with respect to the wkl yields N N N ∂z qr (∞ ) ∂E row = −2 g row 1 − z qn (∞ ) , ∂wkl ∂wkl q =1 r =1 n =1

(A-8)

while noting that ∂z qn (∞ ) ∂wkl

≠ 0 when n = r or l=(q-1)N+r.

Page 25 of 27

Similary using the error term in Equation A-1 for the column constraint and taking the derivative with respect to the wkl results in N N N ∂z qr (∞ ) ∂E col = −2 g col 1 − z mr (∞ ) ∂wkl ∂wkl q =1 r =1 m =1

(A-9)

since ∂z mr (∞ ) ≠ 0 only when m = q or l=(q-1)N+r . ∂wkl

The derivative of the error term due to the distance constraint in Equation A-4 with respect to the wkl is given by N N N ∂z qr (∞ ) ∂z m ( r +1) (∞ ) ∂E dis d qm z qr (∞ ) . + z m ( r +1) (∞ ) = g dis ∂wkl ∂wkl ∂wkl q =1 r =1 m =1

The partial derivative vanishes except for the weight from the l-th node in the hidden layer to kth node in the output layer, where l=(q–1)N+r. Therefore, the first partial derivative term inside the parenthesis will always be zero leading to the following simplified form:

N N ∂E dis = g dis ∂wkl q =1 r =1

N

∂z qr (∞ )

d qm z m ( r +1) (∞ )

∂wkl

m =1

.

(A-10)

Derivative of the error term for the node output constraint in Equation A-3 can be computed as

N N ∂z qr (∞ ) ∂Ebin = −2 g bin . z qr (∞ ) − α ∂wkl ∂wkl q =1 r =1

[

]

(A-11)

Page 26 of 27

By substituting the terms in Equations A-8 through A-11 in Equation A-7, the entire partial derivative becomes

N N ∂E = 2 g col ∂wkl q =1 r =1 N

N

− 2 g bin

N

∂z qr (∞ )

z mr (∞ ) − 1

∂wkl

m =1

(z (∞) − α )

∂z qr (∞ )

qr

∂wkl

q =1 r =1

N

N

N

z qn (∞ ) − 1

+ 2 g row q =1 r =1 N

N

N

d qm z m ( r +1) (∞ )

+ g dis q =1 r =1

n =1

m =1

∂z qr (∞ ) ∂wkl

∂z qr (∞ )

(A-12)

∂wkl

Associating Equation A-6 with Equation A-12 yields

N N ∂E = 2 g col ∂z qr (∞ ) q =1 r =1 N

N

z mr (∞ ) − 1 + 2 g row

m =1 N

− 2 g bin

(z (∞) − α ) + g qr

q =1 r =1

N

N

q =1 r =1 N

N

N

dis q =1 r =1

N

z qn (∞ ) − 1

n =1

(A-13)

d qm z m ( r +1) (∞ )

m =1

Equations A-12 and A-13 provide the two partial derivates needed for the implementation of standard, non-recurrent backpropagation algorithm.

Page 27 of 27