Design of an Optical Fixed-Weight Learning Neural Network - CiteSeerX

2 downloads 0 Views 155KB Size Report
Aug 4, 2005 - Southwest Missouri State University. Springfield, MO 65804. E-mail: [email protected] or [email protected]. Abstract--This paper ...
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

Design of an Optical Fixed-Weight Learning Neural Network A. Steven Younger and Emmett Redd Department of Physics, Astronomy, and Materials Science Southwest Missouri State University Springfield, MO 65804 E-mail: [email protected] or [email protected] plate or slide. Other systems use a Spatial Light Modulator (SLM). A new slide must be produced or weights must be downloaded to the SLM. Both of these are slow compared to the forward propagation of the optical neural network. We propose to integrate learning into the network such that the high-speed hardware can be used during learning as well as forward propagation. This will be done by developing an Optical Fixed-Weight Learning Neural Network (FWL-NN).

Abstract--This paper deals with the design, analysis, and simulation of a prototype Optical Fixed-Weight Learning Neural Network. This type of network could have learning rates five orders of magnitude faster than networks based on Von-Neumann platforms. This network has an embedded learning algorithm and dynamically learns new mappings by changing recurrent neural signal strengths. This will greatly speed up optical neural network learning since the medium containing the synaptic weights does not change during learning. Software simulations suggest that this design is sound. The physical implementation and evaluation of the prototype will be reported elsewhere.

A. Fixed-Weight Learning Neural Networks Several researchers have investigated FWL-NNs [2-6]. Most (perhaps all) implementations of FWL-NN up to now have been as software-based entities. FWL-NNs are a type of recurrent neural network. They have two features that distinguish them from other neural networks. First, FWLNNs store the function mapping information in dynamic, recurrent signal loops instead of synaptic weights. During learning, the network itself dynamically adjusts these internal recurrent neural signals. This approach is similar to biological working memory, where information is thought to be stored in recurrent neural pathways. The second distinction of FWL-NNs is that the learning algorithm is embedded or encoded in its synaptic weights. The synaptic weights encode the ability to learn any mapping (from a large, possibly infinite set of mappings) instead of just a particular mapping, as in conventional neural networks. This FWL-NN uses an on-line (always learning) version of Backpropagation. However, any learning algorithm (on-line or off-line) can be implemented as a FWL-NN [2].

I. INTRODUCTION We will present the concept of an Optical Fixed-Weight Learning Neural Network. The constraints on the network due to the physical properties of the optical system will be examined. A design method called analytic construction will be presented, which we used to design an Optical FWL-NN. Scaling issues will be described in some detail. We will discuss how to compute the necessary fixed-synaptic weights. The results were used to create a simple network capable of learning any linearly separable, two argument Boolean function. Finally, we will present the results of software simulations of the Optical FWL-NN. The physical implementation and evaluation of a hardware prototype will be reported elsewhere Optical Neural Networks are the fastest method available today for performing neural computations. Systems employing laser diodes and high-speed operational amplifiers have estimated synaptic processing rates of 100,000 times a high-end Pentium[1]. Today, Optical DSP processors that can perform up to 8,000 billion synaptic operations per second are commercially available (Lenslet, Inc. EnLight 256). Given this speed advantage, why are Optical Neural Networks not more commonly used? One reason is that the training of these networks cannot use the full speed of the optical hardware. Learning changes the synaptic weights, which requires changing the medium on which the synaptic information in stored. In some systems, this is a holographic

0-7803-9048-2/05/$20.00 ©2005 IEEE

B. Optical Neural Networks A schematic of an optical neural network stratum is shown in Figure 1. A stratum is a layer of neurons together with their synaptic computations. Light from a source passes through a high-speed modulator which encodes the source neuron, ni , activation level, xi . Only one of the many source neurons is shown. The light is directed by presynaptic optics onto the synaptic medium. The interaction of the optical neural signal with the synaptic medium (attenuation) performs the synaptic computation (i.e. Wij ⋅ xi ). Postsynaptic

610

optics such as a cylindrical lens directs the results from all source neurons onto the light detector (target neuron n j ).

learning algorithms [6], we have rejected meta-learning for this FWL-NN project.

The detector performs the spatial summation s j = ¦ i Wij ⋅ xi of

B. Analytic Construction

the signals from all source neurons. An electronic device such as an operational amplifier performs the squashing function (such as logsig ( s ) = 1 (1 − e − s ) ). The final activations

Extending a method [4] that we now call analytic construction, we derived a FWL-NN that contains an on-line Backpropagation learning algorithm. This insures that the FWL-NN will generalize as well as any standard (non-fixedweight learning) on-line Backpropagation. Previous FWL-NNs containing Backpropagation used higher-order Σ − Π synapses [8], which are not easy to implement in optical hardware. We used analytic construction to derive a first-order (i.e. standard synapse) FWL-NN that contains Backpropagation, which can be much more easily implemented in optical hardware. Figure 2 illustrates the construction of a FWL-NN, which is equivalent to a non-fixed-weight single-synapse neural network that learns by Backpropagation. Figure 2(a) shows computer using the Backpropagation algorithm to compute a new synaptic weight during the neural network learning process. It computes the synaptic weight change ΔW(t-1) using the Backpropagation formula ΔW (t − 1) = x(t − 1) ⋅ y (t − 1) ⋅ (1 − y (t − 1) ) ⋅ ( y (t − 1) − T (t − 1) ) .

are then sent to the next stratum, and/or to the external system. II. CREATING A FWL-NN A. Meta-Learning How are the synaptic weights for the FWL-NN derived? Most researchers have used a method called meta-learning or adaptive behavior [5-7]. Meta-learning is learning how to learn (or learn better). It uses a large set of example function mappings to derive a learning algorithm. An optimization algorithm adjusts the synaptic weights in such a way as to minimize a given error metric (usually total squared error over all meta-training data.) Source Neuron Activation

With learning rate Ș, a weight decay term İ, and the ΔW(t-1) value, the computer calculates the new synaptic weight W (t ) = (1 − ε ) ⋅ W (t − 1) − ηΔW (t − 1) . The next time a network input is presented, the new weight is used in the synaptic operation (multiplication of input signal x(t ) by new synaptic weight), i.e. s(t ) = W (t − 1) ⋅ x(t ) . For a FWL-NN (Figure 2(b)), all of the above computations are done by neural networks (called subnetworks), which operate on (and produce as output) neural signals. No synaptic weights are changed as the FWL-NN learns a new mapping. Only the neural signals change

Light Modulator: Source Neuron Presynaptic Optics Synaptic Medium Postsynaptic Optics Light Detector: Target Neuron

C. Universal Approximation

Optical Neural Signal

Light Source

From the Universal Approximation Theorem we know that there exist sub-networks that can perform any learning algorithm including Backpropagation [2]. We created these sub-networks by training them for the appropriate functions. We then assembled them into a complete FWL-NN. Rather than continue to use “weight signal” or “recurrent neural signal” as a term, we will use the word potency. These potencies have also been called “flying weights.” [9] We trained a neural network to multiply the source signal and potency together. Since this fixed-weight network performs a function similar to a synapse but is beyond one like a transistor is beyond a resistor, we call it a tranapse. Similarly, the potency comes from a Backpropagation formalism which includes an error signal. Since error in Greek is πλανη, the error signal will be produced by a fixedweight network called a planapse.

Summed Neural Signal

Operational Amplifier

Target Neuron Activation (Electronic)

Figure 1. Schematic of a stratum of an optical neural network. The resulting learning algorithms (synaptic weights) have proven to be very fast and accurate at learning functions from the meta-training set [7]. However, because of concerns about the generalization ability of meta-learning derived

611

No matter which scheme is used, the signal has a limited precision (number of significant bits), a minimum value (zero) and a maximum value (full intensity), which we set to unity. For our prototype, we decided to use a Digital Micromirror Device (DMD) to spatially and temporally modulate the light. The DMD is a rectangular array of 1 Million micromirrors on a one-square-inch integrated circuit. Each mirror can be independently and rapidly switched from an ‘on’ state to an ‘off’ state. The DMD can support the pulsewidth and pulse-train encoding schemes. Attenuation in an optical medium ranges from opaque (zero synaptic weight) to transparent (unit synaptic weight). Positive and negative synaptic weights are done separately in different regions of the medium and combined later in the neuron. The synaptic weight range of [−1, +1] is not sufficient for most neural networks. However, linear scaling can extend the range to [−ω, +ω ] , where ω > 0 and is large enough to encompass the (absolute) maximum synaptic weight in the network. An ω of 4 to 10 is usually adequate. There is also the spatial bandwidth of the medium to consider. How many pixels should be used per synapse, and how many grayscale levels should be used? While many media can support 8 to 12 bits per pixel of gray, these tend to have a nonlinear attenuation per grey level. Also, consistency of the grayscale between individual media is poor. For our prototype, we decided to use 35mm photographic slides. Slides from commercial slide makers have about 16 Megapixels. Binary area encoding (each pixel either 0 or 1) of the synapses will allow us to use linear grayscale calibration. Our initial FWL-NNs consist of about 250 synapses, allowing 64 Kilopixels per synapse, or 16 bits of precision. For the prototype array of target neurons, we will use a CCD camera to allow maximum flexibility for detection and integration schemes, albeit at a slower speed than op-amp based detectors.

The outputs of several tranapses can be combined by a single neuron. This works like a neuron that has potencies rather than weights. Since the potencies are the recurrent signals discussed above, the combination of tranapses, planapses, and a neuron will be called a recurron. These able-to-learn recurrons can be connected together as a recurrent neural network, i.e. a recurral network. Later, we will discuss construction and simulation of a recurron. a) s (t ) = W (t − 1) ⋅ x(t )

W(t-1)

Σ

x(t)

y(t-1)

y(t-1) T(t-1) x(t-1) W(t-1)

b) Tranapse x(t)

Ȉ

y(t-1)

W(t-1)

T(t-1) x(t-1) Planapse

IV. TRAINING THE BACKPROPAGATION

Figure 2. (a) Single-synapse, changing-weight neural network. (b) Equivalent Fixed-Weight Learning Neural Network.

A. Scaling 1) Neural Network Weight Scaling: This scaling is necessary to keep the optical signals as near unity as possible making the best use of the system’s dynamic range. As discussed above, the positive and negative weights have been separated into different regions of the slide. The neuron summation function includes a negative sign when summing the negative region signals. Then, of the weights associated with a single neuron, they are divided by the largest weight (in absolute value). To compensate for the division, this largest weight then multiplies the output of the neuron’s summation function. This is easy in optical systems, since

III. CURRENT CONSTRAINTS DUE TO THE OPTICAL HARDWARE The optical neural signal can be encoded by any of several methods, such as analog intensity level modulation (driven by a D/A converter), pulse-width modulation, or pulse-train modulation. The first scheme operates faster, but achieving reproducible linear response is problematic, especially with larger numbers of significant bits of precision. The other two methods are slower, but allow for easier extensibility to greater number of significant bits.

612

the detected optical signals must be amplified (scaled) as part of the summation process anyway. This scaling has not yet been implemented on individual fixed-weight neurons. Rather, the largest weight in a layer of neurons has scaled all weights and neuron summation function outputs in that layer. Scaling individual neuron weights may be done in the future if the necessary coding results in benefits to justify it. 2) Potency Scaling: As discussed above, scaling of the potencies is necessary because of the limited physical range of optical signals. To properly implement optical FWL-NN, we must correctly convert the unlimited range of the FWLNN operations into the limited range of the optical system. Formally putting free-range limits as superscripts, the recurrent, free-range gradient descent rule of Backpropagation looks like: −ω , + ω

0,1

P (t ) . Multiplying these five matrices together and using it in the rotation vector-matrix multiplication: ª 0,1(t ) º ª « P » «1 − ε « 0,1 » « « ΔP (t ) » = « 0 « 1 » « 0 « » « «¬ »¼ «¬

where İ is small enough that most of the previous potency remains and Ș is of a size so ǻP will have reasonable impact on P without causing oscillation. Since the P’s and ǻP’s are optical signals in the range [0,1] we actually have an equation that looks like: 0,1

be determined in terms of İ, Ș, Ȧ, and ȟ. A formal way to determine A, B, and C involves an augmented rotation matrix multiplication. In a single matrix multiply, it can do scaling, rotation, and translation. Rotation is not needed here. The full matrix results from multiplying five relatively-simple-to-develop augmented matrices together. The first one to operate on our P and ǻP signals ª1

0

«¬0

0

ª 2ω 0 −ω º » 1 0 » «¬ 0 0 1 »¼

−η

0º » 0» 1 »¼

1

«¬ 0

0

1 »¼

performs the recurrent,

B. “Pages” Defined

free-range Backpropagation mentioned above, combining −ω , + ω

−ξ , +ξ

P (t − 1) and ΔP (t − 1) in a way consistent with scaling.

The fourth matrix

[-1,1].

» »¼

We are only at the beginning stages of building a recurral network. Our first attempt was to build a network that can learn all linearly separable two-input Boolean functions. The set includes the AND, OR, NOR, but excludes XOR and its complement. Table I shows this set of mappings with XOR and its complement in red. These functions can be learned by a single neuron with two synaptic weights and one threshold. A single recurron with three potencies should also be able to learn these functions [2].



0,1

« P . The third matrix « 0

1

A. Recurral Network Topology

, scales P to its free range value of ª1 − ε

−ω , + ω

» « » « »¼ ¬

V. BUILDING A RECURRON

scales ΔP to its free range value of ΔP ; ««0 2ξ −ξ »» . The next one, «« 0

1

We used Backpropagation to train our sub-networks to do Backpropagation. That is, to obtain the fixed weights for a tranapse, planapse, or combination, we trained these threelayer neural sub-networks using a version of Backpropagation. Care had to be used because we were training a small network that had to generalize well.

0,1

−ξ , +ξ

0

ηξ º ª 0,1 º P (t − 1) » 2ω » « 0,1 » « » −ξ » × « Δ P (t − 1) »

+

B. Backpropagation via Backpropagation

P (t ) = A ⋅ P (t − 1) + B ⋅ ΔP(t − 1) + C where A, B, and C need to

0,1

2

ηξ ε ηξ , C= + . Some may think all ω 2 2ω this matrix multiplication unnecessary, but it really keeps each individual scaling operation simple and separable. In fact, to keep the last section of our MATLAB code consistent, we separated out the last matrix and included its scaling with the clipping function (together calling them “linsig”, a linear analog to logsig) and used A′, B′, and C ′ resulting from multiplying the first four.

P (t ) = (1 − ε ) ⋅ P (t − 1) − η ⋅ ΔP (t − 1)

0,1

ε

causes A = 1 − ε B = −

−ξ , + ξ

−ω , + ω

ηξ ω 2ξ



ª1 «ω « «0 «0 « «¬

0 1 0

The fifth matrix

º 0» » 0» 1» » »¼

scales the

ª1 «2 « «0 «0 « ¬«

0 1 0

1º 2» » 0» 1» » ¼»

Although neural networks are often visualized as planar, we found this very difficult when trying to duplicate tranapses and planapses. The fixed-weights for the various neurons had to be shifted in the arrays in a way which was not consistent from one tranapse/planapse network size to another. We then decided to use a third dimension in our weight matrices and put each tranapse/planapse combination on a separate “page” as shown in Figure 3. This allowed each tranapse/planapse combination to have weights and connections identical in the first two matrix dimensions, the planar pages. This helped partially automate combining the tranapse/planapse into a recurron. The final output neuron

−ω , + ω

P (t ) into a range

puts

−1, +1

P (t ) into a

nominal range of [0,1]. This is followed by a clipping function to absolutely make the range [0,1] and the signal,

613

repeat M times: randomly select a function from Table I (call it Λ) repeat N times: randomly select A, B ∈ {0,1}

had to have signals coming from multiple pages. And its output has to be fed back to each page. TABLE I. Boolean Function Truth Table A,B TRUE NOR (AÎB) ~A (BÎA) ~B ~XOR NAND OR XOR B ~(BÎA) A ~(AÎB) AND FALSE

0,0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

0,1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0

1,0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

1,1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

compute T = Λ ( A, B ),

Index 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

output A,B,T, and identifying index for Λ next N next M B. Simulation Results. Figures 4 and 5 show simulation results. For example, Figure 4 step 1500, the Boolean function being presented to the recurron was Λ = A ∨ B as indicated by the function index value (dashed line). As shown by the plot, the squared error of the recurron was initially large, because it had not yet learned the mapping. However, after about 25 steps, the recurron error had fallen to a small value, indicating that learning had taken place. After step 2000, the function was changed. This time, Λ = ~(BÎA). The recurron error immediately jumped up and stayed up until the recurron learned the new function (about step 2050). Note small burst of errors in the Figure 4 around steps 4940 – 4960. We believe this is due to the finite range of the potency signals and the on-line nature of the learning algorithm. This behavior was previously reported in [4]. Figure 5 shows another simulation run. Note that at step 1200, the recurron took longer than average time to learn. This is because the test mapping went from function BÎA to its complementary function ~(BÎA).

U(t-1)

Bias B(t)

Tranapse Tranapse

Ȉ O(t-1)

A(t)

where T ∈ {0,1}

Tranapse

Planapse

Figure 3. Page representation of the described recurron. The A(t) and B(t) represent the Boolean inputs, U(t-1) is the previous target, and Bias is required; O(t-1) is the previous output. VI. SIMULATION AND TESTING This network was tested on a software-based ‘medium grain” simulator written in MATLAB. The simulation included the effects of limited range neural signals, synaptic weights, and multi-phase synchronous neurons. The recurron was made up of three tranapses, three planapses, and a single neuron. The three inputs were a bias signal plus the two Boolean inputs. A true output signal was also provided.

Figure 4. Simulated Recurron Squared Error (M=10, N=500) Dashed line shows the index of the Boolean function being learned (see Table I). Squared error has a maximum value of 1.0 when recurron output = ~(Target). Since recurron’s output is not digital, error can range [0,1].

A. Generating the Test Data The test dataset was generated by the algorithm:

614

Table II shows the overall mean squared error (MSE), the MSE after learning has occurred (we gave the recurron 100 steps to learn), and the percentage of correct recurron predictions after learning. We defined the networks prediction to be correct if ( T − O ) 2 < 14 , and incorrect

ACKNOWLEDGEMENTS This work was supported in part by Data Fusion Corporation of Northglenn, Colorado through a grant from US Army Space and Missile Defense Command. DASG6002-P-0192. This material is based upon work supported by the National Science Foundation under Grant No. 0446660.

otherwise.

REFERENCES [1] Paul E. Keller and Arthur F. Gmitro, "Operational Parameters of an Opto-Electronic Neural Network Employing Fixed-Planar Holographic Interconnects," World Congress on Neural Networks 1993 (WCNN' 93). [2] N. E. Cotter and P. R. Conwell, "Learning algorithms and fixed dynamics," Proceedings of the International Conference on Neural Networks 91, vol. I pp. 799-804 IEEE 1991. [3] Lee A. Feldkamp, Danil V. Prokhorov, Timothy Feldkamp, "Conditioned Adaptive Behavior from Kalman Filter Trained Recurrent Networks," IJCNN’ 03 IEEE 2003.

Figure 5. Simulated Recurron Squared Error (M=16,N=200). Same caption as Figure 4.

[4] A. Steven Younger, P. R. Conwell, and N. E. Cotter, “Fixed-Weight On-Line Learning,” IEEE Transactions on Neural Networks. Vol.10 No. 2, pp. 272-283 March 1999.

Table II. Simulation Results

Fig #

M

N

overall MSE

MSE after learning

4

10

500

0.028

0.0073

% correct after learning 100

5

16

200

0.046

0.0085

99.7

[5] Danil V. Prokhorov, Lee A. Feldkamp and Ivan Yu. Tyukin, "Adaptive Behavior with Fixed Weights in RNN: An Overview," IJCNN’02 IEEE 2002. [6] James T. Lo and Devasis Bassu. “Adaptive vs. Accomodative Neural Networks for Adaptive System Identification,” IJCNN’01 IEEE 2001 .

VII. CONCLUSION AND FUTURE WORK [7] Sepp Hochreiter, A. Steven Younger and Peter R. Conwell, “Learning To Learn Using Gradient Descent,” Proceedings of the International Conference on Artificial Neural Networks. Springer Verlag 2001.

The software simulations of our design of an Optical Fixed-Weight Learning Neural Network prototype confirm that such a network is feasible. We can account for the constraints on the network imposed by the physics of the optical neural system. The resulting Optical FWL-NN should be able to learn at a speed substantially faster than networks based on Von-Neumann hardware. It also should be able to generalize as well as Backpropagation does. Besides construction and evaluation of the physical prototype, future work may involve reducing the learning algorithm overhead costs and incorporation of Holographic and non-linear optical components [10].

[8] E. Fiesler, “Neural Network Classification and Formalization,” Computer Standards & Interfaces, vol. 16 no. 3 pp. 231-239 1994. [9] Paul Werbos, private communication, 2004. [10] Yaser S. Abu-Mostafa and Demtri Psaltis, "Optical Neural Computers," Scientific American March 1987.

615