TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

1

The Correspondence Between Deterministic And Stochastic Digital Neurons: Analysis and Methodology Luca Geretti and Antonio Abramo

Abstract The paper analyzes the criteria for the direct correspondence between a deterministic neural network and its stochastic counterpart, and presents the guidelines that have been derived to establish such a correspondence during the design of a neural network application. In particular, the role of the slope and bias of the neuron activation function and that of the noise of its output have been addressed, thus filling a specific literature gap. The paper presents the results that have been theoretically derived in this regard, together with the simulations of few relevant application examples that have been performed to support them.

I. I NTRODUCTION Digital stochastic neural networks (SNN) have recently gained attention due to the simpler hardware structure they present compared to that of conventional deterministic neural networks (NN) [1]–[4]. In fact, lower resource allocation can be obtained when implementing a SNN on FPGAs, where the relatively scarce availability of predefined modules can limit the complexity of possible applications [1], [5]–[8]. This supremacy can be mainly ascribed to the compactness of the produced hardware, both in terms of neurons and synapses: the neuron architecture can be kept particularly simple, thus shifting complexity issues at the network level only. A desirable situation would be to look at SNNs as a possible design option of a generic (deterministic) neural network to be chosen freely, or when occupation constraints make it mandatory. In this regard, the most trivial choice would be to substitute all deterministic neurons with their stochastic counterparts, keeping the network topology untouched. This option implicitly assumes that the NN operation is preserved by the mapping. Unfortunately, this assumption is false: neuron non idealities and peculiarities may, for example, prevent the SNN to reach the output values that would be needed for a proper supervised learning. As a consequence, additional neurons may be required to overcome the problem [1], thus implying the realization of a different network with respect to the original one. In other words, once a NN application is designed (e.g. in MatlabTM ), the option for its stochastic implementation often The authors are with the DIEGM - University of Udine, Italy.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

2

requires the re-design of the very same application, where signal levels, precision and noise have to be controlled once again. Although the present literature is rich in the characterization of stochastic neurons (see [5], [9], [10] for example), to the best of our knowledge no sets of comprehensive theoretical or practical guidelines able to state the equivalence between deterministic and stochastic neuron implementations have been presented yet. The purpose of this paper is to address the issue, presenting the critical aspects that must be examined in order to quantitatively guarantee this equivalence for a generic neuron. More specifically, the stochastic neuron non-idealities that have been taken into consideration and that will be clarified in the remaining Sections are the following: 1) The presence of an output bias which depends on the number of the neuron inputs; 2) The dependence of the neuron output on the distribution of the activation potential, and not only on its expectation value; 3) The dependence of the activation function slope on the number of neuron inputs; 4) The bound of the synaptic weight values to the [0, 1] interval originating from their probabilistic nature, that limits the range of the activation potential excursion; 5) The role and control of the stochastic noise, needed to maintain a desired precision on the numerical output values. II. T HE S TOCHASTIC N EURON This Section introduces the main notation that will be used throughout the paper, and presents the stochastic neuron model. A. Notation Without loss of generality, we focused our investigation on the classical perceptron. As it is known, a perceptron implements the following activation (or output) function: ! N # I " y=Φ wn xn − θ

(1)

n=1

where xn is the n-th input among the NI of the neuron, wn its n-th synaptic weight, and θ the neuron’s

threshold. In general, the threshold θ may be considered an inherent property of the neuron. However, as commonly adopted we chose to treat it as a weight associated to an additional, fictitious input xNI +1 , always active. Consequently, we can redefine wNI +1 ! −θ and simplify the notation as follows: ! N # ! N # " " y=Φ wn xn = Φ sn = Φ (A(s)) n=1

(2)

n=1

where N = NI + 1 is the total number of synaptic inputs, s is the vector of synaptic products, sn , defined in the RN space, and A(s) is the activation potential. In the remainder of this paper Φ will be treated as a function within the synaptic products space, i.e. Φ(s). Given a generic stochastic bitstream x $ of length L, we define as x $(i) its i-th bit, i = 1, 2, . . . , L, and as

I = {i : i = 1, . . . , L} the set of the indices of the subsequent neuron output evaluations. In a stochastic

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

3

neuron each quantity is defined as the expectation value over I. Thus, if x $ is a stochastic bitstream it L " 1 x $(i) , where x ∈ [0, 1]. Nevertheless, we adopted an equivalent bipolar value holds that x = E [$ x] = L i=1 representation in the interval W = [−1, 1], obtained defining x = −1 + 2 E [$ x], whose only difference, from the implementation viewpoint, is the need of an XNOR gate instead of an AND gate to perform the synaptic multiplication [6], [11], [12]. B. The Neuron Output Function In this Subsection we describe the commonly adopted model for the stochastic neuron output function, which will provide a basis for the next Subsections. Defined as WN the synaptic product space of a neuron with N inputs, the locus of points featuring identical A(s) values is a parallel hyperplane in WN (see also [13], [14]). Ideally, the locus A(s) = 0 divides the space into two disjoint sets. The neuron’s output function Φ must provide opposite values for the two sets, and must move monotonically from one value to the other, as steeply as possible. This implies that Φ must be an odd function with saturating behavior, such as the erf or the bipolar sigmoid functions. In a stochastic neuron, for each evaluation step i we can write the instantaneous activation potential A∗ as: A∗ ($ s (i) ) !

N N % & " " −1 + 2 $ sn(i) = −N + 2 s$n(i)

(3)

n=1

n=1

i.e. the bipolar sum of all synaptic products. Consequently, A∗ ($ s (i) ) ∈ {−N + 2n : n = 1, . . . , N }, and it results that A(s) = E [A∗ ($ s)].

The stochastic neurons presented in the specific literature commonly implement the following function of A∗ ($ s (i) ) [1], [6], [11]:

⎧ ⎨ 1 $ ∗ ($ y$ ($ s (i) ) = Φ(A s (i) )) = ⎩ 0

if A∗ ($ s (i) ) > 0

(4)

if A∗ ($ s (i) ) ≤ 0

$ is the instantaneous output function that, being a step function, is discontinuous. It is the nonwhere Φ $ impulsive distribution of the synaptic products that returns an expected output function Φ = −1 + 2 E[Φ]

of class C 1 . Rewriting Eq. (4) within the domain of the expected value and converting the values into

the W representation, we obtain: ∗

y(s) = −1 + 2 E [ y$ ($ s )] = −1 + 2 P [A ($ s ) > 0] = −1 + 2

N "

fA* (s, j)

(5)

j=1

where fA* (s, j) = P [A∗ ($ s ) = j] is the discrete distribution of the activation potential, that is a function of s. Eq. (5) thus represents the discrete integration of the activation potential distribution. For large N this function approaches an erf function. (see Subsec. II-D) C. The Output Function Steepness In order to improve the input separation ability of the neuron, i.e. to control the steepness of the activation function Φ(A) around A = 0, we exploited the same technique presented in [5], which we will refer to as the Reset Delay Technique (RDT).

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

4

According to the RDT, the value of A∗ ($ s (i) ) is not reset to zero after each neuron evaluation, but it is accumulated for a number of evaluations M > 1, which we named as reset periodicity, where m ∈ M = {1, 2, .., M } represents the number of steps following the latest reset event, otherwise defined as the reset step. The expression m(i) = rem(i − 1, M ) + 1

(6)

provides the relation between an evaluation step i and its reset step m. An increase in M causes an increase in dΦ(A)/dA, the steepness of the output function [5]. However, in [5] the value of A∗ was constrained into the [−N, N ] interval, which caused the steepness to saturate to a specific value for M → ∞. Differently, we want to avoid the steepness saturation by allowing A∗ to reach any values. Therefore, in the remainder of this Subsection we will extend the theoretical analysis of [5] to assess the properties of the modified output function and of its derivative. To this purpose, we define the instantaneous accumulated activation potential A∗A ($ s (i) ) as: m(i)−1

A∗A ($ s (i) ) =

"

A∗ ($ s (i−j) )

(7)

j=1

that is to say the sum of all activation potentials evaluated after the most recent reset event. Consequently, Eq. (4) can be rewritten as:

⎧ (i) ⎨ 1 if A∗ ($ ) + A∗ ($ s (i) ) > 0 A s y$ ($ s (i) ) = ⎩ 0 if A∗ ($ (i) ) + A∗ ($ s (i) ) ≤ 0 A s

(8)

In order to obtain the equivalent of Eq. (5), a brief explanation is required. Eq. (6) uniformly remaps the set I onto M. Thus, in a bitstream of length L a particular m manifests for L/M times, where for simplicity but with no loss of generality we assume L/M integer. Then, we define as Im = {i : i = 1, . . . , N | m(i) = m} the set of the evaluation step indices, mapped onto a specific value of m, satisfying the requirements: I=

M *

m=1

Im1 ∩ Im2 = ∅,

Im ∀m1 ̸= m2

(9) (10)

and as s$m the substream of s$ having the evaluation step indices i ∈ Im , m = 1, 2, . . . , M , where sm = s

for all m. Defining A∗m = (A∗A + A∗ )|m as the effective activation potential on step m, from Eq. (8) each substream s$m will return a different expected potential value Am (sm ), hence a different expected output ym (sm ). Since all substreams feature the same length, we can transform the expectation over I into an average of the expectations over each Im : y(s) = −1 + 2 E [ y$ ($ s )] = −1 +

M M M 1 " 1 " 2 " E [ y$($ sm )] = ym (sm ) = ym (s) M m=1 M m=1 M m=1

(11)

where we have defined ym (sm ) = −1 + 2 E [ y$($ sm )] as the expected output value of subset Im . Being

Nm = m N the number of contributions accumulated in A∗m , similarly to Eq. (5) we can rewrite Eq. (11) as:

y(s) = −1 +

July 22, 2017

M M Nm 2 " 2 "" f * (s, j). P [A∗m ($ sm ) > 0] = −1 + M m=1 M m=1 j=1 A m

(12)

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

5

The discussion above implies that we can interpret the effect of a delayed reset as an amplification by a factor m = Am /A of the m-th expected activation potential value, that results into a different potential distribution, hence into different ym output values, the latter being averaged over M to return the correct output y. If we limit the space of the values of s to the diagonal of WN , that is to say if we introduce the constraint s1 = s2 = . . . = sN , and if we define p = P[$ sn = 1], the distribution of the accumulated potential becomes binomial. Thus we can rewrite Eq. (12) as follows: , /2⌉−1 + M ⌈Nm " 2 " Nm Nm −k k y(p) = −1 + p (1 − p) M m=1 k

(13)

k=0

which can be differentiated with respect to A = N (−1 + 2p) obtaining: , /2⌉−1 + M ⌈Nm " 1 " Nm dy k−1 (Nm − k − Nm p) pNm −k−1 (1 − p) = dA N M m=1 k

(14)

k=0

whose real-time hardware calculation is clearly heavier compared to that of [15].

It must be noticed that the above constraint on s could be considered rather limiting. However, Eqs. (14),(13) will be adopted in the following either for s = 0, where the limitation does not exist, or just to compare their behavior with the exact one, as obtained from Eqs. (12). D. Template Deterministic Output Functions Since the typical outputs of a deterministic neuron are the erf or the bipolar sigmoid functions, we recall here their expressions and derivatives. These results will be useful for the comparison between deterministic and stochastic output functions presented in Sec. III. Given an evaluation step m ∈ [1, M ], if Nm = mN ≫ 1 and all synaptic inputs are equiprobable, in force of the Central Limit Theorem the activation potential distribution fA*m approaches a Gaussian. √ Such a normal distribution features µm = Am = mA and σm = σ ˜ mN , where σ ˜ is the variance of a single activation potential contribution. Assuming for simplicity to be in a continuous domain, the output function then reads: M M 2 " 2 " ∞ ∗ ΦE (A) = −1 + P [Am ($ sm ) > 0] = −1 + fA*m (mA, x) dx M m=1 M m=1 0

(15)

where, compared to Eq. (12), we can notice that the dependence of fA*m on s is replaced by the dependence on Am . It can be easily seen that, after a trivial manipulation of the integral, we can write: ! # - mA - mA/(√2σm ) M M 2 2 " 1 2 " 1 √ ΦE (A) = −1 + + fA*m (0, x)dx = e−t dt (16) M m=1 2 M π 0 0 m=1

where it is apparent the erf behavior of the output function. Additionally, we can rewrite Eq. (16) as: +. , +√ , M M M 1 " 1 " m π 1 " ΦE (A) = erf A = erf K Em A (17) ΦE (A) = M m=1 m M m=1 2N σ ˜2 M m=1 2 / ˜ 2 . The corresponding output derivative is: where we introduced KEm = 2m/πN σ M M 2 2 dΦE (A) 1 " dΦEm 1 " π = = K Em e − 4 KEm A dA M m=1 dA M m=1

July 22, 2017

(18)

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

6

which shows that KEm can be interpreted as the derivative of the m-th erf function computed in A = 0. 0 Consequently, we can define the parameter KE ! KEm /M as the output function derivative for A = 0.

Eqs. (17),(18), obtained for Nm ≫ 1, are approximated expressions which require a rather intensive computation. For this reason we may introduce a bipolar sigmoid approximation: ΦS (A) ! −1 +

2 1 + e −2KS A

(19)

whose derivative is, instead, quite simple: dΦS (A) = KS (1 − Φ2S (A)). dA

(20)

Similarly to the erf case, we purposely introduced here the parameter KS to explicitly refer to the bipolar sigmoid derivative in A = 0. Both previous expressions will be used to analyze the equivalence between a stochastic neuron and its deterministic counterpart. III. A DDRESSING

THE

O PEN I SSUES

This Section addresses the open issues that have been presented at the end of Sec. I, which will be analyzed to lay the foundation for the deterministic-to-stochastic equivalence. A. Controlling The Neuron Bias As previously mentioned, Eq. (8) shows the instantaneous relationship between the neuron output and $ is not properly the evaluation of its N synaptic products. It must be noticed that the activation function Φ

defined for A∗A ($ s (i) ) + A∗ ($ s (i) ) = 0. In fact, in presence of an even value of synaptic products, Nm ,

the number of possible values of the activation potential that return y$ (i) = 0 exceeds by one the number of values returning y$ (i) = 1. This is due to the equal sign attribution of Eq. (8), which favors the zero

output 1 . This results into a bias that is superimposed to the neuron output.

To cancel it, we can restate Eq. (8) as: ⎧ ⎪ ⎪ 1 if A∗A ($ s (i) ) + A∗ ($ s (i) ) > 0 ⎪ ⎨ y$ (i) ($ s (i) ) = 1/2 if A∗A ($ s (i) ) + A∗ ($ s (i) ) = 0 ⎪ ⎪ ⎪ ⎩ 0 if A∗A ($ s (i) ) + A∗ ($ s (i) ) < 0.

(21)

Since 1/2 is not a valid binary value, Eq. (21) must be interpreted in probabilistic terms: in presence of zero activation potential values, the output bit must be generated as a random variable having an expected value P = 1/2. Therefore, the unbiased version of Eq. (12) is: ⎛ ⎞ Nm M " 2 " ⎝1 P [A∗m ($ sm ) = j]⎠ = P [A∗m ($ sm ) = 0] + y(s) = −1 + M m=1 2 j=1 = −1 + 1 Obviously,

July 22, 2017

M M Nm 2 "" 1 " f * (s, j) P [A∗m ($ sm ) = 0] + M m=1 M m=1 j=1 A m

(22)

the dual situation would happen if the equal sign were attributed to the y! (i) = 1 case.

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

7

Consequently, we can define: β(s) !

M M 1 " 1 " βm (s) = − P [A∗m ($ sm ) = 0] M m=1 M m=1

(23)

as the s-dependent output bias that has been compensated by adopting the definitions of Eq. (21). Looking at the definition of the bias, it can be noticed that only the case M = 1 and N = odd shows necessarily β(s) = β1 (s) = 0; in all other cases an even Nm = m N exists, hence at least one βm (s) is not zero. Furthermore, in the case of equiprobable synaptic inputs the unbiased expression corresponding to Eq. (13) becomes:

⎛ ⎞ , ⌈Nm /2⌉−1 + M " " Nm Nm −k 2 k ⎝− βm (p) + p (1 − p) ⎠ y(p) = −1 + M m=1 2 k

(24)

k=0

where the bias can be analytically calculated as: ⎧ % & ⎨ − Nm [ p (1 − p)]Nm /2 Nm /2 βm (p) = ⎩ 0

if Nm even

(25)

if Nm odd.

Finally, the unbiased derivative expression corresponding to Eq. (14) is: ⎛ ⎞ , ⌈Nm /2⌉−1 + M ′ " 1 " ⎝ βm (p) dy Nm k−1 ⎠ − (p) = + (Nm − k − Nm p) pNm −k−1 (1 − p) dA N M m=1 2 k

(26)

k=0

where

⎧ % & ⎨ − Nm (1 − 2p) Nm [ p (1 − p)]Nm /2 −1 Nm /2 2 ′ βm (p) = ⎩ 0

if Nm even

(27)

if Nm odd.

To show the impact of the bias on the neuron output behavior, we simulated a single stochastic neuron both in the biased and unbiased conditions. The resulting activation functions are shown in Fig. 1(a) for the case of a biased neuron (see Eq. (12)) with N = 2, M = 1, and in Fig. 2(a) for the case of its unbiased counterpart (see Eq. (22)). As can be seen, the presence of a bias shows up through the distortion of the output isolines (as it is in Fig. 1(a)) which instead should be linear as a function of the corresponding activation potential isolines (as it is in Fig. 2(a)). Fig. 1(b) shows the neuron bias, computed as the difference between the biased and unbiased outputs. To quantify the relevance of its presence we analyzed the conditions under which it attains a maximum absolute value 2 . Looking at Eq. (23) it can be observed that in the case N = even such a maximum occurs for those values of s that minimize βm (s) = −P [A∗m ($ sm ) = 0] for all m. This is equivalent 2 to say that A(s) must be zero, while the variance of the instantaneous activation potential σA s ) is * ($

minimum (see Eq. (30) and its derivation in Subsec. III-B for an explanation of the activation potential variance behavior). The first condition imposes that the maximum absolute value belongs to the negative diagonal of Fig. 1(b) (hyperplane, in the multidimensional case); the second condition, instead, forces 2 the maximum at the extremes of such a diagonal, where σA s ) = 0 if the number of inputs is even. * ($

As a consequence, the bias value at these extremes is forced to be β max,even (s) = −1 for all M , since 2 The

bias is always negative, as shown in Eq. (23).

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

8

P [A∗m ($ sm ) = 0] = 1 for all m. If N is odd, instead, the bias absolute value is maximized if (N − 1)/2 bits of the input vector have value +1, additional (N − 1)/2 bits have value −1, and the last remaining one has value 0. In this way, the N − 1 non-null bits sum to zero for the activation potential, while it is the remaining bit to determine the output behavior, as it would be in the case of a single input neuron. 2 Hence, in this case σA s ) = 1 (see Eq. (30) once again). To compute the bias value we can resort to * ($

Eq. (25). Being Nm ≡ m and p = 1/2, we obtain: ⎧ % & ⎨ − m 2−m m/2 βmmax,odd = ⎩ 0

if m even

(28)

if m odd

which describes a bias that approaches zero for increasing m, hence M , values. In conclusion, the presence of the neuron bias can be quite relevant, especially if one considers that its maximum absolute value can be as high as 1, i.e. half the output excursion. On the other hand, from the implementation viewpoint a bias compensation can be easily obtained detecting the A∗ + A∗A = 0 condition (see Eq. (21)): each time this is met, the neuron replaces the original output by a bit taken from a random sequence featuring p = 0.5. B. The Distribution of the Activation Potential As it can be understood from Eq. (22), the actual output function depends on the distribution of the activation potential, not only on its expected value, as it would be desirable according to Eq. (2). In addition, the approximation of Eq. (16) shows the dependence of the output function on both the expectation and variance of the activation potential distribution. In other words, if A∗ has a non-uniform distribution on an equipotential hyperplane, the output varies accordingly. Unfortunately, this is the case in all situations: any random binary stream shows a distribution which depends on the value it represents. In fact, since for any random variable v$ the variance of its instances can be computed as σv!2 = E[$ v 2 ]− E[$ v ]2 ,

for a binary bitstream x $ with expectation µ we can write that σx!2 = µ − µ2 , which in our representation,

namely x = −1 + 2 µ, returns:

σx!2 =

1 (1 − x2 ) 4

(29)

Such a relation, parabolic in the case of a single synaptic product bitstream, s$n , forces the variance

of the activation potential to be:

2 σA * = 4

N "

n=1

σ!s2n = N −

N "

s2n

(30)

n=1

that is a paraboloid in WN having its maximum value at s = ⃗0. The locus of points having fixed variance √ k is, thus, a circumference in WN with radius |s| = N − k. As a consequence, under the assumption of a Gaussian distribution of A∗ we can state that ∀(sα , sβ ) ∈

WN , sα ̸= sβ , and A(sα ) = A(sβ ), it is fA* (sα ) = fA* (sβ ) iff |sα | = |sβ |. This means that equipotential points in WN provide the same output if and only if they are equidistant from the center of WN .

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

9

The resulting effect is an unavoidable deformation of the output function 3 , shown in Fig. 2(b) for the case N = 2 and M = 32. In fact, although equipotential (iso-A) hyperplanes have equation ds2 /ds1 = −1, i.e. feature a linear behavior, the corresponding output isolines show the signature of the parabolic relationship of Eq. (30). The impact of this non-ideality on the behavior of the neuron is discussed in more detail in the next Section, where a deterministic function is associated to its stochastic neuron counterpart. C. Controlling the Slope of the Neuron Output Function As it is known, the separation ability of a neuron depends on the maximum steepness of its output function slope (OFS). As already mentioned, in our case the slope of the activation function depends on the number of synaptic products, N , and on the reset periodicity, M . Furthermore, it is trivial to notice that the maximum of the OFS always lies at A = 0 (see Fig. 2(b)). However, due to the deformation effect mentioned above the OFS on the A = 0 hyperplane is not constant, and possesses a minimum at s = ⃗0, the central point of WN , which corresponds to the maximum of the variance of A∗ (see also the definition of KE in Subsec. II-D). Defining the separation ability of the neuron through Eq. (26), i.e. as its derivative at s = ⃗0, after some manipulations we obtain: ⎛ ⎞ . 6 ⌈Nm /2⌉ M 7 2n − 1 8M dy 66 2 "⎝ ⎠≈ K(N, M ) ! (N, M ) = ⌈Nm /2⌉ dA 6p=0.5 N M m=1 2n 9πN n=1

(31)

where the final approximation, obtained using Eq. (18), holds for high values of M . Following the analysis of the previous Subsection, K can be considered as a worst-case approximation of the derivative in A = 0. The behavior of K with respect to N and M is shown in Fig. 3, where both the exact values of K (continuous lines) and the approximated ones (dotted lines) were computed for N = {2, 4, 8, 16, 32} and M = {1, 2, . . . , 4096}. Looking at Fig. 2(a), it can be noticed that for N = 2 (which is the minimum number of inputs, considering the threshold) the separation ability is relatively low. Thus larger M values are preferable (see Fig. 2(a) where M = 32). Since the neuron separation ability decreases with N , we can say that, on average, M ≫ 1 values are required, and thus the approximation of Eq. (31) can be widely used. D. The Limited Range of the Weights In a deterministic neuron, all weights can be represented in a [−h, h] interval, where h = max |w| ∈

+

R , while in the stochastic case the range is limited to W = [−1, 1] (if the bipolar representation is 3 This

non-ideality, although clearly identifiable under analysis of the neuron output, doesn’t seem to be properly taken into

account in the literature. In [12] there is a hint to the evidence that only linear output functions can be expressed as a function of the activation potential A only. In [16], the Authors discuss the generation of a bitstream with desired mean value, but don’t mention the problem of its distribution. In [17], instead, the effect of the variation of the activation potential distribution of stochastic neurons is presented, but its impact on the neuron output is not commented.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

10

assumed, as it is in the present case). Consequently, the corresponding activation potentials span the intervals [−h N, h N ] and [−N, N ], respectively. Thus, in the stochastic case the activation potential is attenuated by a factor of h with respect to the deterministic one. This evidence has a strong impact [10], and to understand its consequences we must analyze the behaviors of the erf and bipolar sigmoid functions in presence of an increased activation potential hA. If we indicate with the subscripts E and S the erf and the bipolar sigmoid cases, respectively, it can be easily noticed from Eq. (17) and Eq. (19) that: ΦE,S (hA, KE,S ) = ΦE,S (A, hKE,S ).

(32)

meaning that an increment of the function slope KE,S returns the same output that would be obtained incrementing the activation potential A by the same factor. This result allows us to compensate the limited output range of a stochastic neuron by increasing its K, whose value can be obtained using Eq. (31). E. Output Noise Control As mentioned, the representation of all numerical quantities in the stochastic case is affected by an inherent noise that limits the precision of the computation. Let us suppose to desire a certain precision in the bitstream representation x $ of a generic quantity x. The variance of x, σx2 , can be obtained from Eq. (29)

that describes the variance of a single bitstream. Considering that the former is inversely proportional to the number of evaluations, which in our case is the length L of the bitstream, we have: σx2 =

1 (1 − x2 ) 4L

(33)

We propose to find the bitstream length LB able to return a desired precision B for the representation of x, where B is expressed in terms of bits. According to Eq. (33), the variance depends on the represented √ value. Therefore, we focus on the worst condition, namely x = 0, which returns σx = 1/ 4LB . If we define ∆ = 2−B+1 as the distance between symbols that are adjacent in W, we want to guarantee that a sufficient number of represented values fall inside the interval [−∆/2, ∆/2], where ∆ is the quantization step, so to correctly return the value x = 0. In other words, we want to enforce that: α σx ≤

∆ 2

(34)

where α sets an acceptable interval of confidence (for example, if α = 1 we enforce that, statistically, the 65% of the values falls into the [−∆/2, ∆/2] interval). Substituting σx and ∆ expressions in Eq. (34) and rearranging, we obtain: 8 92 LB ≥ α 2 B−1

(35)

which grounds the apparently empirical assumptions that are commonly adopted, and represents a quantitative generalization of such rules. For example, in [6] α = 1 was chosen, i.e. the bitstream length was set to LB = 2(2B−2) , which states that it was implicitly chosen that in the same worst case condition the 35% of the expectation values would fall outside the correct quantization interval. In force of Eq. (34),

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

11

instead, a better choice can be made, for example α = 2, which reduces the fraction of erroneous values to 5%. The considerations above hold for the case of bits that are completely uncorrelated. A further analysis must be added in presence of a periodical reset. In fact the RDT, that has been proposed as a mean to control the activation function slope, operates introducing a correlation between subsequent output bits that superimposes a correlation noise to the neuron output. To compensate this effect, an additional increase of the bitstream length must be introduced. To determine a proper methodology to tailor such an increase, we conducted the analysis of the noise dependence on the input bitstream length in a worst case condition, from which a more precise heuristics has been derived. Let us assume we have an input bitstream of length LB and a reset periodicity equal to M . In addition, let us (unrealistically) suppose that the neuron output function is such that all the M output substreams are totally correlated. As a consequence Eq. (11), providing the neuron response y as the average over the M substream outputs ym , must be evaluated assuming that all ym carry the same value. Therefore, we can state that y = y¯, where we denoted with y¯ the generic ym . Since y¯ is obtained through the expectation over LB /M bits (see Subsec. II-B), in force of Eq. (33) we notice that the variance of each substream is M times higher than the variance corresponding to the case M = 1 (i.e. when the reset is performed at each evaluation step). Consequently, to recover the original variance, which is the lowest bound for a given LB , the bitstream length must be set to LM,max = cmax × LB , where cmax ! c(M )max = M is the worst case length correction, since it has been derived in this hypothetical totally correlated case. When correlation is partial, as it is in the case of a generic RDT interval, the value for c(M ) can be chosen from Eq. (21) providing the variance of the output at s = ⃗0, i.e. where the variance is maximum, and for different M values. Consequently, after making explicit the dependence of σy on M , we can write: c(M ) =

σy2 (M ) 2 σy (M = 1)

(36)

which provides the length coefficient to be used to lower the variance of the represented values to the desired level. Fig. 4 shows in a log-log plot the cmax = M (dashed line) and c(M ) (solid line) behaviors4 in M . We can notice from the results that c(log2 M ) ≈ cmax (log2 M − 1). Then, to compensate the correlation noise a value c(M ) = cmax (M/2) = M/2 can be chosen, obtaining an approximation that is more than valid if M > 16. Finally, the corresponding minimum bitstream length can be computed: LM = c(M )LB ≈

M LB = M α2 2 2B−3 2

(37)

which can be widely used since M ≫ 1 in most cases (see Subsec. III-C). For the sake of completeness it must be noticed that LM /M is integer by construction. Consequently, application of Eq. (37) enforces output substreams of equal lengths, a circumstance that guarantees the absence of errors as far as the computation of the output expected values is concerned (see Subsec. II-C). 4 Circles

have been included to mark the c values that are obtained in correspondence of integer values of log2 M .

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

12

IV. E QUIVALENCE Based on the results of the previous Section, we analyze here the problem of granting the equivalence between a deterministic neuron and its stochastic counterpart. We first compare the analytical output arising from Eq. (24) in the case of equiprobable outputs with those calculated using bipolar sigmoid and erf approximations featuring the same derivatives in A = 0 (this means that we are forcing K(N, M ) = KS = KE ; see Subsec. III-D). The purpose is to identify the regions where such an equivalence may lead to reasonable results, and to quantify the amount of the discrepancy. Fig. 5 presents the comparison among the three functions described above, plotted for A ≥ 0 (the region A < 0 was omitted due to the odd symmetry of all functions), for few different values of N, M . Rather limited values of both parameters were chosen, i.e. a critical condition was imposed, since for higher values the difference among the various behaviors was almost undetectable. Looking at the results, it is clear that both high N or M values provide an activation potential distribution able to yield a saturating behavior. However, the erf approximation is acceptable for M = 32 regardless of N . Since, typically, M > 16, we can consider the erf an adequate approximation. The bipolar sigmoid approximation, instead, slightly diverges from the analytical behavior due to different saturation characteristics. Consequently, it should be used when computation time is critical, for example in a supervised learning procedure making use of the output function derivative, because its computation is much simpler in this bipolar sigmoid case (compare Eqs. (18),(20)). For the sake of completeness, it must be noticed that the previous analysis has been conducted in the case of equiprobable inputs, thus ignoring the deformation effect of the activation potential behavior shown in Subsec. III-B. In order to evaluate the unavoidable bias arising both from such a deformation and from the use of the bipolar sigmoid approximation, in Fig. 6 we subtracted the output of Figs. 2(a), 2(b) from that obtained using two different sigmoid functions featuring K(N = 2, M = 1) = 0.5 and K(N = 2, M = 32) = 2.515, respectively. As already said, in the case M = 1, N = 2 the distribution of the activation potential is nearly uniform, returning an output function with linear behavior. This means that the output would descend towards ±1 without a clear saturation, returning a negative bias moving towards the point (−1, −1) and a positive one towards (1, 1) (see Fig. 6(a)). On the contrary, when M = 32 the activation potential distribution resembles a Gaussian, but it is affected by the strong distortion that the variance of the activation potential shows in W2 . Consequently, the output function behavior is more similar to that resulting from Eq. (16), which shows convex and concave behaviors for A > 0 and A < 0, respectively. The obtained bias (see Fig. 6(b)) is then maximum for A → 0 (i.e. along the negative diagonal of Fig. 6(b)) and σ → 0 (i.e. around (−1, 1) and (1, −1)), while in the central region the error is mainly due to the use of a the bipolar sigmoid approximation (see also Fig. 5(b) around A = 1). In both cases the output function appears to be odd and monotone along directions orthogonal to A = 0 (i.e. along the main diagonal in Fig. 6, where a two dimensional case is shown), which grants for the correct separation ability of the neuron. A bias amount such as that of Fig. 6 can impact the use of online gradient descent learning procedures (e.g. back-propagation), since the output

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

13

derivative of a bipolar sigmoid function depends directly on the neuron output (see Eq. (20)). As can be seen in Fig. 6, the sign of the bias in the region of maximum error is equal to that of the output. In other words, the bias adds constructively to the bipolar sigmoid output. Consequently, since |y| ≥ |Φ(A)| and given Eq. (20) we can state that: dΦ (y) = K(1 − y 2 ) ≤ K(1 − Φ(A)2 ) dA

(38)

that is to say that the actual descent rate is lower than what modeled, a circumstance representing a favorable damping condition for the learning procedure. The analysis above has shown that the critical points in the WN (limited) space are those featuring 2 σA * = 0 for A = 0. (see Fig. 6(b)), and that they happen to be some of the space vertices. The

conversion error is localized in critical regions around these points. Our next goal is to obtain some additional information on the extension of such critical regions in respect to N and M . In particular, we want to graphically demonstrate that the critical regions contract into the critical points at increasing N or M . To this purpose, we considered a neuron with an even number of inputs N , a reset delay M , and a fixed precision B. We evaluated the stochastic approximation error on a couple of specific points, namely sα = {−1, (1 − δB ), −1, (1 − δB ), . . . , −1, (1 − δB )} and sβ = {−1, (1 − 2 δB ), −1, (1 − / √ 2 δB ), . . . , −1, (1 − 2 δB )}, i.e. the points with distance dα = δB N/2 and dβ = δB 2N from the

critical point sγ = {−1, 1, −1, 1, . . . , −1, 1}, where δB = 2−B+1 is the smallest value representable in W with precision B. We must notice that any permutation of the coordinates of each of the sα,β,γ returns the same output value5 . Consequently, our treatment holds for all critical regions of WN . Fig. 7 shows the error expressed as the absolute value of the difference between the deterministic output and the corresponding stochastic one, normalized to δB , and for the case B = 7. As it is shown, the approximation error features a peak both at varying N or M (the latter is plotted logarithmically in Fig. 7, in order to obtain a cleaner plot): comparing Figs. 7(a) and 7(b), showing respectively the error behavior in sβ (farther from sγ ) and in sα (closer to sγ ), it can be seen that the error peak moves towards sγ both for increasing N or M . Since in each point s the approximation error is the sum of two monotonic functions, namely the output values of the deterministic and stochastic neurons, we can conclude that the error peak moves monotonically in s towards sγ , both for increasing N or M . This argument implies that we are able to decrease any critical regions size by increasing M for a given value of N , which is also the suggested choice to obtain larger output function slopes, K (see Subsec. III-C), or simply to provide a good erf approximation of the neuron output function (see Fig. 5). This result can be useful in all cases where the neuron, depending on the application, is forced by inputs to work close to the critical vertices of WN or, more in general, far from its minimal error portions (see again Fig. 6(b)). Apart from the considerations above on critical regions, to the purpose of designing the stochastic counterpart of a deterministic neuron we propose a simple, two-step equivalence method. Given a deterministic neuron with output function expressed by Eq. (19) or Eq. (17), featuring N inputs (including 5 In

fact, the output function evaluation prescinds from the order of presentation of the corresponding synaptic products.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

14

threshold), a slope around A = 0 of value KS,E , and a maximum weight |h|, its equivalent stochastic counterpart can be determined by means of the following sequence: 1) Set the shape of the output function by finding from Eq. (31) the value of M returning K = hKS,E ; 2) Determine from Eq. (37) the proper bitstream length LM returning the desired precision for the data representation. This result allowed us to implement an automatic conversion procedure for trained deterministic neural networks, whose simulated characterization will be presented in the following Section. V. S IMULATIONS In this Section we show the results that we obtained applying the equivalence method just described to the case of a deterministic feed-forward, fully connected neural network (DFFNN). This topology was chosen due to its simplicity and, at the same time, wide use in the specific literature. The problems that we used as application examples are two. For each of the proposed problems we followed a three step procedure, which was fully automated without resorting to manual tuning: 1) Network training conducted by means of a floating point, off-line minimization; 2) Conversion of the obtained synaptic weight values into fixed point representation with desired precision; 3) Conversion of each deterministic neuron into its stochastic counterpart. The first test of application was the replication of a template monochromatic figure given its X-Y coordinates of its black and white points. Two different white shapes inside a black background were used, namely a circle and a triangle (see Fig. 8). The purpose of the problem was to visually compare the results as obtained from a deterministic network to its stochastic equivalent. More in detail, we firstly trained two identical DFFNNs (one for the triangle, the other for the circle) using an off-line, back-propagation algorithm with floating point variables. The networks were featuring two inputs (the (x, y) coordinates of each figure in W2 ), five neurons in the first hidden layer, two neurons in the second one, and one output neuron; all neurons possessed the same separation ability, K = 1. So defined, each resulting network was showing a total of 30 synapses including threshold synapses6 , the latter connected to always-active inputs. We decided to choose a 3-layer topology to set up a rather complex situation to stress the impact of small deviations from ideality on the network sensitivity. The deterministic networks were iteratively trained to reproduce with minimum error a training set comprising NP = 100 input/output couples, {I, T }, which were iterated for NE = 5000 training epochs; the inputs I were taken from 10 × 10 grids, a point density that was considered representative of the figure values

in W2 (see dots in Fig. 8).

The error function of the minimization algorithm can be written as: E=

6 Given

NP " NO " 1 2 (Tn,p − yn (Ip )) NO NP p=1 n=1

(39)

the network topology and its input/neuron composition, and including among inputs the threshold one the synapse count

is as follows: (2 + 1) × 5 + (5 + 1) × 2 + (2 + 1) × 1 = 30.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

15

i.e. as the mean square error of the difference between the neuron output yn and its expected value Tn 7 . Correspondingly, the error variance reads: σ2 =

NP " NO : ;2 " 1 2 (Tn,p − yn (Ip )) − E 2 NO NP p=1 n=1

(40)

After the minimization of E, the obtained set of floating point weights was converted into fixed point representation. To do so, after the training procedure we repeated the network execution on a given input set using different data precisions, i.e. starting from Bmax = 16 bits down to Bmin = 2 bits. The adopted precision B was chosen as the lowest one that was respecting the bound: + ,2 ∆ E+ασ ≤ 2

(41)

where α = 3 and ∆ = 1, roughly meaning that we wanted to enforce 99% of the high-valued outputs above yH,min = 0.5, and 99% of the low-valued outputs below yL,max = −0.5 (recall that it was T ∈ {−1, 1} in our case). Finally, each weight was mapped onto W, and the corresponding stochastic neurons were designed using the equivalence scheme described in the previous Subsection. Simulation results computed over the whole W2 are shown in Fig. 9. As can be seen, the stochastic network shows output values that are very similar to those obtained from its deterministic counterpart, qualitatively identical in the case of a triangular shape. The worst performance of the stochastic network is obtained in the case of a circle. This can be ascribed to the variance deformation effect described in Subsec. III-B, inducing a non-ideal neuron separation which shows a stronger signature on curved shapes. In order to obtain additional information on the effect of the conversion errors, we introduced a second problem: provided as inputs a sample of the values of x3 = d1 x1 + d2 x2 , i.e. points laying on a plane in W3 , to replicate its derivative values d1 = dx3 /dx1 and d2 = dx3 /dx2 . Since each neuron input can span any values in W, the problem requires a large degree of precision. Furthermore, the input function is linear, a critical condition when dealing with stochastic neurons, that show better ability when approximating highly non-linear activation functions (see, for example, Fig. 5). In particular, the inputs of a specific plane depicted above were the values of x3 taken on a 7 × 7 grid of points in the {x1 , x2 } space, that is to say x1,2 = {−1 + 2 i/7 : i = 1, . . . , 7}. The training set comprised 9 different planes, as exiting from all the possible choices of d1,2 = {−0.5, 0, 0.5}. The simulation was performed for the different values of precision B = {2, . . . , 8}, and for the activation function slope values K = {0.5, 1, 2}. The inputs were provided both to the deterministic NN and to its stochastic equivalent. Both NNs were characterized by 7 × 7 = 49 inputs, 14 neurons in the hidden layer, 2 neurons in the output layer, for a total of 730 synapses (threshold synapses included). Fig. 10 shows the resulting error plotted against B, both for the deterministic (solid line) and stochastic (dotted line) cases, and for different values of K. We can see that the mean stochastic error roughly follows the deterministic one, at least for relatively low values of B and as K increases. The tendency in K was expected, because neuron activation functions show a more saturating behavior for high K values, so that 7N

O

= 1 in this case, since we are in presence of a single output neuron.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

16

the stochastic approximation is more appropriate (see Fig. 7 and the related discussion in Subsec. IV). On the other hand, the deterministic error increases with K since highly non-linear activation functions are not optimal for the solution of a linear problem, such as the one considered here. As far as B is concerned, it must be noticed that beyond a certain precision the stochastic error cannot reduce. This happens because the error reaches the lower limit imposed by conversion non-idealities. Given these considerations, since in this case the learning error bound was taken as Emax = 1/16 (dashed line in Fig. 10) so as to guarantee a proper neural network discrimination between the derivative values, the stochastic networks for K = 0.5 and K = 1 were failing the learning condition for any choice of B. As a final comment, we must point out that the second experiment was specifically tailored to show the limits of a stochastic neuron. In fact, different simulations that we performed, involving digital input and output variables8 were providing stochastic outputs indistinguishable from the deterministic ones. In other words, in light of these considerations we can state that the deterministic-to-stochastic conversion returns optimal results when applied to the class of the digital problems. VI. I MPLEMENTATION In this section we briefly comment on the implementation aspects of the proposed neuron. To this purpose, in Fig. 11 we provide a simplified RTL scheme of one of its possible realizations inside an FPGA. Specifically, we focus here on the implementation of the output function, while binary-to-bitstream and bitstream-to-binary conversions were already well covered by the current literature (see, for example, [6], [18]). All control signals unessential to the discussion were not reported. The scheme depicts the operations performed on bitstreams during the generic i-th evaluation step. In this regard, from now on we will omit the index i to simplify the notation. As can be seen, the N − 1 actual neuron inputs x $ are multiplied by the corresponding synaptic weights

w $ by means of XNOR operators, while the threshold w $N is directly passed to the next stage, as its input

is implicitly set to a constant “1”. Then, the synaptic products s$ are transformed into a serial bitstream

¯ signal of an UP/DOWN counter. s$ser by the multiplexer, becoming a control signal acting as the U/D

This means that the counter progressively accumulates the value of the activation potential A∗A + A∗ ,

and thus must be sized as H = ⌈log2 (2M N + 1)⌉, where M N is the maximum absolute value of the activation potential given M evaluations of N inputs, the factor of 2 accounts for its sign, and the additional unit is needed to represent the zero value.

Every M evaluation steps the counter is reset to the value C (rst) = 011 . . . 1. Consequently, the MSB of the counter status C provides the condition A∗A + A∗ > 0. However, as previously discussed in Subsec. III-A, such a bit returns a biased output, y$bia . Its compensation needs the introduction of

an additional option, i.e. a bitstream y$ 0.5 featuring p = 0.5, selected as output when C = C (rst) . The

multiplexer selection input is provided bu the H-inputs NAND, performing the the check on C. 8 In

this context, by digital input and output variables we mean a category of signal relationships that require digital transfer

functions to be computed, thus that are suitable for highly saturating neuron activation functions.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

17

As we can see, the required neuron logic is very limited, linearly scaling with N and logarithmically with M N as far as the “input” and “output” stages are concerned, respectively. VII. C ONCLUSIONS In this paper we have investigated the problem of the equivalence between deterministic and stochastic neurons. All critical issues, namely the limited slope of the activation function, the limited weight range, the output noise and the neuron output bias, have been identified and quantitatively addressed, apart from the inherent space-dependence of the activation potential distribution which was investigated mainly on a qualitative basis. The last issue, however, has been shown to be limited to specific regions of the neuron synaptic input space, that can be arbitrarily reduced by increasing the slope of the activation function. A simple two-step equivalence methodology between deterministic and stochastic neurons has been proposed, which allows to control at best the steepness and the noise of the resulting stochastic neuron. Simulation results confirmed that good quality results can be achieved, especially for digital problems, so supporting the proposed equivalence methodology as a valuable strategy for the proper design of stochastic neural networks.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

18

R EFERENCES [1] H. Hikawa, “A new digital pulse-mode neuron with adjustable activation function,” IEEE Trans. Neural Networks, vol. 14, p. 236, 2003. [2] ——, “A digital hardware pulse-mode neuron with piecewise linear activation function,” IEEE Trans. Neural Networks, vol. 14, p. 1028, 2003. [3] N. Nedjah and L. Mourelle, “FPGA-based hardware architecture for neural networks: binary radix vs. stochastic,” Proc. SBCCI ’03, p. 111, 2003. [4] S. Bade and B. Hutchings, “FPGA-based stochastic neural networks - implementation,” in Proc. IEEE FPGAs for Custom Computing Machines, 1994, p. 189. [5] M. Martincigh and A. Abramo, “A new architecture for digital stochastic pulse-mode neurons based on the voting circuit,” IEEE Trans. Neural Networks, vol. 16, p. 1685, 2005. [6] M. van Daalen, P. Jeavons, and J. Shawe-Taylor, “A stochastic neural architecture that exploits dynamically reconfigurable FPGAs,” in Proc. of the IEEE Workshop on FPGAs for Custom Computing Machines, 1993, p. 202. [7] M. Pearson, A. Pipe, B. Mitchinson, K. Gurney, C. Melhuish, I. Gilhespy, and M. Nibouche, “Implementing spiking neural networks for real-time signal-processing and control applications: a model-validated fpga approach,” IEEE Trans. Neural Networks, vol. 18, p. 1472, 2007. [8] N. Patel, S. Nguang, and G. Coghill, “Neural network implementation using bit streams,” IEEE Trans. Neural Networks, vol. 18, p. 1488, 2007. [9] A. Nadas, “Binary classification by stochastic neural nets,” IEEE Trans. Neural Networks, vol. 6, p. 488, 1995. [10] Y. Kondo and Y. Sawada, “Functional abilities of a stochastic logic neural network,” IEEE Trans. Neural Networks, vol. 3, p. 434, 1992. [11] N. Nedjah and L. de Macedo Mourelle, “Stochastic reconfigurable hardware for neural networks,” in Proceedings of the Euromicro Symposium on Digital System Design., 2003, p. 438. [12] P. Burge, M. van Daalen, B. Rising, and J. Shawe-Taylor, “Stochastic bit-stream neural networks,” in Pulsed Neural Networks, W. Maass and C. Bishop, Eds. Cambridge: MIT Press, 1999, p. 337. [13] S. Ghilezan, J. Pantovic, and J. Zunic, “Separating points by parallel hyperplanes - Characterization problem,” IEEE Trans. Neural Networks, vol. 18, p. 1356, 2007. [14] J. Freixas and X. Molinero, “The greatest allowed relative error in weights and threshold of strict separating systems,” IEEE Trans. Neural Networks, vol. 19, p. 770, 2008. [15] M. van Daalen, J. Zhao, and J. Shawe-Taylor, “Real time output derivatives for on chip learning using digital stochastic bit stream neurons,” Electronics Letters, vol. 30, p. 1775, 1994. [16] P. Jeavons, D. Cohen, and J. Shawe-Taylor, “Generating binary sequences for stochastic computing,” in TIT, vol. 40, 1994, p. 716. [17] J. Zhao, J. Shawe-Taylor, and M. van Daalen, “Learning in stochastic bit stream neural networks,” Neural Networks, vol. 9, p. 991, 1996. [18] B. Brown and H. Card, “Stochastic neural computation I: computational elements,” IEEE Trans. Computers, vol. 50, p. 891, 2001.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

19

F IGURE C APTIONS Figure 1: Contour plot of the output of a neuron (a) and corresponding bias (b). (N = 2, M = 1) Figure 2: Contour plot of the output of a 2-input unbiased neuron featuring M = 1 (a) and M = 32 (b). Figure 3: Separation ability, K, of a stochastic neuron plotted at varying N and M , both in the analytical (continuous line) and approximated (dotted line) cases. Figure 4: Bitstream length correction factor, c, and its maximum value, cmax , expressed as functions of the reset periodicity M . Figure 5: Analytical neuron output function (solid line) compared to its erf (dotted line) and bipolar sigmoid (dashed line) approximations. Figure 6: Bias of the bipolar sigmoid approximation of a 2-inputs neuron featuring M = 1 (a) and M = 32 (b). Figure 7: Bipolar sigmoid approximation error for two points positioned farther (a) and closer (b) to a critical point. The plot was obtained as a function of the reset periodicity, M , and varying the number of synaptic inputs, N . Figure 8: The shape replication problem: back-propagation learning sets are indicated by dots. Figure 9: Neural replication after back-propagation learning. Deterministic (a), (c) and stochastic (b), (d) results are compared. Data representation was set to B = 5 bits. Figure 10: Execution error, E, computed after solving the planes derivatives problem, both for the deterministic (“D” solid lines) and stochastic (“S” dotted lines) cases. Plots are obtained at varying precision, B, and for different values of the neuron activation function steepness, K. The error bound for a successful learning procedure is also shown (dashed line). Figure 11: Simplified RTL neuron scheme.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

20

s

2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1 −1

−0.5

0

0.5

1

−1

s

1

(a) Biased output

s

2

1

0

0.8

−0.1

0.6

−0.2

0.4

−0.3

0.2

−0.4

0

−0.5

−0.2

−0.6

−0.4

−0.7

−0.6

−0.8

−0.8

−0.9

−1 −1

−1 −0.5

0

0.5

1

s

1

(b) Bias Fig. 1.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

21

s

2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1 −1

−0.5

0

0.5

1

−1

s

1

(a) Unbiased output, M = 1

s

2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1 −1

−0.5

0

0.5

1

−1

s

1

(b) Unbiased output, M = 32 Fig. 2.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

22

5 4 3

1

2

log K

2

0

N

−1 −2 −3 −4 0

2

4

6

8

10

12

log M 2

Fig. 3.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

23

14 c c

12

max

8

2

log c

10

6

4

2

0 0

2

4

6

8

10

12

14

log M 2

Fig. 4.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

24

1 0.9 0.8 0.7

Y

0.6 0.5 0.4 0.3 0.2 Analytical Erf Sigmoid

0.1 0 0

0.5

1

1.5

A

2

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

Y

Y

(a) N = 2, M = 1

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2 Analytical Erf Sigmoid

0.1 0 0

0.5

1

1.5

A

Analytical Erf Sigmoid

0.1 0 0

2

1

2

(b) N = 2, M = 32

3

4

A

5

6

7

8

(c) N = 8, M = 1

1 0.9 0.8 0.7

Y

0.6 0.5 0.4 0.3 0.2 Analytical Erf Sigmoid

0.1 0 0

1

2

3

4

A

5

6

7

8

(d) N = 8, M = 32 Fig. 5.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

25

1 0.2

0.8

0.15

s

2

0.6 0.4

0.1

0.2

0.05

0

0

−0.2

−0.05

−0.4

−0.1

−0.6

−0.15

−0.8 −1 −1

−0.2 −0.5

0

0.5

1

s1 (a) M = 1

1 0.3 0.8 0.6

0.2

0.4 0.1

s

2

0.2 0

0 −0.2

−0.1 −0.4 −0.2

−0.6 −0.8

−0.3 −1 −1

−0.5

0

0.5

1

s

1

(b) M = 32 Fig. 6.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

26

12

10

2

log M

8

6

4

2

0

5

10

15

20

15

20

N (a) Farther point

12

10

2

log M

8

6

4

2

0

5

10

N (b) Closer point Fig. 7.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

27

1

s

2

0.5

0

−0.5

−1 −1

−0.5

0

s1

0.5

1

0.5

1

(a) T riangle

1

s

2

0.5

0

−0.5

−1 −1

−0.5

0

s

1

(b) Circle Fig. 8.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

28

1

s2

0.5

0

−0.5

−1 −1

−0.5

0

s1

0.5

1

(a) T riangle, deterministic 1

1

0.5

2

0

−0.5

−0.5

−1 −1

0

s

s

2

0.5

−0.5

0

s

0.5

−1 −1

1

−0.5

1

(b) T riangle, stochastic

0

s1

0.5

1

(c) Circle, deterministic 1

s

2

0.5

0

−0.5

−1 −1

−0.5

0 s

0.5

1

1

(d) Circle, stochastic Fig. 9.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

29

1

10

0

10

−1

E

10

−2

10

−3

D: K = 0.5 S: K = 0.5 D: K = 1 S: K = 1 D: K = 2 S: K = 2 Bound

10

−4

10

−5

10

2

3

4

5

6

7

8

B Fig. 10.

^

^

^ ^ w1 w2 . . . wN-1wN

^

y0.5

^

s1

^

x1

s2

^

x2 . . .

. . . ^

^

xN-1

sN-1

^

ybia

C H-1

^

M U X

^

s ser

UP/DOWN COUNTER

M U X

^

y

H-1

C 0:H-2

^

sN

Fig. 11.

July 22, 2017

DRAFT

1

The Correspondence Between Deterministic And Stochastic Digital Neurons: Analysis and Methodology Luca Geretti and Antonio Abramo

Abstract The paper analyzes the criteria for the direct correspondence between a deterministic neural network and its stochastic counterpart, and presents the guidelines that have been derived to establish such a correspondence during the design of a neural network application. In particular, the role of the slope and bias of the neuron activation function and that of the noise of its output have been addressed, thus filling a specific literature gap. The paper presents the results that have been theoretically derived in this regard, together with the simulations of few relevant application examples that have been performed to support them.

I. I NTRODUCTION Digital stochastic neural networks (SNN) have recently gained attention due to the simpler hardware structure they present compared to that of conventional deterministic neural networks (NN) [1]–[4]. In fact, lower resource allocation can be obtained when implementing a SNN on FPGAs, where the relatively scarce availability of predefined modules can limit the complexity of possible applications [1], [5]–[8]. This supremacy can be mainly ascribed to the compactness of the produced hardware, both in terms of neurons and synapses: the neuron architecture can be kept particularly simple, thus shifting complexity issues at the network level only. A desirable situation would be to look at SNNs as a possible design option of a generic (deterministic) neural network to be chosen freely, or when occupation constraints make it mandatory. In this regard, the most trivial choice would be to substitute all deterministic neurons with their stochastic counterparts, keeping the network topology untouched. This option implicitly assumes that the NN operation is preserved by the mapping. Unfortunately, this assumption is false: neuron non idealities and peculiarities may, for example, prevent the SNN to reach the output values that would be needed for a proper supervised learning. As a consequence, additional neurons may be required to overcome the problem [1], thus implying the realization of a different network with respect to the original one. In other words, once a NN application is designed (e.g. in MatlabTM ), the option for its stochastic implementation often The authors are with the DIEGM - University of Udine, Italy.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

2

requires the re-design of the very same application, where signal levels, precision and noise have to be controlled once again. Although the present literature is rich in the characterization of stochastic neurons (see [5], [9], [10] for example), to the best of our knowledge no sets of comprehensive theoretical or practical guidelines able to state the equivalence between deterministic and stochastic neuron implementations have been presented yet. The purpose of this paper is to address the issue, presenting the critical aspects that must be examined in order to quantitatively guarantee this equivalence for a generic neuron. More specifically, the stochastic neuron non-idealities that have been taken into consideration and that will be clarified in the remaining Sections are the following: 1) The presence of an output bias which depends on the number of the neuron inputs; 2) The dependence of the neuron output on the distribution of the activation potential, and not only on its expectation value; 3) The dependence of the activation function slope on the number of neuron inputs; 4) The bound of the synaptic weight values to the [0, 1] interval originating from their probabilistic nature, that limits the range of the activation potential excursion; 5) The role and control of the stochastic noise, needed to maintain a desired precision on the numerical output values. II. T HE S TOCHASTIC N EURON This Section introduces the main notation that will be used throughout the paper, and presents the stochastic neuron model. A. Notation Without loss of generality, we focused our investigation on the classical perceptron. As it is known, a perceptron implements the following activation (or output) function: ! N # I " y=Φ wn xn − θ

(1)

n=1

where xn is the n-th input among the NI of the neuron, wn its n-th synaptic weight, and θ the neuron’s

threshold. In general, the threshold θ may be considered an inherent property of the neuron. However, as commonly adopted we chose to treat it as a weight associated to an additional, fictitious input xNI +1 , always active. Consequently, we can redefine wNI +1 ! −θ and simplify the notation as follows: ! N # ! N # " " y=Φ wn xn = Φ sn = Φ (A(s)) n=1

(2)

n=1

where N = NI + 1 is the total number of synaptic inputs, s is the vector of synaptic products, sn , defined in the RN space, and A(s) is the activation potential. In the remainder of this paper Φ will be treated as a function within the synaptic products space, i.e. Φ(s). Given a generic stochastic bitstream x $ of length L, we define as x $(i) its i-th bit, i = 1, 2, . . . , L, and as

I = {i : i = 1, . . . , L} the set of the indices of the subsequent neuron output evaluations. In a stochastic

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

3

neuron each quantity is defined as the expectation value over I. Thus, if x $ is a stochastic bitstream it L " 1 x $(i) , where x ∈ [0, 1]. Nevertheless, we adopted an equivalent bipolar value holds that x = E [$ x] = L i=1 representation in the interval W = [−1, 1], obtained defining x = −1 + 2 E [$ x], whose only difference, from the implementation viewpoint, is the need of an XNOR gate instead of an AND gate to perform the synaptic multiplication [6], [11], [12]. B. The Neuron Output Function In this Subsection we describe the commonly adopted model for the stochastic neuron output function, which will provide a basis for the next Subsections. Defined as WN the synaptic product space of a neuron with N inputs, the locus of points featuring identical A(s) values is a parallel hyperplane in WN (see also [13], [14]). Ideally, the locus A(s) = 0 divides the space into two disjoint sets. The neuron’s output function Φ must provide opposite values for the two sets, and must move monotonically from one value to the other, as steeply as possible. This implies that Φ must be an odd function with saturating behavior, such as the erf or the bipolar sigmoid functions. In a stochastic neuron, for each evaluation step i we can write the instantaneous activation potential A∗ as: A∗ ($ s (i) ) !

N N % & " " −1 + 2 $ sn(i) = −N + 2 s$n(i)

(3)

n=1

n=1

i.e. the bipolar sum of all synaptic products. Consequently, A∗ ($ s (i) ) ∈ {−N + 2n : n = 1, . . . , N }, and it results that A(s) = E [A∗ ($ s)].

The stochastic neurons presented in the specific literature commonly implement the following function of A∗ ($ s (i) ) [1], [6], [11]:

⎧ ⎨ 1 $ ∗ ($ y$ ($ s (i) ) = Φ(A s (i) )) = ⎩ 0

if A∗ ($ s (i) ) > 0

(4)

if A∗ ($ s (i) ) ≤ 0

$ is the instantaneous output function that, being a step function, is discontinuous. It is the nonwhere Φ $ impulsive distribution of the synaptic products that returns an expected output function Φ = −1 + 2 E[Φ]

of class C 1 . Rewriting Eq. (4) within the domain of the expected value and converting the values into

the W representation, we obtain: ∗

y(s) = −1 + 2 E [ y$ ($ s )] = −1 + 2 P [A ($ s ) > 0] = −1 + 2

N "

fA* (s, j)

(5)

j=1

where fA* (s, j) = P [A∗ ($ s ) = j] is the discrete distribution of the activation potential, that is a function of s. Eq. (5) thus represents the discrete integration of the activation potential distribution. For large N this function approaches an erf function. (see Subsec. II-D) C. The Output Function Steepness In order to improve the input separation ability of the neuron, i.e. to control the steepness of the activation function Φ(A) around A = 0, we exploited the same technique presented in [5], which we will refer to as the Reset Delay Technique (RDT).

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

4

According to the RDT, the value of A∗ ($ s (i) ) is not reset to zero after each neuron evaluation, but it is accumulated for a number of evaluations M > 1, which we named as reset periodicity, where m ∈ M = {1, 2, .., M } represents the number of steps following the latest reset event, otherwise defined as the reset step. The expression m(i) = rem(i − 1, M ) + 1

(6)

provides the relation between an evaluation step i and its reset step m. An increase in M causes an increase in dΦ(A)/dA, the steepness of the output function [5]. However, in [5] the value of A∗ was constrained into the [−N, N ] interval, which caused the steepness to saturate to a specific value for M → ∞. Differently, we want to avoid the steepness saturation by allowing A∗ to reach any values. Therefore, in the remainder of this Subsection we will extend the theoretical analysis of [5] to assess the properties of the modified output function and of its derivative. To this purpose, we define the instantaneous accumulated activation potential A∗A ($ s (i) ) as: m(i)−1

A∗A ($ s (i) ) =

"

A∗ ($ s (i−j) )

(7)

j=1

that is to say the sum of all activation potentials evaluated after the most recent reset event. Consequently, Eq. (4) can be rewritten as:

⎧ (i) ⎨ 1 if A∗ ($ ) + A∗ ($ s (i) ) > 0 A s y$ ($ s (i) ) = ⎩ 0 if A∗ ($ (i) ) + A∗ ($ s (i) ) ≤ 0 A s

(8)

In order to obtain the equivalent of Eq. (5), a brief explanation is required. Eq. (6) uniformly remaps the set I onto M. Thus, in a bitstream of length L a particular m manifests for L/M times, where for simplicity but with no loss of generality we assume L/M integer. Then, we define as Im = {i : i = 1, . . . , N | m(i) = m} the set of the evaluation step indices, mapped onto a specific value of m, satisfying the requirements: I=

M *

m=1

Im1 ∩ Im2 = ∅,

Im ∀m1 ̸= m2

(9) (10)

and as s$m the substream of s$ having the evaluation step indices i ∈ Im , m = 1, 2, . . . , M , where sm = s

for all m. Defining A∗m = (A∗A + A∗ )|m as the effective activation potential on step m, from Eq. (8) each substream s$m will return a different expected potential value Am (sm ), hence a different expected output ym (sm ). Since all substreams feature the same length, we can transform the expectation over I into an average of the expectations over each Im : y(s) = −1 + 2 E [ y$ ($ s )] = −1 +

M M M 1 " 1 " 2 " E [ y$($ sm )] = ym (sm ) = ym (s) M m=1 M m=1 M m=1

(11)

where we have defined ym (sm ) = −1 + 2 E [ y$($ sm )] as the expected output value of subset Im . Being

Nm = m N the number of contributions accumulated in A∗m , similarly to Eq. (5) we can rewrite Eq. (11) as:

y(s) = −1 +

July 22, 2017

M M Nm 2 " 2 "" f * (s, j). P [A∗m ($ sm ) > 0] = −1 + M m=1 M m=1 j=1 A m

(12)

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

5

The discussion above implies that we can interpret the effect of a delayed reset as an amplification by a factor m = Am /A of the m-th expected activation potential value, that results into a different potential distribution, hence into different ym output values, the latter being averaged over M to return the correct output y. If we limit the space of the values of s to the diagonal of WN , that is to say if we introduce the constraint s1 = s2 = . . . = sN , and if we define p = P[$ sn = 1], the distribution of the accumulated potential becomes binomial. Thus we can rewrite Eq. (12) as follows: , /2⌉−1 + M ⌈Nm " 2 " Nm Nm −k k y(p) = −1 + p (1 − p) M m=1 k

(13)

k=0

which can be differentiated with respect to A = N (−1 + 2p) obtaining: , /2⌉−1 + M ⌈Nm " 1 " Nm dy k−1 (Nm − k − Nm p) pNm −k−1 (1 − p) = dA N M m=1 k

(14)

k=0

whose real-time hardware calculation is clearly heavier compared to that of [15].

It must be noticed that the above constraint on s could be considered rather limiting. However, Eqs. (14),(13) will be adopted in the following either for s = 0, where the limitation does not exist, or just to compare their behavior with the exact one, as obtained from Eqs. (12). D. Template Deterministic Output Functions Since the typical outputs of a deterministic neuron are the erf or the bipolar sigmoid functions, we recall here their expressions and derivatives. These results will be useful for the comparison between deterministic and stochastic output functions presented in Sec. III. Given an evaluation step m ∈ [1, M ], if Nm = mN ≫ 1 and all synaptic inputs are equiprobable, in force of the Central Limit Theorem the activation potential distribution fA*m approaches a Gaussian. √ Such a normal distribution features µm = Am = mA and σm = σ ˜ mN , where σ ˜ is the variance of a single activation potential contribution. Assuming for simplicity to be in a continuous domain, the output function then reads: M M 2 " 2 " ∞ ∗ ΦE (A) = −1 + P [Am ($ sm ) > 0] = −1 + fA*m (mA, x) dx M m=1 M m=1 0

(15)

where, compared to Eq. (12), we can notice that the dependence of fA*m on s is replaced by the dependence on Am . It can be easily seen that, after a trivial manipulation of the integral, we can write: ! # - mA - mA/(√2σm ) M M 2 2 " 1 2 " 1 √ ΦE (A) = −1 + + fA*m (0, x)dx = e−t dt (16) M m=1 2 M π 0 0 m=1

where it is apparent the erf behavior of the output function. Additionally, we can rewrite Eq. (16) as: +. , +√ , M M M 1 " 1 " m π 1 " ΦE (A) = erf A = erf K Em A (17) ΦE (A) = M m=1 m M m=1 2N σ ˜2 M m=1 2 / ˜ 2 . The corresponding output derivative is: where we introduced KEm = 2m/πN σ M M 2 2 dΦE (A) 1 " dΦEm 1 " π = = K Em e − 4 KEm A dA M m=1 dA M m=1

July 22, 2017

(18)

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

6

which shows that KEm can be interpreted as the derivative of the m-th erf function computed in A = 0. 0 Consequently, we can define the parameter KE ! KEm /M as the output function derivative for A = 0.

Eqs. (17),(18), obtained for Nm ≫ 1, are approximated expressions which require a rather intensive computation. For this reason we may introduce a bipolar sigmoid approximation: ΦS (A) ! −1 +

2 1 + e −2KS A

(19)

whose derivative is, instead, quite simple: dΦS (A) = KS (1 − Φ2S (A)). dA

(20)

Similarly to the erf case, we purposely introduced here the parameter KS to explicitly refer to the bipolar sigmoid derivative in A = 0. Both previous expressions will be used to analyze the equivalence between a stochastic neuron and its deterministic counterpart. III. A DDRESSING

THE

O PEN I SSUES

This Section addresses the open issues that have been presented at the end of Sec. I, which will be analyzed to lay the foundation for the deterministic-to-stochastic equivalence. A. Controlling The Neuron Bias As previously mentioned, Eq. (8) shows the instantaneous relationship between the neuron output and $ is not properly the evaluation of its N synaptic products. It must be noticed that the activation function Φ

defined for A∗A ($ s (i) ) + A∗ ($ s (i) ) = 0. In fact, in presence of an even value of synaptic products, Nm ,

the number of possible values of the activation potential that return y$ (i) = 0 exceeds by one the number of values returning y$ (i) = 1. This is due to the equal sign attribution of Eq. (8), which favors the zero

output 1 . This results into a bias that is superimposed to the neuron output.

To cancel it, we can restate Eq. (8) as: ⎧ ⎪ ⎪ 1 if A∗A ($ s (i) ) + A∗ ($ s (i) ) > 0 ⎪ ⎨ y$ (i) ($ s (i) ) = 1/2 if A∗A ($ s (i) ) + A∗ ($ s (i) ) = 0 ⎪ ⎪ ⎪ ⎩ 0 if A∗A ($ s (i) ) + A∗ ($ s (i) ) < 0.

(21)

Since 1/2 is not a valid binary value, Eq. (21) must be interpreted in probabilistic terms: in presence of zero activation potential values, the output bit must be generated as a random variable having an expected value P = 1/2. Therefore, the unbiased version of Eq. (12) is: ⎛ ⎞ Nm M " 2 " ⎝1 P [A∗m ($ sm ) = j]⎠ = P [A∗m ($ sm ) = 0] + y(s) = −1 + M m=1 2 j=1 = −1 + 1 Obviously,

July 22, 2017

M M Nm 2 "" 1 " f * (s, j) P [A∗m ($ sm ) = 0] + M m=1 M m=1 j=1 A m

(22)

the dual situation would happen if the equal sign were attributed to the y! (i) = 1 case.

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

7

Consequently, we can define: β(s) !

M M 1 " 1 " βm (s) = − P [A∗m ($ sm ) = 0] M m=1 M m=1

(23)

as the s-dependent output bias that has been compensated by adopting the definitions of Eq. (21). Looking at the definition of the bias, it can be noticed that only the case M = 1 and N = odd shows necessarily β(s) = β1 (s) = 0; in all other cases an even Nm = m N exists, hence at least one βm (s) is not zero. Furthermore, in the case of equiprobable synaptic inputs the unbiased expression corresponding to Eq. (13) becomes:

⎛ ⎞ , ⌈Nm /2⌉−1 + M " " Nm Nm −k 2 k ⎝− βm (p) + p (1 − p) ⎠ y(p) = −1 + M m=1 2 k

(24)

k=0

where the bias can be analytically calculated as: ⎧ % & ⎨ − Nm [ p (1 − p)]Nm /2 Nm /2 βm (p) = ⎩ 0

if Nm even

(25)

if Nm odd.

Finally, the unbiased derivative expression corresponding to Eq. (14) is: ⎛ ⎞ , ⌈Nm /2⌉−1 + M ′ " 1 " ⎝ βm (p) dy Nm k−1 ⎠ − (p) = + (Nm − k − Nm p) pNm −k−1 (1 − p) dA N M m=1 2 k

(26)

k=0

where

⎧ % & ⎨ − Nm (1 − 2p) Nm [ p (1 − p)]Nm /2 −1 Nm /2 2 ′ βm (p) = ⎩ 0

if Nm even

(27)

if Nm odd.

To show the impact of the bias on the neuron output behavior, we simulated a single stochastic neuron both in the biased and unbiased conditions. The resulting activation functions are shown in Fig. 1(a) for the case of a biased neuron (see Eq. (12)) with N = 2, M = 1, and in Fig. 2(a) for the case of its unbiased counterpart (see Eq. (22)). As can be seen, the presence of a bias shows up through the distortion of the output isolines (as it is in Fig. 1(a)) which instead should be linear as a function of the corresponding activation potential isolines (as it is in Fig. 2(a)). Fig. 1(b) shows the neuron bias, computed as the difference between the biased and unbiased outputs. To quantify the relevance of its presence we analyzed the conditions under which it attains a maximum absolute value 2 . Looking at Eq. (23) it can be observed that in the case N = even such a maximum occurs for those values of s that minimize βm (s) = −P [A∗m ($ sm ) = 0] for all m. This is equivalent 2 to say that A(s) must be zero, while the variance of the instantaneous activation potential σA s ) is * ($

minimum (see Eq. (30) and its derivation in Subsec. III-B for an explanation of the activation potential variance behavior). The first condition imposes that the maximum absolute value belongs to the negative diagonal of Fig. 1(b) (hyperplane, in the multidimensional case); the second condition, instead, forces 2 the maximum at the extremes of such a diagonal, where σA s ) = 0 if the number of inputs is even. * ($

As a consequence, the bias value at these extremes is forced to be β max,even (s) = −1 for all M , since 2 The

bias is always negative, as shown in Eq. (23).

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

8

P [A∗m ($ sm ) = 0] = 1 for all m. If N is odd, instead, the bias absolute value is maximized if (N − 1)/2 bits of the input vector have value +1, additional (N − 1)/2 bits have value −1, and the last remaining one has value 0. In this way, the N − 1 non-null bits sum to zero for the activation potential, while it is the remaining bit to determine the output behavior, as it would be in the case of a single input neuron. 2 Hence, in this case σA s ) = 1 (see Eq. (30) once again). To compute the bias value we can resort to * ($

Eq. (25). Being Nm ≡ m and p = 1/2, we obtain: ⎧ % & ⎨ − m 2−m m/2 βmmax,odd = ⎩ 0

if m even

(28)

if m odd

which describes a bias that approaches zero for increasing m, hence M , values. In conclusion, the presence of the neuron bias can be quite relevant, especially if one considers that its maximum absolute value can be as high as 1, i.e. half the output excursion. On the other hand, from the implementation viewpoint a bias compensation can be easily obtained detecting the A∗ + A∗A = 0 condition (see Eq. (21)): each time this is met, the neuron replaces the original output by a bit taken from a random sequence featuring p = 0.5. B. The Distribution of the Activation Potential As it can be understood from Eq. (22), the actual output function depends on the distribution of the activation potential, not only on its expected value, as it would be desirable according to Eq. (2). In addition, the approximation of Eq. (16) shows the dependence of the output function on both the expectation and variance of the activation potential distribution. In other words, if A∗ has a non-uniform distribution on an equipotential hyperplane, the output varies accordingly. Unfortunately, this is the case in all situations: any random binary stream shows a distribution which depends on the value it represents. In fact, since for any random variable v$ the variance of its instances can be computed as σv!2 = E[$ v 2 ]− E[$ v ]2 ,

for a binary bitstream x $ with expectation µ we can write that σx!2 = µ − µ2 , which in our representation,

namely x = −1 + 2 µ, returns:

σx!2 =

1 (1 − x2 ) 4

(29)

Such a relation, parabolic in the case of a single synaptic product bitstream, s$n , forces the variance

of the activation potential to be:

2 σA * = 4

N "

n=1

σ!s2n = N −

N "

s2n

(30)

n=1

that is a paraboloid in WN having its maximum value at s = ⃗0. The locus of points having fixed variance √ k is, thus, a circumference in WN with radius |s| = N − k. As a consequence, under the assumption of a Gaussian distribution of A∗ we can state that ∀(sα , sβ ) ∈

WN , sα ̸= sβ , and A(sα ) = A(sβ ), it is fA* (sα ) = fA* (sβ ) iff |sα | = |sβ |. This means that equipotential points in WN provide the same output if and only if they are equidistant from the center of WN .

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

9

The resulting effect is an unavoidable deformation of the output function 3 , shown in Fig. 2(b) for the case N = 2 and M = 32. In fact, although equipotential (iso-A) hyperplanes have equation ds2 /ds1 = −1, i.e. feature a linear behavior, the corresponding output isolines show the signature of the parabolic relationship of Eq. (30). The impact of this non-ideality on the behavior of the neuron is discussed in more detail in the next Section, where a deterministic function is associated to its stochastic neuron counterpart. C. Controlling the Slope of the Neuron Output Function As it is known, the separation ability of a neuron depends on the maximum steepness of its output function slope (OFS). As already mentioned, in our case the slope of the activation function depends on the number of synaptic products, N , and on the reset periodicity, M . Furthermore, it is trivial to notice that the maximum of the OFS always lies at A = 0 (see Fig. 2(b)). However, due to the deformation effect mentioned above the OFS on the A = 0 hyperplane is not constant, and possesses a minimum at s = ⃗0, the central point of WN , which corresponds to the maximum of the variance of A∗ (see also the definition of KE in Subsec. II-D). Defining the separation ability of the neuron through Eq. (26), i.e. as its derivative at s = ⃗0, after some manipulations we obtain: ⎛ ⎞ . 6 ⌈Nm /2⌉ M 7 2n − 1 8M dy 66 2 "⎝ ⎠≈ K(N, M ) ! (N, M ) = ⌈Nm /2⌉ dA 6p=0.5 N M m=1 2n 9πN n=1

(31)

where the final approximation, obtained using Eq. (18), holds for high values of M . Following the analysis of the previous Subsection, K can be considered as a worst-case approximation of the derivative in A = 0. The behavior of K with respect to N and M is shown in Fig. 3, where both the exact values of K (continuous lines) and the approximated ones (dotted lines) were computed for N = {2, 4, 8, 16, 32} and M = {1, 2, . . . , 4096}. Looking at Fig. 2(a), it can be noticed that for N = 2 (which is the minimum number of inputs, considering the threshold) the separation ability is relatively low. Thus larger M values are preferable (see Fig. 2(a) where M = 32). Since the neuron separation ability decreases with N , we can say that, on average, M ≫ 1 values are required, and thus the approximation of Eq. (31) can be widely used. D. The Limited Range of the Weights In a deterministic neuron, all weights can be represented in a [−h, h] interval, where h = max |w| ∈

+

R , while in the stochastic case the range is limited to W = [−1, 1] (if the bipolar representation is 3 This

non-ideality, although clearly identifiable under analysis of the neuron output, doesn’t seem to be properly taken into

account in the literature. In [12] there is a hint to the evidence that only linear output functions can be expressed as a function of the activation potential A only. In [16], the Authors discuss the generation of a bitstream with desired mean value, but don’t mention the problem of its distribution. In [17], instead, the effect of the variation of the activation potential distribution of stochastic neurons is presented, but its impact on the neuron output is not commented.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

10

assumed, as it is in the present case). Consequently, the corresponding activation potentials span the intervals [−h N, h N ] and [−N, N ], respectively. Thus, in the stochastic case the activation potential is attenuated by a factor of h with respect to the deterministic one. This evidence has a strong impact [10], and to understand its consequences we must analyze the behaviors of the erf and bipolar sigmoid functions in presence of an increased activation potential hA. If we indicate with the subscripts E and S the erf and the bipolar sigmoid cases, respectively, it can be easily noticed from Eq. (17) and Eq. (19) that: ΦE,S (hA, KE,S ) = ΦE,S (A, hKE,S ).

(32)

meaning that an increment of the function slope KE,S returns the same output that would be obtained incrementing the activation potential A by the same factor. This result allows us to compensate the limited output range of a stochastic neuron by increasing its K, whose value can be obtained using Eq. (31). E. Output Noise Control As mentioned, the representation of all numerical quantities in the stochastic case is affected by an inherent noise that limits the precision of the computation. Let us suppose to desire a certain precision in the bitstream representation x $ of a generic quantity x. The variance of x, σx2 , can be obtained from Eq. (29)

that describes the variance of a single bitstream. Considering that the former is inversely proportional to the number of evaluations, which in our case is the length L of the bitstream, we have: σx2 =

1 (1 − x2 ) 4L

(33)

We propose to find the bitstream length LB able to return a desired precision B for the representation of x, where B is expressed in terms of bits. According to Eq. (33), the variance depends on the represented √ value. Therefore, we focus on the worst condition, namely x = 0, which returns σx = 1/ 4LB . If we define ∆ = 2−B+1 as the distance between symbols that are adjacent in W, we want to guarantee that a sufficient number of represented values fall inside the interval [−∆/2, ∆/2], where ∆ is the quantization step, so to correctly return the value x = 0. In other words, we want to enforce that: α σx ≤

∆ 2

(34)

where α sets an acceptable interval of confidence (for example, if α = 1 we enforce that, statistically, the 65% of the values falls into the [−∆/2, ∆/2] interval). Substituting σx and ∆ expressions in Eq. (34) and rearranging, we obtain: 8 92 LB ≥ α 2 B−1

(35)

which grounds the apparently empirical assumptions that are commonly adopted, and represents a quantitative generalization of such rules. For example, in [6] α = 1 was chosen, i.e. the bitstream length was set to LB = 2(2B−2) , which states that it was implicitly chosen that in the same worst case condition the 35% of the expectation values would fall outside the correct quantization interval. In force of Eq. (34),

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

11

instead, a better choice can be made, for example α = 2, which reduces the fraction of erroneous values to 5%. The considerations above hold for the case of bits that are completely uncorrelated. A further analysis must be added in presence of a periodical reset. In fact the RDT, that has been proposed as a mean to control the activation function slope, operates introducing a correlation between subsequent output bits that superimposes a correlation noise to the neuron output. To compensate this effect, an additional increase of the bitstream length must be introduced. To determine a proper methodology to tailor such an increase, we conducted the analysis of the noise dependence on the input bitstream length in a worst case condition, from which a more precise heuristics has been derived. Let us assume we have an input bitstream of length LB and a reset periodicity equal to M . In addition, let us (unrealistically) suppose that the neuron output function is such that all the M output substreams are totally correlated. As a consequence Eq. (11), providing the neuron response y as the average over the M substream outputs ym , must be evaluated assuming that all ym carry the same value. Therefore, we can state that y = y¯, where we denoted with y¯ the generic ym . Since y¯ is obtained through the expectation over LB /M bits (see Subsec. II-B), in force of Eq. (33) we notice that the variance of each substream is M times higher than the variance corresponding to the case M = 1 (i.e. when the reset is performed at each evaluation step). Consequently, to recover the original variance, which is the lowest bound for a given LB , the bitstream length must be set to LM,max = cmax × LB , where cmax ! c(M )max = M is the worst case length correction, since it has been derived in this hypothetical totally correlated case. When correlation is partial, as it is in the case of a generic RDT interval, the value for c(M ) can be chosen from Eq. (21) providing the variance of the output at s = ⃗0, i.e. where the variance is maximum, and for different M values. Consequently, after making explicit the dependence of σy on M , we can write: c(M ) =

σy2 (M ) 2 σy (M = 1)

(36)

which provides the length coefficient to be used to lower the variance of the represented values to the desired level. Fig. 4 shows in a log-log plot the cmax = M (dashed line) and c(M ) (solid line) behaviors4 in M . We can notice from the results that c(log2 M ) ≈ cmax (log2 M − 1). Then, to compensate the correlation noise a value c(M ) = cmax (M/2) = M/2 can be chosen, obtaining an approximation that is more than valid if M > 16. Finally, the corresponding minimum bitstream length can be computed: LM = c(M )LB ≈

M LB = M α2 2 2B−3 2

(37)

which can be widely used since M ≫ 1 in most cases (see Subsec. III-C). For the sake of completeness it must be noticed that LM /M is integer by construction. Consequently, application of Eq. (37) enforces output substreams of equal lengths, a circumstance that guarantees the absence of errors as far as the computation of the output expected values is concerned (see Subsec. II-C). 4 Circles

have been included to mark the c values that are obtained in correspondence of integer values of log2 M .

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

12

IV. E QUIVALENCE Based on the results of the previous Section, we analyze here the problem of granting the equivalence between a deterministic neuron and its stochastic counterpart. We first compare the analytical output arising from Eq. (24) in the case of equiprobable outputs with those calculated using bipolar sigmoid and erf approximations featuring the same derivatives in A = 0 (this means that we are forcing K(N, M ) = KS = KE ; see Subsec. III-D). The purpose is to identify the regions where such an equivalence may lead to reasonable results, and to quantify the amount of the discrepancy. Fig. 5 presents the comparison among the three functions described above, plotted for A ≥ 0 (the region A < 0 was omitted due to the odd symmetry of all functions), for few different values of N, M . Rather limited values of both parameters were chosen, i.e. a critical condition was imposed, since for higher values the difference among the various behaviors was almost undetectable. Looking at the results, it is clear that both high N or M values provide an activation potential distribution able to yield a saturating behavior. However, the erf approximation is acceptable for M = 32 regardless of N . Since, typically, M > 16, we can consider the erf an adequate approximation. The bipolar sigmoid approximation, instead, slightly diverges from the analytical behavior due to different saturation characteristics. Consequently, it should be used when computation time is critical, for example in a supervised learning procedure making use of the output function derivative, because its computation is much simpler in this bipolar sigmoid case (compare Eqs. (18),(20)). For the sake of completeness, it must be noticed that the previous analysis has been conducted in the case of equiprobable inputs, thus ignoring the deformation effect of the activation potential behavior shown in Subsec. III-B. In order to evaluate the unavoidable bias arising both from such a deformation and from the use of the bipolar sigmoid approximation, in Fig. 6 we subtracted the output of Figs. 2(a), 2(b) from that obtained using two different sigmoid functions featuring K(N = 2, M = 1) = 0.5 and K(N = 2, M = 32) = 2.515, respectively. As already said, in the case M = 1, N = 2 the distribution of the activation potential is nearly uniform, returning an output function with linear behavior. This means that the output would descend towards ±1 without a clear saturation, returning a negative bias moving towards the point (−1, −1) and a positive one towards (1, 1) (see Fig. 6(a)). On the contrary, when M = 32 the activation potential distribution resembles a Gaussian, but it is affected by the strong distortion that the variance of the activation potential shows in W2 . Consequently, the output function behavior is more similar to that resulting from Eq. (16), which shows convex and concave behaviors for A > 0 and A < 0, respectively. The obtained bias (see Fig. 6(b)) is then maximum for A → 0 (i.e. along the negative diagonal of Fig. 6(b)) and σ → 0 (i.e. around (−1, 1) and (1, −1)), while in the central region the error is mainly due to the use of a the bipolar sigmoid approximation (see also Fig. 5(b) around A = 1). In both cases the output function appears to be odd and monotone along directions orthogonal to A = 0 (i.e. along the main diagonal in Fig. 6, where a two dimensional case is shown), which grants for the correct separation ability of the neuron. A bias amount such as that of Fig. 6 can impact the use of online gradient descent learning procedures (e.g. back-propagation), since the output

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

13

derivative of a bipolar sigmoid function depends directly on the neuron output (see Eq. (20)). As can be seen in Fig. 6, the sign of the bias in the region of maximum error is equal to that of the output. In other words, the bias adds constructively to the bipolar sigmoid output. Consequently, since |y| ≥ |Φ(A)| and given Eq. (20) we can state that: dΦ (y) = K(1 − y 2 ) ≤ K(1 − Φ(A)2 ) dA

(38)

that is to say that the actual descent rate is lower than what modeled, a circumstance representing a favorable damping condition for the learning procedure. The analysis above has shown that the critical points in the WN (limited) space are those featuring 2 σA * = 0 for A = 0. (see Fig. 6(b)), and that they happen to be some of the space vertices. The

conversion error is localized in critical regions around these points. Our next goal is to obtain some additional information on the extension of such critical regions in respect to N and M . In particular, we want to graphically demonstrate that the critical regions contract into the critical points at increasing N or M . To this purpose, we considered a neuron with an even number of inputs N , a reset delay M , and a fixed precision B. We evaluated the stochastic approximation error on a couple of specific points, namely sα = {−1, (1 − δB ), −1, (1 − δB ), . . . , −1, (1 − δB )} and sβ = {−1, (1 − 2 δB ), −1, (1 − / √ 2 δB ), . . . , −1, (1 − 2 δB )}, i.e. the points with distance dα = δB N/2 and dβ = δB 2N from the

critical point sγ = {−1, 1, −1, 1, . . . , −1, 1}, where δB = 2−B+1 is the smallest value representable in W with precision B. We must notice that any permutation of the coordinates of each of the sα,β,γ returns the same output value5 . Consequently, our treatment holds for all critical regions of WN . Fig. 7 shows the error expressed as the absolute value of the difference between the deterministic output and the corresponding stochastic one, normalized to δB , and for the case B = 7. As it is shown, the approximation error features a peak both at varying N or M (the latter is plotted logarithmically in Fig. 7, in order to obtain a cleaner plot): comparing Figs. 7(a) and 7(b), showing respectively the error behavior in sβ (farther from sγ ) and in sα (closer to sγ ), it can be seen that the error peak moves towards sγ both for increasing N or M . Since in each point s the approximation error is the sum of two monotonic functions, namely the output values of the deterministic and stochastic neurons, we can conclude that the error peak moves monotonically in s towards sγ , both for increasing N or M . This argument implies that we are able to decrease any critical regions size by increasing M for a given value of N , which is also the suggested choice to obtain larger output function slopes, K (see Subsec. III-C), or simply to provide a good erf approximation of the neuron output function (see Fig. 5). This result can be useful in all cases where the neuron, depending on the application, is forced by inputs to work close to the critical vertices of WN or, more in general, far from its minimal error portions (see again Fig. 6(b)). Apart from the considerations above on critical regions, to the purpose of designing the stochastic counterpart of a deterministic neuron we propose a simple, two-step equivalence method. Given a deterministic neuron with output function expressed by Eq. (19) or Eq. (17), featuring N inputs (including 5 In

fact, the output function evaluation prescinds from the order of presentation of the corresponding synaptic products.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

14

threshold), a slope around A = 0 of value KS,E , and a maximum weight |h|, its equivalent stochastic counterpart can be determined by means of the following sequence: 1) Set the shape of the output function by finding from Eq. (31) the value of M returning K = hKS,E ; 2) Determine from Eq. (37) the proper bitstream length LM returning the desired precision for the data representation. This result allowed us to implement an automatic conversion procedure for trained deterministic neural networks, whose simulated characterization will be presented in the following Section. V. S IMULATIONS In this Section we show the results that we obtained applying the equivalence method just described to the case of a deterministic feed-forward, fully connected neural network (DFFNN). This topology was chosen due to its simplicity and, at the same time, wide use in the specific literature. The problems that we used as application examples are two. For each of the proposed problems we followed a three step procedure, which was fully automated without resorting to manual tuning: 1) Network training conducted by means of a floating point, off-line minimization; 2) Conversion of the obtained synaptic weight values into fixed point representation with desired precision; 3) Conversion of each deterministic neuron into its stochastic counterpart. The first test of application was the replication of a template monochromatic figure given its X-Y coordinates of its black and white points. Two different white shapes inside a black background were used, namely a circle and a triangle (see Fig. 8). The purpose of the problem was to visually compare the results as obtained from a deterministic network to its stochastic equivalent. More in detail, we firstly trained two identical DFFNNs (one for the triangle, the other for the circle) using an off-line, back-propagation algorithm with floating point variables. The networks were featuring two inputs (the (x, y) coordinates of each figure in W2 ), five neurons in the first hidden layer, two neurons in the second one, and one output neuron; all neurons possessed the same separation ability, K = 1. So defined, each resulting network was showing a total of 30 synapses including threshold synapses6 , the latter connected to always-active inputs. We decided to choose a 3-layer topology to set up a rather complex situation to stress the impact of small deviations from ideality on the network sensitivity. The deterministic networks were iteratively trained to reproduce with minimum error a training set comprising NP = 100 input/output couples, {I, T }, which were iterated for NE = 5000 training epochs; the inputs I were taken from 10 × 10 grids, a point density that was considered representative of the figure values

in W2 (see dots in Fig. 8).

The error function of the minimization algorithm can be written as: E=

6 Given

NP " NO " 1 2 (Tn,p − yn (Ip )) NO NP p=1 n=1

(39)

the network topology and its input/neuron composition, and including among inputs the threshold one the synapse count

is as follows: (2 + 1) × 5 + (5 + 1) × 2 + (2 + 1) × 1 = 30.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

15

i.e. as the mean square error of the difference between the neuron output yn and its expected value Tn 7 . Correspondingly, the error variance reads: σ2 =

NP " NO : ;2 " 1 2 (Tn,p − yn (Ip )) − E 2 NO NP p=1 n=1

(40)

After the minimization of E, the obtained set of floating point weights was converted into fixed point representation. To do so, after the training procedure we repeated the network execution on a given input set using different data precisions, i.e. starting from Bmax = 16 bits down to Bmin = 2 bits. The adopted precision B was chosen as the lowest one that was respecting the bound: + ,2 ∆ E+ασ ≤ 2

(41)

where α = 3 and ∆ = 1, roughly meaning that we wanted to enforce 99% of the high-valued outputs above yH,min = 0.5, and 99% of the low-valued outputs below yL,max = −0.5 (recall that it was T ∈ {−1, 1} in our case). Finally, each weight was mapped onto W, and the corresponding stochastic neurons were designed using the equivalence scheme described in the previous Subsection. Simulation results computed over the whole W2 are shown in Fig. 9. As can be seen, the stochastic network shows output values that are very similar to those obtained from its deterministic counterpart, qualitatively identical in the case of a triangular shape. The worst performance of the stochastic network is obtained in the case of a circle. This can be ascribed to the variance deformation effect described in Subsec. III-B, inducing a non-ideal neuron separation which shows a stronger signature on curved shapes. In order to obtain additional information on the effect of the conversion errors, we introduced a second problem: provided as inputs a sample of the values of x3 = d1 x1 + d2 x2 , i.e. points laying on a plane in W3 , to replicate its derivative values d1 = dx3 /dx1 and d2 = dx3 /dx2 . Since each neuron input can span any values in W, the problem requires a large degree of precision. Furthermore, the input function is linear, a critical condition when dealing with stochastic neurons, that show better ability when approximating highly non-linear activation functions (see, for example, Fig. 5). In particular, the inputs of a specific plane depicted above were the values of x3 taken on a 7 × 7 grid of points in the {x1 , x2 } space, that is to say x1,2 = {−1 + 2 i/7 : i = 1, . . . , 7}. The training set comprised 9 different planes, as exiting from all the possible choices of d1,2 = {−0.5, 0, 0.5}. The simulation was performed for the different values of precision B = {2, . . . , 8}, and for the activation function slope values K = {0.5, 1, 2}. The inputs were provided both to the deterministic NN and to its stochastic equivalent. Both NNs were characterized by 7 × 7 = 49 inputs, 14 neurons in the hidden layer, 2 neurons in the output layer, for a total of 730 synapses (threshold synapses included). Fig. 10 shows the resulting error plotted against B, both for the deterministic (solid line) and stochastic (dotted line) cases, and for different values of K. We can see that the mean stochastic error roughly follows the deterministic one, at least for relatively low values of B and as K increases. The tendency in K was expected, because neuron activation functions show a more saturating behavior for high K values, so that 7N

O

= 1 in this case, since we are in presence of a single output neuron.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

16

the stochastic approximation is more appropriate (see Fig. 7 and the related discussion in Subsec. IV). On the other hand, the deterministic error increases with K since highly non-linear activation functions are not optimal for the solution of a linear problem, such as the one considered here. As far as B is concerned, it must be noticed that beyond a certain precision the stochastic error cannot reduce. This happens because the error reaches the lower limit imposed by conversion non-idealities. Given these considerations, since in this case the learning error bound was taken as Emax = 1/16 (dashed line in Fig. 10) so as to guarantee a proper neural network discrimination between the derivative values, the stochastic networks for K = 0.5 and K = 1 were failing the learning condition for any choice of B. As a final comment, we must point out that the second experiment was specifically tailored to show the limits of a stochastic neuron. In fact, different simulations that we performed, involving digital input and output variables8 were providing stochastic outputs indistinguishable from the deterministic ones. In other words, in light of these considerations we can state that the deterministic-to-stochastic conversion returns optimal results when applied to the class of the digital problems. VI. I MPLEMENTATION In this section we briefly comment on the implementation aspects of the proposed neuron. To this purpose, in Fig. 11 we provide a simplified RTL scheme of one of its possible realizations inside an FPGA. Specifically, we focus here on the implementation of the output function, while binary-to-bitstream and bitstream-to-binary conversions were already well covered by the current literature (see, for example, [6], [18]). All control signals unessential to the discussion were not reported. The scheme depicts the operations performed on bitstreams during the generic i-th evaluation step. In this regard, from now on we will omit the index i to simplify the notation. As can be seen, the N − 1 actual neuron inputs x $ are multiplied by the corresponding synaptic weights

w $ by means of XNOR operators, while the threshold w $N is directly passed to the next stage, as its input

is implicitly set to a constant “1”. Then, the synaptic products s$ are transformed into a serial bitstream

¯ signal of an UP/DOWN counter. s$ser by the multiplexer, becoming a control signal acting as the U/D

This means that the counter progressively accumulates the value of the activation potential A∗A + A∗ ,

and thus must be sized as H = ⌈log2 (2M N + 1)⌉, where M N is the maximum absolute value of the activation potential given M evaluations of N inputs, the factor of 2 accounts for its sign, and the additional unit is needed to represent the zero value.

Every M evaluation steps the counter is reset to the value C (rst) = 011 . . . 1. Consequently, the MSB of the counter status C provides the condition A∗A + A∗ > 0. However, as previously discussed in Subsec. III-A, such a bit returns a biased output, y$bia . Its compensation needs the introduction of

an additional option, i.e. a bitstream y$ 0.5 featuring p = 0.5, selected as output when C = C (rst) . The

multiplexer selection input is provided bu the H-inputs NAND, performing the the check on C. 8 In

this context, by digital input and output variables we mean a category of signal relationships that require digital transfer

functions to be computed, thus that are suitable for highly saturating neuron activation functions.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

17

As we can see, the required neuron logic is very limited, linearly scaling with N and logarithmically with M N as far as the “input” and “output” stages are concerned, respectively. VII. C ONCLUSIONS In this paper we have investigated the problem of the equivalence between deterministic and stochastic neurons. All critical issues, namely the limited slope of the activation function, the limited weight range, the output noise and the neuron output bias, have been identified and quantitatively addressed, apart from the inherent space-dependence of the activation potential distribution which was investigated mainly on a qualitative basis. The last issue, however, has been shown to be limited to specific regions of the neuron synaptic input space, that can be arbitrarily reduced by increasing the slope of the activation function. A simple two-step equivalence methodology between deterministic and stochastic neurons has been proposed, which allows to control at best the steepness and the noise of the resulting stochastic neuron. Simulation results confirmed that good quality results can be achieved, especially for digital problems, so supporting the proposed equivalence methodology as a valuable strategy for the proper design of stochastic neural networks.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

18

R EFERENCES [1] H. Hikawa, “A new digital pulse-mode neuron with adjustable activation function,” IEEE Trans. Neural Networks, vol. 14, p. 236, 2003. [2] ——, “A digital hardware pulse-mode neuron with piecewise linear activation function,” IEEE Trans. Neural Networks, vol. 14, p. 1028, 2003. [3] N. Nedjah and L. Mourelle, “FPGA-based hardware architecture for neural networks: binary radix vs. stochastic,” Proc. SBCCI ’03, p. 111, 2003. [4] S. Bade and B. Hutchings, “FPGA-based stochastic neural networks - implementation,” in Proc. IEEE FPGAs for Custom Computing Machines, 1994, p. 189. [5] M. Martincigh and A. Abramo, “A new architecture for digital stochastic pulse-mode neurons based on the voting circuit,” IEEE Trans. Neural Networks, vol. 16, p. 1685, 2005. [6] M. van Daalen, P. Jeavons, and J. Shawe-Taylor, “A stochastic neural architecture that exploits dynamically reconfigurable FPGAs,” in Proc. of the IEEE Workshop on FPGAs for Custom Computing Machines, 1993, p. 202. [7] M. Pearson, A. Pipe, B. Mitchinson, K. Gurney, C. Melhuish, I. Gilhespy, and M. Nibouche, “Implementing spiking neural networks for real-time signal-processing and control applications: a model-validated fpga approach,” IEEE Trans. Neural Networks, vol. 18, p. 1472, 2007. [8] N. Patel, S. Nguang, and G. Coghill, “Neural network implementation using bit streams,” IEEE Trans. Neural Networks, vol. 18, p. 1488, 2007. [9] A. Nadas, “Binary classification by stochastic neural nets,” IEEE Trans. Neural Networks, vol. 6, p. 488, 1995. [10] Y. Kondo and Y. Sawada, “Functional abilities of a stochastic logic neural network,” IEEE Trans. Neural Networks, vol. 3, p. 434, 1992. [11] N. Nedjah and L. de Macedo Mourelle, “Stochastic reconfigurable hardware for neural networks,” in Proceedings of the Euromicro Symposium on Digital System Design., 2003, p. 438. [12] P. Burge, M. van Daalen, B. Rising, and J. Shawe-Taylor, “Stochastic bit-stream neural networks,” in Pulsed Neural Networks, W. Maass and C. Bishop, Eds. Cambridge: MIT Press, 1999, p. 337. [13] S. Ghilezan, J. Pantovic, and J. Zunic, “Separating points by parallel hyperplanes - Characterization problem,” IEEE Trans. Neural Networks, vol. 18, p. 1356, 2007. [14] J. Freixas and X. Molinero, “The greatest allowed relative error in weights and threshold of strict separating systems,” IEEE Trans. Neural Networks, vol. 19, p. 770, 2008. [15] M. van Daalen, J. Zhao, and J. Shawe-Taylor, “Real time output derivatives for on chip learning using digital stochastic bit stream neurons,” Electronics Letters, vol. 30, p. 1775, 1994. [16] P. Jeavons, D. Cohen, and J. Shawe-Taylor, “Generating binary sequences for stochastic computing,” in TIT, vol. 40, 1994, p. 716. [17] J. Zhao, J. Shawe-Taylor, and M. van Daalen, “Learning in stochastic bit stream neural networks,” Neural Networks, vol. 9, p. 991, 1996. [18] B. Brown and H. Card, “Stochastic neural computation I: computational elements,” IEEE Trans. Computers, vol. 50, p. 891, 2001.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

19

F IGURE C APTIONS Figure 1: Contour plot of the output of a neuron (a) and corresponding bias (b). (N = 2, M = 1) Figure 2: Contour plot of the output of a 2-input unbiased neuron featuring M = 1 (a) and M = 32 (b). Figure 3: Separation ability, K, of a stochastic neuron plotted at varying N and M , both in the analytical (continuous line) and approximated (dotted line) cases. Figure 4: Bitstream length correction factor, c, and its maximum value, cmax , expressed as functions of the reset periodicity M . Figure 5: Analytical neuron output function (solid line) compared to its erf (dotted line) and bipolar sigmoid (dashed line) approximations. Figure 6: Bias of the bipolar sigmoid approximation of a 2-inputs neuron featuring M = 1 (a) and M = 32 (b). Figure 7: Bipolar sigmoid approximation error for two points positioned farther (a) and closer (b) to a critical point. The plot was obtained as a function of the reset periodicity, M , and varying the number of synaptic inputs, N . Figure 8: The shape replication problem: back-propagation learning sets are indicated by dots. Figure 9: Neural replication after back-propagation learning. Deterministic (a), (c) and stochastic (b), (d) results are compared. Data representation was set to B = 5 bits. Figure 10: Execution error, E, computed after solving the planes derivatives problem, both for the deterministic (“D” solid lines) and stochastic (“S” dotted lines) cases. Plots are obtained at varying precision, B, and for different values of the neuron activation function steepness, K. The error bound for a successful learning procedure is also shown (dashed line). Figure 11: Simplified RTL neuron scheme.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

20

s

2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1 −1

−0.5

0

0.5

1

−1

s

1

(a) Biased output

s

2

1

0

0.8

−0.1

0.6

−0.2

0.4

−0.3

0.2

−0.4

0

−0.5

−0.2

−0.6

−0.4

−0.7

−0.6

−0.8

−0.8

−0.9

−1 −1

−1 −0.5

0

0.5

1

s

1

(b) Bias Fig. 1.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

21

s

2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1 −1

−0.5

0

0.5

1

−1

s

1

(a) Unbiased output, M = 1

s

2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1 −1

−0.5

0

0.5

1

−1

s

1

(b) Unbiased output, M = 32 Fig. 2.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

22

5 4 3

1

2

log K

2

0

N

−1 −2 −3 −4 0

2

4

6

8

10

12

log M 2

Fig. 3.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

23

14 c c

12

max

8

2

log c

10

6

4

2

0 0

2

4

6

8

10

12

14

log M 2

Fig. 4.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

24

1 0.9 0.8 0.7

Y

0.6 0.5 0.4 0.3 0.2 Analytical Erf Sigmoid

0.1 0 0

0.5

1

1.5

A

2

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

Y

Y

(a) N = 2, M = 1

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2 Analytical Erf Sigmoid

0.1 0 0

0.5

1

1.5

A

Analytical Erf Sigmoid

0.1 0 0

2

1

2

(b) N = 2, M = 32

3

4

A

5

6

7

8

(c) N = 8, M = 1

1 0.9 0.8 0.7

Y

0.6 0.5 0.4 0.3 0.2 Analytical Erf Sigmoid

0.1 0 0

1

2

3

4

A

5

6

7

8

(d) N = 8, M = 32 Fig. 5.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

25

1 0.2

0.8

0.15

s

2

0.6 0.4

0.1

0.2

0.05

0

0

−0.2

−0.05

−0.4

−0.1

−0.6

−0.15

−0.8 −1 −1

−0.2 −0.5

0

0.5

1

s1 (a) M = 1

1 0.3 0.8 0.6

0.2

0.4 0.1

s

2

0.2 0

0 −0.2

−0.1 −0.4 −0.2

−0.6 −0.8

−0.3 −1 −1

−0.5

0

0.5

1

s

1

(b) M = 32 Fig. 6.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

26

12

10

2

log M

8

6

4

2

0

5

10

15

20

15

20

N (a) Farther point

12

10

2

log M

8

6

4

2

0

5

10

N (b) Closer point Fig. 7.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

27

1

s

2

0.5

0

−0.5

−1 −1

−0.5

0

s1

0.5

1

0.5

1

(a) T riangle

1

s

2

0.5

0

−0.5

−1 −1

−0.5

0

s

1

(b) Circle Fig. 8.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

28

1

s2

0.5

0

−0.5

−1 −1

−0.5

0

s1

0.5

1

(a) T riangle, deterministic 1

1

0.5

2

0

−0.5

−0.5

−1 −1

0

s

s

2

0.5

−0.5

0

s

0.5

−1 −1

1

−0.5

1

(b) T riangle, stochastic

0

s1

0.5

1

(c) Circle, deterministic 1

s

2

0.5

0

−0.5

−1 −1

−0.5

0 s

0.5

1

1

(d) Circle, stochastic Fig. 9.

July 22, 2017

DRAFT

TRANSACTIONS ON NEURAL NETWORKS, VOL. XXX, NO. XXX, JANUARY 2008

29

1

10

0

10

−1

E

10

−2

10

−3

D: K = 0.5 S: K = 0.5 D: K = 1 S: K = 1 D: K = 2 S: K = 2 Bound

10

−4

10

−5

10

2

3

4

5

6

7

8

B Fig. 10.

^

^

^ ^ w1 w2 . . . wN-1wN

^

y0.5

^

s1

^

x1

s2

^

x2 . . .

. . . ^

^

xN-1

sN-1

^

ybia

C H-1

^

M U X

^

s ser

UP/DOWN COUNTER

M U X

^

y

H-1

C 0:H-2

^

sN

Fig. 11.

July 22, 2017

DRAFT