ENHANCED THRESHOLD GATE FAN-IN ... - Semantic Scholar

2 downloads 0 Views 181KB Size Report
fan-in reduction algorithm, with a view to possible VLSI implementation of .... Computer Science, Spl. Independentei 313, 77206 Bucharest, România. z Senior ...
In J. W, J. Yang, W. Gao and Y. Li (eds.): Young Computer Scientists. Tsinghua University Press, Beijing, China, July 1993.

ENHANCED THRESHOLD GATE FAN-IN REDUCTION ALGORITHMS① Valeriu Beiu†,②, Jan Peperstraete†, and Rudy Lauwereins†,③ †

Katholieke Universiteit Leuven, Department of Electrical Engineering, ESAT–ACCA, Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium E-mail: [email protected]

ABSTRACT

2. A DIVIDE AND CONQUER ALGORITHM

The paper describes and improves on a Boolean neural network (NN) fan-in reduction algorithm, with a view to possible VLSI implementation of NNs using threshold gates (TGs). Constructive proofs are given for: (i) at least halving the size; (ii) reducing the depth from O(N) to O(log2N). Lastly a fresh algorithm which reduces the size to polynomial is suggested.

The basic ideas of the algorithm, introduced by the authors[6,7], are the division

1. INTRODUCTION The paper is the result of an ongoing work at KULeuven focused towards reducing the complexity of Boolean NNs, with a view to their efficient VLSI implementation using TGs. The meaning of “reducing a NN” is that after applying such an algorithm to an “input NN”, the “output NN” will be “simpler” with respect to: (i) the fan-in of the neurons[6,7,15,19], (ii) precision of the weights[2–4,8], and (iii) approximation of the sigmoid output function[5]. We will discuss two algorithms for fan-in reduction. This particular problem is of great importance as showing one way to deal with high fan-in artificial neurons (high connectivity). When a NN is simulated (execution phase), or trained (learning phase), such aspects do not count. But for the VLSI designers who try to map the resulting NNs in silicon, this high connectivity is usually an obstacle.

of the input variables in a first layer, and how to join the results in two subsequent layers. The reduction algorithm has been designed for majority functions. As it has been shown[24], any symmetric function can be built using only majority functions (see Fig.1). But any Boolean function (BF) can be considered as a symmetric function by repeating its input variables[1,35]. It is ^ k is equivalent with also known that any LT 1 ⊂ MAJ3[1,32], and also LT [32] MAJ3 . Here LT 1 is the class of BFs computed by linear TGs (LTGs) with ^ 1 is the class of BFs computed by LTGs with weights arbitrary real weights; LT ^ k the class bounded by a polynomial in the number of inputs (| wi| ≤ N c)[11]; LT ^ 1 gates (the depth of BFs computed by a polynomial size depth-k circuit of LT being the number of gates on the longest input–output path)[11,30]; MAJ1 the class of BFs computed by LTGs having ± 1 weights (these compute functions analogous to MAJORITY gates)[21,31]; and MAJk the class of BFs computed by a polynomial size depth-k circuit of MAJ1 gates[22,32]. It should also be mentioned that more efficient constructions for symmetric functions than the one in Fig.1 are known[23,35]. Unfortunately they are not made only of majority functions.

Suppose that we have a N = 2k fan-in neuron and accept only neurons with There are two main trends for hardware implementation of neural networks: l [6,7] are: (i) analog, and (ii) digital[14,16]. This paper deals with TGs which borrow from fan-in n ≤ 2 (obviously n < N). The recursive equations (Fig.2) k− 1  k (1) both of them, having binary inputs but analog summation. They are challenging NG (k,1) = 2 +  1+ 2 ⋅ NG (k−1,1) alternative to classical Boolean solution due to solid theoretical background (2) NL(k+1,l) =  NL(k,l) ⁄ 2 × 3 +  NL(k,l) ⁄ 2 , from the 60s[17,23,24,26], new interest proven by many articles from the late 80s[27] and 90s[9–13,30–35], as well as proposals of implementation[5,19]. In where NG(k,l) is the size (number of gates), and NL(k,l) is the depth (number of section 2 we will present a divide and conquer algorithm for reducing the fan-in layers). Solving them, we obtain (we use lg instead of log2, and ln for loge): k k  of symmetric functions (any function, if we are allowed to repeat each input (3)    NG (k,1) = ∑  2 i − 1 × ∏  1+ 2 j   = O N lgN  variable), and will improve on its efficiency in section 3, by showing how we     j=i+1 i=1 can reduce even more the size and the depth of the output NN. Mathematical   proofs and simulation results support the claims. Section 4 will shortly present (4) NL (k,l ) = 2⋅ 2 k−l − 1 = ON⁄n. the basic ideas of an even more efficient algorithm with respect to size. Some [6, 7] In the general case (Fig. 3) the size is : conclusions end up the paper in section 5. F

+ −

Figure 1.

F ±

1

2

...

m

x1

x2

...

xN

Classical two layered structure for computing any symmetric function with maximum n + 1 majority functions[24] (m ≤ N).

① Research partly carried out under the Belgian Concerted Action Project “Applicable x1 Neural Networks.” ② On leave of absence from the “Politehnica” University of Bucharest, Department of Figure 2. Computer Science, Spl. Independentei 313, 77206 Bucharest, România. ③ Senior Research Assistant of the Belgian Fund for Scientific Research.

- 3.39 -

x2

x3

x4

x5

x6

x7

x8

Division of the input variables in two groups in a first layer, and “joining” the intermediate results in the second layer (N=8, n=4).

F

F

Depth = 6 Size = 47

Depth = 7 Size = 67

x5 x6 x7 x8 x5 x6 x7 x8 x5 x6 x7 x8 x5 x6 x7 x8 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4

x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 Figure 3.

Tree structure obtained after applying twice the algorithm (N=8, n=2); no Figure 4. enhancement. k

NG (k,1) − NG2 (k,l) NG (k,l) = + NG2 (k,l) = NG (l,1) k

∏  1+ 2i

+



∑  2

i = l + 1

×



(5)

 N lgN  O  lgn   n 

=

i=l+1

3.2 Simplifying decomposition

  1+ 2j   +   j=i+1  k

i−1

x5 x6 x7 x8 x5 x6 x7 x8 x5 x6 x7 x8 x5 x6 x7 x8

Removing the unused gates – tinted circles (N=8, n=2).

Using a partial binary tree decomposition for the OR–gates (Fig.4) leads to: (9) NG (k,1) = 2 k − 1 + 2k ⋅ NG (k−1,1), NL(k+1,l) = NL(k,l) + 2 + k − l,

which is superpolynomial with respect to N (n has been assumed to be constant).

(10)

instead of eq.1 and eq.2. Solving them we obtain:

k (k + 1) k (k + 1) (11) k+1 This compares well with classical Boolean decomposition which has NG(k,1) = 2 2 − 1 < 2 2 = N exponential growth. With respect to depth the proposed algorithm is linear, like (12) (k−l)2+ 3(k−l) 1 2 N  3 N  Boolean decomposition, but the multiplying constant is lower (1⁄2). NL(k,l) = + 1= lg ⁄n  + lg ⁄n  + 1 = O  lg2 N⁄n  , 2  2  2

3. IMPROVING THE ALGORITHM We have improved on these results in several successive steps. First we will

 √

having the proof that the size is reduced by 20.4365 (eq.11 and eq.8), and the depth decreases from linear to squared logarithmic (eq.4 and eq.12).

prove a very tight bound for the size (eq.3), to be used later. As the 3.3 Removing unused gates decomposition process used to obtain eq.1 was “uniform”, treating all gates in an equal manner, an immediate improvement was to simplify the OR-gates But not all the gates are used (the unused gates are the tinted ones in Fig.5)! decomposition. More sophisticated improvements can be realized if one looks As we compute twice all the possible sums of 1s, the number of gates we need carefully at the way the subfunctions are generated; thus some unused gates can is: i−2 be removed. A final improvement step has been to delete the gates performing   (13) 2  i−2  the same functions. 2 ⋅  2 + 2 × ∑ 3⋅j  = 3⋅ 22i−3 + 2i+1  

3.1 A very tight bound

instead of: 2 i⋅  1 + 2i−2 + 2i−1 = 22i−1 + 22i−2 + 2i.

Starting from eq.3 we can rewrite the sum of products as: 1  2k−1  1+ 22 ⋅ 1+ 23 ⋅ … ⋅ 1+ 2k  × 20 + 2 +…+ 2 2 k          1+ 2 1+ 2 ⋅      … ⋅1+ 2  

    

(6)

i+

1+ 2i < 2

1 i 2 ⋅ ln2

When i →∞ this will reduces the size by 2 (the minimum reduction being 8⁄7).

(7) ,

If now, in the first layer, we keep only one gate for one particular function (see Fig.6), we will have: i−2  (15)  2   2 ⋅  2 × ∑ j + 2 × 2i−1 = 22i−3 + 2i+1.   j = 1  

so being able to prove: k−1

k (k + 1) 2 −1 + − 0.2846 2 k 2 ⋅ ln2

< 2

2 k +k + 0.4365 2

= 20.4365

 √ N

k+1

(8) .

instead of eq.13. This reduces the size by 3 (or 5⁄4 minimum).

F

F

Depth = 6 Size = 63

x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 Figure 5.

(14)

3.4 Deleting redundant gates

and use a truncated Taylor series expansion for ln 1+ 2i  around 2i:

NG (k,1) < 2

 

j=1

Depth = 6 Size = 31

x5 x6 x7 x8 x5 x6 x7 x8 x5 x6 x7 x8 x5 x6 x7 x8

Simplifying OR-gate decomposition – tinted circles (N=8, n=2).

x1 x2 x3 x4 Figure 6.

- 3.40 -

x5 x6 x7 x8

Deleting redundant gates – tinted circles (N=8, n=2).

Table 1.

Table 2.

Number of limited fan-in TGs necessary to substitute a high fan-in TG, in the following order: (i) Boolean decomposition; (ii) proposed algorithm; (iii) the enhanced algorithm.

Number of necessary layers to substitute a high fan-in TG by limited fan-in TGs: (i) Boolean decomposition; (ii) proposed algorithm; (iii) the enhanced algorithm.

n N

22

21

22

23

24

25

26

27

28

29

2

4

8

16

32

64

128

256

512

13 7 7 253 67 31 65,535 1,147 511 4.3E9 37,867 16,383 1.8E19 2.5E6 1.1E6

27

3.4E38 3.2E8 1.3E8

28

1.2E77 8.2E10 3.4E10

1.9E76 1.6E10 5.2E9

29

1.3E154 2.2E153 1.2E152 4.4E149 4.2E13 8.4E12 9.3E11 5.5E10 1.8E13 2.7E12 2.3E11 1.1E10

24 25 26

N

22 41 13 13 10,921 229 77 7.2E8 7,573 2,493 3.1E18 4.9E5 1.6E5 5.7E37 6.4E7 2.0E7

23

n

23 583 25 25 3.8E7 841 217 1.6E17 54,697 13,945 3.0E36 7.1E6 1.8E6 1.0E75 1.8E9 4.6E8

24 1.39E5 49 49 6.0E14 3,217 689 1.1E34 4.1E5 88,305 3.8E72 1.1E8 2.3E7

1.8E308 2.9E307 1.6E306 5.8E303 8.6E15 9.5E14 5.6E13 2.7E15 2.4E14 1.2E13 1.8E

210 4.3E16 16

25 8.9E9 97 97 1.6E29 3.7E19 12,557 193 2,401 193 67 5.6E 1.3E58 6.9E38 6 3.2E 49,729 385 6.E5 8,897 385 6.4E144 1.5E135 7.9E115 1.7E9 2.5E7 1.9E5 3.1E8 4.5E6 34,177 8.6E298 1.9E289 1.0E270 1.7E12 2.6E10 2.0E8 3.2E11 4.7E9 3.5E7

26 27 28 2.3E77 769 769 3.1E231 2.7E154 7.9E5 1,537 1.3E5 1,537

29 210

21

22

23

24

25

26

27

28

29

2

4

8

16

32

64

128

256

512

5 3 3 13 7 6 29 15 10 61 31 15 125 63 21 253 127 28 509 255 36 1021 511 45 2045 1023 55

5 3 3 13 7 6 29 15 10 61 31 15 125 63 21 253 127 28 509 255 36 1021 511 45

7 3 3 17 7 6 39 15 10 81 31 15 167 63 21 437 127 28 679 255 36

9 3 3 25 7 6 57 15 10 121 31 15 249 63 21 505 127 28

15 3 3 41 7 6 91 15 10 193 31 15 399 63 21

23 3 3 65 7 6 151 15 10 321 31 15

39 3 3 111 7 6 257 15 10

65 3 3 193 7 6

115 3 3

A lot of work has been devoted to sorting algorithms[18], as well as to parallel sorting networks and their possible VLSI implementation[20,36]. The classic For a better understanding of what we have accomplished so far, we have odd-even merge algorithm[18] can be easily realized out of two element sorting computed in Table 1 the size, and in Table 2 the depth, of an equivalent NN for cells (the primitive of all sorting networks) leading to (see Fig.8): replacing one N-fan-in TG by n-fan-in TGs. We have included the values for (16) 2k  2k − 1 Boolean decomposition, the proposed decomposition algorithm, and the NG (k,1) = cells = N  N − 1 TGs = O  N 2  2 enhanced decomposition algorithm. (17) NL (k,1) = N = O N . 3.5 Results

 

4. USING SORTING NETWORKS

Another algorithm for reducing the fan-in of symmetric functions can be derived if we start from the basic definition of a symmetric function: “a function which depends only on the sum of its input variables” (the function is invariant to permutations of its input variables). It is well known that the evaluation of a symmetric function can be reduced to comparing the sum of the input variables with some constants[17,24,28]. It is thus clear that the basic operation is the SUM! There are many different ways to compute a sum. Out of these we should mention the ones summing with TGs[1,30,31]. But all of them use unbounded fan-in TGs. An alternate solution is to sort the inputs and detect the position in the sorted output string where zeros switch to ones[12]. This position is in fact equal to the sum of the inputs. Two separate blocks are needed: one to sort the inputs, and one to detect (search) the position of the 0 ➯1 transition. That is why we will call it a “sort-and-searach” algorithm (Fig.7).

While the depth of this sorting network is linear, the size has been reduced to polynomial. Fortunately other sorting algorithms can be implemented more efficiently as sorting networks: Batcher’s odd-even mergesort and other butterfly implementation of odd-even mergesort, Batcher’s bitonic merge, or the balance sorting network[25] (Fig.9) have: (18) NG (k,1) = O N lg2N    NL (k,1) = O  lg2N .

We should mention that there are even NL (k,1) = O ( lgN ) algorithms like Ajtai-Komlos-Szemeredi (AKS), but their large multiplying constant does not make it of any practical use (N should be of the order of 10100 such that these algorithms really become interesting). Lower constants for O ( lgN ) time algorithms are Leighton’s columnsort[20] (uses AKS) and Bilardi-Preparata bitonic sorting on a mesh-of-CCC (cube-connected cycles). But they are still too complicated. The interested reader should consult [37,38] for an overview of

F

SEARCHING TREE

x′1



x′2



...



x′N

SORTING NETWORK

x1 Figure 7.

x2

...

xN

Computing a symmetric function with the “sort-and-search” algorithm.

(19)

Figure 8.

- 3.41 -

Sorting network based on “odd-even merge” (N = 8, n = 2).

[5] [6] [7] [8]

[9] Figure 9.

Shuffle-exchange sorting network[25] (N = 8, n = 2).

too complicated. The interested reader should consult [37,38] for an overview of [10] several algorithms. Also, from the VLSI point of view to extremes are: minimum area[29] Θ(2lgN) which can be realized by two counters (one for zeros [11] and one for ones), and minimum delay[36] which goes down to just 2 steps but 2 requires a highly connected O(N ) array of binary neurons (TGs). [12] As the search tree for the sorted sequence of bits can easily be implemented in [13] 2 2 size O(N) and depth O(lgN), it becomes clear that O(Nlg N) size and O(lg N) depth networks of 2-fan-in TGs can be built. In particular for a majority [14] function the search tree is just one TG. For N = 210 and n = 21 we have: NG (10,1) =

210 1  2 ⋅ 10 + 10  = 28160 2 2

NL (10,1) =

[15] [16]

1 2 10 + 10  = 55. 2

[17] As can be seen this solution while having the same delay as the previous one [18] (55 layers), drastically reduces the number of 2-fan-in TGs (28160 instead of 16 1.8E for the example we have taken). [19]

5. CONCLUSIONS Two algorithms for fan-in reduction have been introduced and analyzed. Both

[20]

of them improve on the known ones with respect to size and depth complexity. [21] The first one has superpolynomial size complexity, and we have shown how to reduce the size by a factor of at least 2.16 and up to more than 8, while the depth [22] complexity has been decreased from linear to squared logarithmic. A better result is suggested by a second algorithm which decreases the size to [23] polynomial. The depth of the second algorithm is also squared logarithmic. A further decrease to logarithmic depth is possible, but it is precluded by the fact [24] that the size in this case, while still polynomial (the size complexity is also [25] decreased), will have a large constant making the solution impracticable. To VLSI designers these results should be interesting as they directly relate to [26] the area (A ≈ size) and the delay (T ≈ depth) of an integrated circuit used to estimate its area-time (cost-performance) efficiency: AT 2. The algorithms can be used locally (one neuron at a time), or starting from the [27] logical function to be implemented (the function can be extracted from an [28] already trained NN). [29] Recently an algorithm for reducing the fan-in of BFs belonging to the FN,m [8] class has been proposed . This is the class of BFs of N variables that have [30] exactly m groups of ones. It has linear size and logarithmic depth, thus improving even on the second algorithm suggested in this paper. Still it is worth [31] mentioning that the FN,m class algorithm cannot be used for any other BFs, [32] while the algorithms for decomposing symmetric functions can be applied to any BF with the penalty induced by replicating the variables. [33]

REFERENCES [34] [1] N. Alon and J. Bruck, Explicit Construction of Depth-2 Majority Circuits for Comparison and Addition, Res. Rep. RJ 8300 (75661), IBM Almaden, San Jose, CA, 8/15/91. [2] C. Alippi, Weight Representation and Network Complexity Reductions in the Digital VLSI Implementation of Neural Nets, Res. Note RN/91/22, Dept. CS, Univ. College London, February 1991. [3] C. Alippi and M. Nigri, Hardware Requirements for Digital VLSI Implementation of Neural Networks, in Proc. of IJCNN’91 (Singapore), IEEE Press, 1873, 1991. [4] T. Baker and D. Hammerstrom, Modifications to Artificial Neural

[35] [36] [37] [38]

- 3.42 -

Network Models for Digital Hardware Implementation, Tech. Rep. CS/E 88-035, Dept. CS&E, Oregon Graduate Center, 1988. V. Beiu, J.A. Peperstraete and R. Lauwereins, Using Threshold Gates to Implement Sigmoid Nonlinearity, in Proc. of ICANN’92 (Brighton), Elsevier Science Publishers, vol. 2, 1447, 1992. V. Beiu, J.A. Peperstraete and R. Lauwereins, Algorithms for Fan-In Reduction, in Proc. of IJCNN’92 (Beijing), IEEE and PHEI Press, vol. 3, 203, 1992. V. Beiu, J.A. Peperstraete and R. Lauwereins Simpler Neural Networks by Fan-In Reduction, in Proc. of NeuroNimes’92 (Nimes), EC2, 589, 1992. V. Beiu, J.A. Peperstraete, J. Vandewalle and R. Lauwereins, Efficient Decomposition of Comparison and Its Applications, in Proc. of the European Symposium on Artificial Neural Networks (Brussels), D facto, 45, 1993. N.N. Biswas, T.V.M.K. Murthy and M. Chandrasekhar, IMS Algorithm for Learning Representations in Boolean Neural Networks, in Proc. of IJCNN’91 (Singapore), IEEE Press, 1123, 1991. N.N. Biswas and R. Kumar, A New Algorithm for Learning Representations in Boolean Neural Networks, Current Science, 59(1990), 12, 595. J. Bruck, Harmonic Analysis of Polynomial Threshold Functions, SIAM J. on Disc. Math., 3(1990), 2, 168. J. Bruck, personal communication, 1992. J. Bruck and R. Smolensky, Polynomial Threshold Functions, AC0 Functions and Spectral Norms, SIAM J. on Comput., 21(1992), 1, 33. H.P. Graf, E. Sackinger, B. Boser and L.D. Jackel, Recent Developments of Electronic Neural Nets in the USA and Canada, in Proc. of MicroNeuro’91 (Münich), Kyrill&Method Verlag, 471, 1991. T. Hofmeister, W. Hohberg and S. Köhling, Some Notes on Threshold Circuits, and Multiplication in Depth 4, preprint, June 1990. Y. Hirai, Hardware Implementation of Neural Networks in Japan, in Proc. of MicroNeuro’91 (Münich), Kyrill&Method Verlag, 435, 1991. S.T. Hu, Threshold Logic, Univ. of California Press, Berkeley, 1965. D.E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Addison-Wesley, Reading, 1973. R. Lauwereins and J. Bruck, Efficient Implementation of a Neural Multiplier, in Proc. of MicroNeuro’91 (Münich), Kyrill&Method Verlag, 217, 1991. T. Leighton, Tight Bounds on the Complexity of Parallel Sorting, IEEE Trans. on Comp., C-34(1985), 4, 344. E. Mayoraz, On the Power of Networks of Majority Functions, in A. Prieto (ed.), Lecture Notes in Computer Science 540, Proc. of IWANN’91 (Grenade), Springer-Verlag, 78, 1991. E. Mayoraz, Representation of Boolean Functions with Democratic Networks, preprint, July 1992. R.C. Minnick, Linear-Input Logic, IRE Trans. on Electr. Comp., EC-10(1961), 3, 6. S. Muroga, Threshold Logic and Its Applications, John Wiley & Sons, New York, 1971. L. Rudolph, A Robust Sorting Network, IEEE Trans. on Comp., C-34(1985), 4, 326. N.P. Red’kin, Synthesis of Threshold Circuits for Certain Classes of Boolean Functions, Cybernetics (translation of Kibernetika), 6(1973), 5, 540. A. Sarje and N.N. Biswas, Testing Threshold Functions Using Implied Minterm Structure, Int. J. Systems Sci., 14(1983), 5, 497. C.L. Sheng, Threshold Logic, Academic Press, New York, 1969. A.R. Siegel, Minimum Storage Sorting Networks, IEEE Trans. on Comp., C-34(1985), 4, 355. K.-Y. Siu and J. Bruck, Neural Computation of Arithmetic Functions, Proc. of IEEE, 78(1990), 10, 1669. K.-Y. Siu and J. Bruck, On the Power of Threshold Circuits with Small Weights, SIAM J. on Disc. Math., 4(1991), 3, 423, 1991. K.-Y. Siu and J. Bruck, On the Dynamic Range of Linear Threshold Elements, SIAM J. on Disc. Math., to appear. K.-Y. Siu, J. Bruck and T. Kailath, Depth Efficient Neural Networks for Division and Related Problems, Res. Rep. RJ 7946 (72929), IBM Almaden, San Jose, CA, 01/25/91. K.-Y. Siu, V. Roychowdhury and T. Kailath, Computing with Almost Optimal Size Threshold Circuits, Tech. Rep., Information System Lab., Stanford Univ., June 12, 1990. K.-Y. Siu, V. Roychowdhury and T. Kailath, Depth-Size Tradeoffs for Neural Computation, IEEE Trans. on Comp., C-40(1991), 12, 1402. Y. Takefuji and K.-C. Lee, A Super-Parallel Sorting Algorithm Based on Neural Networks, IEEE Trans. on Comp., C-37(1990), 11, 1425. C.D. Thompson, The VLSI Complexity of Sorting, IEEE Trans. on Comp., C-32(1983), 12, 1171. L.E. Winslow and Y.-C. Chow, The Analysis and Design of Some New Sorting Machines, IEEE Trans. on Comp., C-32(1983), 7, 677.