VLSI implementation of a Binary Neural Network - Semantic Scholar

7 downloads 0 Views 518KB Size Report
Many BNN are based on Correlation Matrix Mem- ory (CMM) that stores relationships between pairs of binary vectors. For vision applications, there is a.
VLSI implementation of a Binary Neural Networktwo case studies Amine Bermak and Jim Austin Advanced Computer Architecture Group Department of Computer Science University of York, Heslington YO10 5DD, York, United Kingdom E-mail: [email protected]

Abstract A comparison between a bit-level and a conventional VLSI implementation of a binary neural network is presented. This network is based on Correlation Matrix Memory (CMM) that stores relationships between pairs of binary vectors. The bit-level architecture consists of an n  m array of bit-level processors holding the storage and computation elements. The conventional CMM architecture consists of a RAM memory holding the CMM storage and an array of counters. Since we are interested in the VLSI implementation of such networks the hardware complexities and speeds of both bit-level and conventional architecture were compared by using VLSI tools. It is shown that a signi cant speedup is achieved by using the bit-level architecture since the speed of this last con guration is not limited by the memory addressing delay. Moreover, the bit-level architecture is very simple and reduces the bus/routing, making the architecture suitable for VLSI implementation. The main drawback of such an approach compared to the conventional one is the demand for a high number of adders for dealing with a large number of inputs.

keywords: Binary neural networks, VLSI implementation, bit-level architecture, internal storage processors.

1 Introduction

In real-time vision processing systems the task is to recognize an object and check its geometric and physical properties against some given speci cations in order to determine if the object is a target or not.

In this kind of application there is a large amount of data to be processed in real-time, and the required processing speed could be very high. It is apparent that special hardware architectures are indispensable to enable real-time processing. Binary Neural Networks (BNN) are at the heart of many successful vision processing systems [2]. These networks are easy to implement since they can be both taught and tested using boolean operations. The prospect of building up chips to make real-time decisions is particularly promising. Many BNN are based on Correlation Matrix Memory (CMM) that stores relationships between pairs of binary vectors. For vision applications, there is a signi cant need for large size CMM memory. Both the size and the performance of the system are greatly a ected by the method of memory implementation. O -chip memories are employed to implement large size memories. In contrast, on-chip memories provide high speed but limited memory size. The idea of using arrays at the bit-level was developed by McCanny and McWhirter [8], who demonstrated that many of the components required in digital signal processing applications can be implemented as arrays of bit-level processings elements based on a gated full adder function. Problems could therefore be treated at the bit-level from the outset and circuits could be constructed by tiling the silicon plane with an array of simple bit-level Processing Elements (PEs). This approach is now becoming increasingly popular. For example a recent bit-level architecture has been developped by Y. Chan and S.Y. Kung for a Block Matching application [4].

In this paper we propose an on-chip memory architecture for implementing the recall phase of BNN. The bit-level architecture is used to improve the performance of a conventional design in terms of speed and routing area. Moreover, the bit-level architecture allows us to overcome the problem of memory size limitation. The architecture is arranged as an array of n  m bit-level processors. Each processor integrates both an internal storage element and a computing element. These bit-level processors are very simple making the architecture suitable for VLSI implementation. The memory size is only limited by the available chip area. Section 2 of this paper describes the computation required to train and recall from a Binary Neural Network. Section 3 describes both the direct approach and the bit-level architecture for implementing the CMM networks. The direct approach is referred to as `conventional design'. In Section 4, the simulated performances in terms of hardware complexities and speed of both designs are compared using VLSI CAD tools. A brief conclusion is presented in Section 5.

2 Binary Neural Networks

The binary neural network studied here contains a binary matrix, often called a Correlation Matrix Memory (CMM), that stores relationships between pairs of binary vectors. Its main advantages are that it can be both taught and tested in a single pass over the data using boolean operations, resulting in simple and fast processing. A further advantage of the approach is that it can still work under conditions such as missing inputs, unlike the MLP (Multi Layer Perceptron) network. However, the CMMs do have drawbacks: they may not generalise as well as an MLP network, nor can they solve as many problem classes as an MLP [7]. The brief explanation that follows describes the recall operation from a CMM. More explanations on both the recall and the training of a CMM can be found in [3]. Recall from the network is described by Equation 1. This is an inner product operation, where Pk is preprocessed input vector k, after undergoing a transpose. M is the CMM. The output from this operation is a vector of integer values v.

Xn vk = MPkiT i=1

the recall from a network that has been taught on many correlations. The vector of summed values, vk , is thresholded using either L-max thresholding or Willshaw thresholding. L-max thresholding sets the L highest summed values to `1', all other summed values are set to `0'. Willshaw thresholding sets the summed values in vector v that equal W to `1', where W is the number of bits set in the input pattern. The remaining summed values are set to `0'. An often used variant of Willshaw thresholding sets all values in vector vk that are greater than or equal to w to 1, all others are set to 0. The output from the thresholding is the original pattern typically called the separator. It should be noted that the thresholded vector may actually contain more than one separator pattern and that further processing would then be required to extract the individual patterns. The technique used to achieve this is called Middle Bit Indexing (MBI). Detailed description of the technique can be found in [6]. (Input) Recall Pattern Pk 0 1 0

0

0 2

0 1

0 0 2 0

0 0 (Output) 1 0

0 0

0 0

CMM (M) summed values -Vk

Figure 1: Recall from a sparse network (Input) Recall Pattern Pk 0 1 0

0

Thresholded values -Tk

1 0 0 0

(1)

The inner product means that each input `1' activates a column of weights in the CMM. These columns are then integer accumulated, resulting in the summed values of vector, v. An example of a network containing a single correlation is shown in Figure 1. Figure 2 shows

Thresholded values -Tk

1 0 0 0

1 2

0 1

1 0 2 0

0 0 1 0

1 1

0 0

(Output)

CMM (M) summed values -Vk

Figure 2: Recall from a saturated network

Address Bus (Recall pattern)

Sout bus (Summed values)

Counter 1 Counter 2 CMM Memory

Data Bus

Counter 3

Counter n

(A.)

(B.)

Figure 3: Conventional architecture. (A.) Represents the block diagram and (B.) shows an example of Standard Cell implementation including an SRAM and 32 6-bits counters.

3 VLSI implementation The hardware that the group at York had previously developed for CMM implementation are called C-NNAP [5] and PRESENCE [7]. The last hardware prototype [7] is based on a pipeline processor implemented using FPGA devices. The design has been implemented on a VME bus, and is also being implemented on PCI bus based system. The main goal in developing this hardware is to increase the speed of the teach and the recall operation. The speed-up of both operations is vital to the processing rate as the complexity of the recall using standard software techniques results in slow execution rates. This section investigates the possibility of improving performance by implementing an ASIC with on-chip memory. We present both the conventional approach and the bit-level architecture for implementing the CMM networks with on-chip memory.

3.1 Conventional approach

A direct approach for the computation in eq. 1 to be carried out by a physical hardware with on-chip memory is rst to implement the CMM memory and then to use each data output to increment a counter. The number of counters is equal to the separator size. Figure 3.A shows a block diagram of the corresponding design. The address of activated components of the input patterns are directly used to address di erent rows in the CMM in order to read out the appropriate word. Each data word read is transferred to the corresponding counter. Figure 3.B, shows the corresponding implementation. This example implements a memory of 32  32 bits and 32 counters in 1:0m CMOS technology (ecpd10). Standard Cell technique with automatic

placement and routing under CADENCE environment was used for this design. The memory, including the row decoder and read/write ampli er, was generated using a cell compiler and then inserted in the standard cell design. In this example, the pads are placed far apart from the active area because of the high number of inputs/outputs. This problem could be overcome by time multiplexing of inputs/outputs. The memory access time is 14:6ns and the simulated maximum frequency of this design is estimated to 40Mhz . Table 1 reports the area of each region of the design represented in Figure 3.B, We can note from this table that the routing area cannot be neglected in the conventional approach. region

area portion (mm2 ) (%) routing 0.24 8 Standard Cell 1.53 51 RAM memory 1.23 41 active area 3 100 Table 1: Active area requirements for the conventional CMM network (chip including SRAM of 32  32-bits and 32 6-bits counters). The speed of this design is limited by the memory addressing. We can address one and only one row of memory at each clock cycle. This is due to the memory architecture. In addition the size of the memory in this kind of architecture is limited to the maximum size of the generated memory using a cell compiler.

Sout Bus (Summed values)

Address Bus (Recall pattern)

One bit memory

Data bus (CMM)

One bit memory

One bit memory

PE0

PE0

PE0

One bit memory

One bit memory

One bit memory

PE

PE

PE

One bit memory

One bit memory

One bit memory

PE

PE

PE

(A.)

(B.)

Figure 4: Bit level architecture. (A.) Block Diagramm. (B.) An example of the layout of a 4  8 architecture. Device size 331m  286m = 0:094mm2 including 784 transistors. Address

1-bit Mem One bit memory Sin Cin

Full Adder

PE Sout Cout

Data

Data Address

(A.)

(B.)

Figure 5: (A.) Block diagramm of one PE including a full-adder and a memory cell. (B.) Shows the corresponding layout.

3.2 Bit-level architecture

The bit-level architecture is proposed to overcome the last problems related to the conventional architecture. From equation 1, we note that the recall from a CMM is based on two basic operations. First there is AND operation between the elements of matrix M and elements of the transposed pattern P T . Secondly, there is a need to add all the partial results of AND operations. The use of AND gates is avoided since the address of activated components of the input patterns are directly used to address di erent rows in the CMM to process the appropriate word. The computation required by equation 1 can then be implemented by storage elements and adders.

The bit level architecture is based on an array of processors where each processor performs the storage of one bit of the CMM and adding operation. For clarity, the array represented in gure 4.A consists of 3  3 processing elements. Each processor receives one bit of the address word (which corresponds to a pattern bit) from the address bus. If the received bit is active then the corresponding stored bit is selected and added to the partial sum received from the processor located on the left. The resulting sum is then transmitted to PE on the right. The address word is propagated vertically from the top to the bottom of the array and consequently an active address bit selects all the corresponding columns. All results are available in parallel on the right side of the array where each row corresponds to a neuron output. These results are

then transmitted to the Willshaw or Lmax Threshold Block1. Figure 4.B shows an example of the layout of a 4  8 processors designed in 1:0m CMOS technology. The area of this design is 0:094mm2 and 784 transistors are integrated within it. This corresponds to a very high density of 2085 gates/mm2. Table 2 reports the active area requirements for an example of 32  32 PEs. We can note from this table that the routing area can be neglected. This is because, routing between the cells is achieved within the cells. Figure 5.B, shows the layout of one PE. There is no routing area needed between the memory cell and the adder. region

area portion (mm2 ) (%) routing 0 0 Cells 4.2 100 active area 4.2 100 Table 2: Active area requirements for the bit level architecture including 32  32 PEs. The layout was designed using MAGIC [1], an interactive system for creating and modifying VLSI circuit layouts. This has been achieved with a full-custom 1:0m CMOS technology. The SPICE simulator was employed to verify both the logic correctness and electrical behavior of the circuit from a SPICE netlist generated by MAGIC's extractor. This hardware design methodology o ers a high precision since the extracted SPICE netlist includes the parasitic capacitance to substracte and several kinds of internodal coupling capacitances. The processors were carefully arranged in order to obtain a high-density design. Consecutive processors were turned upside-down to share common `Gnd' and `Vdd'. This allows us to get rid of the space needed between two metal strips. Special attention was paid to the design of adders used in each bit-level processor. An adder with equal carry and sum propagation times is advantageous, because the worst-case adding operation time depends on both paths.

4 Performance

Figures 6 and 7 show the dependency of silicon area and delay on both input and output size for conventional and bit-level solutions. For a xed number of inputs, the silicon area grows linearly for both solutions and at 1 Threshold

has not been reported on this gures.

approximatively the same rate ( gure 6.A). However, as the input size is increased, the silicon area increases approximatively exponentially for the bit-level solution as compared to the linear growth for the conventional solution ( gure 6.B). As the output size is varied, the processing time remains constant for both solutions ( gure 7.A). The bit-level solution presents a processing time approximatively four times smaller for an input of 32-bit. This solution allows for a signi cant speedup of the computations ( gure 7.B). This speedup we have found here is true in general because in a bit-level architecture the addressing of the array is done in parallel while in the memory con guration we can address only one row at each clock cycle. If we assume that p is the number of bits of the input pattern, then the processing time of the bit-level architecture is expressed as: Dread + (p ? 1)  DAdders where Dread and DAdder are respectively the one cell memory delay and the adder delay. The processing time for the conventional solution is expressed as: p  (Dread + Dcount) where Dcount is the counter delay. There is a well known VLSI implementation of a neural associative memory called SARAM [9]. In this hardware the neural processing and the memory are implemented on a single chip. The weights are stored in an internal Static RAM (SRAM) which consists of 256 words, each word 64 bit wide. Table 3 reports the performance of SARAM and our design. device SARAM (F) std. cell (E) full custom (E)

techno Byte area (gates/ Clock (mm2 ) mm2 ) cycles CMOS 1:0m 0.013 1100 2k+l+2 CMOS 1:0m 0.022 1000 p CMOS 1:0m 0.078 2085 1

Table 3: Performances comparison. F: is for fabricated device performances and E is for estimated device performances. p, k and l are respectively the number of bits of the input pattern, the number of input bits set and the number of output bits set. The conventional design based on standard cell technique demonstrates a silicon area close to SARAM device compared to the consuming silicon area of fullcustom design despite the high gate density of this last design. However, the full custom version allows a higher speed processing since it requires only 1 clock cycle.

20

40

18

35

16 Bit−level solution

14

12

Bit−level solution

10 Conventional solution

8

6

Silicon area using ecpd10 (mm2)

Silicon area using ecpd10 (mm2)

30

25

20

Conventional solution

15

10 4 5

2

0

0

20

40

60 Output size

80

100

0

120

0

20

40

(A.)

60 Input size

80

100

120

(B.)

Figure 6: Area comparison between the two solutions. (A.) is the comparison for a xed input size (32-bits) while (B.) is the comparison for a xed output size (32-bits).

500

800

450

700

Conventional solution

400

600

Processing time (ns)

Processing time (ns)

350

300

250

200

Conventional solution

500

400

300

150 200 Bit−level solution

100

Bit−level solution 100

50

0

0

10

20

30 Output size

(A.)

40

50

60

0

0

10

20

30 Input size

40

50

60

(B.)

Figure 7: Processing time comparison between the two solutions. (A.) is the comparison for a xed input size (32-bits) while (B.) is the comparison for a xed output size (32-bits).

5 Conclusion In this paper, we have presented two case studies of VLSI digital design for CMM networks. Hardware complexities and speeds of both bit-level and conventional CMM network were compared by using VLSI tools (CADENCE for the conventional architecture and MAGIC with HSPICE simulator for the bit-level architecture). We paid special attention to a class of issues including area and time processing. The main drawback of the bit-level architecture compared to the conventional approach is the increased demand for a high number of adders dealing with a high number of inputs. However, the bit level design o ers the following features: (1) it allows for a signi cant speedup of the computations. (2) It reduces bus/routing area and is easier to implement in full custom VLSI. (3) The limitation of cell compilers, used to generate memories in a conventional CMM network with on chip memory, is overcome by the bit level architecture in which each processor includes an internal storage element. The memory size is only limited by the maximum chip size.

Acknowledgments The authors would like to thank EPSRC, under grant number GR/K41090, for providing the support for the research described in this paper. Thanks are due to Ken Lees, Mick Turner and Nourredine Senouci for helpful comments on this paper.

References [1] Magic tutorials. Computer Science Division, University of California, 1990. [2] J. Austin. Ram-based neural networks. Progress in Neural Processing, 9, 1998. [3] J. Austin and T.J. Stonham. An associative memory for use in image recognition and occlusion analysis. In Image and Vision Computing, 5(4):251{ 261, 1987. [4] Y. Chan and S.Y. Kung. Bit level block matching systolic arrays. Proceedings of the 1995 International Conference on Application Speci c Array Processors, 1995. [5] R. Pack J.V. Kennedy, J. Austin and B. Cass. A parallel processing architecture for binary neural networks. International Conference on Neural Networks, pages 1037{1041, 1995.

[6] J. Kennedy. An exploration into an uncertain reasoning architecture. Technical report. The University of York, UK, 1995. [7] J.V. Kennedy and J. Austin. A parallel architecture for binary neural networks. MicroNeuro '97, 6th International Conference on Microelectronics for Neural Networks and Fuzzy Systems, pages 225{ 231, 1997. [8] J. McCanny and J. McWhiter. Implementation of signal processing functions using 1-bit systolic arrays. Electron Lett., 18:241{243, 1982. [9] A. Heittman, J. Malin, C. Pintaske and U. Ruckert Digital VLSI Implementation of a Neural Associative Memory. MicroNeuro '97, 6th International Conference on Microelectronics for Neural Networks and Fuzzy Systems, pages 280{288, 1997.