An Architectural Study of a Massively Parallel Processor for

An Architectural Study of a Massively Parallel Processor for Convolution-Type Operations in Complex Vision Tasks Martin Franz

René Sch¨ uffny

Technische Universit¨ at Dresden, Institut f¨ ur Grundlagen der Elektrotechnik und Elektronik, Mommsenstraße 13, 01069 Dresden, Germany E – Mail: schueff[email protected] Abstract. Complex vision tasks, e.g., face recognition or wavelet based image compression, impose severe demands on computational resources to meet the real-time requirements of the applications. Clearly, the bottleneck in computation can be identified in the first processing steps, where basic features are computed from full size images, as motion cues and Gabor or wavelet transform coefficients. This paper presents an architectural study of a vision processor, which was particulary designed to overcome this bottleneck.

1

Introduction

The computational requirements of complex vision tasks based on real-time image sequences impose severe demands on the processing hardware. The primary computational steps performed on the full size image frames represent a bottleneck in processing. State-of-the-art processor hardware cannot cope with the computational complexity imposed by spatio-temporal convolution, gabor and wavelet transformation, motion estimation by block matching, and recurrent neural fields in real-time. To meet these requirements for realistic image size, e.g. 512 × 512 pixels, a dedicated massively parallel processor architecture was investigated [7]. Our processor architecture is derived from the real-time computational and functional requirements of several representative complex vision applications, e.g. object recognition based on jets computed from Gabor transform coefficients [4], 3D-vision for autonomous vehicles [2], wavelet transformation and vector quantization for video conference image compression [5], block matching based motion estimation for automotive applications, and recurrent neural fields [3]. However, the architecture provides sufficient generality to be widely applicable in machine vision. In the following section, an outline of the supported algorithms will be given. Then the issue of reduced accuracy will be addressed and the processor architecture will be presented. Finally, verification of the processor architecture will be described and verilog simulation results discussed. Published in ICANN 96 Proceedings, pp. 377 - 382, Springer–Verlag Berlin Heidelberg New York, 1996

2 2.1

Image Processing Algorithms Gaussian Pyramid

The Gaussian Pyramid provides a multiscale image representation, which can be exploited for a reduced complexity hierarchical processing, e.g. in object recognition and tracking. Assuming a separable Gaussian filter mask with five coefficients and an image size of N × N pixels, 10 N 2 multiplications and 8 N 2 additions are required for the first pyramid level. For each following level, the computational requirement is reduced by a factor of four. 2.2

Wavelet Transformation

The wavelet transformation [1] is employed for preprocessing in image compression because of its salient entropy reduction properties. Also, the wavelet decomposition can serve to compute features, e.g. texture analysis. From the computational point of view, the forward and backward wavelet transformation are based on one-dimensional convolution. For image processing the one-dimensional convolution is applied in both image dimensions. In applications 4 to 20 filter coefficients are used. Assuming a maximum number of 20 coefficients and an input image size of N × N pixels, the following amount of operations is required: Operation Multiplications Additions 2 Convolutions with 20 coefficients, N × N pixels 40 N 2 39 N 2 N 2 4 Convolutions with 20 coefficients, N × 2 pixels 40 N 39 N 2 2 80 N 78 N 2 2.3

Gabor Transformation

This transformation is based on the two-dimensional complex Gabor function according to [2], which mainly is determined by orientation ϑ and wavenumber k. The Gabor transformation is defined by convolution operator G on an image I(x, y) as follows: gk,ϑ (x , y ) I(x0 + x , y0 + y ) (1) (Gk,ϑ I)(x0 , y0 ) = x

y

The results are usually stored as column vectors, denoted as jets J(x0 ) J(x0 , y0 ) = {(Gk,ϑ I)(x0 , y0 ) | k ∈ K, ϑ ∈ Θ},

(2)

where K is a set of different wavenumbers and Θ a set of different orientations. The jet J(x0 , y0 ) contains informations about the local wavenumber spectrum with respect to different orientations at the jet position in an image. A collection of jets thus provides a good representation of image features, which can serve for robust object recognition [4] and vergence control in active 3D- vision applications [2]. For a complete Gabor transformation we have to convolve the whole

picture with 40 masks (8 different orientations ϑ and 5 different wavenumbers k) containing 11 × 11 complex coefficients. Assuming an input image size of N × N pixels, the following amount of operations is required: Operation Multiplications Additions 40 Convolutions with 242 coefficients, N × N pixels 9680 N 2 9640 N 2

2.4

Neural fields

Also, the computation of neural fields according to [3, 6] and of cellular neural networks by recurrent convolution with arbitrary neighbourhood radius are supported and require the following amount of operations: Operation Multiplications Additions N × N Neural Field with 7 × 7 weights, one iteration 49 N 2 50 N 2

3

Processor Design

In this chapter we derive a processor architecture for image processing based on the requirements of the previously described algorithms. We focused our work on the optimum implementation of the Gabor transformation, due to its complexity and the enormous amount of required computations.

3.1

Computational Requirements

All algorithms are implemented using fixed-point number representation to avoid area consuming floating-point arithmetic blocks. Thus, the processor is designed for fixed-point computation. According to simulation results, the following accuracies were determined for the algorithms: Algorithms Gaussian Filtering Wavelet transformation Gabor transformation Neural fields [3, 6]

Cache Resolution 8 bit 8 bit 8 bit 32 bit

Coefficient/ Result Resolution Weight Resolution 8 bit 8 bit 16 bit 8 bit 32 bit 32 bit 32 bit 32 bit

We concentrated on an optimization of the processor architecture with respect to the acceleration of convolutions. In our architecture the convolution computation for a pixel position is sequentially carried out by a single processing element, but 64 pixel positions are concurrently processed by 64 available processing units. Figure 1 shows the corresponding processor unit architecture.

Image

An on-chip image cache with a size of 8 × 8 pixels stores a window of the input Image Window image, which can be shifted in two dimensions. Each memory cell corresponds to a processor element (PE) containing multiplier and adder. In figure 1 the shifter c0 c1 c2 ... cn Processor Element cache and one of the 64 PEs is shown. At each clock cycle a coefficient is mulMultiplier * Coefficient Memory tiplied with the gray value of the correAccumulator sponding pixel position and the result is added to the accumulator. Then, the winFig. 1.: Calculation of Convolution dow is shifted one pixel in the input image. Only eight new pixels have to be read from Operations image memory to the cache for the next 64 operations. After the multiplication of the last coefficient, the convolution for 64 pixels is completed. Our architecture achieves a minimum I/O-bandwidth along with an optimum computation rate, provided that convolution type locality can be assumed for the computation. 3.2

Data Path

Figure 2 shows the general structure PE 0 PE 1 PE 2 . . . PE 63 of the SIMD-array image processor, which contains 64 PEs, a command controller, an I/O controller and a central processor unit, which controls the program flow and comInternal Command Bus putes special numerical operations, e.g. diInput / Output Bus vision. The shifter cache is not explicitly Bus for Controler ... shown here, as each of its cells is considered Register and Immediates Comparator as integrated in the corresponding slice. For Tree efficient minimum or maximum computation, as required for matching algorithms, Controler Register a 64-input comparator tree is implemented. Central I / O Controler Central Controler The internal structure of the PEs is shown ALU in figure 3. The coefficient memory contains 72 16-bit registers. Coefficient resoData PC Commands lutions of 8, 16 and 32 bit are feasible. For Addresses storage of a 32-bit coefficient two registers Data Command Memory Memory are required. The shifter cache is implemented with four levels, allowing the conFig. 2.: Processor Architecture current storage and access to four different images. The pixel values can be represented as 8 bit unsigned or 16 bit signed integers. The multiplier is configurable dependent on pixel and coefficient representation, e.g. for 8 × 32 bit and 16 × 16 bit multiplications in one clock cycle, or 16 × 32 bit in two cycles and 32 × 32 bit in four clock cycles. The partial products can be added to or subtracted from the accumulator content. Finally, the accumulator with a resolution of 64 bit can be normalized by right shifts.

I/O Bus The normalised result is moved to reCoefficient sult memory, which consists of 52 16-bit Memory registers. The result will usually be transShifter Cache ported from result memory over the I/OCells ( 4 x ) bus (64-bit Bus D) to the external memROI ory, but it can be used as input for fol- A B lowing calculations, too. The architecture D C Muliplier uses three 16-bit busses (A,B,C), which alAdder Load Accu low operand access in the shifter cache levels and in the coefficient or result memoResult Memory ries. Coefficient as well as pixel value precisions can be configured. The I/O-bus is connected to the coefficient memory in order to provide an initial loading of mask Fig. 3.: Processor Element coefficients. The abbreviation ROI stands for “region of interest” and means a bit array, which is used for local masking of numerical and logical operations, so that computations are limited to significant image sections. This bit array can be loaded from the external memory via I/O-bus. 64

16

16

16

64

16

16

64

64

16

64

4

Processor Verification

4.1

Modelling in Verilog-XL

The processor architecture was completely implemented in Verilog-XL. A symbolic assembler for the generation of machine code was developed to facilitate processor programming and testing. The fixed instruction length of the RISC type processor is 32 bit. The set of 51 processor commands contains 22 complex arithmetic commands, 29 instructions for data transfers, 12 commands for controlling program flow, and 18 initialization instructions. As example for a complex command the MAC-and-shift instruction mula.r.r 32x16 shr sl8 rf127 rs8 roi23 is given. 4.2

Simulation Results

For a verification of the processor model the Gaussian filtering, Wavelet and Gabor transformations where realized in assembler code. Because of the long simulation time the input picture was limited to 64 × 64 pixels. The cycle diagrams are shown in figure 4. A Gaussian filter mask with five coefficients was Cycles

2048

Cycles

Cycles

6144

8000 7744 4000 2752 854 2560 1000 640 4000 2000 1248 312 832 1024 331 0 0 0 I/O I/O I/O 2000

a) Calc. Init.

6000

Ctnl.

b) Calc. Init.

Ctnl.

c) Calc. Init.

Ctnl.

Fig. 4. Cycle Diagrams for a) Gaussian filtering, b) Wavelet Transformation, c) Gabor Transformation

used for simulation. For Wavelet transformation the Haar Wavelet was chosen, which leads to two coefficients. This results in a very small amount of calculations in comparison to the Gabor transformation, which requires at least 16 masks with 11 × 11 coefficients. Figure 4a and 4b show, that the ratio between I/O operations and calculations is not satisfying in these cases, where the processor mainly performs I/O operations. On the contrary, for Gabor transformation the amount of I/O transfers is negligible in comparison to the number of calculations. An optimum acceleration is achieved for applications imposing a large amount of computations. Assuming a clock cycle of 100 MHz and a 512 × 512 input image, the following processor performance is obtained2 : Performance

5

Gaussian Filtering Wavelet Transformation Gabor Transformation 405 frames/s 123 frames/s 10 frames/s

Conclusion

In this work, a massively parallel SIMD-array processor architecture with 64 processing elements was proposed and simulated at behavioral-level, that provides an optimum performance for convolution type operations. Our processor represents a real-time computation platform for a number of significant front-end real-world vision tasks [1, 2, 3, 4, 5]. The VERILOG simulation of our processor proved the feasibility and the correct functionality of the implementation. A performance improvement for algorithms with small masks could be achieved in a future revision by doubling the I/O-bus size.

References 1. Daubechis, I.: Ten Lectures on Wavelets, CBMS-NSF Regional Conf. Series in Appl. Math., Vol. 61, Society for Industrial and Appl. Math., Philadelphia, 1992 2. Theimer, W. M.; Mallot, H. A.: Binocular Vergence Control and Depth Reconstruction Using a Phase Method, ICANN Proceedings 1992, pp. 517-520, Elsevier Science Publishers B.V. 1992 3. Seelen, W.v., A Neural Architecture for Autonomous Visually Guided Robots – Results of the NAMOS Project, Fortschrittsberichte VDI, Nr. 388, 1995 4. Wiskott, L.; Malsburg, C. v.: A Neural System for the Recognition of Partially Occluded Objects in Cluttered Scenes: A Pilot Study, IEEE Transactions on Pattern Recognition and Artificial Intelligence, Vol. 7, No. 4, 1993 5. Buhmann, J.; K¨ uhnel, H.; Vector Quantization with complexity Costs, IEEE Transactions on Information Theory, Vol. 39, No. 4, 1993 6. Chua, L. O. and Yang, L.: Cellular Neural Networks: Theory and Applications, IEEE Transactions on Circuits and Systems, Vol. 35, No. 10, 1988, pp 1257 – 1290 7. Franz, M.: Entwurf eines Faltungsprozessors, TU Dresden, Institut f¨ ur Grundlagen der Elektrotechnik und Elektronik, Diploma Thesis, 1996 This article was processed using the LATEX macro package with LLNCS style 2

The processing speed of 10 frames/s, achieved for Gabor transformation, is a pessimistic estimation, as in the application of face recognition the jet computation is only carried out for a small number of pixel feature positions.