IMPLEMENTATION OF SIGNAL PROCESSING ... - IEEE Xplore

IMPLEMENTATION OF SIGNAL PROCESSING FUNCTIONS USING 1-BIT SYSTOLIC ARRAYS Indexing terms: Signal processing, Systolic arrays The use of systolic arrays of 1-bit cells to implement a range of important signal processing functions is demonstrated. Two examples, a pipelined multiplier and a pipelined bit-slice transform circuit, are given. This approach has many important implications for silicon technology, and these are outlined briefly.

Introduction: Kung and co-workers 12 have demonstrated how arrays of identical processors, with adjacent elements connected together to form a regular lattice, can be used in the pipelined computation of a number of important matrix operations. The resulting structures, which are referred to as systolic arrays, display a number of features which make them attractive for VLSI.3 The basic processor required in most of these circuits is a latched multiply accumulator—a single chip in current technology. In this letter we show how similar ideas can be applied to cells of much simpler complexity and that many important signal processing functions can be implemented, in single chip form, using arrays of such cells. In each case the cell required comprises a full adder, some simple logic and a number of latches, the function of each circuit being determined mainly by the cell interconnection topology and the resulting data flow structure. Two examples are described below. They are a pipelined multiplier and a pipelined bit-slice transform circuit. Pipelined multiplier circuit: A circuit suitable for the pipelined multiplication of two continuous streams of 4-bit positive •^ b 3 ( n ) \ \

^

b 2

is formed by letting these components accumulate as sk(n) passes down that column. The sum bits s'k(n) and carry bits c'k{ri) which enter the array boundaries are set equal to zero and the carries which are generated at each stage within the

(n)

\

\ \

\b 3 (n-1)

numbers is shown in Fig. 1. It comprises a diamond-shaped array of 16 latched, gated full adder cells each of which is connected only to its nearest neighbours. Heavy dots have been used to represent the latches, while dashed lines indicate the flow of data in the circuit. The operation of each cell is illustrated in Fig. 1. On each cycle of a clock it receives inputs a, b, s', c' from neighbouring cells—a and b represent individual bits of the two numbers to be multiplied, s' is one bit of the accumulating sum of partial products and c' is a bit carried in from the previous stage of the calculation. The resulting value of 5, the corresponding carry bit, c, and the input bits, a and b, are all stored in latches and then passed on to neighbouring cells on the next clock cycle. The value of s' is also latched on input to each cell. The nth pair of numbers a(n) and b{n) to be multiplied are input to the circuit along the upper edges of the array with their constituent bits ak(n) and bk(n) (k = 0, 3) staggered by means of external latches, as shown. The most significant bit of a(n) (i.e. A3(N)) and the least significant bit of b(n) (i.e. bo(n)) enter the circuit on one clock cycle, the second most significant bit of a(n) and the second least significant bit of b(n) on the next clock cycle and so on. This ensures that as each bit of a(n) moves across the array it meets every bit of b(n)—one at each of the cells which it crosses. The /cth bit of each partial product ak_,{n)b,{n) is formed on one of the cells within the /cth vertical column, and the kth bit of the product

\ \b 2 (n-1)

3

\ \ \b 3 (n-2)

S «— S' 0 (a.b) Q c' c

«— (a.b) . S' + (a.b) . c' • S'.c'

S 7 (n-12) Fig. 1 Pipelined multiplier array

ELECTRONICS LETTERS 18th March 1982

Vol. 18 No. 6

241

array are simply passed to the left (the next most significant column) on the next clock cycle. Any residual carry bits which leave the diamond-shaped interaction region across its lower boundary must be added into the sum of partial products in order to complete the multiplication. This can be accomplished by adding to the basic circuit an extra triangular region of cells as shown in Fig. 1. The least significant bit of a given product s(n) (i.e. so(n)) is obtained five clock cycles after the operation begins, the next most significant bit on the sixth clock cycle and so on, until the final bit (i.e. s-,(ri)) emerges twelve clock cycles later. Although the latency of the device is twelve clock cycles it is possible to start a new multiplication and to complete the product of a previous one every clock cycle. The circuit shown in Fig. 1 is a pipelined carry-save multiplier, but it differs from previously proposed carry-save devices4'5 in one important respect. Each bit of b{n) interacts with only one bit of a{n) on a given clock cycle and so no broadcasting of data takes place. As discussed in the final Section, this has significant implications for detailed chip design, although it means that five latches per cell are required compared with three in previous circuits. Pipelined bit-slice transform circuit: A general n-point matrix x vector transform can be expressed in the form Wx = y, where W is an n x n matrix, x is an n-element vector of data values and y is an n-element output vector. In general, the elements of W will be multibit words, but in this example we assume that they can only take the values 1 or 0 and can therefore be represented in 1-bit form. A general wi-bit transform could, of course, be implemented using m of these 'bit-slice' circuits in parallel. A systolic array which can perform a pipelined four-point bit-slice transform (i.e. n = 4) is shown in Fig. 2. It consists of an array of processor cells, whose logic function is identical to that used in the multiplier.

However, in this case the cells are interconnected on an orthogonal lattice. The data words x,- are input to the array from the left on every second clock cycle with successive bits staggered. The least significant bit, x| u , enters the array one clock cycle ahead of the next significant bit, and so on. The bits move one cell to the right every clock cycle. For this example we have assumed that the incoming data are three-bit two's complement numbers which have been signed extended to five-bit two's complement form (the range of the answer) before entering the array. Bits representing the elements in the matrix W are organised so that they move down the array in the vertical direction as shown. The output words y{ are initialised to zero on entering the array and move from right to left with bits similarly staggered. This means that the /cth bit of a word, yf\ meets all of the terms required to form the sum ,M

=

y«=Ak) where n = 4

the product Wuxf] being formed by passing these bits through the AND gate. Any carry bits which are generated in the course of this summation are latched vertically downwards as shown. Again, the carry-save principle is utilised; hence the reason for the stagger on the bits of the words x{ and y{. Having traversed the array, the output words y{ are completely transformed and one bit of each significance level, yf\ emerges on every second clock cycle. Discussion: Both the circuits described above comprise a regular array of simple one-bit processor cells. Similar structures may be used to implement other signal processing functions,6 such as convolution and correlation. In particular, the Walsh transform operation (where the elements of W have

..#V-«-

II

II

II

II

II

II

II

, «I0C' c -. where

(w

.

x).y' + (w

-

x) . c ; + y' . c'

w = 1 or 0 [163/2]

Fig. 2 Pipelined bit-slice transform array

242

ELECTRONICS LETTERS 18th March 1982

Vol. 18

No. 6

the values — 1 or +1) may be carried out on an array which is almost identical to that described in the preceding Section. Circuits of this type have a number of important features which are extremely attractive for current, as well as future, silicon technology. In all circuits interconnections between cells are short, geometrical and between nearest neighbours only. Long meandering data paths are thus avoided allowing much more efficient use of active silicon area. Stray capacitances are also reduced to an absolute minimum with the result that large drivers are not required to propagate data. Short interconnections also mean that transmission-line problems, which are becoming increasingly important as we move towards VLSI,3 are avoided. Since the circuits are pipelined at bit level, the maximum clock frequencies obtainable should be determined simply by the gate delay in a cell and should not depend on the number of cells. This in turn should give high data throughput rates. Another extremely attractive feature of these circuits is their regularity. In all cases the design of complete circuits is factorised down to the design of a simple cell. This means that the circuit analysis, layout and subsequent testing of such chips should be reasonably straightforward. Moreover, as the structures naturally expand to any dimension, single chip modules could be cascaded together with the minimum of interface problems. Finally, it should be noted that the multidirectional data flow inherent in a 'systolic' system means that the only control circuitry required is a two-phase clock. J. V. McCANNY J. G. McWHIRTER Royal Signals & Radar Establishment St. Andrews Road Great Malvern, Worcs. WR14 3PS, England

25th January 1982

References 1

KUNG, H. T.: The structure of parallel algorithms' in YOVITIS, M. C. (Ed.): 'Advances in computers' (Academic Press, NY, 1980, Vol. 19) 2 KUNG, H. T., and LEISERSON, C. R.: 'Algorithms for VLSI processor arrays' in 'Introduction to VLSI systems' (Addison-Wesley, 1980), Chap. 8 3 MEAD, c. A., and CONWAY, L. A.: 'Introduction to VLSI systems' (Addison-Wesley, 1980) 4 JUMP, j . R., and AHUJA, S. R.: 'Effective pipelining of digital systems', IEEE Trans., 1978, C-27, pp. 855-865 5 HWANG, K.: 'Computer arithmetic: principles, architecture and design' (Wiley, New York, 1979), Chap. 6 6

main element of semiconducting isolators is a helicon resonator in the form of thin semiconducting sample of thickness d. Up to now intrinsic InSb was used as semiconducting resonator material. Because of the high mobility n of electrons at room temperature small helicon attenuation and negligible losses in the isolator can be achieved at relatively low magnetic induction, i.e. n . B > \ holds. So far such semiconducting isolators have not found a wide application because of the following reasons. First, the isolators have a narrow frequency band and they become detuned by insignificant temperature variation because of the large temperature dependence of electron concentration n and the related resonance frequency ij/r: AB

(i)

Secondly, intrinsic InSb with low electron concentration (n * 14 x 1022 m~ 3 at 300 K) demands a comparatively large resonator thickness in the frequency r a n g e / < 20 MHz; this leads to an increase in the dimensions of the magnetic system in order to obtain the required induction in the working gap where the semiconducting resonator is located. The possibility of applying donor-doped n-InSb was also considered.4 Doping increases electron concentration and reduces resonator temperature dependence; however, it also leads to a significant decrease of //, and, as a consequence, stronger magnetic fields are necessary, and thus the dimensions of the magnetic system even increase. It follows from the above that it is of interest to investigate the possibility of using semiconducting resonators made from narrow-gap semiconducting alloys which have high electron mobility under high electron concentrations. In the present work the isolators made from Cd^Hg^^Te, Bi1_xSbJC and doped n-InSb are considered. Their comparison with the isolator out of intrinsic InSb is presented. The isolators were constructed as double n filters2 according to the principal scheme given by Fig. 1. The measurements of semiconducting plate resonators of various types were chosen in accordance with eqn. 1 performed with the following semiconducting materials: (a) intrinsic InSb resonator, T = 300 K, (a i ) doped InSb resonator, T = 300 K, (b) Cd ^ Hg j . ^ Te resonator, 7 = 300 K and (c) B i ^ S b * resonator, T = 11 K.

MCCANNY, j . v., and MCWHIRTER, J. G.: 'Integrated circuits'. UK

Patents application 8125222, filed 18th August 1981

0013-5194/82/060241-03S1. 50/0

SEMICONDUCTOR ALLOY RESONATOR-TYPE HF ISOLATORS Indexing Isolators

terms:

Semiconductor

devices

Fig. 1 Principal scheme of semiconductor isolator A—a semiconductor core

and materials,

The isolators with helicon wave resonators made from doped n-InSb, Cdj.Hg,-xTe and Bi,_xSb;c are shown to widen the operating temperature range, to decrease the magnet system dimensions and to decrease the forward loss to 1-5 dB. The ranges of operating frequencies and temperatures for various types of resonators are discussed.

In a number of works the possibility of using semiconducting isolators as separate elements of radiotechnical networks has been considered.1"4 The operating principle of such isolators is based on the dimensional resonance of HF helicon solid-state plasma waves. The semiconducting isolators are unique passive devices in the frequency band / = 0-5-400 MHz. Ferrite isolators are used at microwave frequencies. The ELECTRONICS LETTERS 18th March 1982

Vol. 18 No. 6

-40

8

0-2

II

0-4 0-6 0-8 magnetic field, T IIK?2|

Fig. 2 Calculated {dashed curves) and experimental {points) forward and reverse loss of isolators of different type A—a

O—a,

•—b

#—c

243