On Customized Decimation Filter Implementation

On Customized Decimation Filter Implementation Lirida A. de B. Naviner and Jean-François Naviner GET/Télécom Paris - CNRS/LTCI Département Communications et Electronique 46 rue Barrault, 75013 Paris — France Email: {lirida.naviner, naviner}@enst.fr

Abstract — This paper deals with customized implementation of decimation processors. Important aspects of design flow decisions are considered under hardware impact point of view. Several implementation approaches and respective evaluations in terms of time and basic operators requirements are given as well as a summary of the steps involved in an ad hoc implementation. Keywords — Decimation filter, dedicated processors, hardware implementation, digital filter architecture.

The paper is organized as following. Section 2 discusses about factors having impact on the performance and cost of the filter implementation. Section 3 presents some special filters well adapted for hardware implementation. Section 4 concerns adapting architecture and required computation powers. Several approaches concerning granularity of the calculations are viewed in section 5. Section 6 summarizes the proposed steps to be performed in a customized filter implementation procedure. Finally, some conclusions are outlined in section 7.

1. Introduction Decimation process concerns the sampling rate conversion (from high to low) which is a necessary task in many applications [1–3]. For example, it has to be done in radio receivers before the signal is processed further for decoding (error detection and correction, equalization) in order to minimize power consumption since the master clock will be running at a much lower rate [4, 5]. This paper deals with customized implementation of decimation processors. It comes from ad hoc filter design, structure, and architecture. Concerning design of filters, many papers are found in the litterature [6, 7]. Some filter implementations are reported in [8–11]. Many applications require linear phase which are easily obtained with FIR (Finite Impulse Response) filters only. Structures and architectures in this work are devoted to this family of filters. Non-recursive filter implementation can be seen as the calculation of an inner product y of two vectors ~h (the impulse response of the filter) and ~x (the signal data samples): y = ~h · ~x ~h = [h0 h1 ~x = [x0 x1

· · · hi

· · · hN −2

· · · xi

· · · xN −2

(1) hN −1 ] (2) xN −1 ] (3)

The filter output at t = n × Ts (named actual output and denoted yn ) is given by: yn =

N −1 X

hi · xi

(4)

i=0

where xi = x[(n−i)×Ts ] and Ts is the sampling period.

2. Filter Implementation Issues 2.1. The computation power A very important parameter to be used in the filter implementation choice is the computation power, Pc , defined as ratio between the quantity of operations to be done NOP and the allowed time for this TA : Pc =

NOP TA

(5)

The operations whose number is given in (5) are dependent of the granularity of the calculation. For example, it may be multiplication (MULT) or addition of k operands (ADDk). TA is normaly given in seconds or as a multiple of a pertinent cycle period Tcycle . Each approach described in this paper carries to an architecture and so to an intrinsec computation power Pci . In the same way each architecture carries to an intrinsec implementation cost Ci . The choice of a given approach should be done by finding the best match between the supplied computation power Pci , required computation power Pcr and intrinsec cost Ci . Structure and architecture choices are guided by the properties of the target hardware. The next sections present some architectural approaches for customized sum of products implementation. We consider direct structure, but the analysis may be easily applied to equivalent transpose or polyphase structure based solutions. We consider M the subsampling factor, that is, the allowed time for each filter output ynM

calculation is: Tinner = M × Ts

(6)

Two’s-complement representation is supposed for filter coefficients and data samples on ph and px bits, respectively. Therefore, they can be written as: hi

= −hi,ph −1 2ph −1 +

pX h −2

hi,k 2k

(7)

xi,k 2k

(8)

k=0

xi

= −xi,px −1 2px −1 +

pX x −2 k=0

Cost evaluation of the approaches are presented in the tables (1) to (4). Also, propagation time Tp and minimum period for clk signal, Tclk are given for each approach. 2.2. Single stage and multi stages structures Specification of a filter naturally carries to a single stage structure. Supposing that the length of this unique nonrecursive filter is N , the computation power associated to filtering for a decimation by a factor M can be approximated to 1 fs , with fs = (9) Pc = N × M Ts An alternative structure is based on a cascade of several filters with lengths N1 , N2 , · · · , NK associated to decimator factors M1 , M2 , · · · , MK [12]. Equivalency of the structures can be obtained if filters lengths respect N = N1 + N2 M1 which carries to a computation power given by fs 1 fs Pc = N 1 × + N2 × , with fs = M1 M1 M2 Ts

(10)

Pc =

i=1

Ni Qi−1

j=0

Mj

!

× fs with M0 = 1

SN R = 10 log10 E{y 2 } + 6.02p1 + 4.77

(14)

The more powerful the signal is, the better the SN R is. On the other hand, improvement of signal powerfullness is accompanied by overflow probability increase. Hence, a tradeoff need to be found between powerfulness of the signal y and overflow probability for a given number of bits. For example, by scaling the filter coefficients with a factor Gs , the filter gain and so the powerfulness of its output y change in the same proportion. 2.4. Quantization noise on coefficients If coding on ph bits is not enough for exact representation of the coefficient values, the filter output contains an approximation yˆn of the expected value yn given by:

yˆn

=

N −1 X

ˆ i · xi h

(15)

(hi + hei ) · xi

(16)

i=0

=

N −1 X i=0

Due to the relationship between N and the length of the filters in the cascade, computation power in (10) is inferior to that in (9). In fact, the more stages there are in the cascade, the shorter the filters in the cascade need to be. The computation power for a cascade of K filters to decimate an input signal of a factor M = M1 M2 · · · MK is: K X

2−2p1 (12) 3 Computation noise has an impact on the signal to noise ratio that depends on the y signal statistics, as viewed in (13). This is the equation for general case. With quantization given by (12), SN R can be expressed as (14). E{y 2 } SN R = 10 log10 (13) σe2 σe2 =

(11)

2.3. Computation noise This noise relates to insufficient number of bits to represent calculations results. If truncation or rounding on p1 bits is carried out after a calculation, the noise variance associated to the quantization error is given by (12). Notice that variance at an adder output is the sum of the variances at adder inputs. So, the effect of quantization on each ci xi is multiplied by N in a N -taps filter. For this reason, it is preferable to quantize only at the output of the filter, that is, after the multidata addition.

The frequency response of the system can be seen as the sum of an ideal frequency response H(f ) with a spurious frequency response He (f ): ˆ ) = H(f ) + He (f ) H(f

(17)

Rounding coefficients to ph bits carries to |He | ≤ N × 2−ph and so to a maximal stopband attenuation given in equation (18). This bound shows that a tradeoff exists between the filter length and the number of bits for coefficients representation in order to ensure minimal stopband attenuation. Amax = 20 log10 |He (f )| ≤ 20 log10 N × 2−ph ≈ 20 log10 N − 6ph dB

(18)

3. Simplified multiplications and multiplier-free implementation 3.1. With CSD coefficient representation Canonic Signed Digit (CSD) is an alternative representation of signed numbers where each digit takes the value +1, 0 or

−1 (denoted by 0 10 , 0 00 , and 0 ¯ 10 , respectively). It is obtained by replacing a sequence of k consecutive ’1’s in a classical binary representation by (k + 1)-length another one containing one 1, one ¯ 1, and k − 1 zeros in order to generate a representation with the smallest number of non-null digits. For example, ”0111” is coded as 0 100¯ 10 . This extension of the binary representation allows to minimize the number of partial products necessaries for a multiplication between two numbers. If CSD is used to code the filter coefficients, products mi = hi xi results in few additions, subtractions and shift operations. An algorithm for digital filters design based on CSD is given in [13].

(a)

(b)

(c) 3.2. With coefficients values given as power of 2 If all coefficients values can be written as 2j with j ∈ Z , only additions of conveniently shifted signal samples are necessaries to calculate the filter output.

4. Adapting architecture and required computation powers

3.3. With sinc filters A special case of multiplier-free filter is the sinc filter whose transfer function is given by equation (19), where M is an integer. When M is a power of 2, the implementation can be done as described in precedent paragraph. Moreover for any M integer, the sinc transfer function can be rewritten in such a manner that numerator (1 − z −M ) and denominator (1 − z −1 ) are separated [14]. Denominator is followed by downsampling (↓ M ) and numerator becomes (1 − z −1 ), so carrying to an implementation based on only two registers, one adder and one subtractor. Changes in downsampling position and numerator expressions refer to the comutative rule theorem described in [12]. M −1 1 1 − z −M 1 X −i H(z) = z = M i=0 M 1 − z −1

(19)

3.4. With half band filters Half Band (HB) filter frequency response is symmetric with respect to the quarter of the sampling frequency [15]. From an application point of view, this constraint implies that HB filters are suitable for decimation by factor M = 2. From an implementation point of view, the symmetry means that in a N -taps HB filter, • if N is odd, h N −1 is equals to 0.5 and there are 2 coefficients equals to zero. • if N is even, there are

Figure 1: (a) Combinatorial processor and improvement of its computation power by (b) multiplexing and (c) pipelining.

N 2

N −1 2

4.1. By multiplexing Multiplexing consists into improve the perceived computation power of a processor augmenting the number processors. Figure 1(b) presents an example for a multiplexing factor of 2. Supposing that the clock period is Tclk , each processor disposes of 2 × Tclk to perform the task. A control unit alternately supplies input data for processor P1 and processor P2 . Improvement of computation power by K is obtained with multiplexing factor given by K. The cost impact of multiplexing processors by a factor K is approximated by SKP = K × SP , where SP is the cost of the unique processor P. 4.2. By pipelining Pipelining allows to improve the perceived computation power by inserting registers in order to cut down the combinatorial propagation time, as illustred in figure 1(c). If the combinatorial propagation time in (a) is Tp(P ) , the perceived propagation time after register insertion Tpipe is the maximum between {Tp1 , Tp2 }. So input data can be supplied on a Tpipe < Tclk time base. Improvement of perceived computation power by K is obtained by splitting the processor T in K partial processors with Tpi = Kp ∀i. The cost impact PK of this improvement is i+1 SRi where SRi is the cost of register insertion after the partial processor i.

5. Calculations granularity coefficients equals to zero.

If we take into account the symmetric response peculiar to all linear phase FIR filter, a N -taps HB filter only requiries N 4 multiplications to evaluate yn in (4).

It refers to the parallelism degree of the processing for each product hi xi and for each input sample signal. To explain the former, we consider two cases: all products processed in parallel (parallel approach) or only one product processed at

Figure 2: Parallel approach implementation. Table 1: Cost estimation for parallel approach. OP ERAT OR N umber Datawidth REG N px N ADD2 px 2 N M U LT (p , p h x + 1) 2 ADD N2 1 ph + px + log2 (N ) REG 1 ph + px + log2 (N )

Figure 3: Sequential approach implementation.

a time (sequential approach). Of course, these are ”extreme” cases and the optimum (probably intermediate) choice is obtained with respect to the required computation power. Parallelism degree of sample signal concerns the number of bits in each sample considered at a time: r bits (as in r-bits representation approach) or one bit (as in distributed arithmetic approach). For this explanation, we consider a parallel approach for the product processing. 5.1. For the products hi xi sum 5.1.1. Parallel approach It consists on the calculations of all products mi = hi xi simultaneously. Figure 2 illustrates parallel approach implementation. N registers allow storing and supplying of the N pertinent samples. A multidata adder calculates y = P N −1 i=0 mi . Time constraints of this structure for N -taps filter and decimation factor M are: Tp Tclk

= TADD2 + TM U LT + TADDN + TREG (20) ≤ M × Ts

(21)

Table 2: Cost estimation for sequential approach. OP ERAT OR N umber Datawidth REG N px ADD2a 1 px M U LT 1 (ph , px + 1) ADD2b 1 ph + px + log2 (N ) REG 1 ph + px + log2 (N ) 5.2. For the input signal samples xi 5.2.1. r-bit signal representation approach Let r be an integer satisfying r = pMx , with M the integer decimation factor and dae means rounding a to the nearest integer greater or equal to a. We can calculate the h~i · x~i product in M steps by taking into account only new r bits of xi in each step. Let the product hi xi be written as: hi · xi = h i ·

M −1 X

mr xm i 2

(24)

m=0

where xm i is a signed number related to the range of bits in xi given by [xi,(m+1)r−1

· · · xi,mr+1

xi,mr ]

(25)

and to the decimal value given by 5.1.2. Sequential approach It considers product sequential calculations of mi = hi xi . This is a DSP-like (Digital Signal Processor) approach in the sense that the operative part consists of only a multiplier/accumulator (see fig. 3) and two selectors that suppy the pertinent data (signal sample xi and filter coefficient hi ) according to a controler. Time constraints of this structure for N -taps filter and decimation factor M are: Tp Tclk

= TADD2a + TM U LT + TADD2b + TREG(22) 2M ≤ × Ts (23) N

−xi,(m+1)r−1 2r−1

r−2 X

k1 =0

xi,mr+k1 2k1

!

+ xi,mr−1 20 (26)

with xi,−1 = 0. Decimal value of xm i is obtained by considering the equality xi,mr−1 2mr−1 = −xi,mr−1 2mr−1 + xi,mr−1 2mr

(27)

Figure 4 shows an example of r-bits approach implementation. Multipliers are designed to deal with xm i data (that is sample data coded on r bits). Time constraints of

Figure 5: Distributed arithmetic approach implementation. Figure 4: r-bits approach implementation.

Table 4: Cost estimation for distributed arithmetic approach. OP ERAT OR N umber Datawidth REG N px ROM 1 wadd = pMx × N wdata = pc + pMx + log2 N ADD2 1 pc + px + log2 (N )) REG 2 pc + px + log2 (N )

Table 3: Cost estimation for r-bits approach. OP ERAT OR N umber Datawidth REG N r N ADD2 r 2 N M U LT (p , r + 1) c 2 ADD N2 1 pc + r + log2 (N ) ADD2 1 pc + px + log2 (N ) REG 1 pc + px + log2 (N ) this structure for N -taps filter and decimation factor M are: Tp Tclk

= TADD2a + TM U LT + TADD N 2 +TADD2b + TREG

(28)

≤ Ts

(29)

5.2.2. Distributed arithmetic approach Consider ~h and ~x a fixed and a variable data vectors, respectively. So, (4) can be rewritten using (8) as: ! pX N −1 x −2 X y = hi · −xi,px −1 2px −1 + xi,k 2k i=0

=

k=0

N −1 X

−hi · xi,px −1 2

px −1

i=0

+

N −1 pX x −2 X

−hi · xi,k 2k

(30)

i=0 k=0

By interchanging the order of the two summations in i and k in (30) gives (31), where fk (x0,k , x1,k , · · · , xN −1,k ) = PN −1 i=0 hi · xi,k . y

=

N −1 X

−hi · xi,px −1 2px −1 +

i=0

pX x −2 N −1 X

hi · xi,k 2k

k=0 i=0

= −fpx −1 (x0,px −1 , x1,px −1 , · · · , xN −1,px −1 )2 +

pX x −2 k=0

fk (x0,k , x1,k , · · · , xN −1,k )2k

px −1

It means that firstly, partial inner products are evaluated (terms i of inner product). After, they are added/accumulated (bits k of xi ). Distributed arithmetic is a procedure for computing inner product based in the fact that fk can take on only a finite number of values (limited to 2N ). The possible values of fk are pre-computed and stored in a look-up table that can be implemented with a ROM. Data (x0,k , x1,k , · · · , xN −1,k ) is used as bus address of the ROM. The inner product computation consists on to successively access the ROM and accumulate the results according to (31). This is illustrated in Figure 5. Time constraints of this structure for N -taps filter and decimation factor M are: Tp Tclk

= TROM + TADD2 + TREG ≤ Ts

Notice that the computational time is dependent of the data xi word length. In fact, px clock cycles are necessaries to complete the inner product calculation. Also, when distributed arithmetic approach is used for M -decimation filtering, no more than M cycles are available for all complete inner calculation. The number of bits of each xi ∈ ~x to be considered in each ROM access is: px pmin = (34) M

6. Customized filter implementation steps The steps involved in a customized filter implementation can be summarized as follows: • Single/Multi stage choice: Determine K and

(31)

(32) (33)

{N1 , N2 , · · · , NK }

which minimize (11). Use HB or sinc filters when pertinent. • Individual filter design (or for each filter Hi in the cascade) – Determine the scaling factor Gs as a function of actual filter gain, expected SN R and p1 , according to (14). – Determine the minimal value for ph that ensures required Amax as given in (18). • Granularity choice – Determine the computation power required for each filter. – Consider multiplexing and pipelining. Given relationships between Tp , Tclk and Ts for each approach in section 5, eliminate those that can not satisfy the required computation power (i.e. those whose performances are far from the required one). – Among the retained solutions, choose the one carrying to the smallest cost.

7. Conclusions This paper focused on the implementation of FIR filters for decimation process. We outpointed important parameters having direct impact on the cost and the performance of customized implementation. Also, several architectures allowing ad hoc implementation and their evaluation have been presented. It is important to notice that presented approaches can be combined in order to obtain a mixed solution carrying to a best matching between application required and architecture supplied computation powers.

References [1] F. J. Harris, C. Dick, and M. Rice, Digital receivers and transmitters using polyphase filter banks for wireless communications, IEEE Transactions on Microwave Theory and Techniques, Volume: 51, Issue: 4, pp. 1395 - 1412, April, 2003.

[5] A. Ghazel, L. Naviner, and K. Grati On design and implementation of a decimation filter for multistandard wireless transceivers, IEEE Trans. on Wireless Commumnications, Vol. 1, pp. 558-562, Oct., 2002 [6] T. W. Parks and C. S. Burrus, Digital Filter Design, Willey Intersciences, Jonh Willey & Sons, 1987. Autonomous Decentralized Systems, pp. 227 - 230,Sept. 2000. [7] A. Nabavi, N. Babaii, and M. Lotfizad, Two new methods to design digital filters, Asia-Pacific Conference on Circuits and Systems, Volume: 2, pp. 515 - 518, Oct., 2002. [8] A. T. Erdogan, M. Hasan, and T. Arslan, Algorithmic low power FIR cores, IEE Proceedings on Circuits, Devices and Systems, Volume: 150, pp. 155 - 160, June, 2003. [9] J. Yougbeom and Y. Sejunf, Low-power CSD linear phase FIR filter structure using vertical common subexpression, Electronics Letters, Volume: 38, Issue: 15, pp. 777 - 779, July, 2002. [10] G. C. Cardarilli, A. Del Rei, A. Nannarelli, M. Re, Power characterization of digital filters implemented on FPGA, Proceedings of IEEE International Symposium on Circuits and Systems, Volume: 5, pp. V-801V-804 May, 2002. [11] D. Alam and S. Lawson, VLSI implementation of a new bit-level pipelined architecture for 2-D allpass digital filters, EEE International Symposium on Circuits and Systems, Volume: 1, pp. 724 - 727, May, 1995. [12] S. Chu and C. S. Burrus, Optimum FIR and IIR Multistage Multirate Filter Design, IEEE Circuits and Systems Process, Vol. 2, N 3, pp. 361-386, 1983. [13] X. Xu and B. Nowrouzian, Local search algorithm for the design of multiplierless digital filters with CSD multiplier coefficients, Proceedings of IEEE Canadian Conference on Electrical and Computer Engineering, Vol. 2, pp. 811 - 816, May, 1999. [14] S. Chu and C. S. Burrus, Multirate filter design using Comb filters, IEEE Trans. on Circuits and Systems, vol.CAS-31, pp. 913-924, 1984.

[2] S. R. Norsworthy and R. E. Crochiere, Decimation and Interpolation for Sigma Delta Conversion, in Delta Sigma [15] A. N. Wilson Jr. and H. J. Orchard, A design method Data Converters: IEEE Press, 1997. for half-band FIR filters, IEEE Trans. on Circuits and [3] C. Landone and M. B. Sandler, Digital filtering for Systems I: Fundamental Theory and Applications, Vol.: 3D binaural sound, Proceedings of IEE Colloquium 46, pp. 95 - 101, Issue: 1, Jan., 1999. on Digital Filters: An Enabling Technology, pp. 9/1 9/8, April, 1998. [4] W. Zheng, L. Cheng, S. Xin, and X. Xibin, Digital filter implementation for software radio, Proceedings of IEEE VTC Vehicular Technology Conference, Volume: 3 ,pp. 6-9 May 2001 .