SAD computation based on online arithmetic for ... - Semantic Scholar

SAD computation based on online arithmetic for motion estimation J. Olivaresa , J. Hormigob , J. Villalbab , I. Benavidesa and E. L. Zapatab a

Dept. of Electrics and Electronics, University of Córdoba, Spain {olivares, el1bebej}@uco.es b

Dept. of Computer Architecture, University of Málaga, Spain {hormigo, julio, ezapata}@ac.uma.es

Abstract Block-based motion estimation is one of the critical tasks in today’s video compression standards such as H.26x, MPEG-1, -2 and -4. Most of the block-based motion estimation algorithms are based on computing the Sum of Absolute Differences (SAD) between corresponding elements in the candidate and reference blocks. In this paper an FPGA design is proposed for rapidly computing the minimum SAD. Two goals are achieved due to the use of online arithmetic (OLA): it is possible to implement a full 16 × 16 macroblock SAD in a single FPGA device; and it allows us to speed up computation by early termination of the SAD calculation when the candidate involved is bigger than the current reference SAD. Reconfigurable devices enable us to change 8 × 8 or 16 × 16 pixels per block quickly and easily. For a 16 × 16 SAD unit 1945 look–up tables (LUTs) are required at 425 MHz. A comparison with other related works is provided.

Keywords: motion estimation, FPGA, sum of absolute differences, online arithmetic

1

Introduction

Motion estimation (ME) plays an important role in today’s video coding and processing systems, since motion vectors provide critical information for temporal redundancy reduction. It has been 1

widely used in the H.26x, MPEG-1, -2 and -4 video compression standards. Motion estimation is defined as searching for the best motion vector, being the displacement of the coordinates of the most similar block in the previous frame compared to the block in the current frame. Full–search block–matching is the most popular algorithm to perform ME, and it searches through every candidate location to find the best match. To do this, the current frame is partitioned into two–dimensional blocks (typically 8 × 8 or 16 × 16 pixel blocks) and a search window in the reference frame is defined. Each block of the current frame is compared with all the blocks of a previous frame within the same window. The final motion vector corresponds to the block with minimum distortion within the search window. The most commonly used metric to calculate the distortion is the Sum of Absolute Differences (SAD) [1], which adds up the absolute differences between corresponding elements in the candidate and reference block. The heavy computational cost of block matching algorithms (BMA) can be a significant problem in real–time coding applications. To reduce computational complexity many fast algorithms have been proposed, which search a subset of candidate blocks [2][3]. Besides this, different architectures have been designed to speed up the associated massive arithmetic calculation [4][1]. However, the need for specialized hardware contradicts the flexibility demanded by current video coding systems. A feasible solution to this problem is to use a programmable processor core along with a field–programmable gate–array device (FPGA) which is in charge of performing critical tasks. The reasons for using FPGAs include the following advantages: increased flexibility and rapid adaptation to new developments; appropriate performance; and faster design times achieved by re-using IP cores and high-level design languages (such as VHDL). In this context, our design is intended to speed up computation of the minimum SAD by its implementation in an FPGA (SAD processor in Figure 1), while a data dispatcher supplies the reference and candidate blocks to the FPGA device (see Figure 1). An FPGA architecture to compute the minimum SAD is proposed in this paper. This design can be integrated with any BMA (full search or another efficient search strategy). Despite the parallelism inherent to SAD, full parallel implementation has proved difficult, since it requires a large number of operands for typical block sizes (a 8 × 8 pixel block requires 2

MINIMUM SAD

DISPATCHER

2N 2 b

CANDIDATE BLOCK MAD PROCESSOR

MOTION VECTOR

2N 2 b

REFERENCE BLOCK

CORE PROCESSOR

Figure 1: Motion estimation system

128 8–bit operands, and a 16 × 16 pixel macroblock needs 512 8–bit operands). Due to the large amount of hardware, the computation of the SAD on only one row of a macroblock (16 × 1) is implemented on an FPGA device in [1], who propose replicating or pipelining the design to obtain the 16 × 16 computation. Four FPGA chips with 1234 I/O pins each are used in [5] for a completely parallel design. On the other hand, the use of online arithmetic (OLA) for motion estimation is proposed in [6] to speed up the computation by early termination of the SAD calculation. A serial architecture (pixel by pixel) for 4×4 blocks is proposed in [6], based on ASIC implementation. This paper is organized as follows: in Section 2 a brief description of the OLA techniques is provided; in Section 3 we deal with the computation of the minimum SAD using OLA; Section 4 presents the implementation of the proposed design in FPGA devices; the results of several simulations are shown in Section 5 to illustrate the clock cycles saved with early termination; a comparison with other works is described in Section 6; and finally, the most relevant results of this paper are summarized in Section 7.

3

2

Online arithmetic

Online arithmetic techniques have been considered as the solution to many signal processing problems, such as digital filtering, Fourier transform, and others [7]→[10]. Recent works have presented the suitability of OLA for FPGAs designs [11]. The basic idea of OLA is to perform computations which overlap with the digit-by-digit communications of operands/results [7]. OLA algorithms operate in a digit-serial manner, beginning with the most significant digit (MSD). To generate the first digit of the result, δ + 1 digits of the input operands are needed. Thus, after δ digits of the operands are received, for each new digit of the operands, a new digit of the result is obtained. For this reason, δ is known as online delay. Due to the online delay, after the last digits of the inputs are introduced into the system, a number of zero digits equal to the online delay have to be introduced to ensure a correct result. The most-significant-digit-first mode of computation requires flexibility in computing digits on the basis of partial information about inputs. This is achieved by using a redundant representation system. In a redundant representation with radix r, each digit has more than r possible values. This permits several representations of a given value. Therefore, there is flexibility in choosing an output digit at a given step, so that a compensation can be introduced if needed. A signed-digit (SD) representation system [12] is used in this paper. In radix-2 SD representation, the digit set is {−1, 0, 1}. Two bits are required to represent each digit, as shown in Table 1. The first bit is negatively weighted and the second one is positively weighted. This number representation system eliminates the long carry propagation chains in the addition operation, although it requires the carry of the two previous digits. In short, the advantages of using online arithmetic are as follows: it reduces the number of signal lines connecting modules due to its serial-digit character; the MSD-first computation allows subsequent calculations to occur at a much earlier stage; and it eliminates carry propagation chains, since it uses a redundant number representation system.

4

3

Online computation of the minimum SAD

The goal of our FPGA design is to find which of the candidate blocks (supplied by the dispatcher) best matches the reference block. The most commonly used metric to determine the best match is the Sum of Absolute differences (SAD). Thus, our design computes the minimum SAD from among all the candidate blocks. To do this, a search iteration is performed for each candidate block. During each search iteration, the SAD corresponding to a candidate block is computed using all its pixels simultaneously. The value obtained is compared with the reference SAD (SADr) which is the minimum SAD computed before the current iteration. If the current SAD (SADc) is less than SADr, it is stored as SADr for the remaining search iterations. Both the SAD computation and comparison operation are performed using OLA techniques. This allows us to begin the comparison when the first digit of the SAD is obtained and to stop the computation early if the digits computed are sufficient to ensure that SADc is greater than SADr.

3.1

Online SAD computation

The SAD adds up the absolute differences between corresponding elements in the candidate and reference block SAD =

N X N X

|ci,j − ri,j |,

i=1 j=1

Table 1: Digit codification in radix-2 signed-digit representation Digit value

Digit representation

+1

01

0

00

0

11

-1

10

5

(1)

where ri,j are the elements of the reference block, and ci,j the elements of the candidate block. Thus, the computation of the SAD is divided into three steps: - Compute the differences between corresponding elements di,j = ci,j − ri,j - Determine the absolute value of each difference |di,j | - Add all absolute values We now describe how each of these operations is performed using online arithmetic, and how the pixel values are converted into radix-2 SD representation. Conversion to SD representation and difference computation: In radix-2 signed-digit representation, each digit is composed of two bits, the first one negatively weighted and the second positively weighted. Thus, a signed-digit number can be interpreted as the difference between two unsigned numbers, one composed of positively weighted bits for each digit, minus the one composed of negatively weighted bits. In fact,this difference must be computed to convert an SD number into a non-redundant representation. This property is used to simultaneously convert each pixel value into SD representation and compute the difference between the pixels of the reference block and the current block at no computational cost. In this way, each digit of the value di,j = ci,j − ri,j is obtained in SD representation by only taking the corresponding bit of ci,j as the positively weighted one and the corresponding bit of ri,j as the negatively weighted one, since ci,j and ri,j are unsigned numbers. Absolute value: To compute the absolute value of di,j , the sign of this value has to be changed if di,j is negative. In SD representation, the negation operation is performed by exchanging both bits of each digit. Since the MSD-first mode of computation is being used, the sign detection of di,j is performed on-the-fly by checking whether the first non-zero digit of di,j is positive (01) or negative (10). The digits of di,j are received in MSD-first mode and go directly to the output when they are zero (00 or 11). If the first non-zero digit received is positive (01), this and all the remaining digits correspond directly with the output. Nevertheless, if the first non-zero digit received is negative (10), the bits of this and all the remaining digits are interchanged to obtain the output. The absolute value operation is performed with no online delay. 6

Ai + Ai -

Bi + Bi -

OLA ADDER

Ci + Ci -

Di + Di -

Ei+ Ei-

OLA ADDER

AB

Fi + Fi -

Gi+ Gi-

OLA ADDER

CD

OLA ADDER

EF

OLA ADDER

Hi + Hi -

GH

OLA ADDER

AD

EH

OLA ADDER

AH

Figure 2: Online design for the sum of the absolute differences.

Sum of absolute differences: The absolute difference of all the pixels corresponding to the current and reference blocks is computed in parallel. Thus, N 2 absolute difference blocks are required. An online adder tree is used to obtain the sum of all di,j values. In Figure 2 this structure is shown for 4 × 4 pixels per block(N = 4). Each OLA-adder in this figure corresponds to a standard SD online adder (see figure 3). The number of addition steps of the complete adder tree is log2 (N 2 ). In radix-2 signed-digit representation, the online delay of the addition is two i.e., the MSD of the result is obtained two cycles after the MSD of the inputs has been sent to the adder. Nevertheless in our case, the carry bit is used as the MSD of the results and this digit is obtained one cycle before. Therefore, the online delay of the complete adder tree is 2 log2 (N 2 ), but the first digit of the results is

7

X j+3 + X j+3 - Y j+3 +

Y j+3 -

FA

D

D

R

R

FA Z j+2 + D R Z j+1

-

Z j+1 +

D

D

R

R

Z j- Z j+

Figure 3: Online adder design.

obtained log2 (N 2 ) cycles earlier.

3.2

Signed-digit online comparison

Once the first digit of the SAD corresponding to the current block is obtained, the comparison between the current SAD and the minimum SAD can begin. Thanks to the fact that the MSD-first mode of computation is used, an efficient comparison algorithm can be applied. Nevertheless, since SD representation allows several representations for a given value, the comparison operation between two values is not as simple as in conventional representations. In [13, 6] a comparison algorithm and its hardware implementation are proposed. The two SD numbers are first converted to sign-magnitude format and then a standard comparison is used. The magnitude computation and comparison are performed on-the-fly in an MSD-first 8

manner. Nevertheless, this comparator has an online delay of two. We propose a comparison algorithm with no online delay. This is based on the analysis of the sign of the difference operation between the two values to be compared. Thus, the online delay of two is avoided due to the substraction operation. Let us define the SD numbers A and B, where A=

n−1 X

ai · 2 i ,

ai ∈ {−1, 0, 1}

(2)

i=0

and B have a similar expression. The result of operation A-B is A−B =

n−1 X

(ai − bi ) · 2i ,

ai , bi ∈ {−1, 0, 1}

(3)

ri ∈ {−2, −1, 0, 1, 2}

(4)

i=0

Let R be the result of the difference R=A−B =

n−1 X

ri · 2i ,

i=0

Let us assume that when using an online comparator, the sign of R can be determined at digit k, if the partial accumulated sum Rk complies with k

|R | = |

n−1 X

ri 2i | ≥ 2k+1

(5)

i=k

Given the previous definition of Rk , R can be redefined as R = A − B = Rk +

k−1 X

ri · 2i

(6)

2 · 2i < 2k+1

(7)

i=0

Since |

k−1 X i=0

i

ri · 2 | ≤

k−1 X i=0

it is proved that Rk ≥ 2k+1 ⇒ R > 0

(8)

Rk ≤ −2k+1 ⇒ R < 0

(9)

If the condition represented in equation 5 does not comply with k > 0, the sign of R cannot be guaranteed until the last digit (k = 0). Let us define the normalized partial accumulated sum as Rk = Rk /2k ; the condition in equation 5 is then equivalent to 9

|Rk | = |

n−1 X

ri 2i−k | ≥ 2

(10)

i=k

The value Rk can be computed using an online recurrence (Note that k ranges from N-1 to 0) Rk = 2 · Rk+1 + (ak − bk )

(11)

The value Rk only depends on its previous value and the current digits, thus an online comparator, as well as minimum or maximum algorithms, can be implemented with no online delay based on this computation. An online comparator requires the value Rk to be computed in each iteration, starting at k = n − 1 (MSD), until |Rk | ≥ 2 or k = 0. At this point, the decision is determined based on the sign of Rk . Transition

ak- bk 0 -1

1 -2

k

k

R >1

k

R =1 0,1,2

2 k

R =0 1

-1

2

k

R =-1

R SADr is detected, the computation is stopped and a new candidate block is required. Otherwise, if the condition SADc < SADr is verified, SADc is stored in SADr when a less significant digit of the SAD is calculated. c i 0,0 r i 0,0

|c 0,0 - r 0,0 |

2b

|c 0,1 - r 0,1 |

2b

c i 0,1 r i 0,1

N 2-OPERAND OLA ADDER

stop

2b COMP

min c i N,N |c N,N - r N,N |

r i N,N

SAD c

2b

2b

SAD r

Figure 5: SAD processor architecture

The timing of the computation for the 4 × 4 SAD processor is shown in Figure 6. In each cycle, the outputs corresponding to the absolute value block (|ci,j − ri,j |, each of the four steps P

in the adder-tree ( (i)), and the comparator (COMP) are represented. In fact, regarding the comparator, this does not really constitute the output, but rather the last digit used for the comparison. The zero digits represent the zero values which have to be introduced into the input due to the system’s online delay. Since each addition has an online delay of two, and the absolute value blocks and the comparator have no online delay, eight zeroes are required in this case. The worst case occurs when a new minimum SAD is found, and then 21 cycles are required for the full process, where the last cycle is run to store SADc in SADr. However, as Figure 6 11

New computation 1

| c ij - r ij | Σ (1)

2

3

4

5

6

7

8

9

10

11

12 13

14

15

16

0

0

0

0

0

0

0

d8 d7 d6 d5 d4 d3 d2 d1 d0 0

0

0

0

0

0

d9 d8 d7 d6 d5 d4 d3 d2 d1 d0 0

0

0

d7 d6 d5 d4 d3 d2 d1 d0 0

Σ (2) Σ (3)

17

18

19

20

21

0

d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 0

0

d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0

Σ (4)

d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0

COMP

Best case

Worst case

(9 cycles)

(21 cycles)

Figure 6: Timing of the 4 × 4 SAD processor

shows, a new SAD computation can start after 16 cycles (after the 8 digits and 8 zeroes are introduced) in which case this period of time is the maximum between two consecutive SAD computations. This period is reduced if the candidate SAD is rejected before. In the best case, this happens after analysing the MSD of the candidate SAD, i.e., after 9 cycles. Therefore, the number of cycles for a SAD computation and comparison is between 9 and 16 for a 4 × 4 SAD processor. This period ranges from 13 to 20 cycles for an 8 × 8 block, and from 17 to 24 cycles for a 16 × 16 block. The design has been implemented on the Xilinx SPARTAN-II and VIRTEX-II FPGA families for three different block sizes. For compilation, simulation and implementation, we use the Xilinx ISE Series 5.2i. The main results of the implementation are shown in Table 2. The area/number of pixels ratio is relatively low, due to the serial-digit character of online computation. The maximum clock frequency is independent of block size because when the number of operators increases, only the number of parallel operations and the number of steps in the adder-tree increase. Although this value strongly depends on the technology used (as shown in Table 2), our results are very promising. Table 3 shows how the area and delay are distributed among the different parts of the design for the 16 × 16 SAD processor. Note that the percentage given refers to the total number of LUTs of the SAD processor. The maximum clock frequency of the global system is determined 12

Table 2: Area and clock frequency corresponding to different FPGA implementations. SPARTAN-II Block size

VIRTEX-II

Area (4 inputs-LUTs)

4x4 (16 pixels)

246

241

8x8 (64 pixels)

603

595

16x16 (256 pixels)

1982

1945

Maximum Frequency(MHz) 231.24

424.99

Table 3: Distribution of LUTs and delay in the 16 × 16 SAD processor. Time Delay (ns)

Area

Parts

SPARTAN-II

VIRTEX-II

LUTs

%

Absolute difference

3.675

1.839

1024

52.7%

Adder-tree

4.325

2.353

768

39.5%

Comparator

4.887

2.048

6

0.3%

Control and Connectivity

-

-

146

7.5%

either by the delay of the comparator or the adder (although both values are similar), depending on the FPGA family, since the basic cells are slightly different. The area is mainly occupied by the absolute value blocks and the adder-tree, due to the large amount of operands for this block size. The general performance of these implementations is shown in Table 4, where the number of SADs per second and the number of frames per second (fps) are given for a 640x480 pixels per frame image.

13

Table 4: Number of SAD calculations and frames per second.

SPARTAN-II

VIRTEX-II

Block Size

Window Size

SAD (millions per second)

fps

SAD (millions per second)

fps

4x4

8x8

14.45

77.08

26.56

141.66

8x8

16x16

11.56

30.50

21.25

56.06

16x16

32x32

9.64

9.56

17.71

17.57

5

Early termination of SAD calculation

Several video sequences have been processed to estimate the number of clock cycles saved. The parameters used are: - 16x16 block size. - 24x24 search window. - Full-search block matching algorithm. - 150 frames of each video have been evaluated. The traditional model shown in Figure 5 uses a final comparator for the SAD comparison. A new model is proposed (as shown in Figure 7), which introduces several comparison levels into the adder tree to evaluate partial SAD information. It is possible that partial SADs of 64 pixels or 128 pixels of a 16x16 block are greater than the reference SAD; if so, the SAD calculation can be stopped before running the entire number of cycles, which cannot be done with the traditional model. Figure 7 shows the new model for partial comparison. This property is demonstrated in the present section. The added cost for the new model is the area occupied of six new comparators. Nevertheless, each comparator only requires 6 LUTs and involves less than 2% of the final area. Figure 8 shows the results obtained for three versions of the implemented algorithm: one with only one final comparator for ’256 PIXELS PROCESSED LEVEL’, called C256P; one with 14

OLA 64 TREE ADDER

OLA 64 TREE ADDER

OLA 64 TREE ADDER

OLA COMPARATOR

OLA COMPARATOR

OLA COMPARATOR

OLA COMPARATOR

OLA 64 TREE ADDER

OLA ADDER

64 PIXELS PROCESSED LEVEL

OLA ADDER


OLA COMPARATOR

OLA COMPARATOR

OLA ADDER


OLA COMPARATOR

Figure 7: New comparators for partial SAD comparison

a final comparator plus two comparators for ’128 PIXELS PROCESSED LEVEL’, called C128P; and one with a final comparator plus two comparators for ’128 PIXELS PROCESSED LEVEL’ and four comparators for ’64 PIXELS PROCESSED LEVEL’, called C64P. The videos tested were: - hall monitor.mpeg - flower.mpeg - tennis.mpeg - coast guard.mpeg The number of clock cycles saved for the C64P model ranges from 4.5% to 13%, in contrast to the conventional C256P model with only one comparator, which saves between 3.3% and 4.53% clock cycles. Introducing partial comparators allows us to improve the efficiency of the system. 15

Figure 8: Number of clock cycles saved

6

Comparison with other works

In this section we compare our design to other recent works, the main ones being [13, 6] and [1, 5]. The use of online arithmetic to compute the minimum SAD was proposed in [13, 6] for ASIC implementation. An SD–adder was used for the computation of the differences, whereas our approach does not use such hardware, since we merge this computation and the SD conversion, saving both time and area. Note that since a difference computation is required for each pixel, the amount of hardware saved is considerable. The authors consider independent bit planes and compute the summation of absolute differences for independent planes, starting from the most significant digit. The mathematical basis for this procedure is not correct since the absolute value of a signed-digit number is not equal to the summation of the absolute value of the different weighted digits. This is due to the fact 16

that each digit can be positive or negative. This leads the authors to obtain a motion vector which is not correct for most cases. For details see cite [15]. Our approach also considers bit planes. Moreover, we take into account the dependence between bit planes (carry propagation and correct calculation of the absolute value) which leads to obtaining the best motion vector. On the other hand, an algorithm based on online arithmetic is proposed in [13, 6] for the SAD comparison. The two SD numbers are first converted to sign-magnitude format, and then a standard comparison is used. The magnitude computation and comparison are performed on-the-fly in an MSD-first manner. Nevertheless, this comparator has an online delay of two and relatively high complexity. The main advantage of our design is that no online delay is required for the comparison operation, thus speeding up computation. Furthermore, our design is based on a simpler method involving less hardware cost. The authors do not provide enough data regarding their ASIC implementation to enable us to perform a quantitative comparison in terms of area and delay. According to [13, 6], the cycle time corresponds to one SD–adder plus one 2–to–1 MUX, one AND and one three-input OR gate. The cycle time of our design is only one SD–adder. Despite the fact that our design is intended for an FPGA implementation, we estimate that an ASIC implementation of our design will significantly improve the performance of the design [13, 6]. In [1], the computation of the SAD for 16 pixels (SAD16), which is equivalent to a macroblock row for MPEG, is implemented on an FPGA device. The design is based on carry–save adders which perform the computation in parallel over all the digits of the data. According to the authors, the design is synthesized using FPGA Express from Synopsys by targeting the FLEX20KE family from Altera, obtaining an area of 1699 LUTs, and a maximum frequency of 197 MHz, with a latency of 19 cycles (96ns). The estimated bandwidth for this design is 50.4 Gbps and the estimated throughput is 197 million SADs per second. The results of our implementation using the VIRTEX-II family is used for comparison, since it provides similar performance. The worst case for our equivalent design (4 × 4 or 16 pixels) occurs when a new minimum SAD is found, and then 21 cycles are required to complete the full process (see Section 4); that is, to compute SAD16, compare the result with the previous minimum and store 17

it; this lasts 49 ns at a frequency of 425 MHz. The bandwidth of our design is 27.2 Gbps, which is less than in [1], since data are serially transmitted. As shown in Table 4, the throughput is 26.56 million SADs per second, which is about seven times less than in [1]. Besides this, our design only requires 241 LUTs, which is seven times less area than in [1]. However, the current compression standard systems require 16 × 16 blocks (and also 8 × 8 for MPEG-4). The authors of [1] state briefly how to extend the design to compute a 16 × 16 SAD in two ways. The first one is based on using 16 SAD16 units (one for each row) and a final adder tree. They estimate that 27 clock cycles are required. Nevertheless, the number of LUTs for the design is close to 30000, which does not seem feasible for the current FPGA devices. Our 16 × 16 design requires only 1945 LUTs, which is easily implemented on a single FPGA device. The second approach presented in [1] is based on reusing the SAD16 units to compute the SAD of all the 16 rows, which are buffered, to finally add them up. This involves 42 clock cycles with a larger area size due to buffering and the fact that longer binary data (16 bits instead of 12 bits) must be supported. Moreover, the intrinsic pipeline behavior of the SAD16 units is eliminated. For a similar area, our design computes a SAD every 24 cycles for the worst case (including the comparison, see Section 4). On the other hand, the solution proposed in [5] involves the use of four Altera STRATIX EP1S80 devices with 1234 I/O pins. This design uses 7765 LCs and requires 29 cycles for a SAD computation at 380 MHz. This means that our design obtains better performance regarding time while requiring far less hardware. We would like to emphasize that the previous comparisons refer to our worst case (16 cycles for 4 × 4 SAD and 24 cycles for 16 × 16 SAD). However, the best case means that after analysing the MSD of the candidate SAD we then reject it; this involves only 9 cycles for 4 × 4 SAD and 17 cycles for 16 × 16 SAD (see Section 6). Moreover, the TIMING results used for our design include the comparison OPERATION (which involves a few more clock cycles due to carry propagation) whereas the designs referred at [1] and [5] do not include this operation time.

18

7

Conclusion

An FPGA implementation of a motion estimation core based on the computation of the minimum SAD has been presented in this paper. The proposed core can be integrated with a full–search algorithm or any more efficient search strategy. The computation is carried out by using online arithmetic. The different operations involved in the SAD computation have been efficiently adapted to online arithmetic, and a new comparator design with no online delay has been proposed. This allows us to implement the design on a single FPGA device. The proposed core can speed up the computation by early termination of the SAD calculation when the candidate involved is bigger than the current SAD reference. Furthermore, the FPGA implementation of the design makes it possible to reconfigure the hardware to deal with 8 × 8 and 16 × 16 pixel blocks, according to the MPEG-4 standard requirements. We present the implementation’s delay and area details for 4×4, 8×8 and 16×16 pixel blocks. We also provide comparisons with other current related works demonstrating the advantages of using our design.

References [1] S. Wong, S. Vassiliadis, S. Cotofana “A Sum of Absolute Differences Implementation in FPGA Hardware”, 28th Euromicro Conference (EUROMICRO’02), pp.183–188, Dortmund, Germany, 2002. [2] J. Kim, S. Byun, Y. Kim, B. Ahn “Fast Full Search Motion Estimation Algorithm Using Early Detection of Impossible Canditate Vectors”, IEEE Trans. on Signal Processing, vol.50, pp. 2355–2365 Sep. 2002. [3] Y. Chan, W. Siu “An efficient Search Strategy for Block Motion Estimation Using Image Features”, IEEE Trans. on Image Processing, vol.10, pp. 1223–1238, Aug. 2001. [4] S.b. Pan, S.S. Chae and R.H. Park, “VLSI Architecture for Block matching Algorithms using Systolic Arrays” IEEE Trans. Circuits Syst. Video Tech.,vol. 6, pp.67–73, Feb, 1996.

19

[5] S. Wong, B. Stougie, S. Cotofana “Alternatives in FPGA-based SAD Implementations”, Proc. IEEE International Conf. on Field-Programmable Technology, pp. 449–452, 2002. [6] C. Su and C. Jen, “Motion Estimation using MSD-first Processing”, IEE Proc. Circuits Devices System, vol. 150, No. 2, pp. 124–133, 2003. [7] M. Ercegovac and T. Lang, “On-line Arithmetic for DSP Applications”, 32nd Midwest Symposium on Circuits and Systems, pp. 365–368, 1989. [8] M. D. Ercegovac and T. Lang. “On-line Arithmetic: a Design Methodology and Applications in Digital Signal Processing”. In VLSI Signal Processing III, pages 252–263, 1988. Reprinted in E. E. Swartzlander, Computer Arithmetic, Vol. 2, IEEE Computer Society Press Tutorial, Los Alamitos, CA, 1990. [9] Lau, D.; Schneider, A. Ercegovac, M.D.; Villasenor, J., “FPGA-based Structures for Online FFT and DCT” Proc.7th IEEE Symposium Field-Programmable Custom Computing Machines, pp. 310–311, 1999. [10] Rajagopal, S.; Cavallaro, J.; “On-line Arithmetic for Detection in Digital Communication Receivers”,15th IEEE Symposium on Computer Arithmetic, pp. 257–265, 2001. [11] McIlhenny, R.; Ercegovac, M.D.;“On the Design of an On-line FFT Network for FPGA’s”, 33rd Asilomar Conference on Signals, Systems, and Computers, vol. 2, pp.1484–1488, 1999. [12] A. Avizienis,“Signed Digit Number Representation for Fast Parallel Arithmetic”, IRE Tran. Electron. Comput., Vol. EC-10, pp. 389-400, 1961. [13] C. Su and C. Jen, “Motion Estimation Using On-Line Arithmetic”, IEEE Int. Symposium on Circuits and Systems (ISCAS-2000), pp. 683–686, May 28-31,2000. [14] Hormigo, J.; Olivares, J.; Villalba, J.; Benavides, I.;“New On-line Comparator with no Online Delay”, 8th World Multiconference on Systemics, Cybernetics and Informatics, 2004. [15] J. Villalba, J Hormigo “Analysis of the Mistakes in the Paper Motion Estimation using MSD-first Processing, IEE Circ., Dev. & Syst, vol 150, no. 2, April 2003”, Internal Report 20

Depart. Computer Architecture, University of Málaga, Dec. 2004 http://www.ac.uma.es/cgibin/htgrep/pubsearch.cgi?isindex=Villalba,

21