Low Power Motion Estimation Based on ...

2 downloads 0 Views 929KB Size Report
22: Motion vectors for video sequence Susie for (a) Base Case at 1.2V. (no energy ..... unrestricted center-biased diamond search algorithm for block motion.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Control Number: 6553

Low Power Motion Estimation Based on Probabilistic Computing Charvi Dhoot‡∏, Lap-Pui Chau§∏, Shubhajit Roy Chowdhury‡ and Vincent J. Mooney&∏ ‡International Institute of Information Technology, Hyderabad, India School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore ∏Institute of Sustainable and Applied Infodynamics, Nanyang Technological University, Singapore &School of Electrical and Computer Engineering, Georgia Institute of Technology, Georgia, USA [email protected], [email protected], [email protected], [email protected] §

Abstract—As CMOS technology driven by Moore’s law has approached device sizes in the range of five to twenty nanometers, noise immunity of such future technology nodes is predicted to decrease considerably, eventually affecting the reliability of computations through them. A shift in the design paradigm is expected from 100% accurate computations to probabilistic computing with accuracy dependent on the target application or circuit specifications. One model developed for CMOS technology that emulates the erroneous behaviour predicted is termed Probabilistic CMOS (PCMOS). In this paper, we propose a PCMOS-based architecture implementation for traditional motion estimation algorithms and show that up to 57% energy savings are possible for different existing motion estimation algorithms. Furthermore, algorithmic modifications are proposed that can enhance the energy savings to 70% with a PCMOS architectural implementation. About 1.8 to 5 dB improvement in PSNR under energy savings of 57% to 70% for two different motion estimation algorithms is shown establishing the resilience of the proposed algorithm to probabilistic computing over the comparable conventional algorithm.

Index Terms— Error resilient design, Low power design, Motion estimation, PCMOS Architecture, Probabilistic Computing I. INTRODUCTION Embedded systems have become an indispensable part of our everyday lives. Cell phones, tablets and other mobile computing, communication and entertainment devices are a respite in our increasingly migrant lives. The ubiquity of these devices demands faster processing and lower power consumption from them. The era is also one of high-end virtual connectivity, and consequently these embedded devices come equipped with several video applications such as videoconferencing, online storage, transmission and viewing of video data. Video compression caters to a majority of these applications. Motion estimation accounts for between 66% and 94% of the computational complexity of video compression [1]. Minimizing the power dissipation during motion estimation is one of the fundamental issues owing to the computational complexity of the operation. This paper focuses on low power design for motion estimation. Historically, the VLSI industry has been guided by Moore’s law to double the number of transistors in a single die every one to two years. Device scaling has traditionally been accompanied by voltage scaling whose quadratic relationship with energy leads to a large amount of energy savings. As the technology downscales approximately every two years, the transistor integration capacity doubles (Moore’s Law), gate

delay reduces by 30%, energy per logic operation reduces by 65% and power consumption reduces by 50% [2]. As devices approach sizes below 22nm, random dopant fluctuations, subwavelength lithography and deep sub-micron noise such as thermal noise [3,4,5] are predicted to affect the reliability expected of CMOS devices. In order to continue the trend of device and supply voltage scaling with the impending variability predicted with device scaling, we may have to forego the conventional approach of accurate computing to a paradigm supporting error tolerance. Computing in the presence of noise sources resulting in erroneous outputs is termed probabilistic computing. Signal processing applications, operating particularly over image, video, and voice, have an inherent ability to tolerate noise. Human perception can ignore the noise due to operations such as quantization, etc., over image, video or voice. These applications become an ideal choice to exploit energy savings with probabilistic computing. In our previous work with motion estimation [6,7], we have applied probabilistic computing to several traditional motion estimation algorithms. The current work elaborates on circuit and implementation level details of techniques proposed in [6,7] and compares them with an extensive survey of prior work in the domain of error tolerant low power application design in the presence of process variations and probabilistic computing. The work presented in this paper caters to the demand for low power video devices and looks at probabilistic computing based low power and high speed processing designs for motion estimation which is known to be computationally the most intensive part of a video codec. II. PRIOR WORK AND OUR CONTRIBUTION In this section, we first give a broad outline of the prior work in subsection A. Then, in subsection B, we will compare and contrast the research presented in this paper with the prior work already explained in subsection A. A. Literature survey Low power motion estimation is a well-studied subject owing to the high computational intensity of motion estimation [8-18]. Algorithmic low-power motion estimation techniques focus on heuristics to reduce the number of macroblocks processed per motion vector [8]. These include employing a computationally less expensive distance criterion compared to the Sum of Absolute Differences (SAD) [14], adaptive search area dependent on motion characteristics [15], and adaptive pixel decimation for computation of SAD [16]. Several VLSI architectures have been proposed for motion

Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected]. Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

2

Control Number: 6553

Application Driven Design with Circuit Level Inaccuracies

estimation with various trade-offs between gate count, input/output bandwidth and throughput [8, 19-21]. The aforementioned research work assumes that the underlying hardware will always function non-erroneously. This assumption has been true until CMOS device scaling reached the sub-nanometer regime, where predictions have been made that thermal noise [3,4,5] and process variations [2,22] with further scaling of device sizes will begin to cause “soft errors.” Here, soft errors are errors occurring probabilistically in computations made via the underlying hardware. Speculations about erroneous computation due to CMOS device size and voltage scaling has led to a large body of research on low power application design suggesting error tolerance and correction mechanisms [23-40]. These mechanisms work at various levels of design abstraction that can be broadly classified as Algorithm, Architecture, Logic and Circuit levels. Research efforts at the algorithmic and architectural levels are fairly closely tied to the application. Prior work on error tolerant design has been referenced in Figure 1, which constitutes all relevant low power error tolerant motion estimation based research known to the authors of this paper at the time of submission. At the level of algorithm and architecture design, these include propositions of a parallel motion estimation architecture for error correction [23], input dependent variable clocking for correction of timing errors due to process variations [26-29], segmented architecture design for a motion estimation architecture to have reduced critical path timing while investing time saved in error correction [29]. - Error tolerance/Algorithmic Noise Tolerance (ANT) [23,24]

Algorithm Level

- Soft Digital Signal Processing (DSP) [23,25] - Significance driven computing [26] - Input based Elastic Clocking [27, 28]

Architecture Level

- Voltage Scalable Metafunction Design [29] - Strategic Re-computation and Triple Modular Redundancy [30] - Energy-efficient PCMOS-based SoC Architectures [32] - Probabilistic Boolean Logic [34]

Logic Level

- Maximizing probability of correctness through Markov Random field based logic circuit designs [33] - Circuit design to enhance noise immunity [35] - Adaptive Supply Voltage and Body Bias [36]

Circuit Level

- Increase in yield by Error Tolerant Design [37] - Use of Multiple Supply Voltages, Biased Voltage Over Scaling (BiVOS) [38,39.40]

Fig. 1: Prior work on error tolerant application design

Besides referring to prior research on low power motion estimation design with erroneous underlying hardware, Fig. 1 also refers to some noteworthy research propositions in the domain of low power error tolerant computing which works independent of the application being used. This includes Biased Voltage Over Scaling (BiVOS) which allocates higher voltage to the resource (hardware block) with a more significant impact on the output [38], use of input dependent adaptive circuit supply voltage [39,40], and CMOS transistor level circuit design techniques for reduction in circuit error [33]. In this paper, we have proposed algorithmic modifications to prior existing and popularly used motion estimation

algorithms as error correction mechanisms to reduce errors. The novelty of our work compared to prior work is discussed in detail in the following section. B. Contribution in light of prior work Algorithm and architecture level techniques proposed in prior work for low power error tolerant application design using voltage scaling [23-29] are best suited for circuits with a large delay imbalance between the critical path and the rest of the circuit paths. This is because paths with delay significantly smaller than the critical path can handle the additional delay introduced due to voltage scaling. Process variations that mainly result in randomly increasing delay based variations predicted for future CMOS technology nodes can also be dealt with by utilizing the same approach. However, probabilistic computing accounts for the random soft errors predicted to be seen in future CMOS technologies due to circuit noise sources such as thermal noise [3,4,5] and the design techniques proposed in prior research are not suited for these types of errors because they are not dependent on circuit timing and cannot be dealt with by varying the operation time of the circuit as has been the approach in [23-29]. In this paper we propose algorithmic modifications for two different motion estimation algorithms that work well with probabilistic computing and can save up to 70% of energy for different kinds of motion estimation algorithms. Also, the proposed approaches of a parallel motion estimation architecture for error correction [23], input dependent variable computational clock cycles [26-29] and voltage scalable metafunction design [29] all require significant computational overheads whereas the modifications proposed in this paper have far less computational overheads. One of the modifications suggested in this paper requires a parallel computing architecture; despite the use of a parallel architecture, the use of the parallel architecture is minimal and, hence, computational overhead is minimal. Unlike BiVOS, a circuit level design technique that requires multiple supply voltages and consequently consumes a large overhead in routing of multiple voltage planes, the error minimization mechanisms proposed in this paper do not use four to five supply voltages as does BiVOS. Also, the error minimization through BiVOS compared to the conventional voltage scaling approach is much more significant for larger circuit sizes, e.g., 32- to 64bit adders compared to 8- to 16-bit adders used in motion estimation, thereby reducing the utility of BiVOS for motion estimation. Circuit level techniques such as adaptive supply voltage adjustments proposed in prior research require accurate fine-tuning of voltages during run-time necessitated by the possibility of massive failures that can occur in circuits beyond a critical voltage scaling point. Accurate fine tuning of voltage supply might not be feasible due to inherent variations present in power supply routing. Thus, in the light of prior work on low power error tolerant application design, the novel contributions of this paper can be stated as follows:  Algorithmic modifications for two traditional motion estimation algorithms proposed as error correction mechanisms for low power error tolerant motion estimation design that introduce minimal computational overhead. The modifications are proposed for two different types of motion estimation algorithms: Full Search

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

3

Control Number: 6553 algorithm, the most optimal algorithm in terms of output video’s visual quality, and Three Step Search algorithm [11], a sub-optimal algorithm which belongs to a class of hierarchical search motion estimation algorithms.  A method to design and estimate errors through the entire motion estimation architecture based on prior work on the noise model proposed in [41,42] and estimate errors through individual gates of the circuit suggested in [43,44].  The proposed techniques avoid usage of error correction schemes such as BiVOS or an adaptive supply voltage that require significant overheads for power supply routing due to closely spaced multiple voltage planes or accurate fine tuning of supply voltage during run time.

encoding is governed by the Quantization Parameter (QP). The value for QP varies from 1 to 31 in an MPEG-2 encoder used for simulations in this paper. When a lower bitrate is desired, QP is increased, which increases the compression loss. In the event that the performance of motion estimation degrades, the video encoder either increases QP to lower the quality but maintain a constant bitrate, or, in applications where the quality cannot be compromised, it increases the bits required to encode the frame at a fixed QP. QP is controlled through rate control in the video codec receiving feedback from the entropy coding block. In most wireless and storage applications the bitrate is critical. Hence, in this paper, we consider low power motion estimation at a fixed bitrate only. Rate Control

III. BACKGROUND A. Motion estimation Motion estimation describes the movement of objects in a frame with respect to the previous frame through motion vectors. The video encoding of the frame is done through these vectors. At the decoder, the motion vectors are used to reconstruct the frame. The reconstruction process is known as motion compensation. The approach most popularly employed to determine the required motion vectors is block-matching. In a blockmatching approach (x,y) 2p+1 (0,0) shown in Fig. 2, (0,0) Current Block the current frame in Search Area the video sequence is divided into nonoverlapping macroPrevious Frame Current Frame blocks of size N×N Fig. 2: Block-matching pixels (e.g., N=16). A displace-ment motion vector (MV) is calculated for each macro-block in the current frame, with respect to the best match for the block in the previous frame. The search for the matching block is constrained within a search area of size (2p+1) × (2p+1) pixels, where the parameter p mostly varies from 7 to 31 (not standard; chosen for the type of video sequences to encode). An algorithm is used to determine the number and place of search points in the search area for an efficient search. The criterion for arriving at the best match out of the candidate macro-blocks is the Sum of Absolute Differences (SAD). SAD is calculated by summing up the absolute difference between pixel luminance values of the current block ‘a’ and the corresponding pixel luminance values of the candidate block ‘b’. 2p+1

SAD 

N N

 a(i, j )  b(i, j ) i, j

B. Motion estimation in video encoding Fig. 3 illustrates video encoding where encoded frames can be characterized as either intra- or inter-coded frames. Intracoded frames or I-frames are reference frames and are encoded via Discrete Cosine Transform (DCT) and quantization only. Inter-coded frames or P-frames are encoded via motion estimation as well as DCT and quantization. DCT and quantization are required in P-frames to encode the difference frame. In a video encoder, quantization leads to lossy compression. The amount of quantization to apply for

FI Input: Current Frame

DCT

+

+

Entropy Coding

Q

- FDiff

Output: Bit Rate

-1

Q Difference Frame

FMC Motion Compensated Frame

IDCT Motion Compensation

+ +

+

FP

Decoded Previous Frame

Motion Estimation

Fig. 3: Video encoder

C. Motion estimation algorithms Block matching algorithms decide which search points to choose in the search area as candidate block locations for deciding the best matching block. In this section we briefly describe the two popularly used motion estimation algorithms used in this paper to show energy savings possible with PCMOS based computing. This is followed by a discussion of the block level functional details of a generic architecture of a motion estimation block and the part of this generic architecture that we target for energy savings in Section III.D. We then provide the specifications of how the part targeted for energy savings is designed for the motion estimation algorithms used in this paper for a PCMOS based implementation. In Section IV, we propose schemes for these algorithms that can further enhance the energy savings achievable with PCMOS based computing. 1) Full Search Block Matching Algorithm (FSBMA) In FSBMA, all pixel locations in the search area are considered as candidate locations. This algorithm is preferred in video codecs such as H.264 as it yields the most optimal results in terms of the visual quality. However, it is also computationally very expensive, and thus many low power architectural implementation variants for this algorithm have been proposed [8-10]. We consider FSBMA to show the possible energy savings with probabilistic computing with minor degradation in the required quality of the output. 2) Three Step Search (TSS) algorithm The TSS algorithm [11] subsamples the SAD search area in order to make an efficient search. TSS performs a successive coarse-to-fine search in three steps following the direction of minimum SAD in every step. The search strategy of the TSS algorithm is very popular, and several similar motion estimation algorithms have been proposed [45,46]. In TSS, the search area size is fixed to be (±7,±7) around the position of the current macro-block. TSS begins with an initial step size of Δ = 4. Nine candidate macro-block locations, positioned symmetrically around the position of the

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

4

Control Number: 6553 current macro-block, are first evaluated to find the winner candidate block. The winner candidate is decided to be the one with the minimum SAD among the nine candidate locations. The step size is then halved in the next step. The position of the candidate with the minimum SAD in the previous step is the new reference position for a finer search. The search is carried out in three steps which explains the name ‘three step search.’ A possible outcome of the TSS algorithm is illustrated in Fig. 4. The positions of the candidate macro-blocks selected by the algorithm Δ Search Area are known as search points. Fig. 4: Three Step Search Algorithm Example search points are shown in Fig. 4. The number of search points evaluated by FSBMA versus TSS differs by a large amount. For a search area corresponding to p=7, the number of search points for FSBMA is 225 as compared to 25 for TSS. D. Motion Estimation Architecture The datapath unit is computationally the most intensive block of a motion estimation architecture and mainly consists of SAD calculations. In a full architectural implementation which includes the control logic, address generation unit and datapath unit, the datapath unit accounts for 60% and 75% of the involved processing of the motion estimation block for FSBMA and TSS respectively [9]. This is due to the large number of SAD calculations required for a video sequence, e.g., a movie of two hours would require SAD calculations for about a million blocks, and the number of additions required for SAD calculations would be on the order of a trillion! Therefore, in this paper, the effort for low power implementation is concentrated towards the datapath unit of the motion estimation block. The remaining part of the section describes the base datapath architectures used for FSBMA and TSS. These base architectures are first modified for a PCMOS-based implementation and then further changed to accommodate the fault tolerant schemes suggested for the algorithms in this paper. 1) Base architecture for FSBMA Systolic arrays and their derivative architectures are widely used to implement datapath architectures for motion estimation when the block matching algorithm being used is FSBMA. The memory accesses for pixel data used in the computation of SAD are largely regular in nature, making systolic arrays an ideal choice for the datapath architecture in case of FSBMA. Fig. 5 shows a 1-D systolic array proposed in [47] which can process 25 frames per second for a block size N=16. This array architecture is chosen as the base architecture for our simulations. Note that this systolic array calculates SAD sequentially for the block comparisons, i.e., every 16th clock cycle SAD is calculated for a new comparison, and values are latched by the ACC block to accumulate SAD every clock cycle. The delay shown in Fig. 5 has to be given initially for correct SAD values to latch at the first block comparison. The functionality of constituent blocks of the systolic array architecture is described in Fig. 6 where block AD computes -7

-6

-5

-4

-3

-2

-1

0

1

2

3

4

7 6

2

2

5 4

1

1

2

5

6

3

3

3

2

3

3

3

3

1

2

2

2

3 2

2

1

0

1

1

1

-1 -2

-3 -4 -5

-6 -7

1

1

1

7

3

the absolute difference of incoming pixel values and adds the result to the output of the previous AD block, block ACC is an accumulator which has to be set to zero before the computation of SAD for every new comparison and block MIN computes the minimum in order to keep the least SAD per motion vector. The adders used in each of these blocks have bit widths ranging from 8 to 16 bits. These are in turn modeled as ripple carry adders. Also, the architecture has to be modeled as a sequential circuit running on a clock to facilitate memory accesses and computations with the correct timing. Thus, blocks AD, ACC and MIN require registers. At the gate level implementation, the basic building blocks of the systolic array architecture are full adders, D-flip flops, inverters and EXOR gates. Inverters are used in blocks AD and MIN for subtraction, and EXOR gates are used in block AD for absolute value computation. 0

p(i,15)

p(i,2)

p(i+1,14)

p(i+2,14)

p(i,1)

p(i,0)

c(0,0)

AD

p(i+1,1) p(i+1,0)

AD

p(i+2,0)

AD

c(0,1) c(0,2)

c(1,0)

c(1,1)

c(2,0)

AD

p(i+15,0)

c(0,15)

c(1,14)

c(2,14)

c(15,0)

delay 0

ACC

MIN

Fig. 5: Systolic array architecture a X

AD

a Y

ACC

|X-Y|+a

b

a+b

a

MIN

min(a,b)

b

Fig. 6: Block level details for the systolic array architecture

1) Base architecture for TSS Tree architectures are popular for Hierarchical Search Algorithms (HSA) such as TSS that perform a coarse to fine grained search as the pixel data common between two candidate blocks is significantly reduced. The elements in the pipeline stages of the tree can be decreased or increased according to the processing requirements of the system being implemented. We choose a tree architecture suitable to process a frame size of 352x288 at 25 fps, which is required by most videophones today. The tree architecture used is shown in Fig. 7. Elements D, A, ACC and MIN function as absolute difference unit, adder, accumulator and comparator respectively as shown in Fig. 8. The tree architecture shown in Fig. 7 has 8 parallel absolute difference computing units (Block D), and it takes 32 clock cycles to compute the SAD for a block size of N=16. E. Modeling and error estimation for PCMOS based architecture Any generic architecture is built using logic gates; the logic gates for PCMOS architectures are replaced by the probabilistic versions of logic gates, termed Probabilistic Gates. Korkmaz et al. [41,42] proposed a model for a probabilistic gate by coupling noise sources at the outputs of the deterministic version of the gate being modeled. Synopsys 90nm generic library. The approach used in [41, 42] goes with the assumption that the equivalent noise source at the

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

5

Control Number: 6553 output estimates the impact of the noise present in each gate or transistor in an actual noisy circuit. MIN

ACC

A

A

A

A

A

D

D

D

A

D

D

A

D

D

D

Y X

Fig. 7: Tree architecture for the Three-Step Search algorithm a

b

a

D

b

a

b

ACC

A |a–b|

a

a+b

b

MIN min(a,b)

a+b

Fig. 8: Elements of the tree architecture

All simulations are performed in HSPICE using the These noise sources are modeled by a Voltage Controlled Voltage Source (VCVS) whose gain is determined by the distribution considered for modeling the noise. References [41,42] use a Gaussian distribution of an appropriate Root Mean Square (RMS) value to model the noise, assumed predominantly to be thermal. Fig. 9 shows a model of a noisy Probabilistic (Pr.) Full Adder (FA) built using a Deterministic (Det.) FA. A Rn

Cout

Rn Sum

Vn

B

circuit. To determine the error rates for the Pr. Gate, the outputs of the Filter gates for the two configurations shown in Fig. 10 are compared. The error rate is calculated as the number of non-matching outputs divided by the total number of input vectors. With a three-stage modeling of all possible unique configurations in the circuit, the error rate for each gate in the circuit is determined. C-simulations of the circuit with false flipping of gate outputs according to the calculated error rates to determine the final error probability of cascade circuit result in error rates which match what is observed when the entire cascade circuit is simulated in HSPICE. Details of the simulation method used are provided in [44]. Most gate configurations inputs (three-stage-models) are repetitive in cascade Det. Gate Pr. Gate circuits, largely reducing the number of unique Filter Filter configurations. This type compare of modeling for Load Load calculation of error rates Fig. 10: Three-stage model to estimate is much less time error rates for prob. circuits consuming than HSPICE simulations for the entire cascade circuit and provides a means to map the behavior of probabilistic circuits into C code for testing their effect on applications such as motion estimation, JPEG, etc. Fig. 11 is taken from [44] and demonstrates the precision with which the three stage model can track the actual error rate for larger circuits such as adders.

FA

Cin

Vn Pr. Cout

Pr. Sum

Fig. 9: Probabilistic Full Adder

The error rates for the outputs of a probabilistic circuit built using probabilistic gates are determined by simulating the circuit in HSPICE for large number of inputs and comparing its output with a deterministic version of the gate. The error rate is then calculated as the number of incorrect computations divided by the total number of times data is provided to the input. Using HSPICE simulations to determine error rates is a time consuming and tedious process for larger circuits. Singh et al. [43,44] proposed methods to predict the error rates observed at the outputs of PCMOS circuits for larger cascade circuits such as a Ripple Carry Adder (RCA) or a Wallace Tree Multiplier (WTM). In a cascade circuit, the interconnections of the gate account for filtering of the noise waveform. This phenomenon is described in detail in [44]. In the method proposed by Singh et al. [44] to predict the error rates of cascade circuits, the error rate for each gate in the circuit is measured using a threestage-model for the gate shown in Fig. 10. The Pr. Gate is the noisy gate for which the error rates are being calculated and has been modeled as shown in Fig. 9 for the case of an FA. The Det. Gate in Fig. 10 is the non-noisy version of the Pr. Gate. The Filter used in Fig. 10 is a non-noisy version of the gate which is connected to the Pr. Gate in the actual cascade circuit. It is called a filter because a filtering of the noise waveform dependent on the propagation delay of the filter gate is observed in HSPICE simulations. The load is a nonnoisy version of the gate connected to the filter in the cascade

Fig. 11: Error rate measured for outputs of a 16-bit RCA

Fig. 11 shows a plot for a 15-bit Ripple Carry Adder that measures the error rate at every bit index of the output (16-bit Sum) in HSPICE, which is compared with predictions for these error rates made using a C code needing error probability of unique three stage configurations of the 15 bit RCA as input. The error rate is calculated for 100,000 input values for both the HSPICE simulation and C code based simulations. This process is repeated for different voltages, and, from Fig. 11, error rates measured through HSPICE agree with predictions made through the C based code. IV. PROPOSED METHODOLOGIES Motion estimation algorithms are largely derived using heuristics and the performance of these algorithms is evaluated based on the image quality of the frame reconstructed through motion compensation. First, these algorithms are tested with PCMOS based implementation of their architectures alone. Voltage scaling, which determines the frequency of errors in the case of PCMOS, is limited to a value that provides acceptable degradation in the image quality. Error correction schemes that can enhance the scaling of voltage further for the motion estimation architecture than PCMOS based implementation alone are proposed for FSBMA and TSS in this section. The proposed error

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

Control Number: 6553 correction schemes differ widely for the two algorithms. This is because of the difference in their search strategies where FSBMA uses (2p+1)×(2p+1) (e.g., 225 in the case of p=7) search points whereas the number of search points for TSS is limited to 25. As with most motion estimation algorithms, the proposed error correction schemes rely on heuristics to justify their capacity to increase the fault tolerance of FSBMA and TSS with PCMOS-based implementation. This section describes the PCMOS modeling of the gates used in the FSBMA and TSS datapath architectures. An error correction scheme based on motion vector statistics is proposed for FSBMA to be used with the PCMOS based architecture proposed for FSBMA. This is followed by a discussion of a new algorithm we propose, Multiple Candidate Three Step Search (MC-TSS) based on TSS that outperforms TSS in terms of energy savings with a PCMOS based architectural implementation. A. PCMOS architectural modeling for FSBMA and TSS The gates identified in the architectures used are full adders, D-flip flops, EXOR and INV gates. These gates are modeled as probabilistic gates with the exception of EXOR and INV gates which account for a small percentage (less than 10%) of the total energy consumed in the systolic array architecture. The systolic array architecture for FSBMA in Fig. 5 and tree architecture in Fig. 7 consist of adders of bit widths ranging from 8 to 16 bits. These adders are modeled as ripple carry adders (RCAs). The architectures are required to run sequentially on a clock. So, blocks AD, A, ACC and MIN of Fig. 6 and Fig. 8 include internal registers to facilitate the sequential processing. These registers are modeled with the help of D-flip flops. The architecture for the TSS algorithm consists of Full Adders, D-flip flops, inverters and EXOR gates. The inverters and EXOR gates are used in block D of Fig. 8 for calculation of absolute difference. The architecture for the MC-TSS algorithm also requires NAND, MUX, NOR and INV gates, which are required to implement the logic in Fig. 17 discussed later in this paper. The modeling of PCMOS gates involves coupling a Gaussian noise source at the output of each gate and error rates for the gates are measured using three stage modeling as described earlier in Section III.E. Each noise source is modeled as a Voltage Controlled Voltage Source (VCVS) with gain equal to the noise RMS. The choice of noise RMS is such that no errors are observed at the output of each probabilistic gate at the nominal supply voltage for the technology node used (noise RMS found empirically to be 0.2V; details in Section V). Errors begin to show up at the output of the probabilistic gate when its supply voltage is scaled down. By modeling the gates in such a manner we have tried to emulate what may be observed in a future technology node, say at 12nm. In order to correctly model the error of the PCMOS architecture in the motion estimation code, unique three stage configurations are identified in the HSPICE circuit designed for the systolic array architecture. Error rates are measured for each gate of the architecture using the unique three stage configurations to which they map. The HSPICE simulations required to measure the error rates for noisy Pr. gates are carried out using the Synopsys 90nm generic library with a nominal supply voltage of 1.2V. The calculations made

through the architecture can then be modeled in C code as per the computations at the gate level in the circuit, and false bit flipping can be introduced in the C-code based calculations using the error probability values for the specific gate. This code is then used with the MPEG-2 video codec code for motion estimation. C code based simulations can thus be used to determine the error tolerance of motion estimation, thereby aiding in determining the appropriate operating supply voltage for the circuit. The energy consumption for the PCMOS architecture at the appropriate supply voltage is calculated using HSPICE simulations of the transistor level netlist of the entire architecture; simulation details are discussed in Section V. As the supply voltage is scaled down, the energy consumption through the architecture decreases but the magnitude of the errors increase. B. Motion vector statistics based error correction scheme For the implementation of a PCMOS based architecture for motion estimation, we scale the supply voltage value till the quality of the video conforms within the limit we set on the allowed degradation in quality. However with the proposed error correction scheme for FSBMA, larger voltage scaling is possible which further lowers the energy consumption achievable with a PCMOS based motion estimation architecture. This error correction scheme is based on a property with which most video sequences can be associated. In video sequences, the displacement of macro-blocks is typically very small over a set of adjacent frames. The movement of background, fixed objects, etc., in a video frame hardly changes over adjacent frames. Based on this idea, we can postulate that a large percentage of motion vectors will be equal to (0,0) and an even larger percentage will be concentrated within (±r, ±r), where r is a small value. Preliminary analysis with test video sequences having different motion characteristics ranging from slow to fast showed that more than 50% of the motion vectors lie within an area of 5×5 in a search area of 23×23 , with this value being as high as 97% in the case of videos with slow motion. The video sequences analyzed include sports video sequences with fast motion characteristics such as Stefan which shows a game of tennis. This observation has been used in the past for proposing center biased searches [48,49]. We use it to propose the following scheme for correction of errors introduced by the PCMOS architecture. Our scheme of error correction requires two parallel systolic array architecture implementations working at two different operating voltages. The search area is divided into two regions as shown in Fig. 12. Region 1 in Fig. 12, a smaller region of the search area, corresponds to the area more likely to have the best 1. 2. motion vector. Region 2, outside of region 1, is a region less likely to have the best motion vector. Amongst the two parallel systolic Fig. 12: Search area arrays, one always works at the division nominal supply voltage of 1.2V for 90 nm technology and processes region 1, whereas the other systolic array works at a scaled voltage value and processes region 2. The workload of the systolic array processing region 1 is much lower than the one 2p+1

2r+1

2r+1

2p+1

Search Area

Previous Frame

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

7

Control Number: 6553 processing region 2. Details of how the supply voltage value for the second systolic array (running at a scaled voltage value) is determined are discussed in Section V.A. The idea is that the error-free systolic array operating at 1.2V is used to calculate the motion vector corresponding to the winner candidate block in region 1, whereas the error-prone systolic array operating at a scaled supply voltage value is used to determine the motion vector corresponding to the winner candidate block in region 2. The winner candidates from the two systolic arrays are compared to determine the best matching block. The decision regarding the size of region 1 is made through QP that can be taken as feedback from the quantization block in the video codec. We know that QP in a video encoder, coding at a fixed bitrate, increases when the quality of motion estimation degrades. This property can be used to vary the size of region 1 as per the needs of the video sequence being encoded. The size of region 1 is controlled by the range parameter ‘r’. In the steps for the error correction scheme shown below, QPTH, η1 and η2 are parameters that facilitate the decision on what value of ‘r’ to choose based on the value of QP. Parameter r 1 is the maximum value that ‘r’ can assume. The approach we use is as follows: Step 1: Select an initial range parameter ‘r’. Step 2: For each macro-block in the current frame, Step 2.1: Divide the search area into two regions. Step 2.2: Determine the best matching block in region 1 using FSBMA and the associated architecture maintained at 1.2V. Step 2.3: Determine the best matching block in region 2 using FSBMA and associated architecture running at a scaled supply voltage. Finally, calculate SAD for the winner candidate in this region with an arithmetic unit maintained at 1.2V; note that this arithmetic unit at 1.2V is already present in the proposed architecture for the scheme for calculations required in Step 2.2. Step 2.4: Compare the SAD for the best matching blocks in Step 2.2 and Step 2.3, and select the one with the smaller SAD as the final best matching macro block. Store the motion vector corresponding to this block. Step3: After every 5th frame, we review the range parameter based on the value of Quantization Parameter (QP). If QP is found to deviate more than an acceptable limit (limits defined by parameters and ), ‘r’ is changed. If QP QPTH and r < r1, then r = r+1, Else if QPTH < QPTH and r > 0, then r = r-1, Else r = r. Continue from Step 2.

In the proposed scheme, QPTH, η1, η2 and r1 are determined empirically. The range parameter ‘r’ is reviewed after every 5th frame for the results of the previous change in ‘r’ to reflect in the quality of motion estimation. The error correction in this scheme is in terms of selection of the correct motion vector which is more likely to be present in region 1. C. Multiple Candidate-Three Step Search (MC-TSS) Algorithm We propose Multiple Candidate Three Step Search (MCTSS) instead of the conventional TSS algorithm for lower energy savings compared to TSS when using a PCMOS based architecture for TSS. The search strategy for TSS begins with a very coarse search in the first step followed by a finer search in the subsequent steps. Due to this approach, any error in calculating the direction of search in the initial steps of the

algorithm may result in a significant degradation in the output quality. So the algorithmic modification suggested for TSS to decrease errors because of PCMOS based arithmetic, needs to avoid making mistakes in the steps used to arrive at the best matching block. Also, TSS needs just 25 search points, so the proposed algorithmic modification needed to keep the computational overhead due to the modification very low. MC-TSS evaluates nine candidate locations in the first step to select three winner candidate locations with the least SAD. The next step involves a finer search around all three winner candidates to Δ Search Area select the next three winner Fig. 13: Multiple Candidate candidates. In every step, three Three-Step Search algorithm locations with the least SAD are kept for the next finer search, so there may arise cases in which two or more winner locations could belong to the same group as can be seen in the top left corner of Fig. 13. The flow diagram for MC-TSS is shown in Fig. 14. We decided to keep three winner candidates in every step based on performance instead of keeping less or more than three candidates; a comparative analysis of MC-TSS with two or more winner candidates is presented in Section V.B.3. Furthermore, we also suggest a modification in the calculations of SAD. SADh (where “h” stands for “half”) is calculated as -7

-6

7

-5 3

6

2

3

-4

-3

3

3

2

3

-2

5

3

3

3

3

4

3

2

3

1

2

3

3

3

3

2

2

2

2

-1

0

2

1

2

2

3

4

2

3

1

2

2

5

6

7

3

3

3

2

3

3

3

3

1

2

2

2

3

1

0

1

1

1

-1 -2

2

2

2

1

2

2

2

2

2

-3 -4

1

1

-5

-6 -7

N

N /2

j

i

SADh    a(2i, j )  b(2i, j )

SADh is calculated using only alternate pixel values of the candidate block. The use of alternate pixel values reduces the computations used to calculate SAD by half, thereby also reducing the total number of computations by half. With MCTSS, the number of search points evaluated increase to 57 as shown in Fig. 13 compared to 25 for TSS. The reduction in SAD computations by using SADh for MC-TSS roughly balances out the increase in the computational overhead introduced due to extra search points, and so the total number of computations used in MC-TSS nearly equals the number of computations used in the conventional TSS algorithm. Initialize positions of the three winner candidates to (0,0) and SADM1, SADM2 and SADM3 as (216-1) Step := 0 Step Size (Δ) := 4 Start Step := Step + 1 Δ = Δ/2 Evaluate the nine candidates positioned symmetrically at (±Δ,0), (0,±Δ) and (±Δ,±Δ) around each of the three winner candidates Update winner candidate positions SADM1, SADM2 and SADM3 If Step = 3

NO

YES Winner candidate corresponding to SADM1 determines motion vector Stop

Fig. 14: MC-TSS flow diagram

The tree architecture used for TSS shown in Fig. 7 can also be modified for MC-TSS through a simple modification. The

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

8

Control Number: 6553 number of comparator units used has to be increased to three for MC-TSS. These comparator units are used to compare the current candidate macro-blocks SADC to the three least SADs of the previous step. SADM1 MIN 1

SADM2

SADM3 MIN 3

MIN 2

SADC

Fig. 15: Comparator unit for MC-TSS

The register unit that is used to store the minimum SAD MIN has to be modified to store the three least SADs: SAD M1, SADM2 and SADM3. The number of register units required to store the least SADs also increases to three for MC-TSS as shown in Fig. 15. The movement of data between these registers is dependent on the outcome of the three comparators. The logic to implement for movement of data between the registers that store the least three SAD values is shown in Fig. 16. IF THEN

SADC < SADM1 SADM3 = SADM2 SADM2 = SADM1 SADM1 = SADC ELSE IF SADC < SADM2 THEN SADM3 = SADM2 SADM2 = SADC ELSE IF SADC < SADM3 THEN SADM3 = SADC Fig. 16: Logic for data movement between register units

The logic described in Fig. 16 can be implemented with the help of shift registers. In Fig. 17, Sign_bit1, 2 and 3 are provided by the comparator unit. Fig. 17 describes the shift register unit for the jth bit of SADC (indicated in Fig. 15), SADM1, SADM2 and SADM3, and the gate level implementation of the logic required for movement of SAD values between registers dependent on the Sign bits provided by the comparators. This unit is replicated sixteen times for all the 16 bits of the SADs. 2x1 MUX

D

Q

SADM1(j)

4x1 MUX

D

Q

SADM2(j)

4x1 MUX

D

Q

SADM3(j)

SADC(j) Clk Sign_Bit1 Sign_Bit2 Sign_Bit3

Fig. 17: Shift register unit for data movement V. RESULTS

A. Simulation Results for PCMOS based FSBMA 1) Energy and Error Estimation through HSPICE The systolic array datapath architecture for FSBMA was described in Fig. 5. All energy and error estimation for this circuit is carried out using HSPICE. A 90nm generic library from Synopsys is used for these simulations in HSPICE. A transistor level circuit description of the entire architecture is provided to HSPICE. A clock frequency of 125 MHz was used, which was chosen in order to process a video with frame size 352x288 at 25 fps and block size N=16 by the 1-D systolic array architecture used. The chosen frame rate and size conform to ones required by many videophones today. Gate level descriptions of the blocks AD, ACC and MIN of Fig. 6 present in the architecture are shown in Fig. 18. The gates used in these blocks are Full Adders (FAs), D-flip flops (FFs), EXORs and Inverters. In addition to these gates, Level

Shifters (LS) are required in the comparator block when the architecture is modeled as a PCMOS architecture. Using Level Shifters before the comparison of SAD values ensures non-erroneous behavior from the comparator. The requirement of non-erroneous comparison is explained later in this section. The transistor level circuit used for full adders is a 24 transistor mirror adder circuit [50], and the circuit used for D-flip flops is described in [51]. The circuit used for the D-flip flop is a dynamic edge triggered flip flop which holds data only for a short period of time. This design for the flip flop suits the requirements of the architecture used, as the outputs from the registers are processed every clock cycle of short duration. Full adders and D-flip flops account for 90% of the total energy consumed. So, noise based PCMOS models based on the method described in [41, 42] were developed only for full adders and D-flip flops. The RMS value for noise sources was chosen for the components in a manner that no errors were observed at the nominal supply voltage of 1.2V. The noise RMS value chosen was 0.2V. This predicts a possibly noisy future technology node, e.g., 12 nm. To calculate the error rate for the probabilistic full adders and flip flops used in the circuit, we first identify the unique three stage configurations of these gates in the architecture as shown in Fig. 19. In Fig. 19(2), LOAD corresponds to all the gates to which the inverter is attached in the actual circuit. The error rate for each of these configurations is determined at a range of voltage values, 1.2, 1.15, 1.10 down till 0.5V, the step size being 0.05V. The lowest voltage used was 0.5V which conformed to the delay requirements of the systolic array architecture used by the full search algorithm to process the video at the specified frame rate. The operability of the technology node we used, namely, 90nm, for supply voltage values as low as 0.5 V has been shown in [52]. At the scaled supply voltage, the blocks AD, ACC and MIN in the architecture become erroneous. However, block MIN is a comparator and needs to make correct decisions. Hence, block MIN was maintained at the supply voltage of 1.2V. The output of block ACC was level shifted to the voltage value of 1.2V using level shifters. The circuit used for level shifting is described in [53]. The energy consumption of the architecture has been calculated using HSPICE. The inputs to the architecture were test vectors from the standard test video sequences. The description of the video sequences is provided in Section V.A.2. To maintain the brevity of the text, henceforth Case 1 will refer to FSBMA with a PCMOS based architecture and Case 2 will represent the error correction scheme of Section V.B applied to FSBMA with a PCMOS based architecture. Case 1 requires the blocks AD and ACC of the architecture to run only at one of the voltage levels specified, and block MIN to run at 1.2V. Hence, energy consumption was calculated by simulating the entire systolic array architecture circuit by scaling down the voltage supply uniformly for all components of blocks AD and ACC to the required voltage level and using level shifters in block MIN in HSPICE. Case 2 requires that two parallel architectures be maintained at distinct voltage levels, 1.2 V and a lower voltage value. If the average power consumed by the two architectures (circuits) for Case 2 at the different supply voltages is P1 and P2, and the percentage of calculations made through the two

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

9

Control Number: 6553 architectures, determined by the range parameter ‘r’, are C1 and C2, then the dynamic energy consumption for Case 2 is estimated as (P1 C1 + P2 C2) T, where T is the time period of the clock cycle required per calculation. The values of C1 and C2 have been determined over 100 frames of a video sequence for a good estimate. It is to be noted that the calculation of power in this manner only accounts for dynamic energy consumption of the circuit. Measuring only the dynamic energy consumption of the circuit is justified when static power is known to account for less than 1% of the energy consumed by the electronic circuitry, which is the case for the 90nm technology library we used in this paper. a7

Cout

a2

b7

FA2

FA7 S7

Rin

0

b1

a0

FA1

Memory

Control Logic

Cin=1

S1 Rin1

Rin2

b0

FA0

S2

Rin7

N

a1

b2

13) to be given to systolic array-2 working at a lower voltage. The control logic will have to facilitate one additional block memory access for systolic array-1 required for the calculation of the final motion vector. The additional memory access is for pixel data of the winner candidate block in region 2, which needs to be evaluated by an error free architecture against the winner of the region 1. Also, the area of the datapath architecture will be approximately doubled because of the presence of two systolic arrays as shown in Fig. 20.

Systolic Array -1

S0

Systolic Array -2

Rin0

Datapath Architecture FA7

FAN SoutN

FA2

Sout7

FA0

FA1

Sout2

Sout1

Sout0

Clk Reset

FFN

FF7

ADoutN

FF2

ADout7

FF1

ADout2

FF0

ADout1

ADout0

(a) Clk ADout2

ADout12 FF12

ADout0Reset

ADout1

FF2

FF1

FF0 Clk ResetA

FF16

FF12

ACCout1

ACCout1

6

2

FA16

FF2

FF1

FF0

ACCout2

ACCout1

ACCout0

FA12

FA2

FA1

FA0

ACCout7

ACCout2

ACCout1

FF16

FF7

FF2

FF1

FF0

FF16

FF7

FF2

FF1

FF0

(b) ACCout16

ACCout

MClk

0

Reset

Sign_Bit ResetM

Mout16

Mout7

FA16

Mout2

FA7

Mout1

FA2

Mout0 Cin=1

FA0

FA1

(c) ACCout

ACCout7

ACCout2

ACCout1

FF16

FF7

FF2

FF1

FF0

LS16

LS7

LS2

LS1

LS0

FF16

FF7

FF2

ACCout16

MClk

0

Reset

Sign_Bit ResetM

Mout16

Mout7

FA16

FF1

Mout2

FA7

FA2

FF0

Mout1

Mout0 FA0

FA1

Cin=1

(d) Fig. 18: Gate level details of (a) AD (Absolute Difference) block, (b) ACC (Accumulator) block, (c) MIN (Comparator) block, and (d) MIN (Comparator) block with level shifters (1)

(2)

FA

FA

(3)

FA

(6)

FA

FA

FA

FF

FF

FF

LOAD FA

FA

FA

FF (4)

FA

FA

(5) FA

FA

FF

FF

FA

FF (7)

FA

FA

FF

FF FA

FA

FA

Fig. 19: Unique three stage configurations of Full Adders (1-5) and flips flops (6-7) in the systolic array circuit

For Case 2, additional control logic will be required to multiplex the data from the memory of the two systolic arrays, i.e., the pixel data of the search area pixels in region 1 (Fig. 13), the current block data to be given to systolic array-1 working at 1.2V, and the pixel data in region 2 (Fig.

Fig. 20: Motion estimation architecture details for error correction scheme

In the energy savings we quote, we account only for the energy savings in the datapath architecture as it makes up for most (about 75%) of the processing involved in the control logic, datapath architecture and address generation unit of the motion estimation architecture in case of the full search algorithm [9]; we do not consider the energy expended by memory. 2) Motion Vector Distribution We analyzed the motion vector distribution for some standard video sequences characterized by different motion types. The test video sequences used are i) Susie which has minor facial, head and shoulder movements, such as most news videos, ii) Mobile Calendar which has both horizontal and vertical moving parts at a medium pace, e.g., a calendar moves upwards relative to the camera in the same frame while a toy train moves from left to right, iii) Flower Garden, which has only horizontal movement but faster motion than Mobile Calendar in its frames, and iv) Stefan, a sports video sequence with fast movement of the players on a court. Table 1 shows the percentage of motion vectors for these video sequences found within (±r, ±r), where r is varied from 0 to 2. The analysis in Table 1 shows that a significant percentage of motion vectors exist within a range of (±2, ±2). Even sports video sequences such as Stefan, which are characterized by fast motion, exhibit this characteristic. This analysis validates the idea behind the error correction scheme proposed in Section IV.B. 3) Effect of PCMOS-based architecture and error correction scheme on motion estimation Table 2 shows the energy savings possible in the datapath architecture with Case 1 and Case 2 when degradation is constrained to be within 0.5 dB. Voltage is scaled down for the architecture for Case 1 for different video sequences while the degradation in quality remains within the 0.5 dB target. The achievable voltage scaling differs with video sequences because of the different motion content. For implementation, however, an on the fly decision on the circuit supply voltage for different video sequences may not be practical, so we select the highest value amongst the scaled voltage values for different video sequences. For Case 2, we fix the supply voltage for the two parallel Systolic Arrays (SA1 and SA2) at 1.2V and 0.55V. The quality degradation constraint of 0.5 dB is achieved in Case 2 by varying the range parameter ‘r’ (Fig. 12) according to QP. In the MPEG-2 video codec [54],

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

10

Control Number: 6553 QP is different for each macro block in a frame. The video codec provides QP averaged over the macro blocks for each frame in the video sequence [54]. As mentioned earlier in Section IV.B, in our proposed error correction scheme, we have to set up a threshold for QP in order to determine whether the range parameter must be increased or decreased. We use the frame QP averaged over 100 frames of the video sequence calculated for the error-free run of the video codec, termed QPmean by us. It is to be noted that the average QP value is taken from the video codec and does not require any extra calculations. We multiply QP mean with the number of macro-blocks in the frame, in this case 396 (QP mean_scaled = 396 x QPmean). We set QPTH, η1 and η2 as QPmean_scaled, 1.0200 and 1.0065 respectively, where QPmean_scaled is the mean QP for a frame in the codec when operated for the error-free architecture, scaled by the number of macro blocks in a frame (in this case 396). The upper limit on ‘r’, i.e., ‘r1’ is set as 4 and the range parameter cannot cross the upper limit. Table 1: Motion Vector Distribution MV

(0,0)

(±1,±1)

(±2,±2)

Susie

59.13%

87.69%

91.49%

Mobile Calendar

26.7%

96.54%

97.55%

Flower Garden

7.5%

54.08%

79.02%

Stefan

36.32%

48.12%

54.79%

Fig. 21 illustrates the improvement in average PSNR and visual quality of the motion compensated frames when the energy consumption of both Case 1 and Case 2 is approximately the same. The architecture in Case 1 is maintained at a supply voltage of 0.65V for the comparison; furthermore, the two parallel architectures for Case 2 are maintained at 1.2V and 0.55V. Both cases account for the same energy savings of approximately 57% in Fig. 21. Fig. 22 shows motion vectors plotted over the frame they have been calculated for, namely, (a) the Base case, (b) Case 1 and (c) Case 2. The motion vectors in Fig. 22 have been scaled larger than their actual size for visibility. The randomness in the motion vectors for Case 1 at the same energy savings as that of Case 2 is apparent through the figure. The errors in calculation of motion vectors that Case 1 and 2 have made in comparison with the Base case is also visible from the figure. B. Simulation Results for PCMOS based TSS 1) Energy and Error Estimation through HSPICE To estimate the energy consumption through the PCMOS based datapath architecture, we develop a transistor level

netlist for the architecture shown in Fig. 7 for TSS and Fig. 7 with modifications suggested in Fig. 15 and Fig. 17 for MCTSS. The architecture processes 25 fps for frame size of 352 ×288. The gate level details of constituentblocks of the architecture shown in Fig. 7, AD, A, ACC and MIN have been shown in Fig. 23. The gates used in these blocks are Full Adder (FA), D-flip-flop (FF), AND, EXOR and Inverter. In addition to these gates, OR and MUX gates are required by the architecture for MC-TSS. HSPICE simulations are carried out for the architecture for energy and error estimation. Synopsys 90 nm generic library is used for these simulations. The noise modeling is done only for Full Adders and D-flip-flops out of all the gates present in the architecture as most of the processing occurs through these gates and they account for about 85-90% of the total energy consumption through the architecture. All the Full Adders in the circuits are designed using a 24-transistor mirror adder circuit [4], and the D-flip flops are designed using the edgetriggered D-flip flop circuit provided in [9].

PSNR = 30.08 dB (a)

PSNR = 20.92 dB

(b)

PSNR = 35.23 dB

PSNR = 23.36 dB

PSNR = 23.13 dB (c) PSNR = 25.22 dB Fig. 21. Motion compensated frames for three different video sequences (a) Susie (b) Mobile Calendar and (c) Flower Garden for Case 1 (left) and Case 2 (right) at approximately same energy savings of 57%

(a) (b) (c) Fig. 22: Motion vectors for video sequence Susie for (a) Base Case at 1.2V (no energy savings), (b) Case 1 at 0.65V (57% energy savings) (c) Case 2 at 1.2, 0.55 V (56% energy savings)

Table 2: Energy savings using PCMOS based implementation with PSNR loss less than 0.5 dB

Base Case: No energy savings PSNR (dB)

Avg. PSNR (dB)

Savings

Circuit Supply Voltage (V)

Avg. PSNR (dB)

Savings

Average range parameter ‘r’

Circuit Supply Voltage (V) (SA1, SA2)

Susie Mobile Calendar

35.74 23.82

35.35 23.44

34% 44%

0.95 0.85

35.23 23.36

56% 57%

2.73 2.18

1.2, 0.55 1.2, 0.55

Flower Garden

25.69

25.37

44%

0.85

25.22

57%

2.42

1.2, 0.55

Stefan

25.64

25.27

44%

0.85

25.12

52%

4.00

1.2, 0.55

Video Sequences

Case 1 (FSBMA) Energy

Case 2 (FSBMA + Error Correction Scheme) Energy

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

11

Control Number: 6553 The choice of noise RMS is made such that no errors are observed at the output of the gates at the nominal supply voltage of 1.2V, which was found empirically to be 0.2V. Supply voltage is then scaled down to 1.15V, 1.1V down till 0.7V with a step size of 0.05V. The voltage scaling is limited any further for TSS as the lowest voltage that conformed to less than 0.5 dB PSNR degradation was 0.85V. The range of voltages used does not violate the processing speed required of the tree architecture used for three-step search. Energy consumption is calculated by simulating the circuit of the architecture for the scaled least voltage value that limits the PSNR degradation to 0.5 dB. The inputs to the architecture netlists are test vectors from standard video sequences which are also used to test the performance of motion estimation. The HSPICE simulations required for error estimation are also carried out by simulating the unique three stage model of the Full Adders and D-flip-flops in the architecture for the required voltage range; four unique threestage configurations for Full Adders, three in the case of TSS and four in the case of MC-TSS for D-flip flops were found. The unique three stage configurations present in the architecture have been shown in Fig. 24. In Fig. 24(4), LOAD corresponds to all the gates the inverter is attached to in the actual circuit, which is shown in the blocks of tree architecture as shown in Fig. 23. As the number of gates that comprise the LOAD are large, they have not been shown again in Fig. 24 due to space constraints. 1) Effect of PCMOS based Architecture and MC-TSS on Motion Estimation The possible energy savings for TSS and MC-TSS with voltage scaling when the quality reduction is limited to 0.5 dB are tabulated in Table 3. In Table 3, Case 1 and Case 2 correspond to PCMOS based architecture modeled for TSS and MC-TSS respectively. The video codec used for simulation is MPEG-2 encoder [54]. The search area size is limited to ±7 for TSS, which makes it an appropriate algorithm for applications such as video conferencing which need faster processing and in which the movement of objects in the video is small. Thus, the video sequences considered for showing the results in Table 3 are slow and medium motion video sequences in which movement of objects is mostly limited to the search area size of ±7. This is why the video sequence Stefan, which is a video sequence with fast motion and was used for PCMOS based FSBMA, is not used to show results for TSS. Instead of Stefan, video sequence Foreman, with head, lip and shoulder movement, has been used for testing the performance. The circuit supply voltages corresponding to the possible energy savings with these video sequences are also provided in Table 3. The least voltage value that can be used varies over the video sequences as motion characteristics differ over these sequences. A common voltage can be selected dependent on the type of videos the video compression unit will be required to process. In Table 3, however, we show the least voltage value that can be achieved for every video sequence with our PCMOS based architecture for TSS as well as MCTSS so that the differences in the performance of the two algorithms are more evident. It can be seen from the results in Table 3 that the decrease in PSNR is much less compared to the increase in energy savings. This is because the SAD is

evaluated as a summation of 256 differences for a block size of 16. Probabilistic bit computation affects only a small number out of these differences and, hence, the overall effect of incorrect computations is minimized. However, note that in the case of MC-TSS, the number of SAD computations are halved decreasing these to 128 differences per SAD (block size is still 16). We surmise that the main reason that MCTSS does better than TSS is because the nature of MC-TSS to keep three winners for each step increases the percentage of errors that MC-TSS can tolerate over TSS a7

Cout

a2

b7

a1

b2

FA2

FA7

b1

a0

FA1

S7

b0

Cin=1

FA0

S2

S1

S0

Clk Reset FF7

FF2 Dout7

FF1

FF0

Dout2

Dout1

Dout0

(a) Inputs from D or A blocks aN

a2

bN

a1

b2

FA2

FAN SN

b1

a0

FA1 S2

b0 Cin=1

FA0 S1

Clk

S0

Reset FFN

FF2 AoutN

FF1 Aout2

FF0 Aout1

Aout0

(b)

Clk

Aout12

Aout1

Aout2

FF12

FF2

Aout0

FF1

Reset

FF0 Clk ResetA

FF16

FF12

ACCout1

ACCout1

6

2

FA16

FF2

FF1

FF0

ACCout2

ACCout1

ACCout0

FA12

ACCout16

FA2

(c)

ACCout7

FA1

FA0

ACCout2

ACCout1

FF2

FF1

ACCout0

MClk Reset

FF16

FF7

FF0 Sign_Bit ResetM

FF16

FF7

Mout16

FF2 Mout7

FA16

FF1

Mout2

FA7

FA2

FF0

Mout1

Mout0 FA0

FA1

Cin=1

(d) Fig. 23: Gate level details of (a) AD (Absolute Difference) block (b) A (Adder) block (c) ACC (Accumulator) block (d) MIN (Comparator) block (1)

FA

FA

FA

(6)

(5)

(2)

FA

FF

FA

FF

FA

FA

FA FF (3)

FA

FA

(8)

(7)

(4) FA

FA

FF

FF

LOAD FF

FF

FF

FA

2x1 MUX

FA

FF

Fig. 24: Unique three stage configurations of Full Adders (1-4) and flip-flops (5-8) in the tree architecture circuit for TSS and MC-TSS algorithms

2) Working of MC-TSS and Comparison with TSS MC-TSS is found to outperform TSS because, in MC-TSS, if an incorrect decision from the comparator due to erroneous computation from PCMOS based architecture leads to an incorrect selection of the winner candidate for a given step, it is quite likely that the correct winner candidate will be

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

12

Control Number: 6553 captured by the next best candidates. Table 4 shows the percentage of the times the correct winner candidate is likely to be amongst two, three and four best candidates for the choice of operating voltages from Table 3 required for the 0.5dB constraint on quality degradation. It can be seen from Table 4 that increasing the winner candidates of MC-TSS to four hardly introduces any difference over keeping three, as the percentage increase of finding the correct winner candidate in the best four over best three candidates is 0.8% to 2.14%; this trend continues for lower voltage values as well. Thus, keeping three winner candidates for every step in MC-TSS is an apt choice. Fig. 25 shows the improvement in visual quality when the energy consumption through architectures for both TSS and MC-TSS is approximately same. Notice that in Fig. 25 the distortion in the frames is mostly in the moving parts of the frame, this is because incorrect computation of SAD values affects the decision for the moving parts of the frame more than the stationary ones.

VI. DISCUSSION In this paper we have focused on traditional motion estimation algorithms with a known number of arithmetic operations per block. Faster motion estimation algorithms have recently been proposed with an unknown number of operations per block; a probabilistic computing architecture for such algorithms would require a much more complex controller design, which we have not investigated. The design of PCMOS controllers is currently an open and unsolved area of research.

Fig. 26: PSNR v/s Voltage Scaling for (a) Mobile, (b) Flower Garden, (c) Susie and (d) Foreman PSNR = 34.36 dB

(a) PSNR = 35.43 dB

PSNR = 21.73 dB

(b) PSNR = 23.21 dB

(a) (b) (c) Fig. 27: Motion vectors for Video Sequence Susie for (a) Base Case at 1.2 V (no energy savings), (b) TSS at 0.95 V (57% energy savings) (c) MCTSS at 0.95 V (55% energy savings)

PSNR = 23.21 dB (c) PSNR = 25.02 dB Fig. 25: TSS (left) and MC-TSS (right) at approx. same energy savings for video sequence (a) Susie (55% savings) (b) Mobile Calendar (70% savings) (c) Flower Garden (70% savings)

Fig. 26 shows the variation of average PSNR for four example video sequences with the operating voltage supply for the architecture. It can be seen that the decrease in PSNR with increase in errors due to voltage scaling is steeper in the case of TSS as compared to MC-TSS. From Fig. 26, MC-TSS can be characterized as a more stable algorithm than TSS to handle the errors from the underlying architecture. Fig. 27 is similar to Fig. 22 shown for FSBMA, and it helps to visualize the errors in the calculation of motion vectors for TSS and MC-TSS when coupled with a PCMOS based architecture compared with the Base Case architecture for TSS algorithm at 1.2V having no errors.

VII. CONCLUSION The work in this paper established PCMOS based probabilistic computing as an effective means to achieve low power computing for motion estimation. Algorithmic modifications such as the error correction scheme proposed in this paper increases the possible energy savings with PCMOS based probabilistic computing by 12% in case of FSBMA and 15% in case of TSS. Furthermore, when compared at the same energy savings with PCMOS based implementation alone, up to 5 dB and 1.8 dB improvement is seen in PSNR using the error correction scheme with the FSBMA and MCTSS algorithms respectively. Thus, algorithmic modifications such as the motion vector statistics based error correction scheme and MC-TSS proposed in this paper can enhance the energy savings possible with probabilistic computing further.

Table 3: Energy savings for TSS, MC-TSS with PSNR loss less than 0.5 dB Base Case (TSS): No energy savings PSNR (dB)

Avg. PSNR (dB)

Susie

35.64

Mobile Calendar Flower Garden

23.72 25.2

Foreman

31.3

Video Sequences

Case 1 (TSS + PCMOS)

Energy

Case 2 (MC-TSS + PCMOS)

Savings

Circuit Supply Voltage (V)

Avg. PSNR (dB)

Energy Savings

Circuit Supply Voltage (V)

35.21

40%

1.05

35.43

55%

0.95

23.23

57%

0.95

23.36

70%

0.85

24.74

57%

0.95

25.02

70%

0.85

31.035

49%

1.00

30.89

64%

0.90

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

13

Control Number: 6553 Table 4: Percentage of the times the correct winner candidate is present amongst 2, 3, or 4 best candidates Video Sequence

Circuit Voltage

Best-2

Best-3

Best-4

Susie

0.95

93.2%

98%

98.8%

Mobile Calendar

0.85

89%

96%

97.15%

Flower Garden

0.85

87.6%

94.21%

96.35%

0.90

92.47%

97.26%

98%

Foreman

REFERENCES [1] [2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

P. Kuhn, “Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation,” Boston, MA: Kluwer, 1999. S. Borkar, “Design Perspectives on 22nm CMOS and Beyond,” Proceedings of 46th ACM/IEEE Design Automation Conference, pp. 9394, July 2009. L. B. Kish, “End of Moore’s law: Thermal (Noise) Death of Integration in Micro and Nano Electronics,” Physics Letters A, vol. 305, no. 3-4, pp. 158-168, 2002. K. Natori and N. Sano, "Scaling Limit of Digital Circuits Due to Thermal Noise," Journal of Applied Physics, vol. 83, pp. 5019-5024, 1998. N. Sano, "Increasing Importance of Electronic Thermal Noise in Sub0.1mm Si-MOSFETs," The IEICE Transactions on Electronics, vol. E83-C, pp. 1203-1211, 2000. C. Dhoot, V. J. Mooney, L. P. Chau, and S. R. Chowdhury, “Low Power Motion Estimation with Probabilistic Computing,” International Symposium on Very Large Scale Integration (VLSI ‘11), July 2011. C. Dhoot, V.J. Mooney, S. R. Chowdhury, and L.P. Chau, “Fault Tolerant Design for Low Power Hierarchical Search Motion Estimation Algorithms,” Proceedings of the IFIP Working Group 10.5 Very Large Scale Integration System-on-a-Chip (VLSI-SoC'11), Oct. 2011. M. A. Elgamel, A. M. Shams, and M. A. Bayoumi, “A Comparative Analysis of Low Power Motion Estimation VLSI Architectures,” in Proceedings of IEEE Workshop Signal Processing, pp. 149-158, Oct. 2000. R. S. Richmond ІІ and D. S. Ha, “A Low-power Motion Estimation Block for Low Bit-rate Wireless Video,” Proceedings of IEEE Workshop Signal Processing, pp. 60-63, Aug. 2001. S. S. Lin, P. C. Tseng, and L. G. Chen, “Low-power Parallel Tree Architecture for Full Search Block-matching Motion Estimation,” Proceedings of the 2004 International Symposium on Circuits and Systems (ISCAS'04), vol. 2, pp. II - 313-316, May 2004. T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, "Motion Compensated Interframe Coding for Video-Conferencing," Proc. of National Telecommunications Conference, New Orleans, pp. G5.3.15.3.5, Nov. 1981. F. Dufaux and F. Moscheni, “Motion Estimation Techniques for Digital TV: A Review and a New Contribution,” Proceedings of the IEEE, vol. 83, no. 6, pp. 858–876, Jun. 1995. P.-C. Tseng, Y. Chang, Y. Huang, H. Fang, C. Huang, and L. Chen, “Advances in Hardware Architectures for Image and Video Coding- a Aurvey,” Proceedings of the IEEE, vol. 93, no. 1, pp. 184–197, Jan. 2005. M. Ghanbari, “The Cross-search Algorithm for Motion Estimation,” IEEE Transactions on Communication, vol. 38, no. 7, pp. 950–953, Jul. 1990. J. Minocha and N. R. Shanbhag, “A low power data-adaptive motion estimation algorithm,” Proceedings of IEEE Workshop on Multimedia Signal Processing, Sep. 1999, pp. 685–690. Y. L. Chan and W. C. Siu, “New adaptive pixel decimation for block motion vector estimation,” IEEE Transactions on Circuits and Systems in Video Technology, vol. 6, no. 1, pp. 113–118, Feb. 1996. B. Zeng, R. Li, and M. L. Liou, “Optimization of fast block motion estimation algorithms,” IEEE Transactions on Circuits and Systems in Video Technology, vol. 7, no. 6, pp. 833–844, Dec. 1997. K. Sauer and B. Schwartz, “Efficient motion estimation using integral projections,” IEEE Transactions on Circuits and System in Video Technology, vol. 6, no. 5, pp. 513–518, Dec. 1996. S. Kim, Y. Kim, K. Yim, H. Chung, K. Choi, Y. Kim, and G. Jung, “A fast motion estimator for real-time systems,” IEEE Transactions on Consumer Electronics, vol. 43, no. 1, pp. 24–33, Feb. 1997.

[20] S. Dutta and W. Wolf, “A flexible parallel architecture adapted to block matching motion estimation algorithms,” IEEE Transactions on Circuits and Systems in Video Technology, vol. 6, no. 1, pp. 74–86, Feb. 1996. [21] P. Lakamsani, “An architecture for enhanced three step search generalized for hierarchical motion estimation algorithms,” IEEE Transactions on Consumer Electronics, vol. 43, no. 2, pp. 221–227, May 1997. [22] S. Borkar, T. Karnik, S. Narendra, J. Tshanz, A. Keshavarzi, V. De, “Parameter variations and impact on circuits and microarchitecture”, Proceedings of Design Automation Conference (DAC’03), pp. 338-342, June 2003. [23] R. Hegde, and N. R. Shabhag, “Soft Digital Signal Processing”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 9, no. 6, pp. 813-823, Dec. 2001. [24] H. Y. Cheong, I. S. Chong, and A. Ortega, “Computation Error Tolerance in Motion Estimation Algorithms,” IEEE International Conference on Image Processing (ICIP’06), pp. 3289-3292, Oct. 2006. [25] G. V. Varatkar, and N. R. Shabhag, “Error-Resilient Motion Estimation Architecture,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 10, pp. 1399-1412, Oct. 2008. [26] D. Mohapatra, G. Karakonstantis, and K. Androy, “Significance driven computation: A voltage-scalable, variation-aware, quality-tuning motion estimator,” in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’09), pp. 195-200, Aug. 2009. [27] D. Mohapatra, G. Karakonstantis, and K. Roy, “Low-power ProcessVariation Tolerant Arithmetic Units Using Input-Based Elastic Clocking,” in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’09), pp. 74-79, Aug. 2007. [28] S. Ghosh, S. Bhunia, and K. Roy, “A New Paradigm for Low Power, Variation Tolerant Circuit Synthesis Using Critical Path Isolation”, IEEE/ACM Internation Conference on Computer-Aided Design, (ICCAD’06), pp. 619-624, Nov. 2006. [29] D. Mohapatra, V . K. Chippa, A. Raghunathan and K. Roy, “Design of Voltage Scalable Meta-Functions for Approximate Computing”, Design, Automation and Test in Europe (DATE’11), pp. 1-6, Mar. 2011. [30] R. Karri, K. Hogstedt, and A. Orailoglu, “Computer Aided Design of Fault-Tolerant VLSI Systems”, IEEE Design and Test of Computers, vol. 13, pp. 88-96, Aug. 2002. [31] H. Y. Cheong, and A. Ortega, “Motion Estimation Performance Models with Application to Hardware Error Tolerance,” in Proceedings of Visual Communication and Image Processing, pp. 1-12, Jan. 2007. [32] L. N. Chakrapani, B.E.S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. Palem, and B. Seshasayee, “Ultra-Efficient (Embedded) SOC Architectures based on Probabilistic CMOS (PCMOS) Technology,” in Proeedings of Design, Automation and Test in Europe (DATE’06), pp. 1-6, Mar. 2006. [33] K. Nepal, R. I. Bahar, J. Mundy, W. R. Patterson, and A. Zaslavsky, “Designing Logic Circuits for Probabilistic Computation in the Presence of Noise”, in Proceedings of 42nd Design Automation Conference (DAC’05), pp. 485-490, June 2005. [34] L. N. Chakrapani, and K. V. Palem, “A probabilistic boolean logic for energy efficient circuit and system design,” in Proceedings of 15th Asia South Pacific Design Automation Conference (ASP-DAC’10), pp. 628635, Jan. 2010. [35] L. Wang, and N. R. Shanbhag, “Noise-tolerant Dynamic Circuit Design,” in Proceedings og IEEE International Symposium on Circuits and Systems (ISCAS’99), vol. 1, pp. 549-552, Jul. 1999. [36] R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and threshold voltage scaling for low-power CMOS,” IEEE Journal on Solid-State Circuits, vol. 31, no. 3, pp. 395–400, Mar. 1999. [37] S. H. Choi, B. C. Paul and K. Roy, “Novel Sizing Algorithm for Yield Improvement under Process Variation in Nanometer Technology,” pp. 454-459, in Proceedings of 41st Design and Automation Conference (DAC’04), pp. 454-459, Jul. 2004. [38] J. George, B. Marr, B.E.S Akgul, and K.V. Palem, “Probabilistic Arithmetic and Energy Efficient Embedded Signal Processing,” in Proceedings of the 2006 international conference on Compilers, Arcihtecture and Synthesis for Embedded Systems (CASES’06), pp. 158168, Oct. 2006. [39] I. S. Chong and A. Ortega, “Dynamic Voltage Scaling Algorithms for Power Constrained Motion Estimation”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), vol. 2, pp. 101-104, Apr 2007.

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Control Number: 6553

14

[40] J. C. Chang and M. Pedram, “Energy Minimization Using Multiple Supply Voltages,” IEEE Transactions on Very Large Scale Intergration (VLSI) Systems, vol. 5, no. 4, pp. 436-443, Dec. 1997. [41] P. Korkmaz, B. E. S. Akgul, K. V. Palem, and L. N. Chakrapani, “Advocating noise as an agent for ultra-low energy computing: probabilistic complementary metal-oxide-semiconductor devices and their characteristics,” Japanese Journal of Applied Physics, vol.45, no.4, pp. 3307–3316, Apr. 2006. [42] P. Korkmaz, B. E. S. Akgul, and K. V. Palem, “Energy, Performance and Probability Trade-offs for Energy-Efficient Probabilistic CMOS circuits,” IEEE Transactions on Circuits and Systems I, vol. 55, no. 8, pp. 2249-2262, Sep. 2008. [43] A. Bhanu, M. S. K. Lau, K. V. Ling, V. J. Mooney III, and A. Singh, “A More Precise Model of Noise Based PCMOS Errors,” Fifth IEEE International Symposium on Electronic Design, Test & Applications (DELTA ’10), pp. 99-102, Jan. 2010. [44] A. Singh, A. Basu, K.V. Ling, and V. Mooney, “Modeling Multioutput Filtering Effects in PCMOS,” Proceedings of the VLSI Design and Test Conference (VLSIDAT 2011), April 2011. [45] C. Zhu, X. Lin, L. P. Chau, “Hexagon based search pattern for fast block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 349-355, 2002. [46] L. M. Po, W. C. Ma,“A novel four-step search algorithm for fast block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, Jun. 1996. [47] T. Komarek and P. Pirsch, “Array architectures for block matching algorithms,” IEEE Transactions on Circuits and System, vol. 36, no. 10, pp. 1301-1308, Oct. 1989. [48] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim, "A novel unrestricted center-biased diamond search algorithm for block motion estimation", IEEE Transactions on Circuits and Sytems in Video Technology, vol. 8, pp.369 - 377, Aug. 1998. [49] R. Li, B. Zeng, and M. L. Liou, “A New Three-Step Search Algorithm for Block Motion Estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, no. 4, pp. 438-442, Aug. 1994. [50] J. M. Rabaey, A. Chandrakasan, and B. Nikolić, Digital Integrated Circuits: A Design Perspective, 3rd ed. Prentice Hall, 2003. [51] http://en.wikipedia.org/w/index.php?title=File:TSPC_FF_R.png License: Creative Commons Attribution 3.0, Contributors: Jon Guerber. [52] R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and Threshold Voltage Scaling for Low-Power CMOS,” IEEE Journal of Solid-State Circuits, vol. 31, no. 3, pp. 395–400, Mar. 1999. [53] Kyoung-Hoi Koo, Jin-Ho Seo, Myeong-Lyong Ko, and Jae-Whui Kim, “A New Level-up Shifter for High Speed and Wide Range Interface in Ultra Deep Sub-Micron”, IEEE International Symposium on Circuits and Systems, 2005, (ISCAS ‘05), vol. 2, pp. 1063- 1065, May 2005. [54] MPEG-2 Test Model. http://www.mpeg.org/MPEG/MSSG/tm5

video transmission, image representation for 3D content delivery, and image based human skeleton extraction. He involved in organization committee of international conferences including the IEEE International Conference on Image Processing (ICIP 2010, ICIP 2004), and IEEE International Conference on Multimedia & Expo (ICME 2010). He is a Technical Program Co-Chairs for Visual Communications and Image Processing (VCIP 2013) and 2010 International Symposium on Intelligent Signal Processing and Communications Systems (ISPACS 2010). He was the chair of Technical Committee on Circuits & Systems for Communications (TC-CASC) of IEEE Circuits and Systems Society from 2010 to 2012, and the chairman of IEEE Singapore Circuits and Systems Chapter from 2009 to 2010. He served as a member of Singapore Digital Television Technical Committee from 1998 to 1999. He served as an associate editor for IEEE Transactions on Multimedia, IEEE Signal Processing Letters, and is currently serving as an associate editor for IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Broadcasting and IEEE Circuits and Systems Society Newsletter. Besides, he is IEEE Distinguished Lecturer for 2009-2013, and a steering committee member of IEEE Transactions for Mobile Computing from 2011-2013.

Charvi Dhoot received the B. Tech and M.S. degree in Electronics and Communication Engineering from IIIT, Hyderabad in 2010 and 2012 respectively. During the 2011-12 school year at IIIT, Hyderabad, she did her research on low power probabilistic computing based motion estimation architectures in collaboration with Institute of Sustainable Nanoelectronics at Nanyang Technological University, Singapore. She is currently working as an Engineer in the RF department at Qualcomm India Pvt. Ltd., Hyderabad.

Vincent J. Mooney III (Senior Member, IEEE and Member, ACM) received the B.S. degree from Yale University in 1991, where he double majored in Electrical Engineering and Computer Science. He was a member of the 1989 Ivy League Championship football team for Yale and was one of 29 football players to be awarded the NCAA Postgraduate Scholarship upon his graduation in 1991. He received an M.S. degree in E.E. from Stanford University in 1994, an M.A. degree in Philosophy from Stanford in 1997, and the Ph.D. degree in E.E. from Stanford in June of 1998. He has worked at Bell Labs (Lucent), Allied Signal Aerospace VLSI Design Group, Hughes Network Systems, and Redwood Design Automation (acquired by Cadence). He is currently an Associate Professor in the School of Electrical and Computer Engineering and an Adjunct Associate Professor in the School of Computer Science, both at the Georgia Institute of Technology in Atlanta, GA. From August 2008 to July 2011 he was a Visiting Associate Professor at Nanyang Technological University, Singapore, where he served as Deputy Director of the Institute of Sustainable Nano-Electronics (ISNE). He is a recipient of the NSF Career Award. He has served as Program Chair of CASES and FPGAworld. He was General Chair of IFIP VLSI-SoC 2007. He has served as an Associate Editor of both the IEEE Transactions on VLSI as well as the ACM Transactions on Embedded Computing Systems. His research interests include computer-aided design of integrated circuits with a particular emphasis on hardware-software codesign, reconfigurable computing, power-aware and probabilistic architectures and circuits.

Lap-Pui Chau received the B. Eng degree with first class honours in Electronic Engineering from Oxford Brookes University, England, and the Ph.D. degree in Electronic Engineering from Hong Kong Polytechnic University, Hong Kong, in 1992 and 1997, respectively. In June 1996, he joined Tritech Microelectronics as a senior engineer. Since March 1997, he joined Centre for Signal Processing, a national research centre in Nanyang Technological University as a research fellow, subsequently he joined School of Electrical & Electronic Engineering, Nanyang Technological University as an assistant professor and currently, he is an associate professor. His research interests include fast signal processing algorithms, scalable video and video transcoding , robust

Dr. Shubhajit Roy Chowdhury was born on August 27, 1981. He completed his Ph. D from the Department of Electronics and Telecommunication Engineering, Jadavpur University in the year 2010. He is currently an Assistant Professor at the Centre for VLSI and Embedded Systems Technology, IIIT Hyderabad. Previously, he has also been teaching at Jadavpur University in the capacity of a lecturer from 2006 to 2010. He is a member of VLSI Society of India, and a life member of Indian Statistical Institute, Microelectronics Society of India and Telemedicine Society of India. He is a member of scientific, technical and editorial committee of Engineering and Natural Sciences Division of World Academy of Engineering, Science and Technology. He is the recipient of university gold medals in 2004 and 2006 for his B.E. and M.E. respectively, Altera Embedded Processor Designer Award in 2007, winner of four best paper awards. He received the award of the Fellow of Society of Applied Biotechnology (FSAB) by the Society of Applied Biotechnology in the year 2012. He is also awarded Young Engineers’ Award 2012-13 by the Institution of Engineers, India for his outstanding contribution in the field of Electronics and Telecommunication Engineering. He has published over sixty papers in international journals and conferences. He is a reviewer of IEEE Transactions on VLSI Systems, ACM Transactions on Design Automation of Electronic Systems, Journal of Medical Systems, Medical and Biological Engineering and Computing and other reputed journals. His research interests span around the development of Biomedical Embedded Systems, VLSI architectures and ASIC design of intelligent signal processing circuits. He is keenly interested in the educational system and its necessary transformation.

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].