Memory Bandwidth and Power Reduction Using ... - Semantic Scholar

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011

225

Memory Bandwidth and Power Reduction Using Lossy Reference Frame Compression in Video Encoding Ajit D. Gupte, Bharadwaj Amrutur, Mahesh M. Mehendale, Ajit V. Rao, and Madhukar Budagavi

Abstract—Large external memory bandwidth requirement leads to increased system power dissipation and cost in video coding application. Majority of the external memory traffic in video encoder is due to reference data accesses. We describe a lossy reference frame compression technique that can be used in video coding with minimal impact on quality while significantly reducing power and bandwidth requirement. The low cost transformless compression technique uses lossy reference for motion estimation to reduce memory traffic, and lossless reference for motion compensation (MC) to avoid drift. Thus, it is compatible with all existing video standards. We calculate the quantization error bound and show that by storing quantization error separately, bandwidth overhead due to MC can be reduced significantly. The technique meets key requirements specific to the video encode application. 24–39% reduction in peak bandwidth and 23–31% reduction in total average power consumption are observed for IBBP sequences. Index Terms—Chroma pixels, DDR SDRAM, luma pixels, macroblock (MB), motion compensation (MC), motion estimation (ME), scalar quantization.

I. Introduction

I

N VIDEO CODING, one or more decoded frames are used as reference frames. Reference frames are typically stored in an external memory such as double data rate synchronous dynamic random access memory (DDR-SDRAM). Accessing them results in large DDR bandwidth and power consumption. The energy spent in accessing a unit of data from the DDR can be as much as 50 times larger than accessing same amount of data from on-chip static random access memories [1]. In multimedia applications, up to 50% of the total system power budget can be attributed to external memory accesses [2]. In the past, focus has been on reducing DDR bandwidth in a video decoder system [3]. Compressing reference frames before storing to DDR can significantly reduce DDR traffic and power. While lossless compression would provide best quality, the compression ratio that can be achieved is not fixed Manuscript received December 14, 2009; revised March 31, 2010 and June 2, 2010; accepted August 3, 2010. Date of publication January 13, 2011; date of current version March 2, 2011. This paper was recommended by Associate Editor G. G. (Chris) Lee. A. D. Gupte, M. M. Mehendale, and A. V. Rao are with Texas Instruments, Inc., Bengaluru, Karnataka 560043, India (e-mail: [email protected]). B. Amrutur is with the Indian Institute of Science, Bengaluru, Karnataka 560012, India. M. Budagavi is with Texas Instruments, Inc., Dallas, TX 75243 USA. Digital Object Identifier 10.1109/TCSVT.2011.2105599

and is limited [4]. Further, variable length coding associated with lossless compression increases the complexity of data access. Lossy compression can achieve higher compression ratio even if it results in small loss in quality [7], [9]. However, the traditional lossy compression approach can cause drift since encoder and decoder use nonidentical reference data. Recently, video encoding has become an important feature of handheld devices such as mobile phones. In [5], a wavelet transform-based reference frame compression scheme was proposed. Either lossy or lossless compression could be used. However, the complexity of being able to efficiently access the compressed data was not considered. In [6], memory bandwidth optimizations based on data reuse and on the fly padding were proposed, but no reference frame compression was used. In [7], a lossy scalar quantization-based compression scheme [min-max scalar quantization (MMSQ)] was proposed to reduce encoder bandwidth. Compressed reference was stored in DDR and was used for motion estimation (ME) and motion compensation (MC). As MC used lossy reference, the scheme is not compliant to the existing standards. We propose a modification to this technique which is compliant to the existing video standards and, hence, avoids drift in a standard decoder. While we use the lossy reference for ME, MC is performed in bit-accurate fashion. This is achieved by separately storing the quantization error. The impact of the compression scheme on the ME quality is negligible. Also, the bit-accurate MC helps significantly in reducing quality loss due to lossy compression at same compression ratio. In the next section, we analyze the DDR bandwidth requirement and preferred qualities of a compression scheme in video encoder. In Section III, we describe the proposed compression scheme. In Section IV, we analyze the video quality impact due to proposed algorithm. In Sections V and VI, we analyze the DDR bandwidth savings and system power reduction, respectively.

II. DDR Bandwidth Requirement in Video Encode Application A. Nature of DDR Traffic in Encoder During ME, luma data from reference frames is fetched from the DDR. Most ME algorithms typically use a rectangular region called “search window” within the reference

c 2011 IEEE 1051-8215/$26.00

226

Fig. 1.


Sliding search window illustration.

frame to search for a good match with current macroblock (MB). Search windows for neighboring MBs have a large overlap. Since the MBs are coded in raster-scan fashion, horizontal overlap can be easily exploited. For example, Fig. 1 shows a “sliding search window” that can be created in on-chip memory. DDR bandwidth can be reduced further by exploiting search window overlap in vertical direction. However, this requires large on-chip memory and hence is costly. We assume a sliding search window for analysis. With sliding search window, the traffic due to ME is proportional to the vertical search range. The burst accesses from DDR are much more efficient when accessed in a regular fashion such as a rectangle. Also, the latency of access from DDR is large. For efficient ME hardware utilization, the search region for a MB is fetched from the DDR, and kept in local memory in advance. Hence, the rectangular search window as described above is commonly used irrespective of the ME search pattern. For example, full search method performs exhaustive search, whereas algorithms, such as UMHexagonS and EPZS search [8], optimize the search by intelligently choosing positions in the search window, thereby reducing computations and local memory accesses. However, all these algorithms will still typically implement a rectangular search window in the local memory.

Fig. 2. Proposed ME bandwidth reduction using luma compression and error storage (MMSQ-EC). Only error data is fetched during MC.

Fig. 3.

MMSQ-EC algorithm.

and is combined with the decompressed reference data already fetched for ME, to recreate bit-accurate reference pixels. This ensures standard compliance. This proposed method is referred as MMSQ-EC.

B. Constraints on the Compression Scheme

A. MMSQ-EC Algorithm and Bound on Quantization Error

Reference frames are stored to and retrieved from DDR in different order. MBs in current frame are processed in rasterscan fashion, while ME accesses reference data as vertically aligned MBs. Also, for MC, nondeterministic blocks of data that can span across multiple MBs need to be read. The quality loss resulting due to lossy compression should be small. In the next section, we describe the proposed scheme that satisfies all these constraints.

Fig. 3 explains the compression process for a block of pixels. For each block, max and min values are found and stored as part of compressed data (8 bit each). For each pixel, difference between the pixel value and min value (pelD) is quantized to n bits. The quantized pixels with the max and min values form the compressed data block. For example, n = 3 gives a compression ratio of 2. We also calculate the quantization error “pelError” for each pixel. We show that the quantization error can be represented using fixed number of bits, which is smaller than a full pixel. This keeps the DDR bandwidth needed for MC luma error accesses small and also enables random access. Bound on quantization error is calculated as shown in Lemma 1 (Fig. 4). In reality, the error is slightly smaller than this bound. For example, with the maximum range of 256, and n = 3, the error is bounded as −16 ≤ pelError ≤ 15, as we found by exhaustive calculations. A further optimization is made to limit the size of the error data compared to that shown in Lemma 1. In case the error is too large, MMSQ-EC is replaced by truncation. For example, for n = 3, if range > 127, then error size is 5 bits per pixel

III. Description of Proposed Scheme In [7], reconstructed MBs undergo in-loop lossy compression using MMSQ (MMSQ algorithm is described in Section III-A). The scheme provides fixed compression ratio (e.g., 2:1) of reference frame data with small quality loss. We refer to this scheme as MMSQ-IL. In our proposed scheme (Fig. 2), the reference luma frame data is compressed using MMSQ. Compressed reference and quantization error due to compression are stored to the DDR. For ME, compressed reference is used, while for MC, only the error data is fetched from DDR,

GUPTE et al.: MEMORY BANDWIDTH AND POWER REDUCTION USING LOSSY REFERENCE FRAME COMPRESSION IN VIDEO ENCODING

227

Fig. 6. Rush-Hour PSNR versus bit rate at different b/p.

Fig. 4.

Bound on compression error.

Fig. 5.

Use of truncation in case the MMSQ-EC error is too large.

(b/p). This can be limited to 4 bits with truncation (Fig. 5). This not only reduces the MC bandwidth but also helps in efficient data packing in DDR memory. A 1-bit flag is maintained per 4 × 4 block to indicate the type of compression used.

(PSNR) characteristics with different n values for the RushHour sequence. Full search ME was used. With n = 3 or more, the performance is indistinguishable from the lossless reference performance. With full search ME, we observed an average bit rate loss of 0.74% over all the sequences for n = 3 at QP = 24. Peak bit rate loss at the same QP was 2.51%. The PSNR degradation at 3 b/p was negligible. Bit rate increase due to MMSQ-EC scheme is significantly smaller than MMSQ-IL scheme as reported in [7]. Since the MMSQ-IL scheme uses lossy reference for MC as well, there is higher average residual energy left after MC. This leads to larger bit rate. So, while [7] reported 7.79% BD bit rate loss for Mobile sequence at n = 4, the MMSQ-EC results in BD bit rate loss of only 0.17% at n = 4 and 0.55% at n = 3. Further, we observed that the compression block size of 4 × 4 pixels performed marginally better compared to other block sizes (we checked up to 16 × 16). Hence, we use n = 3 and block size = 4 × 4 for the rest of the experiments. C. Comparison with Other Compression Schemes

IV. Algorithm Performance A. Experimental Setup We carried out our experiments using JM reference software [8]. Compression scheme as described was implemented in “C” and integrated with the JM reference code. Search range of ±32 was used for most experiments. Search range of up to ±64 was used for some experiments on sequences with high vertical motion. “Full search” and “EPZS” ME algorithms were used for quality comparisons. We used 15 different video sequences ranging from D1 (e.g., Coastguard and Mobile) to 1080p (e.g., Rush-Hour, Fish, and Pedestrian Area) resolution. Quantization parameter (QP) values of 20, 24, 28, 32, and 36 were used for all experiments. Due to lack of space, we report results using Bjontegaard (BD) method [10] or for a specific QP value of 24 in this letter. At QP = 24, the video quality is reasonably good and, also the differences in different algorithms are clearly brought out. Please refer to [21] for further details. B. Varying the Compression Ratio and Block Size n in Lemma 1 can be changed to get different compression ratios. Fig. 6 shows bit rate versus peak signal-to-noise ratio

Truncation and decimation schemes have been used in the past for reducing ME complexity [11], [12]. These schemes also can be used for reducing DDR bandwidth, since only part of the reference data needs to be accessed from the DDR. In this section, we compare the MMSQ-EC algorithm performance with decimation and truncation. To achieve same bandwidth reduction as MMSQ-EC, decimation uses 2:1 subsampling, while truncation uses 4 bits for lossy reference. Table I summarizes PSNR and bit rate (Mb/s) results measured at fixed QP value of 24 using full search ME. Since decimation approximates the ME cost function using only half the number of pixels, this scheme results in large degradation for sequences with high detail such as Mobile Calendar, compared to MMSQ-EC. The MMSQ-EC performs significantly better than truncation as well. MMSQ-EC scheme shows less than 0.74% average bit rate increase compared to lossless reference. Peak loss in bit rate with MMSQ-EC is 2.51%. Average PSNR degradation with MMSQ-EC is negligible. With decimation, average bit rate increase is 6.29%. Truncation shows average bit rate increase of more than 50%. We also measured BDPSNR loss for QP ranging from 20 to 36 using both full search and EPZS methods. For MMSQEC, the BDPSNR loss for full search and EPZS was 0.047 dB

228


and 0.065 dB, respectively. For decimation we got 0.34 dB and 0.56 dB BDPSNR loss, while for truncation, the loss was 1.27 dB and 1.25 dB, respectively. V. Reduction in DDR Bandwidth Additional MC bandwidth required for fetching error data reduces the net bandwidth saving due to MMSQ-EC. Optimizing MC data fetches for integer motion vectors [13] and also caching it [14] can significantly reduce the DDR bandwidth. 2-D data organization reduces wasted traffic [15] due to non-aligned accesses. We used these optimizations in our bandwidth analysis. A simple “bounding-box” caching mechanism as described in [16] was assumed. Burst length of 4 was assumed for 32 bit DDR. This means minimum 128 bit data accesses from DDR. The reference frame data was stored in DDR in units of 4 × 4 pixel blocks. For calculating MC bandwidth, H.264 bit streams were generated using JM reference software. Number of reference frames used was restricted to one frame for “P” and two frames for “B” frames. For high-definition (HD) IPPP sequences, average MC traffic of 1.37 LMB-units while the peak MC traffic of 1.93 LMB-units for the SUNFL sequence was observed when search range was ±32. (1 LMB = 1 luma MB = 256 bytes.) For HD IBBP sequences, average and peak MC traffic of 1.88 and 2.82 LMB-units, respectively, was observed. Percentage bandwidth savings in IBBP sequences are larger than IPPP sequences. Table II shows relative bandwidth improvements for IPPP and IBBP HD sequences with the MMSQ-EC scheme (peak MC bandwidth numbers were used). Bandwidth saving is about 17.81% for the IPPP sequences. For the IBBP sequences, the bandwidth saving percentages are higher at about 24.19%. The average bandwidth reduction is 31% for IBBP sequences and 23% for IPPP sequences. (These are not shown in the tables but are calculated using average MC bandwidth over all HD sequences.) For video sequences with larger vertical motion, a larger vertical search range is necessary. For example, we observed that for SFISH sequence, the bit rate drops by more than 10% when the vertical search range is increased from 32 to 64 for IBBP coding. Several other sequences including SFIRE, SFG benefit significantly from higher vertical search range. Fig. 7 illustrates percentage bandwidth improvement as a function of vertical search-range for the H.264 standard for SFISH IBBP sequence. At vertical search range of ±64, the bandwidth saving with MMSQ-EC is more than 39%. The bandwidth dropped from 1015 Mbytes/s to 617 Mbytes/s in this case. VI. Power Savings A. DDR Memory Power Consumption The power consumed in DDR is a function of currents drawn in different functional or idle modes of operation [17]–[19] as shown in (1). The components P WR and P RD correspond to power consumed due to the read/write burst accesses. The power consumed in opening and closings of pages, P ACT , depends upon the total data traffic and locality of

Fig. 7. Bandwidth savings as a function of search window height for the SFISH sequence.

accesses. P ACT− STBY denotes the background power consumed by the DDR. P ACT− PDN is the power consumed when DDR is in lower power state. P R− DQ and P W− DQ represent the power consumed in DDR input/output (I/O)s and the video processor I/Os connected to the DDR memory, respectively. These two components are functions of actual toggle count on data buses, board level termination. DDR refresh power P REF constitute typically less than 5–10% of the total DDR power and is ignored in our analysis. We also ignore potential power savings possible with bandwidth reduction, due to additional idle time when the DDR can be sent to a lower power state such as P ACT− PDN as follows: P(Total) = P(WR) + P(RD) + P(ACT ) + P(ACT PDN) +P(ACT STBY ) + P(REF ) + P(R DQ) + P(W

DQ) .

(1)

We estimated power savings due to MMSQ-EC assuming a “Micrometer” 32-bit 1 Gb low power double data rate memory [17] based on application notes [18], [19]. The read and write power was estimated using the bandwidth numbers presented in the previous section. Average MC bandwidth over all HD sequences was considered. For the purpose of calculating number of page activations, the reference data was assumed to be mapped in tiled MB fashion in DDR, as it reduces the number of page activations [20]. Luma error and chroma MC data was interleaved to avoid separate page activations. For deterministic data reads and writes, buffering and pre-fetching of page data (data is read/written in multiple of DDR page size) was assumed since it significantly reduces page activations as shown in [1]. The number of page activations for MC traffic was simulated for all the sequences. Fig. 8 shows the mapping of reference data to DDR pages. Table II shows the estimated average DDR power numbers for IPPP and IBBP HD video sequences, respectively. Read/write power reduces linearly with bandwidth. The net reduction in page activations was negative due to nondeterministic nature of MC traffic, leading to slight increase in page activation power. DDR I/O power does not scale linearly with bandwidth due to less correlation in compressed data compared to the original reference, which results in larger number of toggles per transaction on the DDR data bus.

GUPTE et al.: MEMORY BANDWIDTH AND POWER REDUCTION USING LOSSY REFERENCE FRAME COMPRESSION IN VIDEO ENCODING

229

TABLE I ME Quality Comparison of MMSQ Versus Lossless Reference, Decimation, and Truncation

Bit Rate Increase from Lossless (%)

Min Max

MMSQ-EC PSNR Loss from Lossless (dB)

0.04 2.51

0 0.06

Average 0.74

0.01

Video

Riverbed Sunflower

Truncation PSNR Loss from Lossless (dB)

Bit Rate Increase from Lossless (%) 1.13 217.83

−0.02 0.09

Video

Riverbed Sunflower

−0.015

52.59

Bit Rate Increase from Lossless (%) 0.06 32.93

6.29

Decimation PSNR Loss from Lossless (dB) 0 0.12

Video

Riverbed Mobile Calendar

0.05

All the numbers were observed at fixed QP of 24. TABLE II Bandwidth and Power Savings with MMSQ-EC IPPP B.W. (Mbytes/s) IPPP Power (mW) IBBP B.W. (Mbytes/s) IBBP Power (mW) Bandwidth Component No Compression MMSQ-EC No Compression MMSQ-EC No Compression MMSQ-EC MMSQ-EC No Compression Write 94 94.48 15.48 15.56 31.33 31.49 5.16 5.19 Read 361.72 280.07 57.26 40.35 592.85 441.72 93.75 63.01 Standby − − 12.88 12.88 – – 17.51 17.51 Activate − − 2.37 2.82 – – 3.10 3.81 I/O − − 12.20 11.09 – – 13.79 11.04 Compression h/w 1.11 1.35 Total 455.73 374.56 100.20 83.81 624.18 473.21 133.31 101.91 Saving (%) 17.81 16.36 24.19 23.56 Saving (mW) 16.39 31.40

Fig. 8.

DDR memory layout for reference data.

Fig. 9.

MMSQ-EC compression hardware.

B. Power Consumed in the Compression Hardware We estimated power consumption of MMSQ-EC compression and decompression hardware by carrying out power simulations on the synthesized register transfer level using 65 nm library. Figs. 9 and 10 show compression and decompression hardware implementation. The 14K gate design supports 1080p resolution at 30 f/s at 250 MHz. Total estimated power due to additional hardware was 1.11 mW for IPPP and 1.35 mW for IBPP sequences. As a result, the net (averaged over HD sequences) power saving with MMSQ-EC scheme for IPPP sequences is about 16.36% while that for IBBP sequences is 23.56%. For high vertical motion sequences, the power saving percentages increase. For SFISH sequence, the estimated net power saving is 31.66%, when vertical search range is ±64. Readers are referred to [1] for further details on bandwidth and power analysis.

Fig. 10.

MMSQ-EC decompression hardware.

VII. Conclusion A DDR bandwidth reduction technique for video encoder, using lossy reference frame compression based on compression error calculation is presented. The fixed compression ratio, transform free technique is low in complexity and is compliant with existing standards. The resulting quality is negligible. The algorithm showed significantly superior

230


performance compared to other lossy compression techniques such as truncation and decimation. The proposed technique gives 18% peak bandwidth reduction for IPPP sequences and 24% peak bandwidth reduction for IBBP sequences compared to encoding without reference frame compression. Net average power is reduced by 16.4% and 23.6% for IPPP and IBBP sequences, respectively. For sequences which benefit from higher vertical search range, the bandwidth saving is 39% and the power savings is 32%.

Acknowledgment The authors would like to acknowledge H. Sanghvi of Texas Instruments, Inc., Bengaluru, Karnataka, India, for his support in MC bandwidth simulations.

References [1] J. Trajkovic, A. V. Veidenbaum, and A. Kejariwal, “Improving SDRAM access energy efficiency for low power embedded systems,” ACM Trans. Embedded Comput. Syst., vol. 7, no. 3, article 24, p. 21, Apr. 2008. [2] F. Catthoor, F. Franssen, S. Wuytack, L. Nachtergaele, and H. D. Man, “Global communication and memory optimizing transformations for low power signal processing systems,” in Proc. Workshop VLSI Signal Process., Oct. 1994, pp. 178–187. [3] T. Takizawa, J. Tajime, and H. Harasaki, “High performance and cost effective memory architecture for an HDTV decoder LSI,” in Proc. Int. Conf. Acou., Speech Signal Process., vol. 4. Mar. 1999, pp. 1981–1984. [4] L. Benini, D. Bruni, A. Macii, and E. Macii, “Memory energy minimization by data compression: Algorithms, architectures and implementation,” IEEE Trans. Very Large Scale Integr. Syst., vol. 12, no. 3, pp. 255–268, Mar. 2004. [5] C. Cheng, P. Tseng, C. Huang, and L. Chen, “Multi-mode embedded compression codec engine for power-aware video coding system,” in Proc. IEEE Workshop Signal Process. Syst. Des. Implement., Nov. 2005, pp. 532–537. [6] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-H. Tsai, C.-Y. Chen, T.W. Chen, and L.-G. Chen, “Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 6, pp. 673–688, Jun. 2006.

[7] M. Budagavi and Z. Minhua, “Video coding using compressed reference frames,” in Proc. Int. Conf. Acou., Speech Signal Process., Mar. 2008, pp. 1165–1168. [8] JM Software Download. H.264/AVC Software Coordination [Online]. Available: http://iphome.hhi.de/suehring/tml [9] S. Lei, “Compression and decompression of reference frames in a video decoder,” U.S. Patent 6272180, Aug. 7, 2001. [10] G. Bjontegaard, “Calculation of average PSNR differences between RDcurves,” in Proc. 13th Meeting Video Coding Experts Group, Apr. 2001, document VCEG-M33. [11] Z. L. He, C. Y. Tsui, K. K. Chan, and M. K. Liou, “Low power VLSI design for motion estimation using adaptive pixel truncation,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 5, pp. 669–678, Aug. 2000. [12] S. Pigeon and S. Coulombe, “Speeding up motion estimation in modern video encoders using approximate metrics and SIMD processors,” in Proc. IEEE Symp. Industr. Electron. Applicat., vol. 1. Oct. 2009, pp. 233–238. [13] C. Y. Tsai, T.-C. Chen, T.-W. Chen, and L.-G. Chen, “Bandwidth optimized motion compensation hardware design for H.264/AVC HDTV decoder,” in Proc. IEEE 48th Midwest Symp. Circuits Syst., Aug. 2005, pp. 1199–1202. [14] J. H. Kim, G.-H. Hyun, and H.-J. Lee, “Cache organizations for H.264/AVC motion compensation,” in Proc. 13th IEEE Int. Conf. Embedded Real-Time Comput. Syst. Applicat., Aug. 2007, pp. 534–541. [15] E. Jaspers and P. With, “Bandwidth reduction for video processing in consumer systems,” IEEE Trans. Consumer Electron., vol. 47, no. 4, pp. 885–894, Nov. 2001. [16] S. Nagori, “Statistically cycle optimized bounding box for high definition video coding,” U.S. Patent 20070176938, Aug. 2, 2007. [17] MT46H64M16/32LF: 1 Gb, 16/32 Bit Mobile DDR SDRAM [Online]. Available: http://www.micron.com [18] Technical Note TN-47-04: Calculating Memory System Power for DDR2 [Online]. Available: http://www.micron.com [19] Technical Note TN-46-12: Mobile DRAM Power-Saving Features and Power Calculations [Online]. Available: http://www.micron.com [20] C. Wang, P. Chow, R. K. Sita, and P. L. Swan, “Tiled memory configuration for mapping video data and method thereof,” U.S. Patent 7016418 B2, Mar. 21, 2006. [21] A. D. Gupte, “Techniques for low power motion estimation in video encoders,” Ph.D. thesis, Dept. Electric. Commun. Eng., Indian Institute of Science, Bengaluru, Karnataka, India, 2009.