Memory Bandwidth Reduction in Video Coding ... - Semantic Scholar

9 downloads 351 Views 292KB Size Report
one of four static code tables in a context-adaptive way. An architecture that implements ..... http://phenix.int-evry.fr/jct/index.php> August, 2011. [4] L. V. Agostini ...
Memory Bandwidth Reduction in Video Coding Systems Through Context Adaptive Lossless Reference Frame Compression Dieison Silveira, Gustavo Sanchez, Mateus Grellert, Vinicius Possani, Luciano Agostini Group of Architectures and Integrated Circuits Federal University of Pelotas Pelotas, Brazil {dssilveira, gfsanchez, mgdsilva, vnpossani ,agostini}@inf.ufpel.edu.br Abstract—This paper presents the Reference Frame Context Adaptive Variable-Length Compressor (RFCAVLC) for video coding systems. RFCAVLC aims to reduce the external memory bandwidth required to carry out this process. Six experiments were performed, all based on adaptations of the Huffman algorithm, and the best experiment achieved an average compression rate of more than 24% without any loss in quality for all targeted resolutions. This result is similar to the best solutions proposed in the literature, but it is the only one without losses in this process. The presented RFCAVLC splits the reference frames in 4x4 blocks and compresses these blocks using one of four static code tables in a context-adaptive way. An architecture that implements the encoder of the RFCAVLC solution was described in VHDL and synthesized to an Altera Stratix IV FPGA. The synthesis results achieved by the designed architecture indicate that this solution can be easily coupled to a complete video encoder system with negligible hardware overhead and without compromising throughput. Keywords- Video Coding; Motion Estimation; Bandwidth Reduction; Reference Frame Encoder.

I.

Memory

INTRODUCTION

The recent technological innovations introduced an increasing demand for videos with better quality and higher resolutions. However, these videos require a large volume of data to be represented, and this information needs to be stored and eventually transmitted. In addition, digital videos are supported by many current electronic devices (such as smart phones and digital television set-top-boxes). With this scenario in mind, video coding represents a fundamental process in order to make digital videos support feasible. The video coding process comprehends a variety of tools and techniques that were and still are being developed, so there is a great research activity in this area. Through such techniques, digital videos are represented with a much smaller volume of data at the cost of little to no loss in quality. By associating a particular set of coding tools, video coding standards can be derived, from which H.264/AVC [1] is pointed out as the most recent consolidated video coding standard [2]. Moreover, HEVC (High Efficiency Video Coding) is the video coding standard that is currently being developed by expert professionals on the area [3]. The goal aimed by HEVC is to double the compression rate when compared to H.264/AVC, for the same video quality [3].

978-1-4673-0186-2/12/$31.00 ©2012 IEEE

Motion Estimation (ME) is the most important process on recent video encoders, because most of the compression gains achieved during the entire coding process are produced at this stage. This is explained by the fact that ME is responsible for exploiting the similarities that are detected among temporal neighboring frames of a video and, since videos are usually displayed at a 30 fps rate (or more) such similarities occur frequently [2]. Nevertheless, the compression gains introduced by ME are penalized by its complexity and by the significant amount of external memory bandwidth required for this process. This bandwidth comes from the fact that the encoded frames (called reference frames) must be stored in an off-chip memory to be accessed again when ME is applied on the next frames. For this reason, an encoder design must take into account that one of the greatest bottlenecks in this type of system is on this communication between the external memory and the ME core [4]. Besides being a bottleneck for the coding efficiency, external memory bandwidth also implies unwanted energy consumption, and this is a critical issue when batterybased devices are considered. Many solutions that aim to reduce memory issues associated to ME can be found in the literature. Such solutions are based on two main approaches: (1) external memory bandwidth reduction through data reuse strategies using caches; and (2) bandwidth reduction via reference frames compression before they are stored in the off-chip memory. The main advantage of reference frame compression when compared to the other approach is that the former reduces the number of reading and writing accesses, while the latter reduces the number of reading accesses only. However, it is important to emphasize that both solutions can be used together, significantly decreasing the number of accesses to the external memory. This paper presents the Reference Frame Context Adaptive Variable-Length Compressor (RFCAVLC), a solution based on reference frames compression, which is capable of reducing, on average, by more than 24% the volume of data that is stored in the off-chip memory, consequently decreasing the number of reading and writing operations required to access this data. The presented RFCAVLC is a hardware-friendly context adaptive coding based on Huffman algorithm [5]. Its coding process is performed without any information loss, avoiding drift. A software implementation of the proposed reference frame coder/decoder in C++ was used to analyze the gains in

compression of each adaptation explored. Additionally, this paper also presents the hardware design for the encoding side of the proposed solution.

on the market. 3D video coding adds a new kind of estimation during the coding process, the Disparity Estimation, which also requires a great off-chip memory bandwidth to be processed.

Fig. 1 shows a simplified version of a video encoder (excluding intra-prediction which is not focused in this work), highlighting the proposed RFCAVLC architecture. The RFCAVLC module is located between the reference frames memory (off-chip) and the encoding core (ME/MC), so any communication between these components is intermediated by RFCAVLC. Every data that is written in the Reference Memory is encoded by the RFCAVLC Encoder component. When the ME component requests data from the Reference Memory, this information needs to be decoded by the RFCAVLC Decoder component before it is used.

Within the current scenario, solutions that aim to reduce the off-chip memory bandwidth become imperative to any video coding system that prioritizes performance or that is energyaware. Some solutions that address to this matter are found in the literature [6-10]. The work proposed in [6] presents a bitmasking technique to reduce power consumption of the ME. Instead of using a fixed truncation, an adaptive bit-masking method is used to determine the number of truncated bits dynamically, depending on the quality of the picture. The power consumption is reduced by 58% on average since the highest switching input bits are truncated with little loss in quality. On the other hand, the resolution of the videos used to obtain the presented results is not mentioned anywhere and quality degradation increases significantly as video resolution is enlarged.

Reconstruct Residual Frame Frame

Current Frame Static Huffman Tables

-

ME/MC

Compressed Residue Residual Coding

Entropy Coding

Reference Frame Selector

Reference Frames Memory

Decoder

Encoder

Deblocking Filter

+

RFCAVLC

Figure 1. RFCAVLC in a typical video encoding block diagram

II.

STATE OF THE ART AND RELATED WORKS

Video coding can be performed through different combinations of tools and techniques. Therefore, video-coding standards must be created in order to maintain the coding and decoding sides coherent. Such standards explicitly define which procedures are adopted in every coding and decoding steps. The most recent video-coding standard is called H.264/AVC. This standard is capable of achieving greater compression rates than those of its predecessors, since it brought a number of innovations, such as quarter-precision Fractional Motion Estimation, Variable Block Sizes, among others. Nonetheless, it relies on a high computational complexity and large numbers of external memory accesses in order to do so [1]. Aside from H.264/AVC, video-coding experts are currently developing a new standard, the HEVC, which has its final draft expected for January 2012. HEVC also introduces a plethora of innovations to the video-coding process, e.g., new coding structures (Tree Units, Coding Units, among others), Variable Size Transforms, Arithmetic Entropy Coding, etc. Its beta release is already achieving compression rates close to 50% higher than H.264/AVC for the same quality, but its complexity is also higher [3]. Regarding digital videos, an important trend that must be noted is that resolution grows constantly. Besides significantly increasing the coding time needed to encode and decode a video, high resolutions represent an increase on the data volume that traffics between the ME core and the off-chip memory, consequently affecting coding performance. Moreover, there is a particular extension of H.264/AVC called MVC (Multiview Video Coding), related to compression of 3D videos, a kind of media that has obtained a significant interest

Paper [7] applies an in-loop compression of reference frames. This technique serves the dual purpose of reducing memory bandwidth and memory size. The particular algorithm that is used for compressing reference frames in this paper is a fixed-length compression algorithm based on block scalar quantization [11], called min-max scalar quantization (MMSQ) scheme. This scheme saves from 25% to 37.5% of the memory size used to store reference frames and an estimated 25% to 37.5% savings in memory transfer bandwidth, depending on which configuration is applied. Both solutions achieve little quality degradation, but once again the resolution of the videos used was 704x480 pixels, so results for 720p or higher resolution videos cannot be estimated. A solution based on the compression of 2x2 blocks through different prediction modes is presented in [8]. This is quite similar to the Intra prediction module of video encoders, but it is implemented in a simpler way. This solution is not only lossy, as it also introduces extra complexity to the coding process, since the best prediction mode for each 2x2 block must be calculated. The Work proposed in [9] presents different metrics to estimate the best matches among different candidate blocks during the prediction stage. The solutions use highperformance single instruction, multiple data (SIMD) implementations, all based on approximations of the sum of absolute differences metric. These new approximated metrics present high speed-ups, up to 11:1 in some cases, while sacrificing image quality for less than 0.1 dB on average. Yet again, published results were obtained using videos of CIF (352x240 pixels) and QCIF (176x120 pixels) resolutions. Work [10] proposes a solution based on a previously proposed algorithm MMSQ [7]. The improvement proposed by [10] reduces the losses of the MMSQ process by storing them, so that these errors can be returned to the correspondent blocks during the Motion Compensation (MC) process. This solution is interesting, but it is lossy and it is not compatible with architectures that apply the ME and MC processes jointly. Architectures that implement ME/MC combined can skip the external memory access performed by MC by storing the residual block at the moment ME finds the best match for the

current block, reducing the off-chip memory bandwidth of the coding process. The main contribution of this work is to propose a lossless reference frame compression method, preserving video quality for any targeted resolution. Additionally, since memory accesses are known to consume a significant amount of energy in many digital systems, the RFCAVLC is a suitable solution for energy-aware video encoders. Furthermore, the proposed solution is compliant with any video-coding standard currently known. III.

EVALUATED SOLUTIONS

The algorithm proposed in this work is a hardware-friendly adaptation of the Huffman coding algorithm, which was entitled Reference Frame Context Adaptive Variable-Length Compressor (RFCAVLC). Huffman algorithm is a famous data compression technique based on statistical methods widely used on image and text compressors [5]. RFCAVLC relies on compressing a reference frame before it is stored inside the offchip memory. This architecture also performs the decoding process of previously RFCAVLC-compressed frames, since they will be used as future references in the coding process. Traditionally, Huffman is implemented using data trees, which are created following specific rules. Through such rules, the patterns of data that occur more often are coded with a smaller amount of bits. In order to do so, the algorithm requires a two-pass analysis: in the first pass, statistical calculations are performed; in the second one, the algorithm builds a tree based on the results acquired in the previous pass. This kind of implementation is not suitable for video-coding hardware systems, where throughput is a critical constraint. Therefore, hardware implementations of a Huffman coder are usually adapted using a static Huffman table to bypass the statistical analysis. This work proposes a context adaptive solution using more than one static Huffman table to compress the reference frames. Each table is suitable for a different set of frame features, so a decision strategy must be applied. This solution uses an idea similar to that presented in the CAVLC entropy encoder of the H.264/AVC standard [2], but in a simplified fashion. The number of static Huffman tables used to compress the data has a direct impact on compression efficiency. However, more tables represent a larger requirement for memory elements, so this configuration must be carefully explored during design planning. For this work, every table contains the codes for each value of luminance sample. Since luminance samples are typically represented with eight bits per sample, each table has 256 entries, one for each possible luminance value. The values for each table were obtained separately and they correspond the average occurrence of each luminance sample, according to an analysis with eight 1080p (1920x1080 pixels) sequences that presented different motion and lighting characteristics. These sequences were: blue_sky, in_to_trees, man_in_car, pedestrian_area, riverbed, rolling_tomatoes, sunflower and tractor. All these video sequences are available in [12].

In this work, the following configurations were explored using the Huffman algorithm: (A) dynamic table; (B) one static table; (C) two static tables; (D) two static tables and 8x8 block size; (E) four static tables and 8x8 block size; and (F) four static tables and 4x4 block size. These experimental configurations will be explained in the next sections. In the experiments that use block coding, each frame is divided into smaller squared areas, and then each area is associated to the most suitable table. For all experiments, 200 frames of the eight video sequences cited before were used. The results were obtained with software implemented in C++. A. Dynamic Table For each frame, a dynamic table was created and the corresponding frame was encoded using this table. This is the typical Huffman algorithm, but this solution is not suited for hardware implementations, because two-pass encoding requires a large computation time. However, this experiment was still carried out to be used as a comparison to other proposed experiments. This solution obtained an average compression rate of 12.21%. B. One Static Table In this experiment, only one static table was used to encode every frame of the video. The eight videos were used to generate average statistic results, which were used to assemble the static table. Then the table was created and all video sequences were encoded using this table. The average compression rate achieved was 7%. This compression rate was lower than the one obtained in the previous experiment. This was already expected, since, in experiment 1, the table was dynamically reconstructed for each new frame. C. Two Static Tables In this experiment, two static tables were used to encode every frame of the video. The two tables were elaborated using the same strategy presented in the previous experiment, but now the samples values were divided into two groups: one for light frames (with higher samples values) and other for dark frames (with lower samples values). To decide which table must be used, the average sample value of each frame is computed, if this average is smaller than 128, the first table is used; otherwise, the second table is used. The encoded information also needs to inform which table was selected when the information is encoded. The video was encoded using these tables, and the compression rate reached an average rate of 9%. This experiment also reached a lower compression rate than that obtained in experiment 1, but again this was expected, because two static tables are still not better than one dynamic table. However, the use of two tables increased the compression results when compared with the second experiment, so this idea was better explored. D. Two Static Tables and 8x8 Block Size In this experiment, each frame is divided in blocks with 8x8 samples. The average sample for each block is computed, and

one of two tables is selected to encode the corresponding block based on the value of the average sample. In this case, 1 extra bit per block must be added in the bitstream, in order to inform which table was used to encode the uncompressed version of the slice. The static tables used in the previous experiment were reused here. By introducing block coding, the average compression rate obtained was 14.85%, which is already better than experiment 1, using dynamic tables. Then two other experiments were encouraged: one using more static tables and other using lower block sizes. These experiments are presented below. E. Four Static Tables and 8x8 Block Size This experiment considered again the frame division in 8x8 blocks. But in this case, four static tables were used instead of two tables. The methodology used to generate the four tables is similar to that presented in experiment 2, but here the samples values were classified in four ranges. The average sample is computed for each block, and then one of four tables is selected. In this case, 2 extra bits are needed to inform which table was used to encode the uncompressed information. Table decision is performed as follows: if the average sample calculated for the current block is between 0 and 63, then table 1 is selected; if the average is between 64 and 127, table 2 is used; when the value of the average sample is between 128 and 191, the third table is consulted; lastly, if the average is between 192 and 255, the block will be encoded according to the fourth table. The compression rate reached was, on average, 22.36%. This compression rate was better than the one obtained in experiment 5.

block size. Even using variable block size in the ME process [4], the H.264/AVC defines that the reconstructed frames are stored in the reference frames memories in blocks with 4x4 samples. The variable block size blocks resulting from ME are grouped to reconstruct the frame using the motion compensation. The differences between the original frame and the reconstructed one (called residues) are divided in 4x4 blocks, which are processed by forward transforms and quantization. These blocks are processed by inverse quantization and transforms and then added to the residues, to generate the reference frame. These blocks are initially processed by the deblocking filter to reduce the block effect, and then they are available to be written in the external memory [4]. The reconstruction path includes the forward and inverse transforms and quantization (residual coding in Fig. 1) to guarantee that both coder and decoder will have the same references, since the quantization is a lossy stage in video coding [4]. Since the deblocking filter processes 4x4 blocks, when these blocks are available, the RFCAVLC can encode them without the necessity of any buffer or extra logic. Then the RFCAVLC encoder receives the 4x4 blocks from the deblocking filter, compresses this block and stores the result in the external memory. Additionally, when the encoder requests a frame from the off-chip memory to be used as reference for the ME, this frame must be decoded by the RFCAVLC first. Fig. 2 displays the block diagram of the RFCAVLC structure. Ref. Frame

Curr. Frame

Decoder External Memory

Selector

Tables

Encoder Res. Frame + Res. Compressed

F. Four Static Tables and 4x4 Block Size The last experiment considered the frame division in blocks with 4x4 luminance samples and the use of four static tables. The static tables are the same presented in the previous experiment, and the decision process is also the same. The difference is that the decision is more accurate, since lower block sizes are considered. This better accuracy is achieved through the insertion of more control bits per sample. The previous experiment used two extra control bits to indicate which table will be use in each 8x8 block. Here, the same two bits are used, but to indicate the table used for each 4x4 block. This experiment reaches a compression rate of more than 24% on average. It was the best compression rate obtained among all experiments. Therefore, the architecture designed in this work was based on it. IV.

DETAILED ALGORITHM DESCRIPTION AND COMPRESSION RESULTS

The lossless reference frame compression proposed in this paper used four static Huffman tables and a block size with 4x4 samples, as presented in the last section. The 4x4 block size presented the best results among the experiments presented in the last section, but this block size is also very useful because it is aligned with the H.264/AVC

Figure 2. RFCAVLC block diagram.

During the RFCAVLC encoding process, the Selector module in Fig. 2 receives a 4x4 block as input. After that, the average value of this block is calculated and the corresponding table identifier is sent as output. This identifier not only selects the table that is consulted, as it is also attached to the bits that are stored in the reference memory, so that it can be used in the decoder on a later occasion. The decoding phase consists in reading the coded frame from the reference frames memory and applying the reverse process of the encoding phase. The frame is interpreted as a list of 32-bits words. Translation is performed using a greedy strategy, i.e., it consumes the bits from the list, reading one word per turn. The two first bits in the word are used to select the right table. The remaining bits are processed using this table. If the sequence of bits is fully consumed and the table did not reach its end, the next word is read from the external memory and the decoding continues until the end of the table is reached. Table I presents the detailed results obtained with RFCAVLC for each video sequence cited before. The compression rates reached for each sequence indicate also the external memory bandwidth reduction achieved with

RFCAVLC. As Table I shows, a compression rate variation occurs among different video sequences. This is explained by the fact that some sequences present a higher amount of homogeneous areas than others, which favors the compression achieved using the RFCAVLC solution. To exemplify, the Man_in_car sequence is composed of many homogeneous dark areas, so the compression rate achieved for it was 36%. The remaining sequences achieved compression rates closer to the average of 24.55%. TABLE I.

RESULTS ACHIEVED FOR RFCAVLC

Sequence

Compression Rate (%)

Blue_sky

24.10

In_to_trees

21.49

Man_in_car

36.00

Pedestrian_area

23.19

Riverbed

21.53

Rolling_tomatoes

23.62

Sunflower

24.14

Tractor

22.33

Average

24.55

Average < 64 ?

Y

Y

N

Table Selector

Average Calculator

Table 2

Reference Memory

Table 3 Table 4

Figure 4. Reference Frame Encoder Architecture.

The sample controller in Fig. 4 decides which block sample is going to be encoded at each new cycle. The Tables are designed using multiplexor trees and the select statement is chose based on the encoding sample. The tables also contain the size in bits of the RFCAVLC code for each sample. This is used to write in the correct position of the Reference Memory. All this process was designed to use one single cycle. Hence, one luminance sample is encoded per cycle by the RFCAVLC. VI.

ARCHITECTURAL RESULTS AND COMPARISONS

The proposed architecture was described in VHDL and synthesized to the EP4S40G2F40I2 Altera Stratix 4 FPGA device using Altera’s Quartus II [13] synthesis tool. Table II presents the synthesis results to this device. Y

Encode Samples Table 2 Write Results in Reference Memory

Y

Encode Samples Table 3

Get Block N Average Calculation

>>4

Table 1

Encode Samples Table 1

N

Average < 192 ?

The result of this sum is shifted right four bits, since there are 16 samples in a 4x4 block. With the average value, represented in 9 bits, the Table Selector in Fig. 4 sets the multiplexing unit to the correct position, so the correct table is consulted in the coding process for this block.

Input Block

Get Frame

4x4 Blocks Available ?

Fig. 4 presents the architectural design of the proposed RFCAVLC encoder using four static tables and 4x4 blocks. First, all the input block samples are summed up in order to compute the average value among them. The sum of the 16 samples inside a 4x4 block is done using an adder tree using 31 adders in a purely combinational way. The adder tree receives samples with 8 bits per sample, delivering 13 bits in its outputs.

+

N

Average < 128 ?

DESIGNED ARCHITECTURE

Sample Controller

Fig. 3 presents the flowchart of the RFCAVLC compression algorithm, which is the focus of the architecture designed in this work. After the block is processed by the deblocking filter, it is available to the RFCAVLC encoding module. The RFCAVLC fetches a 4x4 block and calculates the average sample value related to this block. Based on the resulting average, the most suitable table is used to compress the block. After that, the compressed block is stored in the external memory and the same process is applied for the next blocks in this frame.

Start

V.

Encode Samples Table 4

Figure 3. RFCAVLC table decision flowchart using 4 static tables

TABLE II.

SYNTESHIS RESULTS FOR A STRATIX 4 FPGA

Frequency

ALUTS

Registers

Memory Bits

369.28 MHz

615 (