LAYERED SCREEN VIDEO CODING LEVERAGING ... - IEEE Xplore

0 downloads 0 Views 2MB Size Report
ABSTRACT. In this paper, we propose a layered screen video coding scheme based on existing video codecs to leverage hardware video codec for efficient ...
LAYERED SCREEN VIDEO CODING LEVERAGING HARDWARE VIDEO CODEC Dan Miao*

Jingjing Fu, Yan Lu, Shipeng Li

Chang Wen Chen

University of Science and Technology of China Hefei, China [email protected]

Media Computing Group Microsoft Research Asia Beijing, China {jifu,yanlu,spli}@microsoft.com

State University of New York at Buffalo Buffalo, NY, USA [email protected]

ABSTRACT In this paper, we propose a layered screen video coding scheme based on existing video codecs to leverage hardware video codec for efficient screen video compression. In this scheme, the screen video compression is performed as twolayer coding: base layer coding and enhancement layer coding. The screen video is first analyzed in both frame and block levels for useful temporal and spatial information extraction to assist coding content selection in each layer. The non-skip screen frames are directly compressed by the conventional video codec in the base layer, while the screen contents sensitive to the video quality degradation are selected for improved coding in the enhancement layer. For contents to be enhanced, two intra coding modes are designed to improve the quality of the compressed text/graphics contents and suppress the artifacts introduced by chroma downsampling. The experimental results demonstrate that the screen video quality is improved objectively and subjectively by the proposed scheme with low cost on bitrate and computation complexity. Moreover, an average of 2.95dB coding gain is achieved in high bitrate. Index Terms— Screen video compression, content analysis, layered video coding 1. INTRODUCTION Recent years have witnessed the rapid development of consumer electronic industry. It becomes more and more common that users possess multiple computing devices and share the computing and storage resources among them through network connection. One general way to access and control the resource remotely is screen sharing, that is, the remote screen interface is compressed as a sequence of bitmap images and transmitted to the local device for display, and then the corresponding operations on the interface are transmitted back to the remote side for control. Based on screen sharing, many systems are developed to facilitate remote resource accessment [1], such as remote desktop, wireless display, webconferencing, online training and so on. In these applications, This work was done while the author was with Microsoft Research Asia as a research intern.

the system delay caused by compression and transmission directly determines user experience on interaction. Considering the constraints on transmission bandwidth and coding complexity, how to compress screen contents efficiently becomes a prevalent and critical problem. One of the existing solutions for screen content compression is to compress the screen as natural images by standard codecs. For instance, JPEG is employed for screen image compression in Virtual Network Computing (VNC) [2] while H.264 is integrated in Onlive [3]. The standard video codec is widely employed in the video transmission systems and it is well-supported by optimized implementations including software (e.g., X.264, FMPEG, MSFT) and hardware to achieve coding efficiency. Moreover, with the acceleration in hardware, real-time compression can be easily achieved. However, the video codec designed for natural video may not perform well for screen video due to the following reasons. Firstly, in a screen video, there are considerable text/graphics contents synthesized by computer, like Web pages, PDF files, and online games. In comparison with natural images, more sharp edges exist in text/graphic regions, and the geometries of these edges are usually complicated and irregular while the transform based coding scheme in most of video codecs cannot handle the anisotropic correlation very well. Secondly, the pixel value in chroma channel changes far smoother than that in luminance channel for natural images. For the sake of efficient representation, chroma channel data is downsampled along horizontal and vertical directions respectively to generate YUV420 format data in the optimized implementations of video codec, including H.264, MPEG-2 and so on. Sequentially, the screen frame has to be transformed from RGB to YUV420 before compression due to hardware constraint. Since considerable screen contents are synthesized by computer, the color text/graphics regions contain rich chroma information including sharp edges and complicated textures. Downsampling in chroma channel will introduce noticeable distortion and decrease the visual quality. Given the challenges stated above, many approaches are developed aiming to efficiently compress the screen video. Some coding schemes based on H.264 intracoding are pro-

Enhance layer encoding

Enhance layer decoding

Intra Mode

Screen + Block level Info.

Transform Domain Encoding

Transform Domain Decoding

Pixel Domain Encoding

Pixel Domain Decoding

Coding Mode Decision

Screen Video

Content Analysis

Screen + Frame level Info.

Transmission

Skip/Inter Mode Encoding

Skip/Inter Mode Decoding Layer Merging

Base layer encoding

Base layer decoding

Video Encoding (YUV420)

Video Decoding (YUV420)

Reconstructed Screen Video

Fig. 1. Framework of layered screen video coding posed for screen video compression [4] [5]. Ding et al. introduced a new intra mode to better exploit spatial correlation in text region [4]. Although the rate distortion performance is improved, the coding compatibility with standard codec is ruined in some degree due to the modification on coding structure. Besides exploring operating in standard codec, Shen et al. introduces a new screen coding framework for efficient screen coding [6]. In the scheme, blocks in each screen frame are classified into image blocks and text blocks. Transform-based coding is employed for image block compression, while text blocks are entropy encoded after quantization in pixel domain. The chroma channel of the text block is compressed accompanied with luminance without downsampling. Shen’s scheme performs well on the pure screen contents, but its performance decreases dramatically when natural video with high motion is embedded in screen video. To address this problem, Shiqi et al. proposed a layered hybrid screen video codec [7]. In his work, video region is segmented from the screen frame and compressed by conventional video codec. The rest of screen is encoded in pixel domain. Though high compatibility with video codec is achieved, the video codec is not fully utilized if no video region is detected. In most of cases, it is difficult to reconstruct the whole screen with only video codec available. In this paper, we propose a layered screen video compression scheme based on video codec. As the extension of the H.264/AVC, SVC [8] is famous as layered coding structure to achieve the scalability on video coding, including quality level. However, this coding structure is designed based on conventional video structure without considering the screen video property. The strong correlation exists between the multiple layers, where the inter-layer predictions including motion prediction, residual prediction are introduced with the complicated prediction loop to pursue coding efficiency. In our coding scheme, a simple layered coding scheme with open-loop coding structure is proposed which can also achieve the scalability on video quality. To fully utilize the existing standard codec resource, the screen video is compressed by conventional video encoder as the base layer coding, which can work well for the natural image part. To satisfy higher requirement on video quality, the original screen content is directly fed to enhancement layer coding independent-

ly from base layer. Specifically, the blocks sensitive to video quality improvement are selected for enhance coding based on content analysis. Two enhance coding methods are designed to improve the video quality. In this way, not only the artifact introduced by the downsampling in chroma channel is suppressed, the degradation in luminance channel caused by lossy transform based coding is also reduced by the recoding in pixel domain. Note that the video codec implemented based on hardware can be directly integrated into our coding scheme for screen video accessment system. 2. PROPOSED FRAMEWORK The screen video compression is performed as two-layer coding: base layer coding and enhancement layer coding. The framework is illustrated as block diagram form in Fig. 1. Before encoding, each screen frame is analyzed in frame and block levels with respect to its inherent spatial characteristics and temporal variation among the neighboring frames to supervise the following layered coding. In the base layer, the standard video codec is directly employed for non-skip screen frame coding, and each input frame is transformed to YUV420 to keep the compatibility with the standard format. With the block-level side information generated from content analysis, blocks in the screen frame are enhanced selectively using intra coding. In terms of distinct distortion types, two intra coding modes are designed to improve the coding performance of the text/graphic content and to suppress the artifacts introduced by the downsampling in chroma channel, respectively. One is on both luminance and chroma channels with pixel domain coding, the other is enhancement on chroma channel with transform domain coding. To preserve the improved coding performance and save the bit cost, the block having the same content with the enhanced block will set as skip/inter mode. In decoder, the complete screen frame is reconstructed by video decoder in base layer. In the enhancement layer, the intra mode block is decoded directly and padded into the video frame decoded from base layer. For skip and inter modes, the corresponding blocks from previous decoded screen frame are copied into the current one to generate the final content. Our contributions of the proposed coding framework can

be summarized to three folds: Firstly, the video coding of different layers are processed separately to better guarantee the scalability on video quality requirement. Secondly, this scheme is robust to the screen video content variations thanks to the content analysis on screen video properties. Finally, high compatibility with conventional video codec can be achieved, and the scheme can be implemented in hardware easily with a little modification for enhancement layer coding. 3. CONTENT ANALYSIS For the efficient screen video coding, we would like to first analyze the properties of the screen video before introducing the selection rules for enhancement coding. 3.1. Screen video characteristics In general, for friendly and explicit representation, the screen is carefully designed and composed of various components. In comparison with the natural video, the screen video has distinct characteristics in both temporal and spatial domains. In temporal domain, the content of screen video is more stable than that of natural video, since users need time to access information through reading. The layout of screen contents is often pre-defined and has little variation during page scrolling. Therefore, the screen contents usually move with a global motion between neighboring frames. In spatial domain, the screen content presents strong anisotropic features, especially on the text and graphics parts. To explain it clearly, we show some exemplified 16x16 blocks in Fig. 2. The edges of text/graphics content are much sharper than those in natural image region and geometries of edges are usually complicated and irregular. An example is shown in Fig. 2(b). Though edges of text/graphics have several-pixel transition around them due to shadow effects, the contrast between foreground and background is much higher than that in natural images. Due to the high contrast feature, the distribution of luminance value is discontinuous and shown in sparse way. Form Fig. 2(e), we can see that the distribution of luminance value is almost continuous for the smooth image block, while the large gaps exist between the non-zero counts in the text/graphics block. Moreover, for the high contrast block, downsampling in the chroma channel may introduce chromatic aberration around the sharp edges. Fig. 2(c) shows an example that the block is transformed from YUV 444 to YUV 420. In the edge region, noticeable grey shadow is observed around the boundaries, especially for text/graphics. 3.2. Content selection for enhancement coding To efficiently allocate bitrate in enhancement layer, it is critical to decide which part of screen video need to be enhanced for video quality improvement. In our scheme, we select enhanced contents by evaluating the content’s impact on the subjective and objective quality of screen video.

(a)

(b)

(c)

(d)

(e)

Image (Smooth region)

Image (Edge region)

Text

Graphic

Fig. 2. From rows (a)-(e): original image, original block, block with chroma downsampling, luminance block, histogram of luminance

The content’s temporal impact on video quality is determined by its duration. For instance, if a screen video is stable, the quality enhancement in the first frame will be preserved in the following stable frames. Accordingly, the overall video quality is improved by the enhancement coding at only one frame. If the content moves with global motion, the improved quality can also be preserved by inter coding. For the active region with frequent content change, though the video quality can be enhanced frame by frame, the quality improvement can hardly be perceived due to fast content change, especially for natural video. In our scheme, the stable contents are detected as enhancement candidates and enhancement coding is performed only one time in the stable period. In spatial domain, the impact on video quality is evaluated by content’s inherent texture. The regions with high contrast often suffer more distortion than the smooth regions. Moreover, human visual system is more sensitive to the distortion in high contrast region than that in smooth regions. Therefore, the high contrast blocks should be enhanced with higher priority. Note that there is high correlation between luminance and chroma channels, shown as Fig. 2(d). We use the gradient value in luminance channel to judge the contrast property. In combination with temporal and spatial selection, the flow of the proposed selection algorithm is illustrated in Fig. 3. In the implementation, the current block is set to skip block, if the content is same with the corresponding one in previous frame. Since content’s stablity is evaluated by its duration, which can be measured by the skip times of the content in the adjacent screen frames, we can distinguish the stable content from the active one by investigating the skip block numbers in the sequence. In our scheme, the stable period is considered to be detected, if the m consecutive blocks are all skip ones (m > 1). The content changing frequently can be filtered

Current Block

Is skip block?

Fn-m N

Fn-m+1

Fn Enhancement coding candidate

No Processing

Y Temporal domain

Are previous m blocks all skip ones?

(i,j) Non-Skip

N

Skip

Skip

Y Is any block enhanced after last non-skip one?

Y

Enhancement as skip mode

Fig. 4. An example of enhance coding selection in temporal domain BC1

Spatial domain

Is high gradient block?

N

Y Enhancement as intra mode

Fig. 3. The flow of enhance coding selection algorithm by this rule and the stable content may keep the same in next several frames with high possibility. In this condition, if there is no block at this position of previous frames enhanced once after last non-skip block appearing, this block will become the only one enhancement coding candidate at this position in this stable period. After temporal filtering on blocks, the gradient value in luminance channel of each candidate block is calculated which is the sum of the absolute difference values between neighboring pixels within one block. If the value is beyond the threshold, the block is considered as the high gradient one which will be enhanced in the following module. Note that the threshold m can be dynamically adjusted in terms of the screen video contents. An illustration is shown in Fig. 4 to demonstrate the content selection in temporal domain. For one block in the n-th frame Fn , the block is detected as skip block and the blocks at the same position in previous m-1 frames are all skipped. No block at the same position is enhanced after the nearest nonskip block in (n-m)-th frame. Then the block will be regarded as enhancement coding candidate for further spatial checking.

4. LAYERED SCREEN VIDEO ENCODING In this section, we will introduce the details of the proposed layered screen video coding scheme. 4.1. Base layer encoding To leverage existing coding resources, screen video is compressed by conventional video codec in base layer which works well for natural image parts. Considering that the screen video may be stable in a certain period, there is no need to compress the same content frame by frame. If the current frame is exactly same with previous one, this frame is set as skip frame, which will not be encoded; instead just the frame type is transmitted to decoder. Otherwise the non-skip frame is fed into the base layer encoder.

Frequency

N

BC0

QW

Escape Pixels

BC2

QW

QW

Base Colors

Pixel Value

Fig. 5. The quantization in pixel domain for mode decision 4.2. Enhancement layer encoding The enhancement blocks selected above will be enhanced by distinct coding modes based on content analysis. 4.2.1. Mode decision In our enhancement coding, four coding modes are designed, including two intra modes, inter mode and skip mode. For each high gradient block selected above, the block is set as intra one which is further classified based on whether containing text/graphics content. As the previous analysis, the text/graphics content usually has the highest contrast in spatial domain and the contrast feature can be reflected by luminance histogram. Based on this property, the quantization is performed for each high gradient block as following. The colors with peak histogram values are selected as base colors. Then, an equal size window in the histogram is used to range the colors near major ones, as shown in Fig. 5. All pixels within the window are quantized to the base color in the same window. There might also be some pixels escaped from the ranging windows. If the escaped pixels number is within a threshold, the block will be considered as the text/graphics one. Otherwise is the image block. Two distinct intra coding schemes are designed for two types of contents in terms of their distortion patterns. Beside the intra mode, skip and inter modes are also introduced to preserve the improved coding performance. The skip block which is not selected as the enhancement one will be processed as skip mode. The region with global motion will be set as inter mode if current content is same with certain region in previous frame and the motion vector is transmitted to decoder. An example is shown in Fig. 6. We can see that most of blocks are set as skip/inter mode since the content is same with certain enhanced region in previous frame. For the blocks needed to enhance, most of text/graphics blocks are detected for pixel domain coding, and others are identified as image blocks with edges and complicated texture for transform domain coding.

Intra coding in transform domain

Fig. 7. The first frames of four sequences. topleft: metro, topright: snowwhite, bottomleft: youtube, bottomright: yahoo Intra coding in pixel domain

Fig. 6. An example of mode decision. blue: intra mode coding in pixel domain, green: intra mode coding in transform domain

4.2.2. Enhancement coding in pixel domain In base layer coding, the quality of high contrast text/graphics blocks is destroyed by chroma downsmapling, as well as lossy transform coding in the luminance channel. Both luminance and chroma components need to be enhanced for better visual quality. Rather than encode the difference between the original block and the reconstructed one, the original block is directly encoded in pixel domain in all YUV channels. After quantization, the text/graphics block can be represented by base colors with an index map and several escaped pixels, that is, the index value is within the range from 0∼M and the value 0∼M-1 is used to identify base colors, while M means the escaped pixels. The base colors of each block are sorted at first, and then the base color is predicted by the corresponding one in the sorted queue of the block in the same position of the previous frame. Then the residual is encoded by the entropy encoder. For the index map coding, each index in the map is predicted from left and up directions. Huffman coding is employed to create variable length codec to encode the prediction pattern and index difference in the pixel domain. The escape pixels are encoded by entropy encoder directly. 4.2.3. Enhancement coding in transform domain As the visual quality of high gradient image blocks mainly suffers from the chroma downsmapling, only chroma channel is need to be encoded in the enhance layer. In comparing with text/graphics content, the edge of the image part is more regular and the transform based coding scheme can exploit the spatial correlation efficiently. Therefore, the JPEG codec is directly adopted to perform the chroma block compression. 5. EXPERIMENTAL RESULTS In order to evaluate the effectiveness of the proposed layered coding scheme, we apply the scheme to the screen video compression of four sequences captured at 30fps with resolution 1280x768 shown in Fig. 7 including desktop, webpage, and conventional video. The X.264 [9] is adopted as the video codec to perform base layer coding and the default mode is set with GOP structure of IBBBP and GOP size of 250. In the enhancement layer coding, the skip block number m is set as 2 and the number of base colors in pixel domain coding is set as 4. The X.264 video codec is selected as the reference scheme. Moreover, to further investigate

our coding scheme, we introduces two revised enhancement coding schemes. “Enhance Transform” is that intra mode blocks are enhanced only by transform based scheme. “Enhance NoSkip” is that all high gradient blocks in each frame are enhanced by two intra modes. The rate distortion performance comparison results are shown in Fig. 8. The bitrate is measured by KB/frm which which means kilobytes per frame. We can see that the performance of our coding scheme outperforms X.264 for all types of sequences. In high bitrate, thanks to the pixel domain coding for text/graphics block and compensation coding for the chroma channel, 2.95dB coding gain can be achieved on average, up to 4dB coding gain can be achieved for Metro sequence in which the coding performance of X.264 is limited due to the chroma information loss caused by downsampling in the chroma channel. Comparing our coding scheme “Enhance” with “Enhance Transform”, we can see that the performance of ours outperforms the reference one for metro and yahoo sequences. That is because there exist considerable text/graphics blocks in these sequences. Comparing with enhance coding in transform domain in these blocks, the pixel domain coding in YUV channels can not only suppress the artifacts caused by the downsampling in chroma channel but also improve the video quality in luminance channel. For snowwhite and youtube sequences, the results of these two schemes are similar, since the text/graphic block ratio is small and the enhancement coding is performed mainly as the chroma channel coding for the high gradient block. Comparing our scheme with “Enhance NoSkip”, we can observe that our scheme can outperform the reference except for the yahoo sequence, since we pay as small bitrate cost as possible to obtain the improvement on PSNR value. Note that the reference one obtains coding gain in high bitrate than ours in yahoo sequence. That is because, the webpage is scrolling frequently in this sequence. In our scheme, these frames will not be enhanced until it is stable. However, large coding gain can be achieved by enhance coding frame by frame. Though the objective quality can be improved in this way, the visual quality improvement is hardly aware due to the fast content switch. The visual quality comparison under same bitrate is shown in Fig. 9. We can see that there exist artifacts and chromatic aberration around the edges of text/graphic content in the reconstructed frame by X.264. While, the decoded frame in ours is almost same as the original one. The complexity comparison results are shown in Table 1 which are measured in the PC with Inter Xeon 2.27GHz processor and 8G memory. It is observed that our coding scheme

Fig. 8. Rate-Distortion performance comparison of schemes Table 1. Complexity comparison of schemes (ms/frame) Sequence Metro Snowwhite Youtube Yahoo Average

X.264 Enc. Dec. 31.35 11.66 68.01 16.23 57.52 14.82 33.42 12.04 47.57 13.68

Ours Enc. Dec. 17.55 13.43 68.29 19.09 61.21 18.31 20.04 13.68 41.77 16.13

Ratio* Enc. Dec. 0.56 1.15 1.01 1.18 1.06 1.24 0.60 1.14 0.88 1.18

Reconstruction by X.264 (PSNR=34.6dB)

Original

Reconstruction by Ours (PSNR=35.3dB)

(a) yahoo, bitrate=2.77KB/frame

Ratio*=Ours/Ref.

can save 12% encoding time comparing with X.264 in average while the decoding time is also comparable with X.264. For the metro and yahoo sequences, since the screen frame in stable period will be skipped in video encoding, the encoding complexity can be reduced correspondingly. 6. CONCLUSION In this paper, we presented a layered compression scheme for screen video based on video codec hardware. We utilize the existing standard video codec to perform the base layer coding. Based on the content analysis, the enhancement layer coding with two coding modes is designed to improve the video quality of the text/graphic content and to suppress the artifact of the downsampling in chroma channel. Simulation results demonstrate that the proposed coding scheme can improve the video quality both objectively and subjectively with low bitrate and complexity overhead. 7. REFERENCES [1] R. W. Scheifler and J. Gettys, “The X Window System,” ACM Trans. on Graphics, 5(2), Apr. 1986. [2] VNC, http://www.realvnc.com/docs/rfbproto.pdf

Original

Reconstruction by X.264 (PSNR=35.9dB)

Reconstruction by Ours (PSNR=36.0dB)

(b) metro, bitrate=2.80KB/frame

Fig. 9. Visual quality compression [3] Onlive, http://desktop.onlive.com [4] W. Ding, Y. Lu and F. Wu, “Enable Efficient Compound Image Compression in H.264/AVC Intra Coding,” Pro of ICIP 2007, pp. 337-340. [5] A. Zaghetto and R. L. de Queiroz, “Segmentation-driven compound document coding based on H.264/AVCINTRA,” IEEE TCSVT. vol. 16, pp. 1755-1760, 2007. [6] H. Shen, Y. Lu, F. Wu and S. Li, “A High-performance remote computing platform,” IEEE PerWare 2009 in conjunction with IEEE PerCom 2009. [7] S. Wang, J. Fu, Y. Lu, S. Li and W. Gao, “Content-Aware Layered Compound Video Compression,” IEEE Pro ISCAS 2012, pp. 145-148. [8] H. Schwarz, D. Marpe, and T. Wiegand,“Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE TCSVT., vol. 17, no. 9, pp. 1103-1120, 2007. [9] X.264, http://www.videolan.org/developers/x264.html