IEEE Transactions on Consumer Electronics

Cover page of submission for the

IEEE Transactions on Consumer Electronics Paper title: EFFICIENT VIDEO CODING INTEGRATING MPEG-2 AND PICTURE-RATE CONVERSION Authors and affiliation: F.J. de Bruijn1 , W.H.A. Bruls2, D. Burazerović2, and G. de Haan1 , 1 Video Processing and Visual Perception Group, 2 Storage Systems and Applications Group, Philips Research Laboratories, Eindhoven, The Netherlands.. ICCE session: THPM-18.1 COMPRESSION & MOTION ESTIMATION Number of pages: Six (6) pages Correspondence address: F.J. de Bruijn, Philips Research Laboratories, Video Processing and Visual Perception Group, Prof. Holstlaan 4, (WO-02) 5656 AA Eindhoven, The Netherlands. Tel.:+31-40-2742356 Fax.:+31-40-2742630 E-mail: [email protected]

EFFICIENT VIDEO CODING INTEGRATING MPEG-2 AND PICTURE-RATE CONVERSION F.J. de Bruijn∗, W.H.A. Bruls, D. Burazerović, and G. de Haan Philips Research Laboratories, Eindhoven, The Netherlands ABSTRACT

original B-frame data is preserved in the MPEG-stream. The addition of a proprietary extension to a standard MPEG-2 chain affects compatibility, so the combination is particularly useful for consumer applications that require local video storage, e.g. on harddisk.

We present an MPEG-2 compliant video codec using picture-rate up-conversion during decoding. The upconversion autonomously regenerates major parts of frames without vectorial and residual data. Consequently, the bitrate is greatly reduced.

2

NATURAL MOTION VS. MPEG

The limited efficiency gain of predictive coding over intra coding has two main causes: the motion-estimation process and the criterion to evaluate each locally predicted picturepart. In most MPEG-2 implementations, motion compensation is based on a computationally efficient derivation of fullsearch block matching (FSBM). The motion vectors resulting from FSBM minimise the blockwise difference between the prediction and the original. The blockwise difference is often calculated as the mean squared error (MSE) or as the mean absolute difference (MAD). In either case, the difference criterion does minimise the local residual data amount, but does not result in a true-motion estimate. Consequently, MPEG motion vectors may not necessarily describe the true object motion and tend to be temporally and spatially inconsistent. In practice, transmission of residual (difference) data is vital for artifact-free reconstruction. In contrast with MPEG, the temporally interpolated frames produced by NM have to result in high-quality motioncompensated predictions without addition residual information. To this end, the interpolated frames are based on estimates of the true motion, which are generated using a threedimensional recursive search (3DRS) block matcher instead of a full-search block matcher [1, 2]. The motion-vectors estimated using 3DRS exhibit a high degree of spatial and temporal consistency. The concept of motion-vector con-

1 INTRODUCTION The MPEG-2 video compression standard allows forward predictive and bidirectional predictive encoding, resulting in the generation of so called P- and B-frames, respectively. Motion-compensated predictive coding exploits the temporal correlation between consecutive frames. In practice, however, the average bit-rate of predictive frames is often not more than a factor of 4 lower than the bit rate of an intra (I) frame in the same group of pictures (GOP). Intuitively, we consider this factor of 4 to be somewhat disappointing given the visual similarity between consecutive frames and the quality offered by another motioncompensated prediction technique which is used in picturerate conversion and known as “Natural Motion” (NM) [1]. The NM-algorithm, developed by Philips for its highend 100Hz televisions, removes motion judder from filmoriginated video material. To this end, it generates additional intermediate pictures between the ones registered on the film instead of simply repeating earlier ones. This interpolation process shows a clear similarity with the generation of B-frames in MPEG, however, NM does not require the transmission of vectorial and residual data. The autonomous operation of the NM-process makes it an interesting addition to a video-compression system. We designed an application in which a NM-based algorithm is integrated with an MPEG-2 scheme. The NM-process is set up to generate alternative ‘B-frames’ based on an input of MPEG I- and P-frames, both during encoding and decoding. In the encoder, each NM output frame is compared with an original B-frame. A criterion, specifically designed for this task decides whether it is necessary to locally fall back on the original B-frame content in order to prevent visible errors. In this case, the vectorial and residual data of the

Figure 1: Frame of the renata sequence (left), inconsistent (middle) and consistent (right) motion vectors depicted by their horizontal vector component.

∗ E-mail: [email protected]

1

Figure 2: Normal decoded MPEG B-frame (left), and Natural Motion output on the same time instance (right) with a typical artifact that occurs after interpolation when 3DRS is used to estimate motion vectors over large temporal distances.

sistency is illustrated in Figure 1. The 3DRS-algorithm can be modified to minimise the blockwise difference between prediction and original. At the cost of loosing the motionvector consistency, the application of 3DRS has shown to enable a computationally more efficient MPEG-2 encoder implementation with better compression results, compared to (efficient) FSBM-variants [3, 4, 5].

3

true motion are not perceived in consistently moving areas. To illustrate this phenomenon, the distribution of MADvalues is depicted in Figure 3. It is clear that the MAD poorly correlates with the visibility of the artifacts shown in Figure 2. As an alternative, we propose to include the consistency of the motion vectors in our judgement of the MAD-values. To quantify the concept of vector inconsistency (VI) we determine the maximum absolute component-wise difference of a vector with surrounding motion vectors, E E +ξ, y +γ ) , y) − D(x (1) VI(x, y) = max D(x,

COMBINED SOLUTION

3.1 Detection of interpolation artifacts

(ξ,γ )∈

An possible solution to improve on standard MPEG is to skip frames during encoding, followed by an up-conversion using NM after decoding. In particular, by skipping only B-frames which are not re-used in the prediction process, error accumulation is avoided. Unfortunately, when NM is applied to interpolate frames over large temporal distances, which is the case when several B-frames are predicted from decoded I- or P-frames, visible errors may occur since 3DRS fails to track small fast moving objects correctly. As an example, Figure 2 shows a typical artifact that occurs during ball games. Here, the NM-algorithm was used to interpolate between two consecutive frames arriving at a rate of 12.5 frames/s. The incorrect motion estimate near small objects causes the objects to disappear periodically. To reliably regenerate ‘B-frames’ with NM during decoding, the occurrence of visible errors must be detected in the encoder. In practice, however, a simple pixel-wise comparison with MPEG B-frames causes an abundance of high MAD-values in almost any detailed area, even in the absence of visual errors. However, the application of 3DRS to frame rate conversion shows that small deviations from the

E is a 2-dimensional motion vector that describes where D the displacement between two consecutive frames, and

Figure 3: Map of MAD-values indicating the difference between the NMframe and the original MPEG B-frame in Figure 2.

2

commonly available, usually as part of a noise reduction system [6]. In practice, however, it was found that such algorithms failed to establish a reliable value of TMADlow . This is due to the fact that the ‘noise’ that remains in decoded MPEG-frames does not match the presumed Gaussian distribution. The systematically too low estimates of TMADlow , generated by these algorithms, even indicate almost a total absence of Gaussian noise after decompression. An empirically determined threshold level at TMADlow = 3 for an 8 bit luminance range, has shown to result in a reliable segmentation in inconsistent areas, for a large variety of video material. The threshold value in consistently moving areas, TMADhigh , must be such that seriously large match errors are not neglected. A potentially visible error is indicated by areas where the local MAD-error exceeds the local threshold,

is the spatial area of evaluation. Figure 4 shows a map of inconsistency values. In areas where the inconsistency is low, we choose to disregard large local MAD-errors, and in inconsistent areas we choose to consider even small local MAD-errors as visible. Each MAD-value is, therefore, thresholded with a value TMAD (x, y) which varies as a function of the local VI-value, ( TMADhigh , VI(x, y) < TVI TMAD (x, y) = (2) TMADlow , otherwise, with TVI [ pixels/s] some threshold. The threshold value was empirically chosen and relates to velocity of about TVI ≈ 80 pixels/s at a D1 resolution.

( Sfallback (x, y) =

1, MAD(x, y) > TMAD (x, y) 0, otherwise.

(3)

In case Sfallback (x, y) = 1, i.e., a visible error is detected, decision is taken to locally fall back on the original MPEG B-frame content.

Figure 4: Map of VI values corresponding to the motion-vector field used to estimate the image in Figure 2.

Figure 5: B-frame with the areas that must be preserved. Figure 6: The relevant areas of the decoded MPEG B-frame (top left), the complementary area of the frame generated by NM in the decoder (top right), and the mix of MPEG and NM output (bottom).

The value of TMADlow is such that the noise is ignored. Methods to determine the noise level in moving video are 3

W MPEG encoder

B-frame comparator DCT

Q

residue

VLC

video in

bitstream out

loc. dec. B-frames SAD

IDCT IQ MC pred

error criterion

I or P mem VI

vectors

Natural Motion NM B-frames

Sfallback

NM B-frames

loc. dec. B-frames loc. dec. I/P-frames

fall-back decision

B-frame comparator

fall-back decision

NM vectors

Sfallback

NM vectors

Figure 7: MPEG-encoder scheme combined with Natural Motion. The locally decoded stream of I- and P-frames is fed to the NM-algorithm which produces alternative ‘B-frames’. These NM ‘B-frames’ are compared with the locally decoded B-frames using the error criterion described below. The comparator controls the decision either to trust the NM-output or to send the MB with the local original B-frame data. The redundant MBs of the B-frame are skipped and the remaining MBs are reduced in bit rate by attenuating AC-coefficients with matrix W .

3.2 Integration with MPEG

is achieved by suppressing the DCT-coefficients in these MBs, i.e., by multiplying each coefficient with an attenuation factor. The resulting smaller coefficient values will map to shorter VLC-codewords. Clearly, the DC-coefficient can not be involved in the attenuation process. Furthermore, by suppressing only the AC-coefficients, more or less in proportion to their order, mostly the high-frequent content in a block is affected. This attenuation method has shown to successfully reduce the bit-rate during MPEG-2 transcoding [7]. The loss of sharpness and detail is controlled such that the MPEG-MBs blend in seamlessly with the areas generated by NM, which are generally somewhat softer that the original MPEG-content. The weighting coefficients are collected in the weighting matrix W = αW 0 , where α is a control-factor between 0 and 1, and W 0 is given by

There are several options to correct potentially visible errors at the decoder. An option could be to send correction data through a private data channel. However, MPEG-2 already offers an efficient way to describe the content and spatial location of the areas that require correction. So, instead of creating a separate stream of correction data we have chosen to preserve the corresponding area in the original B-frame, in case of a fall-back decision. By taking the decision on a macro-block (MB) basis we can exploit the MPEG-2 syntax, which offers an efficient way to skip several MBs within a so called slice. In case of a fall-back decision, Sfallback (x, y) = 1, the corresponding MB in the original B-frame is preserved. Otherwise, it is skipped prior to variable-length encoding, thus creating compact but reversibly decodable description of the spatial areas where NM failed. A normal MPEG-2 decoder deals with the skipped MBs as if they were generated under the regular ‘skip macro-block’ conditions. This means that the motion-vector data of the previously decoded MB is repeated, and the residual data is zero. Consequently, the skipped areas will look somewhat distorted. By checking for skipped macro-blocks during decoding, the MPEG-2 decoder with NM will recognise the skipped areas and overwrites the MPEG-output in these areas with the locally generated NM-output. Unfortunately, the MPEG-2 syntax imposes some restrictions to the use of skipped MBs, e.g., the first and the last MB in a slice must always be encoded. These mandatory MBs add up to the MBs that were preserved by our selection process, which may affect the coding efficiency of the system. However, even in case of a large number of preserved MBs, the bit-rate can still be significantly reduced. This

1.00 W0 =

0.91 0.91 0.83  0.83 0.71  0.67 0.48

0.91 0.91 0.91 0.83 0.71 0.67 0.50 0.48

0.91 0.91 0.83 0.71 0.67 0.50 0.45 0.31

0.91 0.77 0.71 0.67 0.53 0.45 0.32 0.30

0.77 0.77 0.63 0.53 0.43 0.33 0.29 0.22

0.77 0.63 0.56 0.42 0.34 0.29 0.23 0.21

0.59 0.56 0.40 0.36 0.28 0.24 0.20 0.18



0.59 0.38 0.37 0.26 . 0.25  0.19 0.19 0.17

(4) The pattern of decreasing matrix coefficients W0i j is piecewise constant along the diagonals to ensure an efficient VLC-representation after zig-zag-scanning the coefficient values. Figure 7 depicts an MPEG-2 encoder combined with the NM-algorithm in which the error criterion is incorporated. Instead of designing a modified MPEG-encoder, the entire process of skipping and filtering can also be realized as a transcoding process after first generating a regular MPEG-2 stream. 4

The results of our experiment are collected in Table 1. The depicted PSNR results are based on a comparison of the uncompressed sequence with the compressed sequence containing the original B-frames. We measure a bit-rate reduction with a factor between 1.41 and 1.54, regardless the complexity of the scene. The visual quality was considered to be comparable to the original MPEG-output during an informal subjective assessment. For each test sequence, the contribution of the I-, P- and B-frames to the total bit-rate is depicted in Figure 9, which shows that the summed bit-rate of the original B-frames easily constitutes up to 50% of the total bit-rate. Since up to 90% of the B-frame is replaced by NM, a considerable overall bit-rate reduction is achieved. 3 I P B 2.5

bitrate [Mbit/s]

2

Figure 8: Evaluated test sequences which originate either from a standard test set or from broadcast material. From top left to bottom right: soccer busy, soccer calm, horse, rugby, flower, cacao.

1.5

1

4

RESULTS

0.5

The performance of the hybrid algorithm has been compared with original MPEG-2. In both cases we used the same MPEG-2 GOP-structure of N = 24 and M = 4. The compression was performed after de-interlacing and downscaling the sequences to a progressive SIF sequence at 50 or 60 frames/s, both for the original and the hybrid algorithm. The result was judged for visual errors on a regular TV-set after up-scaling to the original SD-format and subsequent re-interlacing. A variety of sequences of 250 to 500 frames was used for the comparative test, most of which contain slow or fast camera pans and zooms, as well as scene changes. The test sequences are depicted in Figure 8.

0 soccer busy

5

soccer busy soccer calm horse rugby flower cacao

PSNR [dB] 35.7 37.2 31.9 35.1 31.0 33.4

bit-rate [Mbits/s] original with NM 1.77 1.22 1.15 0.76 2.53 1.80 2.50 1.70 1.99 1.30 1.99 1.36

horse

rugby

flower

cacao

CONCLUSIONS

A new bi-directional predictive video coding method has been introduced as an improvement on the MPEG-2 method, which achieves a bit-rate reduction of a factor 1.4 up to 1.5 at a comparable visual quality level. A new error criterion has been proposed to enable the application of true-motion vectors in motion-compensated predictive coding. The compression method can be implemented, either by modifying a standard MPEG-2 encoder, or by transcoding an existing MPEG-2 stream. In either case, the syntax of the produced bitstream remains MPEG-2 compliant.

Table 1: Measured PSNR and bit-rate values after compression with original MPEG-2 and modified MPEG-2 algorithm with NM.

sequence

soccer calm

Figure 9: The contribution of all the I-, P- and B-frames to the total bitrate, before (left column) and after (right column) omitting obsolete macroblocks in the B-frames. The total bitrate is indicated by the total height of each column.

bit-rate ratio 1.44 1.52 1.41 1.47 1.54 1.46

REFERENCES [1] G. de Haan, “IC for motion compensated deinterlacing, noise reduction and picture rate conversion,” IEEE Transactions on Consumer Electronics, pp. 617–624, Aug. 1999.

5

[2] G. de Haan, P. W. A. C. Biezen, H. Huijgen, and O. A. Ojo, “True-motion estimation with 3-D recursive search block matching,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 3, pp. 368–379, Oct. 1993.

storage based on D1 recorders, and to the DV camcorder standard. Later he participated in the realisation of the DVD+RW recording standard, and worked on the development of video compression algorithms for MPEG-encoding and -transcoding, resulting in several commercially available single chip MPEG encoders and codecs. As a principle scientist he currently focuses on video compression issues in storage systems. His work has resulted in some 40 patents and patent applications, and various commercially available ICs. He was the first place winner in the 2000 ICCE Outstanding Paper Award program.

[3] W. H. A. Bruls, A. van der Werf, R. P. Kleihorst, T. Friedrich, E. W. Salomons, and F. J. Jorritsma, “A single-chip MPEG2 encoder for consumer video storage applications,” in ICCE Digest of Technical Papers, pp. 262–263, June 1997. [4] W. H. A. Bruls, E. W. Salomons, A. van der Werf, R. Klein Gunnewiek, and L. Camiciotti, “A low-cost audio/video single-chip MPEG2 encoder for consumer video storage applications,” in ICCE Digest of Technical Papers, pp. 314–315, June 2000.

D˘zevdet Burazerović received his M.Sc. degree from Eindhoven University of Technology in 2000. He has been affiliated to Philips Research Eindhoven since 1999, first as an intern (working on assignments in the area of speech coding and processing), and since 2000 as a Research Scientist in the Storage System & Applications group. Currently, he is working on topics related to video coding and processing. His interests are multimedia signal processing, systems and standardisation.

[5] F. S. Rovati, D. Pau, E. Piccinelli, L. Pezzoni, and J.-M. Bard, “An innovative, high-quality and search window independent motion estimation algorithm and architecture for MPEG-2 encoding,” IEEE Transactions on Consumer Electronics, vol. 46, pp. 697–705, Aug. 2000. [6] G. de Haan, M. M. Larragy, O. A. Ojo, T. G. KwaaitaalSpassova, and R. J. Schutten, “TV noise reduction IC,” IEEE Transactions on Consumer Electronics, vol. 44, pp. 143–154, Feb. 1998.

Gerard de Haan received B.Sc.,M.Sc., and Ph.D. degrees from Delft University of Technology in 1977, 1979 and 1992 respectively. He joined Philips Research in 1979. Currently, he is a Research Fellow in the Video Processing & Visual Perception group of Philips Research Eindhoven, and a part-time full Professor at the Eindhoven University of Technology teaching “Video Processing for Multimedia Systems”. He has a particular interest in algorithms for motion estimation, scan rate conversion, and image enhancement. His work in these areas has resulted in several books, about 80 papers, some 50 patents and patent applications, and various commercially available ICs. He was the first place winner in the 1995 ICCE Outstanding Paper Awards program, the second place winner in 1997 and in 1998. In 2002, he received the Chester Sall Award for the second best paper in the IEEE Transactions on Consumer Electronics of the year 2001. The Philips ‘Natural Motion Television’ concept, based on his PhD-study received the European Video Innovation Award of the Year 95 from the European Imaging and Sound Association. In 2001, the successor of this concept ”Digital Natural Motion Television” received a Business Innovation Award from the Wall Street Journal Europe. Gerard de Haan is a Senior Member of the IEEE.

[7] R. K. Gunnewiek, W. H. A. Bruls, B. Hunnenman, and M. Holl, “A low-complexity MPEG-2 bit-rate transcoding algorithm,” in ICCE Digest of Technical Papers, pp. 316–317, June 2000.

BIOGRAPHY Frits de Bruijn received his B.Sc. degree at the Hogeschool Utrecht in Hilversum in 1989. In 1993 he graduated cum laude at the University of Twente in Enschede after having worked on subtractive X-ray imaging at the Dept. of Medical Physics of the University of Wisconsin in Madison in partial fulfilment of his M.Sc. degree. Currently he is finishing his Ph.D. research on compression of noisy medical images. In 1998 he joined the Video Processing & Visual Perception group of Philips Research Laboratories in Eindhoven where he currently is working as a Senior Scientist in the fields of video compression and scan-rate conversion. His interests are medical imaging, image enhancement, video compression, video-format conversion. Fons Bruls received his B.Sc. degree in electrical engineering at the Hogere Technische School Heerlen, Heerlen, The Netherlands, in 1978. In October 1978 he joined the Philips Research Laboratories in Eindhoven, The Netherlands. Until 1987 he was active in the field of integrated analog signal processing in bipolar processes. In 1987 he joined the Storage System & Applications group, working in the field of integrated digital video compression. He contributed to the realisation of HDTV video

6