Overview of Still-picture and Video Compression ... - Semantic Scholar

Overview of Still-picture and Video Compression Standards

Roalt Aalmoes, Peter Bosch

Pegasus paper 95–3 December 21, 1995

University of Cambridge Computer Laboratory

Pegasus, ESPRIT BRA 6865, is a project of the University of Cambridge and the University of Twente. The project is aimed at the design of an operating systems architecture for scalable distributed multimedia systems and the development of a validating prototype, design and implementation of a distributed complex-object service and a global name service, mechanisms for the creation, communication, and rendering of fully digital multimedia documents in real time and in a distributed fashion (with support for full-screen digital motion video and digital hi-fi stereophonic sound), and the design and implementation of an application for the system — a digital TV director.

The Pegasus Papers can be obtained from the Pegasus Secretariat, Faculty of Computer Science, Vakgroep SPA, P.O. Box 217, 7500 AE Enschede, Netherlands. They can also be obtained through the World-Wide Web: http://www.pegasus.esprit.ec.org/default.html

Pegasus Paper 95–3

Overview ofStill-picture andVideo Compression Standards Roalt Aalmoes, Peter Bosch December 21, 1995 Abstract This paper discusses current video compression techniques and presents an overview of the existing video compression standards and standards to be. The standards discussed here are standards defined by dedicated committees as well as de facto standards.

1 Introduction In the past years a number of compression standards have emerged and a number is now being developed. Although it would be useful to use only one general video compression standard, a growing number of standards is developed because of enhanced processing power, dedicated hardware, new compression techniques, and networks with different bandwidths. Each compression standard supports a specific video application. It is difficult to choose the correct compression standard for a specific application. As is true of compression in general that there does not exist one best compression algorithm, the same is true of video compression: there is no best standard. Some applications require fast real-time encoding, at the cost of the compression factor (video-conferencing), while other applications want maximum compression at encoding that need not be done real-time, as long as decoding is real-time (e.g. compressing a video stream on CDROM). This paper describes the different compression techniques used in the available standards on video compression. This simplifies the choice of which standard is most suitable for a certain application. Furthermore, an estimation can be made on the computational costs and size of video stream of a video compression standard. A division is made between still-picture and video compression techniques. In Section 2, some commonly used terms and compression techniques for still-picture compression are explained. These techniques include ways to remove redundant information and information that is not visible to the human eye. In Section 3, the most important still-picture compression standards are discussed. Video compression techniques rely strongly on the techniques used in still-picture compression, but also incorporate prediction and motion-compensation algorithms. These algorithms that remove redundant information between the different frames of a video stream are discussed in Section 4. In Section 5, some video compression standards or standards to be are discussed.

2 Still-picture compression techniques 2.1 Lossy and Lossless Compression algorithms can be categorized in two groups: lossless and lossy compression. Lossless algorithms generate exactly the same bit pattern of an object after decompression as before the object was compressed. These compression algorithms are used for text and computer binary files. Lossy compression algorithms, however, may loose some information during compression and decompression. In a good lossy compression algorithm, the lost information is not visible in case of a picture. Most lossy compression algorithms have the ability to specify a quality-setting that determines how much quality (information) may be lost for a higher compression factor. Lossy compression algorithms are useful for compression of sampled data. This data is analog data from a microphone or a camera that is converted to a digital approximation. Therefore, lossy compression algorithms that change the data slightly are not catastrophic. Lossy compression followed by decompression, however, causes quality loss that can better be avoided by reducing the number of compressiondecompression operations for a picture. If a picture must be manipulated (in the image space), it can best be stored as raw data between the image operations.

2

Still-picture compression techniques

Video compression methods are generally lossy. Video streams need to be compressed aggressively to reduce the required bandwidth and storage capacity for the video stream. Since lossy algorithms reduce the stream the most, such algorithms are used for video stream compression. It does not really matter much if parts of a video stream are lost: since the video stream is continuous, the next frame may repair the lost frame part.

2.2 Color spaces To define a video image, not only the resolution of the image must be specified but also the way the color information is stored. A gray-scale picture only has one colorcomponent: luminance [Poynton95]. For an 8-bit gray-scale image, the higher the value, the lighter the color gray. The value 0 represents the color black and the value 255 represents the color white. To represent color pictures, three components are required. The most popular color space used is RGB . In this color space, the R (red) component represents the amount of red in the pixel, the G (green) component represents the amount of green in the pixel and the B (blue) component represents the amount of blue. True-color pictures use 8-bit for each component and thus 24-bit per pixel. Another known color space is Y C C . The Y component represents the luminance, while the two chrominance components C and C determine the actual color. To convert a RGB color space to an Y C C color space, the three color component intensities of RGB determine the luminance Y . The Y value is a weighted sum of the three color intensities: the green component is brighter than the red component and the red component is brighter than the blue component for the same value. The C component is the blue component without the total luminance (C ? Y ). The C component is the red component without the total luminance (C ? Y ). The Hue-Saturation-Brightness (HSB ), the Hue-Lightness-Saturation (HLS ), or also called the Hue-Saturation-Value (HSV ) color space is based on specifying the colors numerically. The Hue component describes the pure color, the Saturation component describes the degree to which the pure color is diluted by while light, and the Brightness describes the brightness or luminance. The problem with this color space is that no reference is made to the linearity or non-linearity of the colors. To determine the lightness from RGB , the three color component values are averaged ((R + G + B )=3), while the visual luminance of green is much higher than the visual luminance of blue. For (lossy) image compression, it is advised to convert RGB color space to Y C C [Wallace91]. The human eye is more sensitive to luminance components than to chrominance components and by separating them, the luminance component can be encoded in a higher resolution than the chrominance components. In other words, less bits for the chrominance components need to be encoded. The relation between the resolution of the luminance components and the chrominance components determines the picture format. A luminance component accompanied by two chrominance components that are down-sampled in both horizontal and vertical dimensions by two is called the 4:2:0-format (see Figure 1) [Kleine95] [Filippini95]. If the chrominance components are only down-sampled in horizontal direction by 2, the format is called the 4:2:2-format (see Figure 2). Finally, the 4:1:1format has its chrominance components horizontally down-sampled by 4 and has no down-sampling in the vertical dimension. B

R

B

R

B

R

B

B

R

R

B

R

2.3 DCT transformation

+

+

Y

C

(8 x 8)

C

B

(4 x 4)

R

(4 x 4)

Figure 1. The 4:2:0-format

+

+

Y

C

(8 x 8)

B

(4 x 8)

C

R

(4 x 8)

Figure 2. The 4:2:2-format

2.3 DCT transformation A transformation that is useful in image compression is the Discrete Cosine Transform (DCT) [Wallace91]. This transformation converts an n n block of elements into another block of n n coefficients. These n n coefficients represent two-dimensional unique spatial frequencies. The DCT function is reversible by using an Inverse Discrete Cosine Transform (IDCT) function. The first coefficient, which has a zero horizontal and vertical frequency, is called the DC-coefficient and is equal to the average value of the original elements. The other coefficients are called AC-coefficients and represent the dimensional spatial frequencies. The DCT and IDCT are lossless if the DCT encoded data are stored with perfect accuracy. In practice, however, the coefficients are stored as integers which can introduce small differences with the original data after the IDCT decoding. If the DCT transformation is applied to blocks of pixels, higher spatial frequency coefficients become (near) zero because most pixels next to each other differ little in value. If relative more bits are used to encode the lower frequency coefficients than

2


the higher frequency coefficients, a (lossy) compression method is created.

2.4 Scalar Quantization When people refer to ”quantization”, they usually mean Scalar Quantization and not other forms of Vector Quantization (VQ) (see Section 2.5). Scalar quantization is used to reduce the number of bits that are needed to store an integer. This can be done by dividing the integer by a quantization factor and rounding it to the nearest integer before it is stored. To retrieve the integer again, the stored (quantized) integer is multiplied by the quantization factor again. This step is not lossless, as the density of the domain of the integer is reduced by the quantization factor. Original value

Quantized value

Reproduced value

0

0

0

1

0

0

2

Quantization by factor 2

1

Dequantization by factor 2

2

3

1

2

4

2

4

5

2

4

6

3

6

7

3

6

Figure 3. Quantization and de-quantization

An example of quantization is given in Figure 3. A 3-bit domain is quantized by factor 2, which reduces it by one bit to a 2-bit domain. After de-quantization, only the values 0, 2, 4 and 6 have the same value as before the quantization, but the other values are approximated by the nearest value to the original integer.

2.5 Vector Quantization VQ techniques make use of codebooks in combination with a matrix of vectors to represent an image [Gray92]. Instead of referring to elements directly, elements are referenced via the codebook. To transmit an image, only the references to the codebook (the vectors) have to be sent. A lot of different VQ methods exist. For example, each vector points to an RGBtriplet that represents one pixel, or each vector points to an n n image block that represents the vector. The way the contents of the codebook is determined also varies. A common way to generate a codebook is by using a training set to find the ”best”

2.6

Entropy Encoding

codes, the codes that occurs most frequent. A codebook can also be calculated based on the data that are quantized. In this case, the codebook itself is transmitted together with the vectors. Known video compression implementations based on VQ are Cinepak from Radius and Indeo 3.23 from Intel, both free for developers [Bryan95].

2.6 Entropy Encoding An entropy encoding algorithm is a lossless compression method, which encodes the data based on their statistical characteristics. The term ”entropy encoding” is used in the JPEG compression method (see Section 3.1), but it can apply to all compression algorithms that increases the ”energy” or the information density in a message. 2.6.1 Huffman and Arithmetic compression One of the first used general purpose lossless compression algorithms is Huffmancoding [Nelson91]. This method assigns shorter bit-patterns to characters in the message that occur more frequently and longer bit-patterns to characters that occur less often. The table which is used to find the frequency of occurence of a character is called the Huffman-table. This table is determined before encoding is done by analyzing the statistics of the data to be encoded. If decoding is done on data with the same statistical characteristics, the Huffman-table is incorporated in the decoder. If the decoder is used to decode data with different statistical characteristics, the Huffmantable itself is sent prior to the encoded data. Another well-known method for entropy encoding is adaptive Huffman-coding. This method outputs a bit-pattern for each character of the message, based on the occurrence of this character of the previously encoded characters; a character that occured more frequent in the past has a smaller bit-pattern. The Huffman-table of adaptive Huffman-coding is built up on the fly at the encoder, and in the same way it is rebuilt at the decoder. An advantage of this method is that the Huffman-table does not have to be transmitted. A variant of Huffman coding that compresses data more than (adaptive) Huffman coding, is arithmetic coding. This method assigns a fractional number of bits per code, instead of a fixed number of bits in Huffman-coding. The result of an arithmetic coded message is a number between 0 and 1. This number is multiplied by the number of characters in the message. The integer part is used for decoding the next character and the fraction for decoding the rest of the message. Because the coding table has for each character a range between two fractional numbers to choose from, it can `choose' the best number. A drawback of the arithmetic coding algorithm is that it is patented, and therefore it is mostly replaced by the less efficient Huffman-coding. 2.6.2 LZW compression Lempel-Ziv-Welch (LZW) is an entropy encoding technique, developed by Terry Welch [Nelson91]. The best known implementations of LZW are the UNIX ”compress” utility and CompuServe's Graphics Interchange Format (GIF). LZW is based on the LZ77 and LZ78, which are developed by Lempel and Ziv. LZ77 and LZ78 are dictionary-based algorithms: they build up a dictonary of previously used strings of characters. The output stream of these encoders consists of

2


characters or references to the dictionary. A combination of a reference with a character generates a new reference in the dictionary. For example, a reference to ”Hi” in the dictionary followed by the character ”s” results in a new reference ”His”. LZW is an improvement over LZ78. LZW uses a table of entries with an index field and a substitution-string field. This dictionary is pre-loaded with every possible symbol in the alphabet. Thus, every symbol can be found in the dictionary by using a reference. The encoder searches in the dictionary for the largest possible reference to the string at the input. This reference plus the first symbol of the input stream after the reference is stored in the output stream. An example of the encoding of the word ”bananas” is given in Figure 2 and Table 1. The decoder reads the encoded stream and replaces the reference by the substitution string that is stored in the associated entry of the dictionary. The symbol that follows the reference is directly stored in the decoded stream. The reference and the symbol are also used to create a new entry in the dictionary. Input stream 'b' 'a' 'n' 'a' 'n' 'a' 's'

Generated entry Output stream 256 = “ba” 257 = “an” 258 = “na” 257 259 = “ana” 260 = “as”

'b' 'a' 'n' 257 'a' 's'

Table 1. LZW encoding of word “bananas”

Index 0 1 ... 255 256 257 258 259 260

substitution string (char) 0 (char) 1 .. (char) 255 “ba” “an” “na” “ana” “as”

Table 2. LZW dictionary after encoding/decoding of word “bananas”

2.7 Fractal compression Fractal compression is one of the latest techniques in lossy image compression. Fractals are images that recursively contain themselves. They are defined by a number of translations that include rescales, rotations and dimensional flips. If you zoom into a fractal image, it appears that the image has an infinite resolution, but it is actually a part of the same image that reappears in itself. The idea behind fractal compression is to automatically find a fractal that resembles the image that must be compressed. A mayor advantage of fractal compression is the ability to decompress the image to any

2.8

Wavelet compression

given resolution. The first implementation of such an algorithm was implemented by Arnaud Jacquin [Gailly95] and was capable of compression from 8:1 to 50:1 while remaining reasonable quality. This implementation searches a combination of transformations that represent the image the best. Unfortunately, the search to find these transformation is very computational intensive, which makes it unattractive for realtime image compression. Iterated Systems developed and sells a fractal-based compressor/decompressor, mainly used for CD-ROM encyclopedia applications.

2.8 Wavelet compression A relative new and promising development in the area of lossy compression is the use of wavelet transformation [cody92] [press91]. An important characteristic of this transformation is that if it is applied on a time-domain signal, it results in a representation that is localized in time domain as well as in frequency domain. Compared to the Fast Fourier Transform (FFT) that is of an order of N 2 log (N ) for N elements, a fast wavelet transform has an order of N for the same number of elements. The wavelet transformation converts a sample of 2 values into 2 ?1 approximation wavelet transform coefficients and 2 ?1 detail wavelet transform coefficients. This transformation can be repeated over the generated approximation wavelet transform coefficients a number of times, until the minimum number of 2 approximation transform coefficients and 2 ? 2 detail transform coefficients remain. The number of transformations is called the number of levels of the wavelet transformation. The wavelet transformation is inversive, so applying this inverse wavelet transform a number of times (equal to the number of levels) on the generated wavelet coefficients, the original sample is recomposed. J

J

J

J

X0

S

X1

S

X2

S

0

1

2

Approximation transform coordinates

DWT X3

S

X4

D0

X5

D1

X6

D2

X7

D3

3

Detail transform coordinates

Figure 4. Example of a Discrete Wavelet Transform

An example of wavelet transform is given in Figure 4. In this example a Discrete Wavelet Transform (DWT) is applied to an array of 8 coordinates. The result are 4 approximation transform coordinates S0 ::S3 (also called the smooth vector) and 4 detail transform coefficients D0 ::D3 (also called the detail vector). Now, the DWT is applied again on the approximation transform coefficients. All of these resulting coefficients together with the detail transform coefficients from Figure 4 form the final wavelet coefficients.

3

Still-picture compression standards

Wavelet compression is obtained by only storing those coefficients of the wavelet transformation that have an amplitude above a certain threshold together with the place of those coefficients in the transformed domain. Because the coefficients are also time-domain, high contrast edges are maintained at the cost of low contrast areas. By using quantization and entropy encoding in combination with wavelet transform the number of bits needed to store the wavelet coefficients is further reduced.

3 Still-picture compression standards Still-picture compression techniques take advantage of spatial redundancy found in images: in most cases, pixels close to each other have the same color, or almost the same color.

3.1 JPEG The JPEG standard [Wallace91] is developed by the Joint Photographic Experts Group. It is a collaboration between the former International Telegraph and Telephone Consultative Committee (CCITT)1, and the International Standardization Organization (ISO). The JPEG standard is now widely adopted in the world. There are four modes of operation:

Sequential encoding: This is the general mode, in which a picture is encoded from top to bottom. Progressive encoding: In this mode, the picture builds up in multiple scans. After each scan, the picture gets sharper. Lossless encoding: In this mode, the picture is compressed in a way that no data is lost after decompression. The algorithm used for lossless encoding is rather different from one used in the sequential and progressive modes of operation. Hierarchical encoding: In this mode, the image is encoded in different resolutions. Accessing a low-resolution version does not require decompression of the full resolution version.

The JPEG encoder works on one color component at a time. For gray-scale pictures, which only consist of one component, the encoding is straight-forward. For color pictures, every component is encoded separately just like a gray-scale picture. The color components can be interleaved with each other or can be sent after one another, see Section 3.1.2. 3.1.1 Sequential encoding The most common way to encode a JPEG picture is by using sequential encoding. An overview of the codec (Compressor/de-compressor) is given in Figure 5. For every component, the picture is divided in blocks of 8 8 pixels. Each block is transformed in another 8 8 block using a DCT function. The resulting transformed block consists of 64 unique two-dimensional spatial frequencies coefficients, of which the higher frequency coefficients are very small or zero. 1

The CCITT is now called the International Telecommunication Union (ITU)

3.1

JPEG

JPEG-encoder

Data analysis Quantization table

Original picture

Color space conversion

DCT

Quantization

Huffman table

Entropy encoding

Compressed picture

Decompressed picture

Color space conversion

IDCT

De-quantization

Entropy decoding

Quantization table

Huffman table

JPEG-decoder

Figure 5. Overview of the JPEG codec

After the DCT transformation, the transformed block is quantized by using an 8 8 quantization table. This means that every coefficient is divided by its corresponding quantization value and rounding the result to the nearest integer. Note that this step is lossy and removes data that may not be visible to the human eye. The resulting block of coefficients contains even more small or zero values. This block of coefficients is stored in a sequence according to a zig-zag route defined in the block, see Figure 6. This zig-zag sequence is chosen in a way that low-frequency coefficients are stored first and the high-frequency coefficients last. Putting the high-frequency coefficient next to each other results in a series of low or zero value data at the end of the sequence. This sequence is encoded efficiently by using the entropy encoder.

Figure 6. Zig-zag coefficients sequence

3


The final step is entropy encoding of the created sequence. The quantized DC coefficients are treated a little different from the other AC coefficients: because the value of DC coefficient of adjacent blocks correlate strongly, not the value itself but the difference with the previous DC coefficient is used for entropy encoding. The entropy encoder is a mixture of a variable-length encoder and the Huffman or arithmetic encoder. 3.1.2 Color Component Interleaving It is possible to interleave the different color components per frame. In interleaving mode, every color component is divided in a number of Minimum Coding Units (MCUs). Every MCU consists of i by j (where 1 i; j 2) data units for each color component. A data unit is a block of 8 8 pixels that is converted to its DCT equivalent block of 8 8 coefficients. Data units in an MCU, and MCUs in a frame are ordered in a left-to-right and top-to-bottom order. The number of data units in a MCU may not exceed 10. Because of this restriction, not all non-interleaved JPEG compressed images can use interleaving. For a 4:2:0-format, an MCU contains 4 data units of the Y component, 1 data unit of the C component, and 1 data unit of the C component, see Figure 7. The interleaved video stream of this example looks like this: B

R

MCU1; MCU2; ::: = Y0 0; Y1 0; Y0 1; Y1 1; C 0 0 ; C 0 0 ; Y2 0; Y2 1; Y3 0; Y3 1; C 1 0 ; C 1 0 ; ::: ;

0

2

4

;

;

;

B ;

R ;

0

6

0

;

;

;

;

2

B ;

R ;

0

0

2

0

2

+

4

+ 2

2 6

Y

C

(8 x 8)

MCU 1

B

(4 x 4)

C

R

(4 x 4)

MCU 2

Figure 7. Component interleaving using MCUs

3.1.3 Progressive encoding Progressive encoding allows a user to transmit a picture in a number of scans. The first scan is a rough approximation of the picture, but every next scan improves the picture. Progressive encoding uses the same compression techniques found in sequential encoding. The progressive encoding mode, however, introduces a buffer between the quantization and entropy encoding step large enough to store the whole DCT-encoded and quantized picture. The buffer is then entropy encoded in a number of scans. Two methods can be chosen to select the information per scan:

3.1

JPEG

Spectral selection: a selection of the quantized coefficients is made that are transmitted. For example, in the first scan, only the DC coefficient and the three first AC coefficients (according to the zig-zag ordering) are transmitted, in the second scan the next 16 AC coefficient are transmitted, and in the final scan the last 44 AC coefficients are transmitted. Successive approximation: a bit selection of every quantized coefficient is sent per scan instead of the whole quantized coefficient. For instance, in the first scan, the three most significant bits of all the quantized coefficients are transmitted and in the second and final scan the rest of the quantized coefficients are transmitted.

The two methods can be mixed, which enables the user to choose the ”progression” in a very flexible way. The drawback of progressive encoding compared to sequential encoding is the extra buffer that is introduced in the encoder and the decoder, and a more computational-intensive decoder as for each scan the quantization and IDCT-processes must be executed again. 3.1.4 Lossless encoding The JPEG lossless mode does not make use of the DCT transformation. Instead of DCT and quantization, it uses a prediction process that determines the value based on the values of the pixels on the left and above the current pixel, see Figure 8. The selection value for prediction (see Table 3) and the difference with the actual pixel value is then sent to the entropy encoder. The entropy encoder can be either an Huffman encoder or an arithmetic encoder. For the Huffman encoder, the entropy encoding stage is almost identical the the DC-coefficient encoder in the sequential mode.

C

B

A

X

Figure 8. JPEG Lossless prediction for pixel X

3.1.5 Hierarchical encoding Hierarchical encoding allows a user to decode a low-resolution version of a picture, without decoding and down-sampling the whole encoded picture. The hierarchical mode is used in combination with lossless, sequential, or progressive encoding mode. A number of steps are done by the hierarchical encoder, see Figure 9: first, the picture is down-sampled a desired number of times by a factor of two in horizontal dimension, in vertical dimension, or in both dimensions. The result is the minimal resolution of the picture that can be retrieved by the decoder. Then, the encoder

3


Selection value 0 1 2 3 4 5 6 7

Prediction for X no prediction

A B C A+B ?C A + (B ? C )=2 B + (A ? C )=2 (A + B )=2

Table 3. Lossless prediction formula table

compresses this down-sampled picture using one of the sequential, progressive or lossless compression modes. This compressed image is sent to the outgoing video stream. After that, the encoder decompresses the compressed image, so it has the same image as the decoder2 . This image is up-sampled by a factor of 2 in either horizontal, vertical or both dimensions, by using an interpolation filter that is also used in the decoder. The result is then compared with the original image which is down-sampled to the same resolution, or with the original image itself (without down-sampling) if it is already the same resolution. The difference of this comparison is encoded using the same compression method as mentioned before. In this encoding, a different quantification table can be used, because the difference of two images has other (statistical) characteristics than an image itself. If the up-sampled image still has a lower resolution than the original image, the encoder can up-sample, interpolate, compare with the (down-sampled) original image, and calculate a compressed difference-image again, until the whole resolution of the original image is sent over. The drawback of hierarchical encoding is the need for picture buffers for each resolution that is sent over at the encoder and one extra buffer at the decoder. Furthermore, if the decoder wants the picture with the highest resolution, a lot more calculations must be made at the encoder and at the decoder than with sequential coding. The Independent JPEG Group (IJG) developed a public domain source code that supports lossless, sequential and progressive operations modes. Hierarchical mode is not (yet) supported.

3.2 GIF GIF is a lossless, 256 color, still-picture compression standard [Gailly95]. It a variation of the LZW compression technique called Variable-Length LZW. GIF is most suitable for images that have a small number of colors, such as computer generated graphics and cartoons. It is also useful for small images. The difference between the LZW compression method and Variable-Length LZW used in GIF is that in the latter the size of the code to represent an entry in the table is increased by a bit when the table is full. If this code is 12-bit and the table is full, a special character symbol (a clear code) is encoded to indicate that the table must be emptied and the table must be rebuilt from scratch. 2

This decoding step can be optimized when lossless mode is used or an intermediate result is stored before the lossless entropy encoding during encoding.

Down-sampled image

Original image

First scan Down-sampling

Upsampled reconstructed image

Compression

Decompressed image

Up-sampling &

Decompression

Interpolation

Difference image

Comparison

∆

Second scan Compression

Figure 9. Hierarchical encoding in JPEG

There are two widely used versions of GIF: 87a and 89a [Compuserv90]. The 89a version has some extensions to insert text into the picture, and comments and application specific codes in a GIF file, but the LZW algorithm used is not different.

4 Video compression techniques The advantage of video compression standards over still-picture compression standards is that they not only make use of spatial redundancy, but also make use of temporal redundancy, which can reduce the size of the video-stream significantly. Temporal redundancy is the property of a video stream to show the same information (objects) over a certain period of time (a number of frames). In video compression algorithms, motion-compensation prediction techniques are used to scan previously sent frames for parts that are (about) the same as the current frame that is encoded.

4.1 Prediction The most basic form of prediction checks if a block of n m data in the current frame is the same as the block on the same place in the previous frame. If there is no change, the data of this block is not encoded. Although this is an easy example, the implementation still requires quite some thought: What size is chosen for the block that is compared and must the blocks be exactly the same, or is there a threshold

5

Video compression standards

value before the block is marked as changed? In most implementations, prediction is combined with motion-compensated interpolation. If a block of data is not identical to a block of data in a previous frame, the best matching block is found and the difference is used for further compression. The resulting block is compressed better than the original block of data. The area that is used for comparison of the block determines the quality of the final prediction. The larger the area that is searched to find a matching block, the larger the change it is actually found. But most matching blocks are found around the place of the original block and increasing the search area also increases the computation time to find a matching block. Bi-directional prediction not only searches a previous frame for a close-matching block, it also searches a next frame in the video stream. Another advantage of bidirectional prediction is that it can combine the prediction of a previous frame with the prediction of a next frame into an average prediction image block.

5 Video compression standards 5.1 Motion-JPEG The (lossy) JPEG compression standard is so successful that it is also used for motionpicture compression [Lane95]. Although Motion-JPEG is not declared as a standard by any standardization committee, it is used by many vendors and may therefore be called a de facto standard. Motion-JPEG has no standard set of picture formats, nor is there agreement over the file format. Motion-JPEG encodes every frame as a sequential mode JPEG-picture, without making use of temporal redundancies in the video stream. Motion-JPEG has a number of advantages. First, Motion-JPEG requires less computation time than other compression standards, because of the lack of motion-compensation algorithms. Second, random access of frames is possible as every frame can be encoded and decoded independent of the other frames. The main disadvantage of Motion-JPEG is the poor compression factor, due to the lack of temporal redundancy reduction techniques.

5.2 MPEG-1 MPEG stands for Motion Picture Experts Group, and is concerned with the development of video compression standard [Gall91]. Although the MPEG is also developing audio and synchronization standards as part of the MPEG standard, we only look at the video compression techniques used. MPEG-1 is the first standard of the MPEG-group. It describes the way a video stream must be stored, but it does not give specifications on how the coding (and decoding) must be done. The MPEG-1 standard is designed for (encoded) video streams of 1.5 Mbps, which is sufficient for CD-ROM applications. MPEG-1 supports encoding of the Standard Interchange Format (SIF), which has a resolution of 240 352 pixels at a rate of 30 frames per second for NTSC and a resolution of 288 352 at a rate of 25 frames per second for PAL and SECAM. A MPEG video stream uses three different types of frames:

I-frames: The I or intra-picture frames are compressed independent of the other frames in the video stream.

5.2

MPEG-1

P-frames: P or predicted frames are frames that store the difference between the current frame and the previous P or I frame that is encoded. B-frames: B or Bidirectional prediction frames use both the previous I or P frame and the next I or P frame to predict the current frame.

I, B and P frames are compressed differently. In I frames, compression is achieved by reducing spatial redundancy in the frame. P and B frames also use temporal redundancy reduction to improve the compression factor. Because B-frames make also use of the next I or P frame as reference, B frames have the highest compression factor. An MPEG video stream consisting of only I frames has, except for some quantization and Huffman-encoding details, the same compression factor as a motion-JPEG video stream (using the same video format). However, I frames are important for random access, the ability to decode independent frames without decoding the whole video stream. A frame is divided in a number of 16 16 blocks of pixels called Macro-blocks. A macro-block can be encoded in four different ways:

Intra-block encoding: no prediction is used. Forward predicted encoding: A 16 16 block of pixels is searched in the next I or P-frame that most closely resembles the current macro-block, see Figure 10. The difference between these blocks are used for further compression. Backward predicted encoding: This encoding is the same as forward predicted encoding, but with the difference that blocks are searched in the previous I or P frame instead of the next frame. Average encoding: Backward and forward predicted encoding is used to find two blocks of pixels that resemble the current macro-block best, see Figure 11. These two blocks are averaged and the difference with the current macro-block is used for further compression.

T0

T1

Figure 10. Forward prediction

I frames only use intra-block coding; P frames use either intra-block coding or backward predicted coding. B frames use any of the encoding modes.

5


T-1

T0

T1

Figure 11. Average prediction

After motion prediction, a macro-block must be compressed to reduce spatial redundancy. An 8 8 DCT is used similar to one found in JPEG. After DCT, coefficients are stored in zig-zag order. These coefficient are quantized depending of the original encoding mode. For intra-block encoding, low spatial frequency coefficients are quantized with a lower quantization factor than high spatial frequency coefficients. For the other encoding modes, the coefficients are DCT-transformed differences of pixel blocks. Low frequencies of these blocks will be close to zero, because of the applied prediction. Therefore, another quantization matrix must be used than for intra-block-encoded, DCT blocks. MPEG also allows different quantization step sizes for different blocks. This is independent of the encoding mode (intra or predictive encoding). Different quantization step sizes allow the encoder to code certain blocks more accurate than others. In general, an MPEG video stream consists of many B-frames, some P-frames and a few I-frames. The I-frames guarantee random-access in the video stream. P frames are also important because B-frames can only refer to I or P-frames, not to B-frames. After motion prediction, DCT and quantization, the output stream is entropy encoded by a variant of the variable-length compression technique found in JPEG. The MPEG-1 is tuned for compression of video streams that comply to a Constraint Parameter Bit stream (CPB), see Table 4. Video streams that use more bandwidth compared to this CPB may be encoded through MPEG-1, but the encoding is not necessarily efficient and support is not guaranteed by MPEG-1 hardware.

Horizontal size in pixels Vertical size in pixels Total macro-blocks per picture Total macro-block per second Frame rate Bit rate Decoder buffer

720 576 396 396*25 (=330*30) 30 1.86 Mbps 376832 bits

Table 4. MPEG-1 Constraint Parameter Bit stream

5.3

MPEG-2

5.3 MPEG-2 The MPEG-2 standard is developed for high-end video applications that need a compressed video stream from 4 Mbps up to 100 Mbps [Kleine95] [Okubo95]. MPEG-1 may not be efficient for these video streams, but is to video streams that conform to the CPB of MPEG-1. Furthermore, interlaced video streams, which are common in the television industry, are not easily converted to MPEG-1; MPEG-2 is more suited for these interlaced video streams. Video streams of MPEG-2 are nevertheless compatible with MPEG-1. The MPEG-2 standard deals with different resolution video streams that are divided in profiles and levels. The lowest level format is 352 288 pixels (PAL format) and the highest is 1920 1152 (PAL format) pixels. The simplest profile does not use B-frames, is not scalable and uses a 4:2:0 luminance/chrominance format, while the high profile uses B-frames, is scalable and uses either a 4:2:0 or a 4:2:2 luminance/chrominance format.

5.4 MPEG-4 At this moment, MPEG-4 is still in development and no concrete algorithms or methods are yet determined [Filippini95]. However, an outline of the goals of MPEG-4 is available. MPEG-4 is not just a compression standard, it will incorporate a description language that determines the contents of a video stream. It also distinguishes different objects that enables the user to set priorities to different objects so that the foreground of a picture has a higher priority than the background. MPEG-4 intends to support a wide variety of video streams from low-bandwidth to 3-dimensional video streams. An MPEG-4 stream combines tools, algorithm and profiles. These will determine how data is stored. For example, subtitles will be coded differently than other video objects. MPEG-4 is scheduled to become a standard in the end of 1998.

5.5 H.261 The CCITT developed the H.261 video compression standard that is designed for video communications over ISDN networks [Liou91][Turletti93]. H.261 can handle p 64 Kbps (where p = 1; 2; :::; 30) video streams and this is equal to the possible bandwidths in ISDN. The H.261 standard supports the following two video formats:

Common Intermediate Format (CIF). This format has a resolution of 360 288 pixels for the luminance (Y ) part of the video stream and a resolution of 180144 pixels for the two chrominance parts (C and C ) of the video stream; R

B

Quarter-CIF (QCIF). This format contains a quarter of the information of a CIF video stream. This means that the luminance resolution is 180 144 pixels and the two chrominance resolutions are 90 72 pixels;

The maximum frame rate for a H.261 video stream is 30 frames per second. The CIF and QCIF consist of pictures for each frame, and within each picture of Group Of Blocks or GOBs, see Figure 12. A QCIF has 3 GOBs, while a CIF has 12 GOBs.

5


Each GOB consist of 3 11 Macro Blocks (MB). A Macro Block is composed of 4 8 8 luminance blocks and two 8 8 chrominance blocks (C and C ). A macro block can be compared to an MCU in JPEG. B

R

GOB

QCIF C Y

CIF

B

C

R

MB

Figure 12. Composition of an H.261 CIF

The H.261 encoder can operate in two modes. In the intra-frame mode, every

8 block is DCT-transformed, linearly quantized, and sent to the video multiplexer. In the inter-frame mode, every 8 8 block is also DCT-transformed and linearly

8

quantized, but the result is first sent to a motion-compensator before it is sent to the video multiplexer. The motion-compensator is used for comparing the macro-block of the current frame with blocks of data from the previously sent frame. If the difference is below a pre-determined threshold, no data is sent for this block. Otherwise, the difference is DCT transformed, and linearly quantized. The final encoding step is the video multiplexer that uses a variable word-length coder to reduce the bit stream even more. After the video multiplexer, the result is inserted in a transmission buffer, which controls the linear quantizer in order to regulate the outgoing bit stream. H.261 is similar to MPEG with respect to the DCT encoding and quantization. During the standardization of MPEG-1, this is done on purpose to simplify implementations that encode or decode both H.261 and MPEG. The main difference between H.261 and MPEG-1 is that motion vectors in H.261 are restricted to 15 pixels away from the original place in the picture. Furthermore, no future-prediction is used in H.261 which means that H.261 has no equivalent of B-frames in MPEG. INRIA Videoconferencing System (IVS) is an implementation of a video conferencing tool that uses H.261 for video compression. Vic is also a video conferencing tool that supports H.261.

5.6 H.263 The ITU H.263 draft 3 [ITULBC95] is an improvement over H.261. H.263 is developed for low-bandwidth communications over Plain-Old Telephone Systems (POTS), in particular 28.8 Kbps modems. Compared to H.261, the number of available picture formats is increased, the motion-compensation algorithm has improved, better entropy encoding is used and a new frame is introduced that allows a simple form of forward-prediction. 3

At the moment of writing, H.263 is not yet a standard.

5.6

H.263

Three new video formats are added in H.263: a sub-QCIF format is added (128 96), a 4CIF format is added (704 786), and a 16CIF format is added (1408 1152). As in H.261, the number of chrominance pixels is always half of the number of total (luminance) pixels. This means that for 2 2 luminance pixels, one C and one C pixel is used. H.263 also supports unrestricted motion vector mode. In the default (restricted) motion vector mode, the block that is referenced should be fully inside the picture. In unregistered motion vector mode, an arbitrary number of pixels may be outside the pixel. For every reference to these pixels, the closest edge pixel is used instead. The Advanced Prediction mode is also new in H.263. Instead of one motion vector to a 16 16 macro-block, four motion vectors to 8 8 blocks are used for prediction. Although this encoding uses more bits than with one motion vector, the quality of prediction improves significantly. Another improvement in H.263 is the use of motion vectors that refer to half-pixel displacements instead of displacements with a integer number of pixels. To calculate the referenced sub-pixels, the value of surrounding pixels are interpolated. Besides Intra and Inter encoded frames, H.263 introduces PB-frames. The name is derived from the MPEG P and B frames. A PB-frame (see Figure 13) consists of two mixed frames: one ”normal” P-frame frame and one bi-directional prediction frame. When a PB-frame is decoded, first the B-frame and after that the P-frame is shown. The P-frame can use Inter- or Intra encoding modes, but the B-frame can only use the new forward or older backward prediction mode and not the intra encoding mode. The B-frame can refer to the associated P-frame in the PB-frame, and to an average of the associated P-frame and the previously encoded P-frame. B

T-1

T0

R

T1

B frame

P frame

PB-frame Figure 13. B-frame of H.263

Test Model Near (TMN) is an implementation of the upcoming H.263 standard and is used as test model for this standard. It claims that it has a factor of two better compression than H.261 [Telenor95]. Source code of TMN is also available.

6

Summary

6 Summary A number of picture and video compression techniques are discussed. For picture compression, distinction is made between lossless and lossy methods. Lossless compression generate an exact copy of compressed data after decompression. Lossy compression methods give up this requirement to obtain a much higher compression factor. Thus, the quality of lossy compression depends not only on the compression factor as with lossless compression , but also on the way the decompressed image resembles the original picture. Lossless methods for compression are LZW, (adaptive) Huffman and DCT, although the latter looses this property during storage of the rounded coefficients. DCT transforms a block of n n pixels into a matrix of n n coefficient which represent spatial frequencies. Because most high-frequency coefficients are (near) zero, compression is attained. Quantization reduces the number of bits for data by reducing the density of the domain. Scalar quantization does this by dividing data by a quantization factor; decompression is done by multiplying with the same quantization factor. Vector quantization methods use a codebook to translate data to an index in the codebook; the collection of indexes and the codebook together are used to retrieve the original picture. Wavelet transform is another technique that transforms a time domain into a time-freqency domain. Compression is done by storing only some of the generated coefficients. JPEG has four different operating modes. The lossy modes use a combination of the lossy DCT and quantization methods together with lossless entropy encoding methods to compress pictures. The progressive mode allows a user to send the picture in a number of scans so that the picture improves after each scan. The JPEG lossless mode uses a prediction-based method to compress a picture. Hierarchical mode enables a user to send different resolutions of a picture at the same time; low-end decoders will only decode the first scan or the first couple of scans, while high-end decoders decode all scans. GIF uses the LZW algorithm to compress lossless 256-color pictures. Video compression techniques not only make use of spatial redundancy to reduce the bitstream, but also use temporal redundancy found in consecutive video frames. In this way, the current frame is predicted from the previous and sometimes future frames and only the difference of these frame with the current frame is encoded. All video compression standards discussed here use DCT-transformation followed by quantization and entropy encoding. A number of video compression standards are available. Motion-JPEG is a series of JPEG compressed images stored after each other. Only spatial redundancy and not temporal redundancy is reduced. The advantage of this method is the easy implementation and random access of individual frames. The disadvantage is the poor compression factor of video streams. MPEG-1 is a standard for (compressed) bitstreams of 1.5 Mbps. It uses three different types of frames: I-frames that store a frame independent of the others in a stream, P-frames that store the difference between the current and the previous I- or P-frame, and B-frame that use both the previous I- or P-frame and the next I- or P-frame in the video stream for prediction of the current frame. P- and B-frames use motion compensation techniques for their prediction. MPEG-2 is an enhancement of MPEG-1 that is optimized for higher bitstreams and better video resolutions than MPEG-1. MPEG-4 is still in development but will make it possible to determine individual objects in video streams. Standards designed for ISDN and POTS telecommunication are H.261 and H.263,

REFERENCES

respectively. H.261 has two picture formats: CIF has a resolution of 360 288 and QCIF is a quarter of the CIF resolution. H.261 frames operate in either of the following modes: In intra-frame mode, the frame is individually compressed. In interframe mode, the difference with the previous frame is calculated (using motioncompensation) and is stored in the outgoing video stream. H.263 is an improvement over H.261 which supports two more resolutions: 4CIF (704 786) and 16CIF (1408 1152). Furthermore, it introduces a PB-frame that enables a simple form of forward-prediction. The motion prediction algorithm is also improved in H.263.

References [Bryan95] John Bryan. Compression scorecard. Byte, 20(5):107–12, May 1995. [cody92] Mac A. Cody. The fast wavelet transform. Dr. Dobbs journal, 17(4):16–28, April 1992. [Compuserv90] CompuServe Inc. Graphics interchange format (sm) version 89a. Technical report. CompuServe, Incorporated Columbus, Ohio, 1990. [Filippini95] Luigi Filippini. MPEG informations, questions and answers, 31 July 1995. http://www.crs4.it/luigi/MPEG/mpegfaq.html. [Gailly95] Jean loup Gailly. comp.compression frequently asked questions, 28 September 1995. ftp://rtfm.mit.edu/pub/usenet/news.answers/compressionfaq/part[1-3]. [Gall91] Didier Le Gall. Mpeg: A video compression standard for multimedia applications. Communications of ACM, 34(4):46–58, april 1991. [Gray92] Robert M. Gray, Pamela C. Cosman, and Eve A. Riskin. Image compression and tree-structured vector quantization. In James A. Storer, editor, Image and text compression, Communications and information theory, pages 3–34. Kluwer academic publishers, 1992. [ITULBC95] R. Schaphorst. Draft recommendation H.263. Technical report LBC-95251. ITU-T, 3 October 1995. [Kleine95] G. Kleine. Digitale televisie met behulp van mpeg-2-kompressie. Elektuur, 35(9):68–75, September 1995. [Lane95] Tom Lane. JPEG-faq, part 1, 28 May 1995. ftp://rtfm.mit.edu/pub/usenet/news.answers/jpeg-faq/part1. [Liou91] Ming L. Liou. Overview of the p*64 kbit/s video coding standard. CACM, 34(4):60–3, apr. 1991. [Nelson91] Mark Nelson. The data compression book. M & T Publishing, Incorporated, 501 Galveston Drive, Redwood City, CA 94063-4728, U.S.A., 1991. [Okubo95] Sakae Okubo, Ken McCann, and Andrew Lippmann. MPEG-2 requirements, profiles and performance verification — framework for developing a generic video coding standard. Pages 201–9. Signal processing image communication, July 1995.

REFERENCES

[Poynton95] Charles A. Poynton. Frequently asked questions about colour, 1995. ftp://ftp.inforamp.net/pub/users/poynton/doc/colour/. [press91] William H. Press. Wavelet Transforms. Harvard-Smithsosian Center for Astrophysics, 1991. Preprint No. 3184. [Telenor95] Karl O. Lillevold. Digital video coding at Telenor R&D, 11 November 1995. Homepage on internet. [Turletti93] Thierry Turletti. H.261 software codec for videoconferencing over the internet. Technical report 1834. INRIA, January 1993. [Wallace91] Gregory K. Wallace. The jpeg still picture compression standard. Communications of ACM, 34(4):30–44, april 1991.