Adaptive motion-compensated video coding ... - Semantic Scholar

1 downloads 0 Views 568KB Size Report
scheme towards content-based bit rate allocation. Jianping Fan. David K. Y. Yau. Walid G. Aref. A. Rezgui. Purdue University. Department of Computer Sciences.
Journal of Electronic Imaging 9(4), 521 – 533 (October 2000).

Adaptive motion-compensated video coding scheme towards content-based bit rate allocation Jianping Fan David K. Y. Yau Walid G. Aref A. Rezgui Purdue University Department of Computer Sciences West Lafayette, Indiana 47907 E-mail: [email protected]

Abstract. An adaptive motion-compensated video coding scheme, that is based on structural video component segmentation and coding complexity analysis, is proposed in this paper. The bits are allocated more efficiently among different frame types and variant video components. A novel scene cut detection algorithm is proposed for partitioning the input video sequences into a set of shots and each shot may be encoded as one or multiple GOPs according to its length. Moreover, the positions of the reference frames ( I and P frames) in a video shot are adapted to improve the temporal predictability among frames and provide high coding efficiency, thus high picture quality with the same bit rate. More bits are allocated for these reference frames for providing high quality of the reconstructed pictures. The residue frames in a video shot are encoded as the bidirectional interpolation frames ( B frames) and can be also quantized more coarsely because they have high temporal predictability and are not used as references. The bits, that have been allocated for the three different frame types ( I , P , B frames), can be further distributed more efficiently among variant video components to avoid the coding artifacts. Experimental results show that this proposed adaptive video coding scheme is more efficient than the traditional fixed GOP coding algorithms and may be an efficient development of the present adaptive coding techniques. © 2000 SPIE and IS&T. [S1017-9909(00)00504-3]

1 Introduction Motion-based prediction and compensation for image sequence compression have been found to be very successful, and they have been matured into MPEG standards. Since the temporal correlation of consecutive frames is utilized for data compression, a section with high temporal activity 共e.g., high temporal unpredictability that is induced by the rapid moving objects or active camera motion兲 will require much more code bits for a given picture quality than a smooth section 共e.g., small local motion that can be compensated completely兲. For video coding in storage applications, e.g., video compact disk 共VCD兲 and digital video compact disk 共DVD兲, high picture quality is strongly expected but the relative delays in the encoding procedure can be tolerated because the encoding and decoding procedures

Paper 99061 received Nov. 8, 1999; revised manuscript received May 1, 2000; accepted for publication May 12, 2000. 1017-9909/2000/$15.00 © 2000 SPIE and IS&T.

are not being performed in real time. Therefore, a more effective adaptive MPEG encoder can be developed for improving the temporal predictability among frames and providing high coding efficiency thus high picture quality with the same bit rate. The only constraint of these adaptive MPEG encoders is that their bit streams should be allocated by MPEG standard bit stream syntax so that a universal decoder 共hardware player兲 can be used for decoding these bit streams that are generated by variant adaptive or fixed MPEG encoders. The MPEG video bit stream syntax has provided many properties that can be exploited for allocating the bits more efficiently among different frame types and variant video components. One property is that the boundaries of GOPs 共number of frames in a GOP兲 can be adapted to the scene changes of the video sequence. The frames that are encoded in the same GOP should be selected from the same video shot for providing high temporal predictability among frames because the basis for MPEG motion-based prediction and compensation is lost in the case of shot boundaries. One video shot may be encoded as multiple GOPs according to its length or multiple I frames may be used in a GOP for providing good random access. Therefore, a video coding scheme with adaptive GOP size can improve the temporal predictability among frames and provide high coding efficiency; thus high picture quality with the same bit rate. Another property is that the quantization parameter Q p can be exploited for modulating the step size of the quantizer for distributing the bits more efficiently and providing high quality of the pictures. The parameter Q p , an integer in the range 关1, 31兴 is the only number that can be changed inside a frame to control the bit rate allocation, thus the bits can be distributed more efficiently among different frame types and variant video components for avoiding the coding artifacts. Many researchers have suggested several motioncompensated video coding algorithms with adaptive perceptual quantization. Puri and Aravind have proposed an adaptive video coding system by exploiting the masking of the human visual system for bit rate allocation among the three frame types (I, P, and B frames兲 in a GOP and clasJournal of Electronic Imaging / October 2000 / Vol. 9(4) / 521

Fan et al.

sifying the macroblocks according to their variance.1 Gonzales and Viscito have presented an adaptive quantization algorithm by characterizing the activity of macroblocks according to their coefficients of DCT transform.2 Recently, Lee and Dickinson also proposed a novel adaptive motion-compensated interpolation technique by exploiting the advantage of temporal masking of the human visual system for bit rate allocation, where the interval between the successive reference frames is adapted by minimizing the coding complexity of a GOP.3 Several adaptive bit rate allocation algorithms have also been suggested on the basis of dependent quantization and rate-distortion criteria.4–7 However, all these present adaptive video coding schemes only try to allocate the bits more efficiently in a GOP and the size of GOP is fixed, thus the improvements are limited. The video sequence can be partitioned into a set of video shots and the idea of video shot has been exploited for representing and indexing the video sequence,8 thus the ideal adaptive video coding scheme should have the ability to adapt the boundaries of GOPs 共the size of the GOP兲 according to these shot boundaries that indicate the temporal unpredictability among successive frames. Moreover, the human visual system perceives the video sequence through its components, thus an efficient bit rate allocation algorithm should have the ability to distribute the same bits according to variant video components to avoid the coding artifacts. Currently, the segmentation-based video schemes are widely studied because these coding systems can provide higher quality of the reconstructed pictures by distributing the same bits more efficiently according to variant video components.9 Extracting the meaningful real video objects from a video sequence is still not an easy task now, but the detection of structural video components on block resolution is simple but useful for improving the performance of block-based video coding schemes by distributing the bits more efficiently according to the coding complexity of variant video components.10 Therefore, an efficient approach to the adaptive video coding scheme is developed in this paper, where the boundaries of GOPs are adapted to the boundaries of the detected video shots and the positions for references (I and P frames兲 in a GOP are adapted for providing high temporal predictability and coding efficiency. This paper is organized as follows: In Sec. 2 the coding complexity measure is defined. A novel scene cut detection algorithm is proposed in Sec. 3. In Sec. 4, the reference position algorithm is proposed for adaptive video coding, where the algorithms for structural video segmentation and motion clustering are developed. The bit allocation algorithm is introduced in Sec. 5. The experimental results are given in Sec. 6. Section 7 contains concluding remarks. 2 Coding Complexity Measure The quality with which an image sequence can be coded at a given bit rate largely depends on its coding complexity. The coding complexity of a video sequence includes two types: spatial coding complexity inside a frame and temporal coding complexity among frames. Since human beings are in the final position to enjoy the reconstructed pictures, the efficient coding complexity measure should include the principles of the human visual system as much as possible. 522 / Journal of Electronic Imaging / October 2000 / Vol. 9(4)

As we know, a human being uses semantic units 共region based兲 to perceive and recognize various video components instead of looking at them pixel by pixel 共or block by block兲, thus the coding complexity of a video sequence can be determined by its spatial and temporal activities of variant video components. By taking a three-component image model,11 that is the object, the background, and the edges among them into account, the spatial coding complexity can be measured by the number of blocks corresponding to dominant edges and homogeneous objects 共or background兲 because the human visual system is more sensitive to the errors in these regions. Therefore, the video analysis technique should have the ability to detect the homogeneous regions from texture regions and to distinguish the dominant edges from texture regions efficiently. The main difficulty is to find a feature that can separate the texture regions from the dominant edges and the monotone regions simultaneously. We first distinguish the homogeneous regions from texture and dominant edge regions and the dominant edges are further separated from texture regions through a merging and simplification procedure. The temporal coding complexity 共unpredictability兲 can be measured by the number of unpredictable blocks because these unpredictable blocks induce high motion compensation errors. The temporal unpredictable blocks may be induced by scene cuts 共the basis for motion-based prediction and compensation is lost兲, camera motion 共including new object and background that no correspondence can be found in their previous reference兲, fast moving objects 共producing uncovered background that no correspondence can be found in its previous reference and moving edges that one motion vector cannot compensate their four blocks efficiently兲, and textured stationary background 共they are detected as the changed regions because of camera jittering兲. Therefore, the effective temporal coding complexity analysis system should have the ability to detect the scene cuts, camera motion, temporal changes of video components, and dominant edges or textures. 3 Scene Cut Detection Since the video contents are changed around the shot boundary, the frames behind a scene cut are unpredictable on the basis of its previous reference before this scene cut by a MPEG motion-based prediction and compensation technique. The ideal adaptive video coding scheme should adapt the boundaries of GOPs to the boundaries of video shots. Each video shot should be taken as an independent coding unit for improving temporal predictability among frames inside a GOP, thus providing high coding efficiency. Ideally, a GOP should include the frames selecting from the same video shot for providing high temporal predictability among frames, and one video shot may be encoded as multiple GOPs or multiple I frames may be used inside a GOP for providing the property of random access. Figure 1 shows the block diagram of this proposed adaptive motion-compensated video coding scheme. The coding system first tries to obtain the shot boundaries and adapt the positions of reference frames for improving the temporal predictability among frames. The scene cuts are detected by comparing the color histogram difference among successive frames.8 Many approaches to video shot detection have

Adaptive motion-compensated video coding

where f i denotes the number of frames that their color histogram differences HD with their previous frames are equal T f h represents the total number of frames that to i, 兺 h⫽0 their color histogram differences HD with their previous frames are in the range 0⭐i⭐T. The probability P sc(i) for the scene cut, that the color histogram differences HD among successive frames are equal to i (T⫹1⭐i⭐M ), is defined as P sc共 i 兲 ⫽

fi M 兺 h⫽T⫹1 fh

,

共3兲

T⫹1⭐i⭐M .

The entropies for nonscene cut and scene cut frames are defined as T

Fig. 1 The block diagram of this proposed adaptive video coding scheme.

H nsc共 T 兲 ⫽⫺



i⫽0

P nsc共 i 兲 logP nsc共 i 兲 ,

nonscene – cut,

共4兲

scene – cut.

共5兲

M

been proposed in the past.8,12 A comparison and survey of these approaches are also provided.13,14 The scene cut detection algorithms, that can work on the compressed MPEG bit streams, are also developed.15,16 A novel scene cut detection algorithm, which is based on K-means data clustering, is also proposed.17 In this paper, we propose a new scene cut detection algorithm, which is based on two-class data classification, and the optimal threshold for data classification can be determined by a fast search technique automatically. Let H i (k) denote the color histogram of the current frame i, H j (k) indicate the color histogram of its previous frame j, the color histogram difference between them is then defined as18 M

HD⫽

关 H 共 k 兲 ⫺H 共 k 兲兴 2

i j 兺 2 , H k ⫹H 兲 关 共 k⫽0 i j 共 k 兲兴

共1兲

where k is one of M potential color levels. If the color histogram difference HD between the current frame and its previous frame is larger than an optimal threshold ¯T , the current frame is detected as a scene cut and a video shot is generated automatically. The relationships among successive frames in a video sequence can be partitioned into two opposite classes on the basis of their color histogram differences and optimal threshold ¯T : scene cut versus nonscene cut. Since the entropic thresholding techniques have been confirmed to be very efficient for obtaining the optimal thresholds for solving the problem of two-class data classification,19–22 an automatic scene cut detection algorithm can be exploited. This proposed scene cut detection algorithm first calculates the color histogram differences among successive frames of the input video sequence. The probability P nsc(i) for nonscene cut, where the color histogram differences HD among successive frames are equal to i (0⭐i⭐T), is defined as P nsc共 i 兲 ⫽

fi , T 兺 h⫽0 f h

0⭐i⭐T,

共2兲

H sc共 T 兲 ⫽⫺



i⫽T⫹1

P sc共 i 兲 logP sc共 i 兲 ,

The optimal threshold ¯T for scene cut detection is determined automatically by maximizing the following criteria functions19–22: H 共 ¯T 兲 ⫽

兵Hnsc共 T 兲 ⫹H sc共 T 兲 其 .

max

共6兲

T⫽0,1, . . . ,M

The search complexity for obtaining the optimal threshold ¯T is bounded to O(M 2 ) because it takes O(M ) computation time to obtain the two entropies for each element and there are M potential elements. For reducing the search burden on determining the optimal threshold ¯T , an efficient search algorithm is proposed by exploiting the recursive iterations on calculating the probabilities P nsc(i), P sc(i), and the entropies H nsc(T), H sc(T), where the computation burden is induced by calculating the normalization part repeatedly. We first define the total number of pairs for nonscene cut and scene cut 关the normalization parts used in Eqs. 共2兲 and 共3兲兴 when the threshold is set to T T

P 0共 T 兲 ⫽



h⫽0

M

P 1共 T 兲 ⫽

fh ,



h⫽T⫹1

fh .

共7兲

The corresponding total number of pairs for nonscene cut and scene cut at T⫹1 can be calculated as T⫹1

P 0 共 T⫹1 兲 ⫽



h⫽0

f h ⫽ P 0 共 T 兲 ⫹ f T⫹1 , 共8兲

M

P 1 共 T⫹1 兲 ⫽



h⫽T⫹2

f h ⫽ P 1 共 T 兲 ⫺ f T⫹1 .

The recursive iteration property for the corresponding entropies can be further exploited as: Journal of Electronic Imaging / October 2000 / Vol. 9(4) / 523

Fan et al. T⫹1

H nsc共 T⫹1 兲 ⫽⫺

⫽⫺ ⫽

fi

fi

兺 P 共 T⫹1 兲 log P 共 T⫹1 兲 i⫽0

0

P 0共 T 兲 P 1 共 T⫹1 兲

0

兺 P 共 T 兲 log再 P 共 T 兲 P 共 T⫹1 兲 冎

T⫹1 i⫽0

fi

0

0

P 0共 T 兲

0

P 0共 T 兲 f T⫹1 f T⫹1 H nsc共 T 兲 ⫺ log P 0 共 T⫹1 兲 P 0 共 T⫹1 兲 P 0 共 T⫹1 兲 ⫺

P 0共 T 兲 P 0共 T 兲 log , P 0 共 T⫹1 兲 P 0 共 T⫹1 兲 M

H sc共 T⫹1 兲 ⫽⫺



i⫽T⫹2

共9兲 fi fi log P 1 共 T⫹1 兲 P 1 共 T⫹1 兲 M

⫽⫺ ⫽

fi



P 1共 T 兲 fi fi P 1共 T 兲 log P 1 共 T⫹1 兲 i⫽T⫹2 P 1 共 T 兲 P 1 共 T 兲 P 1 共 T⫹1 兲





P 1共 T 兲 f T⫹1 f T⫹1 H sc共 T 兲 ⫹ log P 1 共 T⫹1 兲 P 1 共 T⫹1 兲 P 1 共 T⫹1 兲 ⫺

P 1共 T 兲 P 1共 T 兲 log . P 1 共 T⫹1 兲 P 1 共 T⫹1 兲

One can find that the recursive iterations on calculating the probabilities and entropies are reduced to only adding the increment part and the search burden is reduced to O(M ). For the partition at threshold T⫹1, the normalization parts are always calculated from 0 to T⫹1 by the exhaustive search algorithm. The normalization parts are now calculated from T to T⫹1 by this proposed fast algorithm, thus the search burden is heavily reduced. The temporal relationships among successive frames in a video sequence are then partitioned into two opposite classes on the basis of their color histogram differences and optimal threshold ¯T : scene cut versus nonscene cut



¯, HD⬎T

scene – cut

¯, HD⭐T

nonscene – cut .

4.1 Spatial Segmentation The spatial segmentation is performed on each frame for exploiting the structural image components on block resolution. Since the local variance is a simple but efficient indication of the image details, the dominant edge information can be measured by 共1兲 the locations of variance of the intensity values, and 共2兲 the related average intensity values at these locations. The local variance and average gray levels of these locations, that indicate the local details and average properties of the image, are selected for structural image component segmentation. Moreover, a twodimensional 共2D兲 entropic thresholding technique is developed for determining the optimal segmentation vectors. The average gray level 共it is indexed as MEAN in the following兲 of current block at (m,n), which denotes the average property of current block, is defined as 7

MEAN共 m,n 兲 ⫽ 共10兲

The video shots for the corresponding sequence are then generated automatically and a video shot can be encoded as one or multiple GOPs according to its length. The scene cut frames and the initial frame for each GOP should be encoded as intraframe references (I frames兲. 4 Reference Frame Position Algorithm In this proposed adaptive video coding scheme, only the key frames in a video shot are encoded as the references (I or P frames兲. The key frames are detected according to the following criteria.8 a. Shot-based criteria: Given a video shot, the scene cut frame should be taken as a key frame and encoded as the intraframe references (I frames兲, whether more than one key frame needs to be chosen in a shot depends on the following activity analysis of camera motion and temporal change of video components. b. Camera-based criteria: Global motion of the camera is one important source of video content change and should 524 / Journal of Electronic Imaging / October 2000 / Vol. 9(4)

be taken as a feature for key frame selection. For a zooming-like video shot at least two frames, that indicate the beginning and ending of zooming-like camera motion, will be selected for representing the global and focused view of the video contents. For a panning-like video shot, the number of the selected key frames depends on the value of panning-like camera motion. c. Activity-based criteria: Another important source of video content changes among frames is the active moving objects and it should be taken as a feature for key frame selection and coding complexity analysis. For analyzing the activity of moving objects, the temporal changes of video components should be exploited. The first criteria is fixed by the proposed scene cut detection algorithm and the activities of moving objects and camera motion for key frame detection can be further determined by a structural video component segmentation and motion clustering procedure.

1 64 x⫽0

7

兺 y⫽0 兺 I 共 x,y,t n 兲 ,

共11兲

where I(x,y,t n ) is the gray level of pixel at position (x,y), (m,n) is used as the address of blocks on x and y directions. The local variance 共it is indexed as LCON in the following兲 of current block at (m,n), that indicates the local details of current block, is given by 7

LCON共 m,n 兲 ⫽

7

兺兺

x⫽0 y⫽0

兩 I 共 x,y,t n 兲 ⫺MEAN共 m,n 兲 兩 2 .

共12兲

In a 2D parameter scatterplot 共MEAN,LCON兲 as depicted in Fig. 2 共we further index MEAN as M and LCON as C in the following兲, the original is defined as the upper left-hand corner, where the average gray level increases from left to right, and the local variance increases from top to bottom. Since the largest value of gray levels is W and the corresponding largest value of local variance levels is S, there are SW elements in this 2D scatterplot. Each element represents the co-occurrence f i, j of the corresponding pair (M i ,C j ) 关where (M i ,C j ) denotes the pair in the 2D parameter scatterplot that has MEAN⫽i and LCON⫽ j]. By

Adaptive motion-compensated video coding

¯ 兲⫽ H 共 ¯T ,L

兵HB共T,L兲⫹HO共T,L兲其.

max

共15兲

T⫽0,1,2, . . ,W⫺1 L⫽0,1,2, . . . ,S⫺1

4.2

2D Fast Algorithm

In order to find the optimal maximum of Eq. 共15兲, the computation burden is bounded by O(W 2 S 2 ) because it takes O(WS) computation time to obtain the two entropies for each pair of 共MEAN,LCON兲 and there are WS pairs of 共MEAN,LCON兲, thus an efficient searching algorithm is proposed by exploiting the recursive iterations on calculating the probabilities P B i, j (T,L), P O i, j (T,L) and entropies H B (T,L), H O (T,L), where the computation burden is induced by calculating the normalization parts repeatedly. We first define the total number of the pairs in quadrants 0 and 1 共shown in Fig. 2兲 as follows:

Fig. 2 The 2D scatterplot for 2D entropic thresholding.

means of two thresholds T and L, this 2D scatterplot is partitioned into four quadrants. Since the regions 共blocks兲 interior to homogeneous objects or background should contribute mainly to quadrants with relative low local variance and the edges and textures contribute mainly to the regions with relative high local variance, quadrants 0 or 1 contain the distributions of homogeneous background and homogeneous objects. The residue quadrants, which have the relative high local variance, contain the distributions of blocks near the dominant edges and texture regions. The a priori probability of a pair (M i ,C j ) is given by the number of its co-occurrences f i, j ( f i, j is equal to the number of blocks in the image that has MEAN⫽i and LCON⫽ j), divided by the total number of blocks in the image. As they are to be regarded as the independent distributions, the probabilities for the homogeneous background and homogeneous objects can be defined as

P B i, j 共 T,L 兲 ⫽

P 0 共 T,L 兲 ⫽

f i, j L T 兺 h⫽0 兺 k⫽0 f h,k

,

f i, j W L 兺 h⫽T⫹1 兺 k⫽0 f h,k

兺兺

H B 共 T,L 兲 ⫽⫺

T

P 0 共 T,L⫹1 兲 ⫽

兺兺 W⫺1

H O 共 T,L 兲 ⫽⫺

兺 兺

f i, j .

共16兲

L⫹1

兺兺

i⫽0 j⫽0

f i, j

L

兺兺

i⫽0 j⫽0

W⫺1

T

f i, j ⫹



i⫽0

f i,L⫹1 共17兲

L⫹1

兺 兺

i⫽T⫹1 j⫽0 W⫺1

⫽ 共14兲

L

i⫽T⫹1 j⫽0

⫽ P 0 共 T,L 兲 ⫹ P ⬘ 共 T,L⫹1 兲 ,

P 1 共 T,L⫹1 兲 ⫽

P B i, j 共 T,L 兲 log P B i, j 共 T,L 兲 ,

P 1 共 T,L 兲 ⫽

f i, j ,

T

.

L

i⫽0 j⫽0

W⫺1

The corresponding total number of blocks in quadrant 0 and 1 for partitioning the feature space at (T,L⫹1) can be calculated as

The entropies for the two classes are given by T

L

i⫽0 j⫽0



共13兲 P O i, j 共 T,L 兲 ⫽

T

f i, j

L

兺 兺

i⫽T⫹1 j⫽0

W⫺1

f i, j ⫹



i⫽T⫹1

f i,L⫹1

⫽ P 1 共 T,L 兲 ⫹ P * 共 T,L⫹1 兲 .

共18兲

L

兺 兺

i⫽T⫹1 j⫽0

P O i, j 共 T,L 兲 log P O i, j 共 T,L 兲 .

¯ ,L ¯ ), that are selected for The optimal threshold vectors (T performing the image components partition, have to satisfy the following criteria functions:

The recursive iteration properties, that are included in calculating P ⬘ (T,L⫹1) and P * (T,L⫹1), can be further exploited by using a 1D fast search algorithm as described in Sec. 3. The recursive iteration property for the two corresponding entropies can be also exploited as: Journal of Electronic Imaging / October 2000 / Vol. 9(4) / 525

Fan et al. T

L⫹1

f

The recursive iteration properties, that are included in calculating H ⬘ (T,L⫹1) and H * (T,L⫹1), can be further exploited by using a 1D fast searching algorithm as described in Sec. 3. The iteration relationships for partitioning at (T ⫹1,L) can be also exploited by the same way, then the corresponding total number of blocks in quadrant 0 and 1 for partitioning the feature space at (T⫹1,L) can be calculated as

f

i, j i, j log 兺 兺 P T,L⫹1 P T,L⫹1 兲 兲 共 共 i⫽0 j⫽0 0 0

H B 共 T,L⫹1 兲 ⫽⫺

T

P 0 共 T,L 兲 ⫽⫺ P 0 共 T,L⫹1 兲 i⫽0

L

f

i, j 兺 j⫽0 兺 P 0共 T,L 兲

⫻log

P 0 共 T,L 兲 f i, j P 0 共 T,L 兲 P 0 共 T,L⫹1 兲

T

f

f

i,L⫹1 i,L⫹1 log 兺 P T,L⫹1 P T,L⫹1 兲 兲 共 共 i⫽0 0 0





P 0 共 T,L 兲 P 0 共 T,L 兲 H B 共 T,L 兲 ⫺log P 0 共 T,L⫹1 兲 P 0 共 T,L⫹1 兲



T

f



f

P 0 共 T,L 兲 ⫽ H 共 T,L 兲 ⫹H ⬘ 共 T,L⫹1 兲 , 共19兲 P 0 共 T,L⫹1 兲 B

L

兺兺

H O 共 T,L⫹1 兲 ⫽⫺

L⫹1

兺兺

i⫽0 j⫽0

兺 兺 i⫽T⫹1 j⫽0

L

兺 j⫽0 兺



f i, j P 1 共 T,L 兲

⫺ ⫽

f

f

i,L⫹1 i,L⫹1 log 兺 P T,L⫹1 P T,L⫹1 兲 兲 共 共 i⫽T⫹1 1 1



P 1 共 T,L 兲 P 1 共 T,L 兲 H O 共 T,L 兲 ⫺log P 1 共 T,L⫹1 兲 P 1 共 T,L⫹1 兲 W⫺1



f



兺 兺

L

兺 兺

f i, j ⫺



j⫽0

f T⫹1,j 共22兲

The recursive iteration properties, that are included in calculating P ⬘ (T⫹1,L) and P * (T⫹1,L), can be further exploited by using 1D fast search algorithm as described in Sec. 3. The recursive iteration property for the two corresponding entropies can be also exploited as

T⫹1

H B 共 T⫹1,L 兲 ⫽⫺

f

P 0 共 T,L 兲 P 0 共 T⫹1,L 兲 i⫽0



P 1 共 T,L 兲 P 1 共 T,L 兲 log P 1 共 T,L⫹1 兲 P 1 共 T,L⫹1 兲 W⫺1

f

f

i,L⫹1 i,L⫹1 log . 兺 P 1 共 T,L⫹1 兲 i⫽T⫹1 P 1 共 T,L⫹1 兲

526 / Journal of Electronic Imaging / October 2000 / Vol. 9(4)

L

f

P 0 共 T,L 兲 f i, j P 0 共 T,L 兲 P 0 共 T⫹1,L 兲 f

f

T⫹1,j 兺 T⫹1,j log P 0共 T⫹1,L 兲 j⫽0 P 0 共 T⫹1,L 兲



P 0 共 T,L 兲 P 0 共 T,L 兲 H B 共 T,L 兲 ⫺log P 0 共 T⫹1,L 兲 P 0 共 T⫹1,L 兲 L

⫺ ⫽

f

i, j 兺 j⫽0 兺 P 0共 T,L 兲

L



f

i,L⫹1 兺 i,L⫹1 log P 0共 T,L⫹1 兲 i⫽0 P 0 共 T,L⫹1 兲

f

T

⫽⫺

P 0 共 T,L 兲 P 0 共 T,L 兲 log P 0 共 T,L⫹1 兲 P 0 共 T,L⫹1 兲 T

L

i, j i, j log 兺兺 P 0 共 T⫹1,L 兲 i⫽0 j⫽0 P 0 共 T⫹1,L 兲

⫻log



L

i⫽T⫹1 j⫽0

where

H * 共 T,L⫹1 兲 ⫽⫺

f i, j

f

共20兲



共21兲

i,L⫹1 i,L⫹1 log 兺 P T,L⫹1 P T,L⫹1 兲 兲 共 共 i⫽T⫹1 1 1

P 1 共 T,L 兲 ⫽ H 共 T,L 兲 ⫹H * 共 T,L⫹1 兲 , P 1 共 T,L⫹1 兲 O

H ⬘ 共 T,L⫹1 兲 ⫽⫺

f T⫹1,L

⫽ P 1 共 T,L 兲 ⫹ P * 共 T⫹1,L 兲 .

f i, j P 1 共 T,L 兲 ⫻log P 1 共 T,L 兲 P 1 共 T,L⫹1 兲 W⫺1



j⫽0

L

i⫽T⫹2 j⫽0 W⫺1

W⫺1

P 1 共 T,L 兲 ⫽⫺ P 1 共 T,L⫹1 兲 i⫽T⫹1

f i, j ⫹

⫽ P 0 共 T,L 兲 ⫹ P ⬘ 共 T⫹1,L 兲 ,

P 1 共 T⫹1,L 兲 ⫽

f i, j f i, j log P 1 共 T,L⫹1 兲 P 1 共 T,L⫹1 兲

L

L

W⫺1 W⫺1

f i, j

i⫽0 j⫽0 T



i,L⫹1 i,L⫹1 log 兺 P T,L⫹1 P T,L⫹1 兲 兲 共 共 i⫽0 0 0



T⫹1

P 0 共 T⫹1,L 兲 ⫽

f



f

T⫹1,j 兺 T⫹1,j log P 0共 T⫹1,L 兲 j⫽0 P 0 共 T⫹1,L 兲

P 0 共 T,L 兲 H 共 T,L 兲 ⫹H ⬘ 共 T⫹1,L 兲 , P 0 共 T⫹1,L 兲 B

共23兲

Adaptive motion-compensated video coding W⫺1

H O 共 T⫹1,L 兲 ⫽⫺

L

f

f

i, j i, j log 兺 兺 P T⫹1,L P T⫹1,L 兲 兲 共 共 i⫽T⫹2 j⫽0 1 1 W⫺1

P 1 共 T,L 兲 ⫽⫺ P 1 共 T⫹1,L 兲 i⫽T⫹1

L

f

i, j 兺 j⫽0 兺 P 1共 T,L 兲

⫻log L

⫹ ⫽

f

f

T⫹1,j T⫹1,j log 兺 P T⫹1,L P T⫹1,L 兲 兲 共 共 j⫽0 1 1



P 1 共 T,L 兲 P 1 共 T,L 兲 H O 共 T,L 兲 ⫺log P 1 共 T⫹1,L 兲 P 1 共 T⫹1,L 兲 L

⫹ ⫽

P 1 共 T,L 兲 f i, j P 1 共 T,L 兲 P 1 共 T⫹1,L 兲

f



f

T⫹1,j T⫹1,j log 兺 P T⫹1,L P T⫹1,L 兲 兲 共 共 j⫽0 1 1

P 1 共 T,L 兲 H 共 T,L 兲 ⫹H * 共 T⫹1,L 兲 , P 1 共 T⫹1,L 兲 O

共24兲

where P 0 共 T,L 兲 P 0 共 T,L 兲 log P 0 共 T⫹1,L 兲 P 0 共 T⫹1,L 兲

H ⬘ 共 T⫹1,L 兲 ⫽⫺

L



f

f

T⫹1,j T⫹1,j log 兺 P 0 共 T⫹1,L 兲 j⫽0 P 0 共 T⫹1,L 兲

H * 共 T⫹1,L 兲 ⫽⫺

P 1 共 T,L 兲 P 1 共 T,L 兲 log P 1 共 T⫹1,L 兲 P 1 共 T⫹1,L 兲 L



f

Fig. 3 The second-order neighborhood of current block.

f

T⫹1,j T⫹1,j log . 兺 P T⫹1,L P T⫹1,L 兲 兲 共 共 j⫽0 1 1

The 1D fast searching algorithm can be further included for exploiting the recursive iterations in calculating H ⬘ (T ⫹1,L) and H * (T⫹1,L), where one can find that the 2D calculations have been transformed into 1D and the 1D fast searching algorithm reduces the residue calculation burden further by adding only the increment each time, so the search burden is heavily reduced and bounded to O(2(W ⫹S)). The blocks in the current frame can be first classified into four groups on the basis of the determined optimal ¯ ,L ¯ ): homogeneous background, homogethreshold pair (T neous objects, textures, or edges. However, the dominant edge blocks cannot be distinguished from the texture blocks based on only the above thresholding procedure. Since the texture is essentially a neighborhood property, the distinction of the dominant edge blocks and texture blocks is made by exploiting their differences on the connection properties in a second-order neighborhood as shown in Fig. 3. The dominant edge blocks connect their neighbors only in two directions of their second-order neighborhood as shown in Fig. 4. On the other hand, the texture blocks connect their neighbors in over two directions, thus these texture blocks can be combined together to form meaningful texture object or background regions with large size.

Therefore, this major filtering procedure can distinguish the dominant edge blocks from the texture blocks efficiently, and these extracted image components play an important role in bit rate allocation. The spatial coding complexity of the current frame can be also determined automatically on the basis of the structural image components. 4.3 Temporal Segmentation The temporal segmentation procedure is only performed on the nonscene cut frames for exploiting the temporal relationships of these obtained image components among frames. The frame difference contrast 共we refer it by FCON兲 of current block at (m,n), that indicates the temporal correlation of pixels in the current block and their coordinate pixels in the previous frame, is given by 7

FCON共 m,n 兲 ⫽

7

兺兺

x⫽0 y⫽0

兩 I 共 x,y,t n 兲 ⫺I 共 x,y,t n⫺1 兲 兩 2 ,

共25兲

where I(x,y,t n ) is the gray level of the pixel at (x,y) in the current block, and I(x,y,t n⫺1 ) is the gray level of its coordinate pixel at (x,y) in the previous frame. The probability for a temporal unchanged region that has FCON⫽i (0⭐i ⭐F) is defined as

Fig. 4 The possible connections of current edge block.

Journal of Electronic Imaging / October 2000 / Vol. 9(4) / 527

Fan et al.

P uc共 i 兲 ⫽

fi . F 兺 h⫽0 f h

共26兲

The probability of a temporal changed region that has FCON⫽i (F⫹1⭐i⭐N) is defined as P c共 i 兲 ⫽

fi . N 兺 h⫽F⫹1 f h

共27兲

The entropy for an unchanged region (0⭐i⭐F) can be defined as F

H uc共 F 兲 ⫽⫺



i⫽0

P uc共 i 兲 log P uc共 i 兲 .

共28兲

The entropy for a changed region (F⫹1⭐i⭐N) can also be defined as N

H c 共 F 兲 ⫽⫺



i⫽F⫹1

P c 共 i 兲 log P c 共 i 兲 .

共29兲

The optimal temporal segmentation threshold ¯F is determined by the following criteria function: H 共 ¯F 兲 ⫽

max

兵 H uc 共 F 兲 ⫹H c 共 F 兲 其 .

Fig. 5 The principles of the directional motion vectors for ideal zooming, panning and tilting.

共30兲

F⫽0,1,2, . . . .,N

4.4 Structural Video Components

gions by the temporal segmentation technique but they have high motion compensation errors because no correspondences can be found in their previous reference. On the other hand, the texture background or texture moving objects also have high motion compensation errors the same as the new present objects but their motion vectors for the connected macroblocks are very similar. Therefore, the changed regions can be further partitioned into uncovered background induced by fast moving objects, moving objects, new present objects, or new present background induced by the global motion of camera, texture stationary background induced by camera jittering, and moving edges. The unchanged regions can be also partitioned into stationary background and still objects. The temporal coding complexity of the current frame can also be determined automatically on the basis of the structural video components.

The motion estimation procedure that is based on a (16 ⫻16) macroblock is only performed on the macroblocks located in the change mask.23 Since each macroblock consists of four (8⫻8) blocks, a macroblock is detected as changed if at least one of its four blocks is located in the change mask. After the motion estimation procedure, the temporal and spatial segmentation results are integrated for providing structural video components. This fusion procedure is based on some common rules.24–26 The uncovered background, which has high motion compensation errors, includes the blocks that are detected as the changed regions by the temporal segmentation procedure but it is taken as one part of large background region by the spatial segmentation procedure. The homogeneous moving objects, which have low motion compensation errors, consist of the blocks that are located in the change mask but are detected as the object regions by the spatial segmentation procedure. The new present objects are also detected as the changed re-

4.5 Motion Clustering The character of camera motion can be determined by analyzing the principles of the distribution of their directional motion vectors of dominant regions as shown in Fig. 5. Therefore, the types of camera motion are first determined on the basis of the principles of the distribution of the estimated motion vectors. Two macroblocks B 1 and B 2 called a ‘‘macroblock pair’’ as shown in Fig. 6 are then used for exploiting three global motion parameters: Z(zooming), H(panning), and V(tilting). The major difference between this proposed camera motion clustering algorithm and the technique presented by Jozawa et al.27 is that the camera motion types are first determined by analyzing the principles of the distribution of motion vectors. This procedure is very important for providing the correct camera motion vectors because the zooming center cannot always be set in the picture center. Similar equations are then used for cal-

The relationships between the blocks in the current frame and their coordinate blocks in the previous frame can be first classified into two opposite groups on the basis of their frame difference contrasts and optimal threshold ¯F : changed versus unchanged



¯, FCON共 m,n 兲 ⬍F

unchanged block

¯, FCON共 m,n 兲 ⭓F

changed block .

共31兲

The 1D fast searching algorithm as described in Sec. 3 can also be used for reducing the calculation burden to O(N).

528 / Journal of Electronic Imaging / October 2000 / Vol. 9(4)

Adaptive motion-compensated video coding

Fig. 6 The block pair for calculating the global motion vectors, ( m , n ) is the determined zoom center.

culating the camera motion vectors. For each macroblock pair, three global parameters are obtained through the following formulas: Z⫽

H⫽

V⫽

2 P 1 ⫺X 1 P 2 ⫺Y 1 P 3 X 22 ⫹Y 22 ⫺2X 1 P 1 ⫹ 共 2 共 I 21 ⫹I2 2 兲 ⫹Y 22 兲 P 2 ⫹X 1 Y 1 P 3 2 共 X 22 ⫹Y 22 兲 ⫺2Y 1 P 1 ⫹ 共 2 共 J 21 ⫹J 22 兲 ⫹X 22 兲 P 2 ⫹X 1 Y 1 P 2 2 共 X 22 ⫹Y 22 兲

共32兲

,

Fig. 7 The determined key frames (including scene cut) in the first video shot of ‘‘table tennis’’ and the visual interface of video analysis subsystem.

ues for each parameter (Z, H and V) are counted. Second, pattern 2 ‘‘block pair’’ is selected for calculating three parameters (Z, H, and V) where the two blocks are symmetrical across the horizontal axis of the image, I 2 ⫽I 1 , J 2 ⫽⫺J 1 , the calculations are carried out for all possible block pairs for pattern 2, and the frequencies of these values are counted. Third, pattern 3 ‘‘block pair’’ is selected for calculating three global motion parameters where two blocks are symmetrical across the vertical axis of the image, I 1 ⫽⫺I 2 , J 1 ⫽J 2 . Finally, three frequency sets obtained through pattern 1, 2, and 3 are merged into one, then for each parameter, the most frequent value is chosen as the final value of the corresponding camera motion vector. 4.6 Reference Frame Determination

where P 1 ⫽I 1 V 1x ⫹I 2 V 2x ⫹J 1 V 1y ⫹J 2 V 2y P 2 ⫽V 1x ⫹V 2x P 3 ⫽V 1y ⫹V 2y ,

共33兲

X 1 ⫽I 1 ⫹I 2 X 2 ⫽I 1 ⫺I 2 Y 1 ⫽J 1 ⫹J 2 Y 2 ⫽J 1 ⫺J 2 ,

共34兲

where (I 1 ,J 1 ) and (I 2 ,J 2 ) are the coordinates of the centers of two blocks B 1 and B 2 , respectively. (V 1x ,V 1y ) are the motion vectors of B 1 and (V 2x ,V 2y ) are the motion vectors of B 2 . First, the block pair 1 is selected where the two blocks are symmetrical across the center of focus I 2 ⫽⫺I 1 , J 2 ⫽⫺J 1 共pattern 1兲, and then the calculations of three camera motion vectors are carried out for all possible block pairs for pattern 1, and the frequencies of these val-

Based on the determined scene cut, structural video components 共providing information on the activities of moving objects兲 and camera motion vectors, the key frames in a video shot can be detected automatically on the basis of the three criteria given at the beginning of this section. The key frames in the first video shot of ‘‘table tennis’’ are given in Fig. 7 where frame 67 is detected as a scene cut. In this proposed adaptive video coding scheme, these key frames are encoded as reference frames (I or P frames兲 with more bits for providing high picture quality. Moreover, I 共intraframe兲 frames are used in three cases: one is for the scene cut frames that indicate the beginning of a new video shot, another is the first frames of the GOPs where a video shot has been partitioned into multiple GOPs and encoded separately, the third is the frames inside a GOP that are used for exploiting other properties of MPEG such as fast forward display, backward display, and random access. The residue key frames in a video shot are encoded as the forward prediction references ( P frames兲. The residual frames 共nonkey frames兲 in a video shot are encoded as B 共bidirectional interpolation兲 frames. Journal of Electronic Imaging / October 2000 / Vol. 9(4) / 529

Fan et al.

5

Bit Rate Allocation

In MPEG-based coding schemes, I frames are used as the references not only for B frames but also for P frames. Moreover, P frames are further used as the references for B frames, thus the quality of I frames is very important and will influence the qualities of the following frames until a new I frame is reset. The key frames that are encoded as the references (I or P frames兲 represent the major contents of the video shot and more bits should be allocated for these key frames for providing high picture quality. The residue frames in a video shot that are encoded as B frames can be quantized coarsely because they have high temporal predictability and are not used as references. By taking advantage of the temporal masking of the human visual system for bit rate allocation, the bit ratio among three frame types (I, P, and B) is scaled by a parameter x, D I ⫽C I x, D P ⫽C P x, D B ⫽C B x, where D I , D P , and D B are the bits allocated for I, P, B frames. The ratio among parameters C I , C P , and C B represents the optimal bit ratio among three different frame types for providing good picture quality and they are determined by experimental simulations, C I ⫽290, C P ⫽130, C B ⫽30, where good results can be obtained. Few key frames are selected for the temporal region that has little camera motion and moving objects are not very active. On the other hand, more key frames should be selected for the temporal active region that has active camera motion or fast moving objects, so the bit rate allocation among variant GOPs for a video sequence and different frame types in a GOP has become an interesting problem. In the constant bit rate transmission system, the parameter x should be adapted to the change rules of video components and target bit rate. The total bits are first distributed among variant GOPs of a video sequence according to their coding complexity 共number of reference frames兲 and length 共total number of I, P, and B frames兲 M

R⫽

兺 Ri ,

i⫽1

共35兲

where M indicates the total number of GOPs for the corresponding video sequence, and R i is the bits allocated for the ith GOP depending on the length of the ith GOP and the number of key frames inside the corresponding GOP. The number of I, P, and B frames for ith GOP is denoted by N I , N P , and N B . We have the following relation: R i ⫽N I D I ⫹N P D P ⫹N B D B ,

共36兲

where N B is determined by the following constraint: N B ⫽N⫺ 共 N I ⫹N P 兲 ,

共37兲

where N is the total number of frames in ith GOP and (N I ⫹N P ) is the total number of the selected key frames 共encoded as the references兲; N I also indicates the number of I frames. Based on the bits allocated for these three frame types, their initial quantization parameters Q P are selected 530 / Journal of Electronic Imaging / October 2000 / Vol. 9(4)

from different candidate strings on the basis of the parameter of x, where a string comprises a set of Q P values with one value for each x region. The determined quantization parameters Q P can be further adapted to the spatial coding complexity of the variant video components for distributing the allocated bits more efficiently inside a frame. For I frames, only the spatial segmentation procedure is performed and the number of blocks for different image components is obtained. These image components include homogeneous background, homogeneous objects, texture background, texture objects and dominant edges among them. The block numbers for these image components are denoted as N B , N O , N TB , N TO , and N E . The bit ratios for these major image components are scaled by a parameter y, D B ⫽C B y, D O ⫽C O y, D TB ⫽C TBy, D TO⫽C TOy, D E ⫽C E y, where D B , D O , D TB , D TO , and D E are the bits allocated for these image components. The guiding principle for selecting the parameters C B , C O , C TB , C TO , and C E is that more bits should be allocated for the dominant edge blocks for avoiding blocking artifacts because the human visual system is very sensitive to the errors of the strong edges, and bright 共or dark兲 object 共background兲 can be quantized more coarsely than the middle luminance region. The texture objects 共background兲 can also be quantized more coarsely because the human visual system is less sensitive to the errors on the texture regions than the homogeneous regions. Since the bits allocated for I frames is indicated as D I , we have the following relation:

y⫽

DI . N B •C B ⫹N O •C O ⫹N TB•C TB⫹N TO•C TO⫹N E •C E

共38兲

The corresponding quantization parameters Q ⬘P for these major image components can also be selected from the candidate strings on the basis of the parameter y. In our experiments, these candidate strings are set as 兵 Q min ,QP ⫺4,Q P ⫺2,Q P ⫺1,Q P ,Q P ⫹1,Q P ⫹2,Q P ⫹4,Q max其, where Q min is the minimum candidate and Q max is the maximum candidate. Long strings may be used for providing more accurate quantization parameters for different image components, but long strings will result in expensive search cost. The blocks of the P and B frames are classified into eight major video components that are: homogeneous moving objects, homogeneous stationary background 共or objects兲, texture stationary background 共or objects兲, texture moving objects, uncovered background, new present background 共or objects兲, and dominant edges among them. The bit ratio for these video components can also be obtained in the same way as described in Eq. 共38兲. The guiding principle for selecting the scale parameters for these eight video components is that more bits should be allocated for the local unpredictable blocks such as uncovered background, new present objects 共or background兲, and moving edges of objects. The fast moving objects in front of the flat background are also encoded finely in order to avoid the edge business and color breeding. The final quantization param-

Adaptive motion-compensated video coding Table 1 The average test results for SIF ‘‘table tennis’’ as a comparison (with target bit rate 14 520 bits/frame).

Scheme

Fixed GOP scheme

Adaptive GOP scheme

Table 3 The average test results for CIF ‘‘carphone’’ as a comparison (with target bit rate 15 120 bits/frame).

Scheme

Fixed GOP scheme

Adaptive GOP scheme

32.65

33.42

SNR ( Y ) (dB) Search times (per frame)

31.65

32.58

11 486

4366

SNR ( Y ) (dB) Search times (per frame)

13 578

4982

Used bits ( Y ) (bits/frame)

14 538

14 510

Used bits ( Y ) (bits/frame)

15 200

14 852

eters for these eight video components are also selected by searching the candidate strings. 6 Experimental Results For evaluating the real performance of this proposed adaptive video coding scheme, we have tested many video sequences. We show the average test results of three video sequences that are well known in the video coding community, namely ‘‘table tennis,’’ ‘‘carphone,’’ and ‘‘foreman.’’ The scene cut detection procedure is first performed for generating the video shots of input video sequences, and the coarse video components are further obtained by the proposed structural video segmentation technique. The key frames in a video shot are then determined on the basis of the obtained video components and camera motion. The average test results of this proposed video coding scheme with adaptive GOP size are compared with that of a traditional MPEG encoder with fixed GOP size. We select three parameters, signal to noise ratio 共SNR兲, search time for motion estimation, and the used bits, as the efficiency measure. Since the luminance components U, V are in a half resolution of luminance component Y , the average bit rate and SNR are only calculated on the basis of luminance component Y 共see Tables 1–3兲. The segmentation results for ‘‘table tennis’’ and ‘‘foreman’’ on block resolution are shown in Figs. 8 and 9. Figure 8共a兲 shows the original reference frame 1 of ‘‘table tennis,’’ Fig. 8共b兲 gives the obtained changed regions and the changed edge regions for frame 37 with respect to its reference frame 1; more bits should be allocated for these regions for providing high picture quality. Figure 8共c兲 is the changed region and changed edge regions for frame 39 with respect to its reference frame 1. Figure 8共d兲 is the reconstructed picture for frame 37 at the bit rate 14 520 bits/frame. Figure 8共e兲 is the reconstructed picture for frame 39 at the same bit rate. Figure 9共a兲 is the original reference frame 1 of ‘‘foreman,’’ Fig. 9共b兲 gives the changed regions of frame 11 with respect to its reference frame 1, and Fig. 9共c兲 gives the

changed regions for frame 15 with respect to its reference frame 1. Figure 9共d兲 is the reconstructed picture for frame 15 at bit rate 14 560 bits/frame. Figure 9共e兲 is the reconstructed picture for frame 1 at the same bit rate. 7 Conclusions An adaptive motion-compensated video coding scheme for video storage applications such as VCD and DVD is proposed in this paper, where the bits are allocated more efficiently among different frames and variant video components. Moreover, the boundaries of GOPs are adapted to the boundaries of video shots of the input video sequence and the positions for reference frames are adapted to the activity of moving objects and camera motion for improving the temporal predictability among frames and providing high coding efficiency, thus high picture quality, with the same bit rate. Since the video sequence is partitioned into a set of video shots and these shots are encoded as separate accessible units, the compressed video sequence 共bit stream兲 can be accessed accurately through these video shots and this

Table 2 The average test results for CIF ‘‘foreman’’ as a comparison (with target bit rate 14 560 bits/frame).

Scheme

Fixed GOP scheme

Adaptive GOP scheme

29.28

30.52

SNR ( Y ) (dB) Search times (per frame)

14 560

4820

Used bits ( Y ) (bits/frame)

14 586

14 450

Fig. 8 (a) The reconstructed reference frame 1 of ‘‘table tennis;’’ (b) the obtained changed regions and changed edge regions for frame 37 with respect to its reference frame 1; (c) the changed region and changed edge regions for frame 39 with respect to its reference frame 1; (d) the reconstructed picture for frame 37; and (e) the reconstructed picture for frame 39.

Journal of Electronic Imaging / October 2000 / Vol. 9(4) / 531

Fan et al.

Fig. 9 (a) The reconstructed reference frame 1 of ‘‘foreman;’’ (b) the changed regions of frame 11 with respect to its reference frame 1 after the compensation of camera motion; (c) the changed regions for frame 15 with respect to its reference frame 1 after the compensation of camera motion; (d) the reconstructed picture for frame 11; and (e) the reconstructed picture for frame 15.

property is very attractive for multimedia database applications such as content-based indexing, retrieval, and access control.

Acknowledgments The authors would like to thank the anonymous reviewers for their helpful comments toward making this paper more readable. This work is supported by National Foundation of Sciences 共NSF兲 under Contract No. IRI-9619812.

References 1. A. Puri and R. Aravind, ‘‘Motion-compensated video coding with adaptive perceptual quantization,’’ IEEE Trans. Circuits Syst. Video Technol. 1共4兲, 351–360 共1991兲. 2. C. Gonzales and E. Viscito, ‘‘Motion video adaptive quantization in the transform domain,’’ IEEE Trans. Circuits Syst. Video Technol. 1共4兲, 374–378 共1991兲. 3. J. Lee and B. W. Dickinson, ‘‘Temporally adaptive motion interpolation exploiting temporal masking in visual perception,’’ IEEE Trans. Image Process. 3, 513–526 共1994兲. 4. B. Girod, ‘‘The information theoretical significance of spatial and temporal masking in video signals,’’ Proc. SPIE 1077, 178–187 共1990兲. 5. K. Ramchandran, A. Ortega, and M. Vetterli, ‘‘Bit allocation for dependent quantization with application to multiresolution and MPEG video coders,’’ IEEE Trans. Image Process. 3, 533–545 共1994兲. 6. J. Fan and F. Gan, ‘‘Adaptive motion-compensated interpolation based on spatiotemporal segmentation,’’ Signal Process. Image Commun. 12, 59–70 共1998兲. 7. W. Ding and B. Liu, ‘‘Rate control of MPEG video coding and selection doe MPEG encoding,’’ IEEE Trans. Circuits Syst. Video Technol. 6, 12–20 共1996兲.

532 / Journal of Electronic Imaging / October 2000 / Vol. 9(4)

8. H. J. Zhang, A. Kankanhalli, and S. W. Smiliar, ‘‘Automatic partitioning of full-motion video,’’ Multimedia System 1, 10–28 共1993兲. 9. D. Hepper, ‘‘Efficiency analysis application of uncovered background prediction in low bit rate image coder,’’ IEEE Trans. Commun. 36, 1578–1584 共1990兲. 10. J. Fan, L. Zhang, and F. Gan, ‘‘Spatiotemporal segmentation based on two-dimensional spatiotemporal entropic thresholding,’’ Opt. Eng. 36, 2845–2851 共1997兲. 11. X. Ran and N. Farvardin, ‘‘A perceptually motivated three-component image model–Part I: Description of the model,’’ IEEE Trans. Image Process. 4, 401–415 共1995兲. 12. S. Swanberg, C.-F. Chang, and R. Jain, ‘‘Knowledge guided parsing in video databases,’’ Proc. SPIE 2416, 13–24 共1992兲. 13. J. B. Boreczky and L. A. Rowe, ‘‘A comparison of video shot boundary detection techniques,’’ J. Electron. Imaging 5, 122–128 共1996兲. 14. G. Ahanger and T. S. Little, ‘‘A survey of technologies for partitioning and indexing digital video,’’ J. Visual Commun. Image Represent. 7, 28–43 共1996兲. 15. B.-L. Yeo and B. Liu, ‘‘Rapid scene change detection on compressed video,’’ IEEE Trans. Circuits Syst. Video Technol. 5, 533–544 共1995兲. 16. Y. Juan and S.-F. Chang, ‘‘Scene change detection in a MPEGF compressed video sequence,’’ Digital Video Compression Algorithm and Technology, Proc. SPIE 2419, 14–25 共1995兲. 17. B. Gunsel, A. M. Tekalp, and P. J. Beek, ‘‘Content-based access to video objects: temporal segmentation, visual summarization, and feature extraction,’’ J. Electron. Imaging 7, 592–604 共1999兲. 18. N. V. Patel and I. K. Sethi, ‘‘Video shot detection and characterization for video databases,’’ Pattern Recogn. 30, 583–592 共1997兲. 19. J. Fan, R. Wang, L. Zhang, D. Xing, and F. Gan, ‘‘Image sequence segementation based on 2D temporal entropy,’’ Pattern Recogn. Lett. 17, 1101–1107 共1996兲. 20. J. Fan, G. Fujita, J. Yu, K. Miyanohana, T. Onoye, N. Ishiura, L. Wu, and I. Shirakawa, ‘‘Hierarchical object-oriented image and video segmentation algorithm based on 2D entropic thresholding,’’ Proc. SPIE 3561, 141–151 共1998兲. 21. A. Brink, ‘‘Thresholding of digital image using two-dimentional entropies,’’ Pattern Recogn. 25, 803–808 共1992兲. 22. N. Pal and S. Pal, ‘‘Entropic thresholding,’’ Signal Process. 16, 97– 108 共1989兲. 23. J. Fan and F. Gan, ‘‘Motion estimation based on uncompensability analysis,’’ IEEE Trans. Image Process. 6, 1584–1587 共1997兲. 24. T. Meier, K. N. Ngan, and G. Cregory, ‘‘Reduction of blocking artifacts in image and video coding,’’ IEEE Trans. Circuits Syst. Video Technol. 9, 490–500 共1999兲. 25. A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tuncel, and T. Sikora, ‘‘Image sequence analysis for emerging interactive multimedia services-The European COST 211 framework,’’ IEEE Trans. Circuits Syst. Video Technol. 8, 802–813 共1998兲. 26. T. Aach and A. Kaup, ‘‘Bayesian algorithm for adaptive change detection in image sequence using Markov random fields,’’ Signal Process. Image Commun. 7, 147–160 共1995兲. 27. H. Jozawa, K. Kamikura, A. Sagata, K. Kotera, and H. Watanabe, ‘‘Two-stage motion compensation using adaptive global MC and local affine MC,’’ IEEE Trans. Circuits Syst. Video Technol. 7共1兲, 75–85 共1997兲.

Jianping Fan obtained his MS degree in Theoretical Physics from Northwestern University. In 1997, he obtained his PhD in Information Engineering from Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Science. He spent two years in the Department of Information System Engineering, at Osaka University, Osaka, Japan as a JSPS researcher on VLSI design for the very low bit rate CODEC system. He is now a researcher in the Department of Computer Science, at Purdue University, West Lafayette. He is the author of over 20 technical papers on major international conferences and journals and keeps two Chinese national awards for the Promotion of Science and Technology. His current research interests include adaptive video coding, watermarking, video object extraction, semantic modeling, and indexing for multimedia database applications.

Adaptive motion-compensated video coding David K. Y. Yau received his BSc (first class honors) degree from the Chinese University of Hong Kong, and his MS and PhD degrees from the University of Texas at Austin, all in computer sciences. From 1989 to 1990, he was with the Systems and Technology group of Citibank, NA. He was the recipient of an IBM graduate fellowship, and is currently an Assistant Professor of Computer Sciences at Purdue University. He received an NSF CAREER award in 1999 for research in network and operating system architectures and algorithms for quality of service control. Walid G. Aref received his PhD degree in Computer Science from the University of Maryland, College Park, in 1993. Since then, he has been working with Matsushita Information Technology Laboratory and the University of Alexandria, Egypt. Currently, Dr. Aref is an associate professor at the Department of Computer Science, Purdue University. His current research interests include efficient query processing and optimization algorithms and data mining in spatial

and multimedia databases. He is a member of the IEEE Computer Society and the ACM. Abdelmounaam Rezgui received his MS degree from the University of Sciences and Technology (USTHB, Algeria) in 1995. In his thesis, he designed a fault-tolerant kernel for multicast communication in distributed systems where the basic one-to-one transmission mode was extended to offer a uniform distributed environment in which communicating entities can be processes or groups of processes. His professional career began with, first, serving as a lecturer at the Department of Computer Science in the USTHB from 1996 to 1998. In October 1998, he joined the PIP Laboratory at La Faculte´ Polytechnique de Mons (FPMs, Belgium) where he was involved in the design, implementation, and performance analysis of parallel multithreaded solvers for large banded linear systems. In August 1999, he joined the Indiana Center for Database Systems in the Department of Computer Sciences at Purdue University, Indiana, USA. His main current research interests include: video segmentation and object extraction, high performance video servers, and multimedia databases.

Journal of Electronic Imaging / October 2000 / Vol. 9(4) / 533