AUTOMATIC OBJECT DETECTION IN VIDEO ... - Semantic Scholarhttps://www.researchgate.net/...Camera.../Automatic-Object-Detection-In-Video-Seque...

4 downloads 0 Views 313KB Size Report
and an adaptive K-means clustering was carried out for in- dividual ... This model was then compared ... frame I(x, y, t) i.e. in this frame object and camera are.
AUTOMATIC OBJECT DETECTION IN VIDEO SEQUENCES WITH CAMERA IN MOTION Ninad Thakoor, Jean Gao and Huamei Chen Computer Science and Engineering Department The University of Texas Arlington TX 76019, USA ABSTRACT Automatic moving object detection/extraction has been explored extensively by the computer vision community. Unfortunately majority of the work has been limited to stationary cameras, in which background subtraction is utilized as the major methodology. In this paper, we will present a technique to tackle the problem in the case of moving camera which is the most often encountered situation in real life for target tracking, surveillance, etc. Instead of focusing on two adjacent time frames, our object detection rests on three consecutive video frames, a backward frame, the frame of interest and a forward frame. Firstly, optical flow based simultaneous iterative camera motion compensation and background estimation is carried out on backward and forward frames. Differences between camera motion compensated backward and forward frames with the frame of interest are then tested against the estimated background models for intensity change detection. Next, these change detection results are combined together for acquiring approximate shape of the moving object. Experimental results for a video sequence with moving camera are presented.

1. INTRODUCTION Moving object extraction or detection is an important preprocessing stage for problems such as object tracking, object based video representation and coding, 3-D structure from 2-D shape and shape based object classification. Numerous techniques have been proposed in the literature during the past decades. These approaches can be broadly classified as either optical flow based [2]-[5] or change detection approaches [6]-[10]. The optical flow based object detection approach by Wang and Adelson [2] used an affine-model to model flow field and an adaptive K-means clustering was carried out for individual pixels to minimize the squared distance between the cluster centers and optical flow vectors. Borshukov et al. [3] combined affine clustering approach with dominant motion approach of Bergan et al. [12]. This approach uses residual error over affine motion model and optical flow as criterion for a multistage merging procedure. Altunbasak

et al. [4] used region based affine clustering with color information, which yields motion boundaries matching with the color segmentation boundaries. Celasun et al. [5] applied 2-D mesh based framework with optical flow to define coarse object boundaries. These boundaries were then refined by constrained maximum contrast path search to obtain the segmented objects. Object extracted by optical flow based methods are the regions which follow the same motion model. Thus objects showing articulate motion are split into more than one object. Additional processing is required to extract the meaningful object in such circumstances. As an example of change detection based approach, W 4 [7] visual surveillance system applied change difference of three consecutive frames and a background model built from several seconds of video to detect the foreground region. Moving object segmentation algorithm proposed in [9] employed background registration technique to construct a reliable background image from accumulated frame difference information. This model was then compared with current frame to extract the foreground object. Kim and Hwang [8] utilized edge map of frame difference and background edge map to separate the moving object from the video. These approaches [7],[8], [9] have assumption of stationary camera. Tsaig and Averbuch [10] applied a region labeling approach for extraction of moving object captured by a moving camera. Frames are first segmented by watershed segmentation. Regions obtained are then classified by MRF based classification. In this paper, we present a moving object extraction approach in which optical flow based background estimation and camera motion compensation followed by intensity change detection on two consecutive displaced frame differences is applied. Proposed method deals with moving camera and moving object situation, which is of common interest, to extract the object. First we model the camera motion and generate camera motion compensated frames. During the compensation process, we also generate an estimate for the background. Using this estimate to build a background model, we carry out intensity change detection on the displaced frame difference (DFD). We

combine these changes for two consecutive DFDs to extract the moving areas in the center frame. This moving area information is then merged with region information in terms of region boundary to obtain the final result. Figure 1 gives the overview of the method. Frame 1

Frame 2

Frame 3

Compensate for camera motion

Compensate for camera motion

Calculate DFD

Calculate DFD

Detect moving regions

Detect moving regions

Boudary extraction using color segmentation

AND

Combine boundary information and moving region information

Detected object

Figure 1: Overview of the proposed method

2. CAMERA MOTION MODELING Comparing corresponding pixels in two frames is one of the simplest techniques that can be applied for detecting the motion between the frames. For a video taken by a stationary camera, corresponding pixel position in next frame is same as current pixel position. But in case of a video captured by a moving camera, due to the motion of the camera this correspondence cannot be obtained without knowing the nature of the motion of the camera. In this section we will develop an approach to determine the motion of camera. Let’s consider two frames I(x, y, t) and I(x, y, t ± δt) which are captured by a moving camera. These frames have a foreground object which is in motion and a static background. Camera motion vectors between these frames are given by (CVx , CVy ) and object motion vectors are

(OVx , OVy ). For a moving camera and moving object video, apparent motion at each pixel is combination of camera motion and object motion. Hence the motion vectors (FVx , FVy ) for the frames can be expressed as: FVx (x, y) = CVx (x, y) + OVx (x, y),

(1)

FVy (x, y) = CVy (x, y) + OVy (x, y).

(2)

Since for background pixels object motion will be absent, we can further write: CVx (x, y) = FVx (x, y),

(3)

CVy (x, y) = FVy (x, y).

(4)

All the motions above are 2-D frame motions which are the projections of the corresponding 3-D motions on the image plane. All the motions are calculated with respect frame I(x, y, t) i.e. in this frame object and camera are assumed to be stationary. This frame is addressed as the reference frame. An approximation of the frame motion can be found by measurement of the optical flow. We use Lucas-Kanade method [15] with modifications from [16] for obtaining the optical flow between these frames. Without loss of generality, one can assume that object does not cover the corners of the image. In such a case dominant motion at the corners of the image is the relative motion between the camera and the background. The camera motion can be modeled based on this observation. We apply affine motion model to describe the camera motion, which is defined as: CVx (x, y) = (a1 − 1)x + a2 y + a3 ,

(5)

CVy (x, y) = a4 x + (a5 − 1)y + a6 ,

(6)

where a1 , a2 , a3 , a4 , a5 and a6 are the affine motion parameters. Affine motion model for the camera motion is initialized by affine motion model of four corners of the image. This can be expressed in matrix form as:   a1      a2   x y 1 0 0 0  a3    0 0 0 x y 1   a4   a5  a6   x + FVx (x, y) if B(x, y) = 1, (7) = y + FVy (x, y) where B(x, y) is background mask, and is initialized as:  1 if (x, y) ∈ image corners; B(x, y) = (8) 0 otherwise. Estimate of the affine parameters, a ˆ1 , a ˆ2 , a ˆ3 , a ˆ4 , a ˆ5 and a ˆ6 can be obtained by linear least squares solution of Eq.(7).

In presence of outliers performance of linear least squares estimates deteriorates. Conditions such as lost features, textureless background or object occupying one of the corners, constitute to the outliers. When some of the corner areas are textureless, optical flow obtained for those areas can be incorrect. In another scenario, if object is not at the center of the frame and it is occupying part of corner areas, motion of the object is treated as background motion. Lost features problem arises for parts of background which move out of the frame area in the next frame due to motion of camera. Optical flow cannot be determined for these lost features. To handle the outliers, we obtain robust estimates for affine motion parameters via iteratively reweighted least squares [1]. This method uses weighted least squares solution. For the first iteration of the process all the samples are weighted equally. In this case the solution of weighted least squares will reduce to linear least squares solution. Residues are calculated for this model. Outlier samples, which do not fit this model well, have higher residuals. On the other hand samples fitting the model will have small residuals. For next iteration of weighted least squares, outliers are weighted less compared to samples fitting the model. Thus outlier samples have less effect on new model estimate. This process is repeated till affine motion model converges. Important point to be noted here is, even during the presence of these outlier conditions, individually or in combination, our assumption of dominant motion at corners being background motion does not fail and we are able to obtain estimate for camera motion model. 3. BACKGROUND MOTION ESTIMATION From the above estimate of the camera motion model, we can attain background motion vectors (CˆVx , CˆVy ) as: CˆVx (x, y) = (ˆ a1 − 1)x + a ˆ2 y + a ˆ3 , ˆ CVy (x, y) = a ˆ4 x + (ˆ a5 − 1)y + a ˆ6 .

(9) (10)

Squared difference SQD between the optical flow (FVx , FVy ) and estimates (CˆVx , CˆVy ) at each pixel is examined to classify the pixel as background or foreground. SQD(x, y) = {CˆVx (x, y) − FVx (x, y)}2 + {CˆVy (x, y) − FVy (x, y)}2 . (11) As object motion (OVx , OVx ) is absent in background areas, Eq. (11) becomes: SQD(x, y) = {CˆVx (x, y) − CVx (x, y)}2 + {CˆVy (x, y) − CVy (x, y)}2 . (12) Eq.(12) gives residuals for the camera motion model. Background pixels fit this motion model and have low residual

values. Thus pixels with low values of SQD can be assigned to the background. SQD is thresholded to generate new background mask as:  1 if SQD(x, y) < Bth ; (13) B(x, y) = 0 otherwise. where Bth is the background detection threshold. As outliers will not fit the model and have high values for residuals, they will be eliminated from background mask. We refine the estimate of the affine model for the camera motion based on this newly obtained background mask by reinserting current estimate of background in Eq. 7). Utilizing this camera motion model, correspondence between pixels of frames I(x, y, t) and I(x, y, t±δt) can be achieved. In the following section, we will discuss how motion model achieved in this section can be used to compute DFDs and object extraction. 4. OBJECT EXTRACTION BY DISPLACED FRAME DIFFERENCE Once we have camera motion model, we can compensate for camera motion. Compensation of camera motion will give us pixel by pixel correspondence similar to a sequence taken by a stationary camera, thus we can compare the corresponding pixels for detecting changes. Camera motion compensated image for frame I(x, y, t ± δt) can be calculated from final model estimate as, Ic (x, y, t±δt) = I(x−CV∗x (x, y), y −CV∗y (x, y), t±δt). (14) As motion vectors (CV∗x , CV∗y ) will be real numbers, subpixel calculation of image intensities is required to obtain the motion compensated image. This can be done using any suitable interpolation technique. Comparison between the reference frame I(x, y, t) and compensated frame Ic (x, y, t ± δt) is done by taking difference between these two frames. This difference is called Displaced Frame Difference(DFD). Given two frames of a video sequence, I(x, y, t1 ) and I(x, y, t2 ) (here t1 < t2 ), forward DFD at time t1 can be given as, D(t1 ,t2 ) (x, y, t1 ) = I(x, y, t1 ) − Ic (x, y, t2 ),

(15)

and backward DFD at time t2 given by, D(t1 ,t2 ) (x, y, t2 ) = Ic (x, y, t1 ) − I(x, y, t2 ).

(16)

For an object moving against a plain background, pixels belong to four different situations: • Common background for both frames S(t1 ,t2 ) (t1 ): S(t1 ,t2 ) (t1 ) ∩ O(t1 ) (t1 ) = ∅, S(t1 ,t2 ) (t1 ) ∩ O(t2 ) (t1 ) = ∅.

(17) (18)

gaol is to extract the object shape at time t2 . For this purpose, we first calculate backward and forward DFD for the center frame, i.e. D(t1 ,t2 ) (t2 ) and D(t2 ,t3 ) (t2 ) respectively. From Figure 2 we can write for D(t1 ,t2 ) (t2 ),

O(t1) S(t1,t2) (a)

L(t1,t2) U(t1,t2)

O(t2)

C(t1,t2)

W = S(t1 ,t2 ) (t2 ) ∪ O(t1 ) (t2 ) ∪ O(t2 ) (t2 ).

(d)

Similarly for D(t2 ,t3 ) (t2 ),

S(t2,t3) (b)

L(t2,t3) U(t2,t3)

O(t3)

(26)

W = S(t2 ,t3 ) (t2 ) ∪ O(t2 ) (t2 ) ∪ O(t3 ) (t2 ).

C(t2,t3)

Areas in which changes take place in both DFDs can be written as,

(e)

(c)

Figure 2: (a) Compensated frame Ic (x, y, t1 ); (b) Frame I(x, y, t2 ); (c) Compensated frame Ic (x, y, t3 ); (d) Various areas for DFD D(t1 ,t2 ) (t2 ); (e) Various areas for DFD D(t2 ,t3 ) (t2 ) . • Overlap of moving object position L(t1 ,t2 ) (t1 ): L(t1 ,t2 ) (t1 ) = O(t1 ) (t1 ) ∩ O(t2 ) (t1 ).

(19)

L(t1 ,t2 ) (t1 ) ∩ U(t1 ,t2 ) (t1 ) = ∅.

(20) (21)

• Newly covered background C(t1 ,t2 ) (t1 ): O(t2 ) (t1 ) = L(t1 ,t2 ) (t1 ) ∪ C(t1 ,t2 ) (t1 ), L(t1 ,t2 ) (t1 ) ∩ C(t1 ,t2 ) (t1 ) = ∅.

(22) (23)

From Figure 2, set W is defined as, W = S(t1 ,t2 ) (t1 ) ∪ O(t1 ) (t1 ) ∪ O(t2 ) (t1 )

(24)

Let area in which changes take place be denoted by set E(t1 ,t2 ) (t1 ). This set can be expressed as, E(t1 ,t2 ) (t1 ) = O(t1 ) (t1 ) ∪ O(t2 ) (t1 ).

E(t1 ,t2 ) (t2 )

= O(t1 ) (t2 ) ∪ O(t2 ) (t2 ), (28)

E(t2 ,t3 ) (t2 )

= U(t1 ,t2 ) (t2 ) ∪ O(t2 ) (t2 ), = O(t2 ) (t2 ) ∪ O(t3 ) (t2 ), = O(t2 ) (t2 ) ∪ C(t2 ,t3 ) (t2 ).

(29)

From Figure 2 we can see that common area to the changed areas E(t1 ,t2 ) (t2 ) and E(t2 ,t3 ) (t2 ) is object areas O(t2 ) (t2 ). Thus we can extract moving object as, E(t1 ,t2 ) (t2 ) ∩ E(t2 ,t3 ) (t2 ) = O(t2 ) (t2 ).

(30)

5. EXPERIMENTAL RESULTS

• Newly uncovered background U(t1 ,t2 ) (t1 ): O(t1 ) (t1 ) = L(t1 ,t2 ) (t1 ) ∪ U(t1 ,t2 ) (t1 ),

(27)

(25)

For slowly moving object this area can be approximately same as O(t1 ) (t1 )or O(t2 ) (t1 ), but for a fast moving object it is not true. Thus extraction of exact object shape is not possible by using change detection on a single DFD. To overcome this problem, we use combination of forward and backward DFDs of reference frame based on the method proposed in [14]. Now for the same sequence as above, consider three consecutive frames I(x, y, t1 ), I(x, y, t2 ) and I(x, y, t3 ). Our

Moving object extraction approach discussed in this paper was implemented and tested for variety of sequences. Hierarchical implementation of Lucas-Kanade algorithm [15] [16] with three levels of hierarchy was used to determine the optical flow during camera motion modeling stage. Model for camera motion was initialized by robust affine motion model for optical flow of 20×20 square windows at four corners of the reference frame. Size of this window was selected to withstand the loss of features during camera motion. With smaller window size and larger camera motions, most of the features in these windows might be lost, leading to failure of background detection process. We will show a “Car sequence” to explain the steps and demonstrate the performance of our algorithm. This is a typical outdoor surveillance video. Object in this case i.e car exhibits rapid and rigid motion. Camera motion for this sequence is also rapid in order to keep the moving object almost centered. Part of the background has details but part of it i.e. road is textureless. Figure 3 illustrates camera motion compensation and background estimation process for this sequence. Even though the bottom corners of the frames are textureless and part of the corner features are lost due to rapid camera motion, camera motion compensation and background estimation is successful. Figure 3(d) and (e) show first and final iteration results for background estimation.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3: Camera motion compensation for car sequence: (a) Center frame of the sequence; (b) Frame 3 of the sequence; (c) Optical flow from middle frame to 3rd frame; (d) Initial estimate of background; (e) Final estimate of background; (f) Camera motion compensated 3rd image. White area in the figure represents the background and black areas are either outliers or belong to the moving object. These results are quiet satisfactory for this sequence apart from bottom right and bottom left edges of the frame and top of car. One can see in Figure 3(c) that optical flow in these regions is estimated incorrectly. Bottom right and bottom left edges of the frame have no texture information which leads to the incorrect optical flow. Errors in optical flow at top of car arise due to transparent windows of car. Additionally, phenomenon of object optical flow attaching to the textureless background is visible around the object boundaries. Figure 4 shows various steps of object extraction process from forward and backward DFD. We can see in this figure that moving object is properly detected, still it’s boundaries are not well defined. After we combine region boundary information with Figure 4(e) we get the final result of object extraction as shown in Figure 4(h). As our method does not eliminate object shadow, it is detected to be a part of the object. Result of shape extraction are good apart from the front of car where a small region of background is attached to the object.

tured by moving camera which can be utilized for range of applications. This approach combined motion information in terms of optical flow and change detection. Experimental results presented show suitability of our approach to variety of applications like indoor and outdoor surveillance as well as object based video coding. Accuracy in the object boundaries makes our technique appropriate for future shape based classification and structure from shape problems. 7. REFERENCES [1] P. J. Huber, Robust Statistics. ley,1981.

New York, Wi-

[2] J. Y. A. Wang and E. H. Adelson , “Representing moving images with layers,” IEEE Transactions on Image Processing, Volume 3 Issue 11, pp. 625-638, Sept. 1994.

6. CONCLUSIONS

[3] G. D. Borshukov, G. Bozdagi, Y. Altunbasak, A. M. Tekalp , “Motion segmentation by multistage affine classification,” IEEE Transactions on Image Processing, Volume 6 Issue 11, pp. 1591 -1594, Nov. 1997.

Moving object extraction applications can be divided into various classes, each of these having different video analysis requirements [17]. In this paper we presented an automatic moving object extraction approach for video cap-

[4] Y. Altunbasak, P. E. Eren, A. M. Tekalp, “Regionbased parametric motion segmentation using color information,” Graphical models and image processing, pp. 13-23, January 1998.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 4: Object extraction for car sequence: (a) Backward DFD for red channel; (b) Forward DFD for red channel; (c) Changes detected in (a); (d) Changes detected in (b); (e) Moving areas estimated in center frame; (f) Color segmentation results; (g) Moving edges; (h) Extracted object. [5] I. Celasun, A. M. Tekalp, M. H. G¨okc¸etekin, D. M. Harmanci, “2-D mesh-based video object segmentation and tracking with occlusion resolution,” Signal Processing: Image Communication, Volume 16, Issue 10, pp. 949-962, August 2001. [6] A. Elgammal, R. Duraiswami, D. Harwood, L.S. Davis, “Background and foreground modeling using nonparametric kernel density estimation for visual surveillance,” Proceedings of the IEEE, Volume 90, Issue 7, pp.1151-1163, July 2002. [7] I. Haritaoglu, D. Harwood, L. S. Davis, “W 4 :realtime surveillance of people and their activities,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 22, Issue 8, pp. 809-830, Aug. 2000. [8] C. Kim, J.-N. Hwang, “Fast and automatic video object segmentation and tracking for content-based applications,” IEEE Transactions on Circuits and

Systems for Video Technology, Volume 12, Issue 2, pp. 122-129, Feb. 2002. [9] S.-Y. Chien, S.-Y. Ma, L.-G. Chen, “Efficient moving object segmentation algorithm using background registration technique,” IEEE Transactions on Circuits and Systems for Video Technology, Volume 12, Issue 7, pp. 577-586, July 2002. [10] Y. Tsaig, A. Averbuch, “Automatic segmentation of moving objects in video sequences: a region labeling approach,” IEEE Transactions on Circuits and Systems for Video Technology, Volume 12, Issue 7, pp. 597-612, July 2002. [11] J. Fan, Y. Ji, L. Wu, “Automatic Moving Object Extraction toward Content-Based Video Representation and Indexing,” Journal of Visual Communication and Image Representation, Volume 12, Issue 3, Pages 306-347, Sept. 2001.

[12] J. R. Bergen, P. J. Burt, K. Hanna , “ Dynamic multiple-motion computation,” Artificial Intelligence and Computer Vision, pp. 147-156, Elsevier, Amsterdam,1992. [13] J. O. Street, R. J. Carroll, D. Ruppert, “A Note on Computing Robust Regression Estimates via Iteratively Reweighted Least Squares,” The American Statistician, Volume 42,No. 2, pp. 152-154, May 1988. [14] M.-P. Dubuisson, A. K. Jain, “Object contour extraction using color and motion,” Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition,1993, pp. 471-476, 15-17 June 1993. [15] B. Lucas, T. Kanade, “An iterative image registration technique with application to stereo vision,” Proc. Image understanding workshop, pp. 121-130, 1981. [16] S. Jianbo, C. Tomasi, “Good features to track,” Proc. IEEE Comput. Soc. Conf. Computer Vision and Pattern Recognition, pp. 593-600, 1994. [17] P. L. Correia, F. Pereira, “Classification of Video Segmentation Application Scenarios,” IEEE Transactions on Circuits and Systems for Video Technology, Volume 14, Issue 5, pp. 735-741, May 2004.