Layers-Based Image Segmentation Incorporating Motion Estimation ...

Layers-Based Image Segmentation Incorporating Motion Estimation with Static Segmentation Yu Huang Heinrich Niemann Chair for Pattern Recognition, Dept. of Computer Science University of Erlangen-Nürnberg, 91058, Erlangen, Germany [email protected], [email protected] Abstract This paper addresses the problem of motion segmentation using a multilayer representation. At first, the coarse optic flow is estimated using the robust Simultaneous-Over-Relaxation (SOR) algorithm, meanwhile the intensity segmentation is performed by a watershed algorithm. The image is divided into nonoverlapping rectangular tiled regions, within those subregions affine motion models are fitted from the optic flow. Using the ISODATA clustering algorithm, those affine models are grouped into a small number of motion models. Finally, based on registration errors by image warping on each watershed segment, the support maps of multilayers are obtained. Experimental results from real images are given to show the efficiency of our method.

Keywords: Optic flow, ISODATA algorithm, watershed, robust estimation, static segmentation.

1. Introduction The segmentation of image sequences into regions or `òbjects`` has received a lot of attention in recent years. Applications like object tracking, compact representation of videos for indexing, video coding and structure from motion can benefit from a meaningful segmentation. But it is by now not solved being an ill-posed ``chicken-and-egg`` problem. There are two major groups of approaches to separate image sequences into multiple significant scene strutures and objects[Sawhney, 1996]. One group solves the problem by letting multiple models simultaneously compete for the description of the every motion measurements, and in the second group, multiple models are estimated sequentially by solving for a dominant model at each stage. Most of methods for motion segmentation operate on a pixel level basis and either do not consider spatial constraints or they result into complex and computationally demanding algorithms[Sawhney, 1996][Wang, 1994]. To improve motion segmentation, recently a number of researchers have attempted to combine an initial segmentation with motion, which take into account the spatial coherence to get more robust result[Weiss, 1996][Patras, 1998]. In this paper, we present a simultaneous motion segmentation method incorporating the static segmentation using the watershed algorithm. In Sect. 2, we present related works and background. In Sect. 3, motion analysis and fitting is described. The motion segmentation method incorporating the static segmentation is presented in Sec. 4. Finally experimental results are reported in Sect. 5 and

concluding remarks are given in Sect. 6.

2. Background The sequential motion segmentation method using the dominant motion model may be confronted with the absence of dominant motion. As well it fails to delineate similar motions to different layers because of the lack of competition amongst the motion models. So, the sequential methods are applied only in some simple application cases, such as single object tracking, camera motion compensation ( for stabilization and registration) or background/ foreground segmentation. In this paper we address the problem of simultaneous motion segmentation, below the related work is first given. Sawhney and Ayer[Sawhney, 1996] realize multiple motion segmentation using the EM algorithm under the mixture models. Wang and Adelson[Wang, 1994] apply the k-means clustering algorithm to group the optic flow vectors into layers of consistent affine motion. Both of methods do not exploit the intensity information present in the image. In fact, the actual mapping of pixels to layers is difficult. Weiss[Weiss, 1996] add spatial constraints to the mixture formulations to generate a variant of EM algorithm that obtains more robust segmentation results. Patras et.al [Patras, 1998] put forward a motion-based segmentation method combining watershed static segmentation, they perform dominant motion estimation in each watershed segment and merge these segments according to their motion information. From an image of the natural scene normally the watershed algorithm will output thousands of segments, so the motion estimation task on those segments is computationally expensive. In fact the task of motion analysis module, here is to find the major motion models appeared in the image sequence. So, we prefer motion estimation in some nonoverlapping rectangular tiled regions uniformly divided from the image. It is independent from the static segmentation, but finally motion segmentation still result from some measure in those watershed segments. The motion clustering approach we use is the ISODATA algorithm instead of the k-means algorithm. It behave more robust than the latter. From experimental results we show the proposed scheme is more efficient.

3. Motion Analysis 3.1 Dense Optic Flow Estimation First the early process is estimation of coarse optic flow. Here we use a new method proposed by Black and Anandan[Black, 1993]. It apply a Simultaneous-Over-Relaxation (SOR) algorithm in a robust M-estimation framework. By now, it generate the most accurate estimation of dense optic flow. Below we outline the algorithm: The interframe motion is defined as 1 f ( x , t + 1) = f ( x − u( x ), t ), where f(x, t) is the brightness function in time instant t, x = ( x, y ) is coordinate of the image pixel, and u(x ) = (u( x, y ), v( x, y ))T is the flow vector. Let f x , f y , f t be partial derivatives of the brightness function with respect to x, y and t. To estimate the horizontal and vertical image velocity, we minimize the following objective function composed of a data term and a spatial smoothness term,

min E M = ∑ (λD ρ (uf x + vf y + f t , σ D ) + λS ( u, v )

x

∑ ρ (( u( x ) − u( z )), σ S ) )

(2)

z∈G ( x )

where G(x) are the four nearest neighbors of pixel x, λD , λS control the relative importance of the data and spatial terms repectively, σ D , σ S are scale parameters, ρ is a robust error norm function, taken here to be the Lorenzian form [Black, 1993], x2 ), 2σ 2 The SOR iteration update equations for minimizing E M at step n+1 are simply written as

ρ ( x,σ ) = log(1 +

3

u n +1 = u n − ω

∂E M ∂ 2 EM with Tu ≥ , T (u)∂u ∂u 2

4

v n +1 = v n − ω

∂E M ∂ 2 EM with Tu ≥ , T ( v )∂v ∂u 2

5

where ω = 1.995 0 < ω < 2 . The algorithm begins by constructing the Gaussian pyramid (here we make three levels). At the coarse level the flow vectors are initially set to zero. The number of iterations is chosen as 10. When the estimated parameters are interpolated into the next level, these parameters are used to warp the first image to the second image. In the current level only the change in the vectors are estimated in the iterative update scheme. In all the experiments in this paper we set these parameters as λD = 10, λS = 1 , σ D = 10 / 2 , σ S = 1 / 2 .

3.2 Fitting affine models to dense optic flow We divide the image uniformly into nonoverlapping rectangle tiled subregions (normally the number of regions is 8 x 8 or 16x16 for the situations with small objects appearing), and derive affine motion models in those subregions by robust fitting. The model is written as u( x, y )  a0 + a1 x + a 2 y  u(x; a) =  =  v ( x , y )  a3 + a4 x + a5 y 

(6)

where a = (a 0 , a1 , a 2 , a3 , a4 , a5 ) T are the parameters of the affine model. This model is valid when the depth variance in individual region is small enough compared with the depth from the camera. Here we still use the SOR method to minimize the following two object functions, min E1 = ∑ ρ (a0 + a1 x + a2 y − u( x, y ),σ ) ,

a0 ,a1 ,a2

min E 2 = ∑ ρ (a3 + a 4 x + a5 y − v( x , y ),σ ) ,

a3 ,a4 ,a5

(7)

x

(8)

x

only ρ is taken here to be the Geman-Lure form [Black, 1993], ρ ( x, σ ) =

x2 , x2 + σ 2

9

The SOR method lowers the scale parameter σ according to the formula σ n +1 = 0.85σ n . The effect is similar to simulated annealing [Eberhart, 1996]. Here in this paper we set initially σ as 2 3 and

finally σ as 0.5 3 .

4. Image Segmentation 4.1 Motion clustering with ISODATA Now the task is to cluster in the affine parameter space these motion models from subregions, at the same time the corresponding rectangular subregions are merged if they belong to the same cluster. Here we use the ISODATA algorithm[Kaufman, 1990]. The adaptive k-means algorithm used in [Wang, 1994] can merge any two centers when their distance is enough small. The ISODATA algorithm, besides the merging property above, still can split one cluster when its covariance is big enough. And, in the kernel of our ISODATA, the k-means is replaced by the k-medoid [Kaufman, 1990]. The k-medoid algorithm is resistant to outliers because the center of each cluster is chosen among the elements of the input data, resulting in a more robust clustering. In our clustering method, we use the min-max technique to generate the initial cluster centers: First the former two centers are those with the biggest distance, then the next center is chosen as one having the maximal value of the minimal distances from the selected centers. In the experiments, we find the initial number of clusters can vary from 2 to 6 with the similar results, where the iteration number is 5. After motion clustering, we re-estimate the cluster affine models in the merged subregions belonging to the same cluster in order to get more accurate model parameters.

4.2 Static segmentation using watershed In static segmentation, the watershed algorithm of mathematical morphology is a powerful method. Early watershed algorithms are developed to process digital elevation models and are based on local neighborhood operations on square grids. Improved gradient following methods are devised to overcome plateaus and square pixel grids[Gauch, 1999]. Other approaches use `ìmmersion simulations`` to identify watershed segments by flooding the image with water starting at intensity minima [Vincent, 1991]. Here we realize the single-scale gradient following method, which is part of work given in [Gauch, 1999] : (1) Identify the local intensity minima, which define the bottoms of watersheds. Here the pre-smoothing filter is needed to eliminate the plateaus in the image. (2) Calculate the image gradient, the 8-neighbors of each point are searched to determine the most steeply uphill and most steeply downhill directions. These directions are encoded and stored in a temporary image. (3) Partititon the input image into watersheds. For each of the remaining points in the image, the gradient information is used to follow the downhill to some intensity minimum. The identifier of this extremum is then recorded in the ouptut pixel corresponding to this starting point. Note that the input image is the gradient image from the partial derivatives of the intensity image. A severe drawback to the computation of watershed images is over-segmentation, so here watershed merging is performed based on thresholding the difference of adjacent subregions’ intensity mean values.

4.3 Layers-based image segmentation We determine each region´s motion measure from MAE(Mean Absolute Error) of difference between the warped image and the origin image, i. e. M ij = ∑ f ( x, t + 1) − f x∈Ri

wj

(10)

(x, t ) / Ci ,

where f w j (x, t ) is the warped image of f (x , t ) using affine motion parameters of the jth motion cluster, Ci is the pixel number in the subregion Ri . Now the binary support maps that describe the regions for each affine model are defined as 1, M ij < M it , t ≠ j zij =  0, otherwise

(11)

If needed, several iterations ( here is 3), with one step of re-estimating affine motions in each support map and one step of re-determining the binary support maps, are performed. If the support map area (number of pixels with the value 1) of some cluster is too small( less than 200), we can delete this motion cluster and estimate the support map once again. Here we do not use the median filter like [Sawhney, 1996] [Wang, 1994], but still get a satificatory segmentation result. From that, we could find the patch-based mapping to layers is more robust than pixel-based.

5. Experiment results We realize our method in C on a SGI workstation. The segmentation from a pair of consecutive images require about 30 seconds. Clustering and static segmentation only occupy small part of computation cost. The coarse flow estimation costs about 8 seconds, most of the time is spent on affine model fitting. Below we show two results from real images.

5.1 Moving camera without moving objects We use two consecutive frames (frame 15and 16) of the standard MPEG `Flower Garden´ image sequence. In the image, the tree, flower bed and row of houses move towards the left but at different velocities. Image size is 352x236. Because the flow bed and tree are similar in intensity value and easy to merge into the same region, so we lower the threshold of watershed merging. The final number of motion clusters is 3, illustrated in Fig. 1. Different intensity values are assigned to different layers. We find, there are a few errors in the poorly textured regions like the sky and some regions near the border of the image. The estimated affine motion parameters of 3 layers are listed in Table 1. Table 1 Affine parameters of 3 layers of Fig. 1

1 2 3

a[0] -0.840598 -4.101958 -2.248218

a[1] -0.000653 0.004176 -0.001550

a[2] -0.000360 -0.011013 0.000043

a[3] 0.021484 0.474621 0.051635

a[4] 0.000168 0.000765 -0.000115

a[5] 0.000094 -0.002019 0.000007

(1) 15th frame

(3) affine clustering

(5) support map

(2) optic flow

(4) watershed merging

(6) 1st layer object

(7) 2nd layer object (8) 3rd layer object Fig. 1 Segmentation of the ``Flower Graden`` sequence

5.2 Moving camera with moving objects Here two consecutive frames (frame 8 and 9) are utilized from the standard MPEG ` Coast Guard´ image sequence. In the image, there are four`òbjects``: the shore, a big ship, a small ship and water. Image size is 352x240. The camera follows the small ship, the water behaves in a rather random way locally. The final number of motion clusters is 3, its result is illustrated in Fig. 2. Different intensity values are assigned to different layers. We find, the small ship is clustered into the water because its apprarent motion is very small and difficult to distinguish from water motion. Besides, there are a few errors in the water and on the left-up corner of the image. The estimated affine motion parameters of 3

layers are listed in Table 2. Table 2 affine parameters of 3 layers of Fig. 2

1 2 3

a[0] 2.999030 0.589943 1.332318

a[1] 0.004739 -0.000588 0.000005

(1) 9th frame

(3) affine clustering

(5) support map

a[2] -0.031250 0.000033 0.000000

a[3] 0.093677 0.001103 0.003659

a[4] -0.001135 0.000000 -0.000026

a[5] 0.007489 0.000000 0.000001

(2) optic flow

(4) watershed merging

(6) 1st layer object

(7) 2nd layer object (8) 3rd layer object Fig. 2 Segmentation using the ``Coast Guard`` sequence

6. Conclusions We have presented a framework of multilayer based image segmentations. It efficiently combines the static segmentation and motion information. In future, we will take into account use of the tracking algorithm in the sequence of images to assure a coherent segmentation through time.

Acknowledgements This research is partially supported by Humboldt Research Foundation.

References [Black, 1993] Black M J, Anandan P. (1993). A framework for the robust estimation of optic flow, ICCV’93: 231-236. [Eberhart, 1996] Eberhart R et. al. (1996). Computational Intelligence PC Tools. Academic Press Inc. [Gauch, 1999] Gauch J. (1999). Image segmentation and analysis via multiscale gradient watershed hierarchies, IEEE T-IP, 8(1): 69-79. [Kaufman, 1990] Kaufman L, Rousseeuw. (1990). Finding Groups in data: an introduction to cluster analysis, Wiley, New York. [Patras, 1998] Patras I et.al. (1998). An iterative motion estimation-segmentation method using watershed segments, Proc. of ICIP’98, Illinois, USA [Sawhney, 1996] Sawhney H. S., Ayer S. (1996). Compact representations of videos through dominant and multiple motion estimation, IEEE T-PAMI, 18(8): 814-830. [Vincent, 1991] Vincent L, Soille. (1991). Watersheds in digital spaces: a efficient algorithm based on immersion simulations, IEEE T-PAMI, 13(6): 583-589. [Wang, 1994] Wang J, Adelson E. (1994). Representing moving images with layers. IEEE T-IP, 3(5): 625-638. [Weiss, 1996] Weiss Y, Adelson E. (1996). A unified mixture framework for motion segmentation: incorporating spatial coherence and estimating the number of models, Proc. of CVPR’96.