Video Editing Using Motion Inpainting - IEEE Xplore

2012 26th IEEE International Conference on Advanced Information Networking and Applications

Video Editing Using Motion Inpainting Joseph C. Tsai

Timothy K. Shih

Department of Computer Science and Information Engineering Tamkang University Danshui Dist., New Taipei City, Taiwan [email protected]

Department of Computer Science and Information Engineering National Central University Jhongli City, Taoyuan County, Taiwan [email protected]

Kanoksak Wattanachote

Kuan-Ching Li

Department of Computer Science and Information Engineering National Central University Jhongli City, Taoyuan County, Taiwan [email protected]

Department of Computer Science and Information Engineering Providence University Taichung City, Taiwan [email protected]

Abstract In this paper, we demonstrate a new motion inpainting technique to allow users to change the dynamic texture used in a video background for special effect production. For instance, the dynamic texture of fire, smoke, water, cloud, and others can be edited through a series of automatic algorithms. Motion estimations of global and local textures are used. Video blending techniques are used in conjunction with a color balancing technique. The editing procedure will search for suitable patches in irregular shape blocks, to reproduce a realistic dynamic background, such as large waterfall, fire scene, or smoky background. The technique is suitable for making science fiction movies. We demonstrate the original and the falsified videos in our website at http://163.13.127.36/www/AINA12. Although video falsifying may create a moral problem, our intension is to create special effects in movie industry.

by 3D model. The creation of 3D models and animations could rely on exiting software tools and hardware tracking devices. However, to create outdoor special effects is hard, especially for those in a dangerous zone. We propose an approach called motion inpainting to generate the dynamic background in the original films. It can help users to create a new video with spectacular scene. In the last several years, to generate the dynamic texture in a video has been widely investigated. Most of the methods are using the 3D model to simulate the dynamic texture. In detail, the theory of the 3D model similarity is to use the physical properties of each kind of dynamic textures. In other words, this approach is according to the natural phenomena starting from the physical laws. In the [1], D. Q. Nguyen et al. proposed physically based method for modeling and animating fire. The models of fires are simulated by incompressible Navier-Stokes equations. The developed model is a physical based one for a vaporized fuel reacting to form hot gaseous products and a related model for the similar expansion that takes place when a solid fuel is vaporized into a gaseous state. The particle system is the main tool used by physical method. It permits to simulate the fuzzy model like fire, water and smoke, etc. The synthesis results are always perfect and the results are able to move, change forms and other control. Although the physical-based method includes so many advantages, computationally extremely expensive. Moreover, the users in general are very hard to use this kind of software and the generating results aren’t also suitable to the original videos from users. The other approach is called image-based method. It aims to generate the dynamic texture by using the patches from the other location in the inter-frame or in the intra-frame. The best advantage of this approach is this method is very simple to process. It only needs one kind of model to analyze the original video or uses the information of analysis from each frame. Typically, we will want to use this kind of approach to generate the additional patches or dynamic textures. By this method, we can get the addition with the most similar color, illumination or structure to the original clips. In other words, the generating video can be more natural. The dynamic

Index Terms - Video Editing, Motion Inpainting, Object Tracking, Motion Estimation, Special Effect Production.

I. INTRODUCTION The digital camera has become an important role in our life in recent years, almost of the families use this kind of equipment to store their memory. By this reason, the image editing algorithms are getting considerable attentions not only from the image processing researchers but also from people in general. There are more and more image editing software developed in the last several years for people to modify their digital photos with defect. Such as the image inpainting, image segmentation etc., can be used in the kind of image editing software. People can modify the photos so easier that the software is easy to operate. Although the image processing algorithms are developed so fast, the video processing approaches are still spacious room for development. The video processing is also a popular research area. For users in general, they want to make a new video by their own videos. In actually, we can say that as the special effect generation. The special effect technologies are widely used in the film industry to generate the amazing movies. To do the video post-processing is a task costs time and concentrates attentions. Most of the films with special effect will be created 1550-445X/12 $26.00 © 2012 IEEE DOI 10.1109/AINA.2012.22

649

textures processed by physical mode will be different with the target videos, although the motion or shape of the dynamic texture obtained by the algorithm is good enough. Because of this reason, image-based approach can be processed to the existed videos or video games. In the [2], A. Rav-Acha et al. propose an approach to control the dynamic flow of the video. The authors use the dynamic mosaics by sweeping the aligned space-time volume of the input video by a time front surface and by generating a sequence of time slices. They use the volume to evolve time font of the video and try to analyze the chronological time. The analysis result can help to control the order of the event in the video. In other words, the dynamic texture synthesis can generate the dynamic background panorama. R. Costantini et al. propose another approach to generate the dynamic texture. In the [3], they use the higher order SVD analysis to achieve dynamic texture generating. This model can make the dimensions from original video decrease to two dimensions. A lower dimensions model is easy to analyze and maintain the continuity of the synthesis result. We propose a new method to generate dynamic texture into the target area in the video. This method is based on the inpainting theory, we want to generate the target dynamic texture by extracting the most similar patch clips from other frames in the same video. The dynamic textures are too complexity that we can’t use the original inpainting algorithm. There are many inpainting algorithms proposed in this decade, such as [4, 5, 6]. In the [4], the authors use the LDS-based dynamic texture synthesis scheme as the basic method to filling into the hole. But it still needs to train for recognizing which patch is the most similar to the target hole. In this paper, we use the new patch searching algorithm to maintain the temporal continuity. We describe this algorithm in the section 2. The experimental results will be showed in the section 3. Finally is the conclusion.

patch. We use three kinds of information as the model to compute the similarity of each patch. To process the video with dynamic background, we can see the video as a volume. In the figure 1, we assume the video as the ĭ3, each frame in the video is ĭ2. p5

p1 p2

ȍ2

p4

p3

ĭ3

(b) The patch in the frame t q1 q 2

(a) The video volume ĭ3

Ȥ2

q5

q4 q3 (c) The patch in the frame t’

Fig. 1. In the figure a, it shows the volume of the video. The figure b and c are two patches in the different frame, there are several feature points on the contour.

It illustrates the patches which need to be recovered are not the same size and shape. The feature points on the contour are obtained by SIFT algorithm [7]. In the figure 1 (b) and (c), we can see the feature points on the contour are not on the same location from these two patches. This situation is because that we use the SIFT algorithm to extract the feature points from the frames, we can’t limit the search result from SIFT algorithm. So we match the feature point of the current frame to other feature points of other frames, after getting the best matching result of the feature points, we can according to the result as the model of the similarity of the patch. The figure 1 (a) is represented the volume of the frames, we use ĭ3 to represent, that is two dimensions and time. Each patch from the frames can be combined as ȍ3, Let 3 ||ȍ || denotes the number of frames of ȍ3. That is, ||ȍ3|| represents how many frames the target object is tracked in the video. Let t be the starting frame number of ȍ3. Thus, the target volume exists in video frame t to t+||ȍ3||. We use ȍ2t, įȍ2t, and {p1, p2, …, pj} to search for a best fit region Ȥ2t’ on frame t’. Where įȤ2t’ is the contour of Ȥ2t, and the set of feature point, {q1, q2, …, qk} can be obtained. In general, the j and k are not the same. But for simplicity, we select the best k = j feature points. In order to get the best fit patch, we propose an algorithm as following: 1. For the set of feature points, {p1, p2, …, pj}, we can define a set of patches, {Ȍp1, Ȍp2, … , Ȍpj}. Similarly, we can define {Ȍq1, Ȍq2, … , Ȍqj}. 2. Search for the best fit {Ȍq1, Ȍq2, … , Ȍqj }, from frame t-i to frame t+i. 2.1 Dmax = |B(Ȥ2f)|/k, where k=16. Dmax is computed based on the maximum of height and width of B(Ȥ2f), divided by k. We use k=16.

II. PATCH SEARCHING AND MOTION INPAINTING To complete the hole in the dynamic background, we have to search the most similar area from other location or other frames. Not similar to the traditional inpainting algorithm, a new patch searching method is used in this paper. Besides, to insert the patch into the missing hole is also a challenge in this approach. How to make the generating patch blend with the dynamic background and maintain the continuity is an important issue in this research. A. Patch Searching In the traditional image inpainting algorithm [5], the patch searching will be processing in the same frame. Even if the video inpainting algorithm [6], it will search the best patch in the neighbor frames according to the motion information. Although there are many algorithms proposed in recent years, the patch searching methods are not suitable to the dynamic background. The dynamic textures are too disorder that the traditional searching methods can’t search the most similar

650

3. Compute the Dmax as the distance to locate best fit feature point qi on frame f. 4. For each corresponding feature point pi and qi, we define: 4.1 SSD(Ȍpi, Ȍqi): sum of square difference of pixels in CIELab color space. 4.2 diffe(e(Ȍpi), e(Ȍqi)): difference of edge map. e(Ȍpi) is the edge map of Ȍpi. diffe() counts the number of miss match pixels. 4.3 diffm(m(Ȍpi), m(Ȍqi)): difference of motion map. m(Ȍpi) is the motion map of Ȍpi which is computed by optical flow. 5. PatchDiff(Ȍpi, Ȍqi) = SSD(Ȍpi, Ȍqi) * Į+ diffe(e(Ȍpi), e(Ȍqi))* ȕ+ diffm(m(Ȍpi), m(Ȍqi)) * Ȗ, where Į + ȕ + Ȗ = 1.0. Į, ȕ, Ȗ are weights we set to each evaluation in our experiments for analysis. 6. arg min § · , where ¨ PatchDiff (ψ pi ,ψ qi )¸ 2 ¨ ¦ ¸ ∀Χ t ' © ∀pi , qi ¹ 2 2 pi ‫ א‬įȍ t, qi ‫ א‬įȤ t’, t-iʀ t’ ʀt+i This equation finds the best fit Ȥ2t’ candidate region on frame t’, where the sum of differences is minimal. After the above steps, we can get the best fit patches to insert into the target area. But there will be another problem, that is the time of the patch maybe not fit to the original video. Because of the best fit patch is not on the same frame with the target frame, so we have to extend the patch clips in time to maintain the complete results.

method needs to analyze the similarity between each frame in the video. After this step, a loop of a video can be found. The authors will use the information to create a video with longer time. The method exists a disadvantage that we can’t adopt the concept. The first is this method needs enough frames to analyze, if there are fewer frames in this video, it will make the result to be a discontinuous video. The second limitation is the dynamic texture can’t be too complexity. The result of analysis will be effected to make the wrong analysis if the dynamic texture is too complexity. There is another approach proposed for change the size of the image. S. Avidan et al. proposed a novel approach for image resize [14]. In this algorithm, authors use the energy of the image to look for lowest energy seams. They can use the information of the seams to achieve the goal of changing the size of image. In order to extend the video in time, we can extend the concept of this paper. This algorithm is suitable for us to extend the patch video. The original algorithm computes whole size of the frame; it will cost too much time to process the video. When we extend the concept of seam carving into 3D, we can get a lowest energy curve (i.e. a seam in an image, a curve in a video). The method also solves the problem in [8]. We can get a better extensive result by using this concept. The following is the algorithm we use in our paper. 1. Extract the planext from the video. We want to extend the video in time, so we have to process on the planext. 2. Choose a random of planext and search a lowest seam on this plane.

B. Video Extension in Time After we get the searching patch, there are two questions we need to consider in this algorithm. The first is the length of the target patch clip. Because of each patch is not searched in the same length, we have to make the length of the target patch clips is the same with the original video. The figure 2 shows an example that we insert the searched patch into the target area without extending the length of the patch clips. For the length of the patch clips is too short that the generating video is still with the damage area. In the right side of the figure, we can see the situation which the later half of the video is not inpainted.

Seamplane1 = minD(plane1x,1)+neighbor(x,y-1)minD(plane1x,y) The above formula is showed how to compute the seam. Where minD() means the minimum energy in this scan line. The neighbour() is the neighbour of the previous level result. We use this method to maintain the continuity of the seam. In the figure 3, we showed the result of extracting the seam from the plane. The red line is the seam. We can use this seam to extend into a curve.

Fig. 3. The result of the seam extracting from the plane. The red line is the seam. Fig. 2. The inpainting result without video extension in time.

3. According to the seam of first plane to generate other seams of all planes. Then the min energy curve can be obtained. 4. Generate the additional curve into the volume.

In order to extend the length of the selection patch video, we look for a method can be suitable for our system. There were many approaches proposed in past ten years. Video texture [8] was proposed to find the loop in a video. This

651

Fig. 4. The motion map of the inpainting result, the figure (b) is the motion vectors. The red arrows are the motion vectors in the inpainting area, the white are the motions around to this area.

5. Repeat the above steps until finishing the target additional frame numbers. When we get the extending video in time, the inpainting result can be ensured all of the hole in the frames are inpainted. In the next step, we have to insert the patch clips into the target area.

We can observe the motion vectors in the (b), the motion vectors of the inpainted area are not similar to the motion map of surroundings. The red arrows mean the motion vectors in the inpainting area, the direction is reversed to the motion from the surroundings. In order to solve this problem, we propose an analysis function to make a double check in this section.

C. Patch Insertion In the last section, we know the fittest frame number to be inserted into the inpainting area. We also get an infinite patch video which can make the length of the new patch and source video. If we only insert the patch into the source frame, the final result will be very strange. To solve this program, we use the concept of graph cut algorithm [9]. The goal of using graph cut is to make the inserting patch can be seamless with the background. The following is the algorithm: 1. According to the target volume ȍ3, we track the volume for ||ȍ3|| frames. We copy Ȥ2+t’ for ||ȍ3|| times to obtain an initial Ȥ3t’. 2. Use graph cut to refine Ȥ3t’. 2.1 For each ȍ2t+i and Ȥ2+t’+i , 0ʀiʀ||ȍ3||, we apply graph cut, to refine įȤ2+t’+i to į’Ȥ2+t’+i. 2.2 Each 2D contour į’Ȥ2+t’+i is reshaped from its original contour įȤ2+t’+i. 3. The final step is to assemble all į’Ȥ2+t’+i, 0ʀiʀ||ȍ3||, to obtain an optimal cylindrical surface įȤ^3. Finally, we use Ȥ^3to inpaint ȍ3, with a smooth motion filter applied to the boundary. After the above steps, we can get the final result of motion inpainting. In the next section, we’ll show some experimental results.

A. Analysis We use the motion estimation algorithm to analyze the motion vectors in the inpainted area and the information of the surroundings. Although we use the motion vectors as the reference to find out the best fit patches in the last section, it’s just one of three methods. By this reason, it might appear the situation that the motion vectors of the inpainting area are not similar to the neighbor’s motion. In this section, we use the motion estimation algorithm to check the motion vectors. There are too many motion estimation algorithms proposed in the last two decades. It can be applied in many research areas, such as video compression [10], object tracking [11] or human motion analysis [12]. To analyze the motion vector of the dynamic textures, we choose optical flow [13] which is the same to the last section. The advantage of the optical flow not only can extract the motion information of the target patches but also the information of the structure of the background. In this section, we try to compute the motion vectors of the each patch in the inpainting area and its surroundings. We just use the information to make sure the correct of the inpainting result. The following is the algorithm we use to analyze the motion map of the inpainting results. 1. Compute the motion vectors of the pixels in the inpainting area Ȥ3 and the 5 pixels width area surround to Ȥ3. 2. Make the average of all motion vectors in the Ȥ3. 3. Check the average motion vector of the inpainting area is similar to the motion vectors of the surroundings or not. 3.1 If it’s similarity, output the result. 3.2 Else, search the patch again. According to the above steps, we can make sure the motions of the inpainting area and surroundings are consistency. If the motions are not similar, we will use the modified patch searching method to search again. In the next paragraph, we will describe this method.

III. ANALYSIS OF THE PATCH To make sure that the patch clips we searched are the best fit to the dynamic textures from the background, we use some methods to check the similarity. In the last section, we use the graph-cut algorithm to maintain the contour of the inserting patch to be seamless with the background. Use the textures of the inserting patch and the information from the background to find out the best contour between these two layers is the main concept of the above section. Although the graph cut algorithm is suitable to find out the seamless contour, the motion information is not considered in the algorithm. This situation might make the motion of the textures which are around the contour is not consistency. Such as the river video, if the motion vectors of the neighbor of the inpainting area are right, it will be very strange that the motion vectors of the inserting patches are left. Figure 4 shows the sample.

(a) the inpainting result

B. Patch research In this section, we will modify the original formula proposed in the last section. To emphasize the importance of the motion and make the motion information to be the main reference in patch search, we increase the value of the parameter of the motion equation. It is set as Į + ȕ + Ȗ = 1.0 in the section 2, a. The ȕ is the weight of the motion information, it was originally 0.3. We modify it into 0.6 in the second searching. By this work, we can make the motion information more important and get a more accurate result. In the next

(b) motion vectors

652

V. CONCLUSIONS

section, we will show some results for the motion vectors in the inpainting area and its surroundings.

This paper proposes a new motion inpainting algorithm to process dynamic texture. We use a new patch searching method to search for the best fit patch. This work is very challenge that we have to consider the structure, the motion and the continuity in each frame. After we get the best fit patch clips, we have to extend the length of the clips (i.e. to extend the clips in time). We have to ensure the length of the insertion is longer than or equals to the original video. Finally, a seamless blending algorithm is used to blend these several layers. The main contribution of this paper is the motion inpainting mechanism. By this approach, we can create a more spectacular scene. This method can also use in postproduction. It’s also the software easy to use, users just need to choose the location or object need to be removed, the system will process automatically. In the future, we want to create an authoring tool with this function for creating falsified videos.

IV. EXPERIMENTAL RESULTS We demonstrate our video results in figure 6. In figure (a), it’s a waterfall video, we want to make the waterfall more spectacular. In figure (b), we remove the rock on the left side, and create the waterfall synthesis on the location. It’s a video to show a bear standing in the river in figure (c). We use the proposed method to remove this bear and we can use this video to synthesize with other object to create a new video. The motion inpainting is very useful to create a more spectacular background with dynamic background. In the figure (e), (g), (i), these three videos are dynamic texture, the first is smoke, the second is the submarine volcanic eruption, the third is a conflagration. We show the results in figure (f), (h) and (j). There are some results for the motion map in the figure 5.

REFERENCES

(a) the inpainting results

[1] Nguyen, D.Q., Fedkiw, R., Jensen. H.W.: Physically Based Modeling and Animation of Fire. In: SIGGRAPH 2002. [2] Rav-Acha, A., Pritch, Y., Lischinski, D., Peleg, S.: Dynamosaics: Video Mosaics with Non-Chronological Time. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 58 – 65. San Diego, USA (2005). [3] Costantini, R., Sbaiz, L., Süsstrunk, S.: Higher Order SVD Analysis for Dynamic Texture Synthesis. In: IEEE Transactions on Image Processing, 17 (1). pp. 42-52. (2008). [4] Lin, C.W., Cheng, N.C.: Video bsckground inpainting using dynamic texture synthesis. In: Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1559 – 1562. (2010). [5] Criminisi, A., Perez, P., Toyama, K.: Object removal by exemplar-based inpainting. In: IEEE Computer Vision and Pattern Recognition, 2003. vol. 2, pp. 721 -728. (2003). [6] Shih, T.K., Tang, N.C., Hwang, J.N.: Exemplar-based Video Inpainting without Ghost Shadow Artifacts by Maintaining Temporal Continuity. In: IEEE Transactions on Circuits and Systems for Video Technology, 2008. [7] Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: International Journal of Computer Vision, 60, 2 (2004). [8] Schödl, A., Szeliski, R., Salesin, D.H., Essa, I.: Video Textures. In: SIGGRAPH 2000, pp. 489–498, 2000. [9] Jia, J., Sun, J., Tang, C., Shum, H.: Drag-and-Drop Pasting. In: ACM Trans. Graph. (TOG) 25(3):631-637, 2006. [10] Iain E. G. Richardson, “H.264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia.” John Wiley & Sons Ltd, 2003. [11] K. Hariharakrishnan and D. Schonfeld, “Fast object tracking using adaptive block matching.” Publish in: IEEE Transactions on Multimedia, Vol. 7(5), pp. 853 – 859, 2005. [12] J.K. Aggarwal and Q. Cai, “Human Motion Analysis: A Review.” In: IEEE Nonrigid and Articulated Motion Workshop, 1997. [13] J.L. Barron, D.J. Fleet, and S. Beauchemin. "Performance of optical flow techniques." Internation Journal of Computer Vision, Vol. 12, pp. 43-77, 1994. [14] S. Avidan and A. Shamir, “Seam Carving for Content-Aware Image Resizing,” In: ACM Trans. Graph. Vol. 26( 3), Article 10 (July 2007)

(b) the motion map

Fig. 5. the analysis by motion vectors

In the figure 5, the images on the right side are the motion map of the inpainting results. The red arrows are the motion vectors in the inpainting area, the white arrows are the motion vectors in the original background. We can observe the directions of the red arrows are almost similar to the white arrows. In the left side, the textures are also seamless. According to the experiments, we can find out the results are good enough for dynamic background. (a) Original video

653

(b) inpainting result

(c) Original video

(d) bear removed

(e) Original Video

(f) an additional smoke on the right side (g) Original video

(h) a bigger eruption

(i) Original video

(j) remove the person

Fig. 6. Experimental Results (See our demos at http://163.13.127.36/www/AINA12)

654