Histogram Based Split and Merge Framework for ... - Semantic Scholar

Histogram Based Split and Merge Framework for Shot Boundary Detection D.S. Guru and Mahamad Suhil Department of Studies in Computer Science, Manasagangothri, Mysore, India [email protected], [email protected]

Abstract. In this paper, we propose a non-parametric approach for shot boundary detection in videos. The proposed method exploits the split and merge framework by the use of color histograms. Initially, every frame of the input video sequence undergoes color quantization and subsequently, the color histograms are computed for every quantized frame. The split and merge is driven by the fishers linear discriminant criterion function which results with a set of subsequences after several iterations which are assumed to be the shots present in the given video. The proposed method is experimentally tested on video samples from TrecVid 2002 dataset and YouTube online database. We have obtained overall accuracy of 85.5% Precision, 87.1% Recall and 86.1% Fmeasure for the dataset used. A comparative study of the proposed approach with the contemporary research works is also carried out. Keywords: color quantization, color histograms, split and merge, fishers linear discriminant analysis, shot boundary detection.

1

Introduction

From the past two decades, due to the rapid development of digital storage technology and available bandwidth, the activities such as storing, sharing and searching multimedia data over the internet have become indispensable components of our life. Among all the multimedia data, video is frequently used since it preserves both visual and temporal behaviors of objects present in a scene. But, searching for a video of a particular interest is highly difficult and time consuming due to the size of the multimedia database available on the web. To manage such a huge multimedia database, for the last two decades, there have been a couple of attempts towards development of automated content based video indexing and retrieval (CBVR) systems [1-6]. Shot boundary is the location where the transition from one shot to the other subsequent shot takes place in a video [7]. Among all the various steps involved in CBVR system, shot boundary detection stage is dealt very carefully and it is the first step in the CBVR system and its efficiency is very much necessary for further activities such as, key frame extraction, video indexing, dimensionality reduction and R. Prasath and T. Kathirvalavakumar (Eds.): MIKE 2013, LNAI 8284, pp. 180–191, 2013. © Springer International Publishing Switzerland 2013

Histogram Based Split and Merge Framework for Shot Boundary Detection

181

representation of the video[6]. Abrupt and Gradual transitions are the two types of shot transitions available in videos [9]. In literature, we can find a couple of good attempts to solve the problem of shot boundary detection based on various representation techniques. Some of the major approaches are: histograms based approaches[10-11] where, the similarities and continuity of the frames in a sequence are measured with the help of differences of histograms to arrive at the possible locations of maximum discontinuities, block based approaches [12] where each video frame is studied at the block level to extract local features and matched with the corresponding blocks of the subsequent frames for the identification of shot change, model based approaches [13-14] where a model is trained to identify the possible shots, cluster based approaches [15-17] where frame sequence is clustered into several clusters and every cluster is checked for the possibility of being a shot, non-parametric approaches[18] where shot boundaries are detected without consuming any parameters such as a threshold, compressed domain approaches[19-21] where the video is processed in its compressed domain itself so that, the time of decompression is avoided, fusion based approaches [22-24] where, several approaches are fused with different combinations to make use of the advantages of various popular techniques and so on. Various features such as color, texture, shape, sketch, SIFT, motion vectors, edges in spatial as well as in transformed domains such as Fourier, cosine wavelets, Eigen values, etc., are used majorly with different combinations of the same in many popular approaches. Regardless of the extensive investigations made and copious techniques proposed, shot boundary detection is still an active area of research with many challenges [2527]. It is because of the fact that the researchers are unsuccessful in arriving at a universal and robust model for shot boundary detection which can be an ideal solution for the problem of shot boundary detection in videos of any modality with any amount of complexities being present. With this insight, we have attempted to solve the problem of shot boundary detection with split and merge framework. In our method, we have used a memory efficient color histogram representation to represent the video frames in reduced dimensions. Followed by this, we have exploited the split and merge framework introduced in [28] for automatic shot detection from videos. The proposed method is experimentally validated on videos of Trecvid dataset and some of the videos downloaded randomly from internet. A qualitative comparative analysis of the proposed approach with the contemporary research works is also carried out. The rest of the paper is organized as follows: Section 2 presents the proposed shot boundary detection method. Experimental results and analysis are discussed in section 3. The conclusion and future work are given in section 4.

2

Proposed Method

In this section, we propose our method for shot boundary detection using the split and merge framework by using color histogram features as video frame representatives.

182

D.S. Guru and M. Suhil

Given a video, on an average there will be 25 frames available per second leading to a very huge sequence. Handling and processing such a long sequence of frames with very huge dimension is highly computing time and space demanding task. Hence, the dimensionality reduction needs to be done by extracting only the discriminating features. Fig. 1 shows the major steps involved in the proposed method. Given a video, we represent every video frame using color histogram features. The sequence of video frames is then treated as a sequence of feature vectors which are fed to the split and merge framework to get split into several subsequences so that each subsequence is complete in its own sense and capable of being an individual unit in the given video. These subsequences will later be proved as the shots present in the given video. So, after completion of split and merge process using the sequence of video frame representatives, we can easily identify the boundaries of each and every shot of that video.

Fig. 1. Block Diagram of the Proposed Method

2.1

Color Histogram for Video Frame Representation

In [28], the authors reduce the dimensionality of the input video frame sequence by using Haralick texture features and spectral clustering techniques. Initially every frame is divided into 32X32 blocks and Haralick texture features are extracted from each block. After this, the extracted set of texture features of all blocks of a frame are clustered together using spectral clustering so that, if any natural region which was earlier divided because of block division can now be combined upon comparing the texture features of corresponding blocks. With this a video frame with size mXn is represented with the help of only k feature vectors corresponding to k clusters of the


183

frame and the dimension of the extracted feature vectors was very less when compared to that of the original frame. The major disadvantage of the work proposed in [28] was the amount of time required to process each video frame since they process every video frame in various levels to get its dimensionality reduced. In our present method, we try to extract features out of each video frame through the color histogram which can be processed in less time. Color is one of the most significant properties of images. Images and videos can be efficiently indexed through the use of color information present in them. Color information can be extracted easily when compared to most of the image features and color is upto certain level invariant to transformations like translations, rotations, mirroring and scaling. We are motivated by the work of Mas et al., in [29] where, a color histogram based shot boundary detection method is proposed. The method was based on the image bit plane slicing philosophy [30]. When a gray scale digital image with 8-bit per pixel is given, all 8 bits may not contribute equally to the appearance of the image but, the higher order bits affect significantly when compared to the lower order bits. Hence, eliminating a certain number least significant bits from each pixel can reduce the memory requirements of the image which is also a method of quantization used in image compression. The authors of [29] have used the same philosophy to compress the video frames. Initially, every video frame in RGB color space represented with 24 bits/ pixel is considered and 4 least significant bits from each of the R, G and B color space are removed. After elimination, the frame needs only 12 bits/pixel for representation. By using the quantized video frame, the authors have proposed to create a color histogram. After having efficiently represented every video frame with only 50 percentage of its original size, a color histogram for the reduced 12 bit equivalent frame is created. Since there are only 212(=4096) different possible values for each pixel in the quantized video frame, the number of histogram bins required for the creation of histogram is only 4096, which is very much less compared to the original number of bins. We make use of the color histograms generated in the same way as video frame representatives in our work. The Fig. 2 shows a few sample images in RGB color space with 24 bits/Pixel and their corresponding 12 bits/Pixel representation. It is evident from the images with 12 bits/Pixel that we can visualize all important components in the image even after eliminating 12 least significant bits. With this, given a video frame with any dimension, we can generate a color histogram based representative for it with only 4096 values corresponding to 4096 different color bins. So, a video with dimension N*(m*n*3), where N is the number frames, with each frames composed of three mXn matrices each of which is correspond to a particular color band in RGB color space, can be reduced to N feature vectors with 4096 values.

184


(a)

(b)

Fig. 2. Sample Video Frames. (a). With 24 bits/ pixel. (b). with 12 bits / pixel

2.2

Shot Boundary Detection Using Split and Merge Framework

We now apply the process of split and merge to the input video sequence with N frames which is represented with N color histograms as explained in the section 2.1. We first present the overview of split and merge framework introduced in [28]. In this framework, the authors have approached the shot boundary detection problem in a new way using split and merge concept. Split and merge is a well-known algorithmic strategy used to solve any complex problem, where the problem is subdivided into smaller sub problems through successive splitting and merging until each sub problem becomes atomic, that it can be solved without any difficulty. In this framework, the same analogy is used to arrive at the shots from a given video. The larger sequence of video frames is subdivided into smaller subsequences repeatedly


185

through split and merge using their representatives until the video is divided into many smaller subsequences each of which is dissimilar from the other and the fishers criterion function between every two adjacent subsequences is high. The Fishers Linear Discriminant analysis is chosen in the framework because of the fact that FLD is the best known technique available in the literature for finding out the optimal projection axis between two classes projected on to a space [31-32]. If the data points are well separated in the projected space, then the value of the criterion function for those two classes will be maximum. In case of video frames, if we can project the samples from each of the two subsequences that is being tried for split on to some lower dimensional space and if the optimal projection axis is found between those two classes of data points, then the fishers criterion between those two subsequences will be maximum. Finally these smaller subsequences are declared to be the shots present in the given video.

Fig. 3. Split and Merge Framework

A video sequence, represented with the help of histogram representatives is given to the split and merge framework for recursive split and merge. Initially, the entire sequence of frames is allowed to get split into exactly two subsequences. Then, that division is validated by dividing each of the subsequence again into exactly two and comparing the value of the fishers criterion function between each pair of adjacent subsequences with that of the previous value obtained in the first split. If the values between current pair of adjacent subsequences are higher compared to that of the previous level then only the process is allowed to continue further with each smaller subsequence. Otherwise, the corresponding split is rolled back so that, the two smaller subsequences after splitting are combined back to form a single subsequence. Along with splitting, in each iteration we also check whether any two adjacent subsequences can be combined into single with the help of the same fishers criterion. That is, if the

186


value of the fishers criterion function gets increased or remains unchanged when any two adjacent subsequences are combined into single then, they are allowed to merge into single subsequence. The reason which allows us to split a sequence into two subsequences or merge any two smaller subsequences into one is as follows: Whenever there is a shot boundary present in a given video sequence, if we split it into two and project it onto a lower dimensional space, we can find them projected as two different clusters because of high between class scatter and less within class scatter. After several split and merge actions repeated up to a set of subsequences with only a single frame in each in the worst scenario, the resultant small subsequences of the given video are then can be proved as the shots present in the videos.

Fig. 4. Variations in J value with respect to frame sequence

The Fig. 4 shows the curve of value of fishers criterion function J with respect to a sample video frame sequence with number of frames equal to 500. A value at the location (x, y) on the curve in the figure represents the value of J when the video frame sequence with n frames is projected as two classes with first class consisting of a subsequence of frames from frame 1 to frame x and the second consisting of subsequence of frames from frame x+1 to frame n. Where, x represents the frame number and y represents the value of J. The locations, where the J values are maximum represent that, the corresponding two subsequences are well projected in the space.

3

Experimentation and Analysis

To evaluate the suitability of the proposed approach, we have experimented on seven different video samples. Four video samples are taken from the TrecVid 2002 dataset (available at http://www.open-video.org/) and three video samples are downloaded from YouTube in which one is news other is cricket and one more is an animated movie video. Manually identified shots present in each of the testing video sequence are considered as ground truth. The dataset and ground truth details are given in the Table 1.


187

Table 1. Dataset and Ground-truth Number of Frames Considered 4000

Length in Secs

No. of Transitions

137

15

1955 Chevrolet Screen Ads

3900

135

12

3

6 1/2 Magic Hours

4800

166

16

4

5200

180

17

5

According to Plan: The Story of Modern Sidewalls for the Homes of America News

2760

120

30

6

Cricket

2000

80

16

7

Animated Movie

3000

131

34

Dataset Source

Serial Number

1. TrecVid 2002

1

17 Days: The Story of Newspaper History in making (1945)

2

2. Internet Videos

Caption

The results obtained by the proposed approach for test dataset is given in the Table 2. We make use of the following most popular evaluation measures to evaluate the results of the proposed method,

Pr ecision (P) =

D , D + FA '

Re call =

D , D + MD '

F − measure (F) =

2*P*R P+R

Where D is the number of shots correctly detected, MD is number of shots missed (MD), and FA is the number of false alarms. We can notice from the Table 2 that, on an average the overall performance of the proposed method is really good. Among all seven videos considered, a very high value of F-measure is achieved for the fourth video which is from the TrecVid dataset as there is only one shot missed with number of false alarms being one. This is because, the video is free from object motion and also it has a less number of gradual transitions in it. The method has very less performance compared to all others in the case of video 6 a cricket video. This is because of the fact that the cricket videos are created with a very high amount of editing effects like score board or match summary being displayed suddenly during the broadcast of video, zooming effects, high illumination effects because of outdoor environment, object motion, camera motion and usage of multiple cameras.

188

D.S. Guru and M. Suhil Table 2. Experimental results of the proposed method

Serial Number

D

MD

FA

PRECISION (P)

RECALL (R)

F-MEASURE (F)

1

12

3

2

0.857

0.800

0.828

2

11

1

3

0.786

0.917

0.846

3

13

3

2

0.867

0.813

0.839

4

16

1

1

0.941

0.941

0.941

5

26

4

3

0.897

0.867

0.881

6

14

2

4

0.778

0.875

0.824

7

30

4

5

0.857

0.882

0.870

Average

0.855

0.871

0.861

Table 3. Qualitative Comparative Analysis Method [33]

Video Representation Gray level frames are divided into 16X16 macro blocks and block motions are estimated and a plot of the differences between two successive motion intensities is used Histogram of each of the R G B color bands in RGB space

Parametric yes

Parameters Used Number of blocks per frame, Number of adjacent blocks to be compared for motion estimation, T: Matching error threshold between macro blocks

Remarks Block motion estimation requires comparison with 5 neighboring blocks. Matching error threshold requires training.

yes

d: Number of bins in histogram, n: Number of blocks, r: Range of neighboring frames involved in the SBD, r1: range of frames involved in the computation of continuity feature, Tp: Threshold for similarity between histograms of two frames, T: range of frames involved for the classification of SBDs, T0: threshold for monochrome value, f: neighbored range, Ts: threshold for similarity value, Tα: threshold for distance, Tm: threshold for monochrome values.

Even though a high accuracy is assured, due to the huge number of parameters make it a complex and fixing those parameters needs rigorous training. They have used two SVM’s and the complexity of each is O(n3) where, n is the training set size.

[29]

Color Histogram from a color quantized frame

yes

W: window size used for convolution of difference of histograms signal, A threshold to detect cuts, α: weight factor with range [0, 1], size of the structuring element for signal convolution and morphological operations,

Through the use of color histograms followed by color quantization has made the method memory efficient. But, use of certain parameters and processing the signal at various stages has made it a slightly complex approach.

Proposed Approach

Color Histogram from a color quantized frame

No

Nil

In addition to being memory efficient, the approach is non- parametric with the use of split and merge framework.

[34]


189

To demonstrate the superiority of the proposed method, a qualitative comparative analysis with the contemporary research works [29, 33-34] has been done in Table 3. The work [33] is considered because it uses block motion technique to estimate the motion vector for each video frame which is proven to be effective with high accuracy. The work [34] is considered because, it uses a variant of color histogram creation process to extract the features from video frames. Another approach which resembles our approach in terms of features used is the work presented in [29]. In this work the similar histogram features are used to represent a video. But, they use histogram differences curve to locate shot boundaries with some post processing applied using convolution, derivatives and morphological operators. In our model, we treat histograms as representatives of video frames and apply split and merge process to segment the video into its constituent shots. It can be noted from the comparative analysis that, the proposed approach is very well suited for the real time applications as it is free from parameters making it a non-parametric approach in addition to being time efficient.

4

Conclusion

In this work, we have exploited the split and merge framework for shot boundary detection. The theme of this framework is to view the video as contiguous groups of frames where frames within a group are more similar and preserve temporal continuity when compared to the frames in different groups. And these groups are nothing but the shots present in the video. The process of shot boundary detection is just the identification of the place where two contiguous groups of frames intersect. Here, the video is given to the split and merge framework by using color histogram representatives through the application of color quantization on to each video frame. Experiments are conducted on various video samples from Trecvid 2002 dataset and YouTube online database. The proposed method is evaluated using Precision, Recall and F-measures and we have achieved 85.5% of Precision, 87.1% of Recall and 86.1% of F-measure for the test data set considered. The qualitative comparative analysis is conducted with the contemporary research work and shown that the proposed method is better in various aspects.

References 1. Idris, F., Panchanathan, S.: Review of image and video indexing techniques. J. Vis. Commun. Image Represent. 8(2), 146–166 (1997) 2. Brunelli, R., Mich, O., Modena, C.M.: A survey on the automatic indexing of video data. J. Vis. Commun. Image Represent. 10, 78–112 (1999) 3. Koprinska, I., Carrato, S.: Temporal video segmentation: a survey. Signal Processing: Image Communication 16(5), 477–500 (2001) 4. Lefevre, S., Holler, J., Vincent, N.: A review of real-time segmentation of uncompressed video sequences for content-based search and retrieval. Real-Time Imaging 9(1), 73–98 (2003)

190


5. Patel, B.V., Shah, B.B.: Content based video retrieval systems. Int. J. Ubi Comp. 3(2), 13–30 (2012) 6. Kanagavalli, R., Duraiswamy, K.: A study on techniques used in digital video for shot segmentation and content based video retrieval. European Journal of Scientific Research 69(3), 370–380 (2012) 7. Mittal, A., Cheong, L., Sing, L.: Robust identification of gradual shot-transition types. In: Proceedings of 2002 International Conference on Image Processing, vol. 2, pp. 413–416 (2002) 8. Patel, N.V., Sethi, I.K.: Video shot detection and characterization for video databases. Pattern Recognition 30, 583–592 (1997) 9. Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F., Zhang, B.: A formal study of shot boundary detection. IEEE Trans. on Circuits and Systems for Video Technology 17(2), 168–186 (2007) 10. Zhang, C., Wang, W.A.: Robust and efficient shot boundary detection approach based on fisher criterion. In: Proceedings of the 20th ACM International Conference on Multimedia (MM 2012), pp. 701–704. ACM, New York (2012) 11. Onur, K., Ugur, G., Ozgur, U.: Fuzzy color histogram-based video segmentation. Computer Vision and Image Understanding 114(1), 125–134 (2010) 12. Abdelati, M.A., Ben, A.A., Mtibaa, A.: Video shot boundary detection using motion activity descriptor. J. Telecommun. 2(1), 54–59 (2010) 13. Chen, W., Zhang, Y.: Parametric model for video content analysis. Pattern Recogn. Lett. 29(3), 181–191 (2008) 14. Massimiliano, A., Chianese, A., Moscato, V., Sansone, L.: A formal model for video shot segmentation and its application via animate vision. Multimedia Tools Appl. 24(3), 253–272 (2004) 15. Damnjanovic, U., Izquierdo, E., Grzegorzek, M.: Shot boundary detection using spectral clustering. In: 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, pp. 1779–1783 (2007) 16. Wang, P., Liu, Z., Yang, S.: Investigation on unsupervised clustering algorithms for video shot categorization. Journal of Soft Comput. 11(4), 355–360 (2006) 17. Yuchou, C., Lee, D.J., Yi, H., James, A.: Unsupervised video shot detection using clustering ensemble with a color global scale-invariant feature transform descriptor. J. Image Video Proc. 1, 1–10 (2008) 18. Manjunath, S., Guru, D.S., Suraj, M.G., Harish, B.S.: A non-parametric shot boundary detection: an Eigen gap based approach. In: Proceedings of Fourth Annual ACM Bangalore Conference, vol. 1, pp. 1030–1036 19. Wang, H., Divakaran, A., Vetro, A., Chang, S.F., Sun, H.: Survey of compressed-domain features used in audio-visual indexing and analysis. J. Visual. Commun. Image Represent. 14, 150–183 (2003) 20. Bruyne, S.D., Deursen, D.V., Cock, J.D., Neve, W.D., Lambert, P., Walle, R.V.D.: A compressed-domain approach for shot boundary detection on H.264/AVC bit streams. Signal Processing: Image Communication 23, 473–489 (2008) 21. Chen, J., Ren, J., Jiang, J.: Modelling of content-aware indicators for effective determination of shot boundaries in compressed MPEG videos. Multimedia Tools Appl. 54(2), 219–239 (2011) 22. Jacobs, A., Miene, A., Ioannidis, G.T., Herzog, O.: Automatic shot boundary detection combining color, edge, and motion features of adjacent frames (2004), http://wwwnlpir.nist.gov/projects/tvpubs/tvpapers04/ubremen.pdf


191

23. Chang, Y., Lee, D.J., Hong, Y., Archibald, J.: Unsupervised video shot detection using clustering ensemble with a color global scale-invariant feature transform descriptor. J. Image Video Process. 9, 10 (2008) 24. Philips, M., Wolf, W.: A multi-attribute shot segmentation algorithm for video programs. Telecommunication Systems 9(3-4), 393–402 (1998) 25. Boreczky, J.S., Rowe, L.A.: Comparison of video shot boundary detection techniques. J. Electron Imaging 5(2), 122–128 (1996) 26. Alan, F.S., Palu, O., Aiden, R.D.: Video shot boundary detection: Seven years of TRECVid activity. Comput. Vis. Image Und. 114(4), 411–418 (2010) 27. Mishra, R., Singhai, S.: A review on different methods of video shot boundary detection. International Journal of Management IT and Engineering 2(9), 46–57 (2012) 28. Guru, D.S., Suhil, M., Lolika, P.: A novel approach for shot boundary detection in videos. In: Multimedia processing, communication and computing applications. LNEE, vol. 213, pp. 209–220. Springer (2013) 29. Mas, J., Fernandez, G.: Video shot boundary detection using color histogram (2003), http://www-nlpir.nist.gov/projects/tvpubs/ tvpapers03/ramonlull.paper.pdf 30. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. PHI Learning Private Limited, New Delhi-110001 (2008) 31. Max Welling.:Fisher linear discriminant analysis. Max welling’s classnotes in machine learning. 16(7), 817–830, http://www.ics.uci.edu/~welling/classnotes/ classnotes.html 32. Nagabhushana, P., Guru, D.S., Shekara, B.H. (2D)2 FLD: An efficient approach for appearance based object recognition. Neurocomputing. 69, 934–940 (2006) 33. Atmel, A.M., Abdessalem, B.A., Abdellatif, M.: Video shot boundary detection using motion activity descriptor. Journal of Telecommunications. 2(1), 54–59 (2010) 34. Zhang, C., Wang, W.: A robust and efficient shot boundary detection approach based on fisher criterion. In: Proceedings of 20th ACM International Conference on Multimedia, vol. 5, pp. 701–704 (2012) ISBN: 978-1-4503-1089-5, doi:10.1145/2393347.2396291