Visual attention guided video copy detection based on feature points

0 downloads 0 Views 3MB Size Report
Apr 13, 2013 - Salient feature points are then detected in visual attention regions. Prior to ... Zhai and Shah [3] construct both spatial and temporal saliency maps and fuse .... what follows is the dilation with radius 2 disk structure element. .... isfy the triangle constraint due to the accurate vertexes Otherwise, they would ...
J. Vis. Commun. Image R. 24 (2013) 544–551

Contents lists available at SciVerse ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Visual attention guided video copy detection based on feature points matching with geometric-constraint measurement Duan-Yu Chen ⇑, Yu-Ming Chiu Department of Electrical Engineering, Yuan Ze University, Chung-Li, Taiwan

a r t i c l e

i n f o

Article history: Received 26 June 2012 Accepted 3 April 2013 Available online 13 April 2013 Keywords: Visual attention Video copy detection Feature point Geometric constraint Spatiotemporal analysis Delaunay triangulation Video copy attacks Similarity measurement

a b s t r a c t In this paper, to efficiently detect video copies, focus of interests in videos is first localized based on 3D spatiotemporal visual attention modeling. Salient feature points are then detected in visual attention regions. Prior to evaluate similarity between source and target video sequences using feature points, geometric constraint measurement is employed for conducting bi-directional point matching in order to remove noisy feature points and simultaneously maintain robust feature point pairs. Consequently, video matching is transformed to frame-based time-series linear search problem. Our proposed approach achieves promising high detection rate under distinct video copy attacks and thus shows its feasibility in real-world applications. Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction While innovative and light-weight applications have been developed for mobile platforms, huge volume of data, particularly video clips, is easily shared between users and even edited and reused with or without copyright authentication. Therefore, the issue about how to effectively detect video copies is raised again recently. In this paper, we focus on content-based video copy detection. The research field has been investigated in several years. However, how to extract representative feature(s) for video matching is still a challenging problem. In the literature, feature descriptors used such as Harris corner detector [16], SIFT [17], SURF [15], bag-of-words [13,18] etc. are used for image or video matching. However, feature point descriptors such as SIFT could work well in object recognition. For content-based video matching, the property of inexact matching of video content makes the similarity computing based on feature point matching even more challenging. In order to overcome this problem, a geometric-constraint is applied for feature point matching. In addition, based on our observations, intended editing of video copies would preserve focus of interests in common. That means computing similarity of video clips in visual attention regions would benefit video copy detection. Detecting visual saliency successfully can substantially reduce the computational complex-

⇑ Corresponding author. Fax: +886 3 463 9355. E-mail addresses: [email protected], [email protected] (D.-Y. Chen). 1047-3203/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jvcir.2013.04.005

ity of the process of video copy detection. According to a study conducted by cognitive psychologists [10], the human visual system picks salient features from a scene. Psychologists believe this process emphasizes the salient parts of a scene and, at the same time, disregards irrelevant information. However, this raises the question: What parts of a scene should be considered ‘‘salient’’? To address this question, several visual saliency (or attention) models have been proposed in the last decade [1–5,7,8]. Based on the type of attention pattern adopted, the models can be roughly categorized into two classes: bottom-up approaches, which extract imagebased saliency cues; and top-down approaches, which extract task-dependent cues. Usually, extracting task-dependent cues requires a priori knowledge of the target(s). However, a priori knowledge of attended objects is usually difficult to obtain. Therefore, we focus on a bottom-up approach in this work. In [8], Itti and Koch reviewed different computational models of visual attention, and presented a bottom-up, image-based visual attention system. Previously, Itti et al. [5] had proposed one of the earliest saliency-based computational models for determining the most attractive regions in a scene. The contrasts in color, intensity and orientation of images are used as clues to represent local conspicuity in images. Ma et al. [2] used the motion vector fields in MPEG bitstreams to build a motion attention map directly. In addition, some approaches have extended the spatial attention model from images to videos in which motion plays an important role. To find motion activity in videos, such models compute structural tensors [1], or estimate optical flows in successive frames directly [3].

545

D.-Y. Chen, Y.-M. Chiu / J. Vis. Commun. Image R. 24 (2013) 544–551

Fig. 1. Overview of the proposed approach.

Zhai and Shah [3] construct both spatial and temporal saliency maps and fuse them in a dynamic fashion to produce an overall spatiotemporal attention model. The model first computes the correspondence between points of interest, and then detects the temporal saliency using homography. Li and Lee [1] proposed a model of spatiotemporal attention for shot matching. Under this approach, a motion attention map generated from temporal structural tensors and a static attention map are used simultaneously to determine the degree of visual saliency. The weights of the feature maps are varied according to their contrast in each video frame. However, with this mechanism, the degree of visual saliency could be biased by the static attention map when the static features have higher contrast than the motion features. Therefore, in this work as illustrated in Fig. 1, we present a new visual salient model based on the dynamic features, which extracts the salient feature points from the 3D spatiotemporal volumes of video sequences. As a result, we are able to capture the dynamics of a video sequence. The novelty of our work is the mechanism developed to locate salient regions, especially the proper salient extent examined in a developed motion attention map using a novel measurement. In addition, geometric constrained feature point correspondence is employed for conducting bi-directional point matching in order to remove noisy feature points and simultaneously maintain robust feature point pairs. Consequently, video matching is transformed to frame-based time-series linear search problem. The remainder of this paper is structured as follows. In the next section, we introduce the proposed visual saliency model. Section 3 describes the geometric-constraint measurement for feature point matching. Section 4 presents similarity measurement. Section 5 details the experiment results. Finally, we present our conclusion in Section 6.

2. Visual attention modeling To locate the salient regions in a video efficiently, we first detect the salient points in the video’s corresponding spatiotemporal 3D volume. The points are then used as seeds to search the extent of the salient regions in the constructed motion attention map, in which the extent of the salient regions is determined by finding a motion map that corresponds to the maximum entropy. In the following, we describe the process for detecting spatiotemporal salient points, and explain how we generate a motion attention map. We then introduce the proposed selective visual attention model, which is based on finding the maximum entropy.

2.1. Detection of spatiotemporal salient points The spatiotemporal Harris detector, proposed by Laptev and Lindeberg [4], extends Harris and Stephens’ corner detector [6] to consider the time axis. Similar to the operation performed in the spatial domain, the spatiotemporal Harris detector is based on a 3  3 s moment matrix l for a given spatial scale rl and temporal scale sl (sl is set 7 in the experiment). The matrix is composed of first order spatial and temporal derivatives averaged with a Gaussian weighting function gðx; y; t : r2i ; s2i Þ, i.e.,

0

L2x B 2 2 B l ¼ gðx; y; t : ri ; si Þ  @ Lx Ly Lx Lt

Lx Ly L2y Ly Lt

Lx Lt

1

C Ly Lt C A;

ð1Þ

L2t

where the integration scales are r2i ¼ sr2l and s2i ¼ ss2l ; Lj is a firstorder Gaussian derivative through the j axis; and the spatiotemporal separable Gaussian kernel is defined as

gðx; y; t : r2l ; s2l Þ

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ expððx2 þ y2 Þ=2r2l  t2 =2s2l Þ= ð2pÞ3 r4l s2l :

ð2Þ

To detect points of interest, points that have significant corresponding eigenvalues k1 , k2 , k3 of l are considered salient. The saliency function is defined as

H ¼ detðlÞ  ktrace3 ðlÞ ¼ k1 k2 k3  kðk1 þ k2 þ k3 Þ3 ;

ð3Þ

where k is a tunable sensitivity parameter. The salient points detected by Eq. (3) are illustrated in Fig. 2(b). We observe that the majority of salient points are located near the boundary of an attended object due to the intrinsic nature of corners. In contrast, points with relatively low saliency are located inside the moving objects, and usually correspond to consistent motion. To generate effective seeds for searching the appropriate extent of moving regions, we employ the centroid of every salient region instead of the commonly used local maxima points. The map shown in Fig. 2(b) is first thresholded by the mean of the map and then be processed by performing morphological operations, in which erosion with radius 4 disk structure element is first conducted and what follows is the dilation with radius 2 disk structure element. The regions that correspond to those salient points are shown in Fig. 2(b). The centroid of each region shown in Fig. 2(c) is then taken as a seed for the subsequent search task.

Fig. 2. (a) The original video frame; (b) pixel corner responses detected by using 3D Harris corner detection; (c) the thresholded salient map of (b).

546

D.-Y. Chen, Y.-M. Chiu / J. Vis. Commun. Image R. 24 (2013) 544–551

Fig. 3. Attended regions detected by the proposed approach: (a), (c), (e), (g) and (i) original video frame; (b), (d), (f), (h) and (j) detected saliency map.

547

D.-Y. Chen, Y.-M. Chiu / J. Vis. Commun. Image R. 24 (2013) 544–551

and

2.2. Motion attention map After detecting the seeds, we compute a motion attention map to determine the extent of searching. Based on the observation in Section 2.1 that the regions located inside moving objects usually correspond to consistent motion, our goal is to find regions that contain consistent motion near the search seeds. The optical flow (u, v, w) neighboring a search seed can be estimated by solving the following structural tensor [9]:

l  ½u v wT ¼ 031 ;

ð4Þ

where l is the matrix defined in Eq. (1). Ideally, there would be multiple motions within the Gaussian smoothed neighborhood with spatial scale ri and temporal scale si when rank(l) is three. If there is consistent motion in a windowed area, rank(l) would be equal to two. In the other two cases, i.e., rank(l) equals 0 or 1, the motion of an image structure cannot be derived by Eq. (4) directly. However, in real videos, l always has a full rank. Therefore, the normalized and continuous measure defined in [1] is used to quantify the degree of deficiency of a matrix. Let the eigenvalues of l be k1 P k2 P k3 . The continuous rank-deficiency measure dl is defined as

(

dl ¼

traceðlÞ < c

0;

k23 =ð12 k21 þ 12 k22 þ eÞ; otherwise

;

ð5Þ

where e is a constant used to avoid division by zero. The threshold c is used to handle cases when rank(l) = 0. Since we want to find the regions with consistent motions, regions with high and low values of dl are not considered attended regions. In other words, regions with median values are the targets of interest. Therefore, we use a median filter fl to filter out regions with multiple motions which are the regions of high and low values and keep regions with consistent motions. As a result, we obf tain the motion attention map dll for further processing. f The motion attention map dll complements the saliency map generated by H defined in Eq. (3). Hence, a proper combination of these two maps can determine the real extent of a salient area. To combine two salient maps that complement each other, the seeds produced in the first step, described in Section 2.1, are used as the starting points to search the appropriate extent in the motion attention map. 2.3. Visual attention modeling To find the extent of attended regions, we take the configuration of a motion attention map that corresponds to the maximum entropy as our target. Entropy is an indicator showing the degree of disorder. When the internal energy of a system is increased, its corresponding entropy is increased. Usually, stability of a system will be reached when the molecules within the system are uniformly distributed. Under these circumstances, the value of entropy reaches its maximum. We adopted the so-called exponential entropy defined by Pal and Pal [12] to avoid the cases that the probability approaches zero. For each motion attention map, the most appropriate scale Se is determined by the center-surround approach which extends a local region from a seed (xs,ys) in a motion attention map to its four direction neighbors and then determines if the evaluated entropy in a local region e achieves maximum. Therefore, for each region centered at the seed (xs,ys), Se is defined by

  Se ¼ argmax Hexp ðe; ðxs ; ys Þ; tÞ  W exp ðe; ðxs ; ys Þ; tÞ ; e

ð6Þ

where

Hexp ðe; ðxs ; ys Þ; tÞ ¼

  X pv ;e;ðxs ;ys Þ;t  exp 1  pv ;e;ðxs ;ys Þ;t ; v 2D

ð7Þ

W exp ðe; ðxs ; ys Þ; tÞ ¼ jHexp ðe; ðxs ; ys Þ; tÞ  Hexp ðe  1; ðxs ; ys Þ; tÞj:

ð8Þ

fl

D is the set of all values that contains the values of dl , which correspond to the histogram distribution in a local region e around a seed (xs, ys) in a motion attention map at time t. Eq. (7) defines the exponential entropy in a local region e and the probability mass function, pv ;e;ðxs ;ys Þ;t , is defined as the histogram of pixel values at time t for scale e in position (xs,ys) and the value v, which belongs to D. The value of map D is normalized to [0,1] and the range is simply split into 20 bins for histogram computation. The number of bins should not be too high or the entropy could not properly reflect the changes of appearance in different search scales. Using this approach, the most appropriate scale Se can be obtained by iteratively computing the exponential entropy weighted by the entropy changes in the successive scale e. The larger value obtained f by Eq. (7) means more significant differences in the map dll . Therefore, searching the proper extent exhaustively, we choose the maximum difference, as described in Eq. (6), and consider it the boundary of the salient regions. The results of salient regions detection are demonstrated in Fig. 3. The red boxes are derived by using the entropy maximization process to search for the proper scale in the motion attention map. Fig. 3(b), (d), (f), (h) and (j) shows its saliency map generated by exponential entropy maximization. The attended regions are determined by finding the extent that covers the moving target well. 3. Geometric-constraint feature points measurement As feature points are detected in Eq. (1), to achieve robust point matching we adopt the Bi-directional matching for selecting seed point pairs (stable matches) for the rest of tasks [14], which is based on the observation that true positive matches of distinctive features between the source image and the target image are more likely to be bi-directional, i.e., from source to target, and vice versa. In other words, if the relationship between matches is unidirectional, then the match could be unstable or incorrect that needs to be removed. Therefore, we assume that if there are few stable matches, the corresponding frames are very likely to be irrelative. The features finally matched by the bi-matching method are considered as seed points, e.g. the blue points demonstrated in Fig. 4. The geometricconstraint measurement is invariant to translation, rotation and scale transformations due to the robust feature points obtained from the Bi-directional matching. It is also invariant to affine transformation and robust to partially perspective transformation. The triangle-constraint measurement is briefly introduced as follows. The triangles in IA with their correspondences in IB are obtained. The case shown in Fig. 4(b) is demonstrated to explain the process of triangle-constraint measurement. Without loss of generality, we replace the vertexes 3, 4 and 6 by a, b and c (a’, b’ and c’) respectively for the left (right) triangle. The features are limited by the triangle 4abc and 4a’b’c’, which means only the sets of features inside the triangles (the 4’s in Fig. 4(b)) denoted as PA and PB are involved and the feature points out of the triangles (the red ‘⁄’s in Fig. 4 (b)) are not considered yet. For each feature point Pi in 4abc, the relationship between Pi and the vertexes of 4abc is

Pi ¼ a þ ðb  aÞ þ ðc  aÞ;

ð9Þ

where b and c are the scale coefficients of the vector (b  a) and (c  a) respectively. Since three vertexes of 4abc are known and thus the parameters K can be computed easily by

2 3

2 a xa 6 7 6 K ¼ 4 b 5 ¼ 4 ya c 1

xb yb 1

3 xi 7 6 7 yc 5 4 yi 5; 1 1 xc

31 2

ð10Þ

548

D.-Y. Chen, Y.-M. Chiu / J. Vis. Commun. Image R. 24 (2013) 544–551

Fig. 4. Example of the triangle-constraint measurement. (a) Illustration of the Delaunay algorithm; (b) Illustration of the geometric-constraint.

Fig. 5. Points correspondences after using triangle-constraint measurement: (a) original video frame; (b) video frame under an attack of brightness modification.

where a = 1  b  c. The same parameters for the relationship between the estimated point Pe and the vertexes of 4a’b’c’ in IB still maintain if the triangles are real correspondence. Therefore, we can estimate the coordinates of Pe by

2

3 2 xe xa0 6 7 6 y ¼ y 4 e 5 4 a0 1 1

xb0 y b0 1

xc 0

31 2 3

a

7 6 7 yc0 5 4 b 5: 1 c

ð11Þ

549

D.-Y. Chen, Y.-M. Chiu / J. Vis. Commun. Image R. 24 (2013) 544–551

Fig. 6. Demonstration of the selected video dataset [13].

Table 1 Eight video copy attacks applied in the experiment. Type

Description

Brightness Compression Noise Resolution Cropping Zoom-in Slow motion Fast forward

Enhance the brightness by 20% Set the compression quality at 50% Add random noise (10%) Change the frame resolution to 120  90 pixels Crop the top and bottom frame regions by 10% each Zoom in the frame 10% Halve the video speed Double the video speed

← nw=110

1 0.99

← nw=120

j ¼ 1; 2; . . . jcj. If the maximum similarity value of all the feature points in C is greater than a predefined threshold s (s is set 0.4 in the experiment), the corresponding feature point pair is considered as temporary match. After feature points in PA are all processed, a set T containing all the temporary matches for PA and PB can be obtained. The temporary matches are considered as correct matches if the set T satisfies

jTj > k minfjPA j; jPB jg;

ð13Þ

where k is set 0.3 in our experiments. Otherwise, the triangle pair and the temporary matches between them are discarded. The approach can perform well based on the observation that if the pair of the triangles is true correspondence, they will approximately satisfy the triangle constraint due to the accurate vertexes Otherwise, they would differ significantly. Furthermore, if all the related triangles for a vertex are discarded, the vertex is then removed.

← nw=100

0.98

4. Video similarity measurement Precision

0.97

Let Q ¼ fqi ji ¼ 1; 2; . . . ; nQ g be a query sequence with nQ frames, where qi is the i -th query frame; and let T ¼ ft j jj ¼ 1; 2; . . . ; nT g be a target sequence with nT frames, where tj is the jth target frame, and nQ  nT. A sliding window is used to scan over T to search

← nw=90

0.96 0.95 0.94

Table 2 Performance comparison using precision (P) and recall (R).

0.93

← nw=80 0.92 0.9892 0.9893 0.9894 0.9895 0.9896 0.9897 0.9898 0.9899 Recall

Type 0.99

0.9901

Brightness Compression

Fig. 7. Recall evaluation using distinct temporal window size nw. Noise

Considering the robustness to noises and distortions, the area around the Pe within R pixels (in our experiments, R = 3) is regarded as a candidate area and the feature points in this area are candidate features C (the green ‘4’s in Fig. 4(b)). The similarity between the Pi and the candidate feature Ci is defined by 2

sj ¼ 1:5ðdistj =RÞ  DTi Dcj ;

Resolution Cropping Zoom-in

ð12Þ

Slow motion

where distj is the Euclidean distance between the Cj and the Pi, Di and Dcj are the descriptors for the Pi and the Cj respectively, and

Fast forward

R P R P R P R P R P R P R P R P

[19]

[20]

[11]

Ours

0.8065 0.8065 0.9355 0.9355 0.9032 0.9032 0.9032 0.9032 0.3226 0.3226 0.8387 0.8387 0.0645 0.0645 0.1935 0.1935

0.8710 0.1627 0.9032 0.1228 0.8065 0.1042 0.6129 0.0465 0.7742 0.1127 0.6774 0.0824 0.9355 0.1835 0.9355 0.0755

1.0000 0.9394 0.9677 0.9091 1.0000 1.0000 0.8065 0.9259 0.9677 0.9375 0.9355 0.9355 0.9355 1.0000 1.0000 0.8378

0.9895 1.0000 0.9903 1.0000 0.9666 1.0000 0.9702 1.0000 0.9967 1.0000 0.9926 1.0000 1.000 1.000 1.000 1.000

550

D.-Y. Chen, Y.-M. Chiu / J. Vis. Commun. Image R. 24 (2013) 544–551

Fig. 8. Experiment results of detection rate using Triangleconstraint and OMM (original matching method) respectively for different categories from INRIA dataset.

for a subsequence whose content is identical or similar to Q. Let W ¼ ft j ; tjþ1 ; . . . ; tjþnw 1 g be a subsequence of T extracted by a sliding window with nw frames (nw set to 110 in the experiment), denoted as a window sequence. Let qi p ¼ fqi p1 ; qi p2 ; . . . ; qi pn jqi pn ¼ ðX n ; Y n Þg denotes n matched pointpairs of ith frame in query sequence. Meanwhile, in sliding window we can have wi p ¼ fwi p1 ; wi p2 ; . . . ; wi pn jwi pn ¼ ðX 0n ; Y 0n Þg as demonstrated in Fig. 5. The centroids qipc and wipc are then computed respectively using qip and wip. Finally, the histogram qi h ¼ fqi h1 ; qi h2 ; . . . ; qi hn g for each frame is obtained based on the distance between center point qipc and each point in qip. Similarly, the histogram wi h ¼ fwi h1 ; wi h2 ; . . . ; wi hn g can be obtained based on the distance between wipc and each point in wip. Finally, the similarity measurement between qih and wih is defined by the Jaccard coefficient as

Jðqi h; wi hÞ ¼

PN jqi h \ wi hj minðqi hn ; wi hn Þ : ¼ PNn¼1 jqi h [ wi hj n¼1 maxðqi hn ; wi hn Þ

ð14Þ

Consequently, the similarity between the query and target video sequences is obtained by accumulating the Jaccard coefficients of each frame and is defined as

PI JðQ; WÞ ¼

i¼1 J i

I

;

ð15Þ

where I is the number of matching frames and Ji denotes Jaccard coefficient of the ith frame. Using this measure, a video copy is regarded as detected if J(Q, W) > q. The threshold q is set 0.9 empirically. 5. Experimental results In the experiment, we select MUSCLE-VCD-2007 dataset, which is the benchmark dataset of the TRECVID video copy detection task [13], as demonstrated in Fig. 6, which include distinct video types, such as human subject, animals, plants, sports, buildings, outdoor scene, etc. Totally 100 video sequences are selected and eight types of video copy attacks are edited, as presented in Table 1. All the video sequences were converted into 320  240 pixels, and are resampled to 2 fps since usually frames in a scene would be nearduplicate. The threshold for Eq. (15) is used to determine if a video copy is detected is set as 0.9 through all types of video copy attacks.

comparison, some state-of-the-art works [11,19,20] are selected. The quantitative evaluation is presented in Table 2. It is worth noting that among all video attacks our approach outperform these approaches particularly in precision measurement. The good performance is benefited from the robust feature point matching using geometric-constraint in visual attention regions because we can focus on evaluate similarity between source and target video sequences in focus of interests for general users. Frame matching out of these regions could result in false alarms. In order to show the effect of geometric constraint measurement, as presented in Fig. 8, the approach under geometric constraints achieves better performance than that without it. Particularly in the viewpoint angle attack, these two approaches differ in about 6% detection rate since viewpoint changes would result in relatively worse feature point correspondence. With geometric constraints applied, false point matching pairs can be reduced and thus the detection rate can be improved consequently. 5.2. Time complexity analysis Among the related works for performance comparison, Chiu and Wang’s approach [11] outperform the others. Therefore, in this section, we compare the time complexity with [11]. In [11] the cost of computing the Jaccard coefficient is comprised of (1) OðnW  LÞ for constructing HW by summing n histograms of L dimensions, and (2) O(L) for calculating the histogram intersection and the union between Q and W; thus, the total cost is O((nW + 1)  L). The cost of computing the min-hash similarity between Q and W becomes O(3g  lg (g  nW) + k  lg (k) + k), g in each frame maintains two min-hash values, and k is the min-hash signature with a threedimensional vector. In our proposed approach, for compute visual attention the complexity is O(nW  fh  fw  2), where fh is frame height, fw is frame width. For computing feature points and geometric-constraints in each frame, the complexity is OðnW  fh  fw  2  qip  wip Þ, where qip is the number of feature points in ith frame of q and wip is the number of feature points of ith frame of w. OðnW  fip  2Þ is for computing histogram, where fip is the number of matched feature points in ith frame. OðnW  fip  fip Þ is for computing Jaccard coefficient, and finally the total cost is OðnW ð2ðfh  fw ð1 þ ðqip  wip ÞÞ þ fip Þfi2p ÞÞ. In view of the largest term of time complexity, we have nwfhfw for our approach, which is larger than the largest term in [11]. 6. Conclusion

5.1. Quantitative evaluations Before comparing with some benchmark approaches, the temporal window size nw has to be chosen for optimal performance. In Fig. 7, we can observe that the optimal performance achieves with precision up to 100% when nw is set 110. For performance

In this paper, a robust video copy detector has been proposed. To efficiently detect video copies, focus of interests in videos is localized based on 3D spatiotemporal visual attention modeling. Salient feature points are then detected in visual attention regions. Prior to evaluate similarity between source and target video

D.-Y. Chen, Y.-M. Chiu / J. Vis. Commun. Image R. 24 (2013) 544–551

sequences using feature points, geometric constraint measurement is employed for conducting bi-directional point matching in order to remove noisy feature points. Consequently, video matching is transformed to frame-based time-series linear search problem. Comparing to the state-of-the-art works, our proposed approach has achieved promising high detection rate under distinct video copy attacks and thus shows its feasibility in real-world applications.

References [1] S. Li, M.C. Lee, An efficient spatiotemporal attention model and its application to shot matching, IEEE Trans. Circuits Syst. Video Technol. 17 (10) (2007) 1383–1387. [2] Y.F. Ma, L. Lu, H.J. Zhang, M. Li, A user attention model for video summarization, in: Proc. ACM Multimedia, Dec. 2002, pp. 533–541. [3] Y. Zhai, M. Shah, Visual attention detection in video sequences using spatiotemporal cues, in: Proc. ACM Multimedia, Oct. 2006, pp. 815–824. [4] L. Laptev, T. Lindeberg, Space-time interest points, in: Proc. IEEE International Conference on Computer Vision, Oct. 2003, pp. 432–439. [5] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (11) (1998) 1254– 1259. [6] C. Harris, M. Stephens, A combined corner and edge detector, in: Alvey Vision Conference, 1988, pp. 147–151. [7] V. Navalpakkam, L. Itti, An integrated model of top-down and bottom-up attention for optimizing detection speed, Proc. IEEE CVPR 2 (2006) 2049–2056.

551

[8] L. Itti, C. Koch, Computational modeling of visual attention, Neuroscience 2 (2001) 1–11. [9] B. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proc. International Joint Conference on Artificial Intelligence, 1981, pp. 674–679. [10] W. James, The Principles of Psychology, Harvard Univ. Press, Cambridge, Massachusetts, 1980/1981. [11] C.Y. Chiu, H.M. Wang, Time-series linear search for video copies based on compact signature manipulation and containment relation modeling, IEEE Trans. Circuits Syst. Video Technol. 5604280 (2010) 1603–1613. [12] N.R. Pal, S.K. Pal, Entropy: a new definition and its applications, IEEE Trans. Syst. Man Cybern. 21 (5) (1993) 1260–1270. [13] TRECVID 2010 Guidelines, http://www-nlpir.nist.gov/projects/tv2010/ tv2010.html#ccd. [14] X. Guo, X. Cao Triangle-constraint for finding more good features, in: International Conference on Pattern Recognition, No. 5597550, 2010, pp. 1393–1396. [15] H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, Lect. Notes Comput. Sci. (2006) 404–417. [16] C. Harris, M.J. Stephens, A combined corner and edge detector, in: Alvey Vision Conference, vol. 20, 1988, pp. 147–152. [17] D. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vision (2004) 91–110. [18] S. Zhang, Q. Tian, G. Hua, Q. Huang, S. Li, Descriptive visual words and visual phrases for image applications, in: Proc. ACM Int’l Conf. Multimedia, Beijing, China, Oct. 19–24, 2009, pp. 75–84. [19] T.C. Hoad, J. Zobel, Detection of video sequence using compact signatures, ACM Trans. Inform. Syst. 24 (1) (2006) 1–50. [20] O. Chum, J. Philbin, M. Isard, A. Zisserman, Scalable near identical image and shot detection, in: Proc. ACM Int’l Conf. Image and Video Retrieval (CIVR), Amsterdam, The Netherlands, Jul. 9–11, 2007.