Robust Background Subtraction for Network ... - IEEE Xplore

31 downloads 179 Views 368KB Size Report
Sep 28, 2013 - driven by the advent of technology enabling replacement of ..... predicted from a list of previous frames where reference index. 0 denotes frame (t-1), ..... every codec can deliver a varying degree of output video quality for a ...
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013

1695

Robust Background Subtraction for Network Surveillance in H.264 Streaming Video Bhaskar Dey and Malay K. Kundu, Senior Member, IEEE

Abstract—The H.264/Advanced Video Coding (AVC) is the industry standard in network surveillance offering the lowest bitrate for a given perceptual quality among any MPEG or proprietary codecs. This paper presents a novel approach for background subtraction in bitstreams encoded in the Baseline profile of H.264/AVC. Temporal statistics of the proposed feature vectors, describing macroblock units in each frame, are used to select potential candidates containing moving objects. From the candidate macroblocks, foreground pixels are determined by comparing the colors of corresponding pixels pair-wise with a background model. The basic contribution of the current work compared to the related approaches is that, it allows each macroblock to have a different quantization parameter, in view of the requirements in variable as well as constant bitrate applications. Additionally, a low-complexity technique for color comparison is proposed which enables us to obtain pixelresolution segmentation at a negligible computational cost as compared to those of classical pixel-based approaches. Results showing striking comparison against those of proven state-ofthe-art pixel domain algorithms are presented over a diverse set of standardized surveillance sequences. Index Terms—Background subtraction, compressed domain algorithm, H.264/AVC, video surveillance.

I. Introduction and Motivation N RECENT years, there has been considerable interest in the use of network surveillance over internet protocol for a wide range of indoor and outdoor applications. This is driven by the advent of technology enabling replacement of analog closed-circuit television (CCTV) systems with network cameras coupled with increasing private and public security concerns. However, the limitations and deficiencies, together with the costs associated with human operators in monitoring the overwhelming multitude of streaming feeds, have created urgent demands for automated video surveillance solutions.

I

Manuscript received September 4, 2012; revised January 9, 2013 and February 20, 2013; accepted February 26, 2013. Date of publication March 28, 2013; date of current version September 28, 2013. This paper was recommended by Associate Editor S. Takamura. This paper has supplementary downloadable material available at http://ieeexplore.ieee.org, provided by the authors. This includes five AVI format movie clips, and executable codes used to generate the results presented in the article. This material is 62 MB in size. B. Dey is with the Center for Soft Computing Research, Indian Statistical Institute, Kolkata 700108, India (e-mail: [email protected]). M. K. Kundu is with the Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2013.2255416

Background subtraction [1]–[3] is a fundamental process used in most video-based surveillance systems. Classical pixelbased techniques for background subtraction use raw information about each pixel. However, most video transmitted over networks are encoded bitstreams such as those produced by MJPEG, MPEG-x, and H.26x-compliant encoders. Therefore, such algorithms necessitate prior decoding of each frame, involving substantially huge overhead in terms of computation as well as memory space. These algorithms have a decent segmentation performance, but most of them fail to meet real-time constraints due to the computation involved in processing each pixel individually in every frame. Under the circumstances, it is desirable that new background subtraction techniques be able to process encoded video of considerably smaller size, directly for segmentation. Such techniques, the so-called compressed domain algorithms, avert significant computational effort by reusing aggregated information of blocks of pixels already available in coded syntax to localize object motion in frames. This motivates the development of a background subtraction algorithm for compressed bitstreams encoded in the latest video coding standards like H.264 [4]. The H.246/AVC standard brings new possibilities in the field of video surveillance [5]. It outperforms previous coding standards toward its goals of supporting 1) significantly lower bitrate for a given perceived level of visual quality; 2) error-robust video coding tools that can adapt to changing network conditions; 3) low latency capabilities. Latency is the total time it takes to encode, transmit, decode and display the video at the destination. Interactive video applications require that latency be extremely small. In this paper, a novel feature vector is proposed that effectively describes macroblock (MB) data in compressed video. The temporal statistics of feature vectors are used to select a set of potential MBs that are occupied fully or partially by moving objects. From the set of coarsely localized candidate MBs, foreground pixels are detected by comparing corresponding pixel colors pair-wise with a background model. The proposed method is embedded into the decoding process and obtains pixel-level segmentation at the cost of a negligible overhead. The subsequent sections are organized as follows. Related work is presented in Section II. Section III presents the feature vector abstraction for MB data. Section IV introduces

c 2013 IEEE 1051-8215 

1696

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013

the proposed method. Experimental results are presented in Section V, and Section VI concludes the paper. II. Related Work Although there exist several techniques for moving object segmentation in MPEG domain, algorithms that work on H.264 compressed video are relatively few. Thilak and Cruesere [6] presented a system to track targets in H.264 video, which relied heavily on the prior knowledge of the size of the target. Zeng et al. [7] proposed an algorithm to segment moving objects from the sparse motion vector (MV) field using block based Markov random field (MRF). The limitation of this approach is that it is only applicable to video sequences with stationary background. Liu et al. [8] used complex binary partition tree to segment normalized MV field into motionhomogeneous regions. The complexity of the algorithm increases drastically with a noisy MV field. Solana-Cipres et al. [9] incorporated fuzzy linguistic concepts that use MV and MB decision modes. Fei and Zhu [10] introduced mean shift clustering using normalized MV field and partitioned block size for moving object segmentation. This algorithm fails to detect objects with slow or intermittent motion. W. You et al. [11] proposed a new probabilistic spatio-temporal macroblock filtering (PSMF) and linear motion interpolation for object tracking in P-frames. The linearity assumption holds good for group of picture (GOP) sizes less than ten frames and slow moving targets. Most related techniques that work in H.264 compressed domain are based solely on the MV field. MVs, however, do not necessarily correspond to true object motion and such related techniques end up wrongly classifying the dynamic components (eg. waving tree branches, fountain, ripples, camera jitter etc.) as foreground. Poppe et al. [12] proposed an alternative technique that relied on the total number of bits required to encode a given MB. This technique fails to detect slow-moving objects. A common drawback of the related approaches is the assumption of a fixed quantization parameter (QP) for all MBs thereby limiting them to variable bit rate (VBR) applications with uncontrolled bit rate. However, in most streaming video applications, a predetermined constant output bit rate is desired. These applications, referred to as constant bit rate (CBR) applications, ensure a target bit rate by carefully selecting a different QP for each MB. Unlike the related approaches, our algorithm allows each MB to have a different QP in consideration of the stringent bandwidth requirements of a practical surveillance scenario. Secondly, we perform filtering on the aggregate result of MB-level object segmentation obtained jointly from the MV field and the residual data rather than applying separate filtering procedures on either or both of the features in isolation. This helps us to reduce computation while relying more on the consensus between the MB features. The performance of the related approaches are largely restricted to coarse MB-level segmentation. In contrast, we introduce a low-complexity technique for comparing pixel colors (YCbCr) in order to obtain pixel-resolution segmentation of sequences with a highly dynamic background.

III. Proposed Macroblock Feature Vector A. H.264 Preliminaries A bitstream sequence encoded in H.264 consists of a sequence of pictures or frames, each of which is split into a MB unit covering a fixed area of 16 × 16 pixels. An intracoded or I-macroblock is predicted from spatially neighboring samples of previously coded blocks in the same frame. On the other hand, most MBs are predicted using references to one or more previously coded block(s) of pixels in past frames or a combination of past and future frames. A bi-predictive or Bmacroblock may use past and/or future frames as references while a predictive or P-macroblock may use past frames only. These are collectively called intercoded MBs, which require each referred block to be indicated using a MV with the corresponding reference frame index. Additionally, the difference between the predicted and the target MB, i.e., the prediction error or residual is transformed, quantized, and entropy coded. Thus, the MVs as well as the residual component, bearing complementary information of the same MB, hold key to effective feature vector representation. For network surveillance, the profile most commonly used for encoding is the Baseline. It claims minimal computational resource and has a very low latency, ideal for live surveillance feed on low-end devices. The profile is a simplified implementation of the H.264 standard based only on I- and P-frames. I-frames are coded using I-macroblocks only, and require more bits to encode than any other frame types. P-frames, which are mainly coded using P-macroblocks, may occasionally contain I-macroblocks, particularly in the areas of complex motion. A video sequence starts necessarily with an I-frame. It mostly consists of P-frames, with I-frames spaced at regular intervals. The number of I-frames in a coded video is negligible; hence, they are not considered in the proposed method. B. Feature #1: Mean of Absolute Transformed Differences (MATD) As the name indicates, MATD is the mean of absolute values of the quantized Discrete Cosine Transform (DCT) coefficients in a coded MB. We formulate MATD using the statistical properties of DCT coefficients. In literature, it is found to be most appropriate and convenient to model the distribution of DCT coefficients by Laplacian distributions [13]–[15]. A Laplacian probability density function (pdf) is given by    1 p(z) = exp −|z| b , z ∈ R (1) 2b where b>0 is the Laplacian parameter which defines the width of the pdf and z being the value of a given DCT coefficient. In H.264 baseline encoding, an integer implementation of the 4 × 4 block DCT operates on a similar-sized block of residual or target data. The resulting block is subsequently scaled and quantized element-wise by a quantization matrix. The quantization scheme most commonly adopted is uniform quantization with dead-zone [16]. In this process, the value z of an input coefficient is, in general terms, quantized as    (2) k = (|z| + f ) Q sgn(z),

DEY AND KUNDU: ROBUST BACKGROUND SUBTRACTION FOR NETWORK SURVEILLANCE

1697

TABLE I Scalar Multiplier q0 mod(QP,6) 0 1 2 3 4 5

n=0 0.6250 0.6875 0.8125 0.8750 1.0000 1.1250

n=1 0.6250 0.7031 0.7812 0.8984 0.9766 1.1328

n=2 0.6423 0.6917 0.7906 0.8894 0.9882 1.1364

where q (QP, n) = q0 (mod (QP, 6) , n) 2QP / 6

Fig. 1. Relation between the input coefficient z and the reconstructed output zk for quantization step size Q and rounding offset f.

where k ∈ Z represents the quantization level or index that is actually transmitted by the source encoder, Q>0 is the uniform quantization step-size except the dead-zone interval, and f ∈    0, Q 2 is the rounding offset that controls the width of the dead-zone. Typically, f is set to be Q/6 for P-frames in an effort to approximate the distribution of DCT coefficients in a quantization interval. The decoder process reconstructs the quantized output zk of the input coefficient z as zk = kQ.

Let P(zk ) be the probability that z is mapped to zk . Using (1) and substituting for f (= Q/6) in the limiting expressions of above intervals, we get ⎧ k+ 1 Q ( 6) ⎪ p(z)dz = exp(−2r(1−3k)) sinh 3r; if k 0 (k− 16 )Q (4) where  r = Q 6b. (5) The H.264 standard does not specify Q directly for each coefficient separately but rather uses a quantization parameter QP, whose relationship to Q for each 4×4 block of DCT coefficients of an MB is given by the quantization matrix: ⎡

q (QP, 0) ⎢ q (QP, 2) Q=⎢ ⎣ q (QP, 0) q (QP, 2)

q0 being a scalar multiplier  as defined in Table-I [17]. It is noted in (6) that exactly 1 2 of the transformed coefficients are quantized with a step size equal to q(QP,2), and the remaining with q(QP,0) and q(QP,1) equally. Thus, r can take only three possible values q(QP,0) , q(QP,1) or q(QP,2) , for given values of 6b 6b 6b QP and b. Given the fact that the quantized coefficients zk of a MB (both I and P types) are entropy coded, the lower bound on the average bit rate (bits/coefficient) may be expressed as

(3)

Fig. 1 illustrates the relationship between z and zk . Formally, it may be verified that the value z of any input coefficient is mapped to kQ in the quantization process, such that 1) k=0 if z ∈ (−Q + f, Q − f ); 2) k = −1, −2, −3,. . . if z ∈

f f Q, k + Q Q ; k−1+ Q





f f 3) k=1,2,3,. . . if z ∈ k − Q Q, k + 1 − Q Q .

⎤ q (QP, 2) q (QP, 0) q (QP, 2) q (QP, 1) q (QP, 2) q (QP, 1) ⎥ ⎥ , (6) q (QP, 2) q (QP, 0) q (QP, 2) ⎦ q (QP, 1) q (QP, 2) q (QP, 1)

(7)

+∞ 

H =−

P(zk ) log2 P(zk ).

k=−∞

Using (4)–(6), the above expression simplifies to

H(r) =

(6r + (1 − exp(−6r)) (2r − ln sinh 3r)) − (1 − exp(−5r)) log2 (1 − exp(−5r)) .

exp(−2r) sinh 3r ln 4

(8) Total bits B required in coding the DCT data of a MB may be computed as the product of H and the total number of DCT coefficients. For sequences encoded in the YCbCr 4:2:0 baseline format, a MB is represented by 256 luminance (Y), 64 red chrominance (Cr) and 64 blue chrominance (Cb) samples, giving a total of 384 samples. Therefore, we have





q(QP,1) q(QP,2) 1 1 + + H H B = 384 41 H q(QP,0) 6b 4 6b 2 6b





q(QP,0) q(QP,1) q(QP,2) = 96H + 96H + 192H . 6b 6b 6b (9) q(QP,1) Table I is used to express quantization step sizes 6b and q(QP,2) as scalar multiples of q(QP,0) , so that the right hand side 6b 6b of (9) may be expressed in one unknown. For known values of B and QP (obtained from the MB header), we use numerical methods to evaluate q(QP,0) as it is difficult to derive an exact 6b closed form solution by analytical means. The statistical prediction of MATD for the current MB is formulated as the mean of absolute values of zk MATD =

∞  k=−∞

|zk | P(zk ) = 2

∞  k=1

|zk | P(zk )

1698

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013

Fig. 2.

the sub-MB partitions within an 8×8 sub-MB share the same reference frame. Let us denote the reference index of the ith partition of a MB as fi . A P-frame having frame number t is predicted from a list of previous frames where reference index 0 denotes frame (t-1), reference index 1 denotes frame (t-2), and so on. In order to obtain uniformity among the partitions we normalize each of them by fi and assign fixed weights according to the ratio of the MB area it represents. The weight wi of the ith partition is defined as the fraction of the partition size contributing to the total MB area, i.e., 256. Assuming a total of p partitions in the current MB, the computation process of SNMVM is described in Fig. 2. It may be noted that the value of SNMVM for an I-macroblock in P-frame is taken as zero. The computation involved in this step amounts to four multiplications/divisions and four additions for each partition, the total number of partitions being no more than 16 for any given MB.

Computation of SNMVM.

Using (4)–(6), the above expression of predicted MATD simplifies to the form shown in (10) 

q(QP,0) exp − q(QP,0) 3b

 q(QP,0)   2b q(QP,2) exp − q(QP,2) 3b  q(QP,2)  . + MATD = 4 sinh



8 sinh





+

q(QP,1) exp − q(QP,1) 3b 8 sinh



 q(QP,1)  2b

(10)

2b

Substituting the values of q(QP,0) and its scalar multiples 6b q(QP,1) q(QP,2) and in (10), we finally obtain MATD. Using 6b 6b the fact that B and QP can only assume non-negative integer values from a limited range, we construct a lookup table containing precomputed values of MATD (indexed by B and QP). This enables us to bypass entropy decoding and dequantization of individual coefficients which would otherwise be required to compute MATD directly, i.e., actual MATD. The correlation between the actual MATD and the predicted MATD is explored in Section V. C. Feature #2: Sum of Normalized Motion Vector Magnitudes (SNMVM) SNMVM represents the motion compensated information associated with the MVs of a MB. In H.264, two new coding characteristics are introduced in motion compensation: variable partition size and multiple reference frames. Each inter coded MB may be predicted using a range of block sizes. Accordingly, the MB is split into one, two or four MB partitions using either 1) one 16×16 partition (covering the whole MB); 2) two 8×16 partitions; 3) two 16×8 partitions; or 4) four 8×8 partitions. If the partition size is chosen as 8×8, then each 8×8 block, heretofore a sub-MB, is split into one, two or four sub-MB partitions (either one 8×8, two 4×8, two 8×4 or four 4×4 sub-MB partitions). Each partition or sub-MB partition of a P-macroblock has a MV (mvxi , mvyi ) pointing to an area of the same size in a reference frame, which is used to predict the current (say ith) partition. Each partition in a given MB may be predicted from different reference frame(s). However,

D. Feature Vector Notation Formally, let T be the total number of MBs in each frame. MBs are numerically addressed using a MB index idx ∈ {0,1,. . . ,T −1} in raster-scan order starting with zero for the MB at the top-left hand corner of a frame. A P-frame MB having frame number t at location idx is described by a vector  T − → p = x, y , (11) t,idx

where components x ≥ 0 and y ≥ 0 are the values of MATD and SNMVM respectively for the given MB. E. Initialization and Incremental Update of Covariance In the proposed approach, static and dynamic components present at different locations in a scene background are modeled with a temporally weighted local covariance ∀idx ∈ {0, 1, . . . , T − 1}. Feature vectors computed for each idx are stored in T corresponding first-in first-out (FIFO) buffers, each having a predefined capacity M. A new MB vector is inserted at the rear. However, if the buffer is full, the vector at the front, being the least relevant in the temporal order of vectors in the buffer, is removed to make room for a new one. Corresponding to each position j ∈ {1(front), 2, 3, . . . , M(rear)} in a buffer containing a vector, predefined weight Wj = j/M is assigned in order of temporal relevance. Without loss of generality, let us assume that buffer[idx] is in a state in which all n vectors  T n are inserted, with the nth from the sequence xi , yi i=1 vector currently at the rear. If n > M, this would have resulted in the removal of previous (n−M) vectors. Let σx2 and σy2 denote respective weighted variances of {xi }ni=max(1,n−M+1) and {yi }ni=max(1,n−M+1) currently accumulated in the buffer. Also, let σxy denote the covariance between the same set of values. Accordingly, the required covariance matrix idx is expressed as in   2  σx σxy . (12) idx = σxy σy2 Weighted variance σx2 is defined as n 2

i=max(1,n−M+1) WM−n+i (xi − x) 2 n σx = = x2 − x2 i=max(1,n−M+1) WM−n+i

(13)

DEY AND KUNDU: ROBUST BACKGROUND SUBTRACTION FOR NETWORK SURVEILLANCE

where x=



n 

n 

WM−n+i xi

i=max(1,n−M+1)

WM−n+i

(14)

i=max(1,n−M+1)

and x2 =



n 

n 

WM−n+i xi2

i=max(1,n−M+1)

Similarly, we have

WM−n+i . (15)

i=max(1,n−M+1)

σy2 = y2 − y2

(16)

σxy = (xy − x y)

(17)

where 

n 

n 

WM−n+i yi

i=max(1,n−M+1)

y2 =



n 

WM−n+i

(18)

WM−n+i

(19)

i=max(1,n−M+1) n 

WM−n+i yi2

i=max(1,n−M+1)

i=max(1,n−M+1)

and xy =

(20) may be obtained from (21) by substituting for (a, b) each of the ordered pairs (2,0), (0,1), (0,2), and (1,1) respectively. The update procedures of x, x2 , y, y2 , and xy are followed by an adjustment of [W]n as ⎧ 1;  if n = 1 ⎨  W + WM−n+1 n−1 ; if 1 < n ≤ M (22) [W]n = ⎩ if n > M. [W]n−1 ; The overall process of updating idx requires no more than 14 multiplications, ten divisions, and 21 additions. IV. Proposed Method A. System Overview

and

y=

1699

n 

n 

WM−n+i xi yi /

i=max(1,n−M+1)

WM−n+i . (20)

i=max(1,n−M+1)

Direct computation of idx using (13), (16), and (17) following each insertion would be computationally prohibitive and inefficient as most of the samples in the buffer remain unaltered between subsequent insertions. Hence, recursive counterparts of (14), (15), (18), (19), and (20) are formulated in order to facilitate online update of idx . The recursive update of x is considered as follows. Let [Sx ]n−1 denote the sum of x-components in the buffer prior to insertion of the nth vector. If 1M), the same insertion process will additionally cause the (n-M)th vector to be removed from the front. In either case, the insertion operation causes the existing sum of x-components to decrease by ([Sx ]n−1 /M) and increase by xn . The adjusted sum is finally normalized by the sum [W ]n of predefined weights to obtain x. Eqs.(21) and (22) summarize the initialization and recursive update process of idx for the nth insertion ⎧  a b a b T ⎪ if n = 1 x1 y1 , x1 y1 ; ⎪ ⎪  ⎪ a b a yb W−(S ⎪ x M)+x y a b n n x y ⎪ ⎪   ⎨ ; if 1 < n ≤ M W+WM−n+1 xa y b = Sxa yb +xna ynb n−1  Sx a y b n ⎪ ⎪ ⎪ xa yb W−(Sxa yb M)+xna ynb ⎪ ⎪ ⎪ ; if n > M. ⎪ ⎩ S a b − xa W yb +xa yb x y

n−M n−M

n n

n−1

(21) where the ordered pair (a,b) ∈ {(1,0), (2,0), (0,1), (0,2), (1,1)}. Similarly, the recursive counterparts of (15), (18), (19), and

As highlighted in the introductory section, the proposed method operates at two levels of granularity as follows 1) performing a coarse MB-level segmentation of each frame by selecting of a set of potential MBs that are occupied fully or partially by parts of a moving object and 2) performing a finer pixel-level segmentation of the selected MBs by eliminating pixels that are similar to the corresponding background model. The flowchart of the proposed method is shown in Fig. 3. In order to compute temporal covariance of each MB in a frame, T identical FIFO buffers (indexed by idx) are used. The MB level binary segmentation of the current and the previous frame are stored in arrays τ1 [0 : T −1] and τ0 [0 : T −1] respectively. Before parsing a new frame, the contents of τ1 are copied to τ0 and τ1 is initialized to zero (block A). As usual, frame numbers are denoted by t. Parameters B, QP and MVs, which are parsed from the current MB header (block B) are used to compute its feature vector representation as described in Section III (block C). Using the first N (N < M, say 100) frames, a median background frame is initialized (Section IV-D). At the same time, feature vectors ∀ idx ∈ {0,1,. . . ,T −1} are inserted (block H) into the corresponding buffers, i.e., buffer[idx]. Temporally → weighted mean − μ idx and covariance idx are computed online (block I) using (21) and (22) . Buffered feature vectors for a given MB are characteristics of locally varying background model. If the total number of vectors in a buffer at given idx → reaches N (block D), any incoming vector − p t,idx is considered for further insertion, based on a criteria set as ! "  1 (foreground candidate); if MD T > th1 τ1 [idx] = 0 (background); otherwise (23)  − → → → − → where MD = (− p t,idx − − μ idx )T −1 ( p − μ ) is the idx t,idx idx Mahalanobis distance and th1 = 0.225 being a predetermined → threshold. If τ1 [idx] = 0, − p t,idx is inserted into buffer[idx]. Otherwise, the current MB is selected as one of the probable candidates expected to contain part(s) of a foreground object. Should the buffer be already full (block G), the vector at the → front is removed (block F) prior to the insertion of − p t,idx (block H) at the rear. The coarse MB-level segmentation so obtained, is filtered (block J) using the procedure described in

1700

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013

Fig. 4.

Modeling background variance in the proposed feature space.

portrays highly dynamic component (waving tree branches) in the background at location 330, which is characterized by a sparsely distributed scatter with a higher value of |330 |. B. Multistage Filtering for Noisy Macroblocks

Fig. 3.

Flowchert of the Proposed method.

Section IV-B. Pixels constituting the set of filtered MBs are further used to obtain precise object segmentation (block K), details of which are presented in Section IV-D. Fig. 4 shows the scatter plots of the buffered feature vectors corresponding to different MB locations (highlighted in white) at 330 and 591 for the frame t=1896 from Fall sequence. The location at 591 depicts stationary background. As shown in the corresponding scatter diagram, this is modeled by a highly dense cluster with low |591 |. The same sequence also

In (23), the MB-level binary decision τ1 [idx] ∀idx ∈ {0, 1, . . . , T − 1} may erroneously misclassify some MBs as foreground that do not correspond to true object motion, i.e., false positives and vice versa, i.e., false negatives. In order to reduce misclassification. Such noisy MBs contribute to flickering speckles in the final segmentation result. To suppress such false positives, we use multistaged filter banks in which the output of a 3×3×2 spatio-temporal median filter (STMF) is cascaded to a 3×3 Gaussian filter. For frames of width u MBs and height v MBs as shown in Fig. 5, the spatio-temporal support for the candidate MB at idx in frame t is highlighted with shaded blocks. The output of the STMF is computed as the median of τ0 [idx], τ1 [idx], τ1 [idx − 1], τ1 [idx + 1], τ1 [idx − u], and τ1 [idx + u]. The STMF often erroneously removes MBs containing boundaries of slow-moving objects. In addition, the output of STMF may occasionally contain small holes or gaps in the segmented regions that correspond to very large moving objects (comprising a half of the entire frame or more). Consequently, Gaussian filter is applied on the STMF output to obtain a smooth segmentation mask at the MB-level. It may be noted that an optimized algorithm for computation of the above median (of six binary values) would require no more than nine comparisons. The symmetric Gaussian filter additionally requires three multiplications and eight additions. C. Proposed Color Comparison Technique In the proposed method, a low-complexity color comparison technique is used to determine if two given colors

DEY AND KUNDU: ROBUST BACKGROUND SUBTRACTION FOR NETWORK SURVEILLANCE

Fig. 5.

1701

Illustration of 3×3×2 support for the STMF (enlarged insets).

Fig. 7. Each row shows one example from each category. The categories are from top to bottom: baseline (highway), camera jitter (boulevard), dynamic background (fountain01), intermittent object motion (sofa), shadow (cubicle), and thermal (library).

color) signals which are perceived independently by the human visual system. As brightness is separated from color, the space is affected less by visible shadows. Fig. 6. Scatterplot of actual MATD versus predicted MATD (with regression lines by the method of least squares). (a) VBR with fixed QP = 25. (b) CBR at 1024 kb/s.

D. Background Model Initialization and Update Process

are similar or different. Considering a given pair of pixels with YCbCr color coordinates XF ≡ (YF , CbF , CrF ) and XB ≡ (YB , CbB , CrB ) respectively from the current and the background frame, let luminance differential Y = |YF − YB | and chrominance differential C = |Cb  F −CbB |+|CrF −CrB |. Decision thresholds t = C + C | Y 1 2 idx | for luminance and  tC = C3 + C4 | idx | for chrominance are used which scale linearly with | idx | for a given MB. Constants C1 = 18, C2 = 81, C3 = 0.5, and C4 = 4.2 are empirically set for dynamic background sequences. XF and XB are considered different and the pixel corresponding to XF is classified as foreground if Y > tY and C > tC . This is in contrast to most existing techniques where the similarity of pixels is determined based on the Euclidean (or straight-line) distance between a given pair of points in RGB color space. It is obvious that the proposed color comparison involves fewer computations (with additions and subtractions only) than those of Euclidean distance based techniques. This method benefits from the inherent advantage of YCbCr color space, i.e., decoupling of luminance (or brightness) and chrominance (or

In order to obtain foreground-background classification of the pixels of an input frame, a background model is initialized and continually updated in order to incorporate gradual changes in the scene. For initialization of the background model, the first N frames of the input sequence are divided into ten equal-sized groups, from each of which one frame is randomly picked or sub-sampled for computation of a temporal median frame. This frame is used as the initial background model corresponding to the frame t=N+1. For subsequent frames (i.e., t > N), pixels constituting the set of (filtered) candidate MBs in the current frame are compared with the corresponding pixels in the background model. If the pixels are found to be different (as discussed in Section IV-C), the corresponding pixel in the current frame is labeled as foreground. The remaining pixels in the frame constitute the background. This produces a precise pixel-level segmentation of the current frame. If the number of pixels labeled a foreground exceeds ∼20% of the total pixels in a given MB, the pixel-level comparison process is repeated for the colocated MB in the following frame. This typically enforces inter frame continuity of object segmentation masks. MBs

1702

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013

TABLE II Quantitative Evaluation (Results of all categories combined)

METHOD Proposed (CBR) Proposed (VBR) SC-SOBS [20] PBAS [24] ViBe [21] GMM [22] KDE [23]

Average Recall 0.58995 0.59534 0.80161 0.55046 0.69603 0.50727 0.74422

Average Specificity 0.99253 0.99174 0.98316 0.99561 0.98555 0.99472 0.97577

Average FPR 0.00753 0.00834 0.01696 0.00441 0.01455 0.00532 0.02437

Average FNR 0.41015 0.40474 0.19841 0.44966 0.30403 0.49287 0.25582

containing parts of slow-moving objects often go undetected. All MBs in a frame corresponding to which τ1 [idx]=0 and √ (MD/T ) ∈ (0.02, 0.12) represent the background. Pixels constituting such MBs are used to update the corresponding pixels Bt (x, y) in the existing background using (24). The number of such MBs in a given frame, say β, is practically very small. Bt+1 (x, y) = αIt (x, y) + (1 − α) Bt (x, y) for t > N

(24)

where It (x, y) is the pixel’s intensity value for frame t; α = 0.08 is a predefined learning rate that determines the tradeoff between stability and quick update. V. Experimental Results and Discussion Following the discussion in Section III-B, we provide a statistical comparison of the actual MATD against the predicted MATD values of all P-frame MBs selected at regular intervals from Traffic sequence. The sequence was encoded in VBR as well as CBR modes as reported in Fig. 6. The values of actual MATD are found to be greater than the corresponding predicted MATD values, owing to fact that the latter is modeled using the entropy criterion, which is the theoretical lower bound on the average bitrate. It is observed that the actual MATD and the predicted MATD values are very highly correlated (correlation coefficient ρ > 0.98). In (23), we used the Mahalanobis distance, which is invariant under arbitrary nonsingular linear transformations of the form shown in Fig. 6 (with regression equations). Thus, predicted MATD qualifies for a convenient surrogate to actual MATD insofar as the discriminative aspect of MATD is concerned. The proposed algorithm was implemented in C and integrated into H.264/AVC MB decoding module of Libavcodec, an open source audio/video codec library that is developed as a part of the FFmpeg [18] project. We evaluate our approach on the entire benchmark dataset provided for the Change Detection Challenge 2012 [19]. As the proposed method uses pixel information to achieve pixel-level accuracy in the final stage of segmentation, we consider it fair and obvious to compare our results with those of proven state-of-the-art (SoA) pixel-based approaches [20]–[24]. All 31 sequences of the dataset were encoded in: 1) VBR with a fixed QP=25 for all MBs; and 2) CBR with a target bit-rate of 1024 kb/s. The encoder configuration was set as follows: Baseline profile with YCbCr 4:2:0 (progressive format) 8-bit chroma sub-sampling, GOP size varying in [1, 250], rate-distortion optimization enabled, and the range of MVs (using hexagon-based search)

Average PWC 2.50442 2.52633 2.40881 2.83855 2.54484 3.10516 3.46027

Average F-Measure 0.64845 0.65263 0.72831 0.60756 0.64854 0.59047 0.67192

Average Precision 0.79863 0.78844 0.73165 0.84081 0.69506 0.82282 0.68437

Average speed (fps) 443.11 401.72 7.37 9.25 121.23 30.14 8.86

Average Rank 3.1667 3.3333 3.5000 4.0000 4.1667 4.6667 5.1667

was [-16…16] × [-16…16] for three reference frames. The rate of decoding frames was fixed at 25 frames per second (fps). The background subtraction masks of the proposed method for a few selected frames are shown in Fig. 7 to enable qualitative evaluation against the specified ground-truth masks. For quantitative evaluation, a set of seven evaluation metrics defined in [19], such as recall, specificity, false-positive rate (FPR), false-negative rate (FNR), percentage of bad classification (PBC), F-measure and precision have been used together with average processing speed. An exhaustive comparison of the proposed method with those of [20]–[24] (applied on original input frames prior to encoding with default parameters defined in each work) is summarized in Table II. Subscripts indicate rank of the corresponding figures in the indicated evaluation category. It is noticed that rankings based on FPR=(1 − specificity) and FNR=(1 − recall) are identical to those based on specificity and recall respectively; hence, all evaluation metrics except FPR and FNR were given equal weightage for computation of average rank. Prior to empirical evaluation, it is important to realize that every codec can deliver a varying degree of output video quality for a given set of input frames. Any degradation of visual data introduced by “lossy” compression will inevitably remain visible through any further processing of the content. Notwithstanding the encoding options which considerably affect the performance of a compressed-domain algorithm, the proposed method delivers better overall performance even when pitted against SoA pixel-based techniques. Background segmentation is but one component of a potentially complex computer vision system. Therefore, in addition to being accurate, a successful technique must consume as few CPU cycles and as little memory as possible. An algorithm that segments perfectly but computationally expensive is useless because insufficient processing resources will remain to do anything useful with its results in real-time. The most notable aspect, in this regard, is the comparison of average processing speeds in Table II. The computing speeds were recorded for videos with 720 × 420 resolution on a personal computer powered by Intel Core i7-2600 3.40 GHz CPU with 16 GB RAM (no dedicated hardware or graphics processing unit used). It is evident that the proposed method runs significantly faster in comparison to any of the reported SoA techniques. The computation per MB involved in each step of the proposed method (described in Section III) costs up to a constant factor. Consequently, the complexity of the overall process is

DEY AND KUNDU: ROBUST BACKGROUND SUBTRACTION FOR NETWORK SURVEILLANCE

# O T + c1 β + c2

$

T −1 

τ1 [idx] , where

idx=0

T −1 

τ1 [idx] denotes the

idx=0

number of candidate MBs that require pixel level processing, T is the total number of MBs per frame and c1 , c2 are constants. Arguably, the running time scales with T , $ incurring # linearly T −1  τ1 [idx] ≤ T in only a negligible overhead κ = c1 β + c2 idx=0

addition to the regular decoding cost claimed by each frame.

VI. Conclusion We introduced a novel approach for background subtraction on videos encoded in the Baseline profile of H.264/AVC. The proposed method is aptly built for real-time network streaming applications in consideration of variable/constant bit rate options under practical bandwidth constraints. It also proved to be robust to a diverse set of real-world (nonsynthetic) surveillance sequences.

Acknowledgment The authors would like to thank the anonymous reviewers and the associate editor for their valuable comments that significantly improved the quality of this paper. References [1] Y. Benezeth, P.-M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger, “Comparative study of background subtraction algorithms,” J. Electron. Imaging, vol. 19, no. 3, pp. 1–12, Jul. 2010. [2] S. Elhabian, K. El-Sayed, and S. Ahmed, “Moving object detection in spatial domain using background removal techniques—State-of-art,” Recent Patents Comput. Sci., vol. 1, pp. 32–54, Jan. 2008. [3] T. Bouwmans, F. El Baf, and B. Vachon, “Statistical background modeling for foreground detection: A survey,” in Handbook of Pattern Recognition and Computer Vision, vol. 4, Singapore: World Scientific, 2010, ch. 3, pp. 181–199. [4] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003. [5] “H.264 video compression standard—New possibilities within video surveillance,” White paper, Axis Communications Inc. Mar. 2008. [6] V. Thilak and C. D. Creusere, “Tracking of extended size targets in H.264 compressed video using the probabilistic data association filter,” in Proc. EUSIPCO, 2004, pp. 281–284. [7] W. Zeng, J. Du, W. Gao, and Q. M. Huang, “Robust moving object segmentation on H.264/AVC compressed video using the block-based MRF model,” Real-Time Imaging, vol. 11, no. 4, pp. 290–299, Aug. 2005. [8] Z. Liu, Z. Zhang, and L. Shen, “Moving object segmentation in the H.264 compressed domain,” Opt. Eng., vol. 46, no. 1, p. 017003, Jan. 2007. [9] C. Solana-Cipres, G. Fernandez-Escribano, L. Rodriguez-Benitez, J. Moreno-Garcia, and L. Jimenez-Linares, “Real-time moving object segmentation in H.264 compressed domain based on approximate reasoning,” Int. J. Approx. Reasoning, vol. 51, pp. 99–114, Sep. 2009. [10] W. Fei and S. Zhu, “Mean shift clustering-based moving object segmentation in the H.264 compressed domain,” IET Image Process., vol. 4, no. 1, pp. 11–18, Feb. 2010. [11] W. You, M. S. H. Sabirin, and M. Kim, “Moving Object Tracking in H.264/AVC bitstream,” in Proc. MCAM, vol. 4577, 2007, pp. 483–492. [12] C. Poppe, S. D. Bruyne, T. Paridaens, P. Lambert, and R. V. D. Walle, “Moving Object Detection in the H.264/AVC compressed domain for video surveillance applications,” J. Vis. Commun. Image Representation, vol. 20, pp. 428–437, May 2009. [13] S. R. Smoot and L.A. Rowe, “Study of DCT Coefficient Distributions,” in Proc. SPIE Symp. Electron. Imaging, 1996, pp. 403–411.

1703

[14] E. Y. Lam and J. W. Goodman, “A mathematical analysis of the DCT coefficient distributions for images,” IEEE Trans. Image Process., vol. 9, no. 10, pp. 1661–1666, Oct. 2000. [15] W. Wu and B. Song, “DC Coefficient Distributions for P-Frames in H.264/AVC,” ETRI J., vol. 33, no. 5, pp. 814–817, Oct. 2011. [16] G. J. Sullivan and S. Sun, “On dead-zone plus uniform threshold scalar quantization,” in Proc. SPIE Vis. Commun. Image Process., vol. 5960, no. 2. Jul. 2005, pp. 1041–1052. [17] I. E. Richardson, The H.264 Advanced Video Compression Standard, 2nd ed. London, U.K.: Wiley, 2010, p. 191. [18] F. Bellard. (2002, Apr. 26). FFmpeg [Online]. Available: http://ffmpeg. org/\underbar [19] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar, “Changedetection.net: A new change detection benchmark dataset,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2012, pp. 1–8. [20] L. Maddalena and A. Petrosino, “The SOBS algorithm: What are the limits?,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2012, pp. 21–26. [21] O. Barnich and M. Van Droogenbroeck, “ViBe: A universal background subtraction algorithm for video sequences,” IEEE Trans. Image Process., vol. 20, no. 6, pp. 1709–1724, Jun. 2011. [22] P. KaewTraKulPong and R. Bowden, “An improved adaptive background mixture model for real-time tracking with shadow detection,” in Proc. 2nd Eur. Workshop Adv. Video-Based Surveillance Syst., 2001, pp. 149–158. [23] A. Elgammal, D. Harwood, and L. S. Davis, “Non-parametric model for background subtraction,” in Proc. 6th Eur. Conf. Comput. Vis., 2000, pp. 751–767. [24] M. Hofmann, P. Tiefenbacher, and G. Rigoll, “Background segmentation with feedback: The pixel-based adaptive segmenter,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2012, pp. 38–43. Bhaskar Dey received the B.Tech. degree in information technology from the University of Kalyani, Kalyani, India, in 2007, and the M.Tech. degree in the same field from the University of Calcutta, Kolkata, India, in 2009. He is currently pursuing the Ph.D. degree at the Center for Soft Computing Research, Indian Statistical Institute, Kolkata, India. His current research interests include compressed domain video and image analysis, machine vision, and pattern recognition.

Malay K. Kundu (M’90–SM’99) received the B.Tech., M.Tech. and Ph.D (Tech.) degrees in radio physics and electronics from the University of Calcutta, Kolkata, India. In 1982, he joined the Indian Statistical Institute (ISI), Kolkata, India, as a Faculty Member where he is currently a Professor with the Machine Intelligence Unit. He was the Head of the Machine Intelligence Unit from 1993 to 1995, and from 2004 to 2006, was the Professor In-charge (Chairman) of the Computer and Communication Sciences Division, ISI. He is the Co-Principal Investigator and Acting In-Charge of the Center for Soft Computing Research: A National Facility (funded by the Department of Science and Technology, Government of India) at ISI, Kolkata. He has contributed three book volumes, about 140 research papers in well known and prestigious archival journals, international refereed conferences, and in the edited monograph volumes. He holds nine U.S patents, two International and two E.U patents. His current research interests include image processing and analysis, soft computing, content based image retrieval, digital watermarking, wavelets, genetic algorithms, machine vision, fractals, and very large scale integration design for digital imaging. Dr. Kundu was a recipient of the Sir. J. C. Bose Memorial Award of the Institute of Electronics and Telecommunication Engineers, India, in 1986, and the prestigious VASVIK Award for industrial research in the field of electronic sciences and technology in 1999. He is a Fellow of the the International Association for Pattern Recognition, USA, the Indian National Academy of Engineering, the National Academy of Sciences, and the Institute of Electronics and Telecommunication Engineers, India. He is the Founding Life Member and Vice President of the Indian Unit for Pattern Recognition and Artificial Intelligence, the Indian wing of the International Association for Pattern Recognition.