MORTAL: Multiple Objects Realtime Tracking And ...

6 downloads 25835 Views 1MB Size Report
Masters Student at Carnegie Mellon University,. Pittsburgh, USA. ... interest move at a fast pace. Keywords: Multiple Object Tracking, Online Tracking.
MORTAL: Multiple Objects Realtime Tracking And Learning Praveen Palanisamy Masters Student at Carnegie Mellon University, Pittsburgh, USA. Email: [email protected] ABSTRACT This paper proposes a real-time system that can track and learn multiple independent objects. The Multiple Objects Tracking And Learning (MORTAL) system uses a shared memory multiprocessing to parallelize object tracking. MORTAL takes less than 40 milliseconds to track up to 3 arbitrary objects simultaneously at a resolution of 320x240 on Intel’s dual core ivy-bridge processor. It achieves over 11 Frames per second in simultaneously tracking four independent objects of various size, shape and appearance. The results show the ability of the proposed system to track multiple independent objects efficiently at different scales, poses, lighting conditions, occlusions and also when the objects of interest move at a fast pace.

Keywords: Multiple Object Tracking, Online Tracking. Mathematics Subject Classification: 93C85, 68T40,68T45 Computing Classification System: I.4

1. INTRODUCTION Object tracking can be defined as the process of estimating the trajectory of an object in the image plane as it moves. In the last few years, significant progress has been made in object tracking. Several trackers like the MTT (T.Zhang 2012), LOT (S. Oron 2012), TLD (Z. Kalal 2011), Bayesian-based PF method for objects tracking in dynamic scenes (Y. Zhang 2013), Object Tracking via Color Optical Flow (Sidram, M. H 2014) have been developed which are robust and can track an object in real time. The survey on object tracking (Yilmaz, javed & shah 2006) reveals that the assumptions used to make the tracking problem tractable, for example, smoothness of motion, minimal amount of occlusion, illumination constancy, high contrast with respect to background, etc., are violated in many realistic scenarios and therefore limit a tracker’s usefulness in applications like automated surveillance, human computer interaction, video retrieval, traffic monitoring, and vehicle navigation. Though recent state-of-the-art object tracking algorithms like the Real-time Compressive Tracking (Zhang, Zhang & Yang 2012) and SCM (Zhong, Lu & Yang 2012) achieve real-time performance, they cannot be extended to track multiple independent objects. One of the main problems encountered with multiple object tracking is the low frame rate achieved. In multiple object localization at over 100fps

(Taylor & drummond 2009) no frame-to-frame tracking is performed. The objects are only localised. The proposed MORTAL system aims at tracking two or more objects in real-time. It focuses on methodologies for tracking any object in general and not confined to tracking a particular object like faces or persons. Survey on object tracking (Yilmaz, javed & shah 2006) points out that tracking-based algorithm typically assume that the motion of the object is smooth and fail if the motion is abrupt or if object moves out of view. Detection-based algorithms assume that an object is known in advance and require a training stage. Support vector tracking (Avidan 2001) and Bayesian networks (Park & Agarwal 2004) proposed multi view appearance models. They train a set of classifiers to learn the different views of an object. One limitation of multi-view appearance models is that the appearances in all views are required ahead of time. The proposed system aims at tracking objects whose motion need not be smooth and may go out of view of the capturing device. It does not need an offline training stage. The proposed system tracks and learns objects online. Rest of the paper is organised as follows. Section 2 describes the MORTAL system in detail. It starts by describing the system initialization in 2.1 and moves on to detail the internal representations in 2.2. Tracking, modelling, detection and learning of the system are explained in 2.3, 2.4, 2.5 and 2.6 respectively. The tracking results and how the proposed system tackles various challenges are presented in section 3.

2. DESCRIPTION OF MORTAL

The object(s) of interest, which is the object to be tracked, is defined by the user using a bounding box. The first object to be tracked is defined in the first frame and the other object(s) to be tracked can be specified either in the first frame or in any subsequent frame as and when the objects appear.

2.1

Initialization phase

When a new object is specified to the MORTAL system for tracking, it spawns a new thread which runs concurrently with the already spawned threads including the master thread. Since the maximum number of threads that can be used on a computer is given by:

theoretically there's no limit on the number of objects that can be tracked by the MORTAL system. For each new object, an object model is initialized using the user selected patch. The tracker (section 2.3) is initialized with the initial state of the object. An initial object detector is obtained by training the object detector (section 2.6) to detect the appearance represented in the object model.

After the initialization phase, the tracker estimates the motion of each object in parallel based on the previous state of the objects and yields its result for each object. The detector (section 2.5) which runs in parallel with the tracker for each object returns a number of suppositions of the location of each object. The new state of the objects is deduced after the decision making phase (section 2.7). The MORTAL system learns about the objects by analyzing the outputs of the detector, tracker and the new state of an object collectively. The error is estimated and the detector is updated to prevent the occurrence of these errors in the upcoming frames.

2.2 Representation of the Objects 2.2.1

State

The state of the objects is represented by a rectangle enclosing the object (bounding box). The bounding boxes are labelled at the bottom right corner with the object IDs. The aspect ratio of the bounding boxes remains the same from when it is initialized by the user. The scale and location (x, y coordinates) parameters define the bounding boxes.

2.2.2

Appearance

Each of the objects at any instance is represented by an image patch P[i] where i is the ID of the object. The patches are sampled from the bounding boxes of the respective objects and then are re-sampled to a normalized patch resolution of 15 X 15 pixels. The similarity criterion between two patches of a particular object is defined as: (

)

(

(

)

)

Where, (

)



is the Normalized Correlation Coefficient of two patches of size 15 X 15.

2.3 Tracking of Objects

The tracker of the MORTAL system estimates the displacements of a set of most reliable points within each object’s bounding box. Filtering of points based on their reliability is depicted in Figure 1. The reliability is calculated using Sum of Squared Differences (SSD).

The motion of each point is then estimated using a 2 level pyramidal implementation of Lucas-Kanade tracker (Bouguet 1999). A failure detection mechanism is implemented in the tracker to reduce the false claims of the tracker. It occurs when (

)

Figure 1: Filtering of points for Tracking

Where δ_p represents the displacement of a single point and δ_m represents the median of displacements of all points for the particular object under consideration. The tracker re-initializes the point locations in every frame for every object. 2.4 Object Modeling

Each object is modelled using a compilation of positive and negative patches. It is stored as a dynamic data structure and represents the object’s appearance and surroundings. Mixture models of the pixels inside the region are computed and the background is represented by mixture of Gaussians. The models of the objects are passed using shared memory buffers which improves the performance since no Operating System involvement is necessary. It can be represented by: [] Where [ ] is the

{ []

patch of the

[]

[]

object and

[]

[]

[ ] is the

[] } background patch of the

object.

Relative similarity is used for Nearest Neighbour (NN) classification (Wen 2011). The relative similarity is defined as: []

Where, (

(

[]

)| [ ]

[ ])

Represents the similarity of an arbitrary patch P with the negative NN. The similarity between two patches represented by S(P_i,P_j ) is defined in equation (2). A new patch P is classified as a positive patch and used for updating the model if and only if: [] The confidence threshold of 0.7 was arrived at, experimentally.

2.5

Detection of Objects

Object detector can be regarded as a method to localize the appearances of various objects represented in the object model.

MORTAL uses a matching scheme based on histogrammed intensity patches

(Taylor & drummond 2009).

Figure 2: Multiple object localization by the detector

Cascaded classifier architecture, commonly used in face detection (Viola & Jones 2001) is used by the detector. The three stages of classification by the detector are shown in figure 3. The variance filter rejects patches with variance less than 0.8 times the variance of the first patch, which was derived from the user defined bounding box of the particular object. The patches of an object that were not rejected by the variance filter are passed on to an ensemble classifier (Dietterich 2000).

Figure 3: Cascaded Classifiers used by the detector

The Ensemble Classifier learns from a set of classifiers and combines the predictions of multiple classifiers. A random fern classification is employed at this stage.

Figure 4: Parallel FERN Classifier

It was shown (Ozuysal, Fua & Lepetit 2007) that ferns outperforms random tree based implementations. The semi naïve structure of ferns allows a simple and fast implementation which can be scaled up by increasing the number of ferns. With a single fern, it is necessary to use a large number of features to achieve satisfactory results (Ozuysal, Fua & Lepetit 2007). Limiting the number of ferns to 8 and the number of features per fern to 10 gave optimal performance. The parallel FERN Classifier stage of mortal is shown in Figure 4. The probability given by, [] Is obtained from the feature values where y=1 is the event that the patch under consideration has a positive class label of the corresponding object. To increase the robustness of MORTAL to image noise and shift, the patch is convoluted with a Gaussian kernel with a standard deviation of 3 pixels. The last stage of the Cascaded classifiers is the NN classifier. The NN classifier classifies a patch as an object if and only if

[] The threshold of 0.7 was arrived at experimentally.

2.6 Training by Learning The training phase which is the learning component of the system is inspired by the two well-known learning approaches namely boosting (Freund 2001) and example-based learning (Sung & Poggio 1998) which are combined using P-N learning (Kalal, Matas & Mikolajczyk 2010). The training samples for the initial detectors for each object are derived from the initial bounding box supplied by the user for the corresponding objects. Warped views of the entire objects from different viewpoints are generated by varying the scale, shift and the in-plane rotation of the object boxes.

Figure 5: Training samples generated

With the initial detector the MORTAL system starts tracking and keeps discovering new poses and appearances of the objects to increase the generalization of the object detector. It also keeps discovering negative training samples for the detector to discriminate the objects against the background clutter.

2.7

Decision Making

This is the final phase of the MORTAL system’s processing which is performed on each frame of the input stream. MORTAL constantly analyses the output of the tracker and detector. The detector errors of each object are estimated and the object model is updated to prevent the errors from occurring in the future. The final states of each object are deduced in this phase by combining the results from the detection and tracking phase. The obtained bounding box is the maximally confident box.

3. RESULTS The proposed system was implemented in C++. OpenMP API was used for the multi-platform shared memory multiprocessing. The system was tested on a 2.53 GHz dual core Intel processor. The proposed

system is evaluated on a real-time video input stream at 320 X 240 resolutions and on the publicly available bolt running sequence to demonstrate tracking of arbitrary objects. The results presented below demonstrate the robustness of the proposed system and how it tackles various challenges faced while tracking objects. 3.1 Scale changes The proposed system tracks the objects at different scales. Figure 6 demonstrates the scale invariant tracking where the system tracks the objects even when they are very close or far away from the capturing device.

Figure 6: MORTAL tracking objects at different scales

3.2 Occlusions In multiple object tracking, occlusions between the objects are common. Figure 7 demonstrates the ability of the system to handle occlusions efficiently.

Figure 7: MORTAL handling occlusions

3.3 Appearance and Viewpoint changes

Since no pre-trained object classifier or model is available for online tracking, online trackers face severe problem of appearance and viewpoint changes. Figure 8 demonstrates the ability of the proposed system to handle such changes.

Figure 8: MORTAL handling appearance and viewpoint changes (2nd object is rotated 360⁰)

3.4 Disappearance of Objects(s)

All the objects may not appear in every frame of the Input video stream. An efficient object tracker should not fail or under-perform in such scenarios. Figure 9 demonstrates the proposed system handling disappearance of an object without any decrease in the frame-rate which is usually the case.

: Figure 9 : MORTAL handling disappearance of objects (2nd object is out of view)

3.5 Fast Movements and Motion Blur

The object(s) being tracked may move rapidly which poses a challenge to the trackers. Figure 10 shows a series of consecutive frames in which one of the objects which is being tracked, moves very quickly. The proposed system efficiently tracks the objects handling fast movements and motion blur. Also, note that the frame rate achieved is not affected by such fast movement of the objects.

Figure 10: MORTAL handling fast movements and motion blur

3.6 Illumination changes

Due to environmental conditions and other variations, the illumination of the objects and/or the background may vary rapidly. Figure 11 shows the proposed system tracking the objects accurately even when a direct beam of light (torch light) is directed toward the objects which abruptly changes the illumination pattern.

Figure 11: MORTAL handling illumination variation (torch light).

Figure 12 shows the results of the tracker on the video sequence of the London Olympics 2012 100m final. The tracker tracks 3 sprinters namely Usain bolt (yellow, lane 7), Yohan Blake (yellow, lane 5) and Richard Thompson (orange, lane 2). The video sequence had a total frame of 1800. The average frame rate achieved throughout the video sequence was 12.4 FPS. The precision, recall and F-score of MORTAL’s tracking result against the ground truth are also shown in the figure 12 which shows the high accuracy and reliability of MORTAL’s results.

Figure 12: MORTAL tracking results on bolt-running video sequence.

3.6

Real-time Performance

The proposed system achieves real-time performance in tracking even when it faces the problem of occlusion, scale variations, viewpoint changes, illumination and disappearance of up to 3 objects. The frame rate achieved at tracking four different objects simultaneously was 11.1 frames per second which is shown in figure 10.

Figure 13: MORTAL tracking 4 objects at 11FPS.

4. CONCLUSION A novel method for robust, multiple independent object tracking was developed and the results of the implementation were presented. Several challenges commonly faced in object tracking were addressed and the proposed system was proved to perform under occlusions between the objects, appearance changes, scale variations, disappearances and fast movement of the objects. The precision, recall and Fscore achieved by the proposed system with respect to the ground truth demonstrate the high accuracy tracking of the system. The parallel implementation on the CPU achieves near real-time performance for simultaneous tracking of up to 4 independent objects of different sizes and appearances thus exceeding the current multiple object tracking performances. 5. REFERENCES Avidan, S 2001, 'Support Vector Tracking', Transactions on Pattern Analysis and Machine Intelligence, IEEE, 26, 8, 1064-1072. Bouguet, JY 1999, 'Pyramidal Implementation of the Lucas Kanade', Technical report,Intel Microprocessor Research Labs. Caltech Pedestrians, . Dietterich, TG 2000, 'Ensemble Methods in Machine Learning', First International Workshop, MCS, Springer Berlin Heidelberg, 1-15. ETH Pedestrian, .

Freund, Y 2001, 'An Adaptive Version of the Boost by Majority Algorithm', Machine Learning, Kluwer Academic Publishers, 102-113. Kalal .Z, KMAJM 2011, 'Tracking-Learning-Detection', Pattern Analysis and Machine Intelligence, 34, 7, 1409-1422.

Kalal, Z, Matas, J & Mikolajczyk, K 2010, 'P-N learning: Bootstrapping binary classifiers by structural constraints', International Conference on Computer Vision and Pattern Recognition, IEEE, 49-56. Ozuysal, M, Fua, P & Lepetit, V 2007, 'Fast keypoint recognition in ten lines of code', IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 1-8. Oron .S, AB-HDLASA 2012, 'Locally Orderless Tracking', CVPR, IEEE, 1940-1947. Park, S & Agarwal, JK 2004, 'A hierarchical Bayesian network for event recognition of human actions and interactions', Multimedia Systems, Springer-Verlag, 10, 2, 164-179. Sidram, M. H. & Bhajantri, N. U., 2014, ‘Exploiting Regression Line and Doyle’s Distance to Track the Object via Color Optical Flow’, International Journal of Tomography & Simulation, 27, 3, CESER Publications. Sung, K-K & Poggio, T 1998, 'Example-Based Learning for View-Based Human Face Detection', IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1, 39-51. Taylor, S & drummond, T 2009, 'Multiple target localisation at over 100 fps', British Machine Vision Conference. Viola, P & Jones, M 2001, 'Rapid object detection using a boosted cascade', Conference on Computer Vision and Pattern Recognition, 1, 511-518. Wen, G 2011, 'Relative nearest neighbors for classification', International Conference on Machine Learning and Cybernetics, IEEE, 2, 773-778. Yilmaz, A, javed, O & shah, M 2006, 'Object tracking: A survey', ACM Computing Surveys, ACM, 38, 4. Zhang, K, Zhang, L & Yang, M 2012, 'Real-time Compressive Tracking', ECCV, 1305-1312. Zhang .T, BGLNA 2012, 'Robust Visual Tracking via Multi-task Sparse Learning', CVPR, IEEE, 20422049. Zang .Y & Wang X 2013. ‘A Target Tracking Algorithm Based on Particle Filter’, International Journal of Applied Mathematics and Statistics, 51, 22, CESER Publications. Zhong, W, Lu, H & Yang, AM-H 2012, 'Robust Object Tracking via Sparsity-based Collaborative Model', CVPR, IEEE, 1838-1845.