People Recognition and Tracking Methods for Control of ... - IEEE Xplore

2011 20th European Conference on Circuit Theory and Design (ECCTD)

People Recognition and Tracking Methods for control of Viewpoint in CCTV Systems Adam Dabrowski, Damian Cetnarowicz, Paweł Pawłowski, Mateusz Stankiewicz Poznań University of Technology, Faculty of Computing, Chair of Control and System Engineering, Division of Signal Processing and Electronic Systems Piotrowo 3, 60-965 Poznań, Poland {adam.dabrowski, damian.cetnarowicz, pawel.pawlowski, mateusz.stankiewicz}@put.poznan.pl Abstract. This paper presents several methods for application in CCTV systems in order to support operators’ duties. These methods are based on video processing only. In result, the attention of the operator can be attracted using an alarm signal. Additional information about objects can be delivered or the automatic control of PTZ cameras applied. Among the considered methods are: people detection, silhouette extraction, multiple camera signal processing, moving object tracking.

fact that the object is moving. Thus we can also use the motion detection techniques. The mentioned methods must be supported by additional algorithms to distinguish between the types of objects, and to select people only. Detection of objects and their motion can be realized separately or can be coupled. We can detect an object using its characteristics like a color or a shape. Thus we may need only one frame to analyze but typical quality of CCTV material (especially in the night) is typically too low to rely on a single feature only. We can also detect the object having its velocity vectors or the computed differences between the current frame and the earlier estimated background. The known methods differ with the complexity and the quality of results. Thus to utilize them in the real-time system, they should be carefully selected and optimized.

Keywords: Silhouette extraction, multiple camera, viewpoint, people detection, object tracking 1 Introduction Monitoring systems are often aided by automation procedures and computers to assist given tasks. Unfortunately in CCTV (closed circuit television) and surveillance systems lack of automation algorithms is still noticeable. Video operator must follow many monitor screens and look for potentially dangerous situations. After hours of such monotonous work he / she can react with a delay or even miss some events. One of helpful proposals is a system to detect people and track them by moving of PTZ (pan tilt zoom) camera or by selection of the best view from multiple camera system. PTZ-cameras are very often programmed to scan the selected area in a cycle. During a danger the operator brakes the cycle and controls the camera manually. But due to hardware and human limits he / she can follow the moving object typically by one of the cameras only. If there are more interesting objects to follow he / she must choose only one. The system equipped with the presented solutions can follow as many objects as the cameras are available and help the operator to fix his / her attention on the best viewpoint. The main task of the system is identification of characteristic features of the followed object (especially a person), computing the motion vector, and finally, controlling the position of the camera or selecting the best view to hold the followed object close to the center of the screen.

2.1 Foreground moving objects estimation The main issue of the moving people recognition problem is extraction of foreground (the moving objects) from the complex background. Further step is to estimate some characteristic features identifying each person like the length of the step. In order to perform an experiment, recordings were gathered in the laboratory and in the urban environment. As shown in Fig. 1 foreground model was estimated from the loaded frame using GMM (Gaussian mixture model). In the next step the background model is built based on the foreground statistical model. The foreground model does not represent exact shape of the moving object. Contrary to that, the background is exactly modeled, thus the silhouette is extracted by subtraction of the estimated background image from the original frame. The final image can be rated as a good estimation of the moving person (Fig. 2). The implemented algorithm works in real-time, hence it is possible to utilize it as a part of a more complex system. During the experiment the OpenCV library was used for implementation of the extended image processing [4]. This computer vision cross-platform library suits very well for real-time applications. OpenCV includes a statistical machine learning module, thus the use in other areas (e.g. labeling and motion tracking, gesture recognition, human-computer interface) is also possible.

2 Methods for moving people recognition Detection of people in CCTV systems can use a variety of well established methods. To be general, we can use any object detection algorithm. Moreover we can use the

Fig. 1. Foreground moving object extraction

978-1-4577-0618-9/11/$26.00 ©2011 IEEE

849

tracked person. Then the MEANSHIFT algorithm looks for the maximum of the probability density in the searching window and sets the new center point at the centroid. If the movement of the searching window was low, the algorithm finishes. Otherwise the algorithm looks for a new maximum of the probability density in a new searching window. Using the results of the algorithm, the new position and size of the searching window is calculated. The tracking person can move in all directions, also towards the camera, thus in the window it can be not only in another position but also can be smaller or larger. Therefore the MEANSHIFT algorithm is supplemented by additional conditions and actions. The algorithm operates properly if the tracked objects have a uniform color that highly contrasts with the background. Very good results can be achieved during tracking some parts of the human body like face or hand [7]. When the tracked object gets nearer to the edge of the screen, the camera position is automatically adjusted to compensate the movement. The obtained quality of tracking for the CAMSHIFT algorithm is good [6]. The system manages to track very fast movements as long as the tracked object did not get out of the camera field of vision (the maximum speed of the PTZ-camera is the only limit). There are no problems with tracking objects that changed shape or size due to changing orientation and / or distance from the camera (see Fig. 3).

Fig. 2. Results of extraction: recorded frame (left column), extracted background (middle column), and estimated person (right column) 3 Moving people tracking The detection of people and their movement is an initial process for the tracking. The tracking must be supported by additional methods for clear-cut object classification. It is required due to several reasons: detected person in one frame, and separately detected person in another frame can mean another person. If two persons are passing by, a simple tracking algorithm can mismatch them, thus the object classifiers and the object descriptors will be incorrect. As we already mentioned, in order to follow the object, we are typically using information about its color and / or shape. In the color representation we are mainly using a hue (H) component from the HSV (hue, saturation, value) or HSL (hue, saturation, level) representation. During the initialization we are specifying a given color (hue) value and the maximal variation of the color, or selecting an object to follow. Now the algorithm compares color components or histograms of the analyzed area [5]. Instead of a color we can follow a contour of the person, computed by means of some edges extractor. In the following frames we are looking for the shape of a person [8]. The proposed system implements two motion tracking algorithms, namely the Horn&Schunck method and the CAMSHIFT method. The Horn&Schunck method [3] is based on the optical flow equation. It seeks a motion field that satisfies this equation with the minimum between pixel variation of flow vectors [Horn]. The CAMSHIFT (continuously adaptive mean shift) method is based on the MEANSHIFT algorithm and uses color characteristics of the tracked object [1, 2].

Fig. 3. Tracking a moving person The most difficult issue is influence of the intensity of ambient light on the process of finding the tracked person. A background color also influenced the results. The main limitation of the CAMSHIFT method is related to the use of color to track the object. Thus it is not possible to use this method with black and white or infrared cameras. 4 Multiple camera object tracking The second part of the presented system uses the tracking algorithm, which recognizes the same moving object seen by many cameras. Additionally, location of the detected object is visualized on the map. First, combining of multiple cameras for automatic detection of moving objects is considered. It is assumed that these cameras observe adjacent parts of a big area or the same area from different sides. Such situations occur both outdoors and indoors when a moving object is passing the cameras in a sequence. The point is that the observation of the moving object requires more than one camera because the area is larger than the frame of a

3.1 CAMSHIFT algorithm The implemented CAMSHIFT algorithm uses only hue of the full color description. It simplifies the processing and makes it less sensitive to the lighting variations. Then the MEANSHIFT algorithm is used to compute the modal value of the probability distribution of localization of a person in a given point. We approximate the probability distribution by a histogram of color for the

850

particular camera or there is a need to locate cameras in more than one position because of a complex shape of the area. The general aim is to use multiple cameras in order to view moving objects in the area of interest and the view of the chosen object does not have to be as detailed as in the case of smart rooms where the system captures gestures of people (for the latter case the ratio of the number of cameras to the area is typically larger). In the scenario described above, a video signal from all cameras is transmitted to one place where the operator, using video monitors, observes areas such as streets, car parks, corridors, supermarkets etc. (in general, a public space equipped with the CCTV security system). It is important to emphasize that such an intelligent system is not substitution of the CCTV operator but it is an extension, which offers new functionality for the operator. The block diagram of the proposed idea is depicted in Fig. 4. First, the video signals from cameras are processed independently. At this stage of the algorithm, moving objects are detected and then they are numerically described. These numerical descriptions are delivered to the next stage where they are compared with each other in order to decide, which object is seen by two or more cameras. In this way the system can count how many different objects are seen by multiple cameras. moving object extraction

object description

Objects matching

moving object extraction

object description

Best view selection

moving object extraction

object description

Transformation of coordinates

is seen by different cameras. This allows for tracking a chosen object. An information which camera view contains the selected object can be used for visualization of the moving object on the map of the monitored area. First, the coordinate system of the frame has to be transformed to the coordinate system of the map – it is shown in Fig. 5. This is just the so-called affine transformation. As a result, strait lines from one coordinate system are transformed to strait lines in the other coordinate system – only angles are changed. After the affine transformation the object, which moves in the camera view, can be visualized on the map. An example of this transformation is shown in Fig. 6. The object is marked as a blue circle. In order to perform the affine transformation, the camera has to be calibrated. In the reported experiment a chessboard was used. The calibration procedure requires that the chessboard has to be placed in the monitored camera in such a way that its sides overlap with directions of coordinate axes of the map and the location of the distinguished field of the chessboard is known. This is illustrated in Fig. 6. Given an image with the chessboard acquired by the camera, coordinates of the corners are marked by the user. In this way the location of the corners is known in the camera view coordinate system and in the map coordinate system. This information allows for calculation of the transformation matrix. The reported experiment showed that using the CCTV system, additional information can be obtained by means of video signal processing. Using multiple views from the monitoring cameras, the moving objects such as people or cars can be identified. This is a kind of the local identification. It means that the personality is not identified. The system recognizes appearance of unknown moving object and tracks its movement as long as the object is seen by at least one camera. The experiment was performed using the cardboard model of the real corridor. Web cams and OpenCV library were used [4]. The aim of this experiment was to demonstrate correctness of the proposed procedure.

video monitor

Map

Fig. 4. Block diagram of processing signals from multiple cameras The numerical description (descriptor) was based on a color and size of the detected object. Assuming that the particular object that is observed can be characterized by a dominating color, the object can be represented just by this color. This assumption seems to be quite practical because it was confirmed positively in the preliminary tests. Given a certain moving object, an average color is calculated in the HSV space. The size of the object was used in order to eliminate moving objects, which are very small on a video screen. For example, moving objects are classified for consideration when their height is at least 30% of the frame (video screen) height. The comparison of objects, which can be considered as the matching objects, is based on the Euclidean distance between the calculated color vectors in the HSV space. The smaller distance the bigger is the probability that the same object

5 Conclusions The presented methods can improve quality of the CCTV monitoring. The results were checked in a modeled and in the real environment. Currently the authors optimize the algorithms to reach proper results with a variety of environmental conditions like different weather or light. Additionally, the preliminary stage of the system consolidation was performed. The algorithms are now prepared to be accelerated in GPGPU (general purpose computing on graphics processing units).

Fig. 5. Two coordinate systems: of the frame image – red axes (left image only) and of the map image – green axes

851

0, 0

Cam2

Cam3

160, 650 620, 740

1860, 740

Cam1

Fig. 6. Map of monitored area with marked localization of detected object [4] Intel Corporation, OpenCV, http://opencvlibrary.sourceforge.net [5] Li, J., et al.: Color Based Multiple People Tracking. Robotics And Vision (7th. ICARCV), 2002 [6] Pawłowski, P., Borowczyk, K., Marciniak, T., Dąbrowski, A.: Real-time object tracking using motorized camera, IEEE Signal Processing Conference SPA 2009, Poznań 2009, pp. 225 – 228 [7] See A. K. B., Kang L. Y.: Face Detection and Tracking Utilizing Enhanced CAMSHIFT Model. International Journal of Innovative Computing, Information and Control Vol. 3 Nr 3, 2007 [8] Yokohama, M., Poggio, T.: A Contour-Based Moving Object Detection and Tracking. 2nd Joint IEEE International Workshop, 2005

References [1] Allen, J. G., Xu R. Y. D., Jin, J. S.: Object tracking using CAMSHIFT algorithm and multiple quantized feature spaces, VIP '05 Proc. of the Pan-Sydney area workshop on Visual information processing, Australian Computer Society, Australia 2004, pp. 3 –7 [2] Bradski, G.: Computer Vision Face Tracking For Use in a Perceptual User Interface. Microcomputer Research Lab, Santa Clara, CA, Intel Corporation, Intel Technology Journal Q2 1998 [3] Horn, B. K. P., Schunck, B. G.: Determining optical flow. Institute report AIM-572, Artificial Intelligence Laboratory, Massachusetts Institute of Techn., 1980

This paper was prepared within the INDECT project.

852