Multi-Camera Human Activity Monitoring - Springer Link

1 downloads 1645 Views 2MB Size Report
Jan 29, 2008 - the area of multiple camera human activity monitoring. ... work has been supported by the NSF through grants #IIS-0219863, #CNS-0224363,.
J Intell Robot Syst (2008) 52:5–43 DOI 10.1007/s10846-007-9201-6

Multi-Camera Human Activity Monitoring Loren Fiore · Duc Fehr · Robot Bodor · Andrew Drenner · Guruprasad Somasundaram · Nikolaos Papanikolopoulos

Received: 5 November 2007 / Accepted: 7 December 2007 / Published online: 29 January 2008 © Springer Science + Business Media B.V. 2007

Abstract With the proliferation of security cameras, the approach taken to monitoring and placement of these cameras is critical. This paper presents original work in the area of multiple camera human activity monitoring. First, a system is presented that tracks pedestrians across a scene of interest and recognizes a set of human activities. Next, a framework is developed for the placement of multiple cameras to observe a scene. This framework was originally used in a limited X, Y, pan formulation but is extended to include height (Z ) and tilt. Finally, an active dualcamera system for task recognition at multiple resolutions is developed and tested. All of these systems are tested under real-world conditions, and are shown to produce usable results. Keywords Pedestrian tracking · Camera placement · Human activity recognition · Surveillance

1 Introduction When faced with the changing social and political climate in the world these days, many more governments and private companies are turning to video surveillance, among other things, in an attempt to keep people safe. Video cameras are often chosen because they are inexpensive and if used correctly can provide large amounts of information. This however is also their greatest downfall. These cameras must be

This work has been supported by the NSF through grants #IIS-0219863, #CNS-0224363, #CNS-0324864, #IIP-0443945, #CNS-0420836, #IIP-0726109, and #CNS-0708344. L. Fiore · D. Fehr · R. Bodor · A. Drenner · G. Somasundaram · N. Papanikolopoulos (B) University of Minnesota, CSE, 200 Union Street SE, Minneapolis, MN 55455, USA e-mail: [email protected]

6

J Intell Robot Syst (2008) 52:5–43

placed effectively to view a scene or area so that the amount of usable information is maximized. Also with increasing numbers of cameras comes the need for an increasing number of employees to monitor the cameras. These two problems are the motivating force behind the work presented in this paper. In the following sections, systems are presented for tracking pedestrian motion, optimizing camera placement in two and three dimensions, and pedestrian activity monitoring. Also, the results from several simulations and real-world experiments are given in order to convey the effectiveness of the methodologies. This work will first deal with the tracking of pedestrians through a single-viewpoint camera system. This is limited by the fact that only one camera is used. The camera is forced to have a wide angle view of the scene which is reducing the resolution of tracked individuals. Also a single camera is not robust to occlusions. Multiple viewpoint systems are then an obvious extension and solution to this problem. It will be shown however that an ad-hoc “intuitive” placement of multiple cameras observing a scene may not observe pedestrian activities in an optimal manner. Therefore, a framework is presented that addresses such issues. The framework works both in a limited fashion which assumes the height and the tilt of the cameras to be fixed, and a more general form in which the height, tilt, and focal length of the camera are considered. Finally, a multiple viewpoint system is discussed and demonstrated, which consists of a single wide angle camera mounted along-side a single pan/tilt/zoom camera. This system combines the benefits of a single viewpoint wide angle camera system with one that can take high resolution imagery of the activity in question. In the next section some of the previous work in this field is reviewed and compared to the methods presented here. 2 Related Work The use of computer vision to understand the behavior of human beings is a very rich and interesting field. This problem requires recognizing human behavior and understanding intent from observations alone. This is a difficult task, even for humans to perform, and misinterpretations are common. However, recognizing human activities from video has been widely studied in many areas. Some of the application areas include pedestrian and transportation safety [9, 36, 39], surveillance [5, 23, 28, 37, 50], crime prevention [25], human–computer interaction [15, 61], and even interpreting sign language [52]. This field is at a junction between, and uses components of, many different areas of research including tracking, control, motion classification, observability, and camera placement. This section summarizes work in these areas that is relevant to the problem we address. 2.1 Single Viewpoint-Based Surveillance Activity recognition is challenging for many reasons. Many of the technologies that activity recognition relies upon, such as robust foreground segmentation, human subject tracking, and occlusion handling are all active research areas with many unsolved problems. Another important consideration is the issue of scale. Every recognition task has an appropriate level of detail at which measurements will best resolve the activity of

J Intell Robot Syst (2008) 52:5–43

7

interest. Pers et al. [48] found that the level of detail used to observe human motion plays a critical role in the effectiveness of the classification results for a large set of activities. If vision-based activity recognition at two levels of detail is considered, a low resolution recognition task can be defined as one that requires only information about the body as a whole. For example, center of mass, area, volume, and velocity of the entire body are examples of low resolution measurements under this definition. In contrast, a high resolution task requires measurement of individual body parts and their relations to one another. Examples of high resolution measurements include pose, silhouette, position, and velocity of individual body parts such as hands, feet, head, and face. For many surveillance, safety monitoring, and transportation applications, low resolution measurement is generally sufficient. This level allows for the detection of loitering, trespassing, etc. [9, 23, 25, 27, 28, 37, 45]. A great deal of work has been done in this area. Solutions have been attempted using a wide variety of methods (e.g., optical flow [20], Kalman filtering, Hidden Markov models, statistical learning machines [29], etc.) and modalities (e.g., single camera, stereo [3], infra-red, etc.), including combinations of these [17]. In addition, there has been work in various aspects of the issue, including single pedestrian tracking and group tracking (Fig. 1). For automated surveillance applications, tracking is a fundamental component. The pedestrian must first be tracked before activity recognition can begin. The majority of papers detail methods that only track a single person [4, 14, 15, 22, 61]. Most of these involve indoor domains for purposes of gesture recognition [15], and user interfacing [61]. Both [14] and [22] use pan-tilt rigs to track a single individual indoors. Many motion recognition systems track a single person indoors for both gesture recognition and whole body motion recognition [4, 15, 60]. In [4] a single individual is tracked indoors for the purpose of motion recognition. Tracking groups and their interactions over a wide area has been addressed to a limited extent. Maurin et al. [36] used optical flow to track crowd movements, both day and night, around a sports arena. Haritaoglu et al. [28] tracked groups as well as individuals by developing different models of pedestrian actions. They attempted to identify individuals among groups by segmenting the heads of people in the group “blob.” In addition, Beymer and Konolige [5], as well as McKenna et al. [37], have developed methods that track multiple individuals simultaneously.

Fig. 1 Sample images of human activities taking place in outdoor scenes

8

J Intell Robot Syst (2008) 52:5–43

However, for many applications other than surveillance and tracking, a high resolution view of the individual is necessary to resolve the activity being engaged in. Human–computer interfacing requires accurate measurement of the position and pose of the hands as well as other body parts [15]. The same is true for sign language interpretation [52]. Articulated motion recognition requires a detailed view of the location of all parts of the person [4, 8, 10, 11, 34, 49, 50]. As a result, a system that could provide these measurements at many levels of resolution in multiple environments would be desirable. 2.2 Multiple Viewpoint Approaches Another issue that plays a fundamental role in real-world surveillance and motion recognition is that of optimal camera placement for the purpose of maximizing the observability of the motions taking place. Proper camera placement for the purpose of optimizing the sensor’s ability to capture information about a desired environment or task has been studied considerably. In [46], O’Rourke provides an in-depth theoretical analysis of the problem of maximizing camera coverage of an area, where the camera fields of view do not overlap (the so-called “art gallery” problem). In recent years, research has been done to extend the framework to include limited field of view cameras and to incorporate resolution metrics into the formulation. Fleishman et al. [24] further refined the art gallery framework by introducing a resolution quality metric. In addition, Isler et al. [31] extended the formulation of the minimum guard coverage art gallery problem to incorporate minimum-set cover. They derived reduced upper bounds for two cases of exterior visibility for two- and three-dimensions. The method proposed here differs from the art gallery framework in several important ways. The focus of this work is on task observability, and trying to capture target motions, while the art gallery framework tries to optimize coverage area. This corresponds to minimizing overlapping views. The approach presented in this paper does not optimize for minimum view overlap, and in fact, to accomplish these tasks, overlapping camera views are often necessary. However, the work that is done can be related to the notion of “strong visibility of guard patrol” as described by O’Rourke. In some ways, this is the dual problem—assuming moving cameras, and fixed points of interest to observe. For this paper (largely) fixed cameras, with moving subjects are assumed. However, the framework of the art gallery problem makes many fundamental assumptions that differ considerably from this framework. The art gallery framework assumes that the subject(s) of observation (in O’Rourke’s case, the gallery walls) is known a priori. They generally also assume omni-directional cameras, and that all non-occluded camera placements are equally good. These assumptions are fundamental to the art gallery problem, and so fundamentally affect the approach, that they make it a different problem. In the field of robotics, vision sensor planning has been studied to aid in task planning and visual servoing tasks. In [1] Abrams et al. develop a system to perform dynamic sensor planning for a camera mounted on a moving robotic arm in order to compute optimal viewpoints for a pre-planned robotic grasping task. Nelson and Khosla [41] introduce a modified manipulability measure in order to reduce the constraints on the tracking region of eye-in-hand systems while avoiding singularities

J Intell Robot Syst (2008) 52:5–43

9

and joint limits. They also studied dynamical sensor placement within this context, and introduced the concept of the resolvability ellipsoid to direct camera motion in real-time in order to maintain servoing accuracy [42, 43]. Sharma and Hutchinson [51] also introduce a quantitative sensory measure, perceptibility, in order to improve positioning and control of manipulator systems. Visual servoing uses tracking, estimation, and control methods to actively move a camera to adapt to changing conditions in the scene. Most often, visual servoing is used to maintain a moving subject of interest in the camera’s field of view. Many examples of such active-camera systems have been developed in the literature [13, 14, 16, 18, 32, 40, 41, 59]. Stillman et al. [55] developed a multi-camera face tracking system for indoor applications. This system could control two pan/tilt/zoom cameras using four computers communicating over TCP/IP. Daniilidis et al. [16] provide a very comprehensive study of the problems that human vision systems solve when servoing. They go on to test a stereo pan/tilt system with a variety of algorithms for visual servoing. Murray and Basu [40] demonstrate a system that uses a single pan/tilt camera to track a moving object using an edge flow metric. Another approach uses an image mosaicing technique to track moving objects with a single pan/tilt camera indoors [32]. Collins et al. [13] developed a system that tracked a moving figure using pan/tilt cameras alone. This system used a kernel-based tracking approach to overcome the apparent motion of the background as the camera moved. Zhou et al. [63] combine this system with a fixed camera for the purpose of surveillance applications. Mittal and Huttenlocher [38] develop a system for objects moving in an area that is observed by a moving pan/tilt/zoom camera. Their system uses mixture models to construct a background mosaic. Piexoto et al. [47] present a multiple camera system for tracking indoor human activities. This system combines a fixed wide-angle camera with a stereo pan/tilt camera system. An optical flow approach is used to segment the wide-angle view and provide tracking subjects for the stereo pair. A similar stereo pan/tilt/vergence system is described in [64]. Lyons [33] describes a discrete event model based system to control zooming. Similarly, Tordoff and Murray [58] use a Kalman filter model to actively control zoom in order to maintain bounded measurement error of the subject. In the work in this paper, a dual-camera system is described that is intended to accomplish the activity recognition task at multiple levels of resolution. The system is comprised of a wide-angle, fixed field of view camera coupled with a computer-controlled pan/tilt/zoom-lens camera to make detailed measurements of people for applications that require high-resolution measurements. This method uses measurements taken from the image of the wide-angle lens camera to compute the positioning control instructions for the pan/tilt/zoom camera. This approach differs from classical visual servoing methods mentioned above, where the pan/tilt/zoom camera would itself be used to control its position. Some advantages to this approach are: (1) lower bandwidth is required to maintain observability since the object of interest moves across the image more slowly in the wide-angle view (the observability problem is eased), and (2) simpler foreground segmentation methods than those necessitated by a moving camera (optical flow, motion energy, etc.) can be applied since the wide-angle camera is fixed relative to the background. However, this method puts the burden on the calibration step in that accurate camera calibration and cross-calibration are required. Another advantage of this approach is that the entire scene remains observable by the wide-angle camera at all times, allowing

10

J Intell Robot Syst (2008) 52:5–43

detection of higher-priority events if they happen, and more sophisticated planning of tracking schedules. In [56], Tarabanis et al. present a planning method to determine optimal camera placement given task-specific observational requirements such as field of view, visibility, and depth of field. In addition, Yao and Allen [62] formulate the problem of sensor placement to satisfy feature detectability constraints as an unconstrained optimization problem, and apply tree-annealing to compute optimal camera viewpoints in the presence of noise. Olague and Mohr [44] consider the problem of optimal camera placement for 3D measurement accuracy of parts located at the center of view of several cameras. They demonstrate good results in simulation for known static objects. In [12], Chen and Davis develop a resolution metric for camera placement considering occlusions. In addition, Denzler et al. [19] develop a Kalman filter based approach for selecting optimal intrinsic camera parameters for tracking applications. They demonstrate results for actively adapting focal length while tracking a rigid object. The method discussed in the proceeding sections differs from these because it considers the joint observability of a set of tasks. In addition, this method considers task uncertainty: the locations of the tasks that we are attempting to view are not known a priori, and change with time as the subjects move through the scene. Another important set of multi-camera systems are those that incorporate only fixed cameras. These systems generally focus on the problem of tracking subjects across cameras and finding correspondence of subjects between cameras. Two good examples of this work appear in [59] and [54]. These methods do not consider observability, and camera placement is generally ad-hoc. It is quite possible that this set of methods could benefit from the camera placement theory introduced in this paper. 3 Single Viewpoint Scene Monitoring 3.1 Overview This research involved human activity recognition and tracking based on low resolution measurements of position and velocity. For the purposes of this work, “activity” was defined as a set of human actions over a period of time. Multiple outdoor experiments were conducted. The initial set of these was run on 320 by 240 pixel-resolution images on a computer with a Pentium™ II 450 MHz single processor and 128MB of RAM. In addition, the computer incorporated a Matrox™ Genesis board for video capture. All images were taken using a Sony™ Digital8 video camera. Later experiments were run using the hardware configuration described below. Image processing of the 720 by 480 pixel-resolution images was done on a 2.66 GHz Pentium™ IV computer with 1GB of RAM. Video was captured using a Hauppauge™ WinTV-GO capture card. 3.2 Low Resolution Human Activity Monitoring This process was developed to track pedestrians and detect the occurrence of certain pedestrian activities. The measurements taken are low resolution because a single

J Intell Robot Syst (2008) 52:5–43

11

camera was used to observe the entire scene and all the pedestrians within from a significant distance. The process is comprised of the following steps: 1. Camera calibration 2. Video capture 3. Processing individual frames (a) Foreground segmentation (b) Foreground region clustering and labeling (c) Tracking each foreground region (pedestrian) 4. Low resolution analysis of human behavior (a) Signals issued to operator if necessary 3.2.1 Calibration and Foreground Segmentation The camera was first calibrated to determine its intrinsic and extrinsic parameters. The method of calibration was that of Masoud described in [35]. The calibration involves selecting parallel lines and other features in the ground plane of the image. (This calibration method assumes that the subject being tracked moves on a plane.) The result of the calibration is the homography transformation matrix between images coordinates and world coordinates for the camera, as well as the intrinsic and extrinsic (position and orientation) parameters of the camera. Digital video was captured at 30 frames per second. The video was then separated into sequences of single frames, and all analysis was done on the individual frames. Each frame was processed to segment the moving (foreground) objects from the static (background) objects in the image. Robust image segmentation is an active are of research, and many methods for solving this problem have been proposed. The Mixture of Gaussians approach [2], was chosen for this process. Several other background segmentation methods were tried [21, 36, 53], but it was found that the MoG method provided the best overall performance with the lest computational burden. (Foreground-based methods such as [60] were not attempted because of the need for large training sets.) In pedestrian traffic monitoring sequences of this type, most of the image corresponds to background, and the foreground regions appear as a set of pixel-clusters scattered around the image. Under optimal circumstances, each cluster corresponds to a single pedestrian (Fig. 2). For each foreground object, the following attributes are calculated per frame: width, height, aspect ratio, area, bounding box, and centroid position. In addition, the centroid velocities of the foreground object in x and y are computed over multiple frames. These measures are “smoothed” over time through the use of a Kalman filter. The attributes for each foreground object were used in the subsequent analysis steps. 3.2.2 Tracking Pedestrians and Capturing Pedestrian Images The goal of this stage is to segment and extract the image of each pedestrian from all appearances in the image sequence. This “pedestrian image sequence” data can then

12

J Intell Robot Syst (2008) 52:5–43

Fig. 2 Surveillance images and associated foreground objects

be used in the later stages of the system to provide information to motion recognition algorithms to classify the pedestrian’s motion. The following steps are applied to accomplish this goal: 1. A stable bounding box is established around each pedestrian and tracked smoothly throughout the video sequence (Fig. 3). 2. The image of the pedestrian within the bounding box is captured. 3. The individual images are combined into movie files (Fig. 4). The bounding box is calculated to surround the pedestrian blob in the image, with the same aspect ratio of the blob itself. This module can thus track a pedestrian’s motion. Figure 4 shows some example image sequences generated. In addition to these low resolution measurements, high resolution images can also be taken using camera #2 as described below in a later section. 3.2.3 Pedestrian Activity Recognition Based on Pedestrian Position and Velocity This component estimates the pedestrian motion based on the speed and position of the pedestrian. The basic assumption is that much of the pedestrian’s activities can be Fig. 3 Pedestrian tracked across two frames of an image sequence

J Intell Robot Syst (2008) 52:5–43

13

Fig. 4 Samples of two sequences of pedestrian images

ascertained by measuring these simple features. Measuring these values provides two advantages over articulated motion analysis: (1) these measurements can be made in real-time, and (2) are relatively robust to poor image quality. The process had several components: 1. Track each pedestrian throughout the scene using the Kalman filter estimates. 2. Record the position and velocity state. 3. Develop a position and velocity path characteristic for each pedestrian. This is done using the Kalman filter predictions of the future state. 4. Issue a signal to the user under the following conditions: (a) (b) (c) (d)

Pedestrian moves into an area of interest. Pedestrian moves above a normal walking speed. Pedestrian loiters in the area for a long period of time. Pedestrian falls down.

The detected pedestrian activities were categorized based on an activity prioritization (Table 1). This prioritization was used to determine the level of significance of the activity, whether to generate a signal to a human operator, or for simultaneous activities, which activity to focus on in the case of the dual-camera system. While the measurements of position and velocity are taken from the image (and thus in image coordinates), the classification is based on the position and velocity of each pedestrian in the real world. This transition from image to world coordinates is accomplished by transforming all points measured in the image by multiplying them by the homography matrix determined in the camera calibration step mentioned above: PW = H P I ,

(1)

where PW is the point in the world coordinate system (on the plane of the ground), P I is the point in the image plane, and H is the homography matrix that converts between the two coordinate systems. For the purposes of testing the method, a section of each courtyard was assigned to be an area of interest and the software was programmed to generate a signal if

Table 1 Activity prioritization

Higher numbered activities are of higher priority

Activity

Priority Level

Walking Stopped Running Loitering Moving into an area of interest Falling

0 1 2 3 4 5

14

J Intell Robot Syst (2008) 52:5–43

any pedestrians entered that region. In addition, the software calculated the speed of each pedestrian and generated a signal if any pedestrian exceeded the set speed threshold for walking (experiments indicated a top walking speed of 2.25 m/s). In each of these cases, a pedestrian motion image was captured to provide information for further study (Fig. 5). 3.3 Experimental Results 3.3.1 Low Resolution Activity Recognition The systems were tested in several outdoor courtyards where there was a continuous flow of pedestrian traffic. The figures below show some of the images taken during these tests of the low resolution system, with the pedestrian motion paths in the image space superimposed on them. In each case, the right figure shows a map of the pedestrian paths and motion classifications in world coordinates. 3.3.1.1 First courtyard •

Sequence 1 This sequence tracked two pedestrians crossing the courtyard in different directions. Pedestrian #2 purposefully tripped and fell down during the sequence. In the above example, pedestrian #2 spent 9.4 s fallen (as indicated by the dark areas in Fig. 6 and the Fig. 7). In addition, pedestrian #2 later entered the area of interest for a total time of 4.6 s. The motion path of pedestrian #2 indicates a large divergence midway through the image. The pedestrian did not actually take this path. This error results from a very sudden change in the outdoor lighting conditions causing a problem for the background segmentation method we used (as indicated by the dark areas in Fig. 6 and the Fig. 7).

Fig. 5 Snapshot of the system in operation. Pedestrian #2 has entered an area of interest and generated a signal

J Intell Robot Syst (2008) 52:5–43 Fig. 6 Tracked pedestrian image and map of motion

Fig. 7 Surveillance image of pedestrian #2 fallen down

Fig. 8 Tracked pedestrian image and map of motion

Fig. 9 Tracked pedestrian image and map of motion

15

16





J Intell Robot Syst (2008) 52:5–43

Sequence 2 This sequence tracked three pedestrians crossing the courtyard. Pedestrian #3 walked through the area of interest at the end of the sequence (Fig. 8). In this test, pedestrian #3 spent 5.6 s in the area of interest. Sequence 3 This sequence tracked two pedestrians crossing the courtyard. Pedestrian #3 was on a bicycle (Fig. 9). The dark area in the above figure shows pedestrian #4 loitering in the courtyard for 22.3 s. The motion path for pedestrian #3 shows the pedestrian moving rapidly throughout the sequence. Each pedestrian’s velocity was calculated and used as part of the activity recognition. One limitation of our system is that since we do not directly observe the motion of the pedestrian through articulated motion analysis, our system does not distinguish between pedestrians moving at the same speed through different means, such as a bicyclist and a runner.

3.3.1.2 Second courtyard • • • • •

Sequence 1 This sequence tracked a pedestrian walking across the courtyard (Fig. 10). Sequence 2 This sequence tracked another pedestrian walking across the courtyard (Fig. 11). Sequence 3 This sequence tracked a pedestrian running through a part of the courtyard (Fig. 12). Sequence 4 This sequence tracked a pedestrian walking and running through the area of interest at the top of the courtyard image (Fig. 13). Sequence 5 This sequence tracked a pedestrian loitering for an extended period of time (Fig. 14).

Fig. 10 Sequence 1 in the second courtyard. Image of tracked pedestrian and map of pedestrian motion and velocity

J Intell Robot Syst (2008) 52:5–43 Fig. 11 Sequence 2 in the second courtyard. Image of tracked pedestrian and map of pedestrian motion and velocity

Fig. 12 Sequence 3 in the second courtyard. Image of tracked pedestrian and map of pedestrian motion and velocity

Fig. 13 Sequence 4 in the second courtyard. Image of tracked pedestrian and map of pedestrian motion and velocity

17

18

J Intell Robot Syst (2008) 52:5–43

Fig. 14 Sequence 5 in the second courtyard. Image of tracked pedestrian and map of pedestrian motion and velocity



Sequence 6 This sequence tracked a pedestrian who walked up to the bench and then laid down. Note that the system classified the pedestrian as loitering once he or she had remained on the bench for an extended period (Fig. 15).

3.4 Limitations The system described above provides a means to monitor an outdoor area, classify pedestrian activities into basic categories, and detect certain events of interest. The ability of the system to accurately measure the position and velocity of many simultaneous subjects has been demonstrated along with the classification of their activities in real-time. The system is able to detect certain behaviors such as loitering and trespassing, as well as pedestrians who have fallen. While such a system is able to do basic activity classification, it is limited in its effectiveness because of the simplicity of the approach. Many of the tasks that constitute effective monitoring such as articulated motion recognition, gait recognition, occlusion handling, etc. cannot be achieved, or achieved robustly, from a single Fig. 15 Sequence 6 in the second courtyard. Image of tracked pedestrian and map of pedestrian motion and velocity

J Intell Robot Syst (2008) 52:5–43

19

Fig. 16 Classification accuracy vs. angle for classifier [34]

viewpoint. This is true for both humans [30, 57] and machine perception systems (Fig. 16). As a result, multiple-camera systems are generally needed in order to robustly accomplish activity recognition in these settings.

4 Camera Placement 4.1 Overview The huge amount of information that a multiple camera surveillance system collects can be overwhelming for the operators that have to use this system. Thus the idea has come up of automating the data collection and the data processing. However this huge amount of information needs an enormous amount of computational power in order to be processed properly. If the data processing can be simplified by placing the cameras in such a way that they capture “good” data, the time necessary for the processing of the data can be drastically reduced. The motivation of this part of the work is to simplify the image processing, by calculating a good camera position. The algorithm for camera placement developed in this work is based in [6] and focuses on optimizing a set of target motion paths in a dynamic and unpredictable environment, in which the subjects change over the course of time. The considered paths are not known a priori. The method tries to optimize the set of paths as a whole, without attempting to optimize every individual path. The method does not reconstruct surfaces or objects and does not use internal models of the subjects for reference. These are the main differences between this method and previous algorithms. The data collection of the different motion paths is achieved by using a modified pedestrian tracker as described earlier in this work. After the data collection, the method used by Masoud et al. [35] is exploited in order to transform the image data into a world coordinate frame. Lines are fit to these trajectories via a linear least

20

J Intell Robot Syst (2008) 52:5–43

square approximation. From these trajectory approximations, the optimal camera placement is computed using the following approach. → First, the paths are parameterized as a vector − s j defined as T  − → sj = φj xj yj lj where φ j is the orientation of the path j, (x j, y j) are the coordinates of the path’s center, and l j is the length of the path. Two constraints are then introduced in order to define mathematically optimal positions for the cameras: • •

A distance constraint, and A foreshortening constraint.

The distance constraint is used to place cameras neither too far, nor to close to the path. A camera placed too far away can miss certain details of the tracked subject, placed too close it can certainly catch details, but does not cover the complete path. A minimum distance d0 is therefore introduced, this is the distance from which the camera can see the entire path. This minimum distance is computed using the similar triangle relationship, d0 =

lj f w

(2)

where l j is the length of the path, f the focal length of the camera, and w the width of the camera’s image sensor. A second factor that influences the observability is the angle between the path and the view direction of the camera, this angle can be seen in Fig. 17. When taking into account both of these factors an objective function Gij for each camera-path pair is defined as follows: ⎧ ⎪ if dij < d0 ⎨0     Gij = d20 (3) ⎪ ⎩ d2 cos θij cos φij cos αij cos βij otherwise ij

Fig. 17 Description of the points used in Eqs. 7 and 8. This also shows definition of the angles θ and φ

J Intell Robot Syst (2008) 52:5–43

21

Fig. 18 Description of the points used in Eqs. 9, 10 and 11. This also shows the definition of the angles α, β, and γx

The indices i and j relate to the camera and the path respectively. The angles θ, φ, α, and β are defined in Fig. 17 and in Fig. 18. The cost function to be optimized for one camera i is then: Vi =

m

Gij

(4)

j=1

where m is the number of paths and Gij is defined by Eq. 3. When using multiple camera systems, the formulation becomes slightly more complex. The cameras have to be placed jointly in order to achieve an optimal solution. To ensure that a single optimum is found, it would be necessary to search over all camera parameters at the same time, which is computationally very expensive. The computation steps are in the order of magnitude (km)n , where m is the number of paths, k the number of parameters searched per camera and n is the number of cameras. For systems with many cameras, this computation becomes quickly infeasible in a short time. As a result, an iterative approach was chosen, which has only linear complexity with respect to the number of cameras O(kmn). This method is much faster than the previous approach. The cost function for one camera is different from the previous one and is defined as follows:

Vi =

m

Gij

j=1

i−1 

1 − Gkj



(5)

k=1

 The inversion of the observability of the previous camera values 1 − Gkj , enables the method to “push” the next camera into locations that have the lowest observability. When using many cameras, the sum of the cost functions for each individual camera gives the final total cost function to optimize. V=

n

i=1

Vi .

(6)

22

J Intell Robot Syst (2008) 52:5–43

All the variables in the objective function Gij are dependant on six extrinsic and one intrinsic camera parameter. These can be stacked into a “generalized” viewpoint, which is described in a 7-dimensional space:  T − → ui = Xi Yi Z i γxi γ yi γzi f where the first six parameters are the extrinsic parameters of the camera (position and orientation), and f is the focal length. 4.2 Assumptions and Simplifications As can be seen this approach has many dimensions and will be computationally expensive to calculate, to combat this several simplifications are made. Firstly, the roll (rotation around the y axis) is assumed to have no effect on the observability. Furthermore, the system is considered to have an overhead view of the scene (it is mounted on rooftops or on ceilings), which removes the height (Z ) degree of freedom. As the height is fixed, the tilt (γx ) becomes also fixed, since both of these parameters are strongly related. The final assumption considers all cameras to have the same constant focal length f . This assumption makes it possible to compute the optimal position of cameras independently from the order in which the cameras are placed. This independence comes from the fact that for each camera-path pair, the objective function Gij (Eq. 3) uses the same minimal distance d0 (Eq. 2), because the focal length f is constant. When all these assumptions are taken into account, the viewpoint is reduced to the following: T  − → ui = Xi Yi γzi . With this reduced vector only θij and φij remain relevant for further optimization purposes. In order to run the optimization on the viewpoints, we have to transform the θij and φij into X, Y, and γz . This is achieved using basic geometry. The resulting equations are the following: −→ −→ PC · PS cos θ = −→ −→      PC  PS 

(7)

−→ → CP · − n cos φ = −→   .   − → C P n 

(8)

The points P, C and S are represented in Fig. 17. P is the center of the path, C is the center of the image plane and S is the transformation of S, start of the path, through a rotation of π2 about P (S =  π2 (S)). The normal to the image plane is → the vector − n . This vector and the aforementioned points are considered to be on

J Intell Robot Syst (2008) 52:5–43

23

the ground plane. This assumption can be made since pedestrians are the target of observation, and not flying objects (Z = 0). 4.3 Experimental Results Simulation results of this approach can be found in [6] and [7], as well as results using an ATRV-Jr Robot. Another series of tests were devised using a similar overhead fixed camera to observe the scene, but with the addition of a set of small remotely controlled eROSI robots to act as the cameras that were to be repositioned around the scene. The setup for these experiments consisted of a tripod mounted Firewire camera which captured video at 320 by 240 pixel-resolution which was then connected to a laptop running the pedestrian tracking and camera optimization software. The laptop was then connected to each of the eROSI robots by means of a wireless Bluetooth serial link. The eROSI robots support remote motion commands via this Bluetooth link as well as sensor feedback. The camera used in the experiments was used to both track pedestrians through the scene and determine the position and orientation of each eROSI in the scene so that accurate relocation was possible. This was done by placing on the top of each eROSI a two-colored construction paper marker. The two colors allowed the eROSI finding algorithm to not only detect the position of the eROSIs but also their orientation. The colors were placed front to back, one for the direction in which the robot was facing and the other opposite that. The positions of the robots were detected by segmenting the colors and then clustering the colors into groups based on spatial constraints. For the tests involving robots the search range of the optimizer was limited to areas which were feasible for the robot to enter. For instance the algorithm did not search within walls or furniture for optimal camera positions because the robots would have no method for positioning themselves in these areas. The feasible range was input into the program by means of a graphical user interface presented to the user before the program is run. In Fig. 19b, the centroids of the tracked individuals of one such test are displayed over a single frame of the video. In this test two volunteers walked in an indoor setting and two eROSIs were moved into the optimal positions within the room. Figure 19a and c contain images detailing the starting and ending locations, respectively, of the eROSI robots. The robots were marked with orange markers to indicate the front of the robot and blue markers to indicate the rear. The lines that are present on the figures are to give an idea of the approximate fields of view of the robots. A second series of experiments was done in which the track/optimize/move section of the code was placed in a loop. This allowed the algorithm to adapt to changing motion paths in the scene, since it was fed new motion paths in even intervals. The robots are then assured to always be viewing the current paths in an optimal manner. In this demonstration of the looping code one set of paths and a single eROSI robot are used. The subject in this experiment paced along the far side of the room, then after the initial optimization and positioning, changed to pace along the side of the room. Figure 20 shows the centroid of the tracked person in each of these two path configurations. Figure 20 also shows how the eROSI moves from its initial starting location to each of the two optimal locations. The eROSI used for this

24

J Intell Robot Syst (2008) 52:5–43

Fig. 19 Demonstration of the results for a set of two paths. a Initial robot positions. b Trajectories collected. c Final robot positions observing the paths

a

b

c

J Intell Robot Syst (2008) 52:5–43

25

Fig. 20 Demonstration of results for a change in path pattern over time. a First set of trajectories. b Second set of trajectories. c Change of robot position from initial location. d Change of robot position to view the second set of paths

a

b

c

d

experiment had an orange and blue marker attached to it to indication direction, orange for the front and blue for the rear. In final, the software that was developed for these experiments can be said to consist of components to complete the following major steps: 1. 2. 3. 4.

Track pedestrians across a scene and gather motion paths. Compute the optimal placement of N cameras. Command the eROSIs to move to these N locations. (optional) Repeat the algorithm (optional)

4.4 Limitations and Experimental Difficulties During the experiment multiple things were found to affect the results. Two of the most challenging problems were lighting changes and color tracking errors. Despite the robustness of the Mixture of Gaussians method, occasionally sudden changes in lighting or reflective surfaces would greatly influence the segmentation. In these cases this would lead to mistracking a subject and spurious points being introduced into the data sets. The spurious points would lead to a least-squares linear approximation to the motion curve that was entirely useless when it came to optimization. This problem was tackled to some extent by introducing a blob size threshold for the tracking algorithm. However, the appropriate threshold had to be determined experimentally for each trial. The other problem was the localization of the eROSI robots. The lighting changes described earlier also would create difficulties in detecting certain colors in the images. This would lead to inaccurate current location estimates which in turn would result in incorrect final placement. The most profound example of this during the test is when at one instance two of the robots collided.

26

J Intell Robot Syst (2008) 52:5–43

5 Camera Placement in 3-D 5.1 Overview In an attempt to generalize the results, different assumptions were removed from the previous method. The height Z and the tilt angle γx have been reintroduced into the equations and in a second step the impact of the introduction of the focal length f has been investigated. When Z and γx are not bounded anymore they have an impact on the objective function Gij through the angles α and β in Eq. 3. Using simple geometry in a right triangle and using the points described in Fig. 18, we can find these angles relative

Fig. 21 Objective surface for the extended method. In both (a) and (b), the lines represent the path and the optimal position is represented by a dot. a First set of test paths. b Second set of test paths

a

b

J Intell Robot Syst (2008) 52:5–43

27

to the parameters Z and γx . P is the center of the path, C is the center of the image plane, and C p is the projection of C on the ground. cos α = cos (β − γx ) = cos β cos γx + sin β sin γx

cos β =

Cp P

sin β =

C pC

a

(9)

(10)

CP

(11)

CP

b

c Fig. 22 These three figures show the positions in the optimal camera placement with different focal lengths for the first set of test paths. The positions of the different cameras are almost on an imaginary circle whose radius grows with the focal length. a Focal length = 0.008 mm, b focal length = 0.014 mm, c focal length = 0.020 mm

28

J Intell Robot Syst (2008) 52:5–43

After the introduction of these parameters, the impact of the focal length has been investigated. The cameras are still supposed to be homogeneous in order to use the assumption of the independence of placement. 5.2 Simulation Results The simulation with free parameters α and β gives the expected results (Fig. 21). We find the same optimal positions as in the previous method with an optimal height of Z = 0, since this value gives the value 1 for cos β. In the future work, the objective function Gij (Eq. 3) will have to be changed, with for instance new constraints, in order to deal with this result.

a

b

c Fig. 23 These three figures show the positions in the optimal camera placement with different focal lengths for the second set of test paths. Even with this different set of paths, the cameras optimal positions remain intuitive. a Focal length = 0.008 mm, b focal length = 0.014 mm, c focal length = 0.020 mm

J Intell Robot Syst (2008) 52:5–43

29

The simulation for the focal length f places the cameras on circular shapes around the set of paths. The radius of this circular shape changes with the focal length f . This is because of the definition of the Gij function, which becomes largest when the cameras are placed at d0 (Eq. 3) from the paths. Thus, as the focal length changes, d0 changes and the according positions remain on the circular shape of the minimal distance (Figs. 22 and 23). 5.3 Computation in 3-D During the work on removing assumptions, one new question has arisen: Would there be a difference if the different angles are calculated in 3-D instead of the previous 2-D computation? In fact when looking at the objective function Gij (Eq. 3) and the four angles defined in this formula, it could be argued that this expression could be simplified to just two angles. Instead of using two projections, one from above and one from the side, in order to get four angles in the optimization process, the switch to 3-D space enables the computation of just two angles. Figure 24 illustrates the idea of this approach. Instead of using projections, the important angles are calculated directly in three-dimensional space. One new angle (θ3D ) would be between the normal vector of the path and the vector of the dashed line that links the center of the path to the center of the image plane. The other angle (φ3D ) would be between the normal of the image plane and the opposite vector used for the previous description of the angle. When using the same coordinates as previously (P is the center of the path, C is the center of the image plane, and S is the start of the path), the normal vector

Fig. 24 Angle calculations in 3-D space

30

J Intell Robot Syst (2008) 52:5–43

→ → to the path (− nθ ) and the normal vector to the image plane (− nφ ), the equations are the following: cos θ3D

−→ − PC · → nθ = −→     − →  PC nθ 

−→ → nφ CP · − cos φ3D = −→   .   − → C P nφ 

(12)

(13)

The normal vector of the path is defined by: ⎛ ⎞ nxθ − → n θ = ⎝n yθ ⎠ nzθ and has to comply to the three following constraints: •

The length of the normal vector must be equal to half the length of the path: −   1 −  → → nθ  = l j =  PS . 2



The angle between the ground plane and the normal vector must be δ and satisfy the following equation:   n2xθ + n2yθ sin2 δ = n2zθ cos2 δ.



The normal and the path must form a right angle: − → − → nθ · PS = 0. The normal vector to the image plane is defined by: ⎛ ⎞ cos γx cos γz − → nφ = ⎝ cos γx sin γz ⎠ . sin γx

The advantage of this method is the introduction of a parameter δ that can be chosen arbitrarily. This enables the cameras to be positioned at an angle from the target that is completely task specific. An angle δ = 0 gives the exact same result as the projection method, whereas δ = π2 gives as best possible position for the camera an overhead position (Fig. 25). The previous method was conceived in order to monitor the stride of people, and as such had to place cameras so that they could see people from the side. With this new method, either the side or the top view can be selected, or any other view in between that suits the intended observation purpose.

J Intell Robot Syst (2008) 52:5–43

31

Fig. 25 Description of the angle δ projected from the side. When δ = 0, we have the same configuration as in the earlier method (Fig. 18)

5.4 Experimental Results The setup for the tests of the placement in three dimensions consisted of a Firewire camera connected to a laptop just as the experiments in the previous section. Two miniDV cameras were placed on tripods in order to simulate a group of two robots each with the ability to change their views of the scene in both rotation (γz ) and tilt (γx ) as well as the three spatial dimensions (X, Y, Z ). The tracking was done as in the previous tests by using a Mixture of Gaussians background segmentation method combined with a Kalman filter for tracking. The points found were then transformed to the world coordinate system and lines fit this data using a method of least-squares. The optimization was then performed over the world-space using the three-dimensional objective function and a pair of optimal camera locations was found. The tripods were then moved by hand into the correct locations. Video was taken before the movement and after the movement from each of the miniDV cameras in order to get a sense of the improvement in subject observability.

Fig. 26 Paths obtained in experiment. The red dots show the approximate locations of the cameras after optimization

32

J Intell Robot Syst (2008) 52:5–43

Fig. 27 Objective function plots showing the linear approximations of the paths and the objective function values. The pink asterix and the line show the camera location in X, Y, Z and the pan. a Placement for the first camera. b Placement for the second camera

a

b The tripods provided for the experiment had a Z variation of 1.4 to 1.6 m, and a tilt variation of 0 to 1.57 rad. The focal lengths of the cameras were also used to determine d0 (Eq. 2). The centroids of the subjects tracked by the software are shown in Fig. 26. Erroneous points are removed from the computations by the use of a

J Intell Robot Syst (2008) 52:5–43

33

Fig. 28 Snapshots of the paths from the DV cameras before and after optimizing for view. a Initial view of the horizontal path. b Initial view of the vertical path. c Final view of the horizontal path. d Final view of the vertical path

a

b

c

d

minimum length constraint on the paths. Since it was known that humans were to be tracked in the experiment, the paths were given a fixed Z value of 1m, approximately half the height of an average person. Figure 27 shows the results of the optimization. Figure 28 shows some snapshots from the initial and final views of the scene. Finally, an eye chart was carried during the test by the subjects being tracked. Figure 29 shows cropped images of the eye chart from both the initial and final positions of the tripod, and in Table 2 the number of recognizable characters present is shown and compared. The images in Fig. 29 have been scaled to a common size in order to facilitate better comparisons.

Fig. 29 Scaled images of the eye chart taken from the miniDV cameras before (a, c) and after (b, d) the cameras were moved. These images clearly show the improvement in the resolution of the observed paths

34 Table 2 Recognizable characters

J Intell Robot Syst (2008) 52:5–43 Path no.

Initial view

Optimized view

Total characters

1 2

5 6

15 17

35 35

5.5 Limitations and Experimental Difficulties The experiments for the three-dimensional tests were also affected in the same manner as the two-dimensional tests by lighting changes. These changes would lead to spurious points and in some cases loss of subject tracking. The second greatest obstacle with the three-dimensional approach is the computational complexity. Since this approach uses all the dimensions of the formulation, the amount of computational effort required to compute the optimal placements is much higher than that of the two-dimensional optimization strategy.

6 High Resolution Measurements with a Dual-Camera System To further improve upon the results of the fixed-lens camera placement, a dualcamera system placed in each location was studied in order to incorporate pan, tilt, and zoom and thus further improve the observability of the subjects. A dual-camera system was developed that tracks subjects, measures position and velocity, and attempts to classify each individual’s activity based on the tracking information. The system is comprised of a wide-angle, fixed field of view camera coupled with a computer-controlled pan/tilt/zoom-lens camera to take detailed measurements of people for applications that require high resolution measurements (Fig. 30). Since the cameras are mounted side by side, they share effectively the same view of the scene. However, the zoom feature of the second camera allows it to compensate for the distance to the scene, and capture high resolution measurements of subjects. The method described uses measurements taken from the image of the wide-angle lens camera to compute the positioning control instructions for the pan/tilt/zoom (PTZ) camera. The entire scene remains observable by the wide-angle camera at all times, allowing detection of higher-priority events if they happen, and focusing

Fig. 30 Dual cameras mounted side by side

J Intell Robot Syst (2008) 52:5–43

35

of the pan/tilt/zoom camera onto that part of the scene. Effectively, this system addresses the focus of attention problem by pointing the zoom camera at subjects whose activities are estimated to be of greatest interest. Once measurements are made, activity classification is done through tracking, and analysis, of the position and velocity of the subject in the scene. The dual camera system was developed by mounting two cameras side by side to view the same area. The first camera (camera #1) was a Panasonic™ GP-KR222 digital video camera, with a fixed field of view, 12mm focal length Rainbow™ autoiris lens. The second camera (camera #2) was a Sony™ EVI-D70 pan/tilt/zoom camera. This camera has an 18× optical zoom and wide pan and tilt range (340◦ pan, 120◦ tilt). The pan/tilt/zoom control driver for camera #2 was run on a 1.6 GHz Pentium™ IV computer with 512MB of RAM. The two computers communicated over TCP/IP on a 100 Mbps switched network. This configuration was chosen so that the camera systems could be controlled remotely from a central computer. 6.1 Calibration The cameras were individually calibrated for each scene to determine their intrinsic and extrinsic parameters. In addition, the cameras were cross-calibrated to create a common coordinate system. This allowed a point in the image of camera #1 to be translated to a point in the world coordinate system of camera #2. This transformation was crucial for sending control commands to camera #2. The calibration for each camera was done using the method of Masoud as described above. The result was a series of homography transformation matrices between the cameras, as well as the intrinsic and extrinsic (position and orientation) parameters of each camera. Davis and Chen [18] use a similar model to calibrate pan/tilt cameras as part of a surveillance network (Fig. 31). 6.2 Foreground Object Segmentation and Measurement This dual-camera system can be used to make high resolution measurements of activities of many kinds. For the purposes of this work it was decided to focus the outdoor experiments on the problem of pedestrian tracking and activity classification. An example image of this kind is shown in Fig. 32, along with the foreground object tracking. Images captured by camera #1 were processed as described above in Section 3.

Fig. 31 Multi-camera cross-calibration to common world plane coordinate system

36

J Intell Robot Syst (2008) 52:5–43

Fig. 32 Image from camera #1 and corresponding foreground object detection

When the low resolution activity recognition system (Fig. 33) detected an activity of interest, camera #2 could be used to zoom in on the activity and capture high resolution images of the event. An outline of the logic for using the system is as follows: Camera #1: Capture image Segment foreground objects For each foreground object Track the object across frames Make low resolution measurements of the object If (activity of interest is detected) Send pan, tilt, zoom commands to camera #2 Camera #2: If (receive new pan, tilt, zoom commands) Move camera to new position and zoom settings Capture high resolution image

Fig. 33 Low-resolution measurements of subject

J Intell Robot Syst (2008) 52:5–43 Fig. 34 Schematic diagram of distributed camera control

Fig. 35 Several sample corresponding image sets from both cameras in an outdoor setting. The five first images were taken by camera #1 and the five last images were taken by camera #2. Faces were obscured to protect individual privacy

37

38

J Intell Robot Syst (2008) 52:5–43

Fig. 36 Image sequences of pan/tilt/zoom camera focussing attention on a person and zooming in

6.3 Positioning Camera #2 Positioning camera #2 required sending pan, tilt and zoom commands to the device. To compute these values the kinematics of the camera #2 were calculated. From the forward kinematics, inverse kinematics were computed and solved for the pan and tilt angles. The zoom command was then computed directly from the size of the foreground object in camera #1’s image. The pan and tilt commands are computed by converting the centroid of the 2D foreground object in camera #1’s image, PC1 , into a point in camera #2’s world 3D coordinate system, PC2 , as follows:  T 2D = xy PC1

(14)

3D 2D = T2−1 Twc1 HC1 PC1 PC2

(15)

where: • • •

T2 maps camera #2’s 3D coordinate system to camera #1’s 3D coordinate system, Twc1 maps between camera #1’s ground plane and camera #1’s 3D coordinate system, HC1 is the homography matrix for camera #1, which maps the image plane to the ground plane. 3D is thus: The vector PC2

 T 3D PC2 = XY Z .

Fig. 37 Image sequence of pan/tilt/zoom camera tracking a running person

(16)

J Intell Robot Syst (2008) 52:5–43

39

Fig. 38 Image sequences of pan/tilt/zoom camera tracking a walking person

6.4 Position Control Driver Camera #2 is controlled through the Player robot device interface architecture [26]. Because Player provides a network interface to the supported hardware, it was possible to run the video processing and pan/tilt/zoom device control on separate computers. The computer that processed the video from camera #1 determined the location of the subject, and computed control commands in joint coordinates (pan, tilt, zoom). These commands were sent to a Player server running on the second computer via a TCP/IP connection (Fig. 34). Commands were sent for every frame in which the subject had moved since the previous frame. These commands were issued to the camera through a custom driver that operated within the Player architecture. The driver was multi-threaded so that it could asynchronously receive commands from Player while at the same time monitoring the movements of the camera, to ensure that the device was progressing to its specified goal location. The driver was designed so that whenever a new command was received, it replaced any existing command the pan/tilt/zoom camera was currently executing (i.e. the camera would forego its previous destination in favor of immediately going to the new destination.) The system ran at full video frame rate (30 fps). 6.5 Dual-camera Measurements The dual-camera system was tested in the second outdoor pedestrian courtyard. This setting tested the ability of the system to accurately position camera #2 at zoom levels of as much as 18×. A system similar to the one used in these tests could be used to aid in pedestrian safety and crime prevention at many sites. In addition, the high resolution images from the pan/tilt/zoom camera can be used for more sophisticated activity recognition such as articulated motion classification, gait recognition, and so forth.

Table 3 Pan/tilt/zoom position error (pixels) X Y

Mean

Std. Dev.

4.14 8.77

2.91 5.11

40

J Intell Robot Syst (2008) 52:5–43

Fig. 39 Tracking error

Figure 35 shows numerous sample image pairs from the working system. As people moved through the courtyard, all individuals were tracked by the image processing system of camera #1. These individuals were then tracked by camera #2, and high-resolution images were acquired. Figures 36, 37, and 38 show image sequences of the pan/tilt/zoom camera tracking moving figures. The system tracked hundreds of people moving through the courtyard over several hours. Motion at varying speeds was observed and tracked successfully. The system has demonstrated the ability to accurately track objects moving as fast as a motor scooter. Table 3 shows the tracking error of the pan/tilt/zoom camera over 98 frames, taken when the view is zoomed in to capture the full height of the person. Figures 37 and 38 show sample images used to compute these results. The error was computed by measuring the distance between the center of the image and the centroid of the subject (Fig. 39). (These images were segmented by hand.)

7 Conclusion This paper has investigated the problems of pedestrian tracking and presented usable algorithms for human motion recognition. This field also provides an opening into the problem of camera placement. A method for solving this problem was given in this paper and it was shown to perform well under real world situations. This method also can be generalized to the problem of the placement of distributed robot teams. As has been shown, the implementation for two robots has already been achieved and the expansion of this to upwards of 20 robots is currently underway.

J Intell Robot Syst (2008) 52:5–43

41

References 1. Abrams, S., Allen, P.K., Tarabanis, K.A.: Dynamic sensor planning. In: Proceedings of the IEEE International Conference on Intelligent Autonomous Systems, pp. 206–215. Pittsburgh, PA, Feb. (1993) 2. Atev, S., Masoud, O., Papanikolopoulos, N.: Practical mixtures of Gaussians with brightness monitoring. In: Proceedings of the IEEE International Conference on Intelligent Transportation Systems (2004) 3. Azarbayejani, A., Pentland, A.: Real-time self-calibrating stereo person tracking using 3D shape estimation from blob features. In: Proceedings of the IEEE International Conference on Pattern Recognition (1996) 4. Ben-Arie, J., Wang, Z., Pandit, P., Rajaram, S.: Human activity recognition using multidimensional indexing. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1091–1104, August (2002) 5. Beymer, D., Konolige, K.: Real-time tracking of multiple people using continuous detection. In: International Conference on Computer Vision (1999) 6. Bodor, R.: Multi-camera human activity recognition in unconstrained indoor and outdoor environments. Ph.D. thesis, University of Minnesota, August (2005) 7. Bodor, R., Drenner, A., Janssen, M., Schrater, P., Papanikolopoulos, N.: Mobile camera positioning to optimize the observability of human activity recognition tasks. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Edmonton, Canada, Aug. (2005) 8. Bodor, R., Jackson, B., Masoud, O., Papanikolopoulos, N.: Image-based reconstruction for view-independent human motion recognition. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. (2003) 9. Bodor, R., Jackson, B., Papanikolopoulos, N.: Vision-based human tracking and activity recognition. In: Proceedings of the 11th Mediterranean Conference on Control and Automation, June (2003) 10. Bregler, C.: Learning and recognizing human dynamics in video sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (1997) 11. Cao, D., Masoud, O., Boley, D., Papanikolopoulos, N.: Online motion classification using support vector machines. In: IEEE International Conference on Robotics and Automation, Apr. (2004) 12. Chen, X., Davis, J.: Camera placement considering occlusion for robust motion capture. Technical Report CS-TR-2000-07, Stanford University (2000) 13. Collins, R., Amidi, O., Kanade, T.: An active camera system for acquiring multi-view video. In: Proceedings of the 2002 International Conference on Image Processing, Sept. (2002) 14. Cretual, A., Chaumette, F., Bouthemy, P.: Complex object tracking by visual servoing based on 2D image motion. In: Proceedings of the IAPR International Conference on Pattern Recognition, Australia, Aug. (1998) 15. Cutler, R., Turk, M.: View-based interpretation of real-time optical flow for gesture recognition. In: Proceedings of the Third IEEE Conference on Face and Gesture Recognition. Nara, Japan, Apr. (1998) 16. Daniilidis, K., Krauss, C., Hansen, M., Sommer, G.: Real-time tracking of moving objects with an active camera. Real-Time Imaging 4, 3–20 (1998) 17. Darrell, T., Gordon, G., Harville, M., Woodfill, J.: Integrated person tracking using stereo, color, and pattern detection. Int. J. Comput. Vis. 37(2), 175–185, June (2000) 18. Davis, J., Chen, X.: Calibrating pan-tilt cameras in wide-area surveillance networks. In: Proceedings of the IEEE International Conference on Computer Vision (2003) 19. Denzler, J., Zobel, M., Niemann, H.: On optimal camera parameter selection in Kalman filter based object tracking. In: Proceedings of the 24th DAGM Symposium on Pattern Recognition, pp. 17–25. Zurich, Switzerland (2002) 20. Efros, A., Berg, A., Mori, G., Malik, J.: Recognizing action at a distance. In: Proceedings of the IEEE Conference on Computer Vision (2003) 21. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.: Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proc. IEEE 90, 1151–1163 (2002) 22. Eveland, C., Konolige, K., Boles, R.: Background modeling for segmentation of video-rate stereo sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Santa Barbara, CA (1998) 23. Fablet, R., Black, M.J.: Automatic detection and tracking of human motion with a view-based representation. In: European Conference on Computer Vision, May (2002)

42

J Intell Robot Syst (2008) 52:5–43

24. Fleishman, S., Cohen-Or, D., Lischinski, D.: Automatic camera placement for image-based modeling. In: Proceedings of Pacific Graphics, vol. 99, pp. 12–20 (1999) 25. Gasser, G., Bird, N., Masoud, O., Papanikolopoulos, N.: Human activity monitoring at bus stops. In: Proceedings of the IEEE Conference on Robotics and Automation (2004) 26. Gerkey, B., Vaughan, R.T., Howard, A.: The player/stage project: tools for multi-robot and distributed sensor systems. In: Proceedings of the 11th International Conference on Advanced Robotics, pp. 317–323. Coimbra, Portugal, June (2003) 27. Grimson, W., Stauffer, C., Romano, R., Lee, L.: Using adaptive tracking to classify and monitor activities in a site. In: Proceedings of the International Conference on Computer Visiona and Pattern Recognition, June (1998) 28. Haritaoglu, I., Harwood, D., Davis, L.: W4: real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 809–830 (2000) 29. Heisele, B., Verri, A., Poggio, T.: Learning and vision machines. Proc. IEEE 90(7), 1164–1177 (2002) 30. Hodgins, J., O’Brien, J., Tumblin, J.: Do geometric models affect judgements of human motion? In: Proceedings of the Graphics Interface Conference. Canada, May (1997) 31. Isler, V., Kannan, S., Daniilidis, K.: Vc-dimension of exterior visibility. IEEE Trans. Pattern Anal. Mach. Intell. 26(5), 667–671 (2004) 32. Lin, C., Wang, C., Chang, Y., Chen, Y.: Real-time object extraction and tracking with an active camera using image mosaics. In: Proceedings of the IEEE International Workshop on Multimedia Signal Processing, Dec. (2002) 33. Lyons, D.: Discrete event modeling of misrecognition in PTZ tracking. In: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance. Miami, FL, July (2003) 34. Masoud, O., Papanikolopoulos, N.P.: A method for human action recognition. Image Vis. Comput. 21(8), 729–743 (2003) 35. Masoud, O., Papanikolopoulos, N.P.: Using geometric primitives to calibrate traffic scenes. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Japan, Oct. (2004) 36. Maurin, B., Masoud, O., Papanikolopoulos, N.: Monitoring crowded traffic scenes. In: Proceedings of the IEEE International Conference on Intelligent Transportation Systems. Singapore, Sept. (2002) 37. McKenna, S., Jabri, S., Duric, Z., Wechsler, H.: Tracking interacting people. In: Proceedings of the Conference on Automatic Face and Gesture Recognition. Grenoble, France, Mar. (2000) 38. Mittal, A., Huttenlocher, D.: Scene modeling for wide area surveillance and image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. South Carolina, June (2000) 39. Mori, H., Charkari, M., Matsushita, T.: On-line vehicle and pedestrian detections based on sign pattern. IEEE Trans. Ind. Electron. 41(4), 384–391 (1994) 40. Murray, D., Basu, A.: Motion tracking with an active camera. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 449–459 (1994) 41. Nelson, B., Khosla, P.K.: Increasing the tracking region of an eye-in-hand system by singluarity and joint limit avoidance. In: Proceedings of the IEEE International Conference on Robotics and Automation, vol. 3, pp. 418–423 (1993) 42. Nelson, B., Khosla, P.K.: Integrating sensor placement and visual tracking strategies. In: Proceedings of the 1994 IEEE International Conference on Robotics and Automation, vol. 2, pp. 1351–1356 (1994) 43. Nelson, B., Khosla, P.K.: The resolvability ellipsoid for visual servoing. In: Proceedings of the 1994 IEEE Conference on Computer Vision and Pattern Recognition, pp. 829–832 (1994) 44. Olague, G., Mohr, R.: Optimal 3D sensor placement to obtain accurate 3D point positions. In: Proceedings of the Fourteenth International Conference on Pattern Recognition, vol. 1, pp. 16–20, August (1998) 45. Oliver, N., Rosario, B., Pentland, A.: A Bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843 (2000) 46. O’Rourke, J.: Art Gallery Theorems and Algorithms. Oxford University Press, New York (1987) 47. Peixoto, P., Batista, J., Araujo, H., de Almeida, A.: Combination of several vision sensors for interpretation of human actions. In: Proceedings of the Conference on Experimental Robotics (2000) 48. Pers, J., Vuckovic, G., Dezman, B., Kovacic, S.: Human activities at different levels of detail. In: Proceedings of the Czech Pattern Recognition Society Computer Vision Winter Workshop, pp. 27–32, Feb. (2003)

J Intell Robot Syst (2008) 52:5–43

43

49. Polana, R., Nelson, R.: Nonparametric recognition of nonrigid motion. Technical report, University of Rochester (1994) 50. Rosales, R., Sclaroff, S.: 3D trajectory recovery for tracking multiple objects and trajectory guided recognition of actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June (1999) 51. Sharma, R., Hutchinson, S.: Motion perceptibility and its application to active vision-based servo control. IEEE Trans. Robot. Autom. 13(4), 607–617 (1997) 52. Starner, T., Pentland, A.: Real-time american sign language recognition from video using hidden markov models. In: Proceedings of the IEEE International Symposium on Computer Vision (1995) 53. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 747–757, August (2000) 54. Stauffer, C., Tieu, K.: Automated multi-camera planar tracking correspondence modeling. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, June (2003) 55. Stillman, S., Tanawongsuwan, R., Essa, I.: A system for tracking and recognizing multiple people with multiple cameras. In: Proceedings of the International Conference on Audio and Video Based Biometric Person Authentication, March (1999) 56. Tarabanis, K.A., Tsai, R.Y., Kaul, A.: Computing occlusion-free viewpoints. IEEE Trans. Pattern Anal. Mach. Intell. 18(3), 273–292, Mar. (1996) 57. Thornton, I., Pinto, J., Shiffrar, M.: The visual perception of human locomotion. Cogn. Neuropsychol. 15(6), 535–552 (1998) 58. Tordoff, B.J., Murray, D.W.: Resolution vs. tracking error: zoom as a gain controller. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, June (2003) 59. Ukita, N., Matsuyama, T.: Incremental observable-area modeling for cooperative tracking. In: Proceedings of the International Conference on Pattern Recognition, Sept. (2000) 60. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. Int. J. Comput. Vis. 63, 153–161 (2005) 61. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: real-time tracking of the human body. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 780–785, July (1997) 62. Yao, Y., Allen, P.: Computing robust viewpoints with multi-constraints using tree annealing. IEEE Trans. Pattern Anal. Mach. 2, 993–998 (1995) 63. Zhou, X., Collins, R., Kanade, T., Metes, P.: A master-slave system to acquire biometric imagery of humans at distance. In: Proceedings of the ACM International Workshop on Video Surveillance (2003) 64. Zobel, M., Denzler, J., Niemann, H.: Binocular 3-D object tracking with varying focal lengths. In: Proceedings of the IASTED International Conference Signal Processing, Pattern Recognition, and Applications, July (2002)