Common Visual Cues for Sports Highlights Modeling - Springer Link

1 downloads 0 Views 815KB Size Report
c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Common Visual Cues for Sports Highlights. Modeling. M. BERTINI.
Multimedia Tools and Applications, 27, 215–228, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. 

Common Visual Cues for Sports Highlights Modeling M. BERTINI A. DEL BIMBO W. NUNZIATI

Abstract. Automatic annotation of semantic events allows effective retrieval of video content. In this work, we present solutions for highlights detection in sports videos. The proposed approach exploits the typical structure of a wide class of sports videos, namely those related to sports which are played in delimited venues with playfields of well known geometry, like soccer, basketball, swimming, track and field disciplines, and so on. For these sports, a modeling scheme based on a limited set of visual cues and on finite state machines that encode the temporal evolution of highlights is presented, that is of general applicability to this class of sports. Visual cues encode position and speed information coming from the camera and from the object/athletes that are present in the scene, and are estimated automatically from the video stream. Algorithms for model checking and for visual cues estimation are discussed, as well as applications of the representation to different sport domains. Keywords:

1.

semantic annotation, highlights detection, sport videos, visual cues

Introduction

Every day, a huge amount of sports video footage is recorded by television broadcasters. To provide effective archiving and retrieval of the video content, all this material should be annotated with respect to its semantic content, producing metadata that is attached to the video data and stored in databases, so that a user could then retrieve shots and sequences of interest. For example a broadcasting company may be interested in annotation of relevant actions to produce a video summary of an event for its sport programmes, or to provide only the sequences of certain action to mobile video phone services. Therefore, there are enough motivations for developing tools that support the annotation process, providing automatically the description of the most relevant events and other in-depth information; in the following we will refer to these events as “highlights”, as suggested by the language commonly used in the sport domain. Although the video stream includes multiple media (i.e. image, audio, speech and text) most highlights should be detected only based on the information extracted from the raw image stream. In fact, text captions appear in edited video and provide information on major changes in the competition (such as a change in the score), but do not occur for other interesting highlights (such as shots on goal in soccer games). Audio information (such as crowd cheering) indicates only that “something” interesting is happening but does not indicates the specific event; moreover it may not always be available. A preliminary analysis of soccer videos has also shown that often the sound of the crowd is even unrelated w.r.t.

216

BERTINI, DEL BIMBO AND NUNZIATI

the ongoing action, e.g. when the supporters are cheering their favourite team. Information extracted from the commentary (if present) needs strong localization, in that it is highly dependent on the language and the voice of the speaker, thus requiring a huge effort in order to adapt and train the speech recognition system. In the past few years, there has been a lot of research on automatic annotation of highlights in sports videos. However, due to the complexity of the problem and to the substantial differences between various sport types, the problem has always been addressed aiming at developing ad-hoc solutions for a particular sport genre. Several application-specific approaches have been proposed so far, each of which is tailored to a particular sport domain, dedicated to model and recognize a selected set of events in that domain. Generalization has been developed only for simple types of events. Herzog et al. [1] proposed a representation of soccer events based on course diagrams, that must be traversed in order to identify an event. The focus was in that case on automatic language generation, rather than on automatic annotation of events, so the system works with synthetic or manually generated trajectories of players. In [14], typical tennis player’ swings have been detected using domain-specific rules that map low-level measurements into high-level contents. In [17], a decision tree is used to model nine meaningful events of basketball games. Bayes networks have been used in [7] to model and classify football plays; the representation is very detailed, exploiting in depth the knowledge of the domain. To obtain such precision, trajectories of players and ball are entered manually, and are not automatically extracted from the video stream. The same sport has been studied using HMMs to detect play/break shots in [10]. This type of classification can be generalized to other team games, as proposed in [16] for the soccer domain, but the proposed approach does not appear to be suitable for recognition of structured and higher level events. MPEG motion vectors were used in [9] to detect highlights. In particular, it was exploited the fact that fast camera motion is observed in correspondence of typical soccer highlights, such as shot on goal or free kick. In [15] a hierarchical E-R framework has been proposed for modelling the domain knowledge. The model uses 3D data on the position of players and ball that are obtained from microwave sensors or multiple video cameras. Recognition of relevant soccer highlights (free kicks, corner kicks, and penalty kicks) was presented in [2]: Hidden Markov Models encoding players’ distribution and camera motion were used to discriminate between the three highlights. More recently, in [6], Ekin et al. performed generic highlight detection in soccer video using both shot sequence analysis and shot visual cues. In particular, they assume that the presence of highlights can be inferred from the occurrence of one or several slow motion shots and from the presence of shots where the referee and/or the goal box is framed. In this paper we improve on existing work proposing fully automatic solutions for the annotation of highlights based on raw visual information only, that exploits a limited number of common visual cues to detect highlights of interest in many sports. Visual cues are extracted when required by the model checking algorithm, thus optimizing the processing. The proposed representation can be easily extended to include other visual cues, that can be defined to model more domain-specific highlights. We define sport highlights as atomic entities at the semantic level. They have a limited temporal extension and can be modeled as the spatio-temporal concatenation of specific events.

217

COMMON VISUAL CUES FOR SPORTS HIGHLIGHTS MODELING

The paper is organized as follows. In Section 2, we briefly introduce common peculiarities of sports videos that can be exploited for modeling highlights. In Section 3, we discuss the solution adopted for modeling highlights, comparing the finite state machines (FSMs) approach with other formalisms, most notably with those that are referred to as probabilistic graphical models. Then, we provide details on estimation of visual cues in Section 4, and on the model checking algorithm in Section 5. We report on application of the proposed approach to various sport genres in Section 6, showing experimental results as well. Conclusions and future work are discussed in Section 7. 2.

Sports videos structure and highlights

Each sport shows its peculiarity, both for the particular structure of the competition, and for the structure of the footage imposed by producers. Sports can be classified according to how they are played and according to the playfield where they are played. In particular we distinguish among team sports (teams that face each other sharing the same playground), competitions (two or more players compete, each one having its own part of the playground) and individual competitions (competitions where athletes participate individually one after the other); we also distinguish those sports that are played in delimited venues with playfields of well known geometry, from those that employ circuits or are played in open fields. Table 1 shows a classification of some of the most important sports, based on the type of playfield of the sport and on the class of the sport itself. This classification is useful to derive general rules related to the occurrence of highlights, so that common visual cues could be defined and used to model events that support the detection task. In the following we will concentrate the subject of investigation on games played on playfield whether they are team sports, competitions or individual performances. In all these categories sports are usually filmed using (among others) a fixed main camera that can span the whole playfield and that is kept focused on the most interesting part of the play by the director. The motion of this camera is always highly correlated to the content of the scene, and this motion can in turn be estimated from the imagery. Moreover, the position of the main camera w.r.t the playfield is typical for a given sport, allowing to exploit distinct patterns in the resulting views. It will be shown that highlights in these sports can be detected reliably by using visual cues related to the imaged motion of the main camera, and to the region of the playfield that is framed at each time instant from the main camera itself. These can be regarded to as speed (how camera is moving) and position information of the main Table 1.

Classification of sports.

Playfield Circuit/field

Team games

Competitions

Soccer, basketball, football, rugby

Races, swimming, cycling track, tennis Cycling, motorsports

Individual competitions Jumps, throws Golf

218

BERTINI, DEL BIMBO AND NUNZIATI

camera (what the camera is looking at). Doing so, we define two main visual cues that can be used to model each phase of a given highlight, regardless of the particular sport. Other visual cues such as players’ typical deployments, related to distinctive details of a particular sport/highlight, can be used to improve the discriminating power of the model and to detect highlights that are more specific to the domain. The classification is also useful to define typical highlights of various sports that the system should be able to recognize. For example in team games there is usually a “point attempt”, as well as a “kick off” and (sometimes) a “free shot”. These highlights can be appropriately detected using playfield and players’ position. For individual races, there are always a “start” and “end” events, which are usually the most interesting moments in the competition (i.e. a 5.000 meters Olympic race), as well as the “overtaking” events that can be detected using playfield and camera motion. 3.

Modeling highlights

Following the introductory analysis presented in the previous section and moving to a more formal level, we cast the problem of highlights modeling into a problem of detecting events from temporal sequences. In fact, a generic highlight can be regarded as a concatenation of consecutive phases of the competition. Each phase occurs typically in a distinct zone of the playfield, while transitions between phases are related to the motion of objects such as the ball and/or the athletes. In our approach, we model highlights using FSMs. Each highlight is described by a directed graph G h = S h , E h , where S h is the set of nodes representing the states and E h is the set of edges representing the events. Events indicate transitions from one state to the other: they capture the relevant steps in the progression of the play or of the race, such as moving from one part of the playfield to a different one, accelerating or decelerating, etc. These dynamic evolution is first manually encoded into knowledge models. Figure 1 shows how the essential phases of a shot on goal highlight have been represented as a FSM. For the implementation of the highlight models in the classification engine, knowledge models are turned into operational models, whose parameters are derived from inspection of videos. Transitions on the graph edges are defined by combinations of visual cues descriptors through logic and relational operators.

Figure 1.

The informal model of a shot on goal highlight in the soccer domain.

COMMON VISUAL CUES FOR SPORTS HIGHLIGHTS MODELING

219

Figure 2(a) and (b) shows examples of operational models for a simple and a complex FSM of two distinct domains (namely, soccer and swimming). The playfield zone that is framed, the camera motion, and the position of soccer players are the cues that are used in the state transition conditions. Time constraints, for example a minimum temporal duration, can be applied to state transitions. Since these operators are used to model temporal sequences of facts, they can be compared to those introduced in other temporal reasoning models. In particular, they correspond to before and during operator as defined in Allen’s temporal algebra. Logical symbols (and, or, not) are used to combine visual cues extracted from the video stream. For example the sentence “play is around the goal box” of shot on goal in figure 1 is modeled by expression (µ1 < M < µ2 ) ∧ Z 2 . This expression is made by two constraints, one related to the motion of the play (µ1 < M < µ2 where M is camera motion magnitude, and µ1 , µ2 are thresholds), and the other one (Z 2 ) related to the current framed playfield zone (which is the part of the playfield surrounding the goal box, as shown in figure 4). FSMs appear to be more suited to model highlights in sports if compared with other methods that have been successfully used for automatic annotation. In principle, our problem can be cast into an equivalent one using other rule-based (such as decision-tree) or probabilistic approaches, such as graphical models [8]. In particular, standard HMMs perform well when dealing with single-object action (such as recognition of a hand movement [11]), but tend to fail when discrete observation variables may assume a high number of different values and comes from multiple sources of observation (as in the case of games where visual cues are observed over the whole playfield). In fact, in that case the number of parameters grows exponentially with the dimension of the observations, requiring a very large training set in order to virtually cover all the possible occurrences of a certain event. This problem can be tackled factorizing state and observation variables, and hence deriving a more complicated HMM topology like coupled HMMs [5], or generic dynamic Bayesian network [13]. However, in these cases other problems remain unsolved: in particular, it is hard to encode into the model an explicit temporal duration [12], while this is trivial using a FSM. Moreover, the absence of a “null” model (the silence model in HMMs speech recognizers [4]) raises the temporal segmentation problem: it is not known when an event starts or ends. Some mechanism of sequence alignment or output probability resetting can be introduced, but these mechanism are difficult to implement and unlikely to work. 4. 4.1.

Estimation of visual cues Camera motion

As the main camera is in a fixed position, a 3-parameter image motion estimation algorithm capturing horizontal and vertical translations and isotropic scaling is sufficient to get a reasonable estimate of camera pan, tilt and zoom. Motion estimation algorithm is an adaptation to the sports videos domain of the algorithm reported in [3], that is based on corner tracking and motion vector clustering. As it works with a selected number of salient image locations, the algorithm can cope with large displacements due to fast camera motion. The algorithm employs deterministic sample consensus to perform a statistical motion grouping. This is

220

BERTINI, DEL BIMBO AND NUNZIATI

Figure 2. (a) Soccer shot model and swimming start: on the edges are reported the camera motion and playfield zones needed for the state transition. If state END is reached then the highlight is recognized and (b) Soccer free kick and swimming turning model: Z 1 is the right turning wall, Z 2 is the central swimming pool zone, and Z 3 is the left turning wall. Some transitions in both sports require a minimum time length.

COMMON VISUAL CUES FOR SPORTS HIGHLIGHTS MODELING

221

Figure 3. Left: typical shot action (soccer); at time tstart the ball is kicked toward the goal post. Right: typical turning action (swimming); at time tturning , the athletes perform a turn. Symbols on the right identify low, medium and high motion.

Figure 4.

Playfields partition for soccer (top), basketball (middle) and swimming (bottom).

particularly effective to cluster multiple independent image motions, and is therefore suitable for the specific case of sports videos to separate camera motion from the motion of individual athletes. Once that camera motion is estimated, it is first filtered with a low-pass filter of length 10, with normalized cut frequency equal to 0.2. This is necessary for removing noise due to measuring error. The sequences (one for each independent motion of the camera) are then quantized, to obtain a rough description of the speed of the competition. Parameters of the quantization are derived from inspection of videos, and can vary for different sports.

222 4.2.

BERTINI, DEL BIMBO AND NUNZIATI

Playfield zone estimation

To estimate playfield zone, the playfield is first partitioned in several, possibly overlapping zones, which are defined so that the change from one to the other indicates a change in the action (for example, in the case of soccer, an attack action that enters the goal box develops from zone Z 3 , to either Z 1 or Z 4 and eventually to zone Z 2 ). In general, a typical camera view is associated with each playfield zone, and we exploit common patterns of these views to recognize which zone is framed. Figure 4 shows how soccer, swimming and basketball playfields have been partitioned. The recognition process is carried out as follows: first, playfield region and lines are extracted. Then, we derive numeric descriptors of these two entities, related to orientation and length of lines and to the shape of the playfield region. Descriptors are then used to feed a set of Na¨ıve Bayes classifiers, each of which is dedicated to recognize one of the playfield zones. Since classifiers have identical structure, large differences in their outputs are likely to be significant, so we choose the classifier with the highest output probability as the one identifying the zone currently framed. Figure 5, shows extraction of playfield zone shape and lines for soccer, swimming, cycling track and basketball cases respectively. For each sport, the left image shows the original frame, and the right image shows the playfield shape and lines extracted. The playfield shape is obtained from color analysis and binarization. The bitmap is processed using K-fill, flood fill, erosion and dilation and the playfield shape is finally represented as a polygon. Playfield lines are obtained from the edge image and the playfield shape.

Figure 5. (a) Original soccer image. (b) Soccer playfield shape and lines. (c) Original swimming image. (d) Swimming pool shape and lanes. (e) Original indoor cycling image. (f) Indoor cycling circuit shape and lines. (g) Original Basketball image. (h) Basketball playfield shape and lines.

COMMON VISUAL CUES FOR SPORTS HIGHLIGHTS MODELING

4.3.

223

Athletes’ position and speed

Players’ position is instrumental for the recognition of those highlights that are characterized by a typical deployment of players on the playfield. Examples are all “free shots” highlights in team sports: free kicks in soccer, free throws in basketball, games’ kick offs, and so on. For these highlights, both camera and players are usually still at the start of the action, allowing a robust estimation of the position of players, so that typical configurations of the deployment can be recognized. We exploit the knowledge of the actual planar model of the playfield to estimate automatically the transformation (referred to as a planar homography) which maps any imaged playfield point (x, y) onto the real playfield point (X, Y ). Players’ are first detected as “blobs” from each frame by color differencing. Then, we use an elliptic template to identify their position on the frame. Bottom-end point of each detected template is remapped onto the playfield model trough the estimated homography. The image-to-model registration process can be summarized as follows: 1. Image line selection. From the playfield line segments extracted, select a four element subset (image lines set); 2. Hypothesis generation. From the straight lines of the model, select a four element subset (model lines set). Estimate the transformation satisfying the line homography from the image lines set to the model lines set; 3. Hypothesis validation. Remap all the playfield line segments of the image according to the homography estimated at the previous step. Select the solution which has the minimum model matching error, which is evaluated measuring the distance (evaluated by accumulation of line-point distances) of each remapped line segment to the line of the playfield model closest to it. Output of the registration process is then used to build a compact representation of how the players are deployed, such as an histogram of occupation of typical zones. Figure 6 shows the zones of the playfield (in dark) where players are expected to be found in the case of a free kick in soccer. The presence or absence of players in the darkest areas contributes to discriminate between three classes of free kicks, namely penalty kicks, corner kicks and free kicks with wall; whereas presence or absence of players in the lighter areas are less relevant. Values measured in each zone are used as visual cues for the FSMs engine. Motion information of objects and/or athletes that are present in the scene is obtained from the camera motion estimation algorithm. We cluster motion magnitude and direction of pixels that are moving independently. This measure is sufficient to detect characteristic acceleration and deceleration of groups of players, which are common behaviours when the action is changing in some way. Figure 7 shows one of these typical situations in basketball games, occurring when the defending team steals the ball. Players of both teams suddenly run towards the other midfield, defenders become attackers and vice versa; this is quite different from the normal start of the action, when defenders put the ball in play from the playfield end line.

224

BERTINI, DEL BIMBO AND NUNZIATI

Figure 6. For the purpose of free kicks classification in soccer, the playfield has been partitioned in a number of regions by quantizing the x and y coordinates into 4 and 5 levels, respectively. Relevance of the areas w.r.t. the problem of placed kick classification is coded by the shading (dark: very relevant; light: nor relevant).

Figure 7. Top: Two keyframes of a turnover highlight in basketball. Bottom: Speed magnitude of moving pixels in the horizontal direction, used to model players’ activity.

5.

Model checking

Operational models are used to perform model checking over highlight FSMs. The combination of the measures of the visual cues that are estimated, are checked against each highlight model in parallel. To reduce the computational costs associated with the extraction of each cue, only the cues that are needed to perform state transitions at each state are

COMMON VISUAL CUES FOR SPORTS HIGHLIGHTS MODELING

225

extracted. The model checking algorithm works as follows: in the main loop, the visual features that are used to detect the occurrence of highlights (e.g. line segments, players’ blobs, playfield color) are extracted from each frame. Cue descriptors are computed for each highlight and the constraints associated with transitions from the current state are checked. If a constraint is verified, the current state is updated. Whenever the entire model is checked successfully, the stream segment is annotated with the highlight, reporting its initial and final time instants. Following algorithm shown a pseudo code version of the model checker. main() for each feature f do ref counter[ f ] = 0; /* ref counter[ f ] counts models requiring f */ for each highlight h do init model(h); while (! end of(video stream)) for each feature f do if(ref counter[ f ] > 0) compute( f ); for each highlight h do if (check model(h)) annotate(m.highlight type, m.in, m.out); current frame++; init model(int h) model[h].status = INIT; foreach feature f do h h for each k such that ((e0,k ∈ E h ) ∧ (e0,k requires f )) do ref counter[ f ] + +; boolean check model(int h) boolean r v = FALSE; if (occurred event(ei,h j )) if (sih == INIT) model[h].in = current frame; if (s hj == FAIL) j = 0; else if (s hj == SUCCESS) model[h].out = current frame; r v = TRUE; j =0; h h Fi = k|ei,k h { f | detection of ei,k requires f };  ∈E F j = k|ehj,k ∈E h { f | detection of ehj,k requires f }; Fdec = Fi − F j ; /* set of features whose evaluation is no longer required */ Finc = F j − Fi ; /* set of required features that were not evaluated so far */

226

BERTINI, DEL BIMBO AND NUNZIATI

for each feature f ∈ Fdec do ref counter[ f ] − −; for each feature f ∈ Finc do ref counter[ f ] + +; model[h].status = s hj ; return r v; In the supervising algorithm, each model is first initialised (i.e. its state is set to INIT), then the features that are relevant for the initial state are identified. In particular, ref counter[ f ] tracks the number of models for which the current state requires extraction of feature f . The features that are required to update at least one FSM are then extracted, and cues are evaluated for each incoming frame. Each model analyzes the cue stream to verify whether a relevant event occurred or not. If this is the case, the model updates its state; consequently, the list of relevant features is updated. Two sets Fdec and Finc are evaluated: the former contains features that are no longer required by the model when leaving state sih , while the latter contains additional features that are required by the model when entering the new state s hj . Whenever a model progresses from the INIT state, the current frame number is stored, to mark the beginning of a possible highlight; whenever the model succeeds—i.e. a highlight is identified—the current frame number is also stored to mark the end of an actual highlight. 6.

Experimental results

Experimental results are presented for highlights of different sports, in particular for team games (soccer, basketball, rugby, volleyball) and for competition (swimming, cycling track). Test videos were acquired from Digital Video (DV) tapes, recorded at full PAL resolution, and 25 fps. To assess the validity of the approach in terms of detection rates and robustness w.r.t the different types of style enforced by broadcasters, video sequences have been selected from European and world championship of the last few years, broadcast from Canal+, Sky, BBC television and RAI (Italian television); the total duration of videos for each sport is: about 1 hour for soccer, 20 minutes for swimming and cycling, 30 minutes for basket, volley and rugby. A manual annotation has been first performed (using visual inspection and common sports definitions) to obtain a ground truth on the whole dataset. Then we applied the automatic detection system, comparing results with ground truth data. Tables 2 and 3 report the results for highlights that have been defined using only playfield zone and camera motion information, for team games and competition sports classes, respectively. It can be seen that even these somewhat loosely defined models show good recognition rates, confirming the validity of the two main cues introduced so far. However, there are a few cases (such as turnover in soccer and transition in basket) that display also high false detection. We believe that for these cases other cues should be introduced, to discriminate between these highlights and other less interesting situations. In a second set of experiments, we used also visual cues related to information on position of athletes. These have been used in particular to recognize several types of “free

227

COMMON VISUAL CUES FOR SPORTS HIGHLIGHTS MODELING

Table 2. Classification results for highlights modeled with playfield zone and camera motion information only— team games. Detect.

Correct

Miss.

False

Soccer forward launch

36

32

1

4

Soccer shot on goal

18

14

1

4

Soccer turnover

20

10

3

10

Rugby drop kick

11

10

2

1

Basketball fast break

12

9

1

3

Basketball transition

14

9

2

5

Volley serve

13

12

3

1

Table 3. Classification results for highlights modeled with playfield zone and camera motion information only– competition.

Table 4.

Detect.

Correct

Miss.

False

Swimming start

9

7

2

2

Swimming turn

5

4

1

0

Swimming arrival

9

8

1

1

Cycling track lap

6

6

2

0

Cycling track sprint

3

3

1

0

Classification results for highlights that use cues on typical deployment of players—team games. Detect.

Correct

Miss.

False

Soccer free kick

9

7

2

2

Soccer penalty kick

5

4

1

0

Soccer corner kick

9

8

1

1

Soccer kick off

9

8

1

1

Basketball free throw

10

8

0

2

Rugby line-out

7

5

1

2

Rugby kick off

6

6

2

0

shots” in team games. Table 4 shows recognition results. Higher precision is due to the precise measures obtained using homography estimation and players projection onto playfield model.

7.

Conclusions

In this paper we have presented a general approach to model and detect automatically semantically meaningful highlights in sport videos, based on finite state machines and

228

BERTINI, DEL BIMBO AND NUNZIATI

model checking. We described a set of visual cues that are common to classes of sports, and that can be used to determine the transitions between the states of FSMs. Although the proposed cues have been proved to model appropriately several highlights in different sports, their adaptation to a specific sport type is fine-tuned by hand, with respect to a number of parameters and thresholds that must be derived from inspection of videos. Of these parameters, the most important ones are those related to the motion model and to the partition of the playfield. We are currently working on improving the selection of these parameters, making them learnable by examples, regardless of the sport type. References 1. E. Andre, G. Herzog, and T. Rist, “On the simultaneous interpretation of real-world image sequences and their natural language description: The system SOCCER,” in Proc. 8th European Conference on Artificial Intelligence (ECAI’88), 1988, pp. 449–454. 2. J. Assfalg, M. Bertini, A. Del Bimbo, W. Nunziati, and P. Pala, “Soccer highlights detection and recognition using HMMs,” in Proc. of Int’l Conf. on Multimedia and Expo (ICME2002), Switzerland, 2002. 3. G. Baldi, C. Colombo, and A. Del Bimbo, “A compact and retrieval-oriented video representation using mosaics,” in Proc. 3rd International Conference on Visual Information Systems VISual99, LNCS 1999, Springer: Amsterdam, The Netherlands, June 1999, pp. 171–178. 4. Y. Bengio, “Markovian models for sequential data,” Neural Comp. Surveys, Vol. 2, pp. 129–162, 1998. 5. M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models for complex action recognition,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 1997. 6. A. Ekin, A. Murat Tekalp, and R. Mehrotra, “Automatic soccer video analysis and summarization,” IEEE Transactions on Image Processing, Vol. 12, No. 7, pp. 796–807, 2003. 7. S.S. Intille and A.F. Bobick, “Recognizing planned, multi-person action,” Computer Vision and Image Understanding (1077–3142), Vol. 81, No. 3, pp. 414–445, 2001. 8. M.I. Jordan, Learning in Graphical Models, MIT Press, Cambrigde, 1999. 9. R. Leonardi and P. Migliorati, “Semantic indexing of multimedia documents,” IEEE MultiMedia, Vol. 9, No. 2, 2002, pp. 44–51. 10. M. Mottaleb and G. Ravitz, “Detection of plays and breaks in football games using audiovisual features and HMM,” in Proc. of Ninth Int’l Conf. on Distributed Multimedia Systems, Sept. 2003, pp. 154–160. 11. V. Pavlovic, R. Sharma, and T. Huang, “Visual interpretation of hand gestures for human-computer interaction: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, 1997. 12. L.R. Rabiner, “A tutorial on HMM and selected applications in speech recognition,” in Proc. IEEE, Vol. 77, No. 2, pp. 257–286, 1989. 13. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, Englewood Cliffs, NJ, 1995. 14. G. Sudhir, J.C.M. Lee, and A.K. Jain, “Automatic classification of tennis video for high-level content-based retrieval,” in Proc. Int’l Workshop on Content-Based Access of Image and Video Databases CAIVD’98, 1998, pp. 81–90. 15. V. Tovinkere and R.J. Qian, “Detecting semantic events in soccer games: Towards a complete solution,” in Proc. Int’l Conf. on Multimedia and Expo ICME 2001, 2001, pp. 1040–1043. 16. L. Xie, P. Xu, S.-F. Chang, A. Divakaran, and H. Sun, “Structure analysis of soccer video with domain knowledge and hidden Markov models,” in Proc. IEEE Int’l Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), May 2002, pp. 4096–4099. 17. W. Zhou, A. Vellaikal, and C.C.J. Kuo, “Rule-based video classification system for basketball video indexing,” in Proc. ACM Multimedia 2000 Workshop, 2000, pp. 213–216.