Comparative Study of Methods for Recognition an

Comparative Study of Methods for Recognition an Unknown Person's Action from a Video Sequence Takayuki Hori*a, Jun Ohyaa, Jun Kurumisawab Graduate School of Global Information and Telecommunication Studies, Waseda University, 1011 Okuboyama, Nishi-Tomida, Honjo-shi, Saitama, 367-0035 Japan; b Dept. of Policy Informatics, Chiba University of Commerce 1-3-1 Kounodai, Ichikawa-shi, Chiba, 272-8512 Japan;

a

ABSTRACT This paper proposes a Tensor Decomposition Based method that can recognize an unknown person’s action from a video sequence, where the unknown person is not included in the database (tensor) used for the recognition. The tensor consists of persons, actions and time-series image features. For the observed unknown person’s action, one of the actions stored in the tensor is assumed. Using the motion signature obtained from the assumption, the unknown person’s actions are synthesized. The actions of one of the persons in the tensor are replaced by the synthesized actions. Then, the core tensor for the replaced tensor is computed. This process is repeated for the actions and persons. For each iteration, the difference between the replaced and original core tensors is computed. The assumption that gives the minimal difference is the action recognition result. For the time-series image features to be stored in the tensor and to be extracted from the observed video sequence, the human body silhouette’s contour shape based feature is used. To show the validity of our proposed method, our proposed method is experimentally compared with Nearest Neighbor rule and Principal Component analysis based method. Experiments using 33 persons’ seven kinds of action show that our proposed method achieves better recognition accuracies for the seven actions than the other methods. Keywords: Computer Vision, Human Motion Recognition, Human Motion Analysis, Tensor Decomposition, N-mode SVD, Motion Signature, Core Tensor

1. INTRODUCTION At present human motion analysis continues to be an increasingly active research area in computer vision and computer graphics. The human motion analysis can be applied to surveillance, as described in [1]. The development in surveillance research is being propelled by the increased availability of inexpensive computing power and image sensor, as well as the inefficiency of manual surveillance and monitoring system. There are many surveillance applications due to the awareness of security. The more classical types of surveillance are related to automatic monitoring and understanding locations where a large number of people pass through such as airports and subways. It is still challenging, because the research area contains a number of hard problems such as inferring the pose and self occlusion. The study of motion in image sequences is a typical topic research area in computer vision. Motion is a powerful feature of image sequences, revealing the dynamics of scenes by relating spatial image features to temporal changes. The task of motion analysis, in particular human motion recognition, remains a challenging and fundamental problem of computer vision. There are several approaches to human motion recognition. From sequences of 2D images, optical flows can be computed [2]. In particular case of human motion, the previous work was oriented to motion estimation of a rigid body, but the human body is a non rigid form and could move around, therefore, it could be difficult to apply optical flow based approaches to this issue. Then, the new approach should consider the human body as an articulate chain and also elastic objects. That means that we need more sophisticated models and more complex algorithms. Hidden Markov Models (HMM), which had been successfully used for speech recognition, was applied to recognizing human actions from a video sequence [3]. HMM can handle changes in time-length of actions, to some extent, but cannot achieve good recognition accuracy in case of unknown persons, who were not used for the learning procedure needed for constructing HMM’s for each recognition category [3]. Image Processing: Algorithms and Systems VII, edited by Jaakko T. Astola, Karen O. Egiazarian Nasser M. Nasrabadi, Syed A. Rizvi, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7245, 72450V © 2009 SPIE-IS&T · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.805745 SPIE-IS&T/ Vol. 7245 72450V-1

According to Vasilescu [4], people have characteristic motion signatures that individualize their movements. These characteristics are analogous to handwritten signatures, so, we can extract these signatures from example motions. The ability to perceive motion signatures seems wellgrounded from an evolutionary perspective. Vasilescu’s approach, actually, addresses the motion {analysis / synthesis / recognition} problem using techniques from numerical statistics. The mathematical basis of the approach is a technique known as n-mode analysis, which was first proposed by Tucker [5] and subsequently developed by Kapteyn et al. [6, 7], among others. This multilinear analysis subsumes as special cases the simple, linear (1-factor) analysis associated with conventional SVD (singular value decomposition) and principal components analysis (PCA), as well as the incrementally more general bilinear (2-factor) analysis that has recently been investigated in the context of computer vision [8]. Subsuming conventional linear analysis as a special case, multilinear analysis emerges as a unifying mathematical framework suitable for addressing a variety of computer vision problems [9]. Within Vasilescu’s framework, corpora of motion capture data spanning multiple people and actions are best organized as higher-order arrays or tensors, which define multilinear operators over a set of vector spaces. Unlike the matrix case for which the existence and uniqueness of the singular value decomposition (SVD) is assured, the situation for higher order tensors is not as simple. There are multiple ways to orthogonally decompose tensors [10]. However, one multilinear extension of the matrix SVD to tensors is most natural. Vasilescu’s approach applies this N-mode SVD to extract human motion signatures among the other constitutive factors inherent in human movement. Vasilescu’s method synthesized simple stair ascending-descending walk. Vasilescu’s method can achieve the following two recognition tasks:

Concerning motion recognition,

(1) To identify who, as one of the known persons, performed that known action. (2) To recognize the known person’s unknown action (one of the actions to be recognized). Therefore, Vasilescu’s method cannot recognize the action performed by an unknown person, who is not included in the database used for the recognition process. Further constraint was that human motion was measured by a motion capture system, not by a computer vision based approach. The aim of this paper is to accurately classify the action being performed by an unknown person from a video sequence using a computer vision based approach. Before the recognition process, the (known) persons’ actions are observed by a camera, and image features are extracted from each frame of the video sequences, then the tensor, which consists of persons, actions and time-series image features, is constructed by storing the observed data of the know persons. During the recognition process, an unknown person’s action is observed by the camera, and one of the actions stored in the tensor is assumed. This assumption is needed for computing the motion signature. Using the motion signature, the unknown person’s actions are synthesized. The actions of one of the persons in the tensor are replaced by the synthesized actions. Then, the core tensor for the replaced tensor is computed. This process is repeated for the actions and persons. For each iteration, the difference between the replaced and original core tensors is computed. The assumption that gives the minimal difference is the action recognition result. In this paper, we explore the effectiveness of the above-mentioned recognition method using Lt-s Feature [13], which is a contour shape based feature. More specifically, we compare our proposed method with Nearest Neighbor rule and Principal Component analysis based method. The rest of this paper is organized as follows. Section 2 explains the tensor decomposition method, in particular the mathematics used by our proposed recognition method. Section 3 elaborates on our proposed recognition method. Section 4 explains the above-mentioned image features. Section 5 shows the experimental results with discussions. Section 6 concludes this paper.

2. TENSOR DECOMPOSITION Basically, tensors are a generalization of the concept of a vector. A tensor can be considered to be a multidimensional or N-way array of data and as such is useful for the description of higher order quantities e.g. multivariate data [4]. In this paper, we denote vector quantities by bold lower case letters (a, b), scalar quantities by lower case letters (a, b), matrices by bold uppercase letter (A, B), and tensor quantities in calligraphic letters (A, B). Generally unless explicitly stated throughout this paper i, j refer to indices (counters) and I, J, K, L denote index upper bounds.

SPIE-IS&T/ Vol. 7245 72450V-2

In multilinear algebra an Nth order tensor is written as A ∈ℜI xI …xI , and its elements are indexed as ai1,i2,….iN . An Nth order tensor has N mode spaces, for example in the case of a matrix, when N = 2, two mode spaces exist, a row space and a column space. In tensor terminology a matrix can be defined in terms of a set of mode-1 vectors (column vectors) or as a set of mode-2 vectors (row vectors), e.g., Column-wise mode-1 representation B = [bj1…bjN], where an element of the matrix Bij , has a row index i and column index j. Considering the case of a third order tensor A ∈ℜI I xI , (N = 3), three mode spaces exist where mode- 1 corresponds to column space, mode-2 to row space, and mode-3 to depth space. 1

2

N

1x 2

3

2.1 Tensor Unfolding The main idea of a N-mode SVD derivation needs to consider an appropriate generalization of the link between the column (row) vectors and the left (right) singular vectors of a matrix. To be able to formalize this idea, we define “matrix unfolding” of a given tensor, i.e., matrix representations of that tensor in which all the column vectors are stacked one after the other [4]. A tensor A ∈ℜI xI …xI , can be represented in matrix form, A(n), which is the result of unfolding (flattening) the tensor along dimension n where n = I1, I2….,IN. Tensor unfolding can be considered as splitting a tensor into mode-n vectors and rearranging these vectors column-wise to form a matrix. In fig. 1, a visualization is presented which demonstrates how a 3rd order tensor is unfolded along mode-1 (I1), mode-2 (I2) and mode-3 (I3) dimensions to form matrices A(1) with size I1 × I2I3, A(2) with size I2 × I3I1 and A(3) with size I3 × I1I2. 1

2

N

I,

Mode-i Front-Back Ii

Mode-2 Ton-Bottom 13

Mode3 Left-Right Figure 1: Tensor can be unfolded in three ways to obtain matrices comprising of its mode-1, mode-2 or mode-3 vectors.

2.2 Tensor Decomposition In tensor notation the mode-n product between a tensor A and a matrix M is expressed as B = A ×n M. In terms of tensor unfolding this can be solved as:

B(n) = MA(n)

(1)

where A(n) is the resultant matrix of unfolding tensor A in direction n (mode-n), tensor B is found by folding matrix B(n) back into tensor representation. As stated previously, a matrix has two associated modes, a vector row space and a vector column space. Application of SVD to a matrix, B, results in the decomposition of the matrix into the product of an orthogonal column space U1, a diagonal singular value matrix and an orthogonal row space U2 which is written as:

B = U1ΠU T2

(2)

Using the mode-n product in Eq.(3) can be defined without the need of a generalized transpose as:

B = Π×1 U1×2 U2

SPIE-IS&T/ Vol. 7245 72450V-3

(3)

For tensors, standard SVD cannot be utilized, therefore for N > 2 HOSVD (High Order SVD alternatively known as Nmode SVD) can be used. Like SVD which decomposes a matrix into 2 orthogonal spaces and a singular value matrix; HOSVD decomposes a tensor into N orthogonal mode spaces U1,U2,….UN and a core tensor Z. Using HOSVD [12] a tensor can be represented as the mode-n product between these N orthogonal subspaces and core tensor Z, as indicated by Eq.(4).

Z = A ×1 U 1T ×2 U T2 …×n U Tn …×N U TN

(4)

The core tensor, Z, governs the interactions between the subspace (mode) matrices and it is analogous to the singular value matrix that results in standard SVD, however, it does not have a diagonal structure and is a full tensor. For example, HOSVD on a 3rd order tensor (N = 3) will result in decomposing the tensor into 3 orthogonal mode spaces and core tensor of order 3 as illustrated in Fig.2.

x (ix 12x13)

[J

(I,x 12X13)

U:

/u2/

(Ix I)

Full-size tensor and matrix

(Ix

(13x 13)

12)

Reduced tensor and matrix

Figure 2: An N-mode SVD orthogonalizes the N vector spaces associated with an order-N tensor (N=3).

HOSVD algorithm for tensor decomposition as presented in [4] is given as: For n=1 to N: 1.

Unfold tensor, A, along dimension n to find matrix A(n) .

2.

Apply SVD to matrix A(n) .

3.

Set Un to the left-hand column space matrix of SVD

Solve the core tensor using the equation:

D = Z ×1 U1 ×2 U2…×n Un…×N UN

(5)

2.3 Motion Synthesis Suppose motion sequences of several people are given. Vasilescu defines a data set tensor D with size (H×E×G), where H is the number of people, E is the number of action classes, and G is the time series data. We apply the N-mode SVD algorithm to decompose this tensor as D = Z ×1 U1×2 U2×3 U3, and by denoting U1, U2, U3 as P, A, T respectively we get:

Z = D ×1 P ×2 A ×3 T

SPIE-IS&T/ Vol. 7245 72450V-4

(6)

The people matrix P = [p1…pn … pH]T, whose person specific row vectors p Tn span the space of person parameters, encodes the per-person invariances across actions. Thus P contains the human motion signatures. The action matrix A = T

[a1 … am … aE] T, whose action specific row vectors a m span the space of action parameters, encodes the invariances for each action across different people. The time sequence data T contains the image features. The tensor

B = Z ×2 A ×3 T

(7)

contains a set of basis matrices for all the motions associated with particular actions. The tensor

C = Z ×2 P ×3 T

(8)

contains a set of basis matrices for all the motions associated with particular people. After extracting Z , A and T we have a generative model that can observe motion data Du of an unknown (new) person performing one of these actions (action a) and synthesize the remaining actions for this unknown person. The signature p for the unknown person is solved as follow. The unknown person’s observed motion data Du , a 1×1×T tensor, is represented by

Du = Ba ×1 PT

(9)

In Eq. (9),

Ba = Z ×2 a Ta ×3 T

(10)

where aa is the a-th vector of the action matrix A = [a1…aE]T. Flattening the tensor Du in Eq. (9) in the people mode T

yields the matrix Du (people), actually a row vector, which can be denoted as d u . Therefore, in terms of flattened tensors, Eq. (9) can be written as

d Tu = pTBa(people) where p is the motion signature, and Ba unknown person is given by

(people)

(11)

is the flattened matrix in the people. The motion signature p for the

pT = d Tu B −a1( people )

(12)

Using p and B in Eq. (7), all the E actions for the unknown person, Dp, are synthesized as follows:

Dp = B ×1 pT

(13)

3. RECOGNIZING AN UNKNOWN PERSON’S ACTION Given motion sequences of people, the tensor D is constructed, where H (rows) is the number of people, E (columns) is number of action classes, and G (depth) is the number of sequence samples (time-series image features), as shown in Fig.3. Since the unknown person whose action is observed is not included in the tensor D, the basic idea of our proposed recognition method is to find the most similar action to the observed action from the tensor D. Note that Nearest Neighbor rule is apparently suitable for this aim, but the experimental results presented in Section 5 show the necessity of a better method.

SPIE-IS&T/ Vol. 7245 72450V-5

Action Classes

C

odels

P

A1A2...An...AE Figure 3: Motion Database Structure.

As shown in the flow chart of the proposed algorithm in Fig. 4, Du, a 1 × 1 × T tensor, is the observed unknown motion data for the unknown person. The core tensor Z of the tensor D is computed by Eq. (6). Obviously, the action of Du, is not known in advance, because the action of Du is to be recognized. Therefore, Du's action is assumed to be j (j=1, 2,..., E), which is one of the E actions to be recognized. Then, by replacing the notation a in Eq. (10) by j, the assumed action, the following equation is obtained. T

Bj = Z ×2 a j ×3 T

(14)

Since Du is a row vector, it can be denoted as d Tu . The motion signature q for the unknown person’s assumed action j is represented by replacing a in Eq. (12) by j, as follow. −1

T

qT = d u B j ( people)

(15)

By substituting the obtained motion signature q into Eq. (13), all of this unknown person's actions, Dq, are synthesized by

Dq = B ×1 qT

(16)

The process for finding the recognition result is to find the best assumption jk among the E actions for Du's action. For this, we need an evaluation criterion. This paper utilizes the core tensor for this aim, because similar tensors should have similar core tensors. However, if we construct a new tensor D by simply appending the synthesized actions Dq to the original tensor D, the core tensor Z’ of the appended tensor D’ and the original core tensor Z of D cannot directly be compared, because the sizes of the two tensors Z and Z’ are different. To solve this issue, the E actions of one person i ( i = 1,2,…,H) in the original tensor D are replaced by the synthesized actions Dq, so that the new (replaced) tensor D” and original tensor D have the same size. This replacement is repeated for all the H persons in D for the assumed action j. For each of the E action assumptions, the replacement is repeated for the H persons. For each of the HE replacements, the core tensor Z” of the replaced tensor D” is computed, and the difference between Z and Z” is computed as the evaluation criterion. Specifically, the difference Dif is computed by the following equation as the summation of the absolute values of element by element differences between the two core tensors.

Dif =

H

E

G

Z" ∑∑∑ 　 k =1 l =1 m =1

klm

−Z klm

SPIE-IS&T/ Vol. 7245 72450V-6

(17)

The replacement that gives the minimal value of the difference Dif could correspond to the case in which the synthesized actions are very similar to the replaced actions. Thus, the assumed action j* for this particular replacement is determined as the recognition result. Note that the person ik whose actions are the most similar to the observed action Du can also be known from this particular replacement. This algorithm can be performed according to the following procedure as shown in Fig. 4.

(start) Load Tensor Database cD

Compute motion signature q usingj Compute Core Tensor Z of cD

Dif < Diffmin yntfles1ze an actions using motion signature 'Dq

Obtain 'hn

1ath

I

Diffmin = Dif

j*j

V

Replace i's actions by cflq

i=i+1

Tnitilazei=Oi=O (I: person,]: action) -

Diffmin=large number

Compute New Core tensor Z"

j