Human Activity Recognition using Temporal Templates and ...

Human Activity Recognition using Temporal Templates and Unsupervised Feature-Learning Techniques A project report submitted to MANIPAL UNIVERSITY For Partial Fulfillment of the Requirement for the Award of the Degree of Bachelor of Engineering in Information Technology by Arpit Jain Reg. No. 090911345 Satakshi Rana Reg. No. 090911281 Under the guidance of Dr. Sanjay Singh Associate Professor Department of Information and Communication Technology Manipal Institute of Technology Manipal, India

May 2013

I dedicate my thesis to my parents for their support and motivation. Arpit Jain For my parents and my sister, Shreyaa. Satakshi Rana

i

DECLARATION

I hereby declare that this project work entitled Human Activity Recognition using Temporal Templates and Unsupervised Feature-Learning Techniques is original and has been carried out by us in the Department of Information and Communication Technology of Manipal Institute of Technology, Manipal, under the guidance of Dr. Sanjay Singh, Associate Professor, Department of Information and Communication Technology, M. I. T., Manipal. No part of this work has been submitted for the award of a degree or diploma either to this University or to any other Universities.

Place: Manipal Date :24-05-2013 Arpit Jain

Satakshi Rana

ii

CERTIFICATE This is to certify that this project entitled Human Activity Recognition using Temporal Templates and Unsupervised Feature-Learning Techniques is a bonafide project work done by Mr. Arpit Jain and Ms. Satakshi Rana at Manipal Institute of Technology, Manipal, independently under my guidance and supervision for the award of the Degree of Bachelor of Engineering in Information Technology.

Dr. Sanjay Singh

Dr.Preetham Kumar

Associate Professor

Head

Department of I & CT

Department of I & CT

Manipal Institute of Technology

Manipal Institute of Technology

Manipal, India

Manipal,India

iii

ACKNOWLEDGEMENTS We hereby take the privilege to express our gratitude to all the people who were directly or indirectly involved in the execution of this work, without whom this project would not have been a success. We extend our deep gratitude to the lab assistants and Vinay sir for their co-operation and in helping us download dataset for our project. We thank Dr. Sanjay Singh for his timely suggestions & guidance and Dr. Preetham Kumar for being a constant source of inspiration. Our hearty thanks to Rahul, our classmates and teachers who have supported us in all possible ways. Our deepest gratitude to our family and friends who have been supportive all the while.

iv

ABSTRACT Computer vision includes methods for acquiring, processing, analyzing, and understanding images. Applications of this field include detecting events, controlling processes, navigation, modelling objects or environments, automatic inspection and many more. Activity recognition is one of the applications of computer vision that aims to recognize the actions and goals of one or more agents from a series of observations on the agents actions and the environmental conditions. The goal of the project is to train an algorithm to automate detection and recognition of human activities performed in the video data. The project can be utilized in scenarios such as surveillance systems, intelligent environment, sports play analysis and web based video retrieval. In the first phase of the project, actions like bending, side galloping and hand wave were recognized using a Temporal Template matching technique called Motion History Image(MHI) methodology. MHI method is extensively being used to represent the history of temporal changes involved in the execution of an activity. Here the intensity of the pixels of a scalar valued image are varied, depending upon the motion history. MHI suffers from a very prominent drawback that when self-occluding or over-writing activities are encountered it fails to generate a vivid description of the course of that activity. This happens because when repetitive activities are performed, the current action overwrites or deletes the information of the previous action. In the project, we successfully devised and implemented a methodology which overcame this problem of conventional MHI. In our mechanism, we have utilized red, green and blue channels of 3 different 3-channeled images to represent human movement. In the methodology used, feature extraction of the Motion History Image and Motion Energy Image was performed using 7-Hu Moment and thereafter training and classification was performed using k-NN Algorithm.

v

The second phase of the project focusses on a recent development in machine learning, known as “Deep Learning” which works on the concept of biologically inspired computer vision to achieve the task of unsupervised feature-learning. Stacked convolutional Independent-Subspace Analysis is used for unsupervised feature extraction on Hollywood2 dataset. The features are then utilized to classify videos in the dataset in their appropriate class using Support Vector Machine(SVM). The system is capable of recognizing 12 different activities performed by humans.

vi

Contents

Acknowledgements

iv

Abstract

v

List of Tables

ix

List of Figures

xi

Abbreviations

xi

Notations

xiii

1 Introduction

1

1.1

Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Report Organisation . . . . . . . . . . . . . . . . . . . . . . .

5

2 Recognizing Human Activities using Temporal Templates

6

2.1

Motion-Energy Image(MEI) . . . . . . . . . . . . . . . . . . .

7

2.2

Motion History Image(MHI) . . . . . . . . . . . . . . . . . . .

8

2.3

Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . .

9

2.4

Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.5

Training and Classification . . . . . . . . . . . . . . . . . . . .

12

2.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3 3-Channel Motion History Images 3.1

Failure of MHI in case of Repetitive Actions . . . . . . . . . . vii

15 15

3.2

Improved MHI Methodology . . . . . . . . . . . . . . . . . . .

16

3.3

Proposed Algorithm & Results Obtained . . . . . . . . . . . .

20

3.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4 Unsupervised Feature Learning for Activity Recognition

25

4.1

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2

PCA Whitening . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4.3

Independent Subspace Analysis . . . . . . . . . . . . . . . . .

27

4.4

Stacked ISA for Video Domain . . . . . . . . . . . . . . . . . .

29

4.5

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . .

30

4.5.1

Dataset . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.5.2

Framework for Classification . . . . . . . . . . . . . . .

31

4.5.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.6

5 Snapshots

35

6 Conclusion and Future Work

40

References

42

ProjectDetail

45

viii

List of Tables 2.1

Action Dataset . . . . . . . . . . . . . . . . . . . . . . . . . .

12

4.1

Hollywood2 Action Dataset . . . . . . . . . . . . . . . . . . .

31

4.2

Performance of System

34

. . . . . . . . . . . . . . . . . . . . .

ix

List of Figures 1.1

Approaches for Activity Recognition . . . . . . . . . . . . . .

3

2.1

MEI for Hand Wave . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Different cases of MHI formation for Hand Waving Activity . .

8

2.3

Feature Extraction from MHI and MEI . . . . . . . . . . . . .

12

3.1

MHI Failure: In case of Repetitive Action . . . . . . . . . . .

16

3.2

Initially when no activity is performed . . . . . . . . . . . . .

18

3.3

Activity starts for the first time . . . . . . . . . . . . . . . . .

19

3.4

Hands raised in the anti-clockwise direction . . . . . . . . . .

19

3.5

Actions performed in the clockwise direction . . . . . . . . . .

19

3.6

Hand raised upwards for the second time . . . . . . . . . . . .

20

3.7

Graph 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.8

Graph 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

4.1

ISA Network

. . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.2

Stacked ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.3

Stacked ISA for video data . . . . . . . . . . . . . . . . . . . .

30

5.1

Graphical User Interface . . . . . . . . . . . . . . . . . . . . .

35

5.2

Bending Activity . . . . . . . . . . . . . . . . . . . . . . . . .

36

5.3

MHI and MEI for Bending Activity . . . . . . . . . . . . . . .

36

5.4

Side Galloping Activity . . . . . . . . . . . . . . . . . . . . . .

37

5.5

MHI and MEI for Side Galloping Activity . . . . . . . . . . .

37

5.6

Training Completion . . . . . . . . . . . . . . . . . . . . . . .

37

x

5.7

Test Bending Activity . . . . . . . . . . . . . . . . . . . . . .

38

5.8

Test Side Galloping Activity . . . . . . . . . . . . . . . . . . .

38

5.9

Actions in Hollywood2 Dataset . . . . . . . . . . . . . . . . .

38

5.10 Compute Features . . . . . . . . . . . . . . . . . . . . . . . . .

39

5.11 Classification Results . . . . . . . . . . . . . . . . . . . . . . .

39

xi

ABBREVIATIONS MHI

Motion History Image

MEI

Motion Energy Image

ISA

Independent Subspace Analysis

SVM

Support Vector Machine

CNN

Convolutional Neural Network

PCA

Principal Component Analysis

xii

NOTATIONS I(x, y)

:

Pixel value at (x, y) location

I(x, y, t)

:

Pixel value of the tth frame at (x, y) location

Mij

:

Moment of order (i + j)

ηij

:

Central moment of order (i + j)

xi

: ith training sample

λi

: ith eigenvalue

U

:

Left-singular vector after singular value decomposition

xiii

Chapter 1 Introduction Computer vision is a field that includes methods for acquiring, processing, analyzing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions [1]. A theme in the development of this field has been to duplicate the abilities of human vision by electronically perceiving and understanding an image. Applications range from tasks such as industrial machine vision systems which say, inspect bottles speeding by on a production line, to research into artificial intelligence and computers or robots that can comprehend the world around them. Computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as video sequences, views from multiple cameras, or multi-dimensional data from a medical scanner. The classical problem in computer vision, is that of determining whether or not the image data contains some specific object, feature, or activity. Activity recognition is one of the sub-fields of computer vision that aims to recognize the actions and goals of one or more agents from a series of observations on the agents’ actions and the environmental conditions [2]. Human activity recognition focusses on accurate detection of human activities based on a pre-defined activity model. It can be exploited to great societal

1

benefits, especially in real life, human-centric applications such as eldercare and healthcare [3].

1.1

Literature Survey

The various approaches for recognizing human activities are classified into Single Layered Approaches and Hierarchical Approaches [4]. Single-layered approaches are further classified into two types depending on how they model human activities: i.e., space time approaches and sequential approaches. Spacetime approaches view input as a 3-D(XYT) volume, while sequential approaches view and interpret the input video as a sequence of observations. Space time approaches are further divided into three categories based on the features they use from the 3-D space-time volumes: space-time volumes, trajectories and local interest point detectors. Sequential approaches are classified depending on whether they use exemplar-based recognition methodologies or model-based recognition methodologies. In single layered approaches, each activity corresponds to a cluster containing image sequences for that activity. These clusters are categorized into classes each having some definite property, so when an input is given to the system various algorithms and methodologies like neighbour-based matching [5], template matching [6], statistical modelling [7] algorithm are applied to categorize the input activity into its appropriate class. Hierarchical Approaches works on the concept of divide and conquer; in which any complex problem can be solved by dividing it into several sub problems. The sub-activities are used to identify the main complex activity. These approaches are classified on the basis of the recognition methodologies they use: statistical approach, syntactic approach and description-based approach. Statistical approaches construct statistical state-based models like layered Hidden Markov Model(HMM) to represent and recognize high level human activi2

Figure 1.1: Approaches for Activity Recognition ties [8]. Similarly syntactic approach use a grammar syntax such as a stochastic context free grammar(SCFG) to model sequential activities [9]. Descriptionbased approach represent human activities by describing sub-events of the activities and their temporal, spatial and logical structures [10]. Fig. 1.1 summarizes the hierarchical approach based taxonomy of the approaches used in human activity recognition. Ke et al. [11] used segmented spatio-temporal volumes to model human activities, their system applied a hierarchical mean-shift to clusters of similar coloured voxels, and obtain several segmented volumes. The motivation is to find the actor volume segments automatically and to measure their similarity to the action model. Recognition is done by searching for a subset of over-segmented spatio-temporal volumes that best matches the space of the action model. Their system recognized simple actions such as hand waving and boxing from the KTH action database. Laptev [12] recognized human actions by extracting sparse spatio-temporal interest points from videos. They extended the local feature detectors(Harris) commonly used for object recognition, in order to detect interest points in space-time volume. Motion patterns such as change in direction of object; splitting and merging of an image structure; and collision/bouncing of object are detected as a result. In their work, these features were used to distinguish a walking person from complex backgrounds. Bobick and Davis [13] constructed a real time action recognition system 3

using template matching. Instead of maintaining the 3-dimensional space-time volume of each action, they represented each action with a template composed of two 2-dimensional images: binary Motion Energy Image and scalar Motion History Image. The two images are constructed from a sequence of foreground images, which essentially are weighted 2-D projections of the original 3-D (XYT) volume. By applying a traditional template matching technique to a pair of (M EI, M HI), their system was able to recognise simple actions like sitting, arm waving and crouching. However the MHI method suffers from a serious drawback that when self-occluding or overwriting actions are encountered it leads to severe recognition failure [14]. This failure results because when repetitive action is performed the same pixel location is accessed multiple times due to which the previously stored information in the pixel gets overwritten or deleted by the current action. In order to address this issue we implemented a novel technique for creating motion history images that are capable of representing self-occluding and overwriting action. Our methodology overcomes the limitation of representing repetitive activities and thus outshines the conventional MHI method. Previous work on action recognition has focussed on adapting hand-designed local features, such as SIFT [15] or HOG [16], from static images to video domain. Andrew Y Ng et al. [17] proposed a methodology to learn features directly from video data using unsupervised feature learning. They used an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabelled video data. By replacing handdesigned features with learnt features they achieved classification results superior to all various published results on Hollywood2 [18], UCF [19], KTH [20] and YouTube [21] action recognition dataset. In our project we have worked on Deep Learning techniques such as stacking and convolution to learn hierarchical representations. The hierarchical features are extracted using unsupervised learning technique to classify 12 categories of human activities using Support 4

Vector Machines.

1.2

Report Organisation

The rest of the report is organized as follows: • Chapter 2: Gives a detailed description of activity recognition using a temporal template matching technique, i.e. Motion History Image(MHI) methodology. Construction of MHI and MEI images, feature extraction and training of the algorithm on Weizmann dataset are shown for classification of an input video sequence as a bending or side galloping activity. • Chapter 3: The drawbacks of MHI method are discussed and a novel approach to overcome the drawbacks is proposed. The algorithm for the proposed approach and simulated results are provided. • Chapter 4: Gives an overview of previous hand-coded approaches used for activity recognition. Thereafter activity recognition for Hollywood2 Action Dataset is discussed using recent approaches on Unsupervised Feature Learning. Stacked Independent Subspace Analysis(ISA) is trained for 12 different classes of activities and framework for classification is provided. Results obtained and comparison with previous results is provided. • Chapter 5: Snapshots of the GUI and the system’s performance results are provided. • Chapter 6: Concludes the report and gives the future work.

5

Chapter 2 Recognizing Human Activities using Temporal Templates Motion History Image(MHI) methodology is a representation and recognition theory that decomposed motion-based recognition into first describing where there is motion (the spatial pattern) and then describing how the motion is moving. In the methodology a binary Motion-Energy Image(MEI) and a scalar-valued Motion-History Image(MHI) are constructed. MEI represents where motion has occurred in an image sequence and MHI depicts how the motion has occurred. Taken together, the MEI and MHI can be considered as a two component version of a temporal template, a vector-valued image where each component of each pixel is some function of the motion at that pixel location. These view-specific templates are matched against the stored models of views of known as movements. The initial step for the construction of MEI and MHI involves background subtraction and obtaining silhouettes. So the captured image frame from the video is first converted into a grey scale image which is done by using Eq. 2.1. Here the pixel value Y 0 of the grey image is determined using the RGB components of original image. Y 0 = 0.2126R + 0.7152G + 0.0722B 6

(2.1)

At each step, the current grey image (src1) is compared with the previous image (src2). This is done to find the difference of the two images which is stored in (dst), to identify the changes that help in detecting the motion in the video. The difference is calculated using Eq. 2.2: dst(i)c = |src1(I)c − src2(I)c |

(2.2)

Then binary threshold is applied for the construction of the silhouettes which is the outline of the image using Eq. 2.3:    maxvalue if src(x, y) > threshold, dst(x, y) =   0 if otherwise

2.1

(2.3)

Motion-Energy Image(MEI)

Motion-Energy Images(MEI) are binary cumulative motion images that are computed from the start frame to the last frame of the video sequence. MEI represents the region where movement occurred in a video data. Let I(x, y, t) be an image sequence and let D(x, y, t) be a binary image sequence indicating regions of motion then, the binary MEI is defined as : Eτ (x, y, t) =

τ[ −1

D(x, y, t − i)

(2.4)

i=0

The duration of τ is critical in defining the temporal extent of a movement. Figure 2.1 shows the MEI image for a hand waving activity.

Figure 2.1: MEI for Hand Wave

7

2.2

Motion History Image(MHI)

In motion history images, the intensity of each pixel is varied based on how early an activity was performed. The more recent is the activity; higher will be its intensity. This variation of intensities helps in finding the direction of motion and the course of action. For the formation of the MHI image each pixel’s intensity at location (x, y) at time t is given by:

M HI(x, y, t) =

    ts        0

if silhouette(x, y, t) 6= 0 if silhouette(x, y, t) = 0 & (2.5)

   M HI(x, y, t − 1) < (ts − d)        M HI(x, y, t − 1) otherwise Here in Eq. 2.5: M HI(x, y, t): value of pixel at location (x, y) at time t ts : timestamp which denotes the current time d: duration for which motion is tracked Every pixel of image which is traced will satisfy one of the four cases as described below:

Figure 2.2: Different cases of MHI formation for Hand Waving Activity

8

• Case 1: Point 1 in Fig. 2.2 corresponds to the pixel where motion has just occurred in the current silhouette frame i.e silhouette(x, y, t) 6= 0. Therefore the corresponding pixel in the MHI image is set to the value of timestamp i.e. M HI(x, y, t) = timestamp .

• Case 2: Point 2, represents the pixel where motion has not occurred in the current silhouette frame i.e silhouette(x, y, t) = 0 and whose pixel value at time t-1 is less than (timestamp-duration) i.e. M HI(x, y, t − 1) < (timestamp − duration) and therefore the corresponding pixel in the MHI image is set to 0.

• Case 3: In the third point, there was no motion in the current silhouette frame, however its previous value (M HI(x, y, t − 1)) is not less than (timestamp-duration) and therefore the corresponding pixel’s previous value is retained in the MHI image. But as the timestamp is increasing continuously, the intensity of this pixel is relatively less than point 1.

• Case 4: In point 4, motion never occurred throughout the execution of the activity i.e silhouette(x, y, t) = 0 for all values of timestamp ts . Therefore the corresponding pixel’s value in the MHI image is set to 0.

2.3

Feature Extraction

Transforming the input data into the set of features is called feature extraction. If the features extracted are carefully chosen it is expected that the features set will extract the relevant information from the input data in order to perform the desired task using this reduced representation instead of the full size input. Feature extraction involves simplifying the amount of resources required to 9

describe a large set of data accurately. When performing analysis of complex data one of the major problems stems from the number of variables involved. Analysis with a large number of variables generally requires a large amount of memory and computation power or a classification algorithm which over-fits the training sample and generalizes poorly to new samples. Feature extraction is a general term for methods of constructing combinations of the variables to get around these problems while still describing the data with sufficient accuracy. The Statistical descriptions of a set of MEIs and MHIs images for each view/movement combination is given to the system using moment-based features. An image moment is a certain particular weighted average (moment) of the image pixels’ intensities, or a function of such moments, usually chosen to have some attractive property or interpretation. Simple properties of the image which are found via image moments include area (or total intensity), its centroid, and information about its orientation. Image with pixel intensities I(x,y), raw image moments Mij are calculated by: Mij =

XX x

xi y j I(x, y)

(2.6)

y

Central moments are defined as: µpq =

XX (x − x¯)p (y − y¯)q f (x, y) x

(2.7)

y

Where: M10 M00 M01 y¯ = M00

x¯ =

(2.8) (2.9)

Moments ηij where i + j ≥ 2 can be constructed to be invariant to both translation and changes in scale by dividing the corresponding central moment by the properly scaled (00)th moment, using the following formula: ηij =

µij (1+ i+j ) µ00 2 10

(2.10)

It is possible to calculate moments which are invariant under translation, changes in scale, and also rotation. Most frequently used are the Hu set of invariant moments [22]: I1 = η20 + η02

(2.11)

2 I2 = (η20 − η02 )2 + 4η11

(2.12)

I3 = (η30 − 3η12 )2 + (3η21 − η03 )2

(2.13)

I4 = (η30 + η12 )2 + (η21 + η03 )2

(2.14)

I5 = (η30 − 3η12 )(η30 + η12 )[(η30 + η12 )2 − 3(η21 + η03 )2 ]+

(2.15)

(3η21 − η03 )(η21 + η03 )[3(η30 + η12 )2 − (η21 + η03 )2 ]

(2.16)

I6 = (η20 − η02 )[(η30 + η12 )2 − (η21 + η03 )2 ] + 4η11 (η30 + η12 )(η21 + η03 ) (2.17) I7 = (3η21 − η03 )(η30 + η12 )[(η30 + η12 )2 − 3(η21 + η03 )2 ]− (η30 − 3η12 )(η21 + η03 )[3(η30 + η12 )2 − (η21 + η03 )2 ]

(2.18) (2.19)

Using these 7-Hu moments, seven features are extracted from both MHI and MEI images. The obtained vector of 1 × 14 dimension is given to the system for training and classification purpose as shown in Fig. 2.3.

2.4

Dataset

The system was trained and tested on Weizmann Dataset [23] and the dataset which was made in the Information and Communication Department of MIT, Manipal. Weizmann Dataset contains seven training videos for bending and side galloping activities. Also each activity has one test video. The ICT dataset contain three training videos and one test video for each activity. Total number of training videos are 20 and 4 test videos for two classes of actions namely bending and side galloping. The training data used is summarized in Table 2.1. 11

Figure 2.3: Feature Extraction from MHI and MEI Table 2.1: Action Dataset Weizmann Dataset Training Videos Test Videos Bending

7

1

Side Galloping

7

1

Manipal Dataset Training Videos Test Videos

2.5

Bending

3

1

Side Galloping

3

1

All Samples

20

4

Training and Classification

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces

12

an inferred function, which can be used for mapping new examples. In the project the training examples that are provided to the system are in the form of n × 14 matrix, where n is the number of training videos provided to the system and 14 are the features for MHI and MEI images corresponding to each video. Along with this a n × 1 matrix is provided to the system that specifies the class for each training video given to the system. Therefore, corresponding to each training video, 14 features of the video and the class label of the video is given to the system. In pattern recognition, the k-nearest neighbour algorithm (k-NN) is a nonparametric method for classifying objects based on closest training examples in the feature space [24]. An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common amongst its k nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbour. When an unlabelled input video is given to the system, MHI and MEI images are constructed corresponding to the video. From these MHI and MEI images, features are extracted using 7-Hu Moments. The system then, computes Euclidean Distance between the features of the input image and the features of labelled training videos given to the system using Equation: d(p, q) = d(q, p) =

p (q1 − p1 )2 + (q2 − p2 )2 + · · · + (qn − pn )2

(2.20)

Given the euclidean distance between the input unlabelled video and the training videos, training video are arranged in the ascending order of the distance calculated. From these, voting algorithm is applied on the top k training videos to determine the class of the unlabelled input video sequence. The system is trained on Weizmann dataset and tested on real time video. The system is successfully able to recognize bending and side-galloping activities.

13

2.6

Summary

The chapter provided a detailed overview of MHI methodology, feature extraction, training and classification modules involved in the activity recognition from video sequences. Training was performed on Weizmann dataset and testing was done for real-time videos. The system was made capable of recognizing bending and side galloping activities.

14

Chapter 3 3-Channel Motion History Images Motion History Image(MHI) methodology is a very popular method used in activity recognition. However, it suffers from a very serious drawback, i.e. it fails when repetitive and over-writing activities are encountered. In this chapter we demonstrate the failure of MHI method and provide a novel methodology which has overcome the drawbacks of conventional MHI method. Algorithm of our proposed methodology and simulated results are given.

3.1

Failure of MHI in case of Repetitive Actions

In the case of repetitive actions, there will be two or multiple intensities stored at a single location. This result in overwriting and deletion of previous intensity stored at a particular pixel. The same is shown in Fig. 3.1 where waving action is being performed first in the anti-clockwise direction then in the clockwise direction. The resulting figure does not leads to formation of any conclusion as the previous information of the anti-clockwise movement has been erased by the clockwise movement. This proves that MHI com-

15

pletely fails when overwriting and self- occluding actions are being performed. This drawback of the MHI method cannot be over looked because overwriting

Figure 3.1: MHI Failure: In case of Repetitive Action and self-occluding actions are very common and are encountered a number of times in activity recognition. To address this issue in MHI, we have devised a methodology which will not only characterize how the motion occurred but will also be able to represent repetitive activities successfully.

3.2

Improved MHI Methodology

In our mechanism the initial steps of background subtraction and formation of silhouettes remain same as the normal MHI method. However instead of using a scalar valued image as in the case of normal MHI, we have used red, green and blue channels of three different 3-channel images. A single frame of the activity video is taken and three different images namely MHIR , MHIG , MHIB are computed. Initially, when no activity is being performed, the pixel intensities of all the three images are null as shown in Fig. 3a-3c. Now to find the motion history images, each pixel of the silhouette frame is examined with the corresponding pixel of MHIR , MHIG and MHIB . When motion occurs at a particular pixel location of the silhouette frame, the corresponding pixel value of MHIR , MHIG , MHIB are checked. If the value is found to be null for all the channels, it implies that the motion has occurred for the first time at that pixel location. In that case, motion is represented using the red channel of MHIR as in Fig. 3d-3i where MHIR , MHIG , MHIB respectively are shown for 16

the movement of an arm upwards in the anti-clockwise direction. Here each pixels RGB value of MHIR is determined in accordance with equations:     ts if silhouette(x, y, t) 6= 0        0 if silhouette(x, y, t) = 0 & (3.1) R(x, y, t) =    R(x, y, t − 1) < (ts − d)        R(x, y, t − 1) otherwise

G(x, y, t) = 0

(3.2)

B(x, y, t) = 0

(3.3)

In the case when motion occurs at a particular pixel location in the silhouette frame and the corresponding pixel’s red channel value of MHIR is greater than a specified threshold but blue and green channel values of MHIB and MHIG respectively are null. This suggests that a repetitive action is performed at that pixel location. Therefore, we represent that action using the green channel of MHIG and MHIR remains unchanged. So in this way, there is no overwriting of the previous action in the MHIR by the current action. Fig. 3j-3l shows MHI images when an action is performed first in the anticlockwise direction and then in the clockwise direction. Each pixel’s RGB value of MHIG is varied according to equations:

R(x, y, t) = 0

G(x, y, t) =

    ts        0

(3.4)


   G(x, y, t − 1) < (ts − d)        G(x, y, t − 1) otherwise B(x, y, t) = 0

17

(3.6)

If a repetitive action is further continued for the second time at a pixel location in the silhouette frame then the corresponding pixel’s green channel value of MHIG will not be null. Then in such a situation the blue channel of MHIB is utilized to represent this action and MHIR & MHIG are unchanged. Therefore the previous actions in MHIR and MHIG are not over-written by the current action in MHIB . Fig. 3m-3o shows MHI images when an action is performed in the anticlockwise direction as depicted in MHIR , then in the clockwise direction as shown in MHIG and finally again in the anti-clockwise direction as shown in the MHIB . Each pixels RGB value of MHIB is determined using the equations:

B(x, y, t) =

R(x, y, t) = 0

(3.7)

G(x, y, t) = 0

(3.8)

    ts        0


   B(x, y, t − 1) < (ts − d)        B(x, y, t − 1) otherwise Therefore our methodology is effective in presenting not only simple non repetitive actions but also complex self-occluding and overwriting activities. MHIR

MHIG

MHIB

Fig 3a

Fig 3b

Fig 3c

Figure 3.2: Initially when no activity is performed

18

Fig 3d

Fig 3e

Fig 3f

Figure 3.3: Activity starts for the first time

Fig 3g

Fig 3h

Fig 3i

Figure 3.4: Hands raised in the anti-clockwise direction

Fig 3j

Fig 3k

Fig 3l

Figure 3.5: Actions performed in the clockwise direction

19

Fig 3m

Fig 3n

Fig 3o

Figure 3.6: Hand raised upwards for the second time

3.3

Proposed Algorithm & Results Obtained

Graph 1 shows the number of active pixels per frame for MHIR , MHIG , MHIB here the number of active pixels implies the count of pixels whose value is not null. When the hand moves upwards the number of active pixels per frame of MHIR image increases continuously and its corresponding graph increases whereas the graphs for MHIG and MHIB continue to remain null. Then when the hand stops for some time and there is no motion performed, the graph of MHIR decreases as the count of active pixels decreases in accordance with the Eq. 3.1. Thereafter when the hand moves downwards, MHIR image becomes constant since a repetitive action is encountered and this new action is represented in MHIG . The graph of MHIG increases till the action is performed in the downward direction, and then starts decreasing when the hand stops for some time. When the hand moves again in the upward direction, MHIG image becomes constant and so does the graph for MHIG and this new action is represented in MHIB . The graph for MHIB continuously increases when hand moves upwards and then starts decreasing when the hand stops. In Graph 2 the summation of pixel values for a 20 × 20 box centred at location (287, 193) is plotted for each frame of MHIR , MHIG , MHIB . Initially when the hand is moving in the upward direction, since motion is occurring

20

Figure 3.7: Graph 1 for the first time the action is represented in MHIR whereas MHIG and MHIB continue to remain null images. The summation of pixel values of the box for the MHIR image is initially null implying that the motion is occurring outside the box. Later the MHIR graph continues to increase with the frame number, implying that the hand is moving inside the box. At frame no. 33 the graph for MHIR starts decreasing, this happens due to the fading of the pixels of the box when the hand has moved out of the box. From the 50th frame the hand starts moving in the downward direction, and thus MHIR image becomes constant and this new action is represented in the MHIG image. When this repetitive action is performed the MHIR graph becomes constant and MHIG continue to remain null till the movement of hand downwards is not performed inside the box. At frame no. 207, MHIG graph starts increasing implying that the motion is occurring inside the box. After frame no. 216, the MHIG graph starts decreasing due to the fading of the pixels when the hand has moved out of the box. Thereafter when an action is again performed in the upward direction, MHIG becomes constant and the new action is represented in MHIB . Here also MHIB graph remains null till the action is not performed inside the box. 21

Thereafter MHIB graph increases for the frames in which action is performed inside the box and then MHIB graph decreases due to the fading of the pixels when the hand has moved out of the box.

Figure 3.8: Graph 2 Algorithm 1 describes the proposed methodology using which the simulated results are obtained.

22

Algorithm 1 Improved MHI Algorithm Data: silhouette, MHIR , MHIG , MHIB , timestamp, duration for x:0 to image.width do for y:0 to image.height do valR= MHIR (x,y); valG= MHIG (x,y); valB= MHIB (x,y) if silhouette(x,y) then if valR=0 & valG = 0 & valB = 0then valR=timestamp goto setvalue end if valR!=0 & valG = 0 & valB = 0then valG=timestamp goto setvalue end if valR!=0 & valG! = 0 & valB = 0then valB=timestamp goto setvalue end else if valR!=0 & valG = 0 & valB = 0then valR=valR