Multimedia Explorer: Content based Multimedia ... - IEEE Xplore

11 downloads 625 Views 786KB Size Report
Social networks and special interest groups such as myspace, facebook and youtube, enable users to share personal photos and videos. Portals like BlinkxTV ...
Multimedia Explorer: Content based Multimedia Exploration Sujal S Wattamwar1, Surjeet Mishra1, and Hiranmay Ghosh1, Senior Member, IEEE 1 Innovation Labs Delhi, Tata Consultancy Services {sujal.wattamwar, surjeet.mishra, hiranmay.ghosh}@tcs.com Abstract In this paper, we present an innovative way for effective interaction of users with the multimedia contents. We propose a novel framework which enables creation of content information through structural, semantic and media feature based descriptors compliant to MPEG-7 standard. The architecture offers content based search and personalized presentation using SMIL. The content based search exploits the MPEG-7 compliant content description to support spatiotemporal query constructs. 1.

INTRODUCTION

Today, large amount of multimedia contents are available on the Internet and more and more contents are being added. Social networks and special interest groups such as myspace, facebook and youtube, enable users to share personal photos and videos. Portals like BlinkxTV and other broadcasting companies contain video footage of world events. However, searching for a specific image or a video segment of interest in any of these portals is a Herculean task for a user. A media instance, such as a still image or video, is a complex entity. The mise-en-scene objects in a multimedia data-stream interact with each other in space and time to connote meaningful events. Very often, the media properties, such as the color and texture of the objects convey meaningful information. Thus, the metadata based interaction, commonly supported on the portals with media data is grossly inadequate. There is a need for generating a more complete description of the multimedia contents to enable effective interactions with a media library. The problem of effective multimedia retrieval has caught the attention of research groups as well as industry. Video tagging tools, such as vimeo, clicktv, etc. enable a more granular content description for videos, by allowing a user to annotate specific sections of a video. However, these tools provide little assistance to the user in creating such annotations and do not correlate the annotations associated with different scenes in a video. Another limitation of such tools is that they do not allow annotating specific objects in a video. A few media management solutions, such as Interactive Media Manager from Microsoft and video annotation tools like VideoAnnEx [5] from IBM provide automatic segmentation and some object recognition to assist annotation. IMM proposes use of ontology for broadening a user query. A collaborative tool DiVA [8] allows automatic video segmentation, object recognition and object based video annotation. However, these tools do not relate the semantic annotations with the structural description and media properties for meaningful retrieval.

We propose a multimedia exploration framework that assists users to create a content description, which comprises structural, media feature based and semantic information. The different components of such description are correlated for effective retrieval. Moreover, we provide a platform for a user to view specific segments of the media content to facilitate annotation and contextual utilization. It is also possible to create personalized presentation by combining media segments of from different artifacts. Such personalized presentations are dynamically generated during the interaction process. Whereas the framework is general enough for catering to different forms of multimedia contents, such as still images, videos and multimedia presentations, we have presently implemented a prototype with videos. Despite advances in computer vision, we believe that semantic annotations can best be provided by human beings. This motivates us to create a human-in-the-loop architecture. The structural and feature based descriptions are automatically generated through machine processing of video contents and the semantic descriptions are obtained from manual annotations. It is possible to automate some of the annotation functions by plugging in automatic object detection [11] and scene classification algorithms [7] in the framework. The framework also provides for inclusion of ontology based query refinement. The framework presented in this paper can be useful in several application systems. In this paper, we briefly discuss two such applications, a portal promoting and selling documentaries and an e-learning portal, for which work is ongoing. The paper is organized as follows. Section 2 gives system overview. Section 3 gives the description of the system in detail. Section 4 gives implementation details of Multimedia Explorer. Section 5 gives various illustrative examples. Various applications are discussed in section 6. Section 7 concludes the paper. 2.

SYSTEM OVERVIEW

Effective interaction with multimedia contents requires creation of a comprehensive content description. The effectiveness is related to the overall user experience with the content usage, satisfying user needs, etc. Multimedia content model as proposed in MPEG-7 [1] comprises structural, media feature based and semantic descriptions. Multimedia Explorer relies on creation of such a content description for media

contents for their effective utilization. We define two distinct roles in the system, Content creator and Content user. The content creator interacts with the video to create a content description. Multimedia Explorer automatically creates a structural and media feature based description of the video and provide tools to support the user to complement it with semantic annotations. The structural description of a video is created by segmenting the videos into different shots and the feature based description is created by extracting media features for its constituent elements. Users provide semantic annotation to the video, each of its shots and any visual objects that may be of importance. Multimedia Explorer provides an option for content based search and navigation in the video collection. The users can search for the relevant video contents by submitting queries in the form of text or a visual object in a video. The user gets the relevant videos, or segments of interest, to their queries and can navigate through them. The relevant segments from different videos are stitched to create a virtual presentation. 3.

DESCRIPTION OF THE WORK

The overall framework for the Multimedia Explorer framework is shown in Figure 1.

Figure 1 Multimedia Explorer Framework 3.1 Storage The key elements in the central storage of Multimedia Explorer are the video descriptions, with which all the executable modules interact. The actual videos are referred to by their URL’s and can be present anywhere in the network. However, they are temporarily downloaded from external sources and can be optionally stored on the local host (subject to satisfaction of IPR issues). The content descriptions are stored in MPEG-7 format. It provides a rich set of standardized tools to describe multimedia content for human users and automatic systems that process audiovisual information. The architecture has a provision for using Multimedia Web Ontology Language (M-OWL) [4] for semantic interpretation of multimedia descriptors. 3.2 Segmentation and Feature Extraction When a video is ingested to the system, this module extracts the structural information and some feature descriptors, which are required for the later annotation and retrieval phases. First, the video is processed to extract the frames and to detect

the shot boundaries. The shots are detected on the basis of difference in the visual features of the successive frames in a video. The normalized difference of the visual features is measured for subsequent frames at a regular interval. We have observed that an interval of 20 frames provides the best shot detection performance. A sliding window, of user defined size, is used to localize the comparison of color features and the mean distance (i.e. difference in normalized visual features) is calculated for each sliding window. This mean value is used to implement adaptive thresholding. Also other information like path of the video, representative frame, an image which describes the complete shot, for respective shots, the free text annotation, visual features of the representative frames, are extracted. These information leads to feature based description. 3.3 Annotation In spite of the advances of computer vision technology, it is not possible to generate complete semantic annotations automatically for unrestricted types of videos. Therefore, we keep a provision for manual content annotation in this humanin-the-loop system. In this phase, the video is described at different levels which help in interacting with them. The annotation process in Multimedia Explorer is shown in Figure 2.

Figure 2 Scene and object based annotation The video, as a whole, is described with a Free Text Annotation which is analogous to the abstract or description, which is traditionally available with video collections. Multimedia Explorer enables a user to add more granular annotations. A ‘scene’ in a video is a set of contiguous semantically related shots. Multimedia Explorer provides facility to define the scenes in a video by combining adjacent and related shots and to annotate each of the scenes separately. MPEG-7 supports both structured and free text annotations. Structured annotations include answers to various questions like who, when, what, etc. Structured annotations are more convenient to satisfy structured queries. But the free text annotations are more intuitive, natural, and can carry additional information not captured through pre-defined fields of structured annotations. After weighting the pros and cons, we have favored free-text annotations for scene description. Moreover, the important visual objects in each of the representative frames can be marked and can be textually

described. This description takes the form of a structured annotation. Figure 2 depicts scene and object annotations of a video. The scenes are depicted as a vertical array of thumbnail images of the representive frames, each of which can be clicked to review prior to annotation. Scene annotations can be inserted in the text-boxes next to the thumbnails. The panel towards the right serves both for playing the scenes as well for annotating objects. An object, e.g. an astronaut in the figure, is marked by drawing a bounding rectangle around it and entering text in a text box that appears below the panel. 3.4 Retrieval The input query from the user can be taken in two different forms, text and image. 3.4.1 Textual query In this case, the user types some text in a query box provided with the system. The query can either be one or more keywords, a phrase, or a natural language sentence. This input query is used to find its relevance with the descriptions of the videos which are stored in the MPEG-7 files. Our work for specifying queries has been motivated by [6], which introduces a MPEG-7 query language and describes the background and requirements as well as the main architectural concepts and associated MPEG-7 Query Format XML schema types. In finding the relevance of the video descriptions w.r.t. the input query, only the keywords, and not stop-words, are taken into consideration. Stop words are the words in the text that are of very less significance in the context of the text. In general, the stop words include articles, prepositions, pronouns like a, an, the, and, etc. We used the list of stop words from [12] and then adapted this list to our requirements. In this search based on textual query, the query is first preprocessed for stop-word removal. Then the keywords are searched by stemming them first. The stemming is performed using Porter’s algorithm [14]. The stemming is appropriate for search because, mostly, the morphological variants of words have similar semantic interpretations. The relevance is calculated by using the classical probabilistic model [10]. We have explored the possibilities of using various retrieval models namely Boolean model, vector space model. The Boolean model gives the binary score which is not desirable as the retrieval is based on binary decision criteria with no notion of partial matching. The vector space model assumes the independence of index terms, which is not desirable. The probabilistic model gives the probability that the document/textual description are relevant to the user query and it gives ranking of the documents according to their relevance. Thus the probabilistic model is suited for this framework. The BIR model estimates the probability that a specific document dm will be judged relevant w.r.t. a specific query qk. The probabilistic approach is used because the user need cannot be expressed precisely.

The probabilistic score is calculated for each scene. Then based on scene scores, the score for complete video is calculated as Svideo = max (Sscene_1, Sscene_2 … Sscene_k) Sscene_i = Σ Sobject_ij Where, K = no. of scenes Svideo = video score Sscene_i = score for scene i, for i=1, 2… k Sobject_ij = score of jth visual object in the ith scene. The scores are calculated using BIR model. The videos and corresponding scenes, having scores greater than some threshold value, are retrieved. The threshold value is decided by some expert. The retrieved videos are then ranked according to their scores. The ranked videos are then presented to the user along with the scenes from the video that are relevant to the user query, through personalized presentation. The contents of a media instance can be interpreted in many different ways. It is not possible to capture all these perspectives during the annotation process. Thus, a dynamic and contextual interpretation of available annotation is useful for multimedia retrieval. IMM and some other solutions use ontology for this purpose. However, the semantics of a multimedia content is implied by the spatial and temporal interactions of the scene objects. So, there is need for exploring the relations among the objects in the video. For example, a particular scene in a video may contain an astronaut on a Lunar Vehicle. This scene can be interpreted as a Moon Exploration, which may not be explicitly specified in the annotation. But by exploring the spatial relation between an astronaut and the lunar vehicle, we can arrive at its semantic interpretation. The spatio-temporal properties such as the object coordinates, timestamps, etc. stored in the MPEG-7 files can lead to such semantic interpretations. So, there is need for interpreting the relationships between the objects in a video to understand its meaning. These relations are called as spatio-temporal relations. Various temporal relations between the objects can be explored using Allen’s relations [13]. Figure 3 summarizes various Allen’s relations. We have implemented a subset of Allen’s relations [13] for supporting temporal queries. For example, let A [t11, t12] and B [t21, t22] is the time interval for two objects A and B. Then A precedes B, means t12