AUTOMATIC SUGGESTION OF PRESENTATION ... - IEEE Xplore

AUTOMATIC SUGGESTION OF PRESENTATION IMAGE FOR STORYTELLING Yu Liu† , Tao Mei‡ and Chang Wen Chen† †

State University of New York at Buffalo, NY, USA ‡ Microsoft Research Asia, Beijing, P. R. China † {yliu44,chencw}@buffalo.edu, ‡ [email protected] ABSTRACT Digital storytelling applications are playing an increasingly important role in people’s daily life. In contemporary storytelling applications such as PowerPoint presentation and macro/micro blogs, good presentation images are always highly desired by content creators to boost their presentation in an intuitive and attractive way. Existing studies, however, have not yet addressed the challenging problem of how to select the most appropriate presentation images for storytelling. In this paper, we formulate this problem of presentation image suggestion (given a textual query) as selecting images by maximizing visual and semantic diversity from web image search results of suggested queries. The proposed framework consists of two novel components: 1) click-throughbased query suggestion, which is designed to suggest textual queries by searching relevant queries in a constructed query graph that can reflect diverse aspects of a given query, and 2) query-based image selection, which selects the most appropriate presentation images by keeping semantic relevance while maximizing visual diversity and quality, using a novel model based on Conditional Random Field (CRF) by individual and correlation characters. We evaluate the proposed approach by comparing with several baselines and a thorough subjective survey. The evaluations show inspiring results using the proposed approach for automatic suggestion of images for storytelling. Index Terms— Multimedia presentation, query suggestion, image suggestion, storytelling 1. INTRODUCTION Storytelling, a traditional yet important means for human to share experience, becomes more and more popular due to the recent advances in digital media technologies. We have witnessed in recent years the strong emergence of various types of modern storytelling applications, ranging from formal PowerPoint presentation to everyday macro/micro blogs. One of the key challenges in storytelling is how to select appropriate images to enrich the information in a plain This work was performed when Yu Liu was visiting Microsoft Research Asia as a research intern.

Fig. 1. Difference between (a) normal web-searched images and (b) presentation images. Though they are from the same query health, (b) are more diverse and appealing than (a) for presentation.

textual story. When authoring multimedia contents, such as presentation slides or blogs, people always need to search for appropriate images to enrich their text. The illustrations and background images, or we call presentation images, are always sought to help storytellers express their intentions more clearly and attractively. On one hand, like an old saying, “one picture is worth a thousand words,” images can convey much richer information, such as a poster for movies [1] or a summary for social media [2] or events [3]. On the other hand, beautiful pictures always make stories more attractive and visually appealing. It will not be an over statement that good presentation images are playing a critical role in storytelling. Nevertheless, automatic suggestion of presentation images has often been overlooked in existing research. Most users search on commercial search engine by keywords. However, the user-provided words are usually too short so that searching is often ambiguous, especially for general or abstract topics [4][5]. Conventional image suggestion approaches predominantly rely on image search using direct user query with single semantic [6], leaving the ambiguity problem unsolved. Moreover, these approaches are motivated mainly by keeping the simple relevance between text and image, rather than enriching the information for the purpose of storytelling. There have also been some efforts in suggesting visual queries to improve user search experiences [7]. Such

Fig. 2. Framework of proposed presentation image suggestion approach visual query suggestions are able to alleviate the ambiguity problem, yet still lacking the ability to enrich information for purpose of storytelling due to its intrinsic relevance-oriented design. Therefore, a technique for automatic presentation image suggestion is still much desired. 1.1. The Criteria for Presentation Image for Storytelling In contrast to the conventional image suggestion problems that purely focus on text-image relevance and predominantly rely on mono-semantic, image suggestion for storytelling is a completely different issue. We call this new technique presentation image suggestion: suggest not only relevant but also diverse images in terms of topic to encompass the rich setting of storytelling, as well as with high visual quality to be attractive to its audience. Thus, we emphasis the following properties of presentation images, which act as our design criteria: 1) Relevance. The suggested images should be relevant to the topic. This is fundamentally important for telling a coherent and reasonable story. 2) Semantic Diversity. The suggested images for presentation must be semantically diverse in contrast to conventional image search/annotation, which for this reason suffers ambiguity problem [8]. In other words, diverse images contain multiple and rich meanings for inclusion in the flow of storytelling. We expect such suggestion to enrich user stories by bringing more knowledge and inspiring content. 3) Visual Diversity. The suggested images must be visually diverse to maximize information they can carry for storytelling [9][10]. Bringing visually similar images may lead to redundancy and cause impairment to user experience. 4) Visual Quality. The suggested images should be visually appealing to help generate an attractive story. In other words, only images with high visual quality shall meet the aesthetic requirement for presentation [11]. From the above analysis, we can see the four criteria are not only necessary but also sufficient to distinguish presentation images from search or suggestion results from commercial search engines that target on just one or two criteria. For example, when someone is looking for presentation images

with general, especially abstract, topics like health, a commercial image search engine will provide results as shown in Figure 1 (a). Although relevant to the topic, they are less likely to be expected by users as Figure 1 (b), which are more diverse and appealing for storytelling. Based on these four criteria, we can observe that traditional image suggestion approaches would be unable to accomplish such task since they only focus on one or two of these criteria and thus cannot be directly adopted for presentation image. 1.2. Overview of the Proposed Framework To deal with these all four desired criteria for suggesting presentation images, we develop a novel two-step approach as illustrated in Figure 2. The first step, i.e., click-through-based query suggestion, suggests multiple queries that can expand the semantics of the user’s input query. This step is designed according to criteria 1) and 2), where the suggested queries should be relevant and diverse. The second step, the Conditional Random Field (CRF) based image suggestion, will first collect relevant candidate images by searching the suggested queries from an image search engine, and then solve a labeling problem by a pre-trained CRF model to select the best ones. We design this second step mainly according to criteria 3) and 4), where the images should be both visually diverse and aesthetic. Finally, we seamlessly combine those two steps where the most appropriate presentation images will be suggested. The major contributions of this work include: • We develop the first framework to address the issue of automatic presentation image suggestion for storytelling purpose. This important issue is overlooked in previous research. • We complete analysis of the particular needs in storytelling and proper identification of four criteria for the unique problem of presentation image suggestion. • We design a novel two-step computational approach in order to satisfy individual criteria as well as collective criteria via maximum likelihood principle. The remainder of this paper is organized as follows. Sec-

tion 2 introduces the proposed automatic presentation image suggestion framework in details. Section 3 presents the experiments and user studies. Section 4 concludes with a summary and outlook of some future works. 2. PRESENTATION IMAGE SUGGESTION 2.1. Click-through-based Query Suggestion For authoring multimedia contents, such as presentation slides or blogs, people usually need to search for appropriate images to enrich their text. Most people normally search online with some query words. However, the user-provided queries are typically too short such that the keywords based searching is often ambiguous, especially for general or abstract topics [4]. Conventional search-based image suggestion approaches relying predominantly on the user input, thus will be critically challenged by the ambiguity issue. For instance, a user query health may reflect multiple diverse search intention, such as physical health, mental health, or medical health. To address this problem, we propose to develop a system based on a novel query suggestion approach to expand the semantics based on the user input. We first construct a query graph to represent the relationship between queries. Then, we adopt a three-step searching scheme to suggest proper queries that shall meet criteria 1) and 2) as in the introduction. Query Graph. Different queries hold different power in the searching of images. The click-through data [12] has been shown capable of describing the intrinsic semantic query-image relation. We can construct the query-query relation by utilizing the images as a “bridge.” If two query words generate two sets of searched images that are overlapping, i.e. “co-click” images, we believe they are semantically closely related. We can take advantage of this intrinsic property to define the semantic similarity between two queries by the cosine distance between the clicked-image count vectors: S(qi , qj ) = (~ ci , c~j )/(|~ ci ||c~j |), where qi and qj denote two queries, and c~i and c~j are their clicked-image count vectors. Denoting all queries in the data as nodes, and the connection described by the similarity as edges, we can construct a query relationship graph G to describe the query-query relation over the click-through data, as shown in Figure 2. Because the scale of click-through data is large (will be explained in Section 3.1), such graph contains most of the search queries that can be generated by the users. Therefore, such graph has arbitrary length of candidate queries in any topic, including longterm queries provided by typical presentation users. Query Suggestion. Once the construction of query relation graph G is complete, user query can be accepted to carry out query suggestion according to both criteria 1) and 2). We observe that queries sharing the same semantics are closely connected as clusters in the graph. This helps to achieve the relevance in criteria 1). For the example shown in Figure 2, the query nodes medical health, and medicine for health are more closely connected than the others represent-

ing the semantic medical health. Therefore, we can discover sub-semantics by dividing the neighbors of user query node into multiple clusters so that each cluster represents a distinct meaning different from other clusters. This enables us to find diverse semantic in criteria 2). Based on these ideas, we implement the query suggestion in the following three steps: i) Searching for Relevant Queries. Since the graph G represents the query-query relationship, relevant queries can be obtained by searching the neighbors of an input query in G. Given a user input query as q0 , we first allocate it in the graph G such as the red node with input query health in Figure 2. The allocation can be done by matching the query words with edit-distance. Then, we search q0 neighborhood to construct sub-graph GN which contains all the relevant queries denoted as blue nodes in Figure 2. ii) Semantic Clustering. As we have discussed, among those relevant queries (blue nodes), closely connected ones share same semantic while separated ones maintain different semantics. To identify these diverse semantics, we cluster all relevant nodes in the sub-graph GN after removing the user node q0 and its edges. For clustering scheme, we adopt the label propagation principle reported in [13]. The number of clusters needs not to be set beforehand. iii) Gist Phrase Extraction and Query Suggestion. As we observed, a phrase frequently appears in one cluster but not in others can be considered as representative for the cluster. We call this gist phrase. Due to the property of relevance and diversity of all clusters semantics, the gist phrases are thus qualified to meet both criteria 1) and 2) and eligible as the output of query suggestion step. To implement this step, we treat each cluster as one document and apply tf-idf technique to detect the highest-frequency terms as gist phrases. 2.2. CRF-based Image Selection Based on the suggested queries that are relevant and semantically diverse, we are now ready to search for images that are also relevant and with diverse meanings using these queries. However, the images come directly from commercial search engine usually do not meet the criteria 3) high quality for presentation, or the criteria 4) of high visual diversity for storytelling. Thus a careful selection process is necessary and will need to be explicitly designed for presentation images. We formulate the image selection as a labeling problem where each image is labeled with “selected” or “unselected,” based on maximizing the likelihood of properly suggested presentation images. We adopte a CRF based model for the following reasons. First, the selection among images of relevant queries has “neighboring effect.” The selection of one image shall affect other neighboring images, regardless of whether or not they are under the same query. Therefore, CRF model can take the context of samples into account. Second, we understand that the image quality is an individual property while diversity, in contrast, is about mutual correlation among a population of images. Since the CRF model is capable of

considering both of these two characteristics in order to select the best images that a user desires, we adopt CRF modeling to accomplish the task of image selection. To build a proper CRF model, we consider both high visual quality and high visual diversity based on criteria 3) and 4), respectively. On one hand, it should select high quality images as criteria 3). Specifically, we define presentation images as near professional-quality photos with rich color, smooth texture and minimum synthetic elements [14]. On the other hand, the model we designed should select images that maximize the visual diversity as in criteria 4). Meanwhile, we should also try to keep the semantic diversity we have achieved in the first step of query suggestion. To summarize, we should select images that are high in quality, visually diverse as well as semantically diverse. Based on above requirements, we build an undirected graph with nodes as the images and edges as the visual and semantic similarities. Then a CRF is defined on the observation of images and serach profiles (x, q) and random variable of labels l, with Markov property, i.e. conditional probability of random variable only depends on neighP 1 exp{ f (x; β)li + bors (i, j): P (l|x, t, p; β, λ) = i Z P i,j R(x, q; λ)li lj }. Thus, we obtain the conditional likelihood function by taking log to the conditional probability: L(l) = −logZ +

X i

f (xi ; β)li +

X

R(xi , xj , qij ; λ)li lj ,

(1)

i,j

where l = (l1 , l2 , . . . , lN ) is the label vector, in which li = 1 indicates image xi is selected and li = 0 otherwise. N is the number of images directly output from search engine by the suggested queries. We select the M most appropriate images (M = 10 in our experiment). Function f represents the image quality. qij = (ti , tj , pi , pj ) stands for the search profile of xi and xj , so that R represents the diversity between any two images xi and xj considering the associated queries ti , tj and their search ranks pi , pj . β and λ are vectors of model parameters to be learned in the training stage. By this definition, maximizing the likelihood function leads to a set of M selected images with both maximized image quality and maximized visual diversity. The proposed CRF modeling meets the requirements from criteria 3) and 4), respectively. The quality function f can be defined as: f (xi , β) = β1 fc (xi ) + β2 ft (xi ) + β3 fn (xi ),

(2)

where fc , ft and fn represent image colorfulness, texture smoothness and synthetic of image xi , respectively. These are high-level features applied in professional photograph quality assessment [14] and hence ensure the quality of the suggested images as defined in criteria 3). The diversity function R takes into account both visual and semantic diversity to achieve criteria 4) as discussed previously: R(xi , xj , qij ; λ) = λ1 Rv (xi , xj ) + λ2 Rs (qij ),

(3)

where Rv measures image visual diversity calculated by distance of image hashing Rv (xi , xj ) = kh(xi )−h(xj )k, where h is a 42-dimension hash vector calculated as in [15]. Rs describes image’s pairwise semantic relation based on their search-engine rank:

Rs (qij ) =

1/(1 + e|pi −pj | ), ti = tj . 0 ti 6= tj

(4)

Note that fomula (1) acts as objective function in both training and testing stages. In training stage, we learn the parameters β and λ that maximize L by gradient decedent using training images and manual selection as the ground truth labels. In the testing stage, given input image candidates, we can select those with maximum likelihood as the system output. With both query relation graph and the CRF image selection model, we are ready to suggest presentation images for storytelling. First, when the user provides a textual input as query, multiple semantically related queries are suggested, as described in detail in Section 2.1. Then, powered by a commercial image search engine, we are able to search candidate images for each of the suggested queries. Using the trained CRF model as described in detail in Section 2.2, the candidate images are finally selected and the best ones are the output from the proposed presentation image suggestion system. 3. EXPERIMENTAL RESULTS 3.1. Experimental Settings Datasets. We apply a two-step training process in correspondence to the proposed two-step framework. First, we construct the query relation graph G on a large click-through dataset, the ClickTure Dataset [12]. It consists of more than 20 million click-through logs; each log contains a user query with associated clicked image and click count. The dataset contains 21,800,792 different queries, 943,454 distinct images and 77,872,117 clicks in total. The dataset is large enough so that the constructed graph can cover most of frequently user-searched queries. Second, the CRF model presented in Section 2.2 uniquely requires image search result as input. We randomly pick 132 user-frequent-searched queries, and obtain their suggested queries using the query suggestion method. Then, we can get the training images {xi }, the associated queries {ti } and ranks {pi }. Finally we manually label the images with {li } as ground truth, to indicate whether the image is eligible to be a presentation image for the given query. In total, the crawled dataset contains 638 suggested queries and 29,012 images. Baselines. Since the presentation image suggestion is not addressed yet in existing research, we can only compare between different variations of the proposed method to validate the significance of the key components, i.e. query suggestion and image selection. The first variation is search without the query suggestion (NoQS). Instead, the input query goes directly to Step 2. The second variation is search without image selection (NoIS). The output of Step 2 is top searched images without selection. The third variant is search without both query suggestion and image selection (NoQSNoIS). This is simply the commercial search engines.

Fig. 3. Image suggestion retults of input query power of will by (a) the proposed method and baselines (b) NoIS, (c) NoQS and (d) NoQSNoIs 3.2. User Study Tasks We carry out two set of user studies as follows. The first one, preference study, is performed to observe general user’s preference between this scheme and the three baseline schemes to validate the contribution of each component of our system. The second one, subjective survey, is conducted to explore the subjective opinions particularly on the details of the proposed system, for the purpose of deeper and more thorough evaluation. In both two studies, we invited 20 subjects (7 female 13 male with age 19-35 who are frequent user of presentation software) to test on this system. In each study, a subject is asked to test several times (2-5). The proposed evaluation processes are explained as follows. Preference Study. We ask subjects to input arbitrary queries of their own and present them with three pairs of image suggestion results, i.e. the three baselines vs. proposed. Then we asked about their preferences in five aspects, i.e. relevance, semantic diversity, visual diversity, quality, and overall satisfaction. Subjective Survey. In this study, the subjects input their arbitrary queries, then presented with results from this system, including both query suggestions and image suggestion. The subject is asked to grade the system by rating each of the following questions with integer from 1 (worst) to 10 (best): Q1) Q2) Q3) Q4) Q5) Q6) Q7) Q8) Q9) Q10)

How relevant are the suggested queries? How diverse are the suggested queries’ semantics? Do these queries enrich knowledge of your topic? How is the query suggestion’s overall performance? How is the quality of the suggested image for presentation or storytelling? How relevant are these suggested images? How diverse are these suggested images? Do suggested images enrich knowledge of your story? Will these images inspire your story ideas? How is the overall performance of this system?

Compared with preference study, subjective study questions are designed to more deeply and thoroughly evaluate the system regarding both query suggestion and image suggestion,

Table 1. Preference of our system over baselines Relevance Semantic Diversity Visual Diversity Image Quality Overall

NoQS 61.1% 83.3% 81.5% 83.3% 66.7%

NoIS 79.6% 79.6% 87.0% 96.3% 85.2%

NoQSNoIS 59.3% 90.7% 90.7% 85.2% 70.4%

Table 2. Average rating of each question No. 1 2 3 4 5 6 7 8 9 10

Queries

Images

Questions Relevance Semantic Diversity Enrichment on Knowledge Overall Perfromance Quality Relevance Visual Diversity Enrichment on Knowledge Inspiration on Ideas Overall Performance

Rating 8.89 8.08 8.31 8.56 9.48 8.48 8.33 8.31 8.02 8.64

including relevance (Q1, Q6), semantic diversity (Q2), visual diversity (Q7), image quality (Q5), ability in enriching knowledge for storytelling (Q3, Q8, Q9) and overall impression (Q4, Q10). 3.3. Evaluation and Discussion Preference Study Results. We collected 54 preference results from the subjects in this study. Table 1 shows the results, which indicate users generally prefer this system than the three variations, and thus validate the contribution of our system components. Lacking query suggestion component, NoQS and NoQSNoIS perform less semantic diversity with a large margin. Without image selection part, NoIS and NoQSNoIS fall behind heavily in terms of visual diversity and image quality. We also notice that the proposed scheme achieves even higher relevance than NoIS, which implies the image selection component also help increase the relevance of the

Fig. 4. Subjective survey results system output. However, users prefer almost equally among NoQS, NoQSNoIS and the proposed scheme. This is because these baselines utilize direct user input, thus the suggested queries will not be more relevant than the user input itself. Comparing the overall preference of NoQS and NoIS (66.7% vs. 85.2%), we found that users show less tolerance to bad image selection than missing query suggestion. These results confirm the significance of each component in the two-step design characterized by the four criteria. A set of image suggestion results by the proposed scheme and the baselines is illustrated in Figure 3. With the same user query power of will, the proposed system shows much better presentation images than the results from all three baselines. Subjective Survey Results. We collected 61 surveys results in this study. Table 2 contains the average rating of each question, which shows that the subjects are satisfied by both query and image suggestion results from all important aspects, especially image quality. Users also find the proposed system useful in enriching their knowledge and inspiring new ideas, which is also essential for storytelling. Figure 4(a), (b) and (c) display the distribution of all ratings, each subject’s average rating, and each survey’s average rating, respectively. They demonstrate high overall satisfaction on the system, either from individual subject perspective or from the overall study. The distribution of evaluation on semantic and visual diversity is shown separately in Figure 4(d). From these two user studies, the overall results confirm that the proposed system can achieve the desired performance characterized by the four criteria, and verify the usefulness and effectiveness for presentation image suggestion. 4. CONCLUSION AND FUTURE WORK We have developed in this research a novel system capable of suggesting presentation image for storytelling. This problem has not been adequately explored by existing multimedia research. We defined four important criteria to characterize the presentation images, and implemented a system that is able to select presentation images based on user-provided text query. Through preliminary user study, we have verified the usefulness of the criteria and effectiveness of the proposed system. For now, the current system can only accept short text input from user’s topic keywords. We are working to expand the system to include the capability of long text input. We are

also working on expanding beyond simple input of words or phrases to include multimodal structured input, including text paragraphs, images and videos. 5. REFERENCES [1] Y. Wang, T. Mei, and X. S. Hua, “Community discovery from movie and its application to poster generation,” MMM, 2011. [2] W. Yin, T. Mei, and C. W. Chen, “Automatic generation of social media snippets for mobile browsing,” ACM MM, 2013. [3] C. Gan, N. Wang, Y. Yang, D.Y. Yeung, and A. G. Hauptmann, “Devnet: A deep event network for multimedia event detection and evidence recounting,” CVPR, 2015. [4] C. Carpinet and G. Romano, “A survey of automatic query expansion in information retrieval,” ACM Computing Surveys, 2012. [5] T. Mei, Y. Rui, S. Li, and Q. Tian, “Multimedia search reranking: A literature survey,” ACM Computing Surveys, 2014. [6] C. C. Wu, T. Mei, W. H. Hsu, and Y. Rui, “Learning to personalize trending image search suggestion,” ACM SIGIR, 2014. [7] Z. J. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang, “Visual query suggestion,” ACM MM, 2009. [8] T. Mei, Y. Wang, X. S. Hua, S. Gong, and S. Li, “Coherent image annotation by learning semantic distance,” CVPR, 2008. [9] M. L. Paramita, M. Sanderson, and P. Clough, “Diversity in photo retrieval: overview of the imageclefphoto task 2009,” CLEF, 2009. [10] B. Ionescu, A. Popescu, H. Muller, M. Menendez, and A. Radu, “Benchmarking result diversification in social image retrieval,” ICIP, 2014. [11] X. Yang, T. Mei, Y. Xu, Y. Rui, and S. Li, “Automatic generation of visual-textual presentation layout,” ACM Trans. Multimedia Computing Communications and Applications, 2015. [12] X. S. Hua, L. Yang, J. Wang, J. Wang, M. Ye, K. Yang, Y. Rui, and J. Li, “Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines,” ACM MM, 2013. [13] U.N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structure in large-scale networks,” Physical Review, 2007. [14] Y. Ke, X. Tang, and F. Jing, “The design of high-level features for photo quality assessment,” CVPR, 2006. [15] Z. Tang Y. Dai and X. Zhang, “Perceptual hashing for image using variant moments,” Application mathematics and Information Sciences, 2012.