Gaze based quality assessment of visual media understanding - LIFL

Gaze based quality assessment of visual media understanding Jean Martinet∗ , Adel Lablack∗ , Stanislas Lew∗ , Chabane Djeraba∗ ∗ LIFL

- University of Lille, France

Abstract— Visual media is one of the most widely used in our societies. With the increasing demand for digital image and video technologies in applications such as communication, advertising, or entertainment, there is a growing need for assessment tools to evaluate the quality of visual media understanding. It is necessary to quantify the adequacy of an audience visual media perception and the original message or idea that the media creator intended to transmit. The aim of our work is to build a framework for measuring the quality of a visual media, that is to say its ability to transmit the original idea of the creator, and possibly to give recommendations to the creator about how to better design the media. Based on the recorded gaze data of people viewing the media, we design some qualitative indicators helping to assess the perception of the media by a target audience.

I. I NTRODUCTION With the increasing demand for digital image and video technologies in applications such as communication, advertising, or entertainment, there is a growing need for assessment tools to evaluate the quality of visual media understanding. It is necessary to quantify the adequacy of the audience visual media understanding to the original message or idea that the media creator intended to transmit. Indeed, when advertising agencies and film makers produce a visual media – an image or a movie, authors carefully chose their subject, scene, and settings with the objective to transmit a precise message to the viewer. How can one be sure that the intended message is correctly received by the audience? And how to verify that the most important items displayed in their material is well perceived by the audience? Because of the physiology of the human eye and the human vision system, only a restricted area of the scene can be perceived at a time, in the fovea region. For applications and products that target human consumers, it is desirable to have metrics that will predict the perceived visual quality as measured with human subjects. Quality assessment of visual media understanding aims at quantifying the quality of visual media understanding by the audience, including still pictures and image sequences, by means of quality metrics. Providing such an evaluation tool is crucial for controlling the audience perception of the media in existing and emerging multimedia systems. Such tools are especially important in constrained environments, for instance when the media is an advertisement (still image or video) meant to be viewed in a passing place. It also has the potential to impact nextgeneration systems by providing objective metrics to be used during the design and testing stages, thereby reducing the need for extensive evaluation with human subjects. With such a

tool, media producers could potentially save time by designing suitable media, following the recommendations of the system. The tool could for instance state that such item is not correctly seen in a given shot, and would be better seen if placed in such location at such moment. This paper presents a first step towards a quality assessment of visual media understanding based on gaze. This step consists in the recording and clustering of gaze points from person viewing the visual media, and then elaborating several estimators for analyzing this data. The promising results of ongoing experiments are presented. We briefly review in the following section some related research about gaze analysis and its applications. Then we describe in SectionIII our modeling of the problem in terms of basic quality descriptors that are combined together into a global estimator. SectionIV describes experiments that we carried out to demonstrate the usefulness of the proposed approach. II. R ELATED WORK With the recent development of low-cost gaze tracker devices, the possibility of taking advantage of the information conveyed in gaze has opened many research directions, namely in image compression – where users’ gaze is used to set variable compression ratios at different places in an image, in marketing – for detecting products of interest for customers, civil security – for detecting drowsiness or lack of concentration of persons operating machinery such as motor vehicles or air traffic control systems, and in human-computer interactions. In the latter for instance, the user’s gaze is used as a complementary input device to traditional ones such as a mouse and a keyboard, namely for disabled users. A. Gaze analysis The analysis of gaze has been studied for over a century in several disciplines, including physiology, psychology, psychoanalysis, and cognitive sciences. The objective is to analyze eye saccades and fixations of persons watching a given scene, in order to extract several kinds of information. During the visual perception, human eyes move and successively fixate at the most informative parts of the image [1]. Attention is the cognitive process of selectively concentrating on one aspect of the environment while ignoring other things. For images and video, the visual attention is at the core of the visual perception, because it drives the gaze to salient points in the scene.

B. Applications The main applications of gaze analysis in Computer Science include regions of interest localization for compression purpose. For instance, Osberger and Maeder [2] have defined a way to identify perceptually important regions in an image based on human visual attention and eye movement characteristics. In a similar way, Itti and Koch [3] have developed a visual attention system based on the early primate vision system for scene analysis. Later, Stentiford [4] applied the visual attention to similarity matching. Besides, some works are oriented towards estimating the visual quality of a media in terms of low-level metrics [5], [6]. This can estimate the distortion after reconstructing a lossycompression encoded media. The low level criteria used are objective and evaluate the quality of the media itself at a local level. Our work aims at evaluating the subjective higher level process of viewing a media. This process, which is driven by the visual attention, is influenced by both some low level aspects (color, texture, shape, orientation) and the topology of objects in the scene (and their interpretation involving cognition). III. B UILDING AUDIENCE TRACK PATHS FOR ANALYSIS In order to detect and track users’ gaze, it is necessary to employ a gaze tracking device which is able to determine the fixation point of a user on a screen from the position of their eyes. Non-intrusive gaze tracking systems usually require a static camera capturing the face of the user and detecting the direction of their gaze with respect to a known position. A gaze tracker system is usually composed with an infra-red light source directed towards the users’ eyes, a static infra-red camera, a display device and software to provide an interface between them. A. From row data... Gaze trackers provide the horizontal and vertical coordinates of the point of regard relative to the display device. Thus, one can obtain a sequence of points corresponding to the sampled positions of the eye direction. These points pi correspond to triplets of the form (xi , yi , ti ) and reflect the scanpath for a given user (see Figure 2). The scanpath consists of a sequence of eye fixations (gaze is kept still in a location, yielding regions with important density of points), separated by eye saccades (fast movement, yielding large spaces with only few isolated points). The points in the Pu for a user u participating in our experiment are to be categorized into two classes: fixation and saccades.

as the angular speed of the eye in degrees per second). The velocity of a point corresponds to the distance that separates it from its predecessor or successor. Separation of points into fixations and saccades is achieved by using a velocity threshold. Consecutive points pi and pi+1 separated by a distance under the threshold are then grouped into what is considered to be an eye fixation fj . All fj form the set Fu of fixations for the user u, and F denotes the set of all fixations for all users. Another threshold involving a minimal duration of a fixation allows eliminating insignificant groups. C. Multi-user data merging Once the fixations are identified for all users and grouped into the set F , a clustering process allows reducing the spatial characteristics of fixations into a limited set of clusters Ki . The clustering is to be achieved by unsupervised techniques, such as K-means, because the spatial distribution of points from all fixation points is unknown, and their number is finite. The obtained clusters define the locations where most of the audience have looked. D. Working hypothesis Together with practitioners and human scientists, we have set a working hypothesis stating that a message in a visual media is likely to be best understood when most people in the audience are able to see directly the main target objects at given times. The target objects are to be selected by the media producer. This working hypothesis has inspired the definition of the following indicators, to help measure some metrics about the scanpaths. E. Fixation distribution indicators Based on the above defined set of clusters Ki , we can now define dedicated indicators to estimate some features about the distribution of the fixation points. We have identified the following indicators for the fixation points analysis: • The average total number of fixations: the average total number of fixations n ¯ for all participants is defined by : n ¯=

•

(1)

where U denotes the set of all participants (hence its cardinal ||U || is the number of participants), and ||Fu || is the number of fixation points for the participant u. The average duration of fixations: the average duration ¯ is obtained with the following formula : of fixations ∆ ¯ = ∆

1 X ∆(f ) ||F ||

(2)

f ∈F

B. ...to identified fixations An essential part in scanpath analysis and processing is the identification of fixations and saccades. Indeed, fixations and saccades are often used as basic source for the various metrics that are used for interpreting eye movements [7], [8] (number of fixations, saccades, duration of the first fixation, average amplitude of saccades, etc.) The most widespread identification technique is by computing the velocity of each point (defined

1 X ||Fu || ||U || u

•

where ∆(f ) is the duration of the fixation f obtained by subtracting the timestamp of the first sample point in f from the timestamp of the last sample point in f . The maximum duration of fixations: the maximum duration of fixations ∆max is defined to be : ∆max = maxf ∈F (∆(f ))

(3)

•

The average duration of the first fixation: the average ¯ 1 is defined as : duration of the first fixation ∆ X ¯1 = 1 ∆ ∆(f1u ) ||U || u

•

S¯ =

(4) •

where f1u is the first fixation for each participant. The average scanpath length: an example of scanpath is given Figure 1. The scanpath length Lu for a given participant is the the sum of the length of all segments between consecutive fixation points: Lu =

The average number of regressions: Figure 2 shows examples of regressions. The number of regressions Ru is as follows:

¯ is: Then the average number of regressions R

dist(fi , fi+1 ) ¯= R

i=1

¯ is: Then the average scanpath length L ¯= L

1 X Lu ||U || u

(7)

Ru = ||{fi |fi−1d fi fi+1 < 90◦ }||

||Fu ||−1

X

1 X Su ||U || u

1 X Ru ||U || u

(8)

(5)

Fig. 2. A regression corresponds to the case where three consecutive fixation points form an angle less than 90◦ . •

Fig. 1.

Gini coefficient: Gini coefficient is a measure of statistical dispersion widely used in Economics to estimate the inequality of wealth or income distribution. In our application, after tessellating the displayed media in rectangular patches, we use the following definition of Gini coefficient:

Definition of the scanpath length.

G= •

The average scanpath duration (for still images): the ¯ represents the average time average scanpath duration D spent by participants to explore the displayed media: ¯ = D

•

1 X Du ||U || u

(6)

where Du is the scanpath duration for participant u, obtained by subtracting the timestamp of the first sample point in FU from the timestamp of the last sample point in Fu . The average scanpath convex area: the scanpath convex area Su for a participant is given by the surface of its convex hull: Su = convexHullSurf ace(Fu ) Then the average scanpath convex area S¯ is:

X φi,j ( )2 ||F || i,j

(9)

where ||F || is the total number of fixation points, and φi,j ∈ [0, ||F ||] is the number of fixation points in the patch (i, j). The value of the Gini coefficient, which belongs to [0, 1], quantifies the degree of concentration (or dispersion) of the set of fixation points in the image. A value close to 1 indicates a strong dispersion of points; values close to 0 means that points are strongly concentrated in few areas. We believe that the above listed indicators are useful for estimating metrics about the scanpaths across the participants, although the list is not exhaustive. We describe in the following section the experimental setting for measuring gaze data from participants, and show some examples to illustrate our approach.

IV. E XPERIMENTS We have used a single-camera gaze tracker setting, based the Pupil-Centre/ Corneal-Reflection (PCCR) method to determine the gaze direction. The video camera is located below the computer screen, and monitors the subject’s eyes. No attachment to the head is required, but the head still needs to be static. A small low power infrared light emitting diode (LED) embedded in the infrared camera and directed towards the eye. The LED generates the corneal reflection and causes the bright pupil effect, which enhances the camera’s image of the pupil. The centers of both the pupil and corneal reflection are identified and located, and trigonometric calculations allow projecting the gaze point onto the image. Given this experimental setting, we have recorded the gaze of 10 persons participating in our tests. Participants were shown images successively and were asked to watch attentively the presented scenes. All gaze information have been recorded and processed according to the description given in Section III. We show in the remainder of this section some results of our approach for still images and for movies.

to evaluate the impact of their advertisement during sport events, and how to best dispose them. These two examples show some keyframes taken from broadcasted sport events (soccer and tennis). For this specific application, if we want to focus on the specific advertisement areas on the media, while disregarding other irrelevant parts, the model could benefit from a supplementary estimator counting the hits of the fixations points in advertisement areas: • The average number of relevant hits: the number of relevant hits Hur in a specific image/frame region r for a participant is given by : Hur = ||{fi |fi ∈ r}|| Then the average number of relevant hits is: 1 X r H H¯r = ||U || u u

(10)

•

A. Still images Figure 3 shows three images from our database, representing some advertising posters for specific social or scientific events. For the two first posters, we highlight the recorded scanpath. The scanpath, which is superimposed on the pictures, is composed with several points of regard, each being linked to the previous one to represent the path. On the third poster, we have superimposed the result of clustering the fixation points, which defines a heat map of the most seen locations in the poster, giving an indication of whether the main informations are seen or not. This information should be matched with the requirements of the media producer, e.g. to check if the list of sponsors is well seen. Figure 4 shows another example which is a movie poster (the movie is entitled Into the Blue, directed by John Stockwell in 2005). The left image shows the recored scanpath, and the right image shows the clustered points as the disks (the size of the disks represent their relative importance in the poster, according to cluster features such as the number of fixation points in the cluster and their variance from the centroid). This example demonstrates that the movie title is hardly seen by the target audience.

Fig. 4. Another example for a movie poster, with the scanpath displayed on the left, and clustered fixation points displayed on the right.

The last example for static images shown at Figure 5 illustrates how our approach can help advertising designers

Fig. 5. Illustration of our approach for to evaluate the impact of advertisement campaigns during sport events.

B. Movies We now apply our approach to movies. The material that we have used is a 250 advertising movie of Coca Cola1 . Figure 6 shows a sampled sequence of keyframes from the movie, with the superimposed fixation tracks from all 10 participants. Based on the previously defined indicators, we have estimated a dispersion value for each frame in the video, denoting how spread are the fixations from all participants. The higher this dispersion value, the more undefined is the focus of attention. The lower, the more precise it is. We show in Figure 7 a plot of the evolution of the dispersion value against time for this short movie. Under the graph are highlighted time intervals where the dispersion value is below a threshold, denoting moments of concentrated focus in the video. Sampled keyframes corresponding to such moments extracted from the video are also displayed. The dispersion value reveals the structure of the movie. It is composed of a fast sequence of short-length shots, with a few longer shots in between, where the faces of the characters can be clearly seen. The intervals of low dispersion (i.e. good focus) correspond to longer shots, and the intervals of high dispersion (i.e. uncertain focus) correspond to chains of shortlength shots, with corresponding hard cuts in which the last 1 This movie has been submitted to the workshop as an attached document. It is also available from http://www.lifl.fr/∼martinej/martinet09gazebasedWCVIM.MP4. Please refer to the original movie to experience the dynamics of the recorded fixation points.

Fig. 3.

Example of three images from our database, representing some advertising posters for specific social or scientific events.

Fig. 6.

Sampled sequence of keyframes from the movie, with superimposed fixation tracks from 10 participants.

frame of a shot is replaced with (followed by) the first frame of the next shot, with no transition. In such situations, the focus of attention shifts from the last salient location in the first shot towards the first salient location in the second shot. Due to individual physiological variations, the duration of the gaze shift is not the same for all participants, which explains high values of dispersion. We can also notice that the last shot corresponds to a static display of the advertised product alone, with a red color on a white background. In this situation, the dispersion value is very low. V. D ISCUSSION

However, we think that the satisfaction of media producers implementing our approach would be a first qualitative validation. An alternative solution to evaluate how people possibly discover a visual media is to use saliency maps [3] to simulate the human vision process. Saliency maps have been widely used in the computer vision field to solve the attention selection problem. The purpose of the saliency map is to represent the conspicuity (as a scalar value) of each locations of the image. According to such a map, it is possible to simulate the successive fixation points of a user for the image by selecting salient locations. Hence another direction is to compare the result of merging all fixation points from all participants with the result obtained with saliency maps.

A. Highlight In our last example, we highlight the fact that a simple analysis of the results for the test movie permits to reveal the structure of the movie, and also to formulate the following recommendation for advertising movie designers: the focus of attention is better preserved for most people in the audience when/if salient locations coincide between two consecutive shots. Indeed, when the salient location differ between two consecutive shots, the gaze is naturally shifted, with varying durations for different people. B. Validation of our approach To the best of our knowledge, no prior work has been done for the purpose of evaluating the quality of a visual media from recorded gaze points from the audience. As a consequence it is difficult to compare our findings with other approaches.

VI. C ONCLUSION We have presented a framework for measuring the quality of a visual media, that is to say its ability to transmit the original idea of the creator. This framework is based on the natural human gaze of people watching static images or dynamic scenes. It includes qualitative indicators of the nature of the collected scanpaths, to evaluate how well the message is received by the audience. A. Towards a global quality estimator The work presented here is a part of ongoing works aiming at defining useful tools for visual media designers. As stated in Section III, we have carried out this work based on the hypothesis from practitioners and human scientists that a message in a visual media is likely to be best understood when

Fig. 7. Evolution of the dispersion value against time for the short movie. Under the graph are highlighted time intervals where the dispersion value is below a threshold, denoting moment of concentrated focus in the video. Sampled keyframes corresponding to such moments extracted from the video are displayed.

most people in the audience are able to see directly the main target objects at given times. We have defined a number of indicators for estimating some metrics related to the way the audience perceives the visual media. An important part of this work will consist in providing an explicit matching between our indicators and cognitive interpretations in terms of quality of media understanding. For this purpose, it is necessary to test individually the indicators, and to further define a sound way to combine them into a global quality estimator. B. Large scale settings An active research area in computer vision is aimed at developing some gaze tracking software operating from a simple webcam. From an accurate face detection, it is possible to determine the position of the eyes, and therefore to find the iris. The eye detection can be based on (among other methods) template matching, appearance classification, or feature detection. In the template matching methods, a generic eye model is first created based on the eye shape, and a template matching process is then used to search eyes in the image. The appearance based methods detect eyes based on their appearance using a classifier trained using a large amount of image patches representing the eyes of several users under different orientations and illumination conditions. The feature detection methods explore the visual characteristics of the eyes (such as edge, intensity of iris, or color distributions) to identify some distinctive features around the eyes. The availability of such techniques is relevant to our research activity in the sense that it will enable the deployment the presented approach at a large scale, using different types of display devices (computer screen, TV, shopping mall LCD displays, etc.) We believe that in the near future, webcam-based

trackers will permit such large scale experiments to further validate our approach. Providing such an evaluation tool is crucial for controlling the audience perception of the media in existing and emerging multimedia systems. Hence the possibility to automatically gather and analyze naturally obtained feedback from the audience is a promising proposition. VII. ACKNOWLEDGMENTS This work has been supported by the french National Research Agency (ANR) through the project ANAFIX (2006RIAM-026-01). R EFERENCES [1] A. L. Yarbus. Eye Movements and Vision. Plenum Press, New York, 1967. [2] W. Osberger and A. J. Maeder. Automatic identification of perceptually important regions in an image using a model of the human visual system. In International Conference on Pattern Recognition, Brisbane, Australia, 1998. [3] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998. [4] F. Stentiford. An attention based similarity measure with application to content based information retrieval, 2003. [5] Stefan Winkler. Visual quality assessment using a contrast gain control model. In in IEEE Signal Processing Society Workshop on Multimedia Signal Processing, pages 527–532, 1999. [6] Stefan Winkler. Issues in vision modeling for perceptual video quality assessment. In Signal Processing, pages 231–252, 1999. [7] R. J. Jacob and K. S. Karn. Eye tracking in human-computer interaction and usability research: Ready to deliver the promises. In UK Elsevier Science, Oxford, editor, The Mind’s Eyes: Cognitive and Applied Aspects of Eye Movements, 2004. [8] A. Poole, L. J. Ball, and P. Phillips. In search of salience: A response-time and eye-movement analysis of bookmark recognition. In Conference on Human-Computer Interaction (HCI), pages 19–26, 2004.