Experiencing audio/video quality : an investigation into the ... - Pure

3 downloads 0 Views 10MB Size Report
Dec 15, 2015 - albeit high-level, characteristics to determine QoE in the human ..... multimedia content and a new version is presented. ...... Education level ranged from attending high school to obtained ...... New Canaan, CT: Benchmark.
Experiencing Audio/Video Quality

Uitnodiging tot het bijwonen van de openbare verdediging van mijn proefschrift

Experiencing Audio/ Video Quality:

Experiencing Audio/ Video Quality

an investigation into the relationship between perceived video quality and involvement op dinsdag 16 december 2015 om 16:00 de promotie vindt plaats in het auditorium van de TU/e aansluitend is er een receptie

an investigation into the relationship between perceived video quality and involvement

waarvoor u ook van harte bent uitgenodigd Nele Van den Ende Unilever R&D Quarry Rd East, Bebington CH63 3JW, UK 0044 151 641 1481 [email protected]

Nele Van den Ende

Eindhoven University of Technology Department of Industrial Engineering & Innovation Sciences

Nele Van den Ende

Experiencing Audio/Video Quality: an investigation into the relationship between perceived video quality and involvement

Nele Van den Ende

A catalogue record is available from the Eindhoven University of Technology Library ISBN: 978-94-6295-303-1 NUR: 778 Keywords: Human-Technology Interaction, Quality of Experience, Quality of Service, perceived video quality, involvement, psychometric approach, method development, questionnaire development Printed by Proefschriftmaken.nl || Uitgeverij BOXPress ©Nele Van den Ende (2015) ©Cover Images: Elke Van den Ende (2015) All rights reserved. No part of this book may be reproduced, stored in any retrieval system or transmitted in any form or by any means, electronic or mechanical, photocopying, recording, or otherwise, without prior permission of the author.

Experiencing Audio/Video Quality: an investigation into the relationship between perceived video quality and involvement

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magnificus prof.dr.ir. F.P.T. Baaijens, voor een commissie aangewezen door het College voor Promoties, in het openbaar te verdedigen op dinsdag 15 december 2015 om 16:00 uur

door

Nele Van den Ende

geboren te Jette, België

Dit proefschrift is goedgekeurd door de promotoren en de samenstelling van de promotiecommissie is als volgt: Voorzitter: prof.dr. I.E.J. Heynderickx 1e promotor: prof.dr. D.G. Bouwhuis 2e promotor: prof.dr. W.A. IJsselsteijn co-promotor: dr. L.M.J. Meesters (WUR) leden: prof.dr. H. de Ridder (TUD) prof.dr. D.K.J. Heylen (UT) dr.ir. R.H. Cuijpers prof.dr. P. Markopoulos

Het onderzoek of ontwerp dat in dit proefschrift wordt beschreven is uitgevoerd in overeenstemming met de TU/e Gedragscode Wetenschapsbeoefening.

Dedicated to all participants of psychological experiments

Contents 1 Introduction

9

1.1 Quality of Experience framework

11

1.2 Overview of models that propose concrete measures

13

1.3 Quality of Experience with Multimedia Content

15

1.4 Criteria for the determination of important Influencing Factors

18

1.5 Involvement as salient affective Human Influencing Factor

19

2 Theoretical underpinnings

25

2.1 Human visual system

26

2.2 MPEG-2 Encoding

27

2.3 Motion and perceived video quality

29

2.4 Wireless Network Properties

33

2.5 Streaming Video Adaptation Methods

35

2.6 Subjective Video Quality Assessment

38

2.7 Constructing a Scale for affective constructs

39

2.8 Next steps

43

3 Developing an operational definition for involvement with audio/video material

47

3.1 Involvement in the literature

49

3.2 Concept mapping

53

3.3 Study 1: generating statements

54

3.4 Study 2: structuring statements

56

3.5 Study 3: Interpretation of statements

59

3.6 Results and discussion

62

3.7 Conclusion

67

4 Development of the Involvement Questionnaire

73

4.1 Constructing a Scale

75

4.2 Cognitive Interviews

76

4.3 Testing the Item Pool via an Online Survey

78

7

5 Experimental investigation into the relation between the constructs of involvement and perceived video quality 99 5.1 Discriminant validity

101

5.2 Measuring audio/video quality

102

5.3 Method

103

5.4 Results

108

5.5 Discussion and conclusion

118

6 General Discussion & Conclusion

125

A: Screenshots of the online survey used to gather data for the first validation of the involvement questionnaire.

136

B: Scree plots across all multimedia fragments.

138

C: Overview of initial exploratory factor analyses.

139

D: Iterative exploratory factor analyses across all multimedia fragments

149

Summary 152 List of Publications

154

Curriculum Vitae

155

Acknowledgements

156

8

Chapter 1: Introduction

9

In 1999, Gauntlett & Hill (1999) presented the findings of a revelatory study on TV viewing behaviour of 500 people over a period of five years. Its results provided a comprehensive picture of the multifaceted interaction of people of all ages with their TV and its content. It clearly showed that in the daily life of people TV is more than just a box with moving pictures. As appears from Gauntlett & Hill (1999)’s book, TV can act as an environmental source, i.e. act as background noise. TV can also act as a regulative source, i.e. a punctuation of time and activity, or a structural use of TV watching, e.g. allowing people time to unwind and relax. Watching TV also has relational uses: e.g. it can facilitate communication, encourage social learning or be used as an avoidance strategy. It is highly likely that multimedia content today, regardless of whether it is consumed via a TV, a PC, a tablet or a mobile, still fulfils similar roles. That is to say, sometimes multimedia content is a background in people’s life, and at other times, the content is closely monitored. When content is closely monitored, it is quite often because people are looking, among others, for relaxation, emotional or experiential escape, alleviating stress, entertainment, information, companionship (managing solitude), or disengagement from others (Gauntlett & Hill, 1999; Hassenzahl, 2008; See-To, Papagiannidis, & Cho, 2012). Establishing what experiences people are looking for is what Quality of Experience (QoE) is all about. However, when multimedia content is shown to users, many things can go wrong in the delivery chain, especially when the content is shown via wireless networks (e.g. images may look frozen, or visual artefacts may become visible). The ultimate goal is for nothing to go wrong, but as there are a lot of factors at play, it is realistic to accept that if something can go wrong, it may go wrong. Hence, it is important to have strategies in place which prevent users from noticing that something has gone wrong, and it is necessary for content providers to understand how to best invest their resources in optimizing multimedia quality. These strategies rely on optimizing factors, and up until recently the optimization has focused almost exclusively on Quality of Service (QoS). QoS is defined as the totality of characteristics of a telecommunications service that bear on its ability to satisfy stated and implied needs of the user of the service (ITU-T Rec.E.800, 2008). QoS research traditionally centres on notions of network performance or other system-level performance, i.e. physical levels of performance, and does not tend to concern itself with context of usage or user characteristics (Le Callet, Möller, & Perkis, 2013). Quality of Experience (QoE), on the other hand, is wider in scope, and aims to include assessment of system performance as coloured by its context, culture, expectations, preferences, etc. (Le Callet, et al., 2013). The next section will give an overview of a QoE framework, followed by an overview of concrete models. Next, the specific case of QoE for multimedia content is discussed, followed by an overview of research around factors that have already been defined as important in QoE. The chapter ends with a proposal for criteria one can use to determine whether or not a factor is important in Quality research, the proposal for my own concrete model, and how such a concrete model can be achieved. Such a model can then be used to fill in knowledge gaps within the QoE framework detailed in the next section. The measures

10

INTRODUCTION

that accompany a concrete model should then also be suitable for supporting further exploration of the QoE framework.

1.1

1

Quality of Experience framework

A recently agreed-upon working definition of QoE by the Qualinet consortium (Le Callet, et al., 2013) states that it is “the degree of delight or annoyance of the user of an application or service. It results from the fulfilment of their expectations with respect to the utility and/ or enjoyment of the application or service in the light of the user’s personality and current state.” This definition of QoE will be adopted throughout this thesis. Given that delivering multimedia content to the user often happens under complex circumstances, optimizing QoE such that the user does not notice any performance drops is an important goal, otherwise users might switch to different services, applications or content. However, in order to optimize QoE it is important to know what the salient factors that need optimizing are. This can only be achieved through the creation of a framework, which provides guidance on factors that can be improved, depending on the situation or the main goal of investigators working to optimize QoE. Le Callet et al. (2013) provide a good starting point for a framework. There are three kinds of Influencing Factors - human, system and context - which are related to content experience, at four different levels: perceptual, interaction, usage and service. These Influencing Factors are linked to the content experience levels in an orthogonal fashion (see figure 1). This orthogonal scheme allows for the creation of hypotheses about factors that are in principle independent of each other. As detailed in the Qualinet white paper (Le Callet et al., 2013), Influencing Factors are defined as “any characteristic of a user, system, service, application or context whose actual state or setting may have influence on Quality of Experience for the user”. The Influencing Factors (IFs) are divided into three categories: 1.

2.

Human Influencing Factors, defined as any variant or invariant property or characteristic of a human user. This includes, but is not limited to, demographics, social-economic information, physical and mental constitution and emotional state. It distinguishes between low-level processing, such as physical and emotional responses and higher-level processing, such as conscious interpretation and judgment. System Influencing Factors, defined as the properties and characteristics that determine the technically produced quality of the application or service related to media capture, coding, transmission, storage, rendering, reproduction / display and the communication of the information from the content production. It distinguishes between four levels: content related, media related, network related and device related.

INTRODUCTION

11

3.

Context Influencing Factors, defined as factors that embrace any situational property to describe people’s environment, which can be at different levels of magnitude or dynamic patterns of occurrence. There are three contexts which are specifically distinguished: physical, temporal and technical / info.

Figure 1: A schematic representation of the Qualinet QoE model (Le Callet, et al., 2013) as discussed on the previous page, presented by Redi, Zhu, de Ridder, & Heynderickx (2015). As can be seen, the bottom line introduces the difference between the reference content versus the content actually delivered, which contributes to the final QoE judgment. The weighing of the difference is based on the configuration of the specific IFs, represented by the computing units (circles in the model). This configuration depends on the application area. The green computing unit indicates the influence of System IFs at the perception level, as most research in the QoE domain so far has focused on this specific aspect of QoE (Redi et al., 2015).

As one can see, this framework is wide open, and still does not define what the salient IFs are. Of course, the IFs can depend on the application domain. However, it is still important to create concrete models with good operational definitions, which satisfactorily explain the relationship between the IFs in the model. This means that the definitions of the factors, together with their relationships, allow for the creation of empirically testable hypotheses. As Martens & Martens (2001) and Barakovic & Skorin-Kapov (2013) argue, when it comes to measuring quality, there are potentially many characteristics and factors which can be measured. The challenge is to find the relevant ones, and to define a way to measure them. This allows the creation of a concrete model, which then clarifies what, where, when and how to collect relevant data to lead to optimization of QoE. 12

INTRODUCTION

1.2

Overview of models that propose concrete measures

As the literature shows, measuring and optimising QoE is a complex matter, and there are still a lot of things unclear, especially when it comes to Human IFs, which can often be concepts that are quite difficult to define. Barakovic & Skorin-Kapov (2013)’s overview of the current models and frameworks available shows that there are currently very few QoE models which provide a concrete and testable model. According to Redi et al. (2015), de Ridder & Endrikhovski (2002)’s FUN-model is a concrete model which also shows a shift in paradigm in the QoS field. In the early research around QoS, the often implicit assumption was made that the human visual system and quality judgement remained constant over time, between users, and that context or content would not influence image or video quality judgements (Redi et al., 2015). However, de Ridder & Endrikhovski (2002)’s paper showed that these assumption were invalid. This led to further research concerning higher level HVS features, such as visual attention, as well as higher order cognitive processing and contextual factors (Redi, et al., 2015). In turn, this further research led to the realization that either the scope of QoS needed widening, or that there is a need for a different framework, i.e. Quality of Experience. de Ridder & Endrikhovski (2002)’s FUN-model of image quality assumes that there are three major properties, called constraints by the authors, which determine image quality. The first one is fidelity, defined as the degree of apparent match of the reproduced image with the external reference (i.e. the original, which can be anything from the natural scene to the unprocessed image). The second constraint is usefulness, or the degree of visibility of details in a reproduced image. It is always important to place usefulness into a context, to determine why people are looking at a certain image. The third and last constraint is naturalness, defined as the degree of apparent match between the reproduced image and internal references of the user of the images. de Ridder & Endrikhovski (2002) propose that the overall perceived image quality can be modelled as a weighted sum of the three constraints, with weights ranging from 0 to 1. This model acknowledges that perceived image quality may vary at any moment, depending on the content, context and user, which is consistent with the QoE definition as stated in section 1.1 and as defined in the Qualinet white paper (Le Callet, et al., 2013). A so-called high-level overview model was proposed by Laghari, Connelly, & Crespi (2012). The aim of the model is to allow adaptation to various different contexts. The authors drew on many different fields, as they believe that QoE needs to capture people’s aesthetic and hedonic needs (Hassenzahl, 2008), expanding from the QoS technologycentric approach. While this is not a concrete model, with proposed measures, it is very clear with regard to the variables that are considered important, as well as how they see the relationships between these variables. Contrary to de Ridder & Endrikhovski (2002)’s model, however, they make the assumption that human physical and psycho-physical factors such as the HVS and reaction times are constant over time and between people, which has been proven to not be the case (e.g. Westerink, 1989; Sternberg, 2009; Fedorovskaya INTRODUCTION

13

1

& De Ridder, 2013). Laghari et al. (2012)’s model defines four domains: human, context, technology, and business. Each domain has three levels of abstraction: first, it has an entity; second, the entity can have several roles; and third, each role can have multiple attributes / characteristics. Laghari et al. (2012) propose that technological, business, and contextual characteristics all directly influence QoE, but are moderated by human characteristics (such as interest and social context) and content. They also posit that there is the possibility of a relationship between the technology and the business domain, as well as a relationship between the context and the technology and the context and the business domain. These relationships may also indirectly influence QoE. Taking a closer look at Laghari et al. (2012)’s model, they are proposing that salient, albeit high-level, characteristics to determine QoE in the human domain are divided between subjective and objective factors. With subjective factors they mean reflections and aspects of human perceptions, intentions, and needs such as ease of use, joy of use, usefulness, satisfaction, annoyance, and boredom. Objective factors are seen as physiological factors (such as electro-electroencephalographic (EEG) measures, heart rate, blood pressure, respiration and skin conductivity) as well as measures of cognitive factors (such as memory, attention, reaction time, task performance). Relating this model to the Qualinet framework (Le Callet, et al., 2013) shows that the human domain is firmly planted within the Human IFs, while the technology domain is synonymous with the System IFs. Both the business and the contextual domain then would be part of the Context IFs, as the Context IFs cover the economic aspect, as well as physical, temporal, and information aspects. Geerts et al. (2010) also proposed a comprehensive QoE framework, with the express intention to relate existing measures (or research methods) to aspects of their QoE framework, as an important factor for business success is to have an optimal match between QoS and QoE. Their aim is to go beyond the often-used Mean Opinion Score (MOS), which mainly addresses the system / network / application aspects of QoE (or QoS), and to look at which other measures could be used to relate more user-centred aspects of QoE to the technical aspects of QoS. Geerts et al. (2010) propose to use, for example, the AttrakDiff questionnaire and psycho-physiological measurement tools (e.g. heart rate, galvanic skin rate (GSR)) to better measure several QoE aspects, as well as determining which of these measures represent salient aspects of their QoE framework. The framework Geerts et al. (2010) proposed has some interesting facets currently not explicitly covered in the Qualinet framework (Le Callet, et al., 2013), such as abandonment of watching or using multimedia content. In typical studies, controlled and in the laboratory, abandoning the content is not an option to the participant, but it would certainly be interesting to look into further, as it may give further ideas around which Human and Context IF are salient for predicting continued use of a service / application in the current interactive media landscape. Another interesting facet of this QoE framework (Geerts, et al., 2010) is its distinction between micro- and macro-temporality. They borrow this notion from Karapanos, Zimmerman, Forlizzi, & Martens (2009), and refer to micro-temporality as a use

14

INTRODUCTION

process which happens at a specific moment in time, usually delineating a short experience with specific content. Macro-temporality, on the other hand, is more concerned with the prolonged use of a process / application / service and looks at how familiarity impacts the experience. Applying this to QoE, studies generally look at multimedia content on a microtemporal scale, as they often use short video clips (less than 5 minutes long). Karapanos et al. (2009) argue that initial experiences are coloured mostly by hedonic aspects, whereas prolonged experience with a service or application may become more tied to aspects around meaningfulness in a user’s life. Geerts et al. (2010) do not offer a measure to look at these temporal aspects, but it is important to note that they make the explicit assumption that QoE is a changeable thing, partially depending on three interrelated context factors: socio-cultural (interplay between social structure and members of society), situational (interpretation of situations in everyday life), and interactional (interaction between the user, tools and tasks). This points towards the need to develop or find a measure (or suite of measures) that can capture this temporal fluidity, and can be used multiple times throughout a study.

1.3

Quality of Experience with Multimedia Content

As the Qualinet white paper (Le Callet et al., 2013) explicitly states, QoE depends on the contexts of use, which is usually determined by its application domain, i.e. when and how content will be viewed and used by people. Hence, the recommendation of Le Callet et al. (2013) is to use a more specialized definition specifically targeted to take into account the application’s requirements or need. For the purpose of this thesis, the definition of Quality of Experience with Multimedia Content is thus defined as the extent to which the user/observer/consumer/person’s primary needs, in a specific context while experiencing the multimedia content (regardless of application or service), are satisfied. It is likely that affective factors will play a big role in delivering said satisfaction. For the purpose of this thesis, affective factors are seen as factors that relate to and influence people’s mood and emotions. Cognitive factors will likely depend on the primary (self-expressed) need to experience the multimedia. For example, if the primary need is distraction or enjoyment, cognitive factors are likely to play a minor role. If the primary need is to learn something, cognitive factors such as perceived learning success and information retention will play a dominant role. Both affective and cognitive factors are definitely in the Human IF bucket, but there is currently no agreement on which affect and cognitive factors are important. Content itself is a somewhat difficult case, as in the Qualinet QoE framework (Le Callet, et al., 2013), the importance of the role of content is acknowledged, but under the current definition the meaningful aspects of content are left out, and only the technical aspects of content (e.g. movement, colour depth, texture, 2D/3D) are taken into account. However, several studies have attempted to test the influence of content directly. Fedorovskaya & De Ridder (2013) described research that looked at influences as to why an

INTRODUCTION

15

1

image is appealing. The conclusions were that the majority of the reasons why participants decided an image was appealing (or not) was related to people, composition and subject, i.e. the content of the image. Ketyko et al. (2010)’s results showed that, in trying to predict QoE, evaluation of content accounted for more than 20% of the predictive model. However, they found no correlation between the evaluation of content and the QoS measures. This could mean that other factors, aside from content, play a role as well, such as for instance context, device, interaction, and artefacts. Alternatively, it could mean that they may not have asked the right questions to accurately represent content, or how content can influence people. Measuring the meaningful aspects of content directly is not easy. However, it is possible to measure the effect content has on people. Liking, for example, has been investigated by Kortum & Sullivan (2004, 2010). Their investigation into the relationship between content and perceived video quality (PVQ) showed that there was a significant positive correlation: content that participants rated higher in liking receiving significantly higher PVQ ratings. Kortum & Sullivan (2004) hypothesised that if a viewer likes the content, this leads to involvement with the content, which in turn may lead to the viewer having less cognitive resources left to make a quality judgement. Alternatively, it may be that their judgement about the content quality from a ‘meaning’ perspective flows over into their video quality judgement, as a sort of halo effect (Kortum & Sullivan, 2010). Antons, Arndt, De Moor, & Zander (2015) also investigated what they called ‘likability of multimedia content’. However, they did not report a significant correlation between likability and perceived video quality. This could be because they also included valence measures in their research, which did yield significant results (see next paragraphs). Thus, relationship between likability and perceived video quality could be modulated by other factors, e.g., valence. One important thing to note here is that both Kortum & Sullivan (2004) and Ketyko et al. (2010) only asked one question to investigate liking or content influence. When measuring a concept with only one question, it is important to validate beforehand that this question indeed reliably measures the intended concept, otherwise it is not possible to draw conclusions about the wider concept under investigation, but only about the particular question asked. Given that content seems such a big determinant of QoE, it is worthwhile to investigate further whether it is worthwhile and possible to create a validated and reliable measure which can capture aspects of content, such as liking and involvement, to relate to QoS measures. As mentioned above, valence has also been investigated, by e.g. Antons et al. (2015), where the results showed that valence, as measured by the Self-Assessment Manikin (SAM) had a significant positive correlation with perceived video quality (PVQ), i.e. if a participant had a higher positive emotional rating for a particular video, they would most likely also rate the perceived video quality higher. Note that the causal relationship between valence and PVQ is unclear, and that so far only a correlational relationship has been established. Other valence-related factors, such as mood before watching multimedia content (as measured by the Pick-a-Mood scale), and memories associated with the content (negative, positive

16

INTRODUCTION

or neutral) directly influenced a participant’s emotional responses. Valence might therefore qualify as a good candidate for a salient Human IF in the Qualinet QoE framework (Le Callet, et al., 2013). Other positive affective factors have been investigated by De Moor et al. (2014), Pinson, Sullivan, & Catellier (2014), and Dobrian et al. (2011). De Moor et al. (2014) used a wide range of measures to represent affective and cognitive factors, with the intent to explore which factors showed relationships with perceived video quality (measured by a 5-point Absolute Category Rating (ACR) scale and acceptability (yes/no)). Their results showed a significant relation between PVQ and focused attention (cognitive factor) and felt involvement (affective factor), constructs adapted from the User Engagement scale (O’Brien & Toms, 2010). Hence, De Moor et al. (2014) concluded that perceptible impairments in a video can form a barrier to being engaged or involved with content, which in turn may lead to attrition of users / viewers for a platform or service providing multimedia content. Similarly, Pinson, et al. (2014) investigated the relationship between perceived quality of multimedia content (measured with the typical 5-point ACR for overall quality, audio quality and video quality) and interest in the subject matter (as proxy for opinion about content). Their results showed that that interest in subject matter explained 10% of the differences in the perceived quality, but there was no consistent trend that interest also drives higher quality scores. Dobrian, et al. (2011) investigated the relation between objective video quality metrics and engagement as well, but used a completely different approach. They assume that engagement is a reflection of user involvement and interaction (not further defined), and as such use play-time per video, defined as how much of a video the person is assumed to have watched, and number of views / videos watched as proxy measures of engagement. Their results showed that the buffering ratio of a video (fraction of the total session time spent in buffering the multimedia content) had a consistent decreasing effect on the amount of play-time, i.e. more time needed to buffer the content significantly decreased the playtime. Additionally, bitrate is important to create retention when it comes to people watching live sports content, i.e. a higher bitrate leads to better retention of viewers. Join time, i.e. the time between pressing play and the video actually starting to play, also played a role in how long in total time a viewer would watch something. However, as (Dobrian, et al., 2011) did not want to rely on subjective measures, they make assumptions around engagement that have not been tested. To create a fuller picture, it is necessary to test their assumptions with subjective measures such as e.g. observation of users or self-report measures. Negative affect has also been investigated, for instance by Verdejo et al. (2010). Verdejo et al. (2010) showed that boredom correlated with quality of the mobile network, i.e. if the speed of the network was slow, boredom ratings went up. However, as there was no clear definition of boredom, it cannot be determined whether the participants in their study interpreted boredom more along the valence scale (bored-relaxed), arousal scale (dull-jittery) or both (Bradley & Lang, 1994).

INTRODUCTION

17

1

In summary, when it comes to affective factors, the following affective factors have shown significant relationships with PVQ: valence, interest, liking, involvement, engagement, satisfaction, and boredom. These factors can be currently be tested with several self-report measures, such as the SAM, and adapted scales from other areas such as the User-Engagement scale (O’Brien & Toms, 2010). There are some other potential measures, such as EEG (Antons et al., 2015), eye-tracking (de Moor et al., 2014) and other physiological (skin conductance, heart rate) or observational (e.g. sitting posture, smiling, resistance to distraction, laughing, verbal utterances) measures which are currently under investigation. These could prove to be useful in the future, although perhaps obtrusive or not easy to use outside of a lab setting. However, it is important to note that most of the self-report measures (as well as the other potential measures) here reported have not been validated for use within the QoE context. The next section will discuss some criteria which can be used to determine whether a factor is a salient IF, as well as what is necessary to determine whether a measure is a good proxy for a factor.

1.4

Criteria for the determination of important Influencing Factors

Currently there are no clear criteria to determine whether something is a salient factor, or measure for a salient factor. To optimize QoE, it is important to determine these important factors, so that practical steps can be taken when delivering multimedia content. As Barakovic & Skorin-Kapov (2013) state, the challenge is not just to optimize QoE. The questions is what to improve, at which point in the process to improve, when (and how often) to improve and how to improve. Martens & Martens (2001) support this by stating that, given a specific purpose and time-frame, only some properties of the quality one is trying to measure will be relevant. In order to operationalize these quality properties, for use in QoE management, the investigator needs to go through two phases. The first phase is about finding the quality criteria through explorative data analysis, and the second phase is to use these criteria as specifications with control limits through the use of confirmative and predictive data analysis (Martens & Martens, 2001). To clarify the concept, Martens & Martens (2001) propose four different notions of quality: the first is about the notion of quality as qualitas, i.e. it describes the inherent characteristics or properties of an object (Q1). The second definition covers the expression of quality where the object’s excellence or goodness is intuitively evaluated by humans (Q2). The third definition then assumes that quality has practical relations between its properties and human needs, i.e. the quality of an object can satisfy stated or implied needs, and these quality properties can be standardized (Q3). The fourth and last definition looks at quality as a subjectively experienced event (this is also the assumption made in the Qualinet white paper (Le Callet, et al., 2013)), i.e. quality is experienced through an event, and can be communicated to others (Q4). The relationship between these definitions is proposed to be as follows: Q1, an object’s inherent properties, interactions with Q2, the perceived

18

INTRODUCTION

excellence / goodness which then results into Q3, defining needs and quality standards. For each person then interacting with this object, the process is experienced as an event or action (Q4). Applying these four definitions to QoE with multimedia content could look something like this: Q1. Qualitas: find the characteristic properties that describe the essential nature and variations in supplying multimedia content to users Q2. Excellence or Goodness: assuming users are watching the multimedia content with intent, find which content the users like the most (or satisfies their primary needs best) Q3. Standards: a. Quality criteria: find which sensory / affective / cognitive quality criteria are relevant to predict users continued preference for a service or product which supplies multimedia content b. during production of or supplying the multimedia content, use the criteria defined as relevant and make sure that the content is within its limits for the quality specifications Q4. Event: watch your multimedia content of choice and enjoy it Within the QoE community, a lot of progress has been made with defining characteristic properties which describe the System IFs. However, not as much progress has been made yet with the Human or Context IFs. As the previous section shows, there is quite a wide range of factors that are being investigated. However, when finding and defining quality criteria, starting from an operational definition gives a much greater power to being able to decide whether the measures the investigator used a) actually measured the IF under investigation and b) whether the IF is sufficiently relevant to be seen as a salient factor within the QoE framework (Le Callet, et al., 2013). Hence, this thesis will focus on investigating affective factors (Human IFs) within the QoE through the use of an operational definition which can be clearly linked to a conceptual model including network QoS IFs.

1.5

Involvement as salient affective Human Influencing Factor

Currently, there appear to be two ways of investigating affective and cognitive factors. One is to take a wide range of different measures that have shown to have effectively assessed affective and cognitive factors in other fields; to use all of those measures and to analyze which factors can explain variance between quality judgements. The other way takes a more narrowed down approach, by taking one (or several) factor(s) that the researchers think make sense, based on previous research or research in adjoining fields such as UX, or cognitive psychology. The authors then define the factor(s), develop measures and then link their results back to QoS measures. An example from the first strategy is the research by De

INTRODUCTION

19

1

Moor et al. (2014) (also see section 1.3), as they used the PAM scale, the SAM questionnaire, two adapted constructs from the User Engagement scale (O’Brien & Toms, 2010), EEG, eye-tracking, and facial expression in order to go beyond typical standardized System IF measures such as the MOS, to explore affective aspects of QoE in relation to PVQ. An example of the second strategy is QoE research reported by See-To et al. (2012). The authors created an operationalized framework based on User Experience (UX) and flow foundations. They define UX as follows: “the entire set of effects that is elicited by the interaction between a user and a product [or service], including the degree to which all our senses are gratified (aesthetic experience), the meanings we attach to the product (experience of meaning), and the feelings and emotions that are elicited (emotional experience)”. This is then operationalized by defining that engagement influences enjoyment and engagement and enjoyment together influence satisfaction with the multimedia content. Engagement is defined as ‘occurring when a person is psychologically immersed in a video, referring to the perceptual focus on mediated information and the avoidance of stimuli that do not belong to the multimedia offering (e.g. unrelated cognitions or external stimuli’ (See-To, et al., 2012). Enjoyment is defined as the extent to which the activity of watching the video is perceived to be enjoyable in its own right, independent of device, encoding or QoS performance. Engagement, enjoyment and satisfaction were measured by a self-reported questionnaire, with four items per construct (adapted from other questionnaires, see See-To, et al. (2012) for details). Both these strategies have their advantages and disadvantages. The first strategy is very explorative, and can potentially capture a wide range of important IFs. However, it also assumes that methods validated for other applications and fields will still validly and reliably measure the constructs they were designed for when used in a QoE context. The second strategy has the advantage of building an understanding of exactly what the composition of factors looks like, and how they each contribute to the endconstruct. This strategy could miss influencing factors, though, if the operational definitions are not properly explored and validated. Based on recommendations from Trochim (1989) and Hassenzahl (2008), and following the example of e.g. Strohmeier, Jumisko-Pyykko, & Kunze (2010), O’Brien & Toms (2010) and See-To et al. (2012), this thesis will utilize the second strategy, and approach understanding the Human IFs of QoE through an initial exploratory approach to define one factor which, based on current research in the field, is deemed to be an important Human IF. The initial stage will then be followed by the creation of a method which can be used to relate Human IFs to System IFs. The question then becomes which factor to investigate! Based upon research reviewed in this section, as well as section 1.3, engagement would be a good candidate. However, within the field of UX, engagement has been defined as a quality of user experience with technology that is characterized by challenge, aesthetic and sensory appeal, feedback, novelty, interactivity, perceived control and time, awareness, motivation, interest and affect (O’Brien & Toms, 2008), which they later refined into a scale to measure user engagement in online shopping environments (O’Brien & Toms, 2010).

20

INTRODUCTION

O’Brien & Toms (2010)’s scale is comprised of six factors: perceived usability, aesthetics, novelty, felt involvement, focused attention and endurability. As mentioned in section 1.3, De Moor et al. (2014) used two adapted scales from O’Brien & Toms (2010), namely felt involvement and focused attention, which the authors combined into an engagement variable. See-To, et al. (2012) also defined engagement, and used an adapted scale from Qin, Patrick Rau, & Salvendy (2009). However, both of these scales were developed for other contexts, and especially within UX, engagement is seen as having a large element of interaction. Usually, when testing video quality, users have none to very limited interaction with the content. At the most, they have the power to decide not to watch the content, to turn the content off, or switch to other content. Based on the cited evidence, engagement looks like a very important Human IF, but those findings might be confounded by a third factor, namely involvement. Involvement is present in most definitions and measures of engagements, but has not been studied methodically as a stand-alone factor within the field of QoE with multimedia content. Hence, it is currently unclear how much involvement really contributes to QoE, independently of engagement. For the purpose of this thesis, involvement is differentiated from engagement by stating that involvement is possible without interaction, but engagement is not. To paraphrase IJsselsteijn (2004), it is possible to separate the effects of emotional involvement with multimedia content from interacting with said content. In this context, a finding by Gauntlett & Hill (1999) is of particular relevance. They detail that their participants frequently would write about watching content they find strangely compelling. Participants described it as compelling viewing, even when knowing what would come next they could not stop watching. The current viewpoint is that this compelling feeling occurs frequently when people watch multimedia content, and that it is a salient factor for QoE. For the purpose of this thesis, this compelling feeling will be called involvement. The challenge then lies in how to analyze this compelling feeling, such that it becomes measureable and can be related to other, already defined, salient factors, such as Perceived Video Quality. The aim of the thesis is therefore to create a conceptual model for involvement with audio/video content, with measures that can be used to further determine the relationship between System IFs at the network level and Human IFs at the affect level. This thesis further focuses on investigating Involvement with Multimedia Content as shown on large (>30”) screens, and delivered via a wireless network connection. This will allow us to explore the relationship between Perceived Video Quality (PVQ) as a representative measure of manipulated QoS variables and Involvement as a Human IF for QoE. Based on the summarized research in the previous sections, the model for involvement is expected to explore several key components: first, there are behaviours such as laughing, crying, not paying attention to stimuli other than the multimedia content, i.e. visible signs of valence or arousal. Next, there are the aspects of suspension of disbelief, attention, interest, empathy, joy, liking, motivation, and perceived time. Furthermore, it is also important to take negative emotions such as boredom into account. Much like O’Brien & Toms (2008) propose in their

INTRODUCTION

21

1

model of engagement, it is expected that involvement falls on a continuum of no / low involvement to high involvement, and users may go through different stages of involvement several times while watching multimedia content. Before moving on to the experimental parts of the thesis, Chapter 2 reviews in detail the theoretical backgrounds of technological variables and PVQ measurement tools. With regards to the technological variables, the adaptation methods used for this thesis to introduce temporal or spatial artefacts in video content are an important aspect. There is some research (e.g. McCarthy, Sasse, & Miras (2004) that shows that, depending on types of content, optimizing the video stream by introducing either temporal or spatial artefacts can influence PVQ. Hence, depending on the involvement with multimedia content it may be important to know which the optimal adaptation strategy is for network and content providers. Chapter 3 lays the theoretical foundation for the conceptual model of involvement with multimedia content. Next, chapter 4 covers the development of a scale for the involvement construct. A pool of items is created, tested across a large variety of multimedia content and a new version is presented. In Chapter 5 the scale developed in Chapter 4 is evaluated for reliability and dimensionality. It is further also used to investigate the connection with PVQ, to research whether either temporal or spatial artefacts introduce a different relationship, depending on content. Chapter 6 completes this thesis with a discussion of the relation between involvement and perceived video quality within the Qualinet QoE Framework and definition.

22

INTRODUCTION

References Antons, J.-N., Arndt, S., De Moor, K., & Zander, S. (2015). Impact of Perceived Quality and other Influencing Factors on Emotional Video Experience 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX) Proceedings: IEEE conference proceedings. Barakovic, S., & Skorin-Kapov, L. (2013). Survey and Challenges of QoE Management Issues in Wireless Networks. Journal of Computer Networks and Communications, 2013, 28. Bradley, M. M., & Lang, P. J. (1994). Measuring emotion: The self-assessment manikin and the semantic differential. Journal of Behavior Therapy and Experimental Psychiatry, 25(1), 49-59. De Moor, K., Mazza, F., Hupont, I., Ríos Quintero, M., Mäki, T., & Varela, M. (2014). Chamber QoE: a multi-instrumental approach to explore affective aspects in relation to Quality of Experience Proc. SPIE 9014, Human Vision and Electronic Imaging XIX, 901401 (March 18, 2014): SPIE International Society for Optical Engineering. de Ridder, H., & Endrikhovski, S. (2002). Image Quality is FUN: Reflections on Fidelity, Usefulness and Naturalness. Paper presented at the Society for Information Display: International Symposium. Dobrian, F., Sekar, V., Awan, A., Stoica, I., Joseph, D., Ganjam, A., et al. (2011). Understanding the impact of video quality on user engagement. Paper presented at the Proceedings of the ACM SIGCOMM 2011 conference. Fedorovskaya, E. A., & De Ridder, H. (2013). Subjective matters: from image quality to image psychology. Gauntlett, D., & Hill, A. (1999). TV Living: television, culture ad everyday life. Oxon, UK: Routledge. Geerts, D., De Moor, K., Ketyko, I., Jacobs, A., Van den Bergh, J., Joseph, W., et al. (2010, 21-23 June 2010). Linking an integrated framework with appropriate methods for measuring QoE. Paper presented at the Quality of Multimedia Experience (QoMEX), 2010 Second International Workshop on. Hassenzahl, M. (2008). User experience (UX): towards an experiential perspective on product quality. Paper presented at the Proceedings of the 20th International Conference of the Association Francophone d’Interaction Homme-Machine. Hektner, J. M., Schmidt, J. A., & Csikszentmihalyi, M. (2006). Experience Sampling Method: Measuring the Quality of Everyday Life. Thousand Oaks: SAGE. IJsselsteijn, W. (2004). Presence in Depth. Technical University Eindhoven, Eindhoven. ITU-R. (2012). BT.500-13, Methodology for the subjective assessment of the quality of television pictures: International Telecommunication Union. Karapanos, E., Zimmerman, J., Forlizzi, J., & Martens, J.-B. (2009). User experience over time: an initial framework. Paper presented at the Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Ketyko, I., De Moor, K., De Pessemier, T., Verdejo, A. J., Vanhecke, K., Joseph, W., et al. (2010). QoE measurement of mobile YouTube video streaming. Paper presented at the Proceedings of the 3rd workshop on Mobile video delivery. Kortum, P., & Sullivan, M. (2004). Content is King: The Effect of Content on the Perception of Video Quality. Paper presented at the Proceedings of the Human Factors and Ergonomics Society Annual Meeting. Kortum, P., & Sullivan, M. (2010). The Effect of Content Desirability on Subjective Video Quality Ratings. Human Factors: The Journal of the Human Factors and Ergonomics Society, 52(1), 105-118.

INTRODUCTION

23

1

Laghari, K. U. R., Connelly, K., & Crespi, N. (2012). Toward Total Quality of Experience: A QoE Model in a Communication Ecosystem. IEEE Communications Magazine 50(4), 58-65. Le Callet, P., Möller, S., & Perkis, A. (2013). Qualinet White Paper on Definitions of Quality of Experience Output from the fifth Qualinet meeting, Novi Sad, March 12, 2013. Paper presented at the European Network on Quality of Experience in Multimedia Systems and Services. Martens, H., & Martens, M. (2001). Multivariate Analysis of Quality: An Introduction. Chichester: John Wiley. McCarthy, J. D., Sasse, A., & Miras, D. (2004). Sharp or smooth?: comparing the effects of quantization vs. frame rate for streamed video. Paper presented at the Computer Human Interaction, Vienna, Austria. O’Brien, H. L., & Toms, E. G. (2010). The development and evaluation of a survey to measure user engagement. J. Am. Soc. Inf. Sci. Technol., 61(1), 50-69. O’Brien, H. L., & Toms, E. G. (2008). What is User Engagement? A Conceptual Framework for Defining User Engagement with Technology. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 59(6), 938–955. Pinson, M. H., Sullivan, M., & Catellier, A. A. (2014). A new method for immersive audiovisual subjective testing. Paper presented at the Eighth International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM 2014). Qin, H., Patrick Rau, P.-L., & Salvendy, G. (2009). Measuring Player Immersion in the Computer Game Narrative. International Journal of Human-Computer Interaction, 25(2), 107-133. Redi, J., Zhu, Y., de Ridder, H., & Heynderickx, I. (2015). How Passive Image Viewers Became Active Multimedia Users. In C. Deng, L. Ma, W. Lin & K. N. Ngan (Eds.), Visual Signal Quality Assessment (pp. 31-72): Springer International Publishing. See-To, E. W. K., Papagiannidis, S., & Cho, V. (2012). User experience on mobile video appreciation: How to engross users and to enhance their enjoyment in watching mobile video clips. Technological Forecasting and Social Change, 79(8), 1484-1494. Sternberg, R. (2009). Cognitive Psychology, 5th Edition. Belmont, CA: Wadsworth. Strohmeier, D., Jumisko-Pyykko, S., & Kunze, K. (2010). Open Profiling of Quality: A Mixed Method Approach to Understanding Multimodal Quality Perception. Advances in Multimedia, 2010(Article ID 658980), 28. Trochim, W. M. K. (1989). An introduction to concept mapping for planning and evaluation. Evaluation and Program Planning, 12, 1-16. Verdejo, A. J., De Moor, K., Ketyko, I., Nielsen, K. T., Vanattenhoven, J., De Pessemier, T., et al. (2010, 20-22 Oct. 2010). QoE estimation of a location-based mobile game using on-body sensors and QoS-related data. Paper presented at the Wireless Days (WD), 2010 IFIP. Westerink, J. H. D., M. . (1989). Influences of subject expertise in quality assessment of digitally coded images. SID International Symposium Digest of Technical Papers, 20, 124-127.

24

INTRODUCTION

Chapter 2:

B

Theoretical underpinnings

25

Before moving on to the experimental part of this thesis, an overview of the theoretical background of several of this thesis’ aspects is offered. To understand the background of the development of the technological variables and the measurement tools, the characteristics of the human visual system are essential, since these characteristics facilitate video compression. While auditory characteristics are also important, as this thesis does not manipulate audio but strives to maintain audio quality constant across the presented material, the auditory system will not be discussed. Characteristics of the human visual system will be discussed before continuing with technical details on MPEG video coding. Next, motion in MPEG video coding is discussed. As this thesis discusses perceived video quality (PVQ) from a network point of view, the properties of wireless networks, together with two possible optimization methods to support potentially occurring problems, are also examined. Next, standardized subjective PVQ measures are given, and the Chapter finishes with a Section on how to construct a self-reported measure (e.g. scale) for affective constructs. The aim for this thesis is to then use this measure together with PVQ to investigate underlying relationships between Human and System IFs, resulting in a reliable model which can be used within the Qualinet Quality of Experience Framework (see Section 1.1, Le Callet, Möller, & Perkis, (2013), Redi, Zhu, de Ridder, & Heynderickx (2015)) to improve overall Quality of Experience with Multimedia Content for users.

2.1

Human visual system

The brain and the physiological layout/structure of the eye are responsible for our perception and interpretation of visual stimuli. Two important properties of the Human Visual System (HVS) for video encoding are contrast sensitivity and masking, and both will be discussed below. 2.1.1

Contrast sensitivity

The eye is sensitive to luminance and chrominance intensities. Luminance concerns the light-dark variations, and chrominance can be expressed by polar coordinates: θ denoting the color, or hue, and r the saturation, or the intensity of the color. Taking into account the opponent process theory of colour, these coordinates can also be mapped onto two Cartesian components: one for perceiving red-green variations, and another one for blue-yellow variations. Variations in luminance can be completely characterized by four properties: spatial frequency, phase, orientation and amplitude, or modulation depth. In the contrast sensitivity diagram (Figure 1) only spatial frequency and modulation depth are shown. Spatial frequencies are expressed in cycles/degree (Pennebaker & Mitchell, 1993). Figure 1 gives a graphical overview of the sensitivity of the eye for luminance and chrominance intensity variations. For natural stimuli high spatial frequencies correspond to fine details, while low spatial frequencies are associated with cruder features (Goldstein, 1996). Spatial sensitivity for luminance (gray scale) components is significantly better than 26

THEORETICAL UNDERPINNINGS

2 Figure 1: Sensitivity of the HVS towards luminance and chrominance intensity variations, from (Pennebaker & Mitchell, 1993)

spatial sensitivity for the chrominance (colour) components (Pennebaker & Mitchell, 1993). Hence, for video compression, the grayscale, - or luminance components are represented with more precision than the colour components. For the purpose of video compression, a digital image can be decomposed into a set of waveforms with particular spatial frequencies for both the luminance and the chromatic channels (described by Discrete Cosine Transform coefficients). The decomposition makes it possible to capture and separate information the eye can perceive from information it cannot. Information the eye cannot perceive is then discarded, and the image is reconstructed, using only information the eye can perceive (Pennebaker & Mitchell, 1993). This can substantially reduce the amount of data needed to store a digital image. 2.1.2

Masking

Masking is a form of contrast adaptation, where a pattern becomes obscured for the eye when it is masked by a superimposed, higher-contrast pattern (Bruce, Green, & Georgeson, 1996). This can also happen when the pattern is adjacent in either space and/or time. It is thus possible to reduce the image quality of the frames right after a scene change without it being detected, provided that the spatial resolution of frames is restored after half a second (Seyler & Budrikis, 1965; Yuen & Wu, 1998).

2.2

MPEG-2 Encoding

MPEG compression (Jack, 1995) was specifically designed to compress a video once (encoding) and then play it back (decoding) many times on many platforms. The MPEG encoding process requires about one hundred times the computer power necessary to decode the given video. The quality of a compressed video is determined by the resolution of the original video source and the bitrate (channel bandwidth) allowed or available after compression. THEORETICAL UNDERPINNINGS

27

2.2.1

Image compression

In MPEG, image compression starts with dividing the video stream in frames. Each frame represents an image. A frame is then further divided into macro blocks, the basic building blocks for MPEG (Mitchell, Pennebaker, Fogg, & LeGall, 1996). A macro block consists of luminance (grayscale) and chrominance (colour) components, and is normally composed of 8x8 arrays of picture elements. A common luminance/chrominance ratio in a macro block is 4:2, meaning that for every four luminance components, there are two chrominance components (Mitchell et al., 1996). A frame, therefore, consists of a contiguous sequence of macro blocks. The original macro blocks are represented in shades of gray in the pixel domain. The pixel data from the pixel domain are transformed by a discrete cosine transformation (DCT, Mitchell et al. 1996) to frequencies from black to white in the spatial frequency domain. Subsequently, the frequencies are de-correlated and compressed independently. DCT is thus a lossless transformation of the data, meaning that the image information is preserved and that the transformation does not visibly affect the video. After a discrete cosine transformation, the spatial frequencies are quantised. Quantisation means that the data is represented with less precision than the DCT coefficients, i.e. there is a many-toone mapping. High quantisation values result in a less precise representation of the DCT coefficients but also in higher compression of the frequency data. After quantisation, coding redundancies can removed from the data, leading to further compression (Mitchell, et al., 1996). Quantisation is a lossy transformation, which means that the original image information cannot be retrieved anymore. Quantisation can thus visibly affect the video sequence. 2.2.2

Dynamic compression

If the video stream contains motion, MPEG compression can be improved via motion compensation (Mitchell et al., 1996). Motion compensation is the process of only coding the differences between moving areas of a previous (or future) frame coded, with respect to the current frame (Mitchell et al., 1996). The process of estimating motion in a frame and assigning motion vectors is called motion estimation. An area of a frame is compared to the next frame and as soon as a match is found between the areas of the coded and reference frame, a vector is assigned, indicating the displacement of the part of the frame. The reference and the coded frame are then subtracted to get the residual coding which mostly contains much less information than the original frame (Mitchell et al., 1996). There are three kinds of frames in an MPEG-coded video stream. Intra-coded or I-frames are independent, without any reference to other frames. I-frames allow random access points within the video stream, and as such they occur about twice every second. Predictive-coded or P-frames are coded with respect to the previous I or P frame in the

28

THEORETICAL UNDERPINNINGS

stream; and bi-directionally predictive-coded or B-frames are coded with respect to both a previous I or P frame and a next I or P frame in a stream. Figure 2 shows a Group of Pictures (GOP), which consists of one I-frame and several P- and B-frames. Additionally, frames do not only differ in their relation to other frames, but also in absolute size (in bits). An I-frame generally has a much larger size than a B-frame. Motion compensation for MPEG coding improves the compression of P- and B-frames by removing temporal redundancies between frames. This is possible because usually, within a short sequence of the same image, most objects in a frame (e.g. landscapes, people) remain in the same location, while those that do move only move a short distance. Motion estimation assists this by estimating the position of objects while they move from one video frame to another beforehand (Jack, 1995).

Figure 2: Display order of a Group of Pictures (GOP). Frames are not always transmitted in this order, and then have to be reordered by the MPEG decoder.

2.3

Motion and perceived video quality

As mentioned in the previous Section, motion can be an important factor in perceived video quality. To outline the importance of motion, first theories of motion perception are discussed, followed by how motion is encoded in MPEG-2. Motion perception deals with the acquisition of perceptual knowledge about motion of objects and surfaces in images. Such perceptual knowledge can be acquired in two different ways: directly, from analysis of retinal motion or indirectly, by inferring motion from changes over time in the retinal position of the objects or features of the object. An analysis of retinal motion signals does not necessarily lead to perception of a moving object. Vection, induced by visual means, is the sensation that the observer, rather than objects around him or her, is moving (e.g. looking out the window of non-moving train while the train on the next platform starts to move often induces the feeling that the train started moving). Optic flow is related to vection and occurs when said observer is moving while objects and the environment are static. Optic flow provides the observer with knowledge about objects’ distance, speed, layout and shape, and the environment in general and also gives observers helpful information for controlling their movement (Derrington, Allen, & Delicato, 2004). Motion perception for

THEORETICAL UNDERPINNINGS

29

2

video sequences deals with the direction and speed of the displacements of objects and their environment, and involves apparent motion. Apparent motion is indirectly perceived motion when the stimuli are not moving at all. As Watson & Ahumada (1985) state, when “an image is presented briefly and rapidly at a sequence of closely spaced positions, it may appear indistinguishable from a smoothly moving image”. Of course, this phenomenon forms the basis of film and television, with film usually denoted as ‘movie’. Watson & Ahumada (1985) further state that apparent motion has several important properties. For example, there must be a stimulus for motion to be perceived, and since objects move with direction and speed, perceived motion is local and specific for spatial frequencies. Furthermore, brief exposures are sufficient to induce motion sensation and the visual system can adapt to motion. Additionally, humans have differential contrast sensitivity to moving patterns (Watson & Ahumada, 1985). They respond differently to images with low spatial and high temporal frequencies (rapid movement) than for images with high spatial and low temporal frequencies (slow movement). In summary, video sequences always induce apparent motion: separate images are shown at such a frame rate that objects and persons in the images appear to move. For the stationary eye motion is perceived when the image of a stimulus, an object, is moving across the retina. This may be ambiguous as this may also imply that the observer is moving with respect to objects. This ambiguity is resolved when the background of the object remains static. This holds as long as the background is not at infinity, or, in practical situations, relatively close to the observer. In situations where observers move in a particular direction, the horizon (at infinity) remains static while all objects in front of the horizon will move in a direction opposite to that of the observer. The relative speed of motion of the passing objects, however, indicates both the motion of the observer and the observer-object distances. Motion can also be perceived using two types of eye movements: detection and tracking. Detection is a rapid eye movement between objects, also defined as a saccade. Saccades come into play when the eye needs to fixate on an object in the periphery, e.g. while reading, or when something moves in the periphery of the eye (Sekular & Blake, 2002). Motion, especially in the peripheral visual field, will attract attention (Winkler, van den Branden-Lambrecht, & Kunt, 2001). Shebilske (1986) proposed that natural event perception can be based on an integration of the light-based information (incoming to the eye) and signals from the HVS to the eye muscles that update the brain’s representation of visual space. Hence, there are many connections between the retinotopic map and the human visual system creating what is called the spatiotopic map, but this is not a one-toone representation of the rods and cones in our eyes (Sekular & Blake, 2002). Tracking motion uses the same principles, and is achieved through a combination of the sensory information from neurons which code the direction and speed of moving objects and cognitive expectations about said moving object (Sekular & Blake, 2002).

30

THEORETICAL UNDERPINNINGS

Currently it is generally assumed that there are two different subsystems to analyze retinal motion (Albright & Stoner, 1995; Derrington et al., 2004). The “first-order” or “shortrange” motion-filtering system is based on sensors which respond mainly to motion of luminance or colour patterns and are thus responsible for encoding the location, speed and the direction in which an object moves. The “second order” or “feature-tracking” system complements this by including contrast variations along dimensions such as texture, binocular disparity and luminance contrast modulation. Combining the theories shows that perceiving motion in video sequences depends on inducing apparent motion, on tracking objects and being able to switch attention when new objects need to be fixated. Video coding thus needs to take these theories into account, and how this happens is detailed in the next Section. 2.3.1

Motion in video coding

The amount of motion, together with spatial detail, among others, determines the complexity of coding video sequences at a certain level of video quality (Wolf & Webster, 1997). Accordingly, it takes more bits to code a video with a high amount of motion or with more details. In MPEG-2, motion compensation is used to deal with coding motion. Motion compensation involves motion prediction by only coding the differences between moving areas of a previous (or future) frame coded with respect to the current frame (Mitchell, et al.,

in-scene motion (static camera)

camera motion (moving camera)

t

Figure 3: Motion vectors induced by in-scene motion (top) and camera motion (bottom). In-scene motion (top) means that the camera remains stable, whereas camera motion (bottom) means that the camera does not remain stable. THEORETICAL UNDERPINNINGS

31

2

1996). A block of frame i is compared to the same block of frame j that should be matched in the same place. If the block is found in a different place, a motion vector is assigned, indicating displacement of part of the frame. The code from frame j is then subtracted from frame i to obtain the residual coding, which requires much less information than the original frame. Camera motion refers to the motion of the camera in relation to people and objects (see Figure 3). In video recordings there is usually a combination of in-scene motion and of camera motion. Hence, for the purpose of this thesis, two types of motion are distinguished. In-scene motion entails objects moving within a frame with a stationary camera. Camera motion entails objects moving because the camera has been moved, while the objects themselves are stationary. 2.3.2

Objective video quality and motion

Lan, Nguygen, & Hwang (1999) divided motion into local motion and global motion. Local motion is defined as all object motion that affects a significant proportion of the image, which can be related to the in-scene motion as defined in the previous Section. Global motion is defined as the motion that affects the whole image and includes zooms, pans and rotations of the camera, which can be related to our camera motion as described in the previous Section. Additional compression in the MPEG video coding scheme could be introduced by dividing motion in global and local, and by using motion analysis for scene adaptation. In other words, it should be possible to encode video sequences differently based on the amount and extent of motion over a span of frames. So, global motion, or camera motion, should have a higher priority when encoding and compressing a video sequence than local or in-scene motion. By using motion analysis, picture-type identification, motion-vector interpolation, and adaptive quantization, Lan et al. (1999) generated a motion analysis based adaptive MPEG encoder. They measured the performance of the encoder by using the Signal-toNoise Ratio (SNR) of the luminance component of the video sequence. It was found that using fewer reference frames in low motion areas and more reference frames in high motion areas reduced the overall bitrate, while maintaining the same SNR. The reduction in the variable bitrate ranged from 2% to 14% compared to a fixed bitrate coding scheme. Low motion parts in video sequences could be represented with fewer data than the high motion parts, without changing the SNR value. So the bitrate savings where relatively successful (given the small bitrate reduction) for their encoder, while at the same the SNR average remained reasonably constant. Especially for video sequences with lower camera and inscene motion, bitrate could be reduced and objective video quality maintained. 2.3.3

MPEG induced effects

In the previous Sections, several effects that impact perceived video quality have been mentioned. Effects induced in the series of experiments reported here are either of a spatial 32

THEORETICAL UNDERPINNINGS

nature or of a temporal nature. Spatial effects are expected to show up as blocking and blur. Temporal effects will be reflected through jerkiness. Blocking is introduced by discontinuities at the boundaries of neighbouring blocks in a frame that is being encoded. How severe the effect is depends on the content of the blocks, as well as the masking properties of the HVS (see Figure 4, top) (Yuen & Wu, 1998). The effect usually remains hidden when it occurs in spatially active, bright, or very dark areas. Blur appears when there is a loss of spatial detail and reduction in the sharpness of edges in moderate to high spatially active regions of a frame, (see Figure 4, bottom) (Yuen & Wu, 1998). Jerkiness is defined as time-discrete ‘snapshots’ of the original continuous scene, while strung together as a disjointed sequence (Yuen & Wu, 1998). This gives the impression of stationary images, interspersed with sudden movement.

2.4

Wireless Network Properties

The attraction of a wireless network is that there are no wires running throughout the house, which can become tangled, snapped or confused, so people can enjoy multimedia content anywhere in their home. Implementing a wireless local area network (WLAN) takes place via protocols. These protocols enable communication between electronic devices, such as Tablets, mobile phones, laptops, PCs, game consoles and set-top boxes for TV sets. The protocols to create WLANs are standardized by the IEEE 802.11 working group, who create

Figure 4: At the top an example of blocking, at the bottom an example of a blurred image. THEORETICAL UNDERPINNINGS

33

2

Figure 5: Example of problems that can be experienced when streaming video material, in this case packet combination

standards as well as recommended practice documents (IEEE, 2015). Generally speaking, for network communication, data is formatted to packets, which are then transported between the electronic devices connected to the network. Packets are formatted to suit the requirements for transmission, and contain only one type of data (e.g. audio, video) (Mitchell et al., 1996). Irrespective of the advantages, there are several problems associated with wireless connections. Wireless network bandwidth is often unpredictable and fluctuates because of e.g. interference from other devices such as neighbouring wireless networks or microwave ovens. For WLANS, the following problems are often seen: packet losses, lower throughput and variable throughput. High packet loss can create visual artefacts (as shown in Figure 5). Lower throughput means that the wireless connection allows sending far less data than the video stream requires. A wireless network does not always need to run at maximum capacity, so the bitrate may actually be lower than e.g. the specified 54 Mbps for 802.11g. Variable throughput is mostly caused by interference caused by consumer electronic devices (e.g. microwave ovens), which lowers the available bitrate. Such problems thus can impair the user experience of watching multimedia anywhere in the home. When the available bandwidth is lower than the bitrate of the video stream, the video quality is impaired. Depending on bitrate fluctuations and decoder implementations, users then see a black screen, or blocking artefacts. It is also possible that the moving images freeze: i.e. the last frame shown is repeated. When this happens often, the film seems to flicker. MPEG-2 decoding standards do not specify all details of encoding and decoding. This makes it difficult to predict what will happen when there is network interference, unless one knows which decoder has been used. All of these effects, however, will influence the user experience negatively. Therefore, it is imperative to find a way to dynamically adapt the bitrate of the video stream to fit the available bandwidth. The methods used to dynamically adapt audio/video streams to optimize QoE in this thesis are discussed in the next Section. 34

THEORETICAL UNDERPINNINGS

2.5

Streaming Video Adaptation Methods

For this thesis, the focus is on I-Frame Delay (IFD) and Signal-to-Noise Ratio scalability (SNR scalability). The I-Frame Delay approach (Kozlov, van der Stok, & Lukkien, 2005) is an implementation of a frame dropping algorithm in the MPEG-2 domain. This video adaptation method introduces mainly temporal distortions. SNR scalability (Jarnikov, 2007) deals with variable throughput problems by dividing the original video in several layers. This video adaptation method introduces mainly spatial distortions. Both these techniques are used in Chapter 5 to introduce specific video perturbations, which are then used to assess perceived video quality. The presence of disturbances may not always be perceived, however. In the visual modality people may selectively focus their attention, which can literally narrow their vision to that which is relevant for them, a so-called inattention blindness phenomenon. So, things that are not deemed relevant are less likely to be perceived. However, those are not the only things that are not perceived. Take change blindness, which describes that people fail to notice an otherwise conspicuous change in visual information (Simons & Levin, 1998). Implications for video quality are that errors and artefacts in video sequences might not be perceived due to change blindness phenomena. Motion also draws the attention of people (Goldstein, 1996), and might draw attention away from artefacts in a video sequence, that are then no longer perceived. 2.5.1

I-Frame Delay approach

The I-Frame Delay (IFD) approach has been developed to deal with the problems, described in the previous paragraph, that wireless networks encounter (Burza et al., 2007). When there is network congestion and the buffer containing the remaining frames flows over, IFD drops the least important frames (B-frames), to assure that the frames containing the most important information are received. The decoder then repeats the last frame, so users perceive ‘jerky movement’ in the movie. However, there are no blocking artefacts due to partial frame loss. The technical advantage is that only the sender of the video has to be modified for the bitrate adaptation. Two parts should be added: a tagger to identify packets as they come along and a dropper to decide which packets can be dropped and which are allowed to move on. The IFD buffer, in this case, should be at least large enough to contain two I-frames. A graphical representation can be seen in Figure 6, where the buffer contains waiting packets (denoted as W) and packets being sent (denoted as S). Packets being sent are the most important ones; waiting packets are the ones that can be possibly dropped and non-video packets contain metadata and audio information (Burza et al., 2007). Nonvideo packets always have the same priority as important video packets. The introduction of IFD in a video stream reduces the available bandwidth for video somewhat because it introduces processor overhead costs. However, without IFD a wireless transported video

THEORETICAL UNDERPINNINGS

35

2

C

10

W incoming packets

S

11 11 11 2 11 11 2 12 12 0 12 12

outgoing packets

non-video packets

Figure 6: Network packets in the IFD buffer. The numbers indicate the priority given to the packets, with a higher number indicating a higher priority. Non-video packets are given a low number (2 or lower) which denotes that their priority is always higher than video packets (Burza, Kang, & van der Stok, 2007). C stands for incoming packets, W stands for packets that are waiting to be sent and S stands for packets that are being sent.

stream suffers from spatial artefacts, loss of audio and ‘hiccups’ where the movie stops and starts randomly (Kozlov et al., 2005). 2.5.2

Signal-to-Noise Ratio scalability approach

SNR scalability was also developed to deal with variable throughput problems (Jarnikov, 2007). It deals with variable bandwidth by dividing the original video into several layers: one base layer and (if possible) several enhancement layers. When there is interference in the network and the available bitrate drops abruptly, one or more enhancement layers can be dropped. This assures that, though the image quality of the video is not always perfect, users at least always have a moving image on their screen. The advantage of this method is that it can be used by standard non-scalable MPEG-2 decoders. The base layer provides the basic video quality, and the enhancement layers increase the video quality. In addition, the base layer does not depend on the enhancement layer, and the frames within the enhancement layer are also not dependent on each other. A base layer and its enhancement layers can be stored separately and then be transmitted in a single combined stream or separate streams. When there are bandwidth variations for a short term, they can be dealt with on a frame-by-frame basis because frames within enhancement layers are independent. The independence of the frames is ensured when the layer consists of B-frames with an I-frame at the beginning of the sequence and a P-frame at the end. Figure 7 shows an adaptation for long-term bandwidth variations, which shows that the adjustment to the layer-configuration always takes some time to become effective. Overhead costs, such as extra decoders and processor cycles, are also present, and depend on the size of the base-layer, the overall bitrate of the video and the number of the enhancement layers introduced.

36

THEORETICAL UNDERPINNINGS

2

Figure 7: Usage possibilities for SNR scalability adaptation technique. Time is measured in seconds, while bandwidth is measured in Mb. The thin black line represents the current (variable) bitrate of the wireless network. The continuous thick black line presents the bitrate of the base-layer, the dotted thick black line presents the additional bitrate of the enhancement layer.

Within this thesis, the focus will be on a special case of SNR scalability, called Transcoding (TC, Brouwers (2006)). TC uses a feedback loop to rescale the bitrate of a video sequence on the fly if there is not enough bandwidth available. In that case, the DCT coefficients of the I-frames and the residual coding of the P- and B-frames are requantized, which lowers the bitrate. Re-quantization is done by dividing the DCT components by a positive number. The frame-rate stays the same, but the overall bit rate of each frame is decreased. TC can introduce distortions which affect the spatial detail in the video. As a consequence, especially blockiness and blurriness can be perceived by observers, but other spatial artefacts, such as ringing, can also appear. Temporal artefacts can also appear, because the re-quantization has an influence on motion as well. To conclude, scalable video uses more bits than non-scalable video to reach the same level of video quality, but scalable video is able to overcome network problems that non-scalable video cannot. However, to judge which of the adaptation methods is best used given changeable circumstances cannot be determined without subjective assessment. Several subjective video assessment methods for perceived video quality have been standardized already and are discussed in the next Section.

THEORETICAL UNDERPINNINGS

37

2.6

Subjective Video Quality Assessment

The ITU-R BT.500-13 (2012) details the standardized methodology for the subjective assessment of the quality of television motion pictures. It specifies methods for laboratory studies, and several of those will be examined in this Section. First of all, there is a difference between assessments to establish a system’s performance to show a video stream under the best conditions, and assessments that look at a system’s ability to preserve quality under less than optimal conditions related to either transmission, coding or another form of image processing of video. Since the following experiments are only concerned with the second option, only those characteristics will be discussed. The most recent Table to select a test method can be found in the ITU-R BT.500-13(2012). Test methods feature two characteristics, rating and stimuli. In rating viewers are requested to rate a video sequence at the end of the video sequence, or continuously and immediately, while the image sequence is shown (see Table 1 for examples of rating scales). With respect to stimuli there are double-stimulus and single-stimulus methods. For double-stimulus methods, a reference is always present, whereas there is no reference for the single-stimulus methods. Testing the conditions under which a new system is useful is not easy. First, variables that are meaningful (i.e. video material, impairments) to test these conditions have to be determined. The Double-Stimulus Impairment Scale (DSIS) is the preferred method to accomplish this with (ITU-R BT 500.13, 2012). The procedure of the DSIS works as follows: observers are shown an unimpaired video reference and then presented with the impaired reference. Afterwards they have to rate the quality of the impaired video on the impairment scale, while keeping in mind the reference video. Observers are usually asked to judge video pairs in this manner for up to half an hour. An important claim is that the stability for the results from this test is greater when small impairments are used (as compared to large impairments) (ITU-R BT 500.13, 2012). When it is not possible to provide the full range of the possible video quality in the test conditions, it is generally preferable to use the Double-Stimulus Continuous QualityScale (DSCQS). Here again, observers are shown a pair of videos, where one is an unimpaired reference and the other one is shown in the processed version or by the system under evaluation. However, this time assessors are not told which video is the reference, and they have to indicate their opinion of both videos on continuous quality scales. The scales are

Table Table1. 1.Quality Qualityand andimpairment impairmentscales scalesas asdetailed detailedininthe theITU-R ITU-RBT.500-13 BT.500-13(ITU-R, (ITU-R,2012). 2012). Five-grade scales Quality Impairment 5 Excellent 5 Imperceptible 4 Good 4 Perceptible, but not annoying 3 Fair 3 Slightly annoying 2 Poor 2 Annoying 1 Bad 1 Very annoying

38

THEORETICAL UNDERPINNINGS

printed in pairs because of the double video presentation. Results from this test have to be interpreted as difference scores, and not as absolute scores. Consequently, it is invalid to associate the collected scores with a single quality description. However, as argued in Chapter 1, and assumed in the Qualinet Quality of Experience (QoE) Framework (Le Callet et al. 2013), perceived video quality is only one part of QoE, and a range of other factors, such as Human and Social Influence Factors (IFs) can influence QoE as well. Self-report measures can be used to represent both Human and Social IFs, and the next Section goes through a short summary of how to reliably and validly construct such a self-report measure.

2.7

Constructing a Scale for affective constructs

As mentioned previously, PVQ only investigates what people think of the shown video quality. It does not investigate Human IFs, affective or cognitive, that potentially influence QoE with multimedia content. To be able to investigate underlying relationships between Human and System IFs, it is thus necessary to determine what the important Human IFs are, and whether there are measures available already. If those measures are not available, it becomes necessary to create them. When it comes to measuring affect, there is a wide range of possibilities: self-report methods such as scales, ethnographic methods such as observation, and physiological measures such as electro-encelographical (EEG) recordings or heart rate. Behavioural and physiological measures allow for direct measurement of the viewer experience. However, while behavioural and physiological measurements can be very informative, the data is often very noisy and difficult to interpret. If there is no appropriate self-report measure to assess the construct under research, an accurate interpretation of behavioural or physiological measures would be quite a challenge. Therefore, the decision was made to develop a scale to measure the construct of involvement with audio/video content. Once one decides to create a scale, it is important to set a number of defining characteristics of the questionnaire under development. This Section provides an overview of these characteristics, and of decisions scale-builders have to go through. A summated rating scale is a questionnaire which consists of items with scores which can be summed. It is assumed that, if the rating scale is reliable and valid, the sum of the items then reflects e.g. people’s opinions, attitudes or emotions. A summated rating scale consists of responses based upon agreement, evaluation or frequency (Spector, 1992). Options for items with agreement responses often are bipolar and symmetrical, around a neutral point (e.g. neutral, slightly, moderately, very much). For items with evaluation responses, respondents are required to rate statements along good-bad dimensions, often without a neutral option (e.g. terrible, inferior, passable, good, excellent). Frequency responses require an indication of how often or many times something has happened or will happen (e.g. rarely, seldom, occasionally, most of the time). Spector (Spector, 1992) defines four characteristics for a summated rating scale. First, there must be multiple items, otherwise there is nothing to sum or combine. Second,

THEORETICAL UNDERPINNINGS

39

2

each item should measure a property with an underlying, quantitative measurement continuum. This makes it possible to quantify the property that is measured. Third, there is no right answer, the scale only asks people about their opinions with respect to agreement, not about knowledge or ability. Finally, each item is a statement, and participants are asked to rate each statement. Agreement responses can be made through Likert-scale items, or semantic differential items. Likert-scale items are items where the questions are in the form of a statement and respondents are required to state their level of agreement (Hosker, 2002). An example statement could be “I like the weather today”, and the Likert scale gives people the option to agree or disagree with this statement. Semantic differentials use a bi-polar scale, where the ends of the scale are contrasting adjectives, such as good-bad (Hosker, 2002). The underlying assumption for semantic differentials is that the bi-polar scale presents a continuum, and people are asked to position themselves on that continuum. Likert-scale items can be quantified in several ways which also depends on whether the construct under interest is unipolar (i.e. the construct varies from high to low) or bipolar (i.e. the construct varies from negative to positive). For unipolar scales, response choices are numbered consecutively from low to high (e.g. 1 through 5) (Spector, 1992). When these characteristics have been determined, the items for the item pool can be written. Loevinger (Loevinger, 1957) states that, content-wise, items should be chosen in such a way that they sample all the possible content which might comprise the construct under research, according to all known alternative theories of said construct. Clark & Watson (Clark & Watson, 1995) advise writing the initial pool of items such that the item pool is broader and more comprehensive than the theoretical construct that is under research. Furthermore, one should include items that cover content that will ultimately be shown as only tangentially (or even unrelated) to the core construct (Clark & Watson, 1995).They argue that the psychometric analysis will identify weak and unrelated items, but logically the same analysis cannot find items that should have been included. To guard against the latter, overinclusion of items is a good strategy. Loevinger (Loevinger, 1957) also recommends to try and keep the amount of items belonging to each content area proportional to the importance of said content area for the construct, although this can be quite difficult, given that in most cases these proportions will be unknown. 2.7.1

Constructing the item pool

When writing items for the item pool, Spector (1992) specifies that a good item is clear, concise, unambiguous and as concrete as possible. To guide the writing of items with those properties, Spector (1992) provides five rules. First, it is important to make sure that only one idea is expressed per item. If the statement from the previous example would have been “The weather is sunny and warm today”, it is not evident whether people agree with the sunny and warm part, only the sunny part or only the warm part. Second, using both positive and negatively worded items can reduce bias caused by response tendencies 40

THEORETICAL UNDERPINNINGS

(always agreeing or disagreeing, regardless of the content of the items). Third, avoiding colloquialisms, expressions and jargon will result in a scale that is not limited to a particular population or time-frame. Fourth, consider the reading level of the population one wants to reach, since it is imperative that all respondents are able to read and understand the items. And fifth, try to avoid the use of negatives to reverse the wording of an item, since negatives can be easily misinterpreted by the reader. A pilot is recommended to see whether the initial items in the item pool satisfy the four properties of clarity, conciseness, unambiguity, and concreteness. It is important to instruct pilot participants to be as critical as possible towards the items, and to ask them to indicate which items are ambiguous or confusing, according to them. Also, they should be instructed to point out items which cannot be measured with the given response possibilities. As an example, consider the statement “It rains”, and as a response possibility a 5-point Likert scale. It either rains or it does not, so this would be more appropriate as a yes/no question. 2.7.2

Characteristics of a questionnaire

Before administering a questionnaire, research must determine number of participants required to achieve reliable results. To estimate the size of the population needed, there are a number of different guidelines. Clark & Watson (1995) advocate a minimum of 300 respondents, while Costello & Osborne (2005) advise the use of a 20:1 subject to item ratio. MacCallum, Widaman, Zhang & Hong (1999) argue that recommendations regarding sample size tend to be based on a misconception. According to them, this misconception is that every study needs the same minimum level of subjects (or minimum subject to item ratio) to achieve stable and precise results. Instead they proposed a theoretical framework to guide the choice of sample size. Sample size depends on several conditions of the scale under development, such as the level of communality for the items and the level of overdetermination of the factors. The communality of an item is the portion of variance of each item that is accounted for by the common factors in the factor analysis. Overdetermination is defined as the degree to which each factor is represented by a sufficient amount of items (e.g. at least four items that clearly load on a factor), and is partially dependant on the ratio of the number of items to the number of factors. Hence, if communalities of items are consistently high, the impact of a small sample size is greatly reduced (MacCallum et al., 1999). So, if the variance of items is well-explained by the factor analysis, a larger sample size is less necessary. MacCallum et al. (1999) advocate a minimum communality value of .6 per item for smaller sample sizes (i.e. less than 100 subjects) to achieve stable and precise results. When communalities become smaller, a larger sample size and higher overdetermination of factors becomes more important to achieve stable and precise results. When the responses to the items by the participants are available there are still three steps to take in the development of the questionnaire: analysis of the validity, THEORETICAL UNDERPINNINGS

41

2

assessment of the reliability and the compilation of norms on the instrument. Compilation of norms is a very time-consuming process, involving widely ranging population groups and falls beyond the scope of this thesis. Validity and reliability steps are discussed below. Validity is the process through which is checked whether the constructed scale actually measures the construct it was developed for. Consequently, examining validity of the scale is a critical step. According to Clark & Watson (1995), the goal of scale development is to maximize validity rather than reliability. Hence, checking whether the developed questionnaire really measures the defined cognitive construct is an important step in the scale development process. Spector (1992) states that validation is the most difficult part of scale development. Aside from testing hypotheses about the construct, the scale developer also tests hypotheses about the scale. A typical solution for scale developers is to develop hypotheses about the causes, effects and correlates of the construct, which are then tested through the scale. If empirical support for the construct is found, validity of the scale is implied (Spector, 1992). Factor analysis can be used to verify the validity of a developed scale, by empirically addressing whether the items in the questionnaire measure the intended attributes. The basic idea of factor analysis is to reduce the number of items in a scale, through analysis of the patterns of covariance or correlation among the items. Analysis of the patterns indicates which items belong to which factors. Reliability means that the results of a test are reproducible. Various forms of reliability can be considered, of which Spector (1992) recommends two to create a good scale. First, test-retest reliability addresses consistent measurements across time, i.e. it tests the assumption that if your construct does not change, people should get similar scores when testing is repeated. The second recommended test is internal-consistency reliability. This addresses that multiple items in the same scale should intercorrelate with one another. Hence, the assumption is made that multiple items are constructed to measure the same construct. Since the scale constructed here makes exactly this assumption, internalconsistency reliability will be used to assess reliability. Good measures to address the internal consistency of a scale are the coefficient α and the average inter-item correlation (Clark & Watson, 1995). The coefficient α measures the interrelatedness of the items in a scale (i.e. whether items have high communalities and thus little unique variance) (Cortina, 1993). The coefficient α depends on the number of items, so it is equally important to look at the standard error of the correlations in the item intercorrelation matrix (Cortina, 1993). This standard error reflects the precision of the coefficient α and gives further information about the dimensionality of the scale. If the precision of α is greater than zero, there is usually a departure from unidimensionality in the scale (Cortina, 1993). Considering the construct of involvement, it is plausible to expect a departure from unidimensionality. While keeping these limitations and assumptions in mind, the coefficient α (in the form of Cronbach’s α) was still used to assess reliability in the scale.

42

THEORETICAL UNDERPINNINGS

The average of the inter-item correlation also gives an indication of the unidimensionality of a scale (Clark & Watson, 1995). Clark & Watson (1995) recommend a range of .15 to .50 for the average inter-item correlation, depending on how broadly or narrowly the measured construct is defined. While the recommended range might seem low, Clark & Watson (1995) state that increasing the internal consistency of a scale beyond a certain point might detract from the validity of a scale. Their reasoning is that highly intercorrelated items are probably redundant and yield little extra information. It might even make the scale narrower than the construct, which is not a desirable outcome.

2.8

Next steps

To build upon the theoretical discussions outlined above, and to build a valid and reliable model proposal for QoE, Chapter 3 lays the theoretical foundation for the conceptual model of involvement with audio/video content. Next, Chapter 4 covers the development of a scale for the involvement construct. Chapter 5 establishes the relation between the construct of involvement and perceived audio/video quality. Chapter 6 completes this thesis with a discussion of the relation between involvement and perceived video quality within a larger QoE framework.

THEORETICAL UNDERPINNINGS

43

2

References Albright, T. D., & Stoner, G. R. (1995). Visual Motion Perception. Proceedings of the National Academy of Sciences, 92(7), 2433-2440. Brouwers, C. J. J. (2006). A real-time SNR scalable transcoder for MPEG-2 video streams. Master’s thesis. Eindhoven University of Technology. Bruce, V., Green, P. R., & Georgeson, M. A. (1996). Visual perception. Physiology, psychology and ecology. 3d edition. UK: Psychology Press Ltd. Burza, M., Kang, J., & van der Stok, P. (2007). Adaptive Streaming of MPEG-based Audio/Video Content over Wireless Networks. [Full paper]. Journal of Multimedia, 2(2), 17-27. Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3), 309-319. Cortina, J. M. (1993). What is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98-104. Costello, A. B., & Osborne, J. W. (2005). Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis. Practical Assessment, Research & Evaluation, 10(7), 1-9. Derrington, A. M., Allen, H. A., & Delicato, L. S. (2004). Visual Mechanisms of Motion Analysis and Motion Perception. Annual Review of Psychology, 55, 181 - 205. Goldstein, E. B. (1996). Sensation and Perception. U.S.A.: Wadsworth. Hosker, I. (2002). Social Statistics. Data analysis in social science explained. UK, Somerset: Studymates Ltd. IEEE. (2015). IEEE 802.11TM WIRELESS LOCAL AREA NETWORKS. The Working Group for WLAN Standards. Retrieved 13.08.2015, 2015, from http://www.ieee802.org/11/ ITU-R. (2012). BT.500-13, Methodology for the subjective assessment of the quality of television pictures: International Telecommunication Union. Jack, K. (1995). Video Demystified: a handbook for the digital engineer (3rd ed). Eagle Rock, VA: LLH Technology Publishing. Jarnikov, D. (2007). QoS framework for video streaming in home networks. Eindhoven University of Technology, Eindhoven. Kozlov, S., van der Stok, P., & Lukkien, J. (2005, 2005). Adaptive scheduling of MPEG video frames during real-time wireless video streaming. Paper presented at the Sixth IEEE International Symposium on a World of Wireless Mobile and Multimedia Networks (WoWMoM’05). Lan, A. Y., Nguyen, A. G., & Hwang, J.-N. (1999). Scene-context-dependent reference-frame placement for MPEG video coding. Circuits and Systems for Video Technology, IEEE Transactions on, 9(3), 478-489. Le Callet, P., Möller, S., & Perkis, A. (2013). Qualinet White Paper on Definitions of Quality of Experience Output from the fifth Qualinet meeting, Novi Sad, March 12, 2013. Paper presented at the European Network on Quality of Experience in Multimedia Systems and Services. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694.

44

THEORETICAL UNDERPINNINGS

MacCallum, R. C., Widaman, K. F., Zhang, S. B., & Hong, S. H. (1999). Sample size in factor analysis. Psychological Methods, 4(1), 84-99. Mitchell, J. L., Pennebaker, W. B., Fogg, C. E., & LeGall, D. J. (1996). MPEG video compression standard. NY, USA: Chapman & Hall. Pennebaker, W. B., & Mitchell, J. L. (1993). JPEG still image data compression standard. USA: Van Nostrand Reinhold. Redi, J., Zhu, Y., de Ridder, H., & Heynderickx, I. (2015). How Passive Image Viewers Became Active Multimedia Users. In C. Deng, L. Ma, W. Lin & K. N. Ngan (Eds.), Visual Signal Quality Assessment (pp. 31-72): Springer International Publishing. Sekular, R., & Blake, R. (2002). Perception. U.S.A.: McGraw-Hill. Seyler, A., & Budrikis, Z. (1965). Detail perception after scene changes in television image presentations. Information Theory, IEEE Transactions on, 11(1), 31-43. Simons, D. J., & Levin, D. T. (1998). Failure to detect changes to people in a real-world interaction. Psychonomic Bulletin and Review, 5(4), 644-649. Spector, P. E. (1992). Summated rating scale construction: An introduction (Vol. 082). USA: Sage Publications. Watson, A. B., & Ahumada, A. J. J. (1985). Model of human visual-motion sensing. Journal of Optic Society of America, 2(2), 322-342. Winkler, S., van den Branden-Lambrecht, C. J., & Kunt, M. (2001). Vision and video: models and applications. In C. J. van den Branden-Lambrecht (Ed.), Vision Models and Applications to Image and Video Processing: Kluwer Academic Publishers. Wolf, S., & Webster, A. A. (1997). Subjective and objective measures of scene criticality: International Telecommunication Union. Yuen, M., & Wu, H. R. (1998). A survey of hybrid MC/DPCM/DCT video coding distortions. Signal Processing, 70(3), 247-278.

THEORETICAL UNDERPINNINGS

45

2

46

Chapter 3: Developing an operational definition for involvement with audio/video material

This chapter is based on the following peer-reviewed publications: Van den Ende, N., Sekulovski, D., Hoonhout, J. & Meesters, L. J. M. (2008). Attributes underlying involvement with video material. Proceedings of the 1st international conference on Designing interactive user experiences for TV and video, 143-146. Van den Ende, N., Hoonhout, J. & Meesters, L. M. J. (2008). Involvement in Video Material: Concept Mapping. Proceedings of the 22nd British HCI Group Annual Conference on People and Computers: Culture, Creativity, Interaction - Volume 2, 155-156. 47

Elizabeth is watching her favourite television program. Suddenly the video becomes very blurry and is out of sync with the audio. Elizabeth is annoyed but as she really wants to know what is going to happen next, she continues to watch the program and tries to follow what happens as well as possible. Another scenario could be the following: Elizabeth is watching a documentary, while mentally going over things she still has to do for tomorrow. Suddenly, the video starts to flicker and she sees quite some “blocking”. Since she is not that interested in the subject of the documentary, she channel- surfs until she finds a channel that has no problems showing her multimedia content properly. As Powers (2008)’s Perceptual Control Theory (PCT) states, humans go through a continual cycle of perception, and action based upon perception. The basic tenet of PCT is that behaviour is the control of perception. In experiencing something, e.g. viewing a video, persons compare a ‘standard’ -what they want- with -what they are experiencing at that moment- their perception. The larger the discrepancy between the two, the more effort -behavioral action- persons will spend in trying to minimize the discrepancy. The above examples illustrate several possible cycles of PCT applied to watching audio/video material. In both examples, involvement is an important construct that influences the action taken by the viewer. This chapter focuses on how involvement can be a useful construct within the Qualinet Quality of Experience (QoE) framework (Le Callet, Möller, & Perkis, 2013). As mentioned in chapter 1, the Qualitnet QoE model currently does not yet define what the salient Influence Factors (IF) are, which limits support for multimedia content providers to successfully implement optimization strategies to display content via wireless networks and to enable a perfect QoE for the user. As motivated in Chapter 1, involvement is a good candidate for an important Human IF, as these IFs refer to any variant or invariant property or characteristic of a human user. There are several arguments to focus on involvement. Firstly, there is a lot of interest in investigating Human IFs, but there is a lack of methodological research into measures specifically validated for use within QoE research. Secondly, while engagement looks like a good candidate for further research, it is important to keep in mind that, within UX research, engagement is defined as having an element of constant interaction (O’Brien & Toms, 2010; O’Brien & Toms, 2008), whereas interaction is not a necessary prerequisite for involvement. Furthermore, we posit that, to paraphrase IJsselsteijn (2004), it is possible to separate the effects of emotional involvement with multimedia content from interaction with said content. Furthermore, Gauntlett & Hill (1999) reported that their participants wrote quite often about how they would feel compelled to keep watching some content. Hence, it is believed that involvement is one of the key cognitive mediating factors between content and QoE, and that it is possible to show a relationship between involvement and perceived video quality (PVQ). Such a concrete and detailed model of involvement should support the Qualinet QoE framework in furthering understanding how and which Human IFs and System IFs can combine to create a pleasurable and memorable QoE for the user.

48

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

However, the influence of involvement as a mediating factor between content and viewing experience can only be studied upon when there is a definition of involvement and a reliable and valid tool to measure involvement. Currently, no unanimous definition exists; so, to achieve a first operational definition the term ‘involvement’ and the way it can be applied to experiencing multimedia content was critically deconstructed. The first step towards this goal was to conduct a multidisciplinary literature review. Next, attributes that potentially characterize involvement are further investigated through the approach of concept mapping (Trochim, 1989). This first entailed conducting semi-structured interviews to generate statements, followed by card sorting to create an understanding of the relation between those statements. The chapter ends with the presentation of a model of the factors for and an operational definition of involvement with audio/video material, and suggestions for further work to refine the model. Both model and definition are expected to contribute to the development of an instrument for measuring involvement.

3.1

Involvement in the literature

Involvement has been investigated in various theories, such as flow (Csikszentmihalyi, 1990), User Experience (UX) (Hassenzahl, 2007; O’Brien & Toms, 2008) and media psychology (Klimmt & Vorderer, 2003). Involvement has also been explored in various application areas, such as predicting buying behaviour (Zaichkowsky, 1986), presence in 3D multimedia content (Schubert, 2003; Witmer & Singer, 1998) and immersion in video games (Jennett et al., 2008). The research has led to involvement being presented as a construct on its own (Zaichkowsky, 1985), or as part of a larger construct, such as presence (Schubert, 2003; Witmer & Singer, 1998) or engagement (O’Brien & Toms, 2008). In all of these studies, the definition and measures of involvement varied depending on the area one wanted to use involvement for, but there are some attributes that overlap between the theoretical frameworks and the application areas. Hence, while there is no definition for involvement with audio/video content, it is clear that that this definition will need to accommodate a number of different aspects or attributes. Flow and immersion are constructs which, besides presence and involvement, are frequently encountered in games research (Csikszentmihalyi, 1990; Malone, 1982; Sweetser & Wyeth, 2005). Flow is characterized as “being completely involved in an activity for its own sake. The ego falls away. Time flies. Every action, movement, and thought follows inevitably from the previous one, like playing jazz. Your whole being is involved, and you’re using your skills to the utmost” (Csikszentmihalyi, 1990). Based on research by Csikszentmihalyi (1990), Sweetser & Wyeth (2005) studied immersion, which they defined as ‘deep but effortless involvement in a game’. According to them, immersion is expressed by people through a loss of concern for self and everyday life, an altered sense of time, forgetting that players are participating through a medium, making players linger, and drawing players into the narrative

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

49

3

with characters, storyline and background. Witmer & Singer (1998) define immersion as “… a psychological state characterized by perceiving oneself to be enveloped by, included in, and interacting with an environment that provides a continuous stream of stimuli and experiences”. Although immersion and involvement seem linked, it is important to notice that watching video material in general is a passive activity and does not necessarily demand interaction with the content. This characterizes an important difference between involvement and immersion that normally is accompanied by some form of physical interaction.. Similarly, in UX research, engagement is characterized as having a felt involvement factor (O’Brien & Toms, 2010; O’Brien & Toms, 2008). This factor represented how drawn into an activity participants felt, as well as their assessment of how fun the experience was. This fits Hassenzahl (2008)’s characterization of UX along two dimensions: interactive products are perceived and judged on both pragmatic and hedonic qualities. The pragmatic quality deals with supporting ‘do-goals’ such as ‘send a text’, whereas the hedonic quality of an interactive product refers to its ability to support ‘be-goals’ such as ‘feeling happy’. Engagement and involvement are both more related to the hedonic quality of an interactive product than to the pragmatic quality. And, similarly to immersion, engagement is assumed to have a component of constant interaction, whereas involvement can be separated from the interaction. Another direction to involvement research is presented by Klimmt & Vorderer (2003), who argue that experiencing involvement is realized via a perceptual focus on mediated information, while avoiding/suppressing stimuli that are not important for the mediated information. In other words, when somebody is experiencing media information with a high degree of involvement, perception, thought and emotion are maximally directed towards the media information, while distractors are ignored as much as possible. The definition of Klimmt & Vorderer ( 2003) thus almost matches the definition of selective attention, and does not concern other possible affective or hedonic aspects of involvement. If involvement can be seen as low or high, two levels could be identified: “… a distant, analytical way of witnessing the events presented by the medium (low involvement) and, in contrast, a fascinated, emotionally and cognitively engaged way of enjoying the media information (high involvement)”(Klimmt & Vorderer, 2003). Attention may turn out to be one of the necessary factors for involvement to occur. Klimmt & Vorderer ( 2003)’s definition also points at differences in processing style. That is to say, involvement is seen as variable experience, depending on individuals, media content and situations. All these factors contribute to whether a person remains aware of an experience being mediated by technology or not. In consumer behaviour research, an accepted definition of involvement is “a person’s perceived relevance of the object based on inherent needs, values and interests” (Celsi & Olson, 1988; Zaichkowsky, 1985). Zaichkowsky (1985) proposed the Personal Involvement Inventory (PII), a semantic differential scale to measure involvement with products. Semantic differential scales call for participants to indicate their reaction on a continuum given by bipolar adjectives, e.g. good – bad, cold – warm, etc. Examples of semantic differentials from

50

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

the PII include: ‘important – unimportant’, ‘irrelevant – relevant’, ‘uninterested – interested’, and ‘undesirable – desirable’. The assumption of the PII is that the level of involvement varies on a bi-polar scale, from low to high involvement. However, most of the adjectives tend to describe unipolar constructs, which raises the possibility that the level of involvement varies on a unipolar scale, instead of a bi-polar scale. Furthermore, this research was not concerned with media content but only with physical products (e.g. red wine, jeans). Witmer & Singer (1998) developed the Immersive Tendencies Questionnaire (ITQ) intended to measure tendencies of people to become involved in everyday activities and their ability to focus on a specific activity. The scale is intended to measure both of these together, and tendencies and abilities are expressed by a single number. However, it is debatable whether involvement and focus should really be measured by the same scale, and whether there are not better ways to measure focus than self-reporting via a questionnaire. Witmer & Singer (1998) provided a definition for involvement, from which they developed a subscale with the same description in the ITQ: “Involvement is a psychological state experienced as a consequence of focusing one’s energy and attention on a coherent set of stimuli or meaningfully related activities and events. Involvement depends on the degree of significance or meaning that the individual attaches to the stimuli, activities or events. … Involvement can occur in practically any setting or environment and with regard to a variety of activities or events; however, the amount of involvement will vary according to how well the activities and events attract and hold the observer’s attention.” According to this definition, factors necessary for involvement are attention (controlled-voluntary and automatic) and meaningfulness or significance of the stimuli to the observer. The meaningfulness or significance of stimuli are related to perceived relevance as introduced by (Celsi & Olson, 1988; Zaichkowsky, 1985). An additional component, not mentioned here, might be the readiness to undertake action based on the involvement. Schubert (2003) developed the Igroup Presence Questionnaire (IPQ) to further explore the multi-dimensional nature of presence, and argues that one of the dimensions necessary for presence is involvement. This is in accordance with our paraphrasing of IJsselsteijn (2004), in stating that involvement can be felt regardless of the technology on offer. Schubert (2003) characterised involvement as how much a user has the feeling of being focused on and aware of a virtual environment rather than on the real world, and in their questionnaire involvement is measured by means of four questions. Presence literature and questionnaires to study involvement leads to the conclusion that involvement can be characterised as a consequence of the relation between a person and an object or activity, as well as the meaningfulness or significance of this relation (Witmer & Singer, 1998). It has also been proposed that there is an affective and an analytical style of involvement (Klimmt & Vorderer, 2003; Nillesen, 1988). During analytical involvement, people are aware that their experience is mediated. During affective involvement, people no longer feel that their experience is mediated. Considering that there is still no consensus about how involvement should be defined, understanding exactly how it applies to different

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

51

3

conditions is not straightforward. It is thus possible to characterize involvement as an experience which is expressed through emotions. These emotions are based upon a direct link between an external object and an internal experience, requiring most or all available processing capabilities. Table 1: Attributes of involvement suggested by theoretical underpinnings from Flow theory, UX research, media psychology, as well as previous research in other areas of application, such as presence, video games and buying behaviour. Attributes from theories and application areas altered sense of time attention focus interest loss of awareness of external world meaningfulness positive affect / pleasure disengagement high/low levels importance negative affect / emotions perceived relevance resistance to distraction

theories

application areas

media psychology

presence

flow

UX

ü

ü

ü ü ü

ü ü ü

ü

ü

ü

ü

ü

ü

ü

ü

ü ü

ü ü

video games

buying behaviour

attribute of involvement?

ü

yes

ü ü

yes yes yes

ü

yes

ü ü

yes yes

ü

possible possible possible

ü

ü

possible

ü

ü

possible

ü ü ü

ü

ü

ü

possible

Building upon the literature reviewed, involvement attributes are defined as characteristics that influence, or are a component of, the involvement with audio/video content. Involvement attributes are thus products of users’ connection with audio/video content and are assumed to depend on what the user finds innately compelling. Table 1 summarizes all of the attributes that have been mentioned, and which theoretical framework or application area they came from. Based on the synthesis of these attributes, the first proposition for a definition of involvement is as follows: Involvement is a Human IF of QoE characterized by attributes of attention, focus, interest, loss of awareness of the external world and resistance to distraction, meaningfulness, positive and negative affect, perceived relevance, and an altered sense of time. It can be expressed on one dimension, going from low to high; and is a variable experience, changing across time depending on the individual, the media content and the situation. 52

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

However, is this definition of involvement complete? And are all these attributes important enough that they need to be incorporated in a measurement method? Therefore, it is necessary to further investigate the construct of involvement with multimedia content. This should enable the determination of whether all attributes of involvement have been identified, as well as which attributes should be taken into account for the creation of a valid and reliable measure. The following section explains the rationale behind the method we used to arrive at an operational definition to create a measure.

3.2

Concept mapping

To create an operational definition of involvement, we used the so-called concept mapping method. Concept mapping as described by Trochim (1989) is a structured method which can be used to develop a conceptual framework, e.g. an elaborated definition of a construct, which is what is needed in this study. The process works as follows (and is illustrated by Figure 1): first, statements are generated e.g. through brainstorms or interviews. These statements ideally represent the entire conceptual space for the current topic of interest, i.e. involvement with multimedia material. The next step is to examine how these statements relate to each other, e.g. through card sorting. Card sorting is a procedure whereby participants are given a set of pieces of paper (preferably thicker paper, hence the name cards) with one statement written or printed per piece of paper or card. Participants are then asked to sort the cards into different piles, in accordance with the instructions from the researcher. Card sorting can be done with several restrictions, and Trochim (1989) advises to start with unstructured card sorting. For unstructured card sorting, participants are usually instructed to sort the cards into piles in a way that makes sense to them. Structured card sorting can also be used, in which case participants are given a certain number of piles to make. A statistical analysis is then conducted to create a representation or visualization of the relation between the statements. The studies reported in this chapter used both unstructured and structured card sorting. The next step consists of the interpretation of the representation. Participants are asked to come in groups, such that the interpretation is done in accordance with several others. Participants are offered one or more of the visualization of the results and asked to interpret those together, e.g. come up with labels for clusters. For the study reported in Section 3.5, participants were also asked to re-structure the statements, i.e. go through one more round of card sorting, and make sure that all participants in the group were satisfied with both the card sorting and the labels given to the clusters. This produced more data to represent the statements and enabled the possibility to examine the stability of the relation between the statements.

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

53

3

Figure 1. Structure of the concept mapping process as applied in this paper. There were two iterations, with the first one using an individualized process to structure statements, and the second one using groups to structure statements.

It is assumed that using concept mapping will create an operational definition of involvement which will be useful for the creation of a measurement tool. This measurement tool then is assumed to measure involvement with multimedia in a valid and reliable way. It is further hypothesized that the operational definition of involvement will consist of several facets. These facets are expected to represent meaningfulness of the material to the observer, the focusing of attention by the observer, cognitive and affective processes related to the feeling of whether an experience is mediated or not, and action readiness. To this end, the next step was to generate statements related to multimedia material and involvement.

3.3

Study 1: generating statements

While Trochim (1989) recommends a brainstorm as a first step in the concept mapping process, the decision was made to use semi-structured interviews instead. The purpose of the interviews was twofold: first, to gain more insight in how people watch television (behavioural analysis) and second, to query people for terms which could be used for further research. Interview questions were based on the ITQ questions and Presence Questionnaire (Witmer & Singer, 1998), as well as other definitions of involvement as previously discussed (Klimmt & Vorderer, 2003; Nillesen, 1988; Perse, 1998). 3.3.1

Participants

Six participants were interviewed: two women and four men (age range 17 - 34), all native Dutch speakers. Education level ranged from attending high school to obtained university degree. 3.3.2

Procedure

Participants were invited to participate in this experiment via email. As soon as they accepted, a meeting was arranged to take place in their home, so the experimenter could also observe which arrangements they used for watching multimedia material. Participants were asked to sign a consent form which stated that they agreed to audio-recording of the

54

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

interview, and to have pictures taken of the set-up they used to watch multimedia material (e.g. television set, PC). After asking some warming-up questions, participants were asked general questions about involvement, such as: did they ever feel so involved in a program, movie or a book that other people had trouble getting their attention? This should indicate whether they ever feel involvement at all, and at which level (Witmer & Singer, 1998). Next, they were asked to describe their favourite program on television (including movies), and whether or not they felt involved, and if so, why they were involved. They were also asked to give words that described how they watched video material. At the end of the interview, the experimenter asked whether they had any questions left. If so, all questions were answered, and participants were thanked for their time. Interviews lasted an average of 30 minutes. 3.3.3

Interview results

All interviews were transcribed and analysed, resulting in a set of statements which were used in Study 2. Words that came up during the interviews to describe watching video material were: absorption, background, being bored, being occupied with, challenge, cosy together, curiosity, empathise with favourites, entertainment, escapism, evolving characters, excitement, focused, forgetting the world, fun, funny, interesting, laughing, no influence, not self-absorbed, nothing else to do, observing people in the video, pastime, relaxing, sadness, scary moment, solving the puzzle, wanting to know what happens next. During the interviews, most people mentioned involvement, but they also mentioned two other important things (translated from Dutch): sympathise with (meevoelen met) and being absorbed in/caught up (opgaan in). Participants were also asked to give their own definition of involvement, sympathise with and being absorbed in. One participant described being absorbed as “You forget the whole world. You’re only busy with what happens now, where characters are going, and, what the plot would be”. According to the same participant, being absorbed differs from being involved by the influence one can exercise. Only if one can exercise influence can one be really involved. Only when one can influence the content of a story can one be really involved in something. Another participant remarked “I become absorbed in something because it’s relaxing and because I’m curious about what happens, but I wouldn’t talk about what happened. If you do talk about the content of a program, that is when you are involved”. This participant also mentioned that video material that introduced involvement were documentaries which discussed things such as climate-issues and animal abuse. This introduced involvement because one could feel the show had bad consequences, one knows these events really happen, they are not fictional. Based on participants’ descriptions and verbalizations regarding television and behaviour during the interviews, a list of statements was generated. This list could then be presented to other participants to structure statements.

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

55

3

3.4

Study 2: structuring statements

The goal of card sorting is to examine the relation between statements. Because all interviews had taken place in Dutch, but most of the literature is in English, it was decided to create two sets of statements. The aim of the operational definition is to be internationally useful. Therefore, the aim was to execute the study in English, but keeping in mind that expressions do not necessarily translate easily from one language to another it was decided to use both languages. Hence, to control for these potential differences in translation, participants with Dutch or English as native language respectively were recruited and given the Dutch or English statements respectively. If the results from both statement sets are similar, then the assumption can be made that the translation was sufficiently accurate and culture-differences for those two languages are minimized in the statements. 3.4.1

Stimuli

Statements, obtained in the semi-structured interviews in Study 1, and a literature review served as stimuli for a pilot card sorting task. Several behavioural statements which were not mentioned during the interviews, but which we thought to be of interest, such as talking while watching video material or fidgeting, were added. To add more words into the ‘sympathise with’ and ‘absorbed in’ categories, synonyms of these forms were taken from a thesaurus. In total, fifty statements which expressed a kind of behaviour pertaining to involvement or watching television were written on cards. Examples are: replaced embarrassment, talking to the TV, fiddling, understanding, commiserate, identify with, sympathise with, feel compassion, concern, and horrified. To prepare for the card sorting, each statement was printed on a laminated 4x10.5 cm card. 3.4.2

Pilot

In a pilot test, four male participants were given the 50 cards with instructions for an unstructured sorting task (see Section 3.2 for more details). Based on the hierarchical cluster analysis of these pilot study results, remarks from the subjects, and the previously mentioned literature review, 56 statements were printed on cards for the main card sorting task. From the initial 50 stimuli, three stimuli were taken out (interested, kinship and luxe) and nine stimuli were added (boring, challenge, communication, curiosity, frightening, immersed, relaxing, suspension of disbelief and turning off the tv). Five stimuli were changed based on remarks of the pilot participants (‘clapping hands’ became ‘clapping your hands’, ‘replaced embarrassment’ became ‘feeling embarrassment for something somebody else did’ , ‘flow’ became ‘flow experience’, ‘fiddling’ became ‘fidgeting’ and ‘relation’ became ‘relationship’) to make them better understandable. The 56 statements can be found in Appendix 1.

56

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

3.4.3

Participants, main study

Thirty-three subjects, 17 native Dutch speakers (nDs) and 16 native English speakers (nEs) participated in this study. Of the native Dutch speaking participants, 9 were women and 8 men. Mean age of the participants was 33 (SD = 16.5) with a range between 17 and 63. Of the native English speaking participants, 8 were women and 8 men. Mean age of the participants was 36 (SD = 12) with a range between 25 and 65. None of these participants participated in the pilot, but four of the participants took part in Study 1, reported in Section 3.3. 3.4.4

Procedure

The participants were asked whether they preferred a home-visit or to come to the High-Tech Campus Eindhoven or Eindhoven University of Technology, The Netherlands. If participants opted to come to Eindhoven, they were seated in a quiet meeting room and offered something to drink before starting with the individual card sorting task. If participants preferred a home-visit, the experimenter asked them to make sure a quiet room was available for the experiment. Participants were then asked to fill out a demographics questionnaire, which asked about their native language, gender, date of birth, country of origin, the number of TVs in their house, number of people in their household, and their highest level of education. Next, participants were given instructions which included the focus of the experiment – to find out the most important aspects of television and behaviour – and how to carry out the card sorting exercise. If there were any questions, they were encouraged to ask them. The complete set of index cards was then given to the participants for an unstructured card sorting task, together with post-it notes and a pen to write labels for their piles. Participants were also told that each statement could only be placed in one pile, all 56 statements could not be placed on one single pile and all statements could not be in their own pile (i.e. 56 piles was not allowed either, although some statements may be in a pile by themselves). Labels could be written either during the card sorting or at the end, whichever the participant preferred. While participants were busy with the unstructured card sorting task, the experimenter stayed as unobtrusively as possible in the room to be at hand should any questions arise. Once the unstructured card sorting task and labelling was completed, the experimenter collected the post-it notes and wrote the numbers of the cards assigned to the pile on them. Next, participants were given the complete set of cards again. For their last assignment, participants were asked to execute a structured card sorting task, which consisted of sorting the cards into eight piles. The only other restriction here was that each statement could only be placed in one pile. Again post-it notes and a pen were available to write labels for the piles, and when the structured card sorting task was finished the experimenter gathered the post-it notes and wrote the numbers of the cards assigned to

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

57

3

the pile on them. Any remaining questions were answered, and participants were thanked for their time. 3.4.5

Representation of statements

Analysis of the card sorting was done through agglomerative hierarchical cluster analysis (Gordon, 1999). To include the information from all the participants’ clusterings, a nonmetric 56x56 dissimilarity matrix between the statements was built using the participants’ clustering results. To start, the analyses were done separately for each condition in a 2x2 (structure x native language) grid. The statements that were clustered together more often by the participants were given a lower dissimilarity and the ones that were clustered less often a higher dissimilarity. Based on the generated dissimilarity matrix, the agglomerative clustering analysis was performed using the average linkage method (Sneath & Sokal, 1973). The dissimilarity between two groups of items was computed as the average of the dissimilarities of all pairs of items in the groups. This procedure was repeated for every combination of the conditions. To analyse what would be a good number of clusters, the elbow criterion was used, i.e. the number of clusters was used for which an addition of another cluster did not lower the average cluster variance much compared to the previous one. To compute the similarity between the computed sets of clusters, a similarity measure inspired by a set distance measure, the Rand Index (RI) (Gordon, 1999), was used. The Rand Index is a number between 0 and 1. A value of 0 corresponds to no similarity between cluster sets, while a value of 1 corresponds to identical cluster sets. To estimate the significance of the computed similarity between cluster sets, a set of simulations was run. Ten thousand random cluster sets with the same amount of statements as the original cluster sets were generated. Next, the RI was computed for the simulated cluster sets per two of cluster sets. This produced bounds for the value of the RI for random clustering with the same cluster sizes. Given these values, a significant similarity between cluster sets is established when the computed RI is above the bounds. 3.4.6

Comparison between native Dutch and english speakers

The comparison of the clusters for native Dutch and English speakers showed that these were significantly similar, for both unstructured (U) and structured (S) sorting (RIU = .71, 95% boundsU= [.61, .65], RIS = .75, 95% boundsS = [.64, .68]). As mentioned in the previous paragraph, the bounds provided are the values of the simulated cluster sets with the same cluster sizes as the unstructured and structured cluster analysis, The RI values need to be above these bounds, otherwise there is not sufficient similarity between the clusters and their size to be able to use them in one overall analysis. As both RI values are above the boundary values, it is possible to say with confidence that there is no significant difference between the English and Dutch native speaker sets, and that the data can be pooled. 58

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

Although the number of clusters for the unstructured sorting differed between language groups, 5 for nDs, 7 for nEs, our results show that the content of these clusters was very much alike. The nDs sorting showed that there was one cluster which contained about half of the stimuli. For the nEs sorting, this one cluster had split up in three smaller clusters, making a total of 7 clusters instead of 5. Hence, for the remainder of the analysis the data from both speaker sets were pooled. 3.4.7

3

Results

Pooling the clustering results of Dutch and English participants, for unstructured and structured sorting respectively, produced new dissimilarity matrices. Again, an agglomerative clustering analysis was performed, using average linkage. For an appropriate number of clusters, the elbow criterion was used again. For unstructured sorting, seven clusters was an optimum, and also for structured sorting. Table 2 shows the seven clusters from the pooled sorting, which appears to be a good trade-off between not too many clusters but still enough to be able to discriminate between the content of the cards offered to participants. Looking at Table 2, cluster 1 could be seen as a reflection of focused attention. Furthermore, meaningfulness –or absence of meaningfulness– could be seen in cluster 2, 3 and 4. Cognitive processes are also indicated in cluster 3, while the feeling of mediated experience is probably most suggested by cluster 5. Affective processes are reflected in cluster 4 and 6. Cluster 7 seems to be a reflection of participants’ expression that those statements do not belong with the rest of the statements, and hence probably also not with the operational definition of involvement. To find out whether participants interpret these clusters similarly, participants from this experiment were invited to come back for an interpretation session, which is described in Study 3.

3.5

Study 3: Interpretation of statements

In Study 3, we tested our hypothesis that the current cluster-solution was stable. Participants worked in groups to label the presented clusters in agreement with each other. Participants were then invited to re-structure the statements through an unstructured card sorting task, and again to label the piles they came up with in agreement with each other. 3.5.1

Stimuli

The same 56 stimuli were used as in the previous experiment as selected for the main card sorting task. Presented statements were already clustered, using the structured sorting outcome of Study 2. Cards were arranged in rows on a table, where each row represented a cluster (see Figure 2).

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

59

Table 2: The seven clusters of the 56 statements for both unstructured and structured sorting, with Dutch and English participants pooled. Clusters cluster 1

cluster 2 cluster 3 cluster 4

cluster 5

cluster 6 cluster 7

3.5.2

Similar for both unstructured and structured sorting absorbed in, caught up in, flow experience, immersed, involvement, losing track of time, suspension of disbelief boring, fidgeting, indifference, preoccupied, this is really nothing, turning off the TV communication, comprehension, curiosity, discuss content, informed, interest affinity, association, commiserate, empathy, feel compassion, feeling embarrassment for something somebody else did, identify with, involvement, sympathise with, understanding amusement, entertained, excited, happiness, laughing, relaxing

Different for unstructured sorting clapping your hands, gratification, make predictions, talking to the TV diversion

Different for structured sorting

concern

feelings

appalled, confused, crying, disgusting, frightening, horrified, sad, scared, startled, terrified challenge, compulsion

can’t handle this any longer, dazzled, feelings

can’t handle this any longer

clapping your hands, dazzled, diversion, make predictions, talking to the TV concern gratification

Participants

Participants from Study 2 were asked to participate again, but this time in a session with two to three participants. Not all participants were available, which is why only a subset of the original group of participants remained. A total of 10 native Dutch speakers (nDs) participated, 6 women and 4 men. The mean age of these participants was 32 (SD = 16) with a range between 18 and 63. A total of 7 native English speakers (nEs) participated, 2 women and 5 men. The mean age of the participants was 31.5 (SD = 5.5) with a range between 26 and 41. These are low number of participants, however as specified by Trochim (Trochim, 1989), such numbers are enough for a diversity of opinions and enable sufficient discussion. 3.5.3

Procedure

Participants were invited to come to the sessions in seven small groups. The sessions took place in meeting rooms. When entering the meeting rooms, participants were notified that the sessions would be recorded. If they did not agree with this, they could immediately quit the experimental session, otherwise they were required to sign a consent form. All participants signed the consent form. Sessions were videotaped to be able to record

60

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

3

Figure 2: An example of how cards were presented to participants in Study 3, and how participants wrote down labels for clusters.

the discussion that should follow the instructions; if labels for piles turned out to be not understandable, this should help to recapture their meaning. When participants entered, the index cards were already placed on the table, in the structured sorting outcome of the previous card sorting procedure. This was done by positioning the index cards with the statements on the table, such that they were visually clearly separated. First, participants received instructions which recalled the focus of the research – behaviour and television – and that this session would be conducted in three parts. First, participants were asked to go through the piles individually and to label each cluster with a short word or sentence. Once this was finished, they could discuss their labels with the group and label all piles in such a way that everybody was satisfied with the labels. When the labelling was finished, the experiment leader took pictures of the labels and the cards. The labels were then removed and the cards shuffled. In the second part of the session, participants were instructed to sort the cards into piles in a way that made sense to them (Trochim, 1989), again applying unstructured card sorting. Restrictions were the same as in Study 2. After sorting, participants were asked to write down labels for the piles on provided pieces of paper. If participants felt that statements important for behaviour and television were missing, they could write them down on the provided index cards and sort them with the rest of the cards. Instructions stated explicitly that they could use labels already produced in part 1. Again, everybody in the group had to be satisfied about both the piles of cards and the labels for the cards. DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

61

When the participants finished the sorting and labelling, the experiment supervisor again took pictures of the labels and the piles. Any remaining questions from participants were answered, after which participants were thanked for their cooperation and compensated for their time.

3.6

Results and discussion

There are several outcomes to look at: the rating of the clusters, the labelling of the clusters, and the analysis of the latest unstructured card sorting. Participants also contributed a couple of new statements during the last sorting of the cards. Because not all participants sorted these statements, the additions and their placement will be discussed and interpreted at the end of this section. 3.6.1

New clusters

Each group of participants made at least one change to the statements in the originally presented clusters, with a maximum of 26 changes (almost half of the words), and a mean of 13 changes. Analysis of the card sorting was done again by agglomerative hierarchical cluster analysis (Gordon, 1999). The agglomerative method does not handle 0 as dissimilarity, so therefore several words, which all subjects clustered together, were pre-grouped and considered as one word in the rest of the analysis. The words always grouped together were scared, appalled and disgusting; sad and crying; this is really nothing and fidgeting; and amusement and laughing. Hence, only 51 data points, instead of 56, were used. The results showed that five statements changed to different clusters. Only cluster 2 did not change, but all other clusters either gained or lost one or two statements (see Table 3 for details). The five statements which switched positions to other clusters are make predictions, understanding, concern, gratification and feeling embarrassment for something somebody else did. To gain further insights in how the re-structured clusters relate to each other, a dendrogram was made. A dendrogram is a tree diagram, which illustrates how statements cluster together when a hierarchical clustering algorithm is used. Figure 3 shows the dendrogram which was produced by the hierarchical clustering analysis of the re-structured card sorting task. Generally speaking, clusters come together later than in Study 2, which is better. Beginning from the bottom, cluster 6 and 2 come together first. Both clusters capture negative emotions and states. This could indicate that negative affect is often a precursor for disinterest. Next to join are cluster 1 and 5, or the captivated and positive affect clusters. Cluster 1 and 5 also are composed of actions and behaviours, mostly positive. While this is not necessary an indication that positive affect is necessary for some kind of captivated feeling, it appears as though participants see those two going together naturally. However, cluster 7 and 5 also join together. This could mean that, although most participants see those statements as not belonging together at all, other participants think those statements belong to cluster 5. At the top, cluster 3 and 4 join together, and are mostly concerned with 62

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

subject matter. This could be an indication that empathy and informative interest together create meaningfulness of multimedia material.

Table 3. Results of the structured and restructured group sorting. Italicized words differ in where they are clustered. Cluster 1

2

3 4

5

6

7

3.6.2

Similar for both structured and restructured sorting absorbed in, caught up in, flow experience, immersed, involvement, losing track of time, suspension of disbelief boring, can’t handle this any longer, fidgeting, indifference, preoccupied, this is really nothing, turning off the TV communication, comprehension, curiosity, discuss content, informed, interest affinity, association, commiserate, empathy, feel compassion, feelings, identify with, relationship, sympathise with amusement, clapping your hands, dazzled, diversion, entertained, excited, happiness, laughing, relaxing, talking to the TV appalled, confused, crying, disgusting, frightening, horrified, sad, scared, startled, terrified challenge, compulsion

Different for structured sorting

Different for restructured sorting make predictions

understanding feeling embarrassment for something somebody else did, understanding make predictions

concern

concern

feeling embarrassment for something somebody else did

gratification

gratification

Labelling

Table 4 shows the labels given by participants (all labels that Dutch participants gave have been translated to English as accurately as possible). As can be seen from the Table, cluster 1 was given a range of diverse labels, just like cluster 3. Statements in cluster 1 point towards a focused attention, and unconsciousness of the experience being mediated. Capturing this in a few words is not easy, but for now the decision was made to go with the captivation cluster. Cluster 3 could be characterized as the informative interest cluster. Disinterest cluster is probably a good name for cluster 2, since this can have different causes, but tends to produce the same effects in the end: the channel is changed or the TV is switched off. Cluster 4 labels seemed to centre around empathy and feeling related, so we named it:

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

63

3

RESTRUCTURED CARD SORTING, 7 CLUSTERS EXPERIMENTAL STATEMENTS

discuss content interest informed curiosity communication comprehension understanding affinity relationship commiserate sympathise with feel compassion concern empathy feelings identify with association challenge compulsion dazzled happiness gratification relaxing excited amusement, laughing clapping your hands talking to the tv entertained diversion make predictions caught up in losing track of time flow experience involvement immersed absorbed in suspension of disbelief feeling embarrassment fssed horrified scared, appalled, disgusting startled sad, crying confused terrified frightening preoccupied boring turning off the tv this is really nothing, fidgeting indifference can’t handle this any longer

3

4

7

5

1

6

2

DISTANCE WHERE STATEMENTS CLUSTER Figure 3. Dendrogram of the clusters after agglomerative hierarchical cluster analysis.

relatedness cluster. Cluster 5 could be seen as the positive affect cluster, and cluster 6 as the negative affect cluster. Cluster 7 is probably best characterized as the irrelevant cluster. To summarize, the clusters from the restructured group sorting will be named captivation, informative interest, disinterest, relatedness, positive affect and negative affect. To find out whether participants were satisfied with these clusters, they were asked to change the given clusters, if necessary, such that they were satisfied by the formation of the clusters. This also included labelling the clusters again. Participants often used the same labels, but changed the words that belonged to them. Table 5 shows the labels given in the second part of the workshop. The labels are grouped according to the content of the clusters participants made, and on the meaning of the label itself. For example, all groups had a 64

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL Indicators of interest in content

Lack of involvement

An emotional experience Emotion evoking TV Compassionate emotions

Learning/growing Depth

Passive amusement Indicators of enjoyment or being entertained

I’m ‘taken away’

Empathy

clusters 4 Empathy Put yourself in somebody else’s shoes Reality program

Understanding

Negative experience – time to do something else I want to do something else Missed purpose

Passive involvement

News item

3 Documentary Informative

Bad emotions

2 Boredom Boredom

Emotionbound subject

Flow Captivated

1

Passive influence of TV Terms to do with a shared TV watching experience

Having a great time

I don’t like it but I like it Possible reaction on informative TV Negative emotions & emotional reactions

Motivators for continued use

Impulsive/cheeky/ mischievous Not suitable

No link

Doubt

Negative emotions, which affect you anyway Negative effect

Relaxation Positive effect

7 Rest Experience

6 Negative feelings Negative feelings

5 Reality TV Positive actions

Table 4. The labels given to the structured clusters offered to the participants of Study 3. Each row represents labels given by one group.

3

65

66

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

attractiveness

want to know what happens next positive experiences involvement

informative

boredom

flow

empathy

documentary

reality TV

aroused actions (doubts) occupation

2

rest negative feelings

1

rest negative feelings

reality TV

Empathy

positive feelings

programs that take a hold on you

this is nothing

negative feelings that compel you to keep watching

3

Having a great time! An emotional experience using my brain

Drawn into my TV!

losing you caught you you actively interact with TV positive effect of TV on you TV makes you feel emotions non-emotional information providing

I'm ready to stop!

I don't like it but I like it!

5

lost you

don't fit negative effct TV has on us

4

groups 7

involvement with content positive aspect of TV watching involvement with (an)other/s indicators of interest

entertainment

positive influence of TV emotion evoking TV depth passive entertainment

indicators of disinterest

doesn't fit negative emotions and reactions

missed goal

doesn’t fit possible reaction to informative TV

6

Table 5. Labels given by groups of Study 3 to their own cluster formations. They have been grouped according to content of the clusters and meaning of the label.

cluster with lack of involvement items, but group 4 differentiated between items meaning that the program was about to lose your interest, and that the program had already lost your interest. Another example is the irrelevant cluster: 5 groups made one, 2 groups did not. Labels given by the participants were very helpful in coming up with names for the clusters that covered their content. Before finalizing these cluster names, an analysis of the reformed clusters was performed, which will be discussed in the next paragraph. 3.6.3

3

Statements added by participants

As stated at the beginning of this section, several groups of participants took the opportunity to add one or more statements to their resorted piles. These statements were the following: edge of your seat, looking forward to the next episode, family and sleeping. Edge of your seat refers to a way of sitting when multimedia material really has your focused attention, and was sorted in a pile called ‘Positive effect of TV on you’. Combining all information, cluster 1 through 7 were named captivation (1), disinterest (2), informative interest (3), relatedness (4), expressions of involvement (5), negative affect (6) and irrelevant (7). The final discussion and conclusion concerning the clusters and implications for the model of the involvement construct will be continued in the next section.

3.7

Conclusion

Summarizing, concept mapping was applied through the use of semi-structured interviews to gather statements, card sorting to structure and represent statements and hierarchical cluster analysis to interpret and visualize the results. Using this strategy showed that the construct of involvement with audio/video material can be divided into six attributes, i.e. captivation, disinterest, informative interest, relatedness, expressions of involvement, and negative affect. Relating these six attributes to the attributes from Table 1 (see Table 6) shows that relatedness is a new attribute, unpredicted by the literature review from section 3.1. However, relatedness has been proposed by Hassenzahl (2008) as an aspect of a positive UX, and is seen by the Self-Determination Theory as a core human need (Hassenzahl, 2008). Hence, relatedness has been identified as incorporating meaningfulness and perceived relevance, as well as the ability to feel close and connected with either people or situations in the given multimedia content.

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

67

Table 6: Comparison of the attributes suggested by theoretical underpinnings and application areas in Table 1 versus the results from concept mapping. Attributes from theories and application areas altered sense of time attention focus interest loss of awareness of external world meaningfulness positive affect / pleasure disengagement high/low levels importance negative affect / emotions perceived relevance resistance to distraction

attributes based on concept mapping approach captivation

expressions of involvement

informative interest

relatedness

negative affect

disinterest

ü ü ü ü

ü

ü ü

ü

ü ü ü ü ü ü

ü

ü

One element that has not been identified directly is the high and low levels of involvement as argued by Klimmt & Vorderer (2003). However, the assumption has already been made that involvement varies on one dimension. Hence, it may not be necessary for this attribute to be identified as such through concept mapping, as long as the proposed measure is able to distinguish between lower and higher levels of involvement. The proposed involvement attributes do support the process model of engagement from O’Brien & Toms (2008); where the model assumes that people start with a point of engagement and continue into feeling engagement. Eventually this results into disengagement, and potentially, the decision to re-engage with the activity. A similar process model could easily be exemplified for involvement: a user starts to watch a television program they like (point of involvement) and stays involved until the commercial break (lack of involvement). The user pays little attention during the commercial break, but becomes re-involved as soon as their television program starts again. Especially the disinterest attribute should be able to show whether users are close to becoming uninvolved with shown content. The operational definition of involvement with audio/video material from section 3.1 can thus rewritten as follows: Involvement is a Human IF of QoE characterized by attributes of captivation, expressions of involvement, informative interest, relatedness, negative affect and disinterest. It can be expressed on one dimension, going from low to 68

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

high; and is a variable experience, changing across time depending on the individual, the media content and the situation. To measure involvement, it is necessary to investigate the relationship between the attributes, as well as their contributions to the overall construct. As Figure 4 shows, it is possible to group the attributes together in three higher-order attributes. Hence, informative interest and relatedness would contribute to subject matter; captivation and expressions of involvement contribute to positive actions / behaviours and negative affect and disinterest contribute to negative actions / behaviours. Together, these three higher-order attributes then combine to determine people’s involvement with audio/ video content.

INVOLVEMENT WITH AUDIO/VIDEO CONTENT

NEGATIVE ACTIONS

/

BEHAVIOURS

DISINTEREST

POSITIVE ACTIONS

/

BEHAVIOURS

NEGATIVE

EXPRESSIONS OF

AFFECT

INVOLVEMENT

CAPTIVATION

SUBJECT MATTER

INFORMATIVE

RELATEDNESS

INTEREST

Figure 4. Hypothetical model for involvement with audio/video content. Further research is necessary to investigate whether it is really possible to group the six attributes together as such.

It remains to be seen, however, whether the relationship between the attributes is as straightforward as the hierarchical cluster model and Figure 4 suggest. The relation between these attributes needs further exploration, so at this point this model is still under development. The model, however, does provide a good base for further research, especially for application purposes. This is important, since the development of a measurement instrument to assess involvement was the motivation behind the research described here. The model of involvement allows deciding about the form of the measurement instrument for involvement. Following Witmer & Singer (1998), and O’Brien & Toms (2010) a scale appears to be an appropriate measurement tool for involvement. Further work thus includes the creation of an item pool for a scale, based on the current model. Such an item pool can have a two-fold purpose. First, the item pool serves as the start of the creation of a scale. Second, the analysis of responses to the items can be used to evaluate the current model and address its validity and reliability. Items for the item pool could be formulated on the basis of statements from the clusters and dimensions, and of the new statements added by the participants. Examples of possible items could be “I felt DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

69

3

like laughing out loud”, or “I would have turned the TV off if I could”. Since the current model is meant for a large variety of multimedia content as available on TV, the measurement tool should reflect this variety. This also implies that the measurement tool needs to be validated with a variety of multimedia material, otherwise the results cannot be used for validation of the involvement model.

70

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

References Celsi, R. L., & Olson, J. C. (1988). The Role of Involvement in Attention and Comprehension Processes. The Journal of Consumer Research, 15(2), 210 - 224. Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. New York: Harper & Row. Gauntlett, D., & Hill, A. (1999). TV Living: television, culture ad everyday life. Oxon, UK: Routledge. Gordon, A. D. (1999). Classification (2nd ed. Vol. 82). Boca Raton, USA: Chapman & Hall/CRC. Hassenzahl, M. (2007). The Hedonic/Pragmatic Model of User Experience. Paper presented at the Towards a UX manifesto. COST294-MAUSE affiliated workshop. Hassenzahl, M. (2008). User experience (UX): towards an experiential perspective on product quality. Paper presented at the Proceedings of the 20th International Conference of the Association Francophone d’Interaction Homme-Machine. IJsselsteijn, W. (2004). Presence in Depth. Doctoral thesis submitted to the Technical University Eindhoven, Eindhoven. Jennett, C., Cox, A. L., Cairns, P., Dhoparee, S., Epps, A., Tijs, T., et al. (2008). Measuring and defining the experience of immersion in games. International Journal of Human-Computer Studies, 66(9), 641-661. Klimmt, C., & Vorderer, P. (2003). Media psychology “is not yet there”: Introducing theories on media entertainment to the presence debate. Presence-Teleoperators and Virtual Environments, 12(4), 346-359. Le Callet, P., Möller, S., & Perkis, A. (2013). Qualinet White Paper on Definitions of Quality of Experience Output from the fifth Qualinet meeting, Novi Sad, March 12, 2013. Paper presented at the European Network on Quality of Experience in Multimedia Systems and Services. Malone, T. W. (1982). Heuristics for designing enjoyable user interfaces: Lessons from computer games. Paper presented at the SIGCHI conference on Humon Factors in Computing Systems, Gaithersburg, Maryland, USA. Nillesen, J. P. H. (1988). Involvement en reclameverwerking. In A. E. Bronner (Ed.), Jaarboek MarktOnderzoekASsociatie (pp. 165-187). Haarlem, The Netherlands: De Vrieseborch. O’Brien, H. L., & Toms, E. G. (2010). The development and evaluation of a survey to measure user engagement. J. Am. Soc. Inf. Sci. Technol., 61(1), 50-69. O’Brien, H. L., & Toms, E. G. (2008). What is User Engagement? A Conceptual Framework for Defining User Engagement with Technology. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 59(6), 938–955. Perse, E. M. (1998). Implications of cognitive and affective involvement for channel changing. [Article]. Journal of Communication, 48(3), 49-68. Powers, W. T. (2008). Living Control Systems III: The fact of control. New Canaan, CT: Benchmark Publications. ISBN 978-0-9647121-8-8. Schubert, T. (2003). The sense of presence in virtual environments: A three-component scale measuring spatial presence, involvement, and realness. Zeitschrift für Medienpsychologie, 15(2), 69-71. Sneath, P. H. A., & Sokal, R. R. (1973). Numerical taxonomy. The principles and practice of numerical classification. . San Francisco: W.H. Freeman & Co.

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

71

3

Sweetser, P., & Wyeth, P. (2005). GameFlow: a model for evaluating player enjoyment in games. ACMComputers in Entertainment, 3(3), 3-24. Trochim, W. M. K. (1989). An introduction to concept mapping for planning and evaluation. Evaluation and Program Planning, 12, 1-16. Witmer, B. G., & Singer, M. J. (1998). Measuring presence in virtual environments: A presence questionnaire. [Article]. Presence-Teleoperators and Virtual Environments, 7(3), 225-240. Zaichkowsky, J. L. (1985). Measuring the involvement construct. The Journal of Consumer Research, 12(3), 341 - 352. Zaichkowsky, J. L. (1986). Conceptualing involvement. Journal of Advertising, 15(2), 4-14.

72

DEVELOPING AN OPERATIONAL DEFINITION FOR INVOLVEMENT WITH AUDIO/VIDEO MATERIAL

Chapter 4: Development of the Involvement Questionnaire

73

The focus of this thesis is to investigate the relationship between Human Influencing Factors and System Influencing Factors (IF) within the Qualinet Quality of Experience framework (Le Callet, Möller, & Perkis, 2013) through the use of a concrete model with an operational definition of involvement (Human IF), which can then be related to System IFs. The theoretical foundation to create an operational definition for involvement with audio/video quality was described in Chapter 3, which concludes with the following definition:

Involvement is a Human IF of QoE characterized by attributes of relatedness, informative interest, captivation, expressions of involvement, negative affect and disinterest. It can be expressed on one dimension, going from low to high; and is a variable experience, changing across time depending on the individual, the media content and the situation.

This definition allows moving forward with the creation of a measure based upon six clearly identified constructs. Hence, the goal of this Chapter is to develop a scale that measures actual involvement with audio/video content, and to refine the involvement construct with the results from the reliability and validity assessments. As mentioned in Chapter 3, there are several questionnaires that address involvement in some way (Zaichkowsky, 1985; O’Brien & Toms, 2010; See-To, Papagiannidis, & Cho, 2012; Witmer & Singer, 1998). However, either these questionnaires address different situations (e.g. involvement with products, Zaichkowsky, (1985)), or they have not been validated yet for the use within a QoE application area (De Moor et al., 2014; See-To et al., 2012). Currently, no behavioural measures have been proposed to assess the involvement construct. While there have been several reports of using eye-tracking, facial expressions and electroencephalographic (EEG) measures (Antons, Arndt, De Moor, & Zander, 2015; De Moor et al., 2014), so far no conclusive results linking these measures to either System IFs or involvement have been reported. However, behavioural measurements have been successfully applied in related research areas such as stress (Wilson & Sasse, 2000a, 2000b), player experience in digital games (van den Hoogen, Ijsselsteijn, & de Kort, 2008) and quality of perception (eye-tracking, (Gulliver & Ghinea, 2006)). Nevertheless, research by van den Hoogen et al. (2008), Antons et al. (2015) and De Moor et al. (2014) indicates that there is no easy one-on-one correlation between behavioural measurements taken and the subject’s frustration or enjoyment with the media. To aid the interpretation of behavioural measurements, it can be helpful to have self-report measures which assess the same construct. The correlation between self-report measures and behavioural measurements can give a clearer picture of what is being measured. For example, both Antons et al., (2015) and De Moor et al. (2014) relied on self-report measurements such as the Self Assessment Manikin scale (Lang, 1980) to relate Human IFs to System IFs. Questionnaires also make it possible to gain a more accurate description of the semantic attributes, and are easier to administer. While behavioural measurements can

74

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

be informative about participant behaviour and experience, the data is often very noisy and difficult to interpret. Since there is no appropriate existing self-report method to measure the nature of involvement with audio/video content, accurate interpretation of behavioural measures could be quite a challenge. Therefore, the decision was made to develop a questionnaire to measure the construct of involvement. In questionnaire development, the step after creating a well-defined construct, is to make an item pool consisting of questions based on the operational definition of the construct (Loevinger, 1957; Spector, 1992). The items then require testing of their wording (e.g. through cognitive interviewing), such that the item pool can be updated. Subsequently, the items are compiled in a questionnaire or scale, and tested for reliability and validity. Assessing reliability and validity of a scale is done by having a large number of participants fill out the questionnaire. The data allows further analysis of the involvement with audio/ video construct and verifies whether the construct is stable or needs refining.

4.1

Constructing a Scale

The intention for this thesis is to create a summated rating scale, such that the final rating shows where a person falls on the involvement with audio/video content continuum. This also assumes that the involvement construct consists of a single continuum. The decision was made to measure involvement through agreement responses, as the statements used in Chapter 3 lend themselves best to responses based on agreement (e.g. turning off the tv, excited, losing track of time), and also relate more to behavioural characteristics. Likert-scale items were adopted, because the statements used in the previous Chapter are generally not suitable to create effective contrasting adjectives. The involvement construct appears to be unipolar, varying from high to low. Hence, based on Spector (1992) the response options for the items were chosen to go from strongly disagree to strongly agree, where 1 is strongly disagree and 7 is strongly agree. The number of items for the item pool and the items themselves were written and selected based on the guidelines set out in Section 2.7. To test whether the initial items in the item pool are clear, concise, concrete and free of ambiguity, cognitive interviews were used. The results will be presented in the next Section. The next iteration of the item pool was tested via an online survey, as discussed in Section 4.3 (see Section 2.7.2 for the rationale). Validity and reliability (see Section 2.7.2 for a full discussion of the rationale and method) were also examined, based on the following hypotheses:  Involvement within one person depends on the audio/video content the subject is shown.  Involvement with audio/video content can be decomposed in the six attributes identified in the previous Chapter: captivated, disinterest, informative interest, relatedness, negative affect and expressions of involvement.

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

75

4



The developed scale allows the identification of where people fall on the involvement continuum, where a low score on the scale indicates low involvement and a high score on the scale indicates high involvement with the shown audio/ video content.

4.2

Cognitive Interviews

4.2.1

Method

PARTICIPANTS Four people were invited to participate in the cognitive interview. Participants’ backgrounds were in psychology, human technology interaction and computer science. Two females and two males, with mother tongues of English, Dutch, German and Ukrainian/Russian participated. Their mean age was 35, with a standard deviation of 4.5. TESTS Fifty new English-language items, largely based on the model in Chapter 3, were prepared and compiled into the involvement questionnaire. Participants could answer the items with a 7-point Likert-scale1, which ranged from strongly disagree to strongly agree (see Table 2 for the exact wording of the items). This procedure matches exactly the one defined originally by Likert (1932). Three items not based on the involvement construct were inserted to try and determine whether their aspects are appropriate for this questionnaire. The aspects under discussion are image quality and plotline. Image quality was represented through the items “I enjoyed the images in this video” and “I enjoyed the graphics in the video”. Participants were questioned about whether they perceived a plotline or not via the item “There was an obvious plot”. To find out how participants interpreted the questions, a semi-structured interview strategy was used with comprehension or interpretation probes. To this end, participants were asked which questions were unclear to them, and how they would interpret them, or rewrite them to make them clearer. After those questions were covered they were asked to go over the remaining questions and state how they interpreted them. They were also asked whether they had seen some of the video fragments before, and whether they found any spelling or grammar mistakes.

1

76

A 7-point Likert scale was chosen as this allows a reasonable amount of gradations between responses, and it was hoped that this would improve discrimination between total involvement scores. DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

STIMULI Five short video fragments, as described in de Hesselle (2006), were presented to the participants. APPARATUS Stimuli, as described, were shown on a Dell Latitude E6500 notebook, using a VLC media player 1.0.0 (2010). The same notebook was used to record the cognitive interviews, using the software program Audacity® 1.2.6 (2010), a free, open source program for recording and editing sounds. PROCEDURE Participants read the instructions, and if they had no further questions, were shown the first video fragment. They filled out the involvement questionnaire, and the process was repeated another four times until all video fragments were shown and the involvement questionnaires filled out. Next, the cognitive interview started. The experimenter started the recording program, tested whether it recorded properly and asked the questions as detailed in the previous Section. At the end of the session, participants were thanked for their participation and escorted out of the room. 4.2.2

Results and Discussion

All interviews were transcribed, and the verbalized interpretations of participants were placed in a database with items as rows and participants as columns. Items were judged to be cohesive if the interpretations of the participants contained the same words or synonyms to explain their interpretation. Judgements about the similarity of interpretations were verified by asking an expert with a background in scale development to look at the questions and the interpretations. The expert agreed with all the original cohesion judgements. Table 1 shows examples of items judged to be interpreted differently and similarly. Ten questions were interpreted differently by the participants, and therefore were taken out of the questionnaire. These included “I enjoyed the images in this video” and “I enjoyed the graphics in the video”, which received a mixed interpretation of aesthetics and image quality, and which also differed based on whether the content was a cartoon or not. “I laughed out loud” and “I feel sleepy” were also removed, since two participants pointed out that these questions were hard to answer using a Likert scale. “Watching this video gave me a new understanding of the subject” was also removed, since there were three other questions with the same content. Table 2 shows both item pools, the changes made to the item pool, and the dimension from which the item originates. Table 2 also indicates whether an item contributes positively or negatively towards involvement. The new version of the involvement questionnaire ended up with thirty-eight questions. The DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

77

4

items in the second item pool were randomized and numbered from one to thirty-eight, and all subsequent versions of the item pool will use this numbering for referral. The developed scale will henceforth be referred to as the involvement questionnaire or the inQ.

4.3

Testing the Item Pool via an Online Survey

As Spector (1992) mentions, once the items have been tested a first time to ensure they are clear, concise, concrete and free of ambiguity, the scale should be presented to a larger amount of participants. For this purpose, the inQ was implemented as an online survey to reach as many people as possible in the shortest amount of time. This also allowed testing with a more multi-cultural sample of participants. This is important, since the intended population is everybody who is able to experience audio/video quality, which is a rather large and diverse population. 4.3.1

Method

The online survey was set up to find out which questions in the inQ are relevant for a variety of different multimedia material. An overview follows detailing the participants, the multimedia stimuli, the online survey platform, the scales, and the procedure the participants were asked to follow. PARTICIPANTS The online survey was finished by 107 people. Of those, 63 were male and 44 were female. Mean age was 31 years, with a standard deviation (SD) of 7.6 (min=18, max = 57). On average, participants owned 1.5 TVsets, while 8.4% of the participants did not own a TVset and 7.5% owned more than 3 TVsets. Participants who did not own a TVset watched multimedia on their PC/notebook. One participant also mentioned a projector with a Wii attached, and one other participant mentioned that they watched television at their friends’ home. For 39 participants, English was their mother tongue, with other languages being Dutch (38), German (10), French (6), Chinese (2), Italian (2), Russian (2), Polish (2), Finnish (1), Greek (1), Tamil (1), Bahasa Indonesia (1), Spanish (1) and Macedonian (1). Participants indicated that most of them (34.6%) watched between 5-10 hours of multimedia the week before completing the online survey. 8.4% indicated that they watched less than 1 hour of multimedia, and 11.2% indicated that they watched more than 20 hours of multimedia. MULTIMEDIA STIMULI To gather feedback on a wide range of multimedia, nine different kinds of multimedia footage were selected. From all multimedia sources, a one-minute fragment was selected. Table 3 details the selected fragments. One additional multimedia source was selected to 78

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

I associate the word with negative emotions.

If I was fascinated by it.

7. I felt emotional.

22. This video really fascinated me. I took this literal too. Was I fascinated or ..?

Did the content make me sort of emotional; can be both positive & negative.

Did anything in the scene I just saw trigger an emotion? So it was whether it was AN emotion as opposed to whether it was a good emotion or a bad emotion Related to my interest in the content.

Participant 3

Participant 2

Participant 1

Questions

Table 1. Examples of items. 7 is interpreted differently, 22 is interpreted similarly.

This is a more personal level, so what do I really find fascinating? This asks more about my preference.

I interpret this as negative.

Participant 4

4

79

80

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.

1. 2. 3. 4. 5. 6. 7. 8.

I felt like laughing out loud. I felt relaxed. While watching the video I was getting restless. The video held my attention. I enjoyed the images in the video. This video totally took me away to another world. I felt disappointed when the video finished. While watching the video I was shifting position often. This video had me on the edge of my seat. I felt more informed after watching this video. I lost track of time. The video entertained me. This video gave me new ideas to think about. I was totally absorbed. This video really fascinated me. This video made me curious to learn more. I felt amused. I felt excited after watching this video. I felt like talking to the television. I felt involved. I sympathized with what was going on. I would have turned the tv off if I could. I could relate to the situation. There was an obvious plot. I was happy when the video finished. I felt disgusted after watching this video. I felt embarrassed because of the way somebody acted in the video.

First item pool

This video sequence had me on the edge of my seat. Idem Idem Idem Idem Idem Idem I would like to know what happens next. Idem I felt excited while watching this video. Idem Idem Idem Idem Idem There was a coherent storyline. Idem I felt upset after watching this video. Idem

Idem Idem Idem Idem This video was aesthetically appealing to me. This video took me away to another world. Idem Idem

Second item pool

expressions of involvement informative interest captivated expressions of involvement informative interest captivated captivated expressions of involvement expressions of involvement expressions of involvement expressions of involvement captivated relatedness disinterest relatedness captivated disinterest negative affect negative affect

captivated expressions of involvement disinterest

expressions of involvement expressions of involvement disinterest captivated

Attribute

pos pos pos pos pos pos pos pos pos pos pos pos pos neg pos pos neg pos pos

pos pos neg pos pos pos pos neg

Key

Table 2. Items used in the first and the second item pool. If the item remained the same in both item pools, the second item pool only mentions ‘idem’. They ‘Key’ column indicates whether items contribute positively (pos) or negatively (neg) to involvement.

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

46. 47. 48. 49. 50.

41. 42. 43. 44. 45.

39. 40.

34. 35. 36. 37. 38.

I was making predictions about what would come next. Watching this video made me feel sad. I would have liked it if the video continued. I was concerned about the situation in the video. I felt confused after watching this video. I felt like switching channels, to see if there was something better on tv. Watching this video taught me something new. Watching this video gave me a new understanding of the subject. I felt emotional. I laughed out loud. I enjoyed the images in the video. I didn’t think about anything else while watching. I was willing to believe in the world shown in this video. I enjoyed the graphics in the video. This video helped me to learn about myself. This video gave me something to talk about later. I smiled while watching the video. I feel sleepy.

33.

First item pool

Watching this video made me feel scared. I forgot where I am. I felt compassion for what was going on. This video gave me interesting information. I was bored because of the video.

28. 29. 30. 31. 32.

pos pos pos pos neg pos pos pos pos neg neg pos pos pos pos pos pos pos

expressions of involvement negative affect expressions of involvement relatedness negative affect disinterest informative interest informative interest relatedness expressions of involvement captivated captivated

informative interest informative interest expressions of involvement disinterest

Idem Idem Idem Idem Idem Not used Not used Not used Not used Not used Not used Not used Not used Not used Not used Not used Not used

pos pos pos pos neg

Key

negative affect captivated relatedness informative interest informative interest

Idem I forgot where I was. Idem Idem The content of this video does not interest me in any way. Idem

Attribute

Second item pool

4

81

serve as training for participants, i.e. to go through the questionnaire once and become acquainted with the questions. All fragments were encoded in flash video, with their native aspect ratio of 3:4 or 16:9 kept constant. The re-encoding was done to ensure that artifacts would show as little as possible while streaming the fragments over the internet to participants. Stimuli were presented in four different randomised orders. APPARATUS To present the questions and fragments to participants, the platform Formdesk (2010) was used. Formdesk is an online tool, with which forms and surveys can be created. Participants were also asked to indicate which platform and browser they used, to collect further information about their media provider. Of the participants, 45% used a desktop computer, 55% used a laptop. Almost all participants used an LCD/TFT flatscreen (93.5%), with most frequent resolutions of 1280x1024 (46%) and 1024x768 (22%). Most participants used Mozilla Firefox (60%), although Internet Explorer (29%), Safari (6.5%), Google Chrome (4%) and Opera (1%) were also used. SCALES Demographic questions To collect information about the participants’ background, a number of demographic questions were asked, such as age, gender, nationality, mother tongue, highest finished education, current occupation, the amount of TVsets they had (if none, what do they use to watch multimedia, and if one or more, where are they located), amount of hours multimedia watched last week, which browser platform they were using (Firefox, IE, etc.) , which kind of screen they were using and which screen resolution they were using. Involvement Questionnaire As discussed in the previous Section, 38 questions remained to be validated (see Table 2). Before the InQ, participants were also asked whether they had seen the fragment already, and if so, when (approximately) and where did they think it came from (an example answer could be ‘6 months ago, Star Wars’). Appendix 1, Figures 1-3 give examples of how questions were presented. PROCEDURE Participants were invited to participate through email, which was distributed by means of several email lists (e.g. British HCI mailinglist, ISN Eindhoven mailinglist). The email

82

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

Table 3. Snapshot and description of the ten audio/video fragments selected for the online survey.

Fragment

Description

training

Historical drama, Cranford, season 1, episode 1. th Women in mid-19 century outfits are talking inside a house

documentary_G M

Nature’s Great Events: The Great Melt (BBC). Two icebears sit on drift ice while a narrator comments on their circumstances.

soap opera_GG

Gilmore Girls, season 6, episode 13. Mother, daughter and two grandparents are fighting about who should pay the (grand)daughter’s college tuition.

feel-good_LA

Opening credits of Love Actually. The arrivals gate of an airport is shown, with a voice-over mentioning how love actually is everywhere. Star Wars Episode II: Attack of the Clones. An arena where robots are encircling a small number of people is shown. Another Figure demands that these people join his cause. Mulan. A little dragon wakes up a girl and tells her that she will be late for soldier-training.

sciencefiction_SW

cartoon_M

Snapshot

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

4

83

Snapshot

Fragment

Description

drama_MB

My Bodyguard. A young man and his motorcycle are being beaten up.

news_CNN

Clip from CNN news. Discussion about the legalisation of marijuana.

sports_T

Wimbledon 2008. Finale tennis game.

boring_CM

Clip produced by The Chris Moyles Show. Somebody is mowing their lawn.

also announced that three twenty euro Amazon vouchers would be divided amongst the participants who finished the entire online survey. Once participants clicked on the link to the online survey, they were asked to pick a username and a password. This was necessary to be sure that each participant received a unique ID, in order to secure anonymity for all participants. Once they created the username and password, the next screen came up,

84

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

showing instructions (see Appendix 1). After the instructions, the demographic questions were shown (along with a request to answer them). Next, instructions for filling out the questions with the seven-point Likert scale appeared on the screen (see Appendix 1). The training fragment appeared, and the respondents went through the inQ for the first time. After that they were redirected to the multimedia source sequence that had been established for that person, and requested to give their username and password one more time. Then the fragments appeared, and after each fragment they were again requested to fill in whether they had seen it before or not, and then the inQ. At the end they were thanked for their participation. Three participants, selected randomly, received a twenty euro Amazon voucher for their participation. 4.3.2

Results

DATA SCREENING A missing value analysis showed that one participant did not fill out the inQ for three multimedia fragments. It was not possible to determine why this happened, but since it made the other data of the participant suspect, all data of this participant were removed from future analysis. Furthermore, all multimedia fragments except the ‘boring’ movie and the ‘soap-opera’ showed that missing data appeared completely at random. Three participants indicated that they only heard audio for one or two multimedia fragments. For those fragments, their data was deleted and not replaced. Further inspection was used to determine whether data was missing at random. This is important because it would only be advisable to replace missing values if it could be determined whether they were missing completely at random (MCAR) or were missing at random (MAR) (Tabachnick & Fidell, 2006). To test for MCAR and MAR, SPSS Missing Value Analysis (MVA) was used. MCAR is assumed when the missing values are randomly distributed across all data. MAR is assumed when missing values are not randomly distributed across all data but are randomly distributed within one or more subsamples (e.g. gender, sequence of fragments shown). MCAR can be assumed for all fragments except the soap-series and the boring movie fragment, as all those fragments had a statistically non-significant result, indicating that the pattern of missing data does not diverge from randomness. For the soap-series fragment, MAR can be assumed. The MVA results showed that there was a statistically significant result for data missing completely at random, however, the pattern of missing data was not dependent on the dependent variable. For the boring movie fragment, MAR could not be assumed, as the MVA analysis showed that the missing data were related to the dependent variable, which implies non-ignorable missingness. Non-ignorable missingness means that the data are not randomly missing, but also that the missing values cannot be predicted from other variables (e.g. gender, order of fragments shown). According to Tabachnick & Fidell (2006), the expected maximization method to replace missing values is particularly appropriate for statistical analyses that do not rely DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

85

4

on inferential statistics, such as exploratory factor analysis. Expected maximization (EM) employs maximum likelihood estimation, which picks estimates for the missing values that have the greatest chance of reproducing the observed data (Garson, 2009). Hence, all missing values were replaced with the expected maximization method. Next, normality was inspected via histograms. Histograms were created for each item, for each fragment. This allowed inspection of the shape of the distribution of each item. Few items showed a normal distribution, which means that the technique chosen for factor analysis should not rely on the assumption of normality. Outliers per item were defined as isolated values that exceeded plus or minus three standard deviations from the item mean. Of the 342 items participants were requested to fill out, 13 items had one or two outliers. In total, 23 outliers were detected. One participant accounted for 5 of those outliers, but nothing out of the ordinary could be detected in the demographic variables. The exploratory factor analyses (EFA) were repeated with and without this participant. No differences between the results were detected, therefore the analysis was continued using the data from this participant. Linearity was inspected via a matrix of bivariate scatterplots. If items are linearly related and normally distributed, the scatterplot should show an oval shape of points (Tabachnick & Fidell, 2006). Most variables showed a relation. However, since few variables were normally distributed, rampant heteroscedasticity was to be expected (Tabachnick & Fidell, 2006). Nevertheless, it was decided not to perform data transformations, since transformed variables can be harder to interpret in the results of the analysis. The next step was calculating factorability and the average communality per multimedia fragment. This was done across all fragments to assure that enough participants filled out the scale and that the data would be sufficiently reliable and valid. Per multimedia fragment, the ratio of participants to items is approximately 2,5 to 1. While this is on the low side, there are other factors which determine whether enough data is gathered to achieve

Table 4. Factorability (KMO) and averaged communalities (AFC) of the multimedia fragments, N used for each EFA, the factorability reached through KMO, and the average of the final communalities (AFC) with their standard deviation (SD).

Multimedia fragment sports_T news_CNN drama_MB documentary_GM science fiction_SW soap-opera_GG feel-good_LA cartoon_M boring_CM overall

86

N 104 103 103 105 104 106 105 104 106 942

KMO .831 .835 .797 .836 .882 .900 .852 .877 .751 .951

AFC .616 .626 .640 .597 .613 .659 .606 .629 .590 .591

(SD) (.176) (.158) (.135) (.136) (.178) (.156) (.133) (.151) (.153) (.169)

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

stable and precise results (MacCallum, Widaman, Zhang, & Hong, 1999), as discussed in Section 5.2. Factorability is determined through the Kaiser-Meyer-Olkin (KMO) measure, which looks at both the correlation and the anti-image correlation matrix (which contains the negatives of the partial correlations). The value for KMO varies between 0 and 1.0, and it is advised to make sure that the overall KMO value is above .60 before proceeding with an EFA (Tabachnick & Fidell, 2006). For the current experiment, no KMO values were below .60 and mostly considerably higher. (see Table 4 for details). Average communalities were also calculated, as advised by (MacCallum, et al., 1999) in order to achieve a minimum communality value of .5 per item for a sample that varies between 100 to 200 participants. As Table 4 shows, average communalities are well above .5 and the hypothesized factors are well overdetermined. MacCallum et al. (1999) advise a minimum of 6 or 7 items per factor, and each of the hypothesized factors has at least 9 items. Therefore, the decision was made to continue with the current dataset. DISCRIMINATIVE POWER ANALYSIS To generate a first overview of the ability of the inQ to differentiate between participants, discriminative power analysis was used (Hosker, 2002). The discriminative power (DP) was calculated per item per multimedia fragment by taking the highest 25% and the lowest 25% of the scores. A weighted mean per quartile is calculated (see Table 5), and the DP is created by subtracting the two means. A larger difference between the means indicates a more effective item. The decision was made to look at the lowest and highest 25 scores.

Table 5. Example of calculation of the DP. The DP for this scale item is 4.36 - 1.52 = 2.84.

quartile

number of participants

1st 4th

25 25

frequency of scores 1 12 0

2 13 0

3 0 0

4 0 16

5 0 9

weighted sum

weighted mean

38 109

1.52 4.36

To see which items have the highest and most consistent DP, the 25 items for each multimedia fragment that scored highest were compared. The comparison shows that questions 3, 14, 16, 21, 25, and 33 are consistent across all 9 multimedia fragments. Furthermore, questions 7, 12, 23, 30, 35, and 38 are consistent across 8 multimedia fragments and questions 5, 9, 15, 20, 22, 24, 32, and 36 are consistent across 7 multimedia fragments. The remaining items were also compared to see whether there are items that score consistently low on the DP. While there are no items which score low across all 9 multimedia fragments, there are items that score low across 8 multimedia fragments: questions 19 and 27. There are also two more questions which score low across 7 multimedia fragments (questions 1 and 28). DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

87

4

Table 6 shows the items in this order, which can help in deciding which questions to discard or maintain, once the EFAs have been run. EXPLORATORY FACTOR ANALYSIS From the available EFA techniques, principal factors analysis is least affected by non-normality and thus has been used for all subsequent EFAs in this Chapter (Tabachnick & Fidell, 2006). Considering that the attributes in the involvement construct might not be orthogonal to each other, it is sensible to use an oblique rotation method (i.e. oblique rotation allows correlations between factors (Tabachnick & Fidell, 2006)). Therefore, principal factors analysis was augmented with promax, an oblique rotation method. Oblique rotation enables examination of the correlations between factors, and unless there are correlations above .32 (indicative of 10% or more overlap in variance among factors), oblique rotation should not be used, but one should switch to orthogonal rotation instead (Tabachnick & Fidell, 2006). To be able to compare results across multimedia fragments, exploratory factor analyses were performed per multimedia fragment, and exploratory factor analysis was also conducted for the complete data set, since the end goal was to produce a scale which is valid and reliable for multiple kinds of multimedia fragments. Selecting the appropriate number of factors can be done in several different ways. Retaining all the factors with an eigenvalue greater than 1 is a frequently used rule of thumb. Hence, as an initial cut-off point in the EFA, an eigenvalue of greater than 1 was required. Scree plots and parallel analysis were then used to make a decision about the actual cut-off point for the number of factors (Tabachnick & Fidell, 2006) (see Appendix 2, Figure 4 for an example of a scree test). The scree test, developed by Cattell (Tabachnick & Fidell, 2006), plots the value of eigenvalues against the amount of factors. To determine the cut-off points for the factors, one looks for the point where a straight line drawn through the points changes its slope. However, this is not an exact science, and definitely involves the researchers’ judgment, which can be pre-conceived. Table 7 summarizes the results from the scree test. Another method to establish the number of factors is parallel analysis. Parallel analysis was first proposed by Horn (Tabachnick & Fidell, 2006). O’Connor (2000) constructed a program for SPSS which can be used to conduct parallel analysis. The program randomly generates a data set with the same amount of cases (N) and variables (items). Next, the program performs several factor analyses (without rotation), while keeping track of the eigenvalues. Finally, the program averages the eigenvalues per factor obtained in the simulations. It is advisable to retain only the factors with eigenvalues which exceed the averaged eigenvalues from the randomly generated data set. Considering that the data in the current dataset are not normally distributed, the decision was made to use permutations from the raw data, rather than randomly generated data (O’Connor, 2000). The parallel analysis shows that, with permutations from the raw data from each multimedia fragment, 6 factors with an eigenvalue above 1 were found (see Table 8). For the overall parallel analysis for all multimedia fragments, 4 factors were found. 88

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

89

Question

While watching the video I was getting restless I was totally absorbed I would like to know what happens next. I sympathized with what was going on. I was happy when the video finished. I was making predictions about what would come next.

I felt disappointed when the video finished The video entertained me. I could relate to the situation. I felt compassion for what was going on. I would have liked it if the video continued. I felt like switching channels, to see if there was something better on tv.

This video was aesthetically appealing to me. This video sequence had me on the edge of my seat This video really fascinated me. I felt involved. I would have turned the tv off if I could. There was a coherent storyline. The content of this video does not interest me in any way. I was concerned about the situation in the video.

I felt like talking to the television. I felt embarrassed because of the way somebody acted in the video.

q1 q28

I felt like laughing out loud Watching this video made me feel scared.

Inconsistent across 7 multimedia fragments

q19 q27

Inconsistent across 8 multimedia fragments

q5 q9 q15 q20 q22 q24 q32 q36

Positively consistent across 7 multimedia fragments

q7 q12 q23 q30 q35 q38

Positively consistent across 8 multimedia fragments

q3 q14 q16 q21 q25 q33

Positively consistent across 9 multimedia fragments

Nr

expressions of involvement captivation

expressions of involvement negative affect

expressions of involvement captivation captivation disinterest captivation informative interest relatedness

expressions of involvement expressions of involvement relatedness relatedness expressions of involvement disinterest

disinterest captivation expressions of involvement relatedness disinterest expressions of involvement

Attribute

soap-opera_GG, cartoon_M drama_MB, documentary_GM

boring_CM drama_MB

drama_MB, documentary_GM news_CNN, soap-opera_GG drama_MB, boring_CM soap-opera_GG, boring_CM documentary_GM, feel-good_LA science-fiction_SW, cartoon_M documentary_GM, feel-good_LA sports_T, feel-good_LA

sports_T boring_CM science-fiction_SW documentary_GM boring_CM documentary_GM

not applicable not applicable not applicable not applicable not applicable not applicable

Fragments

Table 6. Questions which come out as either positively or negatively consistent, comparing DPs across multimedia fragments. The last column indicates the fragments for which items were not consistent. For the positively consistent items it indicates for which fragments the items had less discriminatory power. For the negatively consistent items, it indicates for which fragments the items had more discriminatory power.

4

Table 7. Number of factors indicated by the scree test on the EFA per multimedia fragment and across Table 7. Number of factors indicated by the scree test on the EFA per multimedia fragment and all multimedia fragments. across all multimedia fragments. Multimedia fragment number of factors sports_T 2-4 news_CNN 2-4 drama_MB 3-4 documentary_GM 3-4 science-fiction_SW 2-3 soap-opera_GG 2-3 feel-good_LA 2-3 cartoon_M 2-4 boring_CM 2-3 overall 3-6

Taking together the data from the scree test and the parallel analyses, it was decided to use 4 factors as a maximum cut-off point. However, to get a first insight into the factors, the EFA was performed with the instruction to cut off factors once the eigenvalue in the analysis before rotation is greater than 1. Also, it is necessary to check whether promax is really necessary as rotation technique: i.e. are there, within the first 4 factors, really correlations above .32? Appendix 3, Table 10 shows that there is at least one correlation above .32 for each multimedia fragment, and across all multimedia fragments. Therefore, the decision was made to run all EFAs with principal factor analysis and promax, to maximize factor loadings and interpretation possibilities. All pattern matrices from the first EFAs with loadings on the factor cut-off point are reported in Appendix 4. The decision was made to count only items with a factor loading of .400 or higher. Cronbach’s α (Cortina, 1993) and the average interitem correlations (Clark & Watson, 1995) were calculated for each fragment, using the whole item pool. Cronbach’s α (Cortina, 1993) was also calculated for the first four factors in each EFA to examine the reliability of the factors. It can be seen in the factor loadings matrices in Appendix 3 that items do not load systematically on the same factor in different fragments. Therefore, another strategy to create a stable questionnaire was devised. An exploratory factor analysis, with principal factor analysis and promax, was run across all multimedia fragments (see Appendix 4 for statistical details). The results show that questions 5 (This video was aesthetically appealing to me) and 24 (There was a coherent storyline) did not load on any of the six factors found. These questions were removed for the next EFA. Furthermore, question 19 (I felt like talking to the television) loads on two factors, so it was removed as well. The discriminatory power analysis also shows that questions 1 (I felt like laughing out loud), 27 (I felt embarrassed because of the way somebody acted in the video), and 28 (Watching this video made me feel scared) have a low DP, so those questions were also removed for the next EFA.

90

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

1 2 3 4 5 6 7

factor

13.82 2.64 1.62 1.29 1.24 1.03 0.87

13.31 3.53 1.65 1.30 1.15 0.95 0.90

10.39 4.63 2.42 1.48 1.32 1.20 0.88

eigenvalue for multimedia fragments drama_MB news_CNN sports_T documentary_GM 11.43 3.94 2.63 1.38 1.16 1.05 0.93

sciencefiction_SW 13.80 4.59 1.49 1.35 1.02 0.88 0.71

soapopera_GG 16.43 3.50 1.49 1.28 1.14 0.98 0.71

Table 8. Eigenvalues obtained by the parallel analysis. Eigenvalues greater than 1 are italicized. feelgood_LA 11.25 4.52 1.95 1.82 1.60 1.28 0.98

boring_CM 9.83 3.73 2.16 1.89 1.56 1.29 1.11

cartoon_M 13.52 4.22 1.73 1.22 1.12 0.97 0.87

4

91

13.54 3.82 1.98 1.54 0.93 0.58 0.50

overall

Maintaining a balance between the number of items that are necessary to cover the whole involvement construct and the compactness of the scale was an important consideration in the process. For practical purposes, the inQ needs to be as compact as possible. A compact scale increases the possibility for future research to test a wide range of fragments without participants getting bored or tired because of the situation, rather than the offered fragments. To reduce the item-pool further, communalities and Cronbach’s α were inspected again. Questions 37 (I felt confused after watching this video) and 8 (While watching the video I was shifting position often) have both a low communality and, when removed, Cronbach’s α for the factor was improved from .860 to .892. Table 9. The items remaining in the inQ. Nr. 6 7 9 12 14 15 16 17 18 35 3 22 25 32 38 10 13 31 2 26 34 36 21 23 30

92

Question This video took me away to another world. I felt disappointed when the video finished This video sequence had me on the edge of my seat The video entertained me. I was totally absorbed This video really fascinated me. I am curious about what happens next. I felt amused. I felt excited while watching this video. I would have liked it if the video continued. While watching the video I was getting restless I would have turned the tv off if I could. I was happy when the video finished. The content of this video does not interest me in any way. I felt like switching channels, to see if there was something better on tv. I would discuss this video with others. I could easily understand what was going on. This video gave me interesting information. I felt relaxed I felt upset after watching this video. Watching this video made me feel sad. I was concerned about the situation in the video. I sympathized with what was going on. I could relate to the situation. I felt compassion for what was going on.

Factor 1

Loading .653

Attribute captivation

1

.655

captivation

1

.866

captivation

1 1 1 1 1 1 1/2

.745 .711 .709 .632 .781 .940 .541/.450

captivation captivation captivation captivation captivation captivation captivation/disinterest

2

.647

disinterest

2 2 2

.867 .878 .785

disinterest disinterest disinterest

2

.810

disinterest

3 3

.945 .843

informative interest informative interest

3

.903

informative interest

4 4 4 4

-.501 .808 .749 .618

5 5 5

.800 .727 .704

negative affect negative affect negative affect negative affect relatedness relatedness relatedness

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

Factor 1 seems overdetermined, i.e. has 15 items loading on it. Therefore, the decision was made to take items out of factor 1. Participants in the cognitive interview Section already had comments about questions 4 (The video held my attention), 11 (I lost track of time) and 29 (I forgot where I was). Participants stated that of course the video held their attention, since that was the task asked of them. Regarding question 11 and 29, the comments were that the multimedia fragments were too short to lose track of time or to forget where you are. Considering that neither 4 nor 11 made it to the top consistently in the DP analysis, these comments are probably also valid for one-minute multimedia fragments. This led to the removal of both items from the questionnaire. Question 33 (I was making predictions about what would come next) has a communality value below .3, and was removed from the questionnaire. Finally, question 20 was removed (I felt involved), because it was already covered by other, less abstract, questions (e.g. ‘This video took me away to another world.’). A last EFA was run to determine what the structure of the questionnaire now looks like (see Appendix 4, Table 22). Looking at the attributes the questionnaire started from, question 10 and 13 may need to be rewritten to better represent attribute ‘informative interest’. The decision was made to change question 10 to “I would discuss this video with others”, and question 13 to “I could easily understand what was going on”. Question 16 was rewritten to represent curiosity more explicitly, and was changed to “I am curious about what happens next”. The last exploratory factor analysis shows that question 35 loads on factor 1 and factor 2. Question 35 is retained because of theoretical interest. Table 9 shows the new version of the inQ, which question belongs to which factor, and the factor loadings. The newest version consists of five factors, although six attributes were theorized.

Discussion and Conclusion In summary, the results showed that the scores for the inQ differed within participants, depending on the audio/video fragment shown. However, the six previously hypothesized clusters did not replicate into six factors. Therefore, the involvement with audio/video content construct needs to be updated. Almost all questions from factor 1 could be characterized via the captivated and expressions of involvement clusters. The decision was therefore made to name factor 1 captivation. Factor 2 reflected items that were based on disinterest, while items loading on factor 3 reflected informative interest. Factor 4 items represented negative affect, and the items loading on factor 5 all represented relatedness. The updated operational definition for involvement with audio/video content thus becomes:

Involvement is a Human IF of QoE characterized by attributes of relatedness, informative interest, captivation, negative affect and disinterest. It can be expressed on one dimension, going from low to high; and is a variable experience, changing across time depending on the individual, the media content and the situation.

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

93

4

The results detailed in the previous Section showed that only captivated and expressions of involvement came together in one factor. It may have thus been premature to assume that informative interest and relatedness or negative affect and disinterest feed into two larger attributes. The hypothesised model from Chapter 3 thus changes accordingly and it is now assumed that each attribute from the operational definition feeds into involvement directly, without joining up into larger attributes. Having established the operational definition and model of involvement with audio/video content, as well as having created a measure, it now becomes easier to compare these against other definitions and measures, such as those proposed by See-To et al. (2012) or De Moor et al. (2014). As mentioned in Chapter 1, See-To et al. (2012) created an operational definition of engagement, which goes as follows: engagement occurs when a person is psychologically immersed in a video, referring to the perceptual focus on mediated information and the avoidance of stimuli that do not belong to the multimedia offering (e.g. unrelated cognitions or external stimuli. However, for the purpose of this thesis, engagement needs a component of interaction (see also O’Brien & Toms (2008)), and it can be argued that their engagement is really involvement. An interesting point in See-To et al. (2012)’s model is that their engagement and enjoyment remain separate factors in their analysis. However, they did not include questions around negative affect or relatedness, which could potentially influence results differently, as the research detailed in the previous sections showed that both of those also play a role in the involvement with audio/video content model presented. The study reported by De Moor et al. (2014) also included a factor the authors called engagement, which was based upon two factors from O’Brien & Toms (2010), focused attention and felt involvement. However, participants were not allowed to interact with the shown multimedia content, and hence engagement feels like a misnomer for this factor, especially given the adopted definition of engagement in this thesis. Nonetheless, the factors focused attention and felt involvement as used by De Moor et al. (2014) certainly fit within the involvement with audio/video content model reported in this Chapter: comparing the questions, both focused attention and felt involvement are covered by the captivation attribute. This is especially interesting when considering that De Moor et al. (2014) reported a moderate but significant positive correlation between both focused attention and felt involvement with judged overall (multimedia) quality. To be precise, higher overall quality scores corresponded to higher reported focused attention, as well as higher reported felt involvement. Several other measures, assumed to fall under Human IFs, correlated significantly with overall quality as well: pleasure (positive correlation), interest (positive correlation), expectations (positive correlations) and annoyance (negative correlation). Relating these measures to the operational definition of involvement with audio/video content demonstrates that interest is also covered, and seen as an attribute of involvement rather than as a stand-alone Human IF. Pleasure could be seen as being covered by the captivation attribute, and while annoyance is not directly covered, it can clearly be perceived as being part of negative affect.

94

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

Based on the research from Gauntlett & Hill (1999), where participants reported feeling compelled to watch content, even if they knew what to expected, it was assumed that expectations about the meaning of the content would not be a direct attribute of involvement, and therefore would not need inclusion in the operational definition or the model. Given the results, this seems an appropriate decision. However, expectations about the technical aspects of the content, i.e. an error-free presentation, may certainly influence overall quality scores. The question as posed by De Moor et al. (2014) is ambiguous as to whether participants were to score the meaning or the technical aspects of the content. Given that the aim of De Moor et al. (2014) was to determine which measures could be used to add to standardized measures as detailed in ITU-R BT-500.13 (2012), it is important to disambiguate such questions. If such disambiguation does not happen, it is not possible to determine whether the answers reflect a Human IF or a System IF. Combining the results from See-To et al. (2012) and De Moor et al. (2014), there is a high probability that relating the inQ as a Human IF to System IFs has a high chance of success, as both of them reported a significant correlation between their Human IF and System IF measures. The next Chapter will further examine how the inQ relates to perceived video quality. The perceived video quality (PVQ) will be seen as a measure for the level of perception as manipulated by System IFs (Redi, Zhu, de Ridder, & Heynderickx, 2015). The inQ then represents a measure for the level of perception as manipulated by Human IFs (Redi, et al., 2015). The relationship between PVQ and the inQ can thus be investigated, and their contributions to QoE can be explored. A secondary aim for Chapter 5 is to address the stability of the involvement with audio/video content model. The validity of the inQ also needs to be further addressed. While there is a high face validity, other kinds of validity, such as discriminant validity (Spector, 1992), have not been tested yet. The aim is to address all of these issues in the next Chapter.

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

95

4

References Audacity® 1.2.6. Retrieved from http://audacity.sourceforge.net/?lang=en, 17.01.2010. Antons, J.-N., Arndt, S., De Moor, K., & Zander, S. (2015). Impact of Perceived Quality and other Influencing Factors on Emotional Video Experience 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX) Proceedings: IEEE conference proceedings. Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. [Article]. Psychological Assessment, 7(3), 309-319. Cortina, J. M. (1993). What is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98-104. De Moor, K., Mazza, F., Hupont, I., Ríos Quintero, M., Mäki, T., & Varela, M. (2014). Chamber QoE: a multi-instrumental approach to explore affective aspects in relation to Quality of Experience Proc. SPIE 9014, Human Vision and Electronic Imaging XIX, 901401 (March 18, 2014): SPIE International Society for Optical Engineering. Gauntlett, D., & Hill, A. (1999). TV Living: television, culture ad everyday life. Oxon, UK: Routledge. Gulliver, S. R., & Ghinea, G. (2006). Defining user perception of distributed multimedia quality. ACM Trans.Multimedia Comput.Commun.Appl., 2(4), 241-257. Hosker, I. (2002). Social Statistics. Data analysis in social science explained. . UK, Somerset: Studymates Ltd. ITU-R. (2012). BT.500-13, Methodology for the subjective assessment of the quality of television pictures: International Telecommunication Union. Lang, P. J. (1980). Behavioral treatment and biobehavioral assessment: Computer applications. In J. Sidowski, J. Johnson & T. Williams (Eds.), Technology in mental health care delivery systems (pp. 119-137). Norwood, NJ: Ablex. Le Callet, P., Möller, S., & Perkis, A. (2013). Qualinet White Paper on Definitions of Quality of Experience Output from the fifth Qualinet meeting, Novi Sad, March 12, 2013. Paper presented at the European Network on Quality of Experience in Multimedia Systems and Services. Likert, R. (1932) A technique for the measurement of attitude scales. Archives of Psychology, 144: 1-55. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694. MacCallum, R. C., Widaman, K. F., Zhang, S. B., & Hong, S. H. (1999). Sample size in factor analysis. Psychological Methods, 4(1), 84-99. O’Brien, H. L., & Toms, E. G. (2010). The development and evaluation of a survey to measure user engagement. J. Am. Soc. Inf. Sci. Technol., 61(1), 50-69. O’Connor, B. P. (2000). SPSS and SAS programs for determining the number of components using parallel analysis and Velicer’s MAP test. Behavior Research Methods, Instruments, & Computers, 32(3), 396-402. O’Brien, H. L., & Toms, E. G. (2008). What is User Engagement? A Conceptual Framework for Defining User Engagement with Technology. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 59(6), 938–955. Redi, J., Zhu, Y., de Ridder, H., & Heynderickx, I. (2015). How Passive Image Viewers Became Active Multimedia Users. In C. Deng, L. Ma, W. Lin & K. N. Ngan (Eds.), Visual Signal Quality Assessment (pp. 31-72): Springer International Publishing.

96

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

See-To, E. W. K., Papagiannidis, S., & Cho, V. (2012). User experience on mobile video appreciation: How to engross users and to enhance their enjoyment in watching mobile video clips. Technological Forecasting and Social Change, 79(8), 1484-1494. Spector, P. E. (1992). Summated rating scale construction: An introduction (Vol. 082). USA: Sage Publications. Tabachnick, B. G., & Fidell, L. S. (2006). Using Multivariate Statistics, 5th ed.: Prentice Hall. van den Hoogen, W. M., Ijsselsteijn, W. A., & de Kort, Y. A. W. (2008). Exploring behavioral expressions of player experience in digital games. Paper presented at the Workshop on Facial and Bodily Expressions for Control and Adaptation of Games (ECAG). Wilson, G. M., & Sasse, M. A. (2000a). Do users always know what’s good for them? Utilizing physiological responses to assess media quality. . Paper presented at the HCI People and Computers XIV—Usability or Else! . Wilson, G. M., & Sasse, M. A. (2000b). Listen to your heart rate: counting the cost of media quality. In A. M. Paiva (Ed.), Affective Interactions. Towards a New Generation of Computer Interfaces. (Vol. 1814, pp. 9-20). Berlin, Germany: Springer. Witmer, B. G., & Singer, M. J. (1998). Measuring presence in virtual environments: A presence questionnaire. [Article]. Presence-Teleoperators and Virtual Environments, 7(3), 225-240. Zaichkowsky, J. L. (1985). Measuring the involvement construct. The Journal of Consumer Research, 12(3), 341 - 352.

DEVELOPMENT OF THE INVOLVEMENT QUESTIONNAIRE

97

4

98

Chapter 5: Experimental investigation into the relation between the constructs of involvement and perceived video quality

99

Previous chapters showed that it is possible to define and measure the construct of involvement with audio/video content. As the aim of this thesis is to investigate the relationship between Human Influencing Factors (IF) and System Influencing Factors within the Qualinet Quality of Experience (QoE) framework (Le Callet, Möller, & Perkis, 2013), the next step is to determine which System IFs will be investigated. As mentioned in Chapter 1, the focus will be on the network level of the System IFs. When transmitting multimedia content to TVs, MPEG compression techniques are often used. Different MPEG compression techniques (e.g. motion compensation, quantization, etc.) produce different kinds of artefacts in the final video output. In perceptual experiments different MPEG variables are therefore manipulated to measure the resulting perceived video quality. Similarly, using different optimization techniques to adjust for variable bandwidth in a network introduce different kinds of artefacts. The final combination of effects on multimedia content, and whether this is visible, depends on the resource constraints and the methods used. For the purpose of this thesis, the focus will be on optimization methods used to stream multimedia content across wireless networks. The underlying encoding of the multimedia content under investigation will be MPEG-2 (see also Chapter 2, Section 3). The optimization methods used are detailed in Chapter 2, Section 5. Briefly, I-Frame Delay (IFD) (Kozlov, van der Stok, & Lukkien, 2005) and Transcoding (TC) (Brouwers, 2006) will be used to introduce specific distortions. IFD mainly introduces temporal distortions, while SNR scalability mainly introduces spatial distortions. Previous research comparing temporal versus spatial distortions has shown that there are discrepancies in the results which could be explained by Human IFs. For example, McCarthy, Sasse, & Miras (2004) looked at effects of quantization versus those of frame rate for streamed high motion video sequences on perceived video quality (PVQ) for small screens (more specifically, CIF sized soccer video footage). They concluded that it is not always necessary to have a high frame rate when streaming high motion video. This is contrary to findings from Wang, Speranza, Vincent, Martin, & Blanchfield (2003), who noted that reducing frame rate for video sequences with slow motion decreases PVQ only slightly, whereas reducing frame rate for video sequences with faster motion (soccer, among others) decreased PVQ significantly. It is possible that this is due to the target audience. McCarthy at al. (2004) had soccer fans as observers, whereas Wang et al. (2003)’s pool of subjects were students from an introductory psychology course. Claypool, Claypool, & Damaa (2006) also investigated the effect of frame rate and resolution on observer perception and performance in first person shooter games. Observers indicated that both lower resolution and lower frame rate were seen as lower quality. However, lower resolution did not negatively affect game performance, whereas lower frame rate did. This supports the Qualinet QoE framework differentiation between Human, System and Context Ifs, as Claypool et al. (2006) clearly showed that the setting in which video material is watched, and the objective with which video material is watched are important determinants in PVQ.

100

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

With the support of Philips Research two optimization methods were developed (IFD, Kozlov et al., (2005)) and Transcoding (TC, Brouwers (2006)), as well as a method to simulate the effects (de Hesselle, 2006). This provides the opportunity to systematically expose people to the same temporal or spatial distortions in controlled surroundings. Hence, it becomes possible to further investigate PVQ while distinguishing between ratings for reference material versus material adapted with either temporal or spatial artefacts. The Qualinet QoE framework (Le Callet et al., 2013) does not only draw upon IFs, it also can be decomposed into perceptual features (see Chapter 1, Figure 1). A QoE feature is defined as a perceivable, recognized & namable characteristic of the individual’s experience of a service which contributes to its quality. QoE features are classified under four different levels: perception, interaction, usage situation, and service. These levels are not assumed to be independent of each other, but their exact dependencies have not yet been established. For the purpose of this thesis the focus is on the perception level, as the aim is not to assess interaction with television sets or services, and the situation is limited to displaying audio/ video content on large screens for individual consumption. As mentioned in the previous paragraph, within the possible System IFs, the focus will be on the network related level. The manipulation of the optimization methods should, on the feature perception level, mainly results in either jerkiness (temporal artefact) or blockiness and blurriness (spatial artefacts). Based upon the research carried out by e.g. De Moor et al. (2014), Kortum & Sullivan (2010), Wechsung, Schulz, Engelbrecht, Niemann, & Möller (2011) and Palhais, Cruz, & Nunes (2012), the expectation is that there will be a positive relationship between the involvement with audio/video content model and PVQ, i.e. higher rates for the PVQ will correspond to higher inQ scores. Furthermore, given the results reported by McCarthy et al. (2004), Wang et al. (2003) and Claypool et al. (2006) it is highly possible that the relationship between the involvement with audio /video content model and PVQ is moderated by the manipulated distortions (temporal versus spatial). Specifically, depending on the kind of content offered, and the individual’s level of involvement, said individual may prefer temporal above spatial artefacts, or vice versa. To investigate these hypotheses, a study was executed which drew upon measures previously developed for both involvement with audio/video content and PVQ. A variety of multimedia content was employed, to allow the collection of the necessary data to both validate the inQ as a measure of involvement with audio/video content as well as to further explore its relationship to perceived video quality. This chapter concludes with a discussion and conclusion of the reported study, and how these results fit in the Qualinet QoE framework.

5.1

Discriminant validity

In the previous chapter, the validity of the inQ and the involvement construct was tested by means of factor analysis. According to Spector (1992), validation is the most difficult part of scale development, because simultaneously to testing hypotheses about the construct, the

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

101

5

scale developer also tests hypotheses about the scale. A typical solution for scale developers is to develop hypotheses about the causes, effects and correlates of the construct, which are then tested through the scale. If empirical support for the construct is found, validity of the scale is implied (Spector, 1992). Relevant kinds of validity for scale development are discriminant validity, criterion-related validity, and convergent validity (Spector, 1992). For this thesis, the focus will be discriminant validity. Discriminant validity involves the notion that measures of different constructs should relate only moderately with one another (Spector, 1992), which will be tested in this chapter. Criterion-related validity involves comparing scores from the developed scale with other variables or criteria (e.g. scores on other scales or comparisons between different identifiable groups of respondents). To assess discriminant validity, a correlation matrix must be created with the factors from the inQ and the measure for perceived video quality as variables. The expectation is that the factors from the inQ will correlate moderately (below .4) with themselves and with the measure for perceived video quality.

5.2

Measuring audio/video quality

Several studies reported in previous chapters, such as e.g. De Moor et al. (2014), Ketyko et al. (2010), Kortum & Sullivan (2004, 2010) and See-To, Papagiannidis, & Cho (2012) have used absolute category rating (ACR) scales as recommended by the (ITU-R, 2012). Instead of the 10-second length for the video content though, these studies experimented with using 1 to 5 minute long video or audio/video content. The argument behind this is twofold: first, when individuals watch content at home, this generally lasts longer than 10 seconds; and second, 10 seconds may not be enough to generate and record Human IFs with current measures. Hence, the decision was made to continue using longer length audio/video clips for the study reported in this chapter. Additionally, the inQ was developed for audio/video content, whereas the (ITU-R, 2012) recommendations are to use video material only. Reviewing the literature on combining audio and video quality yielded results from Beerends & de Caluwe (1999), showing that people rate video content higher when high quality audio content is presented together with the video content. Furthermore, video content presented together with low quality audio content caused a lowering of the ratings for perceived video quality. ITU-T Recommendation P.911 (Subjective audiovisual quality assessment methods for multimedia applications, 1998) specifically states that video quality in audio/video content dominates the overall perceived audiovisual quality. Hands (2004)IEEE Transactions on806-81666multimedia communicationquality of serviceregression analysisvideo signal processingvisual perceptionaudio qualityaudio-video sequencebasic

102

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

multimedia predictive quality metrichigh-motion sequenceperceptual qualityshoulder sequencevideo quality20041520-9210 presented a basic multimedia quality model, which shows that the influence of audio and video quality on the overall quality is multiplicative. However, the weight given to audio or video quality in the multiplication depends on the audio/video content. If the audio/video content mainly consists of talking heads, the audio quality weighs slightly heavier than the video quality. For content with higher levels of motion, video quality appears to be more important (Beerends & de Caluwe, 1999; Hands, 2004). Previous research by Procter, Hartswood, McKinlay & Gallacher (1999) also showed that degradation of video and audio quality (i.e. brief periods of jerky motion, reduced audio clarity and lapses in the audio-video synchronization through 30-40% packet loss) affects participants’ uptake of emotional content more than the factual information available in multimedia content. Procter et al. (1999) did not enquire how participants themselves felt, but asked participants to judge the emotional state of people in the shown multimedia content. Participants who had to judge degraded multimedia content felt less confident about their reports of the emotions displayed by the actors in the multimedia content. Participants were also required to answer questions regarding factual content, but here both groups were equally confident about their answers. Hence, Procter et al. (1999) conclude that degradation of multimedia content has more effect on emotional content than on factual content. Therefore, the decision was made to use audio/video content in the study reported in this chapter, but to only create distortions in the video and to maintain good audio quality.

5.3

Method

5.3.1

Hypotheses

Three main effects are expected. First, it is predicted that the inQ will be able to distinguish people who are more involved in the audio/video content from people who are less involved in the shown audio/video content. Second, the scores for the inQ are expected to correlate significantly and positively with ratings for the PVQ. Thirdly, for audio/video fragments that rely more on temporal information, spatial artefacts will receive a higher score on perceived video quality than audio/video fragments with temporal artefacts. An interaction effect is also predicted: participants with a high involvement score can either react negatively or positively when rating the perceived video quality of audio/video fragments with artefacts. Participants with a low involvement score are expected to always rate audio/video fragments with artefacts lower than audio/video reference fragments without artefacts, as has been found by e.g. Wang et al. (2003) and De Moor et al. (2014).

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

103

5

5.3.2

Participants

Hundred people participated, selected via the JFS participant database and the ISN mailinglist. Nobody was colour-blind (tested with the Ishihara test), and visual acuity was 1/1 for 85%, 1/.8 or .8/1 for 12%, and .8/.8 for 3%. Of the subjects, 45 were female and 55 were male, with a mean age of 33.78 (SD= 17, min=17, max=82). On average, participants watched 5-10 hours of television per week before they participated; 19 participants indicated that they watched more than 20 hours of television per week. Participants had a median of 2 TVsets in their homes. Dutch was the mother tongue of 90 participants. None of the subjects had seen a previous version of the involvement questionnaire before participating. 5.3.3

Environment and equipment

The experiment was carried out in the Game Experience lab, at the IPO building on the campus of the Eindhoven University of Technology. The fragments were played using a Dell Optiplex 755 PC. This system was connected to the television set with a HDMI to DVI cable. Video was shown on a Philips 42” PFL 9632D full HD-LCD television set. Audio was streamed from the PC to JBL E60 speakers, via a Harman Kardon AVR 245 receiver. The native resolution of the television set was 1920x1080. However, the audio/video fragments were progressive SD quality, so a resolution of 1024x768 was used instead to show the videos. The height of the image was 27 cm, and the distance between the seats and the

Figure 1. Seating arrangements for the participants (above) and display position (below). 104

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

television was therefore 190 cm, with a horizontal viewing angle of 14-15º (ITU compliant, ITU-R BT500.13(ITU-R, 2012)). The room was furnished as much as possible as a living room to make the participants feel comfortable and to represent a home situation similar to where participants would normally watch television (see Figure 1). Table 1. New scenes for this experiment. Scene training_LL

Description Shots of lechwe (antelopes) in the wild are shown, while a voice over narrates that during the floods, the lechwe are safe from predators. Shots of the savanna under water are shown, while the narrator continues that the water prevents predators from moving fast enough to catch the antelopes. The fragment ends with a herd of moving antelopes storming over the savanna, with instrumental music playing in the background.

thriller_1408

A man stands at the window, waving and trying to catch somebody’s attention in an apartment building across the road. The person across the road mirrors his movements, and at about 45 seconds in the fragment, the man at the window finds that he is looking at himself. Then he notices some movement behind the other person, and turns to a person with an axe behind him. The man jumps and rolls over the bed, while the person with the axe follows him. He then cowers in the corner, while the person with the axe comes closer…

documentary_SA

Shots from impoverished houses and people are shown, with a voice-over narrating about the changes in life for white people in South Africa, where some white people claim to be victims of affirmative action or reverse discrimination. At about 30 seconds, a community center canteen is shown where people further discuss the effects of reverse discrimination. The fragment ends with a shot of new buildings being erected, and a comment on black economic empowerment.

Snapshot

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

5

105

5.3.4

Stimuli

Based on results from Chapter 4, the following multimedia fragments were used again: science-fiction_SW, sports_T and feel-good_LA (see Chapter 4 Section 3.1, Table 3). Two new scenes were added: a scary scene (from the movie ‘1408’, thriller_1408) and a documentary (from WorldFocus, documentary_SA). A new documentary scene was used as the training video (from the BBC series with David Attenborough, training_LL). In Table 1, the new scenes are described. All scenes were 1 minute long, and for the fragments with induced artefacts, those were introduced during the last 30 seconds. The reference audio/video fragments were all presented at their highest possible bitrate (see Table 2). In a comparison of the bitrates, 2.4 Mbps was the lowest, so to introduce the artefacts, all fragments were MPEG2 encoded at this baserate. Audio was mp3 coded with 128 kbps. Next, to create temporal artefacts, IFD was used to take out the 8 B-frames of the GOP(12). Next it was determined how much this reduced the bitrate per scene, by looking at the average size of the B-frames in the video fragment with Elecard StreamEye (2010). The percentage of reduction was then used for SNR scalability, to realize approximately the same bitrate in Mbps. The stimulus set thus consisted of 5 reference fragments, 5 fragments modified with SNR scalability and 5 fragments modified with IFD.

Table 2. Bitrate of the audio/video fragments. scene

original Mbps

training sports_T science-fiction_SW feel-good_LA thriller_1408 documentary_SA

5.3.5

8.85 4.52 6.52 5.55 2.43 8.19

Mbps during the last 30 seconds for modified fragments SNR scalability IFD 0.96 2.40 1.11 2.40 1.24 2.40 1.04 2.40 1.09 2.40 1.17 2.40

Tests

In the experiment two scales were used: the inQ (as detailed in Chapter 4) and the Five-Grade Quality Scale (ITU-R, 2002, 2012). All participants first rated the fragments with the inQ, and next rated the fragments twice on the Five-Grade Quality Scale (FGQS). For measuring perceived video quality, the single stimulus with multiple repetitions method (SSMR) was used and fragments were rated on a categorical quality scale labelled with the adjectives “excellent”, “good”, “fair”, “poor” and “bad” (see Figure 2) (ITU-R BT 500.13, 2012). The SSMR

106

Figure 2: Overview of the ITU BT 500-13 (2012) five-grade quality scale.

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

audio / video fragment rating time

nr 3s

60s

7s

30s during which temporal or spatial artefacts could appear

Figure 3: Presentation of fragments during the perceived video quality rating.

was adapted such that after participants saw the fragments for the first time, they filled out the inQ. The number of the fragment was indicated during 3 seconds, and participants received as much time as necessary afterwards to fill out the inQ (on average between 1 to 2 minutes). While rating perceived video quality (during the 2nd and 3rd presentation), the number of the fragment was shown for 3 seconds, and after the fragment, 7 seconds were provided for participants to state their opinion on the quality scale (see Figure 3). For this part of the experiment, the audio/video fragments were presented twice. In total, participants saw each audio/video fragment three times. Each test session used three different randomized orders of the fifteen multimedia fragments. For the 100 participants, 30 different random orders were used. A demographic questionnaire was used to collect information about the participants, such as their age, gender, nationality, mother tongue, highest finished education, current occupation, number of televisions, and number of hours watching multimedia last week. 5.3.6

Procedure

Participants were welcomed to the lab, and checked for visual acuity and colour-blindness. If people had a lower than .8/.8 visual acuity or were colour blind they were thanked for their coming and told that they could unfortunately not participate, given the constraints of the experiment. After the visual measurement, participants were given a booklet which started with demographic questions. After filling out the questions, they were asked to read the instructions of the first part of the experiment, in which it was shown what the rating scale for the inQ looked like and how to interpret the scale. Next, they received a training session where three multimedia fragments were shown in order to become familiar with the inQ. Any questions during the training session concerning the words used in the inQ, were answered by the experimenter. After the training session, participants were shown the 15 multimedia fragments, and were asked to fill out the inQ after each fragment2. Participants got a break after filling out the inQ for all fragments. After about five minutes maximum, participants were asked to sit down again, and to read the instructions 2

While watching audio/video content, people may re-watch content from time to time. However, this is not necessarily the case. To create a situation where repetitions would least influence inQ scores, the decision was made to gather the inQ scores before the perceived video quality scores.

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

107

5

for the next part. The instructions detailed that they were now supposed to rate the fragments for perceived video quality, and gave an explanation of the quality scale. After reading and indicating that the instructions were clear, the training session for the perceived video quality was started. Participants were shown the same training fragments as for the inQ, but now rated perceived video quality. The experimenter observed during the training session whether participants indicated on the scale that they perceived quality differences. If no more questions arose, participants were left alone and shown the same 15 multimedia fragments randomized, and rated each fragment on perceived video quality. After this, another small break was given, where participants could take another drink or snack if requested. After the break, participants again saw the 15 multimedia fragments randomized and rated them on perceived video quality. At the end, participants were debriefed and received 15 euros for their participation.

5.4

Results

Eight participants took more than one hour to fill out the inQ. These participants were not invited to participate in the second part of the experiment, because the experimenter judged that this would have been too fatiguing. 5.4.1

Confirmatory factor analysis for the inQ

DATA SCREENING One participant accidentally did not fill out the inQ for the last fragment he saw. These questions were coded as missing values. Further analysis showed that there were twentyfive missing values, which appeared to be missing completely at random (MCAR, Tabachnick & Fidell (2006)). Hence, the same strategy as in Chapter 4, Section 4.3.2 was followed, and missing values were replaced through the expected maximization method. Furthermore, all questions with a negative key were turned positive (3, 22, 25, 32 and 38). Normality was inspected via histograms, created per item per fragment to allow inspection of the shape of the distribution. Also skewness and kurtosis were inspected, which showed that especially item 13 (‘I could easily understand what was going on.’) was constantly skewed towards agree and highly agree. 176 items had a skewness or kurtosis value greater than 1 or smaller than -1 (as calculated by SPSS), which is 47% of the total amount of items. To be able to treat the data as interval rather than categorical and to diminish the skewness and kurtosis, a logarithmic transformation with base 10 was used, as this reverted the data to normality, which hopefully would make interpretation and correlation with other measures easier (Tabachnick & Fidell, 2006). Factorability was determined through the Kaiser-Meyer-Olkin (KMO) measure. Although the KMO was consistently above .6, the average communalities (MacCallum, Widaman, Zhang, & Hong, 1999) were actually lower than in the previous factor analysis. This 108

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

was due to the fact that the communalities for item 13 were consistently below .2. Item 13 was therefore excluded from the CFA (Tabachnick & Fidell, 2006). As Table 3 shows, average communalities without item 13 are well above .5 and although the standard deviations were larger than was expected, the three hypothesized factors are still overdetermined. Therefore, the decision was made to continue with the current dataset.

Table 3. The factorability scores reached through KMO, and the average of the final communalities (AFC) with their standard deviation (SD) for the 15 fragments, with the corresponding number of participans (N).

fragment sports_T, reference sports_T, TC sports_T, IFD science-fiction_SW, reference science-fiction_SW, TC science-fiction_SW, IFD

N 100 100 100 100 100 100

KMO .891 .906 .904 .914 .885 .906

AFC (SD) .581 (.143) .580 (.159) .567 (.170) .598 (.163) .543 (.193) .561 (.167)

99

.919

.550 (.164)

feel-good_LA, TC

100

.897

.577 (.179)

feel-good_LA, IFD

100

.913

.615 (.149)

thriller_1408, reference

100

.912

.566 (.170)

thriller_1408, TC

100

.885

.563 (.156)

thriller_1408, IFD

100

.896

.581 (.190)

documentary_SA, reference

100

.906

.549 (.157)

documentary_SA, TC

100

.881

.496 (.185)

documentary_SA, IFD

100

.902

.552 (.182)

feel-good_LA, reference

5

CONFIRMATORY FACTOR ANALYSIS To execute a CFA, Mplus was used. The 5-factor solution from Chapter 4 (Section 4.3.2, Table 9) was first used, but did not converge. Therefore, again an exploratory factor analysis strategy was followed. For this, the whole dataset was used, and the participants were coded as a within-subject factor. The Maximum Likelihood R algorithm was used, which estimates standard errors and a chi-square statistic and is robust for non-normality and non-independence of observations (Muthén & Muthén, 2010). Again, analogue to Chapter 4, Section 4.3.2, the analysis showed that the factors were related, and hence an oblique rotation method (geomin) was used. In step 1, all questions which did not load on any factor or loaded on all factors were taken out. The questions were q26, q10, q13 and q34. In step 2, all factors which loaded on more than one factor were taken out: q31, q18, q36, q2 and q3. This resulted

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

109

in a 3-factor solution, which was then tested with a CFA. Mplus also provides modification indices for what would change a model to provide a better fit with the observed data. This showed that also q15 should be removed, since it loaded on two factors. The results of the CFA are detailed in Table 4. As can be seen, factor 1 combines nine items from the attributes captivation and disinterest, with one item from informative interest. Factor 1 captures both positive and negative elements from involvement, and could also be construed as to be more externalized than internalized. Factor 2 consists of three items, which all come from the attribute of relatedness. Factor 3 consists of three items as well, but these items all come from the attribute captivation. However, all three items could be seen as internal reaction, since comments from participants indicated that they did not interpret ‘sitting on the edge of my seat’ as literal. Cronbach’s α for the factors are .944, .798 and .808 for factor 1 to 3 respectively.

Table 4. Last version of the inQ. Nr 7 12 16 17 22 25 32 35 38 21 23 30 6 9 14

Question I felt dissapointed when the video finished. The video entertained me. I am curious about what happens next. I felt amused I would have turned the tv off if I could. I was happy when the video finished The content of this video does not interest me in any way. I would have liked it if the video continued. I felt like switching channels, to see if there was something better on tv. I sympathized with what was going on. I could relate to the situation. I felt compassion for what was going on. This video took me away to another world. This video sequence had me on the edge of my seat. I was totally absorbed.

Factor 1 1 1 1 1 1 1

Loading .807 .836 .754 .747 .828 .854 .778

Attribute captivation captivation captivation captivation disinterest disinterest informative interest

1 1

.910 .862

captivation disinterest

2 2 2 3 3

.870 .595 .818 .706 .744

relatedness relatedness relatedness captivation captivation

3

.859

captivation

To investigate whether there were different effects for the scenes or adaptation methods, a 5x3 (scenes x adaptation method) repeated measures ANOVA was carried out on the sum of the items. For factor 1, participants could have a minimum score of 9 and a maximum score of 63. For factors 2 and 3, participants could have a minimum score of 3 and a maximum score of 21. Repeated measures were carried out separately for each factor, using SPSS GLM.

110

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

For factor 1 there were main effects of both scene, F (4, 95) = 9.757, p < .001, and adaptation method, F (2, 97) = 18.806, p< .001. Follow-up pairwise comparisons for scene showed that the mean score for sports_T and documentary_SA were significantly lower than those for either science-fiction_SW, feel-good_LA or thriller_1408 (a in Figure 4). Furthermore, the factor 1 mean score of sports_T was also significantly lower than that of documentary_SA (b in Figure 4). Pairwise comparisons for the adaptation method showed that the reference was scored significantly higher than TC (** in Figure 4), and that TC was scored significantly higher than IFD (* in Figure 4). For factor 2 there was only a main effect of scene, F (4, 95) = 27.570, p < .001. Pairwise comparisons showed that the mean score for sports_T, science-fiction_SW and thriller_1408 were significantly lower than those for either feel-good_LA or documentary_ SA.(a in Figure 5) Furthermore, the factor 2 mean score of documentary_SA was also significantly lower than that of feel-good_LA (b in Figure 5). For factor 3 there were main effects of both scene, F (4, 95) = 23.029, p < .001, and adaptation method, F (2, 97) = 13.887, p< .001. Follow-up pairwise comparisons for scene showed that the mean score for sports_T was significantly lower than those of feel-good_LA, science-fiction_SW and thriller_1408. The mean scores for documentary_SA and feel-good_ LA were also significantly lower than those for either science-fiction_SW or thriller_1408.

** *

**

*

** *

** *

** *

ab

a

Figure 4. Average scores for the mean participant scores of the scenes per factor, with the adaptation methods as parameter for Factor 1. a: significant difference between sports_T and documentary_SA versus science-fiction_SW, feel-good_LA and thriller_1408. b: significant difference between sports_T and documentary_SA. **: significant difference between the reference and TC / IFD. * significant difference between TC and IFD. EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

111

5

a

ab

Figure 5. Average scores for the mean participant scores of the scenes per factor, with the adaptation methods as parameter for Factor 2. a: significant difference between feel-good_LA and documentary_ SA versus sports_T, science-fiction_SW, and thriller_1408. b: significant difference between feel-good_ LA and documentary_SA.

Additionally, the factor 3 mean score of science-fiction_SW was also significantly lower than that of thriller_1408 (see Figure 6). Pairwise comparisons for the adaptation method showed that the reference was scored significantly higher than TC, and that TC was scored significantly higher than IFD. As the results show, there is a very different pattern for the scenes per factor. For example, feel-good_LA has a high mean score for factor 2, but a low mean score for factor 3. Thriller_1408 showed the opposite pattern, with a high mean score for factor 3 and a low mean score for factor 2. Science-fiction_SW showed a smiliar pattern as thriller_1408, but with a lower mean score on factor 3. Sports_T has, on average, the lowest mean scores for all factors, which could indicate that participants were less inclined to become involved with the material. Documentary_SA also has lower mean scores for factor 1 and 3, but showed a higher mean score for factor 2, indicating that participants could empathize with the situation as expressed in the material. Factor 1 and 3 also showed that there are significant differences between the adaptation methods. However, for factor 2 the adaptation methods do not appear to make a difference.

112

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

** ** **

*

*

**

*

* **

*

5

a

c

b

b

Figure 6. Average scores for the mean participant scores of the scenes per factor, with the adaptation methods as parameter for Factor 3. a: significant difference between sports_T and science-fiction_SW, feel-good_LA, and thriller_1408. b: significant difference between documentary_SA and feel-good_LA versus science-fiction_SW and thriller_1408. c: significant difference between science-fiction_SW and thriller_1408. **: significant difference between the reference and TC / IFD. * significant difference between TC and IFD.

5.4.2

Perceived video quality

To be able to gain further insights into the relation between involvement and perceived video quality, it is necessary to first analyse the data obtained to measure perceived video quality. As mentioned in section 5.3.6, participants were required to judge the stimuli twice. This allowed checking participants’ consistency. First, non-parametric correlations between judgement 1 and 2 were checked for each participant. Participants who did not complete both judgements were counted as missing data, and participants who had less than 1/1 visual acuity were also not taken into account, as it can then not be assumed that they were able to perceive stimuli similarly. This left 81 participants. It was decided to take only data from participants who had a Spearman’s rho correlation of .6 or higher (N= 15, p < .01) between both judgements. In the end, data from 69 participants were usable. Data from the Five-Grade Quality Scale (ITU-R, 2012) are ordinal in nature, and as such can also be analysed with Thurstone’s law of categorical judgement (Torgerson, 1958). Hence, ThurCatD (Boschman, 2000) was used to analyse the perceived video quality data. The ThurCatD model allowed checking whether there are still significant differences between the presentations. Furthermore, it also allows a comparison across the fragments

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

113

estimated scale values

excellent

**

**

good

**

**

** fair

*

*

*

poor

*

*

bad

REF

TC

IFD

sports_T a

REF

TC

IFD

science-fiction_SW

REF TC

IFD

feel-good_LA

REF

TC

IFD

thriller_1408

REF TC

IFD

documentary_SA

scene and adaptation

Figure 7: Estimated scale values of the 69 participants. The category boundaries are indicated with the lines. a: significant difference between sports_T and science-fiction_SW, feel-good_LA, and thriller_1408. **: significant difference between the reference and TC / IFD. * significant difference between TC and IFD.

(see Figure 6). Using the estimated standard error and correlation matrix, significance can be assessed with confidence interval testing as described by Finch & Cumming (2009). Because the estimated scale values for the two presentations did not differ significantly from each other, the data per fragment were pooled and used together for further analysis. As can be seen in Table 5, the reference was always rated higher than either TC or IFD. Furthermore, the estimated boundaries also indicate that all reference fragments were rated excellent or good, whereas most modified fragments were rated fair or poor. There was no significant overall difference between TC and IFD, but it is possible that this is caused by feel-good_LA, where IFD was scored higher than TC. Post-hoc analysis showed that scores for TC were significantly higher than those for IFD for sports_T, science-fiction_SW, thriller_1408 and documentary_SA (t (68) = 5.434, p < .001, t (68) = 4.393, p < .001, t (68) = 7.312, p < .001 and t (68) = 3.151, p < .001 respectively). For feel-good_LA, post-hoc analysis showed that scores for IFD were significantly higher than those for TC, t (68) = -3.596, p < .001. Additionally, the perceived video quality of sports_T is scored significantly lower than science-fiction_SW, feel-good_LA and thriller_1408 (t (68) = -3.007, p < .01, t (68) = -3.996, p < .001 and t (68) = -3.627, p < .01 respectively). The model fit was tested with Mosteller’s Chi-square, which assessed whether the scale values from the model predict the observed data well (Boschman, 2000). The model fit was deemed acceptable, Mosteller Chi-square= 108.5763, df=87, upper tail p-value=0.0586. However, further investigation showed that the model fit for science-fiction_SW and thriller_1408 was not acceptable. To that end, subject correlation matrices were created for both scenes separately, as we hypothesized that there were opposing groups, who had different preferences regarding IFD and TC. For science-fiction_SW, two groups were found,

114

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

Table 5. Results of the paired-samples t-tests for main effects of scene and adaptation method. Data were pooled from the results from ThurCatD. N = 69, df = 68.

comparison reference - TC reference - IFD TC - IFD * p < .001

t 7.714** 8.813** 1.165

* p < .001

one with 33 and one with 22 subjects, which leaves 14 subjects which did not correlate, had missing values or were not easily placed in either group. Analysing the results for the two groups with ThurCatD showed a good model fit, Mosteller Chi-square= 32.3209, df=33, upper tail p-value= 0.5007. Since there were no significant differences between the presentations, the data were pooled and again analysed with ThurCatD. The model fit was good, Mosteller Chi-square= 16.5337, df=15, upper tail p-value= 0.3475. As Figure 8 shows, group 1 rated TC significantly higher than IFD (t (32) = 4.200, p < .001), while group 2 rated IFD significantly higher than TC (t (21) = -2.228, p < .05). Furthermore, group 1 rated TC significantly higher than group 2 (t (44) = 5.607, p < .001). For IFD, this was reversed since group 1 rated IFD significantly lower than group 2 (t (51) = -6.863, p < .001). For thriller_1408, two groups were found, one with 42 and one with 6 subjects, which leaves 21 subjects who did not correlate with each other, had missing values or were

excellent

good

* fair

a

*

b

poor

bad

Figure 8: Scale values as estimated by ThurCatD for the 2 groups for science-fiction_SW. Group 1 includes 33 participants, group 2 has 22 participants The category boundaries are indicated with the lines. * significant difference between TC and IFD. a/b: significant difference between group 1 and 2 for TC and IFD respectively.

EXPERIMENTAL INVESTIGATION INTO THE RELATION BETWEEN THE CONSTRUCTS OF INVOLVEMENT AND PERCEIVED VIDEO QUALITY

115

5

0.1395. As Figure 9 shows, group 1 rated TC significantly higher than IFD (t (41) = 3.295, p < .01). Group 2 rated IFD higher than TC, but this was not a significant difference (t (5) = -1.639, p > .05). Furthermore, there was a trend towards a significant difference between group 1 and group 2 for TC, where group 1 rated TC higher than group 2 (t (7) = 2.245, p < .1). For IFD, this was reversed since group 1 rated IFD significantly lower than group 2 (t (8) = -6.362, p < .001).

Figure 9: Estimated scale values of the 2 groups for thriller_1408. Group 1 includes 42 participants,

Figure 9: Estimated scale values of the 2 groups for thriller_1408. Group 1 includes 42 participants, group 2 has 6 group 2The hascategory 6 participants The indicated with the lines. TC * significant participants boundaries arecategory indicated boundaries with the lines.are * significant difference between and IFD. b:difference significant between TC and IFD. b: significant difference between group 1 and 2 for IFD difference between group 1 and 2 for IFD

5.4.3 discriminant perceived notTesting easily placed in eithervalidity group. of Analysing thevideo resultsquality for theand twoinvolvement groups with ThurCatD showed good hypothesis model fit, Mosteller Chi-square= upper tail p-value= 0.5310. Thea main of this chapter is that 31.7148, there is adf=33, relation between involvement and perceived video quality. To examine this relation, the correlations between the sums of the inQ factor Since there were no significant differences between the presentations (see section 2.6.3), scores and the perceived video quality scale values were investigated. If the same construct is the data were pooled and again analysed with ThurCatD. The model fit was acceptable, measured by the factors and the scale values, they should correlate above .6. The null hypothesis is Chi-square= df=15, upper taildifferent p-value=parts 0.1395. As Figure 9 shows, group thatMosteller they are not measuring 20.9168, the same construct, but are of a larger construct. Hence, the 1 rated TC significantly higher than IFD (t (41) = 3.295, p < .01). Group 2 rated IFD higher expectation for the null hypothesis is a correlation below .4. Only the 69 consistent participants than from theTC, PVQ were for this part of the analysis. As Table shows, there is a relatively low butanalysis this was notused a significant difference (t (5) = -1.639, p >7 .05). Furthermore, there was correlation between athe factors from the inQ,between and only group factor 11 and fromgroup the inQ correlates significantly a trend towards significant difference 2 for TC, where group 1 with PVQ. While these correlations are above .4, they explain still no more than 26% of the variance. rated TC higher than group 2 (t (7) = 2.245, p < .1). For IFD, this was reversed since group 1 Hence, for the purposes of this analysis, the assumption is made that the inQ and the Five-Grade ratedScale IFD significantly lowerQoE thanaspects. group Further 2 (t (8) analysis = -6.362,with p