Usability evaluation methods: Mind the gaps - Semantic Scholar

3 downloads 2680 Views 613KB Size Report
heuristic evaluation conveys in comparison to empirical user testing supported by ...... Evaluatin Table (MHET) to Support Software Development and Usability ...
Usability evaluation methods: Mind the gaps Estelle de Kock School of Computing University of South Africa, Pretoria, South Africa [email protected]

Judy van Biljon School of Computing University of South Africa, Pretoria, South Africa [email protected]

Marco Pretorius School of Computing University of South Africa, Pretoria, South Africa [email protected]

doing prescribed tasks and then measures the construct of usability in terms of objective measures such as time, errors and subjective measures such as user satisfaction [25]. Heuristic evaluation(HE) denotes the situation where one or more experts evaluate a system’s design for adherence to a set of usability guidelines [20]. The method has documented weaknesses such as the reliability of the effectiveness measure [17, 24]. However, despite the known weaknesses heuristic evaluation is widely used since it generally requires less resources than testing with users. The other method used in this study is empirical user testing. Eye tracking is used as a tool used to supplement the user testing. Eye tracking usability data is based on the fact that eye-movement analysis can contribute to our understanding of cognitive processes [4]. Eye tracking is known to be time and resource intensive and the objectivity of eye tracking has been questioned as the analysis of patterns is mostly based on the opinion and interpretations of the individual evaluator [7]. However, eye tracking provides an additional data source that can add to the reliability of the usability testing if triangulated with the user testing data and post-test questionnaire data. In this paper we compare usability data obtained from the same interactive website through two different usability evaluation methods (UEM)’s, namely HE and user testing supported by eye tracking (UTE). To avoid bias, the two studies were done independently by different usability evaluators and the results were compared by a third researcher. To provide a reasonable basis for comparison, we review the basic components of usability evaluation and propose a basic set of criteria for usability evaluation used in this study. We then compare the findings from the HE and UTE evaluations respectively against these criteria to deduce the differences between the usability information obtained. The scope of the paper is limited to the comparison of the findings from two different usability evaluation methods on one interactive website. The remainder of section 1 expands on the purpose and motivation of this paper in section 1.1 and the organisation in section 1.2.

ABSTRACT The strengths and weaknesses of heuristic evaluation have been well researched. Despite known weaknesses, heuristic evaluation is still widely used since formal usability testing (also referred to as empirical user testing) is more costly and time consuming. What has received less attention is the type of information heuristic evaluation conveys in comparison to empirical user testing supported by eye tracking and user observation. If usability methods are combined, it becomes even more important to distinguish the information contribution by each method. This paper investigates the application of two usability evaluation methods, namely heuristic evaluation and empirical user testing supported by eye tracking, to the website of a learning management system with the intent of discovering the difference in the usability information yielded. Heuristic evaluation as an inspection method is accepted to be fundamentally different from empirical user testing. This paper contributes to a deeper understanding of the nature of the differences by identifying the kind of usability problems identified through each method. The findings should be of interest to researchers, designers and usability practitioners involved in website design and evaluation.

Categories and Subject Descriptors H.1.2 [User/Machine Systems]: Human factors, Human information processing, Software psychology; J.4 [Social and Behavioural Sciences] Economics, Psychology, Sociology.

General Terms Human Factors, Design, Experimentation.

Keywords Heuristic evaluation, user testing, eye-tracking, usability

1. INTRODUCTION Usability evaluation ranges from formal, quantitative experiments with large sample sizes and complex test designs to informal qualitative studies with a small number of participants [18, 30]. Empirical usability evaluation captures the performance of users

1.1 Purpose and Motivation Usability is a rich and complex construct that should ideally be measured with more than one UEM [9, 10]. Given time and cost constraints, only one evaluation method is used in many cases. Selecting one method necessitates comparison of usability methods and consideration of the criteria for comparison. Much research has been done on comparing usability evaluation methods. Desurvire [5] and Desurvire et al. [6] investigated the effectiveness of usability inspection methods in comparison with empirical testing and the issue of heuristic evaluation versus usability testing. Karat [14] compared empirical testing with

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAICSIT’09,12-14 October 2009, Riverside, Vanderbijlpark, South Africa. Copyright 2009 ACM 978-1-60558-643-4…$5.00.

122

propose a comprehensive set of evaluation criteria for Web-based learning that apply to learning management systems as well. Considering their criteria together with the MHET we select a subset of these criteria as basis for the heuristic evaluation of the learning management system. Our study focuses on a Web service for assignment submission and therefore not all the criteria for educational Websites are applicable. The selected heuristics are depicted in Table 1.

inspection methods and found that HE was reliable and predictive of laboratory testing methods. However, given the time expended since these studies were done and the continual development in technologies and methodologies, we believe that it is useful to revisit the issue of comparing UEM’s, specifically for identifying fundamental differences in the knowledge yielded by the two methods. Law and Hvannberg [17] propose that UEM’s be evaluated in terms of the validity and thoroughness of the usability problems identified. This highlights the goal of HE, namely to identify usability problems. However, when we evaluate the usability of a website we also need to consider the ISO-9241-11 components of effectiveness, efficiency and user satisfaction [1] as measured in user testing. Acknowledging that HE is fundamentally different from UTE, our research question pertains to the difference between the usability information obtained from HE and UTE respectively.

Table 1: Guidelines for heuristic evaluation

1.2 Organisation of this Paper Section 2 discusses literature on the UEM’s of HE and UTE respectively and provides the background for comparing and appreciating the difference between the findings obtained from the two methods. Section 3 describes the heuristic evaluation performed in this study, while section 4 describes the user testing and eye tracking performed on the same website. Section 5 presents and compares the results of the two studies. Section 6 discusses the findings and section 7 reflects on the contribution of this paper and concludes with suggestions for future work.

Heuristic

Explanation

Additional References

Software-User Interaction

Informs user of system status [20] and task completions.

Learnability

Supports timely and efficient [29] learning of software features.

Cognition Facilitation

Supports the cognitive [21, 27, 29] limitations of the user.

User Control Respond to user action and [29] and Software adaptivity. Flexibility System-Real World Match.

Match users’ expectations, [20, 31] familiarisation, and fit intended with the user group.

Graphic Design Graphical elements, aesthetics.

2. USABILITY EVALUATION THEORY Nielsen and Mollich [22] enumerate four basic user interface evaluation methods: formally by some analysis technique; automatically by a computerised procedure; empirically by testing users performing experiments; and heuristically. Heuristic evaluation is a UEM where evaluators inspect a user interface against a guideline to identify usability problems that violate any items on the guideline [18]. Jeffries and Desurvire [12] recommend usability inspection methods such as heuristic evaluation and cognitive walkthroughs to be used for augmenting usability testing by using the inspection method early in the design or as discount methods when resources such as time, money and trained evaluators are scarce. This section reviews literature on usability evaluation methods focusing on HE and UTE as the two methods under consideration. Eye tracking is generally performed in conjunction with user testing. Section 2.1 discusses HE as a usability evaluation method highlighting some strengths and weaknesses while section 2.2 and 2.3 do the same for user testing and eye tracking. Section 2.4 looks at research towards standardising the construct of usability in a single score.

colours, [29]

Navigation and Facilitate software exploration [20, 26, 27, 31] Exiting and provide outlets to terminate actions. Consistency

Provide standard and reliable [26] terminology, actions and layouts.

Defaults

Provides guidance related to the [29] use of default information.

Help and Providing users with help files [21, 27, 29] Documentation and documentation. Error Management

Prevents, identifies, diagnoses [13, 18, 31] and offers corrective solutions.

HE has received criticism on the reliability of the effectiveness measure [11], the large influence of rater experience [17], the subjective interpretation of the results and the lack of theoretical underpinning [16]. Despite the documented limitations of HE it is still a popular method since evaluators often lack the time, resources or expertise to implement formal methods, automatic testing or empirical user testing (usability testing).

2.1 Heuristic Evaluation as method Heuristic evaluation originated from the need to find a simpler, less resource intensive method for usability testing [17, 21]. Among the various heuristics evaluation methods available, Wheeler Atkinson et al. [31] selected four approaches, namely Nielsen’s ten usability heuristics [20], Shneiderman’s Eight Golden Rules of Interface Design [26], Tognazzini’s First Principles of Interaction Design [29] and a set of principles based on Edward Tufte’s visual display work as the basis for their Multiple Heuristic Evaluation Table (MHET) [31]. As noted, usability is measured in terms of effectiveness, efficiency and user satisfaction [1] but guidelines developed for heuristic evaluation provide useful explanations on why users will probably not be effective, efficient or satisfied. SSemugabi and De Villiers [27]

2.2 User testing as method According to Jacob Nielsen [19] usability is a quality attribute that assesses how easy user interfaces are to use. ISO 9241 defines usability as the effectiveness, efficiency and satisfaction with which specified users achieve specified goals in particular environments [1]. Usability testing involves measuring the performance of users on tasks with regard to the ease of use, the task completion time, and the user’s perception of the experience of the software application [23]. Formal usability testing is an empirical method that requires the design of a formal usability experiment that is carried out

123

under controlled conditions. Usability testing can be conducted within a usability laboratory or by means of field observations. The usability evaluation for this research was conducted by means of a formal usability evaluation in the usability laboratory of the School of Computing at the University of South Africa.

3. HEURISTIC EVALUATION The heuristic evaluations for this study were done by usability experts. They were provided with tools including a checklist and screen dumps. The checklist (Appendix A) is a list of heuristics and was derived by using the Multiple Heuristics Evaluation Table [17]. Wheeler Atkinson et al. (2007) compiled the MHET which is a comprehensive set of heuristics concepts and integrate existing approaches as set out in section 2.1. Provision was made in the checklist for evaluators to add to the list of problems if necessary. Although a task list was used to structure the evaluation, every screen was also inspected separately.

2.3 Eye tracking Eye tracking is based on the fact that a record of a person’s eye movements while doing a task provides information about the nature, sequence and timing of the cognitive operations that takes place [3]. Eye tracking studies have been used in diagnosing the effectiveness of Website designs with point of interest detection (fixation) and information transmission via eye movement (scan path) as two main indicators [2]. Based on this relation between cognition and eye behaviour, the trace of navigation pathways and user attention patterns is used to study the cognitive processes involved in reading [3] , picture perception [2], visual search [4], problem solving [16], face perception[15] and many other tasks. Eye tracking data is mostly used in conjunction with user testing and video taping. Therefore the eye tracking visualisations (usually gaze plots and heat maps) can be triangulated with the user testing and evaluator observations from the tapes as described in the methodology for usability and eye tracking by Pretorius and Calitz. [24] .

3.1 Participant profile The five evaluators, who are all lecturers at the same university, are experts in Human-computer interaction, have a background in Computer Science and/or Information Systems and are also knowledgeable about the assignment submission system. Their experience in undertaking heuristic evaluation ranged between 4 and 8 years as follows: 5;5;7;8;4.

3.2 Procedure The evaluations were done on the evaluators’ own computers in their own offices. The researcher briefed each evaluator on how to use the tools for the evaluation of the website. The evaluation material (tools) consisted of a task list, ‘Checklist for Heuristic Evaluations’ (Appendix A), a bundle of screen dumps of the website and a blank writing pad. Each evaluator was required to complete an informed consent form. The researcher was on standby (phone call away) to handle questions and problems with the use of the site or evaluation material. A post-evaluation interview was conducted to get suggestions and comments regarding the procedure and the checklist.

2.4 Usability components The usability construct has been expressed in terms of the effectiveness, efficiency and satisfaction of the users who performed specified tasks [17]. The objective measures, namely effectiveness and efficiency are measured by conducting tests to measure the time, number of errors and completion rate on specified tasks. Subjective measures such as user satisfaction are often measured with post-test questionnaires. Sauro [25] proposes the standardization of usability metrics into a single score that he formulates as:

3.3 Data Collection Although every evaluator was given a checklist and screen dumps, they were allowed to use any method for data notation. All the evaluators made use of the checklist and the screen dumps. In the post evaluation interview, all the evaluators felt confident that they have recorded most heuristic problems.

Usability = Average satisfaction + Efficiency (time) + Effectiveness (number of errors and completion rate) Arguably this may be seen as simplistic, but we propose this as a starting point in striving to get a common understanding of the concept of usability as needed to compare different usability methods.

4. CONDUCTING USER TESTING AND EYE TRACKING The eye tracking for this study was done in a usability lab. The usability laboratory consists of an observer room and a participant room, separated by a one-way mirror. The participant room is equipped with a 17” TFT monitor with resolution of 1280x1024 and a Tobii 1750 eye tracker, allowing the eye movement of participants on the screen to be recorded. A 9-point eye tracking calibration was used at all times. The participant profile and the procedure will now be discussed.

Jeffries and Desurvire [12] classify the usability problems identified according to severity. They found that HE identified one third of the severe problems, two thirds of the moderate severity problems and 80% of the least severe problems. Fu et al.[8] argue that usability problems are skill-based, rule based or knowledge based. Another important perspective on usability evaluation is the kind of knowledge it renders. UTE involves both observation and empirical user testing. Empirical testing will provide knowledge on what the participants did and how they did it. Observation will provide knowledge on when they took certain actions. Considering HE and the method of experts evaluating the system, it follows that experts base their evaluations on estimates or probabilities of what typical users would do and how they would react to the system based on their expertise and the heuristic guideline. HE involves knowledge about how participants would react and when they would react in a specific way, but in some cases also provides reasons why this would be a problem. HE could identify the cause (why, when), where UTE focuses on the symptom or effect (how, what).

4.1 Participant profile The intended user group is students who have to submit assignments online. A screening questionnaire was used to screen the participants for this evaluation. This questionnaire reflected the possible participant’s experience on the learning management system, computer experience, age and gender. From the 23 questionnaires completed, 10 participants were selected. The selection was gender balanced and included an equal number of experts and non-experts regarding Web usage skills. One student was below 20; six between 21 and 25; one between 26 and 30; and two above 30.

124

4.2 Procedure

Design

One participant was tested at a time. On arrival, the participant was briefed about the experiment and the equipment to be used. The details of the material to be recorded were explained and the participant was required to complete an informed consent form. The participants were then briefed about the website. The task list was explained after which testing commenced. Participants were asked to comment when they were looking for something and could not find it; when they liked something particular about the Website; and when they disliked something particular. Finally a post-test questionnaire was given to participants followed by a debriefing where the participants were thanked and given the opportunity to view the data.

Navigation and Facilitate software exploration and Exiting provide outlets to terminate actions.

6

Consistency

Provide standard and reliable terminology, actions and layouts.

5

Defaults

Provides guidance related to the use of default information. Applications on the same system.

1

Help and Providing users with help files and Documentation documentation.

6

4.3 Data collection

Error Management

3

SystemSoftware Interaction

Live video recordings were captured, including the screen, participant’s face and mouse/keyboard movements. Notes were taken during the test as well as a full evaluation of the video at a later stage; audio in the form of the participant or the test administer speaking were included with the video files; eye tracking video recordings included a cursor which indicates the participant’s eye movements; and eye tracking data files. A posttest questionnaire (using a five-point Lickert scale) was used to collect participants’ perception of the user interface and the system.

5.2 UTE results Task: Submit a MS Word file In the first task where participants were asked to submit a Microsoft Word assignment, the usability results showed a 90% completion rate. The results depicted in Table 4 shows that four of the 10 participants made errors during this task; three needed assistance to continue (severe error); and one was not able to complete the task even with assistance. There is a large variation in time showing that some users were much more efficient than others. Going back to the biographical data we found computer experience to be the differentiator, this becomes even clearer in the next section.

We present the results for the UE in section 5.1 and those from the UTE in section 5.2 while section 5.3 compares the results.

5.1 Heuristic evaluation results Table 2 depicts the results of the HE. Three evaluators identified the problem that the site contained too much information. Only one of the five evaluators noted that the error messages were not clear and that the terminology used could be a severe problem. We observed that experienced evaluators missed severe usability errors, while those errors were identified by less experienced but more committed evaluators. From our observation, evaluator commitment is an undervalued attribute in heuristic evaluations. The more committed evaluators did not only focus on completing the given tasks but also explored and tested the system. The result was that more problems were discovered.

Table 4: Summary of results with UTE

No of problems found by evaluators

Supports the cognitive limitations of 10 the user.

User Control and Software Flexibility

Respond to user action and adaptivity.

2

System-Real Match users’ expectations, 2 World Match. familiarisation, and fit intended with the user group. Graphic

Graphical elements, colours,

Completed %

Errors

90%

4

Assistance needed

3

Min Time (s)

Max Time (s)

Mean time (s)

27

161

68.67

The assignment submission page (Figure 1) contains instructions on how to submit assignments, indicated by a red rectangle. Figure 1 shows this assignment submission screen with an eye tracking heat map. A heat map shows the fixations of a participant where the “hot” colours indicate areas most fixated by the participant. It can clearly be seen that the instructions are not read. Eye tracking showed that no participant read the complete instructions on the assignment landing page. Where participant did have fixations on the instruction area, it was very few fixations, not reading complete sentences. Throughout the task list, eye tracking data showed that the majority of participants did not read the instructions. Figure 2 shows an example where a participant did not read the instructions. The instructions are indicated by a red rectangle. The phrase “Here is your final chance to check that your assignment is correct” is included in these instructions. Only one of ten participants read these instructions.

Software-User Informs user of system status and 8 Interaction task completions. Learnability Supports timely and efficient learning 2 of software features. Cognition Facilitation

Prevents, identifies, diagnoses and offers corrective solutions.

2

HE identified critical problems in the design of the Multiple choice questions. If the user gave the wrong answer to ‘Number of questions’, it could result in missing some questions. If the user looked at the FAQ or clicked the Back button after filling in the answer sheet but before submitting it, all the answers will disappear and the user will have to repeat the whole process.

5. RESULTS

Table 2: Results of HE Heuristic Explanation

aesthetics.

6

125

select the file format. The participant proceeded to guess the file format, by selecting each one available, until PDF was reached and accepted. This shows that the file format and PDF terminology is not understood. Three expert participants received the error message once only, correcting their mistake very quickly. Figure 3 shows the eye movements while the error message is on the screen in the form of a gaze plot. A gaze plot shows the participant’s scan path, with fixations while completing the task. The purpose of this screen capture is to see if participants read the error message. This figure also shows the fixations of a participant just after they clicked “OK” on the error message to the point of the next click. This was done to see where they looked after reading the error message. All participants read the full error message. Fixations can be seen on the “File Name” textbox and “Browse” button. None of the non-expert participants fixated on the word “File Format” or on the drop-down box arrow where the file format had to be selected. This indicates that the error message was not comprehended.

Figure 1: Participant heat map – not reading instructions

Figure 2: Participant gaze plot – not reading instructions

Figure 3: Participant scan path - fixations on “File Name” textbox and “Browse” button

Task: Submit a PDF file

5.3 Comparison of results

The participant had to submit a written assignment (a PDF file), for this task. To do this, participants had to select the PDF file format from a drop-down box. The usability results showed a 70% completion rate; however, all participants made errors, with four participants needing assistance. The error that all participants made was that they did not select the appropriate file format (PDF). Once the participants wanted to proceed with the task, without selecting the correct file format, the following error message was displayed:

When trying to compare the data obtained from heuristic evaluation with the data obtained from usability testing supported by eye tracking results, the differences in origin and the processing of the data needs to be stated. The data obtained from the usability testing with eye tracking method is based on direct measures of participant’s performance in using the system whereas the HE data is reliant on the heuristic guidelines and the interpretation of the evaluator as influenced by their experience and commitment.

“ERROR: The type of file does NOT match the selected file type. (PDF!=DOC)”

In HE the errors are identified by considering the heuristic guidelines and the probable response of the user to the system’s design. The user profile is routinely included with HE but the evaluator determines to what degree this will be taken into consideration. The user profile can also influence the severity rating given to a usability problem.

Five non-expert participants received this error at least twice, one participant received the error five times. One participant was given assistance after receiving the error four times and shook her head in frustration during the task. Two participants submitted Word files instead of the PDF file after receiving the error twice. One participant entered his own name under file name. After receiving the error six times, instructions were given on where to

The eye tracking expert defined ‘severe errors’ as these points where intervention was needed for a participant to continue. The

126

UTE evaluation produced six severe errors i.e., ‘show stoppers’. The HE evaluators predicted three of these six errors as severe. However, they also picked up three other problems that could have severe consequences. For example, when answering the multiple questions the student has to map the answers from alphabetic to numeric and this could result in wrong answers; the number of questions for the assignment must be filled in by the student and if this number is wrong, some of the questions will be omitted.

possibility that HE generally provides a more equal error distribution. Table 5: Comparing the dispersion of errors found within the data collected from each method Heuristic

No of problems found by Heuristic evaluators

No of problems found by UTE

Software-User Interaction

8

1

User Control and Software Flexibility

2

2

System-Software Interaction

2

0

Navigation and Exiting

6

1

Cognition Facilitation

10

7

Learnability

2

7

System-Real World Match

2

3

Graphic Design

6

0

Consistency

5

1

Defaults

1

0

HELP AND DOCUMENTATION

6

2

ERROR MANAGEMENT

3

1

Total

53

25

INTERACTION

The HE evaluators were requested to do the severity rating of a problem together with the identification of the problem as we thought this would be more efficient than asking them to do it later. Unfortunately the HE evaluators omitted the severity rating in many cases. With this implementation of HE, it would probably have been better to do the severity rating separately. In both types of evaluations subjectivity is involved in what is identified and what is ignored - therefore the evaluator is the key element in both methods. In HE more than one evaluator can be used and this allows for triangulation to standardise the evaluations where as UTE usually has only one evaluator.

COGNITION

In HE user reactions are predicted based on the heuristic guidelines. Usability testing, on the other hand, comprise the measures of Efficiency (time) + Effectiveness (number of errors and completion rate) as well as the survey results on user satisfaction, based on testing with real users. The evaluator can influence the findings by his/her observations and recommendations but this is always triangulated with the objective measures. It follows that we are dealing with different sets of data and yet these different sets are used to evaluate the usability of the same website. In an attempt to find common ground for comparison, we grouped the errors found with the HE evaluation into the following five error categories: interaction, cognition, layout, help and documentation and error management, and then calculated the percentage each group contributed to the total number of errors found by the HE method.

LAYOUT

UTE focuses on the tasks performed while HE takes a broader view that goes beyond the tasks specified. For example, only HE identified the limitation that the system accepted documents in MS Word 2003 format only when MS Word 2007 was widely used. We then checked to see if the error categories apply to the errors identified under UTE. As noted, the process of identifying errors differ, but looking at the errors listed in Table 5 it appears that some observations, from different origins, can lead to the same conclusion. For example, the HE comment ‘Unnecessary text on screen – too much textual noise’ and the UTE comment, ‘Did not read complete instruction before completing task’ both imply the existence of redundant information. This was confirmed by the UTE observation that participants did not read the complete instructions and yet completed the task.

Interestingly, the UTE measure on effectiveness and efficiency contradicts the UTE post-test questionnaire results (subjective measure). Eye tracking data showed that the majority of participants ignored on-screen instructions and then had problems in completing the tasks. Participants read error messages, but still repeated the mistakes, showing that participants did not comprehend the error messages. Based on the amount of errors participants made and the required assistance, this website was not easy to use. Despite this evidence of serious usability problems, participants scored “the information given is useful in completing the tasks” high (median 5, standard deviation 0.49) in the post-test questionnaire.

Keeping the noted differences in mind, the graph depicted in Figure 4 provides a dispersion of the problems identified by each method. The UTE errors were mostly found in the category of ‘cognition’ that encompasses ‘cognition facilitation’, ‘learnability’ and ‘system-real world match’. This was followed by ‘interaction’ and then ‘help and documentation’. Considering HE, the most errors were identified in ‘interaction’ followed by ‘cognition’ and ‘layout’. The results depicted in Figure 4 could be influenced by the usability problems in the system tested and cannot be generalised. However, our future research will investigate the

The lack of effectiveness and efficiency shown in these results make it difficult to believe that participants found the overall use of this website satisfactory. Furthermore, the identification of usability problems is influenced by the perception of the users. We have seen the difference in the effectiveness and efficiency as experienced by non-experts and experts. This opens the door to many kinds of errors, especially sampling errors.

127

In this study the HE was done per screen and then collated. We found this useful in helping programmers to locate the problems for correction. The HE evaluation provides a systematic, guaranteed coverage of the site since every screen is inspected according to the heuristic evaluation guideline. According to our findings the experience and especially the commitment of the HE evaluator should not be underestimated.

are accepted methods used to evaluate websites. Therefore we find it important to consider the difference between the methods to understand difference between the information gained from these two methods. Firstly the goals are different; HE aims to identify usability errors while user testing supplemented with eye tracking focus on determining the effectiveness, efficiency and user satisfaction. Secondly, the data capturing and processing is different: in HE the data is obtained from the report by the heuristic evaluator. This report is heavily influenced by the expertise and commitment of the evaluator and there is no means of triangulation except using more than one evaluator – this is indeed recommended throughout literature [12]. In UTE the data is captured through objective measures involving the testing of the user and also by subjective measures such as a questionnaire and the observation of the evaluator. Accepting that the expertise of the evaluator remains a factor there is the opportunity for triangulating between the objective and subjective measures.

6. DISCUSSION The method of HE is based on identifying usability errors. Neither effectiveness, efficiency nor user satisfaction is measured explicitly. HE provided a list of 53 errors in comparison to the 25 identified through UTE. This supports the findings of Jeffries and Desurvire [12] that heuristic evaluators found more problems than usability tests. Considering Law and Hvannberg’s criteria of validity and thoroughness, it seems that HE is more thorough in terms of covering all problems at a meta-level. On the other hand, UTE is better on evaluating the surface level and page-level problems and proving validity by triangulation of the objective and subjective measures.

Thirdly there is a difference in the level of information rendered. UTE deals mostly with the what and the how information while HE focuses on a meta-level considering why and when. This finding is supported by the work of Fu et al, [8] who found that UTE is more effective in finding usability problems associated with knowledge-based level of performance while HE is more effective in identifying usability problems associated with skillbased and rule-based knowledge.

In UTE the usability and eye tracking results obtained (low efficiency, effectiveness and user satisfaction) contradicts the post-test questionnaire results where participants rated the site positively. A discrepancy between the existence of usability problems identified through observation and the positive results from the questionnaire data has also been reported by Van Greunen and Wesson [30]. Triangulation between the data captured through different tools is an important advantage of UTE. Our finding support Sauro [25] in questioning the validity of methods where only subjective or objective measures are used, especially if that method is only a questionnaire.

Apart from revisiting the differences between HE and UTE the findings from this paper also contribute to sharpen the awareness of the different kinds of information that usability methods provide. The heuristic evaluation guidelines used in this study were developed through the review of several prominent heuristic evaluation guidelines with the focus on learning management systems and may therefore be of value to developers and evaluators of learning management systems. Empirical user testing was grouped with eye tracking to get the added value from the eye tracking data, since our aim was to investigate the kinds of data rather than a comparison of the strengths and weaknesses of the methods, we found this justified. The generalisability of the findings is limited by the fact that this study was done on one website only. Future work will aim to repeat this research on more websites to validate the findings and improve our understanding of the optimal use and combination of usability evaluation methods.

Thimbleby [28] warns against the use of empirical findings that may be valid but devoid of theoretical underpinning as that may lead to faulty conclusions. He quotes an example from the eighteenth century where empirical results showed that salve put on a weapon healed a wound caused by that type of weapon. The reason for this fallacy was that the application procedure was so unhygienic that a patient was better off having the salve on the weapon rather than on the wound. Therefore we need to think carefully before accepting questionnaire results that are not substantiated by objective measures or linked to any interaction theory. Lack of theoretical underpinning, especially concerning the cognitive model of users interacting with the system, is exactly the criticism brought in against HE by Law and Hvannberg [17].

7. CONCLUSION This study compared the usability data obtained from heuristic evaluation versus that obtained from user testing supported by eye tracking. The study comprised a heuristic evaluation and user testing supplemented with eye tracking performed on the same website of a learning management system. The main contribution of this paper is to focus the attention on the fundamental differences between the usability data provided by heuristic evaluation and usability testing supported by eye tracking. The commitment of the evaluators (seen separate from experience) was identified as an important factor in HE. Heuristic evaluation is a usability inspection method while UTE is a usability evaluation method but despite this fundamental difference both

128

Figure 4: Comparing the grouped dispersions of errors as percentages

[8] Fu, L., Salvendy, G., Turley, L. 2002. Effectiveness of user testing and heuristic evaluation as a function of performance classifications. Behaviour & Information Technology. 21(2) 137-143. [9] Gray, W.D., Salzman, M.C. 1998. Damaged Merchandise? A Review of Experiments That Compare Usability Evaluation Methods. Human-computer interaction 13, 203-261. [10] Hartson, H.R., Terence, S.A., Williges, R.C. 2001. Criteria For Evaluating Usability Evaluation Methods. International journal of human-computer interaction. 13(4), 373-410. [11] Hertzum, M.,Jacobsen, N.E. 2003. The Evaluator Effect: A Chilling Fact About Usability Evaluation Methods. International journal of human-computer interaction. 15(1), 183-204. [12] Jeffries, R., Desurvire, H.W. 1992. Usability Testing vs. Heuristic Evaluation: Was there a contest? SIGCHI Bulletin. 24(4). [13] Jeffries, R., Miller, J.R., Wharton, C., Uyeda, K.M. 1992. User interface evaluation in the real-world : a comparison of four techniques. ACM SIGCHI Bulletin, 24 (4) : 39 – 41, ISSN:0736-6906 (October 1992) [14] Karat, C.M., Campbell, R., & Fiegel, T. 1992. Comparison of empirical testing and walkthrough methods in user interface evaluation. Proceedings of the ACM CHI'92 Conference on Human Factors in Computing Systems. 397404. [15] Karn, K.S., Jacob, R.J.K. 2003. Eye tracking in humancomputer interaction and usability research: Ready to deliver the promises. In The mind's eye, cognitive and Applied Aspects of Eye movement Research. Elsevier, Amsterdam.

8. REFERENCES [1] 9241-11, I.: Guidance on Usability (1998) [cited 15 March 2009], Available from: http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDet ail?CSNUMBER=16883. [2] Äijö, R.,Mantere, J. 2001. Are Non-Expert Usability Evaluations Valuable? In 18th International Symposium on Human Factors in Telecommunications (HfT'01). Bergen, Norway. [3] Aula, A., Majaranta, P., Räihä, K.J., 2005. EyeTracking Reveals the Personal Styles for Search Result Evaluation. Interact 2005, LNCS 3585, 1058 – 1061. [4] Bednarik, R., Tukiainen, M. 2006. An eye-tracking methodology for characterizing program comprehension processes. In ETRA 2006. San Diego, California,: ACM 159593-305-0/06/0003. [5] Desurvire, H.W. 1994. Faster, Cheaper!! Are Usability Inspection Methods as Effective as Empirical Testing? Usability Inspection Methods, Mack. John Wiley & Sons, New York, NY. [6] Desurvire, H.W., Kondziela, J.M., & Atwood, M.E. 1992. What is gained and lost when using evaluation methods other than empirical testing. Proceedings of the HCI’92 Conference on People and Computers VII. 89-102. [7] Ehmke, C.,Wilson, S. 2007. Identifying Web Usability Problems from Eye-Tracking Data. Proceedings of the 21st British CHI Group Annual Conference on HCI 2007: People and Computers XXI: HCI...but not as we know it. University of Lancaster, United Kingdom, ISBN:978-1-902505-94-7, 1: 119-128,.

129

[16] Kasarskis, P., Stehwien, J., Hickox, J., Aretz, A.2001. Comparison of expert and novice scan behaviours during VFR flight. 11th International Symposium on Aviation Psychology Columbus, OH: The Ohio State University. [cited: 16 April 2009], Available from: http://www.humanfactors.uiuc.edu/Reports&PapersPDFs/isa p01/proced01.pdf. [17] Law, E.L.,Hvannberg, E.T. 2004. Analysis of strategies for improving and estimating the effectiveness of heuristic evaluation. In Proceedings of NordiCHI'04. Tampere, Finland: ACM. [18] Nielsen, J. 1992. Finding usability problems through heuristic evaluation. Conference on Human Factors in Computing Systems. Proceedings of the SIGCHI conference on Human factors in computing systems, Monterey, California, United States, ISBN:0-89791-513-5, 373 – 380, [19] Nielsen, J. 2003. Jacob Nielsen's Alertbox: Usability 101. [cited 15 April 2009] Available from: http://www.useit.com/alertbox/20030825.html [20] Nielsen, J. 1994. Heuristic evaluation. In: J. Nielsen and R.L. Mack, (eds.): Usability inspection methods. John Wiley & Sons, New York, 22-62. [21] Nielsen, J.,Levy, J. 1994. Measuring Usability: Preference vs. Performance. Communications of the ACM. 37, 66-76. [22] Nielsen, J.,Mollich, R.1990. Heuristic evaluation of User interfaces. CHI'90 Proceedings. 249-256. [23] Preece, J., Rogers, Y., and Sharp, H. 2002. Interaction Design: beyond human-computer interaction. John Wiley & Sons, Inc. [24] Pretorius, M.C., Calitz, A.P., van Greunen, D. 2005. The Added Value of Eye Tracking in the Usability Evaluation of a Network Management Tool. In Proceedings of the 2005 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries. White River, South Africa. [25] Sauro, J., Kindlund, E. 2005. A Method to Standardize Usability Metrics into a Single Score. Proceedings of the SIGCHI conference on Human factors in computing systems Portland, Oregon, USA, ISBN:1-58113-998-5, 401 - 409 [26] Schneiderman, B. 1998. Designing the user Interface: Strategies for effective human-computer interaction. Addison Wesley Longman, Reading, MA. [27] Ssemugabi, S., De Villiers, R. 2007. A comparative study of two usability evaluation methods using a web-based e-learning application. Proceedings of the 2007 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries Port Elizabeth, South Africa. ISBN:978-1-59593-775-9, 226: 132 - 142. [28] Thimbleby, H. 2007. Press on. MIT Press, Cambridge. [29] Tognazzini, B. 2003. First Principles of Interaction design. [cited 8 June 2009]; Available from: www.asktog.com/basics/firstPrinciples.html. [30] van Greunen, D., Wesson, J. 2001. Formal usability testing - Informing design. [Cited 6 June 2009]. http://osprey.unisa.ac.za/saicsit2001/Electronic/paper20.pdf [31] Wheeler Atkinson, B.F., Bennet, T.O., Bahr, G.S., Nelson, M.W. 2007. Development of a Multiple Heuristic Evaluatin Table (MHET) to Support Software Development

and Usability Analysis. [Cited 25 May 2009]. Available from: http://research.fit.edu/carl/documents/MHET.pdf

130

The ‘Cancel’ option is available to escape from an operation. • The user can reverse actions easily. • The user can back track to a previous option. • Other 8. Consistency – provide standard and reliable terminology, actions and layouts • The software uses multiple words to refer to the same object or option. • The positioning of the instructions is appropriate. • The positioning of the instructions is consistent • The system is consistent of execution of actions • Other 9. Defaults – provides guidance related to the use of default information. • Default information is provided in response/edit boxes. • It is easy to adjust the initial default settings. • The defaults provide information or examples that relate to the type of input required. • Other 10. System-Software Interaction – multiple applications on the same system. • The software supports the capability to have multiple applications open. • Other 11. Help and Documentation – providing users with help files and documentation. • The help files provide task-oriented information. • The help file/s provide relevant and concise information. • The help messages are brief. • The help messages are informative. • It is easy to find a solution to a problem. • The instructions are represented in an ordered list of concrete steps. • Other 12. Error Management– prevents, identifies, diagnoses and offers corrective solutions. • Error messages are clear and informative. • User errors are prevented. • User errors are detected. Other

Appendix A



Guidelines for Heuristics Evaluation (based on Multiple Heuristics Evaluation Table)

1.

2.

3.

4.

5.

6.

7.

Software-User Interaction – informs user of system status and task completions • The user is aware of his/her progress through a task. • The user feels confident while doing a task. • During the task, the user feels confident that he/she will be able to complete the task successfully. • The user is provided with appropriate and necessary feedback. • Invalid input generates feedback that informs the user why the input is incorrect. • The user is given information on how to enter data. • Other Learnability – supports timely and efficient learning of software features • The instructions are clear on every screen. • The instructions are easy to understand. • Other Cognition Facilitation – supports cognition limitations of user • The display is free of clutter. • There is irrelevant information. • The user receives timely information. • The user receives clear instructions. • The symbols and terminology used, support meaningful comprehension. • Other User Control and Software Flexibility – respond to user action and adaptivity • The user feels in charge of the software. • The user is encouraged to explore the software. • The (visual) environment supports the user in focussing on the task at hand • Other System-Real World Match – match users’ expectations, familiarization, fit intended user group. • The user understands the terminology in the instructions. • The user understands the terminology in the error messages/feedback. • The icons are easily understandable. • The GUIs are easily understandable. • Other Graphic Design - graphical elements, colours, aesthetics • All the text in black font is easy to read. • All the text in coloured font is easy to read. • The graphics convey information clearly. • The icons convey information clearly. • The graphics provide text information with mouseovers. • Commands and icons that cancel each other are physically separated. • The layout is satisfactory. • Other Navigation and Exiting - facilitate software exploration and provide outlets to terminate actions

131