Usability in Practice: Formative Usability ... - Semantic Scholar

21 downloads 1215 Views 242KB Size Report
the usability test report and presents a very useful list of reporting ... and the website posted. Once the ..... Empirically validated web page design metrics, Pro-.
minneapolis, minnesota, usa • 20-25 april 2002

Usability in Practice Session

Usability in Practice: Formative Usability Evaluations — Evolution and Revolution Janice (Ginny) Redish (moderator)

Randolph G. Bias (moderator)

Redish & Associates, Inc. +1-301-229-3039 [email protected]

Austin Usability. +1-512-474-0004 [email protected]

Robert Bailey Computer Psychology, Inc. 1-801-201-2002 [email protected]

Rolf Molich

Joe Dumas

Jared M. Spool

DialogDesign +45 4717 1731 [email protected]

Oracle Corporation +1-781-744-0300 [email protected]

User Interface Engineering +1-978-374-8300 [email protected]

 Formative evaluation is a collection of "find-and-fix" usability engineering methods, focused on identifying usability problems before a product is completed. In this forum, four experienced usability professionals will address different aspects of formative evaluations: •

which methods are most effective,



how to maximize the chances of effecting change and implementing the usability recommendations,



the importance of the usability professional’s relationship with the product developer, and



the importance of developing a science of user interface design, to minimize the need for iterative evaluations.

   Usability engineering, usability methods, inspection methods, usability report, user interface design, iterative design, user-centered design, consensus building.  The first tenet of usability engineering is to test early and often. The application of usability engineering methods – both user testing and inspection methods – during the development of a web site or other user interface is key to identifying potential usability problems at a time when they are most likely to be fixed. Formative evaluation is "user testing with the goal of learning about the design to improve its next iteration" [12]. Think of it as a collection of "find-and-fix" usability engineering methods, focused on identifying usability problems before a product is completed. Formative evaluation is Copyright is held by the author/owner(s). CHI 2002, April 20-25, 2002, Minneapolis, Minnesota, USA. ACM 1-58113-454-1/02/0004.

often contrasted with summative evaluation, which affords a quantitative comparison between a product (often a completed product) and a competitive product or a quantitative standard (i.e., measurable usability objectives). In this forum, four experienced usability professionals come at the topic of formative evaluations from four distinct directions. First, Robert Bailey offers a summary of various usability engineering methods and empirically evaluates some for their effectiveness. He concludes that some are wanting and admonishes that our discipline as a whole is illusorily easy. Rolf Molich argues for the importance of the usability of the usability test report and presents a very useful list of reporting practices that make usability test reports less usable. Molich recommends a practical, collaborative data reduction technique designed to maximize the chances of our usability recommendations being implemented. Joe Dumas takes Molich’s attention to the developers’ receptivity to our usability data a step further, arguing for the importance of the relationship between the usability engineer and the product developer. He highlights the pros and cons of various usability engineering methods in the development of this relationship. Jared Spool concludes by arguing that if we did a better job of building a science of usability we wouldn’t need to spend so much time and money on all of these iterative tests.     !  When designing usable websites, professional usability specialists first ensure that the early prototypes are as good as they can be by using parallel design techniques and applying research-based guidelines. Second, they understand and appropriately apply the following four categories of usability test and evaluation methods. This talk focuses on the second part – issues related to applying usability testing and evaluation methods.

885

CHI

Usability in Practice Session

The research literature on usability testing continues to fall relatively well into four categories: (a) automated evaluations, (b) inspection evaluations, (c) operational evaluations, and (d) performance testing. Ideally, a designer would complete a prototype and then perform the first evaluation using an automated evaluation method, such as TANGO [5, 6]. After correcting any unearthed problems, the designer would conduct an inspection evaluation, such as a heuristic evaluation or a cognitive walkthrough. After fixing the problems, the designer then would conduct a true human performance test, by having representative subjects complete typical tasks in a simulated environment. All identified problems would be corrected and the website posted. Once the site was active, the designer would collect operational data, such as actual download times, frequency counts of accessed links and pages, and user complaints, and then make more changes. There can be major problems trying to implement this testing approach. On average, "automated evaluations" can accurately identify only about 67% of the good pages versus the "not-good" web pages [6]. Using the "heuristic evaluation" method to find problems does not seem to be too difficult; however, rating the severity of each problem is a major difficulty [8]. But the most important question is, "How many of the usability problems identified in a heuristic evaluation are true usability problems?" Recently, three research papers were published that helped to provide some insights into the validity of heuristic evaluations [2, 14, 17]. The articles discussed usability testing in three totally different domains with very similar results. Based on a review of these papers, Bailey [1] suggested that when professional evaluators conducted heuristic evaluations, the most likely outcome was that about half of the problems identified would be true problems and half would be false positives. Another 20% of the usability problems would be missed altogether. Similar findings have been reported from studies on the validity of cognitive walkthroughs [7, 15]. Molich and his colleagues [10, 11] have helped us to better understand the limitations of even our best usability testing method – "performance testing." They had several independent usability testing facilities conduct performance tests on the exact same software systems. Their results showed considerable variability among the testing groups. In their first study, for example, only one problem was found by all testing teams, and over 90% of the problems found by each team were found only by that team. A more recent study [9] had six usability testing teams conduct usability tests on a prototype of a system. Consistent with Molich’s findings, none of the problems was found by every team, and a large proportion of the problems was found by only one team.

886

changing the world, changing ourselves

One major problem we have with many performance tests is that we use too few subjects. Nielsen [13] proposed that usability testers need to test with only five users to discover 85% of the usability problems. Bailey [1] calculated that to be 90% confident of finding usability problems that will affect 99% of users requires more like 112 representative test participants. Spool and Schroeder [16] reported that with five users they identified only about 35% of the problems in a website. Taken together, the findings of these studies shows that there is considerable need for improvement in the overall usability test and evaluation process. Contrary to some stated beliefs, effective usability testing and evaluation is extremely difficult to do well. As a discipline, we need fewer "discount" methods, and more research-based, truly valid methods for finding true usability problems. " "#"  !$ ! %    In the CUE-2 project [3, 10], nine independent organizations evaluated the usability of the same website, Microsoft Hotmail. The results document a wide difference in selection and application of methodology, resources applied, problems reported and reporting techniques. One of the goals of the CUE-2 study was to gather insight into the everyday practices of usability professionals. We therefore examined the content and format of the usability test reports. The results are shown in Table 1. Good and usable reporting practices are outlined for example in the standard textbook A Practical Guide to Usability Testing [4]. We found a number of reporting practices that made the reports less usable: •

Report too long. Too many problems reported. A usability report that describes 75 or even 150 usability problems is difficult to read and sell to developers and designers. Recommendation: If no other agreement has been made with the development team, report only a manageable number of problems, perhaps 20 to 60. Prioritize the full list of usability problems, so that only the most important ones are reported.



No executive summary. Recommendation: Include a one-page executive summary, listing the three most important positive findings, the three most important problems and the three most important managerial actions you recommend from the test.

minneapolis, minnesota, usa • 20-25 april 2002

Team

Usability in Practice Session

A

B

C

D

E

F

G

H

J

# Pages

16

36

10

5

36

19

18

11

22

Executive Summary?

Yes

Yes

No

No

No

Yes

No

Yes

Yes

# Screen shots in report

10

0

8

0

1

2

1

2

0

# Levels in severity scale

2

2

3

1

2

1

1

3

4

# Problems reported

32

149

18

10

67

76

41

18

25

# Positive findings reported

0

8

4

7

24

25

22

4

6

Table 1: Important characteristics of the nine CUE-2 usability test reports. The table shows that many of the participating teams did not follow basic advice on good reporting, such as inclusion of an executive summary, use of a severity scale, and inclusion of positive findings.



No severity classification of problems. Some reports did not distinguish between disastrous problems and minor details. Recommendation: Classify each problem on a threelevel scale: Disastrous, serious, minor.



No indication of how many users encountered a problem (frequency). Recommendation: Report how many test participants encountered each problem, for example "5 out of 7." Avoid phrases like "most test participants" or "71% of the test participants."



No positive findings. One report started by saying "Generally, the users were very happy about Hotmail." The rest of the report contained more than 30 problem descriptions without any positive findings to substantiate the initial claim. Recommendation: Strive for a ratio of at least one positive finding for each three problems.



Unattractive, unprofessional layout.

The layout is important for selling the results to busy developers. Recommendation: Work with a designer and your development teams to define an attractive and usable standard format for your test reports. Follow the standard consistently.

       

 The primary purpose of a usability test is to cause beneficial improvements to the user interface – not to write a good usability report. In other words: If the development team disregards your usability report then you have failed – no matter how "perfect" your usability test report is. Good communication skills are decisive for today's usability professionals. Here's a problem communication technique that is superior to traditional usability problem communication through paper reports and video tapes:



Work with the development team to define user profiles and typical user tasks. This helps to ensure buy-in.



Invite development team members to watch usability tests and take notes.



Immediately after the last usability test get the developers who have watched one or more tests together.



Ask each participant to write down each major observed problem on a colored index card. No discussions – this is a brainstorm.



Put all index cards on a large, sticky board. No discussions!



Read each others' cards silently. Add additional problems as desired.



Sort problems by area. Eliminate duplicates, but only if there is total agreement.



Name each group. The group names can be used as chapter names in the usability report.



Vote for most important problems. Each participant has ten votes. Participants can distribute their votes freely on the problems they consider most important.

The advantages of this approach are: •

Results are immediately available. There is no need to wait for the usability report, although you may write a report anyway, for archival purposes. The development team can start making changes to the interface immediately based on the consensus reached at the meeting.



Buy-in. The development team plays an important role in defining the usability test results.

Some usability professionals don't like this technique. They want to "decide" what happened during the usability test and draw the necessary conclusions. I don't think this is a sound attitude towards usability. Our role as usability professionals is to make usability happen – to act as humble catalysts for a complicated process.

887

CHI

Usability in Practice Session

The KJ method is named after the Japanese anthropologist Jiro Kawakita (denoted K-J by the Japanese custom of last name first), who developed a method of synthesizing large amounts of data, including voice of the customer data, into manageable chunks based on themes that emerged from the data themselves. "&& '#"( " ) *+ When considered as part of a user-centered or usability engineering process, formative evaluation almost always contributes to design. The availability of easy-to-learn software prototyping tools and the resurgence of paper prototyping has made it possible to push evaluation earlier and earlier in the design lifecycle. Evaluation methods, therefore, can be assessed on their ability to be used early and to contribute to design There is, however, another important factor in assessing usability evaluation methods: their ability to facilitate the working relationship between usability specialists and developers. In the long term, the most important factor in the willingness of developers to make design changes is the relationship they have with usability professionals. If the relationship is based on trust and mutual respect, the likelihood that developers will make design changes is increased. An added benefit once you have formed a positive relationship, is that designers will be more likely to incorporate usability methods into their development process. What builds trust and mutual respect? A few of the activities include: •

A sense of shared goals to make the product better.



Bringing them into the planning and conducting of the evaluation, analyzing the data, and communicating the results. Working hard together solidifies the relationship.



Learning as much as you can about the product so that developers see that you understand what the product does.



Conveying the attitude that the evaluation is not of them but of the design.

What harms relationships? •

Surprises, especially when they bring bad news.



A defensive attitude, something I call "usability paranoia," which is the irrational fear of being ignored.



Using the evaluation data as a weapon with which to punish developers.

changing the world, changing ourselves

This notion that it is the relationship that counts most can be used to assess the value of usability evaluation methods. The best methods are ones that provide the most opportunities to build a positive working relationship with developers; but all of the methods offer at least some opportunities. Here is a brief evaluation of the opportunities the methods provide for building relationships:    Provides the most opportunities for collaboration: •

During test planning, build a collaborative relationship working on user profiles, the screening questionnaire, the task list, the wording of scenarios. As you work convey the message that you care about making the product better and that you appreciate the hard work of developers.



Set expectations. If you think there are going to be lots of usability problems, convey your expectation to developers and explain that finding problems is what the test is all about.



During testing, help developers understand what they are seeing in pilot and test sessions. Make it clear that all products have usability issues, especially ones with no previous user testing. Focus discussion on seeing the basic causes for problems and that several problems can have the same source. We always tell test participants that we are testing the product, not them. But we should tell the developers the same thing and point out the product’s good points as well as the weak ones.



During analysis and reporting, educate, not preach, that not all problems are equally important and that systemic problems require global solutions.



After the testing is over, check in on progress of solutions, even if your only purpose is to keep the relationship going.

   • Group walkthroughs with developers present an opportunity for team building. •

One of the unique aspects of walkthroughs is that usability specialists and developers uncover usability problems together.



The usability specialist has an opportunity to explain the reasons for problems and how two or more problems are related.



Rational discussions and negotiations about changes show developers how we think and now much we respect them.

!"    #$   • Reviews by usability experts have the greatest potential to disrupt the relationship with developers.

888

minneapolis, minnesota, usa • 20-25 april 2002



Developers don’t get to see the problems as usability specialists find them.



Written problem descriptions need to be carefully worded so as not to offend developers.



Finding many problems that are low in severity can make the usability specialist appear to be picky and not sharing in common goals.



When the reviewers don’t agree on problems or severity ratings, developers may be properly suspicious about whether the method is valid.



Solutions to problems that usability specialists propose may be impossible to implement or may have already been rejected.

Usability specialists need to be trained in people skills instead of usability paranoia. "# ) ,- ! If we built bridges like we build information systems, then here’s how the process would work: 1. First, you build the bridge. Get it built as quickly as possible. Previous bridge building experience is not required. Nor is any sort of training in bridge building. 2. Take a car full of people and drive it across the bridge. 3. Watch as the car plunges into the water. 4. Make a note of the point where the bridge failed the car and brainstorm on a fix. 5. Implement the fix. 6. Repeat steps 1-5 until cars no longer plunge into the depths. So, who wants to go across the bridge? What’s even more amazing is that the above process, when translated to information systems, describes the "ideal" scenario – one where you actually test designs before you open them up for general use. The common scenario is far worse, where testing is only done "when resources are available" and systems are put into production without any validation or iteration. Fortunately, bridge building is not done this way. There is a fundamental understanding of the basics of bridge building. Bridge builders (also known as engineers) know a lot about materials, stress, constraints, and aerodynamics. Where did they get this information? From years of exploration, design, research, and failures. Winston Churchill once said, "This is not the end, nor is this the beginning of the end. This is the end of the beginning." In the world of information systems usability, we are

Usability in Practice Session

just starting down the road that other disciplines have traveled. What is missing is fundamental research. This kind of information is critical to the development of usable systems. Designing through iteration, while the best technique we have to date, is costly and inefficient. Designers should know what a usable design will look like before they’ve started building, let alone after development is done but before any testing. To get the information they need, we need to create a composite understanding of how design affects behaviors. In our work at User Interface Engineering, we’ve focused on researching the relationship between design and behavior. Since 1988, we’ve been setting the standard for highquality usability research, looking at many different aspects of the engineering and design problem space. One such undertaking has been our recent research in the area of tying specific design elements to bottom line results. In recent studies, we’ve looked closely at how shoppers make purchase decisions. By modifying our analysis techniques to focus on those shoppers who have a strong intention to complete a purchase, we can see where the design of the e-commerce site is preventing them from completing their task. For example, in a recent study, we gave 30 shoppers the money to purchase products they told us they needed from sites we knew to contain those products. Even though we highly motivated these shoppers to complete the sale, only 30% of the shopping expeditions in this study ended in a purchase. In the process, we discovered over 280 design obstacles that prevented purchases from completing. In analyzing the design obstacles, we started to notice some patterns. For example, at some point in a shopping expedition, a shopper is confronted with a list of products, either from a search engine or from choosing a category link. While designers tell us that they expect users to look at several items on the list before purchasing, we discovered that 66% of all purchases happened when users only looked at one item on the list. Also, the more items users looked at on the list, the less likely they were to make a purchase. We refer to the process of looking at multiple items on a list as "pogosticking." (We derived the name from users repeatedly going up and down through the hierarchy of the site, as if they were jumping on a pogo stick.) We noticed that users were far more likely to pogostick on some sites than on others. When we carefully studied the designs of these sites, we noticed that the sites where users didn’t pogostick contained crucial information in the items described on the list. The sites where users did pogostick were missing that information. Our analysis allowed us to determine the information critical for our shoppers to prevent pogosticking and enhance purchasing. We’re now researching further to under-

889

CHI

Usability in Practice Session

Human Factors and Ergonomics Society 42nd Annual Meeting, 1336-1340. [9] Kessner, M., Wood, J., Dillion, R.F., and West, R.L. (2001), On the reliability of usability testing, CHI 2001 Poster. [10] Molich, R., Kaasgaard, K., Karyukina, B., Schmidt, L., Ede, M., van Oel, W., and Arcuri, M. Comparative Evaluation of Usability Tests, CHI99 Extended Abstracts, ACM Press, 83-84. [11] Molich, R., Bevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D., Kirakowski, J. (1998), Comparative evaluation of usability tests, Proceedings of the Usability Professionals' Association. [12] Nielsen, J. Usability Laboratories: A 1994 Survey, http://www.useit.com/papers/uselabs.html [13] Nielsen, J. (2000), Why you only need to test with five users, http://www.useit.com/alertbox/20000319.html [14] Rooden, M.J., Green, W.S., and Kanis, H. (1999), Difficulties in usage of a coffeemaker predicted on the basis of design models, Proceedings of the Human Factors and Ergonomics Society, 476-480. [15] Spencer, R. (2000), The streamlined cognitive walkthrough method, working around social constraints encountered in a software development company, CHI 2000 Conference Proceedings, 353-359. [16] Spool, J. and Schroeder, W. (2001), Testing web sites: Five users is nowhere near enough, CHI 2001 Conference Proceedings. [17] Stanton, N.A. and Stevenage, S.V. (1998), Learning to predict human error: Issues of acceptability, reliability and validity, Ergonomics, 41(11), 1737-1747.

stand how to identify critical information for different types of products (clothing, CDs, cars, pet supplies). Research such as this can help us start to catalogue the behaviors influenced by specific designs. From this, we can start to formulate the fundamentals that allow us to build information systems that are usable from the start, without having to iterate over uninformed initial designs.  [1] Bailey, R.W. (2001), User Interface Update - 2001. [2] Catani, M.B. and Biers, D.W. (1998), Usability evaluation and prototype fidelity: users and usability professionals, Proceedings of the Human Factors and Ergonomics Society 42nd Annual Meeting. [3] CUE home page, http://www.dialogdesign.dk/cue.html [4] Dumas, J.S. and Redish, J.C. (1999/1993) A Practical Guide to Usability Testing. revised edition, Bristol, UK: Intellect. [5] Ivory, M.Y., Sinha, R.R., and Hearst, M.A. (2000), Preliminary findings on quantitative measures for distinguishing highly rated information-centric web pages, 6th Conference on Human Factors & the Web. [6] Ivory, M.Y., Sinha, R.R. and Hearst, M.A. (2001), Empirically validated web page design metrics, Proceedings of CHI 2001, 53-60. [7] Jacobsen, N.E. and John, B.E. (2000), Two case studies in using cognitive walkthroughs for interface evaluation, Computer Science Technical Report Abstracts. [8] Jacobsen, N. E., Hertzum, M., and John, B. E. (1998), The evaluator effect in usability studies: Problem detection and severity judgments, Proceedings of the



890

changing the world, changing ourselves