Guideline Aggregation: Web Accessibility Evaluation for ... - CiteSeerX

0 downloads 0 Views 299KB Size Report
Harbour Quilts, Halifax; 4. Sam's Chop House Manchester. The first two are professionally designed, template-based pages, while the last two are simple HTML ...
Guideline Aggregation: Web Accessibility Evaluation for Older Users Giorgio Brajnik

Yeliz Yesilada and Simon Harper

Dip. di Matematica e Informatica Universita` di Udine Udine, Italy

School of Computer Science, University of Manchester Manchester, UK

[email protected]

[email protected]

ABSTRACT

1.

Web site evaluation methodologies and validation engines take the view that all accessibility guidelines must be met to gain compliance. Problems exist in this regard as contradictions within the rule set may arise, and the type of impairment or its severity is not isolated. The Barrier Walkthrough (BW) method goes someway to addressing these issues by enabling barrier types derived from guidelines to be applied to different user categories such as motor or hearing impairment, etc. In this paper, we use set theory to create a validation scheme for older users by combining barrier types specific to motor impaired and low vision users, thereby creating a new “older users” category from the results of this set addition. To evaluate this approach, we have conducted a BW study with four pages, 19 expert and 49 non-expert judges. This study shows that the BW generates reliable data for the proposed aggregated user category and shows how experts and non-experts evaluate pages differently. The study also highlights a limitation of the BW by showing that a better aggregated user category would have been created by having a severity level of disability for different impairment types. By extending the BW with these impairment levels, we argue that the BW would become more useful for validating Web pages when dealing with users which multiple disabilities and thus we would be able to create a “Personalised Validation and Repair” method.

Web site evaluation methodologies and validation engines take the view that all accessibility guidelines must be met to gain compliance in order to achieve “universal accessibility”. Problems exist in this regard as, contradictions within the rule set may arise, and the type of impairment or its severity is not isolated. The consequence is that only generic conclusions can be drawn, potentially covering simultaneously many diverse impairments but failing to specify how severe a problem is with respect to each kind of impairment. The Barrier Walkthrough (BW) method goes someway to addressing these issues by enabling guidelines to be applied to different user categories such as motor impairment, hearing impairment, low vision, blind, cognitive impairment, etc. [3, 2]. Adoption of the BW yields accessibility problems grouped by user categories and within these, by severity levels. Thus the evaluator can state that, for example, a missing “skip-links” feature has a certain severity with respect to blind people, and another with respect to motor impaired people. The BW assumes that an accessibility barrier is, or is not, appropriate with respect to a user category, and that user categories are disjoint. For example, the missing “skip-links” is a barrier appropriate to blind people, but not for cognitive impaired ones. The consequence is that when the question “how do you evaluate a page for users with multiple disabilities?” arises, it is difficult to get an equally detailed and specific answer: another assessment has to be carried out, this time with respect to a new user category that has to be characterized in terms of appropriate barriers. This is exactly the case for older people: because of the normal aging process, older people experience problems related to vision, hearing, motor skills and cognition [15, 1, 23]. Therefore, we can think of older users as a group of Web users that have common problems with a number of primitive user categories. In this paper we investigate the consequences of a simple notion of aggregated user category, based on collecting the union of the accessibility problems that were found for primitive categories. More specifically, we assume that the accessibility barriers that are appropriate for older users are those that are appropriate either to motor impaired or to users with low vision. Although we are aware that this is a very simplistic approach for defining older users, we believe it can go a long way to address accessibility issues. To investigate this approach, we conducted a BW study with four pages, 19 expert and 49 non-expert judges and analysed how correctness of evaluations and reliability of ratings changed across different conditions (for primitive against derived user categories, and between expert/non-expert judges). We also studied if the strength of certain conclusions (for example, which page is better) changed depending on the conditions.

Categories and Subject Descriptors H.5.2 [Information Interfaces and Presentation]: User Interfaces— Evaluation/methodology, input devices and strategies, user-centred design, interaction styles; K.4.2 [Computers and Society]: Social Issues—Handicapped persons/special needs, assistive technologies for persons with disabilities

General Terms Human Factors, Experimentation

Keywords Web accessibility guideline, evaluation, older users

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. W4A2009 - Technical, April 20-21, 2009, Madrid, Spain. Co-Located with the 18th International World Wide Web Conference. Copyright 2009 ACM 978-1-60558-561-1 ...$5.00.

INTRODUCTION

127

The benefits of aggregation are twofold. 1. When we have an aggregated user category such as older users with multiple disabilities, this approach ensures that pages are validated against their multiple disabilities. 2. On the basis of assessments carried out with respect to a set of primitive user categories, many combinatorial mixes of barriers can be generated through aggregation and collected data can be reused, without requiring additional assessments. In brief, this approach paves the way towards assessments of accessibility customized to fit very detailed profile descriptions of target users, improving in such a way the effectiveness of accessibility engineering.

2.

BARRIER WALKTHROUGH

The barrier walkthrough (BW) method was first introduced in [3] as an analytical technique based on heuristic walkthrough [22]. An evaluator has to consider a number of predefined possible barriers which are interpretations and extensions of well known accessibility principles; they are assessed in a context so that appropriate conclusions about user effectiveness, productivity, satisfaction, and safety can be drawn, and severity scores can be derived. For the BW, context comprises user categories (like blind users), website usage scenarios (like using a given screen reader), and user goals (corresponding to use cases). An accessibility barrier is any condition that makes it difficult for people to achieve a goal when using the website in the specified context. A barrier can be described in terms of i) the user category involved, ii) the type of assistive technology being used, iii) the goal that is being hindered, iv) the features of the pages that raise the barrier, and v) further effects of the barrier on payoff functions. The BW prescribes that severity is graded on a 1–2–3 scale (minor, major, critical), and is a function of impact (the degree to which the user goal cannot be achieved within the considered context) and persistence (the number of times the barrier shows up while a user is trying to achieve that goal). Potential barriers to be considered are derived by interpretation of relevant guidelines1 and principles [7]; more details are available at [2]. There are two major benefits of the BW compared to conformance review: by listing possible barriers grouped by user categories, evaluators are more constrained in determining whether the barrier actually occurs. Secondly, by forcing evaluators to consider usage scenarios, an appropriate context is available to them for rating severity of the problems found. Experimental evaluations of the BW [3] showed that it is more effective than conformance reviews in finding more severe problems and in reducing false positives; however, it is less effective in finding all the possible accessibility problems. Other studies showed how the BW can be used as a basis for measuring the accessibility level of a website rather than measuring the conformance level [4].

3.

BARRIER AGGREGATION

The main purpose of this study is to investigate consequences of aggregation of primitive user categories. To understand if the approach works, we frame the problem in terms of comparison of the kinds of conclusions that can be drawn from primitive user categories with those that can be drawn from the aggregated ones. If we find that aggregation obfuscates some conclusions, or that some distinctions that are significant in the primitive categories become very uncertain using the aggregated data, then we would conclude that the approach does not work. More specifically, we define the category “older users” as users 1

WCAG, http://www.w3.org/TR/WAI-WEBCONTENT/.

that face a common set of barriers with either low vision and motor impaired users. We then investigate the following questions: 1. What are the common barriers for low vision, motor impaired and aggregated older user categories? 2. How does reliability of the results change for primitive user categories and for the derived ones? 3. Which barriers are rated correctly and truly? 4. Do expert and non-expert judges evaluate pages differently? 5. Which page is better in terms of accessibility with respect to primitive and derived user categories? From these answers we draw conclusions as to how aggregation of primitive user categories affects the kinds of claims that can be drawn regarding accessibility of websites, and how aggregation affects differences in performance between expert/non-expert evaluators.

3.1

Participants

Nineteen expert judges (15 males and four females, mean age 40 and sd=11.4) and 49 non-expert judges (38 males and 11 females, mean age 23.9 and sd=3.98) participated in our study. As can be seen from Table 1, our expert participants are highly experienced in testing websites for accessibility: 53% worked as a Web accessibility consultant, and 63% of them assessed 10 or more websites in the last 6 months. On the other hand, our non-expert judges were mainly students who were at that time participating in a course about Web accessibility and Web evaluation and none of them have worked as Web accessibility consultants. Subjective knowledge rating (1:very low – 5:very high) Worked as a consultant Participants tested more than 10 websites in the last 6 months

Expert 4.6 (sd=0.6)

Non-expert 2.3 (sd=0.9)

All 2.9 (sd=1.3)

53% 63%

none 2%

15% 19%

Table 1: Demographics data of 68 (19 experts and 49 nonexperts) participants. When we consider our global data2 and investigate the relationship between rating of knowledge of Web accessibility, being a Web accessibility consultant and the number of websites tested, we can see that there is a significant relationship between these. As expected, there is a moderate association between being a consultant and the number of websites tested (χ2 (1) = 23.7, p < 0.0001, Cramer’s φ = 0.59), and a stronger relationship between the subjective rating of knowledge of Web accessibility and the number of websites tested (χ2 (4) = 34.8, p < 0.0001, φ = 0.72).

3.2

Materials

Both with expert and non-expert judges, the same set of materials were used. The following four pages were used: 1. “I love god father movie” Facebook group; 2. The Godfather at IMDB; 3. Hall’s Harbour Quilts, Halifax; 4. Sam’s Chop House Manchester. The first two are professionally designed, template-based pages, while the last two are simple HTML pages that are most probably handcrafted. We chose these pages because they are typical and they are good at representing both groups. Facebook and the Internet Movie Database (IMDB) pages are in the top 100 most widely used pages ranked by Alexa3 . Even though the last two pages are not in the top 2 When we say global data we refer to data for both experts and non-experts judges. 3 http://www.alexa.com

128

100, these pages are typical long-tail pages [21]. They are not as widely used as Facebook or IMDB but they are within the interests of a small community. In this study, each judge evaluated one page, except for two expert judges who evaluated two pages each; the pages assigned to judges were randomised 4 . Barriers tested in the study can be found at [2]. Each judge was given a sheet with a randomized list of barriers to counterbalance order effects. The same list was repeated once for each of the primitive user categories considered.

3.3

Expert NonExpert All

Total evaluations 21 51

Time (min.)

Effort

Confidence

Productivity

107 (75) 299 (110)

3.4 (0.9) 3.4 (0.8)

3.9 (0.9) 2.5 (0.9)

3.3 (0.9) 2.9 (0.7)

72

243 (133)

3.4 (0.8)

2.9 (1.1)

3.0 (0.8)

Table 2: Subjective ratings of the judges (1-very low, 5-very high; numbers are the simple means and those in parentheses are the standard deviations).

Procedure

When participants accepted to take part in this study, they were given a judge number and asked to follow the instructions on the experiment Web page4 . They completed the study in their own time and working environment. They were asked to follow a procedure with the following three stages: 1. Introduction Part: Participants were asked to read an information sheet4 and asked to answer some screening questions about demographics and expertise. 2. Main Part: By using the given judge number, participants were first asked to download the corresponding barriers sheet and evaluate the appropriate Web page by filling in that sheet. They were allowed to use any evaluation tool, browser extension or technique they liked. Participants were asked to evaluate each barrier with respect to low vision and motor impaired users. For each barrier and user category, they were asked to check whether that barrier exists. If it did not exist then they were asked to enter 0 or leave blank; if it existed they were asked to specify the severity based on three point scale (1=minor, 2=significant, 3=critical) and also explain the rationale for their rating. 3. Conclusion Part: Participants were asked to fill in a post evaluation questionnaire. This questionnaire aims to capture how long it took to complete the study, the tools and techniques used and participants’ subjective rating of the level of effort, productivity required and their confidence in their evaluations.

3.4

Judge

Results

In total there were 21 evaluations by our expert judges and 50 by our non-expert judges. Table 2 summarises the total number of evaluations along with the mean values and standard deviations of subjective ratings and completion time of our judges. As suggested by the table, our expert judges spent significantly less time (107 vs 299 min., two tailed T test T = 8.59(53.9), p < 0.0001 with a large effect size d = 2.23), and they found themselves slightly more productive (3.3 vs 2.9 in a range from 1=very low to 5=very high, T = 2.12(30), p = 0.042, d = 0.55) and confident (3.9 vs 2.5, T = 6.3(39.7), p < 0.0001, d = 1.63) than non-experts. In the post evaluation questionnaire, participants were asked to list the tools and techniques they used to evaluate Web pages, and Table 3 shows the top five most commonly used tools among our expert and non-expert judges.

Common Barrier Types What are the common barriers for low vision, motor impaired and aggregated older user categories?

In studies like this, an important design decision is how to identify correct answers provided by participants. In our case this is needed in order to be able to identify the set of true barriers, given a page 4 Study page: http://hcw.cs.manchester.ac.uk/ research/riam/experiments/riam-samba.php

Tool WAT toolbar WAVE Web Developer toolbar W3C Markup validator Jaws

URL www.visionaustralia.org.au wave.webaim.org webdevelopertoolbar.com validator.w3.org www.freedomscientific.com

Table 3: Top five tools used by the judges.

and a user category. Luckily, this time we could tap on opinions and judgments provided by 19 leading experts in the field (several of them were recruited among attendees of ASSETS 2008). However, as often happens with accessibility, there was also disagreement. To cope with such an unavoidable subjectivity, we adopted a majority rule to determine when a barrier was correctly rated. More specifically, given a page and a user category, a barrier is correctly rated if the majority of experts who rated it agreed on its severity. Hence, given a page and a user category, the set of true barriers is given by barriers with positive severity that were correctly rated. As a consequence, the set of true barriers for a set of pages is the union of the corresponding sets; and the set of true barriers for the aggregated user category is also the union of the corresponding barriers (see Table 4). User Group Low vision (LV) Motor Impaired (MI) Older Users

True Barrier Types LV.all = Union (IMDB.LV, Quilts.LV, Facebook.LV, Sams.LV) MI.all = Union (IMDB.MI, Quilts.MI, Facebook.MI, Sams.MI) = Union (LV.all, MI.all)

Table 4: Method used to identify true barrier types (i.e., IMDB.LV means all true barrier types on IMDB).

Based on this method, with our global data, if we consider only barrier types with severity>0, out of a total of 61 unique barrier types, 27 true barrier types were identified. These barrier types are listed in Table 5 and the details of these barriers can be found in [2]. As can be seen from this table, 17 barrier types were identified for motor impaired people and 25 for users with low vision, and some (but not all) barriers are common between these two primitive categories.

Reliability How does reliability of the results change for primitive user categories and for the derived ones?

Reliability is the extent to which independent evaluations produce the same result; it is an important property for evaluation methods as it guarantees consistency in results. In order to assess reliability of the BW, we use reproducibility

129

Barrier Type 1. Ambiguous links 3. Dynamic menu in JavaScript 5. Functional images lacking text 7. Inflexible page layout 9. Internal links are missing 11. Links/button are too small 13. Long URIs 15. Mouse events 17. New windows 19. No page headings 21. Page Size Limit 23. Skip links not implemented 25. Too many links 27. Valid markup

LV √ √

MI √

√ √















√ √

√ √





√ √

√ √

√ √



Barrier Type 2. Cascading menu 4. Forms with no LABEL tags 6. Images used as titles 8. Insufficient visual contrast 10. Layout tables 12. Links/button too close to each other 14. Minimize markup 16. Moving content 18. No keyboard shortcuts 20. No stylesheet support 22. Scrolling 24. Text cannot be resized 26. Using stylesheets

LV

MI √

√ √ √ √ √ √



√ √



√ √ √







Table 5: Common True Barriers for Older People (LV - low vision, MI - Motor impaired).

Figure 1: Means values of reproducibility for global data (LV Low vision and MI - Motor Impaired users). Mean Reproducibility Low Motor Older vision impaired Facebook 0.33 0.38 0.35 Sams 0.14 0.28 0.21 IMDB 0.18 0.39 0.28 Quilts 0.40 0.40 0.40 Website

and agreement. Reproducibility measures the variability of ratings of a barrier type and is defined as follows (the same definition was used in [22]): reproducibility of a barrier type, given a page, a user category and a set of ratings by different judges for that barrier on sd that page with respect to that user category is r = max{0, 1− M }, 5 where M is the mean of weighted severity and sd is the standard sd deviation ( M is often called coefficient of variation). When reproducibility is close to 1, the standard deviation is very small compared to the mean; in our case this implies that the variability of weighted ratings between our judges is low. Another measure of reliability is the level of agreement between judges. Rather than simply computing the mean of the correlation between pairs of judges (which reflects only the relative agreement), given all the ratings that a set of judges gave to all the barriers with respect to a page and a user category, the intraclass correlation coefficient was applied to weighted severity ratings. This index, ranging from -1 to 1, measures both the relative and absolute agreement between judges and provides a different measure of reliability. Figure 1 shows the reproducibility data for each user category. We can see that relative differences between pages that were shown on primitive user categories are also shown for the aggregated category. It’s also easy to see that reproducibility with respect to motor impaired is consistently higher than that of low vision users. The mean value of reproducibility on all websites is 0.31 for older users, 0.26 for users with low vision and 0.36 for motor impaired users. When testing whether reproducibility differs significantly from page to page (using all the ratings for the derived user category, or only those of a primitive category), we see that: for motor impaired, pages are not significantly different each other; for low vision, only Quilts differs from Sams; for older users, again only Quilts differs from Sams. All other pairs under each condition are not significantly different in terms of mean reproducibility (pairwise Wilcoxon test with global α = 0.05 and Holm correction was 5

Severity level can be 0, 1, 2, 3 but these are categorical values and not numbers (severity 3 does not mean that barrier is three times more critical than a severity 1 barrier); therefore we transformed these into weighted severity as follows: 0, 1, 5, 9. In this way we are assuming that a critical barrier has the same impact on a user.

Low vision 0.22 0.27 0.21 0.27

Agreement Motor impaired 0.19 0.28 0.34 0.35

Older

0.21 0.27 0.28 0.31

Table 6: Reliability with respect to Web pages (sorted by agreement on “Older”). used). When we look at our data globally, the following six barrier types are the most reproducible for older users (considering only those with severity>0): 1. Inflexible page layout, 2. Scrolling, 3. Long URIs, 4. Valid markup, 5. Internal links are missing, 6. Cascading menu, and these are the least reproducible ones: 1. Fields validation, 2. Rich images badly positioned, 3. Too short timings, 4. ASCII art, 5. Window without user controls, 6. Frame without title. Further investigation is needed to understand why these barrier types differ. Table 6 shows the reproducibility and the agreement figures on each page. As can be seen from this table, for all user categories Quilts has the highest and Sams has the lowest reproducibility value. Similarly, for low vision people, our judges had the highest agreement on Quilts and Sams, and had the lowest agreement on IMDB. For motor impaired people, the highest agreement was on Quilts and the lowest agreement was on Facebook. For older users, the highest agreement was on Quilts and the lowest agreement was on Facebook. While it is not surprising that agreement and reproducibility differ a little, it is worth noting that the ranking of pages that we get when focussing on a primitive user category is close to that for older users, both for agreement and reproducibility. Reproducibility results with respect to judge types are discussed in the “Expert vs. Non-expert judges” section below.

True Barriers: accuracy and sensitivity Which barriers are rated correctly and truly?

Based on the notion of correct rating of a barrier we introduced above, given a set of severity ratings, the error rate is the proportion

130

Judge

Low vision

Expert Non-expert Global data

14.4 19.7 18.1

Motor impaired 10.7 14.4 13.3

Older

Website Expert Non-expert Global data

12.5 17.0 15.7

Table 7: Error rates (as percentages) in identifying true barriers.

lv.expert mi.expert older.expert lv.non-expert mi.non-expert older.non-expert lv.global mi.global older.global

Facebook 1 1 1 2 3 3 2 3 2

IMDB 4 4 4 3 1 2 4 2 3

Quilts 2 3 2 1 2 1 1 1 1

Sams 3 2 3 4 4 4 3 4 4

Table 8: Rankings of pages according to the error rate, under different conditions (“older.global” means on older users data by all the judges; “mi.experts” means on motor impaired rated by experts; etc.).

of ratings that are incorrect. Table 7 shows a summary of the error rates for our user categories. When we look at the data globally, over a total of 8784 ratings, the error rate for motor impaired users is 13.3%, for low vision users 18.1% and overall 15.7%. It is readily seen from table 7 that moving from primitive to aggregated user categories preserves certain patterns; for example, the difference in error rates between types of judges. Table 8 shows how pages are ordered with respect to the error rates. It’s worth noting that within groups of judges (either all of them, expert or non-experts only) the ranks do not change much when moving from primitive user categories to the aggregated one. Furthermore, not all of these differences between error rates are significant (pairwise χ2 test with Holm adjustment and global α = 0.05); but when they are, they tend to be for the same pairs of pages across primitive vs. aggregated user categories. On the basis of the notion of correct rating of a barrier we introduced above, we can define ways to measure the correctness of an evaluation (given a judge, a page and a user category). Given a page and a user category, we define the true barriers (TB) as the set of all correctly rated barriers with severity>0 that judges found, and given a page, a user category and a judge, the found barriers (FB) as the set of barriers with severity>0 reported by that judge (regardless whether they are correctly rated or not). These sets can be used to define three indexes: Accuracy A = |TB∩FB| is the proportion of reported barriers that |FB| are also correct. Sensitivity S = |TB∩FB| is the proportion of all the true barriers |TB| that were reported. F-measure F = 2A·S is the harmonic mean of A and S, which A+S is a balanced combination of A and S summarizing the correctness of an evaluation. When we also look at quality of the evaluations, for older users, F-measure values are: 0.295 for Facebook, 0.508 for IMDB, 0.462 for Quilts and 0.546 for Sams. This tells us that quality of the evaluation was best on the Sams and worst on Facebook. Table 9 shows the overall F-measure values for our user categories and judge types. An ANOVA test tells us that F-measure, which is normally distributed, is mostly affected by judge type (expert/non-

Low vision 58 [51, 66] 40 [35, 44] 45 [41, 49]

Motor impaired 54 [45, 63] 41 [35, 47] 45 [40, 50]

Older 56 [51, 62] 40 [37, 44] 45 [42, 48]

Table 9: F-measure values and 95% confidence intervals (as percentages) for user categories and judge types.

Judge Type Expert Non-expert Judge Type Expert Non-expert

Reproducibility Low vision Motor impaired 0.51 (0.49) 0.60 (0.48) 0.33 (0.47) 0.44 (0.49) Agreement Low vision Motor impaired 0.28 0.28 0.25 0.30

Older 0.56 (0.49) 0.38 (4.82) Older 0.28 0.28

Table 10: Reliability with respect to judge types.

expert) and by page, but not by user category; there are no significant interactions. It is also worth noticing that the 95% confidence intervals around the mean value of F-measure, whose widths represent the amount of uncertainty we have on the true value, shrink when we move from a primitive to the aggregated category, leading to more precise results (in this case F-measure). Also, the fact that the intervals for expert is disjoint to that of non-expert judges, is preserved when we move to the aggregated category.

Expert vs. non-expert judges Do expert and non-expert judges evaluate pages differently?

Based on the true barriers definition given above, our non-expert judges did not identify all the barriers that were identified by our expert judges. Across all pages, for older users our expert judges identified 27 barrier types and our non-expert judges identified 24 of these. In brief, they did not identify the following three barrier types: 1. Forms with no label tags, 2. Moving content, 3. No stylesheet support. When we look at our primitive user groups the same trend exists. For low vision users, our non-expert judges missed two barrier types and for motor impaired users, our nonexpert judges missed four barrier types. Figure 2 shows the average weighted severity of experts, nonexperts and globally for all user categories on all pages. Experts gave more extreme ratings than non-experts. For example, for low vision users, on Facebook expert judges gave significantly lower ratings (W = 109926, p < 0.0004) and on IMDB they gave significantly higher ratings (W = 135384, p < 0.0001). For motor impaired users, on Facebook expert judges gave significantly lower ratings (W = 111869, p < 0.0017) and on IMDB they gave significantly higher ratings (W = 128041, p < 0.018). When we look at the reproducibility data, ratings by experts are more reproducible (see Table 10 and Figure 3). Application of the Wilcoxon test shows that reproducibility for both primitive groups and older users of experts and non-experts are significantly different. For older users, the reproducibility for experts (M=0.56) and non-experts (M=0.38) is significantly different (p < 0.0001). For low vision users, the difference between experts (M=0.51) and nonexperts is significantly different (p < 0.0001). Finally, for motor impaired users, the difference between experts (M=0.60) and nonexperts (M=0.44) is also statistically different (p < 0.0025). In terms of true barriers, compared to expert judges our nonexpert judges had higher error rates, which means they identified fewer true barriers. As can be seen from Table 7, this is consistent across our primitive user categories and the difference appears at

131

Figure 2: Weighted severity (Wsev) globally, of experts and non-experts (LV - low vision, MI - Motor impaired). Page Facebook IMDB Quilts Sams Mean

Low vision 0.60 0.69 0.63 0.93 0.70 (1.92)

Motor impaired 0.51 0.57 0.55 0.77 0.59 (1.81)

Older 0.55 0.63 0.59 0.85 0.645 (1.86)

Table 11: Average Weighted Severity.

the older users level.

BW Results — which page is better? Which page is better in terms of accessibility with respect to primitive and derived user categories?

Figure 3: Reproducibility by judge type, page and user categories (LV - low vision, MI - Motor impaired).

When we look at weighted severities and consider our global data, for older users the average weighted severity across all pages is 0.64 (sd=1.86). For low vision users it is 0.70 (sd=1.92), for motor impaired people it is 0.59 (sd=1.81). According to the Wilcoxon test, the difference between motor impaired and low vision users is significant (p < 0.0001). Similarly, the differences between motor impaired and older people (p < 0.0025), and between low vision and older people are significant (p < 0.0035). If we look at the data globally again, Sams has the highest weighted severity for both low vision (M=0.93) and for motor impaired people (M=0.77), and Facebook has the smallest weighted severity for both low vision (M=0.60) and motor impaired users (M=0.51); see Table 11. When we look at our global data and the overall weighted severity levels {0, 1, 5, 9}, if we add all of them and then compare pages according their relative proportions we get the figures listed in Table 12. For example, of all the weighted severities for the low vision user category, Facebook has 21% of weighted severities, IMDB

132

Website

Low vision

Facebook IMDB Quilts Sams Total

0.21 0.25 0.26 0.28 1.00

Motor paired 0.21 0.24 0.27 0.27 1.00

im-

Older Users 0.21 0.24 0.27 0.27 1.00

Table 12: Proportion of weighted severities across pages. Sev 1 2 3

Facebook LV MI Older 98 84 182 63 53 116 27 23 50

LV 138 88 20

IMDB MI Older 79 217 81 169 16 36

LV 114 65 41

Quilts MI Older 77 191 67 132 33 74

LV 113 67 45

Sams MI Older 85 198 44 111 44 89

Table 13: Distribution of severities per Web page for global data (LV - low vision and MI - Motor impaired users). has 25%, etc. For older users data, we can say that Facebook’s value is significantly smaller than IMDB (p < 0.0005), Quilts (p < 0.0001) and Sams (p < 0.0001), and IMDB’s significantly smaller than Quilts (p < 0.0104) and Sams (p < 0.0010). However, the difference between Sam’s and Quilt’s is not significant, thus they cannot be ordered. Therefore, for older users the ordering of the pages is as follows ( means “is better than”): Facebook  IMDB  Quilts and Sams. We can see that also in this case the ranks of the websites is preserved when moving from primitive to aggregated user categories. If we consider the barrier types identified on pages and look at the global data, for older users judges identified more different types of barriers on Sams (17 barrier types), followed by IMDB (16 barrier types), then Quilts (8 barrier types) and finally Facebook (5 barrier types). When we look at the distribution of severity levels for each page, for older users we can see the following trend (Table 13 and Figure 2): (i) IMDB has the most barriers with severity 1 and Facebook has the least, (ii) IMDB has the most barriers with severity 2, and Sams has the least, (iii) Sams has the most barriers with severity 3, and IMDB has the least.

4.

DISCUSSION

This study demonstrates that the evaluation framework we used to identify and measure certain properties of accessibility assessments is viable. First, being an expert is associated to evaluating many websites, rating oneself as knowledgeable, requiring significantly less time to do the job, doing it with less effort and more confidence. Second, the notion of “correct answer” can be framed as a majority rule to cope with inevitable disagreements; this can be done even if severity scale is multilevel. Third, quality of an assessment can be measured as error rates or by using F-measure, an index that reflects both how many false positives and false negatives occurred. Such indexes, which measure two different aspects of “quality” of an assessment (the error rate is closely related to accuracy), can be easily defined in terms of the notion of “correct answer”. Fourth, reliability can be monitored using either a measure of variability of ratings (reproducibility) or agreement among evaluators (intraclass correlation coefficient). Both indexes reflect differences in expertise. Fifth, using weighted severities is a simple way to compare accessibility levels of websites (though more sophisticated indexes than simply proportions can be defined [4]). The outcomes of this study suggest that by using the results of the primitive user categories, we can also infer similar conclusions for an aggregated user category; in particular for the category

“Older users” seen as a combination of “Low vision” with “Motor impaired”. This study shows that even though our judges agreed that there are common barrier types for low vision and motor impaired users, some barrier types are still specific to user categories (see Table 5). Therefore, this shows the importance of considering different user categories and their specific needs when pages are evaluated for accessibility. When aggregating barrier types for motor impaired and low vision, we address all barriers related to these disabilities. Our data shows also that non-experts compared to experts miss to identify some barrier types; however aggregation does not worsen this phenomenon, since with respect to older users the number of missed barrier types is comparable to that of the original primitive user categories. We showed that some barrier types are more reproducible than others. This is probably due to interpretation and/or application of some barrier types being more difficult or more subjective than others. But both measures of reliability produce the same ranking of pages in primitive user categories as in the aggregated one. In addition, differences among experts/non-experts that are shown in primitive user categories, show up also in the aggregated category. Quality of assessments, measured with error rate and F-measure, changes from primitive user categories to the aggregated category, but important distinctions are preserved. This is the case, for example, of the difference in error rate between experts/non-experts; similarly for F-measure, where aggregation reduces the uncertainty about its true mean value. We also saw that the error rate of our judges was smaller with respect to motor impaired users than low vision users. Further investigation is needed to find out why. When using weighted severities to compare pages, aggregation preserves differences in the mean values of the sum of all severities and of proportions. This holds also for differences between experts/non-experts. Our data also show that experts are more judgemental compared to non-experts. A possible interpretation is that non-experts preferred to give a middle-range rating to be on the safe side, while non-experts make stronger claims about the accessibility of the pages (see Figure 2). Finally, because of the average age difference between our expert and non-expert judges, one can think that the non-experts are more famialiar with pages like Facebook and their can be some bias in the way they evaluate such pages. However, our results show that non-experts rated Facebook worse than Quilts and also it is the experts who rated Facebook better than the other pages (see Figure 2). Therefore, we do not think familiarity is a factor that affects the evaluation results. However, a further study is required to confirm this. Furthermore, we believe our non-expert judges are more representative of the entire population of the Web developers who in reality tend to perform accessibility evaluations on their own (young developers, as opposed to experienced evaluators).

5.

RELATED WORK

The work presented in this paper has connections to research in accessibility for older users and Web accessibility evaluation.

Older users. The world’s older6 population is expected to exceed one billion by 2020 [15]. Research shows that approximately 50% of older population suffers from disabilities such as motor impairment that hinder social interaction [8, 9]. Disability increases with age and in UK 50% of those over 65 are disabled7 . As the 6 This term has been defined in numerous ways; [20] defines as “over 58”. 7 http://www.bbc.co.uk/commissioning/

133

population ages the financial requirement to work more years is increased, but age-related disability becomes a bar to employment [17]. At present, only 15% of the 65+age-group use the Internet8 , but as population ages this number will significantly increase. Older users are a diverse group, often experiencing multiple functional limitations; therefore devising a universal strategy for improving their Web experience is not a trivial task. Nevertheless, we argue that only by focusing on the particular needs of older users and tackling particular disabilities can their needs be fulfilled. Research shows that because of the normal aging process, older people experience problems related to vision (i.e., decreasing ability to focus on near tasks, colour perception, contrast sensitivity, etc.), hearing (inability to hear high-pitched sounds), motor skills (diseases like arthritis and Parkinson’s disease are common to older users and cause mobility issues) and cognition (older adults may suffer from diseases such as dementia or Alzheimer’s disease and also suffer from mild cognitive impairments such as remembering names, numbers) [15, 1, 23]. According to W3C’s literature review on accessibility for older people, many requirements needed by older people are already addressed by Web accessibility guidelines [1]. Thinking about this and multiple disabilities, we can say that our work here is a step towards addressing requirements of older users, but obviously it provides a partial solution. As part of our future work, we are planning to also consider barriers for cognitive and hearing disabilities in our union set. To create a sufficient union set, we also have to be careful not to over-emphasis on extreme forms of disability as they do not address the requirements of most older users [16]. Furthermore, we also need to consider some other barriers that are not directly related to disability that affect the technology use of older users. For example, [15] indicates that the biggest barrier of technology use by older persons is not ageingrelated functional impairment, but rather hesitation of exploration due to fear of the unknown and the consequence of incorrect actions. In [6] authors highlight that attitude and aptitude can vary significantly across age groups; for example [19] show that the barrier is not age, but the respondent’s idea that older people cannot or do not use computers. Therefore, devising a specific set of barriers for older people could be used to address these.

Web Accessibility Evaluation. Aggregation of barrier types, the notion we discuss in this paper, is clearly part of a method to evaluate web accessibility and definitely affects its outcomes. We applied this notion to the BW, but other methods exist, including standards review, user testing, subjective assessments and screening techniques [12, 7]. These methods differ in terms of their effectiveness, efficiency and usefulness; but so far little research has been carried out to analyze these properties. Several studies of usability evaluation methods have shown that user testing methods may fail in yielding consistent results when performed by different evaluators [18, 13] and that inspection-based methods are not free of shortcomings either [22, 25, 5, 11]. Although accessibility and usability are two different properties, there is no reason to assume that the kind of uncertainty and mishaps that apply to usability evaluation methods should not apply to accessibility evaluation methods as well. In fact, the relation among accessibility, conformance and usability is a complex and somewhat fuzzy one. The W3C/WAI model of accessibility aims at universal accessibility, it assumes that website conformance to WCAG (Web Content Accessibility Guidelines) is the key precondition to that, and it hypothesizes that accessibility is entailed by a conformant website if tools used by the web developer 8

In the UK: http://www.statistics.gov.uk/

(including CMSs) are conformant to ATAG (Authoring Tools Accessibility Guidelines), and that browser and assistive technology used by the end user are conformant to UAAG (User Agent Accessibility Guidelines) [14]. Because both these two conditions are not under control of the web developer, s/he cannot guarantee accessibility. empirical evidence shows that the link between conformance and accessibility is missing, i.e. even conformant websites may fail in being accessible [7]. Which accessibility problems are identified and how their severity is rated are two aspects of accessibility investigations that lack substantial standardization, leading to low reproducibility of results. Yet these two aspects are the core of many accessibility guidelines. Ideally, a good method is a dependable tool that yields accurate predictions of all the accessibility problems that may occur in a website. This is why methods can be compared in terms of such criteria as accuracy (the percentage of reported problems that are true problems), sensitivity (the percentage of the true problems being reported), reliability (the extent to which independent evaluations produce the same results), efficiency (the amount of resources expended to carry out an evaluation that leads to specified levels of effectiveness and usefulness), usefulness (the effectiveness and usability of the produced results) and the method’s usability (how easily it can be understood, learned and remembered by evaluators); for more details the reader is referred to [22, 10, 11]. We expect many of these issues to be relevant also to aggregation, regardless its underlying method.

6.

CONCLUSIONS AND FUTURE WORK

In this paper, we combined barrier types specific to motor impaired and low vision users, thereby creating a new “older users” category. Although we realise that this is a simplistic approach to define older users, we believe that the aggregation of primitive user categories is a promising approach. To explore the consequences of such an approach we conducted a study with 67 judges and evaluated four pages for their accessibility, using the BW. While we do not claim that we validated the idea of using aggregation to infer accessibility problems with respect to aggregated user categories, we think that several important effects of aggregation were exposed. In brief, this study shows that there are common barrier types between low vision and motor impaired users, but there are also some barrier types that are specific to each user category. When we have an aggregated user category such as older users with multiple disabilities, this approach ensures that the pages are validated against their multiple disabilities. We also show that for the specific aggregation we investigated, important differences that can be found for primitive user categories do show up also in the aggregated one, suggesting that aggregation might be a way to produce valid conclusions. However, older people experience problems not only related to vision and motor skills, but also to hearing and cognition [15, 1, 23]. We could easily extend this approach to also include cognitive disability and hearing disability into our primitive user categories; aggregating then all these barrier types will give a better coverage of the barrier types that older people face. A more powerful extension of the aggregation mechanism could however address another limitation. In the current formulation of the BW we have a number of primitive user categories which include motor impaired, blind, low vision, etc. However, there is no way to express the fact that a barrier type is more or less appropriate to a user category, other than saying that it is or it is not. For example, the barrier “Links are too close to each other” is very appropriate to people with a mild motor impairment (who might still use a pointing device), but its appropriateness decreases as the person uses the keyboard to inter-

134

act. This observation agrees with some studies that indicate that the requirement of older users cannot be met if we focus on severe disabilities [16]. In addition, appropriateness of barriers actually depends also on the level of experience that the user has with the browser, with the assistive technology, with the operating system features of the user interface, and with accessibility features in the page. Not many users know how to use proficiently the keyboard to interact with the browser, how to jump directly — using the keyboard — to a certain link, etc. The BW, and our notion of aggregation, cannot be used to consider these user profiles where expertise is considered explicitly. We envision several ways to extend the aggregation idea to cope also with these requirements. One of them is to use a fuzzy definition of appropriateness of a barrier to a user category, so that the relationship becomes a degree of appropriateness rather than being binary. We believe that by extending the BW with these impairment levels, the BW would become more useful for evaluating web pages when dealing with users with multiple disabilities and thus could be used to produce personalized accessibility evaluations. Similarly in spirit to what Vigo et. al. discussed in [24], but in a different way. Finally, as part of our future work, we are planning to compare our results with the results of the W3C’s literature review on accessibility for older people [1]. This will be an alternative way of validating our approach.

7.

ACKNOWLEDGEMENT

This work is part of a collaboration between the UK EPSRC funded RIAM project (EP/E002218/1), and the University of Udine. As such the authors would like to thank both organisations for their continued support. We would also like to thank all our participants for their valuable time and effort, and two anonymous reviewers for their help in improving this paper.

8.

REFERENCES

[1] A. Arch. Web accessibility for older users: A literature review. W3C, 2008. http://www.w3.org/TR/wai-age-literature. [2] G. Brajnik. Barrier walkthrough: Heuristic evaluation guided by accessibility barriers. http://users.dimi.uniud. it/∼giorgio.brajnik/projects/bw/bw.html. [3] G. Brajnik. Web accessibility testing: When the method is the culprit. In 10th International Conference on Computers Helping People with Special Needs (ICCHP), 2006. [4] G. Brajnik and R. Lomuscio. Samba: a semi-automatic method for measuring barriers of accessibility. In Assets ’07: Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility, pages 43–50, New York, NY, USA, 2007. ACM. [5] G. Cockton and A. Woolrych. Understanding inspection methods: lessons from an assessment of heuristic evaluation. In A. Blandford and J. Vanderdonckt, editors, People & Computers XV, pages 171–192. Springer-Verlag, 2001. [6] K. P. Coyne and J. Nielsen. Web usability for senior citizens: 46 design guidelines based on usability studies with people age 65 and older. Nielsen Norman Group Report, 2002. [7] DRC. Formal investigation report: web accessibility. Disability Rights Commission, www.drc-gb.org/ publicationsandreports/report.asp, April 2004. Visited Jan. 2006. [8] A. D. Fisk, W. A. Rogers, N. Charness, S. J. Czaja, and J. Sharit. Designing for Older Adults. CRC, 2004.

[9] J. Fozard. The handbook of the psychology of ageing, chapter Vision and hearing in aging, pages 150–170. Academic, 1990. [10] W. Gray and M. Salzman. Damaged merchandise: a review of experiments that compare usability evaluation methods. Human–Computer Interaction, 13(3):203–261, 1998. [11] H. R. Hartson, T. S. Andre, and R. C. Williges. Criteria for evaluating usability evaluation methods. Int. Journal of Human-Computer Interaction, 15(1):145–181, 2003. [12] S. Henry and M. Grossnickle. Just Ask: Accessibility in the User-Centered Design Process. Georgia Tech Research Corporation, Atlanta, Georgia, USA, 2004. On-line book: www.UIAccess.com/AccessUCD. [13] M. Hertzum, N. Jacobsen, and R. Molich. Usability inspections by groups of specialists: Perceived agreement in spite of disparate observations. In CHI 2002 Extended Abstracts, pages 662–663. ACM, ACM Press, 2002. [14] B. Kelly, D. Sloan, S. Brown, J. Seale, H. Petrie, P. Lauke, and S. Ball. Accessibility 2.0: people, policies and processes. In W4A ’07: Proc. of the 2007 international cross-disciplinary conference on Web accessibility (W4A), pages 138–147, New York, NY, USA, 2007. ACM. [15] S. H. Kurniawan. Ageing. In S. Harper and Y. Yesilada, editors, Web Accessibility: A Foundation for Research, Human-Computer Interaction Series, chapter 5, pages 47–58. Springer, London, 1st edition, September 2008. [16] S. Milne, A. Dickinson, A. Carmichael, D. Sloan, R. Eisma, and P. Gregor. Are guidelines enough? An introduction to designing web sites accessible to older people. IBM Systems Journal, 44(3):557–571, 2005. [17] J. Mitchell, R. Adkins, and B. Kemp. The effects of aging on employment of people with and without disabilities. Rehabilitation Counseling Bulletin, 49(3), 2006. [18] R. Molich, N. Bevan, I. Curson, S. Butler, E. Kindlund, D. Miller, and J. Kirakowski. Comparative evaluation of usability tests. In Proc. of the Usability Professionals Association Conference, Washington, DC, June 1998. [19] A. Morris, J. Goodman, and H. Brading. Internet use and non-use: views of older adults. Universal Access in the Information Society, 61(1):43–57, 2007. [20] T. Nichols, W. Rogers, A. Fisk, and L. West. How old are your participants? An investigation of age classifications as reported in human factors. Human Factors and Ergonomics Society Annual Meeting Proceedings, Aging, pages 260–261, 2001. [21] B. Schwartz. The Paradox of Choice: Why More Is Less. Harper Perennial, 2005. [22] A. Sears. Heuristic walkthroughs: Finding the problems without the noise. International Journal of Human-Computer Interaction, 9:213 – 234, 1997. [23] C. Tobias. Computers and the elderly: A review of the literature and directions for future research. In Proceedings of the Human Factors Society 31st annual meeting, pages 866–870, Santa Monica, CA, 1987. [24] M. Vigo, A. Kobsa, M. Arrue, and J. Abascal. User-tailored web accessibility evaluations. In HyperText 2007, pages 95–104, Manchester, UK, Sept. 2007. ACM. [25] A. Woolrych and G. Cockton. Assessing heuristic evaluation: mind the quality, not just the percentages. In Proc. of HCI 2000, pages 35–36, 2000.

135