Is Relevance Relevant? User Relevance Ratings ... - Semantic Scholar

5 downloads 27433 Views 233KB Size Report
A more natural metric of search engine performance may be a user's ability to ... task, as this removes the inherent subjectivity of relevance rankings, and ...
542

Coiera and Vickland, Relevance Ratings

Research Paper 䡲

Is Relevance Relevant? User Relevance Ratings May Not Predict the Impact of Internet Search on Decision Outcomes ENRICO W. COIERA, VICTOR VICKLAND A b s t r a c t Objective: A common measure of Internet search engine effectiveness is its ability to find documents that a user perceives as ‘relevant’. This study sought to test whether user provided relevance ratings for documents retrieved by an Internet search engine correlate with the decision outcome after use of a search engine. Design: 227 university students were asked to answer four randomly assigned consumer health questions, then to conduct an Internet search on one of two randomly assigned search engines of different performance, and to again answer the question. Measurements: Participants were asked to provide a relevance score for each document retrieved as well as a pre and post search answer to each question. Results: User relevance rankings had little or no predictive power. Relevance rankings were unable to predict whether the user of a search engine could correctly answer a question after search and could not differentiate between two search engines with statistically different performance in the hands of users. Only when users had strong prior knowledge of the questions, and the decision task was of low complexity, did relevance appear to have modest predictive power. Conclusions: User provided relevance rankings taken in isolation seem to be of limited to no value when designing a search engine that will be used in a general-purpose setting. Relevance rankings may have a place in situations in which experts provide rankings, and decision tasks are of complexity commensurate with the abilities of the raters. A more natural metric of search engine performance may be a user’s ability to accurately complete a task, as this removes the inherent subjectivity of relevance rankings, and provides a direct and repeatable outcome measure which directly correlates with the performance of the search technology in the hands of users. 䡲 J Am Med Inform Assoc. 2008;15:542–545. DOI 10.1197/jamia.M2663.

Introduction Internet search technologies have rapidly become essential tools both in support of scientific research, as well in everyday life. While search engines are unquestionably powerful, there remain substantial challenges in engineering tools that reliably identify scientific material most likely to answer a specific question. For developers of search engines, finding a reliable method to evaluate system performance remains a challenge. In general there are two broad approaches to search engine evaluation. First, once can extrapolate a system’s performance on the Web by testing its ability to retrieve known documents from a finite and curated test set.1 Second, in the common circumstance of absence of a such a set, one can estimate a search engine’s performance by measuring whether documents retrieved from the Web are perceived to Affiliation of the authors: Centre for Health Informatics, University of New South Wales, New South Wales, Australia. This research was supported by Australian National Medical and Health Research Council (NH&MRC) project grant 300435. Correspondence: Enrico W. Coiera, University of New South Wales, Centre for Health Informatics, UNSW 2055, Australia; e-mail: ⬍[email protected]⬎. Received for review: 11/12/07; accepted for publication: 02/21/08

be relevant by a group of users.2 These relevance ratings are used to generate precision and recall curves, which can then be used to benchmark the performance of a search engine. Indeed, it has been said that “relevance” is one of the central concepts for the information retrieval sciences.3 The true meaning of relevance remains the subject of some controversy, and others have noted that judgments of relevance are inherently subjective, varying between individuals, professional groups, and tasks. Consequently researchers have focused on identifying how much concordance there can be between relevance estimates of different groups, under varying circumstances.4,5 While it is recognized that measuring the number of relevant documents does not directly measure whether relevant articles retrieved actually satisfied a user’s information needs,6 it is surprising that no one appears to have examined how effective such proxy relevance measures are in predicting the outcome after conducting a search. In other words, how likely are individuals whose search results contain ‘relevant’ documents to perform better on a decision task than those whose searches return more ‘irrelevant’ material? In our work, we report an experiment that tests whether user provided relevance ratings actually correlate with a successful decision outcome after search (Figure 1). If we are to develop computational systems that can actively support

Journal of the American Medical Informatics Association

Volume 15

Number 4

543

July / August 2008

F i g u r e 1. Traditionally relevance studies have estimated relevance scores but not looked at whether relevance is predictive of performance on decision tasks once a search is done. decision-making, then we need to develop models not just of the process of seeking and reviewing documents, but also of the consequent impact on decision-making.7

Methods To test the hypothesis that user assigned relevance judgments are a good predictor of the performance of a search engine in supporting a user to identify and apply information, we performed an online experiment. To control for the potential impact of search engine performance, user knowledge, and task, we asked subjects to answer four randomly sequenced questions about the diagnosis or treatment of conditions, while using one of two search engines, which were also randomly assigned at each question. We recruited 227 undergraduate and postgraduate participants from across three separate university campuses, over an eight-week period, by advertising on notice boards and distributing hand flyers. As an incentive, we offered each participant A$20 upon completion of the experiment. Users enrolled and completed the experiment using a web browser, from a computer and location of their own choosing. Each participant had their own unique user identification. For each question they were asked to answer, they were assigned randomly to one of two versions of a specialist search engine.8 These search engines had identical user interfaces, but different performance characteristics. One

Table 1 y Questions and Expected Answers Question

Answer

1. We hear of people going on low carbohydrate and high protein diets, such as the Atkins diet, to lose weight. Is there evidence to support that low carbohydrate, high protein diets result in greater long-term weight loss than conventional low energy, low fat diets? 2. Breast cancer is one of the most common types of cancer found in women. Is there evidence indicating an increased chance of developing breast cancer for women who have a family history of breast cancer? 3. Many people use complementary therapies when they are sick or as preventive measures. Is there evidence to support the taking of vitamin C supplements to help prevent the common cold? 4. We know that we can catch AIDS from bodily fluids, such as from needle sharing, having unprotected sex and breast-feeding. We also know that some diseases can be transmitted by mosquito bites. Is there evidence that we can get AIDS from a mosquito bite?

No

Yes

No

F i g u r e 2. Distribution of relevance scores assigned by participants to the total pool of documents retrieved for all questions (minimum zero and maximum relevance 30). version of the system (the ‘clinical search engine’) was optimized to find health information that could support diagnosis or treatment decisions, and had undergone previous trials to evaluate its effectiveness in supporting such searches.9 The second version of the search engine (the ‘general search engine’) was designed to only find general educational materials, and was expected to perform significantly less effectively in identifying documents that could specifically answer diagnosis or treatment questions. The experiment asked participants to answers four questions (Table 1), which were delivered to them in random order. Questions were designed so that there was variation in prior knowledge and difficulty across the decision tasks. Participants were asked to provide an answer for each question before and after using the assigned search engine. The search engines retrieved a maximum of 10 documents in response to keywords provided by participants, and subjects were asked to provide a relevance score for each document retrieved on a scale of 0 –3 (zero indicated an irrelevant document, and 3 was a completely relevant document). A page of 10 results could, therefore, receive a maximum relevance score of 30. Participants produced a data set of 808 search sessions comprising before and after search answers and confidence and relevance rankings. Receiver operating characteristic (ROC) curves, which plot sensitivity (true positive rate) versus 1—specificity (true negative rate),2 were calculated to test the ability of relevance rankings to predict the answers provided by participants, and controlling for search engine used, question, and prior knowledge of participants. Prior

Table 2 y Comparative Performance Answering Questions after Using Clinical and General Search Engines

No

Correct before search Correct after search Improvement

Clinical Search Engine (N ⫽ 415)

General Search Engine (N ⫽ 393)

203 (48.9%) 276 (66.5%) 17.6%

197 (50.1%) 212 (53.9%) 3.8%

544

Coiera and Vickland, Relevance Ratings

Table 3 y Comparative Performance Answering Individual Questions after Using Both Search Engines Combined Correct before search Correct after search Improvement

Question 1 (N ⫽ 193)

Question 2 (N ⫽ 204)

Question 3 (N ⫽ 209)

Question 4 (N ⫽ 202)

60 (31.0%) 72 (37.3%) 6.3% ␹2 ⫽ 1.435 (p ⫽ 0.231)

187 (91.6%) 175 (85.7%) ⫺5.9% ␹2 ⫽ 2.967 (p ⫽ 0.085)

37 (17.7%) 98 (46.8%) 29.1% ␹2 ⫽ 39.180 (p ⫽ 0.0001)

116 (57.4%) 143 (70.7%) 13.3% ␹2 ⫽ 7.192 (p ⫽ 0.0073)

knowledge of the population for a question was estimated from the percentage of correct pre-search answers provided for each question. ROC values were calculated as the area under the curve (AUC) using MedCalc v.9.1, where an AUC of 1 is a perfectly predictive measure and 0.5 indicates no correlation between a measure and an outcome.

Results There was significant variation in the relevance scores assigned by participants to the document sets they retrieved to support their decisions (Figure 2). Participants answered 49.5% questions correctly before search and there was a statistically significant improvement of 10.8% in their answers to 60.4 % after using a search engine (CI ⫽ 6.08 –15.72, p ⫽ 0.0001). No correlation was found between user provided relevance rankings and their final answer (n ⫽ 808, AUC ⫽ 0.580, p ⫽ 0.0001). As anticipated, participants using the clinical search engine improved answers significantly more after search (17.6 %) than those using the general search engine (3.8 %), a difference of 13.8% (CI ⫽ 9.68 –17.92, p ⫽ 0.0001 (Table 2). Relevance rankings did not predict decision outcomes for users of either the clinical search engine (n ⫽ 415, AUC ⫽ 0.552, p ⫽ 0.0767) or the general search engine (n ⫽ 393, AUC ⫽ 0.557, p ⫽ 0.0501). Further, relevance rankings could not differentiate between the performance of users of either search engine. The difference in AUC for both engines was 0.005 and is not statistically significant (z-statistic ⫽ 0.1198 at p ⫽ 0.9046). Participants did not perform equally well across all four questions, as expected (Table 3). Participant improved their answers after search on Question 1 by 6.3% (p ⫽ 0.231), Question 3 by 29.1% (p ⫽ 0.0001) and Question 4 by 13.3% (p ⫽ 0.0073), and performance on Question 2 was worse post search by 5.9% (p ⫽ 0.085).

Analysis by question showed that relevance rankings were unable to predict decision outcome for questions 1 to 3 and had weak predictability in question 4: Question 1 (n ⫽ 193, AUC ⫽ 0.555, p ⫽ 0.2012); Question 2 (n ⫽ 204, AUC ⫽ 0.515, p ⫽ 0.7815); Question 3 (n ⫽ 209, AUC ⫽ 0.489, p ⫽ 0.7796); and Question 4 (n ⫽ 202, AUC ⫽ 0.707, p ⫽ 0.0001). Comparison of the AUC generated by relevance rankings for each question showed that AUC was unable to differentiate between questions 1 to 3, but did distinguish Question 4 from the remainder (Table 4).

Limitations Questions in this experiment were simulated, and only four in number. While there was a clear distribution of prior knowledge across questions, as measured by pre-search answers, a larger or more representative sample of questions may provide different results to those reported here. However, the sample size of our experiment was sufficiently large to generate clear differences of strong statistical significance. Subjects in this study were university students and may not be representative of other populations. However, it is unclear why this population would assign relevance ratings in a way that is different to other populations. We measured our study population’s prior knowledge, as measured by ability to answers questions before searching, which should allow other studies to be directly compared to ours.

Discussion Despite their widespread use as a metric of search engine performance, user generated relevance rankings had little to no predictive power in our experiments. Relevance rankings were unable to predict whether the user of a search engine could correctly answer a question after having completed a search. They were not able to differentiate between two

Table 4 y Comparison of Receiver Operating Characteristic Curves Generated Using Relevance Rankings Question 1 AUC ⫽ 0.555 SE ⫽ 0.043 P ⫽ 0.2012 Question 1

Question 2

Question 3

Question 2 AUC ⫽ 0.515 SE ⫽ 0.058 P ⫽ 0.7815

Question 3 AUC ⫽ 0.489 SE ⫽ 0.040 P ⫽ 0.7796

Question 4 AUC ⫽ 0.707 SE ⫽ 0.037 P ⫽ 0.0001

Difference ⫽ 0.04 z-stat ⫽ 0.5540 p ⫽ 0.5795

Difference ⫽ 0.066 z-stat ⫽ 1.1238 p ⫽ 0.2610 Difference ⫽ 0.026 z-stat ⫽ 0.3690 p ⫽ 0.7121

Difference ⫽ 0.152 z-stat ⫽ 2.6794 p ⫽ 0.0073 Difference ⫽ 0.192 z-stat ⫽ 2.7908 p ⫽ 0.0052 Difference ⫽ 0.218 z-stat ⫽ 4.0008 p ⫽ 0.00006

Journal of the American Medical Informatics Association

Volume 15

Table 5 y Distribution of Study Questions, Based upon Participants Prior Knowledge of the Correct Answer Before Searching, and the Complexity of the Decision Task Low Prior Knowledge High Prior Knowledge Low complexity High Complexity

Q3 Q1

Q4 Q2

search engines with statistically different performance characteristics in the hands of users, which one would anticipate to be of major importance to the designer of a search engine. We asked our study participants to answer four questions of varying difficulty, and for which the participants had varying prior knowledge, as judged by the accuracy of their pre-search answers. In three out of four of these questions, relevance rankings showed no ability to predict how well participants could answer a question after search, nor could the relevance rankings distinguish between how well participants might perform on the different questions. Relevance rankings did distinguish one question from the remaining three, and it is interesting to speculate about what it was about the question that may have made relevance rankings useful. In Table 5, the four questions are classified based upon participants prior knowledge of the correct answer (where low is ⬍50% correct pre-search and high is ⬎50% correct pre-search), and a subjective assessment by the authors of decision task complexity. Complexity relates to the number of information processing sub tasks needed to answer a question.10 For example, question four (Can mosquitoes pass on AIDS?) was ranked as low complexity as it requires a simple yes/no answer that could be found in a text. In contrast, question one (Does the Atkins diet work?) is ranked as high complexity as there are conflicting views and evidence, and arriving at an answer will require synthesis and analysis of several documents. Question four seems to be a low complexity decision task where participants also had high prior knowledge. This suggests that relevance rankings may have a place in situations in which the rankings are provided by experts in the domain, and where the decision tasks are of complexity commensurate with the abilities of those providing the ranking. This observation suggests that further research into the interaction between prior knowledge and task complexity will be of great interest. However, based upon our experimental data, there is now evidence that user provided relevance rankings taken in isolation may be of limited-to-no value when evaluating a search engine that will be used in a general purpose setting, by a variety of users, and on a variety of tasks. Community relevance ratings are also used as a measure of the value of documents by web sites that adopt a social computing model to assist searchers find information that may help

Number 4

July / August 2008

545

answer questions. Our results draw doubts over the value of such an approach. An alternate method, where a community provides direct feedback on the best answer to different questions, rather than providing proxy relevance ratings for documents, does seem to improve decision-making.11 We believe that a more accurate and natural metric for search engine performance is a users’ ability to accurately complete a task, such as identifying a piece of information, or answering a question. These measures remove the inherent subjectivity of relevance rankings, and provide a direct and repeatable outcome measure that directly correlates with the performance of the search technology in the hands of users. In situations in which direct measures are not feasible, it may be that composite proxy measures, which include relevance along with a number of other factors, such as task complexity or prior knowledge, may have some predictive power, and this is an important area for future research. Finally, relevance ratings are just one example of a proxy or surrogate marker. Our results reiterate the caution that should be taken when using surrogate markers to show differences, especially if, as in the case with relevance, the assumed correlation with the final outcome may never have been formally validated. References y 1. Meng W, Liu K, Yu C, Wu W, N. R. Estimating the Usefulness of search engines. Paper presented at: 15th International Conference on Data Engineering (ICDE’99), 1999; Sydney, Australia. 2. Hersh W. Information retrieval: A health and biomedical perspective. 2nd ed. New York: Spinger-Verlag; 2003. 3. Mizzaro S. Relevance: The Whole History. J Am Soc Info Sci 1997;48(9):810 –32. 4. Kagolovsky Y, Mohr JR. A new approach to the concept of “relevance” in information retreival (IR). Paper presented at: MEDINFO 2001; Amsterdam. 5. Hripcsak G, Rothschild A. Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc 2005;12(3):296 –38. 6. Hersch WR, Hickam D. How Well Do Physicians Use Electronic Information Retrieval Systems? A Framework for Investigation and Systematic Review. JAMA 1995;280(15):1347–52. 7. Lau A, Coiera E. A Bayesian model that predicts the impact of Web searching on decision-making. J Am Soc Inform Sci Technol 2006;57(7):873– 80. 8. Coiera E, Walther M, Nguyen K, NH L. An architecture for knowledge-based and federated search of online clinical evidence. J Med Internet Res 2005 Oct 24;7(5):e52. 9. Westbrook J, Coiera E, Gosling AS. Do online information retrieval systems help experienced clinicians answer clinical questions? J Am Med Inform Assoc 2005;12(3):315–21. 10. Sintchenko VS, Coiera E. Which clinical decisions benefit from automation? A task complexity approach. Int J Med Inform 2003;70:309 –16. 11. Lau A, E C. Impact of Web Searching and Social Feedback on Consumer Decision Making: A Prospective Online Experiment. J Med Internet Res 2008;10(1):e2.