Optimal IR: How Far Away?

Optimal IR: How Far Away? Xiangdong An1 and Xiangji Huang2 and Nick Cercone1 1

Department of Computer Science and Engineering York University, Toronto, ON M3J 1P3, Canada [email protected], [email protected] 2 School of Information Technology York University, Toronto, ON M3J 1P3, Canada [email protected]

Abstract. There exists a gap between what a human user wants in mind and what (s)he could get from the information retrieval (IR) systems by his/her queries. We say an IR system is perfect if it could always provide the users with what they want in their minds if available in corpus, and optimal if it could present to the users what it finds in an optimal way. In this paper, we empirically study how far away we are still from the optimal IR or the perfect IR based on submitted runs to TREC Genomics track 2007. We assume perfect IR would always achieve a score of 100% for given evaluation methods. The optimal IR is simulated by optimized runs based on the evaluation methods provided by TREC. Then the average performance difference between submitted runs and the perfect or optimal runs can be obtained. Given annual average performance improvement made by reranking from literature, we figure out how far away we are from the optimal or the perfect IRs. The study indicates we are about 7 and 16 years away from the optimal and the perfect IRs, respectively. These are absolutely not exact distances, but they do give us a partial perspective regarding where we are in the IR development path. This study also provides us with the lowest upper bound on IR performance improvement by reranking.

1 Introduction An IR system automatically finds the information that matches the information needs of users expressed through their queries. We say an IR system is perfect if it could always find the information, if available in corpus, that matches the information needs of the users, and optimal if it could always present to the users what it finds in an optimal way (w.r.t. the relevancy). A critical difference between a perfect IR system and an optimal IR system is that an optimal system may not find all relevant information and may present irrelevant information. A perfect system involves much more sophisticated techniques than an optimal system. To be perfect, an IR system needs to understand natural languages well since natural languages are generally used to write queries expressing users’ needs in their minds, and based on such understanding, the system needs to find all relevant information exactly and presents them in an optimal way. To be optimal, an IR system only needs to present the results in an optimal way. We say the retrieval results are presented in an optimal way if the results are ranked properly based on their relevancy to the query.

None of the currently existing IR systems could be considered optimal in general. An existing system could be made closer to optimal by re-ranking. In this paper, we empirically study how far away we are from the optimal and the perfect IRs based on submitted runs to TREC (Text REtrieval Conference) Genomics track. We assume perfect IRs always achieve a performance of 100% for given evaluation methods. We simulate optimal IRs with optimized runs over TREC evaluation. Then, the performance difference between submitted runs and the optimal runs or perfect IRs can be calculated. Based on annual average performance improvement made by re-ranking from literature, we can figure out how far away we are from optimal or perfect IRs. The study might give us some ideas about where we are in the IR development path and a partial perspective about future IR development. On the other hand, some researchers have tried to improve their retrieval results by re-ranking [1–3]. How much could they potentially improve their retrieval results via re-ranking or say what is the lowest upper bound for such improvement by re-ranking? This empirical study provides us with an answer that may help understand re-ranking. There might not be an agreed standard on the level of relevancy and the optimality of the ranking, getting optimal ranking may be intractable [4, 5], and the queries and information expressed in natural languages could be ambiguous, but we assume the queries and evaluation used by TREC (Text REtrieval Conference) Genomics track [6] are fair and proper. We assume a re-ranking obtained by always selecting the most relevant information unit with ties broken arbitrarily is optimal, which has been shown generally reasonable [4, 5]. The rest of the paper is organized as follows. In Section 2, we provide the details of our empirical study method. Experimental results are given and described in Section 3. In Section 4, we discuss and analyze the results. Section 5 concludes.

2 Method 2.1 Dataset We make the study based on the 63 runs submitted by 26 groups to the TREC 2007 genomics track 3 . The genomics track of TREC, running from 2003 to 2007, provided a common platform to evaluate the methods and techniques proposed by various research groups for biomedical IR. From 2003 to 2005, the track focused on document-level retrieval for question answering. The document-level retrieval was scored based on the document mean average precision (MAP) (called the document measure in this paper). In its last two years (2006 & 2007), the track implemented and focused on a new task, called passage retrieval, where a passage could range from a phrase to a sentence or a paragraph of a document and must be continuous [7]. The task was evaluated based on two performances: the passage-level retrieval performance and the aspect-level retrieval performance. The passage-level retrieval performance, in 2006 was rated based on the amount of overlap between the returned passages and the passages deemed relevant by the judges (called 3

A total of 27 groups submitted 66 runs, but we only got 63 runs from 26 groups for the study.

the passage measure in this paper), and in 2007 was scored by treating each character in each passage as a ranked document (called the passage2 measure) to address the “doubling score by breaking passages in half” problem of the passage measure [6]. The aspect-level performance was rewarded by the amount of relevant aspects reached and penalized by the amount of non-relevant passages ranked higher than the novel passages (called the aspect measure). The relevant aspects related with each topic (question) in 2006 were a set of MeSH terms (entities) assigned by the judges and in 2007 were a set of answer entities picked from the pool of nominated passages deemed relevant by the judges. A passage is novel if it contains relevant aspects not existing in the passages ranked higher than it. Note by the aspect measure, no penalty would be applied if an irrelevant passage is ranked higher than a redundant (i.e., relevant but unnovel) passage. In IR field, some researchers consider the relevance judgment to encompass both the topical match between an information need and a document (or an information unit) and the novelty of the document such as [8, 9]. They use “topicality” to refer to the subjective judgment of whether a document is related to the subject area of the user’s information need, and “novelty” the degree to which the content of the document is new and beyond what the user has known. Some other researchers simply consider the relevance judgment to be topicality judgment and the relevant documents could be either novel or redundant such as [4, 5]. We follow the usage of the latter in the paper. For the question-answering task of TREC 2007 genomics track, there was a list of 36 topics in the form of questions that needed to be answered based on a collection of 162,259 HTML formatted documents collected from the electronic distribution of 49 genomics-related journals from the Highwire Press (www.highwire.org). All the 36 questions were selected from the information needs statements provided by the surveyed working biologists after being screened against the corpus to ensure that the relevant passages were present. The desired entity type, such as genes, proteins, diseases, mutations, etc., for answering each question was designated. For example, for the question “What [GENES] are genetically linked to alcoholism?”, the answers would be the passages that relate one or more entities of type GENE to alcoholism. To answer a question, up to 1000 passages could be nominated. Each nominated passage should include the document ID, the starting byte offset of the passage in the document, passage length, rank, and rank value. A set of 3 runs, each including the nominated passages to all 36 topics, could be submitted by one research group. Eventually a total of 66 runs from 27 groups were submitted. The judges identified the gold passages and located the answer entities for each topic from the pool of nominated passages. The performance of each run was then evaluated based on the gold passages and the assigned answer entities. A total of 4 measures were used to evaluate each run: the passage2 measure, the aspect measure, the passage measure and the document measure. The document-level performance was evaluated based on the passage retrieval results: a document was considered relevant for a topic if it had a relevant passage for the topic. Since the document retrieval was evaluated via the passage retrieval, and the passage2 measure was an improvement to the passage measure, we make our empirical study based on the two passage retrieval performance measures: the passage2 and the aspect measures.

2.2 Measure passage2 & optimization Algorithm 1, summarized from the TREC 2007 genomics track scoring program, shows how the passage2 measure works. Algorithm 1: Passage2 evaluation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Input: {nominatedPassageSet[topic]}, {goldPassageSet[topic]}. Output: Average passage2 precision by topic. for each Topic do nume=0; deno=0; sumPrecision=0.0; for each nominated Passage do if no any relevant characters then deno += passageLength; else for each character do if irrelevant or novel then deno +=1; if novel then nume +=1; sumPrecision+= nume ; deno end end end end end count=numCharactersInGoldPassages[Topic]; recision ; averagePrecision[Topic]= sumP count end

From Algorithm 1, the nominated list of passages for each topic is processed from the top (ranked highest) to the bottom (ranked lowest). If a nominated passage is not relevant (i.e., does not contain any relevant characters), it is penalized by increasing deno by the passage length (line 5) since deno would be used as the denominator when calculating the sum of precision sumPrecision (line 12). Otherwise, the passage would be processed character by character (line 7). If the character is not within the corresponding gold passage range (irrelevant), only deno is increased by 1 (lines 8-9); if the character is within the corresponding gold passage range and has not been used for scoring (novel), both deno and nume would be increased by 1 (lines 8-11), which is considered a reward since nume would be used as the numerator in calculating sumPrecision. Nothing would be done if the character is within the corresponding gold passage range but has been used for scoring before. From the equation at line 12, if all relevant passages are ranked higher than irrelevant ones, and all relevant passages are within the respective gold passage ranges, sumPrecision would be equal to the sum of the lengths of all relevant passages: X len(pi ), sumP recision = pi

where pi is a relevant passage. This is because in that case the penalties applied at line 5 would never be used in calculating sumPrecision and the penalty applied at line 9 would always be paired with the reward at line 11. Therefore, if all gold passages for Topic are exactly nominated, averagePrecision[Topic] would equal 100%. From the analysis above, to get (an approximation to) the optimal passage2 ranking for a nominated run, we first need to push all irrelevant passages back behind all the relevant ones. This can be done by checking if a nominated passage has any overlap with any gold passages. Secondly, we should rank relevant passages such that the highest performance score is achieved. However, this might be a hard problem. We take heuristics by ranking higher the relevant passages that contain higher ratio of relevant characters. That is, we would order all nominated relevant passages based on r − ratio =

numRelevantCharacters(pi ) , len(pi )

where pi is a relevant passage. The higher its r-ratio is, the higher the passage should be ranked with ties broken arbitrarily. 2.3 Measure aspect & optimization Algorithm 2, summarized from the genomics track scoring program, shows the details of the aspect-level performance evaluation. Algorithm 2: Aspect evaluation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Input: {nominatedPassageSet[topic]}, {goldPassageSet[topic]}. Output: Average aspect precision by topic. for each Topic do nume=0; deno=0; sumPrecision=0.0; for each nominated Passage do if there are any relevant aspects then if numNewAspects > 0 then nume +=1; deno +=1; sumPrecision+= numNewAspects∗nume ; deno end else deno +=1; end end count=numUniqueAspects[Topic]; recision ; averagePrecision[Topic]= sumP count end

From Algorithm 2, the nominated list of passages for each topic is processed from the top (ranked highest) to the bottom (ranked lowest). Any nominated passages containing relevant aspects are considered relevant (line 4); all other passages are considered irrelevant and are simply penalized by increasing deno by 1 (line 10). Any relevant

passages that contain new aspects are considered novel and would be rewarded by increasing both nume and deno by 1 (line 6). Nothing would be done for relevant passages that only contain previously seen aspects. That is, the redundant passages would have no impacts on the score. The variable sumPrecision is updated only upon novel passages (line 7). The aspect-level performance, averagePrecision, is finally calculated for each topic (line 14). The averagePrecision is actually equivalent to the recall of the relevant aspects [4]. From the equation at line 7, if all irrelevant passages are put after the novel ones, sumPrecision would equal the number of aspects all passages nominated for the topic contain: X numN ewAspects(pi ) sumP recision = pi

= numU niqueAspects[T opic], where pi is a novel passage. This is because in that case nume =1 deno would always hold. Therefore, the maximum value for sumPrecision is 100%. From the analysis, we can get an approximation to the optimal aspect ranking for each nominated run by always selecting the passage with the most new aspects with ties broken arbitrarily. 2.4 Perfect runs We assume the perfect passage2 results for each topic are obtained by ordering all gold passages based on their r-ratio and the perfect aspect results for each topic are obtained by ordering the gold passages based on the number of new aspects they have with ties broken arbitrarily. The perfect result based on either the passage2 or the aspect measure would produce an MAP of 100%.

3 Experimental results We first re-rank all the 63 submitted runs to get their respective passage2 and aspect optimal runs. We then compare the performances of the two sets of optimal runs with that of the 63 runs. Figure 1 shows the passage2 performances of the set of submitted and the respective set of passage2 optimal runs. It is indicated that all the passage2 optimal runs perform better than the respective submitted runs and mostly much better. Figure 2 shows the respective relative improvements of the passage2 optimal runs over the respective submitted runs from as little as 162% to as much as 1165% with the mean 475%. Figure 3 shows the distribution of these relative improvements. It is indicated that most relative improvements are located between 300% and 500%. Figure 4 shows the aspect performances of the set of submitted and the respective set of aspect optimal runs. It is indicated that all the aspect optimal runs perform better

than the respective submitted runs and mostly much better. Figure 5 shows the relative improvements of the aspect optimal runs over the respective submitted runs from as little as 63% to as much as 1252% with the mean 359%. Figure 6 shows the distribution of these relative improvements. It is indicated that most relative improvements are located between 100% and 300%.

Fig. 1. Performances of the submitted and the passage2 optimal runs on the measure Passage2.

Fig. 2. Relative improvements of the passage2 optimal runs over the respective submitted runs.

Finally, we show how much the optimal runs need to be improved to be perfect (i.e., their distances). Figure 7 shows the relative improvements of the perfect runs over the respective passage2 optimal runs. It is indicated besides the 3 runs (31, 46, 47) that need exceptionally significant improvements, all other runs need improvements up to 1639%. This is consistent with Figure 8 where the relative improvements of the prefect runs over the respective aspect optimal runs are shown. Figure 8 shows the same 3 outliers need

exceptionally outstanding improvements and all others need improvements up to 413%. Refer to Figures 1 and 4, the 3 outlier runs perform not only worst but also exceptionally bad among all. This also indicates some very bad retrieval results are neither ranked well nor contain enough relevant information.

Fig. 3. Histogram of relative improvements of the passage2 optimal runs over the respective submitted ones.

Fig. 4. Performances of the submitted and the aspect optimal runs on the measure Aspect.

4 Discussion and analysis From the experimental results presented above, the optimal runs generally perform much better than the respective submitted runs on either passage2 or aspect measure.

Fig. 5. Relative improvements of the aspect optimal runs over the respective submitted runs.

These optimal runs actually represent the lowest performance upper bound the respective submitted runs could achieve by re-ranking. If we assume all submitted runs as a whole represent the current average IR technology in genomics, we may take the average mean improvements on both passage2 and aspect (475% + 359%)/2 = 417% as the average improvement we could reach by re-ranking over various measures. Therefore, the average relative improvement 417% may be considered the gap between the current average IR technology and the optimal IR in genomics that could be bridged via re-ranking. Different levels of IR performance improvements through re-ranking have been reported [1–3]. Some improvements are quite small (2%-5%) [2] and some improvements could only be made on a small subset of predefined topics [1]. It is reported [3] that a performance improvement of 27.09% was once achieved on TREC 2007 genomics track data. If we assume this is the best performance improvement we could make within one year, we may compute how many years we are away from the optimal IR with the following equation: 1 + 417% = (1 + 27.09%)x, where x is the number of years we are away from the optimal IR. We can get x=

ln5.17 = 6.85. ln1.2709

This definitely would not be the accurate distance we are away from the optimal IR, and so far there are still too many crucial factors that are quite uncertain to determine the development progress of the IR. However, it does tell us there is probably still a long way to go before the optimal IR even from an optimistic view. We could similarly calculate how far away the perfect IR is from the optimal IR. The distance between the optimal IR and the perfect IR might be considered the amount of

Fig. 6. Histogram of relative improvements of the aspect optimal runs over the respective submitted runs.

efforts we need to make on the techniques other than re-ranking such as query understanding and matching. We use the average mean perfect improvements on both passage2 and aspect measures to represent the average improvement made by perfect runs over optimal ones: (1357% + 159%)/2 = 758%. We could get the distance x in years by the following equation: 1 + 758% = (1 + 27.09%)x. It turns out x = 8.97. This indicates we need more time from the optimal to the perfect than from the current to the optimal. In other words, from now on, we need about 6.85 + 8.97 = 15.82 years to get to the perfect IR. It can be easily shown exactly the same result would be obtained if the average relative improvement needed to make the submitted runs perfect is directly used in calculating the distance between the perfect IR and the submitted runs.

5 Conclusion It has been about half a century since the automated information retrieval systems were first implemented in 1950s. Where are we now and how far away are we from the optimal or the perfect IR? We absolutely could not give an exact answer for such questions since there exist too many crucial factors that are still highly uncertain to determine IR

Fig. 7. Relative improvements of the perfect runs over the respective passage2 optimal runs.

Fig. 8. Relative improvements of the perfect runs over the respective aspect optimal runs.

development progress. In this paper, based on some assumptions, we empirically studied how much effort we might still need to make to get to the optimal and the perfect IRs. The study indicated we still have a long way to go to make the existing systems optimal and even a much longer way to go to make systems perfect even from an optimistic view. This work first experimentally studied lowest upper bound regarding performance improvement by re-ranking, which might not have been realized by relevant researchers. The study indicated the improvements made by work in the literature are quite marginal relative to the big room provided by optimal rankings. How to make the performance improvement close to the upper bound may deserve more studying.

References 1. Yang, L., Ji, D., Tang, L.: Document re-ranking based on automatically acquired key terms in chinese information retrieval. In: COLING’04. (2004) 480–486 2. Shi, Z., Gu, B., Popowich, F., Sarkar, A.: Synonym-based query expansion and boostingbased re-ranking: A two-phase approach for genomic information retrieval. In: TREC-2005. (2005) 3. Hu, Q., Huang, X.: A reranking model for genomics aspect search. In: SIGIR’08. (2008) 783–784 4. Zhai, C., Cohen, W.W., Lafferty, J.: Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: SIGIR’03. (2003) 10–17 5. Clarke, C.L.A., Kolla, M., Gormack, G.V., Vechtomova, O., Ashkan, A., B¨ uttcher, S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In: SIGIR’08. (2008) 659–666 6. Hersh, W., Cohen, A., Roberts, P.: TREC 2007 genomics track overview. In: TREC-2007. (2007) 98–115 7. Hersh, W., Cohen, A., Roberts, P., Rekapalli, H.K.: TREC 2006 genomics track overview. In: TREC-2006. (2006) 68–87 8. Boyce, B.: Beyond topicality: a two stage view of relevance and the retrieval process. Information Processing & Management 18 (1982) 105–109 9. Xu, Y., Yin, H.: Novelty and topicality in interactive information retrieval. Journal of the American Society for Information Science and Technology 59 (2008) 201–215