Methods for Ranking Information Retrieval Systems Without ...

14 downloads 17402 Views 107KB Size Report
process, some aspects such as a test collection composed of a set of documents, ..... They could be useful for data fusion in Meta Search Engines for the WWW.
Methods for Ranking Information Retrieval Systems Without Relevance Judgements Shengli Wu

Fabio Crestani

Department of Computer and Information Sciences University of Strathclyde, Glasgow, UK

Department of Computer and Information Sciences University of Strathclyde, Glasgow, UK

[email protected]

[email protected]

ABSTRACT In this paper we present some new methods of ranking information retrieval systems without relevance judgement. The common ground of these methods is using a measure we called reference count . An extensive experimentation was conducted to evaluate the effectiveness of the proposed methods using various different standards Information Retrieval evaluation measures for the ranking, like average precision, R-precision, and precision and different document levels. We also compared the effectiveness of the proposed methods with the method proposed by Soboroff et al. The experimental results showed that the proposed methods are effective, and in many cases are more effective than Soboroff at al.’s method.

Keywords Information Retrieval, Ranking, Relevance Judgement, Performance Evaluation

1.

INTRODUCTION

Evaluating the performance of information retrieval systems usually takes a lot of human efforts. Since in such a process, some aspects such as a test collection composed of a set of documents, a set of query topics, and a set of relevance judgements indicating which documents are relevant to which topic need to be prepared. Among all these, relevance judgement is a laborious task when we have a large set of retrieved documents. TREC has devoted itself to the evaluation of various participating information retrieval systems for over ten years [7]. One of the most interesting evaluation techniques used in TREC is the pooling method employed to deal with relevance judgements, so as to reduce human efforts. In TREC, each participating system reports the 1000 top-ranked documents for each topic. Of these, only the top 100 from each system are collected into a pool for human assessment. The

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’03, March 9-12, 2003, Melbourne, Florida, USA. Copyright 2003 ACM 1-58113-624-2/03/03 ...$5.00.

evaluation is conducted with the assumption that all relevant documents are in the pool. Some studies have been published on the reliability of such an evaluation process [9, 12, 3]. Two aspects are supposed to be the major factors that could affect the performance. One is the pooling method since many retrieved documents remain unchecked by human assessors and are assumed to be irrelevant. The other is the assessor disagreement that happens when several individuals have opposite views in judging relevance. However, the studies found neither of the above two aspects affects the performance of TREC evaluation process significantly. Soboroff et al. proposed a way of ranking information retrieval systems without relevance judgements from assessors [6]. Their method looks similar to TREC s, except that they just pick up randomly a set of documents from the pool as relevant documents rather than involving a judgement process by assessors. Their experiment shows that the system rankings from their method correlate positively and significantly to the actual TREC rankings. In previous work on data fusion we studied different information retrieval systems querying the same collection of documents [11]. We used the method of reference count to rank participating information retrieval systems. Then based on that they assign a proper weight for each of the system for improving the performance of data fusion. The experimental results show that some of the data fusion algorithms based on the estimated weights they proposed are competitive with CombMNZ (one of the best algorithms known in data fusion [5]), it suggests that the estimation process could correlate positively and significantly with the results from official TREC evaluation. However, in [11] we were mainly focused on data fusion, ranking information retrieval systems was done in an empirical way. Only a simple experiment was carried out for evaluating the ranking process itself and therefore, not much substantial conclusion could be drawn from there. We consider it would be interesting to go a little further and at the same time, do a comparison with Soboroff et al.’s method, which has been widely accepted in the Information Retrieval community. As pointed out in [6], such methods for ranking systems in an automatic way would not be useful in an environment like TREC, since it has already a very successful methodology for conducting relevant assessment and building test collections, following the Cranfield tradition. However, it could be useful in an environment like World-Wide Web (WWW), where such methods could be employed. In the

WWW, the document collection is tremendously huge and being updated continuously and frequently, then a method such as TREC is not feasible. Another area of applications of automatic ranking of systems is data fusion, as showed in [11]. Various data fusion algorithms such as linear combination models and others [5, 11, 4, 10] could benefit from that if we could provide a suitable weight for each of those component retrieval systems. The paper is structured as follows. In Section 2 we describe the proposed ranking methods and the evaluation measures used to test effectiveness of these methods. The results of the experimentation using non-interpolated average precision are presented in Section 3, while Section 4 presents the results using R-precision and precision at different document cut-off values. A discussion of these results is presented in Section 5. The conclusions of the study are reported in Section 6.

2.

RANKING METHODS AND MEASURES FOR EVALUATION

To evaluate the performance of an information retrieval system, we use a measure called reference count . Suppose we have a given query and a number of information retrieval systems on a common document collection. Taking a certain number of top documents returned by a retrieval system, we can sum up the occurrences of these documents in the results of all other retrieval systems. In such a way, each retrieval system gets a score, the reference count, for this particular query. For example, suppose we have five retrieval systems Ri (1≤i≤5) and a query Q, each system Ri returns a list of documents Li (1≤i≤5) for that query. Each Li includes 1000 documents (di,1 , di,2 ,..., di,1000 ). Let us consider L1 . For any d1,j (1≤j≤1000), we count its occurrences o(d1,j ) in all other document lists (L2 , L3 , L4 , and L5 ). Here we call di,j the original document, while its counterparts in all other document lists are called reference documents (of di,j ). If d1,1 appears in L2 and L4 , but not Pin L3 and L5 , then o(d1,1 )=2. We make a sum S1 (1000) = 1000 j=1 o(d1,j ) which we call L1 ’s total reference count to the given query Q. We could do the same for other lists as well. A general situation is considering the top n documents. For example, for L i , reference P count of the top 100 documents is Si (100) = 100 j=1 o(di,j ) for (1≤i≤5). Based on the reference count obtained for each retrieval system following the above process, we can rank these retrieval systems. We call this method Basic reference count or Basic in short. The improvement could be addressed in two aspects: to differentiate the positions of reference documents and the position of the original document. In the Basic method, when calculating the reference count o(di,j ) of di,j , we always add 1 to to o(di,j ) for any appearance of di,j ’s reference document, without distinguishing the position in which that reference document appears. For example, let us suppose d1,1 is an original document, d2,1000 and d3,1000 are its two reference documents. d1,2 is another original document, with two reference documents d2,1 and d3,1 . Though d2,1000 and d3,1000 only appears at the very ends of L2 and L3 , d2,1 and d3,1 appears at the top of L2 and L3 . That strongly suggests that d1,2 is more likely to be relevant to the query than d1,1 , the Basic method just ignores the difference and calculates o(d1,1 ) = 2. Since all results submitted

to TREC are ranked document lists, non-discrimination is obviously not a good solution. Another aspect is related to the calculation of Si , here every o(di,j ) has the equal weight of 1. We do not distinguish j = 1 or j = 1000; thus the difference is ignored if the 1st or the 1000th document is considered for relevance. However, the position of a relevant document may be important in many cases. We present some variations by changing either or both of the aspects. Variation 1 (V1) is to assign different weights to reference documents in different positions. This is carried out as follows. For any reference document di,j , the following formula: O(di,j ) =

X

w(dm,n )

(1)

is used to calculate the weighted reference count. Here dm,n is a reference document of di,j , m6=i, dm,n =di,j and w(dm,n ) = K − n, where K is a constant value, that was set empirically to 1501 in our experiments 1 . Thus a reference document in the first place of the list will come up with a score of 1500, one in the 1000th place with 501. The second variation (V2) is to use the same way as in Basic to calculate o(di,j ) but assign different weights to different original documents. We use

Si (1000) =

1000 X

wj ∗ o(di,j )

(2)

j=1

to calculate Si . wj is determined by the following formula for (1≤m≤1000, 1≤k≤4): 

wj = Z(200) − Z(m − 1), if 1 + 5j , if wj = w5m − m

j = 5m; j = 5m − k.

(3)

is defined as where Zeta function Z(m)=1+1/2+1/3+...+1/m, with Z(0)=0. Here is an explanation of the above formula. Suppose the relevant documents distribute in the 1000-documentslist evenly with an average precision of 0.20. One hypothetical situation is that relevant documents are located in the 5th, 10th, 15th,..., 1000th, then the average precision=(1/5 + 2/10 + 3/15 +...+ 200/1000)/200=0.2. Let us consider the contribution of each relevant document to this measure. For the 5th document, it is (1/5 + 1/10 + 1/15 +...+ 1/100)/200 =(1 + 1/2 + 1/3 +...+ 1/200)/1000; for the 10th document, it is (1/10 + 1/15 +...+ 1/100)/200 =(1/2 + 1/3 +...+ 1/200)/1000; for the 1000th document, it is 1/(1000*200). Because P 1/1000 is a common factor, we 1 omit it. Then we get wj = 200 j=m j = Z(200) − Z(m − 1) for j = 5m. If one of the relevant document is not in 5m, but in j = 5m − k (1≤m≤1000, 1≤k≤4) and all other relevant documents are in the same positions as before, then the relevant document in position 5m − k can contribute to 1 + 5j Thus the whole average precision with wj = w5m − m we get the above formula. The third variation (V3) consists of assigning different weights to the reference documents and the original docu1 Some other values between 1000-1600 are tested as well. The results are similar for any value between 1200-1600. However, using 1000 is not as good as using a value between 1200-1600.

ment at the same time. That is to say, using Equation 1 to calculate o(di,j ) and using Equation 2 to calculate Si . The fourth variation (V4) consists of using each document’s normalised score to replace w(dm,n ) in Equation 1, and then to use Equation 2 to calculate Si . Basic and all its variations are based on the same measure, reference count, we will use “reference count-based method(s)” to indicate any (or all) of these methods. We describe Soboroff et al. s method here in a more detail, since both their experiment and ours are done with TREC data, we describe it in TREC terms. Every year TREC provides 50 topics (queries) and a collection of documents. For each topic, every participant system submits one or several runs, each consisting of up to 1000 retrieved documents for each topic. Just as NIST did, Soboroff et al. took the top 100 documents per topic from the runs of every system to form the pool for that topic. Duplicates would not be removed thus they have larger chance to be selected2 . Then a certain percentage (10%) of documents are picked up randomly as relevant documents to form a set of pseudo relevant documents (pseudo-qrels). Note that a slight difference exists between their original method and the one we use here. In their original experiment, they calculated the average number and standard deviation of relevant documents appearing in the pool per topic by checking TREC official figures, then randomly selected documents from the pool at a percentage value, which is draw from a normal distribution with that year s mean and standard deviation as the fraction of documents to select from the pool. We consider that information may not be available in many cases. Besides, it is fair for the comparison since our method does not depend on such information. Finally, using the pseudo-qrels, Soboroff et al.’s technique evaluates all the runs by executing the standard TREC evaluation procedure (trec eval). After producing a ranking of retrieval systems using any of the above-mentioned methods, we need to evaluate the ranking effectiveness. This could be done by comparing this ranking with the TREC official ranking to check how different they are. Both Spearman rank correlation coefficient and Kendall Tau correlation coefficient are two widely used measures for comparing two rankings. The Spearman rank correlation coefficient is defined as [2]: R=1−

X 2 6 di n3 − n

(4)

where di is the rank difference of common document i, and n is the number of documents. Two rankings are identical when the coefficient is 1, and in reverse order when the coefficient is -1. Kendall Tau correlation coefficient is defined as: t=

2(C − D) n(n − 1)

(5)

where C stands for the number of concordant pairs and D stands for the number of discordant pairs. The value range of t is [-1, 1], with 1 for identical ranks and -1 for opposite ranks. 2 Soboroff et al. compared the two different ways: one is to remove all duplicates before the random selection of pseudorelevant documents, the other is not to remove duplicates as we do here. They found the latter performs better than the former.

In [6], Kendall Tau correlation coefficient was used. In our experiment, we use Spearman rank correlation coefficient for two reasons. First, it is a good alternative now that Kendall Tau correlation coefficient had been used in similar experiments. Second, we observed that there exist some differences in these two coefficients for certain situations. For example, consider two different rankings of numbers {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and {10, 9, 1, 2, 3, 4, 5, 6, 7, 8}. R = 0.0182, and t = 0.2444. When used for data fusion, if we take the two poorest systems as the two best systems and put heavy weights on them, that could deteriorate the result of data fusion greatly though the rest of the ranking is in perfect order. It seems that in such cases Spearman rank correlation coefficient is more reasonable than Kendall Tau correlation coefficient since the former punishes heavier for big ranking differences than the latter. Usually people are more interested in the top and bottom ends than the middle part of a ranking since the two ends are more distinctive, we hypothesise that is the case for ranking information retrieval systems. A similar thing happens when we do data fusion. So, we introduce a measure for ranking the accuracy on the top (or bottom) n systems. Suppose we have two rankings. One (L) is produced by our method and the other (S) is the TREC official ranking that we use as a standard. We compare the top n systems in these two rankings. We use Accuracy A(n) to measure the percentage of n top systems in L that appears in top n of S. For example, if L ={1, 3, 5, 8, 10,...}, S={1, 2, 3, 4, 5,...}; then A(1) = 1, A(2) = 1/2, A(3) = 2/3, and so on. For comparing bottom n systems, we need reverse the orders of both lists first then do the samePcalculation. We define Average Accuracy as AA(n) = n1 n i=1 A(i) In the following we will treat different runs of a system as different information retrieval systems for convenience. However, such systems that are actually different runs of the same system are more likely to be similar to each other than real different systems. We will discuss this case later.

3.

EXPERIMENTAL RESULTS WITH NONINTERPOLATED AVERAGE PRECISION

Experiments was carried out with 5 TREC collections (TREC 3, TREC 5, TREC 6, TREC 7, and TREC 2001). The six methods presented include five of ours and Soboroff et al.’s (indicated as “Random Selection” or “RS” in all tables and Figures). In Table 1 we present the result when all runs of participant systems are taken. Each item in the table is the mean Spearman rank correlation coefficient over 50 topics (queries) of that year. Reference count-based methods could be divided into two groups. Group one includes Basic and V1, which have similar performance. Group two includes V2, V3, and V4, which show similar performance too. V4 is the best on average among these five methods. However, it seems that neither of them is as good as Soboroff et al.’s method. The average performance difference between V4 and RS is 8.3%. Another aspect worthy of consideration is related to the case when only a few information retrieval systems are involved. The experiment is conducted for all 5 TREC collections, and each collection has runs 20 times for 50 queries. We hypothesise that similarity among several runs of the same component system may affect the ranking process, es-

Year (No. of Systems) TREC3 (40) TREC 5 (61) TREC6 (74) TREC7 (103) TREC2001 (97) Average

Basic 0.2463 0.3179 0.3090 0.2967 0.2793 0.2898

V1 0.2481 0.3264 0.3155 0.3035 0.2878 0.2963

V2 0.5481 0.3779 0.3706 0.3282 0.3767 0.4003

V3 0.5665 0.3889 0.3825 0.3446 0.4009 0.4167

V4 0.5866 0.4205 0.3840 0.3821 0.4128 0.4372

RS 0.6273 0.4293 0.4359 0.4111 0.4633 0.4734

Table 1: Mean correlation coefficient for average precision, all systems submitted to TREC 0.7 Basic V1 V2 V3 V4 RS

0.65 0.6

Mean Correlation Coefficient

0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3

4

5

6

7

8 9 10 Number of Systems

11

12

13

14

15

Figure 1: Mean correlation coefficients of 3-15 diversified systems

pecially when the number of systems is low. In such circumstances, several runs of the same component system may include more identical documents in their results than runs from different component systems. Therefore, those runs of the same component system could be over-promoted in the ranking list. A similar situation may happen when using Soboroff et al.’s method. Because those irrelevant documents co-exist in runs of the same information retrieval system, this increases the chance of taking these documents as pseudo-relevant documents. In the following, only one run would be selected for any of the participant system for any query. Such runs selected from different systems are referred as diversified systems. All systems involved in the experiment are diversified systems selected randomly. Figure 1 shows the mean correlation coefficient of 20 runs, 50 queries, and 5 TREC collections. As in Table 1, the performances of Basic and V1 are always very close to each other, and they do not vary much with the different numbers of systems. They do not perform well generally compared with other methods. V2, V3, and V4 have very similar curves all the way from 3 systems to 15 systems. The exception is when there are 3 or 4 systems, in which case V4 is not as good as the two other methods. For RS, its curve increases with the number of the systems at a considerable rate, however, with a rather low value at the start point. When the number of systems is 3-5, its performance is even worse than Basic’s and V1’s performances. When the number of systems is no more than 9, its performance is not as good as V2, V3, and V4. After that, it becomes the best. For 3-9 systems, V3 outperforms RS by 45.6% on average; for 10-15 systems, RS outperforms V3 by 6.5%.

Experiments also carried out for evaluating the ranking performance of our methods and Soboroff et al.’s method at both ends. Due to space limitation, we omit the detailed experimental result. Instead, we only briefly discuss it and give some major observations. We take all runs of participant systems and measure the Average Accuracy for the top 5 and 10 and the bottom 5 and 10. 5 collections and 50 queries are used as before. Generally speaking, the average accuracies in all methods are very low, it seems that the rankings at top end are so poor (for most of them, the value of AA is between 0.0 and 0.1) that they are useless. However, the good news is that the bottom end is much better for all six methods (with AA between 0.28 and 0.58). At the top end, RS does the best work; at the bottom end, V4 is the best. We observe that for ranking at two ends, the differences among these reference count-based methods are much smaller compared with that for mean correlation coefficients in Table 1. Once again, let us consider the situation where only diversified systems are involved. In each run, we randomly chose 20 systems. This time the mean average accuracy at top end increases considerably for diversified systems than for all systems. Therefore, the similarity of systems must affect the top end ranking heavily. Conversely it seems that the effect is much less for bottom end ranking. Even with improved performance on diversified systems, the top end ranking is less effective than the bottom end for all six methods. With diversified systems, V4 is still the best. V3 and RS follow.

4.

RANKING SYSTEMS WITH ALTERNATIVE MEASURES

For evaluating the performance of information retrieval systems, we have much choice on the measures. For example, TREC uses precision average at 11 standard recall levels, average precision over all relevant documents, precision at 9 document cut-off values, and R-Precision to evaluate them at the same time [8]. Aimed to test the adaptability of our ranking method, we use two most often used measures R-Precision and precision at 4 different document cut-off values (10, 30, 50, and 100) to do the ranking. In this experiment, we made some small necessary changes to Soboroff et al’s method. For both measures we do the same as before by selecting randomly 10% of the documents in the pool as the pseudo-qrels. For precision at n documents, the estimation process is straightforward. For R-Precision, we suppose that the number of relevant documents for each query is 100 and based on that to estimate R-Precision of each retrieval system. For reference count-based methods, we use Basic and V1 as in Section 3. We introduce another variation of Basic as V4’ which count each reference document’s normalised score as in V4, but

Year TREC3 TREC 5 TREC6 TREC7 TREC01 Average

Basic 0.3364 0.3711 0.4359 0.4760 0.3138 0.3866

V1 0.6355 0.4253 0.4979 0.5035 0.4393 0.5003

V4 0.6203 0.4303 0.4758 0.5002 0.4302 0.4914

RS 0.6127 0.4108 0.4380 0.4657 0.4073 0.4669

Year TREC3 TREC 5 TREC6 TREC7 TREC2001 Average

Table 2: Mean correlation coefficient for RPrecision, all systems submitted to TREC

RS 0.6235 0.4436 0.4968 0.5244 0.5227 0.5222

Basic V1 V4’ RS

0.65 0.6

0.55

0.55

0.5 Mean Correlation Coefficient

Mean Correlation Coefficient

V2 0.6245 0.4652 0.5292 0.5735 0.5409 0.5467

0.7

Basic V1 V4’ RS

0.6

V1 0.6418 0.4583 0.5459 0.5785 0.5528 0.5555

Table 3: Mean correlation coefficient for all systems submitted to TREC, precision at 100 documents

0.7 0.65

Basic 0.2968 0.3980 0.4781 0.5295 0.4064 0.4218

0.45 0.4 0.35 0.3 0.25 0.2

0.5 0.45 0.4 0.35 0.3 0.25 0.2

0.15

0.15

0.1

0.1

0.05

0.05

0 3

4

5

6

7

8

9

10 11 12 13 Number of Systems

14

15

16

17

18

19

0

20

Figure 2: Correlation coefficients for 3-20 systems, R-Precision does not need the second half of V4. For R-Precision, only the top 100 documents are used as original documents. For precision at the 4 document cut-off values, the particular cut-off value (10, 30, 50, or 100) is used to select the number of original documents. All experiments are conducted by using the same 5 TREC collections as in Section 4. Table 2 shows the results for R-Precision, for all systems submitted to TREC. On average, V4’ outperforms RS by 5.2% and V1 outperforms RS by 7.2%. Figure 2 shows the experimental result of 3-20 diversified systems over 20 runs and 50 queries. In Figure 2, we can see that the performance of Basic is quite flat. The performances of the three other methods (V1, V4’, and RS) increase with the number of systems at first; when the number of systems reaches 10, no more obvious increase can be observed and they stabilise at 0.45-0.5 levels. RS starts with a very low value, and the difference from it to V1 and V4’ is bigger. When the number of systems increases, the difference becomes smaller. However, there is still a small difference in favour of V1 and V4’ when we reach 20 systems. On average, V1 is always the best and outperforms RS by 20.1% on average. For precision at 4 document cut-off values (10, 30, 50, and 100), the results are quite similar. We only present the result of precision at 100 documents. Table 3 shows the experimental result of system ranking about all runs of participant systems submitted to TREC. V1 and V4’ outperform RS in every TREC year. On average V1 outperforms RS by 6.4% and V4’ outperforms RS by 4.7%. Figure 3 shows the mean correlation coefficient with 3-20 diversified systems over 50 queries and 20 runs. Again, V1 is the best and it outperforms RS by 29.9% on average.

3

4

5

6

7

8

9

10 11 12 13 Number of Systems

14

15

16

17

18

19

20

Figure 3: Correlation coefficients for 3-20 diversified systems, precision at 100 documents

5.

DISCUSSION

One interesting question is why reference count is quite informative for ranking information retrieval systems by different measures. Considering that to give a proper and complete answer is very difficult, we try to discuss some observations from the experimental results. Though data fusion and ranking retrieval systems by way of reference count are different approaches, they use the same facts or phenomena that occurs in many information retrieval systems at present. Let us see some conclusions from data fusion first. Multiple-evidence techniques are widely used in data fusion for improving the effectiveness of systems [1], and several combination algorithms have been propose [5, 4, 11]. One of the best is CombMNZ [5]. Some researchers identify that, in order to use multiple-evidence improve effectiveness, the retrieved sets must have higher relevance overlap than non-relevance overlap. Lee [4] experimented on TREC 3 and he showed that there was a 125% difference in relevance and non-relevance overlap. Though the exact figure of difference could vary in different situations, we believe that a certain difference indeed exists. An information retrieval system with higher score in general (or on the top n) reference count means its result contains more documents presenting in other systems’ rankings. As an opposite way of application of multiple-evidence technique, such systems with higher reference count are likely to have more relevant documents in their rankings than other systems with a lower reference count. However, reference count is not very good for top end ranking. We believe that an explanation is related to the fact that a system at the very top among a large set of systems usually includes some

(but not all) relevant documents that not many other systems have in their results. That is to say, such systems are somewhat peculiar. On the other hand, the reference count is quite good for bottom end. This suggests that if a system is really unpopular it is likely to be the poorest. The result here could be useful for data fusion with estimated weights. We may divide the systems into two categories “average” and “poor” rather than three categories as in [11]. If we compare Table 1 with 2 and 3, we can find that reference count is doing the best work for precision at 100 documents, then for R-Precision, the worst for average precision. We guess the major reasons for this are: • The estimation of precision at different documents cutoff values is straightforward. • The estimation of R-Precision is a little more difficult than that of precision at different documents cut-off values since the number of relevant documents for that topic is not known. • The estimation of average precision is very intricate. Because the positions of relevant documents are very important and reference count is not accurate enough for locating the positions of relevant documents.

6.

CONCLUSIONS

In this paper we have presented a number of methods for ranking information retrieval systems without relevance judgement. All the methods proposed use a measure called reference count. Experimentation has been conducted to evaluate its effectiveness and compare it with that achieved by a method proposed by Soboroff et al. The experimentation has been carried out with 5 TREC collections and three different performance measures. We can summarise the results on average performance as follows: • The experimentation shows that system rankings with our methods correlate positively and significantly with official TREC rankings using the measures of noninterpolated average precision, R-Precision, and precision at different document cut-off values. It demonstrates that reference count is informative for estimating the performance of retrieval systems by different measures. • When using the measure of average precision, Soboroff et al.’s method outperforms our methods when 10 or more systems are considered. However, when the number of systems is 9 or less, two of our methods considerably outperform Soboroff et al.’s method. • When using the measures of R-Precision and precision at different document cut-off values, two of our methods outperform Soboroff et al.’s method in all cases. Therefore, we have reasons to conclude that the methods proposed in this paper are effective and are a good alternative for ranking retrieval systems by different measures. They could be useful for data fusion in Meta Search Engines for the WWW. One aspect that may affect the effectiveness of our ranking method is the similarity between some participant systems. If some of the systems (like the different runs of the same

system) are much similar than others, it usually results in deterioration of the performance. Therefore, for reliable performance, we should guarantee that such systems would not be involved.

7.

REFERENCES

[1] N. J. Belkin, P. B. Kantor, E. A. Fox, and J. A. Shaw. Combining the evidence of multiple query representations for information retrieval. Information Processing and Management, 31(3):431–448, 1995. [2] J. K. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proceedings of ACM SIGMOD International Conference, pages 479–490, Philadelphia, USA, May 1999. [3] G. Cormack, C. Palmer, and C. Clarke. Efficient construction of large test collections. In Proceedings of the 21st Annual International ACM SIGIR Conference, pages 282–289, Melbourne, Australia, August 1998. [4] J. H. Lee. Analysis of multiple evidence combination. In Proceedings of the 20th Annual International ACM SIGIR Conference, pages 267–275, Philadelphia, Pennsylvania, USA, July 1997. [5] J. A. Shaw and E. A. Fox. Combination of multiple searches. In Proceedings of 3rd Text Retrieval Conference (TREC-3), pages 267–275, Gaithersburg, Maryland, USA, April 1995. [6] I. Soboroff, C. Nicholas, and P. Cahan. Ranking retrieval systems without relevance judgements. In Proceedings of 24th Annual International ACM SIGIR Conference, pages 66–73, New Orleans, Louisiana, USA, September 2001. [7] E. M. Voorhees, editor. Proceedings of 10th Text Retrieval Conference, Gaithersburg, Maryland, USA, November 2001. National Technical Information Service of USA. [8] E. M. Voorhees and D. K. Harman, editors. Proceedings of 5th Text Retrieval Conference, Gaithersburg, Maryland, USA, November 20-22 1996. [9] E. M. Voorhees and D. K. Harman, editors. Proceedings of 8th Text Retrieval Conference, Gaithersburg, Maryland, USA, November 1999. National Technical Information Service of USA. [10] C. C. Vort and G. A. Cotterell. A fusion via a linear combination of scores. Information Retrieval, 1(3):151–173, October 1999. [11] S. Wu and F. Crestani. Data fusion with estimated weights. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, McLean, USA, November 2001. [12] J. Zobel. How reliable are the results of large-scale retrieval experiments. In Proceedings of the 21st International Conference on ACM SIGIR Conference, pages 307–314, Melbourne, Australia, April 1998.