Ranking algorithms for hierarchical similarity metrics - Semantic Scholar

1 downloads 0 Views 148KB Size Report
California State University, Fullerton rburke@fullerton. ... example, the cuisine of a restaurant is a set-valued attribute because a restaurant may serve several ...
Ranking algorithms for costly similarity measures Robin Burke Dept. of Information Systems and Decision Sciences California State University, Fullerton [email protected]

Abstract. Case retrieval for e-commerce product recommendation is an application of CBR that demands particular attention to efficient implementation. Users expect quick response times from on-line catalogs, regardless of the underlying technology. In FindMe systems research, the cost of metric application has been a primary impediment to efficient retrieval. This paper describes several types of general and special-purpose ranking algorithms for case retrieval and evaluates their impact on retrieval efficiency with the Entree restaurant recommender.

1

Introduction

Case-based product recommendation is an important application of CBR technology for electronic commerce catalogs and on-line shopping. [1] Some of the chief problems in this application area are those of scale and efficiency. To meet the demands of Internet e-commerce and compete with alternative technologies, CBR product recommendation must be efficient without losing its accuracy. Moreover, a recommender must be able to cope with the dynamic nature of a product catalog as prices, availabilities, and contents are all subject to frequent change. Although much case-based reasoning research has concentrated on improving the efficiency of case retrieval, many approaches do not handle case bases whose contents are dynamic. Building on the work of Shimazu et al. [10], Schumacher and Bergmann [7] have implemented e-commerce CBR retrieval on top of relational database systems, achieving impressive retrieval efficiency in a dynamic context. They work from the query to describe in SQL the region in feature space that would yield the most similar cases, using case density estimation to determine the region’s desired size. This work concentrates only on retrieval, however and the retrieved results must still be ranked. Additionally, the work assumes a flat attribute-value representation. Entree is a case-based restaurant recommendation system available on-line1 [2,3,4]. In this and related systems, we have found that metric application itself can be chief source of inefficiency in retrieval. It is therefore important to create retrieval algorithms that minimize the number of applications of metrics. This paper introduces the problem of metric application that arises in FindMe systems, describes several 1

http://infolab.ils.nwu.edu/entree/

possible ranking algorithms applied to the problem, and evaluates them using the Entree restaurant data.

2

Local and Global Similarity Measures

Entree and other FindMe systems have case representations that use single-value and set-based representations of attributes. Set-valued attributes are quite common. For example, the cuisine of a restaurant is a set-valued attribute because a restaurant may serve several different kinds of food or incorporate multiple culinary influences. Suppose we defined Ai as a set-valued attribute of a case and ai as the set of values associated with it. FindMe systems use the following local similarity metric for setvalued attributes: sim(Ai, Bi) = | ai ∩ bi | / | ai | For example, consider restaurants D, E and F, with the following values for the cuisine feature: D: { Thai } E: { Thai, Chinese, Vietnamese } F: { Chinese } This asymmetric overlap metric looks at the size of the feature set intersection relative to the size of the set in the query case. So, for example, the similarity between D and E is maximal, because E overlaps with everything found in D. The similarity between E and D is less because D only contains 1/3 of what is found in E. This metric is intentionally asymmetric, under the assumption that a user looking for Thai food would be satisfied by a restaurant that served both Thai and Chinese, but one looking for a restaurant that combined French and Japanese influences would not necessarily be happy with a restaurant that focused on one or the other. Metrics of this type present some difficulty in case retrieval. The calculation is not efficient to perform, having complexity |ai| * |bi|. Because the metric is asymmetric, the distance between cases does not form a metric space where cases can be ordered. If we wish to cache the distance between cases, we have to store as many as c2 values, where c is the size of the case base. Moreover, in a dynamic database, cached values would have to be recalculated frequently. The dynamics of the product database therefore rule out approaches that rely on similarity caching, such as case-retrieval nets [4] and k-d trees [5]. The alternative that we have adopted is to compute the metric at retrieval time, but design our algorithm to minimize the number of such computations. 2.1

Global similarity measures

FindMe systems use a global similarity metric in which local similarity measures are combined in a priority ordering. For example, restaurants are compared first with respect to cuisine and then the quality dimension is used to break ties between otherwise similar restaurants, with additional considerations applied to achieve finergrained distinctions. This type of metric is somewhat different than the weighted combination typically found in case-based reasoning systems, but it has the important

advantage that it is easy to engineer. A hierarchy of the importance of different product characteristics is simple to generate from basic domain considerations, and it is easy to envision what the consequences of any change (for example, putting price before cuisine instead of the other way around). By contrast, the setting of weights in weighted metrics has always been recognized as a difficult problem. A hierarchical metric can always be turned into a weighted combination by using weights that increase exponentially with priority.

3

Sorting algorithms

Earlier publications on Entree [2] described a bucket sorting algorithm made possible by restricting metrics to range only over small integer values. That algorithm is also discussed below, but experience has shown that this limitation on metric formulation is too restrictive. This paper considers the general case, where the similarity function can return a real-numbered value. The simplest solution for retrieving the best cases is to evaluate all of the metrics and sort. This Sort algorithm can be formalized as follows: Sort (q, C, M) Let q be the query case For each case c in the case base C Create a vector of similarity values vc For each metric mi in the similarity function M vc[i] = mi(q, c) Sort (C, compare) where the following comparison function is used compare (c1, c2) for i=0..|v| if vc1 [i] > vc2[i] then return “c1 greater” if vc2[i] > vc1[i] then return “c2 greater” return “c1 = c2” If an efficient sorting algorithm such as Quicksort is used, the complexity of sorting will be O (c log c), where c is the number of cases. The cost of metric application is m*c, if m is the number of metrics. One feature of Entree’s retrieval (as well as other e-commerce case retrieval systems) is that it aims to retrieve only a small number of responses, so that the user is not overwhelmed with alternatives. In Entree, the maximum number of restaurants retrieved is 10. When only a few of the top cases are needed, it is possible to perform a selection operation over the set of cases and return the top items.

Select (q, C, M) Let q be the query case For each case c in the case base C Create a vector of similarity values vc For each metric mi in the similarity function vc[i] = mi(q, c) For i=1..k answer[i] = null For j = 1..|C| compare(cj, answer[i]) if (cj > answer[i]) then answer[i] = cj remove c[], answer[i] using the comparison function described above. This has efficiency O (c * k), where k is the desired number of cases returned, better than sorting for small values of k, but it is equivalent to a sort algorithm as far as metric applications are concerned: each metric must be evaluated for each case to compute the score against which to apply the maximum value test. In practice, the whole case base would never need to be sorted or selected from. It is possible to extract a subset of the case base as the region where similar cases are to be found. This initial step can be performed with database or table lookup and can increase the efficiency of later sorting. In Entree, a simple inverted index technique was used, in which the system retrieved any case that has any possible overlap with the query case. While this technique has the advantage of not ruling out any similar cases, it was found to be excessively liberal, retrieving an average of 87% of the case base. In our later work [3], we implemented a database technique similar to Schumacher and Bergmann’s [7] for retrieving this initial set. 3.1

Metric Application On Demand

Since metric computations are to be minimized, one of the major concerns regarding the Sort algorithm is that many of the metric computations are wasted. If we can determine that c1 > c2 on the basis of the first value in the value vector, then the rest of the values are not needed. It may be the case that these values are not needed to compare c1 against any other cases. Because of the nature of the similarity measure, metric applications can be omitted without harming the quality of the sort. This insight suggests an “on-demand” approach to metric application. Rather than computing all metric values first and sorting later, let the algorithm compute only those metric values needed to actually perform the comparisons required by the sort operation. The Metric Application on Demand Sort (MADSort) algorithm below also uses a standard sort procedure, but it skips the initial step of pre-computing the metric values. Its comparison routine only computes a new similarity value if that value is actually needed. For example, consider a candidate set of three cases c1, c2, and c3, and two metrics m1 and m2. When c1and c2 are compared for sorting, the algorithm

computes m1(q, c1) and m1(q, c2) and discovers that c1 has the higher value. When c1 is compared against c3, the value of m1(q, c1) does not need to be recomputed. MADSort (q, C, M) For each case c in the case base C Create a empty vector of similarity values vc Sort (C, compare_on_demand) compare_on_demand (c1, c2) for i=0..|v| if vc1[i] is not yet computed, vc1[i] = mi(q,c1) if vc2[i] is not yet computed, vc2[i] = mi(q,c2) if vc1[i] > vc2[i] then return “c1 greater” if vc2[i] > vc1[i] then return “c2 greater” return “c1 = c2” This algorithm’s complexity depends on the “discrimination power” of the individual metrics mi. In the worst case, ma = mb for all c, in which case the algorithm is no different from the simple Sort. However, a similarity function with this property is useless in practice since it performs no discrimination and so the worst case does not arise. One way to analyze the complexity is in terms of the discriminating power ρ of each metric ρm. One way to define the power of a metric is to consider the ability of the metric to divide the case base into equivalence classes. Assume that a metric creates equivalence classes that are equal in size across the case base. We can define ρ as follows: ρ = 1 implies that the metric creates an equivalence class the size of c, performing no discrimination; ρ =1/c is one that creates equivalence classes of size 1, therefore imposing a total order on the case base. So, the probability that recourse to a higher metric will be required for any two randomly chosen cases, c1 and c2 is dependent on the size of the equivalence classes created by the metric. Assuming equal size classes, a metric with power ρ yields classes of size c*ρ. c2 will fall into the same class as c1 according to the proportion between the size of the equivalence class after c1 is removed (c* ρ – 1), and the size of the whole case base also after c1 is removed, (c-1). The probability that the two are in the same class is therefore

cρ − 1 c −1

(1)

If c is large, this equation can be simplified to

cρ − 1 1 =ρ− ≈ρ c c

(2)

The probability that we will have to apply j metrics is the product of the probabilities of needing to apply all of the metrics up to j. (3)

j

∏ ρ i −1 i =1

We always have to apply at least one metric, even if it is a perfect one. So, the expected number of metrics that would have to be applied to compare two randomlychosen cases would be m −1

j

e = 1 + ¦∏ ρ i −1

(4)

j =1 i =1

where m is the number of metrics. Each additional comparison adds to the expected number of metrics applied on a case. When a second comparison is made, additional metric applications will only be needed if previously-drawn distinctions are not enough. The probability of this happening is the product of the ρ values of the already-applied metrics. If more metrics are needed, then we can use the same considerations as in Equation 4 to calculate the expected number. e § ′ e = ∏ p k ¨¨1 + k =1 ©

· pi −1 ¸¸ ¦ ∏ j = e +1 i =1 ¹ m −1

j

(5)

The total expected number of metric applications is therefore the sum of this recurrence over all comparisons performed. There is no simple closed-form solution, but the characteristics of the function are easy to evaluate numerically. An efficient sorting algorithm like Quicksort performs c log c comparisons. Each case would be compared to log c other cases. A candidate set of size 1024 would therefore entail 10 comparisons per case. If all of the metrics in the global similarity function had ρ = 0.1, in the 10th round of comparisons, the expected number of metrics applied to each case would be 1.55, for a total metric application count of 1593, only a quarter of the full m*c needed by the Sort algorithm. If on the other hand, the metrics are weaker with ρ = 0.75, for example, we would expect, by the 9th round of comparison, that all of the metrics would already be applied. The same “on demand” technique can be applied to create a MADSelect algorithm, using the Select algorithm as above but with the on-demand comparison function. The analysis is similar except that there are k rounds of comparison, where k is the number of answers to be returned. Since the number of metric applications is a simple statistic to collect in the operation of a retriever, this derivation also provides us a way to measure the discriminatory power of a metric or an entire similarity function. We can examine the total number of metric applications and the number of candidates retrieved and calculate a combined ρ for the whole metric. This value is the overall ρ for only the first m-1 metrics, since ρm does not contribute to Equation 4. We can also derive ρ for each metric in the function in isolation, simply by running the same experiment with a

simplified similarity metric. This technique is applied to the Entree similarity measure in the experiments below. 3.2

Tournament tree

Another approach to minimize the number of metric applications is to use a data structure that allows selection and comparison to be interleaved. One such data structure is the tournament tree [8]. A tournament tree is a binary tree where the parent of two nodes is determined by comparing the two children and copying the “winner” to the parent. Through successive comparisons, the root of the tree is the overall winner. To extract the top k elements from such a tree, the winning leaf is replaced with a null value and the log(c) competitions up to the root are recomputed. Figure 1 shows a simple tournament tree example. Node A is the winner against both B and C, and becomes the root node. When it is removed, B is now the overall winner and is bubbled to the top. The top k elements can be found in order using only k log c comparisons, rather than the n log n required for a full sort or the k*c that would be needed in a selection algorithm. A

B

A

A

C

B

C

B

D

-

C

B

C

D

Fig. 1. Tournament tree sort Here is a tournament tree selection algorithm that returns only the top k cases: Create a vector t0 of size c = |C| containing elements of the case base For each case Create a vector of similarity values v Loop c = c/2 For each c adjacent pairs cl, cr in ti compare_on_demand (cl, cr) if (cl > cr) copy cl to ti+1 else copy cr to ti+1 Until c = 1 For each case to be output Copy the root case c0 to the output Replace c0 in t0 with a null value Recompute all the parents of c0 in the tree

all

There are only k log c comparisons performed in this sort routine, and with ondemand metric application, the number of metric applications is e*k*log c. Table 1 shows how all the algorithms discussed so far compare with respect to metric applications. Table 1. Expected number of metric applications

Algorithm Sort/Select MADSort

No. of metric applications

MADSelect

e⋅k ⋅c

Tournament tree

4

m⋅c e ⋅ c ⋅ log c

e ⋅ k ⋅ log c

Experiments

While these theoretical results suggest the possible benefit of alternate ranking algorithms, we also performed experiments to quantify these benefits. Since our object is to minimize the number of times that metrics are applied, we use the number of metric applications performed by each algorithm as a primary evaluation metric, but run times were also examined. Five algorithms were tested: Sort, Select, MADSort, MADSelect, and Tournament Tree. Each of the 675 Chicago restaurants was used as a query against the Entree database, in a leave-one-out protocol, and the results averaged. This protocol was repeated for different result set sizes: 10, 50 and 100. Figure 2 shows the results counting the average number of metric applications. (Select is omitted since it has the same number of metric applications as Sort.) MADSelect outperforms all the other algorithms at small retrieval sizes, but its advantage decreases with the size of the return set, equaling the Tournament Tree algorithm at set size 100. This result is somewhat surprising given the difference in the expected number of metric applications shown in Table 1. With an average of number of retrieved cases at 585, log c = 9.19: one would expect Tournament Tree to be far superior. And with k = 10, MADSort should have an edge that increases with k. The explanation has to do with which cases are being compared. MADSelect operates over the cases in k passes. In the first pass, on average, half of the comparisons performed by MADSelect will be against the best candidate. Cases that are close to the best will have to be fully evaluated to determine that they are not better; and cases that are far away will be ruled out with few applications. Each successive pass follows the pattern, but the good cases will have already been fully evaluated. The Tournament Tree algorithm is not at all selective about which cases are being compared: cases are compared to their neighbors. MADSort improves only slightly (but significantly, p max

(6)

In other words, we include enough buckets to be sure of having max cases to return. The first t buckets are used in the next round; other buckets can be discarded. This algorithm has the effect of reducing the set of candidate cases quickly and thereby radically reduces the number of metric applications beyond what is found in the MADSelect and Tournament Tree algorithms. Figures 4 and 5 show the results of comparing this technique against these two algorithms. When the result set is small, the average number of metric applications is reduced to 68% of MADSelect (829 vs. 1218). As the number of cases increases, Bucket Sort loses some ground to the Tournament Tree, but not nearly as much as MADSelect. For run times, this effect is even more dramatic – the gap between Bucket Sort and Tournament Tree shrinks, as MADSelect falls far behind.

Metric Applications

3000 2500 2000

Bucket Sort

1500

MADSelect

1000

Tourn. Tree

500 0 10

50

100

Cases Returned

Run time (milliseconds)

Fig. 4. Comparison of Bucket sort against MADSelect and Tournament Tree (# of metric applications)

1600 1400 1200 1000 800 600 400 200 0

Bucket Sort MADSelect Tourn. Tree

10

50

100

Cases Returned Fig. 5. Comparison of Bucket Sort against MADSelect and Tournament Tree (run time)

6

Related and Future Work

The issue of metric application cost has not surfaced as a major issue in case-based reasoning, partly because of the use of indexing methods that cache similarity information, and partly because most systems do not use the kind of prioritized similarity function described here. Bridge [11] has tackled the problem of generalizing global similarity functions and (with Ferguson) in [12] also examined

prioritized combinations of similarity metrics such as those found in Entree. This research elaborates an elegant framework of partial-order-based similarity functions and rules for combining them. One of the interesting aspects of Bridge’s work is the notion of generalized prioritization: a way of relaxing a strict hierarchy of metrics, maintaining priority relations between metrics that are looser than those found in Entree and discussed in this paper. Future work in FindMe systems will examine more general (and typical) global similarity formulations such as the Bridge’s generalized prioritization and linear combination. For linear combinations of metrics, we lose the advantage of a static hierarchy of local similarity influence. Suppose we are comparing two cases c and d using a hierarchical metric that combines local metrics A and B. If metric A finds c > d, we know that the results of metric B cannot have any effect on their ordering. In a linear combination, this is not typically the case. However, consider a simple global similarity metric S that combines two local measures using weights:

S (c1 , c 2 ) = w1 s1 (c1 , c 2 ) + w2 s 2 (c1 , c 2 )

(7)

Suppose that w1 >> w2, and that we are ranking two cases c and d against a query case q. If we find that c scores well with s1 and d scores poorly, it might be mathematically impossible for d to outrank c, even if it gets the maximum score obtainable from s2 and c gets the lowest. In this case, we can use the MAD technique to skip evaluating s2. This technique would only be worthwhile if the cost of applying a metric significantly exceeded the overhead of testing for the possibility of its exclusion, and if the weights in a metric were sufficiently skewed to make it possible to benefit from this optimization. Future work will examine this optimization more fully. Finally, given the proven benefit of Bucket Sort over the other techniques, we are also interested in extending this binned technique to non-integer-valued metrics.

Conclusion This paper has examined the problem of managing the cost of metric application in hierarchical metric combinations. Such hierarchical metrics are not as common as weighted sum similarity functions, but they have proven effective in the catalog search domains found in FindMe systems. Several algorithms were discussed and in the general case, a selection algorithm that caches metric results demonstrated the best performance for small result set sizes, while an algorithm based on the tournament tree shows the best performance for larger result sets. For special formulations of the local similarity metrics, the previously-published bucket sort algorithm shows substantial improvement. Finally, the analysis of the on-demand sorting algorithm allows us to examine metric performance in terms of discriminatory power, examining both overall performance of a global similarity measure and the capabilities of its individual local metrics. This capability can assist knowledge engineers in the creation of effective metrics for case-based retrieval systems.

Acknowledgments Entree was developed at the University of Chicago in collaboration with Kristian Hammond, with the support of the Office of Naval Research under grant F49620-88D-0058. Many others contributed to the FindMe effort at the University of Chicago, including Terrence Asselin, Kai Martin, Kass Schmitt, Robb Thomas, and Ben Young. I would also like to acknowledge David Aha and four anonymous ICCBR reviewers for their assistance in improving this paper.

References [1] Vollrath, I., Wilke, W., Bergmann R.: Intelligent Electronic Catalogs for Sales Support -Introducing Case-Based Reasoning Techniques to On-Line Product Selection Applications. In Roy, R., Furuhashi, T., Chawddhry, P.K. (Eds.): Advances in Soft Computing Engineering Design and Manufacturing. Springer. London, 1999. [2] Burke, R., Hammond, K., and Young, B.: The FindMe Approach to Assisted Browsing. IEEE Expert, 12(4), 32-40, 1997. [3] Burke, R.: The Wasabi Personal Shopper: A Case-Based Recommender System. In Proceedings of the 11th National Conference on Innovative Applications of Artificial Intelligence, 844-849. AAAI, 1999. [4] Burke, R.: Knowledge-based Recommender Systems. In A. Kent (ed.), Encyclopedia of Library and Information Systems. Vol. 69, Supplement 32. Marcel Dekker, New York, 2000. [5] Lenz, M., Burkhard H. D.: Case Retrieval Nets: Basic Ideas and Extensions. In Görz G, Hölldobler, S. (Eds): KI-96: Advances in Artificial Intelligence. Springer Press, 1996. [6] Wess, S., Althoff, K.-D., Derwand, G.: Using kd-trees to improve the retrieval step in casebased reasoning. In Wess, S., Althoff, K.-D., Richter, M.M. (Eds): Topics in Case-Based Reasoning, , 167-81, Springer-Verlag, Berlin, 1994. [7] Schumacher, J. and Bergmann, R.: An Efficient Approach to Similarity-Based Retrieval on Top of Relational Databases. In E. Blanzieri & L. Portinale (Eds): Advances in Case-Based Reasoning: Proceedings of the European Workshop on Case-Based Reasoning, EWCBR00, 273-284. Springer-Verlag, Berlin, 2000. [8] Horowitz, E., Sahni, S.: Fundamentals of Data Structures. Computer Science Press, Rockville, MD, 1981. [9] Cormen, T., Lieserson, C., Rivest, R.: Introduction to Algorithms. MIT Press, Cambridge, MA, 1990. [10] Shimazu, H., Kitano, H., Shibata, A.: Retrieving Cases from Relational Data-Bases: Another Stride Towards Corporate-Wide Case-Base Systems. In Proceedings of the 1993 International Joint Conference on Artificial Intelligence, 909-914. IJCAI, 1993. [11] Bridge, D.: Defining and Combining Symmetric and Asymmetric Similarity Measures. In Smyth, B., Cunningham, P. (Eds.): Advances in Case-Based Reasoning: Proceedings of the European Workshop in Case-Based Reasoning, EWCBR98, 52-63. Springer, Berlin, 1998. [12] Ferguson, A., Bridge, D.: Partial Orders and Indifference Relations: Being Purposefully Vague in Case-Based Retrieval. In Blanzieri, E., Portinale, L. (Eds.): Advances in CaseBased Reasoning: Proceedings of the European Workshop on Case-Based Reasoning, EWCBR00, 74-85. Springer, Berlin, 2000.