Evaluating Retail Recommender Systems via ... - Semantic Scholar

1 downloads 0 Views 134KB Size Report
We would also like to thank Michael Spence from Tessella Support Services plc (3 Vineyard Chambers,. Abingdon, Oxfordshire, OX14 3PX, UK), for his help with.
Evaluating Retail Recommender Systems via Retrospective Data: Lessons Learnt from a Live-Intervention Study Carmen M. Sordo-Garcia, M. Benjamin Dias, Ming Li, Wael El-Deredy and Paulo J. G. Lisboa

Abstract— Performance evaluation via retrospective data is essential to the development of recommender systems. However, it is necessary to ensure that the evaluation results are representative of live, interactive behaviour. We present a case study of several common evaluation strategies applied to data from a live intervention. The intervention is designed as a case-control experiment applied to two cohorts of consumers (active and non-active) from an online retailer. This results in four binary hit rate indicators of live performance to compare with evaluation strategies applied to the same basket data as was available immediately prior to the recommendations being made, treating them as historical data. It was found that in this case none of the standard evaluation strategies predicted comparable binary hit rates to those observed during the live intervention. We argue that they may not sufficiently represent live, interactive behaviour to usefully guide system development with retrospective data. We present a novel evaluation strategy that consistently provides binary hit rates comparable to the live results, which seems to mirror the actual operation of the recommender more closely, paying particular attention to the principles and constraints that are expected to apply. Key Words— Recommender Systems, Performance Evaluation, Model Selection & Comparison, Business Applications, Lessons Learnt

I. I NTRODUCTION Recommender systems are increasingly being used to help people filter through the overwhelming amount of information they encounter on a daily basis in today’s world. As a result, over the past decade or so, there has been a growing interest in research into recommender algorithms (e.g. [1], [2], [3], [4], [5], [6], [7], [8], [9]). However, apart from a few exceptions (e.g. [10], [11], [12]) the problem of evaluating recommender systems on retrospective data, in particular for model selection, remains relatively uncharted territory. A recommender system is an embodiment of an automated dialogue with a human user. Therefore, evaluating the Carmen M. Sordo-Garcia is with the School of Psychological Sciences, University of Manchester, Zochonis Building - Oxford Road, Manchester M13 9PL, UK (phone: +44-1234-222-377; fax: +44-1234-248-010; email: [email protected]). M. Benjamin Dias is with the Unilever Corporate Research Centre, Colworth Park, Sharnbrook, Bedford MK44 1LQ, UK (phone: +44-1234248-156; fax: +44-1234-248-010; email: [email protected]). Ming Li is with the Unilever Corporate Research Centre, Colworth Park, Sharnbrook, Bedford MK44 1LQ, UK (phone: +44-1234-222-218; fax: +441234-248-010; email: [email protected]). Wael El-Deredy is with the School of Psychological Sciences, University of Manchester, Zochonis Building - Oxford Road, Manchester M13 9PL, UK (phone: +44-161-275-2566; fax: +44-161-275-2685; email: [email protected]). Paulo J. G. Lisboa is with the School of Computing and Mathematical Sciences, Liverpool John Moores University, Byrom Street, Liverpool L3 3AF, UK (phone: +44-151-231-2225; fax: +44-151-231-3724; email: [email protected]).

performance of a recommender system essentially requires feedback from the user. However, during the development phase, we do not have access to such feedback, as we only have retrospective data to work with. This is an inherent limitation of the design of recommender systems. In effect, historical data cannot, by definition, accurately estimate the propensity for take-up of recommended items, which were not in the shopping basket. Therefore, it is common practice (e.g. [10], [11], [12]) to withhold some of the test data to be predicted given the remainder (i.e. considering some of the test data as observations and considering a withheld subset as the target). There are also many metrics (e.g. that measure accuracy, coverage, novelty and serendipity) that have been used via this technique to estimate the level of performance that can be expected if/when the system is deployed (see [10] for a comprehensive review of metrics used in evaluating recommender systems). However, the entire process of evaluating the potential of recommender systems on retrospective data is rarely, if at all, validated via a corresponding evaluation on interactive (live) data. In this report we present several possible evaluation strategies and evaluation metrics, for estimating the potential live performance of a recommender system via retrospective data. Most importantly, though, we have had a unique opportunity to validate the different evaluation strategies as well as the evaluation metrics, on data generated via a case-control experiment carried out in a real-world e-commerce scenario (and not in a controlled lab environment). This unique opportunity was made possible via our collaboration with the exclusively online Swiss supermarket LeShop. We have used the live intervention study carried out on www.LeShop.ch, to validate each of the different evaluation strategies and to select the most suitable one. We begin by setting out the context of our work. The experimental design of our live intervention study is then described. This is followed by a description of the evaluation metrics used and the results from the live intervention. We then proceed by describing the different evaluation strategies for working with retrospective data. A description of the results from the analysis on retrospective data follows. Finally, we provide a comparative analysis of the different evaluation strategies and conclude with some details on our on-going and future work on improving our recommender system. II. BACKGROUND We are working in collaboration with the exclusively online Swiss supermarket LeShop to deploy a recommender system on their website. The business objective of this

recommender system is to provide personalized item recommendations to all customers at checkout (i.e. when they hit the checkout button, just before confirming their order), with the aim of increasing the number of items purchased (i.e. increasing sales volume). The customer-centric aim of the system is to make the recommendations relevant to the customer, in order to achieve a win-win outcome. Prior to launching our live recommender system, as part of the development phase, we were faced with the task of evaluating the potential performance of different prospective recommender algorithms, via retrospective data. This process required an evaluation strategy, which defines which portion of the test data is to be withheld as the target. In the literature, the most common two approaches for evaluating recommender systems via retrospective data seem to be the use of cross-validation (including the leave-one-out approach) [10] and selecting the target set at random [3]. However, we found experimentally that different evaluation strategies provided a different performance ranking for each of the recommender algorithms being considered. Therefore, the main aim of our initial work reported here, was to identify an evaluation strategy that consistently provides us with the performance rank order of the recommender algorithms, as observed during a live implementation. A brief description of the data that was available to us for this work is provided next. A. The Data In order to carry out this work, LeShop provided us with link-anonymized1 shopping basket data from actual historical records comprising individual transactions. The data consists of a binary indicator representing the purchase by an individual of a particular item available at the supermarket. The training data used for the work presented in this report, comprised a sample of ∼ 8 million transactions, which refer to purchases by ∼ 30, 000 individuals of ∼ 25, 000 items. The user-by-item binary matrix, potentially compromising the estimation of conditional probabilities for future purchases, is very sparse and is typical for transaction data in retail stores (c.f. the data in [3]). In order to mitigate the effects of sparseness, we aggregated items into categories and only those categories comprising 80% of the total spending (as computed over the training data) are considered in the model. We refer to these categories as the ‘Active’ categories. This process of item aggregation produced ∼ 200 active categories. There is also a substantial variation in the frequency of shopping of individual customers. Therefore, two groups of customers were identified; active and non-active. The active customers are those who have made more than five shopping episodes over the previous six months, while the remainder are the non-active customers. Different combinations of modelling and evaluation using the active and nonactive groups of customers were considered. Another wellknown characteristic of shopping habits is the occurrence of 1 link-anonymized means using dummy indices that can, in principle, be traced back to individual users.

seasonal behaviour. Its effect is partly avoided by aggregating shopping baskets for each customer over the training period2 . The category selection and basket aggregation increase the density of non-zero entries in the data matrix of users-byitems (which is now a user-by-category matrix). The development of our recommender systems focused on the design data set comprising historical purchase records for the period from October 2005 to March 2006, with April 2006 reserved for out-of-sample testing. The selection of this period was determined by the timing of the intervention study, which was run live from May to July 2006. The design data matrix comprises approximately 30, 000 users and 210 active categories. We used this data to build an item-to-item (i.e. category-to-category) personalized recommender system to be deployed on the www.LeShop.ch website. III. T HE L IVE I NTERVENTION S TUDY The aim of our live intervention was to set up a casecontrol experiment in order to test the hypothesis that, when making recommendations, personalization performs significantly better than providing the same set of recommendations to everyone. Therefore, we split the users into two different groups, the “Test” and “Control” groups. We randomly assigned half of the “Active” users (reciprocally with the “NonActive” ones) into the “Test” cell, and the remaining half formed the “Control” cell. We were, thus, able to perform a four-way comparative analysis of our system: “Active” vs. “Non-Active” and “Test” vs. “Control”. The models used to generate the recommendations provided to the customers in each cell are described next. A. The Models Initially we explored a variety of recommender algorithms, including user-based collaborative filtering [11], [13], itembased collaborative filtering [1], [5], maximum entropy models [7] and Markov decision processes [6]. However, regardless of the metric used, the evaluation strategies commonly used in the literature provided us with a different performance ranking for each of the recommender algorithms being considered. Therefore, we opted to use different criteria for model selection. Thus, instead of selecting the most accurate models, in each case we selected the model that was the simplest, easiest to implement and fastest to run online. As such, in line with our aim to prove that personalisation works, we used the Na¨ıve Bayes algorithm [14] in our test model to provide a simple form of personalisation, recommending items tailored to the contents of the current basket. Our control model was chosen to be one that recommends the same (globally) most popular items (not currently in the basket) to all of the users. The algorithms behind the models may be defined as follows: • The Control Model Algorithm: The Marginal Distribution The algorithm we considered for the control model, is 2 The data is aggregated only during training. The test data is not aggregated.

the simplest method of estimating the probability of purchase of an item, namely via the marginal distribution. Clearly this is not personalized, as the recommendations comprise simply the most popular items not present in the basket. For each item Ik , the probability of purchase will be c 1X P (Ik ) = ](Bi ∩ Ik ) , (1) c i=1 where B is the set of baskets in the training data and c is the number of baskets in B. •

The Test Model Algorithm: Na¨ıve Bayes In order to incorporate the information contained in the items already in the basket at checkout, we can use Bayesian statistical techniques to compute the posterior probability P (Ik |B): the probability that item Ik will be purchased given that we also know what is in the basket at checkout, B. The simplest Bayesian statistical technique we can use for this is Na¨ıve Bayes [14]. In this approach the conditional probability of the item Ik is computed as: P (Ik |B) ∝ P (Ik ) · P (B|Ik ) = P (Ik ) ·

s Y

P (Ii |Ik ) ,

i=1

(2) where I1 , · · · , Is are the items in B. P (Ik ) is the prior probability of purchasing item Ik , P (Ii |Ik ) is the likelihood of item Ii being bought given that item Ik was bought, and both of them are computed using the frequency of past purchases in the training data. This method personalizes the recommendations to the given shopping basket, which is the simplest form of personalization (cf. people who bought the items currently in the basket, also bought ...). B. Deploying the Recommender System Our recommender system was deployed on the LeShop website (www.LeShop.ch) providing three item-level recommendations to all of their customers at checkout. In order to validate our evaluation process for retrospective-data, as previously discussed, we deployed our recommender system via a case-control experimental design. The Test cell received recommendations via the Na¨ıve Bayes algorithm, while the Control cell received the the most popular items as their recommendations. Since both of our models are built on category-level data, they can only recommend categories. However, during the live intervention, the recommendations to customers were required at item-level. Therefore, during deployment, the most popular item (over the training data) in each of the recommended categories were presented to the customers. Our recommender system was implemented in Java and deployed as a JAR library, which was integrated into the LeShop website. The system was tested on a Pentium 4, 3.80Ghz Windows machine with a 1GB RAM and was

shown, on average, to provide the required three recommendations in 3-4 milliseconds. These recommendation speeds were matched during testing on the actual LeShop servers. IV. T HE B INARY H IT R ATE The performance measure we use for evaluating our recommender systems is the binary hit rate, which is defined as the rate at which customers accept at least one of our recommendations. The binary hit rate is equivalent to the session acceptance rate used by Thor et al. in [15] and similar to the approximation of recall used by Sarwar et al. in [16], where the number of recommendations is fixed. If we have m baskets in our test data, the binary hit rate hb is given by: m

hb =

1 X δ(#(Bi ∩ Ri )) , m i=1

(3)

where Bi is the set of items in basket i, Ri is the set of recommendations provided to the owner of basket i and  1, x ≥ 1 δ(x) = . (4) 0, otherwise A. The Comparison Metric Given two different models, we may use the binary hit rate described above in order to compute the potential performance of each model. The next step, is then, to compare the models. For this purpose we use the Pearson Chi-Squared statistic [17]. TABLE I T HE BINARY HIT RATES CORRESPONDING TO THE LIVE INTERVENTION STUDY.

May

T C All

June

T C All

July

T C All

A 5.57% 3.46% 4.51% A 7.57% 4.32% 5.95% A 5.48% 4.36% 4.93%

nA 3.15% 2.70% 2.93% nA 4.80% 3.35% 4.09% nA 2.92% 3.05% 2.99%

All 4.41% 3.10% 3.76% All 6.35% 3.89% 5.13% All 4.31% 3.75% 4.04%

V. R ESULTS FROM L IVE I NTERVENTION Evaluating a recommender system via a live intervention study is more straightforward than the evaluation on retrospective data. Here, we do not require an evaluation strategy, as the accepted recommendations are clearly defined by what the customers actually purchased. The binary hit rates from our live intervention study, which ran over the three months of May, June and July, are given in Table I. All results are presented as a four-way comparative analysis of “Active” vs. “Non-Active” and “Test” vs. “Control”. In order to condense the results tables and make them easy to read, we have used the following abbreviations: T: Test group, C: Control group,

A: Active customers, nA: non-Active customers. In all cases, the Test vs. Control and Active vs. non-Active comparisons were significant (p ≤ 0.05).



VI. T HE E VALUATION F RAMEWORK FOR R ETROSPECTIVE DATA When evaluating the potential performance of a recommender system on retrospective data, as previously mentioned, it is common practice to withhold some of the test data to be predicted given the remainder. However, the particular portion of data withheld as the target is never clearly defined, and in most cases is simply selected by random [3]. It is also common to use a leave-one-out approach [10]. Furthermore, we found experimentally that regardless of the metric used, different evaluation strategies provided us with a different performance ranking for each of the recommender algorithms being considered. Therefore, we set out to identify a set of clearly defined and meaningful evaluation strategies that may be used for this purpose. We began by defining a Bayesian framework in which the recommender system may be evaluated. In this framework, we assume that, each test basket is divided into two segments; the evidence and the target. The evidence segment is assumed to consist of the contents of the basket at the time the recommender system is invoked. For example, since we make recommendations at checkout, the evidence segment is assumed to be the contents of the basket at checkout. The target segment is assumed to be the recommendations that the user would have accepted, if they were offered. In this framework, it is also assumed that, given the contents of a test basket, the recommender system assigns a probability of purchase to all of the available items not currently in the basket. The top few items (e.g. in our case the top three items) are then considered to be provided as recommendations to the customer. Thus, the recommender system with the higher potential of success is expected to be the one able to recommend more of the target items. Given this definition of the analysis framework, it is clear that the composition of the target (and hence the evidence) segments will have a significant impact on the predicted performance of the recommender system. Thus, we then went on to define a two-step process for generating the evidence and target segments from a retrospective test basket. Firstly, we arrange the items in each test basket according to a specified ordering. Then we split each ordered test basket into two segments; the evidence and the targets. The different ways in which the test baskets may be ordered and split are described next. A. Ordering the Test Baskets Having considered all the possible orderings of the test baskets and their implied assumptions about the nature of the purchasing process, we identified the following three as the most suitable for the evaluation process: • Temporal Ordering Here, the purchaser is assumed to follow a sequential process. We use the ordering generated in the real-life



shopping scenario and place the items in the basket in the order in which they were actually bought. Random Ordering Here, we simply place the items in the basket in a random order, as we assume that customers make purchasing decisions at random. Popularity Ordering Here, we assume that at check-out the customers have already made purchase decisions on popular items and so the evaluation needs to concentrate on the least popular ones. Thus, we use the training data to compute the global popularity of each item and use this popularity measure to place the items in each test basket in descending popularity order. As a result, the first item in this ordering is the most popular item in the basket, while the last item is the least popular.

B. Splitting the Test Baskets Once an ordered test basket is generated, our next task is to split it into the evidence and the target segments. Here we define the segments in terms of the composition of the target segment. There are two possible options for this, which are to define the target segment in terms of the last n items or in terms of the last n% of items, both with respect to the given order. C. Restricting Items That May Be Recommended When using the popularity ordering we assume that at check-out the customers have already made purchase decisions on popular items and so the evaluation should focus on the least popular ones. Therefore, it is necessary to restrict the set of items from which recommendations can be made, to the set of items that are (globally) less popular (as defined over the training data) than the least popular item in the evidence portion of the basket. This ensures that the models recommend the most likely of the less popular items, which is what the target set is expected to comprise of. VII. R ESULTS FROM R ETROSPECTIVE DATA In this section we present the results generated from the experiments carried out on the retrospective data. Once again all results are presented as a four-way comparative analysis of “Active” vs. “Non-Active” and “Test” vs. “Control”. A. Leave-One-Out Analysis In order to benchmark our evaluation process for recommender systems on retrospective data against the current literature, we also performed a leave-one-out analysis. The results of the leave-one-out analysis is given in Table II. B. Analysis via Our Evaluation Strategies In our framework for evaluating a recommender system on retrospective data, the evaluation strategy used is defined as a combination of a method for ordering and a particular split of the test baskets. As described in Section VI-A we considered three methods for ordering the test baskets; temporal, random and popularity. When splitting the test baskets, we varied the

TABLE II

TABLE IV

T HE BINARY HIT RATES CORRESPONDING TO A LEAVE - ONE - OUT

T HE BINARY HIT RATES CORRESPONDING TO THE THREE OPTIONS FOR

ANALYSIS CARRIED OUT ON THE RETROSPECTIVE DATA .

ORDERING THE TEST BASKETS , WITH A TARGET SET COMPRISING THE LAST

May

June

July

T C All T C All T C All

A 0.46% 2.23% 1.35% A 0.69% 2.48% 1.58% A 0.72% 2.35% 1.52%

nA 0.52% 1.90% 1.20% nA 0.78% 2.09% 1.42% nA 0.80% 1.91% 1.35%

All 0.49% 2.07 % 1.28% All 0.73% 2.31% 1.51% All 0.76% 2.14% 1.44%

percentage of the target set, as described in Section VI-B, from 10% to 50%, in intervals of 10% and we also take the last 3 items of each basket as the target set, as the live system provided three recommendatoins. Thus, we considered a total of 18 potential evaluation strategies (3 orders × 5 percentage splits + 3 orders × last 3 items). For each strategy, we computed the binary hit rate as described in Section IV. The results from our analysis of six of our evaluation strategies (3 orders × last 10% split & 3 orders × last 3 items) on the retrospective data for the month of May are given in Tables III and IV. The results for the months of June and July as well as for the other percentage splits produced similar results, and hence, are not presented here. TABLE III T HE BINARY HIT RATES CORRESPONDING TO THE THREE OPTIONS FOR ORDERING THE TEST BASKETS , WITH A TARGET SET COMPRISING THE LAST 10% OF THE TEST BASKETS , FOR THE MONTH OF M AY.

Popularity Ordering

T C All

Temporal Ordering

T C All

Random Ordering

T C All

A 19.75% 14.36% 17.06% A 5.09% 17.38% 11.22% A 7.03% 29.93% 18.45%

nA 16.29% 12.75% 14.55% nA 5.73% 14.48% 10.04% nA 6.63% 25.97% 16.16%

All 18.23% 13.66% 15.96% All 5.38% 16.12% 10.70% All 6.86% 28.21% 17.45%

VIII. C OMPARATIVE A NALYSIS OF T HE E VALUATION S TRATEGIES In this section we present a detailed comparative analysis of the different evaluation strategies that we tested. The aim of this analysis is to select the evaluation strategy that is most suitable for evaluating recommender systems via retrospective data. Given such an evaluation strategy, we can use it to perform algorithm and model selection via retrospective data, without the need to conduct a live

3 ITEMS OF THE TEST BASKETS ,

Popularity Ordering

T C All

Temporal Ordering

T C All

Random Ordering

T C All

A 21.01% 16.44% 18.71% A 7.90% 20.89% 14.44% A 9.72% 34.99% 22.44%

FOR THE MONTH OF

nA 20.50% 15.47% 18.00% nA 8.68% 19.08% 13.84% nA 10.34% 32.46% 21.33%

M AY.

All 20.77% 16.00% 18.34% All 8.26% 20.06% 14.16% All 10.01% 33.83% 21.92%

experiment each time. We also benchmark our results against the leave-one-out approach commonly used in the literature. The first point that attracts our attention is the results of the leave-one-out approach, since it consistently provides an incorrect ranking of the Test and Control cells, compared to the live results. Therefore, it would not be suitable for model selection via retrospective data for a check-out recommender system like ours. This is an important result, as this evaluation strategy is widely used in the literature for model selection on retrospective data. Next, we see that regardless of the splitting option used, the random and the temporal orderings produce similar results, in terms of model ranking, to the leave-one-out approach. Although in the random and the temporal ordering approaches the absolute values are much higher than both the live and leave-one-out results. Furthermore, in some cases, all three of these approaches (i.e. the leave-one-out and the random and temporal ordering approaches) also provide an incorrect ranking of the Active and non-Active cells. Interestingly, however, we see that the popularity ordering approach provides the correct ranking (in comparison to the live results) of the Test and Control cells, as well as the Active and non-Active cells. Furthermore, this result is consistent across all three months and across all splitting options. However, this result does not hold unless the restriction referred to in Section VI-C is applied. Thus, it appears that the only hypothesis regarding the shopping process supported by the data is the one relating to the popularity ordering. As such, it appears that at check-out the customers have already made purchase decisions on popular items, and so the evaluation should focus on the less popular ones. Thus, it appears that the strategy used to evaluate the performance of recommender systems should follow similar principles to those that apply to the live/planned system. Analyzing the different splits we discover that the results obtained with the last 3 items and the last 10% items split are almost equivalent. Furthermore, the different percentage splits provide similar results to the last 10% split, with the only difference being in the absolute values (i.e. the rankings

of the different cells remain the same, although the absolute values differ). The results show that considering a lower percentage of the basket as the target improve the absolute values of the results as well. Thus, it appears that we should use a target set that has a similar number of items as in the live system. In general, we see that (apart from the leave-one-out approach) all the results obtained with the different strategies overestimate the performance of the models as compared to the live results. This is expected, as the retrospective analysis is carried out at category-level, while in the live system we made recommendations at individual item-level. We are currently working on developing a category-to-item mapping that may be used in the retrospective analysis, in order to provide more accurate performance measures for the models. However, this isn’t an issue, because for the purpose of model selection, an accurate model ranking is more important than accurate absolute performance measures. IX. C ONCLUSIONS AND F UTURE W ORK Evaluating the potential performance of a recommender system requires feedback from the user. However, we do not have access to such feedback during the development phase. Therefore, it is common practice to withhold some of the test data as the targets, to be predicted using the remainder as observations. However, the entire process of evaluating recommender systems on retrospective data is rarely validated via a live case-control experiment. In this report, we present the results from a live intervention study carried out on the website of the Swiss online supermarket LeShop (www.LeShop.ch). Here we designed and deployed a recommender system to all of their customers, providing each customer three relevant recommendations at checkout, in order to increase the number of items purchased. The live intervention was designed as a case-control experiment. In our initial live experiment, model selection was based on simplicity, ease of implementation and speed, as each of the existing evaluation strategies for retrospective data provided a different ranking of the models that were considered. As such, we used a Na¨ıve Bayes recommender for the Test cell and simply recommended the most popular items to the Control cell. The analysis of our live intervention shows that the Na¨ıve Bayes recommender significantly outperforms the Control model. This result supports our hypothesis that, when making recommendations, even the simplest form of personalization significantly outperforms a non-personalized system. Our work then focused on developing an evaluation strategy that may be used for model selection via retrospective data. We defined a number of possible strategies for evaluating the potential of a recommender system via retrospective data. These are formed by combining an ordering with a split of the test baskets, and in some cases restricting the item that may be recommended. We estimate the binary hit rate for each strategy and presented the results as a four-way comparative analysis of “Active” vs. “Non-Active” and “Test” vs. “Control”. We used the Pearson Chi-Squared statistic [17] for

the comparative analysis. We also used the results from our live intervention study on www.LeShop.ch to validate the different strategies and identify the most appropriate one. We also benchmark our evaluation strategies against the leaveone-out approach for comparisons with the current literature. We found that the leave-one-out approach consistently provided an incorrect ranking of the different cells, as compared to the live results. The most appropriate evaluation strategy was to re-order the items in each test basket in descending popularity order and then define the last 3 items (or equivalently the 10%) as the target, providing the remainder as evidence. This evaluation strategy consistently provides a four-way comparative analysis of “Active” vs. “Non-Active” and “Test” vs. “Control”, which is comparable to the live results. As expected, the retrospective analysis carried out at category-level, generally overestimated the performance of the models as compared to the live system which provided item-level recommendations. However, this isn’t an issue, because for the purpose of model selection, an accurate model ranking is more important than accurate absolute performance measures. Two conclusions are drawn from this study. The first is that all evaluative methods are not the same and several methodologies may not be sufficiently representative of online behaviour to usefully guide system development with retrospective data. Secondly, performance evaluation from historical records should closely mirror the actual or planned operation of the recommender system. Furthermore, particular attention should be given to the principles that are expected to apply (e.g. recommendations at check-out are expected to focus on less popular items) and to the constraints that apply in the live system (e.g. at check-out, customers have already made purchase decisions on the more popular items). It is also important to avoid additional constraints that may be unwittingly present in some of the test baskets, even if this means reducing the test base. For example, if the recommender provides three recommendations, test baskets with a target set of less than three items should be excluded from the analysis. Having identified a suitable evaluation strategy for use with retrospective data, our future work is taking us in three parallel directions. Firstly, we are experimenting with more complicated personalized recommender algorithms, in order to improve the performance of our recommender system. In parallel, we are also seeking to further reduce the effects of seasonality by means of new modelling techniques. We are also working on introducing a category-to-item mapping that may be used in the retrospective analysis, in order to prevent overestimating the performance of the models. Future interventions will extend the case-control study to compare different recommender systems, again correlating their performance on-line with those from the same methods applied retrospectively to the same baskets up to the point where a recommendation was made.

ACKNOWLEDGEMENTS We would like to thank Dominique Locher and his team at LeShop (www.LeShop.ch - The No. 1 e-grocer of Switzerland since 1998) for providing us with the data for this work, and for providing us with the opportunity to carry out the live intervention study on their website, without which none of the work presented in this paper would have been possible. We would also like to thank Michael Spence from Tessella Support Services plc (3 Vineyard Chambers, Abingdon, Oxfordshire, OX14 3PX, UK), for his help with implementing our online recommender system. R EFERENCES [1] G. Linden, B. Smith, and J. York, “Amazon.com recommendations: item-to-item collaborative filtering,” IEEE Internet Computing, vol. 7, no. 1, pp. 76–80, 2003. [2] M. Balabanovic and Y. Shoham, “Fab: Content-based, collaborative recommendation,” in Communications of the ACM, vol. 40, 1997, pp. 66 – 72. [3] C.-N. Hsu, H.-H. Chung, and H.-S. Huang, “Mining skewed and sparse transaction data for personalized shopping recommendation,” Machine Learning, vol. 57, pp. 35–59, 2004. [4] R. Burke, “Hybrid recommender systems: Survey and experiments,” in User Modeling and User-Adapted Interaction, vol. 12, January 2002, pp. 331 – 370. [5] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” in 10th International World Wide Web Conference, May 2001, pp. 285 – 295. [6] G. Shani, D. Heckerman, and R. I. Brafman, “An mdp-based recommender system,” J. Mach. Learn. Res., vol. 6, pp. 1265–1295, 2005. [7] X. Jin, Y. Zhou, and B. Mobasher, “A maximum entropy web recommendation system: combining collaborative and content features,” in KDD ’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. New York, NY, USA: ACM Press, 2005, pp. 612–617. [8] B. N. Miller, I. Albert, S. K. Lam, J. A. Konstan, and J. Riedl, “Movielens unplugged: experiences with an occasionally connected recommender system,” in The 8th international conference on Intelligent user interfaces, 2003, pp. 263–266. [9] M. Pazzani and D. Billsus, “Learning and revising user profiles: The identification of interesting web sites,” Machine Learning, vol. 27, pp. 313–331, 1997. [10] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, “Evaluating collaborative filtering recommender systems,” ACM Transactions on Information Systems (TOIS), vol. 22, no. 1, pp. 5 – 53, 2004. [11] J. S. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of predictive algorithms for collaborative filtering,” in Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI-98), G. F. Cooper and S. Moral, Eds. San Francisco: Morgan Kaufmann, 1998, pp. 43 – 52. [12] K. H. L. Tso and L. Schmidt-Thieme, “Evaluation of attribute-aware recommender system algorithms on data with varying characteristics,” in 10th Pacific-Asia Conference of Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science 3918 Springer, 2006, pp. 831–840. [13] J. L. Herlocker, J. Konstan, A. Borchers, and J. Riedl, “An algorithmic framework for performing collaborative filtering,” in ACM Conference on Research and Development in Information Retrieval (SIGIR ’99), 1999, pp. 230–237. [14] I. Rish, “An empirical study of the naive bayes classifier,” in IJCAI, ser. Workshop on Empirical Methods in Artificial Intelligence, 2001. [15] A. Thor, N. Golovin, and E. Rahm, “Adaptive website recommendations with awesome,” The VLDB Journal, vol. 14, no. 4, pp. 357–372, 2005. [16] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl, “Analysis of recommendation algorithms for e-commerce,” in 2nd ACM Conference on Electronic Commerce (EC’00). New York: ACM Press, 2000, pp. 285 – 295. [17] A. Agresti, An Introduction to Categorical Data Analysis, ser. Wiley series in probability and statistics. John Wiley & Sons, 1996.