INTIMATE: A Web-Based Movie Recommender Using Text ...

0 downloads 0 Views 31KB Size Report
Harry Mak, Irena Koprinska and Josiah Poon ... {hhmak, irena, josiah}@it.usyd.edu.au ..... [14] S. Scott, S. Matwin, “Feature Engineering for Text Classification”,.
INTIMATE: A Web-Based Movie Recommender Using Text Categorization Harry Mak, Irena Koprinska and Josiah Poon School of Information Technologies University of Sydney, Australia {hhmak, irena, josiah}@it.usyd.edu.au Abstract This paper presents INTIMATE, a web-based movie recommender that makes suggestions by using text categorization (TC) to learn from movie synopses The performance of various feature representations, feature selectors, feature weighting mechanisms and classifiers is evaluated and discussed. INTIMATE was also compared with a feature-based movie recommender. One key finding of this study is a heuristic that indicates when one recommender performs better than the other does. The results show that the text-based approach outperforms the feature-based if the ratio of the number of user ratings to the vocabulary size is high.

1. Introduction Companies, governments and even individuals spend huge amount of money and energy to create and present various information on their web pages. Everyone becomes a knowledge producer to his/her fellow netizens. Some of this information is valuable to an unmet friend on the globe because it contains useful advice that is important before a decision is made, e.g. buying a car, going to a movie, using a piece of software. However, this sudden influx of readily available and easily accessible information creates the problem of information overload. Hence, there is a need of intermediaries to alleviate the increasing demand of time and cognitive load on a user. Recommendation systems are one of these tools. They suggest items of interests (such as books, movies, CDs, news, pictures, etc.) by using statistical and machine learning techniques. They learn from examples of user’s likes and dislikes. There are two main approaches used: collaborative filtering and content-based. The majority of existing systems are based on collaborative filtering. They keep a database of user’s preferences, find a group of users with similar preferences to the target user (based on statistical correlations of the ratings) and suggest items the group members liked, e.g. Movielens [10], DVD Express [5], and Amazon [1]. There are two main disadvantages of collaborative filtering. Firstly, the system may not be able to find users with similar preferences to those of the target user. Secondly, there is not enough users’ rating to make reliable recommendation.

Content-based recommendation systems overcome the limitations of the collaborative filtering by making suggestions based on the content of the items and target user’s ratings. Two different content-based approaches have been proposed: feature-based and text categorization-based. Feature-based recommendation systems [3,12] extract important features from the item descriptions and learn a user’s profile (classifier) using a set of pre-classified (according to the user’s rating) feature vectors, e.g. genre, leading actor/actress in a movie recommender system. However, choosing representative features and appropriately encoding them, is not an easy task. Text categorization systems learn from thousands of features (words or phrases), but recent research has shown that it is possible to build effective classifiers [13]. Several systems using text categorization (TC) have been developed. They have been applied to recommend web pages [12], news documents [3] and books [11]. In this paper, we extend previous research in the following ways. Firstly, we apply learning for textcategorization to the domain of movie recommendation. In addition to the standard bag of words representation we also explore the use of semantically richer representations such as nouns and noun phrases. We also evaluate the performance of different feature representations, feature selectors, feature weighting techniques and classification algorithms. Finally, we compare content-based and feature based approaches.

2. INTIMATE INTIMATE is an INTtelligent Internet Movie Adviser using TExt categorization. It is an extension of the IMR system developed by eliteAI [8]. Before a recommendation can be made, the user is asked to rate a minimum number of movies into one of the following six categories: terrible; bad; below average; above average; good and excellent. The movies come from the Internet Movie Database [7]. The user profile is updated by storing his/her rating together with a data representation of the movie synopsis. The text of each synopsis is preprocessed prior to machine learning. Details of the pre-processing are in §3.2. The selected terms are then represented with weights corresponding to the importance of the feature in the movie synopsis. Together with the rating, the vector of

weighted terms forms a training example for an inductive machine learning algorithm. When the user requests a recommendation, the machine learning algorithm learns from the set of the pre-classified examples (in a form of rated movie synopses). After viewing and reviewing the recommendation, the user can rate the recommended movies to the incorrectly predicted movies and retrain the system to obtain a new suggestion.

3. Experimental Setup 3.1 Corpora Used for Evaluation To evaluate the performance of INTIMATE, we used 1628 movies that are stored in the IMDB over the web. Ten users who rated from 148 to 405 movies were selected from the EachMovie database [6]. The corpus for each user consists of the movies rated by the user. Most of the current research in TC uses the Reuters newswire stories as a benchmark data set. There are, however, important differences between the style of a news article and movie synopsis. Reuther’s news stories contain a lot of factual material and the writing style tends to be formal, direct, and also uses restricted vocabulary to help quick understanding [14]. Movie synopses, on the other hand, use more casual and indirect style. The categories in movie classification (i.e. the user’s ratings) are much noisy than the target topics in Reuters data. Thus, it should be harder to categorize movie synopses than Reuters. One of our goals was to see how the TC results translate to classification of movie synopses.

3.2 Pre-processing of the Corpora The first step to construct a movie recommendation system based on TC is to transform the movie synopses into a representation suitable for the learning algorithms. This involves representing each document as a feature vector by selecting feature representation, selecting features and representing them with weights. We have investigated three approaches for feature representation (bag of words, bag of nouns and noun phrases), three techniques for feature selection (document frequency, information gain and mutual information) and two for feature weighting (binary and tfidf). Stemming and Stop-Words Removal. All unique words in the entire training corpora were first identified and then stemming and stop words removal were applied. Stemming is the removal of suffixes from words to generate word stems. It maps several morphological forms of one stem to a common feature. We used the Lovins’ stemming algorithm [9]. Stop-words are frequent words that are assumed to have no information value (e.g.

prepositions, conjunctions, etc.). Stemming and stopwords removal significantly reduce the number of unique words and are computationally cheap techniques. Feature Representation. After identifying all unique words in the training corpora (and applying stemming and stop wording as described above), each movie synopsis is represented by a vector that contains weighting for every: a) word, or b) noun, or c) noun phrases. Each term is weighted according to its importance in the synopsis. Bag of words is the simplest and most frequently used feature representation in TC. The use of nouns is motivated by the results showing that nouns are the most important discriminators for information retrieval when compared to the other parts of speech [4]. Noun phrases use sequences of words. A phrase (an n-grams) does not only preserve the word order, but it also maintains the syntactic relationships. Recent research has shown ngrams of length up to three can improve performance over unigrams [7, 10]. Our goal was to investigate if the noun phrases can improve the performance. Feature Selection. Removing the less informative and noisy terms reduces the computational cost and improved the classification performance. All corpora features are ranked according to the feature selection mechanism and those with value higher than a threshold are selected. We have used three ranking criteria: document frequency, information gain and mutual information. Document frequency (DF) is the simplest method for feature reduction. DF is highly scalable as it has a linear complexity. Information gain (IG) measures the number of bits of information to be used for category prediction by knowing the presence or absence of a feature in a document [2]. IG is one of the most successful feature selection techniques [14] used in TC. Mutual information (MI) is often used in statistical natural language processing to compute the associations between two words. Feature Weighting. The two most popular feature weighting mechanisms: binary, and term frequency inverted document frequency (tfidf), have been integrated into INTIMATE. In the first method, weights are either 0 or 1 denoting absence or presence of a term in the movie synopsis. The tfidf [13] is a function to assign higher weights to features that appear frequently (i.e. consider them as representative of the document content), but also balances this by reducing the weight if a feature appears in many documents (i.e. is less discriminating). Machine Learning Algorithms. Three well-known machine learning techniques are used to build a classifier for each user: k-nearest neighbour, decision trees and Naïve Bayes. We have incorporated the WEKA [16] implementations into INTIMATE.

3.3 Evaluation Metrics To evaluate the performance of INTIMATE, we used 10-fold stratified cross validation for each user corpus and examined four performance measures: accuracy, recall, precision and F1 measure. Although accuracy is the most common measure in machine learning, it may be misleading if used alone. It is because each class typically has a very small number of positive examples and a large number of negative examples. Hence, recall and precision are the most popular performance measures used in TC. To get an overall measure of effectiveness, they are also typically combined using the F1-measure, which gives equal weights to recall and precision: F1=(2*recall*precision)/(recall+precision). F1s were calculated for each user and category, and the results were averaged across the categories (micro-averaging).

4. Results and Discussion 4.1 Comparison with Different Preprocessing Steps Feature Representation. The classification results are very similar across different knowledge representations (i.e. bag of words, nouns and noun phrases), and nounsonly has the smallest variance. The vocabulary size of bag of words was 5291, nouns – 3258 and noun-phrases – 6297. Thus, the nouns representation is the superior one, as it offers high classification performance and uses vocabulary that is 38% smaller than the bag of words and 48% smaller than the noun phrases. Our results are consistent with [3] who found that in the domain of newspaper retrieval, noun representation reduced the average recall and precision by less than 1% in comparison to the bag of words representation, while allowing significant savings in terms of disk space and computational time as nouns made up 40% of all words. Feature Selectors. The overall performance using DF, IG and MI is very similar. They are only slightly better (1-4%) than without feature selection. Among the feature selectors studied, DF is the best choice as it is computationally very efficient and much simpler than information gain and mutual information. Yang and Pedersen [15] also showed that DF performed similar to IG on Reuters data. However, they found that MI performed poorly compared to the other two selectors and attributed this to the bias of MI favoring rare terms and its sensitivity to probability estimation errors. Classifier Performance. The overall performance of k-Nearest Neighbor, Decision Trees and Naïve Bayes is very similar. The performance is around 60-65%. The performance among the algorithms varies in the range of 4-7%. Decision Trees are the best in terms of accuracy

and F1 measure, followed by k-Nearest Neighbor and Naïve Bayes. The classification time is also an important factor to consider. Being a lazy learner, k-Nearest Neighbor does not preserve a classifier structure in memory. As a result, it takes more time to execute than Decision Trees and Naïve Bayes. In summary, there is little evidence to suggest that one classifier is significantly superior to the other. However, decision trees are a marginal favorite in the categorization of movie synopses.

4.2 TC vs. Feature-Based Approach We have compared our text-based movie recommendations with the feature-based recommendations produced by IMR [8]. IMR uses the following features: genre (comedy, drama, fiction, action, documentary, etc.), Maltin’s rating, classification (G, PG, M, MA, R, or not-classified), director, leading actor and actress, awards won, awards nominated, country of origin. The accuracy for each of the 10 users is summarized in Figure 1. The results for TC approach are more reliable as they are average of all 10-fold cross validation for each user using the various feature representation, selection, weighting and classifier. The results for the feature-based approach are averaged across three 10-fold crossvalidation runs using different classifiers. Feature-Based Approach

Text Categorization Approach

100 Ave rag e Acc ura cy (%)

90 80 70 60 50 40 30 20 10 0 1404

1885

10

2109

130

1812

1624

139

1935

1797

"user_id" Number

Figure 1. Accuracy results of the feature-based and TC approaches for each user

The feature-based approach appears to be more accurate than the TC approach for seven out of the ten users by 2-10%. However, for users 10, 130 and 139, the TC approach significantly outperforms the feature-based one and reaches accuracy of 90% and above. The TC approach performs very well under particular conditions while the feature-based approach is better for most of the users, but by a small amount. We further investigated this in order to identify in which conditions TC performs best. 4.3

Effect of the Ratings to Vocabulary Ratio on the Performance of the Text-Based Approach

A RV ratio is calculated for each user. This is a ratio of the number of Ratings to the raw Vocabulary size. It is found that there is no necessary correlation between the number of movies rated and the vocabulary size. The RV ratios of each of the 10 users are shown in Figure 2. The solid vertical line divides the graph into two parts: low RV ratio and high RV. As it can be seen, TC approach significantly outperforms the feature-based approach for corpora with high RV ratio. User Profiles with a High Ratings to Vocabulary Ratio

User Profiles with a Low Ratings to Vocabulary Ratio

Normalised Value

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

Normalised Ratings to Vocabulary Ratio Average Accuracy of All Results

0

0 1404

1885

2109

1812

1624

"user_id" Number

1935

1797

10

130

139

"user_id" Number

Figure 2. Average accuracies with user profiles divided into high and low according to RV ratio.

To verify this finding, the RV ratio was computed for 30 other randomly selected users. From this sample, the three users with the lowest RV ratios and the three users with highest RV ratio were selected. Classifiers were trained for each of the six user profiles. The results confirm the finding that the TC approach performs significantly better (in terms of accuracy, recall, precision and F1 measure) than the feature-based approach for users with high RV. Hence, the RV value can be used to predict the performance of a text-based recommender before the building of the profile (i.e. training of a classifier). A closer examination of the results reveals why TC performs better with a high RV ratio. The RV ratio represents the quality of the training set. A high RV value is attributed to a large number of ratings with small vocabulary size. The small vocabulary size is because some movie synopses are prepared by amateurs and contain a smaller number of distinctive words, i.e. the style is more focused and concise. The text document is then similar to news articles in Reuters. The consequence is that the results from the learning process on these examples are more accurate and reliable. On the other hand, a low RV value means the training set has very few examples and the number of features is high. Learning under this kind of scenario typically ends up with a less accurate and a less reliable outcome.

5. Conclusion In this paper, we presented INTIMATE – an intelligent web-based movie recommendation adviser based on TC of movie synopses. In contrast to

collaborative filtering, text-based recommendation systems are able to recommend items that are new or have not been rated by enough users, and to make recommendations to users with unique interests. They also overcome the limitations of feature-based recommenders by being able to learn from thousands of features. The experimental results confirm the validity of a number of existing TC techniques over the domain of movie synopses. The TC based recommendation system was shown to perform very well under particular conditions while the feature-based approach was better for most of the users, but by a small amount. In particular, we found that the text-categorization approach significantly outperformed the feature-based one for corpora with a high RV ratio. This is a simple and computationally inexpensive heuristic to determine whether a more costly TC, but accurate, approach should be attempted. This not only saves a lot of time, but also provides a high quality recommendation depending on the synopses’ characteristics. Future work will include the comparison with collaborative filtering, and development of a hybrid approach that combines the strengths of the feature-based, text-based and collaborative filtering approaches. We will also investigate the appropriateness of the heuristic in other kind of domains. Another interesting extension is the integration of incremental learning algorithms.

6. References [1] Amazon, http://www.amazon.com [2] K. Aas, L. Eikvil, Text categorization: A Survey, TR Norvegian Computing center, June, 1999. [3] D. Billsus and M. Pazzani, "A Personal News Agent that Talks, Learns and Explains", Third Intern. Conf on Autonomous Agents (Agents '99), Seattle, Washington, 1999. [4] A. Chowdhury, M.C. McCabe, “Improving Information Retrieval Systems using Parts of Speech Tagging”, TR-98-48, Institute for Systems Research, Univ. of Maryland, 1993. [5] DVD Express, http://www.dvdexpress.com [6] EachMovie, Compaq Systems Research Labs, 1998, http://www.research.compaq.com/SRC/eachmovie/ http://research.compaq.com/SRC/eachmovie/ [7] Internet Movie Database. http://www.imdb.com [8] IMR by eliteAI: http://www.it.usyd.edu.au/~eliteai [9] J. Lovins, “Development of a Stemming Algorithm”, Mechanical Translation and Comp. Linguistics, 11, pp. 22-31. [10] Movielens, http://movielens.umn.edu/ [11] R. J. Mooney, L. Roy, “Content-Based Book Recommend-ing Using Learning for Text Categorization”, Fifth ACM Conf. on Digital Libraries, 2000. [12] M. Pazzani, J. Muramatsu, D. Billsus, “Syskil & Webert: Identifying Interesting Web Sites”, AAAI-96, pp.54-61, 1996. [13] F. Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), pp.1-47, 2002. [14] S. Scott, S. Matwin, “Feature Engineering for Text Classification”, 16th ICML, San Francisco, USA, 1999. [15] Y. Yang, J Pedersen, “A Comparative Study on Feature Selection in Text Categorization”, ICML, 1997. [16] WEKA3, http://www.cs.waikato.ac.nz/ml/weka/