A Semantic Content-based Recommender System

0 downloads 0 Views 5MB Size Report
general introduction to thesaurus functions, structure, and construction with ..... specialized systems can prevent serendipitous discoveries to happen, but the ...
A Semantic Content-based Recommender System for Cross-Language Personalization Pasquale Lops

Cataldo Musto

Fedelucio Narducci

Dept. of Computer Science University of Bari, Italy

Dept. of Computer Science University of Bari, Italy

Dept. of Computer Science University of Bari, Italy

[email protected] Marco De Gemmis

[email protected] Pierpaolo Basile

[email protected] Giovanni Semeraro

Dept. of Computer Science University of Bari, Italy

Dept. of Computer Science University of Bari, Italy

Dept. of Computer Science University of Bari, Italy

[email protected]

[email protected]

[email protected]

ABSTRACT The exponential growth of the Web is the most influential factor that contributes to the increasing importance of crosslingual text retrieval and filtering systems. Indeed, relevant information exists in different languages, thus users need to find documents in languages different from the one the query is formulated in. In this context, an emerging requirement is to sift through the increasing flood of multilingual text: this poses a renewed challenge for designing effective multilingual Information Filtering systems. Content-based filtering systems adapt their behavior to individual users by learning their preferences from documents that were already deemed relevant. The learning process aims to construct a profile of the user that can be later exploited in selecting/recommending relevant items. User profiles are generally represented using keywords in a specific language. For example, if a user likes movies whose plots are written in Italian, a content-based filtering algorithm will learn a profile for that user which contains Italian words, thus failing in recommending movies whose plots are written in English, although they might be definitely interesting. Moreover, keywords suffer of typical Information Retrieval-related problems such as polysemy and synonymy. In this paper, we propose a language-independent contentbased recommender system, called MARS (MultilAnguage Recommender System), that builds cross-language user profiles, by shifting the traditional text representation based on keywords, to a more complex language-independent representation based on word meanings. The proposed strategy relies on a knowledge-based word sense disambiguation technique that exploits MultiWordNet as sense inventory. As a consequence, content-based user profiles become languageindependent and can be exploited for recommending items represented in a language different from the one used in the content-based user profile.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

Experiments conducted in a movie recommendation scenario show the effectiveness of the approach.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Dictionaries, Indexing methods, Linguistic processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information Filtering

General Terms Algorithms, Experimentation

Keywords Cross-language Recommender System, Content-based Recommender System, Word Sense Disambiguation, MultiWordNet

1. INTRODUCTION Nowadays the amount of information we have to deal with is usually greater than the amount of information we can process in an effective way. For this reason, user modeling and personalized information access are becoming essential to propose only (or firstly) the information that appear relevant or someway related to the informative need of the target user. Information Filtering (IF) systems are rapidly emerging in this context since they are helpful for carrying out this task in an effective way. These systems adapt their behavior to individual users by learning their preferences and storing them in a user profile. Filtering algorithms, exploiting the information stored in user profiles, perform a progressive removal of non-relevant content according to information about user interests, preferences or specific needs. Specifically, the content-based filtering approach [1] analyzes a set of documents (textual descriptions of items previously rated as relevant by an individual user) and builds a model or profile of user interests based on the features (usually keywords) that describe the target objects. The profile is then exploited to recommend new relevant items. If the profile accurately reflects user preferences, the information access process could be effective, since the profile

could be used to filter search results, by deciding whether a user is interested in a specific item/document or not and, in the negative case, preventing it from being displayed. On the other side, these approaches have to deal with at least two kinds of problems: firstly, traditional keyword-based profiles are unable to capture the semantics of user interests because they are primarily driven by string matching operations. If a string, or some morphological variant of it, is found in both the profile and the document, a match is made and the document is considered as relevant. However, string matching suffers from problems of polysemy, the presence of multiple meanings for one word, and synonymy, multiple words with the same meaning. The result is that, due to synonymy, relevant information can be missed if the profile does not contain the exact keywords in the documents while, due to polysemy, wrong documents could be deemed as relevant. Another relevant problem related to string matching approaches is the strict connection with the user language: an English user, for example, frequently interacts with information written in English, so her profile of interests (keywordbased) mainly contains English terms. In order to receive suggestions of items whose textual description is in a different language, she must explicitly express her preferences on items in that specific language, as well. This means that the information already stored in the user profile cannot be exploited to provide suggestions for items whose description is provided in other languages, although they share some common features (e.g. an Italian and an English movie sharing the same features but with plots written in two different languages). Semantic analysis and its integration in personalization models is one of the most innovative and interesting approaches to tackle these problems. More accurate user profiles might be represented using concepts expressing user interests, and those concepts might be represented using keywords in different languages. Concepts remain the same across different languages, while terms used for describing them in each specific language might change. In order to construct language-independent profiles a Word Sense Disambiguation (WSD) algorithm is needed, for assigning the right meaning to (polysemous) words. Furthermore, a linguistic resource able to map concepts in different languages is also needed. The main idea presented in this paper is to exploit MultiWordNet [2] as a bridge between different languages. MultiWordNet associates a unique identifier to each possible sense (meaning) of a word, regardless the original language. In this way we can build user profiles based on MultiWordNet senses and we can exploit them in order to provide crosslanguage recommendations. The paper is organized as follows. Section 2 analyzes related works in the area of cross-language filtering and retrieval. Section 3 presents the architecture of the systems proposed in the paper, while Sections 4 and 5 describe the process of building language-independent documents and profiles. Experiments carried out in a movie recommendation scenario are described in Section 6. Conclusions and future work are drawn in the last section.

2.

RELATED WORK

Up to our knowledge, the topic of Cross-Language and Multilanguage Information Filtering is still not properly investigated in literature.

An attempt to define an effective multilingual information filtering system is proposed in [3]. The system is based on the fuzzy set theory. More specifically, the semantic content of multilingual documents is represented using a set of universal content-based topic profiles, encapsulating all feature variations among multiple languages. Using the co-occurrence statistics of a set of multilingual terms extracted from a parallel corpus (collection of documents containing identical text written in multiple languages), fuzzy clustering is applied to group semantically-related multilingual terms to form topic profiles. The basic intuition is that translated versions of the same text are linguistic variants of the same topic(s), hence, multilingual terms occurring within corresponding translated documents are semantically related to the same topic(s). The main disadvantage of this approach, with respect to our approach, is the need of a significant parallel corpus. Recently, the Multilingual Information Filtering task at CLEF 20091 has introduced the issues related to the crosslanguage representation in the area of Information Filtering. Damankesh et al. [4], propose the application of the theory of Human Plausible Reasoning (HPR) in the domain of filtering and cross language information retrieval. The system utilizes plausible inferences to infer new, unknown knowledge from existing knowledge to retrieve not only documents which are indexed by the query terms but also those which are plausibly relevant. The state of the art in the area of cross-language Information Retrieval is undoubtedly richer, and can certainly help in designing effective cross-language Information Filtering systems. Oard [5] gives a good overview of the approaches for cross-language retrieval. Moreover, Salton [6] showed that cross-language retrieval was nearly as effective as mono-lingual retrieval, when carefully constructed thesauri are used. Anyway, this study seems to be unrealistic, since for current standards it is impracticable to manually index large datasets. The approach proposed in [7] exploits Latent Semantic Indexing for cross-language retrieval. The use of Singular Value Decomposition was investigated by Furnas et al. [8], who create a multidimensional indexing space for a parallel corpus of English documents and their French translations. Ballesteros et al. [9] underlined the importance of phrasal translation in cross-language retrieval and explored the role of phrases in query expansion via local context analysis and local feedback. Soergel [10] focused his attention on the use of thesauri and ontologies as knowledge bases in cross-language retrieval. In his work he provided a general introduction to thesaurus functions, structure, and construction with particular attention to the problems of multi-lingual thesauri. He proposed the creation of environments for distributed collaborative knowledge base development as a way to make large-scale knowledge-based systems more affordable. The most recent approaches to Cross-Language Retrieval mainly rely on the use of large corpora like Wikipedia. Potthast et al. [11] introduce CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The approach is based on Explicit Semantic Analysis (ESA) [12], extending the original model to cross-lingual retrieval settings. Furthermore, Juffinger et al.[13] recently presented the cross language retrieval system developed for the Robust 1

http://www.clef-campaign.org/2009.html

WSD Task at CLEF 20082 . Finally, Gonzalo et al. [14] discuss ways in which EuroWordNet (EWN) [15] can be used in multilingual information retrieval activities, focusing on two approaches to CrossLanguage Text Retrieval that exploit the EWN database as a large-scale multilingual semantic resource. The first approach indexes documents and queries in terms of the EuroWordNet Inter-Lingual-Index, thus turning term weighting and query/document matching into language-independent tasks. The second approach describes how the information in the EWN database could be integrated with a corpusbased technique, thus allowing retrieval of domain-specific terms that may not be present in the multilingual database.

3.

GENERAL ARCHITECTURE OF MARS

MARS (MultilAnguage Recommender System) is a system capable of generating recommendations, provided that descriptions of items are available in textual form. Item properties are represented in the form of textual slots. For example, a movie can be described by slots title, genre, actors, summary. Figure 1 depicts the main components of the MARS general architecture: the Content Analyzer, the Profile Learner, and the Recommender.

Figure 1: General architecture of MARS In this work, the Content Analyzer is the main module involved in designing a language-independent content-based recommender system. It allows introducing semantics in the recommendation process by analyzing documents in order to identify relevant concepts representing the content. This process selects, among all the possible meanings (senses) of each (polysemous) word, the correct one according to the context in which the word occurs. In this way, documents are represented using concepts instead of keywords, in an attempt to overcome problems due to natural language ambiguity, and to the diversity of languages. The final outcome of the pre-processing step is a repository of disambiguated documents. This semantic indexing is strongly based on natural language processing techniques, such as Word Sense 2

http://www.clef-campaign.org/2008.html

Disambiguation [16], and heavily relies on linguistic knowledge stored in lexical ontologies. In this work, the Content Analyzer relies on the MultiWordNet lexical ontology [2]. The generation of the cross-language user profile is performed by the Profile Learner, which infers the profile as a binary text classifier. The Profile Learner implements a supervised learning technique to infer a probabilistic model of user interests from disambiguated documents, rated in the training phase by that user. Finally the Recommender exploits the cross-language user profile to suggest relevant items by matching concepts contained in the semantic profile against those contained in documents to be recommended (previously disambiguated). The user might receive recommendations in her own mother tongue, or in languages she knows. This is a decision of the specific application in which the recommender is integrated.

4. BUILDING LANGUAGE-INDEPENDENT DOCUMENTS Semantic indexing of documents is performed by the Content Analyzer, which relies on META (Multi Language Text Analyzer) [17], a tool able to deal with documents in English and Italian. The goal of the semantic indexing step is to obtain a concept-based document representation. To this purpose, the text is first tokenized, then for each word possible lemmas (as well as their morpho-syntactic features) are collected. Part of speech (POS) ambiguities are solved before assigning the proper sense (concept) to each word. This step requires the identification of a repository for word senses and the design of an automated procedure for defining word-concept associations. As regards the first issue, in this work, the semantic indexing module exploits MultiWordNet as sense-repository. MultiWordNet is a multilingual lexical database that supports the following languages: English, Italian, Spanish, Portuguese, Hebrew, Romanian and Latin. Similary to WordNet, the basic building block for MultiWordNet is the synset (SYNonym SET), a structure containing sets of words with synonymous meanings, which represents a specific meaning of a word. Some words have several different meanings, and some meanings can be expressed by several different word forms. Polysemy and synonymy can be viewed as complementary aspects of this mapping. In addition, MultiWordNet is organized by semantic relations between meanings, and since meanings are represented by synsets, semantic relations are pointers between synsets. In MultiWordNet the Italian WordNet is strictly aligned with English WordNet 1.6 [18]. Italian synsets are created in correspondence with English WordNet synsets, whenever possible, and semantic relations are imported and preserved from the corresponding English synsets. Thus, when a relation occurs between two English WordNet synsets, the same relation occurs between the corresponding Italian synsets. However, some lexical idiosyncrasies between languages may exist, such as lexical gaps and denotation differences, but the multilingual hierarchy implemented in MultiWordNet is able to overcome these problems [2]. An example of mapping between an English and an Italian synset is reported in Table 1, in which the (synonymous) word forms used to define a specific meaning are reported, together with the gloss, that is the description of that specific meaning using free text.

Table 1: Mapping between an English and an Italian synset Languge Sysnet Gloss English world, human all of the inhabitants race, humanity, of the earth humankind, human beings, humans, mankind, man Italian mondo, umanit` a, insieme degli abitanti uomo, genere della terra, il compumano, terra lesso di tutti gli esseri umani As regards the issue for performing word-concept association, META implements a WSD algorithm, called JIGSAW [19], that takes as input a document encoded as a list of h words in order of their appearance, and returns a list of k MultiWordNet synsets (k ≤ h), in which each synset s is obtained by disambiguating the target word w. The goal of JIGSAW consists in assigning a word w occurring in a document d with its appropriate meaning or sense s, by exploiting the context C in where w is found. The context C for w is defined as a set of words that precede and follow w. The sense s is selected from a predefined set of possibilities, usually known as sense inventory. In the proposed algorithm, the sense inventory is obtained from MultiWordNet. JIGSAW is based on the idea of combining three different strategies to disambiguate nouns, verbs, adjectives and adverbs. The main motivation behind our approach is that the effectiveness of a WSD algorithm is strongly influenced by the POS tag of the target word. An adaptation of Lesk dictionary-based WSD algorithm has been used to disambiguate adjectives and adverbs [20], an adaptation of the Resnik algorithm has been used to disambiguate nouns [21], while the algorithm we developed for disambiguating verbs exploits the nouns in the context of the verb as well as the nouns both in the glosses and in the phrases that MultiWordNet utilizes to describe the usage of a verb. The complete description of the adopted WSD strategy adopted is already published in [22, 19].

4.1 Document Representation The WSD procedure implemented in the Content Analyzer allows to obtain a synset-based vector space representation, called bag-of-synsets (BOS), that is an extension of the classical bag-of-words (BOW) model. In the BOS model, a synset vector, rather than a word vector, corresponds to a document. MARS is able to suggest potentially relevant items to users, as long as an item having m properties can be represented in form of M textual slots. The text in each slot is represented by the BOS model by counting separately the occurrences of a synset in the slots in which it occurs. More formally, let s be the index of the slot, the document d is reduced to M bags of synsets, one for each slot: ds = ts1 , ss2 , . . . , tsns  tsi

th

Figure 2: Example of synset-based document representation

f s = w1s , w2s , . . . , wns s 

s = 1...M

wis

is the weight of the synset ti in slot s of the document, and can be computed in different ways: it can be the frequency of synset ti in s or a more complex feature weighting score. Note that the adoption of slots does not jeopardize the generality of the approach since the case of documents not structured into slots corresponds to have just a single slot in our document representation strategy. By invoking META on a text t, we get META(t) = ( x,  y), where  x is the BOS containing the synsets obtained by applying JIGSAW on t, and  y is the corresponding synsetfrequency vector. Figure 2 provides an example of representation for the movie by Stanley Kubrick Clockwork Orange, corresponsing to the Italian translation Arancia Meccanica. The textual description of the plot is provided both for the Italian and English version. The Content Analyzer produces the BOS containing the concepts extracted from the plot. The MultiWordNet-based document representation creates a bridge between the two languages: in a classical keywordbased approach the two plots would share only the term Beethoven, while the adoption of the synset-based approach would allow a greater overlapping (six shared synsets). The following section describes how BOS-indexed documents in different languages are used in a content-based information filtering scenario for learning accurate languageindependent user profiles.

s = 1...M

is the i synset in slot s of the document, and ns where is the total number of synsets in slot s of document d. For each bag of synsets ds , the corresponding synset-frequency vector is computed:

5. PROFILE LEARNER The generation of the cross-language user profile is performed by the Profile Learner, which infers the profile as a binary text classifier [23] since each document has to be

classified as interesting or not with respect to the user preferences. Therefore, the set of categories is restricted to c+ , the positive class (user-likes), and c− the negative one (userdislikes). The induced probabilistic model is used to estimate the a posteriori probability, P (c|dj ), of document dj belonging to class c. The algorithm adopted for inferring user profiles is a Na¨ıve Bayes text learning approach, widely used in contentbased recommenders, which is not presented here because already described in [24]. What we would like to point out here is that the final outcome of the learning process is a probabilistic model used to classify a new document, written in any language, in the class c+ or c− . Given a new document dj , the model computes the a-posteriori classification scores P (c+ |dj ) and P (c− |dj ) by using probabilities of synsets contained in the user profile and estimated in the training phase by exploiting both ratings provided by users and synset frequencies. Each user provided ratings on items using a 5-point Likert scale ranging from MIN (strongly dislikes) to MAX (strongly likes). Items whose ratings are greater than (MIN+MAX)/2 are supposed to be liked by the user and included in the positive training set, while items with lower ratings are included in the negative training set. The profile contains the user identifier and the a-priori probabilities of liking or disliking an item (P (c+ ), P (c− )), apart from its content. Moreover, the profile is structured in two main parts: profile like contains features describing the concepts able to deem items relevant, while features in profile dislike should help in filtering out not relevant items. Each part of the profile is structured in slots, resembling the same representation strategy adopted for documents. Each slot reports the features (MultiWordNet synset identifiers) occurring in the training examples, with corresponding frequencies computed in the training step. Frequencies are used by the Bayesian learning algorithm to induce the classification model (i.e. the user profile) exploited to suggest relevant items in the recommendation phase. The profile learning process for user u starts by selecting all items (disambiguated documents) and corresponding ratings provided by u. Each item falls into either the positive or the negative training set depending on the user rating, in the same way as previously described in this section. Therefore, given a new document (previously disambiguated) dj , the recommendation step consists in computing the a-posteriori classification scores P (c+ |dj ), used to produce a ranked list of potentially interesting items, from which items to be recommended can be selected. Figure 3 provides an example of cross-language recommendations provided by MARS. The user likes a movie with an Italian plot and concepts extracted from the plot are stored in her synset-based profile. The recommendation step exploits this representation to suggest a movie whose plot is in English. The classical matching between keywords is replaced by a matching between synsets, allowing to suggest movies even if their descriptions do not contain shared terms.

6.

EXPERIMENTAL EVALUATION

The goal of the experimental evaluation was to measure the predictive accuracy of language-independent (cross language) user profiles built using the BOS model. More specifically, we would like to test 1) whether user profiles learned

USER RATING

Bag of Synset from ITALIAN

ENGLISH description

“n5477412” “a1744532” “a2584413” “n3652872” “a3225722” “n32256325” “n225784” “n255632” “Beethoven” …

A rape victim, enraged at the light sentence her attackers received on account that she was of "questionable character" goads a female prosecutor to charge the men who literally cheered the attack on.

Bag of Synset from ENLGLISH

User PROFILE “n5477412” “a1744532” “a2584413” “n3652872” “a3225722” “n32256325” “n225784” “n255632” “Beethoven” “a5547852””n632258” “n11052255” “n777412” “a95525” …

“n5477412” “a34225” “n63325” “n5223665” “a2584413” “n3652872” “n32256325” “n225784” “n255632”…

MARS SUGGESTED

Figure 3: Example of a cross-language recommendation using examples in a specific language can be effectively exploited for recommending items in a different language, and 2) whether the accuracy of the cross-language recommender system is comparable to that of the monolingual one. Experiments were carried out in a movie recommendation scenario, and the languages adopted in the evaluation phase are English and Italian. We do not provide experiments for evaluating whether synset-based profiles outperform keyword-based ones, because we have already shown this result in the context of a movie recommendation scenario [25].

6.1 Users and Dataset The experimental work has been carried out on the MovieLens dataset3 , containing 100,000 ratings provided by 943 different users on 1,628 movies. The original dataset does not contain any information about the content of the movies. The content information for each movie was crawled from both the English and Italian version of Wikipedia. In particular the crawler gathers the Title of the movie, the name of the Director, the Starring and the Plot. It is worth to note that English and Italian movie descriptions are not a mere translation from one language to another: different contributors have provided the English and Italian movie plots, thus different sets of terms have been used (see Table 2). This means that a simple approach in which user profiles are learned on items in a specific language, and recommendations are provided by suggesting items properly translated in another language could be not effective. It is the same for the approach in which profiles are learned on items properly translated in the language for which recommendations should be provided. The text in each slot has been tokenized and stemmed, and the POS tag has been identified for running the WSD algorithm. A named entity recognition procedure has been performed and stopwords have been eliminated just from 3

http://www.grouplens.org

the plots. Table 2 reports the number of features used to represent movies. We can notice that the WSD procedure allows a reduction of the number of features used to represent movies (roughly 9% for English, 16% for Italian). Table 2: Features used to represent Slot #words #synsets english italian english Title 875 1,023 763 Director 642 642 525 Starring 2,535 2,535 2,490 Plot 25,846 21,932 23,330 Total 28,898 26,132 27,108

movies italian 913 498 2,234 18,268 21,913

In order to learn accurate user profiles, we have not performed the evaluation for those users who provided less than 20 ratings. Moreover, we selected all the movies for which both the English and Italian description is available. To sum up, the dataset after this processing contained 40,717 ratings provided by 613 different users on 520 movies.

• Exp#3 – ENG-ENG: profiles learned on movies with English description and recommendations produced on movies with English description; • Exp#4 – ITA-ITA: profiles learned on movies with Italian description and recommendations produced on movies with Italian description. We executed one experiment for each user in the dataset. The ratings of each specific user and the content of the rated movies have been used for learning the user profile and measuring its predictive accuracy, using the aforementioned measures. Each experiment consisted of: 1. selecting ratings of the user and the description (English or Italian) of the movies rated by that user; 2. splitting the selected data into a training set Tr and a test set Ts; 3. using Tr for learning the corresponding user profile by exploiting the:

6.2 Design of the Experiment

• English movie descriptions (Exp#1);

User profiles are learned by analyzing the ratings stored in the MovieLens dataset. Each rate was expressed as a numerical vote on a 5-point Likert scale, ranging from 1=strongly dislike, to 5=strongly like. MARS is conceived as a text classifier, thus its effectiveness was evaluated by classification accuracy measures, namely Precision and Recall [23], where Precision (P r) is defined as the number of relevant selected items divided by the number of selected items, and Recall (Re) is defined as the number of relevant selected items divided by the total number of relevant items. Fβ measure, a combination of Precision and Recall, is also used to have an overall measure of predictive accuracy (β sets the relative degree of importance attributed to Pr and Re) [26]. Since users should trust the recommender, it is important to reduce false positives. It is also desirable to provide users with a short list of relevant items (even if not all the possible relevant items are suggested), rather than a long list containing a greater number of relevant items mixed-up with not relevant ones. Therefore, we set β = 0.5. These specific measures were adopted because we are interested in measuring how relevant a set of recommendations is for a user. In the experiment, an item is considered relevant for a user if the rating is greater than or equal to 4, while MARS considers an item relevant for a user if the a-posteriori probability of the class likes is greater than 0.5. The dataset used in the experiment is balanced in terms of positive and negative ratings (57% positive, 43% negative). We designed two different experiments, depending on 1) the language of items used for learning profiles, and 2) the language of items to be recommended:

• Italian movie descriptions (Exp#2);

• Exp#1 – ENG-ITA: profiles learned on movies with English description and recommendations provided on movies with Italian description; • Exp#2 – ITA-ENG: profiles learned on movies with Italian description and recommendations produced on movies with English description. We compared the results against the accuracy of classical monolanguage content-based recommender systems:

4. evaluating the predictive accuracy of the induced profile on Ts, using the aforementioned measures, by exploiting the: • Italian movie descriptions (Exp#1); • English movie descriptions (Exp#2); In the same way, a single run for each user has been performed for computing the accuracy of monolingual recommender systems, but the process of learning user profiles from Tr and evaluating the predictive accuracy on Ts has been carried out using descriptions of movies in the same language, English or Italian. The methodology adopted for obtaining Tr and Ts was the 5-fold cross validation [27].

6.3 Discussion of Results Results of the experiments are reported in Table 3, averaged over all the users. Table 3: Experimental Results Experiment Pr Re Fβ exp#1 – eng-ita 59.04 96.12 63.98 exp#2 – ita-eng 58.77 95.87 63.70 exp#3 – eng-eng 60.13 95.16 64.91 exp#4 – ita-ita 58.73 96.42 63.71 The main outcome of the experimental session is that the strategy implemented for providing cross-language recommendations is quite effective. More specifically, user profiles learned using examples in a specific language, can be effectively exploited for recommending items in a different language, and the accuracy of the approach is comparable to those in which the learning and recommendation phase are performed on the same language. This means that the goal of the experiment has been reached. The best result was obtained by running Exp#3, that is a classical monolanguage recommender system using English

content both for learning user profiles and providing recommendations. This result was expected, due to the highest accuracy of the JIGSAW WSD algorithm for English with respect to Italian. This means that the error introduced in the disambiguation step for representing documents as bagof-synsets hurts the performance of the Profile Learner. It is worth to note that the result of Exp#4 related to movies whose description is in Italian is quite satisfactory. The result of the second experiment, in which Italian movie descriptions are used for learning profiles that are then exploited for recommending English movies is also satisfactory. This confirms the goodness of the approach designed for providing cross-language personalization. The approach proposed in the paper might seem too elaborate, because of the several operations needed to represent documents as bag-of-synsets. Translating every term in the documents in other languages might appear to be the most straightforward solution to cross-language personalization. However, polysemous terms can hurt the effectiveness of the translation process and as a consequence the filtering performance. Furthermore, due to linguistic and cultural differences, not all terms have direct translations in all languages, even if conceptually related terms with slightly different meanings are always available. Translating every term in the user profile could be a solution even easier, but the translation process could be not effective due to the lack of context of those terms. However, a syntactic approach based on the translation of documents in different languages can fail in producing effective cross-language recommendations because documents to be recommended might be represented using keywords different from those contained in the user profile. It is the case of our experiments. To sum up, a mere syntactic approach to cross-language information filtering is hardly feasible, and semantic approaches for representing the text seems to be necessary. This is more evident from the results of the experiments, in which the cross-language recommendations are accurate even though movie descriptions are not a direct translation of English to Italian or vice versa. In our opinion, the strongest aspects of the approach presented in this work are the following: 1. the process of learning content-based user profiles does not depend on the specific language in which training examples are represented. Indeed, the Profile Learner is able to process documents represented as bag-ofsynsets, independently of the language used; 2. the Content Analyzer, able to represent documents as bag-of-synsets, relies on a WSD algorithm in which a lexical resource such as MultiWordNet is exclusively needed. That resource must contain the association between word forms and word meanings, and the alignment between senses in different languages. This alignment allows representing documents in different languages, without any additional process; 3. the WSD algorithm used by the Content Analyzer is totally unsupervised, thus it does not need any parallel corpus for associating senses to words in different languages; 4. the BOS model allows a semantic representation of user profiles that captures concepts concerning user

interests, independently from the words and the language used for representing them. This means that a semantic profile is sufficient to represent user interests in several languages, while a syntactic approach would need a different profile for each different language. Another point of discussion to evaluate the goodness of our approach is the comparison with other filtering techniques that might be adopted for cross-language personalization. In this context, a collaborative filtering algorithm is inherently multilingual, because it does not rely on the content of items, and it just exploits the users’ rating style (set of ratings given by users on the same items) for providing recommendations. The problem with this approach is that the similarity between users is only computable if they have common rated items. By the way, the number of ratings obtained from users is usually very small compared to the number of ratings that must be predicted. This problem is called sparsity, and has a negative effect on the predictions, because it affects the selection of neighbors: It is likely that overlap of rated items between two users will be minimal in many cases. This problem is even more evident in the context of a multilingual environment. It is more likely that most users rated items in their mother tongue: This is also a cultural aspect which also depends on the type of content users generally access. Users often prefer to interact with content in their mother tongue, especially for items such as movies or books. This makes even more complicated finding neighbors for providing effective recommendations. The approach presented in this paper has the main advantage to provide cross-language recommendations even for those users who rated items in a single language, or also for users who rated items in different languages. Finally, another remarkable advantage of our approach is that providing users with cross-language recommendations can facilitate serendipitous encounters. Content-based recommender systems suffer from the over-specialization problem. When the system can only recommend items that score highly against a user’s profile, the user is limited to being recommended items similar to those already rated. Overspecialized systems can prevent serendipitous discoveries to happen, but the provision of cross-language recommendations can help to discover something new and unexpected, that might also be in another language.

7. CONCLUSIONS AND FUTURE WORK This paper presented a semantic content-based recommender system for providing cross-language recommendations. The key idea is to provide a bridge among different languages by exploiting a language-independent representation of documents and user profiles based on word meanings, called bag-of-synsets. The assignment of the right meaning to words is based on a WSD algorithm that exploits MultiWordNet as sense repository. Experiments were carried out in a movie recommendation scenario, and the main outcome is that the accuracy of cross-language recommmendations is comparable to that of classical (monolingual) content-based recommendations. In the future, we are planning to investigate the effectiveness of MARS on different domains and datasets. More specifically, we are working to extract crosslanguage profiles by gathering information from social networks, such as Facebook, LinkedIn, Twitter, etc., in which information are generally in different languages.

8.

REFERENCES

[1] M. J. Pazzani and D. Billsus, “Content-Based Recommendation Systems,” in The Adaptive Web, ser. Lecture Notes in Computer Science, P. Brusilovsky, A. Kobsa, and W. Nejdl, Eds., vol. 4321, 2007, pp. 325–341, iSBN 978-3-540-72078-2. [2] E. Pianta, L. Bentivogli, and C. Girardi, “MultiwordNet: developing an aligned multilingual database,” in Proc. of the 1st Int. WordNet Conference, Mysore, India, 2002, pp. 293–302. [3] R. Chau and Chung-Hsing Yeh, “Fuzzy multilingual information filtering,” in FUZZ ’03, 12th IEEE International Conference on Fuzzy Systems, 2003, pp. 767–771. [4] A. Damankesh, J. Singh, F. Jahedpari, K. Shaalan, and F. Oroumchian, “Using Human Plausible Reasoning as a Framework for Multilingual Information Filtering,” in CLEF 2009: Proc. of the 9th Workshop of the Cross-Language Evaluation Forum. [5] D. W. Oard, “Alternative Approaches for Cross-Language Text Retrieval,” in In AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, 1997, pp. 154–162. [6] G. Salton, “Experiments in multi-lingual information retrieval,” Inf. Process. Lett., vol. 2, no. 1, pp. 6–11, 1973. [7] B. Rehder, M. L. Littman, S. T. Dumais, and T. K. Landauer, “Automatic 3-language cross-language information retrieval with latent semantic indexing,” in TREC, 1997, pp. 233–239. [8] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, “Information retrieval using a singular value decomposition model of latent semantic structure,” in SIGIR ’88: Proc. of the 11th annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. ACM, 1988, pp. 465–480. [9] L. Ballesteros and W. B. Croft, “Phrasal translation and query expansion techniques for cross-language information retrieval,” in SIGIR ’97: Proc. of the 20th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. ACM, 1997, pp. 84–91. [10] D. Soergel, “Multilingual thesauri and ontologies in cross-language retrieval,” in In AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, 1997. [11] M. Potthast, B. Stein, and M. Anderka, “A wikipedia-based multilingual retrieval model,” in Advances in Information Retrieval, Proc. of the 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008, ser. LNCS, vol. 4956. Springer, 2008, pp. 522–530. [12] E. Gabrilovich and S. Markovitch, “Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis,” in IJCAI, M. M. Veloso, Ed., 2007, pp. 1606–1611. [13] R. K. Andreas Juffinger and M. Granitzer, “A Wikipedia-Based Multilingual Retrieval Model,” in Evaluating Systems for Multilingual and Multimodal Information Access, 2009, pp. 155–162. [14] J. Gonzalo, F. Verdejo, C. Peters, and N. Calzolari,

[15] [16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

“Applying EuroWordNet to Cross-Language Text Retrieval,” vol. 32, no. 2-3. Springer Netherlands, 1998, pp. 185–207. P. Vossen, “Introduction to EuroWordNet,” Computers and the Humanities, vol. 32, no. 2-3, pp. 73–89, 1998. C. Manning and H. Sch¨ utze, Foundations of Statistical Natural Language Processing. Cambridge, US: The MIT Press, 1999, ch. 16: Text Categorization, pp. 575–608. P. Basile, M. de Gemmis, A. Gentile, L. Iaquinta, P. Lops, and G. Semeraro, “META - MultilanguagE Text Analyzer,” in Proceedings of the Language and Speech Technnology Conference - LangTech 2008, February 28-29, 2008, Rome, Italy, 2008, pp. 137–140. G. Miller, “WordNet: An On-Line Lexical Database,” International Journal of Lexicography, vol. 3, no. 4, 1990, (Special Issue). P. Basile, M. de Gemmis, A. Gentile, P. Lops, and G. Semeraro, “UNIBA: JIGSAW algorithm for Word Sense Disambiguation,” in Proc. of the 4th ACL 2007 Int. Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic. Association for Computational Linguistics, June 23-24, 2007, pp. 398–401. S. Banerjee and T. Pedersen, “An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet,” in CICLing’02: Proc. 3rd Int’l Conf. on Computational Linguistics and Intelligent Text Processing. London, UK: Springer-Verlag, 2002, pp. 136–145. P. Resnik, “Disambiguating noun groupings with respect to WordNet senses,” in Proceedings of the Third Workshop on Very Large Corpora. Association for Computational Linguistics, 1995, pp. 54–68. G. Semeraro, M. Degemmis, P. Lops, and P. Basile, “Combining Learning and Word Sense Disambiguation for Intelligent User Profiling,” in Proceedings of the 20th International Joint Conference on Artificial Intelligence, M. M. Veloso, Ed., 2007, pp. 2856–2861, ISBN 978-I-57735-298-3. F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, 2002. M. de Gemmis, P. Lops, G. Semeraro, and P. Basile, “Integrating Tags in a Semantic Content-based Recommender,” in Proc. of the 2008 ACM Conf. on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23-25, 2008, pp. 163–170. M. Degemmis, P. Lops, and G. Semeraro, “A Content-collaborative Recommender that Exploits WordNet-based User Profiles for Neighborhood Formation,” User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI), vol. 17, no. 3, pp. 217–255, 2007, springer Science + Business Media B.V. J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, “Evaluating Collaborative Filtering Recommender Systems,” ACM Transactions on Information Systems, vol. 22, no. 1, pp. 5–53, 2004. R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” in Proc. of IJCAI-95, 1995, pp. 1137–1145.