DYNAMIC MODELING AND LEARNING USER ... - Semantic Scholar

DYNAMIC MODELING AND LEARNING USER PROFILE IN PERSONALIZED NEWS AGENT

A Thesis by DWI HENDRATMO WIDYANTORO

Submitted to the Oce of Graduate Studies of Texas A&M University in partial ful llment of the requirements for the degree of MASTER OF SCIENCE

May 1999

Major Subject: Computer Science

DYNAMIC MODELING AND LEARNING USER PROFILE IN PERSONALIZED NEWS AGENT A Thesis by DWI HENDRATMO WIDYANTORO Submitted to Texas A&M University in partial ful llment of the requirements for the degree of MASTER OF SCIENCE Approved as to style and content by:

John Yen (Chair of Committee) Thomas Ioerger (Member) Reza Langari (Member)

Whei Zao (Head of Department) May 1999 Major Subject: Computer Science

iii ABSTRACT Dynamic Modeling and Learning User Pro le in Personalized News Agent. (May 1999) Dwi Hendratmo Widyantoro, B.S., Institut Teknologi Bandung Chair of Advisory Committee: Dr. John Yen Finding relevant information eectively on the Internet is a challenging task. Although the information is widely available, exploring Web sites and nding information relevant to a user's interest can be a time-consuming and tedious task. As a result, many software agents have been employed to perform autonomous information gathering and ltering on behalf of the user. One of the critical issues in such an agent is the capability of the agent to model its users and adapt itself over time to changing user interests. In this thesis, a novel scheme is proposed to learn user pro le. The proposed scheme is designed to handle multiple domains of long-term and short-term users' interests simultaneously, which are learned through positive and negative user feedback. A 3-descriptor interest category representation approach is developed to achieve this objective. Using such a representation, the learning algorithm is derived by imitating human personal assistants doing the same task. Based on experimental evaluation, the scheme performs very well and adapts quickly to signi cant changes in user interest.

iv

To my wife Lina Handayani and my daughter Risma Cahyani Widyantoro.

v ACKNOWLEDGMENTS

I would like to express my gratitude to Dr. John Yen, my advisory committee chair, who has provided his valuable time to guide me in this research. I learned how to conduct good research from him. I also would like to thank Dr. Thomas Ioerger, my committee member, for the fruitful discussions and his constructive suggestions regarding my research. A great amount of time has been spent by him to help me revising the writing of my thesis. Special thanks go to Dr. Reza Langari, my other committee member, for his support and encouragement. Another special thank you goes to the Computer Science's software agents group members: Jianwen Yin, Magy Seif El-Nasr, Linyu Yang, Anna Zacchi and Chris Standish for the stimulating discussions and brainstorming. The research in this thesis is in part supported by the Army Research Laboratory (ARL) project under grant number DAAL01-97-M-0235. I would like, therefore, to thank Dr. James Wall at the Texas Center for Applied Technology, Texas A&M University, and LTC Robert J. Hammel at the Army Research Laboratory that have driven the completion of this work. Finally, I would like to express my deepest gratitude to the Ministry of Education and Culture of the Republic of Indonesia that has provided me a two year fellowship to pursue a master's degree through its Education Engineering Development Project (EEDP). I am really in debt for the opportunity to get an experience to study abroad.

vi

TABLE OF CONTENTS CHAPTER I

II

III

IV

Page INTRODUCTION : : : : : : : A. Background . . . . . . . B. Objective . . . . . . . . C. The Proposed Approach D. Contribution . . . . . . . E. Organization of Writing

: :: :: :: ::: :: :: :: :: :

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

LITERATURE REVIEW : : : : : : : : : : : : : : : : : : : : : A. Document Text Analysis . . . . . . . . . . . . . . . . . . 1. Keyword Weighting . . . . . . . . . . . . . . . . . . a. Normalized Term Frequency Weighting Scheme b. Term Frequency Inverse Document Frequency Weighting Scheme . . . . . . . . . . . . . . . . 2. Measuring Feature Vector Similarity . . . . . . . . . B. Previous Works in Information-Filtering System . . . . . 1. Ad Hoc Approaches . . . . . . . . . . . . . . . . . . 2. Neural Network Approaches . . . . . . . . . . . . . . 3. Genetic Algorithm Approaches . . . . . . . . . . . . C. Limitation of Previous Works . . . . . . . . . . . . . . .

:

1 1 2 2 3 4

. . .

6 6 7 8

. . . . . . .

9 9 10 12 13 15 16

:: :: :

. . . . . . . . . . .

18 18 18 21 23 24 24 25 25 27 30 32

SENSITIVITY ANALYSIS : : : : : : : : : : : : : : : : : : : : :

35

MODELING AND LEARNING USER PROFILE : : : A. Modeling User Pro le . . . . . . . . . . . . . . . . 1. Rationale behind User Pro le Modeling . . . 2. The Structure of User Pro le . . . . . . . . . B. Information Filtering . . . . . . . . . . . . . . . . C. Learning User Pro le in Personalized News Agent 1. Learning Algorithm . . . . . . . . . . . . . . 2. Measuring Interest Category Relevance . . . 3. Updating Feature Vector . . . . . . . . . . . 4. Learning Explicit Feedback . . . . . . . . . . 5. Learning New Interest . . . . . . . . . . . . . 6. Learning Implicit Feedback . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

vii CHAPTER

Page A. Learning Rate Sensitivity . . . . . . . . . . . . . . . . . . . B. Threshold Sensitivity . . . . . . . . . . . . . . . . . . . . .

V

VI

VII

PERSONALIZED NEWS AGENT : : : : : : : A. Personalized News Agent Architecture . . B. News Agent Functionality . . . . . . . . . C. Agent's Component Implementation . . . . 1. News Retriever . . . . . . . . . . . . . 2. Learning Module . . . . . . . . . . . . 3. Meta Search Engine . . . . . . . . . . 4. User Interface Module . . . . . . . . . a. Processing New User Registration b. Opening Personalized News Page c. Handling User's Feedback . . . . d. Performing Search Request . . . .

:: :: :: :: :

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

PERFORMANCE EVALUATION : : : : : : : : : : : : : : A. Performance Evaluation Criteria . . . . . . . . . . . . 1. Accuracy . . . . . . . . . . . . . . . . . . . . . . 2. Normalized Distance-based Performance Measure 3. Spearman Rank-order Correlation Coecient . . 4. Average Document Score . . . . . . . . . . . . . 5. Precision versus Recall . . . . . . . . . . . . . . . B. Methodology . . . . . . . . . . . . . . . . . . . . . . 1. The Procedure of Experiment . . . . . . . . . . 2. Scenario of Feedback . . . . . . . . . . . . . . . . a. Rationale for Feedback Scenario . . . . . . . b. Algorithm for Feedback Scenario . . . . . . C. Performance Comparison . . . . . . . . . . . . . . . . D. Experimental Results . . . . . . . . . . . . . . . . . . 1. Accuracy . . . . . . . . . . . . . . . . . . . . . . 2. Normalized Distance-based Performance Measure 3. Spearman Rank-order Correlation Coecient . . 4. Average Document Score . . . . . . . . . . . . . 5. Precision versus Recall . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

35 38 40 40 42 44 44 46 48 48 49 49 50 51

. . . . . . . . . . . . . . . . . .

52 52 53 53 54 55 55 56 56 60 60 62 65 66 69 73 76 76 76

CONCLUSION : : : : : : : : : : : : : : : : : : : : : : : : : : : A. Summary of Research . . . . . . . . . . . . . . . . . . . . .

80 80

:: :

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

viii CHAPTER

Page B. Future Work . . . . . . . . . . . . . . . . . 1. Extending the 3-descriptor Scheme . . 2. Incorporating Collaborative Filtering . 3. Extending the Learning Capability . . 4. Direct Manipulation of Pro le . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

81 82 82 83 84

REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

85

APPENDIX A : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

88

APPENDIX B : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

89

APPENDIX C : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

91

APPENDIX D : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

92

VITA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

93

ix

LIST OF TABLES TABLE

Page

I

The dierence of systems based on learning process and learning approaches. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

11

II

The dierence of systems based on the types of feedback. : : : : : : :

12

III

Parameters of precision and recall. : : : : : : : : : : : : : : : : : : :

56

IV

The accuracy of the 3-descriptor model over dierent number of keywords. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

68

V

The accuracy of the 3-descriptor model over dierent document sources used for feedback. : : : : : : : : : : : : : : : : : : : : : : : :

72

VI

The accuracy of the 3-descriptor model over dierent number of recommendations. : : : : : : : : : : : : : : : : : : : : : : : : : : : :

72

x

LIST OF FIGURES FIGURE

Page

1

A 3-descriptor representation of user pro le. : : : : : : : : : : : : : :

21

2

The nature of changing long-term interests. : : : : : : : : : : : : : :

28

3

A new category from positive feedback. : : : : : : : : : : : : : : : : :

32

4

A new category from negative feedback. : : : : : : : : : : : : : : : :

33

5

The eect of learning rate on the change of document's ranking. : : :

37

6

The eect of threshold on the proliferation of new categories. : : : :

39

7

An architecture of the personalized news agent. : : : : : : : : : : : :

41

8

A main entry of the personalized news agent. : : : : : : : : : : : : :

42

9

An example of a personalized page. : : : : : : : : : : : : : : : : : : :

43

10

A diagram of experiment procedure. : : : : : : : : : : : : : : : : : :

57

11

A relation between user satisfaction and learning rate. : : : : : : : :

63

12

A relation between document score and learning rate. : : : : : : : : :

64

13

The accuracy of the 3-descriptor model over threshold. : : : : : : : :

67

14

The accuracy of system's recommendation. : : : : : : : : : : : : : : :

70

15

The average accuracy of the 3-descriptor model. : : : : : : : : : : : :

70

16

The system's performance on ndpm. : : : : : : : : : : : : : : : : : :

73

17

The average performance of the 3-descriptor model on ndpm. : : : :

74

18

The system's performance on Spearman rank-order correlation coecient. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

75

xi FIGURE

Page

19

The performance of the 3-descriptor model on Spearman coecient. :

75

20

The system's performance on average document score. : : : : : : : :

77

21

The performance of the 3-descriptor model on average document score. 77

22

Precision versus recall. : : : : : : : : : : : : : : : : : : : : : : : : : :

78

1 CHAPTER I INTRODUCTION

A. Background The advancement of networking and Internet technology recently has driven the rapid development and spread of World Wide Web. Not only have many information sources become available, but the explosion of information has also raised another subtle problem. Using existing search engines, relevant information is relatively easier to obtain. However, getting on the right site is still a time-consuming task. Given a list of Web sites that is super cially relevant, a considerable amount of time is still required to check out and surf each of the sites before the truly relevant information is found. Further, the focus of searching is diverted occasionally by the occurrence of other seemingly interesting Web sites. This adds another amount of searching time before going back to the right track. Therefore, although searching tools ease the nding of information, the main problem, exploring Web sites remains. What people really need is a software agent that explores information sources and delivers only the most relevant documents to its users. News agent is a software agent to be used for this purpose. News agent is a computer program that performs information gathering and ltering on behalf of its users. It maintains the pro les of its users that represent current users' interests. According to these pro les, a set of relevant documents retrieved from the World Wide Web is recommended to its users. Additionally, the agent performs autonomous information gathering. It surfs the Internet periodically and collects news articles from online newspapers and magazines during the night. The journal model is The Computer Journal.

2 The articles are then examined and the meta content of those articles that are relevant according to its user pro les are kept in local repository. When a user signs on the system, the agent proactively oers a list of ranked articles that might be of interest to the user. Moreover, the agent adapts to the dynamic of the user's interests. It learns the changing of user needs and interests through the user feedback. Users are able to provide explicit feedback expressing whether they like or dislike the articles recommended to them. Using this feedback, the pro le of the user is modi ed accordingly. The agent also learns the user's implicit feedback by observing the user's actions during using the system.

B. Objective User modeling is one of the essential components in an intelligent information gathering and ltering system. The model represents current user interests, and therefore, keeping up to date of this model is major concern. The objective of the research on this thesis is to develop a dynamic model of user pro le and its learning algorithm that adapt to changing user's information needs and interests over time. The design of the model is aimed to be applicable in an Internet-based personalized information provider system, or more speci cally in a personalized news agent.

C. The Proposed Approach This thesis proposes a new scheme for modeling and learning user pro le. The proposed scheme consists of user pro le structure and its learning algorithm. The structure of the user pro le is designed to accommodate the representation of multiple interest categories, long-term and short-term interests as well as the recognition of explicit interest area and its level of interest. Based on the structure, the learning

3 algorithm is further derived by imitating a human personal assistant doing the same task. The proposed approach employs a 3-descriptor scheme to represent an interest category. One descriptor maintains long-term interest and other two descriptors deal with short-term learning. In learning from a feedback, the three descriptors will be aected. A positive feedback increases the interest level of both the long-term and the positive descriptors and decreases the interest level of the negative descriptor. On the contrary, a negative feedback increases the interest level of both the long-term and the negative descriptors and reduces the interest level of the positive descriptor. The update of the long-term descriptor is modeled to re ect the reluctance of an interest to change after learning in the long run. Meanwhile, the update of positive and negative descriptors is modeled to re ect a reactive behavior on a short-term interest.

D. Contribution The main contribution of this research that can be distinguished from other related works is the introduction of the 3-descriptor scheme to represent a category of interest and its learning algorithm to handle the interactions among of them. Compared to other similar works that mainly use single-descriptor model for interest category representation, the 3-descriptor scheme oers some advantages over the single-descriptor model. The 3-descriptor scheme allows handling exceptions of interest within an interest category and learning long-term as well as short-term interests simultaneously. The only disadvantage of this approach is a more complex pro le structure and update scheme, which is not an issue on this problem domain. On the contrary, although the single-descriptor model has a simple update scheme, it also has some

4 limitations. In learning from a feedback, a single-descriptor approach can either easily forget previous learned interests with a high learning rate or be hard to learn a new interest with a low learning rate. Furthermore, it fails to represent exceptions of interest within a broader scope of interest category. Consequently, long-term and short-term learning cannot be handled simultaneously in a single-descriptor scheme. Hence, the 3-descriptor scheme proposed in this thesis extends the learning capability of previous approaches so that it not only able to cover multiple interest categories [1, 2, 3, 4], perform online learning [4] and learn implicit feedback [4, 5] but also able to handle exceptions within an interest category as well as maintain the changing long-term and short-term interests simultaneously.

E. Organization of Writing For the clarity of presentation, the rest of this thesis is organized as follows. Literature review regarding the study and work that relate to this thesis research is described in Chapter II. Chapter III describes in detail the proposed idea of the research. The description includes the rationale behind the design of the proposed method, the detail model and its learning algorithm, and the application of model for information ltering. Experimental analysis regarding the proposed model will be discussed in Chapter IV. The analysis is focused on the sensitivity of parameters that govern the behavior of the model. In Chapter V, the prototype of system employing the model proposed in this research will be presented as a proof-of-concept demonstration. The architecture, system functionality and its detail implementation are described in this chapter. The performance evaluation of the model will be discussed in Chapter VI. This chapter is preceded by the discussion of the methodology used for the performance evaluation and the procedure of experiment, as well. Afterwards,

5 the experimental results will be presented and discussed. Finally, the summary and future direction of this research will be presented in Chapter VII.

6 CHAPTER II LITERATURE REVIEW Some basic alternative techniques need to be considered in the development of user model and algorithm for information ltering. Those alternatives are choosing a representation of document and interest, and selecting an appropriate approach for the learning algorithm. This chapter describes those alternatives in the rst two sections. The rst section presents some techniques used to represent and analyze a document. The second section describes some systems that use dierent approaches in their algorithm to learn user interests for information ltering. In the last section, a brief discussion on the limitation of previous works will be presented.

A. Document Text Analysis The representation of document is commonly based on vector space model. In this model, each document is represented by a vector in n-dimensional space. Given a document to be learned, a mechanism is required to transform the document into a vector of n-dimension, each dimension represents a distinct term extracted from the document. The vector is called document feature vector, which is a representation of document in a hyperspace. This representation is also the same for interest area that is de ned based on the document feature vector. Document representation in vector space model raises two issues: weighting the terms and measuring the feature vector similarity. The rst issue deals with the assignment of the weight of terms according to their degree of importance, while the second issue concerns with nding a way to determine the closeness between two documents or interests.

7 1. Keyword Weighting One of the important tasks in a textual document analysis is to extract the terms from a document and assign them with weights that re ect their presumed importance for purpose of content identi cation. Luhn H. P., one of the pioneers in automatic indexing said that measuring word signi cance by use-frequency is justi ed by the fact that an author usually repeats certain words as he advances or varies his arguments and as he elaborates on an aspect of a subject [6]. Furthermore, when distinct terms in a textual document are arranged in decreasing order of their frequency of occurrence, the occurrence characteristics of the vocabulary can be characterized by the constant rank-frequency law of Zipf. The law states that a product of a term frequency and the rank order of the term will be approximately equal to the product of the frequency of another term and its rank [7]. Based on Zipf law, Luhn conjectured that the relevant terms extracted from a document text would peak in the middlefrequency range, and further proposed to use terms with medium frequency, because high- and low-frequency terms are not good content identi er [6]. Many methods have been derived to weight keywords on textual documents based on these basic considerations. Among of those are term frequency, probabilistic weighting, inverse document frequency, signal weighting, discrimination value, TFIDF, Latent Semantic Indexing (LSI) [8, 9, 10, 11] etc. Simple term frequency has normalization problem that cannot distinguish between a long and a short document [8]. However, the rest of the methods suer from a prior knowledge. They require a collection of documents in database as reference for the keyword weighting that is dicult to extend in a dynamically changing database [8]. The following describes two heuristics for keyword weighting that are popular in use, namely normalized term frequency and term frequency inverse document fre-

8 quency (TFIDF). a. Normalized Term Frequency Weighting Scheme Normalized term frequency is the most suitable method for keyword weighting in a domain independent system. A set of new keywords can be introduced without extra eort and unused keywords can be automatically removed from the system's domain. For a system that explores and works in an environment where the content of information varies and could change dynamically over time, this method will work very well. Given a document d, after a pre-processing steps 1, the weight of keyword i, wid , is computed from the frequency of keyword occurrence fid divided by the length of document. Document length is the total number of keywords extracted from the document. Therefore, the feature vector of a document d can be represented as fvd = f(k1d ; w1d ); (k2d ; w2d ); ::; (knd ; wnd )g, where kid is keyword i and wid is de ned by equation 2.1. fd wid = Pn i d j =1 fj

(2.1)

and Pni=1 wid = 1 (normalized form) The assignment of weight on a keyword does not need a prior knowledge obtained from a document collection. It can be derived directly from the document text itself and thus is very exible to be used in a domain independent system. However, this weighting scheme does not take into account the context of keyword in a document and hence does not re ect precisely the degree of keyword importance. 1Removing common words, stemming words and identifying bigrams.

9 b. Term Frequency Inverse Document Frequency Weighting Scheme TFIDF is a well-known keyword-weighting scheme of vector space model in the information retrieval literature. Based on TFIDF, term importance is proportional to the occurrence frequency of each term in each document and inversely proportional to the total number of documents in a document collection to which the term occurs [9]. It assumes that keywords appearing in fewer documents discriminate better than the ones appearing in more documents. Therefore, keywords with smaller document frequency will be assigned higher degree of importance than the ones with larger document frequency. The weight of keyword i in a document d is then a product of term frequency tfid and the inverse document frequency idfi as shown in the following equation. wid = tfid :idfi

(2.2)

where idfi = log dfNi , N is the number of documents in document collection, and dfi is the document frequency of term i. The document frequency of a term is de ned as the number of document in document collection that contains the term. The term frequency tfid is the number of term occurrences in a document d. To avoid favoring long document, the term frequency is usually normalized by dividing it with the length of document (see equation 2.1). 2. Measuring Feature Vector Similarity As stated earlier, similarity measure is required to determine the closeness between two documents or interests. Many similarity measures have been derived to describe the proximity of two vectors such as Dice, Jaccard, overlap, asymmetric and cosine measure [9]. Among of those measures, cosine is the most widely used similarity

10 measure. For this reason, this measure will be employed for measuring two vectors similarity in this research. The cosine similarity measures the dierence in direction indicated by the two vectors, irrespective of length. This dierence is the angle between the two vectors. Therefore, the cosine similarity measure calculates the cosine of angle between two vectors in hyperspace and is de ned as follows:

P fv fv SIM(fvi; fvj ) = q 2 ik q jk 2

fvik fvjk

(2.3)

If the two feature vectors representing two dierent documents are totally dissimilar, the value of the measure will be zero. If the cosine value increases, the similarity of the two vectors will also increase. As the value approaches one, the two feature vectors become coincident and the con dence that the two documents are similar one of another is high. In information ltering, a document is considered relevant to a query or an interest area if the cosine similarity of the two in vector space representation is close to one.

B. Previous Works in Information-Filtering System Many systems have been developed to learn user pro le for information- ltering purpose. They use mostly the vector space model to represent a document and a user interests, and employ the standard weighting scheme described in previous section. The signi cant dierence among the systems is in the algorithm used to update the pro le during a learning process. Some systems employ learning algorithm adopted from neural network and genetic algorithm, while some others develop their own-ad hoc approach. Another dierence among the systems can be viewed from their learning process. Most systems allow incremental learning where the learning process can be performed

11

Table I. The dierence of systems based on learning process and learning approaches. Learning

McEligot et al. Wiener et al. Menczer et al. INFOS Fab WebMate PIN Amalthea NewT

Process Approach Incremental Batch p Genetic Algorithm Neural pNetwork

p p p p p p p

p

p

p p

p

p

anytime whenever new information is available. Meanwhile, only few systems use batch learning. In batch learning, the learning process is performed in a period of time or once a certain amount of information has been obtained. Additionally, the type of feedback for learning user interest also characterizes the systems. The user feedback can be positive or negative, and explicit or implicit. Positive and negative feedback indicate that the users like or dislike respectively. Meanwhile, implicit feedback is a feedback that is inferred by the system based on user's action. Table I summarizes eight systems or works and their dierences based on the learning process and the learning approaches. Their dierences based on the feedback type used to learn the user pro le are illustrated in Table II. All previous works address a problem in information ltering by learning user interest. Dierent pro le representation and learning algorithm have been explored, and dierent performance evaluation criteria have been used to show the eectiveness of their approaches. Accuracy, mean square error and ranking-based evaluation criteria are commonly used to describe their performance using ad hoc and neural

12

Table II. The dierence of systems based on the types of feedback. McEligot et.al. Wiener et.al. Menczer et.al. INFOS Fab WebMate PIN Amalthea NewT

Feedback Positive p Negative Explicit p Implicit

p p p p p p p p

p p p p p p

p p p p p p p p

p p

network approaches. Fitness or accumulated energy is a typical performance measure in systems using genetic algorithm. Based on those measures, it is dicult to compare the performance among the systems. Besides the dierence on their evaluation criteria, they use a set of dierent data that may result in dierent performance even if the same performance evaluation criterion is used. The following subsections describe previous works in more detail, which are categorized based on their learning approaches. 1. Ad Hoc Approaches Mock developed INFOS (Intelligent News Filtering Organizational System) that uses feature vectors to represent user interests [12]. INFOS is an intelligent information ltering system that reduces user's load in selecting interesting articles from Usenet news articles. The system learns automatically by adapting its internal user model. It uses the content of the article and the collaboration with other users for the learning process. The features are manipulated through keyword-based and knowledge-based

13 techniques. The hybrid approach is claimed to improve accuracy over keyword approach, support domain knowledge and retain the system's scalability. Chen developed WebMate, an agent that helps users to eectively browse and search the web [1]. WebMate keeps track of user interests in dierent domains through multiple vectors weighted using TFIDF technique. The domains are learned automatically as users give a positive example. In this system, a new domain category is created for every feedback if the number of domains is still below its upper limit. If the number of domains has reached its maximum limit, the document to be learned will be used to modify a vector with the greatest similarity. WebMate also introduces the use of Trigger Pair model to automatically extract keywords for document search re nement. Balabanovic developed Fab, an adaptive system for Web page recommendation [5]. It represents a user pro le with a single feature vector weighted using TFIDF method. The vector weight is updated using explicit positive or negative user's feedback by increasing or decreasing the weight based on the type of feedback and the learned document feature vector. In its successor, the system has been extended to allow multi-topic user interests representation [2]. In the latest version, the system provides an interface for the users to manage the information presented to them. By monitoring the user clicking and drag-and-drop actions, the system learns implicit feedback from its users. 2. Neural Network Approaches Several neural network techniques have been explored to learn user pro le. Wienar et al. used nonlinear neural network for topic spotting [3]. In their model, nonlinear neural network classi ers extend logistic regression by modeling higher-order term interaction and non-linear decision boundary in the input space. They propose to use

14 modular architecture with task-directed LSI 2 representation to improve performance of the network. In this architecture, a learning problem is decomposed into a set of smaller problems. The rst component on the architecture is meta topic network, which is a network trained on the full training set. The second component is a set of network groups corresponding to the meta topics. Each group is a separate network trained using only the corpus that corresponds to the meta topic for that group. Given a document as the input, the meta topic network determines the document's topic on high-level, and based on the output of this network, appropriate local network can be deployed. McElligot and Sorensen employ two-layer connectionist model to lter news article on the Internet [13]. In addition to keywords, their system uses the context in which it occurs to represent a user pro le. Their system operates in two modes. The rst mode is the learning phase, where sample documents are presented to the system. The second mode is the comparison phase, where retrieved documents from news sources are ltered out to the user. The system then enters the learning phase to learn additional interesting document. In their two-layer connectionist model, the rst layer is used to receive the input article and build the internal representation of it using script structures created from statistical analysis of word combination frequency. The output of this layer is fed into the input of the second layer. This layer acts as supervisor nodes, which monitor the activity of the rst layer during comparison mode. Another neural network technique so-called Adaptive Resonance Associative Map (ARAM) is incorporated by Tan and Teo to personalize their news system known as 2Latent Semantic Indexing (LSI) is a weighting technique which reduces the di-

mension of a representation using a technique called Singular Value Decomposition (SVD).

15 Personalized Information Network (PIN) [4]. ARAM organizes information into categories according to their semantic similarities. A user pro le is modeled by associating each of the information categories with a relevance factor. In this system, keyword weights and the intensity level of a user interest are encoded directly as the network weights. The weight of a keyword is computed using normalized term frequency. The model allows online learning and integrates user-de ned keywords and system-learned knowledge in a single framework. 3. Genetic Algorithm Approaches A successful application of genetic algorithm to learn user interest is introduced by Sheth [14]. The ltering information agent developed in his system, NewT, is modeled as a population of pro le. Each of the pro les actively searches for documents and recommends them that are close to its pro le. Either positive or negative user's feedback is used to change the pro le by increasing or decreasing the tness of pro le. The pro le contains information about the URL addresses of articles, and other information such as the author, number of line, news group's category etc. The news sources of this system are Usenet network news available on the Internet. The extension of NewT is Amalthea which is developed by Moukas and Zacharia. Amalthea is an evolving, multi agent ecosystem for personalized information ltering, discovery, and monitoring of information sources [15]. This system introduces two different types of agents, namely information ltering agents and information discovery agents. The information ltering agents are responsible for system's personalization and keeping track of user's interests. The information discovery agents are responsible for the fetching of information that the user is interested in and the learning of information resources. Both types of agents evolve by means of genetic algorithm in separate population. Their relationship and interaction are based on simple economic

16 model. The information agents receive ratings directly from the user depending on their performance. Based on these ratings, they assign some credit to the information discovery agent that retrieves the information. The user's rating will aect their tness and thus aects their survivability to evolve. User interest in Amalthea is encoded as a feature vector weighted using TFIDF method. This vector is augmented with other information such as author and long-term interest eld. The latter is activated if keywords given to the ltering agents are explicitly speci ed by the user. Another approach inspired by arti cial life theory has been proposed by Menczer et. al [16]. Evolving agents on the approach represent user interest as keywords and the agents evolve based on their energy. Energy, in their term, is a single currency by which agents survive, reproduce, and die. Agents go through a simple cycle in which they receive input from both environment and internal state, perform some computations and act. The action can reduce the agent's energy, but may result in energy intake. Agents that perform task better than average reproduce more and colonize the population. When the agent's energy exceeds a xed threshold value, it produces an ospring by cloning itself. However, it will die as its energy is reduced below zero. Depending on how well a document presented to a user meets the user's interests, the feedback provided by the user can increase or decrease the agent's energy.

C. Limitation of Previous Works Although the performance of systems on previous works improves after learning from user feedback, most of them do not address the eectiveness of their approaches to adapting to changing interests. Except in works by Sheth and Moukas, their evaluation assumes that the user's interests to be approximated does not change during the

17 evaluation process. In real world, however, this assumption is not completely true because the user's interests always change over time. The changes can be distinguished from the ones of long-term and short-term interests. Long-term interests are interests that result from an accumulation of experiences in a long time-span, while short-term interests are interests from events on day-to-day basis. An approach developed by Sheth and Moukas to address this problem using genetic algorithm is not eective. Although their systems are able to adapt to the changes of user's interests, it requires a large number of generations and therefore a great amount of eort from the user to bring the system back to an equilibrium state. Therefore, the capability to adapt to the changes of user's interests eectively for both long-term and short-term interests simultaneously over time is still the unsolved problem in previous works.

18 CHAPTER III MODELING AND LEARNING USER PROFILE The capability to model and learn user pro le is at the heart of a personalized news agent. Therefore, the main research issue is to designing the pro le structure and its learning algorithm that adapt to changing user interests. In this chapter, the issue will be addressed. Section A discusses the underlying concept behind the modeling user pro le and its structure. Based on the underlying concept, section B presents the use of model and user pro le representation for information ltering. The detail of the learning algorithm is described furthermore in the last section.

A. Modeling User Pro le 1. Rationale behind User Pro le Modeling Software agent performs its task on behalf of its users. An analogy for this situation is human personal assistants (e.g. personal secretary) that always observe the behavior and the needs of their employer. Therefore, the way an agent approximates the model of its users should rudimentarily resemble human personal assistants doing the same task. Based on this assumption, the agent behavior is de ned intuitively. The agent learns user's interests from explicit and implicit feedback. Explicitly, the user tells the agent whether he/she likes or dislikes a document recommended for him/her. Through explicit feedback, the agent gets an idea of the attitude of its user on a piece of information. With this knowledge, the model of its user is then modi ed according to the user's preference. The advantage of explicit feedback is that the agent is con dent about the information needs of its users and thus appropriate action can be taken. A strong positive

19 or negative feedback should result in a signi cant change of pro le. If a feedback indicates that a user really does not like a piece of information from the agent's recommendation, a similar information should not appear in future recommendation. In contrast, a high-rating document should make similar documents to be presented in top ranking among other documents. This poses a reactive behavior of the agent to adapt to a rapid change of its user's interests through strong negative or positive feedback. Besides this behavior, the agent should posses a capability to handle moderate changes of user interests. Through feedback with moderate intensity level, the pro le is updated gradually. The eect of a moderate negative feedback is to lower the ranking of documents having similar content. Conversely, a moderate positive feedback will put similar documents into higher ranked documents. Consistently giving feedback with moderate intensity level many times will take eect like a strong feedback. The disadvantage of explicit feedback is to burden users with cognitive tasks. Some researchers believe that an ideal software agent should avoid these tasks. The agent should be able to infer the user's interests implicitly only by observing the user's actions. However, accurately inferring user interests through implicit feedback is a very dicult task and requires other sophisticated tools that are not available today (e.g. a tool that is able to recognize user's facial expression). Although some Simpli ed approaches have been developed to enable this capability, e.g. by observing the duration of user's reading activity, the approaches are still inaccurate to infer user's interests due to the pitfalls of the method itself. Therefore, the need of explicit feedback from the users is inevitable to obtain accurate prediction despite its disadvantage. The capability of agent to infer users' interests from implicit feedback makes the agent more friendly and humanly. In a personalized news agent, an implicit feedback

20 with a low con dence can still be acquired by assuming the following argument. Given a list of article titles, each with sample sentences from the rst paragraph of the article, users are able to predict roughly the content of the articles. If the users select and read the articles on the list, it can be inferred implicitly that in some degree they are interested with the content of the article. Conversely, unread articles during a period of time or one session of news articles presentation can be considered as uninteresting. Based on the articles the users have read or not read, their pro les are modi ed. The learning rate is set to a low value to avoid the eect of misinterpretation if the assumption fails to hold. The assumption holds in some sense with a low con dence because there are also some exception conditions. Unread articles do not always mean that the articles are not interesting. It could happen if the users are still busy with other things and do not have time to read the articles that might be interesting for them. Similarly, if the users read articles recommended for them, it may be due to their instant curiosity about the content of the articles. After reading the articles, the users might not be interested anymore and do not give explicit negative feedback. The beauty of this approach is that the exception conditions rarely happen. If they really happen, the change is not signi cant so that it does not aect the user pro le as a whole. However, if the users consistently read articles of the same topic many times without giving positive feedback, the small change from the implicit feedback will accumulate and the ranking of the article goes up gradually. The contrary situation also occurs for consistently unread articles. Users may have multiple interest categories at the same time. Each interest category represents one domain of interest. The number of categories can grow or shrink to adapt to the dynamics of user's interests. If a new area of interest is introduced to the agent by a user, a new category representing the new interest will be created. The positive or negative feedback will change the interest level of

21

Pro le

QQ QQ Category1 Categorym b " w" w p "" wb lt bbn bb " " bb " "

Positive Descriptor

HH

Long-term Negative Descriptor Descriptor QJ HH w1 w2 JQQwQm

k k

1 2

JJ QQ km

ki = ith keyword wi = ith keyword's weight wp,wlt,wn = descriptor's interest level Fig. 1. A 3-descriptor representation of user pro le. a category dynamically. The interest level of a category getting negative implicit feedback will decrease gradually. This will be removed from the user pro le at some point. Furthermore, an interest category may cover a broad scope and in some cases has exceptions of interest within the category. 2. The Structure of User Pro le The basic structure of an interest category representation is a feature vector. It consists of a list of keywords weighted according to their degree of importance. The feature vector describes an area of interest that is a part of interest category. In a 3-descriptor scheme, an interest category C is composed of three descriptors: a positive dcp , a negative dcn , and a long-term dclt descriptor. Figure 1 illustrates the structure of user pro le using this scheme. The positive and negative descrip-

22 tors maintain a feature vector learned from documents with positive and negative feedback respectively, while the long-term descriptor maintains the feature vector of documents obtained from both types of feedback. Each descriptor also has descriptor's weight to represent the interest level of the corresponding descriptor's interest category. Interest weight wpc , wnc and wltc are used to describe the level of interest of positive, negative and long-term descriptor of the interest category C respectively. The value of wpc and wnc is from zero to one to represent the degree of interestingness and uninterestingness respectively. The value of wltc is from -1 to 1. A negative value in the long-term descriptor describes that the interest area is uninteresting, and a positive value indicates that it is interesting. In addition to the long-term descriptor, a document counter dCount is maintained to keep the total number of documents that have been observed for an interest category. Further, a ag status is provided to infer implicit feedback. The ag status flagc indicates whether at least one article in an interest category C has been read during a period or one session of news presentation. The ag status in a category is set to one if an article corresponding to the interest category was read by the user. Formally, the representation of an interest category C can be written as follows. Categoryc = hflag c ; (wpc ; dcp ); (wnc ; dcn ); (dCountc; wltc ; dclt )i

(3.1)

As mentioned previously, users may have multiple interest categories. The pro le P of a user having n interest categories is then represented as: P rofilep = fCategory1p; Category2p; :::; Categorynpg

(3.2)

With such a pro le structure, a learning algorithm to model user pro les is developed to accommodate intuitively de ned agent behavior as described earlier.

23

B. Information Filtering Based on the pro le structure described in previous section, information- ltering process is performed to select n most relevant documents from a set of documents. For each document in the set, the level of user interest in the document's content is assessed according to the match to an interest category of the pro le and the degree of interest in the category. The assessment is calculated as a numeric value that is assigned to the document as the score of document in respect to the pro le being considered. Based on these scores, the documents are ranked and the n most relevant documents are obtained from the n top ranked documents. The document score is weighted from -1 to 1. A positive value of the score indicates that the document in some degree is interesting. Conversely, a document with negative score is uninteresting. Given a document feature vector fvd, the score of fvd based on a pro le P is computed as follows. 1. Find the most relevant category C on user pro le P. The de nition of relevant category will be explained in later section. 2. Calculate the score of each descriptor in the interest category C with the greatest relevance: wpos = wpc Sim(dcp ; fvd ) (3.3) wneg = wnc Sim(dcn ; fvd ) wlong = wltc Sim(dclt ; fvd )

3. Compute the nal document score as follows. Score(P,fvd) = max (wlong ; wpos) + min (wlong ; ?wneg )

(3.4)

Since each interest category has positive and negative interest and the descriptors of both interests have the opposite meaning, the nal value of the document score is a

24 fusion between the score of positive wpos and negative wneg interest. The score of longterm interest wlong contributes to either the positive or negative interest depending on the sign of its value.

C. Learning User Pro le in Personalized News Agent The learning algorithm developed in this research allows incremental learning and online learning which enables reactive learning as well as long run learning. For the clarity of presentation, the learning algorithm will be presented in high-level description prior to explaining the detail of the algorithm that follows. 1. Learning Algorithm The learning process in a personalized news agent relies on a feedback provided by a user. Using the feedback information that re ects the user's attitude toward a document of news article, the system changes the internal pro le representation of the corresponding user such that the feedback will be incorporated in future information ltering. The feedback consists of feedback type fbType, a document to be learned fvd and learning rate . The feedback type can be a positive or negative feedback that represents that the user likes or dislikes the document's content. The learning rate describes the user's preference. In general, the algorithm to learn a user pro le P is de ned as follows. 1. 2. 3. 4. 5. 6.

Find the most relevant category C in user pro le P

If Relevance(C; fvd) < then

LearnNewCategory (P, fbType, fvd, )

Else

LearnUserFeedback (P, fbType, fvd, )

End if

25 Given a feedback, a search is performed to nd the most relevant interest category based on the document to be learned. If the greatest relevance of an interest category falls below a threshold value , a new interest category will be created. Otherwise, the interest category is modi ed according to the feedback type, the document feature vector and the learning rate. In a personalized news agent, the explicit positive and negative feedback as well as the implicit positive feedback are triggered by the users, while the implicit negative feedback is triggered by the system when new articles are recommended. 2. Measuring Interest Category Relevance Given a document to be evaluated, category relevance measure is used to nd the best match category in a pro le. Category relevance is a measure of the similarity between the feature vector of a document and an interest category in the user's pro le. Employing cosine similarity, the relevance between an interest category C and a document feature vector fvd is de ned as follows: Relevance(C; fvd) = maxfSim(dcp; fvd); Sim(dcn ; fvd); Sim(dclt; fvd)g

(3.5)

Thus, category relevance is the maximum similarity between a document feature vector and either the positive, negative or the long-term descriptor. In principle, it measures the relevance of the information content of a document rather than the document interest. 3. Updating Feature Vector Reactive learning is required to handle a strong feedback so that it takes eect immediately in future recommendation. To allow this behavior to occur, the modi cation of the feature vector in an interest category may consider the contribution of the fea-

26 ture vector to be learned. The formula for modifying the feature vector should also be general enough to handle other feedback. Taking all these into account, the updating of the feature vector in an interest category is performed in general as follows:

8 >> w (1 ? ) + w if k = k j (doc) i(C ) j (doc) >< i(C) wi(C ) = > wi(C ) (1 ? ) if for all kj(doc) ; kj(doc) 6= ki(C) >> : wj(doc) if for all ki(C) ; ki(C) 6= kj(doc)

(3.6)

where wi(C ) = ith weight of either dcp ; dcn or dclt in category C, wj (doc)= j th weight of fvd , ki(C ) = ith keyword of either dcp ; dcn or dclt in category C, kj (doc) = j th keyword of fvd , and

= [0,1]: the learning rate.

If both the descriptors of the interest category and the feature vector of document have the same keyword, the keyword weights from both of them are incorporated into the computation. The learning rate is used to adjust the contribution of each vector depending on the intensity level of the desired change. The range of parameter is between zero and one. The learning rate of a strong feedback is high, while the value of learning rate from an implicit feedback is low. If neither of the descriptor nor the document feature vector has the corresponding keyword, the weight from the one having the keyword will be incorporated for the keyword update. Depending on which one owns the keyword, a learning rate or (1-) will be used to modify the new weight.

27 4. Learning Explicit Feedback As described earlier, a pro le is composed of a set of interest category; each represents dierent domain of interest. The feature vectors of the three descriptors in an interest category (e.g. positive, negative and long-term) represent only the area of interest. The interest level of a category is encoded as the interest weight in each descriptor. Because those descriptors represent the same group of interest category, there is a trade-o between the value of interest level in positive and negative descriptors. Increasing the interest level of positive descriptor might reduce the interest level of negative descriptor, and vice versa. Additionally, the modi cation of the long-term descriptor should be dierent from the one of short-term descriptors, e.g. positive and negative descriptor, due to the dierent nature between the long-term and short-term interests. The change of short-term interests tends to be reactive to accommodate the change of interest in a short period, while long-term interests change gradually over a long time-span. Given a feedback fvd that corresponds to an interest category C, the modi cation of C is performed in the following. 1. Update wltc , the interest weight of the long-term descriptor. To allow an unlimited gradual change and capture a reluctance of the long-term interests to change after learning in the long run, bipolar sigmoid logistic function is considered to govern the change of long-term descriptor's interest weight. The function ranges from -1 to 1 where the lower and upper limit can be approached using argument values -1 and +1 respectively. Figure 2 illustrates the use of bipolar sigmoid logistic to govern the change of wltc as a function of current value and learning rate . First, the current value wltc (old) (the ordinate) is projected into its abscissa. Second, the learning rate is then added to the abscissa value for

28

1

wlt (new) wlt(old)

6

-

f (x) = 1+exp(2? x) ? 1

wlt

? -

0

+

-0.2

Fig. 2. The nature of changing long-term interests. positive feedback or subtracted from the abscissa value for negative feedback. The new abscissa value is then projected back to its ordinate value as the new value of wltc (new). The following equations implement the modi cation of the weight of long-term descriptor.

8 >< f (f ?1(wc ) + ) for positive feedback lt(old) c wlt(new) = > : f (f ?1(wltc (old)) ? ) for negative feedback

(3.7)

where f (x) is a bipolar sigmoid logistic function, f (x) =

2 1 + exp? x ? 1

(3.8)

2. Update the long-term descriptor dclt with the learned document feature vector fvd using equation 3.6. Set the value of the learning rate on equation 3.6 to

29

, where:

=

1

dCount + 1

+

(3.9)

As more feedback is learned, the contribution of the most recently learned document becomes smaller and so the previously learned interests are preserved. The constant is set to a small value (e.g. .05) to prevent a complete stoppage of learning, since would otherwise converge to 0 in the limit (after learning many documents). Thus, it allows the long-term descriptor to keep learning regardless the number of previously learned examples. 3. Increase the number of documents dCount in the long-term descriptor by one. 4. Update the positive descriptor for positive feedback or the negative descriptor for negative feedback using equation 3.6. Use appropriate learning rate depending on the intensity level of feedback. 5. Update the interest level of positive(negative) descriptor. Depending on the value of learning rate, the interest level of this descriptor increases from current value up to one. It will ensure that the corresponding descriptor's interest weight (positive or negative) is increased. Thus, it supports a reactive learning of short-term interests. wpc (new) = wpc (old) + (1 ? wpc (old)) for positive feedback, or

(3.10)

wnc (new) = wnc (old) + (1 ? wnc (old)) for negative feedback.

(3.11)

6. Compute the similarity of the learned document feature vector fvd with the negative descriptor for positive feedback or the positive descriptor for negative feedback, and reduce proportionally the interest weight of the descriptor according to the similarity between the two vectors. This scheme guarantees that

30 no two opposite descriptors (positive and negative) containing similar feature vector has close interest weight values. The following formulas are used for this update: wnc (new) = wnc (old) (1 ? Sim(dcn ; fvd )) for positive feedback, or

(3.12)

wpc (new) = wpc (old) (1 ? Sim(dcp ; fvd )) for negative feedback.

(3.13)

Depending on the type of feedback (e.g. positive or negative), all steps except step 2 will be modi ed accordingly. In step 1, the interest level of long-term descriptor will increase for positive feedback or decrease for negative feedback. In step 2, the contribution of the document feature vector depends on the number of documents that have been observed in this category. The value of learning rate in step 4 and 5 is used according to the intensity level of the feedback. A strong feedback will have high value, and vice versa. A positive feedback in step 6 will reduce the interest level of negative descriptor. Conversely, the interest level of positive descriptor will be reduced due to a negative feedback. 5. Learning New Interest A new category will be added to a user pro le if the feature vector of the learned document cannot be classi ed into a category in a pro le. To determine that a document's content belongs to a particular category, a threshold value is then de ned. If the similarity between a learned document and the descriptor of the greatest category's relevance in a pro le is less than , a new category representing the new interest will be created in addition to the existing categories. Let fvd be the feature vector of the document to be learned and be the learning rate. The new category is created as follows:

31 1. Create a template for the new category. 2. Set the long-term descriptor to the feature vector of the learned document and set the interest level of the descriptor using equation 3.7 where the interest level of the long-term descriptor wlt(old) is set initially to zero. It results in the following assignment: dclt = fvd and wltc = f ()

for positive feedback, or dclt = fvd and wltc = f (?) for negative feedback

where f (x) is the bipolar logistic sigmoid function. 3. Set the positive descriptor for positive feedback or the negative descriptor for negative feedback to the feature vector of the learned document and set the interest level of the corresponding descriptor using equation 3.10 for positive feedback or 3.11 for negative feedback. The initial value of wpc(old) or wnc (old) is set to zero, and therefore those variables are assigned with the following: dcp = fvd and wpc = for positive feedback, or dcn = fvd and wnc = for negative feedback.

4. Set the negative (positive) descriptor for positive (negative) feedback to an empty vector and set the interest level of the corresponding descriptor to zero. dcn = fg and wnc = 0 for positive feedback, or dcp = fg and wpc = 0 for negative feedback.

In the above steps, the idea of the algorithm is clear. The long-term descriptor is assigned with the learned document's feature vector on both positive and negative feedback. However, the assignment for the positive and negative descriptor is distinguished according the type of feedback. On a positive feedback, the positive

32

Pro le New Category

""bbb " "" f () bb0 bbb " ""

Positive Descriptor

Long-term Descriptor

Negative Descriptor

fvd

fvd

fg

HH

HH

HH

Fig. 3. A new category from positive feedback.

descriptor is set to the feature vector of the learned document and the negative descriptor is left empty. Similarly, the positive descriptor is left empty and the negative descriptor is assigned with the feature vector of the learned document on a negative feedback. Figures 3 and 4 illustrate the above steps pictorially. 6. Learning Implicit Feedback Unlike explicit feedback that can result in signi cant changes of the descriptors and their interest levels, implicit feedback aects them very slightly. An implicit positive feedback is triggered as users open and read an article presented to them. A very small increase on the interest level of long term and positive descriptor will be obtained from this feedback. The more the articles are read, the more the increase of the weights. Further, the decrease of interest level of negative descriptor caused by the implicit positive feedback is proportional to the similarity of the opened document

33

Pro le New Category

""bbb " 0"" bb f ( ? ) bb " b ""

Positive Descriptor

Long-term Descriptor

Negative Descriptor

fg

fvd

fvd

HH

HH

HH

Fig. 4. A new category from negative feedback.

with the negative descriptor. However, since the con dence for this feedback is low, the reduction of the interest level is limited to a small range. The following are steps to modify the user pro le on implicit positive feedback. 1. Apply learning explicit feedback described earlier using a low learning rate . 2. If flag = 0 set the value of flag to one. If in a period of time no article for a particular interest category has been read, a penalty will be given to the corresponding interest category by triggering an implicit negative feedback. The steps to modify the user pro le on implicit negative feedback are given below: 8 >< wc p(old) if flag = 0 (3.14) wpc (new) = > : wpc(old) otherwise

34

8 >< wc n(old) c wn(new) = > : wnc (old) 8 >< wc lt(old) c wlt(new) = > : wc

if flag = 0 otherwise

(3.15)

if flag = 0 (3.16) otherwise lt(old) where the value of parameter approaches one, i.e 0.97. If the ag is equal to one, set the ag to zero. The penalty results in the reduction of interest level on the three descriptors. If this happens frequently, the interest level of the three descriptors will fall near zero and at some point the corresponding interest category will be removed from the pro le.

35 CHAPTER IV SENSITIVITY ANALYSIS Due to the complex interaction among model parameters and the varieties of the input data, an experimental sensitivity analysis was conducted instead of analytical analysis. In this chapter, the results of two experiments to observe the sensitivity of two major model-governing parameters are presented. One type of experiment is performed to observe the sensitivity of learning rate to change a document's ranking position. Another experiment is performed to observe the sensitivity of threshold value on the proliferation of new categories. The results of the experiment are a model characteristic from which the behavior of the model is then de ned through the appropriate choice of model parameter values.

A. Learning Rate Sensitivity Experimental analysis on the sensitivity of learning rate tries to answer the following question. Given a positive or negative feedback with a learning rate , how much will the ranking position of a document be aected after the system has learned the feedback? In this experiment, about 40 to 50 Html documents retrieved from online newspapers or magazines that contain topic about money are used. These documents are obtained from online newspapers or magazines on the Internet such as UsaToday, Fortune, Business Week etc. Let these documents be document set D. The following are the steps of the experiment to observe the sensitivity of the learning rate using positive feedback. 1. Generate a pro le by giving a random positive and negative feedback with equal

36 probability using document from the document set D. 2. Rank all documents in D based on the pro le generated from step 1. 3. Select the lowest ranked document d and use it for feedback with a learning rate value to be observed. 4. Learn the feedback. 5. Rank all documents with the modi ed pro le and record the new ranking position of document d. 6. Repeat step 1 through 5 using other 40 dierent learning rates ranging from 0.0 to 1. The experiment procedure above is also used to observe the sensitivity of learning rate using negative feedback with a slight modi cation. In an experiment using negative feedback, the highest ranked document is selected instead of the lowest one, and a negative feedback is given in step 3 instead of positive feedback. Figure 5 shows the experimental results in percentile which are averaged over 50 experiments using 50 dierent document sets. Each document set is formed by picking randomly from 100 documents about the same topic. As illustrated in the gure, the curve has smooth transition over the learning rate. A learning rate below 0.2 or above 0.8 does not result in a signi cant change on the ranking position. Between those two values, the rate of change of the ranking position is almost constant on the change of learning rate value. The results give an idea of what learning rate value to use according to the expected eects. To expect a signi cant eect, learning rate above 0.8 is set for a strong feedback. A moderate feedback may use learning rate between 0.4 and 0.6, and a weak feedback uses learning rate around 0.3. The results also

37

Learning Rate Sensitivity

100%

'positive-feedback' 'negative-feedback'

50% 0% Ranking Change -50% -100%

0

0.2

0.4 0.6 Learning rate

0.8

1

Fig. 5. The eect of learning rate on the change of document's ranking. suggest that inferring from implicit positive feedback, which is a very weak feedback, should use learning rate below 0.2. It will change the pro le very slightly and will not cause a signi cant impact if the assumption behind inferring the implicit positive feedback fails to hold. However, it will gradually change the user's pro le and will result in a signi cant eect if the users are consistent with their interests so that a frequent implicit positive feedback will be generated. In this experiment, the eect of feedback on the ranking of other documents is also observed. For each learning rate value, there is a slight shift of relative ranking on some documents, but in general, the constellation of documents in the ranking does not change. Since all documents used are of the same topic, it shows that the scheme is able to handle exceptions within an interest category nicely.

38

B. Threshold Sensitivity Threshold value, as mentioned previously, determines the proliferation of new categories within a pro le. It is obvious that if the threshold value is set to zero, regardless the number of documents used to form the pro le, the number of categories created will be one. In contrast, setting the threshold value to one will result in the same number of categories in the pro le and number of documents used for the feedback. In the latter case, the learning process will be similar to the one of instance based learning technique. Intuitively, having more categories in a pro le derived from the same document set is better than the one with fewer categories. However, a pro le with more categories requires more memory to maintain and lessen the processing time during the feedback learning and information ltering. It is really a problem for learning in the long run as more and more documents are learned. Therefore, there is a tradeo among space, processing time and the learning accuracy that need to be considered. Experimental analysis to observe the sensitivity of the threshold value was conducted to answer the following question. Given a threshold value, how many categories will be generated when learning from a document set? During this experiment, 561 documents are used as a feedback source covering about six broad categories. The learning rate is chosen randomly and equal probability is set to learn either positive or negative feedback. After 100 documents have been used to generate the pro le, the number of categories is counted. The experiment is repeated for various threshold values ranging from 0.0 to 1.0. Figure 6 summarizes the experimental results averaged over 30 experiments for each threshold value. The ordinate r expresses the ratio of the number of categories in the pro le over the number of documents used for feedback. A small ratio is obtained

39

r

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Threshold vs Rate of Category per Document #Categoriesinprofile r = #Documentsusedforfeedback

0

0.2

0.4

Threshold

0.6

0.8

1

Fig. 6. The eect of threshold on the proliferation of new categories. for a threshold value lower than 0.25. The ratio increases almost linearly from 0.3 to 0.7. The increase becomes slow after 0.7 as the threshold approaching one, but it becomes one at a threshold value 1.0.

40 CHAPTER V PERSONALIZED NEWS AGENT This chapter describes the implementation of the personalized news agent. The description of implementation includes agent's architecture, its functionality and detail process of each agent's component.

A. Personalized News Agent Architecture Figure 7 shows the general architecture of the personalized news agent. It consists of four main components: news retriever, learning module, meta search engine and user interface. All components reside on the Web server except the user interface that is partly on the client site. In this system, two types of data are managed: user pro les and meta documents. The news retriever explores and collects Web pages from online newspapers and magazines. The activity of this component is initiated by a scheduler that runs as a background process. The scheduler determines when the news retriever starts collecting news articles, and which online newspaper the Web pages should be retrieved. After all documents from the Internet have been retrieved, those documents are ltered out according to the system's user pro les and a personalized Web page is rebuilt for each user. The meta information of the selected documents are stored in local repository and unused documents are deleted afterwards. The learning module builds and maintains a user model. It creates a new user an initial pro le from pre-de ned topics provided by the user when lling out a subscription form. Based on the user's explicit or implicit feedback on the news recommended to his/her, the learning module modi es the corresponding user model so that the model adapts to the user's needs and interests. The feedback learned by

41

Search - WWW SEARCH - Meta Engine ENGINES

? User - Learning USER Interface Module 6 6

? User Pro le

Scheduler

? Local - News Repository Retriever 6 ?

ONLINE NEWSPAPERS & MAGAZINES Fig. 7. An architecture of the personalized news agent.

the system can be positive or negative with some degree of con dence. The meta search engine is provided in this architecture to allow its users explore new interests. It uses several existing search engines to search. Users are able to open the Web page provided by the meta search engine and give explicit feedback to the system to learn the content of the Web page. The user interface manages the interaction between the user and the system. On the server site, it creates an Html interface on the y to convey the information and messages from the system to the user and vice versa through an Internet browser (e.g. Netscape, Internet Explore, Mosaic, etc). It also redirects messages sent through the Internet from its users to appropriate system's components. On the client site (i.e. user's desktop), the user interface provides a means for its users to send a message for a particular actions, e.g. giving feedback, reading news etc.

42

Fig. 8. A main entry of the personalized news agent.

B. News Agent Functionality On the front page as shown in Figure 8, the news agent has a set of news topics (e.g. technology, sport, weather etc.) for general users. Each topic contains a list of related articles ranked based on the similarity of the article with the topic. These topics are system's pre-de ned pro les whose structures is the same as the one on the system's user pro le. On these articles, the users cannot provide a feedback because the articles are provided for general users with various interests. However, registered users who have signed on into the system may use the articles to provide their feedback to the system so that the system is able to learn their interests. Another functionality on the main entry is the meta search engine which searches information on the World Wide Web through dierent types of search engines available on the Internet. Additionally, new users through this page can subscribe to obtain an account; and registered users can sign on to open their personalized page.

43

Fig. 9. An example of a personalized page. Once a registered user has logged in, a personalized news page is presented to the user as illustrated in Figure 9. The news page contains a list of titles along with sample sentences extracted from earlier paragraph of the corresponding article. The article titles are ordered according to the match to the user pro le. The right next to the article title is a number representing the system prediction about the interest level of its user if given the article. The top of the page contains pointers to other news articles list of pre-de ned topics as well as some utilities. The utility allows the users to display their pro le (under link Pro le ) or ask the system to rebuild their personalized page based on the latest pro le (under link Rebuild ). Just before the rst article title, an input form is provided to search pages by using the meta search engines. Five dierent search engines can be used in this implementation: Lycos, Excite, HotBot, Yahoo, and Altavista. Users can use either one of these search engines or all of them.

44 As a user opens an article, an implicit positive feedback will be sent to the system automatically. Furthermore, when the user reads an article, a list of pre-de ned user comments is provided on top of the user window. It enables the user to give explicit feedback for the article he/she reads. The comments are listed below. 1. "I like it, always show me article like this" for strong positive feedback. 2. "Interesting" for moderate positive feedback. 3. "Not Bad" for weak positive feedback. 4. "Not interesting" for moderate negative feedback. 5. "Never show me article like this" for strong negative feedback. In addition to the above feedback, the system will send a negative implicit feedback to a user pro le that does not show a reading or learning activity in a period of time.

C. Agent's Component Implementation The implementation of the agent's component uses several programming tools including C, Perl and Html. The Html is used as the interface to interact with the users through the user's browser. The Perl script is used to pass information sent from the user's browser to appropriate agent's components. The C language is used to implement the news retriever, the learning module and the meta search engine. 1. News Retriever The news retriever periodically retrieves news articles from online newspapers and magazines. This module runs as a background process and the retrieval process from

45 the Internet is initiated by a scheduler. At start up time, the scheduler loads a news retrieval schedule de ned in a le. The le (see appendix A) contains information about the name of the entry site, its URL address, the maximum depth of exploration, the type of the news source (daily, weekly or monthly) and the time of retrieval. Given an entry site, the document-retrieval process starts from depth level zero. The exploration of the site is performed by following the Html document links until a certain depth limit is reached. Based on exploration experiment, following link until depth level two is enough to get representative documents provided by any online news source in an acceptable time. Exploring until depth level three or more results in a very long retrieval process (e.g. more than 3 hours). During an exploration of a site, visited sites are recorded to avoid retrieving the same site more than one. The document-retrieval process is then proceeded in a depth- rst-search manner until no more sites can be explored. Once a document has been retrieved and the links on the document have been extracted, a quick check is performed to determine whether the document contains news article. A simple heuristic is applied by detecting the existence of a somewhat long paragraph in the document. On this implementation, a minimum of 80 characters within paragraph tag is set as the minimum paragraph length that must exist to be eligible as having news article. If the document does not contain news article, it will be removed. After all documents from an online source have been retrieved, the next step is to extract the content of each retrieved document. First, the Html tag, Java script as well as the information within a link are removed. The document processing tries to extract the content of the news article; and therefore the information contained in the document link is not needed. The output of this process will be the title, heading and paragraphs of the article in a text le format. The noise sometimes occurs if the

46 Html document has other information that is not those removed and is not a part of the content of the articles (e.g. copyright notice). The next stage is to convert the plain text news article into a document feature vector. At this step, common words (e.g. and, or, thus, etc.) are removed using stop list (see Appendix B). The number of words in stop list used in this prototype is 293 words. After the extracted word is stemmed using Porter's stemming algorithm, word frequency count is performed and the occurrence of two-word phrase is detected and counted. Two-word phrase is identi ed by detecting the occurrence of the same two successive words of at least twice during word counting. The feature vector of the document is then obtained by normalizing the word or phrase frequency count. The feature vector, document's title, sample content from the rst somewhat long paragraph, and document's URL address are saved as the meta document (see Appendix D). The last stage performed by the news retriever is to score and rank each document according to the pro le of its users. For each user, the score of each document is calculated based on its pro le and the top 30 documents are kept and ordered on decreasing document score value. This information is used to generate a personalized page as the corresponding user signs on into the system. If the retrieving documents and processing the retrieved documents have been completed, the corresponding online source is marked with a status ag to indicate that it has been explored. The scheduler then waits for the next retrieval schedule. 2. Learning Module The main task of the learning module is to learn user pro le and adapt to the changes of user's interests. As mentioned previously, the system learns the pro le of its users from user's explicit and implicit feedback. The explicit positive or negative

47 feedback is obtained from the user's comments and the implicit positive feedback is sent by the user's browser as the users open a document on their personalized page. Additionally, the learning of implicit negative feedback is triggered by a scheduler, which is performed once a day at the beginning of document-retrieval process. Initial pro le of a new user is created based on the information provided by the user during registration process. In the subscription form, the user is given a list of interests to be chosen. The options cover various interest categories such as sport, nancial, technology, weather, health and world news. For each of the interest category, the system has a corresponding feature vector that is acquired by averaging the feature vector of documents in the category. These feature vectors are used to build the initial pro le of the new user after submitting the completed subscription form. On learning a feedback, the learning rate of strong explicit feedback is set to 0.9. This value will ensure that after learning the feedback, the similar documents in the future information- ltering process will be scored high for the positive feedback or low for the negative feedback. The learning rate of moderate explicit feedback is set to 0.5 to allow moderate changes on the user's pro le. A learning rate of 0.2 is set for the weak positive feedback so that a slight change will be aected by this feedback. To give an eect of gradual change on implicit positive feedback, a learning rate value less than 0.2 is used. When this module processes a feedback, the source of document to be learned can be meta document on system's local repository or a Web page from the Internet. The former is used if the feedback is given from a document provided by the system. The latter is used if the feedback is given from the search results of the meta search engine.

48 3. Meta Search Engine Searching through the meta search engine is carried out by sending a search command to a search engine available on the Internet and parsing the search results. As stated before, ve dierent search engines are used for the construction of the meta search engine. Each has dierent search query syntax and dierent format of search output. As a result, a separate code is written for each of the search engines to query and parse the search output. After a search engine's response is received by the meta search engine, the titles and the URL addresses are extracted, and a customized meta search engine' output is generated. In the customized results, a small code is inserted in every link to a Web page. The code, when the link is clicked, will invoke a Perl script to generate a frame for feedback controls. This will allow the user to rate the document from the search results and ask the system to learn from that document. 4. User Interface Module On the client side, the user interface is of form of Html document running on a user's browser. The interface on the browser allows its users to open the front page, do registration, sign on for personalized page, send search query, open news article, and give a feedback. Some of the user interface codes are static and most of them are dynamically generated on the y. On the server side, the user interface consists of Perl scripts activated from the user interface on the user's browser. The scripts parse the information sent by the user and process it according to the user's request. The request can be processing a subscription form and user's feedback, opening a personalized news page, searching a Web page, validating a user's identi cation and password, and handling other utility

49 commands such as looking up the content of user pro le and rebuilding customized news page based on the latest pro le. Some of the scripts work independently and some others invoke the learning module and the meta search engine for further processing. a. Processing New User Registration Through the front page, a new user can subscribe and obtain an account for using personalized page feature. A subscription form will be issued to the user to complete at the time of registration. Some of the information must be supplied such as user identi cation and password, and some of them are optional. The user is also provided with a list of pre-de ned interest categories to choose according to the user's preference. As the user presses submit button, the information on the form will be sent to the server and the corresponding script le will be invoked. The script will parse the information and check the validity of the user's identi cation and password. The information provided in the subscription form is valid if the user's identi cation is unique and the given two passwords agree. All information is then added to a user's table stored in an Oracle database. The processing registration form continues by extracting the user's interest information and invoking the learning module to create a new user pro le for information- ltering process. After the processing of subscription form is completed, the user is noti ed. b. Opening Personalized News Page For registered users, their personalized page can be opened after they provide their valid user identi cation and password. The validation is performed by cross checking them on the user's table. If there is no problem in this process, a personalized news page will be created on the y. The personalized news page is created by using the

50 output obtained from the news retriever. It contains a list of 30 news titles and sample news content ordered by decreasing of user's preference. The title of the news article is used as a link to open the corresponding article and a small code is inserted in this link to invoke a Perl script. This script will generate a frame for feedback control in the user's browser. c. Handling User's Feedback As users open a news article by clicking the title of the article on their personalized page, a frame of two regions will be generated on the user's browser. The top frame, which is about 15% of the client area, is used for feedback control, while the rest of the client area is used for displaying the news article. At the bottom frame, users are able to scroll up and down the news article without aecting the appearance of the feedback control on the top frame. On the feedback control area, ve dierent ratings are provided for user feedback ranging from strong positive feedback on the most left to strong negative feedback on the most right. A small code is encoded in each of the comments to identify the type of the feedback and the document identi er used for feedback. The document identi er could be the name of a meta document stored in the system's local repository, or a URL address to the Web page opened on the bottom frame. Clicking this comment will invoke a Perl script in the server to process the user's feedback. Similar to other Perl scripts, Perl script used for the feedback processing will rst parse the feedback information sent from the user's browser. The learning module is then invoked for further processing and a message is sent to the user at the end of processing. At this point, the user can see the eect of their feedback by clicking Rebuild or Profile link.

51 d. Performing Search Request The searching process is invoked through a searching form available at any page on the user's browser. The Perls script handling the searching process will be invoked when the users press go button on the searching form. Searching process will be performed according to the type of search engine speci ed by the users. If the users specify only one search engine, user's query will be sent to the corresponding search engine by invoking the meta search engine, and the search results will be sent back to the user's browser. However, if all search engines are required, each of the search engines will be queried sequentially. Once the search results are received, they will be merged to the results from the output of the previous search engine and the next search engine will be queried.

52 CHAPTER VI PERFORMANCE EVALUATION Performance evaluation was conducted to measure the performance of the proposed scheme and learning algorithm. Various performance measures can be used for this purpose. Depending on the approach adopted, dierent techniques may use dierent performance evaluation criteria and dierent experiment procedure. The rst two sections in this chapter describe the various criteria to evaluate the performance of the model and the procedure to perform the experiments. The use of various criteria for the evaluation is intended to observe the behavior of the model and its learning algorithm from dierent perspectives. The experimental results will be described and the results for each of the performance criteria will be analyzed.

A. Performance Evaluation Criteria Five dierent performance criteria are employed to measure and observe the behavior of the proposed scheme. The criteria are accuracy, ndpm 1, Spearman rank-order correlation coecient, average document score, and precision vs recall. These experiments basically involve ltering a set of documents using a learned pro le and then comparing to those selected by the target pro le. To avoid confusion, some terms will be de ned. A user pro le Pu is used to denote a target user pro le that will be modeled by the system. System's user pro le Ps will be used to refer the user model learned by the system. Additionally, some document sets involved in the computation are also de ned. Dretr will be used to denote a set of documents that is recommended by the system. Dnews denotes a set of documents that is a source of document from 1Normalized Distance-based Performance measure

53 where Dretr is ltered using Ps . Further, another document set will be used to represent a document set preferred by the target user pro le. This set is Dpref which is also ltered from Dnews using Pu . The number of documents in document set Dretr and Dpref are the same. 1. Accuracy Accuracy refers to the number of documents where the user and the system agree to choose from the same document set. The two document sets used for comparison are Dpref and Dretr that are ltered from Dnews as reference. The formal de nition of accuracy is thus expressed in the following equation.

X

Accuracy(Dpref ; Dretr ) = j D1 j f (d) retr dDretr

(6.1)

where f (d) = 1 if dDpref , or 0 otherwise. The user and the system rate the documents and take the n highest ranked documents. The accuracy is then de ned by the percentage of the same documents from the n documents chosen by the user and the system. 2. Normalized Distance-based Performance Measure It has been shown in user studies that human judges are better and more consistent over time to make relative judgments than absolute judgments. ndpm measures the distance of relative ranking between the user and system's ranking [17]. This measure is de ned with reference to pairs of document in the ranking. In the system, the relative ranking for every pair of document in a set of document being evaluated is derived from numeric values obtained from the system's rating which can be computed using equation 3.4. Let be a preference relation on a set of document Dretr such that if di and dj

54 represent two dierent documents and P is the pro le used to rate the documents, then di dj , Score(P; di ) > Score(P; dj ). If u is the ranking predicted by the target user pro le and s is the ranking predicted by the system's user pro le, then the distance between two documents di, dj Dretr with respect to these rankings is de ned as follows:

8 >> 2 if (d d ^ d d ) _ (d d ^ d d ) i u j j s i i s j j u i >< u ;s (di ; dj ) = > 1 if (di s dj _ dj s di ) ^ di u dj >> : 0 otherwise

(6.2)

Based on the user and system's partial ordering of document, the distance of ranking from a document pair is two if the user and system's ordering on the pair contradict one of another. If the ordering of document pairs is compatible, its ranking distance is one. The full measure is obtained by summing the distance of ranking over all document pairs in the set, and normalizing by dividing by twice the total number of document pairs in the document set. The performance of system based on this measure is best when the ndpm value approaches zero. 3. Spearman Rank-order Correlation Coecient Let Ui be the rank of document di among other d's based on user's ranking, Si be the rank of document di among other d's based on system's ranking. Then the Spearman rank-order correlation coecient rs is de ned by the following equation.

PjDretrj(U ? S )2 6 i i i=1 rs = 1 ? 3

j Dretr j ? j Dretr j

(6.3)

Thus, the coecient rs is derived from the distance of ordinal ranking between the user and system's ranking. According to this performance measure, perfect ordering on user and system's ranking will have coecient value 1. Conversely, coecient

55 of -1 results from two rankings having opposite ordering. Rank correlation has an advantage that it is more robust and more resistant to unanticipated defects of data. 4. Average Document Score Document score represents the rating of a pro le on a document being evaluated. Document score is the raw score derived directly from equation 3.4 and its value ranges from -1 to 1. Negative and positive score indicate that the document is uninteresting and interesting respectively. If the needs and interests of a human user can be approximated using the user pro le model (i.e. target user pro le), the value of the document score re ects the absolute judgment of the user on the document. Given a set of document Dretr , the performance measure based on average document score is given below.

X

Score(Pu ; Dretr ) = j D1 j Score(Pu ; fvd ) retr dDretr

(6.4)

The average document score is computed from each document in the recommendation set. The document scoring is performed by the user and the performance will be best if based on this measure, the average document score approaches one. 5. Precision versus Recall Precision versus Recall is a typical performance measure used in information retrieval system. The procedure to evaluate the performance using this measure is dierent from the one described earlier. Table III describes the parameters used to compute the precision and recall assuming that the relevance is measured as binary values. Based on these parameters, the precision and recall are computed as follows. P recision =

a

a+b

(6.5)

56

Table III. Parameters of precision and recall. Relevant Irrelevant Retrieved a b Not Retrieved c d Recall =

a

a+c

(6.6)

where a is the number of relevant documents retrieved, a + b is the number of retrieved documents, and a + c is the total number of relevant documents in document collection. Precision is de ned by the portion of relevant documents over retrieved documents, while recall refers to the portion of relevant documents retrieved over the total number of relevant documents in document collection. The precision of the retrieved document is then measured for each recall values, which range from zero to one. These are usually inversely related; i.e. adjusting parameters to raise precision often lowers recall, and vice versa.

B. Methodology 1. The Procedure of Experiment Figure 10 illustrates the procedure used to evaluate the model's performance. In this experiment, arti cial users are used in place of human users. The advantages of using arti cial users are two folds. First, to ensure that users are consistent in rating the documents relevance based on their pro les. Second, to cut down the evaluation

57

Simulating User Feedback - Generate Scenario Feedback 6 Target - Evaluate User Pro le Documents

6

System

-

Learn Feedback

? System's User Pro le

? Ranked Evaluate Documents Documents 6

Retrieved Documents

6

Document Retrieve - Documents Collection

Fig. 10. A diagram of experiment procedure. time that can take several months with human users. It is believed that if the model were able to adapt to the dynamic of user's interests in a simulated environment, it will also be able to adapt to human users in real worlds if given enough information sources. Figure 10 shows the procedure to measure the model's performance based on system's accuracy, ndpm, Spearman rank-order correlation and average document score. The steps of the experiment are as follows, and the details for each step will be given afterward. 1. Generate a target user pro le Pu randomly and set the system's user pro le Ps to an empty pro le.

58 2. Rank all documents in document collection using the target user pro le and select the n highest ranked documents Dext . 3. Retrieve a set of document Dnews by selecting m documents randomly from document collection. 4. Rank the retrieved documents Dnews using the system's user pro le and select the n highest ranked document Dretr . 5. Measure the average document score, Spearman rank-order correlation coecient and ndpm of document set Dretr using the target and system's user pro les. 6. Rank the retrieved documents Dnews using the target user pro le and select the n highest ranked documents Dpref . 7. Measure the accuracy of system's prediction using the document set Dretr and Dpref . 8. Give feedback by providing positive and negative examples from document set Dext , Dpref and Dretr to the system. 9. Learn the feedback and modify the system's user pro le. 10. Do step 3 through 9 for k times. 11. After k evaluations, invert the target user pro le and do step 10. The document collection used in this experiment is articles in Html format collected from 12 dierent online newspapers and magazines (e.g. UsaToday, USNews etc.) at dierent times. The collection contains 1427 documents with six dierent general topics: world news, nancial news, health, weather, technology and sports.

59 The length of each document varies with the average number of distinct terms of 228 after the document has been pre-processed (e.g. removing Html tags, stop list, identifying bigrams etc). The target user pro le in step 1 is generated by providing a feedback to an initially empty pro le using randomly chosen document in a document collection. For each experiment, the number of feedback for generating target user pro le varies between 20 and 30. Equal probability is set up for a positive or negative feedback with a random learning rate between zero and one. The target user pro le is the one that will be approximated by the system that starts with empty system's user pro le. In step 3, 200 documents in document set Dnews are selected randomly from the document collection. This step is performed for every evaluation cycle to simulate the change of information content in online newspapers and magazines. The document set Dretr generated in step 4 represents a personalized news articles ltered using system's user pro le. To measure the accuracy of the system's prediction, a document set Dpref is generated by rating the document set Dnews using the target user pro le. Dpref represents a document set that is preferred by the user from the same document set Dnews . The percentage of the same documents in Dretr and Dpref is the accuracy of system to predict user's information needs. Other three performance criteria are measured only from document set Dretr . In step 9, documents used for feedback are the ones from system's recommendation, system's exploration and user's exploration. System's recommendation documents are news articles ltered by the system speci c for a user based on the system's user pro le. Thus, each user has a dierent set of document for this information source. System's exploration documents are news articles retrieved by the system based on system's pre-de ned pro les that are preferred by the user and are not part of the system's recommendation documents. These pro les cover topics intended for

60 general users such as weather, nancial, world news etc. User's exploration documents are news articles searched by the user from the Internet using the meta search engine. These three feedback sources are represented by the document set Dretr , Dpref and Dext respectively. Because the system learns user's interests only through the user's feedback, a good feedback scenario is required to guide the system to approximate its user model. In step 11, after the rst 20 evaluations and learning from 19 feedback, the target user pro le is inverted. This scenario is intended to observe the adaptability of the system on a signi cant change of user interest. The inversion of the target user pro le is performed by swapping the positive and negative descriptor along with their interest weights, and changing the sign of the interest level of long-term descriptor. 2. Scenario of Feedback The scenario of the feedback is designed to re ect the behavior of an ideal human user teaching the system to learn his/her interests. Since the system relies on the user feedback to modify its user model, the way the feedback is given will determine whether the system is able to approximate the target user pro le eectively. Thus, users play an important role to guide the system learns its user model. a. Rationale for Feedback Scenario In a feedback, three feedback parameters need to be supplied: feedback type (e.g. positive or negative), document to be evaluated and learning rate. The feedback type depends on the user judgment over a document. Interesting document based on user's point of view will be classi ed into document for positive feedback. Otherwise, it will be categorized as the one for negative feedback. Further, the signi cance of document also determines the intensity level of the system to learn user's interests that relates

61 to the document provided. If the document matches with current user's hot topic, the learning rate should be given high. Similarly, if the user dislikes a document and wants that the same or similar documents will not appear in the future, a high learning rate should be given but with negative feedback type. Besides the similarity of a document with the user's interests, the source of document that will be used for feedback is also important to be considered. The system will not learn eectively if the document used for positive feedback is obtained from document set provided by the system. Since the document set is generated using the system's user model, giving positive feedback from the document set does not introduce a new concept. Conversely, uninteresting document supplied by the system will be a good example for the system to learn negative feedback. System will identify the corresponding document category in its user model as a negative interest and rate documents similar to the category to a low value. However, negative feedback is not necessary if it is not from system's recommendation. For better user model, there is no need to teach the system with negative example as long as the system does not consider it as a positive example on its user model. If the system maintains document set containing system's recommendation to other users, interesting documents from those sources that are not recommended by the system for a speci c user will be a good source for positive feedback. It will allow the system to learn a new concept that is not covered in current user model. Another document source that can be used for eective feedback is from user's browsing that cannot be found in document set managed by the system. From this feedback source, the system will learn to broaden its scope about interesting topics beyond current system's domain. Another important aspect is the user judgment on the documents. If most of the documents in recommended document set are uninteresting, the system needs to learn more by giving more feedback with a high learning rate. However, if most of

62 the documents match with current user's interests, the demand of the system to learn is low. In this case, the user is very satis ed with the system's performance and the probability that giving feedback is likely to occur is very low. b. Algorithm for Feedback Scenario Two scenarios are developed to handle two dierent feedback sources. The rst scenario is used when document used for feedback is from system's recommendation. The algorithm on the rst scenario is as follows. 1. 2. 3. 4. 5.

6. 7.

Compute the average document score in Dretr using Pu . If the average score is positive then Do step either 4 or 5 with equal probability. If the score for the lowest ranked document in Dretr is negative, then give negative feedback with learning rate lrate1. If the score of the highest ranked document in Dretr is positive, then give positive feedback with learning rate lrate2. Pick another document having positive document score randomly and give positive feedback with learning rate lrate2.

Else

8. 9.

End

Give negative feedback using the lowest ranked document with learning rate 0.9. If the score of the highest ranked document is positive, then give positive feedback with learning rate 0.9.

63

1 0.9 0.8 0.7 0.6 lrate1 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4 0.6 User Satisfaction

0.8

1

Fig. 11. A relation between user satisfaction and learning rate. The learning rate lrate1 in step 4 is derived from the following function (see Figure 11). 1 lrate1 = (6.7) exp3Score(Pu ;Dretr ) The function is chosen to capture the relation between user satisfaction and the need for the system to learn from negative feedback. If the user satisfaction level is low, the learning rate is set high according to equation 6.7. Conversely, a low learning rate is used if the user satisfaction level is high. In step 5, the learning rate lrate2 is derived directly from the score of document used for feedback based on the target user pro le. The second scenario is used to provide feedback obtained from the system and user's exploration. The following steps are the algorithms for this scenario. 1.

Compute the average document score in Dretr using Pu .

64

1 0.9 0.8 0.7 0.6 lrate3 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4 0.6 Document Score

0.8

1

Fig. 12. A relation between document score and learning rate. 2. 3.

Set the probability of giving feedback, prob1, to a value derived from equation 6.7. Pick the highest ranked document from document set that is not in Dretr , and with probability prob1, give positive feedback with learning rate lrate3.

The probability that providing a feedback is likely to occur prob1 is set with reference to the user satisfaction. It is not likely that a feedback will be given if the user satisfaction level is high. Otherwise, the probability is low. The learning rate lrate3 for positive feedback is set according to equation 6.8 (see Figure 12). lrate2 =

2

1 + exp?5Score(Pu ;di )

?1

(6.8)

65 Both equation 6.7 and 6.8 are chosen to describe the non-linear relation between two parameters.

C. Performance Comparison To compare the performance between the 3-descriptor and single-descriptor schemes, algorithm for learning user pro le employed in WebMate system is used for the base line [1]. The algorithm is chosen for the following reasons. First, it has a similarity in user pro le representation. A pro le in WebMate is a list of interest categories; each represents a dierent domain of interest and its feature vector is weighted using the TFIDF. However, interest category in WebMate is represented as a single-descriptor model rather than the 3-descriptor approach. Second, it clusters the training examples based on the greatest similarity between a learned example and one of its interest categories. The dierence of the clustering strategy between WebMate and the one developed in this thesis is that in this thesis, a new cluster is created based on a threshold value and can be unlimited. On the contrary, the number of clusters in Webmate is xed. A new cluster for a new learned example is created if the number of clusters in a pro le is still less than its upper limit. The following is the detail algorithm used for learning user pro le in Webmate that has been adapted for the purpose of performance comparison. 1. Pre-process the Html document by removing the Html tags, deleting stop words, stemming words and identifying bigrams (two-word phrases). 2. Weight the extracted keywords using TFIDF and let it be Vi : 3. If j V j< N then V ( V [ Vi: 4. Otherwise, calculate the cosine similarity between Vi and every vectors in V .

66 Let Vm be the vector in V with the greatest similarity. Then compute Vm(new) = Vm(old) + Vi for positive feedback, or Vm(new) = Vm(old) ? Vi for negative feedback (optional). 5. Sort the new weight and keep the highest M elements. To make the algorithm comparable with the one developed in this thesis, twoword phrase identi cation has been added from the original algorithm. Furthermore, reward for keywords appearing in document's title and header applied in the original algorithm has been eliminated in the modi ed algorithm. An important dierence is that WebMate was not originally designed to learn from negative feedback. To make the comparison fair, a version of the system that could do this was implemented. The learning of negative feedback is performed by subtracting the feature vector of a learned document from the matching category in the pro le. In the following section, 1-descriptor-A is used to denote the modi ed singledescriptor algorithm that learns from both positive and negative feedback, and 1descriptor-B is used to denote the one that learns only from positive feedback. 3descriptor refers to the algorithm developed in this research.

D. Experimental Results The experimental results described in this section are averaged over 28 trial, each consists of 40 evaluations. Figure 13 depicts the accuracy of the 3-descriptor model over threshold values, each uses 40, 90 and 190 keywords. Those keywords are weighted using TFIDF and are obtained from the n highest ones after the keywords from a document have been pre-processed (e.g. removing stop list, word stemming, etc.). Based on these results, the performance peak is achieved at a threshold 0.25 and it tends to degrade at higher threshold values. This tendency is a bit surprising because

67 Accuracy(%) 56 54

40-keywords 4 90-keywords + 190-keywords +

4 +

52 50 48 46

4

4

+

+ 4

44 42 40

0

0.1

0.2

0.3 0.4 Threshold

0.5

0.6

0.7

Fig. 13. The accuracy of the 3-descriptor model over threshold. it is expected that as the threshold values rise, the performance should be better although higher threshold value is undesirable due to a larger size of pro le. However, the experimental results are a good coincidence because no performance versus memory requirement tradeo is required to determine the best value of threshold. Table IV describes the accuracy of the 3-descriptor model over dierent number of keywords used to represent an area of interest, with the threshold is set to 0.25. The second and the third columns are the average of accuracy over the rst and the second 20 evaluations that represent the performance before and after the pro le inversion respectively. The table shows that the performance peak is obtained when the number of keywords is 90. The performance degrades as the number of keywords is less or more than 90. Having fewer keywords removes other important keywords, and using too many keywords tends to include unimportant keywords or noise. Those

68

Table IV. The accuracy of the 3-descriptor model over dierent number of keywords. Number of Before Pro le After Pro le Average Keywords Inversion Inversion 20 53.14 46.65 49.89 30 55.11 46.51 50.81 40 56.12 46.47 51.29 60 57.54 48.65 53.10 70 57.46 49.10 53.28 90 57.82 48.98 53.40 130 54.61 51.11 52.86 190 55.75 49.46 52.60 220 54.00 44.27 49.13 conditions result in the inaccuracy of document's content representation. Based on the above experimental results, for the purpose of performance comparison between the 3-descriptor and single-descriptor approaches, the number of keywords used to represent a document's content is limited to 90 keywords. The maximum number of clusters for the single-descriptor model is set to 10, and the threshold of 0.25 is used for the 3-descriptor scheme. Using the results of the threshold sensitivity experiment in chapter 4 and applying about 2 or 3 feedback on each evaluation cycle based on the feedback scenario, a trial using the 3-descriptor model consisting of 40 evaluations with a threshold value of 0.25 will result in a pro le with its number of categories which is close to the one used in a single-descriptor scheme at the end of evaluation. In addition to comparing the performance between the 3-descriptor and the single-descriptor models, the experiments are also intended to observe the performance of the 3-descriptor model using two dierent weighting schemes: normalized term frequency (nTF) and TFIDF. The performance between the two is then plotted

69 over dierent threshold values. In this observation, the performance on each threshold value is averaged over 38 evaluations through 30 trials. The rst and the twentieth evaluation are not counted because at these points, feedback is not given to the system and therefore the system has not yet learned. In the rst evaluation, the system pro le is still empty and in the twentieth evaluation, the system has not yet learned from the signi cant change of target pro le when the performance is measured. 1. Accuracy Figure 14 illustrates the performance comparison between the 3-descriptor and singledescriptor models based on accuracy evaluation criterion. In general, the performance of system employing the 3-descriptor approach for interest category representation outperforms the one using the single-descriptor model. Furthermore, as illustrated in Figure 15, the 3-descriptor model with TFIDF weighting scheme is better than normalized term frequency on every threshold value. Figure 14 shows that the performance of the 3-descriptor scheme increases rapidly after the rst feedback but not in steady state afterwards. This is because on each evaluation, a dierent set of document is used. In the next evaluation, therefore, the new set may not provide a document of the previously learned interest, and may introduce other documents containing a new interest that has not yet been learned. At the twenty- rst evaluation, the performance drops at its lowest values. This is due to, based on the experiment scenario, the signi cant change of user's interests after the target user pro le is inverted. It took a while to adapt to this sudden changes before the performance reaches a more stable state. In the single-descriptor model, however, after the target user pro le is inverted, the performance is not as high as the one before the inversion. Nevertheless, there is a trend on the increasing performance at the end of evaluation.

70

Performance Evaluation 80% 70% 60% 50% Accuracy40% 30% 20% 10% 0

3-descriptor 1-descriptor-A 1-descriptor-B

0

5

10

15

20 25 Evaluation

30

35

40

Fig. 14. The accuracy of system's recommendation.

70% 65% 60% 55% 50% Accuracy 45% 40% 35% 30% 25% 20%

Performance Evaluation TFIDF nTF

4

4

4 4

4 0

0.1

0.2

0.3 0.4 Threshold

0.5

0.6

Fig. 15. The average accuracy of the 3-descriptor model.

0.7

71 For the 3-descriptor model, the phenomenon might be caused by the eect of long-term learning before the target pro le is inverted. To increase the interest level of the long-term descriptor from negative value after the pro le is inverted takes more eort than starting from the empty pro le. The positive and negative descriptors allow the system to adapt quickly on the sudden change but the past-learned documents on the long-term descriptor prohibit this reactive behavior. This is due to the slow increase or decrease of the long-term descriptor's interest level. The bipolar sigmoid function used to govern the change of this descriptor seems to work very well to represent the reluctance of system to change after long run learning process. The role of past-learned long-term interests to inhibit the system adaptation to the change of interest is very clear by observing the results of the single-descriptor model. Without special treatment to the behavior of short-term and long-term interests, the system adapts slowly to the change of interest after learning in the long run. In summary, compared with the single-descriptor model, the signi cance of using the 3-descriptor scheme is obvious when user's interests change suddenly. The 3descriptor model adapts very quickly to the change of interests. On the contrary, the single-descriptor model that learns only positive feedback (1-descriptor-A) adapts very slowly. The improvement is achieved by the single-descriptor model that also learns from negative feedback (1-descriptor-B) but its performance is still below the 3-descriptor model. In addition to the previously described experiment scenario, other experiments were run using dierent document sources for the feedback. The previous results use three dierent sources of document set which are from the system's recommendation, the system's exploration and the user's exploration. In these experiments, 10 trial were performed using feedback source from the system's recommendation and the system's exploration, and other 10 trial use only documents from the system's recom-

72

Table V. The accuracy of the 3-descriptor model over dierent document sources used for feedback. Feedback Sources S1 S1; S2 S1; S2; S3 TFIDF 16% 50% 53% normalized tf 17% 34% 41% random pro le 7% 6% 7% S1 = System's recommendation S2 = System's exploration S3 = User's exploration Table VI. The accuracy of the 3-descriptor model over dierent number of recommendations. #Recommendation Top 10 Top 20 Top 30 TFIDF 53% 50% 48% normalized tf 41% 40% 45% random pro le 7% 9% 15% mendation. By setting the threshold value to 0.25, the results of the experiments are shown in Table V. A signi cant performance drop results from the experiments that use only document set from the system's recommendation. The highest performance is achieved by including the other two sources. Finally, the experiment was varied by increasing the number of articles recommended by the system. The experimental results are shown in Table VI. The performance of the model using TFIDF weighting scheme tends to decrease as the number of recommendation increases. The opposite tendency happens for the system's performance using random pro le. For a random pro le, the number of recommendations corresponds directly to the probability to get the right document. By picking 10, 20 and 30 documents out of 200 documents, the probability of getting the correct

73

Performance Evaluation 0.7


0.6 ndpm

0.5 0.4 0.3 0.2 0

5

10

15

20 25 Evaluation

30

35

40

Fig. 16. The system's performance on ndpm. document will be 0.05, 0.1 and 0.15 which are very close to the results above. 2. Normalized Distance-based Performance Measure The experiment results based on ndpm criterion are shown in Figure 16. The performance based on ndpm criterion for single-descriptor models (1-descriptor-A and 1-descriptor B) are worse than the 3-descriptor model. The erratic results from the 3-descriptor model according to this measure can be explained as follows. By learning from feedback, the system's performance on ndpm criterion improves but not quite signi cant. It is due to the way the system's user pro le is modi ed is not designed to ne-tune the relative ranking of the user's interests. Further, the longterm descriptor maintains only the average learned document feature vectors. The positive and negative descriptors have reactive behavior, and can potentially forget

74

Performance Evaluation

0.5

TFIDF nTF

0.45 0.4

4

ndpm 0.35 0.3

4

4

4

4

0.25 0.2

0

0.1

0.2

0.3 0.4 Threshold

0.5

0.6

0.7

Fig. 17. The average performance of the 3-descriptor model on ndpm. the feature vectors learned from previous feedback. Consequently, the performance on this measure uctuates depending on the current values of positive and negative descriptors. The performance in general is better than the one with random feedback because the short-term descriptors contain feature vectors of the highest or the lowest interest. Thus, documents with positive and negative interest will be ranked high and low respectively but not necessarily in the correct order for other documents. It will improve the system's performance over the single-descriptor model but not optimally with respect to ndpm criterion. On this criterion as shown in Figure 17, the performance of the 3-descriptor model using TFIDF weighting scheme in average outperforms the one using normalized term frequency.

75

Performance Evaluation

1


0.8 0.6 Spearman 0.4 0.2 0 -0.2 0

5

10

15

20 25 Evaluation

30

35

40

Fig. 18. The system's performance on Spearman rank-order correlation coecient. Performance Evaluation

1

TFIDF nTF

0.8 Spearman

0.6 0.4

4

4

4

4 4

0.2 0

0

0.1

0.2

0.3 0.4 Threshold

0.5

0.6

0.7

Fig. 19. The performance of the 3-descriptor model on Spearman coecient.

76 3. Spearman Rank-order Correlation Coecient Figure 18 shows that the experimental results based on Spearman rank-order correlation are consistent with other criteria. The performance of the 3-descriptor model outperforms the one of single-descriptor scheme with the same explanation as the results on ndpm evaluation criterion. In average, as illustrated in Figure 19, the 3-descriptor approach weighted using TFIDF is better than the one using nTF. 4. Average Document Score Figure 20 shows the system's performance based on the average document score. The results are similar to the one with accuracy evaluation criterion. The 3-descriptor model is better than the single-descriptor scheme, and the single-descriptor scheme that learns both positive and negative feedback outperforms the one that learns only positive feedback. Dierent from other evaluation criteria, however, the 3-descriptor scheme using normalized term frequency weighting scheme outperforms TFIDF in almost every threshold value as shown in by Figure 21. 5. Precision versus Recall The measuring of the model's performance in term of Precision versus Recall uses a dierent experiment procedure. In this experiment, 494 documents of recognizable topic were used as a test set that contains 25 documents about weather related topic. Another set of 36 documents containing weather related topic was used as a training set. To ease the identi cation of document topic, the naming of each document le is encoded to represent the topic it contains. The experiment procedure is as follows. 1. Generate a user pro le by giving a positive feedback using n documents in the training set. Each of the documents is selected randomly. The learning rate for

77

Performance Evaluation 3-descriptor 1-descriptor-A 1-descriptor-B

0.4 0.2 Avg Score 0 -0.2 -0.4 0

5

10

15

20 25 Evaluation

30

35

40

Fig. 20. The system's performance on average document score. Performance Evaluation

0.6

TFIDF nTF

0.5

4

0.4 Avg Score 0.3 0.2

0

4

4

0.1 0

4

0.1

0.2

0.3 0.4 Threshold

4 0.5

0.6

0.7

Fig. 21. The performance of the 3-descriptor model on average document score.

78

Precision(%) 100 90 80 70 60 50 40 30 20 10 0 10

3-descriptor 1-descriptor-B

20

30

40

50 60 Recall(%)

70

80

90

100

Fig. 22. Precision versus recall. each feedback in the 3-descriptor model is set with a random number between 0 and 1. 2. Sort the documents from the test set with the pro le obtained from step 1. 3. On the ranked documents, search document containing weather topic from the highest ranked document and continue the search to the lower ranked documents until 25 documents of weather topic are found. For each found document, compute the recall and precision values. The results of this experiment are illustrated in Figure 22 and have been averaged over 250 experiments. It uses 1, 4, 10, 20 and 36 documents from the training set to build a user pro le. The experiment for each particular number of documents is repeated 50 times. The experimental result shows that for a static interest built

79 only from positive examples, the performance of the single-descriptor and 3-descriptor models when both use TFIDF weighting scheme is almost identical. This suggests that the use of 3-descriptor model in a typical information ltering system does not degrade the performance of the single-descriptor approach. Moreover, it improves the capability of the single-descriptor approach by learning eectively a changing interest as described earlier.

80 CHAPTER VII CONCLUSION

A. Summary of Research In a personalized information ltering system, modeling user pro le is one of the essential features. The model represents current user's interest and therefore keeping the model up to date is the main research issue. A number of models and algorithms to learn user pro le have been proposed to address the issue. Each of the proposed models tries to address the issue from dierent point of view. Among of the proposed model in previous works, only a few that has addressed the problem of changing user's long-term and short-term interests. However, their solution is still ineective and cannot be used to handle both long-term and short-term interests simultaneously. The approach proposed in this thesis, the 3-descriptor scheme, is feasible and eective to learn the dynamic of user's long-term and short-term interests. The system learns quickly from an empty pro le and adapts gradually on a signi cant change of user interest. Therefore, the problem of adapting long-term and short-term interests simultaneously can be handled with the 3-descriptor scheme. Based on the experimental sensitivity analysis, the learning algorithm employed is able to maintain long-term interests while adaptive enough to accommodate changing short-term interests. The logical consequence of this capability is the system's ability to distinguish a dierent level of interests within an interest category. The scheme and the algorithm work on vector space model. Improvement of this model can be obtained by using appropriate term weighting scheme. Based on the performance evaluation using two dierent heuristics shows that TFIDF weighting scheme is better than normalized term frequency in most evaluation criteria. Taking

81 into account the inverse of document frequency in the TFIDF heuristic is feasible in some degree for expressing the context of term in a document. In a personalized news agent that learns user's interests from user feedback, users play an important role. The learning process will be eective if the users are aware about their own interests and provide relevant information for the feedback according to their interests. Although undesirable, the user understanding about how the system works is therefore very crucial. This implies that the users' cognitive task increase when using and teaching the system to learn their interests. Reducing this burden can be done by inferring implicit feedback from user's action. The disadvantage that results from inferring implicit feedback is minimized by applying a minimal change during learning this type of feedback. Hence, taking both explicit and implicit feedback may be considered to take the advantages of both feedback.

B. Future Work The 3-descriptor scheme allows modeling and learning a changing user's long-term and short-term interests. Current evaluation indicates that the approach is feasible. Embedding the proposed approach in a personalized news agent that has been implemented in this thesis is a proof-of-concept demonstration. The lesson learned from this work is that the development of a method in information gathering and ltering system can be considered as an integral part in the development of application using the proposed method. This work can be extended in several directions in the future that address issues concerning some concepts in information gathering and ltering.

82 1. Extending the 3-descriptor Scheme The ability of the 3-descriptor model to learn changing long-term and short-term interests is due to the additional information that is maintained in user pro le. For each interest category, the pro le has information about long-term and short-term interests. Furthermore, it has information about positive and negative interest in short-term interests. In addition to this scheme, the 3-descriptor can be extended by incorporating n descriptors on the long-term descriptor. This will result in n+2-descriptor interest category representation: two descriptors for positive and negative short-term interest, and n descriptors for long-term interests. The additional descriptors are intended to allow a gradation change in the long-term descriptor. Given a learning rate in this scheme, the second, third and the nth descriptor in both positive and negative descriptors will be modi ed with learning rate 2, 3 and n respectively. This scheme enhancement sounds to be a good way to capture a more smooth transition on long-term interests over dierent ranges of time-scales. The disadvantage of having this enhancement is obvious. The space required for pro le representation becomes larger. The question that may arise is how much will the improvement be, if any, as a result from this enhancement? Is it worth for the price that must be paid for the increasing size of the pro le? Answering this question requires more research and experiment is done in the future. 2. Incorporating Collaborative Filtering The ltering scheme employed in this thesis is classi ed as content-based ltering where the information- ltering process is based solely on the content of a document being evaluated. The content extracted from a document is then matched against the

83 pro le and ranked according to the degree of match. Another completely dierent approach can also be taken into account in the design of a personalized information ltering in addition to the content-based ltering. This approach is called collaborative or social ltering [18]. Instead of using the content of document, social ltering takes bene t from the experience of other users to perform information ltering. A number of techniques have been developed to perform collaborative ltering. Each of the techniques solves collaborative ltering problem from dierent point of views. The challenging issues from this area are to unify the various techniques into a single framework and nd a new scheme of social ltering to solve other unobserved problems. Another issue that can be explored from incorporating social ltering is nding a nice and elegant way to integrate social ltering with content-based ltering. Both of them have some advantages and disadvantages. The questions that need an answer is when do we have to switch from using content-based to collaborative ltering and vice versa? Is there a way to integrate them so that we can use both approaches simultaneously by taking the advantages of both ltering scheme? A thorough research is needed to answer these questions. 3. Extending the Learning Capability The personalized news agent developed in this thesis learns user pro le. Based on the feedback from its users, the corresponding user pro le is then modi ed. From the system's point of view, the learning capability can be expanded to learn attributes or tendencies of information sources. With this capability, the agent is expected to be able to identify useful information sources for future use and to avoid exploiting the ones that could not provide interesting information based on its user pro le. Therefore, the eectiveness of the information-retrieval process can be improved. This is a very challenging issue due to the fact that the information sources on the World

84 Wide Web can be heterogeneous in their domain expertise, unstructured and subject to change over time. 4. Direct Manipulation of Pro le The ltering process developed in this thesis is performed using user pro les. The pro les are learned automatically as the users provide their feedback. The learning process is syntax-based and no semantic clue has been considered. Furthermore, the extraction of keywords relies on the frequency of keyword's occurrence in a document. This results in inaccuracy of keywords that are extracted from a document because some of the unimportant keywords might still occur in user's pro le. In current implementation, the user does not have control to change the keywords except by providing a positive or negative example. For advanced users, they may want to re ne their pro les by removing keywords that they think are not important and adding some other important keywords. This surely will improve the system performance. Therefore, it is desirable to provide this option to advanced users to allow them re ne their pro les directly.

85 REFERENCES [1] Chen, L. and Sycara, K. (1998) WebMate: Personal Agent for Browsing and Searching. Proceedings of the Second International Conference on Autonomous Agents, St. Paul, MN, May, pp. 132-139. ACM Press, New York, NY. [2] Balabanovic, M. (1998) Learning to Surf: Multi-Agent Systems for Adaptive Web Page Recommendation. Ph.D. Thesis, Department of Computer Science, Stanford University, Palo Alto, CA. [3] Wiener, E., Pederson, J. and Weigend, A. (1995) A Neural Network Approach to Topic Spotting. Fourth Annual Symposium on Document Analysis and Information Retrievel, Las Vegas, NV, April, http://www.stern.nyu.edu/~aweigend/Research/Papers/BEFORENYU/topicspotting.ps.Z. [4] Tan, A. and Teo, C. (1998) Learning User Pro le for Personalized Information Dissemination. Proceedings of 1998 International Joint Conference on Neural Networks, Anchorage, AK, May, pp. 183-188. IEEE, New York, NY. [5] Balabanovic, M. (1997) An Adaptive Web Page Recommendation Service. Proceedings of the First International Conference on Autonomous Agents, Marina del Rey, CA, February, pp. 378-385. ACM Press, New York, NY. [6] Luhn, H. P. (1958) The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, II(2), 159-165. [7] Zipf, G. K. (1949) Human Behavior and the Principle of Least Eort. Addison Wesley Publishing, Reading, MA.

86 [8] Kowalski, G. (1997) Information Retrieval Systems Theory and Implementation. Kluwer Academic Publishers, Boston, MA. [9] Salton, G. and McGill, M. J. (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. [10] Joachims, T. (1977) A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN, July, pp. 143-151. Morgan Kaufmann Publisher, San Francisco, CA. [11] Deerwester, S., Dumais, S., Furnas, G., Laundauer, T. and Harshman, R. (1990) Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, 391-407. [12] Mock, K. J. (1996) Hibrid-Hill-Climbing and Knowledge-based Techniques for Intelligent News Filtering. Proceedings of the Thirteenth National Conference on Arti cial Intelligence, Portland, OR, August, pp. 48-53. AAAI Press, Menlo Park, CA. [13] McElligott, M. and Sorensen, H. (1994) An Evolutionary Connectionist Approach to Personal Information Filtering. Proceedings of the Fourth Irish Neural Network Conference, Dublin, Ireland, September, pp. 141-146. [14] Sheth, B. D. (1993) A Learning Approach to Personalized Information Filtering. M.S. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA. [15] Moukas, A. and Zacharia, G. (1997) Evolving a Multi-agent Information Filtering Solution in Amalthea. Proceedings of the First International Conference on

87 Autonomous Agents, Marina del Rey, CA, February, pp. 394-403. ACM Press, New York, NY.

[16] Menczer, F., Willuhn, W. and Belew, R. K. (1994) An Endogenous Fitness Paradigm for Adaptive Information Agents. CIKM'94 Workshop on Intelligent Information Agents, Gaithersburg, MD, December, http://www.csee.umbc.edu/conference/cikm/1994/iia/papers/menczer-cameraready.ps. [17] Yao, Y. (1995) Measuring Retrieval Eectiveness Based on User Preference of Documents. Journal of the American Society for Information Science, 46, 133145. [18] Goldberg, D., Nichols, D., Oki, B. M. and Terry, D. (1992) Using Collaborative Filtering to Weave an Information Tapestry. Communications of the ACM, 35, 61-70.

88 APPENDIX A NEWS SOURCE INFORMATION The following is the de nition of information sources. # daily news format # src_id=URL,depth,daily,hour:min usatoday=http://www.usatoday.com/,2,daily,3:00 people=http://www.pathfinder.com/people/,2,daily,3:30 latimes=http://www.latimes.com/HOME/NEWS/FRONT/,2,daily,4:00 usnews=http://www.usnews.com/usnews/home.htm,2,daily,4:30 internetwnd=http://www.internetworld.com/,2,daily,5:00 # weekly news format # src_id=URL,depth,weekly,weekday,hour:min pcweek=http://www.zdnet.com/pcweek/,2,weekly,1,2:00 pcmag=http://www.zdnet.com/pcmag/,2,weekly,2,2:00 time=http://www.pathfinder.com/time/,2,weekly,2,2:30 businessweek=http://www.businessweek.com/contents.htm,2,weekly,3,0:00 # monthly news formai # src_id=URL,depth,monthly,monthday,hour:min fortune=http://www.pathfinder.com/fortune/,2,monthly,15,0:00 windows=http://www.zdnet.com/wsources/,2,monthly,14,0:00

89 APPENDIX B STOP LIST The following is a list of stop words that have been modi ed from a stop list obtained on the Internet. a act air also amp any around back been best both called case didn door either every eld four give good have help himself important it knew later like

able after all although an anything as be before better boy came co do down end far nd free given got having her his in its know least look

about again almost always and are asked became began between brought can copyright does during enough feet rst from go had he here how inc itself known left looked

above against along am another area at because behind big but cannot could don each even felt ve gave god half heard high however into just large less made

across ago already among any aren available become being body by car did done early ever few for get going has held him if is kind last let major

90 make me mr name night number on or out past quite report say seems she so sure them they those time took until very we when white will would you

making might mrs nbsp no of once order over per rather right says seen should some than themselves thing though times toward up want well where who with year yound

man mind much never not o one other own perhaps re said second set show something that then things thought to turned upon war went whether whole within years your

many more must new nothing often only others page pm real same see several since still the there think three today two us was were which whose without yet

may most my next now old open our part put really saw seemed shall small such their these this thus too under use way what while why would york

91 APPENDIX C EXAMPLE OF INTEREST CATEGORY CONTENT

Negative Descriptor wn = 0:31230 secur 0.05285 enterpris 0.04472 server 0.04065 user 0.04065 web 0.04065 base 0.03659 product 0.03252 system 0.03252 exodu 0.02846 data 0.02846 java 0.02846 project 0.02439 tool 0.02439 spider 0.02033 gemston 0.02033 sqribeenter 0.02033 enabl 0.02033 analysi 0.02033 manag 0.02033 map 0.02033 creat 0.01626 hyperion 0.01626 servic 0.01626 network 0.01626 nt 0.01626 transact 0.01626 internet 0.01626 version 0.01626

Interest Category Long term Descriptor Positive Descriptor wlt = 0:76130 wp = 0:99500 server 0.07982 solari 0.11194 nt 0.06454 sun 0.05926 user 0.04902 nt 0.05512 system 0.04234 server 0.0496 manag 0.03728 upgrad 0.04609 solari 0.03337 bit 0.03292 network 0.03201 manag 0.03035 applic 0.03104 solariupgrad 0.02634 window 0.02641 isp 0.02634 microsoft 0.02594 network 0.02226 product 0.02529 includ 0.02126 internet 0.02498 enterpris 0.01975 enterpris 0.02494 bitsolari 0.01975 servic 0.02296 sourc 0.01975 secur 0.02252 perform 0.01975 version 0.02108 reliabl 0.01975 disk 0.01959 custom 0.01975 pc 0.01882 le 0.01969 data 0.01834 featur 0.01718 sun 0.01767 microsoft 0.01568 winnt 0.01708 releas 0.01317 perform 0.01578 releasbit 0.01317 sqribe 0.01559 internet 0.01317 base 0.01559 internetservic 0.01317 report 0.01559 san 0.01317 ......... java 0.01213 hardwar 0.01317 exodu 0.01213 oer 0.01317 windownt 0.01087 easi 0.01317

92 APPENDIX D EXAMPLE OF META DOCUMENT

40 # number of keywords nt

0.10811

server 0.08784 window 0.08784 microsoft 0.04730 zdnet 0.04054 current 0.03378 system 0.03378 name 0.03378 currentnt 0.03378 offer 0.02027 product 0.02027 advanc 0.02027 supportsmp 0.02027 advancserver 0.02027 ntserver 0.02027 link 0.01351 namescheme 0.01351 notewindow 0.01351 cluster 0.01351 ntconsum 0.01351 ....... datacentserver 0.01351 desktop 0.01351 http://www.zdnet.com/windows/stories/main/ 0,4728,2155790,00.html Windows 2000: Coming To Stores Near You In 1999
Microsoft would not provide any pricing or delivery details, beyond noting Windows 2000 pricing would not deviate much from current NT 4.0 pricing. Officials did note, however, that Windows 2000 Advanced Server would likely be cheaper than the current

93 VITA Dwi Hendratmo Widyantoro graduated cum laude from the Institut Teknologi Bandung, Indonesia, in 1991, obtaining his B.S. in Computer Science. His undergraduate thesis research was in the area of Arti cial Intelligence focusing on the construction of an inference engine for non-implicit equations. In 1993, he joined the Informatics Department of the Institut Teknologi Bandung as a teaching sta member. In early 1997, he was awarded a two year fellowship from the Ministry of Education and Culture of the Republic of Indonesia to pursue a M.S. degree in the United States of America. He can be reached through the Department of Computer Science of Texas A&M University. His permanent address is at Jl. Puri Cipageran Indah I Blok A-36, Cimahi 40511 Indonesia.