Combining Relevance Feedback and Genetic ...

2 downloads 72565 Views 183KB Size Report
This paper presents the development of IntellAgent, an Internet information filtering agent with a search engine using a hybrid evolutionary algorithm to optimize ...
Combining Relevance Feedback and Genetic Algorithms in an Internet Information Filtering Engine Guy Desjardins & Robert Godin

Université du Québec à Montréal { [email protected] , [email protected] } Abstract Ever since the advent of the public network Internet, the quantity of available information is rapidly rising. One of the most important use of this public network is to find information. In such a huge and unstable information collection, today’s greatest problem is to find relevant information. This paper presents the development of IntellAgent, an Internet information filtering agent with a search engine using a hybrid evolutionary algorithm to optimize the user profile. The algorithm is a combination of the well known relevance feedback process and a genetic algorithm. This paper describes in detail the specifics of the combination and reports on its effectiveness measured using the TREC collection. The agent builds its own data collection statistics incrementally as it analyses the documents of the collection. Two construction methods are tested: one using all terms of all documents and a second one using only the the profile terms. The hybrid algorithm is tested against both methods. The results show that the hybrid algorithm performs better in recall than the relevance feedback process alone and better in precision than the genetic algorithm alone, revealing that the combination is building upon the strength of both processes. Using the incremental statistics collection built with the profile matching terms shows the same level of effectiveness but a slightly different ratio of recall/precision.

Keywords Intelligent agent; Information filtering; Genetic algorithm; Hybrid; Incremental statistics 1. Introduction

The number of computers linked by the Internet network grew from 5 million to 65 million over the past 6 years. Surveys estimated that there were over 200 million users in September 1999. The second most frequent use of the network is information retrieval (after e-mail use). Today the Internet is the biggest public collection of documents available worldwide. Because the information is changing continuously it is also the most unstable collection. Information filtering is concerned with finding information from unstable collections of documents such as the Internet. In the information filtering domain, the user query is called a profile. An agent usually builds the profile using examples provided by the user. Thus the query is not a list of words to search for but rather combinations of words extracted from various examples. Two of the major problems to solve are optimizing the significance of the profile and obtaining accurate collection statistics for calculating the term frequencies. For the later, Callan (1996) has proposed two methods: use the collection statistics of a similar domain sample or build your own incrementally. The first method can be used in a domain specific engine. The second method is more suitable for a general search engine. We have selected it for the development of a prototype information filtering agent, IntellAgent. Ribeiro proposed a classification of the various optimization techniques used for optimizing the user profile (1994). Enumerative and analytic techniques have shown limited effectiveness because the solution space is to vast. Yet the relevance feedback process has shown good results and can be classified as an iterative analytic method. Recently, guided-random techniques such as genetic algorithms and neural networks have been proposed as alternative solutions for information filtering.

A number of researchers combined algorithms in search for a better solution (Yang & Korfhage, 1993; Sheth, 1994; Chen & Kim, 1994).

Search techniques

Calculus-based Direct Indirect

Guided-random

Enumerative

Dynamic Simulated Evolutionary annealing algorithms programming

Fibonacci Newton Evolutionary strategies

Genetic algorithms

Figure 1 : Optimization techniques (source: Ribeiro, 1994) When developing IntellAgent, we aimed to find such a value added solution by combining a relevance feedback process with a genetic algorithm. The objective was to find a combination that would yield better results than each of its composing processes alone. Section 2 reviews the background of this work. Section 3 describes the functions of the search engine. Section 4 describes in detail the combination of the relevance feedback process and the genetic algorithm. Section 5 describes the two methods for building the collection statistics incrementally. Section 6 presents the experimental results. Section 7 concludes. 2. Background

Genetic algorithms have been used for various problems since Holland first introduced them in 1975, particularly in the 80’s. They borrow their process from the Darwin natural process of survival. Genetic changes the individuals over generations. Nature selects the most fit individuals to survive. The overall result is a better adapted community to its environment. It is a continuous process since the environment itself changes over time. The changes are made by recombining the genetic codes between two individuals. The analogy in information filtering makes use of the vector space model to represent the documents. Using this model, a document can be represented by a vector of its unique words, the terms vector (t), along with their frequencies (f) (see figure 2). A weights vector (w) could be calculated based on the frequencies of the terms. Using this model, the genetic would represent a gene as a term, an individual as a document and the community as the profile. After recombining the terms of the two parent documents, an objective function is used as the survival process to decide whether or not to keep the two generated children documents into the profile. The relevance feedback process has also been used successfully by a number of researchers (Chen et al, 1995; Sheth, 1994; Yang & Korfhage, 1993; Yuwono & Lee, 1996b). Applied to the vector space model, this process changes the weights of the terms according to the user feedback whenever the agent proposes a document. The firing vectors in the profile get their weights increased or decreased when the user judges the proposed document relevant or irrelevant. Those modified vectors get stronger or weaker to influence the next retrievals.

User Examples

Index & Weights

1

t1 , t2 , t3

... n

t

2

f1 , f2 , f3

... n

3

w1 , w2 , w3

f

. . . wn

Figure 2 : Vector space model While the relevance feedback is used to adapt the profile as the agent retrieves documents, the genetic algorithm is usually used for optimizing the profile once at the beginning of the search process. In Beagle (Ferguson, 1995), once the profile is optimized using a genetic algorithm, it stays static thereafter during the active search phase. In NEWT (Sheth, 1994), a genetic algorithm is used to optimize the initial profile and the relevance feedback is used thereafter to make the profile evolve with the user feedback. In GANNET (Chen & Kim, 1994), a genetic algorithm is used at the initial phase to train a neural network which is used during the active search phase. In IntellAgent, the genetic algorithm is also used to optimize the initial profile. But it is further used to re-optimize it as it evolves with the user relevance feedback. During the active search phase, both processes modify the profile. 3. Search engine

This section gives an overview of IntellAgent processes and reviews the basic components of the search engine. It describes the use of the vector space model and introduces the various computations. 3.1 Process overview First, the documents provided by the user are translated into vectors which form the profile. Then the first collection statistics are calculated and the genetic algorithm optimizes this initial profile. IntellAgent needs at least two distinct document examples in order to perform the initial optimization. Indeed the genetic process needs at least two parents. In the active phase, the agent retrieves a new document, translates it into the vector model and performs the similarity calculations against the profile. Whenever it is found similar enough to at least one vector of the profile, the agent proposes it to the user. The user replies with his relevance judgment and the agent modifies the weights of the firing vectors accordingly. Then the genetic algorithm re-optimizes the modified profile and the agent proceeds with the next iteration. If the proposed document is judged relevant, the agent adds its vector to the profile, which further modifies it. In that case, the collection statistics are recalculated.

Internet

Proposed documents

S>h

Index next document

Similarity calculation

User profile

Query documents

f+

Relevance feedback

Genetic algorithm evolution

Figure 3 : IntellAgent functional diagram

3.2 Vector space model In IntellAgent, each document is represented by four vectors: • • •

the terms vector contains the terms of the document after stopword removal and stemming; the frequencies vector contains the frequencies of the terms in the document; the weights vector contains the normalized weights of the terms calculated by a traditional (tf x idf) formula; the feedback vector contains the cumulative feedback factors of the terms calculated when a document is proposed by the agent.



The weights and the feedback factors are kept separately in order to better control the combination of the relevance feedback process and the genetic algorithm. 3.3 Weight computation IntellAgent makes use of a stopword process to eliminate the useless words. Then it truncates the remaining words to their basic stem. The frequencies of those remaining terms are calculated and the weights are computed using a well known tf x idf (term frequency x inverse document frequency) formula (Salton & Buckeley, 1991). The formula is normalized to compensate for long documents using the maximum frequencies normalization variant.

æ æ tf ö ö æ N ö ç 0.5 + 0.5ç ik ÷ ÷ logç ÷ ç è max tf ip ø ÷ø è nk ø è

wik =

2

å k

2 æ æ tf ö ö æ Nö ç 0.5 + 0.5ç ik ÷ ÷ logç ÷ ç è nk ø è max tf ip ø ÷ø è

where tfik = the frequency of term k in document i, and idfk = log(N/nk), where N is the total number of documents in the corpus, and nk is the number of documents that include term k. 3.4 Objective function One similarity function often selected as the objective function when using the vector space model is the scalar product of the two vectors:

(

)

S Di , Pj = å wik × w jk , where D and P represent the document and the profile respectively1. k

This function is used to optimize the profile as well as to fire documents. It computes the similarity between vectors in the profile or between the profile’s vectors and the document vector under analysis. For the later case, the document is fired if at least one vector of the profile is found similar enough, i.e. the similarity is higher than a predetermined threshold. 3.5 Fitness function The fitness function is used by the genetic algorithm to select the best fit parents for the next generation. It is also used to determine which parents to replace when the profile size reaches a predetermined maximum. The fitness function is defined as the average similarity measured through time:

1

The subscript k varies on the common terms only.

F ( Pi ) =

å S ( Dp

k

k

, Pi )

# Dp

, where S(Dpk ,Pi) is the similarity between profile’s vector i and the kth

relevant document proposed by the agent, and #Dp represents the total number of relevant documents proposed by the agent. 3.6 Relevance feedback Whenever the agent proposes a document, the user judges its relevance and replies 1 if it is relevant or -1 if it is not. The agent uses this information to modify the weights of the firing vectors in the profile. The weights are modified according to the formula:

wikp = wikp + α × f × wkd where the feedback power α is a predetermined parameter between 0 and 1, Wp are the weights of the firing vectors of the profile, Wd are the weights of the proposed document and f is the user feedback2. The relevance feedback is a competition process where the useful terms get their strength reinforced and the useless terms get their strength reduced. The more a term is proven useful, the higher its influence will be on future retrievals, and vice versa. 3.7 Genetic algorithm Unlike traditional optimization processes, genetic algorithms work from many initial solutions simultaneously to reach a near optimal solution (Goldberg, 1989). They follow a structured process for exchanging information randomly. The two main operators are crossover which exchanges genes between the parents creating two new individuals and mutation which mutates a random gene. IntellAgent uses a four sections crossover where the terms of the two parents vector in section one and three are exchanged. The sections are selected randomly. The mutation operator will make a term disappear or introduce a new term in the offspring.

1

1

0

0

1

1

0

1

PARENTS 0

0

0

1

0

1

1

1

1

0

OFFSPRING 1

0

1

0

0

0

Figure 4 : Crossover operator The genetic algorithm first selects the two most fit parents according to the fitness function. Then it proceeds with the crossover operation, adds the offspring to the profile and recalculates the average similarity of the whole profile. This process goes on until one of the following events occurs: • • • •

2

there are no more parents available to process; the average similarity decreases with the last generation; we have reached the maximum number of crossovers allowed, which is a parameter expressed in percentage of the size of the profile; we have reached the maximum size allowed for the profile, which is a parameter expressed in number of vectors.

The subscript k varies on the common terms only. The subscript i varies on the vectors of the profile.

When the last event occurs, the genetic algorithm does not stop but rather starts replacing vectors into the profile. In doing so, it will select the two weakest individuals, according to the fitness function, to be replaced by the offspring. The mutation process occurs randomly on one of the genetically generated vectors. A parameter sets the mutation rate. It is expressed in percentage of the number of genetically generated vectors. Generally, the selection of individuals follows rules whereas the selection of genes are randomized. This is why it is called a structured process for exchanging information randomly, or a guided-random process. 4. A hybrid algorithm

The novelty in IntellAgent is that the relevance feedback process and the genetic algorithm influence each other continuously. Thus both algorithms affect the future retrievals after each proposed document, unlike Beagle (Ferguson, 1995), NEWT (Sheth, 1994) and GANNET (Chen & Kim, 1994). Here is how the two algorithms are combined (refer to figure 5 below). First, any new document analysis generates new terms and updates the frequencies of existing terms into the corpus statistics. This changes the idf factors thus there is a need to recalculate the weights of the profile vectors. Second, after the similarity calculations, if the document is found similar enough to at least one of the profile’s vector, the document is proposed to the user. The feedback changes the weights of the firing vectors into the profile, thus changing the dynamic again for future retrievals. Third, if the proposed document is judged relevant by the user, it is added to the profile, changing both the tf and the idf factors. Fourth, the genetic algorithm optimizes that new profile by adding new generated vectors to it, changing the idf factors again.

Add terms Update frequencies

New document

User relevance feedback

idf

Weights calculation

Update weights

Genetic documents generation

Add vectors

Corpus statistics

Update frequencies

New relevant document

Update feedback factors

Profile vectors . terms . frequencies . weights . feedbacks

Add vector

Figure 5 : Events diagram The frequencies of the genetic vector terms are initialized to zero and the weights are taken from the parents. These weights will never change since they have no frequency. But their feedback factors will make them evolve. That is why we need to keep the feedback factors in a separate vector. The genetically generated vectors will influence the idf factors in the corpus statistics. The weights of the non-genetic vectors will be recalculated. In summary, the relevance feedback influences the future retrievals by directly modifying the weights of the terms. The genetic algorithm influences the future retrievals in two ways: by adding new combinations of terms into the profile and by modifying the inverse document frequencies into the corpus statistics which will have an effect on the weights of the non-genetic vectors. The relevance feedback process makes the profile evolve by changing the relative importance of the terms within each vector. The genetic algorithm mainly makes the profile evolve by adding new combinations of

terms which brings in different term relations. A document could thus be fired based on a genetic vector only rather than on an original vector provided by the user. The relevance feedback process introduces a competition process at the term level within each vector. A proven useful term as judged by the user will get its weight increased. The genetic algorithm introduces a competition process at the vector level. A proven useless vector in the pass will eventually disappear from the profile. A proven useful vector will survive and multiply by passing its genetic code to its offspring. 5. Incremental collection statistics

Testing with different incremental collection statistics building methods was not part of the original objectives of this experiment. It soon appeared that this issue was important to improve the performance. The genetic algorithm increases the number of calculations dramatically. Reducing the corpus size was the best alternative to cope with the GA computational cost. All “tf x idf” algorithms work with the corpus statistics. These are needed for the idf factors calculations. In traditional information retrieval, the collection of documents is static. Thus one can calculate the statistics in advance and store them for further use by a search engine. In information filtering the collection is unknown in advance. The collection statistics have to be incrementally updated as the search engine goes through the collection. IntellAgent was first programmed with an incremental update of the collection statistics using all terms of each document of the collection. Based on the work of Callan (1996), we alternatively computed the incremental update of the statistics using only the terms of collection that matched at least one term of the profile. This reduced the total number of terms to one fourth of the original size and cut the processing time by two thirds. We wanted to further test the hybrid algorithm with that alternate method to ensure a similar level of effectiveness before adopting it. The results are detailed at the end of the next section. 6. Experiment results

The experiment was conducted using an ad hoc type of test from the TREC (Text REtrieval Conference) categorized collection of documents. We have selected a sub-collection of 7532 documents from the TREC-6 collection along with five topics to be search for: #301, #306, #319, #337 and #347. The selection was made to ensure a sufficient number of relevant documents for each topic to allow the agent for adaptation in time. The number of relevant documents in that subcollection ranged from 24 to 129. The parameters of the algorithms were set as following: • • • • •

relevance feedback power α = 0.20; maximum crossover rate = 60 %; maximum number of vectors in the profile = 30; mutation rate = 1 %; similarity threshold = 0.058.

Since the similarity threshold was fixed, the recall and the precision metrics were measured simultaneously. TREC topic Total/Average TREC relevant # 271 Agent fired # 758 Agent relevant # 82 % Recall 30.26 % Precision 10.82

#301 129 66 5 3.88 7.58

#306 56 166 19 33.93 11.45

#319 29 117 13 44.83 11.11

Table 1 : Relevance feedback results

#337 24 186 18 75.00 9.68

#347 33 223 27 81.82 12.11

TREC topic Total/Average TREC relevant # 271 Agent fired # 3931 Agent relevant # 222 % Recall 81.92 % Precision 5.65

#301 129 1167 92 71.32 7.88

#306 56 1024 55 98.21 5.37

#319 29 590 22 75.86 3.73

#337 24 509 21 87.50 4.13

#347 33 641 32 96.97 4.99

#337 24 217 21 87.50 9.68

#347 33 256 27 81.82 10.55

Table 2 : Genetic algorithm results TREC topic Total/Average TREC relevant # 271 Agent fired # 1814 Agent relevant # 210 % Recall 77.49 % Precision 11.58

#301 129 639 97 75.19 15.18

#306 56 545 52 92.86 9.54

#319 29 157 13 44.83 8.28

Table 3 : Hybrid algorithm results For the relevance feedback alone, the average precision is good but the average recall is very low. The recall results are quite variable among the topics. It seems that the relevance feedback process is unstable and topic dependent. The genetic algorithm alone yielded a very good average recall but the average precision is low. The results seems stable across the topics. The hybrid algorithm yielded a better average precision than the two others and better average recall than the relevance feedback process. The average recall is still within the genetic algorithm range but a little below. The detailed results showed the same stability as with the genetic algorithm. A t-test with a confidence interval α = .1 and a degree of freedom df = 4 (5 TREC subjects - 1) showed that the hybrid algorithm has significantly better recall results than the relevance feedback process with no significant difference in their precision and the hybrid algorithm has significantly better precision results than the genetic algorithm with no significant difference in their recall. The test of the hybrid algorithm with the alternate method for building the collection statistics shows a better average recall with a lower average precision (see table 4). The overall results show the same stability among the topics. Kind of corpus Metric Based on all collection % Recall terms (66010 terms) % Precision Based on profile terms (14148 terms)

% Recall % Precision

Weighted average

#301

TREC topics #306 #319 #337

#347

77.49

75.19

92.86

44.83

87.50

81.82

11.58

15.18

9.54

8.28

9.68

10.55

90.04 9.89

92.25 13.11

83.93 9.14

89.66 7.07

79.17 100.00 7.76 7.66

Table 4 : Comparative results for incremental corpus statistics building It seems a little surprising that the recall is better. This could be explained by noting that the profile’s matching terms method concentrates more the search on the useful combinations of terms, eliminating useless terms at the beginning of the process. Although the overall results showed more or less the

same level of effectiveness. One can argue that changing the level of the threshold parameter could bring back the same ratio of recall/precision. The gain in the performance obtained by using the statistics built from the profile matching terms only was based on a 7532 documents collection. Needless to say, the bigger the collection is, the bigger the performance gain will be and the results will tend to the same as the number of terms in the corpus tends to its upper bound. 7. Conclusion

In this work, a combination of relevance feedback and genetic algorithms was studied for information filtering purposes. The hybrid algorithm developed was tested within the IntellAgent search engine using a subset of the TREC collection. The results show that the hybrid algorithm is significantly better in recall than the relevance feedback process alone and it is significantly better in precision than the genetic algorithm alone. Using an alternate method based on the profile’s terms matching only for building the incremental collection statistics cut the processing time by three. Surprisingly, it also improved the overall recall. Both the genetic algorithm and the hybrid algorithm showed more stable results than the relevance feedback process across the topics. The preliminary results of this experiment highlight potential for a hybrid algorithm combining traditional relevance feedback methods and genetic algorithms for information filtering. To further support these results, more tests with a larger collection of documents and more topics are needed. Also, conducting independent tests for each parameter, precision and recall, would give more insight into the relative effect of these parameters on retrieval effectiveness.

References Allan, J. (1996). Incremental Relevance Feedback for Information Filtering. Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (pp. 270-278). Belkin, N.J. & Croft, W.B. (1992). Information Filtering and Information Retrieval: Two Sides of the Same Coin ?. Communications of the ACM (Vol.35, No.12, pp. 29-37). Callan, J. (1996). Document Filtering with Inference Network. Computer Science Department, University of Massachusetts, Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (pp. 262-269). Chen, H. (1994). Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms. MIS Department, College of Business and Public Administration, University of Arizona, http://ai.bpa.arizona.edu/papers/mlir93/mlir93.html. Chen, H. & Kim, J. (1994). GANNET: Information Retrieval Using Genetic Algorithms and Neural Nets. MIS Department, College of Business and Public Administration, Electrical and Engineering Department, College of Engineering, University of Arizona, http://ai.bpa.arizona.edu/papers/gannet93.html. Chen, H. et al (1995). A machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing. MIS Department, College of Business and Public Administration, University of Arizona, http://ai.bpa.arizona.edu/papers/expert94.html. Cheong, F-C. (1996). Internet Agents - Spiders, Wanderers, Brockers, and Bots. New Riders Publishing, Indianapolis, Indiana. Ferguson, S. (1995). BEAGLE: A Genetic Algorithm for Information Filter Profile Creation. University of Alabama, http://www/cis/uab/edu/info/grads/sf/papers/cs692.report.html. Genesereth, M.R. & Ketchpel, S.P. (1994). Software Agents. Communications of the ACM (Vol.37, No.7, pp. 48-53). Goldberg, D.E. (1989). Genetic Algorithms in Search, Optimization & Machine Learning. University of Alabama, Addison-Wesley Publishing Company, Inc., ISBN 0-201-15767-5.

Jacobs, P.S., et al (1993). A Boolean Approximation Method for Query Construction and Topic Assignment in TREC. GE Research and Development Center, Second Anual Symposium on Document Analysis and Information Retrieval, IEEE (pp. 191-200). Ribeiro, J.L. et al (1994). Genetic-Algorithm Programming Environments. University College London, Politecnico di Milano, IEEE (pp. 28-43). Salton, G. & Buckley, C. (1991). Global Text Matching for Information Retrieval. Cornell University, Science (Vol. 253, 974). Salton, G. & McGill, M.J. (1983). Introduction to Modern Information Retrieval. Cornell University, Syracuse University, Computer Science Series, McGraw-Hill Company (pp. 120-122). Sheth, B. (1994). NEWT (News Tailor). MIT Media Lab, Autonomous Agent Group. http://lcs.www.media.mit.edu/groups/agents/papers/newt-thesis/main.html. Singhal A. et al (1996). Pivoted Document Length Normalization. Department of Computer Science, Cornell University, Ithaca, NY 14853. Srinivas, M. & Patnaik, L.M. (1994). Genetic Algorithms: A Survey. Motorola Indian Electronics Ltd., Indian Institute of Science, IEEE (pp. 17-26). Yan, T.W. & Garcia-Molina, H. (1993). Index Structures for Information Filtering Under the Vector Space Model. Department of Computer Science, Stanford University, ICDE (pp. 337-347). Yang, J-J. & Korfhage, R.R. (1993). Effects of Query Term Weights Modification in Document Retrieval - A Study Based on a Genetic Algorithm. University of Pittsburgh, Second Anual Symposium on Document Analysis and Information Retrieval, IEEE (pp. 271-285). Yuwono, B. & Lee, D.L. (1996a). Search and Ranking Algorithms for Locating Resources on the World Wide Web. New Orleans, Proceedings 12 Int’l Conference Data Engineering (pp. 164-171). Yuwono, B. & Lee, D.L. (1996b). WISE: A World Wide Web Resource Database System. Ohio State University, Hong Kong University, IEEE Transactions on Knowledge and Data Engineering (Vol.8, No.4, pp. 548-554).