final draft - Semantic Scholar

9 downloads 7617 Views 172KB Size Report
Computer Science Department. 3. Department of Management Sciences ... starts off interested in “Florida's vote counts” during the time of the recent US ..... We will use the same scale on the Y axis for all configurations to aid in reader ..... a mix of Associated Press, Wall Street Journal and Financial Times newswire stories.
Adaptive Filtering of Newswire Stories using Two-level Clustering David Eichmann1,2 and Padmini Srinivasan1,3 1School

of Library and Information Science Computer Science Department 3Department of Management Sciences The University of Iowa {david-eichmann, padmini-srinivasan}@uiowa.edu 2

Abstract Adaptive filtering of news is an area of information retrieval gaining substantial interest as services become more available on the Internet. This paper reports on a number of experiments involving a two-level clustering approach using a variety of techniques including threshold adaptation, topic vocabulary adaptation and both noun phrase and named entity adaptation. Our goal in this exploratory research is to empirically compare alternative configurations of our filtering approach that will allow us to better understand the relative value of the component subsystems. Keywords: adaptive filtering, document clustering, part-of-speech tagging, named entity extraction

1 – Introduction What does an individual do when reading an information stream such as a daily newspaper? It is indeed a rare user who doggedly peruses every single item present. More commonly users read some articles in depth, give others a quick scan and avoid the rest totally. Clearly the user’s interests strongly influence the choices made. These decisions are “filtering” decisions with the user operating on the results. There may be filters looking for items that will be read in depth while other filters may look for topics that are of medium interest while still others may have a “serendipity” factor - to keep an eye out for new and interesting items. Users’ filters have other characteristics, they may be short lived or long standing. They may be at different levels of abstraction. Compare the filters for “current statistics on foot and mouth disease” to the one for “international trade barriers.” They may be fixed or evolving. As an example of the latter take the case of the user who starts off interested in “Florida’s vote counts” during the time of the recent US presidential elec-

–1–

Adaptive Filtering of Newswire Stories using Two-level Clustering

tions and a month or so later becomes more interested in the “the different kinds of ballot machines in the US” and later more specifically with “electronic voting methods.” Relative priorities on filters may also change based on the current context. Our top ranked filters when at work are likely to be quite different from those while at home. And of course filters may be imposed; for example, a parent requiring particular filters on a child’s reading or a government filtering out allegedly “dangerous” materials. Clearly when users filter a temporal stream of information that covers a heterogeneous collection of topics the decisions behind their selections are rather complex. Text filtering systems have taken on the challenge of making such decisions for the user. The only input given to such an automated filtering system is a description of the user’s interests with perhaps a few example relevant items. Each time the filtering system becomes aware of a new item, it has to make a decision. Is the item relevant enough to retrieve for the user? With filtering systems it is also assumed that when shown a new item, the user is both willing and capable of providing a relevance judgement. This judgement offers the system an opportunity to adapt to the user’s information need. Indeed it is this opportunity that makes “adaptive filtering” an exciting area for research and development. How should one change the internal representation, i.e., the profile of the user’s interest given this feedback? How should one change the filtering strategy? Adaptive filtering systems that have arisen show both differences and similarities in their responses to these basic questions. The TREC filtering track makes certain assumptions to constrain the adaptive filtering problem and make it more tractable. These assumptions are also made to support comparisons between filtering systems. First, it is assumed that the user’s information need is reasonably long standing, at least as long as the duration of the test database. Second, it is assumed that users do not go back in time to read older materials. That is, decisions made regarding current information (e.g., today’s newspaper) may not be changed at a later date. In reality, a current news item may motivate a user to look for its origins in older issues. Finally, it is also assumed that user profiles are independent of each other. In other words, what the system learns via one topic is not allowed to influence the filtering strategy for a different topic. In contrast, real users interactions may freely occur between the topic filters. For example, a user may be simultaneously interested in both: ‘genetically engineered agricultural products’ and ‘strawberry production’. Her first topic may lead to documents about strawberries that have been genetically modified to raise their resistance to frost. These doc-

–2–

Adaptive Filtering of Newswire Stories using Two-level Clustering

uments may be of interest to her second topic as well and may in fact cause her to modify it to “the effect of temperature and frost on strawberry production.” Users interested in multiple topics may actually benefit from any serendipitous or intentional interactions between them. A simple extension of this argument suggests that a foreseeable advantage of the next generation of adaptive filtering systems is the ability to exploit relationships between topics and across users. This paper explores the adaptive information filtering problem, specifically the task of filtering a large temporally ordered collection of documents, given an initial set of topic definitions. We do this via TRECcer our text filtering system. TRECcer adopts a two-level dynamic clustering strategy that uses the vector space model and involves Rocchio-based adaptation. We present systematic exploration of different features of TRECcer as a means of comparing within a single framework the effects of varying thresholds of similarity and lexical features including noun phrases and named entities. We do not consider the problem of identification or discovery of topics from a stream of documents, viewing this as a distinct task with different, equally valid, performance criteria.

2 – Background and Related Work 2.1 – Recent Research Filtering as a task has its roots in SDI, the Selective Dissemination of Information, a service that was introduced as early as 1950 when most retrieval technologies operated as batch systems. SDI services, where users receive current information automatically as per their interest profile, are now a common feature of operational retrieval systems such as DataStar and SilverPlatter. The limitation with SDI in these operational systems is that it is the user’s responsibility to ensure that the interest profile is current and appropriate for the search system. In contrast, research within the TREC filtering track aims to shift the responsibility more towards the system. This includes decisions regarding profile creation, maintenance and filtering strategy. Since its introduction in 1994, the filtering track has evolved to include different sub tasks such as batch filtering, routing and adaptive filtering. The annual TREC filtering reports, both overviews and individual participant reports, are available at the TREC web site*. A variety of adaptive filtering approaches may be observed in the last TREC session including methods based on the popular Rocchio algorithm, k

*

http://trec.nist.gov/pubs/ –3–

Adaptive Filtering of Newswire Stories using Two-level Clustering

nearest neighbors, clustering and support vector machines (Robertson and Hull, 2000). Some key aspects investigated include feature selection, threshold adaptation and the value of the relevant information over time. A closely related notion of tracking is also being explored within TDT - the Topic Detection and Tracking workshop series. A key difference between TREC filtering and TDT tracking is that TDT user interests tend to be event based while in TREC they are more topic based. Although seemingly similar, topic versus event orientation can generate significant differences in the nature of relevance decisions which in fact drive the whole adaptation process. Moreover, TDT ‘defines’ a topic by means of a small number of example items, a strategy that has only recently been used to augment the more standard descriptive format of topics in TREC (an example of which can be found in Figure 1). Finally, TDT tracking is not adaptive since relevance judgments are not available as the run proceeds. Despite these differences, at a fundamental level TREC filtering and TDT topic tracking goals are very close. Filtering defined a bit more broadly has captured strong interest outside the TREC and TDT domains as well. An active area of research is collaborative filtering (Balabanovic 1997, Delgado 1998, Resnick 1994, Shardanand 1995) where user evaluations of information sources (or products and services) are combined in some way and used as recommendations for new users. Such filters are meaningful when dealing with a large network of users which allows us to capitalize on users having similar interests and tastes. Collaborative filtering supports the development of recommender systems such as Ringo (Shardanand 1995) and Grouplens (Resnick 1994). Recent research where software agents maintain their own social networks utilizing not only individual recommendations but also the ratings of the recommenders (Singh et al, 2001) offer very interesting avenues for extending collaborative filters. The approach to adaptation discussed here and as done in TREC does not assess the ability of a community to succeed at recommendation, but rather addresses the effect that incremental revelation of topic knowledge can have upon a filtering system’s performance. Indeed, the adaptive filtering experiments done in the TREC framework commonly result in new positive judgements not retrieved by other tasks operating on the same corpus. There are also some other interesting specific problems related to information filtering such as concept drift described generally as the phenomenon of changes in content in the information stream over time. It is obvious that concept drift can impact the effectiveness of user profiles for filtering. One response to concept drift is to use a windowing approach where the window is se-

–4–

Adaptive Filtering of Newswire Stories using Two-level Clustering

lected so as to contain the most relevant context for the current learning task as suggested by Widmer and Kubat (1995). A relevant notion proposed recently is that of the “half-life” of a training document. A corresponding weighting scheme used in TREC for the profile updating function gives a higher weight to the more recent documents (Arampatzis et al. 2000). Decay functions have also been explored by Balabanovic (1997) and Taylor et al. (1997). Also of interest is the phenomenon of concept shift where there is a sudden change in the user’s interest profile as opposed to a gradual one as implied by concept drift (Klinkenberg and Joachims 2000, Klinkenberg and Renz 1998, Lam and Mostafa 2001). Our use of a set of secondary clusters of documents associated with a topic provides some support for drift – new secondary clusters are typically created through the entire period of analysis, and frequently have little similarity with clusters formed early in a run. We do not, however, attempt to assess these effects in this paper. 2.2 – TRECcer: Filtering Approach TRECcer, our two-level dynamic clustering system, is designed to simultaneously process multiple pre-specified topics (Eichmann et al.,1998, Eichmann and Srinivasan1999a, Eichmann et al., 1999b, Eichmann and Srinivasan 2000). Note that a topic may be specified by a textual description or/and by few sample relevant documents. Each topic description generates a topic vector which starts a primary cluster for the topic. As documents arrive as a temporally ordered sequence, their vectors are compared with the set of cluster vectors. A document that is sufficiently similar to a topic cluster is added to this cluster. Documents attracted into a primary cluster participate in a topic-specific second level clustering process yielding what we refer to as secondary clusters. Thus the newly attracted document may join an existing cluster or create a new secondary cluster. This new document is declared, i.e., retrieved for the user, depending upon the secondary cluster’s status as described below. Note that as the information stream is processed a given document may enter the primary cluster of zero, one or more topics. Document that do not enter a topic’s primary cluster are not considered for retrieval against this topic any further, irrespective of its true relevance status. (It is this point in our architectural framework where the semantics of TRECcer as a document filterer differ from that of the configuration that we use in TDT-style topic detection. The topic detector proceeds to create a new primary cluster, using the current document as a seed.) There are three thresholds used by the system. A document is added to a primary cluster if its similarity with the cluster’s centroid* is above the ‘primary’ threshold, p. A document joins a sec-

–5–

Adaptive Filtering of Newswire Stories using Two-level Clustering

ondary cluster if its similarity with the cluster’s centroid is above the secondary threshold, s. A document with its maximal secondary cluster similarity less than s forms a new secondary cluster for that primary. Finally, a (potentially new) secondary cluster declares (retrieves) the newly entered document if the similarity between the secondary cluster and its parent primary cluster is above the declaration threshold, d. Document and Cluster Representations

As documents arrive they are indexed using TF*IDF weights normalized by document length. Terms are stemmed using Porter’s algorithm (Porter 1980). Cluster vectors are also weighted using TF*IDF. We limit document and cluster vectors to the best 100 and 200 stems respectively. Vectors are regenerated every time the object changes. For example, when a cluster gets a new document then its centroid vector is recalculated. IDF weights are periodically updated, at 1000 document intervals. This allows us to avoid the need for a secondary corpus used to generate term frequencies. We encounter some drift in IDF weights at the beginning of a run, but nothing substantial and the system quickly stabilizes. Adaptation Strategies

Adaptation has been built into TRECcer in several ways. First the primary cluster profile for each topic can be adapted with feedback using Rocchio’s method. The Rocchio approach is generally used to transform a query vector in response to relevance judgements obtained for retrieved documents [20]. Details of our Rocchio-based primary profile adaptation are provided in Section 6.1. In general, three term vectors jointly represent the topic profile at the primary cluster level. The first represents the original topic, the remaining two represent the relevant/non relevant documents retrieved. Thus as each new document is judged for relevance, the topic profile is adapted by adding its terms to the appropriate vector of the topic profile. A second dimension of adaptation is present in the secondary clusters which we describe using a coloration scheme. When a secondary cluster is first created it has no color. However, when it declares a document (which happens when its similarity with the parent primary cluster is above d, the declaration threshold) then based on the user judgement on the document the cluster obtains a color. If the document just declared is non relevant (off-topic) the cluster is colored red. This *

We define a centroid for both a document and a cluster to be a vector of occurring terms with associated weights.

–6–

Adaptive Filtering of Newswire Stories using Two-level Clustering

means that although the cluster continues to participate in future clustering processes, it does not declare documents any further. If a secondary cluster just declared a relevant (on-topic) document then it is colored green. This means that the cluster continues to later declare documents that are attracted to it. Of course it is quite possible for a green cluster to declare a non relevant document. In this case, the non relevant document spawns off its own red cluster and the parent secondary cluster continues on as a green cluster. A third dimension of adaptation is associated with the three thresholds. TRECcer has the ability to monitor performance and respond by adjusting one or more of these thresholds. In a later section we show experiments that explore one such adaptive configuration of TRECcer.

3 – Assessment of Performance: 3.1 – Measures: Several measures have been used for testing the performance of filtering methods. In our case since we are interested in adaptive filtering we limit this discussion only to the corresponding subset of measures. A typical approach for evaluation is to use a linear utility measure: Utility(T) = w1 * |TP| - w2 * |FP| where |X| is the cardinality of set X. TP is the set of relevant documents retrieved (true positives), FP is the set of non relevant documents retrieved (false positives) and w1 and w2 represent the value of retrieving a single relevant document and a non relevant document respectively. Some specific measures deriving from this general scheme that have been tried (in the adaptive filtering track of TREC) are LF1 with w1 = 3 and w2 = 2 and LF2 with w1 = 3 and w2 = 1 (both were tried in TREC-7). Clearly a greater weight on w1 gives a greater emphasis on recall over precision. An even more recall oriented measure is F3 with w1 = 4 and w2 = 1 (TREC-7). Filtering by a utility function is equivalent to filtering by an estimated probability of relevance, as shown by Lewis (1995). For instance, LF1 is equivalent to retrieving a document if its probability of being relevant is greater than 0.4 while for LF2 it should be greater than 0.25. Non-linear utility measures such as the following have also been explored. NF1 = 6*|TP|(0.5) - |FP| NF2 = 6*|TP|(0.8) - |FP|

–7–

Adaptive Filtering of Newswire Stories using Two-level Clustering

The motivation behind these non linear measures is that the greater the number of relevant documents retrieved already, the lower the utility of the next relevant document retrieved. Recently (TREC-9) a measure used that is more directly precision oriented is: T9P = |TP|/Maximum(TargetNumber, |R|) where R is the set of documents retrieved, and TargetNumber is the set of documents to be retrieved over the temporal duration of the database. Although in experiments TargetNumber is fixed across all topics the intent behind this number is actually to be able to consider topic specific user expectations in terms of volume of relevant material present in the collection and number that the user wants. The trouble with the Utility score given above is its range of values. At the lower end it is bounded almost by the size of the collection, if we assume that the number of relevant documents for a topic tends to be very much smaller than the number of non relevant ones in the collection. Since a filtering system may retrieve any number of non relevant documents a different measure, T9U, was introduced in TREC-9 that places a floor on the penalty of false positive decisions via MimimumU: T9U = Maximum((2*|TP| - |FP|), MinimumU) It may be observed that although bounded below T9U is still unbounded above since this is dependent on the number of relevant documents in the collection. In the TDT tracking task, performance is measured by the cost faced by the user. CDET = CMISS PMISS Ptarget + CFA PFA Pnon-target where CMISS and CFA are the costs faced due to a miss (false negative) and a false alarm (false positive) respectively. PMISS and PFA are the conditional probabilities of a miss and a false alarm given that the retrieved document is relevant and non relevant respectively. Ptarget and Pnon-target are the a priori probabilities of a document being relevant and non relevant respectively and thus sum to 1. PMISS increases as recall falls while PFA increases as precision falls. CDET is generally normalized so that the score is not less than 1: (CDET)NORM = CDET / min {CMISS Ptarget, CFA Pnon-target}

–8–

Adaptive Filtering of Newswire Stories using Two-level Clustering

This measure has been used in TDT2000 for tracking with Ptarget = 0.02, CMISS = 1.0 and CFA = 0.1 (TDT2000). This combination of costs places the emphasis on high recall. 3.2 – Summarizing across topics: There are two broad strategies for summarizing. First one may calculate the selected measure (any of the utility scores, the precision measure or the cost) and average scores across topics (macro averages). Alternatively one may calculate totals (such as |TP| and |FP|) across all topics and then calculate the selected measure (micro averages). Unfortunately these averages can be somewhat problematic when there are no bounds on the scale of possible values for the measure. Take for example, the LF1 score. A system can have its macro average LF1 pulled down severely if it performs very poorly on a few of the topics. One option for summarizing performance across topics is to use a scaled utility function such as proposed by Hull (TREC-7 1998): us(T) = {max(u(T), U(s)) - U(s)} / {MaxU(T) - U(s)} Here us(T) and u(T) are the scaled and original utility scores on topic T respectively. MaxU(T) is the maximum possible utility for topic T while U(s) is the utility if we assume that s non relevant documents are retrieved. The scaled utility score is between 0 and 1 and can be averaged more confidently across topics. The problem is in setting an appropriate value for s. In TREC-7, s ranged from 25 to 100 and the scaled utility scores at each s were averaged across topics. Finally the difference of the average scaled utility score from the baseline performance computed using zero retrieved documents was measured.

4 – Experimental Design Our corpus is comprised of 242,407 Associated Press newswire stories generated over the period February 1988 to December 1990. User queries comprise the 50 topics used in TREC-7 for the filtering track and ranging in scope from counternarcotics activities to the nature of the Postscript printer language. An example topic is given in Figure 1. The stories in the corpus are processed in temporal order against each of the topics. Stories published on the same day are processed in the order that they appear in the file that contains them. No aggregating or pooling of documents is done in any way.

–9–

Adaptive Filtering of Newswire Stories using Two-level Clustering

Tipster Topic Description Number: 011 Domain: Science and Technology Topic: Space Program Description: Document discusses the goals or plans of the space program or a space project of any country or organization. Narrative: To be relevant, a document must discuss the goals or plans of a space program (e.g. the Space Station Freedom) or space project (e.g. Shuttle mission 29-A) and identify the organization sponsoring the program. Concept(s): 1. Shuttle, Space Plane, space station 2. Magellan, planetary explorer, satellites 3. vehicle launch 4. NASA, Ariane, European Space Agency (ESA) 5. Astronaut, Cosmonaut 6. Explorer, Dicsovery, Columbia, Mir 7. Cape Canaveral, Star City 8. space Factor(s): Definition(s): Space program - coherent set of initiatives to exploit outer space (e.g., the National Aeronautic and Space Agency (NASA) has a manned space program). Space project - a specific mission to exploit outer space (e.g., a specific launch of the Space Shuttle). Figure 1. A Sample Topic Definition (Note the misspelling of Discovery in Concept 6.)

When the system decides that a story is relevant, it may declare it, and know at that time the relevance of the declared story. Stories are judged either on-topic, off-topic, or unjudged (which for evaluation purposes is counted as off-topic). This third category is necessary since the pool of judgements for this corpus with these topics is not complete. These judgements were generated by batch retrieval systems prior to TREC-7 and supplemented with a single round of additional judgement assessments for those stories declared by TREC-7 filtering systems. The corpus with its subject area and the format of its topic definitions offers a reasonable approximation of situations where a user drawn from the general public interested in having a customized feed of news stories.

– 10 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

In order to evaluate the performance of our system over time, we take periodic snapshots of system state, in intervals of 5000 stories. At each snapshot we update frequencies and calculate statistics sufficient to generate the measures discussed above. Our main interest in this series of experiments is to study the trade-offs and scoring sensitivity to a number of parameters available in TRECcer. We therefore ran each configuration of the system that we present later across a range of primary threshold values, 0.10 - 0.50 in 0.05 increments. We also in some cases have runs involving a primary threshold of 0.05 and as high as 0.75 for a secondary threshold. These outlier runs are not exhaustive across the parameter space, but were done only to explore extremes where they might be interesting. We have selected as our evaluation measure: T9U = Maximum((3*|TP| - 2*|FP|), -100). This is a modification of the TREC-9 utility measure in which the weights were 2 and 1 on |TP| and |FP| respectively. This provides a middle ground between the high-precision evaluations done in TREC-7 with LF1 and the somewhat more recall-oriented evaluations done in later filtering evaluations in TREC. We specifically chose not to pursue a truly recall-oriented measure like that used in TDT because we felt that a real user would not be tolerant of the number of false positives that such a measure accommodates. The T9U scores are averaged across topics to give MeanT9U.

5 – Results Our goal as mentioned above is to explore a variety of configurations of TRECcer using a reasonable range of parameter values. Unless otherwise indicated we present the four best runs within each configuration and also specify the range of parameter values tested. Each presented line graph has the “snapshot” number on the X axis and the T9U score on the Y axis. Snapshots are taken after every 5,000 processed documents. Unless otherwise specified the runs within a graph are indistinguishable, i.e., they are not significantly different as seen using error bars. Please note that the error bar graphs are not presented due to space limits. 5.1 – Simple Threshold Filtering Perhaps the most obvious approach to filtering is to set a single similarity threshold and proceed to process the document stream for the topics. Documents that cross this threshold are retrieved. In TRECcer this corresponds to setting the primary threshold and not using the rest of the

– 11 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

simple p=0.30 simple p=0.35 simple p=0.40 simple p=0.45 simple p=0.50

8

MeanT9U Score

6

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 2. Simple Threshold Filtering, All Responsive Runs

system, i.e., the secondary clustering components. Similarity between a topic Ti, and a document Dj is computed as: Similarity(Ti, Dj) = sim(ti, dj) where ti and dj are the representation vectors for Ti and Dj respectively and sim() is the dot product function. The obvious difficulty with this approach is that the threshold must be established with little prior knowledge (perhaps some estimate of term frequencies on some other corpus). Figure 2 shows how challenging this approach can be. Of the ten runs in the parameter range, only five exceed the floor function in the utility measure at any time during the run. Of those five, 0.05 ≤ p ≤ 0.50, only two (p = 0.45 and p = 0.50) manage to generate any level of interesting performance over the full run. It is useful to consider the density of on- and off-topic documents over time, since these numbers constrain the potential gain or loss across any snapshot or sequence of snapshots. Figure 3 shows the density of on- and known off-topic documents in the AP corpus. Gains in utility in this simply approach roughly correspond to increases in on-topic document density over time, but loss-

– 12 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

500 on-topic off-topic

# On- and Off-Topic Documents

0

-500

-1000

-1500

-2000

-2500 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 3. On- and Off-Topic Document Density by Snapshot (every 5000 documents)

es exhibit little correspondence to off-topic document density. This is due in great measure for snapshots 1 and 17 due to contributions to the judgement pool by filtering systems in TREC-7 that declared large numbers of documents early in the two phases of the filtering run for that year in attempts to train on the topics. In most cases, the actual similarity of a given document to a given topic is quite low for negative judgments in these two snapshots.

6 – Adaptation Via Topic Terminology It is generally recognized that a retrieval system must be adaptive to perform well in a filtering task, particularly when little or no domain knowledge is available. This section presents the results of experiments with Rocchio-based adaptation of topics and the ‘pure’ two-level clustering scheme that we use as the foundation of our main architecture. We then combine the two techniques into a hybrid architecture and derive a variation by considering only the vocabulary distinct to the on- and off-topic documents. 6.1 – Rocchio Adaptation with Simple Threshold Filtering (Rocchio_Simple). Much of the penalty associated with the simple thresholding approach of the previous section can be attributed to the nature of the topic definition and its relationship to the corpus vocabulary. – 13 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

The Rocchio method has been frequently used to enhance retrieval performance through the expansion of a query vector with vocabulary drawn from strongly matching documents that are known to be relevant or non-relevant. The challenge, of course, is that these high-ranking documents are not known until the completion of the run, and filtering decisions must be made starting with the first document. Our next configuration involves extending the simple approach by modifying the primary cluster for each topic Ti to support two additional term vectors, one for terms from on-topic documents (ri), and one for terms from off-topic documents (nri). As relevance judgements become available for retrieved documents these two vectors adapt as reflected in the following similarity measure for topic Ti and document Dj: Similarity(Ti, Dj) = α * sim(ti, dj) + β * sim(ri, dj) - σγ * sim(nri, dj) where sim(x, y) is a cosine measure of similarity between vectors x and y and with α = 1.0, β = 0.5, γ = 0.5 and σ = 0 if | ri | = 0, 1 otherwise. Including σ allows us to avoid suppressing potentially relevant documents through the third term in the case where we have found off-topic documents but no on-topic documents. Note that we leave the original topic vocabulary intact. As in the previous case, this configuration of TRECcer also does not use the secondary clustering mechanism. Figure 4 shows the best 4 runs* with the primary threshold ranging between 0.05 and 0.5 in 0.05 increments. Note that there is a distinct learning phase exhibited in the early snapshots as the Rocchio-based system adapts to the judgements of similar, but off-topic documents. These learning phases can appear at any point in a run, triggered by a sequence of highly relevant stories that involve vocabulary sufficiently distinct from that already learned as to cause the documents in the sequence to exceed the primary threshold. Learning phases can be beneficial or detrimental to performance. Note that the falling score in snapshots 1-6 is quickly offset by a positive learning phase, followed by a negative and then a substantial positive phase. Using snapshots as a performance presentation technique has proven to be very useful in understanding system performance (or lack thereof). System performance is typically reported

*

We will use the same scale on the Y axis for all configurations to aid in reader comparison. This sometimes implies, as here for the 0.05 threshold run, that an ‘n-best’ run does not perform well enough to plot on the graph.

– 14 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

rocchio_simple p=0.40 rocchio_simple p=0.50 rocchio_simple p=0.45 rocchio_simple p=0.05

8

MeanT9U score

6

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 4. Rocchio Adaptation with Simple Thresholding

through the score achieved by a system at some point in time, usually – and arbitrarily – when the available data are exhausted. This can have little correlation to the density and distribution of relevant and non-relevant documents for a given topic, and is problematic when there are broad variations in the smoothness of the performance graph. Consider Figure 4, where choosing snapshot 15, for example, as a termination point would lead to very different conclusions regarding performance than does snapshot 48. Furthermore, while the plots in Figure 4 are somewhat regular in smoothness (although exhibiting distinguishable local fluctuations), plots can exhibit very pronounced variations in performance over relatively short snapshot intervals (as seen in Figure 6, for example). General trends can be recognized in the plots, but clearly the choice of one termination point over another could completely alter the perception of relative system (or parameter) performance. For comparisons with previously reported results on the same data, we have chosen the score reported in the final snapshot as our basis for ‘final’ assessment of the performance of a configuration and/or set of parameters. While this sometimes yields a best performance from a configuration that is clearly exhibiting a degradation in performance at the time (consider p = 0.40 in Figure 4), our general sense is that this approach usually select relatively good performers.

– 15 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

6.2 – Two-level Clustering The Rocchio approach to weighting aggregates all on-topic terminology into a single vector and all off-topic terminology into a single vector, loosing any finer distinctions between documents. Our original interest in filtering as a task was in the benefit that might be derived from adapting the human/computer hybrid Scatter-Gather scheme (Cutting 1992, Cutting 1993) to a dynamic framework where the number of clusters formed was arbitrary. Our two-level clustering configuration treats a topic as the seed for a primary cluster. Documents exceeding the similarity threshold for the primary are then matched against a set of secondary clusters (if any) associated with the primary cluster. The highest similarity secondary cluster (assuming similarity is crosses the secondary threshold) assimilates the new document, adapting its term vector accordingly. If the document does not join an existing secondary cluster then it creates its own. This two level scheme allows us to set three distinct thresholds as mentioned earlier: the primary threshold acts as a ‘gatekeeper’ for the topic; the secondary threshold controls the coherency of a secondary cluster; and the declaration threshold controls the decision on whether to declare a newly added document as potentially relevant. Figure 5 shows the best 4 runs for this approach. For the sake of brevity in notation, we use the notation [p, s, d] to denote a combination using p as the primary threshold, s as the secondary threshold and d as the declaration threshold. Note that the runs in Figure 5 exhibit the same general learning curves as those in Figure 4 from the Rocchio experiments. Interestingly, the [0.15, 0.15, 0.15] run in Figure 4 is unique in that despite a low threshold combination by the end of the run it manages to perform as well as the other runs due to the nature of secondary cluster formation and declaration. 6.3 – Two-Level Clustering with Rocchio Adaptation Combining the techniques of the previous two sections into a single hybrid architecture yields a system that can run with its primary threshold set to a level to provide potentially better recall while using the declaration scheme on the secondary clusters to avoid over-declaring off-topic documents to the detriment of utility. At the same time, the Rocchio-based adaptation is intended to make the system more responsive to the topic. Note that adaptation is only at the primary cluster level, although in theory it could also be applied to the secondary cluster. Figure 6 shows the best 4 runs with Rocchio parameters α = 1.0, β = 0.5, γ = 0.5. Note that while the best runs for primary-

– 16 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

Two-level [0.35 0.35 0.35] Two-level [0.20 0.40 0.40] Two-level [0.15 0.15 0.15] Two-level [0.25 0.25 0.25]

8

MeanT9U score

6

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 5. Two-Level Clustering Configuration Performance

only Rocchio were in the range 0.40 ≤ p ≤ 0.50 (with the best at p = 0.40) and the best runs for pure two-level clustering were in the range 0.15 ≤ p ≤ 0.35 (with the best at p = 0.20), the best runs for the hybrid scheme are in the range 0.10 ≤ p ≤ 0.15 with the best at p = 0.15. It is interesting to note, however, that the final utility score for the hybrid scheme falls between the scores for the single technique systems. 6.4 – Two-Level Clustering with Differential Adaptation The performance of the Rocchio-based hybrid scheme in the previous section exhibits a distinct flattening in the second half of each of the runs compared to the gains achieved in the latter portion of the first half. Given the density of on- and off-topic documents across snapshots (as shown in Figure 3), we would expect to see some tail-off in gain towards the latter portion of the runs, but not the flattening that we see in Figure 6. To establish whether we were experiencing a dampening effect in the positive contribution of the declared on-topic documents by a corresponding negative contribution from declared off-topic (or unjudged!) documents, we generated a variation of the hybrid scheme that only includes in the positive and negative term vectors for each topic primary cluster those terms that only appear in

– 17 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

Two-Level Rocchio [0.15 0.40 0.40] Two-Level Rocchio [0.10 0.40 0.40] Two-Level Rocchio [0.15 0.35 0.35] Two-Level Rocchio [0.10 0.35 0.35]

8

MeanT9U score

6

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 6. Two-Level Clustering with Rocchio Adaptation

declared on- and off-topic documents respectively, which we refer to as differential vectors. Figure 7 shows the best 4 runs for the differential configuration. These runs exhibit the same general shape as those in Figure 6, but the best performing parameter set does not reach the level of performance of the best Rocchio hybrid run. On the other hand, the four differential runs exhibit stronger correspondence to one-another, indicating that the differential configuration is somewhat less sensitive to tuning flaws by a user.

7 – Parts-of-Speech and Named Entities Linguistic ambiguity frequently but erroneously increases similarity due to incorrect matching of different senses of a word (e.g., office as a room and office as a political position). We have experimented previously with part-of-speech tagging and named entity extraction as part of lexical analysis, where the diversity of the context made real comparison difficult (Eichmann et al., 1998, Eichmann and Srinivasan 1999a, Eichmann et al., 199b, Eichmann and Srinivasan 2000). This section presents a number of configurations using enriched vocabularies with the intent of laying a foundation for future experimentation.

– 18 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

Two-Level Differential [0.15 0.45 0.45] Two-Level Differential [0.15 0.40 0.40] Two-Level Differential [0.15 0.50 0.50] Two-Level Differential [0.15 0.30 0.30]

8

Mean T9U score

6

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 7. Two-level Clustering with Differential Adaptation.

7.1 – Two-Level Clustering with Rochio Adaptation and Noun Phrases Our first lexically enriched configuration involves processing both documents and topics with a part-of-speech tagger derived from Brill’s system (Brill). The tagger analyses each sentence, annotating the words with their respective part-of-speech (noun, adjective, etc.). We then build term vectors comprised of the noun phrases of length greater than or equal to two words. These are maintained separately from the original stemmed, untagged term vectors. Our noun phrases include not only contiguous sequences of unstemmed nouns, but also certain ‘glue’ vocabulary such as ‘of’, ‘of the’ and ‘and’ to support recognition of such phrases as ‘Secretary of State’, ‘Ministry of the Interior’, and ‘American Telephone and Telegraph’ as complete noun phrases. Table 1 shows a sample of the noun phrases generated by this process. Note that allowing noun phrases that contain trivial words such as ‘of’ occasionally generates phrases of dubious value such as ‘plans of the space program’ for topic 11. One of the challenges for general application of linguistic techniques is avoiding too many of these constructs while maintaining domain independence. The ‘curb production’ phrase for topic 22 indicates a failure of the tagger to recognize ‘curb’ as a verb rather than a noun.

– 19 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

Table 1: Noun Phrases Detected in Sample Topics Topic

Topic Title

Noun Phrases

7

U. S. Budget Deficit

auction quota, budget deficit, budget shortfall, defense budget, excess of expenditures, exchange rate, government subsidies, spending cuts, tax increase, tax reform

11

Space Program

Space Plane, Space Shuttle, Space Station Freedom, plans of the space program, space program, space project, space station, vehicle launch

22

Counternarcotics

curb production, curb production of drugs, drug lords, drug money, drug trafficking, entry of illegal drugs, production of illegal drugs

We then extend the definition of similarity to Similarity(Ti, Dj) = α * sim(ti, dj) +β * sim(rpi, dpj) + β * sim(ri, dj) - 0.5 * σγ * sim(nri, dj) - 0.5 * σγ * sim(nrpi, dpj) where rpi is the vector of noun phrases extracted from the topic and on-topic documents, nrpi is the vector of noun phrases from off-topic documents and dpj is the vector of document noun phrases. This is the Rocchio measure from section 6.1 with an additional term for positive (and topic) noun phrases (rpi) and an additional term for negative noun phrases (nrpi) that splits its contribution with the negative terms. Figure 8 shows the results for a set of runs within this configuration. Table 2 provides distributional data extracted from snapshot 16 comparing the Two-Level Rocchio adaptation configuration with and without noun phrases. Given the distinctly poor performance using [0.15, 0.40, 0.40] with large numbers of unjudged documents relative to declared positive and negative documents exceeding the primary threshold as shown in the table, we also tried [0.25, 0.40, 0.40] and two higher threshold runs of [0.25, 0.75, 0.75] and [0.35, 0.75, 0.75]. Note the relatively high number of declared unjudged documents for run [0.15, 0.50, 0.50], even though that run had the second highest utility of runs attempted. 7.2 – Two-Level Clustering with Rocchio Adaptation and Named Entities As part of our work on a question answering system for TREC-8, we developed a named entity extractor that couples into the part-of-speech tagger, using a set of patterns and heuristics to categorize noun phrases into: • •

Persons – names and titles drawing from a number of cultures (Anglo, Chinese, Arab, Hebrew, Indian, Japanese, Latino and Russian currently)*; Organizations – various entities drawn from sources such as the CIA Fact Book, Fortune 500 – 20 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

Two-Level Rocchio+NP [0.25 0.50 0.50] Two-Level Rocchio+NP [0.15 0.50 0.50] Two-Level Rocchio+NP [0.25 0.75 0.75] Two-Level Rocchio+NP [0.35 0.75 0.75]

8

6

Mean T9U score

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 8. Two-Level Clustering with Rocchio Adaptation and Noun Phrases

Table 2: Declaration Counts at Snapshot 16, Two-Level clustering with Rocchio Adaptation Declared Positives

Declared Negatives

Declared Unjudged

Without NP [0.10, 0.40, 0.40]

37

44

6

Without NP [0.15, 0.40, 0.40]

36

35

7

With NP [0.15, 0.40, 0.40]

218

211

433

With NP [0.25, 0.40, 0.40]

254

258

283

With NP [0.15, 0.50, 0.50]

115

128

81

With NP [0.25, 0.50, 0.50]

126

153

159

With NP [0.25, 0.75, 0.75]

37

36

8

With NP [0.35, 0.75, 0.75]

19

31

6

Configuration

• • •

companies, etc.; Locations – countries, cities, lakes, etc. drawn from the CIA Fact Book and other sources; Events – months, days of the week, holidays, etc.; and Medical terminology – MeSH subject headings and related terms (this category was disabled *

These were generated by running searches for baby names on the Web. While clearly not exhaustive, they proved, with only a handful of exceptions to cover the person names appearing in our tagger development data, comprised of a mix of Associated Press, Wall Street Journal and Financial Times newswire stories.

– 21 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

Two-Level Rocchio+NE [0.15 0.40 0.40] Two-Level Rocchio+NE [0.15 0.45 0.45] Two-Level Rocchio+NE [0.15 0.50 0.50] Two-Level Rocchio+NE [0.15 0.35 0.35]

8

MeanT9U score

6

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 9. Two-Level Clustering with Rocchio Adaptation and Named Entities

for the experiments described here). Each entity category generates a separate representation vector for a document. Each topic maintains an on-topic vector and an off-topic vector for each category, with the on-topic vectors initialized with the entities recognized from the topic itself. The Rocchio-based similarity function is then: Similiarity(T , D j) = α ⋅ sim(t i, d j) +



( β ⋅ sim(rp ik, d p j)k ) – ( γ ⋅ sim(nrp i, d p j) )

entitycategories

with α = 1.0; β = 0.3, γ = 0.15 for persons, organizations and noun phrases; and β = 0.1, γ = 0.05 for locations and events. Figure 9 shows the results of the best four runs using the two-level clustering configuration with Rocchio-based adaptation and named entities. Note that the parameters and final utility scores are roughly comparable to those achieved by the two-level Rocchio system but without named entities. 7.3 – Two-Level Clustering with Rocchio Adaptation and only On-Topic Named Entities Reflecting on the specificity of recognized entities for both topics lead us to attempt an asymmetric – 22 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

Two-level Rocchio+NE_O [0.15 0.45 0.45] Two-level Rocchio+NE_O [0.15 0.50 0.50] Two-level Rocchio+NE_O [0.15 0.40 0.40] Two-level Rocchio+NE_O [0.15 0.15 0.15]

8

MeanT9U score

6

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 10. Two-Level clustering with Rocchio Adaptation and On-Topic Named Entities.

variation in scoring: SIM (t i, d j) = α ⋅ sim(t i, d j) +



β ⋅ sim(rp i, d p j)

entitycategories

with weights the same as the symmetric entity function, but only on-topic entity vectors used. Figure 10 shows the best four runs using this variation. Notice that while at first glance there seems to be only a minor improvement in scores, the high scoring run is now [0.15, 0.45, 0.45], rather than [0.15, 0.40, 0.40] for the symmetric measure. Both [0.15, 0.45, 0.45] and [0.15, 0.50, 0.50] exhibit noticeable gains in utility in the asymmetric configuration compared to the symmetric configuration.

8 – Threshold Adaptation Our two-tiered approach to filtering provides a number of adaptive tuning dimensions involving modification of: • •

the primary membership threshold; the secondary membership threshold;

– 23 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

• the declaration threshold; and • the values of α, β, and γ when doing Rocchio-based similarity as well as a number of policy choices (i.e., whether to treat unjudged documents as off-topic, whether to use the current document’s or the secondary cluster’s term vector(s) in declaration decisions, etc.). The Rocchio parameters have been studied extensively elsewhere, and we saw little advantage to be gained in modifying these on the fly – this technique works by altering the emphasis of particular (on- and off-topic) vocabulary, not by the weights associated with that vocabulary. The thresholds seemed to offer more potential. However, adapting the primary threshold did little but limit the number of documents processed in subsequent system logic. The contents of Figure 2 imply that while loss via false positives can be contained with a high primary threshold, little gain in true positives is to be had. Given the performance patterns exhibited in the previous sections, we considered adaptation of the secondary membership threshold. However, this threshold’s primary function is the control of secondary cluster coherency – just how jointly similar the members of a secondary cluster are. Raising the threshold for secondary cluster membership beyond that of the configurations in the previous sections doesn’t generate a gain in score, but actually lowers the score through the formation of multiple clusters from the documents comprising a coherent, and positive, cluster at that lower threshold. We therefore chose as our approach the incremental adaptation of the secondary cluster declaration threshold. To minimize the complexity of the scheme, we constrained the adaptation to only increases in the threshold. The system compares the performance score for a given topic during a snapshot to the score for that topic in the previous snapshot. If the difference between the new and the old score is greater than or equal to zero, then d the declaration threshold for that topic remains at its current value. Otherwise the threshold is raised using the following rules at snapshot i:* • • • •

if Ui < -9 and boost < 3, set d = 1.3 * d and set boost to 3; else if Ui < -7 and boost < 2, set d = 1.2 * d and set boost to 2; else if Ui < -5 and boost < 1, set d = 1.1 * d and set boost to 1; else set d = d + 0.01 Boost which is initially set to 0 ensures that the first three rules are triggered at most 1 time. So

if the utility decreases across snapshots and the current utility is for example -30, then the 1st rule triggers only if it has not been activated before. Otherwise the last rule, one which degrades d grad*

These rules were developed via initial experiments using a subset of the Wall Street Journal dataset.

– 24 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

8

MeanT9U score

6

4

2

0

Two-Level Rocchio [0.10 0.50 (0.25)] Two-Level Rocchio [0.15 0.50 0.25] Two-Level Rocchio [0.15 0.45 0.25] Two-Level Rocchio [0.15 0.40 0.25]

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 11. Two-Level Rocchio with Declaration Adaptation.

ually, is activated. Without boost controlling the larger increments, d can quickly get a value that is too high for further retrieval. We set a ceiling of 0.45 for the declaration threshold unless the number of non relevant (offtopic and unjudged) documents exceeds 100, in which case we set the declaration threshold to 0.55 in an attempt to minimize the damage from a poorly performing topic. 8.1 – Two-Level Clustering with Rocchio Adaptation and Declaration Adaptation For our first experiments with declaration threshold adaptation we began by choosing the best performing run in the two-level clustering configuration with Rocchio adaptation: [0.15, 0.4, 0.4]. We then added declaration threshold adaptation to this. Keeping the primary threshold fixed we varied the secondary threshold to be 0.4, 0.45 and 0.5. Since we are adapting the break threshold we initialize its value to 0.25 in all three runs. We represent such runs by the scheme [p,s,(d)]. We also experiment with a fourth run [0.1, 0.5, (0.25)] to examine the effect of lowering the primary threshold within this architecture. As shown in Figure 11, there is a noticeable gain in performance that is remarkably consistent across the top four configurations, [0.10, 0.50, (0.25)], [0.15, 0.50, (0.25)], [0.15, 0.45, (0.25)], and [0.15, 0.40, (0.25)]. All of these runs perform better than any of – 25 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

the runs presented thus far. Note that we now extend our parameterization notation to indicate via parentheses when a parameter is set to the specified initial value when then is potentially adapted by the system during the run. Table 3 contains a breakdown at regular intervals (roughly one year of data) of the adaptation of thresholds. By the end of the run, 25 of the 50 topics have increased their declaration thresholds above 0.30 and 12 have reached the ceiling of 0.45. Of the 15 topics with non-negative utility, only 7 are still running with 0.25 ≤ d < 0.30, indicating that there are few ‘easy’ topics in the corpus. Compare this with Table 4, for [0.10, 0.40, 0.40], the best performing two-level Rocchio configuration without adaptation of the declaration threshold, where at the end of the run, 20 of 50 topics had generated no declared documents. The declaration adaptation configuration avoids the disabling of topics through high secondary and declaration thresholds, which is how the hybrid configurations achieve their utility. Instead, by keeping the secondary threshold high enough to maintain secondary cluster cohesion and disabling only those topics which prove ‘hard,’ we can achieve a substantial gain in utility. Table 3: Distribution of Topics for Two-Level Rocchio with Adaptation of Declaration Threshold [0.10, 0.50, (0.25)] Snapshot 16

Snapshot 32

Snapshot 49

Decl. Threshold Total

T9U > 0 T9U = 0

Total

T9U > 0 T9U = 0

Total

T9U > 0 T9U = 0

0.25 ≤ d < 0.30

38

9

6

30

8

3

25

4

3

0.30 ≤ d < 0.35

4

2

0

6

5

0

6

4

0

0.35 ≤ d < 0.40

4

1

0

5

1

0

6

4

0

0.40 ≤ d < 0.45

2

0

0

3

1

0

1

0

0

0.45 = d

2

0

0

6

0

0

12

0

0

Total

50

12

6

50

15

3

50

12

3

Table 4: Distribution of Topics for Two-Level Rocchio [0.10, 0.40, 0.40] Snapshot 16

Snapshot 32

Snapshot 49

Decl. Threshold Total d = 0.40

50

T9U > 0 T9U = 0 7

32

Total 50

– 26 –

T9U > 0 T9U = 0 12

26

Total 50

T9U > 0 T9U = 0 15

20

Adaptive Filtering of Newswire Stories using Two-level Clustering

8.2 – Two-Level Rocchio Configuration with Declaration Adaptation and Noun Phrases Given the improvement exhibited by the Rocchio-based configurations using declaration threshold adaptation, we ran the noun phrase configurations from section 7.1 with threshold adaptation active. For this experiment we changed the ceiling on break threshold from 0.45 to 0.75. As shown in Figure 12, there is a marked improvement in performance for the high initial threshold configurations, but a decline in performance in the remaining configurations*. Table 5 shows that the counts of declared documents for the lower threshold configurations are still very high, with roughly comparable numbers in each category. The higher threshold configurations, however, retrieved roughly only half as many unjudged as on-topic documents. Comparing these configurations to those in section 7.1 using table 2, we see that the higher threshold adaptive configurations retrieved from two to four times as many documents in each category, but with the same proportions across categories. Examining table 6, we see that half of the topics have zero utility for [0.35, 0.75, (0.55)]. The aggregate utility for the configuration is concentrated in the seven topics with positive utility. This is in distinct contrast to the Rocchio [0.10, 0.50, (0.25)] configuration, where there were twelve topics with positive utility and only three with zero utility. Table 5: Declaration Counts at Snapshot 16, Two-Level Rocchio with Declaration Adaptation and Noun Phrases Declared Positives

Declared Negatives

Declared Unjudged

NP [0.15, 0.50, (0.35)]

314

257

268

NP [0.25, 0.50, (0.35)]

297

275

320

NP [0.25, 0.75, (0.55)]

67

106

32

NP [0.35, 0.75, (0.55)]

81

114

32

Configuration

8.3 – Two-Level Rocchio with Declaration Adaptation and Named Entities We then turned to adapting the declaration thresholds for the named entity configuration. The only difference in this configuration from that discussed in section 7.2 is the threshold adaptation logic discussed at the beginning of Section 8. As seen in Figure 13, only two runs exceeded our utility plotting cut-off of -5. The [0.15, 0.45, 0.25] run, on the other hand had the highest end point utility of the entire sequence of experiments. Examining table 7, we see that by the end of the run

*

Note that the run for [015, 0.50 (0.35)] fails to complete in our current environment due to exhausted memory resources.

– 27 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

Two-Level Rocchio+NP [0.35 0.75 (0.55)] Two-Level Rocchio+NP [0.25 0.75 (0.55)] Two-Level Rocchio+NP [0.15 0.50 (0.35)] Two-Level Rocchio+NP [0.25 0.50 (0.35)]

8

6

MeanT9U score

4

2

0

-2

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 12. Two-Level Rocchio with Declaration Adaptation and Noun Phrases

Table 6: Distribution of Topics for Two-Level Rocchio with Declaration Adaptation and Noun Phrases [0.35, 0.75, (0.55)] Snapshot 16

Snapshot 32

Snapshot 49

Decl. Threshold Total

T9U > 0 T9U = 0

Total

T9U > 0 T9U = 0

Total

T9U > 0 T9U = 0

0.55 ≤ d < 0.60

41

5

31

36

5

25

33

2

24

0.60 ≤ d < 0.65

2

1

0

3

3

0

4

3

0

0.65 ≤ d < 0.70

0

0

0

1

1

0

2

1

0

0.70 ≤ d < 0.75

0

0

0

0

0

0

1

1

0

0.75 =d

7

1

0

10

1

0

10

0

1

Total

50

7

31

50

10

25

50

7

25

only 16 of the topics have not increased their declaration threshold above 0.30 and fourteen have equaled 0.45. Furthermore, unlike the adaptive Rocchio run, the topics with positive utility span the entire range of declaration threshold values, indicating that the entity configuration is maintaining the ability to retrieve documents over time, even faced with challenging topics.

– 28 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

8

MeanT9U score

6

4

2

0

-2

Two-Level Rocchio+NE [0.15 0.45 (0.25)] Two-Level Rocchio+NE [0.15 0.50 0.25] Two-Level Rocchio+NE [0.15 0.30 0.15] Two-Level Rocchio+NE [0.15 0.35 0.15]

-4 0

5

10

15

20

25 Snapshot

30

35

40

45

50

Figure 13. Two-Level Rocchio with Declaration Adaptation and Named Entities

Table 7: Distribution of Topics for Entity[0.15, 0.45, (0.25)] Snapshot 16

Snapshot 32

Snapshot 49

Decl. Threshold Total

T9U > 0 T9U = 0

Total

T9U > 0 T9U = 0

Total

T9U ≥ 0 T9U = 0

0.25 ≤ d < 0.30

28

7

8

20

5

4

16

4

2

0.30 ≤ d < 0.35

7

1

0

7

4

0

6

3

0

0.35 ≤ d < 0.40

7

1

0

9

2

0

7

3

0

0.40 ≤ d < 0.45

5

0

0

5

1

0

7

1

0

0.45 ≤ d

3

0

0

9

1

0

14

2

0

Total

50

9

8

50

14

4

50

14

2

9 – Analysis of Topic Specific Recall and Precision The constraints imposed in filtering provide challenges for comparison against the more traditional measures of precision and recall. This section ties our system to this perhaps more familiar framework by examining system performance in a number of configurations. Figure 14a shows per topic recall and precision for three runs: two two-level Rocchio runs, [0.10, 0.25, 0.25] and [0.10, 0.40, 0.40] and one two-level Rocchio run with declaration adaptation, [0.10, 0.40, (0.25)]. Note

– 29 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

1

1 rocchio 0.10 0.40 0.40 rocchio 0.10 0.25 0.25 rocchio_adapt 0.10 0.40 0.25

rocchio 0.10 0.40 0.40 – 0.25 rocchio 0.10 0.25 – 0.40 0.25

0.6 Precision Gain/Loss

Precision

0.8

0.8

0.6

0.4

0.2

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

0

-1 0

0.2

0.4 0.6 Recall

0.8

1

Figure 14a. Per Topic Recall / Precision Performance

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Recall Gain/Loss

Figure 14b. Per Topic Gain / Loss Comparison

that the adaptive declaration run shares its secondary threshold with the higher-precision run and its declaration threshold with the lower-precision run of the first group. Note that there is a wide range in precision with an average of 0.19, 0.15 and 0.23 respectively. Recall performance is conservative, with an average of 0.07, 0.06 and 0.11, respectively. Figure 13b shows the per topic gain or loss for the adaptive run relative to the fixed threshold runs. Topics in the upper right quadrant demonstrate improved performance in both recall and precision for the adaptive run. Topics in the lower left quadrant demonstrate degraded performance in both recall and precision. The adaptive run shows very mixed performance relative to the lower precision run, with a number of topics in each of these two quadrants. The adaptive run shows fairly uniform improvement in recall relative to the higher precision run, but a large range of gain and loss in precision. Turning to entity performance, Figure 15a shows the precision / recall performance on a per topic basis using entity similarity for runs of [0.15, 0.45, 0.45], [0.15, 0.25, 0.25] and [0.15, 0.45, (0.25)]. Comparing the precision recall graphs for the rocchio and entity approaches (Figures 14a and 15a) the relative performance of the adaptive run against the higher- and lower-precision run in the two approaches is fairly similar. Contrasting the gain/loss performance in Figures 14b and 15b bears this observation out, with the only obvious distinction being the loss in recall of the adaptive system against the higher-precision fixed threshold run for entity similarity. – 30 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

1

1 entity_pos 0.15 0.45 0.45 entity_pos 0.15 0.25 0.25 entity_adapt 0.15 0.45 0.25

entity_pos 0.15 0.45 0.45 – 0.25 entity_pos 0.15 0.25 - 0.45 0.25

0.6 Precision Gain/Loss

Precision

0.8

0.8

0.6

0.4

0.2

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

0

-1 0

0.2

0.4 0.6 Recall

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Recall Gain/Loss

Figure 15a. Per Topic Recall / Precision Perfor-

Figure 15b. Per Topic Gain / Loss Comparison,

mance Using Entities

Entity Similarity

The noun phrase runs clarify some of the performance questions. The [0.25, 0.50, 0.50] runs in Figure x appear fairly similar to those in the other recall / precision plots. The [0.35, 0.75, 0.75] run, however, plots only three topics off of the origin. Contrasting the gain / loss in Figure y, we see that only a few topics for the [0.35, 0.75, (0.55)] adaptive run improve over the [0.25, 0.50, 0.50] non-adaptive run, but there are numerous topics with substantial gain over the [0.35, 0.75, 0.75] non-adaptive run. This is clearly a configuration where adaptation has a major impact.

10 – Conclusions The experimental results presented here bear out our experience with filtering and tracking in TREC and TDT. The experience of participants in the TREC evaluations involving this corpus and set of topics was distinctly one of seeking to achieve performance better than a decision procedure that simply returned not-relevant for all documents (i.e., ‘it is better to do nothing’). Relevant document densities were such that learning a topic entailed sometimes massive penalties – greater than could be overcome by perfect performance beyond that point in some cases. A two-level approach, when combined with some form of adaptation of vocabulary can break through the ‘it is better to do nothing’ barrier and achieve overall positive utility, even in the face of challenging topics. More importantly, as shown in section 8.1, these techniques are reasonably stable across a range of parameters, allowing a system processing newswire-like documents to capitalize upon the tuning re-

– 31 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

1

1 nounphrase_rocchio 0.350.750.75'' nounphrase_rocchio 0.250.500.50'' nounphrase_adapt 0.350.750.55''

noun_roc 0.350.750.75 – 0.55'' noun_roc .25 .50 .50 – .35 .75 .55''

0.6 Precision Gain/Loss

Precision

0.8

0.8

0.6

0.4

0.2

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

0

-1 0

0.2

0.4 0.6 Recall

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Recall Gain/Loss

Figure 16a. Per Topic Recall / Precision Perfor-

Figure 16b. Per Topic Gain / Loss Comparison,

mance Using Noun Phrases

Noun Phrase Similarity

sults of Figure 11, for example. Performance is quite comparable across a combined range of primary and secondary thresholds – starting anywhere in this region yields relatively acceptable performance. Further experimentation involving more diverse document and topics types is a natural next step. We feel that the noun phrase and entity configurations are still in development. The entitybased technique in particular shows serious promise for gain in utility. Also open to further research is the question of the effects of localized irregularities in performance in the overall assessment of a system. While not typical, we have seen particular combinations of configuration and parameter settings exhibit marked ‘jitter’ in performance over rather short periods of time. A more detailed examination of judgement distributions might yield insight into this issue. Finally, the overall configuration of our system as reported here is positioned to be extremely conservative with respect to declaration of documents as potentially relevant. For this corpus and this set of topics, that yields behavior that avoid massive penalties, but with the limitation that adaptation is always from the perspective of monotonically increasing thresholds. This policy decision is likely to prove too conservative for corpus / topic combinations with higher densities of relevant documents. Further experimentation with bidirectional adaptation of thresholds is also warranted.

– 32 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

References [1] Arampatzis, A, J. Beney, C. H. A. Koster and T. P. van der Weide, “Incrementality, halflife and threshold optimization for adaptive document filtering,” TREC-9 Proceedings. http://trec.nist.gov/pubs/trec9/papers/trec9-kun-final.pdf, 2000. [2] Balabanovic, M., “An Adaptive Web Page Recommendation Service,” Proceedings of the First International Conference on Autonomous Agents (Agents ‘97), eds. W. L. Johnson and B. Hayes-Roth, Marina del Rey, CA, USA, 1997, pp. 378-385. [3] Balabanovic, M. and Y. Shoham, “Fab: Content-based collaborative recommendation,” CACM, v. 40, no. 3, March 1997, pp. 66-70. [4] Brill, E., “A Simple Rule-Based Part-of-Speech Tagger,” Proc. of the Third Conference on Applied Natural Language Processing, Trento, Italy, pp. 152-155. [5] Cutting, D. R., D R. Karger and J. O. Pedersen, “Constant interaction-time scatter/gather browsing of very large document collections,” Proc. of SIGIR’93, June 1993. [6] Cutting, D., D. Karger, J. Pedersen, and J. W. Tukey, “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen, 1992. [7] Delgado, J., M. Ishii and T. Ura, “Content-based collaborative information filtering: actively learning to classify and recommend documents,” In: Mathias Klush and Gerhard Weiss, Eds. Cooperative Agents II. Proceedings/CIA 1998. LNAI Series Vol, 1435, Springer-Verlag, 206-215, 1998. [8] Eichmann, D. and P. Srinivasan, “A Cluster-Based Approach to Broadcast News” (to appear) TDT Book, J. Allen, ed. [9] Eichmann, D. and P. Srinivasan, “Filters, Webs and Answers: The University of Iowa TREC-8 Results,” Eighth Conference on Text Retrieval, NIST, Washington, D.C., November 16 - 19, 1999. [10] Eichmann, D., M. Ruiz, P. Srinivasan, N. Street, C. Culy and F. Menczer, “A Cluster-Based Approach to Tracking, Detection and Segmentation of Broadcast News,” Proc. DARPA Broadcast News Workshop, Herndon, VA, February 28 - March 3, 1999, pp. 69-75. [11] Eichmann, D., M. E. Ruiz and P. Srinivasan, “Cluster-Based Filtering for Adaptive and Batch Tasks,” Seventh Conference on Text Retrieval, NIST, Washington, D.C., November 11 - 13, 1998. [12] Klinkenberg, R. and T. Joachims, “Detecting Concept Drift with Support Vector Machines,” Proceedings of the Seventeenth International Conference on Machine Learning. ICML-2000, Morgan Kaufmann, 2000, pp. 487-494. [13] Klinkenberg, R. and I. Renz, “Adaptive information filtering: Learning in the presence of concept drifts,” AAAI Workshop on Learning for Text Categorization, 1998, pp 1-8. [14] Lam, W. and J. Mostafa, “Modeling user interest shift using a Bayesian approach,” JASIS 52(5), 2001, pp. 416-429. [15] Lewis, D. D., “Evaluating and optimizing autonomous text classification systems,” in: Research and Development in Information Retrieval [(proceedings of the ACM SIGIR conference 1995)]. Springer-Verlag, 1995. pp. 246-254. [16] NIST, “The year 2000 Topic Detection and Tracking (TDT2000) task definition and evaluation plan,” http://www.nist.gov/speech/tests/tdt/tdt2000/evalplan.htm, 2000. [17] Pfeifer, U., T. Poersch and N. Fuhr, “Retrieval effectiveness of proper name search methods,” Information Processing and Management, v. 32 1996, pp. 667-679. [18] Porter, M. F., “An Algorithm for Suffix Stripping,” Program, v. 14, no. 3, 1980, p. 130137.

– 33 –

Adaptive Filtering of Newswire Stories using Two-level Clustering

[19] Robertson, S. and D. A. Hull, “The TREC-9 filtering track final report,” 2000. [20] Rocchio, J., “Relevance feedback in information retrieval,” in The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton (ed.), Prentice-Hall, p. 313-323. [21] Salton, G., C. Buckley and C. T. Yu, “An evaluation of term dependence models in information retrieval,” in: Salton G and Schneider H-J, eds. Research and Development in Information Retrieval [(proceedings of a conference in Berlin in 1982)]. Springer-Verlag, Berlin, 1983, pp. 151-173. [22] Savoy, J., “Ranking schemes in hybrid Boolean systems: a new approach,” Journal of the American Society for Information Science, v. 48, 1997, pp. 235-253. [23] Shardanand, U. and P. Maes, “Social information filtering: algorithms for automation ‘Word of Mouth’,” ACM/CHI’95. http://www/acm.org/sigchi/chi95/Electronic/documents/papers/us_bdy.htm, 1995. [24] Singh, M. P., B. Yu and M. Venkatraman, “Community-based service location,” CACM, v. 44, no. 4, April 2001, pp. 49-54. [25] Sparck Jones K., ed., Information Retrieval Experiment. Butterworths, London, 1981. [26] Resnick P., N. Iacovou, M. Sushak, P. Bergstrom and J. Riedl, “GroupLens: An open architecture for collaborative filtering of Netnews,” Proc. of the CSCW 1994 conference, October 1994. [27] Tague, J. and M. Nelson, “Simulation of bibliographic retrieval databases using hyperterms,” 1983. [28] Taylor, C., G. Nakhaeizadeh and L. Lanquillon, “Structural Change and Classification,” Workshop Notes on Dynamically Changing Domains: Theory Revision and Context Dependence Issues, 9th European Conference on Machine Learning (ECML ‘97), Prague, Czech Republic, 67-78, 1997. [29] Widmer, G. and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Machine Learning, v. 23, 1996, pp. 69-101.

– 34 –