Interactive Document Retrieval Using Faceted ... - Semantic Scholar

3 downloads 182 Views 182KB Size Report
COMPAQ. Peter.Anick@digital.com ..... among terms has a long history in Information Retrieval. [12]. In the late ... Digital Equipment. Corp., 1996. [11] Beaulieu ...
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Interactive Document Retrieval using Faceted Terminological Feedback Peter Anick and Suresh Tipirneni COMPAQ [email protected]

Abstract We present a new interactive paradigm for iterative information retrieval based on the automatic construction of a faceted representation for a document subset. Facets are chosen heuristically based on lexical dispersion, a measure of the number of different terms with which a word co-occurs within noun phrases appearing in the document set. The phrases in which these terms occur serve to provide a set of attributes, specializations, related concepts, etc., for the so-identified “facet” terms. The resulting representation serves the dual purpose of providing a concise, structured summary of the contents of a result set as well as presenting a set of terms for engaging in interactive query reformulation.

The bulk of the “dialog” in current search interfaces is a simple query refinement loop, in which a user enters a search expression, evaluates the results returned, and then adjusts the search expression accordingly, if necessary. However, evaluating search results can be a time and energy consuming task. Through surveying a potentially long list of titles and summaries, a user must not only evaluate the likely relevance of documents to his/her information need but also • • • •

1. Introduction Satisfying an information need through on-line full-text searching is often an iterative process in which a user successively modifies an initial query in response to the results obtained. This seemingly simple task is often complicated in practice by the user’s lack of knowledge of the subject matter that generated his/her information need, lack of knowledge of the extent of database content with respect to that need, and the difficulty of translating even a well-defined need into an effective linguistic formulation [1]. The challenge, for system designers, is to augment the range of user/system dialog while keeping the complexity of the interface to a minimum. The popular success of current “natural language” search interfaces comes in large part from the fact that users can interact relatively successfully through a simple interface without fully understanding how the system actually works. As Marchionini warns, “End users want to achieve their goals with a minimum of cognitive load and a maximum of enjoyment. …humans seek the path of least cognitive resistance and prefer recognition tasks to recall tasks.” [2]

assess the likelihood that the database will eventually be able to satisfy the information need, or part of it, assess the degree to which the current query formulation has expressed the need, learn about the information space and the vocabulary used to describe the domain within this particular database, and ultimately decide on an appropriate query reformulation strategy, if necessary.

A number of approaches have been proposed for helping a user more effectively explore the contents of a database or result list, including document clustering, visualization techniques, tilebars, and the like [3]. A separate set of techniques have been developed for assisting in query reformulation, including on-line thesauri, key term extraction, and relevance feedback [4]. There is some evidence from recent studies comparing “opaque” and “penetrable” relevance feedback suggesting that allowing end-users to interact directly with terminological feedback improves not only actual (measured) retrieval performance but also perceived performance, trust in the system, and subjective usability [5]. In this paper, we will present an iterative model of information retrieval in which the user is encouraged to explore the information space through the presentation of faceted terminological feedback. Unlike a simple list of terms, a faceted classification structures the domain of discourse around salient dimensions [6]. While a true faceted classification of a subject field would involve considerable intellectual effort to codify the domain objects, their properties, behavior, interactions, and

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

1

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

operations on them, we have devised a heuristic approach that uses statistical and natural language processing techniques to automatically identify a set of lexical dimensions that roughly characterize the contents of a result set. Our algorithm is based on the observation that lexical items signifying key concepts within a domain often have a greater tendency to co-occur with a multiplicity of other concepts within certain syntactic contexts such as noun phrases. We exploit this notion of “lexical dispersion” to generate a terminological summary of a result list at query-time. This faceted representation serves simultaneously as a brief conceptual synopsis of the result list and a source of terminology for iterative, recognition-driven query reformulation. This paper is organized as follows. In section 2, we motivate the “lexical dispersion hypothesis” as a basis for the identification of key terminology and describe how it can be employed in on-line information retrieval. In section 3, we describe a web-based interface built for the Paraphrase retrieval system, illustrating the use of automatically generated faceted terminological feedback for a subset of the TREC-6 Financial Times database. Section 4 delves into implementation issues and presents a number of parameters that can affect the results. In Section 5, we discuss this work in relation to other approaches, such as thesauri and relevance feedback. In Section 6 we present our conclusions and directions for future research.

2. The Lexical Dispersion Hypothesis In English, as in other languages, new concepts are often expressed not as new single words, but as concatenations of existing nouns and adjectives. This is especially noticeable in technical language, where long chains of nouns are not uncommon (“central processing unit”, “byte code interpreter”) and appears to be one of the primary characteristics of sublanguages such as medical briefs and naval messages, which have domain specific grammars for combining terms into terse expressions of complex concepts (e.g., [7], [8]). Compound terms also permeate everyday language. Noun compounds are regularly used to encode ontological relationships -- an “oak tree”, for example, isa tree. But noun compounds may encode many other kinds of relationships as well [9]. In “tree rings”, rings are a property of the tree. In “tree roots”, the roots are a part of the tree. One would therefore expect that documents dealing with trees would contain many phrases containing the word “tree”, (e.g., tree diseases, tree bark, tree sap, tree roots, pine tree, coniferous tree) since such compounds linguistically serve to identify subordinate categories, attributes, and other relationships within the domain of trees.

Based on the above observations, we define the lexical dispersion hypothesis as the proposition that a word’s lexical dispersion – the number of different terms that a word appears with within certain syntactic constructions within a document set - can be used as a diagnostic for identifying key concepts, or “facets” of that document set. For our present purposes, we are interested not only in those terms with high dispersion but also the phrases in which they occur, as these serve to provide a set of attributes, specializations, related concepts, etc., for the so-identified “facet” terms. To test the viability of using lexical dispersion to capture facets of a document collection, we applied part-of-speech tagging and noun phrase parsing to three large databases covering different subject areas – the results of a webcrawl on jazz music (30,000 articles) , a second intranet web-crawl on PC software (8,000 articles), and a 90,000 article subset of the TREC-6 Financial Times database – to estimate the degree of lexical dispersion that actually 1 occurs in typical collections. We found that for all databases, a significant number of terms had high dispersion rates. As expected, the domain specific databases (jazz music and computing) had a greater incidence of noun phrases and a correspondingly greater number of terms with high lexical dispersion. But even the somewhat more generic Financial Times database showed a significant number of terms with high dispersion. Of the 706,228 unique terms in this database, 64 of them occurred in over 1825 different noun phrases, 8192 occurred in at least 22 different phrases, and 16,384 words occurred in at least 7 phrases. Figure 1 shows the 20 terms with the highest dispersion in the Financial Times database, along with the number of different phrases they occurred in. These terms represent some of the most salient concepts recurring throughout the Financial Times corpus, and provide a concise answer to the question, “What is this corpus about?” For all three databases, the top terms, ranked by dispersion, tended to represent key concepts within the respective databases. Our hope was that the lexical dispersion hypothesis would also generate useful facet-like terms within the more constrained context of result lists for individual queries. In a series of experiments on the three databases, we have found that this appears to be the case, although we have had to refine the algorithm in a number of respects from a simple ranking by dispersion size. These refinements will be described in Section 4. In the next section, we first illustrate how the facets are put to use in a working prototype, the Paraphrase interface. 1

We used the pattern ? * as our syntactic context for computing lexical dispersion.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

2

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

3. Paraphrase The goal of the Paraphrase project is to enhance the effectiveness and usability of standard search engines through a richer system/user dialog. Paraphrase is implemented on top of the AltaVista search engine [10], which supports natural language queries and the ranking of result lists. Our interface is implemented as a web application, using TCL as a CGI scripting language, making runtime calls to the AltaVista search engine. In 2 the version of Paraphrase presented in this paper , we exploit the lexical dispersion hypothesis described above to compute sets of facets and phrases for describing result sets and browsing the database.

As a second example to illustrate this point, figure 3 presents the set of facets for the query “cooking”. A user unfamiliar with the database contents might be unsure whether the topic would be found at all in the Financial Times. However, one can readily surmise, from skimming the list of facets returned, that the database does indeed deal with the topic, from a wide variety of angles.

3.2 Facets and Query Reformulation

3.1 Facets as a descriptive device Examples of the use of facets/phrases as a descriptive device are presented in figures 2-5, showing the Paraphrase interface running on a 90,000 document subset of the TREC-6 Financial Times database. The user interface consists of a single web page composed of four frames. In the upper right frame is a text box in which the user may enter natural language queries. Below that is a frame for the display of results. On the left are two frames for displaying terminology. The scrollable upper frame contains up to 10 select boxes, each labeled with a single term label (i.e., the facet name). Each select box expands into a frequency sorted menu of phrases containing the label term. Below that is another frame in which the user is invited to enter a term. The result of entering a term is a new select box which, on expansion, displays a menu of phrases containing the entered term that occur most frequently in the current result list. Figure 2 shows the screen after the user has entered the query “indonesia”. The facets presented in the upper lefthand frame include terms such as indonesian, export, and government, all of which reflect generic topics that occur in a variety of more specific contexts throughout the result list. Figure 3 lists the actual phrases associated with these facets, which the user would see upon clicking on the corresponding select box. Terms associated with the facet indonesian include products (tuna), occupations (farmer), and attributes (debt). Those associated with government include roles (critic), attributes (corruption), and specific governments (jakarta). While the facets and 2

phrasally associated concepts certainly do not constitute an exhaustive catalogue of the result list content, they are nevertheless an informative approximation, in effect a surrogate table of contents for the result list, from which the user can gain some insight into the subject matter of the result set without necessarily wading through the title list or pulling up sample articles.

A previous version [29] used phrases associated with cluster key terms as a feedback device for query refinement within a clustered database.

The most common approach to interactive query reformulation, based either on a thesaurus or related terms derived via relevance feedback is to present a user with a list of candidate terms and ask the user to choose all that are relevant (e.g., [5], [11]). These are then added en masse to the query. However, in the faceted structure constructed in Paraphrase, the phrases associated with a facet often represent alternative specializations or attributes of the facet term. As a result, we developed a different model of iterative search that is more in keeping with the interpretation of facet values as a set of alternatives for exploring the document space. Specifically, we wanted to make it easy for users to iterate through a set of alternatives, one at a time, while maintaining their original query context. To do so, we divided the search expression into two distinct conceptual components – a main topic and a focus. The user’s initial search expression is taken as the main topic, and from its result list a set of facets is computed and presented, as described above. The user can then browse the facets and choose one to add to the focus. At this point, the system constructs a new query from the concatenation of (1) the main topic terms, (2) the facet phrase (as a single contiguous term), and (3) the individual components of the facet phrase. Because of AltaVista’s ranking algorithm, which grants rarer terms higher weights, documents containing the full phrase are more likely to bubble to the top of the result list. By also including the components of the phrase in the query, we enhance recall as well; component terms need not appear adjacently to contribute to the result. When the user selects a different facet value as the focus, the previous focus value is superseded. As a result, the user can browse the database by iterating through facet

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

3

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

phrases one at a time, reshuffling the ranked result list with each mouse click. The facet list itself is not recomputed each time the user chooses a new facet value as the focus. Thus, in the “cooking” query illustrated in figure 4, the user can select one cheese type after another from the cheese facet, doing in effect a series of small high precision searches. Alternatively, the user may simply add the facet “cheese” itself, if the sharper divisions are unnecessary to the information sought. Two other operations are available in the interface, as 3 displayed in figure 5 , which shows the interface after the user has selected a focus term. The user may append the current focus terms to the main topic. In this way, the current focus terms become part of the stable context to which new focus terms may then be applied. Or the user may choose to replace the main topic with the current focus. These two operations allow the search to evolve over time, without necessarily clearing the text box and starting over. A new facet set is computed as a result of either of these operations. The user may at any time enter a term into the bottom left frame to produce a list of phrases from the result set that contain the input term. The resulting select box may then be used just as the items in the frame above to alter the focus of the query. We expect this option to be adopted over time once the user begins to appreciate the informational value of phrases for finding related terms. For example, in the case of the “wildlife extinction” query shown in figure 5, one could enter the word “bill” to get a listing of names of legislative bills having to do with the topic.

4. Implementation Issues As described above, our aim is the automatic extraction of a small number of facets which capture key conceptual dimensions within the result set. While the lexical dispersion hypothesis provides a simple heuristic for capturing such terms, the actual implementation must take into account a number of practical factors, most of which we are continuing to explore. In this section, we will describe some of the issues around selecting facets.

4.1 Choice of phrasal patterns A large number of syntactic patterns can encode valid semantic relationships. Nominal concepts may be connected by verbs or prepositions, often expressing the 3

See the options listed below the Focus term in the upper right hand frame.

same relationships encoded in noun compounds. Even simple co-occurrence within a sentence, paragraph, or document often implies some semantic relationship, and, in fact, many automatically constructed thesauri use such co-occurrences, filtered through statistics computed over the corpus as a whole, as the basis for postulating relationships among terms ([12], [13]). The advantages of using noun compounds is that they are relatively easy to recognize – a pattern matcher scanning a tagged corpus can readily detect sequences of nouns – and tend to encode strong, long-lived semantic relationships. As a result, it is generally safe to assume that nouns appearing in a compound constitute valid and useful relationships without the need for corpus-wide statistical verification. On the other hand, not all useful semantic relationships are encoded as compounds. Therefore, depending on noun compounding alone will inevitably lead to gaps. As adjectives are commonly used to specialize concepts, we found it valuable to also include noun phrases preceded by an adjective modifier, such as “international law” or “ancient history.” This also captures compounds which are formed with the adjectival form of a morphologically related noun, as in “French hotel” or “literary criticism.” Many nouns and adjectives that frequently occur in phrases have little value for information retrieval, however. To filter out such candidate phrases, we have had to construct a large list of noiseword modifiers and head terms. These include quantitative nouns and adjectives, such as cardinal or ordinal numbers, words like “many”, “some”, “amount”, temporal nouns like “year”, and qualitative adjectives, such as “significant” and “reasonable.” We have conducted some experiments on extracting facets from prepositional and verb phrases. However, there is considerable noise in this data, making it difficult to extract facets reliably based solely on a term’s dispersion within these syntactic constructions.

4.2 Facet ranking The most straightforward method of applying the lexical dispersion hypothesis to the selection of facets is to rank all the terms appearing within a result set by the number of different noun phrases they occur in and select the top n as facets. We found two drawbacks to this direct approach. First, it is possible for one or two long articles on a specific subject to contribute a large number of distinct phrases containing a specific term, even though the term may otherwise be relatively rare in the corpus. Second, depending upon the subject matter of the database, there may be some terms that are likely to appear in many phrases within practically any result set within the database. For example, in our jazz music

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

4

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

database of web articles, almost every query resulted in the terms “jazz” and “music” showing up in the facet list. A weighting algorithm which favors terms which appear in many articles within the result set corrects the first of these problems, but tends to aggravate the second. As a result, we devised a two pass algorithm, which first uses lexical dispersion to select a set of n (e.g. 50) candidate facets, then uses a tf.idf measure to rerank the candidates. Tf.idf is commonly used to weight terms based on a combination of their density within a document (term frequency) and scarcity in the corpus as a whole (inverse document frequency) [14]. For our purposes, the “document” is a pseudo document composed by appending the sets of unique terms contained within each document of the result list. Applying this weighting has the desired effect of promoting facets that occur more widely throughout the result set, while demoting terms that are too prevalent throughout the corpus as a whole.

4.3 Other parameters A number of other parameters can affect the outcome of the choice of facets. These are briefly described below. •







Dispersion cut off – the minimum number of different phrases a term must appear in to qualify as a facet candidate. This can be safely set as a low number (2 or 3), since facet candidates will be ranked by their dispersion. Result set cut off – the number of documents from the result list from which the facets are to be computed. This number can have a significant effect on the eventual facets output by the algorithm. Since the result list is ranked, relevance of facets to the query can be increased by limiting the extracted phrases to top ranked documents. For the examples in this paper, we use a cut-off of 100. However, this is a prime area for further research, as the optimal cut-off is likely to differ from query to query. Candidate cut off – the number of top ranked facet candidates from the first pass to be reranked by tf.idf. This parameter reflects a trade-off between two measures of what constitutes a good facet. On the one hand, a facet should have many phrases associated with it. On the other hand, it should be a term that is particularly relevant to the result set at hand, not simply a common term throughout the database. In our examples, we used a cut-off of 50. Line cut off – The number of lines from each article in the result set from which phrases are drawn. In long documents in which the topic changes, many of the phrases in the article may actually be unrelated to the query. Since AltaVista’s ranking strategy favors terms which appear near the beginning of articles, we

make the assumption that query terms are more likely to occur near the front of highly ranked articles. Thus, unrelated phrases can be heuristically eliminated by considering only the first n lines of a document.

5. Discussion Henninger and Belkin have argued that online information retrieval ought to be perceived as a problem solving process: Information retrieval systems must not only provide efficient retrieval, but must also support the user in describing a problem that s/he does not understand well. The process is not only one of providing a good query language, but also one of supporting an iterative dialog model. As users query and browse the repository, they learn more about the problem and potential solutions and therefore refine their conceptualization of the problem. The information being sought differs from that being sought at the beginning of the session. The user is engaged in a problem-solving session in which the problem to be solved, that of finding relevant information, evolves and is refined through the process of seeing the results of the intermediate queries. [15]

The approach presented in this paper supports such a view of online information retrieval. It not only provides terminological feedback to help the user learn about the information space but also encourages browsing and discovery through its recognition-driven search paradigm. The interface design has been motivated by the desire to minimize both the amount of screen real estate required and the number of new operations the user must understand to engage in a productive user/system dialog. The hierarchical presentation of terms allows a great quantity of terminological information to be placed effectively in a small space, thereby making it possible to integrate the terms frames directly onto the search/results page. The user is free to ignore the new functionality and continue to use the search engine in the standard way, or use the terms display simply as a diagnostic device for evaluating the contents of a result set. Query reformulation introduces three new operations to the user, and further testing is needed to evaluate how well users can adapt to the notion of dividing a search expression into a main topic and focus. However, there is evidence from both query logs and user interviews that users often employ such a narrowing strategy even in their interactions with standard search interfaces, first probing the database with a short, general query to test what information may be available, then augmenting the query with more specific terms based on the results from the initial “test” (e.g., [16]). Boolean search engines have often included a similar device in which a set of articles retrieved by one query could be used as a restricted context for a subsequent subquery, and some “natural

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

5

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

language” ranking systems (e.g., Infoseek) offer a similar capability. In Paraphrase, however, “focusing” is not done with respect to a subset of articles but rather with respect to the entire database. Therefore, new articles may be introduced into the result list with the addition of focus terms. The model of retrieval supported by Paraphrase is one in which a search is conducted as a sequence of many small high-precision searches. This contrasts with the traditional notion of a search within the Information Retrieval research community, where systems are typically evaluated on measures such as recall/precision curves (e.g.,[17]). These measures have led to a model of search in which the goal is the construction of a single “mega-query” which can capture everything there is about a potentially complex information need. It remains to be seen under what circumstances one search style is superior to another. However, it is well known that users vary widely in their preferred strategies; supporting multiple strategies is therefore likely to be a good thing, provided the complexity of the interface does not become too great. Indeed, the approach presented here complements the two most well-known approaches to query reformulation – thesaurus feedback and relevance feedback. Thesauri, either manually or automatically constructed, tend to reflect global static relationships among terms, as opposed to the context-sensitive relationships computed through run-time lexical dispersion analysis. While lexical dispersion often tends to pick up morphological variants (e.g., “indonesia” and “indonesian”), it is not a device particularly suited for detecting synonyms. Interactive relevance feedback provides a focused, dynamic set of potential query expansion terms. However, these terms are typically based on the user’s selection of one or two “relevant” articles and therefore do not reflect the contents of the result set as a whole. Paraphrase’s facet analysis is most useful for probing the database contents or browsing along various conceptual dimensions. If a particularly interesting article is encountered, relevance feedback may be the more appropriate strategy for finding similar documents. We therefore see lexical dispersion not as a substitute for these other techniques but rather as yet another device in the growing constellation of tools useful for the construction of the kind of multi-strategy interactive interface envisioned by Henninger and Belkin [15].

5.1 Related Work

among terms has a long history in Information Retrieval [12]. In the late ‘80s, Church [20] and others showed how corpus linguistic techniques over very large corpora using metrics like mutual information could be exploited to help automate the work of lexicographers in identifying semantic classes. Ruge [21] and Grefenstette [22] applied corpus linguistic techniques to the automatic construction of thesauri for use in information retrieval. Strzalkowski [23] utilized the notion of dispersion in a formula to compare the specificity of semantically related terms for the automatic construction of a lexical domain map. Nakagawa [24] used the heuristic that simple Japanese nouns that are part of many compound nouns make good index words to automate the extraction of index terms for computer manuals. The REALIST hyperterm system [25] and AI-STARS user interface [26] made on-line databases of phrases available to end-users for interactive query reformulation. Cooper and Byrd‘s Lexical Navigation system [27] provided extensive interactive browsing and query formulation capabilities through a network of terminology derived from automatic corpus analysis. Bruza‘s Hyperindex Search Engine [28] allowed users to browse through a lattice of search terms composed of noun compounds of various levels of specificity. Anick and Vaithyanathan’s Paraphrase interface to a clustered document database [29] employed cluster descriptor terms as “facets” which expanded to sets of phrases containing the facet terms.

6. Conclusions We have presented a new interactive paradigm for iterative information retrieval based on the automatic construction of a faceted representation for a document subset. This representation serves the dual purpose of providing a concise, structured summary of the contents of a result set as well as presenting a set of terms for engaging in interactive query reformulation. Future research will be directed primarily in two areas. First, there is a clear need to perform user studies to assess the effectiveness of the interactive paradigm. Which features will users adopt if left to their own devices? Is some training required? Are the right operators provided? Do users understand the notion of main topic and focus or is another metaphor needed? Second, we believe there are opportunities for further tuning of the facet selection algorithm, making it more sensitive to the specificity of the query and perhaps even capitalizing on syntactic contexts other than noun phrases to ferret out key terms and relationships that do not appear in noun compounds.

The PLEXUS [18] and OAKASSIST [19] interfaces used (manually constructed) knowledge organized into facets in order to help users formulate their queries. The use of co-occurrence statistics to discover semantic relationships

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

6

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

7. References [1] Belkin, N. J., R. N. Oddy, and H. M. Brooks, “ASK for Information Retrieval”, Journal of Documentation, 38:2, 1982. [2] Marchionini, Gary, “Interfaces for End-User Information Seeking”, JASIS, 43(2):156-163, 1992. [3] Rao, Ramana et al, “Rich Interaction in the Digital Library”, Communications of the ACM, 38:4, pp. 29-39, 1995. [4] Salton, Gerard, Automatic Text Processing – The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, Reading, MA, 1989. [5] Koenemann, J., and Belkin, N. J., “A Case for Interaction: a Study of Interactive Information Retrieval Behavior and Effectiveness”, in Proceedings of the Human Factors in Computing Systems Conference (CHI'96), ACM Press, New York, 1996. [6] Vickery, B. C., Classification and Indexing in Science, Butterworths, London, 1958. [7] Jacquemin, Christian and Jean Royaute, “Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework”, in Proceedings of SIGIR’94, pp. 132-141, 1994. [8] Marsh, Elaine, “A Computational Analysis of Complex Noun Phrases in Navy Messages”, in Proceedings of COLING’84, pp. 505-508, 1984. [9] Finin, Timothy W., “The Interpretation of Nominal Compounds in Discourse”, Technical Report MS-CIS-1982-3, University of Pennsylvania, 1982. [10] The AltaVista Search Revolution. Digital Equipment Corp., 1996. [11] Beaulieu, Micheline, Thien Do, Alex Payne and Susan Jones, “ENQUIRE Okapi Project”, British Library Research and Innovation Report 17, Jan. 1997. [12] Spark Jones, Karen, Automatic Keyword Classification for Information Retrieval, Butterworths, London, 1971. [13] Jing, Yufeng and W. Bruce Croft, “An Association Thesaurus for Information Retrieval”, in Proceedings of RIAO’94, pp. 146-160, 1994. [14] Salton, Gerard, “Another Look at Automatic Text-Retrieval Systems”, Communications of the ACM, 29:7, pp. 648-656, 1986. [15] Henninger, Scott and Nicholas Belkin, “Interface Issues and Interaction Strategies for Information Retrieval Systems”, in Proceedings of the Human Factors in Computing Systems Conference (CHI'96), ACM Press, New York, 1996.

[16] Cool, Colleen et al, “Information Seeking Behavior in New Searching Environments”, TIPSTER Phase III Research Project Report, Rutgers University, 1997. [17] Harman, Donna, “Overview of the First TREC Conference”, in Proceedings of SIGIR’93, pp. 36-47, 1993. [18] Vickery, A and H. M. Brooks, “PLEXUS – The Expert System for Referral”, Information Processing & Management, 23(2):99-117, 1987. [19] Borgman, C. L., D. O. Case, and C. T. Meadow, “The Design and Evaluation of a Front-End Interface for Energy Researchers”, JASIS, vol. 40, pp. 99-109, 1989. [20] Church, Kenneth Ward and Patrick Hanks, “Word Association Norms, Mutual Information, and Lexicography”, Computational Linguistics, 16:1,1990, pp. 22-29. [21] Ruge, Gerda, “Experiments on Linguistically based Term Associations”, in Proceedings of RIAO’91, pp. 528-545, 1991. [22] Grefenstette, Gregory, “Use of Syntactic Context to Produce Term Association Lists for Text Retrieval”, in Proceedings of SIGIR’92, pp.89-97, 1992. [23] Strzalkowsi, T., “Building a Lexical Domain Map from Text Corpora”, in Proceedings of COLING’94, pp. 604-610, 1994. [24] Nakagawa, Hiroshi, “Extraction of Index Words from Manuals”, in Proceedings of RIAO’97, pp.598-611. [25] Ruge, G, C. Schwarz and G. Thurmair, “A Hyperterm System Based on Natural Language Processing”, International Forum on Information and Documentation, 15(3):3-8, 1990. [26] Anick, Peter G., Jeffrey D. Brennan, Rex. A. Flynn, David R. Hanssen, Bryan Alvey, and Jeffrey M. Robbins, “A Direct Manipulation Interface for Boolean Information Retrieval via Natural Language Query”, in Proceedings of SIGIR’90, pp. 135150, 1990. [27] Cooper, James W. and Roy J. Byrd, “Lexical navigation: Visually prompted query expansion and refinement”, in Digital Libraries’97. [28] Bruza, P.D. and S. Dennis, “Query Reformulation on the Internet: Empirical Data and the Hyperindex Search Engine”, in RIAO’97, pp. 488-499, 1997. [29] Anick, Peter and Shiv Vaithyanathan, “Exploiting Clustering and Phrases for Context-based Information Retrieval”, in Proceedings of SIGIR’97, pp.314-323, 1997.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

7

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Figures 8781 8703 7743 5575 4793 4351 4305 4183 4182 3953 3813 3594 3462 3408 3393 3381 3303 3237 3185 3113

group market company business international bank system government price industry investment service economic european policy management rate share trade political

Figure 1. Terms with highest lexical dispersion (within noun phrases)in the 90K article Financial Times database.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

8

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Figure 2. Paraphrase interface showing the facet terms (in upper left hand frame) in response to the query “indonesia”.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

9

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

indonesian indonesian government indonesian troop indonesian company indonesian debt indonesian value indonesian tuna indonesian shrimp indonesian rupiah loan indonesian producer indonesian plywood indonesian official indonesian naval commander indonesian island indonesian fishermen indonesian farmer indonesian exporter indonesian export indonesian conglomerate indonesian central bank indonesian businessmen indonesian business indonesian banking sector

export export quota export market export earnings world export tuna export rubber export oil export net export export value export tax revenue export requirement export quality standard export proceeds export increase export growth export commitment energy export commodity export coffee export indonesian export batam island export zone

governme nt indonesian government government official government supporter government report government minister government expenditure government decision government critic government corruption government concern government committee central government singapore government company singapore government local government official jakarta government indonesian government general government government spending dutch government

Figure 3. Phrases associated with the facets indonesian, export, and government, in response to the query “indonesia”.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

10

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Figure 4. Paraphrase interface showing facets for the query “cooking”, with the “cheese” facet expanded to show its phrases.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

11

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Figure 5. Paraphrase interface, after the user has entered the query “wildlife extinction” and selected the phrase “logging area”.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

12