ENHANCING INFORMATION RETRIEVAL THROUGH STATISTICAL ...

Arazy & Woo/Enhancing Information Retrieval

RESEARCH ARTICLE

ENHANCING INFORMATION RETRIEVAL THROUGH STATISTICAL NATURAL LANGUAGE PROCESSING: A STUDY OF COLLOCATION INDEXING1 By:

Ofer Arazy The University of Alberta 3-23 Business Building Edmonton, AB T6G 2R6 CANADA [email protected] Carson Woo Sauder School of Business University of British Columbia 2053 Main Mall Vancouver, BC V6T 1Z2 CANADA [email protected]

In this paper, we provide preliminary evidence for the usefulness of statistical natural language processing (NLP) techniques, and specifically of collocation indexing, for IR in general settings. We investigate the effect of three key parameters on collocation indexing performance: directionality, distance, and weighting. We build on previous work in IR to (1) advance our knowledge of key design elements for collocation indexing, (2) demonstrate gains in retrieval precision from the use of statistical NLP for general-settings IR, and, finally, (3) provide practitioners with a useful costbenefit analysis of the methods under investigation. Keywords: Document management, information retrieval (IR), word ambiguity, natural language processing (NLP), collocations, distance, directionality, weighting, general settings

Abstract Although the management of information assets—specifically, of text documents that make up 80 percent of these assets— an provide organizations with a competitive advantage, the ability of information retrieval (IR) systems to deliver relevant information to users is severely hampered by the difficulty of disambiguating natural language. The word ambiguity problem is addressed with moderate success in restricted settings, but continues to be the main challenge for general settings, characterized by large, heterogeneous document collections.

1 Veda Storey was the accepting senior editor for this paper. Praveen Pathak served as a reviewer. The associate editor and two additional reviewers chose to remain anonymous.

Introduction Text repositories make up 80 percent of organizations’ information assets (Chen 2001), with the knowledge encoded in electronic text now far surpassing that available in data alone (Spangler et al. 2003). Consequently, the use of information technology to manage documents remains one of the most important challenges facing IS managers (Sprague 1995, p. 29). Information retrieval (IR) systems enable users to access relevant textual information in documents. Despite their rapid adoption, current IR systems are struggling to cope with the diversity and sheer volume of information (Baeza-Yates and Ribiero-Neto 1999). The performance of IR systems is often less than satisfactory, especially when applied to information

MIS Quarterly Vol. 31 No. 3, pp. 525-546/September 2007

525


searches in general settings, such as networked organizations or the World Wide Web, which are often characterized by large, unrestricted collections, and multiple domains of interest. This highlights the need for scalable IR techniques (Jain et al. 1998).

In spite of evidence suggesting the use of collocations is essential for representing text documents (Church and Hanks 1990; Lewis and Spärck-Jones 1996), its usage and word cooccurrence patterns to improve IR effectiveness continues to be controversial (Baeza-Yates and Ribeiro-Neto 1999).

The effectiveness of IR systems is determined by the extent to which the representations of documents actually capture their meaning. The main challenge inhibiting the performance of current retrieval systems is word ambiguity (BaezaYates and Ribiero-Neto 1999): the gap between the way in which people think about information, and the way in which it is represented in IR systems. IR systems require that both documents and user requests be formulated in words, which are inherently ambiguous (Deerwester et al. 1990).

This paper makes two contributions to theory and practice of IR. First, we identify key parameters that affect collocation indexing performance, namely collocation directionality and the distance between the words making up collocations. Second, we demonstrate empirically that the use of statistical collocation indexing can substantially enhance IR from large and heterogeneous collections.

The objective of this research is to gain an understanding of the potential and limitations of automatically generating meaningful representations of textual documents. Our goal is to identify IR techniques that are both suitable for general settings and deliver relevant information. Accordingly, we focus on statistical IR techniques which are scalable to large, open environments. Statistical natural language processing (statistical NLP) methods “have led the way in providing successful disambiguation in large scale systems using naturally occurring text” (Manning and Schutze 2003, p. 19). In order to represent the meaning of documents, statistical NLP uses information contained in the relationships between words, including how words tend to group together. Our study focuses on one specific statistical NLP technique: collocation indexing. A collocation is a word combination which carries a distinct meaning (i.e., a concept). Collocation indexing refers to the process of extracting collocations from natural text and representing document meaning through these collocations (instead of through words). Collocations are extracted by grouping together words that appear close together in the text (the simplest example of collocation is a phrase, such as “artificial intelligence”). Firth (1957, p. 11) coined the phrase, “You shall know a word by the company it keeps,” suggesting that words could be disambiguated by analyzing their patterns of co-occurrence. Statistical collocation indexing provides a good solution to the word ambiguity problem (Morita et al. 2004), and this approach is robust (even in the presence of error and new data) and generalizes well (Manning and Schutze 2003). Collocations extracted from large text corpora have been shown to convey both syntactic and semantic information (Maarek et al. 1991), thus collocation-based IR systems can more accurately address users’ requests.

526

MIS Quarterly Vol. 31 No. 3/September 2007

The paper continues as follows. The next section provides a context for this study, briefly review the problem of content representation in information retrieval; the third addresses scientific rigor by positioning the work within the context of Information Science and Computational Linguistics; the fourth section lays out our research questions; the fifth section presents the proposed research method, detailing the evaluation procedure; the sixth section presents the results from our experiments; the seventh section discusses the findings of this study, and the eight section concludes the paper with some future research directions.

Information Retrieval, Representations, and Constructs Information retrieval systems represent documents and queries through profiles (also referred to as indexes). A profile is a short-form representation of a document, easier to manipulate than the entire document, and serves as a surrogate at the retrieval stage. Next, matching is performed by measuring the similarity between the document and query profiles, returning to the user a ranked list of documents that are predicted to be of relevance to the query. IR effectiveness is measured through precision, the percentage of retrieved documents which are relevant to the user, and recall, the percentage of the total relevant documents in the collection which are included in the set of returned documents. Recall and precision are generally complementary, and negatively correlated. There are situations in which recall will be most important to a user; for example, in retrieving every possible piece of information regarding a specific medicine (even at the cost of retrieving nonrelevant information). However, precision is considered more important to most search tasks, since users are primarily interested in exploring only the documents at the top of the results list


(Lewis and Spärck-Jones 1996). Since users’ intentions are unknown during system design, the general performance of a system could be evaluated using measures that combine precision and recall. Two such measures are popular: the precision–recall curve (plotting a precision value for each recall level), and the harmonic mean, or F measure, F=

(2 × Precision × Recall) (Precision + Recall)

Another important aspect of IR systems is their efficiency. IR efficiency is estimated by (1) an algorithm’s computational complexity and (2) the storage space of document profiles. Efficient systems are capable of processing large amounts of documents, and are thus suitable for a wide variety of business settings. In testing the performance of collocation indexing, we will consider both effectiveness (precision and recall), and efficiency (complexity and storage space).

IR Representations Research in IR systems investigates four different types of artifacts: instantiations (IR software systems), methods (algorithms for developing the IR system), IR models (representations of documents and queries), and constructs (the indexing units used to represent documents and queries). Representations in IR are defined in terms of constructs (i.e., the indexing units that define the semantic space) and models, which describe how documents and queries are mapped onto the semantic space and how their similarity is measured in that space. There are several reasons why we focus in this study on IR representations and, more specifically, on constructs. First, IR effectiveness is primarily determined by the extent to which representations actually capture the meaning of documents. Second, most IR research tends to focus on retrieval models, algorithms, and the development of systems; while the remaining type of artifact—constructs—has been overlooked to a large extent. An investigation of IR indexing units could only be performed in the context of a specific retrieval model. We use the vector space model (Salton et al. 1975) in our study, as this is the de facto standard in the design of commercial retrieval systems. Popular alternative models include the probabilistic model (Robertson and Spärck-Jones 1976) and language models (Ponte and Croft 1998). The fundamental difference between these alternative models and the vector space model is in the way in which indexing terms are weighted. Because the probabilistic model requires initial training of the system, it is less appropriate for large, domain-independent collections.

Language models are also domain-dependent and, to date, have not been applied to general settings. In the vector space model, documents and queries are represented as weighted vectors of indexing units (i.e., a list of terms and their associated weights). Geometrically, the indexing units define the semantic space, and documents and queries are represented by points in that space (see Figure 1). The similarity of query and document indexes is commonly calculated through the cosine of the angle between the two index vectors (i.e., the dot product of the two normalized vectors) (Baeza-Yates and Ribiero-Neto 1999; Salton and McGill 1983).2

Constructs in IR The major challenge for IR research on constructs is in proposing indexing units that are meaningful, yet can be efficiently extracted (automatically) from text (Lewis and Spärck-Jones 1996). We discuss two different types of indexing units used in IR models: tokens and concepts. Tokens as Indexing Units Traditionally, single-word terms, referred to as tokens, were used to represent a document’s meaning. For example, this paper may be indexed by the following set of tokens (information, systems, research, retrieval, text). Token indexing is usually accomplished through (1) statistical analysis of word frequency for the entire corpus, and removal of highfrequency words which in themselves do not carry meaning (e.g., the, of, and at); (2) stemming, by removing word prefixes and suffixes; (3) removal of infrequent tokens and the generation of a token list; and (4) indexing of each document (and later, of query), using tokens from the list (Baeza-Yates and Ribiero-Neto 1999). Token indexing is useful for excluding words that do not carry distinctive meaning; however, tokens do not solve the problem of word ambiguity. For example, a person interested in computer viruses submitting the query “virus” may be presented with documents describing biological viruses, resulting in low precision. Techniques for automatically generating token indices rely on the early works of Luhn (1958), and are still used in current commercial IR systems. Despite the popularity of token indexing, it is widely believed that this approach “is inadequate for text content representation in IR” (Salton and Buckley 1991, p. 21), and therefore higher-level indexing units are essential for large-scale IR (Mittendorf et al. 2000). 2

We tested the cosine function by comparing it to a Euclidean distance function and found that the performance of the two functions was indistinguishable.


527


Indexing Unit B W1,B

Doc. #1

Doc. #2 Query

Indexing Unit A

W1,A

Figure 1. IR Models (Illustrated through a simple semantic space—defined by two indexing units: A and B—where query and documents are mapped onto that space)

Concepts as Indexing Units There is a substantial body of evidence from the fields of linguistics, semantics, education, and psychology that people process and represent knowledge using concepts. Conceptual (versus token-based) document representations have the potential to alleviate word ambiguity; yet, automatically generating such representations is extremely difficult, as language is complex and patterns of word usage do not follow strict rules (Manning and Schutze 2003). Two classes of natural language processing (NLP) approaches have been used to automatically extract concepts from text by grouping words into sets of higher-level abstractions (i.e., latent concepts): linguistic and statistical. Linguistic NLP relies on formal models of language to analyze the text using syntactic and/or semantic methods. Although linguistic analysis can be performed automatically, this is often not effective in general settings, because syntactic analysis is extremely ambiguous and computers struggle to determine the syntactic structure of arbitrary text (Lewis and Spärck-Jones 1996). Alternatively, linguistic analysis can be complemented by manual rule creation and hand-tuning, but hand-coding syntactic constraints and preference rules are time consuming to build, do not scale-up well, and are not easily portable across domains (Chen 2001).3 Furthermore, linguistic NLP performs poorly when evaluated on naturally occurring text (Manning and Schutze 2003). 3

One popular approach for addressing the word ambiguity problem is wordsense disambiguation, where preconstructed lexical resources, such as WordNet (Miller 1995), are used to disambiguate terms appearing in documents and queries. Despite the use of these external resources, in general this method has not proved effective in IR (Stokoe et al. 2003).

528


The alternative approach, statistical NLP, is founded on the philosophies of Firth (1957), who argued that the meaning of a word is defined by the patterns of its use and circumstances (i.e., the words surrounding it). Statistical NLP can be used to extract meaningful patterns of words and these collocations can then be utilized to represent the meaning of text documents. This enables the automatic processing of very large and heterogeneous text collections. We will, therefore, adopt the statistical NLP approach for our study.

Collocations in Information Retrieval From linguistic and semantic points of view, the idea that often a combination of words is a better representation of a concept than the individual words is well-accepted (Church and Hanks 1990). A collocation4 is defined as “an expression consisting of two or more words that corresponds to some conventional way of saying things” (Manning and Schutze 2003, p. 151). Collocations include compounds (hard-disk), phrasal verbs [adjacent (get-up) and nonadjacent (knock…. door)] and other stock phrases (bacon and eggs). Collocations can be used as indexing units, in addition to single-word tokens, since the frequency of a collocation is correlated with the importance of the collocation in the text (Luhn 1958; Maarek et al. 1991). The discrepancy between the potential of collocations and their actual use may be due to the challenge of automatically extracting meaningful collections using statistical methods. 4

Also referred to as term associations, term co-occurrence, term combinations, second-order features, higher-order features, or simply phrase.


The untapped potential of collocation indexing makes it an area where statistical NLP—given its scalability to large collections—can make important contributions to IR (Lewis and Spärck-Jones 1996; Manning and Schutze 2003; Mittendorf et al. 2000; Strzalkowski et al. 1996).

Collocations and Information Retrieval Performance One reason that collocation analysis has been under-utilized in IR may be that, traditionally, collocations were associated with linguistic analysis, complex models, and artificial intelligence implementations, and thus were regarded as most suitable for very small and restricted problems (Salton and Buckley 1991).5 To address these limitations, statistical techniques have been investigated. The early works on statistical collocation indexing (e.g., Fagan 1987; Lesk 1969; van Rijsbergen 1977) were restricted to small and domain-specific collections. These early studies yielded inconsistent, minor, and insignificant performance improvements (Fagan 1989). Later works, investigating collocation extraction for much larger and diverse collections, usually in the context of the Text Retrieval Conference (TREC; e.g., Carmel et al. 2001), have also yielded weak and inconclusive results (Khoo et al. 2001; Mittendorf et al. 2000), and the effectiveness of collocation-based IR continues to be controversial (BaezaYates and Ribiero-Neto 1999). Lewis and Spärck-Jones (1996), in their survey of NLP in IR, argue that “automatically combining single indexing terms into multi-word indexing phrases or more complex structures has yielded only small and inconsistent improvements” (p. 94). In TREC, phrase (i.e., adjacent collocation) indexing has shown to provide only minor (2 to 4 percent) improvements in precision (Spärck-Jones 1999). Other experiments (e.g., Mitra et al. 1997) show that collocations alone perform no better than tokens, and that enhancing token indexes with collocations results in minor (1 to 5 percent) precision gains. In some specific cases, collocations were able to yield small, still insignificant, improvements (e.g., 10 percent gains in Buckley et al. 1996). In summary, to date, no method for employing collocations in general settings has yielded substantial and significant precision improvements over token-based representation on an accepted benchmark. Recently, the focus of collocation indexing research has shifted toward advanced retrieval models. The language 5

Furthermore, syntactic collocation indexing, in most cases, has failed to yield improvements beyond traditional token indexing or statistically xtracted collocations (e.g., Croft et al. 1991, Mitra et al. 1997; Strzalkowski et al. 1996). Semantic collocation extraction methods have also failed to prove useful in IR (Khoo et al. 2001).

modeling approach to IR provides a well-studied theoretical framework that has been successful in other fields. However, language models (similarly to the vector space and probabilistic IR models) assume independence of indexing terms, and attempts to integrate co-occurrence information into the language models have not shown consistent improvements (e.g., Alvarez et al. 2004; Jiang et al. 2004; Miller et al. 1999). We believe that the major factor inhibiting the performance of collocation-based IR is the insufficient knowledge we have regarding the key parameters that affect collocation performance. In the following sections, we identify such parameters and discuss their effect on IR performance.

Collocation Extraction Parameters and IR Performance Hevner et al. (2004) noted that designing useful artifacts is complex due to the need for creative advances in areas in which existing theory is often insufficient. Extant theory on collocations provides little practical guidance on the design of collocation-based retrieval systems. Collocation theory, aside from discussions on grammatical features (which are irrelevant for statistical NLP), does not prescribe a procedure for distinguishing meaningful collocations from meaningless word combinations. A survey of collocation literature revealed three salient characteristics of collocations: directionality (i.e., whether the ordering of collocation terms should be preserved), distance (i.e., the proximity of the two words comprising the collocation), and weighting (i.e., the algorithm for assigning weights to collocations in document and query profiles), as discussed below.

Collocation Directionality Collocation theory does not provide a definite answer on the issue of directionality, as in some cases (e.g., artificial intelligence) the ordering of collocation terms is essential for preserving the meaning, while in other cases (e.g., Africa and elephants) directionality does not impact meaning (Church and Hanks 1990). Directionality is one of the key features that distinguishes linguistic from statistical collocations: while linguistic collocations preserve tokens order, most statistical approaches normalize collocations to an nondirectional form. Most works in IR on collocation indexing have treated collocations as nondirectional (e.g., Fagan 1989; Mitra et al. 1997). However, Fagan (1989) reported on problems


529


with nondirectional collocations, and there have been few exceptions where directional collocations were used in IR (e.g., Mittendorf et al. 2000). The effect of directionality on IR performance has not been tested in the context of the vector space model, although evidence from experiments with language models (Srikanth and Srihari 2002) suggests that directional collocations are more precise (by 5 to 10 percent) than nondirectional collocations. The Distance Between Collocation Terms The intensity of links between words—commonly operationalized through distance—reflects the semantic proximity of the words. Capturing that semantic proximity is important for IR effectiveness, and it is generally recognized that collocation distance is of critical importance (Fagan 1987, 1989). IR research on collocation extraction assumes that cooccurrence of words within tight structural elements (i.e., a sentence) conveys more meaning than within less tight structural elements (i.e., paragraphs or sections). Thus, research on collocation extraction has been dominated by withinsentence analysis. Empirical analysis justifies restricting collocation extraction to combinations of words appearing in the same sentence. Martin et al. (1983) found that 98 percent of syntactic combinations associate words that are within the same sentence and are separated by five words or fewer. Fagan (1987) found that restricted extraction of collocations to a five-token distance window is almost as effective as collocations extracted within a sentence with no such restriction, supporting Martin’s findings. Others that followed (e.g., Carmel et al. 2001; Maarek et al. 1991) employed a five-word window for extracting collocations. Collocations binding terms across sentence boundaries also may be of importance (e.g., doctor-nurse or airport-plane commonly co-occur across sentences). Fagan (1987) provided initial evidence showing that, for a small domainspecific corpus, across-sentence collocations are more effective than within-sentence collocations. Two recent studies in related fields support Fagan’s earlier findings. Mittendorf et al. (2000; routing task; the probabilistic IR model) report that collocations combining terms across sentences were just as effective as those within a sentence, while document-level collocations proved less effective than sentence and paragraph-level collocations. Srikanth and Srihari (2002; the setbased IR model) investigated a text window for combining terms, and attained optimal performance with a 50 to 80 word window, well beyond the limits of a sentence. Although these studies were carried out in a context different from ours (e.g., small collection, different IR models, different tasks), they do

530


suggest that across-sentence collocation could potentially enhance the performance of the vector space model for large text collections. Another important aspect, which has not been investigated in IR literature, is the importance of distance within a given structural element: that is, whether collocation terms separated by few words are more meaningful than collocations at larger distances (still within the same sentence). Studies to date have tended to regard all collocations within the predefined text window (usually a five-word window, after Martin et al.) equally, and failed to take into account the fact that more proximate collocations are likely to carry more semantics. To summarize our discussion of collocation distance, the exact relationship between physical and semantic proximity requires further investigation. First, should collocation extraction be restricted to within-sentence co-occurrence, or should it include across-sentence collocations? Second, should more weight be assigned to closer collocations (within the same structural element)? Collocation Weighting A document (or query) profile includes a list of indexing units, each associated with a weight. Indexing units’ weights determine the position of the document in the semantic space. With tokens as indexing units, there is a substantial body of research on weighting schemes (Salton and McGill 1983). It is now commonly accepted that an effective scheme should assign a weight based on (1) a local, document-specific factor, such as token frequency in the document (often normalized, to counter the variations in document length), and (2) a global, corpus-level factor, such as the number of documents and the frequency of the token in the entire collection. The token weight in the document profile is correlated positively with the local factor, and negatively with the token global factor. Different weighting schemes could be used with the vector space model, and the de facto standard is term frequency – inverse document frequency (tf-idf), defined formally as: Let N be the total number of documents in the collection and dfi be the number of documents in which the index term ki appears. Let tfi, j be the raw frequency of term ki in the document dj. Then, the normalized frequency fi, j of term ki in the document di is given by ntfi, j =

tfi, j , and is referred to as Maxl tfi, j

the term frequency (TF) factor. The maximum, tfi, j,


is computed over all terms which are mentioned in the text of the document dj. The inverse document frequency (IDF) factor for ki is given by

IDFi = log N dfi

.

N Thus, wi, j = TF × IDF = ntfi, j × log df . i

Research Questions This study is intended to advance our understanding of collocation indexing in IR in general settings. We address two key questions that IS design should consider (Hevner et al., 2004): (1) Does the artifact work?

Tf-idf has been reported to increase IR precision up to 70 percent, when compared to simple tf (i.e., weight is assigned based on the frequency of a term in the document; Salton and Buckley 1988). With collocations as indexing units, however, there is no well-accepted scheme and the development of weighting schemes for collocations is at early phases of research. The main challenge for collocation weighting is that the vector– space model assumes terms are independent, and thus does not lend itself to modeling inter-term dependencies, which are essential for handling collocation (Strzalkowski et al. 1996). Two alternatives for alleviating this problem are possible: (1) use the simple tf weighting scheme for collocations (e.g., Khoo et al. 2001), based on the fact that collocation frequency is correlated with the importance of the collocation (Luhn 1958; Maarek et al. 1991), or (2) adapt a more effective scheme, such as tf-idf, to suit collocations. There have been various attempts to adapt the tf-idf weighting scheme to collocations (e.g., Alvarez et al. 2004; Buckley et al. 1996; Church and Hanks 1990; Maarek et al. 1991; Mitra et al. 1997; Strzalkowski et al. 1996), but these schemes do not address the term-independence problem. Furthermore, performance gains from these tf-idf adaptations were not substantial, and the weighted collocations indexes were usually only 5 percent more effective than weighted token indexes (Jiang et al. 2004; Mitra et al. 1997). Among the various suggestions, the scheme proposed by Church and Hanks (1990) stands out for its theoretical grounding. Church and Hanks’ proposal is based on Shannon’s information theory, where the mutual information (MI) of two words x and y with occurrence probabilities P(x), P(y), and co-occurrence probability P(x, y) is

I(x, y) = log

P(x, y) P(x) × P(y)

.6 Mutual information compares the

(2) What are the characteristics of the environment in which it works? We first conduct an examination on the effects of key parameters (i.e., collocation directionality, distance, and weighting), and then investigate the extent to which collocation-indexing enhances retrieval performance beyond the traditional token-based indexing.

Research Method In order to study the effect of collocations on IR performance, we developed a prototype IR system with collocation indexes for documents and queries. We conduct a series of laboratory experiments to address the research questions presented earlier. To study IR constructs (i.e., tokens and collocations) within a retrieval system, we fixed the models, methods, and instantiations, discussed earlier, in the following ways: •

Model: we employed the standard vector space model, with two alternative weighting schemes: simple tf and tfidf. For collocations, we also tested the mutual information adaptation to tf-idf.

•

Methods: we used the cosine function for matching indexes, and statistical NLP methods for extracting collocations (where collocations were restricted to twoword combinations, a common and practical approach that is scalable to large and heterogeneous collections7).

•

Instantiations: we restrict the prototype system to only the core features described above (vector space model, collocation indexing, and cosine function for matching). This allowed us to focus on the core questions of this study, and control for the effect of exogenous factors.

“information” captured in the collocations to that captured by the single terms making-up the collocation. When applied to IR, word probabilities are estimated by their occurrence frequencies. To date, this information theoretic approach to collocation weighting has not been tested on a large corpus. 7 6

Alternative bases for the log functions are possible. In our implementation (see details in the “Research Methods” section), we used a base of 2.

Mitra et al. (1997) tested the performance of two-word collocations against an index that included both two-word and three-word collocations. They found that incorporating three-word collocations adds less than 1 percent to precision.


531


Data In order to simulate a real-life general setting, we studied the research questions on a large and heterogeneous collection: the Ttext Retrieval Conference (TREC) database. The TREC collection is becoming a standard for comparing IR models (Baeza-Yates and Ribiero-Neto 1999). TREC includes (1) a document collection, (2) a predefined set of queries, and (3) predetermined manual assessments for the relevance of each document to each of the queries. For each approach tested, the retrieval prototype system processes the documents and queries to produce a list of predicted relevant documents for each query, which is then compared to manual relevance assessments. Effectiveness is measured using precision, recall, recall–precision graph, and the F measure. We used disks 4 and 5 of the TREC collection that include 528,030 documents from a variety of sources8 and cover a variety of topics, 100 queries, and manually constructed binary relevance judgments (i.e., either relevant or nonrelevant) for all documents on each of the queries.

Implementation To test our hypotheses, we generated several alternative representations—for each document and query—that differ only in terms of directionality and distance between collocation terms.

appeared within the same sentence, while in another scheme we grouped collocations across sentences. Likewise, in one scheme we preserved the ordering of collocation terms while in another scheme we disregarded term ordering. Efficiency is a challenge in collocation processing, since the possible two-word combinations in an average document are in the billions.10 To address this issue, we restricted collocation extraction in each document to combinations of only the 20 most frequent tokens.11 We further restricted the number of collocations by combining the collocations extracted from each document into one cumulative list, and pruning collocations of low frequency from the list. Similar to Mitra et al. (1997), we ranked the cumulative list based on the number of documents in which each collocation appeared, and removed those with lowest frequency, thus restricting the list to one million unique collocations. In order to have a level playing field, all of the collocation lists generated under different combinations of distance, directionality, and weighting were pruned to one million collocations. Table 1 provides examples of extracted collocations. We then indexed documents and queries with the remaining set of collocations. The outcome of this process was a collocation profile for all documents and queries in the test collection. We generated several alternative sets of profiles, based on directionality, distance, and weighting settings.

We first processed the TREC collection documents to generate a token list, using stop-word removal (with SMART’s system common words list [Buckley et al. 1996]), stemming (with Porter’s [1980] algorithm), and removal of tokens that appear in less than six documents,9 arriving at 72,354 unique tokens. We then generated documents and query profiles with these tokens. We used the resulting token index (1) to compare the performance of collocations to the traditional tokenbased approach, and (2) as the starting point for the collocation extraction process.

Measures

To extract collocations, we created “tokenized documents” by replacing words with tokens, while preserving sentence boundaries. We then extracted two-token collocations, based on the co-occurrence of these tokens in the text. We applied several alternative procedures that differed in terms of collocation distance, directionality, and weighting. For example, in one scheme we grouped two-token combinations that

We conducted three experiments: Experiment #1 to test the effect of collocation indexing parameters (directionality, distance, and weighting), Experiment #2 to determine the best collocation indexing scheme, and Experiment #3 to test whether collocations can enhance IR performance.

8

Documents from the Los Angeles Times (1989-1990), Financial Times (1991-1994), Federal Register (1994), and the Foreign Broadcast Information Service (1996).

9

A similar threshold was employed in other studies (e.g., Deerwester et al. 1990).

532


Both precision and recall were measured for the 10, 20, 30, and 100 top ranked documents (i.e., Precision[10], Precision[20], …; Recall[10], Recall[20], …). Similarly we employed the precision–recall graph and the F measure for the 10, 20, 30, and 100 top ranked documents.

Experimental Design

10

With the total number of unique tokens in our data set at 72,354, the total number of unique collocations could potentially reach 5 billion, highlighting the need for an efficient collocation extraction algorithm. 11

The choice of the top 20 tokens is based on scalability considerations, and is justified by the fact that the token list is later trimmed to only the most frequent collocations, and frequent collocations regularly are made up of frequent tokens.


Table 1. Examples of the Frequent Collocations Extracted from the TREC Database (Disks 4 and 5) Token 1

Words it Represents

Token 2

Words it Represents

Fund

funding, funded, …

Corp

corporate, corporation, …

Grade

grading, grades, …

Math

math, …

Gain

gains, gaining, …

Lab

lab, …

In Experiment #1, we employed a 2 × 2 × 3 design (directionality: direct/indirect; distance: within/across sentence; weighting: tf ÷ tf-idf ÷ MI) to test the effect of each parameter, as well as the interaction effects. Since query length (i.e., the number of terms in the query) has been reported to impact IR performance considerably, we controlled for this effect. We processed each document to generate a collocation index by extracting two-word combinations (using the standard procedure of restricting word-pairs to a five-token window,12 and “sliding” this window over the entire document). We calculated the frequency of a collocation in a document by counting all instances of that collocation. Across-sentence collocation extraction was performed similarly, using a window of five sentences (and excluding collocations within the same sentence). Weighting was tested with simple tf13 and two variations of tf-idf: the classic formula (similarly to tokens) and the mutual information adaptation for tf-idf, as discussed earlier.

combined scheme against the “pure” schemes of tokens and collocations. In IR research, the differences between competing designs are often small and not statistically significant, thus in most IR works no analysis of statistical significance is provided. For instance, IR literature on collocations reports only minor effects for distance, directionality, or weighting (usually below 5 percent differences), and rarely test for statistical significance. Despite the statistical insignificance of results from these works, they are considered important contributions to the cumulative knowledge in the field, and are reported in academic publications. Nonetheless, we will test and report the statistical significance for our experiments.14

Results and Analysis

In Experiment #2, we built on results of the first experiment to try and design an optimal collocations indexing scheme. We performed two types of analysis, focusing on collocation distance. First, we studied in detail the impact of distance within structural elements (e.g., whether collocations separated by one term yield higher performance than collocations separated by two terms), and tried designing a scheme that assigns higher weight to more proximate collocations within the same structural element (e.g., a sentence). Second, we worked to combine within-sentence and across-sentence schemes.

In order to appreciate the results of the algorithms we studied, it is useful to provide performance measures for random retrieval. On average, a query in our test set had 558 relevant documents, out of the total set of 528,030 documents. Thus, the precision (i.e., percentage of relevant documents) for any randomly selected set is 558 ÷ 528,030 = 0.0011.

Experiment #3 tested the optimal collocations scheme from the previous experiment against the traditional token-based scheme, while controlling for query length. Since collocations are used to complement tokens in document indexes— rather than to replace tokens (e.g., Carmel et al. 2001; Mitra et al. 1997; Strzalkowski et al. 1996)—we proceeded to combine token and collocations schemes, and tested the

The 2 × 2 × 2 (i.e., directionality × distance × weighting) factorial ANOVA analysis for Experiment #1 (for each of the performance measures) revealed no statistically significant effects. We, therefore, report below only on the performance levels for each directionality-weighting-distance combinations.15

12

We also experimented with not restricting collocation distance within a sentence. Results were almost identical to those of a five-term window. 13

We also tested a normalized tf scheme (i.e., term frequency normalized by the highest term frequency in the document) to account for document length, but results were indistinguishable from the simple tf, and therefore are not reported.

Experiment #1: Collocation Directionality, Distance, and Weighting

14

We will model our analysis after Khoo et al. (2001), one of the rare IR papers that reports the statistical significance of results, and employ onesided significance tests. 15

Similar to (Khoo et al. 2001), our comparison included only queries in which the profile contains at least one collocation. Out of the 100 queries provided with the TREC database (numbered 351 through 450), we employed the 84 queries that included at least one indexing unit (excluding queries 361, 362, 364, 365, 367, 369, 392, 393, 397, 405, 417, 423, 425, 427, 439, and 444).


533


Precision10

Precision20

Precision30

Precision100

Recall10

Recall20

Recall30

Recall100

F10

F20

F30

F100

Table 2. The Effects of Directionality, Distance, and Weighting on Collocation Indexing

Direct ÷ Within-Sentence

Tf

0.273

0.226

0.190

0.104

0.054

0.085

0.099

0.149

0.091

0.124

0.130

0.123

tf-idf

0.288

0.239

0.196

0.110

0.058

0.090

0.103

0.153

0.096

0.130

0.135

0.128

Direct ÷ Across-Sentence

Tf

0.268

0.212

0.185

0.107

0.037

0.056

0.070

0.131

0.065

0.088

0.101

0.118

tf-idf

0.276

0.224

0.202

0.113

0.039

0.060

0.077

0.137

0.068

0.094

0.111

0.123

Nondirect ÷ Within-Sentence

Tf

0.240

0.199

0.173

0.098

0.049

0.074

0.039

0.142

0.081

0.108

0.117

0.116

tf-idf

0.260

0.218

0.181

0.100

0.053

0.0978 0.092

0.147

0.088

0.116

0.122

0.119

Nondirect ÷ Across-Sentence

Tf

0.242

0.205

0.181

0.106

0.048

0.071

0.087

0.149

0.080

0.106

0.117

0.124

tf-idf

0.260

0.224

0.193

0.112

0.052

0.076

0.091

0.153

0.086

0.113

0.124

0.129

The results show a moderate effect for the weighting scheme, where tf-idf is superior to tf, for both precision (0.002 – 0.020, or 2.3 to 9.3 percent) and recall (0.002 – 0.007, or 2.6 to 10.4 percent) measures (and thus for the F measure, 0.003 – 0.010, or 2.6 to 9.8 percent). The effect size of tf-idf over tf is small. This result is in line with previous studies, which demonstrated that tf-idf is not as effective for collocations as it is for tokens. The results for the Mutual Information adaptation to tf-idf were indistinguishable from the results for standard tf-idf, and thus are excluded from Table 2. Thus, despite the theoretical grounding for the mutual information scheme, it does not seem to perform well in practice. Figure 2 below demonstrates the effect of weighting on F[10]. No clear effect is visible for collocation distance (for either precision or recall), and within-sentence and across-sentence schemes perform similarly. While current collocation-extraction practices commonly restrict collocation extraction to within-sentence word combinations, our findings demonstrate that across-sentence collocations are equally important, thus suggesting that both types of collocations should be incorporated into an effective collocation indexing scheme. Collocation directionality has a positive impact on precision (directional is up to 0.033, or 14.8 percent better than nondirectional), thus supporting findings in related areas. Directionality seems to have a mixed effect on recall and the F measure. These findings are consistent across tf and tf-idf weighting schemes. Although distance and directionality alone do not demonstrate large effects, the interaction effect between distance and

534


directionality does provide important insights. For withinsentence, directional collocations yield higher precision (0.006 – 0.033, or 6.7 to 13.4 percent) and recall (0.005 – 0.011, or 9.0 to 19.6 percent) over nondirectional. For acrosssentence, precision is still higher with directional collocations (although the effect size is smaller than for within-sentence; only up to 0.026, or 10.8 percent), while recall is substantially higher for nondirectional collocations (0.011 – 0.018, or 10.7 to 24.7 percent). These results are consistent across both weighting schemes. The graphs in Figure 3 illustrate the interaction effect of distance and directionality on P[10] and R[10], when using tf-idf weighting (interaction effects are statistically insignificant).

Experiment #2: An In-Depth Exploration of Collocation Distance In the second experiment, we explored the effect of distance in more detail in an effort to improve performance. We initially explored the effect of distance within structural elements. Previous studies treated all collocations appearing within the same structural element (whether sentence or paragraph) similarly, and we wanted to test whether collocations binding proximate terms are more important than distant collocations (all within the same element, for example, a sentence). Distance Effects for Within-Sentence Collocations

For within-sentence, we generated several alternative collocation indexes, one for each distance, and then compared their


F[10]: tf vs. tf-idf

tf

tf-idf

Nondirect ÷ Across-Sentence Nondirect ÷ Within-Sentence Direct ÷ Across-Sentence

0.000 0.020 0.040 0.060 0.080 0.100 0.120

F[10]

Direct ÷ Within-Sentence

Figure 2. F[10]: tf Versus tf-idf for Different Combinations of Collocation Directionality and Distance

0.295

R[10]: Interaction Effect Distance x Directionality

P[10]: Interaction Effect Distance x Directionality

0.07

0.290

0.06

0.285

0.05

0.275

Directional

0.270

Nondirectional

R[10]

P[10]

0.280

0.265 0.260

0.04 0.03 0.02

0.255

Directional Nondirectional

0.01

0.250 0.245

0

Within Sentence

Within Sentence

Across Sentence

Across Sentence

Figure 3. Distance and Directionality Interaction Effect on Precision[10] and Recall[10] Using tf-idf Weighting

performance. For example, in one scheme we extracted only collocations of adjacent terms, then collocations at one token difference, and so on, up to a five-token distance. We then compared the performance of these alternative schemes (see Figure 4). The results clearly show that collocation combining adjacent terms (i.e., phrases) are the most precise (similar effect was observed for recall, and for the F measure), and that the relative effectiveness decreases as distance increases. These findings were consistent across both weighting schemes. A combination of these collocations into an integrative withinsentence scheme, using the industry-standard procedure of treating all collections equally (i.e., the within-sentence scheme we explored in Experiment #1) was substantially

more effective than any one of the distinct schemes.16 Nevertheless, our findings show that shorter-distance collocations yield better performance, and thus suggest that weighting all collocations equally might not be optimal. The relative weighting of collocations at different distances could be obtained by assigning the highest weight to the collocations that yielded the highest precision in our experiment (collocations of adjacent terms), and low weight to collocations that resulted in lower precision (collocation combining terms at two-, three-, four-, and five-term distances). We explored several alternative combinations of within-sentence colloca16

The integrative scheme yielded significant gains over the adjacent collocations scheme (i.e., zero token difference): Precision[10] and Precision[30] gains of 21 percent (p < 0.05), and Precision[15] and Precision[20] gains of 34 percent (p < 0.01).


535


Within-Sentence - Collocations at Different Distances 0.250

Precision

0.200 Precision[10]

0.150

Precision[20] Precision[30]

0.100 0.050 0.000 0 term distance

1 term distance

2 term distance

3 term distance

4 term distance

5 term distance

Distance between collocation terms

Figure 4. The Effect of Distance Within a Sentence on Precision (Using tf Weighting)

tions (i.e., polynomial, logarithmic, and exponential), but failed to obtain gains beyond the standard equal-weight scheme. This suggests that the problem is not simple, and thus warrants further research. Distance Effects for Across-Sentence Collocations

We compared the performance of collocations extracted at adjacent sentences, one sentence difference, and so on, up to a five sentence difference (similarly to our analysis of withinsentence collocations). Figure 5 illustrates the effects on precision (similar effect were observed for recall and for the F measure, and these results were consistent across weighting schemes). The results show that distance does not play a large role for across-sentence boundaries, and, contrary to our expectations, the number of sentences separating collocation terms appears to have very little effect on retrieval performance. These findings suggest that a collocation scheme that combines all across-sentence collocations should assign even weights to collocations at all distances.

indexing, and our analysis suggests that, in both schemes, all collocation within a structural element should be weighted equally. Thus we proceeded to combine the two schemes we studied in Experiment #1: within-sentence (restricting collocations to a five-term window; all collocations weighted equally) and across sentence (restricting collocations to a fivesentence window; all collocations weighted equally). Since, in general, directional collocations proved superior to nondirectional, we performed this analysis using directional collocations. We explored various linear combinations of the two directional schemes,17 and obtain maximum effectiveness with query-document distance, D = 0.7 × Dwithin-sentence + 0.3 × Dacross-sentence for tf weighting, and D = 0.6 × Dwithin-sentence + 0.4 × Dacross-sentence for tf-idf. Table 3 describes the performance for the combined scheme, and the gains it yields over each of the individual schemes.18 The combined scheme is superior to any of the individual schemes on most performance measures, and the gains are consistent for both tf and tf-idf weighting schemes (effects size, in percentages, are presented in Table 3). The combinations yield some precision gains (up to 0.027). Recall of the 17

Combining Within and Across-Sentence Collocation Indexing Schemes

We employ the common approach for combining schemes (Mitra et al. 1997; Strzalkowski et al. 1996). We initially preserved separate indexes for both schemes, performed query-document matching separately for each, and then combined the matching results into one query-document similarity value.

Findings from Experiment #1 suggest that both within and across-sentence collocations are essential for collocation

We employed the 88 queries that, for these schemes, included at least one indexing unit (excluding queries 361, 362, 365, 369, 393, 397, 405, 417, 423, 425, 427, and 439).

536


18


Across-Sentence - Collocations at Different Distances 0.300

Precision

0.250 0.200

Precision[10]

0.150

Precision[20] Precision[30]

0.100 0.050 0.000 Within sent. 0 sent. diff.

1 sent. diff.

2 sent. diff. 3 sent. diff.

4 sent. diff.

5 sent. diff.

Sentence Difference

Figure 5. The Effect of Distance for Across-Sentence Collocations (Using TF Weighting)

tfidf

0.095

0.149

–0.1%

0.8%

5.2%

F100

0.081

F30

0.053

F20

Recall10

0.107

F10

Precision100

0.190

Recall100

Precision30

0.231

Recall20

Precision20

0.277

Recall30

tf

Scheme within- and acrosssentence

Precision10

Weight

Table 3. Combining Within- and Across-Sentence Collocation Indexing Schemes (Using Directional Collocations)

0.089

0.120

0.127

0.125

2.2%

versus within-sentence

6.6%

6.8%

4.0%

7.6%

2.8%

3.4%

1.7%

versus across-sentence

8.4%

14.0%

7.5%

5.5%

50.3%* 52.7%** 42.6%* 19.1%

43.5%

42.7%* 30.9%

11.2%

within- and acrosssentence

0.303

0.247

0.208

0./112

0.057

0.085

0.099

0.155

0.096

0.127

0.134

0.130

versus within-sentence

10.3%

8.2%

10.7%

6.6%

2.9%

–0.2%

0.9%

5.8%

4.1%

1.9%

4.1%

6.3%

versus across-sentence

15.1%

15.1%

7.9%

4.2%

53.4%* 50.5%** 35.3%* 18.4%

47.3%* 41.4%* 26.4%

6.6%

10.2%

Asterisks indicate statistical significance of the results (using a one-sided T test): *p < 0.1, **p < 0.05.

combined scheme is slightly better than that of withinsentence collocations (up to 0.002 higher), and substantially better (0.025) than that of across-sentence collocations (recalls 10, 20, and 30 are statistically significant). F measure is somewhat better than that of within-sentence (up to 0.002 gains), and substantially better (0.024 – 0.033) than that of across-sentence (statistical significance: F[20] for tf; F[10] and F[20] for tf-idf). The results for Precision[10] and Recall[10] (using tf-idf weighting) are illustrated in Figure 6.

Experiment #3: Token and Collocations In the third experiment, we compared collocation-based indexing (using the best collocation scheme identified in

Experiment #2) to the traditional token-based indexing. We initially compared the two distinct indexing schemes, tokens and collocations, and then integrated the two to obtain an effective combined scheme. Comparing Collocations Against Tokens

Table 4 compares the best collocations scheme against the traditional token scheme, for both tf and tf-idf weighting. Overall, the collocations scheme alone is superior in precision to tokens, while the effect of recall is dependent on the weighting scheme. In line with the findings from previous studies, tf-idf works well for tokens (0.031 – 0.045, or 18 to


537


Combining Within & Across-Sentence Schemes

Combining Within & Across-Sentence Schemes

Combined scheme

0.310

0.060

Withinsentence

0.290 0.280 0.270

Acrosssentence

0.050

Recall[10]

Precision[10]

0.300

Acrosssentence

0.260

Combined scheme

Withinsentence

0.040 0.030 0.020

0.250

0.010

0.240

0.000

Figure 6. Precision[10] and Recall[10]: Combining Within- and Across-Sentence Collocations

tf tfidf

F100

F30

F20

F10

Recall100

Recall30

Recall20

Tokens

0.226

0.185

0.159

0.098

0.049

0.074

0.089

0.149

0.080

0.106

0.114 0.118

Collocations

0.277

0.231

0.190

0.107

0.053

0.081

0.095

0.149

0.089

0.120

0.127 0.125

Tokens

0.267

0.230

0.202

0.129

0.066

0.094

0.121

0.208

0.106

0.133

0.151 0.160

Collocations

0.303

0.247

0.208

0.112

0.057

0.085

0.099

0.155

0.096

0.127

0.134 0.130

32 percent, precision gains; and 0.017 – 0.059, or 27 to 40 percent, recall gains), and to a lesser extent for collocations (0.005 – 0.026, or 4 to 9 percent, precision gains; and 0.004 – 0.006, or 4 to 7 percent, recall gains). When using simple tf weighting, collocations outperform tokens: precision (up to 0.051, or 25 percent, and statistically significant19), recall (up to 0.007, or 10 percent), and F (up to 0.014, or 14 percent). When using tf-idf, the precision advantage of collocations over tokens is diminished (up to 0.036, or 14 percent), and recall levels are lower (0.009-0.053, or 9 to 26 percent) than those of tokens. Figure 7 illustrates the effect of weighting on F measure. The relative performance of collocations depends on the length of the result list we employ. The performance of collocations is better at the top rankings, and this effect (using tf-idf weighting) is illustrated in Figure 8. Performance at the top ranking list is of key importance, since searchers are 19

Precision[10] and Precision[20] for collocations are (respectively) 0.051 (or 23 percent) and 0.046 (or 25 percent) better than tokens, and the results are statistically significant (p < 0.1).

538

Recall10

Precision100

Precision30

Precision20

Scheme

Precision10

Weight

Table 4. Comparing Collocations (Using the Combined Within- and Across-Sentence Scheme) to Tokens for tf and tf-idf Weighting


usually interested in exploring only the top-ranked documents. In order to gain additional insights into the factors driving performance, we analyzed all queries in detail. We found that the number of terms in a query has a substantial effect on performance. For collocations, query length has a large positive effect on precision (long queries are roughly 30 percent more precise), and a small negative effect on recall (roughly 7 percent). For tokens, on the other hand, we observe a reverse effect: long queries yield substantially higher recall (roughly 30 percent), and no clear gains for precision. This interesting effect has not been reported in previous studies; it has no simple intuitive explanation, and thus remains an issue for future research.

Using Collocations to Enhance Token-Based Indexing Since in practice collocations are used to complement tokens, rather than to replace token indexing, we proceeded to com-


Collocations vs. tokens: tf and tf-idf

F measure

0.200 Tokens: tf-idf

0.150

Collocations: tf-idf

0.100

Tokens: tf 0.050

Collocations: tf

0.000 F10

F20

F30

F100

0.400

Collocations

0.300 0.200 0.100

1. 0

0.000 0. 8

P100

Tokens

0. 6

P30

0.500

0. 4

P20

0.600

0. 2

P10

Tokens vs. Collocations (tf-idf weighting)

0. 0

Collocations vs. Tokens (tf-idf weighting) 0.350 Tokens 0.300 0.250 Collocations 0.200 0.150 0.100 0.050 0.000

Precision

Precision

Figure 7. Comparing Tokens to Collocations and the Effect of Weighting Scheme on F Measure

Recall

Figure 8. Tokens Versus Collocations: Precision for Top Ranked Documents and Recall Precision Curve Using tf-idf

bine tokens and collocations schemes,20 exploring various linear combinations, reaching an optimum with a 60:40 collocations/tokens ratio (for both tf and tf-idf weighting). The effect for F[10] is illustrated in Figure 9. Table 5 presents the results for the optimal collocations/ tokens combination, when compared to the distinct collocations and tokens schemes. The combined collocations/tokens scheme is superior to the distinct schemes, for both tf and tf-idf weighting (effects size, in percentages, are presented in Table 5). With tf weighting, the combination yields moderate gains over collocations (precision: 0.016 – 0.025; recall: 0.008 – 0.025), and substantial gains over tokens (precision: 0.025 – 0.073, and statistically 20

For combining tokens and collocations we employed the same methodology we used for combining the two collocations schemes, that is, combining query-document similarity scores.

significant; recall: 0.012 – 0.025). With tf-idf, the combination still yields considerable gains over both tokens (precision: 0.023 – 0.084, and statistically significant; recall: 0.004 – 0.012) and collocations (precision: 0.040 – 0.048; recall: 0.013 – 0.064). Figure 10 illustrates the gains for the combined scheme (using tf-idf weighting). Figure 10 and Table 5 demonstrate that collocations can enhance the precision of the traditional tokens-based indexing substantially. Using the industry-standard tf-idf weighting, Precision[10] for the combined scheme is 0.084 (or 31.5 percent) higher than that of the baseline tokens scheme, whereas Precision[20] and Precision[30] gains are 0.062 (or 27 percent) and 0.051 (or 25 percent) respectively (and all these effects are statistically significant). These gains are substantially higher than those reported in the literature, and to the best of our knowledge, no similar gains for collocations were obtained over a large-scale domain-independent test bed.


539


Combining Collocations and Tokens

tf tf-idf

C ol C l ol lo oca c/ To tion C s ol ke lo ns c/ To : 9 C /1 ol ke lo ns c/ To : 8 C /2 ol ke lo ns c/ To : 7 C /3 ke ol lo ns c/ To : 6 C / ol lo ken 4 c/ s: T C 5 ok ol en /5 lo c/ To s: 4 C /6 ke ol lo ns c/ To : 3 C /7 ol ke lo ns c/ : To 2/ ke 8 ns :1 / To 9 ke ns

F[10]

0.140 0.120 0.100 0.080 0.060 0.040 0.020 0.000

Figure 9. Combining Collocations and Tokens with Various Linear Combinations, the Impact on F[10]

tf

tfidf

Combination versus Tokens

Recall10

Recall20

Recall30

Recall100

F10

F20

0.247

0.215

0.123

0.061

0.090

0.109

0.174

0.101

0.132

0.144

0.144

32.2%** 33.5%** 35.3%** 26.3%** 24.8%

21.5%

22.5%

16.9%

26.0%

24.7%

26.8%* 22.4%*

Combinations versus Collocations

7.8%

6.9%

12.9%

15.0%

13.8%

10.6%

14.6%

16.5%

12.7%

9.6%

14.0% 15.7%

Collocations/Tokens Combination

0.351

0.292

0.253

0.152

0.070

0.106

0.132

0.219

0.117

0.156

0.174

Combination versus Tokens

31.5%** 27.2%** 25.6%** 17.6%

5.4%

13.8%

9.5%

5.3%

9.8%

17.4%

Combination versus Collocations

15.7%

21.9%* 35.8%** 23.0%

24.5%

18.4%

33.4%* 41.7%** 21.8%

Asterisks indicate statistical significance of the results (using a one-sided T test): *p < 0.1; **p < 0.05; ***p < 0.01

540

F100

Precision100

0.299

F30

Precision30

Collocations/Tokens Combination

Precision20

Scheme

Precision10

Weight

Table 5. Combining Collocations and Tokens (Using the 60:40 Collocations/Tokens Combination)


0.180

15.0% 12.5%

22.9%* 29.4%* 38.2%***


Combining Collocations and Tokens (tf-idf weighting) 0.400 0.350 Precision

0.300 0.250 0.200 0.150

Tokens

0.100

Collocations

0.050

Collocations & Tokens

0.000 P10

P20

P30

P100

Figure 10. Effectiveness of the Combined Tokens/Collocations

Discussion Collocation and IR Effectiveness Although theory in linguistics suggests that collocations are essential for capturing meaning, disambiguating terms, and enhancing IR precision, to date there has been no empirical evidence in the literature demonstrating that collocations can enhance retrieval substantially. This study investigated whether and how statistical collocation indexing can benefit IR in terms of enhancing users’ ability to access textual information from large and heterogeneous collections. Linguistic theory provides little guidance on operationalizations of collocations and on ways to extract collocations from text effectively. Accordingly, we investigated three key collocations indexing parameters, and tested their impact on IR effectiveness. Two parameters, collocations directionality and distance, are applicable for any retrieval model, and thus our findings related to these parameters could be generalized to alternative IR models. The third parameter, the weighting scheme, is model specific, and thus findings related to this parameter are restricted to the vector space model. Collocations distance plays an important role in determining collocations performance. While the computational linguistics literature suggests that collocations extraction should be restricted to grammatically-bound elements (i.e., terms appearing in the same sentence; see Carmel et al. 2001; Maarek et al. 1991; Strzalkowski et al. 1996), the results from Experiment 1 indicate that collocations both within and across sentences are important for retrieval performance. These

findings present the first large-scale demonstration of the importance of across-sentence collocations for the vector– space model, support recent findings in related areas—Mittendorf et al. (2000) with the probabilistic model and Srikanth and Srihari (2002) with the set-based model—and suggest that collocation extraction should be performed both within and across sentence boundaries. While Mittendorf et al. report on difficulties in designing an effective combination of within and across sentence collocations, we found the combination moderately superior to any of the individual schemes on most performance measures (the combination yielded both precision (4 to 15 percent) and recall (6 to 53 percent) gains). Further explorations of collocation distance in Experiment 2 suggest that within a sentence, proximate collocations are more useful than distant ones, and a collocation indexing scheme should assign weights to collocations based on their distance, thus challenging the current practices of weighting all within-sentence collocations equally. Nevertheless, our initial exploration of such a weighting scheme did not yield performance gains. Across sentence boundaries, on the other hand, distance does not seem to impact IR effectiveness, thus a collocation indexing scheme should weight all acrosssentence collocations equally. The literature does not provide a clear answer on the effect of collocation directionality on IR performance. In Experiment 1, we found that directional collocations are superior (in both precision and recall), while for across sentence we observe a mixed effect: directional collocations provide small precision gains, while nondirectional yields substantially higher recall.


541


This finding could be explained by the fact that collocations extracted within a sentence usually capture lexico-syntactic relations (mostly noun/verb relations) and thus preserving collocation directionality is important. Across-sentence collocations, on the other hand, usually capture noun-noun relations (e.g., doctor/nurse), which are symmetric. Since the overall performance of directional collocations is superior to nondirectional collocations, we recommend that directional collocations be employed. The weighting of indexing units in document profiles has been the subject of considerable research in IR (Baeza-Yates and Ribeiro-Neto 1999; Salton and McGill 1983) and, in general, advanced weighting schemes have yielded substantial performance gains. Consistent with findings from previous studies, we found that the effect of tf-idf weighting on collocations (4 to 9 percent precision and recall gains) is substantially lower than the effect observed for traditional tokenbased indexing (18 to 32 percent precision gains, and 27 to 40 percent recall gains). According to Strzalkowski et al. (1996), the cause of this effect is the term-independence assumption that is implicit in tf-idf weighting. Collocations capture dependencies between terms, and thus violate the termindependence assumption. Previous works have explored various adaptations of tf-idf to collocation with little success (e.g., Alvarez et al. 2004; Buckley et al. 1996; Mitra et al. 1997; Strzalkowski et al. 1996). We implemented an adaptation to tf-idf that is grounded in Shannon’s information theory and has been proposed by Church and Hanks (1990), only to find that performance levels are indistinguishable from those of traditional tf-idf weighting. Recent advancements in work with language models (a newer retrieval model that seeks to replace the traditional vector– space model) try to relax the term-independence assumption (Jiang et al. 2004; Miller et al. 1999; Srikanth and Srihari 2002), thus allowing for modeling of term dependencies, but these “attempts to integrate positional information into the language-modeling approach to IR have not shown consistent significant improvements” (Jiang et al. 2004, p. 1). Thus, to date, it seems that there is no effective method for incorporating term dependencies into existing retrieval models, and there is no collocation weighting scheme that is better than tf-idf. Comparing the performance of collocations to that of tokens in Experiment 3 showed that collocations are more effective with the simple tf weighting (precision is 10 to 25 percent higher, and recall is up to 10 percent higher), while when using tf-idf we observe a mixed effect: precision is higher for collocations at the top ranking list (14 percent for Precision[10]) and higher for tokens when using a long list (tokens

542


13 percent higher at Precision[100]), while recall is consistently higher for tokens (9 to 26 percent). Thus, collocations are a more effective indexing unit (as evident in tf results), but due to the lack of an effective collocation weighting scheme, collocations and tokens perform similarly when using tf-idf. In practice collocations are utilized to enhance token indexing (e.g., Carmel et al. 2001; Strzalkowski et al. 1996), and we reached the optimal performance with a 60:40 collocations/ tokens combination. This combined collocations/tokens method yielded substantial performance gains over the traditional token-based indexing, for both tf (precision: 26 to 35 percent and statistically significant; recall: 17 to 25 percent) and tf-idf (precision: 18 to 32 percent and statistically significant; recall: 5 to 14 percent) weighting. We conclude that, even with industry-standard weighting (i.e., tfidf), collocations could be used to enhance precision substantially and recall moderately. Our study provides the first strong evidence for the usefulness of statistical collocation indexing that is performed over a large and heterogeneous collection. We attribute our success to our collocation extraction parameters (i.e., distance and directionality) settings, and specifically to the incorporation of across-sentence collocations. It is worth noting that the reported effectiveness levels are lower than those attained by state-of-the-art retrieval engines, as these retrieval engines incorporate many additional features (beyond collocation indexing).21

Collocation and IR Efficiency Hevner et al. (2004) suggest that design should consider costbenefit tradeoffs. Because inefficient IR techniques will not scale up to large and heterogeneous collections, the efficiency of collocation indexing becomes critical. IR is performed in two steps, preprocessing and real-time processing, with different efficiency considerations for each. In preprocessing (i.e., collocation extraction), we are mainly concerned with scalability to large volumes of documents, and less concerned with the exact timing of these processes, as most of them are only performed once or at relatively long intervals. 21

For instance, the IBM Juru system, which competed at TREC10 (Carmel et al. 2001), in addition to using collocation, allows users to define stopwords (excluded from indexing), as well as a set of proper names. Juru also processes information included in hyperlinks pointing to that document. In addition, Juru extracts information from web pages citing the page that is currently being indexed (and this feature alone has shown to improve precision by 20 percent). Despite all these enhancements, the Precision[10] results attained by Juru are only 20 to 25 percent higher than the results reported here.


Table 6. Summarizing the Relative Effectiveness and Efficiency of Token and Collocation Indexing Tokens, Collocations, and Combinations Tokens Collocations Tokens and Collocations

Effectiveness Precision Low

Efficiency †

Recall Medium

Storage

Complexity Low

Medium

O(T)

1 gigabyte

Medium

Low

Medium

Low

(up to 14% over tokens)

(9% to 26% under tokens

O(T × (Log(T))²)

0.5 gigabyte

High

High

Medium

High

(18% to 32% over tokens)

(5% to 32% over tokens)

O(T × (Log(T))²)

(1.5 gigabyte

†

The complexity of collocation matching is similar for both tokens and collocations, and this is absent from the table.

The complexity of token indexing for one document is O(T); T = number of tokens in a document, while collocation indexing complexity is O(T × (Log(T))²).22 Although both are suitable for very large document collections, the collocation indexing complexity is somewhat higher. Real-time processing occurs when a user submits his query, thus making computational efficiency critical, so as not to keep the user waiting for the system’s response. In matching a query to documents, both the token and collocation-based models employ inverted matrixes (Salton and McGill 1983), where a query is matched with only documents containing at least one query indexing unit. Thus, the complexity of the matching process is similar for both tokens and collocations, and linear with the number of documents that contain query indexing units. Storage requirements are also an important constraint. Collocation indexing has been shown to require less storage space, when compared to token indexing (Maarek et al. 1991), and in our study, collocation indexing required approximately 50 percent less storage space than token indexing. Of course, expanding the number of unique items (tokens or collocations) will change this ratio. Table 6 summarizes the relative efficiency and effectiveness for tokens and collocations in our experiments.

22

The reported complexity is based on the following extraction algorithm: (1) processing the text file to register the location (sentence and word positions) for all relevant tokens (in our case, the 20 most frequent tokens), (2) for each relevant token instance, register co-occurrences with all other relevant tokens, and (3) for each collocation, sum-up all co-occurrence instances. It should be noted that the aim of this study was not to develop the most efficient collocation extraction algorithm, and it is likely that more efficient algorithms could be designed.

Common practice, as well as our findings, suggests that an index combining token and collocations is more effective than the pure token or collocation indexes. However, a disadvantage of a combined token-collocation indexing procedure is its inefficiency, although these efficiency differences are not crucial and the combined approach could easily scale-up to very large collections. Thus, combined token-collection indexing is still recommended for large and heterogeneous collections.

Conclusion Despite the understanding that collocations are essential for representing the meaning of natural language text, the effectiveness of collocation indexing in IR remained in doubt. Prior studies were unable to provide solid evidence for the ability of collocations to provide substantial effectiveness gains in retrieval of textual information from large and heterogeneous document collections. This study makes four contributions to our understanding of the role collocations can play in general-settings IR. First, our investigation of the effect of distance clearly shows that (1) within a sentence, collocation distance is associated with effectiveness, thus suggesting that the common practice of assigning equal weight to all within-sentence collocations is not optimal, and (2) existing theory and practice restricting collocation extraction to elements co-occurring within the same sentence should be revisited, as across-sentence collocation provides additional effectiveness gains. Second, we demonstrate the interaction effect between distance and directionality, suggesting that directionality is effective within sentence boundaries, while across sentences nondirectional collocations are superior. Third, our exploration of weighting


543


schemes for collocations and the disappointing results obtained with the mutual information tf-idf adaptation illustrate the challenge of modeling collocations within existing retrieval models. Finally, our results show that collocation indexing can substantially enhance IR precision, despite the inappropriateness of tf-idf weighting for collocations. These findings, in addition to contributing to the existing body of knowledge regarding collocation, have immediate implications to IR practitioners in providing guidelines for the design of effective collocations extraction schemes. Although this study enhances our understanding of collocation indexing, several issues warrant further research. •

Future research should attempt to identify an effective weighting scheme for collocations.

•

Future research should try to design a collocation extraction scheme that, within a sentence, assigns more importance to proximate collocations.

•

To generalize the results, it is necessary that similar experiments be repeated for additional large and heterogeneous text collections.

•

In these experiments, we employed only one retrieval model, the vector space model. Although insights regarding collocation extraction parameters (i.e., distance and directionality) are model-independent, in order to provide stronger evidence to the value of collocation in IR, similar experiments should be repeated with alternative retrieval models.

•

To verify the usefulness of collocation for real-life settings, it is important that collocation indexing models be studied in the context of specific applications, particular domains, and real business tasks. The amount of knowledge encoded in electronic text far surpasses that available in data alone. However, the ability to take advantage of this wealth of knowledge is just beginning to meet the challenge. Businesses that can take advantage of this potential will surely gain a competitive advantage through better decision-making and increased efficiencies (Spangler et al. 2003, p. 192).

To date, organizations trying to process textual information into meaningful knowledge, in a general setting, were faced with a difficult choice: they could either (1) employ very simplistic processing methods that are scalable, yet error

544


prone and ambiguous, or alternatively (2) use complex, domain-specific, analysis, which is not scalable to large collections and not portable across domains. This research work provides important implications for practitioners, and demonstrated that a third approach that balances the earlier two is now available: employing scalable and domain-independent statistical NLP methods, which address the problem of word ambiguity and enable users to effectively access textual information. Despite the potential role of document management systems to enhance organizational competitiveness, the design of such systems has been under-explored in management information systems literature. Sprague (1995, p. 30) has argued that “harnessing information technology to manage documents is one of the most important challenges facing IS managers,” and encouraged a shift in the focus of MIS research from database to information retrieval systems. We support Sprague’s appeal and call for increased attention to the design of IR systems in MIS research.

Acknowledgments We thank Reginald Oake, Edmund Szeto, and Andy Leung for their work in implementation and testing. We are thankful to David Patient for providing feedback to earlier drafts of the paper. This research was funded by a grant from the Natural Sciences and Engineering Research Council of Canada.

References Alvarez C., Langlais P., and Nie J. Y. “Word Pairs in Language Modeling for Information Retrieval,” in Proceedings of the 7th Conference on Computer Assisted Information Retrieval (RIAO), Avignon, France, April 26-28, 2004, pp. 686-705. Baeza-Yates, R., and Ribeiro-Neto, B. Modern Information Retrieval, ACM Press, New York, 1999. Buckley C., Singhal A., Mitra M., and Salton G. “New Retrieval Approaches Using SMART: TREC 4,” in Proceedings of the Fourth Text Retrieval Conference (TREC-4), D. K. Harman (ed.), NIST Special Publication 500-236, Gaithersburg, MD, November 1-3, 1996 (http://trec.nist.gov/ pubs/trec4/t4_proceedings.html). Carmel, D., Amitay, E., Herscovici, M., Maarek, Y., Petruschka, Y., and Soffer, A. “Juru at TREC 10: Experiments with Index Pruning,” in Proceedings of the 10th Text Retrieval Conference, E. M. Voorhees and D. K. Harman (eds.), NIST Special Publication 500-250, Gaithersburg, MD, November 13-16, 2001, (http://trec.nist.gov/pubs/trec10/ t10_proceedings.html). Chen, H. Knowledge Management Systems: A Text-Mining Perspective, University of Arizona, Tucson, AZ, 2001 (http://dlist. sir.arizona.edu/483/01/chenKMSi.pdf).


Church, K. W., and Hanks, P. “Word Association Norms, Mutual Information, and Lexicography,” Computational Linguistics (16), 1990, pp. 22-29. Croft, W. B., Turtle, H. R., and Lewis, D. D. “The Use of Phrases and Structured Queries in Information Retrieval,” in Proceedings of the 14th Annual Conference on Research and Development in Information Retrieval (ACM SIGIR), ACM Press, New York, 1991, pp. 32-45. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. “Indexing by Latent Semantic Analysis,” Journal of the American Society of Information Science (41:6), 1990, pp. 391-407. Fagan, J. L. “The Effectiveness of Nonsyntactic Approach to Automatic Phrase Indexing for Document Retrieval,” Journal of the American Society for Information Science (40:2), 1989, pp. 115-132. Fagan, J. L. Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Nonsyntactic Methods, unpublished Ph.D. thesis, Department of Computer Science, Cornell University, Ithaca, NY, 1987. Firth, J. “A Synopsis of Linguistic Theory 1930-1955,” in Studies in Linguistic Analysis, Philological Society, Oxford, UK, 1957, pp. 1-32; reprinted in Selected Papers of J. R. Firth 1952-1959, F. Palmer (ed.), Longman, London, 1968. Hevner, A., March, S., Park, J., and Ram, S. “Design Science in Information Systems Research,” MIS Quarterly (28:1), 2004, pp. 75-105. Jain, H., Ramamurthy, K., Ryu, H., and Yasai-Ardekani, M. “Success of Data Resource Management in Distributed Environments: An Empirical Investigation,” MIS Quarterly (22:1), March 1998, pp. 77-29. Jiang, M., Jensen, E., Beitzel, S., and Argamon, A. “Effective Use of Phrases in Language Modeling to Improve Information Retrieval,” paper presented at the Eighth Symposium on AI & Math Special Session on Intelligent Text Processing, Fort Lauderdale, FL, January 4-6, 2004. Khoo, C., Myaeng, S., and Oddy, R. “Using Cause-Effect Relations in Text to Improve Information Retrieval Precision,” Information Processing and Management (37), 2001, pp. 119-145. Lesk, M. E. “Word-Word Associations in Document Retrieval Systems,” American Documentation (20:1), 1969, pp. 27-38. Lewis, D., and Spärck-Jones, K. “Natural Language Processing for Information Retrieval,” Communications of the ACM (39:1), 1996, pp. 92-101. Luhn, H. P. “A Business Intelligence System,” IBM Journal of Research and Development (2:4), 1958, pp. 314-319. Maarek, Y., Berry, D., and Kaiser, G. “An Information Retrieval Approach for Automatically Constructing Software Libraries,” IEEE Transactions on Software Engineering (17:8), 1991, pp. 800-813. Manning, C., and Schutze, H. Foundations of Statistical Natural Language Processing (6th ed.), MIT Press, Cambridge, MA, 2003. Martin, W. J. R., Al, B. P. F., and van Strenkenburg, P. J. G. “On the Processing of Text Corpus: From Textual Data to Lexico-

graphical Information,” in Lexicography: Principles and Practice, R. R. K. Hartmann (ed.), Academic Press, London, 1983, pp. 77-87. Miller, D. R., Leek, T., and Schwartz, R. M. “A Hidden Markov Model Information Retrieval System,” in Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval (SIGIR’99), ACM Press, New York, 1999, pp.214-221. Miller, G. A. “WordNet: A Lexical Database,” Communications of the ACM (38:11), 1995, pp. 39-41. Mitra M., Buckley C., Singhal A., and Cardie C., An Analysis of Statistical and Syntactic Phrases, in Proceedings of the Fifth Conference on Computer Assisted Information Retrieval (RIAO), Montreal, Canada, June 25-27, 1997, pp. 200-214. Mittendorf, E., Mateev, B., and Schauble, P. “Using the Co-occurrence of Words for Retrieval Weighting,” Information Retrieval (3:3), 2000, pp. 243-251. Morita, K., Atlam, E. S., Fuketra, M., Tsuda, K., Oono, M., and Aoe, J. ”Word Classification and Hierarchy Using Co-occurrence Word Information,” Information Processing and Management (40:6), 2004, pp. 957-972. Ponte, J. M., and Croft, W. B. “A Language Modeling Approach to Information Retrieval, 1998,” in Proceedings of 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, pp. 275-281. Porter, M. F. “An Algorithm for Suffix Stripping,” Program (14:3), 1980, pp. 130-137. Robertson, S. E., and Spärck-Jones, K. “Relevance Weighting of Search Terms,” Journal of the American Society for Information Sciences (27:3), 1976, pp.129-146. Salton, G., and Buckley, C. “Automatic Text Structuring and Retrieval: Experiments in Automatic Encyclopedia Searching,” in Proceedings of the 15th Annual International ACM SIGGIR Conference on Research and Development in Information Retrieval, Chicago, October 13-16, 1991, pp. 21-30. Salton, G., and Buckley, C. “Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management (24:5), 1988, pp. 513-523. Salton, G., and McGill, M. J. Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983. Salton, G., Wong, A., and Yang, C. S. “A Vector Space Model for Automatic Indexing,” Communications of the ACM (18:11), 1975, pp. 613-620. Spangler, S., Kreulen, J. T., and Lessler J. “Generating and Browsing Multiple Taxonomies Over a Document Collection,” Journal of Management Information Systems (19:4), Spring 2003, pp. 191-212. Spärck-Jones, K. S. “Summary Performance Comparisons TREC-2 Through TREC-8,” in Proceedings of the 8th Text Retrieval Conference, E. M. Voorhees and D. K. Harman (eds.), NIST Special Publication 500-246, Gaithersburg, MD, November 17-19, 1999 (http://trec.nist.gov/pubs/trec8/t8_proceedings.html). Sprague, R. H. “Electronic Document Management: Challenges and Opportunities for Information Systems Managers,” MIS Quarterly (19:1), March 1995, pp. 29-49.


545


Srikanth, M., and Srihari, R. “Biterm Language Models for Document Retrieval,” in Proceedings of 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002, pp. 425-426. Stokoe, C. , Oakes, M. P., and Tait, J. “Word Sense Disambiguation in Information Retrieval Revisited, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, 2003, pp. 159-166. Strzalkowski, T., Guthrie, L., Karlgren, J., Leistensnider, J., Lin, F., Perez-Carballo, J., Straszheim, T., Wang, J., and Wilding, J. “Natural Language Information Retrieval: TREC-5 Report,” Proceedings of the Fifth Text Retrieval Conference, E. M. Voorhees and D. K. Harman (eds.), NIST Special Publication 500-238, Gaithersburg, MD, November 20-22, 1996 (http:// trec.nist.gov/pubs/trec5/t5_proceedings.html). Van Rijsbergen, C. J. “A Theoretical Basis for the Use of Co-occurrence Data in Retrieval,” Journal of Documentation (33), 1977, pp. 106-119.

546


About the Authors Ofer Arazy is an assistant professor at the University of Alberta. He is a B.Sc. (in Industrial Engineering) and an MBA, and worked for 7 years in IT projects management. Ofer earned his Ph.D. at the University of British Columbia in 2004. His research approach is grounded in Design Science, and his thesis investigated design principles for information retrieval systems. Additional areas of interest are multi-agent systems, recommendation systems, and social informatics. Carson Woo is an associate professor and Stanley Kwok Professor of Business at the University of British Columbia. He earned his Ph.D. in Computer Science from the University of Toronto in 1988. Carson served as president of the Workshop on Information Technology and Systems (WITS) Inc. from 2004 to 2006 and on the editorial board of several journals. His main research interest is in providing methods and tools for supporting the change and evolution of information systems from the business and organizational perspective.