Intelligent Techniques for Effective Information Retrieval

Intelligent Techniques for Effective Information Retrieval (A Conceptual Graph Based Approach)

A Thesis Submitted to the

University of Allahabad for the Degree of

Doctor of Philosophy in

Science

By

Tanveer J. Siddiqui

J K Institute of Applied Physics & Technology Department of Electronics & Communication University of Allahabad Allahabad-211002

2005

Certificate Certified that the thesis entitled “Intelligent Techniques for Effective Information Retrieval (A Conceptual Graph Based Approach)” submitted for the degree of Doctor of Philosophy in Science by Ms. Tanveer J. Siddiqui has been carried out under my supervision and that this work has not been submitted elsewhere for a degree.

(Uma Shanker Tiwary) Associate Professor Indian Institute of Information Technology Allahabad

ii

Acknowledgements This thesis would not have been possible without the aid and collaboration of many people whom I wish to thank and remember. My supervisor Dr. Uma Shanker Tiwary, the person who has the most direct impact on shaping the research in this thesis, for his advice and support when I needed it. He allowed me to pursue the research area I was interested in but still took keen interest in helping me improve my work. For that, I express my deep sense of gratitude to him. I count his guidance as a distinct privilege for me. Stephen E. Robertson, for showing keen interest in answering my queries in spite of very busy schedule. He not only explained my queries but also offered valuable suggestions. I thank him with my deepest sense of gratitude. Manoj K. Singh, for helping gathering.

me in the tough task of literature

Thorsten Brants, for providing an opportunity to use TnT part-ofspeech tagger. Judith P Dick, Michael Buckland, A. Gelbukh for extending their support in achieving technical depth in the subject. Mandar Mitra, for the moments he spent in clearing my doubt. Padmini Srinivasan, for the time she spent working for me. My brothers and sister, who helped me in bearing the strain of this whole exercise. A special word of thank to my elder brother for his constant inspiration. My parents, especially to my mother, for giving support and freedom to pursue my own goals. I pay my deepest regards and indebtedness to them. Finally, I would like to thank all my colleagues in the department for their kind cooperation and encouragement.

(Tanveer J. Siddiqui)

Table of Contents Acknowledgments

iii

List of Figures

viii

List of Tables

x

1. Introduction

1

1.1 Introduction..................................................................... 1 1.2. Major Issues in Information Retrieval .......................................

5

1.3. Statement and Scope of the Work............................................. 7 1.4. Literature review.................................................................

10

1.4.1. Statistical techniques in IR..............................................

10

1.4.2. Intelligent techniques in IR.............................................. 11 1.4.2.1. Conceptual Graph based methods............................

12

1.4.2.2 Query expansion and refinement..............................

13

1.4.2.3. Latent Semantic Indexing......................................

14

1.4.2.4. Genetic and Connectionist approaches....................... 14 1.4.2.5. Relation Matching...............................................

15

1.4.3. Natural Language Processing...........................................

16

1.4.3.1. Phrase extraction ................................................

19

1.4.3.2. Word Sense Disambiguation .................................. 20 1.4.3.3. Knowledge bases ................................................ 21 1.5. Organization of the thesis......................................................

2. Information retrieval and Conceptual Graphs: Basic Concepts

23

26

2.1. Introduction ......................................................................

26

2.2 Information Retrieval Models..................................................

27

2.2.1. Boolean model ............................................................

29

2.2.2. Probabilistic model........................................................ 29 2.2.3. Vector Space Model .....................................................

30

2.2.4. Semantic Based Models.................................................

31

2.3. Conceptual Graphs .............................................................

31

2.3.1. Definition and Notation .................................................

32

2.3.2. Conceptual Graph Representation ..................................... 35 2.3.3. Conceptual Graph Operations .......................................... 38 2.4. Evaluation of IR System........................................................

41

2.4.1. Effectiveness Measures..................................................

42

2.4.1.1. Precision and Recall ...........................................

42

2.4.1.2. Other Effectiveness Measures ...............................

45

2.4.2. User-Centered Evaluation ............................................... 47

3. Basic Framework: Document and Query Representation,

49

Matching and Similarity Measures 3.1. Introduction ......................................................................

49

3.2. Vector Space Model ............................................................

50

3.2.1. Indexing ....................................................................

50

3.2.1.1. Stemming ........................................................

52

3.2.1.2. Stop Word Elimination ........................................ 53 3.2.1.3. Term Weighting ................................................

54

3.2.1.3.1. Term Frequency ....................................

54

3.2.1.3.2. Inverse Document frequency .....................

54

3.2.1.3.3. Document Length ................................... 56 3.2.1.3.4. Combining the three factors ......................

56

3.2.2. Indexing Algorithm......................................................

60

3.2.3. Similarity Measures......................................................

61

3.3. Conceptual Graphs in IR ...................... ................................

64

3.3.1. Conceptual Graph as a Tool for Meaning Representation.........

64

3.3.2. Conceptual Graph for NL Document Representation..............

65

3.3.3. Representing Documents as a Set of CGs............................

66

3.3.3.1. An alternative to concept hierarchy derivation............

67

v

3.3.3.2. Identify Relations...............................................

68

3.3.3.3. The Algorithm ..................................................

72

3.3.3.4. Example ..........................................................

73

3.3.4. Conceptual Graph Matching Algorithm..............................

77

4. Techniques for Improving Effectiveness in IR

80

4.1. Introduction ....................................................................... 80 4.2. A hybrid model to Improve Relevance in Document Retrieval ........

81

4.2.1. CG-based Core Model ...................................................

82

4.2.2. The Retrieval Model .....................................................

84

4.2.2.1.Vector Space Model ............................................

85

4.2.2.1.1. The Document Collection ......................... 85 4.2.2.1.2. Model Description .................................. 85 4.2.2.1.3. Methodology.........................................

89

4.2.2.1.4. Selection of the Appropriate Model.............. 89 4.2.2.2. Conceptual Graph Model......................................

96

4.2.2.2.1. Document Representation.........................

97

4.2.2.2.2. CG Similarity Measure............................. 99 4.2.2.2.3. Computing Intersection Graph.................... 101 4.2.3. Experimental Design.....................................................

102

4.2.3.1. Experiment 1 ....................................................

104

4.2.3.1.1. The Experiment .....................................

104

4.2.3.1.2. Results and Discussions............................ 104 4.2.3.2. Experiment 2..................................................... 108 4.2.3.2.1. Document Collection..............................

108

4.2.3.2.2. The Experiment ....................................

109

4.2.3.2.3. Results and Discussions...........................

110

4.2.3.3. Experiment 3..................................................... 112 4.3. Conclusions.......................................................................

5. Automatic Query Expansion through CG

vi

112

118

5.1. Introduction .....................................................................

118

5.2. Models for relevance feedback................................................ 119 5.3. CG-based Expansion ...........................................................

120

5.4. Experiments and Results.......................................................

122

5.4.1. The Experiment ...........................................................

122

5.4.2. Results and Discussions .................................................

123

5.5. Conclusions....................................................................... 125

6. Capturing Semantics through Relation Matching

126

6.1. Introduction ......................................................................

126

6.2. The Proposed Retrieval Model ...............................................

127

6.3. New CG Similarity Measures .................................................

132

6.4. Experiments and Results .......................................................

136

6.4.1. The Experiment ..........................................................

136

6.4.2. Results and Discussions .................................................

140

6.5. Conclusions ....................................................................... 141

143

7. Conclusions and Future Work 7.1. Conceptual Graph based techniques..........................................

143

7.2. Main Contributions .............................................................. 145 7.3. Limitations and Future work ..................................................

146

References

148

List of Publications

168

Appendix

167

A. Stop words ............................................................................

A-1

B. I. Format of untagged and tagged file ......................................... B-1 II. Penn Treebank Tagset ......................................................... B-2 C. Conceptual relations (conrels) ...................................................

vii

C-1

List of Figures Figure 2.1: Basic information retrieval process ....................................

26

Figure 2.2: Conceptual Graph of sentence 2.1 ...................................... 32 Figure 2.3: A Concept with referent................................................... 33 Figure 2.4: CFG Description of Linear Form .......................................

36

Figure 2.5: Conceptual Graph of sentence 2.2 ...................................... 37 Figure 2.6: Derivation graph of a conceptual graph v .............................

40

Figure 2.7: Recall-Precision(R-P) graph ............................................. 44 Figure 3.1: Sample Documents ........................................................

53

Figure 3.2: Results of applications of indexing steps .............................. 57 Figure 3.3: Steps in preprocessing documents....................................... 61 Figure 3.4: Representation in two-dimensional vector space ....................

62

Figure 3.5: The meaning-triangle .....................................................

64

Figure 3.6: General Architecture of Conceptual Graph Construction .........

66

Figure 3.7: A subset of conceptual relations.........................................

67

Figure 3.8: Sample document text ....................................................

74

Figure 3.9: Conceptual graph for the sample document...........................

77

Figure 4.1: Retrieval Model ............................................................

84

Figure 4.2: Recall-Precision curve for Model 1 vs. Model 2 ..................... 91 Figure 4.3: Recall-Precision curve for Model 1 vs. Model 3 ..................... 91 Figure 4.4: Recall-Precision curve for Model 1 vs. Model 4 ..................... 94 Figure 4.5: Recall-Precision curve for Model 1 vs. Model 5 ..................... 94 Figure 4.6: Comparison of R-P behavior of Model 6 with Model 1 & 7.......

95

Figure 4.7: Comparison of seven different retrieval models ...................

95

Figure 4.8: R-P curve for CACM collection (averaged over 64 queries) for BM25 and modified weighting tf-idf scheme .......................

96

Figure 4.9: Part of Concept hierarchy ...............................................

97

viii

Figure 4.10: Intersection of conceptual graphs G1 and G2 ....................... 101 Figure 4.11: R-P curve for query 63 of the CACM-3204 collection ...........

106

Figure 4.12: R-P curve for CACM query 30 ........................................

107

Figure 4.13: Recall-precision curve for CACM query 19 ........................

107

Figure 4.14: Average R-P curve for CACM queries 12, 19, 30 and 63 .......

108

Figure 4.15: Sample queries and documents ........................................

109

Figure 4.16: Ranks of documents after first and second stage ...................

110

Figure 4.17: Ranking after first and second stage of retrieval for query 2& 3. 111 Figure 4.18: Ranking of first 10 documents returned by LYCOS after second stage................................................................

112

Figure 5.1: Sample Queries from the CACM collection........................... 121 Figure 5.2: Average R-P curves for a subset of CACM queries.................

124

Figure 5.3: Average Recall-Precision curves for a subset of CACM queries for techniques (a), (b) and blind expansion............................

124

Figure 6.1: Sample fragments involving the use of “which”.....................

129

Figure 6.2: Identifying and representing triplets...................................

130

Figure 6.3: A subset of replaceable terms ...........................................

131

Figure 6.4: Comparison of R-P curve for CACM-3204 collection for vector model and our model using CG Similarity measure I .............................

137

Figure 6.5: R-P curve for modified vector and our model using CGSim II ...

137

Figure 6.6: R-P behavior of a subset of CACM queries using CGSim II.......

138

Figure 6.7: R-P curve for modified vector and our model using CGSim III .. 139 Figure 6.8: R-P curve for modified vector and our model using CGSim IV ..

ix

139

List of Tables Table 2.1: Different types of referents ................................................

34

Table 2.2: Relevant and Retrieved Documents ....................................

42

Table 2.3: IR test collections ...........................................................

42

Table 3.1: Vector representation of Sample Documents after stemming.......

53

Table 3.2: Calculating weight with different options for the three weighting factors ..........................................................................

58

Table 3.3: Syntactic patterns for noun phrase .......................................

69

Table 4.1: Retrieval results of various models (11AvgP) ..........................

90

Table 4.2: The Retrieval algorithm.....................................................

103

Table 4.3: Ranking of relevant documents for CACM query 12 and 23

105

Table 4.4: Percentage increase in precision .........................................

106

Table 4.5: First ten results returned by LYCOS for the query “Genetic algorithm for information retrieval”.....................................

116

Table 4.6: Titles of documents considered in Experiment 2 .....................

117

Table 5.1: 11 AvgP score for a subset of CACM queries........................

122

Table 6.1: Calculating similarity between the query and the documents........

135

Table 6.2: Comparison of retrieval results for a subset of CACM queries.....

138

x

Chapter 1 Introduction ‘Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world.’ Albert Einstein

1.1. Introduction “We are producing more data every three years than humankind has in its entire history”[93]. The preceding line gives a glimpse of the speed with which the information is being made available today. This has led to the problem of information explosion. Being confronted with the information explosion problem, it becomes necessary for information retrieval (IR) systems to employ intelligent techniques to provide effective access to such a huge amount of information. Particularly with the emergence of World Wide Web, users have access to large amount of documents. More and more information services (e.g. news services, libraries, electronic mail etc.) are becoming online in order to provide prompt and easy access to users. Most of this information is textual in nature. This ever-increasing size of information sources has made it difficult for people to find relevant textual material. Often the information that reaches to user, in response to a query, does not match the interest of the user and merely ends up overloading her/him, who will have to manually select the interesting information from the “noise” [3]. This makes an urgent demand for more effective information retrieval systems that perform intelligent retrieval. This thesis makes a humble effort in this direction.

Chapter 1

Introduction

This thesis is the result of a research effort to capture semantics (although at low level) and integrate it in IR systems. At one end, the efficiency of the statistical methods could not be ignored. But, at the other end, the need for improved relevance (in semantic sense) has to be satisfied. This motivated us to explore techniques that attempt to improve document and query representation, e.g. by capturing relationship between terms through Natural Language Processing (NLP) techniques. The relevance of a document can be judged effectively if intelligent techniques are introduced to capture semantics, say with the help of conceptual graphs (CGs), and to use it in the representation and the matching process. This culminated in a hybrid model with syntactic as well as semantic features. Research in the area of relevance improvement in information retrieval (IR) systems takes mainly two directions [182]. First, assuming that a user query is an inappropriate representation of her/his information needs, how to expand query in order to obtain a better representation of the user’s information need. Second, given a user query, how to determine which documents in a given collection are relevant. It can be argued that improved representation is needed for both documents and queries and, that the identification of relevant documents is itself dependent on their representation. If some information about the document content is lost in the representation process then it can not be recovered later, say during matching (relevance assessment). This work explores both the dimensions. The Information Retrieval (IR) problem is concerned with finding documents that satisfy users’ need in response to a query. IR deals with unstructured data. The retrieval is performed based on the content of the document rather than its structure. IR systems usually return a ranked list of documents. Information retrieval components have been traditionally incorporated in different types of information systems including database management systems, bibliographic text retrieval systems, question-answering systems, and more recently in search engines. Current approaches for accessing large text collection can be broadly classified into two categories. The first category consists of approaches that construct topic hierarchy

2

Chapter 1

Introduction

(as in Yahoo). This helps user to locate documents of interest manually by traversing the hierarchy. However, it requires manual classification of new documents within the existing taxonomy. This makes it cost ineffective and inapplicable due to rapid growth of documents on the Web. At present, there is a huge gap between the amount of information available on-line and information accessible via topic hierarchy. It clearly shows the inefficacy of these approaches in handling the information explosion problem. The second category consists of approaches that rank the retrieved documents according to relevance. Users are not interested in huge amount of information, but in precise, accurate and relevant information. The relevance can not be judged simply on the basis of term occurrence. Most of the existing retrieval system still rely on standard retrieval models (e.g. boolean, standard vector and probabilistic) that treat both documents and queries as a set of unrelated terms. These models have the advantage of being simple, scalable and computationally feasible, but they do not offer accurate and complete representation. They ignore semantic and contextual information in the retrieval process. It is difficult to identify useful documents simply on the basis of words used by the author of the document, as words may mean differently in different context, as pointed out in [186]: “It is today impossible to retrieve all documents pertaining to a particular subject, because such documents do not share a common set of keywords and because current search engines do not address semantics or context”. Realizing the inadequacy of the current approaches of information retrieval, this thesis aims at investigating intelligent techniques that help in retrieving information effectively. “Intelligent IR is just good IR” [164] in which the programs (i.e. the representation, comparison, and interaction methods implemented in the system) result in effective performance. Techniques that improve these aspects i.e., the representation, comparison, or interaction, will lead to intelligent retrieval. Our focus is mainly on the semantic techniques. However, building a complete semantic understanding of the text requires human-like processing of text and is beyond the scope of this thesis. The objective of this work is to classify documents as relevant or

3

Chapter 1

Introduction

non-relevant with respect to a standing query with more accuracy but less overhead. A detailed and accurate semantic interpretation is not needed for this classification [43]. This fact distinguishes IR application from other NLP applications like machine translation, summarization, question answering, etc. A middle ground has been taken by investigating techniques that make use of limited semantic knowledge needed to define the relevance of the document and that can be easily extracted from the text. This also helps in dealing with scalability issue, which is one of the most important factors in the design of information retrieval systems. These techniques allow the search and retrieval systems to: •

improve document and/or query representation

•

address document semantics

•

improve ranking of retrieved documents

•

adapt queries based on relevance feedback

•

improve retrieval performance

Finally, realizing the fact that so much information is being produced and at such a rate that no single technique can offer remedy to all problems, we propose hybrid approach to information retrieval and also evaluate one such model. To improve the efficacy of an IR system, we need a better understanding of the issues involved in information retrieval and problems associated with existing information retrieval systems. We can then outline where the application of these techniques can provide significant benefit. This exactly defines the scope of the thesis. In the rest of the chapter, we first discuss the issues involved and the problems associated with current approaches to information retrieval. In section 1.3, the problem statement is presented and an overview of the techniques evaluated by us is provided. This overview also serves as a summary of the core technical contributions of this work. In section 1.4, we briefly review some of the previous research aiming at similar tasks. Section 1.5 describes the organization of the dissertation.

4

Chapter 1

Introduction

1.2. Major issues in information retrieval There are a number of issues involved in the design and evaluation of IR systems, which needs to be understood. In this section, we briefly discuss some of these. The first important issue to address is to choose a representation of the document. Most of the human knowledge is coded in natural language. However, it is difficult to use natural language as knowledge representation language for computer systems. Most of the current retrieval models are based on keyword representation. This representation creates problem during retrieval due to polysemy, homonymy and synonymy. Polysemy involves the phenomenon of a lexeme with multiple meaning. Keyword matching may not always include word sense matching [112]. Homonymy is an ambiguity in which words that appear the same have unrelated meanings. Ambiguity makes it difficult for a computer to automatically determine the conceptual content of documents. Synonymy creates problem when a document is indexed with one term and the query contains a different term, and the two terms share a common meaning.

Previous studies indicate that human beings tend to use different

expressions to convey the same meaning [9]. The recent work in developing extensive lexicon is an attempt to improve the situation [95]. Traditional retrieval models ignore semantic and contextual information in the retrieval process [37], [178]. This information is lost in the extraction of keywords from the text and can not be recovered by the retrieval algorithms. Improving IR demands an improved representation of text. A related issue is that of inappropriate characterization of queries by users. There can be many reasons for the vagueness and inaccuracy of the users’ queries, say for instance, their lack of knowledge of the subject or the inherent vagueness of the natural language itself. Users may fail to include relevant terms in the query or may include irrelevant terms. Inappropriate or inaccurate query leads to poor retrieval performance. The problem of ill-specified query can be dealt with by modifying or expanding queries. An effective technique based on users’ interaction is the relevance

5

Chapter 1

Introduction

feedback. Improving representation of documents and/or queries is thus central to improving IR. In order to satisfy user’s request an IR system matches document representation with the query representation. How to match the representation of a query with that of the document is another issue. A number of similarity measures have been proposed to quantify the similarity between a query and the document to produce a ranked list of results. Selection of the appropriate similarity measure is a very crucial issue in the design of IR system. Evaluating the performance of IR systems is also one of the major issues in IR. There are many aspects of evaluation; most important being the effectiveness of an IR system. Recall and precision are the most widely used measures of effectiveness in IR community. As improving effectiveness in IR is the underlying theme for evaluating any technique and is one of the core issues in this thesis, a detailed discussion of effectiveness measures has been done in the following chapter (Section 2.4.1). As the major goal of IR is to search document relevant to a query; understanding what constitutes relevance is also an important issue. The evaluation of the performance of IR systems relies on the notion of relevance. Relevance is subjective in nature [143]. Only the user can tell the true relevance; however, it is not possible to measure this “true relevance”. One may define the degree of relevance.

Relevance has been

considered as a binary concept, whereas it is a continuous function (a document may be exactly what the user wants or it may be closely related). Current evaluation techniques do not support this continuity. It is quite difficult although it seems to be attractive. A number of relevance frameworks have been proposed in [142]. This includes the system, communication, psychological and situational frameworks. The most inclusive is the situational framework, which is based on the cognitive view of the information seeking process and considers the importance of situation, context, multi-dimensionality and time. A survey of relevance studies can be found in [100]. Most of the evaluations of IR systems so far have been done on document test

6

Chapter 1

Introduction

collections with known relevance judgments. We have also adopted the same evaluation method for our work and selected CACM-3204 collection. The large size of document collections also complicates text retrieval. Further, users may have varying need. Some users require answers of limited scope, while others require documents having wide scope. These different needs can require that different and specialized retrieval methods be employed. However, this issue has not been addressed in this work. We attempt to handle some of these problems by proposing techniques, as outlined in the next section, to improve representation of documents and queries and by incorporating new similarity measures. Information retrieval models based on these representations and similarity measures have been proposed and evaluated in this work

1.3. Statement and Scope of the Work The present research is focused on intelligent information retrieval. In order to be retrieved the information first needs to be represented in a way amenable to processing. The choice of the representation put constraints on the retrieval process. The hypothesis is that through an improved representation of the text and by incorporating more contextual information from the document and/or query in the matching process, it is possible to improve the effectiveness of the retrieval process. In particular, NLP techniques may be used for capturing semantics in the representation of document/query and the relationship between terms in the document. We argue that: (1) A word out of context does not provide useful information about its importance in the text. (2) The semantic descriptions can be derived only by considering the way terms are used in the document. The semantic information is contained in the relationship between the words and not in individual words.

7

Chapter 1

Introduction

(3) The real challenge posed by IR is not the need of deep level semantic processing of text, but the processing of large volumes of unrestricted text. (4) The consideration of the contextual information existing in the query and utilizing it in the expansion process can improve query representation. In this line, we investigate syntactic and the semantic techniques that can be easily integrated with existing statistical IR techniques. In order to improve representation of documents and queries, we capture semantic aspects of documents along with the statistical aspects. The statistical representation used in this work is the vector space representation. To make use of best possible statistical representation an investigation of different term-weighting schemes may be useful. The knowledge representation formalism used to represent semantic aspects of documents, so that the retrieval system can understand and make use of it, is conceptual graph (CG). The choice of CG formalism has a number of advantages. This formalism provides a way to extract and represent meaning of natural language text. CG theory provides a framework in which all the component of an IR system can be represented adequately. It can be easily extended to accommodate specific knowledge and needs, without a revision of its semantics. More importantly, it leads to a scalable representation. An automatic procedure to arrive at a simplified conceptual graph representation of documents and queries has been introduced. This representation captures semantics by considering the relationships between concepts and by the use of type hierarchy/replaceable terms. However, the development of a system to capture truly conceptual representation is beyond the scope of the current work. It is still very difficult to do an indexing at a conceptual level. It is and will be in the future a challenge because of the need to transform text into meaning, syntax into semantics. The major questions to be answered by this research are: 1. How to improve the performance of the existing statistical model and can they be used for the development of a new semantically oriented hybrid model.

8

Chapter 1

Introduction

2. Can a conceptual graph be used for representing the content of a textual document? How a CG-based model can be used for information retrieval task? 3. Can a CG-based model be used to extend the capabilities of existing statistical retrieval systems? 4. Can conceptual graphs be used for modifying query representation? 5. How the term similarity and the relational similarity affect the retrieval performance? How to extract the terms (concepts) and the relations from the documents using NLP techniques? The answers to these questions and the new CG similarity measures proposed in the thesis could also have implications for other related tasks like information extraction, document summarization, question answering, etc. In this work, an attempt has been made to combine the benefits of both the vector space model and the CG-based retrieval model by proposing a two-stage retrieval model. The first stage uses vector space model to retrieve documents quickly and the second stage re-ranks documents using CG-based representation. A query expansion technique based on this model has been introduced. Two different expansion strategies have been investigated. The first strategy uses the CG-based ranking obtained in the second stage of retrieval to improve the feedback set. The second strategy further considers CG-based representation to get concepts (terms) related with query concepts for expansion. The objective is to restrict the documents and concepts being used in the expansion process to those that are more closely related to query. This helps in reducing the alteration of the focus of the query after expansion. Yet another model proposed in this work attempts to integrate term matching and relation matching in the retrieval process. New CG similarity measures have been proposed for identifying relational similarity between documents and query. A flexible expression has been used to combine these two similarities.

9

Chapter 1

Introduction

1.4. Literature Review Research in IR is not new. It dates back to 1960’s when text retrieval systems were introduced. These early systems were based on full text indexing of words. Since then many different fields and techniques have emerged. IR approaches can be broadly classified into two major categories: statistical and semantic. Most of the early systems were statistical in nature. Different techniques have been proposed by researchers to enhance the performance of IR systems. In the following section we briefly review some of the techniques.

1.4.1. Statistical Techniques in Information Retrieval Luhn is considered the first person, who advanced the notion of automatic indexing of documents based on their contents [88]. He proposed that frequency of words and their rank order provide useful measurement of word significance. Luhn’s early ideas about indexing continued to be relevant for many years. Almost all of the indexing techniques being used today have term frequency as one of their components. The SMART system, developed at Cornell, is one of the earliest IR systems based on fully automatic indexing. Over the years Salton’s SMART project [131] had contributed to the development of many important notions in the development of vector space model, e.g. term weighting schemes, relevance feedback, various similarity measures and clustering methods. Introduction of the notion of inverse document frequency (idf) in term weighting [160], which is based on Zip’s law, was another notable contribution in automatic indexing. A series of experiments have successfully demonstrated that combinations of term frequency (tf) and inverse document frequency (idf) resulted in improved performance of IR systems [132]. A family of tf-idf weighting schemes has been introduced in IR literature. Several years of experiences of SMART, resulting out of experiments in automatic term weighting by investigating distinct combinations of term-weighting schemes, with or without normalization, are summarized in [130]. Other major contributions to the development of term weighting and similarity

10

Chapter 1

Introduction

measures were due to [170], [133] and [158]. A critical review on current methods in automatic indexing has been made in [11]. An alternative to the vector space model is the probabilistic model. Probabilistic indexing was proposed by [90]. The intent of probabilistic IR was to deal with the uncertainty of the retrieval process. Major contributions to the development of the probabilistic model were by [123]. They experimented with various term-weighting formulae with the manually indexed Cranfield collection. Other notable research in probabilistic indexing was due to [33], [34]. The inverse document frequency measure was heavily used in probabilistic indexing also. More complex term distributions were also tried including 2-poisson model [114]. However, this distribution poses problems, as it requires estimation of many parameters. Some simple approximations to 2poisson model for probabilistic indexing have been introduced in [124]. Probabilistic model form the basis of INQUERY and Okapi systems which participated in many Text REtrieval Conferences (TREC). One of the oldest and traditional models is the boolean model which has been used successfully over the years. Several extensions of boolean models have been proposed. The MMM model [49] and the Paice model [109] are based on fuzzy set theory. The best performing extended Boolean model is P-norm model [135], which like the vector space model is based on geometry and a fuzzy set model suggested by Paice[109]. The effectiveness of P-norm model has been verified experimentally by [135], [49].

1.4.2. Intelligent Techniques in IR Most of the early work was focused on the standard retrieval models (Boolean, Probabilistic and Vector space model). Recently, methods that try to capture more information about each document and accomplish better performance have been researched and established in IR systems. Researchers have proposed enhancement of the classical models based on specific techniques from different fields. These techniques include the use of query expansion, latent semantic indexing [36], [46], and

11

Chapter 1

Introduction

artificial intelligence (AI) techniques like connectionist approaches [7], [17], [26], genetic algorithms(GAs) [58], [107], [176], [31] and natural language processing [163], [150], [111]. We provide a brief review of these techniques in the following section. Considering the importance of NLP techniques in IR, we discuss it under separate heading.

1.4.2.1. Conceptual Graph Based Methods Conceptual graphs are very closely related to natural language and hence can be used for representing text. Such a representation holds the promise of extracting more information from documents by explicitly capturing relationship between terms; unlike word-statistical approaches that merely count nouns and noun phrases. Earlier evidences of the use of conceptual graphs include CoDHIR [89], DR-LINK [83], RELIEF [108] and ITELS [38] system among others. CGKAT [92] and WebKB [91] use conceptual graphs to index document elements (chapters, paragraphs, etc.). Rama and Srinivasan [115] underlined the utility of conceptual roles for information retrieval. Their work provides strong evidence in support of use of conceptual roles in information retrieval. Marega and Pazienza [89] also emphasized the use of contextual role of words in CoDHIR (COntent-Driven Hypertext Information Retrieval) system and concluded that this results in an improvement in retrieval precision over traditional IR technologies. They used conceptual graphs in CoDHIR system to represent semantic information extracted from text.[89] Their work consists in identifying contextual roles of words and to extend vector model to consider compound descriptor (contextual role – word). DR-LINK (Document Retrieval using LINguistic Knowledge) system [83] [104] is one of the serious attempts to use conceptual graphs to represent the documents. It uses conceptual graphs to extract and use semantic relation for information retrieval. It involves processing and representation of text at lexical, syntactic, semantic and discourse level to achieve intelligent retrieval at a level beyond more traditional approaches. The image retrieval system RELIEF (Relational Logical Approaches based on Inverted Files) [108] uses conceptual graph as indexing language. Martin and Eklund [91] argued in favor of

12

Chapter 1

Introduction

general knowledge representation languages for indexing web documents and suggested the use of concise and easy comprehensible CGs. They argued that CG representation has advantage over metadata language based on extensible markup language (XML). ITELS (Intelligent TErminology Learning System) [38] is an intelligent tutorial system aimed at helping Bulgarian to learn English terminology. It makes use of conceptual and linguistic knowledge and uses conceptual graph to represent conceptual relations between concepts. A number of conceptual graph matching algorithms have been proposed by researcher for IR applications. Chapter 3 (Section 3.3.4) of this dissertation contains a review of these algorithms. The major source of difficulty in the use of conceptual graph is the development of an automated system to extract CG representation of text. This work attempts to overcome this difficulty by proposing a simplified CG model for IR and to use it to enhance (and/or to complement) the capabilities of the existing statistical information retrieval models instead of replacing them.

1.4.2.2. Query Expansion and Refinement User query is often an inappropriate representation of their need. It fails to retrieve all the relevant documents. This problem has been recognized as major difficulty in information retrieval systems [60]. Significant amount of IR research has been devoted to refine and expand the original query, based on the documents retrieved by the initial query. Query refinement and expansion may involve adding additional terms, removing poor terms and refining the weights of original query terms. Majority of the work in this direction has been focused on ranked retrieval systems, although some of these works has been adapted for boolean model [60]. Relevance feedback is the single most effective technique for query reformulation ([73], [127]). The early research in this direction has been carried out by Rocchio [127] and Ide [69]. The work performed by [123] and [68] was focused on probabilistic model. Automatic query expansion via relevance feedback is an effective technique [121], but the requirement for explicit relevance judgments interferes with users’ information seeking behaviors [75]. An improvement over relevance feedback is

13

Chapter 1

Introduction

pseudo relevance feedback that does not require explicit relevance judgment from the users. Instead, the IR system assumes that top ranked documents are relevant and then automatically expands query by adding terms from them. A survey of relevance feedback can be found in [128]. Buckley et al.[134] demonstrated success of query expansion in TREC environment. The work carried out by Belkin et al.[8] was focused on query reformulation techniques on interactive environment. An alternative approach to query improvement is to expand the query by adding terms that are highly correlated to the original query terms [73]. Such terms can be found in the thesaurus. Unfortunately, available thesauruses like resources are not suitable for most collections. Automatic query expansion based on thesaurus has not been very successful [21].

1.4.2.3. Latent Semantic Indexing Use of Latent Semantic Indexing (LSI) in information retrieval is based on the assumption that there is some underlying “hidden” semantic structure in the pattern of word-usage across documents rather than just surface level word choice. LSI attempts to identify this hidden semantic structure through statistical techniques and then uses this structure for representing and retrieving information. This is done by modeling the association between terms and documents based on how terms co-occur across documents. It transforms the term-document vector space into more compact latent semantic space. It is believed that in the vector space of reduced dimensionality the words referring to related concepts are collapsed into the same concept vector. This handles the problems arising out of synonymy. LSI performs retrieval on the basis of meaning, is completely automatic, and has been applied successfully in many information retrieval systems [36], [46]. However, it is costly in terms of computation and requires re-computation if many new terms are added.

1.4.2.4. Genetic and Connectionist Approaches The convergence between AI and IR came up from two directions. The first source of this convergence are the AI researchers who selected IR as an application area and

14

Chapter 1

Introduction

the second being the IR researchers intermingling traditional retrieval techniques with methods developed in AI research in an attempt to improve retrieval performance. In this section, we briefly review genetic and connectionist approaches in IR. Genetic algorithms have been used in various ways to enhance query description, to improve document representation and to build document clusters. Gordon [58] adopted genetic algorithm (GA) to obtain better representation of documents. He successfully demonstrated that GA produces better document description as compared to the representation obtained by probabilistic model and uses these results to build document clusters for queries. A hybrid methodology integrating inductive learning and neural network is discussed in [31]. New crossover operators have been introduced and the effect of population size and the number of generation has been studied in [176]. Neural networks have found their applications in document classification, recognizing co-occurrence words and relevance feedback [26]. However, all these techniques are computationally intensive. Knowledge bases and NLP can be used instead, to enhance the performance of IR system by providing methods for indexing and query expansion etc.

1.4.2.5. Relation Matching Relation matching has also been used in information retrieval. One of the ways of using relations is to expand query by adding terms using different types of relations e.g. synonym or hypernym/hyponym relation. The term driving can be expanded using: Synonyms: traveling, steering etc. Hyponyms: motoring Object: car, truck, vehicle, etc. Unless term expansion has been done very carefully, the precision degrades [21]. Other approaches include the use of proximity search and multi-word phrases. Proximity search approach is based on the assumption that words that are related tend to appear close together. But words appearing together may express different

15

Chapter 1

Introduction

relationship. One of the often-quoted examples is “Library School” and “School Library”. Multi-word phrase matching is simpler and existing methods for single term matching can be applied to multi-words. However, it fails to capture variations in syntactic structure unless phrases have been normalized. For example “extraction of roots” might be transformed into “root extraction”. Further, the use of multi-word phrases has not yielded significant improvement [82], [97]. Tree matching has also been used for relation matching. Though it is more flexible and captures syntactic variations successfully but is more complex and the optimal tree matching technique for information retrieval is yet to be developed [74]. A number of researchers have used syntactic relation matching [35], [39], [151]. However, the results reported with syntactic relation matching are not very encouraging [153]. Syntactic relations may fail to identify similarity existing between semantic relations expressed using different syntactic structures. This suggests that the use of semantic relations can provide better results. Some researchers have concentrated on particular type of semantic relations [84], [76]. Liu’s[84] work was focused on case relations whereas Khoo[76] concentrated on cause-effect relations and reported small but significant improvement in retrieval result. However, causal relation matching did not yield better result than wordproximity matching. In [54], semantic relation between members of compound noun has been studied. Identifying semantic processing requires domain knowledge. Identifying and coding this knowledge is a labor intensive and time consuming task and it is difficult to port this knowledge to a different domain. If any significant improvement in retrieval is to be achieved then a non-specific procedures have to be developed for identifying semantic relations.

1.4.3. Natural Language Processing Approaches Processing textual information available in electronic form and retrieving information intelligently in response to users’ queries has emerged as one of the great challenges in information retrieval. Natural language processing can play vital role in both

16

Chapter 1 storage and retrieval of documents.

Introduction There are several interdependent levels of

analysis for natural language processing. These are: -

Phonological level

-

Morphological level

-

Lexical level

-

Syntactic level

-

Semantic level

-

Discourse level

-

Pragmatic level

The phonological level is concerned with analysis of speech sounds e.g. phonemes and is of little interest in textual IR. The morphological level deals with meaning units i.e. morphemes. It is concerned with analysis of the different forms of a given word in terms of its prefixes, roots and suffixes. This level of NLP has been traditionally incorporated into IR systems. Stemming techniques that reduce words to some root forms (stems) for query-document similarity are example of this level of processing. The next higher level is the lexical level that deals with word level processing involving analysis of structure and meaning of words and part of speech tagging etc. Lexical operation in IR includes elimination of stop words, generation and use of thesauri for expanding queries and handling abbreviations and acronyms. Part of speech tagging is another lexical processing that is now being used in IR. It is commonly used operation in NLP, but not widely known in traditional IR. The next level is the syntactic level that deals with the grammar and structure of sentences. There can be many possible structure of a sentence. Identification of correct structure among various alternatives requires higher-level knowledge. The attempts to use syntactic analysis in understanding meaning of natural language were based on the assumption that the meaning is inherent in the syntactic structure [110]. The limitation of the meaning, which can be drawn from the syntactic analysis, is discussed by Salton et al. in [137]. Syntactic level processing has been rarely used in traditional IR. Identification of phrasal units is an example of this level of processing that has been used in IR. Although sophisticated parsers have been developed, but

17

Chapter 1

Introduction

usually statistical methods such as co-occurrence and proximity method has been preferred over NLP, for phrase identification in IR. The semantic level is concerned with meaning of units larger than words such as clauses and sentences. It involves the use of contextual knowledge to represent meaning. Word sense disambiguation is a task that requires semantic level of processing. This is because a word can be disambiguated only in the context of larger textual units in which it is being used. Due to sophisticated level of processing and the need of real world and domain specific knowledge most of the IR system preferred statistical keyword matching to semantic level processing. The discourse level processing attempts to interpret the structure and meaning of even larger units, e.g. at paragraph and at document level, in terms of words, phrases, cluster and sentences. The highest level is the pragmatic level that deals with outside world knowledge (i.e. knowledge external to document and/or query). There is no evidence of pragmatic level analysis in IR. Even in AI, research at this level is only experimental in nature. Although NLP is difficult, its potential benefits have caused researchers to investigate the use of both syntactic and semantic processing. In [151], e.g. use of syntax in IR has resulted in increase in retrieval effectiveness. Anaphor resolution has also been used to extend the capabilities of statistical techniques. However, the study performed in [152] and [174] has been disappointing. There are contending views regarding the use of lexical methods in IR. The supporters consider IR as an early stage of questionanswering and believe that this will improve retrieval. The opponents believe that IR and other areas of language processing are two different processes. They suspect the usefulness of any attempt of meaning representation in IR and believe that one can get quite good result with statistical, probabilistic or vector space techniques. Salton et al. [137] remarked that to analyze the meaning of a text automatically, syntactic analysis must be supplemented by a knowledge base consisting of world knowledge and semantic knowledge, and that it was impossible to put semantic and world knowledge in the computer. After examining the potential contribution of knowledge based techniques (Natural language processing and expert systems, in particular) SparckJones[161] also pointed out that although AI techniques can contribute to specialized

18

Chapter 1

Introduction

systems, but it is important to note that one should not overestimate the power of these techniques for IR. She remarked that for really hard tasks it will not be possible to replace humans by machines and argued that many information processing tasks are rather shallow linguistics tasks, which do not involve elaborate reasoning or complex knowledge. As direct application of NLP results in lack of robustness and efficiency [163], it is almost always used with existing systems for indexing, query expansion and modification, and document categorization, etc. In the following subsections, we briefly review the work in various areas of IR, where NLP techniques have found useful applications.

1.4. 3.1. Phrase Extraction Term and document weighting schemes, normalization, term stemming and common word elimination have been explored in depth in current bibliography. But the optimal representation of a text document still remains an open research question. Some

systems make use of phrases or even n-grams instead of words to represent

documents and queries. Phrases are often used as descriptor terms of a document. Traditionally statistical approaches such as co-occurrence, proximity etc. have been used for phrase identification. Though these methods are simple and efficient, but fail to capture variations in forms of phrases. Alternatively, syntactic methods to phrase extraction have been used. Syntactic phrases satisfy certain syntactic relations rather than being simple collocation of words. They capture actual linguistic relations and are expected to be semantically more meaningful. However, pure syntactic methods of phrase identification have not been found to be very effective. Though these methods are able to capture the variations in forms of a phrase and can better handle with varying degree of evidence provided by each form, but can result in identification of a large number of phrases having little importance for characterizing the topic of a given document or query [35]. Hence a combination of syntactic and statistical methods has been proposed in [44],[81], [82] where the use of greater degree of sophisticated NLP and statistical processing for identifying phrase in users’ queries, as compared to documents, has been suggested. This is particularly because

19

Chapter 1

Introduction

the number of queries is much less than the number of documents and computational cost can be afforded. Further, it will help in more appropriate understanding of users need than would be possible by statistical means only. However, the fact that users queries are usually short [70] complicates the process of understanding. This realization has led to the development of various query reformulation techniques in IR by means of thesaurus, relevance feedback or NLP analysis. More recently it has been reported that the simple noun phrase based system performs roughly as well as a state of the art corpus trained key phrase extractor [5]. An analysis of statistical and syntactic phrases has been done by [98]. The study performed in [98] suggests that statistical and syntactic phrases give comparable performance and that the use of phrases does not have major effect on precision at top ranks.

1.4.3.2. Word Sense Disambiguation Word sense disambiguation is another area where syntactic and semantic techniques have been employed to improve IR performance [177], [172], [138], [139], [146]. Word sense ambiguity particularly creates problem when the query under consideration is very short [138]. It has been considered a cause of poor performance in IR and it seems natural that if words are disambiguated accurately then IR performance will increase. A number of studies have been aimed at investigating effect of disambiguation on IR. The results so far, with few exceptions, have been disappointing. The first large scale tests of using disambiguation in IR was conducted by Voorhees[172] and Wallis[177]. The sense disambiguator developed by Voorhees was based on WordNet thesaurus [95]. The experiments conducted by her on CACM, CISI, TIME, MED and CRAN collection resulted in a drop in IR performance. Wallis’ disambiguator consisted in replacing the words by their dictionary definitions. She conducted experiments on CACM and TIME collection, but failed to found any significant improvement in IR performance. The reason for this poor performance may be that when the query is not very small then co-occurring terms in the query themselves provide disambiguation and when the query is very small then the context is insufficient for correct disambiguation. Augmenting short queries with domain

20

Chapter 1

Introduction

specific context and selecting most frequent sense for an ambiguous term may ameliorate the problem. The first positive indication with word sense disambiguation was due to schutz[146]. He proposed a disambiguator based on computationally expensive approach of clustering co-occurrences within the collection and reported an improvement in performance from 7% to 14% on the average. In [138], Sanderson compared and summarized the most of the previous work in disambiguation and pointed out that reasons of their failure being too skewed sense frequencies, collocation problem, inaccurate sense disambiguation etc. One of the problems in evaluating disambiguation effects felt by many researchers is the lack of a term weighting function that considers word senses. More recently significant increase in performance after disambiguation has been reported [162], [85], [77]. The work in [85] utilizes WordNet to disambiguate query terms and observed significant improvement over best known results on three TREC collection (TREC-9,10 & 12) for short queries. [77] also experienced improved retrieval effectiveness with their root sense tagging method on large scale TREC collection.

1.4.3.3. Knowledge bases Knowledge bases have also been used in IR system to enhance performance, e.g. by providing methods for query expansion. A knowledge base is a representation and collection of knowledge usually specific to an application area [39]. Knowledge bases are represented using semantic networks, rules, frames etc. More recently, ontologies have received increasing interest, particularly with the insurgence of semantic web. Ontology refers to exhaustive and rigorous conceptual schema within a given domain in which concepts are organized hierarchically by semantic relations beside subsumption relation. Ontologies attempt to provide knowledge reuse and sharing by explicitly encoding shared understanding of a domain. Ontology design has become an active research area in artificial intelligence in recent years. Application of ontologies in specific domain has been discussed by many researchers including [147], [149], [106]. Nevertheless, their usefulness in a general domain is an issue of debate, and how general -purpose ontology can be utilized for improving

21

Chapter 1

Introduction

information retrieval is still an open question. There are both opponents and proponents for the use of general-purpose ontology in IR and other NLP related tasks. The opponents argue that people believe different things, they speak differently and therefore they can not have the same ontology. The proponents say that people may believe different things, they may speak different language, and may use different words and phrases to refer to the same concepts even in the same language, but this does not mean that they refer to different concepts. They believe that most of these differences can be traced to mixing of ontology, language and knowledge and can be dealt with by separating naming and definitions of concepts from language, knowledge and belief. The studies conducted on concept acquisition process and social/linguistic interaction behavior of human being also suggest that generalpurpose ontology is not much helpful in the learning process or for achieving intelligence. Further, ontology development, like lexicon development, itself is extremely difficult and time consuming task. One notable contribution toward automation of ontology development is by [181] who developed an algorithm that classifies terms in a document based on the probability of co-occurrence in any given document and hierarchically into concept maps. However, truly semantic ontology is still too far. Some of the available knowledge bases In this section we briefly discuss some of the freely available lexicons and ontologies that have found useful applications in IR research. WordNet ([95], [96]) is one of the most widely used knowledge base in IR research. It was developed at Princeton University and was inspired by psycholinguistic theories. Information is organized as a set of synonymous words called synsets. Some 54,000 word forms are organized into 49,000 synonym sets, each representing one base concept. These synsets are linked with each other by means of lexical and semantic relations. These relations include synonymy, hypernymy/hyponymy, antonymy, meronymy/holonymy etc. Nouns and verbs are organized in hierarchies

22

Chapter 1

Introduction

and adjectives are organized in clusters containing head synset. LDOCE (Longman Dictionary Of Contemporary English) and Roget’s thesaurus are some of the other lexical resources used in IR research. Cyc is a well-known and quite comprehensive ontology [30]. It includes foundation ontology and many domain specific ontologies, called micro theories. A subset of Cyc kown as OpenCyc has been released for open access. SUMO (Suggested Upper Merged Ontology) is another comprehensive upper ontology developed by IEEE working group which is freely available[125]. It is extended with many domain ontologies and a complete set of links to WordNet. Existing lexical resources and knowledge bases are not sufficient for in-depth processing of natural language. Attempts have been made to develop knowledge bases (lexicons, taxonomies, ontologies etc.) automatically and several methods have been proposed. These methods, however, are still premature. Developing knowledge bases manually is a labor-intensive task and requires a lot of knowledge. These constraints typically restrict the knowledge bases to specific domains. The cost of applying NLP to a large amount of unrestricted text can be reduced by a two-step process: A coarse ranked retrieval of candidate documents using statistical and shallow NLP technique followed by a more sophisticated NLP applied only to the much smaller list of highly ranked documents retrieved by the first stage. In this work we attempt to make use of limited semantic knowledge, wherever possible, to improve the retrieval performance with statistical methods. By combining both the approaches the benefits of each is achieved in a single system.

1.5. Organization of the thesis The thesis is organized into seven chapters including the present chapter which introduced IR problem, presented a brief review of the work done in the field and provided an overview of our work. An outline of the remaining chapters follows.

23

Chapter 1

Introduction

In chapter 2, the basic concepts involved in text representation, including conceptual graphs, are introduced. This includes a brief discussion of traditional information retrieval models, definition and notation of conceptual graphs, the basic operations performed on CGs and measures of IR performance. The purpose is to establish a basic terminology and coverage of the issues that will be used in the subsequent chapters. In chapter 3, a detail description of representation of documents as a vector of term weights and as a set of conceptual graphs is provided. First, the document representation in vector space model has been discussed. The vector space model constitutes the base model for the experimental work carried out in subsequent chapters. The purpose is not just to provide a theoretical background of the model, but also to present a working model which is to be used in rest of the experimental work and which serves as a baseline for assessing the performance of the techniques proposed in later chapters. Next, all related (theoretical as well as implementational) aspects of conceptual graph based retrieval model have been discussed. How CGs can be used to represent natural language semantics has been explained with the help of examples. A theoretical framework for using conceptual graph (CG) representation for IR tasks is presented. The construction of CG with English language text has been elaborated in detail. Finally, a brief survey of CG matching algorithm has been given. In chapter 4, an intelligent technique to improve retrieval performance has been investigated. First, the performance of seven different combinations of term weighting schemes has been experimentally investigated for the CACM-3204 document collection and the best performing model has been used in the first stage of retrieval. In order to get a good understanding of the term weighting schemes two more test document collections namely ADI and MEDLINE are investigated. Next, a hybrid information retrieval model that uses a CG-based representation to improve retrieval performance has been proposed. The proposed model is essentially a two-stage retrieval model which takes advantages of attractive characteristics of vector space

24

Chapter 1

Introduction

model to quickly and efficiently handle large amount of unconstrained text in the first stage and then makes use of more sophisticated techniques to improve the retrieved results. We conduct three experiments to investigate its usefulness. The first experiment has been conducted on CACM-3204 collection. The second experiment we performed on our own small document collection CGDOC consisting of title and abstracts of papers. The third experiment has been performed using top 10 results returned by LYCOS search engine. The experimental investigation suggested that the model has potential to improve the retrieval performance. In chapter 5, two pseudo-relevance feedback strategies based on conceptual graphs have been presented and evaluated. The first strategy uses conceptual graphs to construct relevance feedback set. Terms from the documents in this set are then used for expansion. The second strategy further considers CGs to select terms to be added from the feedback set. A comparison of both the strategies has been made with blind expansion following an initial retrieval using vector space model. In chapter 6, we extend our work by proposing a technique that integrates relation matching with keyword matching. Unlike the model proposed in chapter 4, which is two-stage model, this is a single stage model. It makes use of simple heuristics and transitivity of relations in the matching process. Conceptual graph based representation has been used in order to identify relational similarity between documents and the query.

New CG similarity measures have been proposed to

capture relational similarity and used for evaluating the performance of the proposed model. The proposed similarity measures are novel contribution of this thesis. Finally, chapter 7 presents a quick review of the work carried out in this thesis and lists the major findings. We conclude by outlining other potential applications of our work and suggesting future research direction.

25

Chapter 2 Information Retrieval and Conceptual Graphs: Basic Concepts 2.1. Introduction Information retrieval (IR) is concerned with the organization, storage, retrieval and evaluation of information relevant to the query. The user having an information need formulates a request in the form of a query written in natural language. The retrieval system responds by retrieving documents that seem relevant to the query. Figure 2.1 illustrates the basic information retrieval process. The retrieval is performed by matching the query representation with the document representation. If the matching process views a document as sufficiently similar to the query, then that document is assumed relevant and returned to the user.

Figure 2.1: Basic Information Retrieval Process

Chapter 2

Information Retrieval and Conceptual Graphs: Basic Concepts

However, this is just an engineering account of an IR system. The basic question involved is ‘what constitutes the information in the documents and the queries’? This, in turn is related to the problem of representation of documents and queries. Several different models have been developed so far, which differ in the way documents and queries are represented, and the retrieval is performed. Some of these models consider documents as a set of terms and performs retrieval merely on the basis of presence or absence of one or more query terms in the documents. Others represent documents as a vector of term weights and perform retrieval based on the numeric score assigned to each document representing similarity between the query and the document. An alternative retrieval model considered in this work is conceptual graph based model. This model represents documents as a network of interrelated concepts. This representation captures more information in the documents compared to the representation in vector space model. In this chapter, the basic concepts involved in both the model are discussed. This chapter gives the necessary background and context for understanding the research done in subsequent chapters. Section 2.2 discusses information retrieval process and provides a brief description of information retrieval models. Section 2.3 introduces conceptual graph, describes various notations and representations, and discusses the basic operations performed on them. Implementation aspects have been dealt with in the next chapter. Finally, section 2.4 considers the issues involved in the evaluation of IR systems.

2.2. Information Retrieval models This section offers a brief description of various information retrieval models. A general introduction of IR model is given first. Next, the three classical information retrieval (IR) models, namely boolean, vector space and probabilistic model, are briefly discussed. As we have also used vector space model for the experimental work in this thesis, a detail discussion of this model is provided separately in the next chapter, including both its principles and implementation.

27

Chapter 2


Basically an IR model is a pattern that defines several aspects of retrieval procedure, for example, how the documents and queries are represented, how the system retrieves relevant documents queries and how the retrieved documents are ranked. Several different IR models have been developed so far.

These models can be

classified as [39]: •

Classical models of IR

•

Non-Classical models of IR

•

Alternative models of IR

All the early models of IR were based on mathematical knowledge that was easily recognized and well understood. The retrieval is performed on the basis of similarity, probability, boolean operation, etc. These models were simple, efficient and easy to implement. Almost all of the existing commercial systems are based on the mathematical models of IR. That is why these models were called classical models of IR. Non-classical models perform retrieval based on principles other than those used in classical models of IR. These models are best exemplified by models based on special logic technique, situation theory or on the concept of interaction. The third category of IR models, namely alternative models, is actually enhancements of classical models by making use of specific techniques from other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) model are examples of alternative models of IR. The application of document clustering to IR is based on the cluster hypothesis, which states: “closely associated documents tend to be relevant to the same requests” [170]. The literature published in the field covers various areas, such as the development of efficient algorithms for document clustering, the application of document clustering as post-retrieval document visualization technique [62],[129] etc.. Recent research shows [140] that the cluster hypothesis is disputable, at least in the cases based on fixed clustering.

28

Chapter 2


The classical models of IR form the basis of almost all of the commercial systems in use today. The three classical models, namely, boolean, probabilistic and vector model, are briefly discussed, followed by a discussion of the semantic approaches to IR.

2.2.1. Boolean Model Boolean model is based on boolean logic and classical set theory. In this model documents are represented as a set of keywords, usually stored in an inverted file. An inverted file is a list of keywords and identifiers of the documents in which they occur. Users are required to express their queries as a boolean expression consisting of keywords connected with boolean logical operators (AND, OR, NOT). Retrieval is performed based on whether or not the document contains the query terms. Boolean retrieval models have been used in IR systems for a long time. They are simple, efficient and easy to implement and perform well in terms of recall and precision if the query is well formulated. However, the model suffers from certain drawbacks. First, it is not able to retrieve documents partly relevant to the query; all information is ‘to be or not to be’. Second, users seldom express their queries with pure boolean expressions that this model requires. Further, this model can not produce a ranked list of relevant documents. It merely distinguishes between the presence and the absence of keywords and fails to assign relevance importance of keywords in a document. To overcome these weaknesses a number of extensions of boolean model have been proposed [48]. These models attempt to strengthen the standard boolean model by avoiding strict interpretation of boolean operator, by handling uncertainty in the indexing process and by providing a ranked list of documents in response to a query.

2.2.2. Probabilistic model The probabilistic model attempts to rank documents by their probability of relevance, given a query. Retrieval is based on whether the probability of relevance (relative to a query) of a document is higher than that of non-relevance (and exceeds a threshold value). Given a query, documents and a cut off value, this model first calculates the

29

Chapter 2


probabilities that a document is relevant and irrelevant to the query; it then ranks the documents having probabilities of relevance at least as that of irrelevance. Those documents are retrieved whose probabilities of relevance in the ranked list exceed the cut off value. Major contributions to the development of probabilistic model are due to [90], [123], [171] and [126]. Different mathematical methods for calculating the probabilities of relevance and irrelevance, as well as properties and applications are discussed by [171], [22], [51].

Most of the systems assume that the terms are

independent when estimating probabilities. This assumption allows for accurate estimation of parameter values and helps in reducing computational complexity of the model. However, this assumption seems to be inaccurate [86], as terms in a given domain usually tend to co-occur. For example, it is more likely that ‘match point’ will co-occur with ‘tennis’ as compared to ‘cricket’. A new method of incorporating term dependence in probabilistic IR, to compensate the weakness of the term-independence assumption, has been proposed in [12]. A comparison of boolean and probabilistic model can be found in [86]. A comprehensive review of the probabilistic model in IR is presented by [32]. A bayesian network version of the probabilistic model forms the basis of the INQUERY system [22]. The probabilistic model, like a vector model, can produce result for partly matched query. Nevertheless, this model still has difficulties; one of which is to determine value of threshold for initially retrieved set. The number of relevant documents by a query is usually too small for the probability to be estimated accurately.

2.2.3. Vector Space model Vector space model applies vector space in which both the documents and queries are represented with term weight vectors. The ranking algorithm computes similarity between the document and the query vectors to yield a retrieval score to each document to produce a ranked list of retrieved documents. Being used in this work, vector space model will be elaborated in chapter 3.

30

Chapter 2


2.2.4. Semantic Based Models Traditional approaches to information retrieval represent documents as a set of keywords where the mutual semantic relationship or the underlying common notions of these words are completely ignored. A better performance is expected from a retrieval system that considers semantic aspects. Use of stemming can be considered a first step in this direction. However, its success is limited. The semantics of individual words can be applied using a thesaurus. This requires the existence of vocabulary where synonymous and semantically covering or otherwise related words are collected. Another way to bring semantic understanding in retrieval process is to consider relationship between words. This perceives a word in the context where it appears and what relationship it bears with other words appearing in that context. This resembles cognitive activity of concept formation. A concept is simply an idea [37]. It is a thinking unit. Concept formation takes place gradually, as a person grows up. When we meet certain concept in language, expressed as a word, we think of every aspect that is related to this concept. As we come to know a new concept, we connect it with what is already in the brain in order to form a network of concepts. The same approach can be used for representing documents and queries, i.e. as a set of interrelated concepts. Conceptual graphs are a knowledge representation formalism that makes it possible.

2.3. Conceptual Graphs Conceptual graphs (CGs) are the basic building blocks of conceptual structures. CGs have evolved out of conceptual structures theory as set down by John F. Sowa in [154] and later revisited in [155]. Sowa’s conceptual structures are notations for knowledge representation for use in text analysis that is highly expressible, mathematically well founded and computationally tractable. CGs are based on the logic of Charles Sanders Peirce, Tesńierés dependency graphs and the Semantic networks of AI.

Peirce

developed algebraic notation for logic but he was never satisfied with that. He always believed that graph notations are more flexible and readable. Peirce experimented

31

Chapter 2


with graph notation for logic and proposed existential graphs (EGs). He stated rules of inferences and defined an interpreter for it. An EG is a graphical representation of logic which can represent both modality and quantification. However, it does not capture the details of the linguistic structure correctly [183]. Existential graphs (EGs) have a single canonical form instead of the multiple synonymous sentences in languages with built-in operators, such as English and predicate calculus. This makes it difficult to translate natural language sentences conveniently. Peirce called the graphic notation “the luckiest finding of my carrier” and believed that it is the “logic of the future” [157]. Peirce proved right when several developers and theorists form a group called “Peirce Project” to implement and extend Peirce’s logic. CGs are synthesis of Peirce’s logic with Tesńierés dependency graphs and the semantic networks. CGs are a graphic notation for logic based on EGs but with extended features that supports more direct translations to natural languages. The advantage with graph notation is that diverse graph-manipulation operations are available, and it is easy to determine the global status.

2.3.1. Definition and Notation Definition 1. (Conceptual Graph). “A conceptual graph is a finite, bipartite graph. It has two types of nodes: concept node and relation nodes” [154]. In the graph concept nodes represent entities, attributes, states and events and relation nodes show how concepts are interconnected. Figure 2.2 shows a conceptual graph of the sentence: “A cup is on the table”.

( 2.1)

Figure. 2.2: Conceptual Graph of sentence (2.1)

The conceptual graph in figure 2.2 consists of two concept nodes connected with a conceptual relation. A single concept node may constitute a conceptual graph. However, a single relation node can not constitute a conceptual graph. In the graph

32

Chapter 2


boxes represent concept nodes and the circles are called conceptual relations. In text usually linear notation is preferred over graphic notation in which boxes are replaced with square brackets and circles with parentheses. The linear form of the graph shown in figure 2.2 will be: [cup] Æ(on) Æ [table] Concept nodes have two types of field – type field and referent field. Two fields are separated with a colon. In the box type field is shown on the left and referent field is shown on the right separated by a colon as in figure 2.3. Concepts that do not identify a specific individual are called generic concepts. The referent part of these concepts is omitted. The existential quantifier ( ∃ ) is assumed to apply on concepts with blank referent field, as in [cup]. For individual concepts, the referent field is a specific entity such as a name.

Figure 2.3: A Concept with referent

Table 2.1 lists the referents and their descriptions. The concepts are organized in a hierarchy called type hierarchy. The type hierarchy constitutes a partial ordering and becomes a type lattice when all the intermediate types are introduced. The type lattice has the undefined type ┬ at the top and the absurd type ┴ at the bottom. Although the type hierarchy represents a model of real world, or at least of the domain, its significance is more implementational rather than ontological [37]. The choice of types depends on the nature of the domain and on the specific requirement.

33

Chapter 2


Kind of referent

Example

Description

Universal

[Student: ∀ ]

Every student

Generic or existential

[Student] or [Student :*]

A student or some student

Definite referent

[Student: #]

The student

Named individual

[Student: Zeeshan]

Student Zeeshan

Singular

[Student:@1]

Exactly one student

Generic set

[Students: {*}]

Students or some students

Counted set

[Student:{*}@3]

Three students

Set of individuals

[Student: {Sana, Aman}]

Sana and Aman

Question

[Student: ?]

Which student

Measure

[distance: 6 km]

Distance of 6 km

Table 2.1: Different types of referents

There exist a defined mapping between conceptual graph and a corresponding FOPL (First Order Predicate Logic) formula. A conceptual graph can be easily translated in its predicate logic equivalent using a formula operator Ф. This operator maps the boxes to quantified variables with monadic predicates to specify type. Circles are mapped to predicates with each arc as one of its arguments. The arrow pointing towards the circle becomes the first argument and the arrow pointing away represents the second argument. When applied to figure 2.2, Ф assigns a variable x to represent the concept [cup] and a variable y to represent the concept [table]. The type labels CUP and TABLE are represented as monadic predicates cup(x) and table(y). The relation ‘on’ is represented by a dyadic predicate on(x,y). The equivalent predicate calculus representation is:

(∃x) (∃y) (cup(x) ^ table(y) ^ on(x, y)) Now consider the CG representation of the sentence ‘Danish is playing football’:

34

Chapter 2


g: [Person: Danish] Å (agnt) Å [Play] Æ (obj) Æ [Football] When the translator operator Ф is applied to g, it is translated into a first-order predicate calculus as: φg = ∃x(∃ y(∃z(Person(x ) ^ play(y) ^ football(z) ^ agnt(y, x) ^ obj(y, z)^ name(x, ' Danish' ))))

2.3.2. Conceptual Graph Representation CGs are formally defined in an abstract syntax that is independent of any notation but can be represented in three different forms. These are linear form (LF), display form (DF) and conceptual graph interchange form (CGIF). The display form has the advantage of being more intuitive for average user to comprehend. Linear Form (LF) is a compact readable form, which is normally followed in text. In the linear form (LF), concepts are represented by square brackets instead of boxes, and the conceptual relations are represented by parentheses instead of circles. In the linear form notation, a concept is to be selected as a head concept. The head concept should be a concept central to the proposition being represented. Most of the other concept in the assertion is related to the head concept. In order to get a CG representation of a sentence usually main verb of the sentence is selected as head concept. Alternatively, head concept can be the concept with maximum number of related nodes. A ‘–’ separate the head node with other conceptual relations and concepts occurring in the graph. Other CGs may be embedded within the main graph in similar fashion. A formal context free grammar description of linear form, as given by Sowa [154], is shown in figure 2.4. The first production rule of the grammar says – “The nonterminal CGGRAPH represents a conceptual graph consisting of either a CONCEPT followed by an optional relational link(RLINK) or a RELATION followed by a required concept link (CONLINK)” [154]. Recall that a single concept node constitutes a conceptual graph but single relation node does not. Both DF and LF are designed for communication with humans or between humans and machines. CGIF was developed for communication between machines that use

35

Chapter 2


CG as their internal representation. CGIF has a concrete syntax, which makes it usable for implementation. In CGIF, co-reference labels are used to represent the arcs.

Figure 2.4: CFG description of Linear Form

The LF representation of sentence (2.1) is as: [Cup] Æ (On) Æ [Table] In CGIF the sentence (2.1) can be represented as: [Cup: *x] [Table: *y] (On ?x ?y) The symbols *x and *y are called defining labels. The matching symbols ?x and ?y are the bound labels that indicate references to the same instance of a cup x or a table y. CGIF also permits concepts to be nested inside the relation nodes. Nesting of concepts helps in reducing number of co reference labels. (On [Cup] [Table]) For communication with systems that use other internal representations, CGIF can be translated into different logical languages such as Knowledge Interchange Format (KIF). Hence it is better to use CGIF for storage and retrieval of CGs. The KIF representation of the above: (exists ((?x Cup) (?y Table)) (On ?x ?y))

36

Chapter 2


Although DF, LF, CGIF, and KIF seem to be quite different, their semantics is defined by the same logical foundations. They can all be translated to a statement of the following form in typed predicate calculus: (∃x : cup)(∃y : Table)on(x, y) Any statement expressed in any one of these notations can be automatically translated to a logically equivalent statement in any of the others. Figure 2.5 shows the display form of a CG representing the English sentence: Sana goes to school by bus.

(2.2)

Figure 2.5: Conceptual graph of sentence (2.2)

It has three conceptual relations: (agnt) relates [Go] to the agent Sana, (dest) relates [Go] to the destination School, and (inst) relates [Go] to the instrument bus. In DF, concepts are represented by boxes. Conceptual relations are represented by circles or ovals. The linear form for CGs is intended for more compact notation than DF, but with good human readability. Following is the LF for figure 2.5: [Go](agnt)->[Person: Sana] (dest)->[School] (inst)->[Bus].

37

Chapter 2


In this form, the concepts are represented by square brackets and the conceptual relations are represented using parentheses. A hyphen at the end of a line indicates that the relations attached to the concept are continued on subsequent lines. Following is the CGIF of sentence (2.2): [Go: *x] [Person: Sana *y] [School *z] [Bus: *w] (agnt ?x ?y) (dest ?x ?z) (inst ?x ?w) or [Go: *x] (Agnt ?x [Person: Sana]) (Dest ?x [School]) (Inst ?x [Bus])

2.3.3. Conceptual Graph Operations Before discussing the operations that can be performed on conceptual graphs, we first introduce the notion of support and give a more formal definition of conceptual graph. Each conceptual graph is defined in relation to a support, which defines syntactic constraints, and provides background knowledge on a specific application domain. A support consists of: TC, a set of concept types, which is a finite lattice with ┬ as supremum and ┴ as infimum. TR, a set of relation types, TC and TR, are disjoint. M, a set of individual markers for concept types, in addition there exist a generic marker *. ∑ r , a signature ∑ r = (r, n, C1, . . ., Cn) for each r with arity n. ∑ i (r ) represents the

ith concept type in r. A signature for each relation type thus fixes the arity of a relation type and shows the greatest concept types this relation type can link. More conceptually CG can be defined as: Definition 2. (Conceptual Graph). A CG g = (R, C, E, ord, label) is a bipartite (recall

that they are not necessarily connected) and finite graph with C ≠Ф. R and C denote its relation and concept nodes. E is the set of edges, and the edges adjacent to each relation node r are totally ordered by the function ord. The ith neighbor node r in g is denoted by gi(r). Every concept node in the conceptual graph has a label defined by

38

Chapter 2


the mapping label. A label of a concept node type c ∈ C is a pair label(c) = (c, m(c)) with c ∈ TC & m(c) ∈ M. Where, M is a finite set of individual markers. There are four basic operations that can be performed on conceptual graphs [157]. Copy: Creates a conceptual graph v as an exact copy of another conceptual graph u. Restrict: Let c be a concept of v and has a constant or existential quantifier as a

referent. Then a conceptual graph w can be derived by restricting c by type or referent: restriction by type replaces the type label of c with some subtype; and restriction by referents replaces an existential quantifier with a constant. Join: Let c1 be a concept of u and c2 be a concept of v, where neither c1 nor c2 are

nested inside a context and both c1 and c2 have identical type and referent fields. Then a

graph w obtained by deleting c2 and linking to c1 all arcs of conceptual

relations that had been previously linked to c2 is called a join of u and v. Simplify: A conceptual graph can be simplified by deleting duplicate conceptual

relations. Two conceptual relations r1 and r2 are said to be duplicate if they are of exactly the same type, and each arc of one conceptual relation is attached to the same concept as the arc of another relation. A new conceptual graph v can be derived from other conceptual graphs by applying copy, join and restrict operations. The conceptual graph v thus derived is said to be specialization of every graph u from which it has been derived. More formally, this is represented as v ≤ u. Every specialization sequence from a graph u to v associate a projection from u to v. A derivation sequence for the graph v representing “A person x loves another person y” from the graph u1 and u2 is shown in figure 2.6. In this case v is specialization of graph u1 and u2 , i.e. v ≤ u1 and v ≤ u2. In conceptual graph theory, two alternative notions were introduced: graph derivation and projection. The basic rule for graph derivation is the restriction rule in which a concept C is restricted to a sub-concept C’ (C’ ≤ C), together with a suitable substitutes of the existential quantifiers for individuals. As the main operation for deriving graphs is given by restriction, the idea of Sowa was to obtain the graph

39

Chapter 2 derivation h

Information Retrieval and Conceptual Graphs: Basic Concepts g (h derives from g) by projecting all the concept nodes of the starting

graph g onto nodes of the final graph h. Thus, both graph derivation and projection proceed in the same direction [2].

u1

Animate

u2

State

expr

Restrict u11

Act

obj

Entity

Restrict

Animate

u21

Love

expr

Love

obj

Entity

Join of u11 and u21 u3 Animate Restrict

expr

Love

obj

Entity

u4 Person Restrict

expr

Love

obj

Entity

u5

expr

Love

obj

Person v

Person

Figure2.6: Derivation graph of a conceptual graph v

If a graph v is derivable from another graph u (i.e. v ≤ u) then there exists a projection mapping π : u → v with the following properties: •

For each concept c in u, πc is a concept in v such that type ( πc ) ≤ type(c). If c is an individual concept, then referent ( πc ) = referent(c).

•

For each relation node r in u, its image πr is a relation node in v such that type ( πr ) = type(r). If the i-th arc of r is linked to a concept c in u then the ith arc of πr must be linked to πc in v.

πu is a sub-graph of v called a projection of u in v. However, projection is not sufficient for IR because it retrieves only exact answer. IR requires a search mechanism that retrieves and ranks approximate answers [55]. The

40

Chapter 2


computational complexity involve in graph derivation also makes them impractical for IR.

IR researchers have proposed a number of CG similarity measures for

retrieval; we discuss few of them in the next chapter. In chapter 6 of the thesis, we propose and evaluate new similarity measures suitable for information retrieval and other related tasks.

2.4. Evaluation of IR System The evaluation of IR system is the process of assessing how well a system meets the information needs of the users [173]. Evaluating information retrieval system is a difficult task as it combines issues from a number of areas including cognition, statistics, man-machine interaction, etc. [168]. The IR evaluation models can be broadly classified as system driven model and user-centered model. The system driven model focuses on measuring how well the system can rank documents while the user–centered evaluation model attempts to measure the user’s satisfaction with the system. There are a number of aspects of IR system that can be evaluated, such as the coverage of the collection, time lag, the presentation format, user effort, etc. The aspect that has gained wide acceptance in IR research, and that is also central to this thesis, is the effectiveness of IR system i.e. its ability to retrieve relevant documents in response to user query. A number of effectiveness measures have been formulated in IR literature (e.g. [170]), we discuss a few of them in the following section. To give a better understanding of the relationship between various aspects of retrieval process and the different measures, the correlations between pairs of measures are estimated in [175].

2.4.1. Effectiveness Measures Effectiveness is purely a measure of the ability of the system to satisfy user in terms of the relevance of documents retrieved. Aspects of effectiveness include whether the documents being returned are relevant to the user, whether they are presented in the order of relevance, whether a significant number of relevant documents in the

41

Chapter 2


collection are being returned to the user etc. Attempts have been made to quantify effectiveness and a number of measures have been proposed. The most commonly used measures of effectiveness are precision and recall. These measures are based on relevance judgments.

2.4.1.1. Precision and Recall Precision is defined as the proportion of the retrieved documents that are relevant. This can be seen as the probability that a retrieved document is relevant. Recall is the proportion of the relevant documents that have been retrieved and can be regarded as the probability that a relevant document is retrieved. Precision measures the accuracy of the system and recall its exhaustiveness. Precision and recall can be computed as: Precision =

and Recall =

Number of relevant document retrieved Total number of documents retrieved

Number of relevant documents retrieved Total number of relevant documents in the collection

Relevant

Non Relevant

Retrieved

AIB

AIB

B

Not Retrieved

AIB

AIB

B

A

A

Table 2.2: Relevant and Retrieved Documents

Referring to table 2.2 precision and recall will be given as:

Precision =

AIB B

and

Recall =

where, A is set of relevant documents,

42

AIB A

Chapter 2


A = No. of relevant documents in the collection B is set of retrieved documents and

B = No. of retrieved documents

It is clear from these definitions that the total number of relevant documents in a collection must be known in order calculate recall. The amount of effort and time required on behalf of users makes this almost impossible in most of the actual operating environment. To provide a framework of evaluation of IR systems, a number of test collections have been developed (Cranfield, CACM, TREC, etc.). These document collections are accompanied by a set of queries and relevance judgments. These test collections made it possible for IR researchers to efficiently evaluate their experimental approaches, and to compare the effectiveness of their system with those of others. In Table 2.3 basic statistics for a number of test collections are presented. ____________________________________________________________ Collection

Number of documents

Number of queries

Cranfield

1400

225

CACM

3204

64

CISI

1460

112

LISA

6004

35

TIME

423

83

82

35

1033

30

742,611

100

ADI MEDLINE TREC-1

___________________________________________________________ Table 2.3: IR test collections

There exists a trade-off between precision and recall, even though a high value of both the parameters is preferred. A number of

researchers have discussed the

relationship between recall and precision [19],[28], [57], [10]. Some of them have

43

Chapter 2


modeled precision and recall as continuous functions [57], while others [10] have described recall and precision in terms of two poisson discrete model. [19] has studied the relationship between precision and recall and suggested that a two-stage, or more generally, a multi-stage retrieval procedure is likely to achieve the goal of improving both the precision and the recall simultaneously, even though the trade-off between them can not be avoided.

Figure 2.7: Recall-Precision graph

In order to evaluate the performance of any IR system recall and precision are almost always used together. Precision values are calculated at different recall levels and a recall-precision graph like the one shown in figure 2.7 is plotted. As a retrieval system is evaluated over several queries, such a graph is usually plotted using precision figures averaged over all queries as described in [133], [170]. The most standard method of deriving a Recall – Precision(R-P) graph is to calculate precision values for a set of recall points. Considering the fact that system may not always

44

Chapter 2


retrieve all the relevant documents and the number of relevant documents is not the same for all queries, interpolated values are often realized. The interpolation used in TREC states that the precision at a given recall level is the highest known precision at any recall level greater than or equal to this given level. Usually a precision value is interpolated at each of the 11-standard recall levels viz. rj ∈ {0.0, 0.1, 0.2, 0.3, …, 0.9,1.0 } as :

P(rj ) = max P(r) . r j ≤ r ≤ r j+1

where P(r) is precision at recall point r.

2.4.1.2. Other Effectiveness Measures Single score precision measures for IR system have also been proposed. Two such measures are average precision and R-precision. Average precision is the average of the precision at different recall level. In order to evaluate the performance of an IR system the average precision is calculated for a number of queries. The mean of these precision values is then taken to get mean average precision (MAP). Geometrically MAP is the area below the non-interpolated recall precision curve [175]. Alternatively, 11-point interpolated average precisions (11 AvgP) can also be used for calculating MAP, though the non-interpolated measure has the advantage that it rewards a system that gives high ranks to the relevant documents. The R-precision is the precision after R documents have been retrieved, where R is the total number of documents relevant to the query. A number of other alternatives measures have also been proposed for measuring retrieval effectiveness [170], [4], [42]. As defined in the previous section, the recall can not be defined if there is no relevant document in the collection. An alternative measure is the fallout, which may be regarded as the inverse of recall. It is not defined only if all the documents in the collection are relevant [133]. It is defined as the ratio of non-relevant documents retrieved to non-relevant documents in the collection:

45

Chapter 2


Fallout =

Number of non - relevant documents retrievaed Number of non - relevant documents in the collection

Referring to table 2.2, fallout will be computed as: Fallout =

AIB A

Considering the fact that different users may have different ideas of relevance and system performances, some measures have been developed that let user to decide whether she is more interested in recall or in precision. For instance, the utility measure U defined as: U = α . Nr + β . Nr + δ Nn + γ Nn

where, Nr is the number of relevant documents retrieved, Nr the number of relevant documents not retrieved,

Nn the number of non-relevant documents retrieved, Nn the number of non-relevant documents not retrieved,

and α, β, δ and γ are positive weights specified by the user. This measure was later simplified by considering only retrieved documents [82]. E-measure combines recall and precision in a single score [170] and is defined as:

E =1-

(1 + β 2 ) P R β2 P + R

Where, P is the Precision, R is Recall and β is the relative importance of P compared with R. Swets [167] developed a model that received some amount of attention in the literature. However, none of these alternative measures have received widespread acceptance. In [13], query-sensitive similarity measures have been introduced for the calculation of inter-document relationships. The experimental investigations made in [13] suggest

46

Chapter 2


that these measures have potential to increase the effectiveness of a cluster-based information retrieval system.

2.4.2. User Centered Evaluation The system-driven model is still the dominant approach followed in IR research for evaluation of IR system in which the evaluation is made on the test collection having known relevance judgments. These relevance judgments were usually provided by the problem domain experts and are binary, objective, topical and static in nature, but it lacks user’s viewpoint. Studies have revealed that there are major disagreements among the experts themselves in providing relevance judgments [63], [65]. Further, these judgments of relevance are affected not only by the expertise of the judge, but also by the order of the documents [41], [145]. It has been argued that relevance is not fixed but varies over time [94]. The meaning and the relevance of a document can be different for different users and can be inferred only in the context of the user's situation. Relevance, therefore, is subjective, dynamic and multi-dimensional in nature [99], [61], [100]. Another drawback of the system-driven (test-collection) approach is that it removes the end users from the retrieval process, substituting them by the queries and the judgments provided with the test collection. Although this allows fast experimentation but makes it difficult to evaluate the effect of interactive IR techniques and is suitable only for non-interactive environment [40]. In an interactive setting, a user normally starts with a query and goes through many refinements to eventually get desired documents. Test-collection approach poses problem in such an environment. As the performance of the IR system will eventually be measured in terms of its ability to retrieve documents relevant to the query, it seems realistic to follow a user-centered approach for evaluation. Such an approach will result in a much more direct measure of the overall goal. These limitations have led the researchers to realize the need for interactive, user-centered methods for evaluation [165], [141], [6], [14]. A number of alternative measures have been proposed for interactive IR including relative

47

Chapter 2


relevance (RR), ranked half life (RHL) and cumulated gain etc. Details of these

measures can be found in [66], [14], [71]. The subjective nature of the interactive IR has been highlighted in [14] and an attempt has been made to integrate cognitive theory into IR evaluation. However, the effort in this direction has been very limited. A task oriented, user-centered, non-interactive evaluation methodology has been proposed in [117]. The basic unit of evaluation in this method was the task instead of a query. More recently an interactive IR evaluation model has been proposed by [15], [16] for evaluation of interactive information retrieval systems. The key elements of interactive information retrieval model are the use of realistic scenarios (simulated work tasks) and alternative performance measures like relative relevance (RR) and ranked half life retrieval etc. The user-centered evaluation methods, however, are expensive both in terms of time and in terms of resources. A properly designed user-centered evaluation, with a few exceptions, requires a sufficiently large representative sample of actual users of retrieval systems. The systems to be compared must be equally well developed and should be equipped with the appropriate user interface. The subjects must be trained with these systems. Further, it is difficult to develop a standard interactive evaluation methodology that will allow for comparison across different systems and users [117]. Because of these considerations the recall and the precision are still the most popular and standard measures of evaluating the IR system performance, even though they have been criticized for their inability in reflecting dynamic, situational and subjective nature of information seeking process and in handling user’s evaluation criteria. We have also used the most widely accepted evaluation measures, viz. the recall and the precision, in our work.

48

Chapter 3 Basic Framework: Document and Query Representation, Matching and Similarity Measures 3.1. Introduction Documents themselves are not information. In order to perform retrieval, first, the documents and the query have to be represented in a computable form. The query representation is then compared with the document representation to get a similarity value, based on which the retrieval is performed. Different retrieval models represent documents and queries in different forms and use different similarity values. This chapter discusses various issues involved in the design and development of the feature–vector representation used in the vector space model and the conceptual graph representation (used as semantic model in this work) as well as the similarity measures involved in both the cases. It includes the details of both the model to build a framework to be used in the hybrid model as discussed in chapter 4. First, the vector space model has been discussed in section 3.2. An automatic indexing procedure has been given. Next, in section 3.3 the use of conceptual graphs in information retrieval has been elaborated. Conceptual graph matching algorithms proposed by various researchers have been briefly reviewed and an automatic procedure for representing text as a set of conceptual graphs has been proposed.

Chapter 3

Basic Framework: Document & query representation, matching & similarity Measure

3.2. Vector Space Model Vector space (VSM) model is one of the major classical and well-studied retrieval models. It represents the documents and the queries as vectors of features representing terms that occur within them [131], [130]. Each document is thus characterized by a boolean or numerical vector. These vectors are represented in a multi-dimensional space in which each dimension corresponds to a distinct term in the corpus of documents being characterized. In the simplest form, each feature can take a value zero or one, indicating the absence or presence of a term in a document or a query. More generally, some numerical value is assigned to these features, which is usually some function of frequency of terms. Given a finite set of n documents as: D = { d1, d2, d3, … , dn } and a finite set of m terms : T = { t1, t2, t3, … , tm }, each document is represented by a column vector of weights as follows: (w1j, w2j, w3j, . . wij , … wmj )t where wij is the weight of term i in the document j. The document collection as a whole will be represented by an m x n term–document matrix as:  w 11  w 21  w i1  w m1 

w 12 w 22 w i2 w m2

.... .... .... ....

w 1j ... w 2j ... w ij ... w mj ...

w 1n w 2n w in w mn

    

Different term-weighting functions have been introduced by researchers [130]. We will discuss and experimentally investigate some of them in chapter 4. Next subsection discusses how a document collection can be reduced to a matrix of term weights automatically.

3.2.1. Indexing For a small document collection, it may be possible for the IR system to access the actual document to decide whether it is relevant to a query. However, for a large document collection this poses practical problems. Hence, usually the raw document

50

Chapter 3


collection is transformed into an easily accessible representation. The process of transforming document text to some representation of text is known as indexing. A number of indexing techniques exist; most of them involve identifying good document descriptors, such as keywords or terms, to describe information content of the documents. A good descriptor is one that helps in describing the content of the document and in discriminating the document from other documents in the collection. Indexing, thus simply means representing text (the query and the document) as a set of terms whose combined semantic meaning is equivalent in some sense to the content of the original text. Multi-word terms can also be used to represent a dimension of the vector space. These multi-word terms can be obtained by looking for frequently appearing sequences of words in the documents [29], [20], or through Ngrams, or POS tagging [72], or by applying NLP to identify meaningful phrases [163] or hand crafting. Some of theses techniques simply extract phrases based on adjacency information while others attempt to capture underlying semantic concepts. A probabilistic model for automatically extracting English noun phrases without part-

of-speech tagging or any syntactic analysis has been proposed in [27]. Use of part of speech tagging can help in extracting meaningful sequences of words. It helps in handling sense ambiguity, as words are assigned part-of-speech tags based on their local (sentential) context. This may be particularly useful in the cross lingual information retrieval. Though, statistical approaches to phrase extraction are more efficient but fail to handle word order changes and structural variations, which can be handled more appropriately by syntactic approaches.

In TREC conferences, the

method used for phrase extraction is as follows: 1. Any pair of adjacent non-stop words is regarded a potential phrase. 2. Final list of phrases is composed of those pairs of words that occur in, say, 25 or more documents in the collection. The result of using multi-word phrases is mixed. Croft et al. [35], and Lewis and Jones [82] found them no more effective than single word terms, whereas improved accuracy has been reported in [29]. The Cornell group reported increase in retrieval

51

Chapter 3


precision. However, the improvement due to phrases went down from 7% to 1% [98]. More recently, Mittendorf et al. [97] also remarked that the addition of phrases results in only moderate improvement over purely word-based retrieval or no improvement at all. As no significant improvement has been reported through the application of multi-word phrases we have considered vector spaces defined by single word terms only in our work. The choice of index terms and weights is a difficult theoretical (e.g. linguistic or semantic) and practical problem [39] and several techniques can be used to cope with it. Luhn [87] assumed that frequency of word occurrence in an article gives meaningful identification of their content and hence can be used to extract words to represent document. Usually these words go through some specific forms of lexical processing, before they become indexing features. These lexical steps are stemming, stop word elimination etc. As consideration of all the forms of terms appearing in the documents will make the representation voluminous, these lexical steps are quite useful in reducing the number of terms to be used to represent document’s content. After the application of these lexical steps weights are assigned to the remaining terms.

3.2.1.1. Stemming Stemming normalizes morphological variants, though in a crude manner. It removes suffixes from the words to reduce them to some root form e.g. the words “compute”, “computing”, “computes” and “computer” will all be reduced to the same word stem “comput”. The terms representing the dimension of the vector are thus stems, not the actual words. The most widely used stemming algorithm was developed by Porter [113]. Table 3.1 gives an example of the representation of sample documents, as shown in figure 3.1, after stemming. One of the problems associated with stemming is that it throws away the useful distinction. In some cases it may be useful to help conflate similar terms resulting in increased recall, in others it may be harmful resulting in reduced precision (e.g. when documents containing the term ‘computation’ is returned in response to the query “personal computer”). A

52

Chapter 3


comparison of various stemming methods and the un-stemmed representation for retrieval can be found in [50].

3.2.1.2. Stop Word Elimination Another common lexical processing of index terms involves the elimination of stop words. Stop words are high frequency words that have little semantic weight and are thus unlikely to help the retrieval process. Such words have no topical specificity. Typical examples of stop words are articles (“a”, “an” “the”) and prepositions (e.g. “in”, “of”, “for”, “at” etc.). Eliminating these words can result in considerable reduction in text volume without losing any significant information. The drawback of eliminating stop words, however, is that it can sometimes result in elimination of terms useful for searching, for instance, the stop word “A” in “Vitamin A”. A list of stop words for the CACM collection is given in Appendix A.

Figure 3.1: Sample documents

Stemmed terms inform intellig model probabilist retriev space technique Vector

Doc1

Doc2

0 0 1 0 0 1 0 1

0 0 1 1 1 0 0 0

Doc3 1 1 0 0 1 0 1 0

Table 3.1: Vector representation of Sample Documents after stemming

53

Chapter 3


3.2.1.3. Term Weighting Each term that is selected as an indexing feature for a document may act as a discriminator between that document and all the other documents in the corpus. As discussed earlier, Luhn[87] attempted to quantify the discriminating power of the terms by associating their frequency of occurrence (term frequency) within the document. Luhn postulated that the most discriminating (content bearing) terms are mid frequency terms. High frequency terms are function words and can be discarded. Similarly, low frequency words can also be discarded, as they are obscure and less likely to appear in a query. This postulate can be refined by noting that: 1. The more a document contains a given word the more that document is about a concept represented by that word. 2. The less a term occurs in particular document in a collection, the more discriminating is that term.

3.2.1.3.1. Term Frequency (TF) The first factor above simply means that terms that occur more frequently (except function words) represent its meaning more strongly than those occurring less frequently and hence should be given high weights. In the simplest form, this weight can be the raw frequency of the term in the document.

3.2.1.3.2. Inverse Document Frequency (IDF) The second factor noted above actually considers the term distribution across the document collection. Terms occurring in few documents are useful for distinguishing those documents from the rest of the collection. Similarly, terms that occur more frequently across the entire collection are less helpful in discriminating among documents. This requires a measure that favors terms appearing in fewer documents. The fraction n/ni, where n is the total number of the documents in the collection and ni is the number of the documents in which term i occurs, gives this measure. This measure assigns the lowest weight 1 to a term that appears in all documents and the

54

Chapter 3


highest weight of n to a term that occurs in only one document. As the number of documents in any collection is usually large, log of this measure is usually taken, resulting in the following form of inverse document frequency (idf) term weight: n idfi = log    ni 

Inverse document frequency (idf) attaches more importance to more specific terms. If a term occurs in all documents in the collection, idf will be 0. A number of researchers attempted to include term distribution in the weighting function [159], [131] to give a more accurate quantification of term importance. Sparck Jones[159] showed experimentally that a weight of log(n/ni) + 1, termed as inverse document frequency (idf), leads to more effective retrieval. Later researchers attempted to combine the term frequency (tf) and the inverse document frequency (idf) weights resulting in a family of tf x idf weight schemes having the following general form: n wij = tfij × log   ni 

According to this scheme the weight of term i in document j is equal to the product of within document frequency of the term and log of the its inverse document frequency within the collection. The tf x idf weighting scheme combines both the ‘local’ and ‘global’ statistics to assign term weight. Salton et al. [135] noted that a highperformance term weighting system assigns large weights to terms that occur frequently in particular documents, but rarely on the outside, because such terms are able to distinguish the items in which they occur from the remainder of the collection. tf x idf weighting scheme fulfils this criterion. Many variations of tf x idf measure have been reported [130] that attempt to normalize tf and idf factors in different ways to allow for variations in document length. One way to normalize tf is to divide it by the frequency of the most frequent term in the document. This kind of normalization, often termed as maximum normalization, yields a value between 0 and 1. Normalization is needed because using

55

Chapter 3


absolute (raw) term frequency to weight terms favor longer documents over shorter ones. After frequency normalization, the weight of a term in a given document depends on its frequency of occurrence relative to the other terms in the same document, instead of its absolute frequency. Similarly, idf can be normalized by dividing it by the logarithm of the collection size (n): wij =

tfij n × log  / log(n). max(tfij)  ni 

3.2.1.3.3. Document Length (DL) A third factor that may affect weighting function is the document length. A term appearing for the same number of times will be more valuable to a short document than to a long document. This is because long documents often use the same term repeatedly. This increases the impact of term frequency factor. In addition, a long document makes use of numerous terms. This increases the number of word matches between a query and a long document. Both these factors increase the chances of retrieval of a long document over shorter documents. To compensate these effects, document length normalization of term weights is often used. This diminishes the advantage that long documents have in retrieval. Different choices for these three factors yield different term weighting schemes. Figure 3.2 shows the result of application of these indexing steps on a sample document, where the raw term frequency has been used to assign weights to terms.

3.2.1.3.4. Combining the three factors Any term weighting scheme can be represented by a triple ABC. The letter A in this triple represents how the term frequency component is handled, B indicates how the inverse document frequency component is incorporated and C represents the document length normalization component. Possible options for each of the three dimensions of the triple are shown in Table 3.2. Different combinations of options can be used to represent the document and the query vectors. The retrieval model

56

Chapter 3


themselves can be represented by a pair of triples like nnn.nnn (doc = “nnn”, query = “nnn”) where the first triple correspond to the weighting strategy used for documents and the second triple correspond to the weighting strategy used for the query.

Figure 3.2: Results of application of indexing steps

There are many possible ways to compute each component as shown in the Table 3.2. One useful variation in computing term frequency is the consideration of the fact that the first occurrence of a term is more important than successive repeating occurrences. Thus, term frequency can be computed as 0.5 + 0.5 (tfij/max) in which normalization is achieved by dividing tf by maximum tf value for any term in the document or as ln(tfij) + 1.0, known as logarithmic term frequency. The former one is called ‘augmented normalized term frequency’. It causes term frequency to vary between 0.5 and 1. The problem with ‘maximum normalization’ and ‘augmented normalization’ of tf component is that a single term in a document with unusually high frequency may degrade the weights of other term significantly. Though this effect is not much pronounced in augmented term frequency, as the highest frequency term can not

57

Chapter 3


Term frequency within document (A)

n

tf = tfij

Raw term frequency

b

tf = 0 or 1

binary weight

a

tfij   tf = 0.5 + 0.5    max tf in Dj 

Augmented term frequency

l

tf = ln(tfij) + 1.0

Logarithmic term frequency

L

tf =

ln(tf ij + 1.0) 1.0 + ln( mean (tf in D j))

Average term frequency based normalization

Inverse Document Frequency (B)

n

wt = tf

No conversion

t

n wt = tf . ln   ni 

Multiply tf with idf

Document Length (C)

n c

wij = wt wij =

No conversion

wt

Cosine normalization

∑ w ij 2 i

u

wij =

wt (1 − slope) × mean(nt) + slope × nt

Pivoted unique normalization

where mean(nt) is average number of unique terms in the collection Table 3.2: Calculating weight with different options for the three weighting factors

degrade frequency of any term below 0.5. The logarithmic term frequency reduces the effect of unusually high frequent term within a document and also the importance of raw term frequency in a document collection consisting of documents with significant variations in length. It actually decreases the effect of all sorts of variations in term frequency, as log(tf 1) + 1 tf 1 < log(tf 2) + 1 tf 2

for any two term frequencies tf1 and tf2 > 0, such that tf1 > tf2

58

Chapter 3


Different choices for A, B and C for query and document vectors yield different retrieval model, for example ntc-ntc, lnc-ltc etc. The choice for term frequency are “n” (use the raw term frequency), “b” (binary i.e. neglect term frequency, term frequency will be 1 if term is present in the document, otherwise 0), “a” (augmented normalized frequency), “l” (logarithmic term frequency) and “L” (logarithmic frequency normalized by average term frequency). The options for the inverse document frequency are “n” (use 1.0, ignore idf factor) and “t” (use idf ). The possible options listed in the table 3.2 for document length normalization are “n” (no normalization), “c” (cosine normalization) and “u” (pivoted unique normalization). Every element of the term weight vector is divided by Euclidean length of the vector to achieve cosine normalization. This is called ‘cosine normalization’ because the length of the normalized vector is 1 and its projection on any axis in document space gives the cosine of the angle between the vector and the axis under consideration. The pivoted unique normalization is based on normalizing the overall vector length by a factor dependent on the number of unique terms. The widely known weighting scheme ntc-ntc normalizes both the document and the query term weights in the range 0-1 and may prove beneficial. The weighting scheme lnc-ltc means that document term weights are computed as the product of logarithmic term frequency (l) of the given term, 1.0 (n) and cosine normalization(c) of the document vector. The query term weights are computed in the same way except that each query term weight is also multiplied by the idf (t) of the given term in the document collection. More recent weighting schemes integrate document length within the weighting formula yielding more complex retrieval models, for example Okapi probabilistic search model [119] and doc=”Lnu” model [53]. In TREC-3, the three systems with the best base performance were Okapi, INQUERY [18] and Cornell's Smart [134]. The best performance was reported by Okapi system. Okapi uses BM25 weighting algorithm introduced by developers of probabilistic model during TREC-2 [118] and TREC-3 [119]. Robertson and Walker [124] motivated the best match (BM)

59

Chapter 3


algorithms by the probabilistic model and by some simple approximations to 2poisson model, but indicated that their result was as much guided by experimentation as by theory [67]. In [148], Singhal et al. pointed out that the success of BM25 on TREC collection was mainly because of the violation of the implicit assumption that “relevance of the document is independent of documents length”. In TREC-collection, longer documents have a higher chance of being judged relevant, as compared to shorter documents. It is to be established that this characteristics holds for other document collections as well. With usual cosine normalization, short documents are favored over long documents, particularly when short documents are single topic relevant to the query and longer documents are multi-topic of which only one is relevant to the specific query [78].

Singhal et al. [148] proposed ‘pivoted

normalization’ to handle this problem. Considerable research efforts have been devoted in refining term weighting methods. As a result, a large number of term weighting schemes have been proposed in IR literature. Further details regarding weighting function can be found in [130], [133], [171].

3.2.2. Indexing Algorithm In order to perform retrieval using vector space model, the documents in the collection are preprocessed to get vectors of term weights. A simple automatic method for obtaining indexed representation of the documents is given below: Step 1. Tokenization: This step extracts individual terms (words) in the document, converts all the words in the lower case and removes punctuation marks. The output of the first stage is a representation of the document comprising of stream of terms. Step 2. Stop word elimination: Removes words that appear more frequently in the document collection, e.g., are, is, this etc. (see Appendix A for the list of stop words used in this work) Step 3. Stemming: Reduces remaining terms to their linguistic root forms, to get index terms.

60

Chapter 3


Step 4. Term weighting: Assigns weights to terms according to their importance in the document, in the collection, or some combination of both.

Figure 3.3: Steps in preprocessing documents

Figure 3.3 explains these preprocessing steps. Queries in the collection are also processed in the same way. Retrieval is based on the similarity measurement between the documents and the query.

3.2.3. Similarity Measures Vector space model represents documents and queries as vectors in a multidimensional space. Retrieval is performed by measuring the ‘closeness’ of the query vector with the document vector. The documents can then be ranked according to the numeric value of the similarity between the query and the document. In the vector space model, the documents selected are those, geometrically closest to the query according to some measure. The model relies on the intuitive notion that similar vectors define semantically related documents. Figure 3.4 gives an example of the documents and the query representation in a two dimensional vector space. These dimensions correspond to the two index terms ti, tj. Set of terms occurring in the

61

Chapter 3


document and the query have been shown. Documents d1, d2 and d3 are represented in this space using term weights as coordinates. The angles between these documents and the query are represented as θ1, θ2 and θ3 respectively.

Figure 3.4: Representation in two-dimensional vector space

The simplest way of comparing document and query is by counting the number of terms they have in common. One quite often employed similarity measure is to take “inner product” between the query and the document vector. The inner product is given by: m

sim( dj,qk ) = 〈 dj, qk 〉 = ∑ wij × wik i =1

Where m is the number of terms used to represent documents in the collection. Other measures attempted to normalize the similarity value by the length of the document and the query, e.g. dice coefficient, Jaccard’s coefficient, cosine coefficient etc. The dice coefficient defines the similarity between query k and document j as: m

sim( dj,qk ) =

2 × ( ∑ wij × wik) i =1

m

m

i =1

i =1

∑ wij 2 + ∑ wik 2

Jaccard’s coefficient is defined as:

62

Chapter 3


m

sim( dj,qk ) =

∑w m

∑w

ij

2

i =1

ij

× wik

i =1 m

m

i =1

i =1

+ ∑ wik 2 − ∑ wij × wik

The cosine measure is one of the commonly used measures of similarity in IR. It computes cosine of angle between the document and the query vector to give a similarity value between 0 and 1. A minimum value of 0(angle 90˚) indicates that vectors are unrelated (i.e. no terms in common) and a value of 1 means that vectors share the common terms. If dj and qk be the document and the query vector respectively then cosine similarity can be computed as: m

sim( dj,qk ) =

(dj, qk) = dj qk

∑w

ij

× wik

i =1

m

∑ wik 2 × i =1

m

∑w

ij

2

i =1

One of the problems with cosine similarity is that it produces quite low similarity values, particularly for long, multi-topic documents, as noted by many researchers [20], [79], [148] etc. Lee [79] attempted to solve this problem by merging the results of cosine and term frequency normalization. Buckley et al [20] tried to combine the local similarity (sentence and paragraph similarity) with the global similarity. [148] proposed pivoted normalization and improved the term frequency normalization to handle the document length bias. The pivoted normalization was able to counter balance the advantage of a short document dealing entirely with the query topic, over a long document dealing with query topic and several other non-relevant topics. The weighting scheme proposed by them reduces the importance of the local similarity measures. Passage retrieval is another way of handling multi-topic long document [23]. Passages can be logical sections, e.g. paragraphs, sentences etc. [180], [136] or fixed size, overlapping or disjoint, textual windows [1], [23].

These weighting

schemes are well suited to the retrieval environment but poses problem in routing environment where no fixed document collection exists. Usually a training set of documents is used to compute statistics in such an environment and it is assumed that

63

Chapter 3


the subsequent documents arriving at the system will have the same statistical properties as the training set.

3.3. Conceptual Graphs in IR In the previous section, a detail discussion of the vector space model used in this work has been made. In order to apply the techniques described in subsequent chapters to text documents we need to represent document in a way that captures more semantics than terms only. The knowledge representation formalism used to make it possible is conceptual graph, which has been introduced in chapter 2. The ongoing discussion is about the use of conceptual graphs in information retrieval.

3.3.1.

Conceptual

Graph

as

a

tool

for

meaning

representation In Sowa’s work, a concept may be abstract or concrete. The roots of Sowa’s analysis of conceptualization lies in Odgen-Richard’s meaning triangle (see figure 3.5) which relates an object in the world to the concept, thought or idea. Finally, both the concept and the referent are related to lexical tokens or symbols. The meaning triangle is actually a formalized representation of the belief that people remember particular

Figure 3.5: The meaning triangle [155] (Sowa, 1984)

64

Chapter 3


facts and general principles. In a given communication, the intention of word’s specific meaning is a specific instance of the general class. Mapping of word to general and specific object is the translation of symbol to meaning. Sowa outlined the use of CG as knowledge representation language [154] as: “Conceptual graph form a knowledge representation language based on linguistics, psychology and philosophy”. Sowa proposed an abstract model that can be used at a different level. At a conceptual level, it can be the basis for a specialized communication language between specialists of different disciplines involved in a common cognitive work. At an implementation level, it can be the basis for a common representation tool used by a complex system, integrating knowledge and databases, inference engines, sophisticated human-computer interface etc. [24]. CGs were initially designed to facilitate natural language analysis and understanding and as a way to represent typed logic. Since they are similar to semantic networks, they have also been applied to knowledge representation. Sowa describes conceptual graphs as the logical form that state relationships between concepts and so represent meaning [154].

3.3.2. Conceptual Graphs for Natural Language Document Representation CG is one of the good formal languages for representing the meaning of natural language sentences [183]. Conceptual graph express meaning in a precise, readable and computable way. This makes them applicable for more effective information and knowledge processing. Knowledge in general is better represented in a declarative style rather than in a procedural one. CGs are well adapted to this kind of paradigm. Specific features in the CG formalism allow the handling of the most of the linguistic problems such as the treatment of referents and context. With their direct mapping to languages, CGs serve as an intermediate language for translating computer oriented formalisms to and from natural language. With their

65

Chapter 3


graphic notation, they serve as a readable but formal design and specification language. The CG concept is suitable for modeling both data in database and queries to these databases. This makes CG theory applicable to IR research. Applications that use the CG formalism range from database interfaces, to text retrieval and natural language processing. CGs can provide a format to unify the linguistic needs for context and the knowledge base.

Figure 3.6: General Architecture of Conceptual graph Construction

3.3.3. Representing Documents as a set of Conceptual Graphs Natural language processing techniques have been used to create conceptual graph representation of textual document. Figure 3.6 depicts a general architecture of conceptual graph construction. Documents are first tagged. Tagged representation is then parsed to generate structured representation. Conceptual graph is constructed for each parsed sentence. For constructing conceptual graph, syntactic patterns in a

66

Chapter 3


sentence are identified. A sentence itself is thus considered as consisting of many fragments (syntactic patterns). Conceptual graph is constructed for each such pattern. Subordinate clauses are considered as an independent sentence and conceptual graph is constructed similarly. These various segments are related with each other through the relationships identified by verbs, preposition etc.

Figure 3.7: A subset of conceptual relations

3.3.3.1. An Alternative to Concept Hierarchy Derivation As discussed in section 2.3.3, each conceptual graph is defined in relation to a support. We have used a small set of basic relations (TR), as shown in fig. 3.7, which can be identified based on syntactic patterns. The terms appearing in a document are all

67

Chapter 3


considered individual marker and all are of the same type “concept” which is the only type in the set of concept types. Thus, the support is ( { “concept”}, TR, T) where T is the set of terms used to represent documents. A partial concept hierarchy has been manually created and used for the work performed in chapter 4 and 5. As an alternative to this, a set of terms that can be substituted for each other has been maintained and utilized for the work performed in chapter 6. Similarly, we have used some heuristics to replace a conceptual relation with other in certain situations.

3.3.3.2. Identifying Relations Sowa [154] provided a set of fundamental conceptual relations (conrels). Appendix C of this dissertation contains a subset of that catalogue with appropriate description and examples wherever needed. The basic type of relation in which a verb is involved is ‘agnt’(agent) and ‘ptnt’(patient) or ‘obj’(object). However, they can exhibit many different semantic roles. We have used syntactic patterns for identifying constituents of a sentence. Some of the syntactic pattern used for noun phrase is shown in Table 3.3. Conceptual graph of these noun phrases depend on their syntactic structure. For pattern (6), we create concepts representing the noun and the adjective and link them by an ‘attr’ relation. For example, consider the pattern: An (DT) efficient (JJ) structure (NN) Its CG will be: [structure] Æ (attr) Æ [efficient] The determiners informs about the type of referents. However, we have not considered referents while creating CGs. Complex nominals can be represented as a single concept as in [101]. Following this notation the representation of pattern (9) will be a single concept node where the concept corresponds to noun sequence separated by a dash (-). The representation of ‘The {DT} information {NN} retrieval {NN}’ will be therefore:

68

Chapter 3


1. NP Æ NN 2. NP Æ NNP 3. NP Æ PRP 4. NP Æ DT NN 5. NP Æ NP PP 6. NP Æ DT JJ NN 7. NP Æ NNP NNP 8. NP Æ CD NNS 9. NP Æ DT NN NN 10. NP Æ NP SBAR Table 3.3: Syntactic patterns for noun phrase

[information-retrieval] However, this representation creates problem during matching. It fails to match with ‘retrieval of information’ or ‘retrieve information’. In order to capture these structural variations we represent both the noun as separate concept and combine them by a generic relation ‘mod’. Therefore its CG representation will be: [retrieval] Æ (mod) Æ [information] To handle structural variations we create additional CGs corresponding to different variants. For documents, we just create a single conceptual graph for noun sequences depicted by ‘mod’ relation. For queries, we additionally include alternative conceptual graphs corresponding to possible structural variants. In case of an adjective qualifying a compound noun sequence (a pattern such as JJ NN NN ) the last noun in the sequence is used in ‘attr’ relation. For example, the conceptual graph generated for ‘flexible {JJ} information {NN} retrieval {NN}’ is: [flexible] Å(attr) Å[retrieval] Æ (mod) Æ [information]

69

Chapter 3


This gives an exact match with ‘flexible retrieval’. The pattern (10) is transformed to a CG by creating concept nodes corresponding to the words tagged as NN and CD and linking them with a generic relation ‘MEAS’. In Sowa’s original notation, this is represented by referents. Many different semantic roles are represented by ‘MEAS’. It may represent quantity as in “10 passenger”, distance as in “10 kilometer”, and weight as in “5 kilograms”. All prepositional relations have been expressed using a generic relation ‘PMOD’. The actual semantic relation represented by these generic relations can be different. For ex.: The prepositional relation specified using ‘at’ in the conceptual graph can be assigned the semantic role ‘location’ in the phrase ‘arrived at station’ but not in ‘available for sale at a reasonable cost’. Similarly the semantic relation ‘source’ and ‘destination’ can be used to replace prepositional relation (PMOD) representing the relationship expressed by prepositions ‘from’ and ‘to’ in the phrase ‘flight from Delhi to Chenai’. Some constraints need to be checked while assigning semantic roles to relations. For ex., a common requirement for a constituent to fill the role of ‘agent’ is to be of type animate or organization. Similarly, for a constituent to fill the role of ‘instrument’, it has to be inanimate. It is difficult to identify semantic relations automatically. For example, the preposition ‘for’ is involved in different semantic relations between constituent concepts in each of the following sentences: (i) Danish bought a book for Daniya. (ii) Adiba bought a barbie doll for 500 rupees. (iii) The train left for Agra. (iv) The customer asked for quick service. (v) There is no justification for attacking the findings of the judge. The basic patterns for identifying propositional phrase are: PP Æ IN NP PP Æ TO NP CG is constructed for the noun phrase and preposition is used to mark the relation between this phrase and the rest of relevant phrases in the sentence.

70

Chapter 3


Compound sentences involving Wh-determiners (which and that) are broken into two independent sentences and conceptual graphs of constituent of these sentences are created. The Wh determiners are replaced by the concept they refer to wherever possible. These concepts provide a means for relating the constituent sentences. The level of semantic details to be captured within the conceptual graph depends on the nature of application and the domain in which it is to be used. The generic relations used to represent relations between prepositions, verbs and noun concepts can be replaced by semantic relations with the help of relation ontology.

The

assignment of semantic role is a domain dependent process and it is very difficult to perform automatically. To some extent, the traditional parsing approaches suggest semantic relations corresponding to syntactic ones but they require that all possible semantic roles be recognized and manually encoded in grammars and lexicons. Developing such grammars is both time consuming and tedious task and such systems usually work well within limited domain only [166]. Resorting to a manual solution for assignment of semantic roles may be viable solution in specific domain but when considering IR in general, it is almost impossible. This is particularly because of the size of the document collection with which an IR system has to deal with. Further, detail automatic analysis of semantics of text requires full text analysis, a preliminary defined description of every word expected in the input text, and at least an upper level ontology. Full text analysis poses problem when dealing with large document collection. Ontologies are still in the stage of development. Some existing lexical resources, such as WordNet, do provide both the description of words and a number of semantic relations between them. Still they are not sufficient. For example it is not possible to identify the is_a relationship that exists between ‘concurrent program’ and ‘parallel program’, neither it is possible to identify that ‘genetic algorithm’, ‘neural networks’ are AI techniques using WordNet. This clearly shows the limitation of existing lexical resources. We argue that a deep semantic analysis of documents is neither possible nor desired for general IR problem. The available amount of text to be analyzed is so huge that it

71

Chapter 3


seems impractical to carry out detail semantic interpretation. Existing lexical resources can not be utilized effectively in different domains. In a specific domain, it may be feasible to build domain ontology by hand. This is because of the following characteristics of domain specific sentences as identified by [185]: (1) Vocabulary is limited. (2) Word usage has pattern. (3) Semantic ambiguities are rare. (4) Terms and jargon of the domain appear frequently. These characteristics have the obvious implication that the relations used are limited and sentences having similar meaning usually have similar patterns, thereby making it relatively easy to construct an ontology pertaining to a specific domain. However, in an unrestricted domain this seems to be a herculean task. Therefore, we focus on techniques that make use of the information that can be easily extracted and that can be found within the document itself without requiring much of the domain knowledge. Rather than treating document as a bag of words bearing no relation to each other we treat them as a set of words involved in different types of relationships with each other. For the purpose of IR, we avoid detailed semantic interpretation of text and rely on a CG model that can be easily extracted from the text without requiring much of the domain knowledge.

3.3.3.3. The Algorithm Followings are the step by step description to arrive at CG representation of a document: 1. Trigrams ‘n’ Tags(TnT1) part-of-speech tagger has been used in this work. This tagger requires that the input be converted in a particular format. So the first step is to preprocess the document collection so as to convert it in a form acceptable by TnT tagger(See Appendix B-I for tagged and untagged file formats).

1

Use of TnT was possible through an evaluation license agreement.

72

Chapter 3


2. Next, the documents are tagged using TnT tagger (See Appendix B-II for the details of the tag set used). Tagged representation of document is processed to make certain substitution and modifications. These include removal of determiners, pre-determiner, modal words, wh-determiner, wh-pronoun and few patterns like ‘such that’ , ‘so that’, ‘as well as’ (replaced with ‘and’) etc 3. Tagged and processed document is input to a sentence extractor which extracts sentences. Each extracted sentence of the tagged text is then passed through four modules. Each of these modules is devoted to identify certain types of relationships between concepts. These modules correspond to: •

Preposition and Adverb handler: to extract prepositional and adverbial relations.

•

Noun handler: Extracts relations between noun sequences and cardinals.

•

Adjective handler: To extract adjectival relations.

•

Verb handler: To extract relations between verb and its subject and object.

Similar steps are followed in getting CG representation of a query. Additionally we add replaceable terms to get alternative CG fragments. For example suppose “article”, “literature”, “paper”, “information” etc. are in the set of replaceable terms for “document” then additional CGs will be added corresponding to each of these while processing the query “Role of computers in retrieval of scientific documents”. However, this addition is not counted towards length of the original query CG. Neither, we repeat the whole query, instead additional relations corresponding to those existing between “document” and other concepts appearing in the query are added replacing “document” with terms in the above set.

3.3.3.4. Example Now we will explain these steps by considering the conceptual graph representation for the sample document shown in figure 3.8.

73

Chapter 3


Figure 3.8: Sample Document text

The tagged representation of the document is given below: %%document

1

%% title analysis

NN

the

DT

of

IN

reproduction

NN

the

DT

and

CC

role

NN

distribution

NN

of

IN

of

IN

the

DT

scientific

JJ

computer

NN

papers

NNS

in

IN

.

.

the

DT

an

DT

american

JJ

analysis

NN

chemical

JJ

of

IN

society

NN

the

DT

has

VBZ

role

NN

begun

VBN

of

IN

%% abstract

74

Chapter 3


the

DT

distribution

NN

computer

NN

,

,

in

IN

and

CC

related

JJ

retrieval

NN

aspects

NNS

of

IN

of

IN

scientific

JJ

the

DT

information

NN

reproduction

NN

.

.

,

,

where %% represent comments. Each sentence of the tagged text is then passed through the four modules discussed above. We will show the outputs produced by these modules on the sentences of the sample documents. The first sentence:

Analysis {NN} of {IN} the

{DT} role {NN} of {IN} the {DT} computer {NN}

in {IN} the {DT} reproduction {NN} and {CC }computer {NN} in {IN} the {DT} distribution {NN} of {IN} scientific {JJ} papers {NNS}. The LF of the CG produced for the above sentence by module 1 is: [Analysis] Æ (PMOD_OF) Æ[role: #]Æ(PMOD_OF) Æ[computer : #] – Æ (PMOD_IN)Æ [reproduction] Æ(PMOD_OF) Æ [papers: #] Æ (PMOD_IN) Æ [distribution] Æ (PMOD_OF) Æ [papers:#]

No CG will be produced by the second and fourth module in this case. The CG produced by the adjective handler is: [paper] Æ (attr) Æ [scientific]

75

Chapter 3

Document Representation and Similarity Measures

The second sentence:

The {DT} American {JJ} chemical {JJ} society {NN} has{VBZ} begun{VBN} an{DT} analysis{NN} of{IN} the{DT} role{NN} of {IN} the{DT} computer {NN} in{IN}

related

{JJ}

aspects{NN}

of

{IN}

the{DT}

reproduction{NN},

distribution{NN}, and {CC}retrieval{NN} of {IN}scientific {JJ} information{NN}. Module 1:

[Analysis] Æ (PMOD_OF) Æ[role: #] Æ(PMOD_OF) Æ[computer: #]Æ(PMOD_IN) Æ [aspects] Æ (PMOD_OF)Æ[reproduction]Æ(PMOD_OF)Æ [information:#] Æ (PMOD_OF)Æ[distribution]Æ(PMOD_OF)Æ [information: #] Æ (PMOD_OF)Æ[retrieval]Æ(PMOD_OF)Æ [information: #] Module 3:

[society : #]Æ(attr)Æ[American] Æ(attr)Æ[chemical]

[aspect]Æ(attr)Æ[related] [information] Æ(attr) Æ[scientific] Module 4: [begin]Æ(agnt) Æ[society] Æ(ptnt) Æ[analysis]

The display form of the resulting conceptual graph is shown in figure 3.9.

76

Chapter 3


Figure 3.9: Conceptual graph for the sample document

3.3.4. Conceptual Graph Matching Algorithms Most of the research work on CG has focused on graph theory and graph algorithms [2],[154] which have been considered the essence of CG theory [25]. Researchers have proposed sound and complete graph derivation with respect to first order logic semantics and graph theory [56], [179]. But these algorithms are non-deterministic and it seems unreasonable to use non-deterministic graph derivation algorithms for IR applications. Some efficient graph algorithms have been proposed by the use of CG normal form [103], [2]. However, there is a widespread opinion that any reasonably complete method of structural comparison is not computationally affordable [103]. A number of researchers attempted to use CGs in IR [37], [83] and proposed computationally tractable CG matching algorithms. While the general sub-graph isomorphism problem is known to computationally intractable, matching CGs containing conceptual information seems to be practical.

77

Chapter 3


Rau [116] proposed a CG matching algorithm, which uses semantic relations between concepts to retrieve relevant information in the absence of an exact match. Her algorithm employs the semantic information in the inheritance hierarchy for partial matching. However, no syntactic information was taken into account to retrieve the relevant information in the absence of exact information. Levinson and Ellis [80] developed algorithms that could search a lattice of graphs in logarithmic time. They used a partially ordered hierarchy to reduce the search space. The hierarchy however, did not cover the full generalization or full specialization if we consider a complete graph instead of a sub-graph. Myaeng and López-López[105] presented a flexible algorithm for CG partial matching. Later a flexible criterion was proposed by Montes-y-Gómez et al.[101] to quantify the approximate matching expressed in sub graphs. The similarity measure proposed by them was analogous to dice-coefficient; a well-known similarity measure used for weighted keyword representation of documents, and consists of two components- conceptual similarity and relational similarity. The conceptual similarity measures the similarity between concepts and actions referred in two pieces of knowledge. The relational similarity measures how similar are the context of common concepts in the two pieces of knowledge. This similarity measure was later modified by [102] to take into account synonymy and subtype/supertype relationships between the concepts and relations used in CGs and to allow for different weight to different types of nodes (entity, action and attribute). Liddy and Myaeng [83] also used CG matching algorithm proposed in [105] and attempted to normalize the score by considering “connectivity” into account. Their experimental results show that (1.05)^ (1-x) is the best normalization factor where x is the number of test units that contain one or more matching CG fragments. Liu [84] investigated partial relation matching. Instead of trying to match conceptrelation-concept triple, he attempted to match individual concepts with their semantic role.

78

Chapter 3


In[184], Yang et al. presented a CG matching algorithm that allows both partial and exact matching which was later modified by [183] for CGIF. [183] described an efficient storage and retrieval system for CGs (STORET) built on top of commercial DBMS. The retrieval system supported by STORET allows exact as well as semantic and syntactic partial matching. A type hierarchy was utilized to facilitate semantic partial match. STORET allows user to control the degree of matching in such situations by selecting Degree of matching (DOM) and Degree of inheritance (DOI). DOM is calculated as the ratio of number of matched relations to the number of relation in query graph. DOI indicates the inheritance length in the type hierarchy. Both these measures specify closeness of query graph with matched graph from different point of view. A number of conceptual graph matching algorithms have been proposed and utilized for information retrieval. However, attempts to integrate these measures with existing statistical similarity measures have been very limited. In chapter 6 of this thesis, an information retrieval model has been discussed that performs retrieval based on a cumulative measure.

79

Chapter 4 Techniques for Improving Effectiveness in Information Retrieval 4.1. Introduction The basic framework of the IR system having conceptual graph as its core has already been discussed in chapter 3. Now it is necessary to investigate and develop ways to improve the effectiveness of the system and face those questions as asked in chapter 1. In response to those questions, the improvement has been tried in three directions. The first and foremost is the use of conceptual graph structures; which through graph intersection mechanism gives rise to intelligent retrieval through semantic and contextual search. It should also be noted here that the use of natural language processing techniques to derive the relation constructing the CG brings up intelligence in the process. However, all semantic based methods require high computational cost to be paid. Vector space model can reduce this cost and hence an effort towards a hybrid model is obvious (section 4.2). The investigation of various vector space models with different schemes depending on the types of documents and queries is the effort in the second direction. The third direction of effectiveness improvement includes the automatic query expansion through CG for relevance feedback as discussed in the next chapter.

Chapter 4

Techniques for Improving Effectiveness in Information Retrieval

4.2. A Hybrid Model to Improve Relevance in Document Retrieval Although syntactic and semantic approaches have shown promise for increasing the effectiveness in IR, but so far neither approach has succeeded in handling a large set of documents in an unrestricted domain. This is in contrast to vector space model, which efficiently handles such text. We have tried to combine the attractive characteristics of the two approaches by proposing a novel hybrid information retrieval model. The model works in two stages. The first stage is a statistical model and the second stage is based on semantics. We first downsize the document collection for a given query using vector space model and then use a conceptual graph (CG) based model to rank the documents. The main objective is to investigate the use of conceptual graphs as a precision tool in the second stage. The use of CG brings semantic features through the concept type hierarchy and syntactic relations in the ranking process. This results in improved relevance. Three experiments have been conducted to demonstrate the feasibility and usefulness of our model. A test run is made on CACM-3204 collection. We observed an increase of 34.8% in precision for a subset of CACM queries. The second experiment is performed on a test collection specifically designed to test the strength of our model in situations where the same terms are used in different context. Improved relevance is observed in this case also. The application of this approach on results retrieved from LYCOS shown a significant improvement. The model proposed by us is both efficient and scalable. The rest of this section is organized as follows. Section 4.2.1 reviews the problems associated with the statistical models and discusses the situations where the CG-based model can provide significant benefits. We also review some of the earlier approaches based on semantics. It is not possible to give a comprehensive overview at this juncture, so we just point out the work that is closely related with our work and that influenced our thinking. In section 4.2.2, we present the details of our model. This includes a discussion of various vector space models and CG-based model, and the

81

Chapter 4


CG similarity measure used in the experiments carried out in this chapter. Section 4.2.3 is devoted to experimental investigation made by us.

4.2.1. CG-based Core Model Conceptual graphs are very closely related to natural language and were originally developed for representing meaning of the text. Such a representation holds the promise of extracting more information from documents by explicitly capturing relationship between terms; unlike word-statistical approaches that merely count nouns and noun phrases. This fact suggests the use of conceptual graphs in information retrieval. With such representation, we will be able to improve precision in information retrieval. CG-based model is capable of retrieving documents pertaining to more specialized topic when a query for more general one is made. This is made possible because a conceptual graph is formally defined on a support, which includes, among other things, a type hierarchy. The model also keeps contextual information present in the document and/or query intact and hence, will be able to differentiate the documents in cases where statistical models fail to do so. Let us consider two documents having fragments like: 1.

… genetic algorithm for information retrieval

2.

…genetic algorithm, neural network, …. Information retrieval…

and a query “genetic algorithm for information retrieval”. Both the documents contain the terms “genetic”, “algorithm”, “information” and “retrieval”. If we represent these two document fragments as a set of terms then traditional vector space models will return both of them as equally relevant, though fragment 1 is actually more relevant to the query. The distinction is possible in such cases through the use of conceptual graphs. The proposed model is also able to handle polysemy implicitly, say for example in the following two sentences: The Allahabad bank is situated near river Ganga.

82

Chapter 4


and, I stayed in a hotel situated near the bank of river Ganga. and a query “Bank situated near Ganga” Though it is clear from the local context that the first sentence is more relevant to the query but this distinction is lost when treated as a bag of words. Because a CG aims to capture the relationships between concepts, it holds the promise to improve ranking in such situations. The preceding example demonstrates the strength of our model in handling the problem of polysemy. The terms appearing in the periphery of an ambiguous term often provide information necessary to disambiguate the term. The consideration of relationship helps in filtering out documents involving incorrect senses. Earlier attempts to include semantics were based on latent semantic indexing [36], [46], and natural language processing [163], [150] techniques. Use of Latent Semantic Indexing (LSI) in information retrieval is based on the assumption that there is some underlying “hidden” semantic structure in pattern of word-usage across documents. LSI attempts to identify this hidden semantic structure through statistical techniques and then uses this structure for representing and retrieving information. However, it is costly in terms of computation and requires re-computation if many new terms are added. Agreeing with [37], who argued that conceptual retrieval is a precision tool not an all purpose device, we propose and evaluate a two stage hybrid information retrieval model to handle the problems associated with pure statistical or semantic models. Retrieval in the first stage is performed using vector space model. The second stage ranks the documents retrieved in the first stage based on CG representation. The model thus takes the advantage of the efficiency of the vector space model and the versatility of CG model. Unlike LSI, the CG-based model followed by us has the advantage of being scalable. In traditional information retrieval system relevancy simply refers to the degree to which the query terms are present or absent in a document. This model builds notion of “relevancy” based on an understanding of semantics of terms. It ranks documents based on their relational and conceptual

83

Chapter 4


similarity to query. This form of relevancy more closely corresponds to users’ mental model. For CG based ranking, we have captured relationships existing among concepts in a sentence. The corresponding conceptual graphs have been stored for each document during preprocessing stage. First vector space model has been used to retrieve set of potentially relevant documents quickly and then relevance judgment is made based on conceptual graphs. The ranking in second stage makes use of a quantified measure proposed by Montes-y-Gomez et al. in [101] to match conceptual graphs. This measure combines conceptual and relational similarity. CG representation of documents has been prepared along with vector space representation.

Graph

derivation has not been used and relatively small number of relations has been captured. All these factors contribute in keeping our model computationally simple and add generality to it.

4.2.2. The Retrieval Model Fig. 4.1 shows our retrieval model. Two models are used in conjunction. (i) Vector space model and (ii) Conceptual Graph model

Figure 4.1: Retrieval Model

84

Chapter 4


4.2.2.1. Vector Space Model The vector space model has already been discussed in chapter 3. A comprehensive overview of various term weighting scheme and their impact on retrieval effectiveness is provided in this section. In order to have best retrieval possible during the first stage the performance of seven different retrieval models based on different combination of document and query term weighting schemes has been experimentally investigated. The test collection used in this study is CACM-3204 collections. In the subsequent subsections, we discuss the characteristics of the test collection, describe the models considered, explain the methodology and discuss the model selected for use in the first stage.

4.2.2.1.1. The Document Collection (CACM Collection) The document collection used in this work is the CACM-3204 collection. The CACM-3204 collection is a test collection of 64 queries and 3204 documents. The documents contain title, author, abstract, bibliographic and citation information. We have considered title and abstract section for indexing purposes. Out of 64 queries only 52 queries have relevant document in the collection. A list of stop words containing 429 words is also provided with the collection. The document #1559 has highest average frequency of terms with an average frequency of 2.83. The most frequent term in any document is “program” with a maximum number of occurrences of 21 in document #3077. The average document length is 28.57 words (excluding stop words) and the largest document length is 192 words. The term “algorithm” has maximum document frequency of 1313.

4.2.2.1.2. Model Description In this section, description of the model investigated by us is provided.

Model 1 (doc=”atn”, query = “ntc” ) This model uses augmented term frequency to weight document terms, and raw term frequency with cosine normalization to weight query terms. The expressions used for

85

Chapter 4


term weighting are: Document-term weight (“atn”): wij = (0.5 + 0.5 × Query-term weight (“ntc”) :

tfij n ) × (log ) ni max tfj

wik *

wik =

m

∑w

ik

*2

i =1

n ni

Where wik * = tfik × log( ) Query-document similarity:

∑w

ij

× wik

Model 2 (doc = “atn”, query = “atc”) In this model both the document and the query terms are weighted using augmented normalized term frequency and inverse document frequency. The length of the query vector has been normalized. The exact expression used to assign term weights are: Document-term weight (“atn”): wij = (0.5 + 0.5 ×

tfij n ) × (log ) max tfj ni

wik *

Query-term weight (“ntc”) : wik =

m

∑w

ik

*2

i =1

where wik* = (0.5 + 0.5 ×

Query-document similarity:

tfik n × log( ) max tfk ni

∑w

ij

× wik

Model 3 (doc = “atc”, query =“atc”) In this model both the document and query terms are weighted using augmented normalized term frequency with length normalization incorporated. The term weighting expressions are as follows:

Document-term weight (“atc”): wij =

wij * m

∑w

ij

i =1

86

*2

Chapter 4


Where, wij* = ( 0.5 + 0.5 ×

tfij n ) × ( log max tfj ni

)

wik *

Query-term weight (“atc”) : wik =

m

∑w

ik

*2

i =1

Where, wik* = ( 0.5 + 0.5 × Query-document similarity:

∑w

tfik n ) × log ( ) max tfk ni

× wik

ij

Model 4 ( doc = “atc” query = “ntc”) This model weights document terms using normalized augmented term frequency. Query terms are weighted using raw term frequency. Length of both the document and the query vector has been normalized. The expressions used to compute weights of the document and the query terms are given below: wij *

Document-term weight (“atc”): wij =

m

∑w

ij

*2

i =1

Where, wij * = (0.5 + 0.5 ×

tfij n ) × (log ) maxtfj ni

wik *

Query-term weight (“ntc”): wik =

m

∑w

ik

*2

i =1

n Where, wik* = tfik × log( ) ni Query-document similarity:

∑w ×w ij

ik

Model 5 (doc = “ntc”, query = “ntc”) The expressions used by this model to compute term weights are:

87

Chapter 4


wij *

Document-term weight (“ntc”): wij =

m

∑w

ij

*2

i =1

Where, wij * = tfij × (log

n ) ni

wik *

Query-term weight (“ntc”) : wik =

m

∑w

ik

*2

i =1

n Where, wik* = tfik × log( ) ni Query-document similarity:

∑w ×w ij

ik

Model 6 (doc = “ltc”, query = “ltc”) This model uses logarithmic term frequency with length normalization to weight document and query terms as given by following expression: wij *

Document-term weight (“ltc”): wij =

m

∑w

ij

*2

i =1

Where, wij * = (log(tfij) + 1.0) × (log

n ) ni

wik *

Query-term weight (“ltc”) : wik =

m

∑w

ik

*2

i =1

Where, wik * = (log(tfik) + 1.0) × (log Query-document similarity:

∑w ×w ij

n ) ni

ik

Model 7 (doc = “nnn”, query = “nnn”) This model uses raw term frequency to weight both the document and query terms. Document-term weight (“nnn”): wij = tfij Query-term weight (“nnn”): wik = tfik

88

Chapter 4


Query-document similarity:

∑w ×w ij

ik

4.2.2.1.3. Methodology To evaluate the performance of various retrieval models discussed in the previous sub-section, the documents in the collection are represented as term-weight vectors using the automatic procedure discussed in chapter 3 (section 3.2.2). The queries in the collection also went through the same processing steps to get the query vectors. The similarity between the query and the document vectors is then calculated. No cutoff has been used and all the documents having a non-zero similarity value is considered. For each query, precision at observed recall points is evaluated which is then used to linearly interpolate precision at standard eleven recall points. The average precision is then calculated. This gives 64 precision values, one corresponding to each of the 64 queries in the collection.

The mean of these

precisions is then calculated to yield a single value, which is used as a measure of performance. To plot the precision-recall curve, precision values at each of the 11recall point are averaged over all the queries.

4.2.2.1.4. Selection of the Appropriate Model The retrieval performance of these models on CACM collection has been listed in table 4.1. 11-point (0.0, 0.1, … , 0.9, 1.0) average precision(11-AvgP) has been used to assess the retrieval performance. The doc = “atn”, query = “ntc” model (model 1) exhibits best retrieval performance, with an average precision of 29.61%, for the collection under consideration. The recall-precision curve of each model has been compared with this model. In order to get a good understanding of the term weight factors test runs have been made on ADI and MEDLINE collection also. Fig. 4.2 and fig. 4.3 compare the recall-precision curve of model 2 and model 3 with that of model 1 respectively. The performance of model 1 and model 2 is quite close. The only difference in these two models is that model 2 uses augmented term frequency for query term weights, which gives low weight to subsequent occurrences

89

Chapter 4


of terms as compared to their first occurrence. If all the terms in a query occur just once then there will be no difference in model 1 and model 2. There are 21 such queries in the collection and both the model exhibit identical performance for these queries. The drop in the performance of model 3 as compared to model 2 indicates that normalizing length of the document has negative impact on the performance for this collection. This is because the documents in the CACM collection are small and variation in the length of the documents is not much significant. However, in MEDLINE collection where documents are large and they vary in length, document length normalization improves the performance.

Model

11-point Average Precision score CACM (% change) MEDLINE ADI Averaged over: 64 queries 30 queries 35 queries 1. doc = “atn”, query = “ntc” 29.61 25.17 45.95 2. doc = “atn”, query = “atc”

27.72

(-6.38)

26.58

46.01

3. doc = “atc”, query = “atc”

22.81

(-22.96)

39.87

47.53

4. doc = “atc”, query = “ntc”

22.85

(-22.83)

39.46

47.05

5. doc = “ntc”, query = “ntc”

23.59

(-20.33)

38.79

47.59

6. doc = “ltc”, query = “ltc”

22.67

(-23.43)

24.44

47.62

7. doc = “nnn”, query= “nnn”

12.47

(-57.88)

23.46

32.62

Table 4.1: Retrieval results of various models (Averaged over all the queries, 11-point interpolated average precision)

The traditional tf-idf weighting scheme (model 5) does not provide satisfactory results for the CACM collection. Again, the document length normalization is primarily responsible for this. The recall-precision curve for this model has been compared with the model 1 in figure 4.5. The average precision in this case is 23.59%.

90

Chapter 4


Figure 4.2: Recall-Precision curve for Model 1 vs. Model 2


91

Chapter 4


The model 6 is the second worst performing model for the CACM-3204 collection with an average precision of 0.2267. The recall-precision curve of this model is shown in figure 4.6 along with the model 1 and the model 7. As discussed earlier the logarithmic term frequency reduces the impact of all sort of frequency variation in document. The use of length normalization for both the document and queries further reduces the differences in term weights. This makes it difficult to discriminate relevant documents, particularly in a collection of abstracts like CACM, resulting in poor performance of the model. The reason for the worst-case behavior of model 7 is obvious. This model simply ranks document based on the raw term frequency of query and document terms. Rank of the returned documents is thus mainly decided by high frequency terms appearing in the document that happens to be present in query also, which are usually quite frequent terms like “system”, “algorithm”, “procedur”, “comput”, etc., irrespective of whether or not other query terms do appear in the document. Fig. 4.4 compares the performance of model 4 with the model 1. The mean average precision for this model is 0.2285. The model 7 exhibits poor performance on the MEDLINE and ADI collection also. The recall-precision behavior of all the models has been compared in fig. 4.7. Model 1 outperforms all other models at all recall points. This model does not incorporate document length normalization and uses raw term frequency for queries. As discussed earlier these factors help in discriminating relevant and non-relevant document in the CACM collection. Using document length normalization along with frequency normalization reduces the values of term weights making it difficult to differentiate among documents. The raw query term frequency increases the chances of retrieval of documents containing query terms, and augmented document term frequency reduces the impact of multiple occurrences of terms in a document thereby reducing the chances of ranking being dominated by a single document term that happens to be common with query term and has high frequency in the document. In the rest of the discussion, this model has been called as “modified vector” space model. The performance of “modified vector” space model has been found better than one of

92

Chapter 4


the more recent weighting scheme namely the BM25 [52] which has reportedly best performance in TREC environment [118], [119]. The recall-precision behavior of these two models has been compared in figure 4.8. The mean average precision with the BM25 scheme is 24.9%. The following approximation of BM25 term weighting function has been used in this comparison. Document term weights: wij =

tfij × (k1 + 1) n − ni + 0.5 × log( ) dl ni + 0.5 k1 × ((1 − b) + b × ) + tfij avdl

Query term weights: wik = tfij Query-document similarity: ∑ wij × wik Where, tfij = frequency of term ti in document dj n = number of documents in the collection ni = number of documents in the collection containing the term ti dl = document length, and avdl = average document length in the collection The two coefficients k1 and b were empirically set to 2 and 0.1. As the “modified vector” space model exhibits best performance, it as been selected for use in the first stage of retrieval and constitutes the base model for the rest of the work carried out in this thesis against which the performance of the proposed techniques has been judged.

93

Chapter 4




94

Chapter 4


Figure 4.6: Comparison of Recall-precision behavior of model 6 with model 1 & 7

Figure 4.7: Comparison of seven different retrieval models

95

Chapter 4


Figure 4.8: Recall-Precision curve for CACM collection (averaged over 64 queries) for BM25 scheme and “modified vector” model

4.2.2.2. Conceptual Graph Model A CG is formally defined over a support S. The support is ( TC, TR, T) where TC is the set of concept type, TR is the set of relations and T is the set of terms used to represent the documents. A concept type “concept” has been included in TC to denote the type of the terms, which have all been considered individual marker. A part of concept hierarchy used in the matching process is shown in fig. 4.9. During matching conceptual graphs of sentences in a document are compared with query CG. Instead of using graph derivation, a comparison method proposed in [101] has been used for this purpose. This, together with a small number of relations identified by us, helps in keeping our model computationally simple and thereby making it useful to be used on the top of the first stage. The similarity values between query and sentence CGs have

96

Chapter 4


been combined in order to give a single score to document. We propose a combination of average and maximum similarity value for this.

Figure. 4.9: Part of Concept hierarchy

4.2.2.2.1. Document Representation The CG representation of documents has been obtained by identifying relationships among concepts occurring in a sentence. The relations captured by us include the relations between the constituent nouns of complex nominal, relations between verb and other constituents surrounding the verb, mainly AGNT(agent or subject) and PTNT(patient), INSTR(instrument) and prepositional relations. As discussed in chapter 3, “MOD” has been used as a generic relation to represent the relations between constituent nouns in a compound noun sequences and “PMOD” has been used to represent all prepositional relationships. Other relations used by us include ATTR, MNR, AMNT, BETW etc.(see Appendix C)

97

Chapter 4


Alternate representations of nominalized verb for documents have not been maintained here. Instead, alternate representations of query CGs have been prepared manually. Considering the fact that users’ queries are usually short, this seems practically feasible to perform detail semantic interpretation of queries to automate this step. Further, in order to capture some of the structural variations noun phrases involving “of” has been normalized. Rules for normalizing these phrases are given below: (i)

{NN}1 of {IN} {NN}2 Æ {NN}2 {NN}1

(ii)

{JJ} {NN}1 of {IN} {NN}2 Æ {JJ} {NN}2 {NN}1

(iii)

{NN}1

of {JJ} {NN}2 Æ {JJ} {NN}2 {NN}1

The first rule says that patterns like “noun1 {NN} of {IN} noun2 {NN}” in the tagged representation of the document are replaced with noun sequences with their order reversed i.e. noun2 {NN} noun1 {NN}. Similarly, the second rule handles the case when such a pattern is being preceded by an adjective. Example: Consider the tagged representations (i)

Extraction {NN} of {IN} Roots {NNS}

(ii)

Survey {NN} of {IN} trend {NNP} of {IN} Development {NNP}

(iii)

Systematic {JJ} classification {NN} of {IN} devices {NNS}

These fragments will be replaced with Roots {NNS} Extraction {NN} Development {NNP} trend {NNP} Survey {NN} and Systematic {JJ} devices {NNS} classification {NN}. However, in certain cases these rules give incorrect patterns. Consider the tagged representation of the following pattern from document #42 of the CACM collection: New{NN} method{NNP} of{IN} computation{NNP} of{IN} square{NNP} roots {NNS} Application of normalization rules will replace the above pattern by: Square{NNP} roots{NNS} computation{NNP} new{NN} method{NNP} Whereas it should be:

98

Chapter 4


New {NN} square {NNP} roots {NNS} computation {NNP} method {NNP}. This is because “new” has been assigned a tag “NN”(noun) whereas, it should be tagged as “JJ” (adjective). The CGs are stored in the form of triplets as: ( rel c1 c2) Where ‘rel’ is the relation, and ‘c1’ and ‘c2’ are concepts participating in this relation. Each CG is assigned an id and is represented by a list of triplet. In order to speed up the matching process a table containing concepts and a list of document and CG ids in which a concept participates may be used. This structure considerably reduces the number of conceptual graphs to be considered and makes it possible to retrieve the potential document CGs quickly.

4.2.2.2.2. CG Similarity Measure In the second stage of the retrieval process, we compare query CG with conceptual graphs of those sentences that contain query concepts. Instead of using graph derivation for matching query and sentence CG, these conceptual graphs have been compared using the similarity measure proposed by Gomez et al [101]. Given two texts represented by the conceptual graphs G1 and G2 respectively and their intersection graphs Gc, this measure defines similarity s between them in terms of their conceptual similarity sc and relational similarity sr : The conceptual similarity measures how many concept the two graphs G1 and G2 have in common. This is calculated using an expression analogous to one of the widely known similarity measure, the dice coefficient, used in IR: sc =

2 × n(Gc ) n(G1) + n(G 2)

Where n(G) is the number of concepts in the graph G. The value of sc lies in the range of 0 and 1. A 0 value indicates that two graphs have no common concepts and a value of 1 indicates that the graphs have common concepts.

99

Chapter 4


The relational similarity is the measure of similarity of the contexts of the common concepts in both graphs. It is defined as the proportion of the degree of connection of concept nodes in Gc, and the degree of the connection of the same node in initial graphs G1 and G2. sr =

2 × m(Gc ) mGc (G1) + mGc (G 2)

Where m(Gc) is the number of the arcs in the graph Gc, mGc(G) is the number of the arcs in the immediate neighborhood of the graph Gc in the graph G. The two similarity values are combined to give the similarity s between the graphs G1 and G2 as: s = sc ×(a + b × sr )

The value of the coefficients a and b depends on the structure of the document and is computed as: a=

2 × n(Gc ) 2 × n(Gc ) + mGc (G1) + mGc(G 2)

and b = 1- a There can be a number of sentences in a document having a nonzero CG similarity with query CG yielding a similarity value for each. Various similarity values obtained have been combined to get a single score for the document as: c

∑ si

S = α × i =1 + β × max(si ) c

Where, c is the number of CGs in the document having one or more concepts common with the query CG. The first factor in this expression ensures that a document having a number of identical repetitions of query concepts will score more than the one having a single identical and many different occurrences. The second factor ensures that a document having a single exact match and many fragments matching partially have fair

100

Chapter 4


chances of being ranked better. A high value of α improves ranking of a document having many repeated occurrences of partially matched fragments. A high value of β improves ranking of a document having a single matching CG fragment. We have given equal weights to the average and maximum similarity values (α = β = 0.5). These values have been set empirically.

Figure 4.10: Intersection of conceptual graphs G1 and G2

4.2.2.2.3. Computing Intersection Graph Intersection graph Gc, of two conceptual graphs G1 and G2 consists of the following elements: (i) All concept nodes appearing in both the conceptual graphs G1 and G2. (ii) All relation nodes that appear in both the initial graphs and join concept nodes. This can be easily computed using the triple representation. Consider the conceptual graphs shown in fig. 4.10. These conceptual graphs will be stored as: G1: (r1 A B) (r2 A C) (r2 C G) (r1 C F) (r3 B D) (r4 B E) (r5 D E) (r3 E H) G2: (r1 I B) (r2 I A) (r3 B D) (r4 B E) (r2 D E) (r2 A C) (r6 D J) (r4 E H)

101

Chapter 4


The intersection graph Gc is computed by extracting (i) all the triples having all the elements in common and (ii) triples having one or both concepts common (with a null value introduced as relation). A null value at first place signifies a concept match only. If a single concept match occurs then a null is placed at second or third place as the case may be. The intersection graph is: Gc: (r2 A C) (r3 B D) (r4 B E) (null E H) The number of concepts n(G) and number of relations m(G) can be easily computed by counting unique number of distinct concepts and number of triples, except those having a null at first place, respectively. The number of arcs in the immediate neighborhood of Gc, mGc(G), is computed by counting those triple in G that have at least one of the concept common with Gc.

4.2.3. Experimental Design In order to test the effectiveness of our model and to compare the improvement with other models three experiments have been performed. A test run is made on CACM3204 collection. Manual consideration of the concept type hierarchy and nominalized verbs in query restrict us from performing large-scale automatic investigation of the model. Hence, a subset of CACM queries has been selected. This subset includes the smallest query in the collection (query #19), queries having precision better than average precision (query #12, #19, #63), queries having poor performance than the average performance of the collection (query #23) and queries close to the average performance of the collection(query #12, #30). This still needs consideration of approximately 700 documents. To test the capability of our retrieval model to capture context information and improve ranking we performed another experiment on our own document collection specifically designed to contain documents having similar terms being used in different contexts (CGDOC). In an attempt to investigate a possible application of our approach on top of existing retrieval systems, a third experiment has been performed using the top 10 results returned by LYCOS browser. The retrieval algorithm is shown in Table 4.2.

102

Chapter 4


Algorithm Retrieval () 1. Start 2. Read query 3. Extract query terms and prepare query vector(qv). 4. Prepare query CGs.

// Alternate representation possible

5. for i = 1 to N do

// N is collection size

read term weight vector for document i si.score = ∑dvi x qvi si.doc = i end for 6. sort(s)

// Sort on score to get a ranked list of documents

7. cut_off = get_maxrecall(s, qrels)

//For experiment 2 & 3 cut off is specified as

no. of documents 8. c = count of documents up to cut-off value // i.e. No. of documents retrieved for highest recall value 9. for i = 1 to c do for each CG (k) in document si..doc having a matching concept cgfk = CG similarity value between query CG and kth CG. end for savg = ∑ cgfk/m

// m = total number of CGs having matching concepts

smax = max( cgfk ) cgsi.score = a . savg. + b . smax // a=b=0.5 cgsi.doc = si.doc end { for} 10. sort(cgsi )

// To get a ranked list of document

11. end {of algorithm } Table 4.2: Retrieval Algorithm for hybrid model

103

Chapter 4


4.2.3.1. Experiment 1 The document collection used in the first experiment is CACM-3204. The objective of this experiment is to see the performance of our model on an existing document collection with known relevance judgments.

4.2.3.1.1. The Experiment To test the effectiveness of our two-stage retrieval model the precision after first and second stage of retrieval has been compared. First documents have been retrieved using “modified vector” space model in stage 1 and then these documents are reranked using CG-based representation. Documents up to the highest recall point have been considered (subject to a minimum of 100). The ranking of relevant documents after first and second stage for query 12 and query 23 (“Distributed computing structures and algorithms”) is shown in Table 4.2. The total number of relevant documents in the collection for query 12 and 23 is 5 and 4 respectively. Table 4.3 lists the precision after first and second stage and % improvement for queries considered in this paper. The average recall-precision graph for the subset of four queries has been shown in figure 4.14. Query 23 has been excluded from this, as this single query might contribute a lot to the average performance.

4.2.3.1.2. Results and Discussions We have used “modified vector” model in our experiment. Table 4.3 shows that the ranking of the relevant documents for query 12 and 23 of the CACM-3204 collection has been improved after second stage. The vector model gives high rank to documents containing frequent occurrences of terms like ‘operations’ and ‘operator’ or including a phrases like ‘portable random generator’.

104

Chapter 4


Document#

Rank

(Relevant )

First stage Second stage

Document#

Rank

(Relevant )

First stage Second stage

1523

391

60

2578

64

43

2080

34

30

2849

29

2

2246

4

3

3137

36

33

2629

26

25

3148

47

1

3127

1

1

Query 12: Portable operating system Query 23: Distributed computing structures and algorithms Table 4.3: Ranking of relevant documents for CACM query 12 and 23

Similarly for query 23 vector model gives high ranks to documents pertaining to “probability distribution”, “tree structure” etc. CG-based model improves ranking by eliminating these documents resulting in improved precision. Figure 4.11, 4.12 and 4.13 shows 11-point standard recall-precision curve for query 63, 30 and 19 of the CACM collection after first and second stage of retrieval. As shown in Table 4.4, precision has improved after second stage in all the five queries. The maximum increase in precision of 979.6% was observed for query #23. During the first stage of retrieval, the first relevant document for this query was obtained after retrieving 29 documents. With CG-based ranking, this value was reduced to 1. The minimum improvement of 3.6% was observed for query 12. For query 19, these values were 48.6% and 59.5%. This represents 22.3% increase in precision. The recall-precision curve averaged over the subset has been plotted in figure 4.14. The mean average precision for four queries after first and second stage of retrieval was observed as 46.4% and 62.6%, resulting in an improvement of 34.8%. This improvement is more than significant.

105

Chapter 4


Query #

Precision(%) Stage 1

12 19 23 30 63

42.7 48.6 5.4 35.2 44.8

Precision (%) Stage 2 44.3 59.5 57.8 51.5 74.5

Increase in Precision(%) 3.7 22.4 970.4 46.7 66.3

Table 4.4: Percentage increase in precision

Figure 4.11: Recall-Precision curve for query 63 of the CACM-3204 collection

106

Chapter 4


Figure 4.12: Recall-Precision curve for query 30 of the CACM-3204 collection

Figure 4.13: Recall-precision curve for CACM query 19

107

Chapter 4


Figure 4.14: Average Recall-Precision curve for the subset of CACM queries

4.2.3.2. Experiment 2 In the second experiment, we have considered our own document collection specially designed to contain identical terms in different areas. The objective was to test the retrieval effectiveness of our model in an environment where differentiation among documents may be difficult solely by statistical means.

4.2.3.2.1. Document collection (CGDOC) There are 65 documents in this collection. These are all abstracts of scientific papers. The average length of the document in the collection is 71.9 words excluding stop words. The size of the smallest and the largest document in the collection is 12 words and 136 words respectively. Fig.4.15 lists titles of few of the documents (titles only) and sample queries in this collection.

108

Chapter 4


Query #1. Genetic algorithm for information retrieval Query #2. Fuzzy information retrieval Query #3. Information retrieval using conceptual graph Doc#2. An Efficient Information Retrieval Method in WWW Using Genetic Algorithms Abstract: The information retrieval framework based on ...starting WWW host. Doc #9 An extended inverted file approach for information retrieval Abstract: In information retrieval ... object-oriented DBMS. Doc #11 Genetic algorithm based redundancy resolution of robot manipulators Doc #13 Genetic algorithms: A Survey Abstract: Genetic algorithms provide... dynamics, and deception. Doc #14 Genetic Algorithm and Graph Partitioning Abstract: Hybrid genetic algorithms ... supporting the experimental results. Doc #21 An Image Retrieval Method Based on a Genetic Algorithm Abstract: This paper describes image retrieval ... representation of images. Doc #22 Evolutionary Reinforcement of User Models in an Adaptive Search Engine... Doc #24 Genetic algorithm approach to image segmentation using morphological operations Abstract: This paper presents an approach for image ... images are presented. Doc #25 Multiprocessor Document Allocation: A Genetic Algorithm Approach Doc #27 Probabilistic and genetic algorithm for document retrieval... Doc #42 Intelligent Agents and its Applications in Information Retrieval Abstract: There is an increased amount... filtering and gathering systems. ... Doc #44 Applied Genetic Algorithms in Information Retrieval... Doc #47 A test of Genetic algorithms in relevance feedback...

Figure 4.15: Sample queries and documents

4.2.3.2.2. The Experiment As there can be many sentences in a document containing query terms, a combination of average and maximum similarity value has been used for ranking purpose. To better explain the effect of our CG-based approach we have made a comparison of the ranking obtained after first and second stage of retrieval instead

109

Chapter 4


of forcing our own relevance judgments. The titles of the documents under consideration for query 1 have been listed in fig.4.15.

Figure 4.16: Ranks of documents after first and second stage

4.2.3.2.3. Results and Discussions The effect of hybrid model is that the relevant documents that were ranked low by vector model were shifted up in the ranking. The ranking of top 10 documents after first and second stage of retrieval for the query “genetic algorithm in information retrieval” is shown in fig. 4.16.

It can be verified easily from the titles that

documents ranked 3 and 5 (Doc#13 and #42) returned by first stage are not as much useful as doc#22 and doc#27 which are ranked 7 and 9 respectively.

The

application of CG improves ranking by including doc#22 and doc#27 in top five documents, resulting in improved precision. The vector model, unlike the CG model, fails to identify the similarity between “document retrieval” and “information retrieval” and gives a low rank to doc#27. Fig. 4.17 shows the ranking after first and second stage of retrieval for query 2 and 3 listed in fig. 4.15. The titles of the

110

Chapter 4


documents being considered can be found in Table 4.6. Similar improvements in ranking have been achieved for other queries also.

Figure 4.17: Ranking after first and second stage of retrieval for query 2 and 3.

111

Chapter 4


Figure 4.18: Ranking of first ten documents returned by LYCOS after second stage

4.2.3.3. Experiment 3 In order to investigate a possible application of our approach in web search environment we have conducted a small experiment. In this experiment, we have simply considered first ten results returned by LYCOS search engine for query#1 (See Table 4.5) of our collection and then constructed conceptual graphs for fragments of sentences returned by the search engine, and compared it with query graph. LYCOS has been used because it is based on simple keyword search. Unlike other search engines it does not consider the position of keyword in the document, number of inbound hyperlinks etc. in the ranking process. This more closely corresponds to the retrieval model used in the first stage of the retrieval. The result of this comparison is then used for ranking. The result is shown in figure 4.18. Significant improvement has been achieved in this case also, even though we have just used very little piece of context information (highlighted in Table 4.5), as we do not have access to actual document representations.

4.3. Conclusions The performance of different combinations of document and query term weighting schemes has been experimentally investigated and a conceptual graph-based

112

Chapter 4


technique, relying on a hybrid information retrieval model, has been proposed. The technique had shown potential in improving the retrieval performance. The model first retrieves a set of potentially relevant documents to query using modified vector model and then ranks documents based on conceptual understanding of terms. We observed an increase of 34.8% in the precision for a subset of CACM queries. The experiment performed on our own document collection has also shown significant improvement in the ranking of retrieved documents. This is because of the semantic considerations in the second stage of retrieval through conceptual graphs. A specific query gives useful relationships among query terms that help in improving ranking. The proposed model helps in quickly short-listing the relevant documents from a large document set without hampering efficiency. Application of CG-based model in the second stage brings conceptual understanding in making relevance judgment. Relevance based on semantics more closely correspond to users’ mental model resulting in improved acceptance for user. In order to make our CG model more efficient we have used a new scoring function to get a single document score. The form of graph matching function used in this work keeps the computational cost low. The CG representation used by us is easily scalable. The proposed model takes the advantages of both the efficiency of the syntactical approach and the accuracy of the semantic approach. The proposed model yields appreciable improvement when applied to the search results returned by LYCOS. This suggests that this CG-based model can be used as a precision tool with existing search and retrieval systems.

113

Chapter 4


1. Machine Learning for Information Retrieval: Neural Networks, Symbolic... … Results of genetic algorithms testing Machine Learning for Information Retrieval : Neural … algorithms in information retrieval . In [48 … presented

a genetic algorithms based approach for... More results from: ai.bpa.arizona.edu/papers/mlir93/mlir93.html February 3, 2004 - 163 KB *3 (0.639) 2.The Art Site on the World Wide Web: McLaughlin … electronic palettes, generated from genetic algorithms, or captured by digital camera … fractal animations, genetic algorithm animations, morphs, ray trace … retrieving files from the site).... cwis.usc.edu/dept/annenberg/artfinal.html March 11, 2004 - 90 KB 6 (0.531)

3. text mining and web-based information retrieval reference … Web Mining, Information Retrieval and … Fast Look-up Algorithm for Structural … Web Mining: Information and Pattern … adaptive genetic algorithm /programming … statistics to information ...

filebox.vt.edu/users/wfan/text_mining.html February 2, 2004 - 23 KB 9 (0.477)

4. Effective Information Retrieval Using Genetic Algorithms Based... … Effective Information Retrieval Using Genetic Algorithms Based Matching … relevant information from these … improving retrieval performance … and recall) for retrieval … the area of... www.computer.org/proceedings/hicss/0493/04932/0493201... August 22, 2002 10 KB

*

1 (0.853)

Rank ( Score) obtained through CG

114

Chapter 4


5. Machine Learning/Genetic Algorithm group - Dipartimento di... … representation for ML, such as exploitation … to also include Genetic Algorithms and Neural Networks … Mobile Agents for automatic information retrieval , network management … Learning Agents...

www.di.unito.it/~mluser/ February 3, 2004 - 8 KB 7 (0.522)

6. Personal Information Intake Filtering … Recent work on genetic algorithms for information retrieval [Gordon] focused … program uses genetic algorithms for evolving symmetrical … B Cousins. "Information Retrieval from Hypertext... www.baclace.net/Resources/ifilter1.html July 27, 2001 - 42 KB 2 (0.752)

7. Free C/C++ Sources for Numerical Computation … Description : many genetic algorithm optimisation libraries … Description : Objects for doing genetic algorithm optimization Name … analytic statistics for the TREC IR trials Comments :... cliodhna.cop.uop.edu/~hetrick/c-sources.html July 24, 2002 - 97 KB 4 ( 0.560)

8. Home Page for Haym Hirsh … Directors: Institute for the Study of … machine learning, information retrieval , data mining … engineering design, genetic algorithms , knowledge

representation … Publications Information ... www.cs.rutgers.edu/~hirsh/ March 8, 2004 - 18 KB 5 (0.560)

9. Artificial Life … Related Topics Genetic Algorithms Agent technologies … computation such as genetic algorithms (GAs), evolutionary … stands for "Adaptive Retrieval Agents

Choosing Heuristic Neighborhoods for...

115

Chapter 4


www.insead.fr/CALT/Encyclopedia/ComputerSciences/AI/a... January 12, 2004 - 56 KB

10 (0.460)

10.Information and Communication R&D Center Information and Communication … Fuzzy Theory, Genetic Algorithm ,

Artificial Life … JAPANESE) Information Retrieval (IR) (JAPANESE … Service Other Information Abstract … TOWN GUIDE for ... www.ricoh.co.jp/rdc/ic/index_e.html September 30, 1999 - 3 KB 8 (0.494)

Table 4.5: First ten results returned by LYCOS for the query “Genetic algorithm for information retrieval”

116

Chapter 4


Doc#1 An Efficient Storage and Retrieval System for Conceptual Graphs Doc #4 The RELIEF Retrieval System (Image retrieval system based on Conceptual graphs) Doc #8 Compiling Conceptual Graphs Doc #9 An extended inverted file approach for information retrieval Doc #16 PRIME-GC. A medical information retrieval prototype on the Web Doc #28 Application of fuzzy set theory to extend Boolean Information retrieval Doc #29 Fuzzy functional dependency and its application to approximate data querying Doc #31Improving the Performance of Existing Information Retrieval Systems Using a Software Agent Doc #34 Flexible Comparison of Conceptual Graphs Doc #35 CG-DESIRE: Formal Specification Using Conceptual Graphs Doc #36 Comparison of Conceptual graphs Doc #37 Fuzzy conceptual graphs for matching images of natural scenes Doc #39 Text mining at detail level using conceptual graphs Doc #43 Knowledge Representation Using Fuzzy Petri Nets - Revisited Doc #49 fuzzy integral as a basis for the interpretation of flexible queries involving monotonic aggregates Doc #50 Fuzzy Content-Based Retrieval in Image Databases Doc #56 An information retrieval model based on vector space method by supervised learning.

Table 4.6: Titles of documents considered in Experiment 2

117

Chapter 5 Automatic Query Expansion through Conceptual Graph 5.1. Introduction Another direction in which our work is focused is to improve effectiveness by improving the query representation. It is a well-known fact that the initial user’s query generally does not retrieve all the relevant documents [17]. The initial query is based on the user’s knowledge on the subject as well as on the document collection and the system. However, most of the users have very little knowledge about the system and/or document collection and are unclear about what they can find in the collection. Therefore, they usually proceed by trial and error. They present a query to the system and then use the retrieved documents to refine the initial query. Automatic query expansion is an attempt to automate this process. A query can be expanded by adding terms selected by referring to thesauri [64] or by consulting users through relevance feedback technique. The work performed in [64] is based on the use of relations defined in WordNet, such as synonyms, hypernyms and hyponyms, to expand queries. However, the results have not been satisfactory. Many information retrieval researchers use pseudo relevance feedback, which is the relevance feedback without user intervention [121], [20] [47]. Nevertheless, this technique is quite sensitive to the initial retrieval [98]. If its quality is good then the terms added to the query will be related to the topic and the expanded query will be more appropriate representation of

Chapter 5

Automatic Query Expansion through Conceptual Graph

the users’ intent. This will improve the retrieval performance. On the other hand, if the initial retrieval quality is poor then unrelated terms will be added to the query resulting in a degraded performance: a problem known as ‘query drift’. We attempt to exploit the strength of the conceptual graphs for handling ‘query drift’. Rest of the chapter is organized as follows. Section 5.2 introduces models for applying relevance feedback, while section 5.3 discusses CG-based query expansion technique. Experimental investigation has been elaborated in section 5.4.

5.2. Models for Relevance Feedback Query expansion adds term in the original query, readjusts weights of terms and may drop some terms. Terms may be added to the user's initial query using a thesaurus. Alternatively, the query may be modified based on relevance information. Relevance feedback is a well established query expansion technique [127],[69],[45]. In relevance feedback, a query is modified using information in a previously retrieved ranked list of documents that have been judged for relevance by the user. Robertson and Jones [120] pointed out that the power of relevance feedback comes from expanding query by adding new terms and not from re-weighting. There are two well-known models for applying relevance feedback: (i) Ide’s method [69]: This method adds all the terms of relevant documents of the set provided for relevance feedback and removes the terms of first irrelevant document in the set. Modified query vector is constructed as: qnew = qold + ∑ di − s di∈R

where qold is the original query vector, qnew is the new query, di is the vector of relevant document R and s is the vector of the first irrelevant document in the feedback set. (ii) Rocchio’s method [127]: Rocchio’s method consists of moving the initial query vector toward the centroid of the relevant documents and away from the centroid of

119

Chapter 5


the non-relevant documents. It attempts to estimate “the optimal” user query through relevance feedback. This can be described by the following equation:

qnew = α × qold + β × ∑ di − γ ×∑ di di∈R

di∈I

Where, di is relevant or irrelevant document obtained by manual or automatic feedback during initial retrieval, I is the set of irrelevant documents and α, β and γ are coefficients. These coefficients are set by trial and error method. In real relevance feedback, the user is asked to give relevance information and this information is used to revise the query. The requirement for explicit relevance judgments interferes with users’ information seeking behavior [75]. Users are often not willing to provide feedback information. Hence, this is often replaced with automatic relevance feedback termed as “pseudo relevance feedback” or “retrieval feedback” [134], where the few top ranked retrieved documents are assumed relevant without user intervention and then used to reformulate the query. In pseudo-relevance feedback, the modified query vector is calculated by dropping the negative terms appearing in the Ide’s and Rocchio’s equation.

5.3. CG-based Expansion Query expansion following retrieval feedback may degrade the performance if the top ranked documents retrieved during initial run are not relevant. Expanding query by adding terms from irrelevant documents, or adding terms from relevant documents that are not closely related to the query terms, will move the query representation away from what may be “optimal” query representation. This results in alteration of focus of the query i.e. ‘query drift’. We attempt to reduce this drift by including semantics in the expansion process through the conceptual graphs. The conceptual graph representation of documents has been used to decide which documents from the initially retrieved set, and which terms from these documents will be used for expanding the query. First, an initial run is performed to get a small set of documents and then scores are assigned to documents based on their similarity with query CG.

120

Chapter 5


This type of similarity more closely corresponds to users’ intent, resulting in an improved chance for a relevant document to be included in the feedback set. Expanding query by adding terms from this set results in expansion that is more appropriate. Two different strategies for CG-based expansion have been evaluated. (a) The first CG based expansion strategy uses Ide’s approach to expand query. (b) The second strategy evaluated by us consists of expanding the query by adding those terms only which participate in a document CG having a matching query concept. Following steps explain our algorithm: 1. Perform an initial run to get a set of 20 documents (DS). 2. Assign CG score to each document in DS 3. Rank the documents according to their CG score. 4. Assume top N documents to be relevant and use them to expand query as (a) Compute modified query vector as qnew = qold + ∑ di

or

di∈R

(b) Add participating concepts from document CGs to query and re-compute the query vector. 5. Perform retrieval using modified query. Query 12 portable operating systems Query 14 find all discussions of optimal implementations of sort algorithms for database management applications Query 19 Parallel algorithms Query 30 Articles on text formatting systems, including "what you see is what you get" systems. Examples: t/nroff, scribe, bravo. Query 42 Computer performance evaluation techniques using pattern recognition and clustering Query 63 Algorithms for parallel computation, and especially comparisons between parallel and sequential algorithms. Figure 5.1: Sample Queries from the CACM collection

121

Chapter 5


5.4. Experiment and Results 5.4.1. The Experiment In order to test the performance of the proposed expansion strategies and to compare it with blind expansion following retrieval feedback two experiments have been performed. Fig.5.1 lists sample queries of this collection. Retrieval effectiveness is assessed using eleven point average precision scores (11-AvgP). In the first experiment, we have initially retrieved a set of 20 documents. These documents are then ranked on the basis of their CG similarity values. From the resulting ranked list of documents, top five documents are used for expanding query using Ide’s approach. As shown in chapter 4, CG based ranking has potential to improve the raking. This increases the chances of more relevant document to be included in the feedback set. A comparison has been made with the blind expansion following retrieval feedback. Figure 5.2 shows the average performance for a subset of CACM queries. In the second experiment the query was expanded by adding concepts that appear in the document CGs having a matching query concept. This restricts the concepts being used for expansion to those that are closely related to query concepts. A comparison of this strategy with the earlier one and the blind expansion has been made. Figure 5.3 shows average recall precision curve for this comparison. Retrieval Run

1AvgP

Original Query (Baseline) Blind expansion (Ide’s) CG based expansion (Ide’s) CG-based expansion (Participating Concepts added)

0.3371

(%improvement)

0.1867

( -44.61%)

0.4247

(+25.98%)

0.5257

(+55.94%)

Table 5.1: Average 11AvgP score for a subset of CACM queries

122

Chapter 5


5.4.2. Results and Discussions Table 5.1 compares the 11-point average precision over CACM queries 12, 14, 19, 30, 42 and 63 for the three strategies with the original unexpanded queries. The CG-based expansion following Ide’s approach yields an 11-AvgP of 0.4247. This represents 25.98% improvement compared to the original one. Blind expansion yields an 11AvgP of 0.1867, representing a decrease in the performance. This is because in blind expansion all the terms from the documents in the relevance feedback set are added. Some of these documents are not relevant. This results in the retrieval of a number of documents containing terms being added from non-relevant documents, some of which gets high rank, resulting in a drop in performance. The expansion (Ide’s) based on the relevance feedback set determined by CG-based ranking improves the situation by forcing some of the non-relevant documents out of the feedback set. Thus, terms being added are from documents more closely related to the query. This reduces the chances of alteration of focus of the original query resulting in improved performance. The second CG-based strategy resulted in best 11-AvgP value of 0.5257 representing an improvement of 55.94% and 23.78% respectively, as compared to the original query and our first CG-based strategy respectively. This significant improvement is achieved due to the consideration of context of the query terms within the document through CG in the expansion process. Thus, the terms being used for expansion are not all the terms from the related documents but only those terms that are more closely related to the query terms.

123

Chapter 5


Figure 5.2. Average Recall-precision curves for a subset of CACM queries

Figure 5.3: Average Recall-Precision curves for a subset of CACM queries for techniques (a), (b) and blind expansion

124

Chapter 5


5.5. Conclusions We considered the main problem associated with the automatic query reformulation namely ‘query drift’. We have presented a CG-based approach for relevance feedback to overcome this problem. Two strategies have been evaluated. Both strategies resulted in significant improvement in the precision for our test run. This is particularly because of the consideration of semantic aspects of words in the expansion process through the CG representation. We base our findings on empirical investigation made in a limited setting. Our approach can be quite useful in legal and medical domains where high precision and high recall is desirable.

125

Chapter 6 Capturing Semantics through Relation Matching 6.1. Introduction Significant amount of research has been devoted to improve the performance of the keyword-based retrieval system. Most of the earlier research attempts were virtually centered on refining and developing keyword matching algorithm [74]. Regardless of all these research effort the effectiveness of retrieval system is still rather low [159]. This motivates researchers to explore new directions for IR. Researchers do believe that IR performance can be improved through improved search procedure [182]. The keyword-based systems perform retrieval merely on the presence or absence of terms (keywords) in the document. It is surprising however, because meaning does not come merely by putting words together. It is the arrangement of words in a sentence that give it its meaning. Another motivating factor that led focus on intelligent and semantic techniques in recent years is the insurgence of semantic web. One of the problems identified by the semantic web researchers was that the relative importance of developing foundational “units of meaning” is unclear [169]. In order to capture meaning semantic web researchers focused on the development of ontology. However, it is extremely laborious and time consuming to develop ontologies manually. The semantic web aims at capturing “true relevance” and one of the ways to do this is to understand the semantics of terms in the context. This demands representation that considers relationship between terms. In relation matching both the terms and the relation

Chapter 6

Capturing Semantics through Relation Matching

between them as expressed in the query are matched with terms and relations in the documents. The system first finds the concepts within the document that match query terms and then looks whether the relation expressed by these matching concepts are the same as in the query. A word out of context does not provide useful information about its importance in the text. Further, the correct semantic and pragmatic interpretation is possible only by accounting the relationship between terms. It makes sense therefore that considering relationships in the matching process can improve retrieval effectiveness. This implicitly brings semantics in the matching process and thereby making intelligent retrieval possible. There can be sentences comprising the same set of words involved in different relationships.

As an example, consider the following two sentence

fragments: The farmer exploits worker…

and

Exploitation of farmers and workers … . These two sentences have entirely different meaning though constituted using the same set of words. It is the relationship existing between words that gives them different meaning. Thus, the consideration of the local context in which the words are being used is helpful in capturing the underlying semantics. In the following sections, we first introduce our model in section 6.2. In section 6.3, we propose new CG similarity measures. In section 6.4 experimental investigations has been made and the conclusions are included in section 6.5.

6.2. The Proposed Retrieval Model Our model performs retrieval based on term similarity and relational similarity. Vector space and CG-based representation of documents has been used for identifying term and relational similarity respectively. In this section, we will discuss how documents and queries are represented and how these representations are compared in each case and finally how the two similarity values are combined to give a single score to the document for retrieval.

127

Chapter 6


We propose a retrieval model that takes advantage of the simplicity and efficiency of vector space model, but enhances its effectiveness through two techniques: (a) relationships captured through underspecified syntactic analysis, and (b) consideration of a set of replaceable terms to expand queries. This is an improvement over thesaurus-based expansion, which often fails to add appropriate terms. The model combines relation and term matching. Our approach is simple and flexible and takes the advantage of any existing relational similarity between the document and the query to improve retrieval performance. The vector space model used in the previous chapter remains an integral part of this work and the same approach is used to obtain vector of term weights to represent the documents and queries. However, some modifications have been made in the CGbased retrieval model used in chapter 4. These modifications include: •

Automatic handling of statements involving “which”.

•

Use of replaceable terms instead of concept type hierarchy.

•

Use of heuristics and transitivity of relations to handle structural variations.

The steps followed to construct CGs are: (1) The documents are first tagged using TnT tagger. (2) Tagged output is then processed to (a) Rewrite compound statements involving “which” as two simple fragments by substituting the referring instances and including additional fragments representing prepositional relations between terms in constituent fragments. For example, consider the fragments in fig. 6.1.

128

Chapter 6


(i) Techniques NNS are VBP described VBN which WDT place VBP the DT burden NN of IN the DT programmed JJ logic NN on IN system NN programs NNS …(from Doc #46 of CACM collection) (ii) The DT automatic JJ way NN in IN which WDT an DT internal JJ representation

NN for

IN each

DT newly

RB created

VBN

substring NN is VBZ stored VBN sequentially RB in IN a DT block NN of IN common JJ storage NN … (Doc#1062) Figure 6.1: Sample fragments involving the use of “which”

Fragment (i) will be transformed to give: (ia) Techniques NNS are VBP described VBN. and (ib) Techniques NNS place VBP the DT burden NN of IN the DT programmed JJ logic NN on IN system NN programs NNS … Whereas, following additional fragment will be added in case of (ii): ways for storing internal representation to yield a semantically correct relationship between ‘ways’ and ‘store’ and ‘in which’ will be ignored during triple extraction. We have extracted syntactic rules for this purpose. When applied to CACM collection approximately 70% out of the 1442 statements involving “which” were reduced correctly to simple fragments. We feel exceptional cases will not create much problem as queries will also go under identical treatment. (b) delete determiners, wh-determiners, wh-pronouns, reflexives and patterns like “as well as” (replaced by “and”), such as etc. (3) Conceptual graphs are then constructed for each document. Detail semantic interpretation has not been done as deep and complete understanding of text is not necessary for information retrieval. This helps in keeping the approach computationally simple. The CGs are stored in the form of triplets as:

129

Chapter 6


( rel c1 c2). Where ‘rel’ is the relation and ‘c1’, ‘c2’ are concepts participating in this relation. For each sentence of each document, the triplets are identified and stored as shown in figure 6.2. For each document, the frequency of each triple and maximum frequency value of any triple in the document is obtained. In order to keep our approach simple and practically feasible, no type hierarchy has been considered during CG representation of documents and queries. This is unlike the model used in chapter 4. The use of existing resources as an alternate to type hierarchy may yield sufficiently large number of relation triples, which might result in low precision. In order to cover disjoint concepts that share similar meaning, a set of replaceable terms have been maintained and used during the construction of query CG to yield additional edges in the graph. Fig. 6.3 lists some of these terms. These alternate representations can be presented to the user and her/his acceptance can be obtained.

Text fragment: “A methodology for calculating and optimizing system performance” Tagged representation: A DT

methodology NNP

system NNP

for IN calculating VBG and CC optimizing VBG

performance NNP.

CG representation: (pmod methodology calculate) (pmod methodology optimize) (mod performance system) (obj calculate performance) (obj optimize performance) Figure 6.2: Identifying and representing triplets

130

Chapter 6


Figure 6.3: A subset of replaceable terms

Syntactic variations have been captured through simple heuristics. The heuristics considered are: (i) When the root of a concept is a verb then a match with ‘obj’ relation will be accepted for a triple involving “mod” relation provided the participating concepts and their order is same. This heuristic will allow “information retrieval” to be matched with “retrieve information”. (ii) If the concepts appearing in the query triple and the document triple are same then an exact match is allowed for ‘pmod’ and ‘mod’ relation. This heuristic allows an exact match for “retrieval of scientific documents” with “scientific documents retrieval”. Thus, it eliminates the need of normalizing these phrases, which sometimes yields incorrect relationship when handled automatically. (iii) If the two concepts appearing in the query and the document triple are same then an exact match is allowed for ‘mod’ and ‘attr’ relation. This heuristic handles cases where an adjective is tagged as noun resulting in a “mod” relation in some places whereas “attr” relation at others. (iv) When the two concepts appearing in the query and the document triples are same then allow an exact match for ‘attr’ and ‘mnr’ relation. ‘attr’ relation is between a noun and its qualifying adjective whereas ‘mnr’ relation is between a verb and its modifier. This heuristics capture structural variants like “fast retrieval” and “retrieve fast”.

131

Chapter 6


(v) A search for a query triple (rel1 arg1 arg3) is satisfied if there exists triples (rel1 arg1 arg2) and (rel1 arg2 arg3). Using transitivity of relations during matching allows “concept type hierarchy” to be matched with “concept hierarchy”.

6.3. New CG Similarity Measures In order to perform retrieval we identify relational similarity and vector similarity. To get relational similarity between a query and a document, the query CG is matched with the document CG. This matching is done for each textual unit. Textual unit can be a single sentence, a window containing predetermined number of sentences or the whole document. In this study, we have evaluated four CG similarity measures for identifying relational similarity between a query and a document, considering entire document as a textual unit. CG Similarity measure I (CGSim I) In the first CG-similarity measure (CGSim I) proposed by us, the relational similarity (SR) is obtained by dividing the sum of matching triple frequencies by the product of query length (m) and maximum frequency of any triple in the document. The exact expression used to get relational similarity (SR), between query k and document j is as follows: SR(j, k) =

∑ (match(QTRPL[k] , dtrpl[i, j]) × dtrpl_freq [i, j]]) i

m × maxi (dtrpl_freq[i, j])

where, QTRPL is a vector containing query triples of query k,

match(QTRPL[k] , dtrpl[i, j]) is a function that returns a value 1 if ith triple in the document matches with any of the query triple, otherwise 0, dtrpl_freq[i,j] is frequency of ith triple in the document j, maxi(dtrp_freq[i,j]) is maximum frequency of any triple in document j, & m = query length (i.e. the number of triples in the query CG).

132

Chapter 6


The vector similarity (SV) between document j and query k is computed as inner product of the query and the document vector. SV(j,k) = ∑ wij × wik These two similarity measures are then combined to get a single score as:

S = α × SV + β × SR where 0