Cooperative Distributed Dynamic Lexicons for WWW ...

Cooperative Distributed Dynamic Lexicons for WWW Text Documents TR-IISc-CSA-ISL-KRK-99-06

K.R.K. Murthy Intelligent Systems Lab Dept. of Computer Science and Automation Indian Institute of Science Bangalore - 560 012 India E-mail: [email protected] Ph: (91)-(80)-3092907 Fax: (91)-(80)-3602911

S.S. Keerthi Dept. of Mechanical and Production Engg. National University of Singapore 10, Kent Ridge Crescent Singapore 119260 E-mail: [email protected]

Cooperative Distributed Dynamic Lexicons for WWW Text Documents K.R.K. Murthy∗

S.S. Keerthi†

[email protected]

[email protected]

June 6, 2015

Abstract Lexicons play important role in NLP, information retrieval, and information filtering. Important tasks carried out by lexicons, in these applications, are: word sense disambiguation (WSD), word translation, and provide basis for similarity measures between words and documents. The importance of lexicons has become more prominent in the light of the internet which has become the source of enormous wide-ranging information, especially text documents prepared in various languages. In this report, we argue that the present day lexicon approach is not suitable for the distributed, dynamic, and unfocused environments like internet; this is due to their static, focussed and limited applicability. Then we present the concept cooperative distributed dynamic lexicon (CDDLex) which can cope up with the the nature of the internet. CDDLexs may be cooperatively developed and maintained by several users distributed over the internet. We suggest information filtering communities [0] as a framework to develop and maintain CDDLexs. We also present an architecture and methodologies to implement, maintain CDDLexs, and apply them to mono-lingual, multi-lingual, cross-lingual retrieval, machine translation, and document summarization. ∗

Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore - 560012, India.

†

Dept. of Mechanical and Production Engg., National University of Singapore, Singapore.

1

0.1

Introduction

The internet has been accommodating exponentially growing information since its inception and it has become the source of enormous distributed, diverse information. The information is presented in the form of text, images, video and audio forms. For the purpose of this report, we consider text documents only. Text documents are presented in several natural languages. The volume and nature of text documents on the internet demand the use of information retrieval [0,0] and NLP processes [0] such as machine translation, natural language parsing, document summarization etc. Lexicons are needed for word sense disambiguation (WSD) [0, 0,0,0,0,0,0,0,0,0], word translation [0], and for defining word-word and document-document similarity measures which are fundamental tasks to carry out in the above applications. Several lexicons such as WordNet [0], EDR [0] and machine readable dictionaries such as Rogets [0] are available for the WSD tasks. Several methodologies have been developed for machine translation [0], multi-lingual [] and cross-lingual [] information retrieval. Various similarity measures [0] also use lexicons as a basis. But, the lexicons in use are general purpose, static and bulky which make them not well suitable for the above tasks in the massive, diverse, and dynamic environments like internet. This means, the lexicon should meet requirements of vast needs of diverse users in a dynamic massive environment. This observation has led to the development of a new concept called cooperative distributed dynamic lexicon (CDDLex). CDDLex has several components distributed over the internet and the components may be maintained by several users or services independent of each other. A component describes semantics and usages of words for a limited information space on the internet. It also describes relations between various words. These components are accessed and suitably combined for the task at hand. Several application approaches are possible such as user oriented approach, author oriented approach and broker oriented approach. We propose concept network (CNW) [0] as a model for components of CDDLex and information filtering communities (IFCs) [0] as a framework to develop and maintain them. The CNWs contain semantics of the information space and they can be easily built and maintained which are fundamental requirements of a CDDLex component model. The IFCs

2

provide a framework to share information on semantics of various information spaces and their overlaps. The organization of the report is as follows: Section 0.2 describes the structure and operation of CNWs briefly and section 0.3 presents an introduction to the concept of IFCs. Section 0.4 describes lexicons, MRDs; their structure and approaches to constructing them; and also discusses the role of lexicons in text document related tasks such as information retrieval and NLP. Section 0.5 elicits the problems present in the present day approach on the internent; presents the new proposal of lexicons, i.e. CDDLex, for use on the internet and discusses their feasibility by proposing CNWs as building blocks of them. Section 0.6 describes various approaches to the use of CDDLexs and section 0.7 discusses various issues that have to be resolved for effective implementation and use of CDDLexs. Section 0.8 concludes the report with the presentation of future research directions.

0.2

CNW: An Introduction

We now present a representational methodology called concept network which helps in building information filtering systems that satisfy the above derived characteristics. This is one of the possible models which can meet the above requirements. This representation tool is a part of the instructable information filtering agent that we are currently developing. A concept network is a network of concepts present in documents. Building blocks, their operation and structure of concept network are discussed, in detail, below. This section provides only a brief description of the concept network; for detailed discussion see [0]. Figure 1 shows a sample sketch of concept network. It has three distinct items namely keyPhrases KP1...KP9, concepts C1...C6 and context filters Ct1...Ct10. KeyPhrases are features observed in a document and these are also called as fundamental concepts. Each concept is a topic of discourse in the document. As shown in the figure, arrangement of concepts is hierarchical. This arrangement is very natural in document analysis. For example, the concept sports could be derived using the concepts baseball, tennis and cricket. Here baseball, tennis and cricket are called input concepts for the concept sports. Further, the baseball concept can be derived using concepts US-baseball, Japanese-baseball. This example gives clear support for hierarchical arrangement of concepts. Therefore the entire CNW is 3

divided into several levels and each concept belongs to one of them. A concept receives inputs from concepts belonging to the levels lower to its level. The 0th level which is the lowest of all levels of CNW accommodates keyPhrases only. 3rd Layer

C6 Ct10 C4 Ct7 Ct1

2nd Layer

C5 Ct8

C1

Ct9

C2

1st Layer

C3

Ct2 Ct3

Ct4

Ct5

Ct6 0th Layer

KP1 KP2 KP3

KP4

KP5 KP6 KP7 KP8 KP9

Figure 1: Concept Network, Concepts:C1...C6, Phrases(Fundamental Concepts):KP1...KP9

Context-Filters:Ct1..Ct10,

Key

Context filters are employed on some of the inputs, which are expected to carry noise, of a concept. Different concepts that are getting inputs from the same input concept will receive them through different context filters. This is because definition of noise in a concept output depends on which concept is receiving input from this concept. Noise carried by an input is contributed by two factors of uncertainty. First of them is wrong sense i.e. occurrence of an input in the sense not required by the concept and the second is incorrect relative position i.e. improper occurrence of an input with respect to other inputs of the concept to define the relevance which can cause wrong judgment. Wrong sense problem is further divided into pseudo sense problem and broad sense problem. Pseudo sense problem stems out in a situation in which a concept occurs but not in the required broad sense. And broad sense problem arises because a concept occurs in the required broad sense but not exactly in the required specific sense. Apart from above factors of uncertainty, another factor called as implicit presence of a keyPhrase also contributes to the noise carried by keyPhrases. Under this factor, a keyPhrase is absent in the document but the topic represented by it is present. Context filters handle all the above factors of uncertainty to reduce the noise carried by inputs using their context of occurrence. This is defined by other concepts, phrases called 4

contextPhrases and their positions of occurrence in the document. A context filter is divided into four consecutive stages to handle the above described factors. The first stage handles implicit presence problem, second stage handles pseudo sense problem, third stage handles broad sense problem and the fourth one handles incorrect relative position problem. The above order can be justified either from correctness viewpoint or meaningfulness viewpoint or both. Implicit keyPhrase problem is handled by the first stage. This is because handling of any other factor before this will be nullified since unlike stages handling the other factors, output of a stage handling this factor can be TRUE even if the input is FALSE for a given concept position. This is more a correctness problem. Order of the other stages has to be seen from meaningfulness viewpoint . Division of context filters into the above four stages help in several ways. One is good comprehensibility i.e. if a position of a concept is filtered out one will know the reason for that, second is giving more decision making power depending on the decision models used for these stages and the other is less credit assignment problem. Last one means that if a right position is filtered out or if a wrong position is not filtered out, one can know which stage is responsible for this and accordingly either correct it or supply feedback. For detailed description of context filters refer to [0,0]. A document is represented as a network (concept network) of activations. Detailed description on how documents are represented will appear in the forthcoming technical reports. Decision models used in both concepts and various stages of context filters are decision lists [Rivest]. A “decision list” is a list L of pairs {(f1 , v1 ), (f2 , v2 ), .......(fr , vr )} where each fi is a test, each vi is class label, and fr is the constant function, TRUE. A decision list L defines a classification function as follows: for any input x, L(x) is defined to be equal to vj where j is the least index such that fj (x ) = TRUE. These tests can be anything such as perceptron test, cnf, dnf or conjunct of attributes. Regarding description of CNWs we would like to point one important thing: Since RDF [0,0] has been introduced and the CNWs have nice standard language, CNWs may be nicely described in the RDF framework.

5

0.3

An Introduction to IFCs

An information filtering community (IFC) is a group of people who explore similar streams of documents to satisfy their information needs and share their experiences of working on those streams of documents. A member of an IFC discusses various noise factors in filtering and explores possible solutions, and shares various helpful representational building blocks1 for personalized information filtering with the other members of it. Therefore this is a collaboration at the stage of representation. Privacy of final interests may be secured since those are not shared with any other members of any IFC, i.e. only partial information that do not reveal his interest is shared, and that too according to the extent of his willingness. In this way, in IFC, a balance between privacy and sharability may be obtained. Each member of a given IFC can belong to other IFCs i.e. there can be good overlaps among various IFCs. Therefore, a person interacts with different IFCs for different components of his information filtering system. These components generally correspond to different topics of interest of the user.

0.4

Lexicons and Documents

In this section, we introduce lexicons and basis for their development. We also introduce various ways of generating lexicons, their advantages and disadvantages. Then we present applications of lexicons and how they are applied for these tasks. 0.4.1

Lexicons and dictionaries

Lexicon is a component of NLP system that contains information (semantic & grammatical) about individual words and word strings [0]. Examples of lexicons are WordNet [0], The EDR Electronic Dictionary [0]. The assumptions of WordNet and EDR (according to Miller [0]) are: (1)Lexical Knowledge (modest part of commonsense knowledge [0]) is necessary for computers to process human knowledge; (2) Linguistic Knowledge can be separated from 1

The process of representing documents contains blocks for processing them, extracting features from

them and measuring these features. These blocks are called representational building blocks or components of the representational methodology.

6

commonsense knowledge; (3) Lexical knowledge can be compiled by using large but finite number of factual propositions(definitions & linguistic relations); (4) Word forms can be distinguished from the concepts they are used to express; and (5) Lexicalized concepts can be organized by semantic relations. A lexicon provides the following semantic relationships between words: Synonymy, Antinomy, Hyponymy (in turn Hypernomy), Meronymy (in turn Holonomy); and syntactic categories of words. A lexicon will be able to identify polysemy also. Apart from describing these semantic relationships and syntactic categories of words, an individual lexicon may provide other features also. For example, WordNet (a mono-lingual English lexicon) provides other semantic relationships like Troponomy and Entailment too; and EDR (English-Japanese, Japanese-English bi-lingual lexicon) provides information on semantic and surface level cooccurrences of concepts. A machine readable dictionary (MRD) is a typical printed dictionary available in a machine readable form [0]. Examples of MRDs are OALD [0], COBUILD [0], LDOCE [0]. Several schemes have been proposed in the literature to construct lexicons which can be classified into one of the following categories: (1) MRDs based scheme [0], in this scheme, a lexicon is derived using the semantic and syntactic information provided along with the structure and context information in MRDs; (2) Fluent speakers based scheme [0], in this scheme, fluent speakers of the language are engaged in the development of lexicons; and (3) Corpus and MRDs based scheme [0], in this scheme, lexicons are derived from MRDs limiting their application to the corpus at hand. MRDs based lexicon or manually constructed lexicon may not be meaningful or sufficient for certain document collections since either the dictionary or a fluent speaker may fail to exhaust all possible senses used and explore all their contexts. But the lexicons generated by these two methods are general. Then the third method, i.e. corpus based lexicon generation scheme, comes for rescue. Since the lexicon generated in this method is tailored to work well on a certain document collection, it works well on that document collection. But the corpus based lexicon may not be applicable in the domains other than its domain. These lexicons fall into one of the following classes: (1) general purpose lexicons, built using MRDs or fluent speakers of a language, such as WordNet which are built for containing 7

general information about the words in the language; (2) application specific lexicons [0] which are built for certain application domains like natural language parsing where finding corpus may be difficult; (3) corpus specific lexicons [0]. 0.4.2

Applications of lexicons

Lexicons have been extensively applied in several text documents related tasks like information retrieval [0, 0, 0, 0], information filtering [0], NLP (such as machine translation [0], natural language parsing [0] and dictation). Fundamental problems of all these tasks are word sense disambiguation (WSD) for which lexicons have been widely applied in several ways [0, 0, 0, 0, 0, 0], and word translation for which multi-lingual lexicons and dictionaries have been applied [0,0,0]. Apart from these tasks there is another task which needs a lexicon — at least in some methodologies — is finding similarity between two given documents [0]. Lexicons are applied in information retrieval by several researchers to improve retrieval effectiveness over the statistical retrieval schemes by disambiguating the words both in the query and in the documents [0]. These are either automatically done [0,0,0] or the information seekers are prompted to choose some of the senses as indicated by the lexicon [0]. Lexicon based similarity measures are used to find similarity between the query and documents [0]; documents are listed in the descending order of their similarity to the query. In the other applications WSD is a key task that has to be carried out; this may be done using lexicons. Information filtering is another application in which lexicons can be applied [0]. Machine translation is another area where lexicons can be applied; word translation is one of the key tasks in machine translation. The word translation is carried out by mapping a word in the source language to a word whose sense matches with that of the source word; this involves finding sense of the source language word and mapping the sense to a destination language word. The process is depicted in the figure 2. Clearly this needs a word sense disambiguation which might need a lexicon. Match and map generation are done by bi-lingual or multi-lingual dictionaries. Apart from the WSD task, several syntactic disambiguation tasks, such as reference disambiguation [], part-of-speech disambiguation [], have also to be carried out which may need lexicons. Natural language parsing [0] is another application where both the semantic and syntactic 8

Wx

Sense of W WSD Match X

Wy

Y Map

List of equivalent Words in Y

Figure 2: Word Translation: A word of language X is mapped to one of the words of language Y based on its sense. disambiguation have to be carried out. Another application of similar kind is document summarization []. All these methods consider the existing general purpose lexicons or custom built lexicon [0]. WSD may be carried out either by context based rules or by finding similarities between senses and context of the word.

0.5

Cooperative Distributed Dynamic Lexicons

In this section we present a new concept called cooperative distributed dynamic lexicon (CDDLex) as a solution to the problems faced by the present day lexicons and their approaches. Firstly, we present the problems with the present day lexicons for internet applications; then propose the concept of CDDLexs as a solution to these problems; and, lastly, CNWs are proposed as building blocks of CDDLexs. 0.5.1

Problems in the present lexicon approaches

The deficiencies of the present lexicon approaches arise from the diversity of users with diverse interests or needs using the internet as a common platform to gather or to disseminate information. This led to the problems of boundaryless usage of words. Further problems arise due to the dynamic and bulky composition of web document collection and user needs. This subsection describes the important deficiencies of the present day lexicon approaches by drawing up the characteristics of the internet.

9

0.5.A. Characteristics of the web: Before analyzing the deficiencies of the present approaches to lexicon development and use for the web, it is mandatory to know the lexicon related characteristics such as nature of documents it addresses, nature of users it is intended for, and nature of usage of languages of the web. Hence, the following discussion presents the lexicon related characteristics of the web. 1. Multiple languages: Web documents are presented in many natural languages such as English, French, Chinese, Sanskrit. A document in one language may be useful even to a user who does not know the language. Because of the multiplicity of languages, and space and maintenance constraints; it is not possible for an author of a document to present it in all languages. The resort is machine translation of documents at user’s site which needs WSD and word translation. The multiplicity of languages also demands systems for crosslingual and multi-lingual retrieval which also need query translation and WSD. Translation of a word needs disambiguating it in the source language and finding the appropriate word in the destination language. 2. Multi-lingual documents: The multiplicity of languages and author’s knowledge of multiple languages cause the appearance of multi-lingual documents on the web. In a multi-lingual document, context of a word is to be described by the multi-lingual text around it. This may need translation of parts of the text to a user known language. 3. Bulkiness of vocabulary: The internet is a library which is accommodating documents of several languages and domains (such as History, Computers, Archeology, Literature) of the world; and each language and each domain constitutes a huge amount of vocabulary. These two factors (languages and domains) together contribute huge vocabulary to the internet document collection. 4. Cross domain usage of vocabulary: A word may be used in different domains in different senses and all these domains are uniformly (i.e. without any bias) appearing on the internet i.e. appearing in the single collection. Therefore, to find the sense of a word, domain specific context along with the context to avoid senses that belong to the other domains has to be considered. This is clearly more complicated and tend to be erroneous in the presence of increased noise compared to single domain collections. 5. Diverse senses of concepts: As discussed in the above paragraph, cross domain usage gives 10

rise to increased number of senses as opposed to limited domain usage of a word. Another factor that gives rise to increased number of senses is that even in a limited domain, if it is not too narrow, one word can appear in vast number of senses. This is because of the fact that any sense can be narrowed or broadened. The accurate sense depends on the user, the document and the task. This fact is restated in [0] as follows: Humans often cannot agree about which of a given collection of senses is being used in a particular sentence. 6. Dynamic composition of the web: Composition of the web (documents, domains and languages) does not remain static i.e. documents, domains may be deleted or added, users may join a domain or withdraw from a domain. 7. Dynamic senses of concepts: The dynamic nature of the web gives rise to changes in senses either because of changing usage of words or inception of new domains which may use the earlier words in the sense specific to it or both. This also includes deletion or inclusion of slang, jargon and cross-domain exchange of words. 8. Structured and unstructured documents: Web accommodates both structured and unstructured documents. Some documents may be partially structured i.e. some portion of them have tags and the other portions are expository text. 9. Diverse contexts of a word: In the structured documents, a word borrows its context not only from its vicinity but also from the farther parts of the document and also from the documents pointing to the document [0]. 10. Multi-domain documents: Many documents on the web, even though they are short, contain information about multiple domains. For example, home pages of persons, bibliography documents contain text of multiple domains. This needs tiling of documents before processing its text. 11. Mutual definition of senses: Because of the huge number of domains and cross-domain usage of words, definitions of senses of two words may be mutually dependent. In other words, a word defines meaning of a context and vice-versa.

0.5.B. Demands on the lexicons of the web: The general purpose lexicons can meet 60% of the knowledge and the rest of the knowledge has to be acquired through domain specific lexicons such as corpus-based lexicons. The following discussion presents a few demands on the general purpose and corpus based lexicons of the web. This discussion 11

paves the way for the concept of CDDLexs. 1. Bulky lexicon: Bulky vocabulary, vast number of senses, huge number of domains and messy context specifications make the universal lexicon development impossible both to develop and to maintain. 2. Dynamic lexicon: Since the composition of the web and needs of the users are dynamic, the lexicon aimed for it should be dynamic. This is not possible for the present day lexicon approaches. 3. Recurrent definition of senses: Because of the cross domain usage of words and presence of huge number of domains, the sense of a word and its context may be mutually dependent as pointed out in the paragraph 11 of section 1. This requires even the WSD to be recurrent or to be dependent on approximations. 4. Multi-lingual lexicons: As paragraphs 1 and 2 of section 1 show, there is a need for multi-lingual lexicons for machine translations and for considering multi-lingual context. Considering the bulky composition of the web both in languages and domains, it is almost impossible to have a single corpus based multi-lingual lexicon. 5. Maintainability: Because of the bulkiness and dynamic nature, a person or a group of persons can not keep maintaining a universal lexicon. 6. Freedom and Access difficulties: For a single bulky lexicon, let us assume that it exists, user may not have freedom to update it to meet his needs; the lexicon either have to reside at the users site in which case it has to be updated whenever there is a change in its original version or has to reside at a particular site, in which case, accessing a bulky lexicon may be costly. The following subsection suggests a lexicon which is composed of several dynamic distributed independent components. Section 0.6 presents a scheme and methodologies to use it. 0.5.2

CDDLexs

The problems of bulky vocabulary, cross-domain usage of words, recurrent definition of senses, and diversity of senses may be reduced by breaking the document space into small subspaces. A lexicon defined on each such subspace does not face the hurdle of bulkiness. 12

If the lexicons developed on these small subspaces are easily upgradable then the dynamic nature of composition and senses may be well taken care of. The user dependency of senses, maintainability, and access problems can be solved if we allow several lexicons to be defined on a document subspace and if they are maintained at several user sites i.e. distributed over the web. The following discussion is related to a proposal which can solve the problems described in section 2 by incorporating characteristics presented in this paragraph. As a solution to some of the problems presented earlier, we suggest cooperative distributed dynamic lexicons (CDDLexs) for the web. This is a combination of several components, Each component is a lexicon addressing a small portion of the web document subspace and the vocabulary space. It represents only limited knowledge about the vocabulary on the document subspace. All these components are distributed over the web at various user sites. Each component has a owner who maintains it and thereby takes care of dynamic nature of the web as far as the component coverage is concerned. Limited document subspace, vocabulary subspace, and sense subspace make each component to be easily constructible and maintainable. The distributed user maintained lexicon solves the problems of user dependency, dynamic web and access problems since the owner of it knows his needs and constructs and maintains the component accordingly. Appropriate selection of component model is needed for focusing on to various subspaces. Figure 3 shows the architecture of a CDDLex. Figure 3 shows three spaces: (1) document space which is of documents on the internet; (2) vocabulary space which is the vocabulary used in the document space; and (3) senses space which is a set of senses used in the document space on vocabulary space. The relationships between various spaces are shown in the form of labelled arrows between the three spaces. The document space gives rise to the vocabulary space; and it maps a word of the vocabulary space to some senses in the sense space. We consider only senses that are mapped to at least one word of the vocabulary space and which present in at least in one document of the document space. Therefore, the spaces contain the items present in the document collection. Figure 3 also shows that a CDDLex is a composition of several components (C1 through C6 ) representing a small document subspace, looking after limited subspace of senses of the small subspace of the vocabulary space as applied to the focused document subspace. It 13

Document Space

Vocabulary Space

V1

D1 Has

D2 D4

V2

D3 Has

V4 Sense Space

V3

Has

S1 S2 S4 D1

V1

C1

S1 D2

V2

C2

Document Space Related

S 2 D3

S3

V3

S 3 D2 V3

C3

C4

Vocabulary Space Related

S 1 D3 V2

C5

S 2 D3

V1

S2

C6

Sense Space Related

Figure 3: Architecture of CDDLex: C1 to C6 are components of CDDLex; D1 to D4 are document subspaces; V1 to V4 are vocabulary subspaces; S1 to S4 are sense subspaces. Each CDDLex component represents one of the subspaces of each space. means, a component finds the sense of a word which belongs to the vocabulary subspace on the document subspace and the sense belongs to the sense subspace considered by it. For example, the component C1 maps a word of V1 to one or none of the senses of S1 as used in D1 . Document subspace, vocabulary subspace and sense subspace covered by several components may have overlaps. Figure 4 shows all possible overlaps between two components regarding their subspaces. Each pair of components fall in one of the eight regions shown it. For example, the component pair (C1 ,C2 ) falls in region 7, the pair (C1 ,C4 ) falls in region 4. Each region addresses some properties of the web collection. For example, region 1 indicates the document overlap which implies that two components falling in this are covering non-overlapping vocabulary and senses of the document subspace; region 2 means that two components are addressing the same vocabulary on non-overlapping documents for non-overlapping senses. Each component takes care of cross domain usage of words while giving context specification of senses of words in its document subspace. Focusing on a certain document space may be achieved in two ways: (1) filter-component scheme; and (2) combined component scheme. The two schemes are suggested based on the observation in Fig(a) of figure 5. The CDDLex component may have two components: one for finding senses and lexical relationships; and 14

D-Overlap

1 4

V-Overlap

5 7

2 6

3 S-Overlap

8

Figure 4: Space of overlap of any two components of CDDLex; any pair of components fall into one of the eight regions of the figure. Region 8 indicates no overlap in any space and region 7 indicates overlap in every space. the other is to focus on a document subspace i.e. filter. If we keep these two components separate and the component is to follow the filter then the scheme is called filter-component scheme which is shown in figure 6. In this scheme, the document is passed to the CDDLex component only when its filter accepts the document. The filter is designed to accept documents which belong to the document subspace covered by the respective component. If we combine these two components then the combined component scheme is formed as shown in the Fig (b) of the figure 5. The filter and component may not need to look as separate blocks as shown in the figure. Based on the overlaps between document subspaces of the various components and the document at hand, more than one component may be selected for analyzing the document. To cope up with the multi-domain documents (documents which belong to multiple domains), one has to segment or tile the document and apply the appropriate components on each segment of it. The tiles or segments may overlap in a document; in this case, one has to carefully generate the decision making process on the overlapping portions. Again it can have two schemes: (1) intra component segmentation; and (2) extra component segmentation. In the intra component segmentation, a CDDLex component finds the regions of the document appropriate to its document subspace. In the extra component segmentation, one has to segment the document by external means and pass the individual segments to the appropriate 15

CDDLex Component CDDLex Component

Basic Layer

Filter

Basic Filter

Basic Layer

Basic Layer

Apparent Component

Fig (b)

Fig (a)

Figure 5: Fig(a) shows two components of CDDLex component namely filter and CDDLex component. Fig(b) shows the combined component scheme Filter 1

Document

S1

CDDLex Component 1

S2

CDDLex Component 2

S3

CDDLex Component 3

Filter 2

Filter 3

Figure 6: Filter-component scheme for focusing on a document subspace: closure of a switch indicates that the document belongs to the document subspace of the respective filter. components. Figure 7 shows the extra component segmentation scheme. The next subsection presents requirements of a component model of CDDLexs and proposes concept network as a component model, and discusses its applicability for realizing a lexicon. 0.5.3

CNWs for CDDLexs

Based on the description given in the earlier subsections about the nature of internet and requirements of lexicons on it, we can derive the properties of components of such a lexicon.

16

Document

Document Segmentation

To CDDLex Components (Document, Segment) Pairs

Figure 7: Extra component segmentation scheme They should have: (1) ability to handle multi-lingual words and contexts, it is needed to handle multi-lingual documents and word translation tasks; (2) ability to focus on a small document space which is needed for any component; (3) ability to provide only a small set of senses for a given word, it is needed because a component provides only limited number of senses of a word; (4) ability to incorporate lexical knowledge of the language (5) ability to narrow or broaden any sense of any word since a component model is used to meet requirements of diverse users; (6) maintainer should be able to construct and update them, this is needed to share components and to take care of the dynamic nature of the web. We suggest concept network (CNW) [0] as a model for the components of CDDLexs. An introduction to their structure and operation are presented in the section 0.2. CNWs can be developed and updated under the sharable instructable agents model [0] in the framework of information filtering communities as described in [0]. An introduction to IFCs is given in the section 0.3. Note that, there may be many models for CDDLex components; the CNW model, which is developed for instructable information filtering agent that we are developing, is one such possible model for CDDLex components. The basic model suggested in [0] has to undergo several minor modifications, which are presented in the later sections, for effective use in CDDLexs. The following paragraphs explain how the CNW satisfies various requirements of a CDDLex component model. 1. Multi-lingual word and context: Since the CNW model uses words or word strings or character strings as its features, its language independent application is not difficult. It can behave even as a multi-lingual lexicon since a label of a concept can be made to be a set of words along with the respective language labels. Figure 8 shows how a word or word string of one language can be translated into another language. The scheme shown in the figure translates either one or more of the words A,B,C of language X to a word D of language Y. The translation scheme may be described by ap17

A(X) B(X) C(X)

0110 1010 01 10

00 11 00 11 00 11 11 00 11 00

E(Y)

11 11 00 00 00 11 00 11 00 11 11 00 F(Y) 11 00

1 0 0D(Y) 1 G(Z)

00 Concept Node 11 00 11 00 Context Filter Node 11 Figure 8: An example lexicon component for translation: One or more of the words of language X are translated to a word D of language Y and word G of language Z. propriately defining the scheme language. One of such languages is {(L11 ,L12 ),(L21 ,L22 ),...} which indicates that a set of pairs whose first element indicates a set of source words and second element indicates translations of the words of first element set. These sets does not need to have words of same language. Each label given in the scheme specification will be a label of one of the concept nodes; subscripts of labels indicate their languages. One of possible tasks of figure 8 is {({AX , BX },{EY }), ({BX , CX },{FY }), ({AX , BX , CX },{DY , GZ })}. This says, A and B of language X are translated to E of language Y in the presence of certain context indicated by the corresponding context filters; B and C of language X are also translated into F of language Y; and similarly A,B,C of X are translated to D of Y and G of Z. A word may be translated to multiple of words of other languages and labels of several nodes may be same. 2. Focusing on a document subspace: Focussing on a document subspace is discussed in the earlier section. The CNW model can serve for filter also [0]; so two consecutive CNWs can be merged into each other easily. Therefore, use of CNW as a component model serves both filter-component and combined component schemes to focus on a document subspace. 3. Focussing on vocabulary & sense subspaces: The context filters allow to define only the required senses of a word which is nothing but mapping a set of words into predefined set of senses. Similarly, the words involve in a lexicon task are determined by the fundamental layer of the CNW which is under the control of the user. Therefore, CNW model allows focusing both on vocabulary and on sense subspaces. 18

4. Lexical knowledge: Lexical knowledge indicates semantic and syntactic relationships between words or word strings. Any network or graphical model can represent the semantic relationships and syntactic roles of words. Being a network model by its nature and its ability to consider context of appearance in powerful way by the use of context filters on links make the CNW model suitable for lexicon modeling. Figures 9 and 10 show how the semantic and syntactic relationships can be represented by a CNW model. W

11 00 1 0 0 1

11 00 00 11

1 00 W11

2 W 11 00 00 11

00 11

01 00 11 11 00

W 1

11 00

W 2

01 00 11 11 00

W 3

11 0 00 1 00 11 0 1 11 00

1 0

W 4

W 5

Concept Node (Words)

11 00

W 6

11 00

W 7

1 0 0 Context Filter Node 1

11 00

W 8

Figure 9: A CNW scheme for representing semantic relationships between words or word strings Figure 9 shows semantic relationships between various words. The relationships may be properly identified by using a scheme language. An example language is {(S11 , R1 , S12 ),(S21 , R2 , S22 ),...} This means, words of the sets S11 and S21 are obeying the relationships R1 and R2 with the sets of words S12 and S22 respectively. For example, figure 9 indicates a scheme: {({W1 , W2 , W3 , W4 },R,{W 1 , W }), ({W5 , W6 , W7 , W8 },R,{W 2 , W }),({W 1 , W 2 },R,{W })} The relation R may be any of the following lexical relationships: Synonym, Hypernym, Holonym, Meronym. There does not need to be a single relation to exist in the whole scheme; they just need to be compatible only. The relations indicated are true only under the conditions specified by the context filters and nodes. Figure 10 shows how the syntactic relations may be defined using CNW component model. The words of the fundamental layer obey the syntactic label given by the second 19

Syntactic Label

01

11 00 00 11 01 00 11 1 0

W1

11 00

W2

11 00

W3

01

W4

0 Concept Node (Words) 11 1 00 Context Filter Node Figure 10: A CNW scheme for representing syntactic definitions of words or word strings level node under the conditions specified by the respective context filters. The language to specify the scheme is similar to the above languge where each element of the set contains a pair whose first element is a set and the second indicates the syntactic label along with the label of the syntactic node. 5. Granularity of senses: The generation of senses of any granularity is possible because of the staged context filter structure on links and graphical structure of the CNW model. This is described with example in [0]. 6. Construction and maintenance: Construction and maintenance of a CNW is simple. The architecture may easily be supplied as evident from the the figure. The decision making processes are derived using SIIFA model [0], learning paradigms suggested in [0] and various decision list algorithms presented in [0]. The SIIFA framework helps even in crafting the architecture of a CNW. The above discussion shows that CNW model for CDDLex components is powerful, easy to construct and maintain. Compared to a general purpose lexicons such as WordNet, it is powerful on its document subspace for a user who crafted it. It means, it combines the power of corpus based lexicons and comprehensibility of general purpose lexicons. The next subsection explains how this distributed lexicon scheme can be used in practice by suggesting various approaches in one framework.

20

0.6

Application Approaches of CDDLexs

This section presents schemes to maintain and organize components of CDDLexs, schemes to access CDDLexs and their applications to various tasks related to the internet documents. 0.6.1

Maintaining components in CDDLex

There are mainly three ways of developing and maintaining components of CDDLexs. These may be complementary to each other: (1) standard place components (SPCs); (2) distributed place components (DPCs); and, (3) broker place components (BPCs). These are used to meet general needs of users of well-defined groups since there is a possibility of customization for the group. The SPCs may be maintained at places like mailing list servers, news group servers, community pages servers etc. SPCs may be maintained either collaboratively by several users as described in [0] or may be maintained by an authorized person. SPC at a server may be directly applicable to all documents that are originating from it or related to it. The DPCs are used for general documents and they are maintained by several users. The lexical knowledge represented by a DPC of a user correspond to the document subspace explored by that user. The vocabulary subspace and sense subspace covered by a DPC are tailored for its user. A DPC can become an SPC over a time if its popularity increases. The BPCs are maintained by a brokers of CDDLex components for their users. A broker may maintain a set of components or may maintain pointers to DPCs or SPCs which meet needs of its users. A broker receives a request for CDDLex components from a document or a user in terms of documents or queries; returns a set of components or component addresses which meet the request. Cooperation between various components may be carried out in several ways; one of such schemes is establishing and operating as in IFGs. Therefore, a concept of CDDLex IFG, which is presented below, is developed for this purpose.

0.6.A. Information flow graph for CDDLexs: The concept of IFGs has been suggested in [0] as a means of establishing implicit community cooperation for distributed information dissemination, access and maintenance systems. The important characteristics

21

to apply IFGs in a task are: distributed task handling elements; opportunity to establish cooperation between them which includes ability to establish semantic connections between them and passing information items. CDDLex satisfies all requirements for becoming another IFG. The components of a CDDLex process documents in distributed manner; they explore overlapping subspaces of documents, vocabulary and senses. The information items that can flow are of two types: (1) a document looking for appropriate components, flow of this item establishes seeking information flow; and (2) new explorations of various spaces, it establishes corrective information flow. Connections may be established during component seeking process which includes seeking information flow, searching for components, and by serendipity because of IFG monitoring agents. The CDDLex IFG gives rise to popularity ratings for various components of it which in turn generates authorities and hubs as described in [0]. SPCs may become authorities since many DPCs may be pointing to them; and BPCs may become hubs or authorities since many will be pointing to them and this also points to many SPCs and DPCs. Social and regional components may also be nicely linked to each other to meet requirements of the social or regional communities in CDDLex IFG. Social and regional groupings are important for a lexicon generation as stated in []: . Each component or CDDLex IFG node is characterized by the document, vocabulary and sense subspaces spanned by it. The links are characterized by aggregation of above spaces of components of N levels following the descendent component. The aggregation can be easily done by merging components since they are CNWs which are easily mergeable. 0.6.2

Operating CDDLex

Now we present various approaches for applying CDDLexs for various applications. These are user oriented, author oriented and broker oriented approaches. The approaches offer flexibility of application and they can be applied in combination. The following paragraphs present these three approaches in detail at proposal level. Figure 11 shows a scheme of operation in CDDLex. The blocks document spaces finder, CDDLex components discoverer, CDDLex components combiner, and user CDDLex base put together is called CDDLex engine. The first block segments the document and finds corre22

Author CDDLex Engine

Author Side

Augmenting Document

Document

User Side User

Document

Spaces

Application

Standard Knowledge

Local CDDLex Base

Finder

User CDDLex Engine

CDDLex Component Discoverer

External CDDLex Base

Brokers CDDLex Component Combiner

Figure 11: A scheme for working with CDDLex sponding document subspaces for each segment; the second block takes up each (segment, document subspace) pairs and generates CDDLex components for each segment; the third one combines the components of each segment into a single component and outputs (segment, component) pairs; and, the fourth one provides database of information on document subspaces and CDDLex components along with the external CDDLex base available on the internet. CDDLex engines exist both at user’s side and author’s side. CDDLex engine that is employed at authors sites are called author CDDLex engine; CDDLex engine on user side is called user CDDLex engine. Author CDDLex engine is different from that of user CDDLex engine in its function and the context of usage. Author CDDLex engine takes care of what the author provides on the internet and whom he provides; user CDDLex engine takes care of what subspaces he explores and what he wants out of it. Both these engines have their local CDDLex bases and have access to the distributed CDDLex base and brokers on the internet. Since many users are also authors of documents and vice versa, it may be possible that many users of the internet have both CDDLex engines. The document that is accessed by user may be modified at authors site to include information about its use. Document augmentation is done by accessing CDDLex engine residing at the author’s site. The augmenting document block analyzes the document and augments the document with the information related to segments and associated CDDLex components. This block contains author CDDLex engine and document augmentation blocks. 23

Once the document is accessed by the user, it is directed for finding various segments in it and their respective document subspaces which can be seen from the user’s CDDLex engine. The output of the user CDDLex engine is fed to the application module for processing the document. The application module may be avail of standard knowledge in terms of other knowledge bases required for processing the documents. These can include general purpose lexicons, translators, syntactic analyzers. Based on the operation of the above scheme, one can come up with several approaches for generating lexicon component for the task at hand. We propose three approaches which do not need to be mutually exclusive. These approaches are devised based on who actively involves in the component generation. The approaches are: (1) author oriented approach, in this author plays a role in component access and generation; (2) user oriented approach, this employs user’s preferences and ideas to generate components; and, (3) broker oriented approach, broker is contacted for finding right components for the task and the documents at hand. All these three are discussed below in detail by explaining processes that take place in various blocks of figure 11. 1. Author oriented approach: In this approach to CDDLex operation, the author plays important role in the generation of CDDLex component for the documents of him. The author of a document adds the addresses or descriptions of components, document subspaces, or brokers to be used for carrying out the task on the document. Document subspaces may be described by specifying the filters of components. The meaning of this approach is that the author gives pointers to components which best describe his senses in the documents or points to a broker who best describes his usage and senses. In this approach, either the author or author CDDLex engine dominate the process. In this approach author either personally augments the documents and keeps them for access; or author CDDLex engine analyzes the documents and augments it. The first approach is more static i.e. the metadata about the document does not change since users do not involve in this activity frequently; whereas the second approach is dynamic i.e. changes in the CDDLex environments immediately appear in the next access of the documents accessed through it. First approach is computationally less intensive compared to the second approach. The static nature of the specification of the manual augmentation may be alle24

viated, to some extent, by mentioning the broker who takes care of generating components. This is called broker oriented approach and it is described in the later paragraph. On the user side, before the task execution, the user agent retrieves the data given in the document and uses them for the task execution. The document space finder uses the (segments, document subspace) pairs mentioned in the document if they are given; otherwise, it generates the pairs on its own or by contacting appropriate brokers mentioned by the user or by the author or by discovering on the internet. The component finder takes the (segments, document subspace) pairs output by document subspace finder and gets corresponding components as mentioned by the author or by the user. In this also, either external or internal CDDLex bases are searched for components or a broker is contacted for them. The component combiner takes (segment, components) pairs and combines them to form a lexicon either for the whole document or for segments individually. From the above discussion, it is clear that the components may be either DPCs or SPCs or BPCs depending on the author, user, and document. This may lead to the critical mass problem since the author of each document has to specify CDDLex components of use. Once this reaches the critical mass, it will be effective since the author knows to which document space it belongs to and which components do well on his documents. Critical mass development is not a problem if the document passes through a standard document streams or collections such as mailing lists or if the author employs an author CDDLex engine. 2. User oriented approach: In this approach, user specifies the filters, components, brokers to use for for finishing the task at hand. In this approach user wants to see a document from his view point apart from what the author says. This is needed in the following cases also: (1) author does not provide any information; and, (2) users agent wants to compare the descrepency between the user’s view and author’s view of a document which may help the user’s agent to further explore the web for information gathering and component collection. The components retrieved may be combined with the components retrieved from specification of the author. In this approach user CDDLex engine dominates the processing as opposed to the author oriented approach. These components may be maintained with the user in the user CDDLex base or may be retrieved using either broker or external CDDLex base based on the document as described 25

earlier. Therefore, these are either DPCs or SPCs or BPCs, but these will be either DPCs or BPCs, in general. This does not get into the difficulty of critical mass problem but can cause irritation to the user which may be eliminated, to some extent, in IFCs framework. 3. Broker oriented approach: In this approach, user retrieves a document through a broker or agent which is monitoring the DPCs and SPCs. The agent suggests required elements (components, filters or document subspaces) based on its experience and the document. The broker may either be specified by the user or by the author. Since the agent or the broker has been in the field for long time which gives them great experience about nature of documents, CDDLex components and their maintainers. This leads to no critical mass problem as in the author oriented approach or irritation problem as in the user oriented approach. The next subsection presents various applications related to IR, IF, and NLP. It also presents which of the above approaches will be useful for those tasks. 0.6.3

Applications

Now we present how the above schemes can be applied to the problems related to information retrieval, translation, document summarization and similarity computation. The following subsection presents the sub tasks related to the applications in detail at proposal level. As discussed in section 0.4, basic problems of the above domains are: word sense disambiguation, word translation, similarity measure, and text generation. These are discussed in the following paragraphs. 1.Information retrieval: Information retrieval involves three varieties of databases: monolingual document databases; cross-lingual document databases; and multi-lingual document database. Complexity of retrieval task increases in the given order. In these tasks three tasks have to be carried out: (1) word sense disambiguation; (2) query translation2 ; and (3) query elaboration3 . In mono-lingual retrieval, we need to disambiguate and elaborate the query only which may be done in user oriented approach for lexicon component generation. Apart from that, documents have to be seen from the view of the user which also supports user oriented 2

Translating a query into the language of the collection.

3

Expands query to represent information need of the user in the limits of the collection

26

approach to component generation. The actual ways of doing it are described in [0]. In multi-lingual and cross-lingual retrieval, the query has to be translated into appropriate languages after disambiguating keywords of it. Disambiguation and elaboration may be done as described above; translation may be carried out as described in the section 0.5 and in the figure 8. Since user may not understand the other languages, disambiguation; elaboration; and translation may go on in repeated fashion. This means that, compared to the mono-lingual retrieval, only the translation phase is added. As in mono-lingual retrieval, the component generation may be done either in user oriented or broker oriented approaches as described in [0]. 2.Translation: In this report we consider finding right replacement of a word of source language in destination language. This follows similar scheme as that of cross-lingual document retrieval. The words are replaced by analyzing context of a word using the lexicon component derived using one or more of the approaches described in the previous section. The translation scheme is described earlier in the figure 8. Reasonable way of translating a document is to follow standard knowledge available at user’s site for normal translation along with the lexicon components derived using one of the above application approaches. If the document belongs to document subspace that is new to user then either the author or broker oriented approaches are to be used; otherwise, user oriented approach gives the translation which will be palatable to the user. 3.Summarization: A document is segmented and labelled by the labels of the nodes of lexicon component. The labels also carry context analysis of the segments they represent. The segments of higher-level labels have some more lower level labels. Using these observations one can train a system to generate text based on the labels assigned which is nothing but summary of the document. There are several text generation systems such as []. Since the summary has to appear in the view of the user, either user oriented or broker oriented approach for generating components may be used along with the author proposed, if proposed at all, for segmenting and labelling tasks. 4.Lexicon based similarity measures: Similarity measures exist for finding similarity between documents and words. Lexicons have been applied for these tasks []. The challenge lies in 27

finding similarity between the entities of different languages. Since the similarity between documents depend on the user who is looking at it, user or broker oriented approaches may be more appropriate.

0.7

Issues in CDDLexs

The ideas proposed in this report are high-level ones. To see the effectiveness of CDDLex, several improvements over the proposed ideas have to be incorporated and various additional features have to be added. We list few of them below. 1.Naming standards: Names of various nodes of CNWs may be given arbitrary if there is no convention to follow for them. The arbitrary naming may complicate the performance of retrieval of CDDLex components. Therefore, there should be a naming convention for nodes (especially higher-level nodes) which may be interpreted by machines regarding their sense and the coverage of document space. There is a need of working at component specification language as described in the section 0.5. Only few of the all cases are described in this report. One has to explore all cases must come up with good description language. 2.Context Specification: Context specification of a word in the proposed CNW depends purely on its physical location in the document which is not correct for the structured documents like internet documents. Various additional factors for context specification are discussed below: • Document structure: The web documents have lot of tags and structure in them which is important for correctly describing the context. For example, context described in the section heading carry more information of a word in that section than that of the surrounding words in the section. • Graph structure: Unlike the traditional document collections, the web collection has nice graphical structure where documents are nodes and URLs are links of the graph. The analysis of a document depends on the analysis of links pointing to it []. Similarly the context of a word may depend on the content of the documents pointed by this

28

document. Therefore, including all these factors have to be considered for context specification of a word. • Metadata: Metadata described in documents also provide good information for discovering context of a document and words in the document. The above features have to be included into the CNW model; and the CDDLex operation scheme has to be modified to cope up with the enhanced component model 3.IFC issues: Since IFCs play crucial role in developing and maintaining CDDLexs, all issues listed down for effective application of IFCs have to be resolved. These are given in [0]. Apart from these IFC issues, one has to address issues like explicit definition of communities and setting standards for them. Explicit community definitions are needed since the lexicons are crucial for many applications. 4.CDDLex IFG issues: The cooperation, presently, between components of a CDDLex is passive i.e. the maintainers of components have to initiate the component updating process. There is a need for active cooperation between various components of a CDDLex which enhances the performance and maintainability of them since they may be addressing overlapping spaces. CDDLex IFG has been proposed for this purpose only. But several issues have to be resolved. They are: describing information items; describing characteristics; and other IFG issues described in [0]. 5.Identifying non-standard important components: A component may not be present at a standard places but may prove to be an important component because of its correctness, up-to-dateness and wider applicability. Finding such components and informing the other corresponding components and keeping them at a standard place is useful good operation for CDDLex. The active cooperation enhancement suggested in the earlier paragraph may be useful for this task which may be done by analyzing its popularity in CDDLex IFG. 6.Summarized view of CDDLex: It may be useful if one can give summarized view of a CDDLex by combining its components. This gives the view of vocabulary, senses spaces and context specifications which gives a view of cross domain usage of words and their range. This view helps in developing better components of the CDDLex, establishing active coop29

eration among its components and for labelling some of its components as important ones also. 7.Author oriented approach issues: In the author oriented approach, developing and maintaining author CDDLex engine is a major issue. Other issues related to it are: its scheme of operation; learning and describing author’s authoring styles. If it can take feedback from its users, it should be able to use this feedback to customize or personalize document augmentation process for users. The role of author CDDLex engine in CDDLex IFG has to be analyzed for architecting it better. 8.User oriented approach issues: In this approach, user CDDLex engine plays key role for carrying out the user oriented component generation task. Other issues related to this engine are similar to that of author CDDLex engine. They are: designing its scheme of operation; learning and describing users information space and his behavior on it; and analyzing its behavior on CDDLex IFG. Apart from the CDDLex engine issues, one has to resolve issues related to combining knowledge of the CDDLex and standard knowledge; since the lexicons are applied for several applications which have some common tasks such as word sense disambiguation, translation, document similarity, it is important to develop these systems to incorporate in bigger systems. 9.Broker/Agent oriented approach issues: In this approach, a broker or an agent suggests subspaces of documents using the user related information or author related information. For this, the broker must be able to describe and learn users or authors behaviors and their subspaces of exploration or participation respectively. These, description and learning about users or authors, are important issues that have to be resolved. Apart from these, architecture of broker and its participation in CDDLex IFG have also to developed. 10.Rewards for component maintainers: Rewards have to be supplied for good component maintainers and providers. Nature of rewards and schemes for awarding them have to be devised. Finding such maintainers is another issue to handle.

30

0.8

Conclusions and Future Research

In this report we have presented the problems with the use of static lexicons or dictionaries in the context of internet whose characteristics are also presented in this report. We have presented cooperative distributed dynamic lexicon (CDDLex) as a solution to the above problems and presented its feasibility by the use of concept networks [0] in the framework of information filtering communities [0]. Then we have presented various application methodologies for applying in various circumstances which gives flexibility of usage and elimination of critical mass problem. Finally we have presented various issues that have to be resolved for the effective implementation and application of CDDLexs. The CDDLex proposal gives rise to lot of other research directions apart from those listed in the previous section. They are listed below: 1.Using conventional lexicons in CDDLexs: The presence of conventional lexicons does not need to be ignored out in the lieu of CDDLexs! Conventional lexicons provide rich stable information about the language and its usage; combining them with the CDDLexs may enhance the performance of lexicon related tasks. 2.IFGs and CDDLexs: Information flow graphs [0] provide framework for automatic information flow for timely dissemination of information. The information flow in IFGs and their framework may be combined with the CDDLexs for improved operation of them. 3.Combining similarities: Instead of combining various components for finding similarity between two documents or words, one can find their similarities on the relevant components and combine them for getting the final similarity between documents. The issue is finding sensible way of combining similarities. 4.Combining decisions: This is similar to the above direction. If we can find a sensible way of combining decisions of various relevant components, we do not need to combine components themselves for classifying a word. Association for Computational linguistics, Proc. of the 28th Annual Meeting of the ACL (ACL-90), (Pittsburg, Pa), 1990. Giuseppe Attardi, Sergio Di Marco, Davide Salvi, Categorization by Context, Journal of Universal Computer Science, vol4, No.9, 1998 31

Andres Bartroli, Jean-Francois Groff, Claudia Mozzafari, Werner Staub, A Multi-lingual Lexicon on the Web, In Proc. of 1st Intl. WWW Conf., 1994. http://www1.cern.ch/WWW94/Welcome.html Brown, P.F., S.A. Della Pietra, V.J. Della Pietra and R.L. Mercer, Word-Sense Disambiguation using Statistical Methods, In Proc. ACL meeting, Berkeley 1991, 264-270. Association for Computational linguistics, Proc. of the 28th Annual Meeting of the ACL (ACL-90), (Pittsburg, Pa), 1990. Miller on CYC and EDR, CYC, WordNet and EDR: Critiques and Responses, CACM, Nov 1995, Vol38, No11, 45-48pp. Ling Cao, Mun-Kew Leong, Ying Lu and Hwee-Boon Low, Searching Heterogeneous Multilingual Bibliographic Sources, In Proc. of 7th Intl. WWW Conf, 1998. http://www7.scu.edu.au/ Isaac Cheng and Robert Wilensky, An Experiment in Enhancing Information Access by Natural Language Processing, CSD-97-963, University of California, Berkeley, July 14, 1997. Kenneth W. Church, Lisa F. Rau, Commercial Applications of Natural Language Processing, CACM, Nov 1995, Vol38, No11, 71-79pp. Collin’s COBUILD dictionary of the English Language, 1987. Douglas B. Lenat, CYC: A Large-Scale Investment in Knowledge Infrastructure Communications of The ACM, Nov95, Vol38, No11, 33-38pp. Association for Computational linguistics, Proc. of the 28th Annual Meeting of the ACL (ACL-90), (Pittsburg, Pa), 1990. OCLC Forest Press, Dewey Decimal System Home Page, http://www.oclc.org/oclc/fp/index.htm (October 1998) Toshio Yokoi, The EDR Electronic Dictionary, Communications of The ACM, Nov95, Vol38, No11, 42-44pp. ACL, In Proc, of the 14th Intl. Conf. on Computational Linguistics, COOLING-92, Nantes, France, 1992.

32

Johannes Furnkranz, Tom Mitchell, and Ellen Riloff, A Case Study in Using Linguistic Phrases for Text Categorization on the WWW, In M. Sahami (ed.) Learning for Text Categorization: Papers from the 1998 AAAI/ICML Workshop, pp. 5-13, Madison, WI, 1998. AAAI Press. Terrance Goan, Nels Belson, and Oren Etzioni, A Grammar Inference Algorithm for the World Wide Web, In Proc. of AAAI 96 symposium on Machine Learning in Information Access, 1996. http://www.aaai.org/Symposia/Spring/1996/sssparticipation-96.html Sergey Brin and Lawrence Page The Anatomy of a Large-Scale Hypertextual Web Search Engine, In Proc. of 7th International Conference on WWW, 1998. http://www7.scu.edu.au/ Stephen J Green, Automated Link Generation: Can we do Better Than Term Repetition?, In Proc. of 7th International Conference on WWW, 1998. http://www7.scu.edu.au/ Louise Guthrie, James Pustejovsky, Yorik Wilks, Brian M.Siator, The Role of Lexicons in Natural Language Processing, CACM, Jan96, Vol39, No1, 63-72pp. Hirst,G., Semantic Interpretation and the Disambiguation of Ambiguity, Cambridge University Press, England, 1987. Marti A. Hirst, Hinrich Schutze, Customizing a Lexicon to Better Suit a Computational Task, In Proceedings of the SIGLEX Workshop on Acquistion of Lexical Knowledge from Text, 1993. Hutchins, W.J., and Somers, H.L., An Introduction to Machine Translation, Academic Press, London, 1992. Ide, N.M., Veronis, J., Word Sense Disambiguation with Very Large Neural Networks Extracted from Machine Readable Dictionaries, In Proc. of the 13th Intl. Conf. on Computational Linguistics, 1990. Charlotte Jenkins, Mike Jackson, Peter Burden, and Jon Wallis, Automatic RDF Metadata Generation for Resource Discovery, 8th International WWW Conference, May 1999. 33

K. Sparck Jones, Automatic Keyword Classification for Information Retrieval, Archon Books, 1971. ACL, Proc. of 13th Intl Conf. on Computational Linguistics, COOLING-90, Helsinki, 1990. Jon Kleinberg, Authoritative sources in a hyperlinked environment, In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), January, 1998. Krovetz R. and Croft W.B. ”Lexical Ambiguity and Information Retrieval,” ACM Transactions on Information Systems, 10(2), 115-141, 1992. Longman Dictionary of Contemporary English, Longman Group Ltd., 1978. Claudia Leacock and Martin Chodorow, Combining Local Context and Wordnet Similarity for Word Sense Identification, In Wordnet, an Electronic Lexical Database, 285-303pp, MIT Press, Cambridge MA, 1998. Michael Lesk, Automatic Sense Disambiguation Using Machine Readable Dictionaries: how to tell a pine cone from an ice cream cone, In Proc. of the 1986 SIGDOC Conf, 24-26pp, 1986. David D. Lewis, Karen Spark Jones, Natural Language Processing for Information Retrieval, CACM, Nov 1995, Vol38, No11, 92-101pp. Li, X., Szpakowicz, S., Matwin, S., A WordNet-based Algorithm for Word Sense Disambiguation, In Proc IJCAI-95, 1368-1374, 1995. Frank Meng, Wesley W. Chu, Database Query Formation from Natural Language using Semantic Modeling and Statistical Keyword Meaning Disambiguation, University of California, Los Angeles, 990003, April 16, 1999 Miller, G., Wordnet: An On-line Lexical Database, In International Journal of Lexicography, 3:4, 1990. http://www.cogsci.princeton.edu/ wn/ WordNet: A Lexical Database for English, CACM, Nov 1995, Vol38, No11, 39-41pp. Kenrick Jefferson Mock, Intelligent Information Filtering via Hybrid Techniques: Hill Climbing, Case-Based reasoning, Index Patterns and Genetic Algorithms, PhD Dissertation, University of California, Davis, 1996. 34

Murthy, K.R.K., Keerthi, S.S., Concept Network: A New Document Representation Scheme, Technical Report : TR-IISc-CSA-ISL-KRK-99-03, Dept of Computer Science and Automation, Indian Institute of Science, India, 1999. Murthy, K.R.K., Keerthi, S.S., Use of Decision Lists in Information Filtering, Technical Report : TR-IISc-CSA-ISL-KRK-99-11, Dept of Computer Science and Automation, Indian Institute of Science, India, 1999. Murthy, K.R.K., Keerthi, S.S., Feedback Handling in Instructable Information Filtering Agent, Technical Report : TR-IISc-CSA-ISL-KRK-99-10, Dept of Computer Science and Automation, Indian Institute of Science, India, 1999. Murthy, K. R. K., Keerthi, S. S., Context Filters for Document-Based Information Filtering, In Proc. of International Conference on Document Analysis and Recognition’99 (ICDAR’99), 709-712pp, 1999. Murthy, K.R.K., Keerthi, S.S., Community Representation and Personalized Filtering, Technical Report : TR-IISc-CSA-ISL-KRK-99-02, Dept of Computer Science and Automation, Indian Institute of Science, India, 1999. Murthy, K.R.K., Keerthi, S.S., Information Flow Graphs for WWW, Technical Report : TR-IISc-CSA-ISL-KRK-99-08, Dept of Computer Science and Automation, Indian Institute of Science, India, 1999. Murthy, K.R.K., Keerthi, S.S., Information Gathering in the Framework of Information Filtering Communities, Technical Report : TR-IISc-CSA-ISL-KRK-99-04, Dept of Computer Science and Automation, Indian Institute of Science, India, 1999. Murthy, K.R.K., Keerthi, S.S., Instructable Information Filtering Agent: An instructable agents model for information filtering, Technical Report : TR-IISc-CSA-ISL-KRK-99-07, Dept of Computer Science and Automation, Indian Institute of Science, India, 1999. Murthy, K.R.K., Keerthi, S.S., Developing Personalized View of WWW using Concept Networks, Technical Report : TR-IISc-CSA-ISL-KRK-99-05, Dept of Computer Science and Automation, Indian Institute of Science, India, 1999.

35

Neff, M.S., McCord, M.C., Acquiring Lexical Data from Machine-Readable Dictionary Resources for Machine Translation, In Proc. of 3rd Intl. Conf. on Theoretical and Methodological issues in MT, 85-91, Austin, Tex, 1990. Oxford Advanced Learner’s Dictionary Current English, Oxford University Press, 1942. Ted Pedersen, An Introduction to Machine Translation, Technical Report: 93-CSE-17, Dept of Comp. Sci & Engg, Southern Methodist University, 1993.

R. Swick, E. Miller, B. Schloss, D. Singer, Resource Description Framework (RDF), http://www.w3.org/RD October 1998. D. Brickley, R. Guha, A. Layman, Resource Description Framework (RDF) Schema Specification, http://www.w3.org/TR/WD-rdf-schema, October 1998. R. Richardson and A.F. Smeaton, Automatic Word Sense Disambiguation in a KBIR Application, The New Review of Document and Text Management (formerly Journal of Document and Text Management), Vol. 1, 299-319, 1995, proceedings from the BCS-IRSG colloquium in Manchester, April 1995. R. Richardson, A.F. Smeaton and J. Murphy, Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words, presented at the AICS Conference, Trinity College, Dublin, September, 1994. Roget’s Machine Readable Dictionary. Gerard Salton, Automatic Text Processing, 1989. E. Schweighofer, W. Winiwarter, Intelligent Information Retrieval: KONTERM - Automatic Representation of Context Related Terms within a Knowledge Base for a Legal Expert System, In Proc. of the 25th Anniversary Conference of the Istituto per la documentazione giuridica of the CNR: Towards a Global Expert System in Law, Milano, Cedam, 1996. Toshihiro Takada, Multi-lingual Information Exchange through the World-Wide Web, In Proc. of 1st Intl. WWW Conf, 1994. http://www1.cern.ch/WWW94/Welcome.html Jason M. Whaley, An Application of Word Sense Disambiguation to Information Retrieval, Dartmouth College, PCS-TR99-352, June 1999. 36

Yarowsky, D., Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora, In Proceedings, COLING-92, Nantes, pp.454-460, 1992. Zernik, U., Ed. Lexical Acquisition: Exploiting On-line Resources to build a Lexicon, Erlbaum, Hillsdale, N.J.1991.

37