An ontology-driven approach for semantic information retrieval on the ...

17 downloads 59536 Views 2MB Size Report
new, to have a strong background and a complete context of comparison for our proposed approach. .... These metrics obtained very good results, but some authors [Lee et al. 1993] ..... Builder inserts them in the Web Repository. The structure ...
An Ontology-Driven Approach for Semantic Information Retrieval on the Web ANTONIO M. RINALDI University of Napoli Federico II

The concept of relevance is a hot topic in the information retrieval process. In recent years the extreme growth of digital documents brought to light the need for novel approaches and more efficient techniques to improve the accuracy of IR systems to take into account real users’ information needs. In this article we propose a novel metric to measure the semantic relatedness between words. Our approach is based on ontologies represented using a general knowledge base for dynamically building a semantic network. This network is based on linguistic properties and it is combined with our metric to create a measure of semantic relatedness. In this way we obtain an efficient strategy to rank digital documents from the Internet according to the user’s interest domain. The proposed methods, metrics, and techniques are implemented in a system for information retrieval on the Web. Experiments are performed on a test set built using a directory service having information about analyzed documents. The obtained results compared to other similar systems show an effective improvement. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Dictionaries, linguistic processing, thesauruses; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering, query formulation, retrieval models, search process, selection process; I.7.5 [Document and Text Processing]: Document Capture—Document analysis General Terms: Algorithms, Design, Experimentation, Performance Additional Key Words and Phrases: Ontologies, semantic relatedness metrics, WordNet ACM Reference Format: Rinaldi, A. M. 2009. An ontology-driven approach for semantic information retrieval on the Web. ACM Trans. Internet Technol, 9, 3, Article 10 (July 2009), 24 pages. DOI = 10.1145/1552291.1552293 http://doi.acm.org/10.1145/1552291.1552293

1. INTRODUCTION The production of digital contents is currently one of the most rapidly growing processes in the information age. This implies the creation of a plethora of information with related problems in organizing, managing, and searching in Author’s address: DIS-Dipartimento di Informatica e Sistemistica, Via Claudio, 21-80125, Napoli, IT; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected].  C 2009 ACM 1533-5399/2009/07-ART10 $10.00 DOI 10.1145/1552291.1552293 http://doi.acm.org/10.1145/1552291.1552293 ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10

10:2



Antonio M. Rinaldi

digital document catalog. One of the most representative examples of this scenario is the World Wide Web. The search on the Web for relevant information for a user is extremely complex; even if search engines assist in the information retrieval (IR) process, they are usually far from completely satisfying the request for an desired search. From a general point of view the IR process can require a considerable amount of time that a user pays in terms of accuracy. These inconveniences happen because the traditional search engines return results using several search strategies, which often are independent of the context in which the terms are used. Results are acceptable in terms of number of pages returned and retrieval speed, but results are often wrong or inaccurate because they are not related to the context of the user’s query. As an example, let us suppose that a user wants to find information about Lincoln, the famous car brand, inserting into a search engine the keyword “Lincoln” he will have pages related to the automobile domain and pages about history and political domains. The retrieval accuracy is a problem related to the representation and organization of knowledge in the documents, and from a structural point of view it is strictly related to the actual organization of the Web. A current vision of the future Web is the Semantic Web, defined as: “an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al. 2001]. A universal implementation of the Semantic Web as a complete substitution for the actual Web is still far from happening. Therefore, we think that it could be useful to have a system that analyzes documents from a contextual point of view for more accurate information retrieval. We try to give a solution for the problem of information retrieval on the Web using an approach based on a measure of semantic relatedness applied to evaluate the relevance of a document with respect to a query in a given context: the concepts of lexical chains, ontologies, and semantic networks. The proposed methods, metrics, and techniques are implemented in a system called DySE (Dynamic Semantic Engine). DySE implements a context-driven approach in which the keywords are processed in the context of the information in which they are retrieved, in order to solve semantic ambiguity and to give a more accurate retrieval based on the real of the user interests. This article is organized as follows: in Section 2 are presented some main works about different fields in our context of interests; a top level view of our system is shown in Section 3; Section 4 introduces the architecture of the system and describes the role of each component; in Section 5 the information extraction strategy is described, while Section 6 discusses the strategy used for ranking the results; experimental results and conclusions are reported in Sections 7 and 8 respectively. 2. BACKGROUND AND RELATED WORKS In this section we introduce some relevant works in several research fields related to our interests, which represent a large theoretical background. In order to better understand the different dimensions of our work, we will point out ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:3

the differences between our approach and those described. In the first subsection we introduce the concept of ontology, the basic conceptual tool of our investigation; in the second we explain the notion of relevance and we discuss some metrics of semantic relatedness and approaches to measuring the relevance of a document with respect to the user’s interest; in the last subsection we describe some implemented systems that are based on the same theoretical background. We choose to introduce and discuss several concepts and works, both old and new, to have a strong background and a complete context of comparison for our proposed approach. 2.1 Ontologies In the last two decades, ontological aspects of information have acquired a strategic value. These aspects are intrinsically independent of information codification, so the information itself may be isolated, recovered, organized, and integrated with respect to its content. A formal definition of ontology is proposed in Gruber [1993], according to whom “an ontology is an explicit and formal specification of a shared conceptualization”; conceptualization is referred to as an abstract model of specified reality in which the component concepts are identified; explicit means that the type of concepts used and the constraints on them are well defined; formal refers to the ontology property of being machine-readable; shared is about the property of an ontology of capturing the consensual knowledge, accepted by a group of persons, not only by a single one. We also consider other definitions of ontology; in Neches et al. [1991] “an ontology defines the basic terms and relations comprising the vocabulary of a topic area, as well as the rules for combining terms and relations to define extensions to the vocabulary.” This definition indicates a way to proceed in order to construct an ontology: (1) identification of the basic terms and their mutual relations; (2) agreeing on the rules that arrange them; (3) definition of terms and relations among concepts. From this perspective, an ontology does not include just the terms that are explicitly defined in it, but also those that can be derived using defined rules and properties. In our work, the ontology can be seen as a set of terms and relations between them, denoting the concepts that are used in a domain. In the next sections we describe some systems, methods, and techniques based on this conceptual knowledge representation and we put in evidence the differences from our approach using the same representation. 2.2 Relevance, Semantic Relatedness Metrics, and Linguistic Structures In our context, the concept of relevance information has great importance. We can divide relevance into two main classes [Harter 1992; Saracevic 1975; Swanson 1986], called objective (system-based) and subjective (human (user)-based) relevance. Objective relevance can be viewed as a topicality measure, that is, a direct match of the topic of the retrieved document and the one defined by the query. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:4



Antonio M. Rinaldi

Several studies on human relevance show that many other criteria are involved in the evaluation of the IR process output [Barry 1998; Park 1993; Vakkari and Hakala 2000]. Subjective relevance refers to the intellectual interpretations carried out by users and it is related to the concepts of aboutness and appropriateness of retrieved information. According to Saracevic [1996], five types of relevance exist: an algorithmic relevance between the query and the set of retrieved information objects; a topicality-like type, associated with the concept of aboutness; cognitive relevance, related to the user’s information need; situational relevance, depending on the task interpretation; and motivational and affective relevance, which is goal-oriented. Furthermore, we can say that relevance has two main features defined at a general level: multidimensional relevance, which refers to how relevance can be perceived and assessed differently by different users, and dynamic relevance, which instead refers to how this perception can change over time for the same user. These features have great impact on information retrieval systems, which generally do not have a user model and are not adaptive to individual users. Moreover, an interesting point of view about the concept of relevance is in Schutz [1970] where it is referred to the connection between information and “theme” (the present object or aspect of concentration) having as a base a “horizon” that is the whole of knowledge of a sentient agent. On the other hand we must define some techniques to measure the relevance of an object with respect to a given topic. We consider an approach based on semantic relatedness. To clarify this way of proceeding, in accord with Boyce et al. [1994], we can give the following definition: “measurement, in most general terms, can be regarded as the assignment of numbers to objects (or events or situations) in accord with some rules (measurement function). The property of the objects that determines the assignment according to that rule is called magnitude, the measurable attribute; the number assigned to a particular object is called its measure, the amount or degree of its magnitude. It is to be noted that the rule defines both the magnitude and the measure.” The concept of semantic relatedness somehow refers to the perceived relations between words and concepts. Several metrics have been presented in the literature in order to measure the semantic relatedness of two words. These metrics can be collected in the following categories. — Dictionary-based. Dictionaries are a natural linguistic information source for people’s knowledge about the world; they form a knowledge base in which the headwords are defined by means of other headwords and/or their derivatives; — Thesaurus-based. These metrics use a thesaurus in which words are related to concepts and each word is referred to a category by an index structure; — Semantic network-based. These metrics use semantic networks, which are graphs in which the nodes are the concepts and the arcs represent relations between concepts; ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:5

— Integrated approach. This approach considers additional knowledge sources to enrich the information already present in the network. An exhaustive overview of the metrics based on these approaches can be found in Budanitsky [1999] and a new approach for measuring semantic similarity is proposed in Li et al. [2003]. Following the cited overview, we focus our attention on three metrics that are extremely significant in our field of interest and in our theoretical background. In fact they use an approach based on semantic networks taking into account different features of these informative structures. The described metrics have been implemented for a comparison with our system. In Rada et al. [1989], the authors propose an approach to improve the bibliographic retrieval in biomedical literature. The authors use MeSH (medical subject headings), a hierarchical semantic network of over 15,000 terms. The main assumption of this approach is that the number of edges between terms in the semantic network is a measure of conceptual distance between terms. Wu and Palmer [1994] introduce a metric based on the verb category in the context of machine translation. They project verbs and their compounds onto something they call conceptual domains, dividing the different senses of verbs and placing them into different domains. In these domains, the authors define a measure of similarity between concepts, based on path lengths (in number of nodes), common superconcepts, and distance from the hierarchy root. Leacock and Chodorow [1998] proposed a metric based on the count of link numbers between two set of terms or synonyms representing the same concept. The metric used is applied on a nouns hierarchy connected to a single root always having a path between nodes. These metrics obtained very good results, but some authors [Lee et al. 1993] made observations about how it is insufficient to represent conceptual distance in complex semantic networks that have other relations than hierarchical, and how the choice of a specific context simplifies the problem of ensuring the homogeneity of the hierarchy. Moreover the authors consider only single lexical categories loosing the wholeness of a linguistic representation. We propose a different metric, which tries to solve these problems using a linguistic approach, taking into account, for example, all the lexical categories, specific domain ontologies, and several semantic network features to compute the path between concepts. Moreover semantic relatedness measures use linguistic structures to perform their computations: lexical chains. The notion of lexical chain derives from the research in the area of textual cohesion in linguistics [Halliday and Hasan 1976]. Cohesion involves relations between words that connect different fragments of the text. Lexical cohesion is the most common type of cohesion. It can be expressed by repetitions, synonyms, and hyponyms, or by other linguistic relations between words, such as whole-part, object-property and so on. A lexical chain is a sequence of related words in the text, spanning short (adjacent words or sentences) or long distances (entire text). A chain is independent of the grammatical structure of the text; it is a list of words that captures a ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:6



Antonio M. Rinaldi

portion of the cohesive structure of the text. Computing the lexical chains allows identification of the main topics of a document. A large number of researchers have used lexical chains for information retrieval and related areas. Morris and Hirst [1991] were the first to suggest the use of lexical chains to explore the structure of texts. Kazman et al. [1996] used lexical chains to index videoconference transcriptions by topic. Stairmand [1996] used lexical chains in the construction of both a typical IR system and a text segmentation system while Green [1997] developed a technique to automatically generate hypertext links. The strategy used to build lexical chains will be discussed in the following sections. 2.3 Semantic IR Systems In this section we focus our attention on several projects related to contentbased and semantic discovery that have been developed for managing knowledge and retrieving documents from a semantic point of view. These systems represent some examples that inspire us in the definition of our framework and system functionalities. Baziz et al. [2005] describe the use of ontologies for information retrieval. Their proposed approach consists in identifying important concepts in documents using two criteria, co-occurrence and semantic relatedness, and then disambiguating them via an external general purpose ontology (WordNet). Matching the ontology and a document results in a set of scored concept-senses (nodes) with weighted links; this is the semantic representation of a document. The SAFARI project [Shek et al. 1998] focuses on developing and integrating techniques for content-based mining of image archives and semantic information source matching in an agent-based foundation. Similarity-based search is discussed in various papers related to Web and text mining, as in Xu et al. [2002], Lee and Yang [2001], and Srihari et al. [2000]. The integration of agent technology and ontology is proposed in Weihua [2002] as having an important impact on the effective use of Web services. However, to the best of our knowledge, there has been little significant attempt to apply ontological models to IR. Fabriani et al. [2001] proposed a text processing system for building an ontological domain. The IntelliZap system [Finkelstein et al. 2002] executes context searches related to words marked by the user. It is based on a client-server paradigm, where the application client (launched on the user computer) captures the context close to the text marked by the user. The algorithm on the server, using semantic networks and semantic relatedness measures, analyzes the words selecting the more important contexts, performs a word sense disambiguation, and makes expanded queries to submit to the search engines. A reranking module orders the results retrieved by the search engines that are related to the semantic similarity between the search engines’ results and the original context. Gaizauskas and Humphreys [1997] describe a system called LaSIE (Large Scales Information Extraction System) based on text analysis: it translates single phrases in a logical form in order to have a weak discourse model of the ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:7

entire text. To represent the reality the authors use a model word which is built from an ontology. The nodes are object classes or phrases, whereas the phrase nodes are only leaf nodes. The word model is an empty coverage enriched with a semantic representation of the text. Moldovan and Mihalcea [2000] expand the proposed system and improve the searches on the Web attempting to extract only important information for the user from query results. The input query or the phrase expressed in natural language is sent to a lexical processing module. The words’ or phrases’ boundaries are detected by the tokenization process. The system labels the words using a version of Brill’s Tagger. A phrase parser divides every phrase into several members as names and verbs. After stopwords deletion, the system uses some keywords to represent the main concept of the phrase. The other process steps are word-sense-disambiguation (WSD), query expansion, and postprocessing. The project proposed in Kerschberg et al. [2003] gives a methodology and an architecture of a system based on an agent, WebSifter II, which captures the semantics of a user search and expands the start query in a series of queries for the traditional search engines; cataloging the results according to the weight specified by the user. The authors propose a model for information retrieval based on an ontological structure called Weighted Semantic Taxonomy Tree (WSTT). In this approach the user’s interests are represented by a hierarchical concept tree with weights associated via user interaction. In order to solve the problem derived by word polysemy (a single term can have several meanings), the authors use the concept of “word sense” in WordNet. Therefore the user, interacting with the system, can associate to every concept a list of terms semantically related to it. The SCORE System (Semantic Content Organization Retrieval Engine) [Sheth et al. 2002] gives support to the definition of ontologies used by system agent software in order to analyze documents. These agents use rules based on expressions combined with several metadata extraction techniques from structured and semistructured documents. SCORE supports four fundamental properties that are the core of its technology. The technique used for automatic classification can help to cluster the documents in one or more categories and extracts the metadata referring to one or more contexts. An interesting work is Varelas et al. [2005], where some approaches to computing semantic similarity between concepts in an ontology using their relationships are investigated. The proposed method is capable of detecting similarities between documents containing semantically similar but not necessarily lexicographically similar terms. This approach has been evaluated in retrieval of images and documents on the Web. The experimental results are carried out with several semantic similarity methods for computing the conceptual similarity between natural language terms using WordNet. From a general point of view, intelligent information retrieval systems allow a more accurate information search in a complex scenario such as the Web. In the last few years Web agents, cataloged, defined and evaluated following several criteria [Jansen et al. 2006], have been widely used for helping users during their searches showing information relevant with respect to the user context of interest. Approaches taken by the various information agents include ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:8



Antonio M. Rinaldi

algorithmics, information filtering, collaborative filtering, and information integration. Algorithmics approaches generally consist of optimizing algorithms and data structures to achieve faster and more accurate results. Information filtering approaches help select relevant information for a given user at a given time and in a given context. This can be done by either filtering out irrelevant information or by actively recommending useful information. Collaborative filtering focuses on identifying users with similar preferences and using their opinions to provide recommendations for information and information sources. Finally, information integration is a process whereby an agent tries to achieve its goals by combining information from multiple heterogeneous information sources. Often these approaches take into account a detailed knowledge about the user context and the navigational environment, defining a model to describe the user’s preferences [Anand et al. 2007]. Our method follows an information filtering approach where the interaction with a user is limited to the definition of a query composed of two components: a subject keyword used to specify a specific interest and a domain keyword, which provides the domain of interest for a user. On the other hand, in our methodology, the interaction with other search engines is strictly limited to a fetching step to collect documents from the Web. This procedure allows us to not have to define users’ profiles, track actions during Web navigation (clickstream), and simplify the system processes as a whole. The details about the query formulation and information analysis strategies used in our system are explained in the following sections. 3. DYSE OVERVIEW In our vision we can have Web search enhancement using a hybrid approach that takes into account both syntactic and semantic information in a system that has as a horizon of knowledge, an ontology. We suggest using a query structure formed by a list of terms to retrieve (subject keywords) and a domain of interest (domain keyword) to better represent the different components of the IR process (user interests, objects to retrieve). For example, if a user wants to get information about the famous jazzman Miles Davis, we have as subject keywords:=Davis, and domain keyword:=music. This system can retrieve pages that are interesting from the user’s perspective, without considering ones related to the Davis Cup, which pertains to the sports domain. In our system the horizon of knowledge is WordNet [Miller 1995], a general knowledge base organized from a linguistic point of view. A brief description of this knowledge source is given in the following sections. Even if WordNet has several shortfalls in some conceptual domains, it is one of the most used linguistic resources in the research community. The primary goal of our work is to design a system capable of retrieving and ranking results, taking into account the semantics of the pages. This system should be able to perform the following tasks. — Fetching. Searching Web documents containing the keywords specified in the query. This task can be accomplished using traditional search engines. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:9

— Preprocessing. remove from Web documents all those elements that do not represent useful information (HTML tags, scripts, applets, etc.); —Mining. an analysis of the documents’ content from a semantic point of view, assigning a score with respect to the query; —Reporting. ranking and returning the document relevant to the query. Now we describe an example to introduce our framework and its relations with the proposed architecture (shown in Figure 1). By means of the system interface, the user submits a query following the structure previously described. The subject keywords is used in the fetching step where a number of pages are fetched from traditional search engines (Altavista, Yahoo, Google) and then preprocessed by the modules described in Section 4. On the other hand a domain keyword is passed to the miner and an ad hoc module builds a semantic network dynamically extracted from WordNet following the algorithm presented in Section 5. In the document analysis step lexical chains are obtained by intersecting the extracted semantic network with each preprocessed page. A global rank is assigned to each page using a metric described in Section 6. From a high level point of view, the proposed procedure is simple and follows a modular approach; moreover it is completely automatic and the interaction with the system only occurs during the query formulation step. In the following sections we describe the proposed system in more detail, explaining our methods and techniques. 4. THE SYSTEM ARCHITECTURE The proposed system is based on several services. In this context each software module performs actions described in the previous section considering the semantic meaning of the Web documents. Figure 1 presents a complete architectural view of the proposed system. 4.1 Search Engine Wrapper The Search Engine Wrapper gets the query and adapts it to the specific syntax of the search engines using the Query Adapter Module, thus creating the query string for the single chosen search engines. In order to achieve a high level transparency, the Search Engine Wrapper submits the adapted query to the search engines, by means of the Search Engine Submitter, in order to obtain the Web links page. After this phase, the Parser analyzes this page in order to retrieve the links that are contained in it. 4.2 Web Fetcher The Web Fetcher retrieves the pages related to the links and stores them in the Web Repository. The pages are retrieved by the Web Catcher while the Repository Builder inserts them in the Web Repository. The structure of a Web page often has a presentation page that is composed of animations, images, and so on. Currently we suppose that these objects do not give useful information to our system. The Web Fetcher retrieves, as default, the first two levels in the site structure and stores them using the same hierarchy, ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:10



Antonio M. Rinaldi

Fig. 1. System architecture.

starting from the main link. In an analogous way, it stores the pages with frames. 4.3 Document Preprocessor After the Search Engine Wrapper and Web Fetcher perform their actions, we have in the Web Repository a set of Web pages related to the user query. From a general point of view, a Web page is composed of several parts. It is clear that the semantic content of a Web page relies on the body tag; metatags have a particular importance because they give a synthetic description of the page. The HTML language defines some tags to organize a Web page. A user inserts information in these tags and organizes contents in a structured way. In our ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:11

system we try to catch the different levels of information considering — title; — meta tag description; — meta tag keywords; — body. The Document Preprocessor analyzes the page and divides it into those components, storing them in the Preprocessed Web pages repository. In this step stop words are deleted and the remaining words are tagged and stemmed. In the tagging phase we use the Monty Tagger and the stemming is obtained by means of the WordNet morphological processor. 4.4 Miner The Miner analyzes, from a semantic point of view, the pages cleaned and stored in the Preprocessed Web page repository; its core is the Dynamic Semantic Network (DSN). The DSN is created by the DSN Builder, which generates it from WordNet by means of the domain keyword submitted by the user in the query submission step, following an ad hoc algorithm described in the next section. This network represents the domain of interest of a user and using it, the Miner processes the information necessary to analyze the semantic content of a page and measures the relations between documents and the user’s information needs represented by the DSN. In order to compute this similarity we implement a metric that takes into account both syntactic and semantic components in the document analysis step. The proposed metric is used by the Global Grader module and its output is a list of reranked pages shown to the user. The details of the mining procedure are explained in the next sections. 5. THE PROPOSED INFORMATION EXTRACTION ALGORITHM In this section we describe our proposed algorithm to extract information from Web documents and we analyze in detail all the components of our implemented modules. 5.1 The Dynamic Semantic Network (DSN) In the proposed system, the implementation of the ontology is obtained by means of a Semantic Network (DSN), dynamically built using a dictionary based on WordNet [Miller 1995]. WordNet organizes its terms using linguistic proprieties. Moreover, every domain keyword may have several meanings (senses) due to the propriety of polysemy, so a user can choose its proper sense of interest. In WordNet these senses are organized in synsets composed of synonyms; therefore, once the sense is chosen (the appropriate synset), it is possible to take into account all the possible terms (synonymous) that are present in the synset. Beyond the synonymy, we consider other linguistic proprieties applied to the typology of the considered terms in order to have a strongly connected network. A semantic network is often used as a form of knowledge representation: in accordance with the definition in Lee et al. [1993] it is a graph consisting of ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:12



Antonio M. Rinaldi Table I. DSN Generation Algorithm

//-----------------------------------------------------------------// DSN creation algorithm // // INPUT: Main Synset: represents the synset chosen by user // // OUTPUT: Synset List: the list returned from the function. // It contains all DSN synsets //-----------------------------------------------------------------Synset List CreateDSN (Main Synset) { Add Main Synset to a Synset List Load from Wordnet the Category terms of Main Synset Add founded synsets to Synset List While (Synset ListEOF) Do { Load from Wordnet all hyponyms of all synsets in Synset List Add founded synsets to Synset List } While(Synset ListEOF) Do { Load from Wordnet all synsets linked to all synsets in Synset List using all linguistic properties (counting hyponymy and hypernymy out) } return Synset List }

nodes which represent concepts and edges which represent semantic relations between concepts. We propose a dynamic construction of the semantic network via the interaction with WordNet. As previously specified, a user interacts with the system by means of a semantic query, specifying the subject keywords and the domain keyword. The DSN is built starting from the domain keyword that represents the context of interest for the user. We then consider all the component synsets and construct a hierarchy based only on the hyponymy property; the last level of our hierarchy corresponds to the last level of WordNet. After this first step we enrich our hierarchy by considering all the other kinds of relationships in WordNet. Based on these relations we can add other terms in the hierarchy, obtaining a highly connected semantic network. The algorithm to extract the DSN is described in pseudo-code in Table I. We now introduce an example to better explain the proposed algorithm. We suppose that a user is interested in retrieving documents about the religion domain. She submits the word religion as the domain keyword. The system passes the domain keyword to the DSN Builder and fetches from WordNet the synset Religion. Following the algorithm the DSN Builder links to the synset Religion all the other synsets linked by the category terms property, which belong to related topical classes (knowledge domains). Starting from these synsets we add only hyponyms to the initial semantic network. The process of adding hyponyms stops at the last level of the hyponymy hierarchy in WordNet. After this step we add all the other synsets directly ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:13

Table II. Property Weights Property Antonymy Attribute Category Domain Cause Derived Entailed by Entailment Hypernym Hyponym Member Holonym Member Meronym Member of Category Domain Nominalization Part Holonym Part Meronym Principle of See Also Similar To Substance Holonym Substance Meronym Synonymy

Weight 0.8 0.7 1 0.6 0.8 0.7 0.7 0.9 0.9 0.5 0.5 1 0.7 0.7 0.7 0.7 0.6 0.5 0.5 0.5 1

related to the synsets already extracted, considering all the linguistic properties (see Table II), counting hyponymy and hypernymy out. Figure 2 shows an example of DSN: in particular Figure 2(a) reports the religion domain DSN; the concepts are the nodes and the properties are the arcs, they are labeled using different colors; Figure 2(b) shows an OWL representation of this DSN putting in evidence several synsets and linguistic properties. We assign to the linguistic properties, represented by arcs between the nodes of the DSN, a weight σi , in order to express the strength of the relation. The weights are real numbers in the [0,1] interval and their values have been set as described later. To calculate the relevance of a term, we assign a weight to each one in the DSN considering the polysemy property, which can be considered as a measure of the ambiguity in the use of a word, if it can assume several senses. Thus we define the centrality of the term i as:  (i) =

1 , poly(i)

(1)

poly(i) being the polysemy (number of senses) of i. As an example, the word music has five senses in WordNet so the probability that it is used to express a specific meaning is equal to 1/5. Therefore we build a lexical chain on the retrieved Web pages using the DSN. Each word in the page that matches any of the terms in the DSN is a lexical chain component and the links between the components are the relations in the DSN. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:14



Antonio M. Rinaldi

Fig. 2. An example of DSN.

6. THE PROPOSED SYSTEM SCORING Given a conceptual domain, in order to discriminate the interesting pages from the others by using a DSN, it is necessary to define a system scoring that is able to assign a vote to the pages on the basis of their syntactic and semantic content. First of all we will define the proposed system grade on a generic document and after we will adapt it for a Web page. The measure considers two types of information; one in order to take into account syntactic information based on the concepts of document word frequency and centrality in a domain of interest, ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:15

and another one in order to consider the semantic component calculated on each couple of words in the document. Using the Syntactic-Semantic Grade (SSG), we can define the relevance of a word in a considered conceptual domain and in a single document as the word weight. We use a hybrid approach with statistical and semantic information. Using a statistical approach as in Weiss et al. [1996], we extend it with semantic information considering Equation 1. In this way we divide the terms into classes, on the basis of their centrality: SSGi,k = 

(0.5 + 0.5(TFi,k /TFmax,k ))i

2 2 i∈k (0.5 + 0.5(TFi,k /TFmax,k )) (i )

,

(2)

k being a lexical chain related to the k-th document, i being the i-th term, TFi,k being the term frequency of i in k, TFmax,k being the maximum term frequency in k, i being the centrality of i defined in 1. This formula gives us statistical information about the analyzed document but, using the term centrality, we have a more accurate definition of the role of the term in the document. The other measure is based on a combination of the path length (l) between pairs of terms and the depth (d) of their subsumer (the first common ancestor), expressed as number of hops. The correlation between the terms is semantic relatedness and it is computed through a nonlinear function. The choice of a nonlinear function to express the semantic relatedness between terms derives from several considerations. The values of path length and depth, based on their definition, may range from 0 to infinity, while relatedness between two terms should be expressed as a number in the [0, 1] interval. In particular, when the path length decreases towards 0, the relatedness should monotonically increase towards 1, while it should monotonically decrease towards 0 when path length goes to infinity. We need a scaling effect with respect to the depth, because words in the upper levels of a semantic hierarchy express more general concepts than the words in a lower level. We use a nonlinear function for scaling down the contribution of subsumers in an upper level and scaling up those in a lower one. Given two words, w1 and w2 , the length l , of the path between w1 and w2 is computed using the DSN and it is defined as: l (w1 , w2 ) = min j

h j (w 1 ,w2 )  i=1

1 , σi

(3)

where j spans all the paths between w1 and w2 , h j (w1 , w2 ) being the number of hops in the j-th path and σi being the weight assigned to the i-th hop in the j-th path with respect to the hop linguistic property. We assign to each linguistic property a weight that represents the expressive power of the considered relation. We argue that not all the properties have the same strength when they link concepts or words (this difference is related to the nature of the considered linguistic property). ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:16



Antonio M. Rinaldi

Fig. 3. Best path evaluation.

We found this intuition also in other works such as Sussna [1993] and Castano et al. [2003]; we set our weights following the values in the articles cited and spreading them to similar linguistic properties. These weights are real numbers in the [0,1] interval and their values are set by experiments and validated, from a strength comparison point of view, by experts. Table II shows the considered relations and the corresponding weights. Using this formula we find the best path, which represents the conceptual distance between two words, we consider not only a geometric distance (number of hops) but also a logic proximity—the kind of properties between words/concepts. The aim of this comparison is to put in evidence the different relational power of considered linguistic properties: two words linked by the synonymy property are more semantically related then two words linked by the similar to property. In Figure 3 an example of paths between concepts X and Y is proposed. As we can see, the arcs are labeled with their linguistic properties, σ , and the concepts have a common subsumer, S, having a distance of 8 levels from the WordNet root. Suppose that σi = σ j = 0.8 and σt = 0.3, the best path is the one traversing Z with a value of l = 1.58. The depth, d , of the subsumer of w1 and w2 is also computed using WordNet. For this, only the hyponymy and hypernym relations (the IS-A hierarchy) are considered; d (w1 , w2 ) is computed as the number of hops from the subsumer of w1 and w2 to the root of the hierarchy. Given these considerations, we selected an exponential function that satisfies the previously discussed constraints; our choice is also supported by the studies of Shepard [1987], who demonstrated that exponential decay functions are a universal law in psychological science. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:17

We are now in the position of introducing the definition of Semantic Grade (SeG), which extends the metric proposed in Li et al. [2003]:  eβ·d (wi ,w j ) − e−β·d (wi ,w j ) SeG(ν) = e−α·l (wi ,w j ) β·d (w ,w ) , (4) i j + e −β·d (wi ,w j ) e (w ,w ) i

j

ν being the considered document, (wi , w j ) being the pairs of words in the lexical chain, and α ≥ 0 and β > 0 being two scaling parameters whose values have been defined by experiment. We explicitly point out that the choice of using the formula proposed in Li et al. [2003] is based on the consideration that it takes into account all the principles of our methodology and in particular the questions related to a linguistic approach. The main difference from the original implementation is in the estimation of the path between words; in fact we use an approach based on the best path, which is evaluated exploiting the different linguistic relations (and their strength). According to the system architecture, those formulas are computed for each Web page element as described in Section 4, considering the page composed of four elementary documents, namely title, keywords, description and body (t,d,k,b). Both metric components are computed for each of these elements. We can finally define the Syntactic-Semantic and the Semantic Grade of a Web page π as follows:  SeG(π ) = SeGi (π.i) (5) i∈{t,d,k,b}

SSG(π ) =



SSGi (π.i).

(6)

i∈{t,d,k,b}

Now we define the Global Gcore of a Web page. Given a Web page π, the Global Score of π is defined as GG(π ) = SSG(π ) + SeG(π ).

(7)

In the next section we show our experiments in detail, explaining the strategy defined to build a general test set, and the standard parameters used to measure the performance of our system. 7. EXPERIMENTAL RESULTS In this section we will discuss the results of the experiments carried out to evaluate the performance of our approach from a quantitative point of view by running some experiments to evaluate the precision of our results. We use a test set collection to evaluate our system. The test collection is a set of documents, queries and a list of relevance documents. We use it to compare the results of our system using the ranking strategies described. To evaluate the correctness of the retrieval system outputs, the outputs must be compared with ground truth. Ground truth is determined by humans. In text retrieval, where the system may be evaluated using corpora consisting of tens of thousands of documents, it would be almost literally impossible to judge the relevance of all documents with respect to all queries ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:18



Antonio M. Rinaldi

used in the evaluation. For this reason we chose to perform the evaluation task using a test set of previously cataloged pages to compare with the classification given by our system using the proposed metric. We assume that a Web page is relevant for a query if it is in a category related to a user’s domain keyword (e.g. domain keyword:=astronomy → Yahoo Category:=Directory >Science >Astronomy >Solar System > Planets>Mars). We built the test set by means of interaction with the directory service of the search engine Yahoo.1 The directory service provides the category referred to each Web page. In this way we can have a relevance assessment in order to compare the results. The test collection has more then 10,000 pages in which about 100 queries use words with a high polysemic value, so that the documents belong to different categories. We chose keywords about general and specific subjects to have a more general test set taking into account questions related to having few terms in a document, and on the other hand to have small or large (in terms of number of concepts) semantic networks built from the user domain of interest. The whole test set has been used during the experiments and in Table III a portion of our test set collection is shown together with some examples of its organization. We remark that the system query has two components (subject keyword and domain keyword), therefore there are 13 queries in the test set example shown. It is important to have standard parameters for IR system evaluation. For this reason we use precision and recall curves. Recall is the fraction of (all) relevant material that is returned by the search; precision is a measure of the number of relevant documents in the set of all documents returned by a search. We compare our system results with other metrics used to measure the semantic relatedness between words. These metrics have been generally discussed in Section 2 and we now describe their implementation. Leacock and Chodorow [1998]. This approach is based on the length of the shortest paths between noun concepts in an is-a hierarchy. The shortest path is the one that includes the fewest number of intermediate concepts. This value is scaled by the depth, D, of the hierarchy, where depth is defined as the length of the longest path from a leaf node to the root node of the hierarchy. Thus, their measure of similarity is defined as follows: sim LCh (c1 , c2 ) = −log [min (length (c1 , c2 ))/(2 ∗ D))], where min(length(c1 , c2 )) is the shortest path length (having minimum number of nodes) between the two concepts, and D is the maximum depth of the taxonomy. To avoid singularities the authors measure path lengths in nodes, rather than edges, so synonyms (members of the same synset) are one unit of distance apart from each other and they introduce a hypothetical global root node above the unique beginners in WordNet to ensure the existence of a path between any two nodes. 1 www.yahoo.com

ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:19

Table III. Test Set Example Subject Mars

Domain Astronomy Mythology Music

Davis Sport Animal Jaguar

Car Sport

History Lincoln Car Music Madonna

Religion

Computer Apache Helicopter

Yahoo Category Directory>Science>Astronomy>Solar System>Planets >Mars Directory>Society and Culture>Mythology and Folklore >Mythology>Greek>Gods and Goddesses>Ares(Mars) Directory>Entertainment>Music>Artists>By Genre>Jazz>By Instrument>Trumpet>Davis, Miles (1926-1991) Directory>Recreation>Sports>Tennis>Tournaments> Davis Cup Directory>Science>Biology>Zoology>Animals, Insects, and Pets>Mammals>Cats>Wild Cats>Jaguars Directory>Recreation>Automotive>Makes and Models>Jaguar Directory>Recreation>Sports>Football (American)>Leagues>National Football League (NFL)>Teams>Jacksonville Jaguars Directory>Arts>Humanities>History>U.S. History>By Subject>Presidency>Presidents>Lincoln, Abraham (1809-1865) Directory>Recreation>Automotive>Makes and Models>Ford Directory>Entertainment>Music>Artists>By Genre>Rock and Pop>Madonna Directory>Society and Culture>Religion and Spirituality >Faiths and Practices>Christianity>People>Saints >Virgin Mary Directory>Computers and Internet>Software>Internet>World Wide Web>Servers>Unix>Apache Directory>Government>Military>Aviation>Helicopters> AH-64 Apach

Doc 212 5 12

11 7 36 29

57

328 34 11

18

9

Rada et al. [1989]. The authors define the conceptual distance between any two concepts as the the shortest path through a semantic network. They evaluate this technique using MeSH, a hierarchical semantic network of biomedical concepts that at that time consisted of about 15,000 terms organized into a nine-level hierarchy. This measure is similar in spirit to approaches that rely on spreading activation, and works relatively well due to that the fact that the network consists of concepts with broader-than relationships, which includes both is-a and part–of relationships. In this technique, the number of edges in the shortest path between two concepts under consideration gives the measure of similarity. Their distance between two terms is thus defined as: distRetal (ti , t j ) = min (lenght (ti , t j ). Wu and Palmer [1994]. The measure of similarity defined by these authors is based on path lengths, however they focus on the distance from a concept to the root node. This measure finds the distance to the root of the most specific node that intersects the path of the two concepts in the is-a hierarchy. This intersecting concept is the most specific concept that the two have in common (lcs–lowest common subsumer). The distance of the lcs is then scaled by the ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:20



Antonio M. Rinaldi

Fig. 4. Recall-Precision curve.

sum of the distances of the individual concepts to the node. The measure of similarity between two is formulated as follows: simWP (c1 , c2 ) = 2 ∗ depth (lcs (c1 , c2 )/(length (c1 , lcs (c1 , c2 )) + length (c2 , lcs (c1 , c2 )) + 2 ∗ depth (lcs (c1 , c2 ))), where depth is the distance from the concept node to the root of the hierarchy and length is the distance (in number of nodes) of the path between concepts. The authors describe this measure relative to a verb taxonomy, but in fact it applies equally well to any part of speech as long as the concepts are arranged in a hierarchy. We observe a good improvement with respect to the other metrics, as shown in Figure 4, where we can see great precision values for low recall, which is a suitable condition in information retrieval systems. Typically, text retrieval systems are capable of producing ranked results, with the documents that the system judges more likely relevant ranked at the top of the list. The ranked output results are in a recall-precision curve, with points plotted that represent precision at various recall percentages. Such a curve is likely to show very high precision at 10% recall, perhaps 50% precision at 50% recall (for a challenging retrieval task), and a long tail-out toward 100% recall. In our case the precision continued to be high at 50% recall and showed the same trend as the other metrics starting from 80% recall. In the analysis of the test results we notice that we have a good score for pages about some topics, some others, we have low accuracy. We argue that this is due to the considered general dictionary (WordNet) and to the fact that such a dictionary has small sized ontologies that can be extracted for specific conceptual domains. This idea is also supported by the analysis of the log system, where we noticed a detailed DSN and several matching terms in the document with good score. Even if the recall-precision curve shows good performance for our approach, these kinds of parameters could have some problems in terms of ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:21

Fig. 5. F1 -masure curve.

misunderstandings in the analysis step [Baeza-Yates and Ribeiro-Neto 1999] due to trade-offs in some criteria. Other methods have been proposed to take into account the trade-off between precision and recall using a single value. A popular measure that combines precision and recall is the weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score [van Rijsbergen 1980]: F = P2PR . +R This is also known as the F1 -measure, because recall and precision are evenly weighted. It is a special case of the general Fβ -measure (for non-negative real 2 values of β): Fβ = (ββ 2+1)PR , where the setting of β = 1 entails the same relP +R evance of precision and recall during the experiment evaluation process. The F1 -measure assumes a high value only when both precision and recall are high. Therefore, determination of the maximum value for F1 can be interpreted as an attempt to find the best possible compromise between precision and recall. The performance of an IR system can be evaluated in therm of F1 -measure curve trend. In Figure 5, the curve of our proposed metric is clearly better than all the other considered metrics. In this figure the F1 -values are on the y-axis and the averages of all precision and recall values for each query at 20 positions are on the x-axis. From a general point of view the experiments show promising results. We argue that this result depends on some novel intuitions in our approach. In fact the proposed framework gives us a more accurate measure of semantic relatedness between concepts using a combination of different grades of information about document content related to the syntactic and semantic roles of all parts of speech. This happens for a more effectively used formula, which takes into account the centrality of terms in a given document and statistical information about them, a nonlinear combination of the best path (considering the straight of linguistic properties), and the specialization of a concept computing its distance from the root of WordNet hierarchy. Moreover the use of a more efficient ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:22



Antonio M. Rinaldi

algorithm to build the semantic network achieves a better representation of the user interest domain, adding more terms and using other linguistic properties instead of common nouns and verbs or hypernym/hyponyms. 8. CONCLUSION A system for semantic information retrieval has been presented. The Semantic Web is a new approach for organizing information and it represents a great interest area for the international research community, but it is still far from a large-scale implementation. In this work we have proposed a system for information retrieval based on ontologies, dynamic semantic network, and lexical chains, defining a strategy for scoring and ranking results by means of a novel metric to measure semantic relatedness between words. Our approach has several novelties, in particular about the use of a general knowledge base from which we extract specific domain ontologies; moreover, the proposed semantic relatedness metric performs optimally compared with other metrics in the literature and using a general test set. The results of our experiment are promising and encourage new efforts in this direction, but some aspects of our approach should be further investigated. In particular we are considering the possibility of introducing some form of normalization for the semantic component with respect to the length of lexical chains and the size of the documents, and we are improving the test accuracy of our system compared with other metrics. Currently, other related topics could be investigated: (1) using relevance feedback techniques and user characteristics for improving the results precision; (2) considering multimedia information to perform IR task on other features not only textual ones; (3) defining a formal linguistic model to represent our ontologies; (4) using standard language to describe ontologies and improving their reusability and sharing; (5) inferring relevant documents and related terms to have specialized ontologies to merge with our DSN. REFERENCES ANAND, S. S., KEARNEY, P., AND SHAPCOTT, M. 2007. Generating semantically enriched user profiles for Web personalization. ACM Trans. Internet Technol. 7, 4, 22. BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Modern Information Retrieval. Addison-Wesley, Reading. BARRY, C. L. 1998. Document representations and clues to document relevance. J. Amer. Soc. Inform. Sci. 49, 14, 1293–1303. BAZIZ, M., BOUGHANEM, M., AUSSENAC-GILLES, N., AND CHRISMENT, C. 2005. Semantic cores for representing documents in IR In Proceedings of the ACM Symposium on Applied Computing (SAC’05). ACM Press, 1011–1017. BERNERS-LEE, T., HENDLER, J., AND LASSILA, O. 2001. The semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Sci. Amer. 284, 5 (5), 28–37. BOYCE, B. R., MEADOW, C. T., AND KRAFT, D. H. 1994. Measurement in Information Science. Academic Press Inc. BUDANITSKY, A. 1999. Lexical semantic relatedness and its application in natural language processing. Tech. rep., Department of Computer Science, University of Toronto. CASTANO, S., FERRARA, A., AND MONTANELLI, S. 2003. H-match: An algorithm for dynamically matching ontologies in peer-based systems. In Proceedings of the International Workshop on Semantic Web and Databases (SWDB). 231–250. ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

An Ontology-Driven Approach for Semantic IR on the Web



10:23

FABRIANI, P., MISSIKOFF, M., AND VELARDI, P. 2001. Using text processing techniques to automatically enrich a domain ontology. In Proceedings of the ACM International Conference on Formal Ontology in Information Systems (FOIS’01). 270–284. FINKELSTEIN, L., GABRILOVICH, E., MATIAS, Y., RIVLIN, E., SOLAN, Z., WOLFMAN, G., AND RUPPIN, E. 2002. Placing search in context: The concept revisited. Trans. Inform. Syst. 20, 1, 116–131. GAIZAUSKAS, R. AND HUMPHREYS, K. 1997. Using a semantic network for information extraction. J. Natural Lang. Eng. 3, 2/3, 147–169. GREEN, S. 1997. Automatically generating hypertext by computing semantic similarity. Ph.D. thesis, Department of Computer Science, University of Toronto. GRUBER, T. R. 1993. A translation approach to portable ontology specifications. Knowl. Acquis. 5, 2, 199–220. HALLIDAY, M. AND HASAN, R. 1976. Cohesion In English. Longman. HARTER, S. P. 1992. Psychological relevance and information science. J. Amer. Soc. Inform. Sci. 43, 9, 602–615. JANSEN, B. J., MULLEN, T., SPINK, A., AND PEDERSEN, J. 2006. Automated gathering of Web information: An in-depth examination of agents interacting with search engines. ACM Trans. Internet Technol. 6, 4, 442–464. KAZMAN, R., AL-HALIMI, R., HUNT, W., AND MANTEI, M. 1996. Four paradigms for indexing video conferences. IEEE Multi-Media 3, 1, 63–73. KERSCHBERG, L., KIM, W., AND SCIME, A. 2003. A personalizable agent for semantic taxonomy-based Web search. In Lecture Notes in Artificial Intelligence. Springer, 3–31. LEACOCK, C. AND CHODOROW, M. 1998. Combining local context and WordNet similarity for word sense identification. In WordNet: An Electronic Lexical Database, C. Fellbaum, Ed. The MIT Press, Cambridge, Chapter 11, 265–283. LEE, C.-H. AND YANG, H.-C. 2001. Text mining of bilingual parallel corpora with a measure of semantic similarity. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 470–475. LEE, J., KIM, M., AND LEE, Y. 1993. Information retrieval based on conceptual distance in is a hierarchies. J. Docum. 49, 2, 188–207. LI, Y., BANDAR, Z., AND MCLEAN, D. 2003. An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 4, 871–882. MILLER, G. A. 1995. WordNet: A lexical database for English. Comm. ACM 38, 11, 39–41. MOLDOVAN, D. I. AND MIHALCEA, R. 2000. Using WordNet and lexical operators to improve Internet searches. IEEE Internet Comput. 4, 1, 34–43. MORRIS, J. AND HIRST, G. 1991. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computat. Ling. 17, 1 (Mar.), 21–48. NECHES, R., FIKES, R., FININ, T., GRUBER, T., PATIL, R., SENATOR, T., AND SWARTOUT, W. R. 1991. Enabling technology for knowledge sharing. AI Mag. 12, 3, 36–56. PARK, T. 1993. The nature of relevance in information retrieval: An empirical study. Library Quart. 63, 3, 318–351. RADA, R., MILI, H., BICKNELL, E., AND BLETTNER, M. 1989. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man and Cyber 19, 1, 17–30. SARACEVIC, T. 1975. Relevance: A review of and a framework for thinking on the notion in information science. J. Amer. Soc. Inform. Sci. 26, 6, 321–343. SARACEVIC, T. 1996. Relevance reconsidered. In Proceedings of the 2nd International Conference on Conceptions of Library and Information Science: Integration in Perspective (CoLIS2), P. Ingwersen and N. Pors, Eds. The Royal School of Librarianship, 201–218. SCHUTZ, A. 1970. Reflections on the Problem of Relevance. Yale University Press, New Haven. SHEK, E., VELLAIKAL, A., DAO, S., AND PERRY, B. 1998. Semantic agents for content-based discovery in distributed image libraries. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries. IEEE, 19–23. SHEPARD, R. N. 1987. Towards a universal law of generalisation for psychological science. Science 237, 1317–1323. SHETH, A., BERTRAM, C., AVANT, D., HAMMOND, B., KOCHUT, K., AND WARKE, Y. 2002. Managing semantic content for the Web. IEEE Internet Comput. 6, 4, 80–87.

ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.

10:24



Antonio M. Rinaldi

SRIHARI, R., RAO, A., HAN, B., MUNIRATHNAM, S., AND XIAOYUN, W. 2000. A model for multi-model information retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’00). vol. 2. IEEE, 701–704. STAIRMAND, M. A. 1996. A computational analysis of lexical cohesion with applications in information retrieval. Ph.D. thesis, Centre for Computational Linguistics, UMIST Manchester. SUSSNA, M. 1993. Word sense disambiguation for free-text indexing using a massive semantic network. In Proceedings of the 2nd International Conference on Information and Knowledge Management (CIKM’93). ACM Press, 67–74. SWANSON, D. 1986. Subjective versus objective relevance in bibliographic retrieval systems. Library Quart. 56, 4, 389–398. VAKKARI, P. AND HAKALA, N. 2000. Changes in relevance criteria and problem stages in task performance. J. Docum. 56, 5, 389–398. VAN RIJSBERGEN, C. J. 1980. Information Retrieval, 2nd Ed. Butterworths. VARELAS, G., VOUTSAKIS, E., RAFTOPOULOU, P., PETRAKIS, E. G., AND MILIOS, E. E. 2005. Semantic similarity methods in WordNet and their application to information retrieval on the Web. In Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management (WIDM’05). ACM Press, 10–16. WEIHUA, L. 2002. Ontology supported intelligent information agent. In Proceedings on the 1st International IEEE Symposium on Intelligent Systems. IEEE, 383–387. WEISS, R., VELEZ, B., AND SHELDON, M. A. 1996. Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the the 7th ACM Conference on Hypertext (HYPERTEXT’96). ACM Press, 180–193. WU, Z. AND PALMER, M. 1994. Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94). 133–138. XU, H., MITA, Y., AND SHIBATA, T. 2002. Intelligent Internet search applications based on VLSI associative processors. In Proceedings of the Symposium on Applications and the Internet (SAINT’02). 230–237. Received October 2008; accepted October 2009

ACM Transactions on Internet Technology, Vol. 9, No. 3, Article 10, Publication date: July 2009.