DLOnto: A Semantic Information Model for Digital Libraries - CiteSeerX

2 downloads 371 Views 403KB Size Report
Ontology, as a concept modeling tool that can present information systems on the semantic and ... import role in Knowledge Engineering, Digital Libraries, Software Reuse,. Information ... Libraries and a semantic query algorithm which works well in semantic ranking in .... A subject falls into some categories and a keyword.
DLOnto: A Semantic Information Model for Digital Libraries Ming Zhang, Zhihong Deng, Sukai Ding, Dongqing Yang School of Electronic Engineering and Computer Science, Peking University, Beijing, 100871, China {mzhang, zhdeng, skding, ydq}@db.pku.edu.cn

Abstract. In this paper, we propose the Semantic Information Model for Digital Libraries - DLOnto based on the theory of the Semantic Web, and give the formalized definition of and relative Ontology operations on this model, which is constructed on the basis of the Chinese Classification and Thesaurus for Libraries extended using WordNet. We use DLOnto to implement query expansions and give an algorithm for measuring semantic rank. Experiment results show that DLOnto greatly improves the semantic relativity of query results.

1 Introduction As the key infrastructure of Internet2, digital libraries have had a rapid development in recent years. One of the key challenges is to help users to find relevant needed resources more efficiently among the affluent contents in heterogeneous repositories of digital libraries. Ontology, as a concept modeling tool that can present information systems on the semantic and knowledge level, captures the attention of many researchers and play an import role in Knowledge Engineering, Digital Libraries, Software Reuse, Information Retrieval (IR), Semantic Web, the Interoperability of Heterogeneous Information and so on. In this paper, we propose DLOnto, a Semantic information Model for Digital Libraries and a semantic query algorithm which works well in semantic ranking in Digital Libraries. Section 2 of this paper introduces the Semantic Information Model and the Semantic Web; section 3 a comprehensive description of Ontology as well as the Chinese Classification and Thesaurus; section 4 DLOnto, Ontology-based Semantic Information Model for Digital Libraries; section 5 an Ontology-based semantic query algorithm and section 6 the summary and future work.

2 Semantic Information Models Information Model, or Data Model, describes information resources within organizations and the relationship between them. Information Model supports data modeling, describes design requirements of databases and document storage and provides views for the resource managers [1]. Semantic Information Model provides a natural mechanism to support designs of databases, at the same time, expresses data and their relationships more exactly [2]. Integration Definition of Function Modeling, IDEF, developed many methods to construct Semantic Information Model. IDEF4 uses an object-oriented modeling method, and IDEF5 is a rising Ontology-based Semantic Modeling method, but it is still not mature [3]. The Semantic Web proposed by Tim Berners-Lee is constructed on the basis of eXtensible Markup Language, XML, which supports user-defined tags, and Resource Description Framework, RDF, which makes data presentations more flexible [4]. Ontology, the core of Semantic Web, is used to describe the relations of all kinds of resources with rich semantic information, and hence makes information computerunderstandable in order to make it possible that software agents retrieve and access the heterogeneous and distributed information on Internet and to implement high level and knowledge-based applications. The protégé Project of Stanford University supplies a tool used to construct and revise Ontology [5]. This tool is very powerful and fully supports OWL specifications, but does not support constructing and mapping multiple Ontologies. Jena, a Semantic Web project of HP Laboratory, provides a group of APIs that support constructing Ontology and its instances dynamically and supply simple semantic queries and reasoning functions [6]. SEKT (Semantically-Enabled Knowledge Technologies) [7], an important European semantic web project, focuses on knowledge discovery and description and gives less consideration on applications. The MindSwap project pays little attention on semantic query and focuses on the Ontologies of Internet resources and Web Services [8]. From the point of sharing knowledge, Ontology can be viewed as conceptualized explicit explanations and descriptions of existing concepts and relationships, and expresses apparently the conceptual models that hides in our minds or programs and greatly reduces misunderstanding of concepts and logic relationships in problem domain.

3 Ontology and the Chinese Classification and Thesaurus for Libraries In philosophy, Ontology describes the essence of existence, is a systematic account of existence, corresponding to Epistemology. Epistemology studies the essence and origin of knowledge and subjective cognition, while Ontology focuses on existence.

Scholars on Knowledge Engineering make use of this concept to retrieve knowledge in developing knowledge systems [9]. In Computer Science, it is a fairly long time before getting a clear definition of Ontology. In 1993, Gruber gave a most popular definition of Ontology, that is, “An Ontology is a specification of a conceptualization” [10]. All the different definitions of Ontology are explanations from different views. These definitions complement each other and continually extend the application range of Ontologies. They all embody the fact indicated by Gruber’s definition that “An Ontology is a formal conceptual model which describes the existence, and is the explicit specification of shared conceptualization” [11]. Knowledge Grid incorporates epistemology and ontology to reflect human cognition characteristics; exploits social, ecological, and economic principles; and adopts the techniques and standards developed during work toward the nextgeneration Web [12]. Topic Map, WordNet, Classification & Thesaurus are three typical Ontologies in Digital Libraries. Topic Map is a standard of describing of knowledge structure and electronic indices and a methodology of constructing knowledge management systems. It classifies and navigates information resources in a unified manner, and hence builds meaningful information networks on the basis of all kinds of resources and data [13]. Topic Map takes advantage of the concept of Term Indices and the features of networks, integrates Topic, Association and Occurrence, controls the obtaining and browsing of information with style sheets, and describes various browsing levels in detail. It expresses the complicated relationships in knowledge as to create virtual knowledge maps of the information on Internet to help users effectively make use of electronic resources. WordNet is an online lexical reference system, which is leaded by Professor George Miller, Princeton University, and cooperatively finished by relevant psychologists, linguists and computer scientists. WordNet has 129,509 words, 99,643 synsets and 391,885 binary semantic relations of words [14]. WordNet is constructed with Synonymous word sets as the basic unit. If a user has certain concept in mind, he can find proper words to express this concept. Synonymous word sets in WordNet are linked by some relationships including upper-lower, part-whole and inheritance. To sum up, the purpose of Ontology is to obtain, describe and present knowledge in certain domain, conduce common understandings of the knowledge, affirm recognized terms, and give explicit definitions of these terms and their relationships in different formalization level. Generally speaking, Ontology has two characteristics, namely, stable and dynamic. “Stable” denotes that what Ontology reflects is conceptual models, and does not refer to dynamic actions. “Dynamic” means that the contents and service objects change from time to time and that we can define and construct different Ontology due to different domains [15]. Taking subject term as the basic structure unit, the Classification Scheme and Thesaurus are two kinds of knowledge management tools agglomerated by generations of librarians with their intelligence and experiences to represent human’s knowledge structure. Their ability in organizing knowledge has been demonstrated

and enriched in over 200 years’ development and widely used in document annotation, book shelving, category organization and retrieving service. Actually, classification and thesaurus can both be viewed as concepts in Ontology [13]. With mappings between corresponding items, the Classification and Thesaurus makes a bi-directed index. That is, Subjects are given after Categories and Classification Number after Subjects, but these correspondences are not equivalent because of restrictions of the original Classification Scheme and Thesaurus. Therefore, this is a relatively loose combination. The “Chinese Classification and Thesaurus for Libraries” is such kind combination with more than 50,000 categories, 210,000 terms (subjects and keywords) from all the field of philosophy, social science and natural science [13]. There are three kinds of concepts in Chinese Classification and Thesaurus, Category, Subject (Formal Term) and Keyword (informal term, the synonymous or similar term of subject term, i.e. “entry term”). There are hierarchies among categories with category numbers. A subject falls into some categories and a keyword belongs to some subjects. Generally, every subject has one keyword with the same name to it. Upper-lower, Equivalence and See-also are the three relationships between these concepts. Broader terms (BT for short) is the generalization or broader sense of its Narrow Terms (NT), and NT is the specialization or narrower sense of its BT. U/UF, usually indicates the Use/Used-For between a subject and other subjects. Related terms guide users to see also other related subjects to extend retrieval range. The following is a segment of The Chinese Classification and Thesaurus for Libraries, and the texts in the parentheses are comments: TP3 Computing Technology and Computer Technology (category) TP30 General Problems (category) TP31 Computer Software (category) TP311 Programming and Software Engineering (category) TP311.1 Programming and the Theory of Program Correctness (category) TP311.12 Data Structure (category) Data Structure ( a concept under Category TP311.12) Data Structure ( a keyword of the concept above) Abstract Data Structure (a concept under the concept of Data Structure) TP311.13 Database Theory and System (category) The concepts and relationships in the example are as follows: Keyword: “Data Structure” is an instance of keywords. Subject: “Data Structure” and “Abstract Data Structure” are instances of subjects. Keyword-Subject relation: “Abstract Data Structure - Data Structure” is an instance of keyword-subject relation. Subject Hierarchy: As the lower concept of the subject “Data Structure”, “Abstract Data Structure - Data Structure” is an example of a subject hierarchy relationship. Relevance Relation of Subjects: “see-also” between subjects is an instance of relevance relation of subjects.

Category: “TP311 Programming and Software Engineering” is an instance of category. Category Hierarchy: “TP311.1-TP311.12” is an instance of Category hierarchy relationship. Subject-Category relation: “Data Structure – TP311.12” is an instance of Subject-Category relations. We map manually the concepts in Chinese Classification to corresponding terms in English, retrieve nouns and noun phrases as keyword candidates by using WordNet online dictionary and add them to Chinese Classification. Statistics about basic concepts are as table 1. Table 1. The statistic of terms and their relations

Total Keywords Subjects keyword-subject relationships Subject hierarchy relationships Subject relevance relationships

422,480 310,738 422,821 160,589

Chinese Classification 217,508 195,314 217,508 65,758

WordNet 204,972 115,424 205,313 94,831

210,826

41,665

169,161

Among them, subjects can be viewed as semantic concepts, while Category is not semantic concept or keywords but can be viewed as keywords in query expansions. The total number of categories of Chinese Classification is 49,602, including 425 TP (Automatization Technology and Computing Technology) categories. There are 1305 categories in ACM, and 5,242 keywords in TP category.

4 DLOnto - a Semantic Information Model for Digital Libraries We introduce ideas in Semantic Web into information retrieval, propose a semantic information model in Digital Libraries and discuss its formalization and semantic query. Following we will give the formalized definition of DLOnto, a Semantic Information Model for organizing resources of Digital Libraries. Definition 1: the logical structure of DLOnto, the resource Ontology for Digital Libraries, can be viewed as a triple set, DLOnto=, described in detail as following: (1) The elements of set C are called concepts, and the elements of Set P are called properties. The intersection of these two sets is null. (2) R:={Rbtp, Ruf, Rrt} is the set of relationships between concepts. Rbtp, Ruf and Rrt are all subsets of C × C . Rbtp is the upper-lower relationship, and Rbtp(C1, C2) indicates that C2 is C1’s sub-concept. Upper-lower is a kind of partial-ordering relationship and has transitivity, which can be denoted by “≥”. Ruf is the equivalence

relationship and Ruf(C1, C2) indicates that C1 is the subject of keyword C2. Rrt is the see-also relationship between concepts and Rrt(C1, C2) indicates that C1 and C2 are relevant concepts. Rrt is symmetrical. (3) Rbtg is defined as the inverse mapping of Rbtp, denoted by “≤”. Ruse(C1, C2) is the inverse mapping of Ruf(C1, C2). Rbtg also has transitivity. Some Ontology operations are defined on DLOnto as follows. Definition 2: Direct_Specialization Direct_Specialization (C1,C2) ⇔ Rbtp (C1,C2) ∨ Ruf(C1,C2) ∨ Rrt(C1,C2) Direct_Specialization(C1,C2) means that C1 and C2 satisfy the relationship of Direct_Specialization, that is to say, C2 is the specialization of C1. Direct_Specialization is the set of couples that satisfy any one of three relationships above. Definition 3: Specialization Specialization (C1,C2) ⇔ Direct_Specialization(C1,C2) ∨ ∃X∈C (Specialization(C1,X) ∧ Specialization(X,C2)) Specialization is the transitive closure of Direct_Specialization. Definition 4: Direct_Generalization Direct_Generalization(C1,C2) ⇔ Rbtg(C1,C2) ∨ Ruse(C1,C2) Definition 5: ` Generalization (C1,C2) ⇔ Direct_Generalization(C1,C2) ∨ ∃X∈C (Generalization(C1,X)∧Generalization(X,C2)) Generalization is the transitive closure of Direct_Generalization. We unify the Classification Scheme and Thesaurus. Categories are treated as upper concepts, subjects as the intension of categories and metadata as extension and instances of concepts. We keep the original hierarchy of categories and subjects, including keywords. Subjects belong to some categories, and hence form the structure illustrated by figure 1.

Classification Schem

Metadata: Instance of Concepts

.....

Category

See also Suject Subject

.....

Keyword

Subject Subject

Subject Keyword

Layer of Category Concepts Layer of Subject Concepts

Fig. 1. the structure of DLOnto and metadata in Digital Libraries In order to unify Classification Scheme and Thesaurus, we first define a top class, Lexicon. Lexicon has a property called hasType, which can only be one value of

CATEGORY, SUBJECT and KEYWORD. The Cardinality of hasType property is restricted to 1. Lexicon may have some properties called hasParent. Every node in DLOnto is inherited from Lexicon, that is to say, every category, subject and keyword are defined as subclass of Lexicon. Hierarchy can be embodied by the hasParent property. Equivalence is embodied also by hasParent, because equivalence exists only between a keyword and its subject, i.e., a keyword is a synonymy of its parent node. Note that when defining the hasParent properties of a keyword, we restrict the referred class with hasType property equal to SUBJECT. An example of ontology structure of Chinese Classification and Thesaurus is shown in figure 2. A real line in figure 2 corresponds a hasParent property of the son node. The hierarchy in OWL is represented by hasParent properties. The equivalence relationship is actually the hierarchy between keywords and their subjects and is also represented by hasParent properties. The relevance relationship between subjects is denoted by name in OWL, in which “name” is the ID of a relevance subject. A subject may have several relevant subjects, which are listed in the way above. The concept node noted by metadata is no longer an abstract concept but a knowledge node containing literature instances. TP3 Computing Technology and Computer Technology (Category) btg

btp

btg

TP30 general problems

btp

TP31 Computer Software (Category)

(Category)

btp

btg

TP311Programming and Software Engineering (Category) ) btp

btg

TP311.1 Programming and the Theory of Program Correctness (Category) btg

btp

TP311.12 Data Structure (Category) btp

btg

btg

rt

Data Structure (Subject) uf

Data Structure (Keyword)

use

btg

btp

TP311.13 Database theory and System (Category) Algorithm (Subject)

btp

Abstract Data Structure (Keyword)

Fig. 2. The hierarchical of Chinese classification and thesaurus

The following segment of OWL is the definition of “TP311.12 Data Structure” in DLOnto.



5 Semantic Query based on DLOnto The user queries committed to Information Retrieval Systems are commonly very short, often one or two words. Researchers try to improve recall and precision by query expansion. There are two kinds of query expansion technologies, automatic (or semiautomatic) technologies [16] and human-participant technologies [17]. It is common that users can not express exactly what they want when using IR systems, so it is necessary to make use of query expansion technologies before committing users’ queries. There are two steps in the process of query expansions, first expanding the original query with new key words; second, reassigning weight to words in the expanded query. There are three major methods of query expansion, users’ preference [18], user feedback [19], and global information [20]. These methods expand users’ queries in different ways in order to make query results close to users’ intension. The methods based on users’ preference are efficient and easy to implement, but users have to register their profiles in advance and it is a matter whether users trust the IR system or not. The shortcoming of users’ feedback is that most users do not want to give feedback and the work burden of IR systems is very heavy. Following are some good implements of the methods based on global information about query results. Bruza and Dennis use a method based on Natural Language Processing showed in their hyperindex system [21]. Their system runs on a web searching engine. It resolves titles of searched documents to find query phrases linked by conjunctions such as “in”, “of”, “with”, “as” and so on. New phrases are presented to users in a structured form indicating whether they are restrictions or expansions of the original query. The authors claimed that their system resolved effectively with the titles of documents. Anick and Tpirneni introduced a technology in 1999 that could retrieve query words that reflected the topics of documents [22]. To reach their purpose, they selected the words with high lexical dispersion, which appeared together with many other different words. They experimented on a group of articles on Finance Times and concluded that the words with the highest lexical dispersion were “market”, “group”

and “company”. By lexical dispersion, they selected proper words from retrieved documents as candidate query expansions to present to users. Not only the words with highest lexical dispersion but also the phrases containing them were presented to users. The authors provided a tool to analyze logs of IR systems with this technology. “Concept lattices” are used to assist query expansions dynamically and analyze initial query results to form global concept information and use it to optimize query [23]. “Concept lattices” are also used in query navigation. Above works show that semantic query is a very potential direction. Most systems present dynamically generated query expands to users in a simple list. Actually, concept hierarchies can organize the expansions neatly. For example, if a user wanted to query “operating systems”, “Unix” would be presented as relevant word and, at the same time, a document containing many “real-time” and “embedded” should be retrieved. Concepts and relationships between them defined by Ontology are the basis of semantic query. We use DLOnto to assist users in semantic query. With the help of DLOnto users can express their demands using computer-understandable terms with exact semantic meaning. The semantic query expansion focuses on the concept locating, and also takes advantage of the metadata. There are three procedures shown as following. First, “Essential Information” gets the original metadata, including Author, Title, and Keywords. Second, “Vocabulary Assistant” abstracts the Synonymous Terms, Broad Terms, and Narrow Terms. Third, “Specialize Query” defines the final query form, including metadata. The query expansion with three abstract levels is suitable for reuse: “Essential Information” concentrates on the query interface; “Vocabulary Assistant” deals with the concept locating; “Specialize Query” is related with the target information system. Algorithm 1: Vocabulary Assistant Input: Keywords stored in Term [1..n]; Output: The expanded concept set from original Keywords; Procedure: for (i=1; i