domain ontology based knowledge ... - BioInfo Publication

4 downloads 10320 Views 420KB Size Report
Available online at http://www.bioinfo.in/contents.php?id=45 ... to query any information about merged domain. .... check if they have a common ancestor. If they ...
Journal of Information Systems and Communication ISSN: 0976-8742 & E-ISSN: 0976-8750, Volume 3, Issue 1, 2012, pp.-138-142. Available online at http://www.bioinfo.in/contents.php?id=45

DOMAIN ONTOLOGY BASED KNOWLEDGE REPRESENTATION FOR EFFICIENT INFORMATION RETRIEVAL LAKSHMI TULASI R.1, GOUDAR R.H.2*, SREENIVASA RAO M.3 AND DESAI P.D.4 1Dept.

of CSE, QIS College of Engineering & Technology, Ongole, India. of CSE, Graphic Era University, Dehradun, India 3Director, School of IT, JNTU, Hyderabad and Dean CIHL, IIIT-H, India 4Department of CSE, BVB College of Engg. and Tech., Hubli, India *Corresponding Author: Email- [email protected] 2Department

Received: January 12, 2012; Accepted: February 15, 2012 Abstract- The advent of semantic web marked the growth of ontologies. Ontologies and semantic web were mainly designed to tell the users sit back and bask in the glow of that seamless efficiency. Information retrieval plays a very crucial role in getting the right information. In order to get the right information we need an efficient search engine which can convert the query entered by the user to a semantically right query. Another important factor to be taken care of is that the expanded query should have the same semantics as the query the user entered. The generation of ontologies and the creation of metadata attributing information to them is the key to knowledge access. In this paper we have taken into consideration all these factors and an effort is being made to improve the relevancy of results in a search system for a domain specific documents. We also described in this paper how semantic knowledge can be clubbed with Knowledge management and with IE and IR and also proposed a model of Grid Environment, where more than one domain can be merged to have a single ontology, which can be able to query any information about merged domain.. Keywords- Semantic Web, Knowledge Base, Information Retrieval (IR), Ontology, Intelligent Web, Next Generation Web. Citation: Lakshmi Tulasi R., et al. (2012) Domain Ontology Based Knowledge Representation for Efficient Information Retrieval. Journal of Information Systems and Communication, ISSN: 0976-8742 & E-ISSN: 0976-8750, Volume 3, Issue 1, pp.-138-142. Copyright: Copyright©2012 Lakshmi Tulasi R., et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Introduction It is generally accepted that search engines based on IR (keyword and phrase matching between the query and index) alone tend to offer high recall and low precision. The user is faced with too many results and many results are irrelevant. The main reason for this is the failure to handle polysemy (a word that has two or more similar meaning) and synonymy (a word that has the same meaning as another word). The use of ontologies and associated metadata can allow the user to more precisely express their queries, thus avoiding the problems above. Users can choose ontological concepts to define their query or select from a set of returned concepts following a search in order to refine their query. This can improve the accuracy of a search. The average search engine query length of the web in a survey was 2.2 words [1]. Having submitted a query, the user is then presented with a ranked list of documents relevance to the query. Therefore it is suggested here, the future of the search engines lies in supporting more of the information and management process, as opposed to incremental and modest improvements to relevance

ranking of documents. In this approach, software supports the process of actually reading and analyzing relevant documents, rather than merely listing them and leaving the rest of the information analysis task to the user. Corporate knowledge workers need information defined by its meaning, not by text strings. They also need information relevant to their interest and to their current context. They need to find not just documents, but sections and information entities and even digests of information created from multiple documents. The generation of ontologies and the creation of metadata attributing information to them is the key to access knowledge domain specific entities Current knowledge management systems have significant weaknesses. A. Searching information Existing keyword-based searches can retrieve irrelevant information that includes certain terms in different meanings. They also miss information when different terms with the same meaning about the desired content are used.

Journal of Information Systems and Communication ISSN: 0976-8742 & E-ISSN: 0976-8750, Volume 3, Issue 1, 2012 Bioinfo Publications

138

Domain Ontology Based Knowledge Representation for Efficient Information Retrieval

On the other hand, exploitation of interrelationships between selected pieces of information (which can be facilitated by the use of ontologies) can put otherwise isolated information into a meaningful context. The implicit structures so revealed help users use and manage information more efficiently. B. Extracting information Currently, human browsing and reading is required to extract relevant information from information sources. The explicit representation of the semantics of data, accompanied with domain theories (i.e. ontologies) will enable a web that provides a qualitatively new level of service. It will weave together a large network of human knowledge and will complement it with machine processability. Creating a Knowledge Basec Ontologies offer a way to cope with heterogeneous representations of web resources. The domain model implicit in an ontology can be taken as a unifying structure for giving information a common representation and semantics. A semantic repository of information spanning multiple heterogeneous information sources can be created. Ontologies are simply another class of model, although rather more complex than normally used in machine. Many search engines today, when requested to search for a particular string, do not just return the appropriate documents, but also highlight the occurrences of the string within the document. Using coreference resolution, a search for ‘George W Bush’ would also highlight locations in the documents where ‘the President’, ‘Mr Bush’, occurred, or might even highlight ‘he’ or ‘him’ where the pronoun referred to George Bush. Figure 1 illustrates the five basic operations of information extraction, applied to a simple text. Rama announced yesterday that it will make Shyam its next african base expanding its route network to 18 in an investment worth around 100m. The airline says it will deliver 1.8 million passengers in the first year of the agreement, rising to five million by the seventh year. a. entities: Rama, Shyam b. mentions: it = Rama, the airline = Rama, it = the airline c. descriptions: African base d. relations: Shyam base_of Rama e. events: investment ( 100m) Fig. 1- Extracting information. Along with the ontology, containing the classes and properties, we also have a ‘knowledge base’ which contains the instances and the specific property instantiations. Together the ontology and the knowledge base make up a semantic repository. Whether the two parts are stored separately, e.g. in two distinct relational databases, depends on the practicalities of the implementation. As a text is analysed and named entities identified, hyperlinks are established to the instances in the knowledge base. Figure 2 illustrates a piece of text and an associated semantic repository. The repository contains both ontology (classes) and knowledge base (instances). Also shown are the linkages between text and knowledge base. How such a repository can be used to, for example, enhance the search experience, is illustrated in a subsequent section.

Fig. 2- Linking texts to knowledge base (Source: BT Technology Journal, Vol 23, No.3) Ontologies and Semantic Web were mainly designed to tell the users sit back and bask in the glow of that seamless efficiency. When the user inputs a query it is a short one which is not expanded, hence it is difficult to retrieve the appropriate information each time. At times even if the query is long enough it may not be able to retrieve the semantically correct information. When a query is given using just one or two words it is not possible to retrieve the exact semantics of what the user wants. If this is not possible then the precession and recall parameters fall to a great extent. Hence we will have to first expand the query trying to retrieve the meaning of what it actually means and then search for the required results. Ontologies are defined as a representation of shared conceptualization of a particular domain, a major component of semantic web. Proposed Algorithm for Next Generation Web For our design we have build the University Ontology and is shown in fig 3. For building the Ontology, first we have to list down all the classes and subclasses. Then we have used object properties like istaughtby, teaches, take to indicate the relationships between the various classes. For instance, ComputerScienceModule is linked to the Academicstaff using the isTaughtBy relationship. Similar conditions hold for the other classes.

Fig. 3- University Ontology Initially we restrict the algorithm to accept a maximum of two words as a query from the user because of implementation complexity. Then it expands the query to make a meaningful one by

Journal of Information Systems and Communication ISSN: 0976-8742 & E-ISSN: 0976-8750, Volume 3, Issue 1, 2012 Bioinfo Publications

139

Lakshmi Tulasi R., Goudar R.H., Sreenivasa Rao M. and Desai P.D.

adding properties, instances, classes etc from the ontology, depending on their relationship The algorithm consists of the following steps :  User enters a query with maximum of two words from the domain.  Expand the query using the properties, relationships, instances etc.  If the entered query is a concept then expand it in terms of its properties, subclasses and instances For e.g. if the concept is Room ,it can be expanded it terms of its properties hasRoom or hasRent or ofSize, in terms of its instances like place or in terms of its subclasses like singleRoom, doubleRoom, DeluxeRoom, Suite etc. So the expanded query will be (Room) and (hasRoom or hasRent or ofSize) or (singleRoom or doubleRoom or deluxeRoom or Suite).  If the entered query is a Property then expand the query in terms of its domain and range. If the query is hasRoom then the expanded one will be (Room and hasRoom) or (hasRoom and Accomodation)  Create a meaningful query with the relations.  Create a sub query by removing all conjunctions, prepositions ( Like and, or, some, exactly etc)  Prioritize the words in the sub query.  Search using a search engine for the documents containing the words in the query.  Rank the documents.  The document which contains the maximum occurrence of the words will have highest priority. Table 1- Expanded version of Query Sr No 1

Type Class

Query Department

2

Property

Takes

3

Instance

Student1

4

ClassClass

5

ClassProperty

ModuleGraduateStudent (when they don’t map to the same superclass) EconomicsMo dule-MathsModule (when they map to the same superclass) Module-istaughtBy

6

AcademicStaff-Prof1

7

ClassInstance PropertyProperty PropertyInstance

teaches-Prof1

8

InstanceInstance

Student1-Student2

9

Teaches-taught_by

Expanded version of Query Department and (ComputerScienceModule or EconomicsModule or MathsModule ) and (isTaughtby only AcademicStaff) UndergraduateStudent and takes and (ComputerScienceModule or EconomicsModule or MathsModule) Student1 and UndergraduateStudent or (takes only 2 Things) GraduateStudent and takes and (Module or ComputerScienceModule or economicsModule or MathsModule) (EconomicsModule and Mathsmodule ) or isTaughtby only Academicstaff

Module or istaughtBy (Academicstaff or Prof1 or Prof2 or Prof3 or Prof4 or Prof5 or Prof6 or Prof7 or Prof8 or Prof9) Academic Staff and taught by some Prof1 Academic_Staff and teaches or taughtby some (ComputerScienceModule or EconomicsModule or MathsModule) Prof1 and teaches and (ComputerScienceModule or EconomicsModule or MathsModule) UnderGraduateStudent and (Student1 or Student2) and takes exactly 2 things

Explanation of how the algorithm will work a. Class: When the query given is a class we expand the query in terms of its equivalent classes, properties, inferred subclasses. b. Property: When the query given is a property we expand the query in terms of its domain and range. c. Instance: When the query given is a instance we expand the query in terms of its super class, properties and other instances which are related to this. d. Class-Class: when the query given is a class-class we first check if they have a common ancestor. If they do have then it is expanded it terms of their properties. If they have different super classes, then they are expanded in terms of their connecting properties. Those documents which contain nothing in common are left out. e. Class-Property: here the class should be either a domain or a range of that property. If it’s a domain(range) , then the query should be expanded in terms of range(domain) including all its inferred classes, equivalent classes . f. Class-Instance: Here the instance could be a child of the same class or of a different class. If it is the child of the same class then they are expanded in terms of their properties and super classes, else they are expanded in terms of their super classes and their connecting properties. g. Property-property: whenever two properties are given, we first check the domain on which they are defined and expand them in terms of their domain and its subclasses. h. Property-Instance: If the property and instance are related, then the query is expanded in terms of their relationship, else it becomes a keyword based search. i. Instance-Instance: Here we look at the class through which they have evolved and expand using their connecting properties, the specific properties of each of the individuals and the class to which they belong. Experimentation and Results Figure 4 shows the comparison between proposed semantic search and ordinary search. The queries are expanded in Table 1. The precision and recall parameters were estimated manually. It was found that the average increase in search with the proposed next generation was 49% compared to normal search. Precision and recall are set-based measures. That is, they evaluate the quality of an unordered set of retrieved documents. To evaluate ranked lists, precision can be plotted against recall after each retrieved document as shown in the example below. To facilitate computing average performance over a set of topics each with a different number of relevant documents| individual topic precision values are interpolated to a set of standard recall levels. Recall= A measure of the ability of a system to present all relevant items. Recall= no of relevant items retrieved/ no of relevant items in collection. Precision: A measure of the ability of a system to present only relevant items. Precision= number of relevant items retrieved/total number of items retrieved. Using the interpolation rule, the interpolated precision for all standard recall levels up to .5 is 1, the interpolated precision for recall

Journal of Information Systems and Communication ISSN: 0976-8742 & E-ISSN: 0976-8750, Volume 3, Issue 1, 2012 Bioinfo Publications

140

Domain Ontology Based Knowledge Representation for Efficient Information Retrieval

levels .6 and .7 is .75, and the interpolated precision for recall levels .8 or greater is .27. The average increase in search is 49% compared to ordinary search. The result is given below:

Fig. 4- Comparison between Ordinary and Proposed Semantic Search Web Services With Grid Environment Web services are self-contained, self-describing software applications that can be published, discovered, located and invoked across the web, using standard web protocols. Nowadays, the basic support for web services is provided through the SOAP (Simple Object Access Protocol), WSDL(Web Service Description language) and UDDI(Universal Description, Discovery and Integration) specifications [9], which address message format, service description, service publishing and lookup. The company CBDI states that “when creating an agile service based business, design must be based on an intelligent appreciation not just of immediate requirement but also of dynamics of demand. This produces convergence between business strategy and design and technology strategy and design”. Foster and Kesselman proposed distributed computing infrastructure for advanced science and engineering, called as Grid. This word was initially coined from the electricity power grid. In recent years the focus has shifted away from the high performance aspect towards definition of the grid problem a flexible, secure, coordinated resource sharing among dynamic collection of individuals, institutions and resources, whatever we refer to as virtual organizations. The main focus of Grid was to obtain high performance. Once ontology is being created and queried it can be enhanced to “Grid Environment”, where more than one domain can be merged to have a single ontology, which will be able to query any information about merged domain as shown in fig 5 and proposed methodology to grid environment shown in fig 6. Web Services and Grid Computing are technologies that go hand in hand these days. Grid is a very dynamic environment because new services may be added into it or may cease to exist. Web Services.

Fig. 5- Two Services which lead to Next Generation Web

Fig. 6- Proposed Methodology to Grid Environment At times, the queries may require combining Ontologies which are very complex. Then creation of the grid environment may take weeks .The idea of creating a grid was to share the computations and resources across the web. Grid infrastructure started to leverage existing Web service standards with some extensions in order to build Grid based systems on top of a pure Service Oriented Architecture (SOA). Today, we can share anything across the web using grid like images, databases, computations, people, archives etc. The grid components require information about the functionality and interfaces and this information must have an agreed meaning and this will be supported by metadata. This has been the key for Grid services Need For IE And IR In Next Generation Web In natural language processing [NLP], Information Extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents.IE should be able to get the latest information and club it with the existing knowledge base so that up to date information is available for customers. Many IE systems have been developed since the 1960s. Naomi sager at New York University developed an IE system that extracted hospital discharge information from patient records. The output was a structured form that made it suitable for database management. IE is a division of NLP, where the function of NLP is to analyze, understand and generate human languages which is in turn built on technologies like Artificial Intelligence. For information to be readily available, the interface between the customer and the services must be flexible as users demands are dynamic and even the types of users keep varying. An IR component will retrieve from the web all those pages which contain the topic we have searched. Now an IE component will again filter this data and extract only information that is required , i.e. an IR component will contain a lot of unrelated information too whereas IE will display only the related data. Hence, the task of IR is to select the page and the task of IE is to select the relevant data. Fig.7 is used to demonstrate how the two agents work hand in hand. Here IR can be the Service Consumer while IE can be the

Journal of Information Systems and Communication ISSN: 0976-8742 & E-ISSN: 0976-8750, Volume 3, Issue 1, 2012 Bioinfo Publications

141

Lakshmi Tulasi R., Goudar R.H., Sreenivasa Rao M. and Desai P.D.

service provider as it responds by giving the right information to the query. The information returned by the IE will be some form of formatted data like XML. In this architecture the user first gives a query to the WWW. The crawlers search the web through IR agents for all links which contain this topic. Then these links are given to the IE. The job of IE is to filter out all the unwanted data and retain only the needed data. If these agents are efficient enough to integrate the information, then they can display abundant amount of related information, thereby making itself an intelligent system. This architecture can take advantage of many SOA characteristics like reusability, integration etc. IR is extracting information from some text that satisfies the information needed from a large collection like a grid usually stored on computers. The information on the computers may be structured data like a Boolean expression or unstructured data like another document we have more amount of unstructured data than structured data. In IR, we have a set of users who will put forward queries to retrieve a subset of the database. For e.g., the document should be about semantic web and the author should be Tim Berners Lee. In this example, the query is about a certain topic and the author should contain the specified value. IR components like web crawlers can be used to scour the web to find the relevant information. IE is required to convert the unstructured data into a structured format

based systems: representation and inference in the Cyc project, Addison-Wesley. [4] Agirre E., Ansa O., Hovy E. and Martinez D. (2000) 14th European Conference on Artificial Intelligence. [5] Steinbach M., Karypis G. and Kumar V. (2000) Workshop on Text Mining, 109-110. [6] Hotho A., Staab S. and Stumme G. (2003) ECML/PKDD, LNAI 2838, 217-228. [7] Mladenic D. and Grobelnik M. (2006) Web mining: from web 77- 96. [8] Stojanovic L., Maedche A., Motik B. and Stojanovic N. (2005) European Conference on Knowledge Engineering and Management, 285-300. [9] Davies N.J., Fensel D. and Richardson M. (2008) BT Technol J., 22(1), 118-130 [10] Duke A., Davies N.J. and Richardson M. (2005) BT Technol J., 23(3),191-201. [11] Berners-Lee T., Hendler J. and Lassila O. (2004) Scientific American. [12] Ioan Toma, Dumitru Roman, Kashif Iqbal, Dieter Fensel, Stefan Decker (2006) First International Conference IEEE. [13] Arroyo S. and López-Cobo J.M. (2009) Int. J. Metadata, Semantics and Ontologies, 1(1), 76-82. [14] Sumali J. Conlon, Jason G. Hale, Susan Lukose, Jody Strong (2010) The Journal of Computer Information Systems 48(3), 74. [15] Pallos M.S. (2001) EAI Journal 32-35.

Fig. 7- Information Retrieval and Information Extraction agents of a Web Service System. Conclusion The proposed method in this paper is an attempt to retrieve related documents for a domain, which is captured in the form of ontology. The experimental result for the proposed scheme improves precision and recall substantially. Future Directions: When the query is too generic, the expanded query may be too lengthy. In such cases the keyword search applications which have limitations on the length of the query to search may not work here. So the appropriate interface has to be designed for the ontology guided search. References [1] Davies N.J. (2008) BT Technol J., 18(1), 62-63. [2] Maxwell C. (2007) BT Technol J., 18(1), 55-56. [3] Lenat D.B. and Guha R.V. (2002) Building large knowledgeJournal of Information Systems and Communication ISSN: 0976-8742 & E-ISSN: 0976-8750, Volume 3, Issue 1, 2012 Bioinfo Publications

142