Semantic Ontology Tools in IS Design - Semantic Scholar

3 downloads 107 Views 73KB Size Report
a subscriber who wants to set up an on-line news website on basketball, and has selected a suitable website template according to a user profile, and posted a ...
Semantic Ontology Tools in IS Design R.A. Meersman 1 Vrije Universiteit Brussel (VUB) Building G/10, Pleinlaan 2 B-1050 Brussels Belgium [email protected]

Abstract. The availability of computerized lexicons, thesauri and "ontologies" –we discuss this termi nology – makes it possible to formalize semantic aspects of information as used in the analysis, design and implementation of information systems (and in fact general software systems) in new and useful ways. We survey a selection of relevant ongoing work, discuss different issues of semantics that arise, and characterize the resulting computerized information systems, called CLASS for Computer-Lexicon Assisted Software Systems. The need for a "global" common ontology (lexicon, thesaurus) is conjectured, and some desirable properties are proposed. We give a few examples of such CLASS-s and indicate avenues of current and future research in this area. In particular, certain problems can be identified with well-known existing lexicons such as CYC and WordNet, as well as with sophisticated representation- and inference engines such as KIF or SHOE. We argue nevertheless that large public lexicons should be simple, i.e. their semantics become implicit by agreement among "all" users, and ideally completely application independent. In short, the lexicon or thesaurus then becomes the semantic domain for all applications.

1. Introduction The formal treatment of semantics in information systems has always been an important and difficult issue -though often implicit or even tacit- in the analysis, design and implementation of such systems. The literature abounds with different ways and attempts to represent the meaning of data and information for use in and by databases and their applications in computerized information systems. We contend that computer-based "ontologies" (we shall make this term precise in what follows, but use it in the meantime to cover lexicons and thesauri) provide a useful way for formalizing the semantics of represented information. In fact we shall argue that such ontologies, in principle, can actually be the semantic domain for an information system in a very concrete and useful manner. The caveat "in principle" is needed since current ontologies (lexicons, thesauri) are not (yet) constructed or available in forms that quite suit this purpose. Modern distributed object (DO) technology such as CORBA [OMG95], DCOM [Red97] etc. makes it ever more likely, and desirable, that objects or agents 1

This research was partially supported by the ESPRIT Project "TREVI", nr. 23311 under the European Union 4 t h Framework Programme

performing interesting services "meet" interested consumers and applications without prior knowledge of each other's design and precise functionality. Mediators, wrappers and other devices have been proposed to help make this possible, but these are not at present "runtime" solutions. The sheer multitude of such object s/agents in the future dictates that the necessary agreements will have to be negotiated in realtime as the "meeting" occurs. Short of a galaxy-wide standard for programming and data modeling it is hard to imagine that these agreements can occur without the use of sophisticated and general lexicons, thesauri or ontologies. These will need to be (a) very large resource(s), possibly only one per language, but (b) with a rather simple, "semantics-less" structure that makes them nearly completely independent from any domain-specific applications. Naturally, this necessitates that instead a priori and mostly implicit agreements must exist about the contents (individual entries) of such vast natural languagecovering ontology, but it is to be expected that this still will be much easier and more stable than dealing with the need for individual and local mini-agreements about the "real world" itself. Even if one fails to reach a complete match between one's terms and those in the ontology, the benefits would likel y outweigh the cost of the required compromise. At present, only a few and partial resources of a related kind exist, either as research prototypes (e.g. the very popular [WordNet], see also [Fel98]) or, as is now the case for [CYC], as a proprietary development that does not (yet) constitute a wide standard. Many efforts and technologies at this moment contribute towards the feasibility of such common ontologies. For instance, DO technology mentioned above has already led to a degree of unexpected standardization of the terminology of business processes –and in fact the processes themselves – as materialized in e.g. [SAP], [Baan] and related products. Another example is XML [BPS98] which may provide a widely used –if limited– vehicle for "semantic" tagging of webpages and -sites. This ability is already being usefully exploited e.g. in [SHOE]; see [HHL98] The required resource will need to be a (very) large database residing on the Internet; its structure needs to be defined. We argue that it should be as simple as possible to allow a maximum of application independence ; in fact in a rather precise sense, for communication-based systems this intuitive requirement forms the dual concept to the very well-known concept of data independence that drives all database technology today. Needless to say, the increasing availability of very large numbers of data sources, all developed autonomously, and eventually the presence of many software agents on the internet will make common ontological resources attractive for establishing communication on-line without or with a minimum of human interaction. Commercial applications may be imagined in the areas of data warehousing, on-line analytical processing (OLAP), electronic trading and commerce, etc.. See e.g. also [L&G94] of r a number of other possible uses for CYC® that apply to all such ontologies. The rest of this paper is structured as follows. We start in the next section by trying to motivate the need for a common "global" ontology by an example. A general introductory survey in the study of semantics, and notably in the context of information systems is next in Sec.3, followed in Sec.4 by a quick survey of some relevant work in lexicons and ontologies. We then combine this in Sec.5 discussing

their use, again, in info rmation systems and methodologies in particular. We conclude in Sec.6 with some comments about possible future research.

2. Motivating Example A number of distributed software agents are playing around a blackboard communication facility. Some of these agents are Publisher, Categorizer, Intaker, Matcher, and User Agent. There may be several of each, and others, active at any given time. As we look, User agent happens to be servicing a subscriber who wants to set up an on-line news website on basketball, and has selected a suitable website template according to a user profile, and posted a request on the blackboard based on this. User agent has a domain-specific thesaurus of such templates to choose from. The subscriber wishes to title the website "Hoops Today". For the sake of the example, assume the template (or the subscriber) is somewhat simplistic, and in a fictitious syntax looks like website has has has has not is-a part_of

layout = title = "Hoops Today" content = on -line_news < stream subject = basketball+ subject = ( basketball_player– , "sex" ~ ) place < virtual internet

The intended meaning is that the desired website has on-line news content, limited to be a stream; the subject must be "basketball" or maybe something more general (the +); but not containing tidbits involving basketball players or more specific individuals (the –) mentioned together with their favorite pastime or a synonym for it (the ~). A website generalizes to a place that is limited to be virtual, and is in part -of relationship to the notion "internet". The tokens "has", "is-a", part-of" are called roles and they determine how the term following them has to be interpreted; "has content" in our thesaurus means that content is an attribute of website. Note the las t two items would in general not be application-specific properties of a website; however we didn't find "website" in our favorite lexicon so we had to provide these ourselves in our domain thesaurus. (In fact "web site" surprisingly does not appear today in WordNet, a popular lexicon we use in several examples further on.). We stress again that the syntax above is entirely arbitrary and "constructed" here only for the purpose of illustrating some ideas behind this example. Clearly the interpretation of roles, and of operators like + when they are domainor application-specific, is the internal responsibility of the User Agent. We say that they belong to its ontology. This is a general principle; agents may communicate with each other of course but must do so only through a simple "flat" lexicon devoid of domain specific semantics. The role "is-a" for instance describes in general a domain independent relationship and occurs in almost every known lexicon, thesaurus or "ontology", and should be inherited from and interpreted at that level.

After this intermezzo, let's turn back now to our little system where the agents are waiting, and quickly run through a possible scenario for it. For the sake of further simplicity we shall ignore the "not", "is-a" and "part-of" roles and just concentrate on content type and required subject. The reader may easily imagine scenarios that involve these properties as well. A Publisher in the meantime has triggered on the posted request, knowing it can fulfill it provided it can find the right inputs. It will need to receive these as data coming from other agents; in this case it looks for NITF formatted news text and, since it has experience with the media and news items, also for a relevance ranking with respect to the subject "basketball". Note that Publisher needs no specialized basketball knowledge; it only uses the fact that basketball is a subject. Publisher now posts itself two messages on the blackboard, the first asking for NITF-formatted on-line news, the other asking for a relevance factor of news articles in relation to the subject basketball. Two agents trigger on these requests. The Intaker agent can supply the NITF formatted text, so the agent commits to this. This commitment is accepted by the Publisher which removes this part of its request. The second agent is a Matcher who can compute the relevance provided it can receive categorization data on the subject basketball. It posts a request for such categorization on the blackboard. A rampant Categorizer passes by and decides it could partially fulfill this request, given the right inputs. It cannot fulfill the exact categorization required, but after consulting a general -purpose lexicon (such as WordNet for example) it finds it can satisfy a more general categorization on the subject of "sports". Since fulfillment is only partial the posting agents will leave their request on the blackboard until, maybe, in the end the User Agent commits if he decides relevance is high enough. For instance, look at Fig.1 which shows the (partially flattened and edited) entry for the specialization hierarchy containing sports and basketball in WordNet. If Categorizer would generalize to (another) "court game", the high fan-out under that term might reduce the relevance noticeably. Mo re requests will be posted until finally Publisher can make a partial commitment that is accepted by the User Agent, and now a series of parameterized steps is sent to the system for execution. Note that all the steps in the process made use of domain-specific and general "data" knowledge residing in thesauri, while the “operations" knowledge necessary for the use, application or rewriting of the rules was kept inside the agents. This is of course intentional and makes the system flexible and robust, e.g. to allow run-time reconfigurations caused by new (types of) agents appearing at the table (see Figure 1). Certain aspects of the above system, organization and scenario of operation are currently under investigation as part of a development related to the TREVI Project [TREVI] at VUB STARLab. The current implementation of TREVI (in EU Esprit Project #23311) is a personalized electronic news server system that is designed to operate on a continuous basis, supporting a high volume of subscribers. On the basis of a user profile the system provides the subscriber with a personalized electronic news service through a personalized website, an email message service, or a custommade client application. The system is unusually modular and a limited ability to process meta-data already presently allows adding new functionality (module types) to the system at runtime. User profiles are mapped onto series of processing steps,

each corresponding to a parameterized stateless operation of a module type. The system performs merging and parallelization of the series behind the scenes to achieve an optimized execution scenario. The current system prototype uses a Java/CORBA backbone architecture over a distributed set of (NT) servers.

sport, athletics -- (an active diversion requiring physical exertion and competition) => contact sport -- (a sport that necessarily involves body contact between opposing players) => outdoor sport, field sport -- (a sport that is played outdoors) => gymnastics -(a sport that involves exercises intended to display strength and balance and agility) => track and field -(participating in athletic sports performed on a running track or on the field associated with it) => skiing -- (a sport in which participants must travel on skis) => water sport, aquatics -- (sports that involve bodies of water) => rowing, row -- (the act of rowing as a sport) => boxing, pugilism, fisticuffs -- (fighting with the fists) => archery -- (the sport of shooting arrows with a bow) => sledding -- (the sport of riding on a sled or sleigh) => wrestling, rassling, grappling -(the sport of hand-to-hand struggle between unarmed contestants who try to throw…) => skating -- (the sport of gliding on skates) => racing -- (the sport of engaging in contests of speed) => riding, horseback riding, equitation -- (riding a horse as a sport) => cycling -- (the sport of traveling on a bicycle or motorcycle) => bloodsport -- (sport that involves killing animals (especially hunting)) => athletic game -- (a game involving athletic activity) => ice hockey, hockey, hockey game -- (a game played on an ice rink by two opposing …) => tetherball -- (a game with two players who use rackets to strike a ball that is tethered …) => water polo -- (a game played in a swimming pool by two teams of swimmers …) => outdoor game -- (an athletic game that is played outdoors) => court game -- (an athletic game played on a court) => handball -- (a game played in a walled court or against a single wall by two …) => racquetball -- (a game played on a handball court with short-handled rackets) => fives -- ((British) a game resembling handball; played on a court with a front wall …) => squash, squash racquets, squash rackets -- (a game played in an enclosed court by two …) => volleyball, volleyball game -- (a game in which two teams hit an inflated ball over …) => jai alai, pelota -- (a Basque or Spanish game played in a court with a ball …) => badminton -- (a game played on a court with light long -handled rackets used to volley a shuttlecock over a net) => basketball, basketball game -- (a game played on a court by two opposing teams of 5 players; points are scored by throwing the basketball through an elevated horizontal hoop) => professional basketball -- (playing basketball for money) Fig. 1. Sports hierarchy including "basketball" obtained from the [Wordnet] lexicon (partly edited and collapsed for simplificat ion)

[TREVI] is an example of a project that uses lexicon technology in several ways. We may call such systems CLASS-type for Computer Lexicon Assisted Software Systems. The design implied in the example above would make such technology fundamental to its operation. As should be immediately obvious, because of the heterogeneous nature of agent technology and the "openness" of possible request, formal agreement on the meaning of the terms (and if possible processes) is indeed primordial. We believe that thesauri (domain -specific ontologies) and the availability of global lexicons will make this possible, by providing a pragmatically usable

substitute for a formal, "reductionist" definition of information system semantics. We explain terminology and these pri nciples in the next section.

3. Semantics in Information Systems For convenience and ease of presentation we shall assume here that an information system is merely defined in a strict model -theoretic paradigm by a pair (S,P) where S denotes a conceptual schema and P denotes a population of instances that "satisfies" S in an intuitive but well-defined sense (viz. logically P is a model for S). Informally speaking, S typically contains "sentence patterns" in terms of types or classes, including constraints or "business domain rules" while P contains instances of these sentences –ground atomic formulas – in terms of proper nouns, numbers and other lexical elements of the universe of discourse. P is typically implemented as a database, for instance. This is further complemented by an active component, the information processor which manipulates the instances of P such that consistency, i.e. satisfaction of S, is maintained. A customary implementation of an information processor is by a DBMS, often supplemented by a set of application programs that either implement user-desired functionality, or consistency requirements defined in S, or both. While admittedly this is a seriously simplified view (e.g. it ignores most aspects of dynamics and events in the application domain, and is –rather incorrectly– suggestive of a centralized, non-distributed architecture) it will do to illustrate many of the basic issues encountered when dealing with the meaning of the represented and stored information. Any treatment of semantics will associate the contents of an information system with a "real world" for the purpose of creating an understanding of these contents for the system's users. It will view the entire information system, i.e. both its conceptual schema and its population of instances, as a language (of well-formed formulas), and must deal with the following rather obvious but essential principles: • any semantics formalism for such a language must describe a form of relationship (usually a mapping) between the symbols used in the syntactical constructs of this language and the entities (objects, events, relationships, …) in an observed "real world"; • the "real world" can only enter into the semantics formalism as a representation itself, usually under the form of a collection of signs. Such a representation is sometimes called a conceptualization [G&N87]. This in turn implies an unavoidable a priori choice of alphabet, symbols, etc and dependence from which the formalism however has to abstract; • any semantics must be the result of an agreement among involved domain experts, designers, implementers, and present and future users of the resulting system, which in fact does nothing but implement this agreement. It is especially this last principle that distinguishes semantics in the context of an information system from e.g. the formal treatment of semantics in the context of programming languages [vLe90][Sch86].The latter is axiomatic and reductionist by the very nature of the interpreter -the computer- which can be assumed completel y understood. Explicit agreements (except, perhaps, on the machine's axioms and rules

of inference) therefore are unnecessary. In the world of information systems, reductionist approaches to semantics do not work so well, as they would necessitate an axioms-plus-logic solution to the "commonsense world" problem in AI [e.g. H&M85]. Indeed, understanding of the operation of the information processor itself becomes largely irrelevant, certainly compared to the need to associate meaning to the input and output streams of the system. These information streams are the very purpose of that system and contain (actually, are constituted of) references to real world entities. In programming terms, it is the data manipulated by the program that has the prime responsibility of carrying the semantics, rather than the program itself. Such references therefore, unlike for programming languages, do require explicit agreement among the observers of that real world. Obviously, formally registering such agreement in general for any realistic domain is a major undertaking, rapidly becoming more difficult, and even impossible in the usual logical sense, as the number of parties (users, designers, experts, …) to the supposed agreement grows. Therefore the information systems commu nity out of necessity (after all, quite many of these systems do seem to work usefully) often adopted a pragmatical approach to semantics, giving rise to a large variety of semi-formal or empirical semantic formalisms in the literature [Mee97].

4. Lexicons, Thesauri and Ontologies "What is an Ontology?" is the title of one of Tom Gruber's Web pages, and the short answer he provides there on the next line is: "a specification of a conceptualization" [Gru93]. This is a rather good and usable definition –althoug h "ontology" strictly speaking as an English word is not a count noun and so in principle we cannot say "an ontology" nor make its plural… It is both simple and precise, if we take a definition of conceptualization in the sense of e.g. [G&N87] or [G&F92]. There, a conceptualization is merely an abstract, very elementary representation of a Universe of Discourse (application domain) in terms of lists of objects, mathematical relationships and functions, denoted in symbols (signs) that correspond one-to-one with "real-world" objects, facts, etc. (In fact the definition used is somewhat more complex, i.e. a conceptualization may be seen as a logical theory, but we shall not adopt that notion here.) See for instance the formal semantics of [KIF] for an illustration of the use of this concept. Clearly, even if a conceptualization as a substitute for the "real world" is described as independently as possible of a representation formalism, at the very least some – hopefully minimal – kind of modeling, structuring and symbol conventions and denotations must be adopted, i.e. agreed on. This however could be relatively easy, at least for the constants of the language. Indeed, computer lexicons, or ontologies, or thesauri (see below), i.e. indexed databases of lexical terms related in context, may form a good basis for a pre-agreed, common, conceptualization of the "real world" –if they are properly structured and made sufficiently complete. In other words, when again viewing a semantics for an information system as a mapping (called "interpretation" [G&N87]) from its formal representation language to a "part of the real world", a well-defined lexicon etc. may then be substituted for, and serve as the actual semantic domain, becoming the domain (in the mathematical

sense) of the semantics interpretation mapping, instead of that "part of the real world". Notes on Terminology In the literature on computational linguistics and on ontology in AI the different uses of the terms "ontology", "lexicon", "thesaurus", "vocabulary", "dictionary" etc. give rise to some confusion and incompatibilities. According to the New Shorter Oxford dictionary, and taking the more specific alternatives of each definition, a lexicon is "a complete set of elementary meaningful units in a language"; a thesaurus is a "classified list of terms, esp. keywords, in a particular field for use in indexing or information retrieval", and ontology (without an article) is "the study of being", "the list of what is/exists" [Bro93]. Both lexicons and thesauri may seem to be special forms of dictionaries or vocabularies. Most approaches to "ontologies" in the relevant AI literature on e.g. [CYC], [KIF], [SHOE], [ASIS], etc. would therefore seem to lead to tools that are closer to the linguistic concept of thesauri (even if these tools are not a priori domain-specific, such as e.g. [GRASP] is for works of art). We will therefore for the purpose of this paper say that • an ontology is a specified conceptualization as described above; • a thesaurus is a domain-specific ontology, e.g. Business, Naïve_Physics, Contract_Law, … or application(s) -specific ontology, e.g. Inventory_Control, Airline_Reservations, Conference_Organization, …; • a lexicon is a language-specific ontology, e.g. English, Esperanto, Bosnian, … All ontologies may therefore be listed as logical theories, or even simply as "dictionaries" in a sense defined below. Admittedly for a thesaurus the distinction between application-specific and domain-specific can be vague and must often be left to intuition, especially since ontologies may cover a set of applications. Nevertheless this level-like distinction is a customary one in many IS modeling methodologies where applications are developed within a shared domain (see also the discussion in Sec. 4). So under the definition above a lexicon is more general – encompasses more entries, facts, …– than a domain-specific thesaurus which is more general than an application-specific thesaurus. Swartout et al. [SPK96] further distinguish domain ontologies and theory ontologies , the latter expressing fundamental and unchanging truths about the world such as time, space, mathematics as logical theories while the former are typically large sets of facts about a given (application) domain. As an example of a (domain specific) thesaurus, we show a small fragment of the one the Reuters news agency uses to classify news items by subject in Fig.2. This ontology is shared by many applications and used in particular in the [TREVI] project (where Reuters plc. is a partner) to enrich news items, after subject identification, with background information based on this classification and according to user profiles. (This was the inspiration for the Motivating Example.) TOPICS CORPORATE …



Accounts Acquisitions Advertising Annual Results Asset Transfers Bonds Capacity/Plant Capital Competition/Anti-Trust /Restrictive Practices

Health Human Interest International Elections International Relations Obituaries Religion Science/Technology Sports Travel/Tourism Weather





Fig. 2. Fragment of Reuters' "Business" Thesaurus: Corporate Topics. ©Reuters plc Another small fragment is shown in Fig.3 below. Such a thesaurus may actually be seen as composed of elementary entries which we shall define as lexons in the next Section. For instance, a lexon in Fig. 3 might be , stating that within the context of Business the term Industries has a relationship with the term Brewing which plays the role of Category in it. Another lexon in Fig.2 would be .

INDUSTRIES …



Accountancy Adhesives Advertising Aerospace Agricultural Machinery Agriculture Air Transport Aircraft Leasing Airports Alarms and Signalling Equipt. Animal Feed Armaments Baking Banking Basic Electrical Equipment Basic Industrial Chemicals Bread & Biscuit Making Book Publishing Brewing Building Products Bus & Coach Services

Retail Chemists Retailer -General Retailer -Specialist Road Haulage Rubber & Plastics Sales Promotion Shipbuilding Shipping Shipping -Support Services Soft Drinks Software (Applications) Software (Systems) Sugar Telecom Service Telecoms Equipment Textiles Theatre Venues Timber Products Tobacco Tourism -(Building Construction)





Fig. 3. Fragment of Reuters' "Business" Thesaurus: Industries. ©Reuters plc

It will be readily obvious even from these tiny fragments that the roles often are not obvious nor the same for all elements of a sublist, as an information analyst would expect: in fact this poses in general a major problem with many of today's thesauri and lexicons. It turns out that a rather convenient way of representing the lexons in an ontology during its design phase, is by using the ORM notation (Object -Role Modeling [Hal95], earlier known as NIAM [VvB82]). The NIAM/ORM method has its roots in using natural language agreement as the basis for information analysis, uses mainly binary rel ationships to represent knowledge and always has had roles as first-class citizens. Commercial design tools are available for it [VISIO], but extensions have to be added to handle large classifications at the "type" level. Another such useful graphical notation, especially for representing theory knowledge in ontologies, are Sowa's Conceptual Structures [Sow84] further elaborated and applied to ontology in [Sow99]. In the case of ORM a small fragment of the above thesauri looks like Fig.4, with a self-understood graphical notation. Each binary relationship corresponds to a lexon in the ontology.

about_subject subject_of

Industries

Business

Acquisitions

with_sub

sub_of

attr_of

with_attr

Topics

Politics

Corporate Economics

Litigation

attr_of

Markets

with_attr

Fig. 4 . Example of an ORM binary relationship diagram

Most ontology work in the AI literature so far tends to equip ontologies with fairly sophisticated logical machinery. An elegant example is [KIF] developed at Stanford by a team around Mike Genesereth, and which allowed implementation as an open ontology in the KSL system (a.k.a. Ontolingua). An example of a KIF theory is shown in Fig. 5. (extracted from www-ksl.stanford.edu/knowledgesharing/ontologies/). The underlined terms denote hyperlinks to other theories or classes. KIF implements an essentially reductionist philosophy but boasts a welldefined formal declarative semantics [G&F92].

The openness of KIF results from a well-defined manner in which (distributed) users may add their own application or "general" domain ontologies to the system and so expand it. By leaving out this logic machinery we reduce the ontology to a "less intelligent" kind of dictionary (such as in the thesaurus examples in Figs 2. and 3.) and leave the responsibility for handling semantics with the interpreter, viz. the agents that use the ontology. We shall see in the next Section that there are a number of advantages to this "dumber is smarter" principle. CYC® is an example of a (large) instances ontology that operates in this manner, although it is also equipped with a separate limited inference engine, and a manipulation language (CycL) which allows fairly general first order formulas to be stated over CYC®'s constants. CYC® is organized in so-called microtheories which group related and interacting constants in what is the equivalent of contexts. Fig.6 shows the entry for "Skin", and its parent "AnimalBodyPart", both to be interpreted as elements of a microtheory (context) on physiology (copied from the public part of CYC®, www.cyc.com/cyc-2-1/intropublic.html).

Class BINARY-RELATION Defined in theory: Kif-relations Source code: frame-ontology.lisp Slots on this class: Documentation: A binary relation maps instances of a class to instances of another class. Its arity is 2. Binary relations are often shown as slots in frame systems. Subclass-Of: Relation Slots on instances of this class: Arity: 2 Axioms: ( (Binary-Relation ?Relation) (And (Relation ?Relation) (Not (Empty ?Relation)) (Forall (?Tuple) (=> (Member ?Tuple ?Relation) (Double ?Tuple))))) Fig. 5. A KIF built-in theory for Binary Relations (a subtype of Relations)

CYC® unquestionably is a massive resource, which has only begun to be mined, and it may even be expected that precisely the simplicity and relative "flatness" of its underlying vocabularies rather than its inferencing capabilities will make it attractive as such. In "flat" ontologies (dictionaries), i.e. devoid of application semantics, the roles (expressing relations, functions) in the lexons become the carriers for most of the application semantics: whenever the using agent "applies" a lexon, it executes the

interpretation associated with the role internally (as part of the agent, not as part of any inferencing inside the ontology). In the #$Skin example above, the lexon roles are "isa" and "genls" and are of course interpreted by an internal CYC® "reasoner" which "owns" their semantics. See also for instance again the WordNet example in Sec. 2. for another illustration of this concept.

#$Skin A (piece of) skin serves as outer protective and tactile sensory covering for (part of) an animal's body. This is the collection of all pieces of skin. Some examples include TheGoldenFleece (an entire skin) and YulBrynnersScalp (a small portion of his skin). isa: #$ AnimalBodyPartType genls: #$AnimalBodyPart #$SheetOfSomeStuff #$ VibrationThroughAMediumSensor #$ TactileSensor #$BiologicalLivingObject #$ SolidTangibleThing some subsets: (4 unpublished subsets) #$AnimalBodyPart The collection of all the anatomical parts and physical regions of all living animals; a subset of #$OrganismPart . Each element of #$AnimalBodyPart is a piece of some live animal and thus is itself an instance of #$ BiologicalLivingObject . #$AnimalBodyPart includes both highly localized organs (e.g., hearts) and physical systems composed of parts distributed throughout an animal's body (such as its circulatory system and nervous system). Note: Severed limbs and other parts of dead animals are NOT included in this collection; see #$DeadFn. isa: #$ ExistingObjectType genls: #$OrganismPart #$ OrganicStuff #$ AnimalBLO #$ AnimalBodyRegion some subsets: #$ Ear # $ReproductiveSystem #$ Joint-AnimalBodyPart #$ Organ #$ MuscularSystem # $Nose # $SkeletalSystem #$Eye #$RespiratorySystem #$ Appendage-AnimalBodyPart #$ Torso #$Mouth #$Skin #$DigestiveSystem #$ Head-AnimalBodyPart (plus 16 more public subsets, 1533 unpublished subsets) Fig. 6. CYC® constants: #$Skin generalizes to #$AnimalBodyPart. ©Cycorp, Inc.

5. Linking Semantics, Ontologies, and Information Systems Methodologies As indicated earlier, there are good arguments for keeping ontologies (especially general, global ones such as lexicons) as simple as possible. "Universal truths" could be incorporated in general knowledge bases of this kind, but it is unlikely that they will be used explicitly very often in inferring knowledge that is application -specific,

any more than quantum mechanics are needed explicitly to describe material business processes. In relational database theory the distinction between the model theoretic and the proof-theoretic approaches is well understood since a seminal paper by Reiter [Rei84]. Although Reiter makes a strong and convincing case for the superiority of the proof-theoretic paradigm because of its greater "semantic richness", it is noteworthy that databases today still follow the model -theoretic paradigm. In fact, proof-theoretic (Datalog, or deductive) databases, while very elegant and formally satisfying [AHV95] have flared but very briefly in the commercial market-place. The explanation is rather simple, and methodological in nature. In the prooftheoretic scheme of things rules, constraints, and other ways of representing "more world knowledge" tend to get pushed into the database schema; reasoning with them requires an extension or conversion of DBMS-s with inference capability. However, getting agreement among users and designers about so -called "universal unchanging truths" –if such facts can at all be identified in a pragmatical business environment– and locking them into the database schema is exceedingly more difficult, and rare, than obtaining agreement about (static) data structures. In fact often "exceptions become the rule", defeating the purpose of the new capabilities of the DBMS, which are then soon abandoned. In the model theoretic paradigm, the issue in practice is tackled, but in general not solved, using layers (or levels) of knowledge representation, rather analogously to the levels of ontology defined above in Sec.4, viz. general > domain > application. In [ISO90] an early case (the "onion model") was made for layering real world knowledge in this manner from the abstract to the specific, even within a domain. While no implementation was given there and then, the resulting architecture relates nicely to some current ontology principles. In particular it leads to a requirement for ontologies to be extendable with domain- and application specific thesauri, as for instance is possible in KIF. However, most application specific knowledge in this way ends up in insulated mini-extensions or hard -coded in the application programs. One partial remedy is that the information system's conceptual schema allows for a layer of application - or even domain-specific constraints from which filters may be generated which will screen incoming data for conformance to these constraints. Reasoning about these constraints however is rare in information systems, and at best in some CASE tools a limited constraint consistency check was performed, see e.g. the RIDL* analyzer for NIAM in [DMV88] or a similar module for ORM in InfoModeler [VISIO]. Usable and teachable methodologies for constructing and maintaining consistent, simple and practical lexicons and thesauri are very much lacking at the moment, as well as for designing and developing CLASS-type systems. This is obviously a difficult problem, but "global" lexicons must carry authority if they should be usable as semantic domains. All existing lexicons, especially the larger domain independent ones suffer from important deficiencies, such as the frequently inconsistent semantics of the, in our terminology, implicit classification "roles". Most modular ontology construction techniques such as the ones proposed by [KIF], [SHOE] and its derivative [ASIS] organize knowledge in fairly small "chunks" to be added to a more fundamental ontology. It is rather striking, but perhaps not so surprising, to see that this requires a kind of modeling which is not unlike the building of a "pattern data model" [Fow97] for the domain or application,

and produces similar results. Even the constraints (i.e. mostly b usiness rules) find a place there. In the case of a thesaurus, constraints are indeed likely to stay domainor application specific – as pointed out earlier, ample methodological experience with development of classical information systems –pre-CLASS, so to speak– shows it is usually very hard to get users to agree on business rules, even on relatively mundane things like the unique identification of business objects. Whichever the ontology construction method, it will in one way or the other need to respect the (coarse) layering into fundamentals, domain and application, and possibly finer layers as well within applications, and eventually within domains. This coarse knowledge layering is symbolically depicted in Fig. 7. Note that the application level (not the domain level) also needs to take care of "local" encodings such as abbreviations used in a particular company etc.. Knowledge Levels

Example Knowledge Π (museum) = {"a depository for collecting and displaying objects having scientific or historical or artistic value" 1 }

Language Knowledge Example Knowledge

< 1(collection) = {"bottle collection" ,"art collection", "battery", "library", "universe", "stamp collection", ...}

art collection contains

displays

painting

contains

museum

Π (collection) = {"several things grouped together"1 ,"a publication containing a variety of works" 2 ,"request for a sum of money" 3 ,"the act of gathering something together"4 } ≅ 1(collection) = {"aggregation" ,"accumulation", "assemblage"}

Domain Knowledge

Example Knowledge Π (Pid) = {"painting identification"1}

sculpture

Application Knowledge

Fig. 7. Knowledge levels in an ontology

In general, and among other issues, the question will arise how to decide in which level to include a certain information occurrence. As a methodological step one could suggest simple abstraction starting from the lowest level: if the lexon is interpreted by different applications, it becomes a candidate for generalization into the domain level, i.e. into a thesaurus for that domain. The language level is a nearly static one and must be kept "completely" domain independent if possible. One of the drawbacks of, for instance, WordNet is the unclear separation between the language level and the domain level there. Some entries contain relations that are domain dependent, whereas this domain information is not included for others.

6. Conclusion and Avenues of Research Lexicons, thesauri and other ontologies will need to appear to make semantic communication between distributed systems practically possible. The process for reaching agreements needed for establishing meaningful cooperation will at the very least be strongly simplified by the presence of such "universally" accepted resources with nearly implicit semantics. Ideally, domain experts, users and designers will be able to reach dependable (even if partial) agreement on concepts used or needed, based on their listed use in context. In the spin-off research activity from TREVI, loosely described in the Example of Sec.2, we hope to achieve useful feedback from the CLASS prototype to be constructed for some of these ideas. It will provide information on the feasibility of structuring application and domain ontologies (possibly as lexon databases) to be plugged into a very simple, very easy to use but very large and "semantics -less" future lexicon architecture (also implemented as a lexon database, using only "universal" roles). The use of databases allows large collections of lexons ot be easily indexed, extracted and grouped around an application concept, mapped to other languages, etc. At this moment lexons are quite elementary, denoted γ(t 1 r t 2 ) where t1 and t2 are terms, r is a role in a (binary) relationship and γ is a label (term) for a context. Contexts may however be terms too and so connected through their own relationships. This simplicity will of course have to be paid for eventually by more complex domain and application thesauri and especially their more complex interpreters. The enormous success of a basic public lexicon like [WordNet] clearly shows the need for such "experimentation material" for many new kinds of projects, which in turn may lead to the availability of truly dependable, standardized and "authoritativ e" ontologies in a reasonable timeframe.

Acknowledgements The author gratefully acknowledges the discussions about TREVI and related subjects with Peter Stuer of VUB STARLab and the idea that was at the basis of the example in Sec.2, and with Dirk Deridder for the WordNet examples and insights, and the diagram in Fig. 7.

Literature References [AHV95] Abiteboul, S., Hull, R. and Vianu, V.: Foundations of Databases. Addison-Wesley, Reading MA (1995). [Baan] R. van Es (Ed.)Dynamic Enterprise Innovation: Establishing Continuous Improvement in Business, , Baan Business Innovation, January 1998. [BPS98] Bray,T., Paoli, J. and Sperberg-McQueen, C.M.: Extensible Markup Language (XML). World Wide Web Consortium (W3C). Available at www.w3.org/TR/1998/rec-xml-19980210.html (1998). [Bri98] Brinkkemper, S.: Global Process Management. In: [Baan] (1998)

[Bro93]

Brown, L. (ed.): The New Shorter Oxford Dictionary of the English Language. Clarendon Press, Oxford (1993). [DMV88] DeTroyer, O., Meersman, R.A. and Verlinden, P.: RIDL* on the CRIS Case, a Workbench for NIAM. In: Computer Assistance during the Information Systems Life Cycle, , T.W. Olle, A. Verrijn -Stuart, and L. Bhabuta (eds), North-Holland, Amsterdam (1988). [Fel98] Fellbaum, C.(ed.): WordNet: An Electronic Lexical Database. MIT Press (1998). [Fow97] Fowler, M.: Analysis Patterns: Reusable Object Models. Addison -Wesley, Reading MA (1997). [G&F92] Genesereth, M.R. and Fikes, R.E.: Knowledge Interface Format Reference Manual. Stanford Computer Science Department Report (1992). [G&N87] Genesereth, M.R. and Nilsson, N.J.: Logical Foundations of Artificial Intelligence. Morgan Kaufmann Publishers, Palo Alto CA (1987). [Gru93] Gruber, T.R.: A Translation Approach to Portable Ontologies. J. on Knowledge Acquisition, Vol. 5(2), 199 -220 (1993). [Hal95] Halpin, T.: Conceptual Schema and Relational Database Design. Prentice-Hall (1995) [HHL98] Heflin, J., Hendler, J. and Luke, S.: Reading Between the Lines: Using SHOE to Discover Implicit Knowledge from the Web. In: Proceedings of the AAAI-98 Workshop on AI and Information Integration (1998); from webpage accessed Feb.1999 www.cs.umd.edu/projects/plus/SHOE/shoe-aaai98.ps [H&M85]Hobbs, J.R. and Moore, R.C. (eds.): Formal Theories of the Commonsense World. Ablex, Norwood NJ (1985). [Hun88] Hunnings, G. The World and Language in Wittgenstein's Philosophy. SUNY Press, New York (1988). [ISO90] Anon.: Concepts and Terminology of the Conceptual Schema and the Information Base. ISO Technical Report TR9007, ISO Geneva (1990). [L&G94] Lenat, D.B. and Guha, R.V.: Ideas fr Applying CYC. Unpublished. Accessed Feb. 99 from www.cyc.com/tech-reports/ (1994) [Mee94] Meersman, R.: Some Methodology and Representation Problems for the Semantics of Prosaic Application Domains. In: Methodologies for Intelligent Systems, Z. Ras and M. Zemankova (eds.), Springer-Verlag, Berlin (1994). [Mee97] Meersman, R.: An Essay on the Role and Evolution of Data(base) Semantics. In: Database Application Semantics, R. Meersman and L. Mark (eds.), Chapman & Hall, London (1997). [OMG95] CORBA: Architecture and Specification v.2.0. OMG Publication (19 95). [Red97] Redmond, F.E. III: DCOM: Microsoft Distributed Component Object Model. IDG Books, Foster City CA (1997). [Rei84] Reiter, R.: Towards a Logical Reconstruction of Relational Database Theory. In: On Conceptual Modeling: Perspectives from AI, Databases, and Programming Languages, M.L. Brodie, J. Mylopoulos, and J.W. Schmidt (eds.), Springer-Verlag, New York (1984). [K&T98] Keller, G. and Teufel, T.: SAP R/3 Process Oriented Implementation. AddisonWesley, Reading MA (1998). [Sch86] Schmidt, D.A.,: Denotational Semantics: A Methodology for Language Development. Allyn & Bacon (1986) [Sow99] Sowa, J.F.: Knowledge Representation. Logical, Philosophical and Computational Foundations. PWS Publishing, Boston MA (1999) [in preparation]. [SPK96] Swartout, W., Patil, R., Knight K., Russ, T.: Towards Distributed Use of LargeScale Ontologies. In: Proceedings of the 10th Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW'96), (1996). [VISIO] Infomodeler, now part of Visio Enterprise Modeler, Visio Corp.

[VvB82] Verheyen, G. and van Bekkum, P.: NIAM, aN Information Analysis Method. In: IFIP Conference on Comparative Review of Information Systems Methodologies, T.W. Olle, H. Sol, and A. Verrijn-Stuart (eds.), North -Holland (1982). [vLe90] van Leeuwen, J. (ed.): Formal Models and Semantics. Handbook of Theoretical Computer Science, Vol. B. Elsevier, Amsterdam (1990).

Internet References to Cited Projects [ASIS]

The ASIS Project, http://wave.eecs.wsu.edu/WAVE/Ontologies/ASIS/ASISontology .html [TREVI] The EU 4 th Framework Esprit project TREVI, http://trevi.vub.ac.be/ [CYC] The CYC® Project and products, www.cyc.com/ [GRASP] The GRASP Project, www.arttic.com/GRASP/… …public/News03/News03/GRASP_Ontology.html [KIF] The Knowledge Interface Format (also as KSL, Ontolingua), http://ontolingua.stanford.edu [SHOE] The Simple HTML Ontology Extensions Project, www.cs.umd.edu/SHOE/ [WordNet] The WordNet Project, www.cogsci.princeton.edu/~wn/