Component-Based Support for Building Knowledge ... - CiteSeerX

Component-Based Support for Building Knowledge-Acquisition Systems Mark A. Musen, Ray W. Fergerson, William E. Grosso, Natalya F. Noy, Monica Crubézy, and John H. Gennari Stanford Medical Informatics Stanford University Stanford, California 94305-5479 U.S.A. [email protected] http://www.smi.stanford.edu/projects/protege

During the past decade, there has been increasing consensus within the knowledge-based–systems community on appropriate conceptual components for building intelligent computer programs. Intelligent systems are now generally construed in terms of both domain ontologies and abstract problemsolving methods that operate on knowledge bases defined in terms of those ontologies. There has been less consensus, however, regarding how to optimize the operational components and the user interfaces of tools that assist developers in the construction of knowledge-based systems. For the most part, such lack of consensus is to be expected, given the way in which domain considerations often dominate the way in which knowledge can best be entered, browsed, and updated in any computerbased tool. In our research group at Stanford University, we acknowledge both the central importance and the great variability of domain-specific idioms that can enhance the functionality of knowledgeacquisition tools. Our work seeks specific ways to harness this variability and to allow developers to take advantage of alternative approaches. Our latest development in a series of computerbased knowledge-acquisition systems is known as Protégé-2000. Protégé-2000 is a “meta-tool” that helps users to construct domain-specific knowledgeacquisition systems that application experts can use to enter and browse the content knowledge of electronic knowledge bases. The architecture of Protégé-2000 assumes that, as with Web browsers, users will want to enhance and custom tailor the system’s behavior by means of a variety of “plugins.” These plug-ins are modular pieces of program code that add new functionalities to Protégé-2000 in well circumscribed ways. Developers can contribute new Protégé-2000 plug-ins to a library maintained on the Internet, and can freely download new plug-ins to augment the behavior of their own

knowledge-acquisition systems constructed using Protégé-2000. Our approach establishes a new kind of knowledge-acquisition enterprise— one of building knowledge-acquisition–tool components that can be shared among a large community of users over the Internet. Protégé-2000’s modular architecture expands tremendously the kinds of systems that can be assembled to address specific knowledgeacquisition tasks, and offers the possibility that future knowledge-acquisition systems can be better tailored to the particular requirements of end users.

1. The Protégé Lineage Since the 1980s, our research group has been working on both a series of knowledge-acquisition workbenches and an associated methodology for building intelligent computer-based systems. Each of these workbenches has been called “Protégé,” although the capabilities of each system have increased from one generation to the next [1]. Each new incarnation of the Protégé approach has explored the consequences of relaxing additional knowledge-acquisition constraints, while attempting to make more declarative and more explicit the components from which developers can build knowledge-based systems. The first incarnation of the system, known as PROTÉGÉ [2], assumed a fixed problem-solving method [3] (namely, episodic skeletal-plan refinement [4]). The PROTÉGÉ system allowed developers to instantiate an ontology that defined the abstract data on which the episodic skeletal-plan refinement operated (i.e., the method ontology [5]) with corresponding domain concepts. PROTÉGÉ then used the instantiated method ontology to generate a domain-specific knowledge-acquisition system with which application experts could then enter the

specific content knowledge on which the episodic skeletal-plan refinement method would operate to solve particular application tasks. In particular, when a user instantiated the method ontology with concepts from the domain of clinical trials in a particular medical specialty, PROTÉGÉ could construct an interactive knowledge-acquisition system tailored to the problem of acquiring specifications for individual clinical trials in that area of medicine. Next, our laboratory built PROTÉGÉ-II [6], which relaxed the assumption that the system could deal only with a single problem-solving method. PROTÉGÉ-II allowed developers to create and edit a separate domain ontology, and to select from a library an appropriate piece of software that would automate a well defined problem-solving method [3]. The developer encoded the relationships between concepts in the domain ontology and the data on which the problem-solving method operated by creating declarative mappings between the domain ontology and the method ontology for the selected problem-solving method [5]. Like the original PROTÉGÉ system, PROTÉGÉ-II used the general domain concepts described by the developer to generate an application-specific knowledgeacquisition tool for entry of the detailed content knowledge. PROTÉGÉ-II was developed for use under the NeXTSTEP operating system, which ultimately limited the size of its user community. A subsequent version of our knowledge-acquisition workbench, known as Protégé/Win, was created for use under Microsoft’s 32-bit Windows operating systems. Protégé/Win not only served as an excellent development environment for a wide range of medical knowledge-based systems created at Stanford [7], but also was adopted by dozens of knowledge-engineering groups around the world. Protégé/Win for the first time also separated out from the domain ontology the presentation information required to generate the user-interface for the associated knowledge-acquisition tool; this separation was essential to allow for the flexibility in constructing knowledge-acquisition interfaces that is made possible in the current version of our system. The most recent entrée in the series of Protégé systems is known as Protégé-2000. Protégé-2000 is written entirely in Java and therefore runs on a wide range of platforms. Like its predecessors, Protégé2000 is designed to allow developers to construct intelligent systems from libraries of reusable domain ontologies and problem-solving methods [8]. The

major advance in Protégé-2000 is that the system is constructed in an open, modular fashion. Not only is Protégé-2000 used to construct component-based knowledge-based systems, but also Protégé-2000 itself is a component-based system. System builders can add new functionality to Protégé-2000 by creating appropriate “plug-ins.” These plug-ins run the gamut from simple user-interface enhancements to whole application systems that can integrate tightly with the underlying Protégé-2000 framework. In the next section, we review the classes of plug-ins that are supported by our current knowledgeacquisition workbench. 2. Protégé-2000 Plug-ins The Protégé-2000 system has been designed from the beginning as an open foundation upon which developers can build tailored knowledge-acquisition functionality. Our vision has been for a knowledgeacquisition framework that would support the work of widely distributed builders of intelligent systems, each of whom could add new features to Protégé2000 as the need arose. Our long-term goal is to support virtual communities of developers who will contribute additional capabilities to Protégé-2000 as new requirements are determined, allowing other users of our system to take advantage of their disparate enhancements. The Protégé-2000 architecture currently supports a wide range of plug-ins. There are several classes of components that developers can add to the system to expand its capabilities. User-interface widgets handle display and input of data of particular types in domain- or task-specific ways. Alternative back ends for archival storage enable users to store knowledge bases in the formats that fit best with their environment. Utility programs for knowledge-acquisition tasks provide support for more elaborate knowledge-acquisition approaches such as accessing and importing knowledge from on-line resources and building new knowledge bases by integrating existing ones. Entire end-user applications that operate on Protégé knowledge bases can be plugged into the system as special “tabs.” 2.1 Domain-specific user-interface widgets Frequently, application domains adopt particular visual metaphors that are helpful in entering and browsing domain information. Protégé-2000 makes it quite simple to create small Java components that handle specific knowledge-acquisition input–output tasks and to associate those programs with particular elements of generated knowledge-acquisition tools. For example, in a domain where the user must select the name of a particular geographic region from a

pre-enumerated list, the default Protégé-2000 behavior would be to create a drop-down list containing the names of the relevant regions. It may be preferable, however, to present the end user with an image depicting a map, and to allow the user to click on the appropriate map location. By creating a map widget, the developer can override the system’s default presentation, and can provide the knowledge-acquisition tool with a more natural visual metaphor for entry of this particular element of the knowledge base. Analogously, developers can create extremely intuitive user-interface widgets for any individual or group of data types supported by Protégé-2000. In formal usability experiments, our group has demonstrated significant advantages when knowledgeacquisition tools provide such custom-tailored interface components [9]. 2.2 Alternative “back ends” for archival storage Early versions of Protégé would store all ontologies and knowledge bases as flat files encoded in the CLIPS knowledge-representation language. Protégé-2000 still supports the use of ASCII files adopting CLIPS syntax for archival storage, but also allows developers to add new “back end” support for alternative storage formats. With this kind of plug-in capability, we have created several alternative facilities for archival storage. For example, Protégé-2000 can read and write ontologies and knowledge bases to any server that supports the Open Knowledge Base Connectivity (OKBC) protocol [10]. Thus, via OKBC, Protégé-2000 can become a user-friendly, graphical knowledgeacquisition system for popular servers such as Ontolingua or LOOM. Protégé-2000, via another plug-in, can read and write ontologies and knowledge bases to most relational database systems by using the JDBC protocol for sending SQL statements. The ability to access a relational database for archival storage provides considerable scalability. For example, we currently are using Protege-2000 to develop an ontology of human anatomy that contains close to 50,000 concepts. In the past year, we have enhanced Protégé-2000 to read and write ASCII files encoded using the Resource Description Framework (RDF) of the World Wide Web Consortium [11]. We expect that RDF Schema will have an increasing importance in representing ontologies in the years ahead, and that RDF will be used pervasively to encode machineinterpretable annotations to Web pages. We therefore anticipate an expanding role for Protégé-2000

in encoding the knowledge that will allow software agents to access information on the Internet in a wide range of e-commerce and informationprocessing applications. Because plug-ins for the different archival storage formats are present in the system simultaneously, Protégé-2000 provides a convenient translation facility that is transparent to the user. Files can be read in using RDF and then written out to an OKBC server or to a relational database; files archived in CLIPS can be written out in XML or RDF, and so on. Protégé-2000 itself has an application program interface that allows other programs to call Protégé2000 as a knowledge server in its own right.

2.3 Utility programs for knowledge-acquisition The standard Protégé-2000 system divides its user interface into a number of “tabs” that support functions such as editing ontologies, refining the layout of the ontology-specific knowledge-acquisition tool, and creating knowledge bases by instantiating a domain ontology. Developers can create plug-ins that present additional “tabs” to the user, providing support for additional knowledge-acquisition tasks. One such tab allows users to access online information resources such as the WordNet dictionary at Princeton University and the Unified Medical Language System (UMLS) created by the United States National Library of Medicine [12]. When developers are creating and modifying ontologies, it is particularly useful not only to browse these online resources within Protégé-2000, but also to import portions of these resources directly into the evolving ontology. When creating medical ontologies, for example, it can be extremely helpful to verify the existence of a concept within UMLS, or to import entire sub-trees of the UMLS directly into a nascent Protégé-2000 ontology. Other Protégé-2000 tabs have quite advanced functionality. For instance, an entire subsystem known as PROMPT [13] is a plug-in that helps developers to merge and align different domain ontologies. PROMPT can analyze related domain ontologies that may arise from different sources, and can make suggestions to the developer regarding how concepts in the two ontologies might be similar or even equivalent. If the user wishes to generate a single consensus ontology from the different inputs, PROMPT makes specific recommendations to the developer regarding how the concepts in the input

ontologies might be effectively combined or related to one another. Another exciting development involves tabs that extend the capabilities of the Protégé-2000 knowledge-representation system. We currently are developing additional plug-ins that allow us to augment the underlying Protégé-2000 OKBCcompatible knowledge-representation system with logical axioms written in the Knowledge Interchange Format (KIF). These axioms allow ontology builders to specify semantic relationships among classes and attributes represented in the frame system. Protégé-2000 then can verify that these axioms are satisfied at knowledge-acquisition time, pointing out errors to the knowledge-base developer. 2.4 End-user applications Just as developers can create new tabs for Protégé2000 that incorporate new functionality for knowledge acquisition, they can add to Protégé-2000 tabs that include entire application programs. When an application program takes as input ontologies or knowledge bases that can change over time, it is particularly helpful to build that program as a tab that can get immediate access to the relevant knowledge structures directly through the Protégé-2000 infrastructure. For example, our group has developed a computer system to assist medical researchers in defining the particular eligibility criteria that will determine the characteristics of patients whom the researchers wish to enroll in clinical trials of new, experimental therapies [14]. Each clinical trial requires its own specific eligibility criteria, but the medical researchers must select those criteria from a comprehensive ontology of eligibility criteria that developers have predefined using Protégé-2000. Because the ontology of eligibility criteria undoubtedly will evolve over time, it makes sense to embed the application program for selecting the eligibility criteria within the Protégé-2000 system. Users can modify the eligibility-criteria ontology and, immediately, see their changes take effect in the associated application program. At the same time, by creating the application program as a tab within Protégé-2000, the program can take advantage of the broad support that Protégé-2000 provides for implementing graphical user interfaces that display and process elements of the domain ontology and its related knowledge bases. 3. A New Vision for Knowledge Acquisition In recent years, there has been considerable interest in building knowledge-based systems from reusable

components [15]. After many years of arguing what are the “correct” abstractions for modeling and for implementing intelligent computer programs, the community has achieved consensus that domain ontologies and reusable problem-solving methods are the most appropriate building blocks [8]. The advantages of conceptualizing and engineering intelligent systems using well defined, previously debugged components are quite clear. Protégé-2000 comprises a unique software infrastructure that extends the notion of componentbased architectures to knowledge-acquisition systems themselves. Protégé-2000 provides the opportunity to study what happens when we allow knowledge-acquisition systems to be assembled rapidly from libraries of reusable components. Protégé-2000 supports a wide range of plug-ins that include novel visual metaphors to support entry and editing of ontologies and knowledge bases, novel input–output modules to allow the system to read and write to a variety of servers, and entire application programs. We do not claim that our development team can even anticipate all the services that such plug-ins ultimately may provide users of our system. Integration of new plug-ins into Protégé-2000 is simple. The appropriate Java code is simply loaded with the basic Protégé-2000 software. Creating new plug-ins for Protégé-2000 is also straightforward. During the past year, several of our academic colleagues have developed their own custom-tailored Protégé-2000 plug-ins. In addition to our team at Stanford University, colleagues at the University of Linköping, the University of Karlsruhe, the German Institute for Artificial Intelligence (DFKI) in Kaiserslautern, the University of Victoria, the University of Washington, and the University of California (Irvine) have all programmed their own special plug-ins for use with Protégé-2000. With continued input from a worldwide community of collaborators, we anticipate that we soon will have a facility in which scores of knowledge-acquisition components will be downloadable for incorporation within a seemingly endless variety of highly customized knowledge-acquisition tools. Ultimately, the functionality that comprises the Protégé approach will be determined not by the Protégé-2000 infrastructure, but rather by the wide array of knowledgeacquisition components that will populate our knowledge-acquisition–tool library. Protégé-2000 offers a platform to which a wide international community of researchers already is contributing, and that soon will define itself in terms of the available set of plug-ins, not in terms of the current system’s capabilities.

4. Accessing Protégé-2000 Protégé-2000 may be downloaded from our Web site (http://www.smi.stanford.edu/projects/protege). The source code is distributed under an open-source license. The library of Protégé-2000 plug-ins is growing, and may be browsed online (http://smiweb.stanford.edu/projects/protege/protege2000/tabs/index.html). Acknowledgments Continued development of Protégé-2000 is supported in part by a grant from Spawar, and by a grant from Fast Track Systems, Inc.

References [1] Grosso, W.E., Eriksson, H., Fergerson, R.W., Gennari, J.H., Tu, S.W., and Musen, M.A. Knowledge modeling at the millennium: The design and evolution of Protégé-2000. In: Proceedings of the Twelfth Knowledge Acquisition for KnowledgeBased Systems Workshop. Banff, Alberta, Canada, October, 1999. [2] Musen, M.A. Automated Generation of ModelBased Knowledge-Acquisition Tools. London: Pitman, Artificial Intelligence Research Notes Series, 1989. [3] Fensel, D. and Straatman, R., The essence of problem-solving methods: Making assumptions to gain efficiency. International Journal of HumanComputer Studies, 48:181–215, 1998. [4] Tu, S.W. and Musen, M.A. Episodic refinement of episodic skeletal-plan refinement. International Journal of Human–Computer Studies, 48:475–497, 1998. [5] Gennari, J.H., Tu, S.W., Rothenfluh, T.E., and Musen, M.A. Mapping domains to methods in support of reuse. International Journal of Human– Computer Studies, 41:399–424, 1994. [6] Eriksson, H., Shahar, Y., Tu, S.W., Puerta, A.R., and Musen, M.A. Task modeling with reusable problem-solving methods. Artificial Intelligence, 79:293–326, 1995. [7] Musen, M.A. Domain ontologies in software engineering: Use of Protégé with the EON archi-

tecture. Methods of Information in Medicine, 37:540–550, 1998. [8] Musen, M.A. Ontology-oriented design and programming. In: Cuena, J., Demazeau, Y., Garcia, A., and Treur, J., eds. Knowledge Engineering and Agent Technology. Amsterdam: IOS Press, in press. [9] Noy, N.F., Grosso, W.E., and Musen, M.A. Knowledge-acquisition interfaces for domain experts: An empirical evaluation of Protege-2000. In: Proceeding of the Twelfth International Conference on Software Engineering and Knowledge Engineering (SEKE 2000), Chicago, Illinois, July, 2000. [10] Chaudhri, V.K., Farquhar, A., Fikes, R., Karp, P.D. and Rice, J.P. OKBC: A programmatic foundation for knowledge base interoperability. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), American Association for Artificial Intelligence. Madison, Wisconsin, July, 1998. [11] World Wide Web Consortium, Resource Description Framework (RDF), online: http://www.w3.org/RDF/ [12] Li, Q., Shilane, P., Noy, N.F. and Musen, M.A. Ontology acquisition from online knowledge sources. In: Proceedings of the AMIA 2000 Annual Symposium, American Medical Informatics Association. Los Angeles, California, November, 2000, submitted. [13] Noy, N.F. and Musen, M.A. PROMPT: Algorithm and tool for automated ontology merging and alignment. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI2000), American Association for Artificial Intelligence. Austin, Texas, July 30–August 3, 2000. [14] Rubin, D.L., Gennari, J.H., Srinivas, S., Yuen, A., Kaizer, H., Musen, M.A., and Silva, J.S. Tool support for authoring eligibility criteria for cancer trials. In: Proceedings of the AMIA Annual Symposium. American Medical Informatics Association, Washington, DC, November, 1999, pp. 369–373. [15] Musen, M.A. and Schreiber, A.T. Architectures for intelligent systems based on reusable components. Artificial Intelligence in Medicine, 7:189– 199, 1995.