Developing Biomedical Ontologies Collaboratively - PubMed Central ...

Developing Biomedical Ontologies Collaboratively Natalya F. Noy, PhD,1 Tania Tudorache, PhD,1 Sherri de Coronado, MS, MBA,2 Mark A. Musen, MD, PhD1 1 Stanford Center for Biomedical Research, Stanford University, Stanford, CA 2 National Cancer Institute, Center for Biomedical Informatics and Information Technology Abstract The development of ontologies that define entities and relationships among them has become essential for modern work in biomedicine. Ontologies are becoming so large in their coverage that no single centralized group of people can develop them effectively and ontology development becomes a community-based enterprise. In this paper we present Collaborative Protégé—a prototype tool that supports many aspects of community-based development, such as discussions integrated with ontology-editing process, chats, and annotation of changes. We have evaluated Collaborative Protégé in the context of the NCI Thesaurus development. Users have found the tool effective for carrying out discussions and recording design rationale. Ontology Development Becomes Collaborative Modern biomedicine is largely a knowledge-based enterprise, dominated by the advent of highthroughput experiments, huge data sets, and the need for both humans and computers to make sense of massive quantities of data. Most biomedical researchers today can no longer imagine processing and integrating these data without the help of terminologies and ontologies. Ontologies convey the biomedical meaning of experiments and data in a computer-accessible format, and they permit integrating data and knowledge from many sources. Recent developments are dramatically changing how biomedical scientists build terminologies and ontologies. First, as ontologies become mainstream within biomedicine, they are being developed collaboratively by increasingly large groups of scientists. Second, ontologies are becoming so large (e.g., 80K concepts in the NCI Thesaurus) that no centralized group can develop them effectively. Hence, organizations such as the NCI Center for Bioinformatics (NCICB) “outsource” some of their ontology development to the scientific community at large. Finally, in the last one or two years, many users have become familiar and comfortable with the concept of user-contributed content, both in their personal and professional lives (cf. Web 2.0). As a result of these trends, the development of many biomedical ontologies is a community enterprise.

The following list contains just a small sample of notable biomedical ontologies that use some structured collaborative process for their development today: • The Gene Ontology (GO)1 provides terminology for description of gene products in model-organism databases in terms of their associated biological processes, cellular components, and molecular functions in a species-independent manner. • The National Cancer Institute’s Thesaurus (NCI Thesaurus) is a biomedical reference ontology that covers areas of basic cancer biology, translational science, and clinical oncology developed at the NCICB.2 Recently NCICB has launched the Biomedical Grid Terminology (BiomedGT)3—a terminology product to enable the wider biomedical research community to participate directly and collaboratively in extending and refining the terminology on which they depend. • The Ontology for Biomedical Investigations (OBI),4 is a federated ontology that describes biological and medical experiments. • BIRNLex5 is a controlled terminology for annotation of data resources—such as structural and functional image data—for the Biomedical Informatics Research Network (BIRN). • RadLex6 is an effort to develop a standard terminology for radiology, sponsored by the Radiological Society of North America. In all of these efforts a community of users contributes to the ontology development, either by commenting on and discussing the current version of the ontology, or directly by making changes. Tool Support For Collaborative Development Despite this move to community-based development of biomedical ontologies and terminologies, very little tool support for such development exists today. Discussion tools comprise mostly mailing lists and message boards (as used by OBI, GO, BIRNLex, and many others). Whereas these forums provide some archiving capability, the content of the discussion is not linked to the ontology itself. The discussions, the alternatives considered, and the design rationale are separate from the concepts to which they refer. It is

AMIA 2008 Symposium Proceedings Page - 520

often difficult for ontology authors to find and correlate the discussion with ontology content. The users of ontologies cannot easily get an overview of those portions of the ontologies that are under active discussion or development as opposed to those that appear stable and less likely to change.

bases.9, 10 Collaborative Protégé is an extension of Protégé that enables users who develop an ontology collaboratively to hold discussions, chat, annotate ontology components and changes—all as an integral part of the ontology-development process itself.

Projects employ a variety of synchronization mechanisms, few, if any, of which have been designed for ontology synchronization. For example, GO and OBI use systems for version management of software code (SVN and CVS) to maintain the versions, to enable active editors to check in new versions, and to find differences between versions. Because such systems were not designed for versioning ontologies, they are cumbersome to use for this purpose. For instance, diff services to compare versions of software code assume the use of linear text files, and will fail when used for ontologies, which may be serialized in a variety of ways; ontology developers require a structural or semantic diff.7 Many biomedical ontologies are very large and are not modularized.8 Thus, if curators want to lock out a version, they have to lock out the whole ontology, while others cannot edit it.

Collaborative Protégé works in a client–server mode, with the shared ontology residing on a server, and users accessing it simultaneously from distributed Protégé clients. In addition to the standard ontologyediting features of Protégé, Collaborative Protégé provides the features described in the following.

Recently, the wiki software has gained popularity as a way of soliciting community participation and feedback. A platform known as LexWiki currently is at the core of community-based development of BiomedGT. LexWiki enables users to browse an ontology, to make comments or to propose changes to (usually text-based) definitions. The BiomedGT curators then open this annotated ontology in Protégé and perform the actual edits there. Wikis provide a natural forum for discussions, and the provenance information for suggested changes is easy to archive. Wikis also enable programmatic extensions, and developers have added capabilities such as class hierarchy browsing, autocompletion, and other features. Wikis, however, are not intended for ontology development and users cannot easily edit class definitions using this kind of framework. For example, in BiomedGT, curators must switch to Protégé to make the actual changes. It is difficult for a wiki environment to support ontology editing directly, since text-based wikis cannot perform even simple semantic checks on the data being entered.

Discussion threads: Users can reply to annotations by others, thus forming a discussion thread. A discussion thread is a set of annotations that are linked to one another, with messages in the thread having the same properties as other annotations: they are attached to ontology components or changes, have author and timestamp, and can be of different types. We also support general discussion threads, which are attached to the ontology as a whole and not to a specific ontology component.

To support collaborative ontology development, we have developed Collaborative Protégé, a tool that we present and evaluate in this paper. Collaborative Protégé Our laboratory has developed Protégé—a widelyused open-source technology for developing and managing terminologies, ontologies, and knowledge

Features of Collaborative Protégé

Annotation of ontology components and ontology changes: Users can attach notes (that we call annotations) to ontology components or descriptions of ontology changes. These annotations can record, for example, the rationale for a modeling decision, note a new task, or provide a literature. Each annotation contains the name of its author, a timestamp, and a reference to the relevant ontology component or ontology change that it annotates. We support several types of annotations, such as Question, Proposal, and Comment.

Change proposals and voting: Users can create proposals for changes and call for a vote on the proposal by creating an annotation of a specially designated type—Proposal. After a user creates a proposal, others can vote on it. Users can accompany each vote in favor or against the proposal with a comment explaining their motivation. Browsing annotations for a class: As a user browses the ontology and selects a class, he can see all the class changes (its concept history) and all annotations and discussions for that class. An icon next to the class name in the class-hierarchy browser indicates that there are annotations for this class (cf. Figure 1) Search and filtering of annotations: Users can search and filter annotations based on a number of criteria. The annotations can be filtered by their authors, date they were made, type of annotation and so on. Chat: Collaborative Protégé integrates a chat functionality that allows users that are connected to a Protégé server to exchange live messages.


Hyperlinks to ontology components from text: There is particular value in using Protégé itself for discussion threads and chats (rather than traditional chat and mail software): users can add hyperlinks to ontology components in the text of annotations and chat messages. If the user types @’ClassName’, we create a hyperlink to that class in the ontology. Thus, when a line in the chat refers to a particular class, the reader can simply click on the link and open the class definition. Embedding in other applications: Developers can easily access the information about changes, discussions, and all the other collaborative features through a Java API. Thus, other applications can use these features in a straightforward way. Figure 1 shows a screenshot of the Protégé ontology editor with the Collaborative Protégé enabled. The two left columns are the Protégé traditional view of the class hierarchy (left column) and the definition of the selected class (middle column). The right column provides the collaborative capabilities: discussion threads related to the selected class; annotations of changes; filtering and search of changes and annotations; and a chat tool. Representing changes and annotations We use declarative representation of ontology changes and annotations, and their respective

metadata, such as authors and timestamps. As users make changes and create annotations, this information is stored as instances in the Changes and Annotations Ontology (CHAO).11 The ontology contains a class corresponding to every type of change that is possible in the corresponding ontology language, and the types of annotations that the tool supports. Annotations can point (by means of property values) to the components in the domain ontology, to other annotations (if it is a reply in a discussion), or to the changes that they annotate. For each domain ontology, this set of CHAO instances is stored separately, constituting the CHAO knowledge base for that particular domain ontology. Figure 2 shows the main components of Collaborative Protégé. A user browses and edits a shared domain ontology in a Protégé client—a desktop application running on his computer (cf. Figure 1 for the client user interface). The shared ontology resides on a Protégé server and the client uses the API to access the ontology. As the user makes a change in the user interface, it is reflected on the server. At the same time, Protégé creates CHAO instances corresponding to these changes on the server. For example, suppose a user has added a new property to a class. There is a Add_Property class in CHAO. When the user adds a property in his domain ontology, an instance of the

Figure 1. The Protégé user interface, with the Collaborative Protégé plug-in. This screen capture shows the OWL Classes tab, in which the user edits and browses the classes that describe a domain ontology—here the NCI Thesaurus. The left panel shows the class tree; the middle panel displays the form for entering and viewing the description of the selected class (Gene_Product), as a collection of attributes; the right column shows the discussion among users about this class.


Add_Property class is created corresponding CHAO knowledge base.

in

the

Note that the instances describing changes and annotations are not part of the domain ontology itself. Thus when publishing the ontology, the authors can easily choose to publish it without the annotations and changes, if they don’t want to make their discussions or complete change history public. Evaluation We have performed formative evaluation of the Collaborative Protégé prototype in the development of the NCI Thesaurus. Four NCI editors participated in the evaluation for a two-week period. These editors were the regular editors of the NCI Thesaurus who were asked to use the Collaborative Protégé plug-in as part of their routine. Evaluation participants have been using Protégé regularly for a long time. The only change in their workflow for the evaluation was the addition of the collaborative component. After the evaluation period, we have analyzed the changes and annotations in order to understand what types of comments the users make and how they use Collaborative Protégé. We have also conducted a usability questionnaire, asking users to assess how helpful the collaborative features were in their work, which features they used most, which features they found easy or difficult to use, and what they would like to see in future versions. Results Over the period of two weeks the editors made approximately 40 changes to the ontology, most of which were edits on the OWL annotation properties. Over the same period, they have entered 43 annotations. All annotations were annotations on classes, and none on changes. Most annotations (40 out of 43) were of the type Comment. Maximum depth of discussion thread was 6 messages; maximum number of replies at one level was 7. The users attached comments to classes and also had discussion threads at the level of the ontology itself, not attached to any class. We have observed different uses for collaborative features, some of which we have not envisioned. These uses include (with the quotes from the discussions in parentheses): (1) discussion of the modeling issues for specific classes (“CD47 seems to have a mistake among its asserted biochemical function…”); (2) discussion of future directions for the ontology (“We need a plan to review business rules”); (3) issue tracker and a “to do” list (“John, will you fix this?”); (4) notes on scheduling meetings

and meeting agenda (“We should discuss this at our bi-weekly meeting.”) and meeting updates (“Our next meeting is scheduled for…”); (4) discussion of the Collaborative Protégé and its features (“Discussion and Annotation Windows need a save button.”). In usability survey, evaluation participants agreed that Collaborative Protégé was helpful in their work. They highlighted discussion threads and chat as the most helpful features. They felt that Collaborative Protégé features would be most useful in their work for carrying out discussions to reach consensus and for tracking outstanding issues and problems. The participants found the voting feature difficult to use. They also raised concerns that the Collaborative Protégé interface takes some workspace away from the regular editing interface. Discussion Analysis of the results even from this fairly small period of usage of Collaborative Protégé, points to several interesting issues. First, users found Collaborative Protégé useful, and engaged in the discussion actively, producing almost as many discussion items, as the changes in the ontology itself. Second, the innovative use of Collaborative Protégé features points to the versatility of the tool. In fact, some of these uses prompted us to consider new features for the tool. For example, we might link the tool to a calendar application, to enable integration of discussions with scheduling ontologyreview meetings and setting meeting agenda. Then, at the meeting, users can quickly and easily access the ontology components that they planned to discuss, as well as the discussion that took place on-line. One of the surprising findings for us was that users do not add annotations to changes, but annotate only ontology components. Even the rationale for changes themselves is recorded at the level of the ontology

Figure 2. The main components of Collaborative Protégé. Users edit ontology in a Protégé client; changes are sent to the server where the domain ontology resides. In addition to performing the change, Protégé creates an instance of a class in CHAO, describing this change.


component, and not the changes. This observation suggests that users think in terms of ontology components rather than changes. In Collaborative Protégé, facilities for reaching consensus, recording design rationale, and noting outstanding issues are an integral part of the process of ontology browsing and editing. As users examine a class in the ontology, they can immediately see all the discussion and questions pertaining to this class, whether there was any contention in its definition, alternatives that the authors considered. An editor, when coming upon a class that, he feels, must be changed, can post a request immediately, in the context of this class. This dual advantage of contextsensitivity and archival character of annotations adds the greatest value to Collaborative Protégé compared to discussion lists and issue trackers that are separate from an ontology environment. There are many outstanding issues, however, that we must address in order to support truly collaborative ontology development. First, more open environments, where anyone can join the editing process will need more advanced support for determining trust and credibility of various users. Given the complex nature of the task such as ontology editing, poor entries may result not only from malicious intent but also from simple incompetence. Users must be able to see who made the changes and when, to read a comment by the author of this change, to understand what was the state of the knowledge base when the change was made, and to access the concept history. Currently, Protégé has only rudimentary support for different user roles. However, such support is essential, as collaborative scenarios require users with different roles: For example, not all users in a project may have the privileges to create change proposals or to make the changes in the ontology. Other users may be allowed only to comment on proposals. We plan to analyze the different roles that the biomedical ontology-development projects employ and add such support in future versions. Finally, as we studied the different workflows that the projects described in the introduction to this paper used, we concluded that developers of biomedical ontologies need tools that are flexible enough to work with different workflows. For instance, a group of users working together on developing an ontology in the context of a specific project will have different requirements compared to an open community developing a lightweight taxonomy that anyone can edit. In some cases, tools should support specific protocols for making changes, where some users can propose changes, others can discuss and vote on

them, and other users can perform the changes. At the other end of the spectrum are settings where anyone can make any changes immediately. Thus, tools need to support different mechanisms for building consensus, depending on whether the environment is more open or more controlled. Conclusion The results of our initial evaluation demonstrate that Collaborative Protégé, even in its prototype form already provides significant value to the developers of large biomedical ontologies. Users are generally satisfied with the tool, find it useful in their work, and, in addition to using the features to support activities that the tool developers had in mind, they also use the features in innovative ways that address their specific needs and workflows. While Collaborative Protégé is one of the first tools to integrate consensus-building mechanisms directly into the ontology-development tool, the features described in this paper are only the first step. Much remains to be done to support fully the dynamic distributed, and open development of biomedical ontologies that modern medicine requires. Acknowledgments This work was supported in part by a contract from the U.S. National Cancer Institute. Protégé is a national resource supported by grant LM007885 from the United States National Library of Medicine.

References 1. GOConsortium, Creating the gene ontology resource: design and implementation. Genome Res 2001, 11, (8). 2. Sioutos, N., et.al. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inf. 2007, 40, (1). 3. BiomedGT http://biomedgt.org/ 4. OBIConsotrium http://obi.sourceforge.net/consortium/index.php. 5. BIRN What is BIRNLex? http://xwiki.nbirn.net/ xwiki/bin/view/+BIRN-OTFPublic/About+BIRNLex 6. Rubin, D. L.; Noy, N. F.; Musen, M. A., Protégé: A Tool for Managing and Using Terminology in Radiology Applications. J. of Digital Imaging 2007. 7. Noy, N. F.; Musen, M. A., Ontology Versioning in an Ontology-Management Framework. IEEE Intelligent Systems 2004, 19, (4), 6-13. 8. Seidenberg, J.; Rector, A. Web Ontology Segmentation: Analysis, Classification and Use, 15th Intl. WWW Conf., Edinburgh, Scotland, 2006; 9. Protégé http://protege.stanford.edu 10. Gennari, J. et.al., The Evolution of Protégé: An Environment for Knowledge-Based Systems Development. Intl. J. of Human-Computer Interaction 2003, 58, (1). 11. Noy, N.F., Chugh, A., Liu, W., Musen, M.A. A Framework for Ontology Evolution in Collaborative Environments, Intl Semantic Web Conf. Athens, GA, 2006