On designing Controlled Natural Languages for Semantic ... - CiteSeerX

On designing Controlled Natural Languages for Semantic Annotation Brian Davis and Pradeep Varma and Siegfried Handschuh and Laura Dragan1 and Hamish Cunningham2 1

Digital Enterprise Research Institute, National University of Ireland, Galway {brian.davis, pradeep.varma, siegfried.handschuh, laura.dragan }@deri.org 2 Sheffield NLP Group, University of Sheffield [email protected] Extended Abstract

Abstract. Manual semantic annotation is a complex and arduous task both time-consuming and costly often requiring specialist annotators. (Semi)-automatic annotation tools attempt to ease this process by detecting instances of classes within text and relationships between classes, however their usage often requires knowledge of Natural Language Processing(NLP) and/or formal ontological descriptions. This challenges researchers to develop user-friendly annotation environments within the knowledge acquisition process. Controlled Natural Languages (CNL)s offer an incentive to the novice user to annotate, while simultaneously authoring, his/her respective documents in a user-friendly manner, yet shielding him/her from the underlying complex knowledge representation formalisms. CNLs have already been successfully applied within the context of ontology authoring, yet very little research has focused on CNLs for semantic annotation. We describe the design and implementation of two approaches for user friendly semantic annotation, based on Controlled Language for Information Extraction tools, which permit non-expert users to semi-automatically both author and annotate meeting minutes and status reports using controlled natural language.

1

Introduction

The Semantic Web endeavors to bring structure to the meaningful content of webpages, creating an environment where software agents can roam freely from web resource to web resource readily carrying out sophisticated tasks for users. Consequently the real power of the Semantic Web will be realised upon the creation of of many software agents that collect web content from a range of diverse sources, process the the information and exchange the results with other agents. In order for the Semantic Web to become a reality, we need, as a primer inter pares, semantic data. The process of providing semantic data is very often referred to as semantic annotation, because it frequently involves the embellishment of existing data, i.e. the text, with semantic metadata, which can subsequently describe the associated text. Hence semantic annotation annotation is

one of the core challenges of the for building the Semantic Web and by extension the Semantic Desktop. 1.1

Controlled Natural Languages and Semantic Annotation

Manual semantic annotation however is a complex and arduous task both timeconsuming and costly often requiring specialist annotators. (Semi)-automatic annotation tools attempt to ease this process by detecting instances of classes within text and relationships between classes, however their usage often requires specialist knowledge of Natural Language Processing(NLP) and/or formal ontological descriptions. This challenges researchers to develop user-friendly annotation environments within the knowledge acquisition process. Controlled Natural Languages (CNL)s offer an incentive to the novice user to annotate, while simultaneously authoring, his/her respective documents in a user-friendly manner, yet shielding him/her from the underlying complex knowledge representation formalisms. ”Controlled Natural Languages are subsets of natural language whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity.3 ” The use of CNLs for ontology authoring and population is by no means a new concept and it has already evolved into quite an active research area[1]. However very little research has focused on CNLs for semantic annotation. A natural overlap exists between tools used for both ontology creation and semantic annotation, for instance the Controlled Language for Information Extraction(CLIE)technology permits ontology creation and population by mapping both concept definitions and instances of concepts to a ontological representation using a CNL called CLOnE - Controlled Language for Ontology Editing[2]. However, there is a subtle difference between the process of ontology creation and population and that of semantic annotation. We describe semantic annotation as “a process as well as the outcome of the process. Hence it describes i) the process of addition of semantic data or metadata to the content given an agreed ontology and ii) it describes the semantic data or metadata itself as a result of this process”[3]. Of particular importance here is the notion of the addition or association of semantic data or metadata to content . 1.2

Latent Annotation

As with any annotation environment, a major drawback is that in order to create metadata about a document, the author must first create the content and second annotate the content, in an additional a posteriori, annotation step. In the context of our annotator we seek to merge both authoring and annotation steps into one. This process differs from classic a-posteriori annotation resulting in a new type of annotation which we call latent annotation. Latent comes from the Latin word with identical spelling who’s etymology is derived from the Latin verb latere (lie hidden ), a nod in respect to a-posteriori(later, what comes after)4 . 3 4

http://www.ics.mq.edu.au/˜rolfs/controlled-natural-languages/ http://www.myetymology.com/latin

1.3

Controlled Language ANNotation: CLANN and Habitability

This paper describes the design and implementation of two user friendly approaches to applying CNL to Semantic Annotation, which we call CLANN Controlled Language ANNotation, both of which are based on CLOnE(Controlled Language for Ontology Editing) language [4], Both approaches permit nonexpert users to semi-automatically both author and annotate meeting minutes and status reports using controlled natural language. CLANN I is more automatic and aims to sacrifice expressiveness (wrt the controlling the manipulation and creation of metadata ) over usability while in contrast CLANN II prioritizes expressiveness,(as in control over metadata manipulation), over usability. Uncovering the correct balance between expressiveness and usability is related to the habitability problem[5]. A Natural Language Interface(NLI) is considered habitable if users can express everything needed to complete a task using language they would expect the system to understand. A second aspect of the habitability problem, an aspect often overlooked within the CNL community itself, that of Chomsky’s distinction between competence vs performance [6]. Human linguistic competence can be described as a set of strict rules of a language’s grammar( in this case English grammar) while performance consists of the uses we make of competence . In simpler terms, How information is written using the grammar is a measure of competence and What information could be written using the grammar is a measure of performance. The design of CNLs is often driven by competence while the second aspect of habitability states that an NLI should also attempt to account for both. The remainder of this paper is organized as follows: Section 2 discusses our use case, the target domain, the design and engineering of the domain ontologies and the design and implementation of both CLANN annotators and their corresponding CNLs. Section 3 discusses related work. Finally, Section 4 offers conclusions, as well as ongoing and future work.

2

CLANN: Design and Implementation

In this section, we describe our use case, deployment scenarios ,the target domain ontology as well as the overall architecture of both CLANN annotators which are realized as GATE pipelines [7]. Furthermore, we discuss briefly the design and grammars of both types of CLANN CNLs and provide examples. 2.1

A Use Case for Controlled Natural Language for Semantic Annotation

The reader should note that CNLs cannot offer a panacea for manual semantic annotation as a whole since it is unrealistic to expect users to annotate every textual resource using CNL, however there are certain use-cases where CNLs can offer an attractive alternative as a means for manual semantic annotation, particularly in contexts, where controlled vocabulary or terminology is implicit such

as health care patient records, business vocabulary and reporting. Our domain use case focuses on project administration tasks such as taking minutes during a project team meeting and writing weekly status reports. Very often such note taking tasks can be repetitive and boring. In our scenario the user is a member of a research group which in turn is part of an integrated EU research project. Based on pre-defined templates, the user simultaneously authors and annotates his/her meeting minutes or status reports in CNL, using a semantic note taking tool - SemNotes5 , which is an application available for Nepomuk-KDE6 - the KDE instance of the Social Semantic Desktop. The metadata is available for immediate use after creation for querying and aggregation. The scenario is not limited to the KDE Desktop or the semantic desktop. Other scenarios involve using Nepomuk Lite - a lightweight version of the platform independent OSGI based Nepomuk Implementation, using a Google Web Toolkit7 based interface. For the purposes of this paper we will focus on the the Nepomuk-KDE and SemnNotes scenario. 2.1.1 Nepomuk-KDE: Nepomuk-KDE is the KDE instance of the EU funded Integrated Project Nepomuk8 , which aims to provide a full implementation of the standards and APIs defined in Nepomuk on the KDE Desktop. The (Social) Semantic Desktop envisioned a new type of collaboration infrastructure intersecting with research across the Semantic Web, Peer-to-Peer (P2P) Networks, and Online Social Networking, culminating in a ”novel collaborative environment, enabling the creation, sharing and deployment of data and metadata”[8]. The Nepomuk project provided , a specification for a social semantic desktop framework, which has continued beyond the lifespan of the project in the form of the OSCAF9 foundation which aims to foster interoperability between different implementations and publish standards within the community established around the original Nepomuk project. 2.1.2 SemNotes: SemNotes is a desktop-based tool for note-taking. It is implemented for the K Desktop Environment 10 for Linux and uses the NepomukKDE implementation of the Semantic Desktop. The architecture is plugin-based and extensible, so new plugins can be easily added to the application. SemNotes is a semantic application because it uses the ontologies developed in the Nepomuk project 11 to define its data structures 12 . The vast majority of notes are short and contain only text, hence it was decided that not only the metadata about the notes should be stored in the RDF repository, but also the notes themselves. For 5 6 7 8 9 10 11 12

http://smile.deri.ie/projects/semn http://nepomuk.kde.org/ http://code.google.com/webtoolkit/ http://nepomuk.semanticdesktop.org http://www.oscaf.org http://kde.org/ (27/04/2009) http://nepomuk.semanticdesktop.org/ontologies/ (18/03/2009) The notes are instances of http://www.semanticdesktop.org/ontologies/2007/11/01/pimo#Note

storage, SemNotes uses the RDF repository provided by Nepomuk. An important feature of the application is the linking of resources mentioned in text, to the notes. As an application for the Semantic Desktop, SemNotes has direct access to all the resources (like people, places, projects, etc.) available in the local RDF repository. This allows us to query the repository while the user is typing and identify references in text 13 . The newly discovered relations are stored in the central repository and thus can be accessed by all semantically-aware applications, not just SemNotes. The user benefits from improved search and filtering of notes, easy contextual browsing, as well as better interlinking of data and discovery of new information from that available on his/her computer. The SemNotes plugins currently available include: a tag cloud, a timeline and a list of linked resources for visualizing metadata about the notes; export and import to and from files for backup of individual notes as well as a RDF exporter plugin for bulk backup. In addition to these, we provide a plugin that employs keyword extraction 14 plugin that allows the user to tag the note with the extracted keywords - the most relevant being suggested as tags. We intend to wrap CLANN technology as a plugin for SemnNotes. An overview of SemNotes can seen below(See Figure 1).

Fig. 1. Overview of SemnNotes

13

14

The data that can be linked to the notes depends greatly on the data available on the desktop as well as on the ontologies that are loaded in the system. http://smile.deri.ie/projects/keyphrase-extraction (23/03/2009)

2.2

CLANN I and II Overview

As mentioned earlier this work focuses on too approaches to Controlled Languages ANNotation - CLANN. Both CLANN annotators are implemented in GATE and build on the existing advantages of the CLOnE[4] software and input controlled language and share the common features below: 1. Both annotators require only one interpreter or runtime environment, the Java 1.5 JRE. 2. As far as possible, CLANN I and CLANN II are grammatically lax; in particular it does not matter whether the input is singular or plural (or even in grammatical agreement). 3. Both types of CLANN are compact; the user can create any number of instance properties or instances in one sentence. 4. Both types of CLANN are relatively easy to learn by following examples and a small style guide, without having to study elaborate expressions of formal syntax. 5. Both CLANN annotators are a form of latent semantic annotation - simultaneous authoring and annotation. 6. Both annotators share a common template for meeting minutes. 7. Finally both CLANN annotators share a common Ontology API based on CLOnE. In our scenario both CLANN are anchored to existing semi-structured data such as a AgendaTitle, Scribe or ActionItem based on predefined meeting minutes or status report templates described below: Template for Type I and Type II annotators Meeting Date: Project Name: Attendees:, Chair: Scribe: Agenda Items: Agenda Title: Comment:.+ RoundTable: Comment:.+

2.2.1 CLANN I: Design and Implementation: The annotator architecture contains a standard GATE pipeline15 (see Figure 2) which includes the following language processing resources: The GATE English tokenizer, the Hepple 15

General Architecture for Text Engineering, See http://gate.ac.uk/

POS tagger, a morphological analyzer, a gazetteer list component for recognizing useful key-phrases, such as structured elements from the templates and reserved CNL phrases. Any sentences for example, preceded by a Comment: element are considered candidates for controlled language parsing. Any remaining tokens from the CNL sentence which are not recognized as reserved CNL key-phrases are used as names to generate links to ontological objects. This is followed by a standard Named Entity(NE) transducer in order to recognize useful NEs, a preprocessing JAPE16 finite state transducer(FST) for identifying quoted strings, chunking Noun Phrases(NPs) and additional preprocessing. A second gazetteer list lookup is applied to identify trigger phrases associated with NEs which intersect with quoted and unquoted NP annotation spans. Additional feature values are then added to the NP chunks to indicate the appropriate class to link an NP chunk as an instance to. The last FST parses the CNL from the text and generates the metadata. The current tool is bootstrapped via the Nepomuk Core Ontologies17 and currently the application creates/populates a meeting minutes/status report ontology MEMO18 , which references the users Personal Information Model Ontology(PIMO) 19 , via the GATE Ontology API. Each meeting minute note should follow a pre-defined template below. The template is parsed initially to extract the inherent metadata about the meeting.

Fig. 2. CLANN I pipeline

Each valid sentence in CLANN I matches exactly one syntactic rule and as mentioned earlier consists of reserved keyphrases (verb phrases, fixed expressions 16 17 18 19

Java Annotations Pattern Engine http://www.semanticdesktop.org/ontologies/ http://ontologies.smile.deri.ie/2009/02/27/docs/ http://www.semanticdesktop.org/ontologies/2007/11/01/pimo/

and punctuation marks) as well as chunks( which similar to noun phrases are used to name instances). Similar to CLOnE, the language has quoted chunks, a series of words which are enclosed in quotes (Eg. ”the PhD proposal”). Quoted chunks permit the capture of multi-word expressions as instances. They also permit the use of reserved words that would other wise be detected by the reserved gazetteer lookup.

Fig. 3. CLANN I visualised in GATE

An example syntactic rule is contained below: – (? )+ where corresponds to chunk or Quoted chunk and corresponds to reserved verb phrases and paraphrases derived from corpus analysis. Furthermore corresponds to any preposition annotated using the POS tagger. Finally (? )+ matches one or more prepositional adjuncts i.e. ”for the EU” or”in Work Package 3000”. Hence the above rule would match the following sentences. Comment: Marco to visit ”University of Karlsruhe”. Comment: Dirk to complete paper by “Sunday 21st June” for ”International Semantic Web Conference”. The above rule extracts the instances as arguments. The reader should note that prior to this stage that standard NE transducer and post-processing NP transducer (see Figure 2) will have collected addition information about each

chunk. So Marco and Dirk are associated to a Pimo Person, while ‘‘Sunday 21st June’’ will be recognized as a Date. Using similar techniques ’’University of Karlsruhe’’ would be recognized as an Organization and ’’International Semantic Web Conference’’ would be recognized as a Conference. The verb phrases to visit and to complete are then used to identify the relevant properties to link the instances recognized. Other features of CLANN I include nominal co-reference using the Alias: rule, which allows the user to express the same instance in varied forms. It also enables the usage of a shorthand by the user when taking minutes. In addition CLANN I attempts to incorporate elements of language performance into CLANN I in order to make the language more habitable. See section 1. Table 1 provides additional examples of the language and grammar. Table 1. Excerpt of CNL Type I grammar with examples

Sentence Pattern +

Example Ambrosio to submit "her PhD Proposal" during "the next week". Dirk to work on "the E-Health Proposal" with Ambrosia

Alias:; Alias:;

Parsed pattern (Ambrosia ) (to submit ) ((her PhD Proposal )) (during (the next week )). (Dirk ) (to work ) (on(the E-Health Proposal)) (with (Ambrosia )). Creates ”D6.7”’ as an alias for ”CII Deliverable 6.7”.

2.2.2 CLANN II: Design and Implementation: The CLANN II architecture (see Figure 4 is similar in design to CLANN I in that it shares the same language processing resources for tokenisation, sentence splitting, POS tagging and morphological analysis. CLANN II uses an identical template as CLANN I, however the Comment: element is non existent and furthermore sentences themselves are not written in controlled language. In CLANN II the user can write any sentence without restriction under the heading of an Agenda Item. What differs in CLANN II is that the user can use snippets of controlled language to associate metadata to a particular piece of text. Snippets of CNL are identified within square brackets using [.... ]. The CLANN II CNL snippets themselves are similar to CLOnE with minor changes. A preprocessing finite state transducer(FST) similar to CLANN I is applied to extract values associated with template elements. In addition, text associated to the CNL snippets is also parsed at

this stage. The final stage in the pipeline consists of a JAPE transducer which pulls the instances and properties to parse triples, ignoring the unassociated text. CLANN II shares the same API with respect to ontology manipulation as CLANN I and consults the ontology in similar manner. Similar to CLOnE and CLANN I, the language in CLANN II has quoted chunks, a series of words which are enclosed in quotes (”...”). This allows the user to associate metadata to more than one word. Example syntactic rules are shown below: [is a classname] where [is a classname] corresponds to a snippet of CNL. Hence: Dirk[is a Person] to complete paper by “Sunday 21st June”[is a Date] for ”International Semantic Web Conference”[is a Conference]. [Chunk property Chunk] The above rule allows the user to simply embed a sentence in CNL in order to create relation metadata. This approach also allows users to handle adjuncts with much greater ease. such as associating the Date instance ’’Sunday 21st June’’ with paper [”to complete” same as toComplete] Dirk[is a Person] to complete paper[is a Document] by ”Sunday 21st June”[is a Date] for ”International Semantic Web Conference”[is a Conference]. [Dirk ”to complete” paper]. [Paper hasEndDate ”Sunday 21st June”]. Note, that when creating instances of properties, the controlled language will recognize pre-existing annotations i.e. paper and ’’Sunday 21st June’’. In order to use a property in the CLANN II CNL, the user must either use the appropriate label for the property on inspection of the ontology (in this case toComplete is a part of the ontology) or alternatively they can use the alias preprocessing command to create a more natural substitute for the property. Another major difference between CLANN I and CLANN II is that the user can also create and manipulate classes, subclasses and class properties. Suppose the user is is unsatisfied with the association of paper to Document and would prefer to associate the text to instance of a non existent class ConferencePaper. CLANN II permits the creation of new classes on an ad hoc basis using the following rules: [ is a subclass of ] resulting in the following: [”Conference Paper” is a subclass of Document] Dirk[is a Person] to complete paper[is a ”Conference Paper”] by ”Sunday 21st June”[is a Date] for ”International Semantic Web Conference”[is a Conference]. [Dirk toComplete paper]. We refer the reader to Table 2 for further examples of the CLANN II language .

Table 2. Excerpt of CNL Type II grammar with examples

Sentence Pattern [is a ].

[is ].

a

subclass

[ ject>].

of