Towards Multilingual, Termontological Support in Ontology Engineering

16 downloads 10430 Views 76KB Size Report
Towards Multilingual, Termontological Support in Ontology. Engineering ... An ontology in the DOGMA engineering framework is represented by means of a set of ..... Semantics (ODBASE '02), Lecture Notes in Computer Science, Vol. 2519 ...
Towards Multilingual, Termontological Support in Ontology Engineering Koen Kerremans and Rita Temmerman Department of Applied Linguistics – Centrum voor Vaktaal en Communicatie (CVC) Erasmushogeschool Brussel Trierstraat 84, B-1040 Brussels, Belgium {koen.kerremans;rita.temmerman}@ehb.be http://cvc.ehb.be

Introduction More than ever before terminology as a discipline is growing into multidisciplinarity. On the one hand, it is inspired by new developments in other disciplines like computational linguistics, artificial intelligence or database management. On the other hand, it can be at the service of other disciplines. The example we would like to elaborate on is how a multilingual terminographic analysis can contribute to the quality of the output of ontology engineers. The example we will use to illustrate this is taken from the FFPOIROT project1, a European research project in the 5th framework which aims at compiling for several languages (Dutch-Belgium, French-France, Italian-Italy and English-UK) a computational knowledge repository (i.e. an ontology) for the financial forensics domain. The method of termontography was developed by CVC Brussels in order to add a multilingual layer to this ontology. This paper is structured as follows: in the first section we briefly explain some fundamental issues related to DOGMA modelling and show how our terminological resource is used by DOGMA ontology engineers in the FFPOIROT project (section 1). In the second section we elaborate on the problems that may arise when trying to match the content in our terminology resource with the ontology developed in DOGMA (section 2). In the third section we motivate the need for a framework of domainspecific categories that precedes the termontographic analysis. The fourth section summarises the termontography method and in section 5 we draw some general conclusions with respect to the integration of multilingual, termontological support in an ontology engineering framework.

1

IST 2001-38248. For more information: see http://www.ffpoirot.org.

1

1 What is DOGMA? Developing Ontology-Guided Mediation for Agents (DOGMA) is a research initiative where various theories, methods and tools related to ontology design and usage are being studied and developed (Jarrar and Meersman 2002)2. An ontology in the DOGMA engineering framework is represented by means of a set of lexons. Lexons are extracted from natural language texts (solely in English), dictionaries, thesauri, glossaries or general-purpose lexicons like WordNet. A lexon is a grouping element further composed of a starting term (i.e. headword), a role (i.e. relation) and second term (i.e. tail). In the lexon member state adopt/adopted law, the headword member state plays the role of ‘adopting’ vis-à-vis the second term law, whereas, vice versa, the latter is assigned the role ‘adopted by’ when related to the former. This particular type of conceptual modelling is called Object-Role Modelling (ORM). In ORM the world is viewed in terms of objects and the roles they play (Halpin 2001). The lexon member state adopt/adopted law does not hold for all situations in reality in which a law is being adopted. For instance, in some situations, a law may be adopted by a national government. Therefore, a lexon is always further specified by a context (which is usually a reference to a text or even a subsection/paragraph in a text). The lexon member state adopt/adopted law corresponds to a situation in reality which is always true within the context of the value added tax (VAT) Sixth Directive (77/388/EEC), a European directive dealing with the harmonisation of the VAT legislations of the different European member states. From a terminological point of view, the English term law has in this particular context, the following correspondents for the languages in the FFPOIROT project: disposition législative (French), disposizione legislative (Italian) and wettelijke maatregel (Dutch). The task of the terminographers is thus to make sure that, given its context, each term in a lexon base is further refined with term correspondents. The purpose of this process is to provide the eventual ontology with a multilingual layer. A term in a lexon base is the lexical representation of a concept. It may be either linguistic (i.e. an existing term in natural language like law) or non-linguistic (i.e. a label like BusinessAcitivity constructed by the ontology engineer) and can have only one meaning given the context in which it occurs. Each term is further specified by a gloss in natural language stating the intended meaning of this term in a given context. It is also this gloss that should be provided by the terminographers in the FFPOIROT project. In the end, the terminological resource consisting of terms, translations and descriptions in natural language, should be combined with the content of the lexon base. However, as we will see in the next section, this process is hampered in different ways.

2

For a general description of DOGMA: see http://www.starlab.vub.ac.be/research/dogma.htm.

2

2 Mapping the terminology base to the ontology As terminographers construct the terminological resource by starting from a multilingual corpus of textual material, DOGMA ontology builders in FFPOIROT solely concentrate on English textual material. Consequently, terminographers may have terms in the terminological resource which do not have a match in the ontology as they refer to concepts that do not appear in the English textual material. For instance, in the domain of value added tax (VAT) law, the category paraphrased in English as ‘VAT deduction on copyright publications’ (section 285bis in the French VAT legislation) only appears in the VAT law in France. In the French, Italian and Irish VAT legislations there is a special kind of export license – lexicalised in Italian as ‘esportatori abituali’ (section 8c in the Italian VAT legislation) – which does not have a correspondent in the legislations of the other EU member states. Note that corresponding categories are or may be perceived differently in different legal settings. Although the Dutch-Belgium term ‘vrijstelling’ and the English-UK term ‘zero-rated’ both refer to transactions in which a supplier has the right to deduct VAT, it does not follow that both terms cover exactly the same list of possible transactions. Another example is the category lexicalised in English as ‘taxable event’ which is defined in article 10 of the Sixth Directive but implemented differently in the legislations of the European member states (see e.g. section 6 of the Italian VAT legislation, section 269 of the French VAT legislation or section 6(2) of the UK VAT legislation). Terminographers who work on the domain-specific, multilingual textual material may detect variations in domains and between related categories and may reflect and represent these variations in a way which is ideal for immediate ontology upload. In order to benefit from an approach which adds to the conceptualised model the multilingual and cultural diversity within a certain domain of interest, we propose to start from a categorisation framework containing, in an initial stage, all the culture- and language-independent categories, called units of understanding, of the domain. This will be motivated in the next section. A unit of understanding can be described or defined in any natural language by a terminographer who has acquired sufficient insight in the domain through the analysis of textual information. Ideally, the development of a terminological resource has been preceded by a user requirements report (Temmerman and Kerremans 2003b). 3 The need for a categorisation framework In order to set up a conceptualised model of a particular domain, one needs to have a substantial insight in the categories and intercategorial relationships that exist independent of any culture or language in the domain of interest. For instance, with respect to an application that needs to detect fraudulent intra-community transactions, it would be essential to know beforehand what sections in the VAT legislation need to be included in the conceptualised model of the domain. In order to acquire that insight, one may ask field experts to set up a visualisation of the VAT regulatory domain, a semantic 3

network-like structure in which the relevant culture-independent categories (or units of understanding) and intercategorial relationships are presented. The category paraphrased in English as ‘transactions for which no VAT is required’ is cultureindependent as all the European VAT legislations contain a section on particular transactions for which one does not have to pay VAT. By presenting this category in a semantic network-like structure, showing its relations to other categories (including its hierarchical ones) in the domain of interest, terminographers as well as DOGMA ontology modellers would derive that:

§

all the European VAT legislations contain a section describing transactions for which no VAT is required

§

the unit of understanding denoting a category of transactions for which no VAT is required is further divided into four subcategories: transactions in which the supplier does not have the right to deduct VAT, transactions in which the supplier has the right to deduct VAT, transactions that occur outside the territory of the VAT legislation at stake and transactions that are outside the scope of VAT.

§

the sections in the VAT legislations describing these different units of understanding need to be included in the conceptualised model of the domain of interest

§

the terminology referring to each unit of understanding in particular needs to be taken up in the multilingual terminological database

Starting from a categorisation framework in a multilingual ontology engineering project has many advantages. For instance, terminographers will share a framework with the DOGMA ontology modellers to which they can map their terminology. Using this common reference framework will facilitate (semiautomatic) mapping of the two knowledge resources. Each term mapped to a particular unit of understanding will receive the particular ID code (or categorial label) of that unit of understanding. The results of this process are shown in a terminological resource holding ontological information (henceforward: termontological database/resource). A second advantage of this approach is the fact that terminographers may detect culture-specific categories in the multilingual domain-specific corpus which need to be added to the categorisation framework (based on the feedback of domain experts). Moreover, by comparing each term to its corresponding category in the framework, they may detect slight meaning differences which are due to differences in culture (Kerremans et al. 2003). 4 Termontography: a unit of understanding approach The ideas formulated above clarify the importance of integrating the multilingual and cultural diversity of a given domain in a conceptualised model. For that, CVC has worked out an approach called termontography. Termontography can be summarised as a terminological approach in which one structures multilingual terminological knowledge, retrieved from a textual corpus, according to a cultureindependent and task-oriented framework of domain-specific knowledge which can be further refined in 4

a culture-specific layer. The multilingual terminological knowledge is reflected in a termontological resource which contains terms referring to categories in the framework and descriptions of possible meaning variations between terms referring to the same category in the framework. Termontography is a ‘functional’ approach (Agirre et al. 2000, Temmerman 2000). The content and structure of the termontological resource are the result of a careful analysis of its purpose, the requirements of its users (Temmerman and Kerremans 2003b) and the scope of the domain of interest (analysis phase). The analysis of the latter results in the categorisation framework (section 3), which is used as a starting-point for the extraction of terminology and co-texts from a domain-specific multilingual corpus (search phase). This results in a first version of a termontological database/resource which may be further specified with other information (refinement phase). After that, the database is checked for consistency (verification phase) and the ‘termontographer’ verifies whether the content of the database meets the requirements specified in the analysis phase (validation phase). We refer to Kerremans et al. (2003) for a more elaborate discussion on the termontography method. 5 How to represent degrees of correspondence? Term correspondents in several languages are aligned by means of the categorisation framework and are placed in the same termontological record (with the identification code and gloss of the unit of understanding at the top). At the moment it is not yet decided in what format the possible degrees of correspondence between similar terms should be presented to DOGMA modellers. One may consider using controlled language, strict templates, lexons or a description of the degrees of correspondence in natural language, similar to the ‘interconceptual relations’ (relations interconceptuelles) field in the dictionary on retailing of Dancette and Réthoré (2000). Taking an example from this dictionary, the termontological record referring to the unit of understanding paraphrased in English as ‘part of the package that carries the information about the product it contains’, may be structured as follows:

Unit Unit of of Understanding Understanding : : part partofofthe the

package packagethat thatcarries carriesthe theinformation informationabout aboutthe theproduct productitit contains contains

English-UK: English-UK:

label label Language-specific Language-specific features: features:isisnot noteasily easilyremoved removed

English-UK: English-UK:

tag tag Language-specific Language-specific features: features:isiseasily easily removed removed

… …

Figure 1 example of a termontological record

5

The English description of this unit of understanding follows the ID-code, automatically derived from the categorisation framework. Note that this description could be given in any other human language as it denotes a culture-independent category. According to the record, there are two terms in English (UK) which match the description at the top of the record: label and tag. If no further specification is required, these two terms could be seen as interchangeable labels (i.e. synonyms) referring to a certain unit of understanding in the categorisation framework. However, for a machine translation application for instance, it is essential to know in which situations the unit of understanding should be referred to as ‘label’ and in which situations as ‘tag’. The ‘language-specific features’ specification in the termontological record therefore adds to the language-specific usage of each term the description of the unit of understanding. Note that such a specification could be relevant in order to describe related sections in the national VAT legislations. For instance, knowing that there appears to be in all the (European) VAT legislations a unit of understanding ‘transactions which are outside the scope of VAT’, does not imply at all that the national legislations have the same list of transactions which can be classified according to that category (section 2). This example suggests that a multilingual modelling approach can benefit from the input of the termontological resource as it contains the multilingual and cultural complexity of the domain of interest. Conclusion In this paper we discussed how a multilingual terminological resource holding ontological information can support ontology engineering. By focussing on our participation as terminographers in the FFPOIROT project, we showed how a multilingual termontological analysis can contribute to the quality of the ontology output, obtained in the DOGMA ontology engineering framework. As DOGMA ontology modellers only work on English textual material in the FFPOIROT project and terminographers analyse the multilingual corpus, mapping between the lexon base and the termontological resource will result in an ontology in which a clear distinction is made between categories and relationships (or roles) which are language-independent and categories and relationships which are not. Integrating the multilingual and cultural diversity in the ontology may be improved if a specific format is chosen for the representation of possible meaning variations between term correspondents. At present, this has not yet been decided. A common domain-specific and task-oriented reference framework of language-independent categories or units of understanding guarantees that there will be fewer consistency problems when the resources are being mapped. This is one of the principles of termontography, which is referred to as a unit of understanding approach. Our ongoing research now turns to the development of a software workbench which should support this approach. 6

Bibliography (Agirre et al. 2000) Agirre, E., Arregi, X., Artola, X., De Ilarraza, A.D., Sarasola, K. and Soroa, A. 2000. "A Methodology for Building Translator-oriented Dictionary Systems." Machine Translation 15. 295310. (Aussenac-Gilles et al. 1995) Aussenac-Gilles, N., Bourigault, D., Condamines A. and Gros, C. 1995. “How can knowledge acquisition benefit from terminology?” In: Proceedings of the 9th Knowledge Acquisition for Knowledge Based System Workshop (KAW '95), Banff, Canada. (Dancette and L’Homme 2001) Dancette, J. and L'Homme, M.-C.. 2001. "Modélisation des relations sémantiques dans un dictionnaire spécialisé bilingue". VIe Journées scientifiques de l'AUPELFUREF, Beyrouth, Méta, 385-399. (Dancette and Réthoré 2000) Dancette, J. and Réthoré, C. 2000. Dictionnaire Analytique de la Distribution. Analytical Dictionary of Retailing. Montréal : Les presses de l'Université de Montréal. (Halpin 2001) Halpin, T. 2001. Information Modeling and Relational Databases. From Conceptual Analysis to Logical Design. Salt Lake City: North Face University. (Jarrar and Meersman 2002) Jarrar, M. and Meersman, R. 2002. “Formal Ontology Engineering in the DOGMA Approach”. 1st International Conference on Ontologies, Databases and Application of Semantics (ODBASE ’02), Lecture Notes in Computer Science, Vol. 2519, 1238-1254. Berlin: Springer-Verlag. (Kerremans et al. 2003) Kerremans, K., Temmerman, R. and Tummers, J. 2003. "Representing multilingual and culture-specific knowledge in a VAT regulatory ontology: support from the termontography approach". In: Robert Meersman & Zahir Tari (eds.) OTM 2003 Workshops. Tübingen: Springer Verlag. (Meyer 2001) Meyer, I. 2001. "Extracting knowledge-rich contexts for terminography". In: Bourigault et al. (eds.) Recent advances in computational terminography. Amsterdam: John Benjamins. 279-302. (Temmerman 2000) Temmerman, R. 2000. Towards New Ways of Terminology Description. The sociocognitive approach. Amsterdam: John Benjamins. (Temmerman and Kerremans 2003a) Temmerman, R. and Kerremans, K. 2003. "Termontography: Ontology Building and the Sociocognitive Approach to Terminology Description". Prague CIL17conference. (Temmerman and Kerremans 2003b) Temmerman, R. and Kerremans, K. 2003. "User requirements in Terminography". Paper presented at the 2nd International Terminology Conference – CIT, Lisbon (http://www.fcsh.unl.pt/termip/cit2003/indexen.htm).

7