Subword Approach For Acquiring and Cross-Linking ... - CiteSeerX

58 downloads 0 Views 275KB Size Report
In Pierre Zweigenbaum, Stefan Schulz, and Patrick Ruch, editors, LREC 2006 Workshop on Acquiring and .... Vincent Claveau and Pierre Zweigenbaum. 2005.
1

Subword Approach For Acquiring and Cross-Linking Multilingual Specialized Lexicons Philipp Daumke, Stefan Schulz, Korn´el Mark´o Freiburg University Hospital, Department of Medical Informatics, Freiburg, Germany

Abstract

We present a new subword-based approach to automatically translate biomedical terms from one language to another. The approach may support the creation of new multilingual biomedical lexicons and make the crosslinking between different languages possible. Using subwords, i.e. morphologically meaningful units, instead of full words significantly reduces the number of lexical entries to sufficiently cover a specific language and domain. The language transfer between queries and documents is based on these subwords, as well as on lists of word-n-grams that are generated from large monolingual corpora and serve as look-up tables for different target languages. First tests were done for the translation of German terms into English.

1. Introduction The automatic translation of biomedical terms between different languages using some sort of aligned word lists poses a big challenge whenever the coverage of terms or the linkage between these terms is not comprehensive. Particularly in languages such as German, Finnish or Swedish that are characterized by a high frequency of compounds an exhaustive list of crosslinked biomedical terms is not yet available. This paper presents a new approach to automatically translate biomedical terms from one language to another. It combines a lexicon- and corpus-based approach that is able to translate both dictionary and out-of-dictionary biomedical terms. At its core lies a multilingual subword lexicon that contains semantically minimal, morpheme-style units called subwords. Language-specific subwords are linked by intralingual as well as interlingual synonymy and grouped into language-independent equivalence classes. Using an interlingua significantly reduces the number of entries that are needed to sufficently cover the biomedical domain. Our approach additionally exploits large monolingual word lists that are easily acquired from the web for many languages. These lists are analyzed with regard to term frequencies and correspondences of word orders.

2.

Morpho-Semantic Indexing

The M ORPHO S AURUS system is based on the assumption that neither fully inflected nor automatically stemmed words constitute the appropriate granularity level for lexicalized content description. Especially in scientific sublanguages, we observe a high frequency of complex word forms such as in ‘pseudo⊕hypo⊕para⊕thyroid⊕ism’. To properly account for particularities of ‘medical’ morphology , the notion of subwords was introduced as self-contained, semantically minimal units. Subwords are assembled in a multilingual dictionary and thesaurus, which contain their entries, special attributes and semantic relations between them. Subwords are listed as entries together with their attributes such as language and subword type (stem, prefix, suffix, invariant). Each lexicon entry is assigned to one or more morpho-semantic identifier(s) representing the corresponding synonymy classes (MIDs). Intra- and interlingual semantic equivalence are judged within the context of medicine only. Figure 1 depicts how source documents (top-left) are converted into an interlingual representation by a three-step morpho-semantic indexing procedure. First, each input word is orthographically normalized (top-right). Next, words are segmented into sequences of subwords or left unaffected when no subwords can be decomposed (bottom-right). Finally, each meaning-bearing

In Pierre Zweigenbaum, Stefan Schulz, and Patrick Ruch, editors, LREC 2006 Workshop on Acquiring and

Representing Multilingual, Specialized Lexicons: the Case of Biomedicine. Genova, Italy, 2006. ELDA.

2

LREC 2006 Workshop on Acquiring and Representing Multilingual, Specialized Lexicons

Language ENG GER POR SPA FRE SWE Figure 1: Morpho-Semantic Indexing Pipeline subword is replaced by a language-independent semantic identifier, its MID, thus producing the interlingual output representation of the system (bottom-left). MIDs which co-occur in both document fragments appear in bold face.

3. Term Translation 3.1. Creating Subword Lists In a preparation phase we acquired large (medical) domain specific corpora in different languages from the Web including abstracts from medical journals indexed in Medline1 as well as different online health portals such as Mayo Clinic2 or Netdoctor3 . These corpora are normalized by removing HTML tags and stop words, transforming characters with diacritics into 7-bit ASCII by applying language specific transliteration rules and removing all non 7-bit ASCII tokens. Subsequently, these normalized corpora are tokenized into word-n-grams (henceforth, target words). We limited n to values between 1 and 3 resulting in lists of surface words, word bigrams and trigrams. These temporary lists are uniquely sorted counting the number of occurrences. Table 1 lists the number of generated word-n-grams for ENGlish, GERman, PORtuguese, SPAnish, FREnch and SWEdish. The target words are now sent to the morphosemantic normalization routine which assigns a sequence of MIDs to this input (cf. Figure 1). The resulting language specific target lists contain triples of the form (target words, frequency, MIDs). Due to the frequent occurrence of subword permutations between languages (e.g. German ”Bluthochdruck” (literally ”blood high pressure”) vs. English ”high blood pressure”),

Unigrams Bigrams Trigrams 528k 30,257k 97,673k 467k 4,101k 5,530k 138k 3,899k 7,058k 125k 2,382k 3,746k 85k 1,129k 1,796k 47k 423k 782k

Table 1: Number of Generated Target Queries in Different Languages (k = 1000)

Target Words Freq MIDs ... ... ... side 111k #side side effects 76k #effect #side pancreatitis 9k #itis #pancreas heparin 574 #heparin ... ... ... Table 2: Extract of the English Target List bigrams and trigrams on the interlingual MID layer are ordered alphabetically. Table 2 shows a small subset of the English target list. 3.2. Producing Translations

When a term Torig is sent to our translation tool (with specified term language and desired target language), Torig is transformed to its MID representation TM ID . Subsequently, TM ID is iteratively matched against the MIDs in the target list of the desired language starting with the MID sequence that correspond to the first n words of Torig (n