Translation Selection by Combining Multiple

0 downloads 0 Views 2MB Size Report
the sense of a source word and select the most appropriate one among various .... 'car' is translated into a Korean word 'cha', which has two different meanings ...
November 28, 2003 14:15 WSPC/162-IJCPOL

00090

International Journal of Computer Processing of Oriental Languages Vol. 16, No. 3 (2003) 219–239 c Chinese Language Computer Society &

World Scientific Publishing Company

Translation Selection by Combining Multiple Measures for Sense Disambiguation and Word Selection HYUN AH LEE∗ , JUNTAE YOON†

AND

GIL CHANG KIM‡

Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusong-Dong, Yusong-District, Taejon, 305-701, Korea ∗ [email protected][email protected][email protected] Translation selection is a process to select, from a set of target language words corresponding to a source language word, the most appropriate one that conveys the correct sense of a source word and makes the target language sentence more natural. In this paper, we propose a hybrid method for translation selection that exploits a bilingual dictionary and a target language corpus. Based on the ‘word-to-sense and sense-to-word’ relationship between a source word and its translations, our method selects translation through two levels: sense disambiguation of a source word and selection of a target word. For translation selection, we introduce three measures: sense preference and sense probability for sense disambiguation, and word probability for word selection. The first one is based on knowledge from a bilingual dictionary, and the others are calculated using statistics from a target language corpus. We evaluated our method and results showed that our method selects more appropriate target words with knowledge extracted from easily obtainable resources. Keywords: Machine Translation; Translation Selection; Sense Disambiguation; Knowledge Extraction; Mono-Bilingual Dictionary; Target Language Corpus.

1. Introduction Translation selection is a process to select, from a set of target language words corresponding to a source language word, one that conveys the correct sense of a source word and makes more fluent target language sentences. Translation selection is a key problem in machine translation (MT) since the quality of translation quite varies according to the results of translation selection. † DAUMSOFT

INC., DAECHI-DONG 946-12, GANGNAM-GU, SEOUL, 135-280, KOREA.

219

November 28, 2003 14:15 WSPC/162-IJCPOL

220

00090

Hyun Ah Lee, Juntae Yoon and Gil Chang Kim

The difficulty of translation selection and machine translation is that they link two different languages thus requiring more complex knowledge than other problems concerning only one language. So, knowledge acquisition is a critical problem for many MT systems and translation selection methods including rule-based, knowledge-based and statistical methods since hand-crafted knowledge or bilingual corpora used for them are not easily available. To ease knowledge acquisition, several studies have utilized a bilingual dictionary or a monolingual corpus, but some of them have limited practical usage and others are liable to select incorrect translations. In this paper, we propose a hybrid translation selection method for transferbased or statistical machine translation. To select a target word, we disambiguate the sense of a source word and select the most appropriate one among various target words that correspond to the resolved sense. By dividing translation selection into two sub-problems, we can select translation using automatically obtained knowledge. Knowledge for sense disambiguation and word selection is extracted from a mono-bilingual dictionary and target language monolingual corpora. From a dictionary, we extract clues for sense disambiguation and sense-to-word mapping information. From target language corpora, we extract statistics of target word cooccurrence. Though our method can be applied to any pair of languages, in this paper we focus on English-Korean translation. To achieve effective translation selection, we introduce three measures: sense preference, sense probability and word probability, among which the first two are measures for sense disambiguation, and the last is that for word selection. In a bilingual dictionary, example sentences are listed for each sense division of a source word. We define the sense preference as the level of similarity between those examples and an input sentence. In a bilingual dictionary, target words are also recorded for each sense division. As the set of those words can model each sense, we can calculate the probability of each sense with co-occurrence between target words of a source word and target words of contextual words. We define the estimated probability as the sense probability. Lastly, the word probability is defined as the probability of selecting a word from the set of translations with the same sense, which is computed by using co-occurrence in a target language. Combining the sense preference, the sense probability and the word probability, we compute the translation preference for each target word. Target words with the highest translation preference are chosen as results. The rest of the paper is organized as follows. Related work is reviewed in Sec. 2. In Sec. 3, we discuss the ‘word-to-sense and sense-to-word’ relationship, and detail our method of combining sense disambiguation and word selection with three measures. We extract knowledge for translation selection from a mono-bilingual dictionary and target language monolingual corpora. The knowledge extraction method is shown in Sec. 4. In Sec. 5, we describe the computing model of the sense preference, the sense probability and the word probability. To evaluate our method, we have conducted various experiments, and the results are shown in Sec. 6. Finally,

November 28, 2003 14:15 WSPC/162-IJCPOL

00090

Translation Selection by Combining Multiple Measures

221

we discuss our results and conclude in Sec. 7.

2. Previous Work For translation selection, classical systems with a transfer-based method have generally relied on hand-crafted rules, lexicons or knowledge bases, in which contextual words or their semantics serve as conditions to choose target words from source words [1, 2]. For example, to select a target word for a source language verb, Egedi and Palmer [3] constraint semantic features of object nouns in a target language, and Kim et al. [4] use a list of exemplary object nouns in a source language. Although those rule-based methods are intuitive and reliable, they have difficulty in acquiring rules or knowledge. To overcome the difficulty of knowledge acquisition, some studies have attempted to automatically extract rules or knowledge from existing resources such as a machine-readable dictionary (MRD) [5, 6]. However, those methods did not show practical results since they were concerned with limited types of words or applicable only when extra knowledge sources including a multilingual large knowledge base or a bilingual corpus already exist. As masses of language resources have become available, a lot of statistical methods have been attempted for translation selection. Approaches based on statistical machine translation extract lexical information or word-class information from a bilingual corpus and calculate the probability for translation with this information [7, 8]. However, as Koehn and Knight [12] pointed out, a bilingual corpus is hard to acquire in itself and even it does not provide sufficient information for translation. Hence, the statistical methods based on a bilingual corpus are generally not preferred in the practical system. Dagan and Itai [9] have proposed a new method for sense disambiguation and translation selection that uses word co-occurrence in a target language corpus. Based on this method, some latest approaches proposed advanced probabilistic models that exploit monolingual corpora [10, 11]. They extract clusters of target language words by using their co-occurrence in target language corpora. Since those clusters contain words that show similar distribution of contextual words, they are expected to have similar meaning and to serve to reduce data sparseness of word co-occurrence. Prescher et al. [10] scored an accuracy of 49.4% when selecting an English translation of a German noun word. They conducted an additional experiment with the semantically most distant translations, and scored an accuracy of 68.2%. Based on a binalization comparison method, they compared accuracies of their method with those of various sense disambiguation methods, and the comparison showed that their method results in higher accuracies than others. Those target language based methods could relieve the knowledge acquisition problem since they need only a monolingual corpus and simple mapping information between a source word and its target words. However, they are apt to select an incorrect translation because of ambiguity of target word senses for individual source words. Consider the translation of an English phrase ‘have a car’. In the phrase,

November 28, 2003 14:15 WSPC/162-IJCPOL

222

00090

Hyun Ah Lee, Juntae Yoon and Gil Chang Kim

‘car’ is translated into a Korean word ‘cha’, which has two different meanings — car and tea. In addition, the word ‘have’ has many senses, for example possess, eat, and drink, which are mapped into ‘gaji-da’, ‘meok-da’ and ‘masi-da’, respectively. From these examples, we get three possible translations ‘cha-reul gaji-da’, ‘cha-reul meok-da’ and ‘cha-reul masi-da’, among which the correct one is ‘cha-reul gajida’. However, it turns out that the co-occurring frequency of ‘cha’ and ‘masi-da’ is dominant in a corpus over the co-occurring frequencies of all other translations. Consequently, a target language based method generates incorrect translation ‘chareul masi-da’ that means ‘drink a tea’.

3. Translation Selection through Sense Disambiguation and Word Selection As mentioned in the previous section, most previous methods on translation selection have tried to select a target word directly from all translations of a source word. In this section, we show the defect of the ‘word-to-word’ translation and propose a translation selection method based on the ‘word-to-sense and sense-to-word’ relationship. In our method, translations are selected by combining three measures for sense disambiguation and word selection.

3.1. ‘Word-to-word’ versus ‘word-to-sense and sense-to-word’ relationship Most previous approaches on translation selection usually select a target word directly from a source word. This direct mapping could be referred to as ‘wordto-translation’ or ‘word-to-word’ relationship. Figure 1 shows the ‘word-to-word’ relationship between an English verb ‘break’ and its Korean translations. Based on this direct relationship, previous approaches could easily obtain rules or statistics. Owing to the ‘word-to-word’ relationship, writing rules or lexicons in a transfer-based method becomes much easier than in an interlingua-based method or a knowledge-based method because the latter needs knowledge for deep semantic analysis of input sentences. Knowledge acquisition is more simplified by utilizing

break1





kkaetteuri-da



busu-da

muneotteuri-da

  



sangcheohna-da



kkeokk-da

Figure 1.

 

dachi-da



 

beomha-da

eogi-da

  

 alli-da

chimipha-da …

taljuha-da

‘Word-to-word’ relationship from ‘break’ to its Korean translations.



November 28, 2003 14:15 WSPC/162-IJCPOL

00090

Translation Selection by Combining Multiple Measures

223

break breik v  broke, brok " en vt1    cause (something) to come to pieces by a blow or strain; destroy; crack; smash  !"#$ & ~ a cup %&  '&   ~ a glass to [into] piece () +**-,.# 0/1  ~ a thread 2# 435&  ~ one's arm 6# 47189  ~ a stick in two :;=+?A@BA/  I heard the rope ~. CD#E435=8FAG>AHI=  The river broke its bank. BJELK +"#M& 2 N hurt; injureL POQRAST1UVWTXZY & ~ the skin [\LOQ>A/  3 ]^N put (something) out of order; make useless by rough handling, etc.  L58_&`1a4bTU cedeS TXfY  & ~ a watch gh=>+cd/   ~ a line i j   ~ the peace k1l>m58_=`#  4 n fail to keep or follow; act against (a rule, law, etc) ; violate; disobey poq=r]s0tu>m5=