Improving phrase-based statistical machine translation ... - Springer Link

7 downloads 0 Views 278KB Size Report
Sep 6, 2007 - morphological transformation and syntactic transformation. Since the word-order ... tion from English to Vietnamese and from English to French.
Mach Translat (2006) 20:147–166 DOI 10.1007/s10590-007-9022-1 ORIGINAL PAPER

Improving phrase-based statistical machine translation with morphosyntactic transformation Thai Phuong Nguyen · Akira Shimazu

Received: 11 October 2006 / Accepted: 3 April 2007 / Published online: 6 September 2007 © Springer Science+Business Media B.V. 2007

Abstract We present a phrase-based statistical machine translation approach which uses linguistic analysis in the preprocessing phase. The linguistic analysis includes morphological transformation and syntactic transformation. Since the word-order problem is solved using syntactic transformation, there is no reordering in the decoding phase. For morphological transformation, we use hand-crafted transformational rules. For syntactic transformation, we propose a transformational model based on a probabilistic context-free grammar. This model is trained using a bilingual corpus and a broad-coverage parser of the source language. This approach is applicable to language pairs in which the target language is poor in resources. We considered translation from English to Vietnamese and from English to French. Our experiments showed significant BLEU-score improvements in comparison with Pharaoh, a state-of-the-art phrase-based SMT system. Keywords Phrase-based statistical MT · Reordering · Preprocessing · Morphological transformation · Syntactic transformation · Pharaoh decoder · BLEU score 1 Introduction In the field of statistical machine translation (SMT), several phrase-based SMT models (Och and Ney 2004; Marcu and Wong 2002; Koehn et al. 2003) have achieved state-of-the-art performance. These models have a number of advantages in comparison with the original IBM SMT models (Brown et al. 1993) such as word choice,

T. P. Nguyen (B) · A. Shimazu School of Information Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi-shi, Ishikawa 923-1292, Japan e-mail: [email protected]

123

148

T. P. Nguyen, A. Shimazu

Fig. 1 The architecture of our SMT system

idiomatic expression recognition, and local restructuring. These advantages are the result of moving from words to phrases as the basic unit of translation. Although phrase-based SMT systems have been successful, they have some potential limitations when it comes to modelling word-order differences between languages. The reason is that the phrase-based systems make little or only indirect use of syntactic information. In other words, they are still “nonlinguistic”. That is, in phrase-based systems tokens are treated as words, phrases can be any sequence of tokens (and are not necessarily phrases in any syntactic sense), and reordering models are based solely on movement distance (Och and Ney 2004; Koehn et al. 2003) but not on the phrase content. Another limitation is the sparse data problem, because acquiring bitext is difficult and expensive. Since in phrase-based SMT differently inflected forms of the same word are often treated as different words, the problem is more serious when one or both of the source and target languages is an inflectional language. In this paper, we describe our study for improving SMT by using linguistic analysis in the preprocessing phase to attack these two problems (Fig. 1). The word-order problem is solved by parsing source sentences, and then transforming them into the target-language structure. After this step, the resulting source sentences have the word order of the target language. In the decoding phase, the decoder searches for the best target sentence without reordering source phrases. The sparse data problem is solved by splitting the stem and the inflectional suffix of a word during translation. The preprocessing procedure which carries out morphosyntactic transformations, and which was applied to source sentences in both the training and testing phases, takes in a source sentence and performs the following five steps. 1. 2. 3. 4. 5.

Parse the source sentence. Transform the syntactic tree. Analyze the words at leaf nodes morphologically to lemmas and suffixes. Apply morphological transformation rules. Extract the surface string.

123

Phrase-based statistical machine translation with morphosyntactic transformation

149

For syntactic transformation, we propose a transformational model (Nguyen and Shimazu 2006) based on a probabilistic context-free grammar (CFG). The model’s knowledge is learned from bitext in which the source text has been parsed.1 The model can be applied to language pairs in which the target language is poor in resources in the sense that it lacks syntactically-annotated corpora and good syntactic parsers. In our experiments, we mainly considered translation from English to Vietnamese. Since there are significant differences between English and Vietnamese, this language pair is appropriate to demonstrate the effectiveness of the proposed method. We used Pharaoh (Koehn 2004) as a baseline phrase-based SMT system. Our experiments showed significant improvements in BLEU score (Papineni et al. 2001). We analyzed these experiments with respect to morphological transformation, syntactic transformation, and their combination. However, since our English–Vietnamese corpora are small, we carried out other experiments for the English–French language pair on the large Europarl corpus (Koehn et al. 2003). Within the range of data size with which we experimented, we found that improvements made by syntactic transformation over Pharaoh do not change significantly as the corpus scales up.2 Moreover, an SMT system with syntactic transformation needs shorter maximum phrase length to achieve the same translation quality as the baseline system. The rest of this paper is organized as follows: in Sect. 2, background information is presented. Section 3 describes the syntactic transformation. Section 4 presents the morphological transformation. Finally, Sect. 5 discusses our experimental results.

2 Background 2.1 Phrase-based SMT The noisy channel model is the basic model for phrase-based SMT (Koehn et al. 2003) (1). (1) arg max e P(e| f ) = arg max e [P( f |e) × P(e)] The model can be described as a generative story.3 First, a target-language sentence e is generated with probability P(e). Second, e is segmented into phrases e1 , . . . , e I (assuming a uniform probability distribution over all possible segmentations). Third, e is reordered according to a distortion model. Finally, source-language phrases f i are generated under a translation model P( f |e) estimated from the bilingual corpus. Though other phrase-based models follow a joint distribution model (Marcu and Wong 2002), or use log-linear models (Och and Ney 2004), the basic architecture of phrase segmentation, phrase reordering, and phrase translation remains the same. 1 We use a statistical parser trained on the Penn Wall Street Journal treebank (Marcus et al. 1993). 2 Works which use morphological transformation for SMT have a property of vanishing improvement

(Goldwater and McClosky 2005). 3 We follow the convention in Brown et al. (1993), designating the source-language sentence as f and the

target-language sentence as e.

123

150

T. P. Nguyen, A. Shimazu

2.2 Approaches to exploiting syntactic information for SMT The first approach can be called “SMT by parsing”. Yamada and Knight (2001) proposed an SMT model that used syntax information in the target language alone. The model was based on a tree-to-string noisy channel model, and the translation task was transformed into a parsing problem. Melamed (2004) used synchronous CFGs for parsing both languages simultaneously. Melamed’s study showed that syntax-based SMT systems could be built using synchronous parsers. Chiang (2005) proposed a hierarchical phrase-based SMT model which was formally a synchronous CFG. His grammar was learned from bitext without any syntactic annotation. Since phrases were encoded directly in CFG rules (seen as hierarchical phrases), and there were only two nonterminal symbols, the grammar relied very much on lexical knowledge. His SMT system achieved better performance than the Pharaoh system (Koehn et al. 2003) on a Chinese–English translation task. Several other researchers have proposed translation models which incorporate dependency representation of the source and target languages (Ding and Palmer 2005; Quirk et al. 2005). These models required that input sentences be parsed by a dependency parser. An advantage in comparison with the phrase-based approach was that the dependency structure could capture nonlocal dependencies between words. Since it was difficult to specify reordering information within elementary dependency structures, a separate reordering model was used. Quirk et al. (2005) reported better BLEU scores than the Pharaoh system on an English–French translation task. Charniak et al. (2003) proposed an alternative approach to using syntactic information for SMT. The method employed an existing statistical parsing model as a language model within an SMT system. Experimental results showed improvements in accuracy over a baseline syntax-based SMT system. A fourth approach to the use of syntactic knowledge is to focus on the preprocessing phase. Xia and McCord (2004) proposed a preprocessing method to deal with the word-order problem. During the training of an SMT system, rewrite patterns were learned from bitext by employing source-language and a target-language parsers. Then at testing time, the patterns were used to reorder the source sentences so as to make their word order similar to that of the target language. The method achieved improvements over a baseline French–English SMT system. Collins et al. (2005) proposed reordering rules for restructuring German clauses. The rules were applied in the preprocessing phase of a German–English phrase-based SMT system. Their experiments showed that this method could also improve translation quality significantly. Our study differs from those of Xia and McCord (2004) and Collins et al. (2005) in several important respects. First, our transformational model is based on the probabilistic CFG, while neither of the previous studies used probability in their reordering method. Second, the transformational model is trained by using bitext and only a source-language parser, while Xia and McCord (2004) employed parsers of both source and target languages. Third, we use syntactic transformation in combination with morphological transformation. Last, we consider translation from English to Vietnamese and from English to French. Reranking (Shen et al. 2004; Och et al. 2004) is a frequently used postprocessing technique in SMT. However, most of the improvement in translation quality has come

123

Phrase-based statistical machine translation with morphosyntactic transformation

151

from the reranking of nonsyntactic features, while the syntactic features have produced very small gains (Och et al. 2004). 2.3 Morphological analysis for SMT According to our observations, most research on this topic has focused on preprocessing. Al-Onaizan et al. (1999) reported a study of Czech–English SMT which showed that improvements could be gained by utilizing morphological information. Some Czech processing tools, such as a morphological analyzer, part-of-speech (POS) tagger, and lemmatizer, are required. Ordinary Czech text can be changed in several different ways, including word lemmatization, attachment of morphological tags to lemmas, or the use of pseudowords. Each technique can be used separately or in combination with others to preprocess source texts before training or testing. Niessen and Ney (2004) presented a bag of useful techniques using morphological and shallow syntactic information to improve German–English SMT. These techniques include separating German verb prefixes, splitting German compound words, annotating some frequent function words with POS tags, merging phrases, and treating unseen words using their less specific forms. Another study exploiting morphological analysis for Arabic–English SMT was reported in Lee (2004). The method required the morphological analysis of the Arabic text into morphemes, and the POS tagging of the bitext. After aligning the Arabic morphemes to English words, the system determined whether to keep each affix as a separate item, merge it back to the stem, or delete it. The choice of an appropriate operation relies on the consistency of the English POS tags that the Arabic morphemes are aligned to. Goldwater and McClosky (2005) recently used the techniques proposed in Al-Onaizan et al. (1999) for the Czech– English language pair, with some refinements, and analyzed the usefulness of each morphological feature. They also proposed a new word-to-word alignment model to use with the modified lemmas. Experimental results showed that the most significant improvement was achieved by combining the modified lemmas with the pseudowords. 2.4 Vietnamese language features This section describes some phenomena specific to the Vietnamese language. The first is word segmentation. Like a number of other Asian languages such as Chinese, Japanese and Thai, Vietnamese has no word delimiter. The smallest unit in the construction of Vietnamese words is the syllable. A Vietnamese word can be a single word (one syllable) or a compound word (more than one syllable). A space is a syllable delimiter but not a word delimiter in Vietnamese. A Vietnamese sentence can often be segmented in many ways. Example (2) shows a Vietnamese sentence with two possible segmentations.4 Obviously, Vietnamese word segmentation is a nontrivial problem. 4 For clarity, in the following sections, we use the underscore character to connect the syllables of

Vietnamese compound words.

123

152

T. P. Nguyen, A. Shimazu

(2) Ho.c sinh ho.c sinh ho.c. a. Ho.c_sinh ho.c sinh_ho.c pupil learn biology b. Ho.c_sinh ho.c_sinh ho.c pupil pupil learn The second phenomenon is morphology. Vietnamese is a noninflectional language. Most English inflected word forms can be translated into a Vietnamese phrase. First, the word form is analyzed morphologically to a lemma and an inflectional suffix. Then the lemma is translated into a Vietnamese word which is the head of the phrase, and the suffix into a Vietnamese function word which precedes/follows and modifies the head word. English derivative words often correspond to Vietnamese compound words. Example (3) shows a number of English words and their translations. (3) a. books → book-s → những [plural] cuốn s´ ach [book] b. working → work-ing → dang [-ing] l` am viê. c [work] c. changeably → change-ably → thay dổi [change] dược [-ably] The third difference is word order. Vietnamese has an SVO sentence form similar to English (4a). However, wh-movement is significantly different between Vietnamese and English. In English, a wh-question starts with an interrogative word, while in Vietnamese, the interrogative word is not moved to the beginning of a wh-question (4b). In addition, most Vietnamese yes/no questions end with an interrogative word, while the English yes/no questions do not (4c). In phrase composition, the Vietnamese word order is quite different from English. The main difference is that in order to make an English phrase similar in word order to Vietnamese, we often have to move its premodifiers to follow the head word (4d). (4) a. Tˆ oi nh`ın anh ấy. I s ee him b. Anh ấy yˆeu ai? he love who ‘Who does he love?’ ong? c. Anh yˆeu cˆ o ấy phải khˆ you love her interrog ‘Do you love her?’ ach của ba.n anh ấy d. quyển s´ book ’s friend his ‘his friend’s book’ 3 Syntactic transformation One major difficulty in the syntactic transformation task is ambiguity. There can be many different ways to reorder a CFG rule. For example, the rule (5a) in English can become (5b) or (5c) in Vietnamese.5 5 NP: noun phrase, DT: determiner, JJ: adjective, NN: noun.

123

Phrase-based statistical machine translation with morphosyntactic transformation

153

(5) a. NP → DT JJ NN b. NP → DT NN JJ c. NP → NN JJ DT For the phrase a nice girl, the reordering in (5b) is most appropriate, while for the phrase this weekly radio, the (5c) is correct. Lexicalization of CFG rules is one way to deal with this problem. Therefore we propose a transformational model which is based on probabilistic decisions and also exploits lexical information. 3.1 Transformational model Suppose that S is a given lexicalized tree of the source language (whose nodes are augmented to include a word and a POS label). S contains n applications of lexicalized CFG rules L H Si → R H Si , 1 ≤ i ≤ n. We want to transform S into the target-language word order by applying transformational rules to the CFG rules. A transformational rule is represented as (L H S → R H S, RS), which is a pair consisting of an unlexicalized CFG rule and a reordering sequence (RS). For example, the rule (6a) implies that the CFG rule (6b) in the source language can be transformed into the rule (6c) in the target language. (6) a. (NP → JJ NN, 1 0) b. NP → JJ NN c. NP → NN JJ Since the possible transformational rule for each CFG rule is not unique, there can be many transformed trees. The problem is how to choose the best one. Suppose that T is a possible transformed tree whose CFG rules are annotated as L H Si → R H S  i , which is the result of reordering L H Si → R H Si using a transformational rule (L H Si → R H Si , RSi ). Using the Bayes formula, we have (7). (7) P(T |S) =

P(S|T )×P(T ) P(S)

The transformed tree T ∗ which maximizes the probability P(T |S) will be chosen. Since P(S) is the same for every T , and T is created by applying a sequence Q of n transformational rules to S, we can write (8). (8) Q ∗ = arg max Q [P(S|T ) × P(T )] The probability P(S|T ) can be decomposed into (9), n P(L H Si → R H Si |L H Si → R H S  i ) (9) P(S|T ) = i=1 where the conditional probability P(L H Si → R H Si |L H Si → R H S  i ) is computed with the unlexicalized form of the CFG rules. Moreover, we set the constraint in (10).   (10) R H Si P(L H Si → R H Si |L H Si → R H S i ) = 1 To compute P(T ), a lexicalized probabilistic CFG (LPCFG) can be used. LPCFGs are sensitive to both structural and lexical information. Under an LPCFG, the probability of T is (11).

123

154

(11) P(T ) =

T. P. Nguyen, A. Shimazu

n i=1

P(L H Si → R H S  i )

Since application of a transformational rule only reorders the RHS symbols of a CFG rule, we can rewrite (8) as (12). Q ∗ = { RSi ∗ : RSi ∗ = arg max R Si [ P(L H Si → R H Si |L H Si → R H S  i )× (12) P(L H Si → R H S  i )], i = 1, . . . , n} Suppose that a lexicalized CFG rule has the form (13), (13) F(h) → L m (lm ) . . . L 1 (l1 )H (h)R1 (r1 ) . . . Rk (rk ) where F(h), H (h), Ri (ri ), and L i (li ) are all lexicalized nonterminal symbols; F(h) is the LHS symbol or parent symbol, h is the pair of head word and its POS label; H is a head child symbol; and Ri (ri ) and L i (li ) are right and left modifiers of H . Either k or m may be zero, k and m are zero in unary rules. Since the number of possible lexicalized rules is huge, direct estimation of P(L H S → R H S  ) is not feasible. Fortunately, some LPCFG models (Collins 1999; Charniak 2000) can compute the lexicalized rule’s probability efficiently by using the rule-markovization technique (Collins 1999; Charniak 2000; Klein and Manning 2003). Given the LHS, the generation process of the RHS can be decomposed into three steps: 1. Generate the head constituent label with probability PH = P(H |F, h) 2. Generate the right modifiers with probability (14)  k+1 P(Ri (ri )|F, h, H ) (14) PR = i=1 where Rk+1 (rk+1 ) is a special symbol which is added to the set of nonterminal symbols. The grammar model stops generating right modifiers when this symbol is generated (STOP symbol). 3. Generate the left modifiers with probability (15)  m+1 P(L i (li )|F, h, H ) (15) PL = i=1 where L m+1 (lm+1 ) is the STOP symbol. This is zero-th order markovization (the generation of a modifier does not depend on previous generations). Higher orders can be used if necessary. The probability of a lexicalized CFG rule now becomes (16). (16) P(L H S → R H S  ) = PH × PR × PL The LPCFG which we used in our experiments is Collins’ Grammar Model 1 (Collins 1999). We implemented this grammar model with some linguistically motivated refinements for nonrecursive noun phrases, coordination, and punctuation (Collins 1999; Bikel 2004). We trained this grammar model on a treebank whose syntactic trees resulted from transforming source-language trees. In the next section, we will show how we induced this kind of data.

123

Phrase-based statistical machine translation with morphosyntactic transformation

155

Fig. 2 Inducing transformational rules

3.2 Training The required resources and tools include a bilingual corpus, a broad-coverage statistical parser of the source language, and a word-alignment program such as GIZA++ (Och and Ney 2000). First, the source text is parsed by the statistical parser. Then the source text and the target text are aligned in both directions using GIZA++. Next, for each sentence pair, source syntactic constituents and target phrases (which are sequences of target words) are aligned. From this hierarchical alignment information, transformational rules and a transformed syntactic tree are induced. Then the probabilities of transformational rules are computed. Finally, the transformed syntactic trees are used to train the LPCFG. Figure 2 shows an example of inducing transformational rules for English– Vietnamese translation of the sentence (17a). The source and target sentences are in the middle of the figure, on the left. The source syntactic tree is at the upper left of the figure. The source constituents are numbered. Word links are represented by dotted lines. Words and aligned phrases of the target sentence are represented by lines (in the lower left of the figure) and are also numbered. Word alignment results, hierarchical alignment results, and induced transformational rules are in the lower right part of the figure. The transformed tree is at the upper right. (17) a. Are computers basically simple devices? ay t´ınh l` a mˆ o.t thiết bi. dơn giản ? b. về cơ bản m´ bas ically computer be plural device s imple? To determine the alignment of a source constituent, link scores between its span and all of the target phrases are computed using the formula in (18) (Xia and McCord 2004), (18) scor e(s, t) =

links(s,t) wor ds(s)+wor ds(t)

where s is a source phrase, t is a target phrase; links(s, t) is the total number of source words in s and target words in t that are aligned together; wor ds(s) and wor ds(t)

123

156

T. P. Nguyen, A. Shimazu

are, respectively, the number of words in s and t. A threshold is used to filter bad alignment possibilities. After the link scores have been calculated, the target phrase with the highest link score, and which does not conflict with the chosen phrases, will be selected. Two target phrases do not conflict if they are separate or if they contain each other. We supposed that there are only one-to-one links between source constituents and target phrases. We used a number of heuristics to deal with ambiguity. For source constituents whose span contains only one word which is aligned to many target words, we choose the best link based on the intersection of directional alignments and on word link score. When applying formula (18) in determining alignment of a source constituent, if there were several target phrases having the highest link score, we used an additional criterion: – for every word outside s, there is no link to any word of t, – for every word outside t, there is no link to any word of s. Given a hierarchical alignment, a transformational rule can be computed for each constituent of the source syntactic tree. For a source constituent X with children X 0 , . . . , X n and their aligned target phrases Y, Y0 , . . . , Yn (in which Yi are sorted increasingly according to the index of their first word), the conditions for inducing the transformational rule are as follows: – Yi are adjacent to each other. – Y contains Y0 , . . . , Yn but not any other target phrases. Suppose that f is a function in which f ( j) = i if X i is aligned to Y j . If the conditions are satisfied, a transformational rule (X → X 0 . . . X n , f (0) . . . f (n)) can be inferred. For example, in Fig. 1, the constituent SQ [9] has four children AUX0 [1], NP1 [6], ADVP2 [7], and NP3 [8]. Their aligned target phrases are: Y [9], Y0 [1], Y1 [2], Y2 [3], and Y3 [8].6 From the alignment information: 1–3, 6–2, 7–1, and 8–8, the function f is determined (19a). Since the target phrases satisfy the previous conditions, the transformation rule (19b) is induced. (19) a. f (0) = 2, f (1) = 1, f (2) = 0, and f (3) = 3 b. (SQ → AUX NP ADVP NP, 2 1 0 3) For a sentence pair, after transformational rules have been induced, the source syntactic tree will be transformed. The constituents which do not have a transformational rule remain unchanged (all constituents of the source syntactic tree in Fig. 2 have a transformational rule). Their corresponding CFG rule applications are marked as untransformed and are not used in training the LPCFG. The conditional probability for a pair of rules is computed using the maximum likelihood estimate (20). 

H S→R H S,L H S→R H S ) (20) P(L H S → R H S|L H S → R H S  ) = Count (L Count (L H S→R H S  ) In training the LPCFG, a larger number of parameter classes have to be estimated, such as head parameter class, modifying nonterminal parameter class, and modifying terminal parameter class. Very useful details for implementing Collins’ Grammar Model 1 were described in Bikel (2004). 6 For clarity, we use the Y symbol instead of the Vietnamese phrases.

123

Phrase-based statistical machine translation with morphosyntactic transformation

157

3.3 Application After it has been trained, the transformational model is used in Step 2 of the preprocessing procedure (Sect. 3.1) for an SMT system. Given a source syntactic tree, first the tree is lexicalized by associating each nonterminal node with a word and a POS (computed bottom-up, through head child).7 Next, the best sequence of transformational rules is computed by formula (12). Finally, by applying transformational rules to the source tree, the best transformed tree is generated. 4 Morphological transformation In this research, we restricted morphological analysis to inflectional phenomena.8 English inflected words were analyzed morphologically into a lemma and an inflectional suffix. Deeper analysis (such as segmenting a derived word into prefixes, stem, and suffixes) was not used. We experimented with two techniques (Al-Onaizan et al. 1999): the first lemmatizes English words (lemma transformation).9 The second treats inflectional suffixes as pseudowords (which normally correspond to Vietnamese function words) and reorders them into Vietnamese word order if necessary (pseudoword transformation), as shown in (21), where (21a) shows the source sentence, (21b) the lemmatized sentence, and (21c) the sentence with pseudowords. (21) a. He has travelled to many famous places. b. He have travel to many famous place. c. He +sg3 have +pp travel to many famous +pl place. Our morphological features are listed fully in Table 1. In the next section, we will describe our experimental results in two cases: morphological transformation alone (lemma or pseudoword), and in combination with syntactic transformation. 5 Experiments 5.1 Experimental settings We carried out experiments of translation from English to Vietnamese and from English to French. For the first language pair, we used two small corpora: one collected from some computer text books (named “Computer”) and the other collected from some grammar books (named “Conversation”). For the second language pair, we used the freely available Europarl corpus (Koehn et al. 2003). Data sets are described in Tables 2, 3, and 4. For quick experimental turn around, we used only a part of the Europarl corpus for training. We created a test set by choosing sentences randomly from the common test part (Koehn et al. 2003) of this corpus. 7 For a CFG rule, which symbol in the RHS is the head was determined using heuristic rules (Collins 1999). 8 This is due to the morphological analyzer that we used (Sect. 5.1). 9 In the rest of the paper, the terms “lemma” and “lemma transformation” are used interchangeably.

123

158

T. P. Nguyen, A. Shimazu

Table 1 Morphological features Feature

Description

Vietnamese word order

Example

+pl +sg3 +ed +ing +pp +er +est

Plural noun 3rd-person singular present-tense verb Past-tense verb Present participle verb Past participle verb Comparative adjective/adverb Superlative adjective/adverb

+pl noun +sg3 verb +ed verb +ing verb +pp verb adj/adv +er adj/adv +est

+pl book +sg3 love +ed love +ing love +pp love small +er small +est

Table 2 Corpora and data sets Corpus Computer Conversation Europarl

Sentence pairs

Training set

Dev test set

Test set

8,718 16,809 740,000

8,118 15,734 95,924

251 403 2,000

349 672 1,122

Table 3 Corpus statistics of English–Vietnamese translation task

English Sentences Average sentence length Number of tokens Number of types

Computer Vietnamese

8,718 20.0 173,442 8,829

8,718 21.5 187,138 7,145

Conversation English Vietnamese 16,809 8.5 143,373 9,314

16,809 8.0 130,043 9,557

Table 4 Corpus statistics of English–French translation task Training

Sentences Average sentence length Number of tokens Number of types

Test

English

French

English

French

95,924 27.8 2,668,158 29,481

95,924 32.4 3,109,276 39,661

1,122 28.0 31,448 4,548

1,122 32.0 36,072 5,174

A number of tools were used in our experiments. Vietnamese sentences were segmented using a word-segmentation program (Nguyen et al. 2003). For learning phrase translations and decoding, we used Pharaoh (Koehn 2004), a state-of-the-art phrase-based SMT system which is available for research purpose. For word alignment, we used the GIZA++ tool (Och and Ney 2000). For learning language models, we used the SRILM toolkit (Stolcke 2002). For MT evaluation, we used the BLEU measure (Papineni et al. 2001) calculated by the NIST script version 11b. For the parsing task, we used Charniak’s (2000) parser and another program (Johnson 2002) for recovering empty nodes and their antecedents in syntactic trees. For morphological analysis, we used a rule-based morphological analyzer which is described in Pham et al. (2003).

123

Phrase-based statistical machine translation with morphosyntactic transformation

159

Table 5 Unlexicalized CFG rules (UCFGRs), transformational rule groups (TRGs), and ambiguous groups (AGs) Corpus Computer Conversation Europarl

UCFGRs

TRGs

AGs

4,779 3,634 14,462

3,702 2,642 10,738

951 669 3,706

5.2 Training the transformational model The transformational model was trained on each corpus, resulting in a large number of transformational rules and an instance of Collins’ Grammar Model 1. We restricted the maximum number of syntactic trees used for training the transformational model to 40,000. Table 5 shows the statistics resulting from learning transformational rules. On three corpora, the number of transformational rule groups which were learned is smaller than the corresponding number of CFG rules. The reason is that there are many CFG rules which appear once or several times; however, their hierarchical alignments did not satisfy the conditions for inducing a transformational rule. Another reason is that there were CFG rules which required nonlocal transformation.10 5.3 BLEU scores In each experiment, we ran Pharaoh’s trainer with its default settings. Then we used Pharaoh’s minimum-error-rate training script to tune feature weights to maximize the system’s BLEU score on the development test set. Experimental results on the test sets are shown in Table 6. The table shows the BLEU scores of the Pharaoh system (baseline) and other systems which are formed by the combination of the Pharaoh system with various types of morphosyntactic preprocessing. In the first row, the baseline score on the Computer corpus is higher than the baseline score on the Conversation corpus, due to differences between these corpora. Rows 2 and 3 show the scores in cases where the morphological transformation was used. Each of these scores is better than the corresponding baseline score. For the Computer corpus, the pseudoword score is higher than the lemma score. Conversely, for the Conversation corpus, the opposite is the case. Since the Computer corpus (from computer books) represents written language, the morphological features are translated quite closely into Vietnamese. In contrast, those features are translated more freely into Vietnamese in most of the Conversation corpus which represents spoken language. Therefore, the elimination of morphological features (by lemma transformation) in the Conversation corpus is less harmful than in the Computer corpus. Row 4 shows the BLEU scores when syntactic transformation is used. With each corpus, the syntax score is higher than the baseline score, and also higher than the score achieved 10 That is carried out by reordering subtrees, instead of reordering CFG rules. Fox (2002) investigated this phenomenon empirically for the French–English language pair. Knight and Graehl (2005) presented a survey of tree automata which may be a useful way to deal with the nonlocal transformation.

123

160

T. P. Nguyen, A. Shimazu

Table 6 BLEU scores

1 2 3 4 5 6

Baseline Lemma Pseudoword Syntax Lemma+syntax Pseudoword+syntax

Computer

Conversation

Europarl

47.00 46.88 48.50 50.00 50.03 51.94

35.47 36.19 35.56 38.12 38.76 38.83

26.41

28.02

by the system with morphological transformation. The last two rows, 5 and 6, show the scores for morphosyntactic combinations. The combination of lemma and syntax is not very effective, because on both corpora, the lemma+syntax score is slightly higher than the score when using syntax alone, and the lemma+syntax improvement is no better than the total of individual improvements. However, on both corpora, the improvement made by combining pseudoword and syntax is better than the total of individual improvements. Table 6 also contains experimental results with the Europarl corpus (rows 1 and 4).11 The improvement made by syntactic transformation is only 1.61%. On the Vietnamese corpora, the corresponding improvements are 3.00% and 2.65%. The differences between these values can be explained in the following ways. First, we are considering the word-order problem, so the improvement can be expected to be higher with language pairs which are more different in word order. According to our knowledge, Vietnamese and English are more different in word order than French and English. Second, by using phrases as the basic unit of translation, phrase-based SMT captures local reordering quite well if there is a large amount of training data.

5.4 Significance tests In order to test the statistical significance of our results, we chose the sign test (Lehmann 1986).12 We selected a significance level of p ≤ 0.05. The Computer test set was divided into 23 subsets (15 sentences per subset), and the BLEU metric was computed on each of these subsets separately. The translation system with preprocessing was then compared to the baseline system over these subsets. For example, we found that the system with pseudoword transformation had a higher score than the baseline system on 17 subsets, and the baseline system had a higher score on 6 subsets. With the chosen significance level of p ≤ 0.05 and the number of subsets 23, the critical value is 7. So we can state that the improvement made by the system with pseudoword transformation was statistically significant. The same experiments were carried out for the other systems (see Table 7). 11 According to our knowledge, English and French are considered weakly inflected languages. Therefore we did not use morphological transformation for this language pair. 12 The sign test was also used in Collins et al. (2005).

123

Phrase-based statistical machine translation with morphosyntactic transformation

161

Table 7 Sign tests: Figures show total number of tests where score was better than the baseline.

Subsets Sentences per subset Critical value Lemma Pseudoword Syntax Lemma+syntax Pseudoword+syntax

1 2 3 4 5

Computer

Conversation

Europarl

23 15 7 12 17* 20* 18* 21*

22 30 6 14 12 20* 21* 21*

22 51 6

17*

*significant at p ≤ 0.05

In rows 1 and 2, only the improvement gained by pseudoword transformation on the Computer corpus is statistically significant. The other improvements, achieved by morphological transformation, are inconclusive. In contrast, all the improvements gained by syntactic transformation and morphosyntactic combinations are statistically significant (rows 3–5). In addition to the tests reported so far, two other tests were carried out to verify the improvements of the pseudoword+syntax combination over syntax alone. The results showed 18 (out of 23) better than the baseline with the Computer corpus and 16 (out of 22) with the Conversation corpus. In both cases the improvements are statistically significant. Therefore the combination is beneficial. 5.5 Some analyses of the performance of syntactic transformation Figure 3 displays individual n-gram precisions when syntactic transformation is used. Unigram precisions increase less than the others. These numbers confirm that the translation quality of long phrases increases and that syntactic transformation has a greater influence on word order than on word choice. In (22–24) we show some examples of better translations generated by the system using syntactic transformation. In each case, (a) shows the English source, (b) the reference translation (with literal gloss), (c) the baseline translation and (d) the translation after syntactic transformation. (22)

a. How many feet should I buy? b. tˆ oi nˆen mua bao nhiˆeu feet ? I s hould buy how many feet oi mua dược khˆ ong ? c. bao nhiˆeu feet nˆen tˆ d. tˆ oi nˆen mua bao nhiˆeu feet ?

(23) a. Yeah, but I can’t read all the characters. ung, nhưng tˆ oi khˆ ong thể do.c hết c´ ac k´y tự. b. d´ yeah but I cannot read all plural character ac quen thuˆ o.c. c. vˆ ang, nhưng tˆ oi khˆ ong thể do.c dược c´ ac k´y tự. d. vˆ ang, nhưng tˆ oi khˆ ong thể do.c dược c´

123

162

T. P. Nguyen, A. Shimazu

Fig. 3 N -gram precision

(24)

a. By pushing out pins in various combinations, the print head can create alphanumeric characters. ac kim ra theo nhiều tổ hợp b. bằng viê. c dẩy c´ by -ing pus h plural pin out in many combination kh´ ac nhau, dầu in c´ o thể ta.o ra c´ ac k´y tự chữ số different, print can create plural character numeric v` a chữ c´ ai and alphabetic c. kim trong những tổ hợp kh´ ac nhau, dấ in c´ o thể ta.o ra c´ ac k´y tự ai v` a chữ số muốn ra bởi. chữ c´ ac kim ra theo nh˜ ung tổ hợp kh´ ac nhau, dấu in c´ o thể d. bởi viˆe.c dẩy c´ ac k´y tự chữ số v` a chữ c´ ai. ta.o ra c´

A limitation of the syntactic transformation model is that it cannot handle nonlocal transformation. By dealing with this kind of transformation, the BLEU score can be improved more. We give a linguistically-motivated example here. In English–Vietnamese translation, wh-movement belongs to nonlocal transformation (see Sect. 2.4). In a syntactic tree with an SBARQ symbol at the root, the wh-constituent (WHNP, WHADJP, WHPP, or WHADVP) is always co-indexed with a null element. Therefore the tree is transformed by removing the wh-constituent from its SBARQ mother to replace its co-indexed null element, as in (25) for example. (25) (SBARQ (WHNP-1 (WP what)) (SQ (AUX ’s) (NP (DT the) (NN baby)) (VP (VBG doing) (NP (-NONE- *-1)))) (. ?)) → (SBARQ (SQ (AUX ’s) (NP (DT the) (NN baby)) (VP (VBG doing) (WHNP1 (WP what)))) (. ?)) We used this “rule-based” technique in combination with normal syntactic transformation on the Conversation corpus. The BLEU score improved from 38.12% to

123

Phrase-based statistical machine translation with morphosyntactic transformation

163

Table 8 Effect of maximum phrase length on translation quality (BLEU score). Maximum phrase size Pharaoh Syntactic transformation

2

3

4

5

6

21.71 24.10

24.84 27.01

25.74 27.74

26.19 27.88

26.41 28.02

Table 9 Effect of training-set size on translation quality (BLEU score). Training-set size

10K

20K

40K

80K

94K

Pharaoh Syntactic transformation

21.84 23.65

23.35 25.67

24.43 26.86

25.43 27.52

25.74 27.74

38.98% (the baseline score was 35.47%). This example suggests that there is room for improving translation quality by dealing with nonlocal transformation.

5.6 Maximum phrase length Table 8 displays the performances of the baseline SMT system and the syntactictransformation SMT system, with various maximum phrase lengths (we used Europarl data sets). Obviously, the translation quality of both systems improves when the maximum phrase length increases. The second system can achieve high performance with a short maximum phrase length, while the first system requires a longer maximum phrase length to achieve a similar performance. The improvement of the SMT system with syntactic transformation over the baseline SMT system decreases slightly when the maximum phrase length increases. This experiment leads to two suggestions. First, a maximum phrase length of three or four is enough for the SMT system with syntactic transformation. Second, the baseline SMT system relies on long phrases to solve the word-order problem, while the other SMT system uses syntactic transformation to do that.

5.7 Training-set size In this section, we report BLEU scores and decoding times corresponding to various sizes of training sets (in terms of sentence pairs). In this experiment, we used Europarl data sets, and we chose a maximum phrase length of four. Table 9 shows an improvement in BLEU score of about 2% for each training set. It means the improvement over Pharaoh does not decrease significantly as the training set scales up. Note that studies which use morphological analysis for SMT have a property of vanishing improvement (Goldwater and McClosky 2005). Table 10 shows that, for all training sets, the decoding time of the SMT system with syntactic transformation is about 5–6% of that of the Pharaoh system. This is an advantage of monotone decoding. Therefore we save time for syntactic analysis and transformation.

123

164

T. P. Nguyen, A. Shimazu

Table 10 Effect of training-set size on decoding time (seconds/sent). Training-set size

10K

20K

40K

80K

94K

Pharaoh Syntactic transformation

1.98 0.10

2.52 0.13

2.93 0.16

3.45 0.19

3.67 0.22

6 Conclusion We have presented an approach to incorporate linguistic analysis including syntactic parsing and morphological analysis into SMT. By experiments, we have demonstrated that this approach can improve phrase-based SMT significantly. For syntactic transformation, we have proposed a transformational model based on probabilistic CFG and a technique for inducing transformational rules from source-parsed bitext. Our method can be applied for language pairs in which the target language is poor in resources. We have carried out various experiments with English–Vietnamese and English– French language pairs. For English–Vietnamese translation, we have shown that the combination of morphosyntactic transformation can achieve a better result than can be obtained with either individually. On the Europarl corpus, within the range of data size with which we have experimented, we have found out that improvements made by syntactic transformation over Pharaoh do not change significantly as the corpus scales up. Moreover, a phrase-based SMT system with syntactic transformation needs shorter maximum phrase length to achieve the same translation quality as the conventional phrase-based system. In the future, we would like to apply this approach to other language pairs in which the differences in word order are greater than those of English–Vietnamese and English–French. We also would like to extend the transformational model to deal with nonlocal transformations. Acknowledgements This work was supported by the COE project. We would like to thank Dr Dinh Dien and his VCL (Vietnamese Computational Linguistics) group, who gave us permission to use the Computer corpus, a part of their English Vietnamese Corpus. The Conversation corpus was collected by a data group of Lac Viet Computing Corporation in Vietnam. We would like to thank Philipp Koehn for the use of the Pharaoh software. We also would like to thank colleagues in the Natural Language Processing Laboratory at JAIST: Dr Nguyen Le Minh, Dr Le Anh Cuong, Mr Nguyen Tri Thanh, and Mr Nguyen Van Vinh for their helpful discussions.

References Al-Onaizan Y, Curin J, Jahr M, Knight K, Lafferty J, Melamed D, Och F-J, Purdy D, Smith NA, Yarowsky D (1999) Statistical machine translation. Final Report, JHU Summer Workshop 1999, Johns Hopkins University, Baltimore, MD Bikel DM (2004) Intricacies of Collins’ parsing model. Comput Ling 30:479–511 Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1993) The mathematics of statistical machine translation. Comput Ling 22:39–69 Charniak E (2000) A maximum-entropy-inspired parser. In: 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, Washington, pp 132–139

123

Phrase-based statistical machine translation with morphosyntactic transformation

165

Charniak E, Knight K, Yamada K (2003) Syntax-based language models for statistical machine translation. In: Summit MT IX: Proceedings of the Ninth Machine Translation Summit, New Orleans, USA, pp 40–46 Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 263–270 Collins M (1999) Head-driven statistical models for natural language parsing. PhD thesis, University of Pennsylvania, Philadelphia, PA Collins M, Koehn P, Kuˇcerová I (2005) Clause restructuring for statistical machine translation. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 531–540 Ding Y, Palmer M (2005) Machine translation using probabilistic synchronous dependency insertion grammars. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 541–548 Fox H (2002) Phrasal cohesion and statistical machine translation. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, (EMNLP-02), Philadelphia PA, pp 304–311 Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver BC, pp 676–683 Johnson M (2002) A simple pattern-matching algorithm for recovering empty nodes and their antecedents. In: 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, pp 136–143 Klein D, Manning CD (2003) Accurate unlexicalized parsing. In: 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp 423–430 Knight K, Graehl J (2005) An overview of probabilistic tree transducers for natural language processing. In: Gelbukh AF (ed) Computational linguistics and intelligent text processing, 6th international conference CICLing 2005, Mexico City, Mexico. Springer, Berlin, Germany, pp 1–24 Koehn P (2004) Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In: Frederking RE, Taylor KB (eds) Machine translation: From real users to research, 6th Conference of the Association for Machine Translation of the Americas, AMTA 2004, Washington DC. Springer, Berlin, pp 115–124 Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Alberta, Canada, pp 127–133 Lee Y-S (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, Companion Volume: Short Papers, Student Research Workshop, Demonstrations, Tutorials Abstracts pp 57–60 Lehmann EL (1986) Testing statistical hypotheses. 2. Springer, Berlin, Germany Marcu D, Wong W (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, (EMNLP-02), Philadelphia PA, pp 133–139 Marcus MP, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: the Penn TreeBank. Comput Ling 19:313–330 Melamed ID (2004) Statistical machine translation by parsing. In: 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp 653–660 Nguyen TP, Nguyen VV, Le AC (2003) Vietnamese word segmentation using Hidden Markov Model. In: Proceedings of International Workshop for Computer, Information, and Communication Technologies in Korea and Vietnam, Hanoi, Vietnam, pp 13–17 Nguyen TP, Shimazu A (2006) Improving phrase-based SMT with morpho-syntactic analysis and transformation. In: AMTA 2006, Proceedings of the 7th Conference of the Association for Machine Translation of the Americas: Visions for the future of machine translation, Cambridge, Massachusetts, pp 138–147 Niessen S, Ney H (2004) Statistical machine translation with scarce resources using morpho-syntactic information. Comput Ling 30:181–204 Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2004) A smorgasbord of features for statistical machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American chapter of the Association for Computational Linguistics, Boston, Massachusetts, pp 161–168

123

166

T. P. Nguyen, A. Shimazu

Och FJ, Ney H (2000) Improved statistical alignment models. In: 38th Annual meeting of the association for computational linguistics, Hong Kong, China, pp 440–447 Och FJ, Ney H (2004) The alignment template approach to statistical machine translation. Comput Ling 30:417–449 Papineni KA, Roukos S, Ward T, Zhu WJ (2001) Bleu: a method for automatic evaluation of machine translation. Technical report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY Pham NH, Nguyen LM, Le AC, Nguyen PT, Nguyen VV (2003) LVT: An English–Vietnamese machine translation system. In: First National Conference on Fundamental and Applied Research in Information Technology FAIR-03, Hanoi, Vietnam, pp 173–180 Quirk C, Menezes A, Cherry C (2005) Dependency treelet translation: Syntactically informed phrasal SMT. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 271–279 Shen L, Sarkar A, Och FJ (2004) Discriminative reranking for machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, pp 177–184 Stolcke A (2002) SRILM–An extensible language modeling toolkit. In: International Conference on Spoken Language Processing, Denver, Colorado, pp 901–904 Xia F, McCord M (2004) Improving a statistical MT system with automatically learned rewrite patterns. In: 20th International Conference on Computational Linguistics, Geneva, Switzerland, pp 508–514 Yamada K, Knight K (2001) A syntax-based statistical translation model. In: Association for Computational Linguistics 39th Annual Meeting and 10th Conference of the European Chapter, Toulouse, France, pp 523–529

123