evaluation of quality metrics for machine translation: a ...

2 downloads 0 Views 115KB Size Report
translation systems focusing pronominal anaphora resolution. A case .... + parsarg, verb group, adjective group + noun/pronoun, adverb + verb sequence) with.
EVALUATION OF QUALITY METRICS FOR MACHINE TRANSLATION: A CASE STUDY OF ENGLISH-HINDI MACHINE TRANSLATION TRIVENI LAL PAL1, KAMLESH DUTTA2, NATIONAL INSTITUTE OF TECHNOLOGY HAMIRPUR, INDIA 1 [email protected], [email protected]

ABSTRACT Though the human judgments are considered to be gold standards for quality evaluation of machine translation systems, automatic metrics have been receiving significant attention in the recent years because human judgments are both labor intensive and time consuming. In this paper we present a brief survey of the major works in the field (for Hindi only). Our focus is to analyze the suitability of different tools for English-Hindi translation systems focusing pronominal anaphora resolution. A case study of two publically available English-Hindi translation systems, ‘Google’ and ‘MaTra2’, is presented. Different tools have been considered as quality metrics. We have taken the hundred bilingual (English and Hindi) sentences from different sources like news, blogs and government organizations containing pronouns and applied the quality metrics on Google and MaTra2 machine translations (as candidate translations). Most of the pronouns (demonstrative, reflexive, possessive, etc.) have been considered for the analysis. The study depict that translation done by Google is better in context with discourse; on the other hand MaTra2 handles case marker in better way. Study reflects that BLEU (with suitable modifications) and METEOR are better quality metrics for Hindi. METEOR a bit better for morphological rich and free word order languages like Hindi. Keywords: Candidate Sentence, Machine Translation, Precision, Quality Metric, Reference Translation INTRODUCTION As humans are native speaker of any natural language and end user of any translation output, so their judgment will be the benchmark for assessing any automatic metrics. Unlike quality evaluation methods that involve human judgments, automated tools provide rapid and reliable feedback with relatively low cost. The automated quality metrics presents some numerical value of goodness of the translation. Evaluation of quality metrics for machine translation is generally a correlation with human judgment. Quality metric calculate the numerical value of translated sentence (candidate sentence), and then correlated against human judgment for same sentence. Correlation coefficient, thus, obtained represent how close the metric is to human judgment. Banerjee describes: correlation, consistency, sensitivity, generality and reliability as five attributes that a good automatic metric must possess (Banerjee et al., 2005). Correlation means for a metric to be good metric, it must be highly correlated with human judgment. It must be sensitive towards different machine translations, even for small difference. Consistency means, it must give similar results to given machine translation system on similar text. The metric must be general in terms that it must work for text from different domains. Dutta et al., 2009 highlights the importance of anaphora resolution for machine translation application by evaluating the existing Machine translation systems: AnglaHindi by IIT Kanpur, Matra2 by CDAC Mumbai and Google translation system. They also highlight the importance of application of pronominal divergence in machine translation and concluded that “pronominal divergence can help in identifying anaphoric and non-anaphoric occurrences of pronoun”. Dutta et al., 2009 further highlight that machine translation is affected by improper anaphora resolution. They point out the issues in different translation systems. Where

Google translation, in one hand, is unable to resolve the ambiguity between nominative and ergative forms of subject pronouns; MaTra2, on other hand, fails to specify correct forms of pronouns occurring in the object position. Anglahindi has problem in making a choice of correct reflexive pronouns. In this paper, we extend the study performed so far by Dutta et al., 2009and present different evaluation tools for translation quality and try to explain by taking some examples. Meanwhile we presented the numerical value of quality of different English-Hindi translation systems (Google and MaTra2) applying BLEU and METEOR quality metricsand try to explain about the outputs, taking pronominal anaphora into account. In case if the tool cannot be implemented for English-Hindi translation systems, we try to find out the possible reason(s) for the same. This paper is divided into two main parts. A literature survey of existing tools for quality evaluation is presented in first part of the paper. The second part of the paper concentrates on case-study of different English-Hindi translation systems and corresponding result on the basis of comparative study is presented. EXISTING TOOLS FOR QUALITY EVALUATION First, we present here some existing automatic tools for quality evaluation of translation systems. We presented here the only tools which are suitable for Hindi language, in one or other way. And we discuss them for suitability of Hindi. The BLEU BLEU metric is simple and most popular metric for quality evaluation. It was developed for English language and it was first to report high correlation with human judgment for quality evaluation. It is based on the fact that "the closer a machine translation is to a professional human translation, the better it is" (Papineni et al., 2002). The metric calculates the precision based on n-gram. Calculation of precision is performed at sentence level and then averaged to whole corpus to get overall (final) score. At the corpus level, it shows highly to correlate with human judgment (Culy et al., 2003, Papineni et al., 2002). There can be many plausible machine translations of given source translation. They may differ in word choice or word order (in many free word order languages) even if same words are chosen. This forces IBM researchers, to think of such evaluation tool which handles this fundamental problem of many translations. And finally they came up with the development of BLEU (BiLingual Evaluation Understudy) evaluation measures in 2001 which provided a partial solution to this problem (Papineni et al., 2002). The baseline BLEU metric considered more than one reference translations. The primary programming task in BLEU metric implementation is to compare n-grams of candidate translation with the n-grams of reference translation and count the number of matches. These matches are position independent. The more the matches, better is the translation (Papineni et al., 2002). To compute precision one simply counts up the number of candidate translation words (unigrams) which occur in any reference translation and then divides by total number of words count in the candidate translation. As the BLEU precision totally depends on number of candidate word matches,so the problem of over generation the reasonable words must be handled otherwise result may be high precision translation. IBM researchers have taken care of it by considering word in reference translation exhausted after matching candidate word is found. This they call as modified ngram precision (Papineni et al., 2002). To overcome the problem with word order (as if the words are not positioned according to language grammar, it will lead to absurd translation though the BLEU precision may high), researchers calculated the n-gram matches by increasing the value of n. As the value of n increases it will influence the value of precision score. The value of modified n-gram precision decays roughly exponentially with n.The modified unigram precision is much larger than modified bigram precision which in turn is much larger than modified trigram precision. Finally, geometric mean of all the precision is calculated so that different precision at different n can be averaged. There is also concept of brevity factor to penalize short sentences (as the short sentences may give absurd high precision). Finally, bleu score takes the form (Papineni et al., 2002)-

B = PF. exp

log

Where, PF is brevity penalty factor, xn is weight of n-gram and

(1) is precision of n-gram.

Implementation of The BLEU for English-Hindi Translation Systems The Bleu metric is a modified n-gram precision measure, which uses a number of reference translations in order to measure the goodness of a candidate translation. But the Bleu metric cannot be implemented in Hindi as such. There are two reasons for it: first one, as the other Indian languages Hindi also does not have much in electronic form that is multiple reference requirement of the Bleu is not fulfilled. Secondly, Hindi is free-word order language unlike English so it does not possess a fixed syntactic structure. Chatterjee made some improvements over original Bleu metric so that it can be applied for quality assessment of English-Hindi translation systems(Chatterjee et al., 2007). First, authors try to find some fixed syntactic structure besides free word-order nature of Hindi. For this they work out to find groups of words (like noun/pronoun + parsarg, verb group, adjective group + noun/pronoun, adverb + verb sequence) with fixed internal order. In these groups of words the words themselves occur in fixed order within the particular group, while these groups may permute amongst themselves in a given sentence. The identification of these word groups is very important because almost all permutations of these word groups are allowed in standard Hindi without changing the soul meaning of sentence. This imposes some sort of order upon a valid Hindi sentence, and therefore can be used to identify grammatically acceptable sentences, corresponding to the known measure of translation fluency. Secondly, they use single reference to solve the unavailability of multiple references in Hindi. To overcome the losses of not using multiple references, authors implemented the recall measures.

The NIST The NIST metric is derived from BLEU metric but with some modification. The BLEU metric give equal weight to each n-gram for calculating n-gram precision without bothering about how informative is the ngram. On the other hand NIST also takes into account the fact that how informative the particular n-gram is. Means if a rarer n-gram is correctly matched then it will be given more weight over the most frequently matched n-gram. For example, the bi-gram “reference requirement” will be given more weight over the bigram “of the”, if they are correctly matched.

The METEOR The BLEU metric calculate score on the basis of n-gram precision only, it ignores the recall factor. The METEOR (Metric for Evaluation of Translation with Explicit ORdering) also considers unigram recall to calculate the score. Researchers, (Lavie et al., 2007), shows that metric based on recall achieve higher correction with human judgment than the metric which is based on precision only, e.g. BLEU and NIST. The metric uses weighted harmonic mean of unigram precision score and unigram recall score. The METEOR metric has some other feature also such as stem matching and synonym matching. That is, a word is considered matched if it will match on synonym instead of matching on exact word form. For example, the word "score" in synonym as "mark", in the translation, considered as a match. Banerjee presented METEOR with improved correlation with human judgments (Banerjee et al., 2005). Banerjee discussed the various issues with BLEU such as lack of recall (fixedbrevity penalty in BLEU does not adequately compensate for the lack of recall), Use of Higher Order N-grams (only useful for languages having fixed sentence structure), Lack of Explicit Word-matching BetweenTranslation and Reference, Use of Geometric Averaging of N-grams (BLEU score will be “zero” whenever one of the component n-gram scores iszero).

Implementation of The METEOR for English-Hindi TranslationSsystems Gupta presented METEOR-Hindi, an automatic evaluation metric for machine translation where target language is Hindi, by making appropriate change to METEOR’s alignment algorithm and the scoring technique(Gupta et al., 2010). Authors discuss various problems in implementing BLEU metric to the languages which are morphologically rich and have relatively free word order like ‘Hindi’. Gupta calculated precision and recall scores for Exact Match, Stem Match, Synonym Match, LWG, POS and Clause and then scores obtained were combined as weighted linear sum to obtain METEOR-Hindi.

CASE STUDY OF ENGLISH-HINDI TRANSLATION SYSTEMS (GOOGLE AND MATRA2) We have considered here hundred English sentences (specially containing pronouns) from different sources and find the Hindi translation of the same by Google and MaTra2 translators. Applying BLEU and METEOR quality metrics, we evaluated the candidate sentences against standard reference (human) translations. The reference translations, we considered here are obtained five human professional translators. BLEU and METEOR scores for different translation systems are compared and presented in Table 4. BLEU and METEOR scores are also correlated with human judgments and presented in Table 5. Some, but not all, of the sentences considered are shown in Table 1 to Table 3.

Possessive Pronouns Table 1: Examples of Possessive Pronouns ENGLISH (SOURCE SENTENCE)

GOOGLE TRANSLATION

MATRA2 TRANSLATION मैने तु हारा मेज पर उसका कताब

I saw her book on your table.

म अपनी मेज पर उसे पु तक दे खा.

The house is theirs and its paint is flaking.

घर और उनक अपनी पट flaking है .

My dog is better than their dog.

मेरा कु ा अपने कु े से बेहतर है .

मेरा कु ा उनका कु ा से अ छा है |

Sewing is a hobby of mine.

िसलाई मेरा एक शौक है .

िसलाई खान क एक अिभ िच है |

मुझे पता है जो बटु आ है इसक ज रत

म ज रत हू ँ f ◌ दोउ करते ह w◌ो

I need to find out who’s wallet this is.

है .

दे खी |

घर ठे ◌रस है और उसका रं ग परत कर रहा है |

झोला करता है |

Reflexive Pronouns Reflexive pronouns ends as “…self” or “…selves” and rename subjects of action verbs. They function as various types of objects. For example ‘myself’, ‘yourself’, ‘himself’, ‘herself’, ‘itself’ are singular and ‘ourselves’, ‘yourselves’, ‘themselves’ are plural reflexive pronouns. As in the Table 2 below reflexive pronouns are not properly handled by both the translation systems.

Table 2: Examples of Reflexive Pronouns ENGLISH (SOURCE SENTENCE) She bought herself a new purse for her new job. They managed themselves very well as members of the conference panel. He loved himself too much, and never thought about anyone else. Sally thought to herself, this would be a very nice day for a picnic!

GOOGLE TRANSLATION

MATRA2 TRANSLATION

वह खुद उसे नई नौकर के िलए एक

उ होने खुद उसका नया काम के िलये

वे खुद स मेलन पैनल के सद य के

वे खुदक सद य के प म स मेलन

नया बटु आ खर दा है .

प म बहु त अ छ तरह से बंिधत.

उसने अपने आप को बहु त यादा

यार करता था, और कसी और के

बारे म कभी नह ं सोचा.

सैली खुद के िलए सोचा, यह एक

पकिनक के िलए एक बहु त अ छा दन होगा!

एक नया बटु आ खर दा |

पैनल का बहु त अ छ तरह क दे खबाल कये |

उ होने कोई अ य के बारे म पस द कया खुद बःउत बहु त |

धावा ने खुद को सोचा यह एक लघु

या ा के िलये एक बहु त अ छा दन होगा |

Demonstrative Pronoun In the first sentence (Table 3) ‘this’ and ‘that’ are demonstrative pronouns used to represent closeness of object from speaker but Google translator does not able to resolve ‘that’ pronouns as shown in Table 3. It takes ‘that’ as subordinating conjunction not as demonstrative pronoun. On the other hand MaTra2 is able to resolve somehow better than Google in this sentence. Similarly in the last sentence, “I really like that”, “that” is resolved as subordinating conjunction rather than demonstrative. Whereas MaTra2 resolve as expected. Table 3: Examples of Demonstrative Pronouns ENGLISH (SOURCE SENTENCE)

GOOGLE TRANSLATION

MATRA2 TRANSLATION

I like this better than that.

म इस तरह से है क बेहतर.

म यह उनसे अ छा पस द करती हू ँ |

This is great.

यह महान है .

यह बड़ा है |

These look tasty.

ये वा द दे खो.

ये वा द दे खता है |

I really like that.

म सच है क पसंद है .

म यथाथ मे वह पस द करता है |

COMPARISON BETWEEN BLEU AND METEOR METRICS We have tested these metrics (BLEU and METEOR) for 100 sentences specially containing different pronouns. Sentences in English are taken from different sources news, blogs, government’s sites etc. and these sentences are translated in Hindi by two different publically available translators Google and MaTra2. The candidate translations are checked for quality by two quality metrics, BLEU and METEOR. Result so far obtained is represented in the Table 4.

PRONOUN

Table 4: Comparison between BLEU and METEOR scores BLEU SCORE METEOR SCORE GOOGLE MATRA2 GOOGLE MATRA2

Possessive Pronouns

0.98214561

0.89541023

0.61205420

0.45115875

Reflexive Pronouns

0.74105286

0.54201258

0.54823601

0.32012458

Demonstrative Pronoun

0.54217821

0.75421365

0.35642130

0.40153217

The Table 5 shows the correlation of the two metrics with human judgment. Table 5: Correlation of the two metrics with human judgment PRONOUN

BLEU

Possessive Pronouns

METEOR

0.19037641

0.71478452

Reflexive Pronouns

0.25874636

0.69124510

Demonstrative Pronoun

0.20547963

0.61232412

Figure1 shows the correlation of BLEU and METEOR for different pronouns. We can see that METEOR is highly correlated with human judgment than BLEU. So METEOR can be a better quality metric for Hindi. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

BLEU METEOR

Possessive Pronouns Reflexive Pronouns

Demonstrative Pronouns

Figure 1 CONCLUSION As Hindi is free word order and morphologically rich language, quality metrics based on fixed syntactic structure cannot be implemented as such. The METEOR metric taking care of free word order and morphological richness of Hindi and can be seen more effective metric than the BLEU metric. Study reflects that BLEU (with suitable modifications) and METEOR are better quality metrics for Hindi. METEOR a bit better for morphological rich and free word order languages like Hindi. On the other hand at case study point of view, the study depict that, first, translation done by Google is better in context with discourse (as it uses statistical method of machine translation in empirical methodology); on the other hand MaTra2 handles case marker in better way. Secondly, demonstrative pronouns are not resolved properly by Google, but ‘MaTra2’ resolve somehow better than ‘Google’.

ACKNOWLEDGMENT We gratefully acknowledge the support of Ministry of Human Resource Development (MHRD) and the Department of Computer Science and Engineering, N.I.T. Hamirpur (H.P.). REFERENCES  A. Agarwal and A. Lavie, (2008). “METEOR, M-BLEU and M-TER: Evaluation Metrics for HighCorrelation with Human Rankings of Machine Translation Output”,Proceedings of the ACL Workshop on Statistical Machine Translation. 

A. Lavie and A. Agarwal, (2007).“METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments”,Proceedings of the ACL 2007 Workshop on Statistical Machine Translation.



A. Lavie and M. Denkowski, (2010). “The METEOR Metric for Automatic Evaluation of Machine Translation”,Machine Translation.



A. Gupta and S. Venkatapathy and R. Sangal, (2010). “METEOR-Hindi : Automatic MT Evaluation Metric for Hindi as a Target Language”, In Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, Macmillan Publishers, India.



C. Culy and S. Riehemann, (2003). “The Limits of N-Gram Translation Evaluation Metrics”, In Proc.of the MT Summit IX.



K. Dutta, N. Prakash and S. Kaushik, (2009). “Application of Pronominal Divergence and Anaphora Resolution in English-Hindi Machine Translation”, Research journal "POLIBITS" Computer Science and Computer Engineering with Applications, Issue 39, pp-55-58.



K. Papineni, S. Roukos, T. Ward and W. Jing Zhu, (2002). “Bleu: a Method for Automatic Evaluation of Machine Translation”, In Proc. 40th Annual Meeting of the ACL, pp. 311-318.



M. Denkowski and A. Lavie, (2010). “METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support for Five Target Languages”, Proceedings of the ACL 2010 Joint Workshop on Statistical Machine Translation and Metrics MATR.



M. Denkowski and A. Lavie, (2010). “Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level”,Proceedings of NAACL/HLT.



M. Denkowski and A. Lavie, (2011). “Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems”,Proceedings of the EMNLP 2011 Workshop on Statistical Machine Translation.



N. Chatterjee, A. Johnson and M. Krishna, (2007). “Some Improvement over the BLEU Metric for Measuring Translation Quality for Hindi”, In Proc. of the International Conference on Computing: Theory and Applications.



S. Banerjee and A. Lavie, (2005). “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments”, In Proc. of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization.



S. Condon, M. Arehart, C. Doran, D. Parvaz, J. Aberdeen, K. Megerdoomian, and B. Oshika, (2010). “Automated Metrics for Speech Translation”.



http://www.cdacmumbai.in/matra/index.jsp



http://en.wikipedia.org/wiki/Evaluation_of_machine_translation



http://hindiseekho-i.blogspot.com/



http://translate.google.co.in/?hl=en&tab=TT