english-malayalam statistical machine translation - Amrita University

ENGLISH-MALAYALAM STATISTICAL MACHINE TRANSLATION A PROJECT REPORT

Submitted by

RAHUL-C (CB207CN009)

in partial fulfillment for the award of the degree of MASTER OF TECHNOLOGY IN COMPUTATIONAL ENGINEERING AND NETWORKING

AMRITA SCHOOL OF ENGINEERING, COIMBATORE

AMRITA VISWAVIDHYAPEETHAM COIMBATORE – 641 105

JULY 2009

AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING, COIMBATORE, 641105

BONAFIDE CERTIFICATE This is to certify that the mini-project report entitled “ENGLISH-MALAYALAM STATISTICAL MACHINE TRANSLATION”, submitted by “RAHUL-C ”(Reg No CB207CN009)” in partial fulfillment of the requirements for the award of the degree of Master

of

Technology

in

COMPUTATIONAL

ENGINEERING

AND

NETWORKING is a bonafide record of the work carried out under my guidance and supervision at Amrita School of Engineering, Coimbatore, during the period 2007-2009

Project Guide

Head of the Department

Dr. K.P. SOMAN

Dr. K.P. SOMAN

Professor & Head,. CEN

This project report was evaluated by us on ......................

INTERNAL EXAMINER

EXTERNAL EXAMINER

ii

AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING, COIMBATORE, 641105 DEPARTMENT OF COMPUTATIONAL ENGINEERING & NETWORKING

DECLARATION I, RAHUL-C (Reg No CB207CN009) hereby declare that this project report, entitled “ENGLISH-MALAYALAM STATISTICAL MACHINE TRANSLATION”, is a record of the original work done by me under the guidance of Dr.K.P.SOMAN, H.O.D, CEN, Amrita School of Engineering, Coimbatore and that this work has not formed the basis for the award of any degree / diploma / associate ship / fellowship or a similar award, to any candidate in any University, to the best of my knowledge.

RAHUL-C Place: Date:

COUNTERSIGNED

Dr K.P SOMAN Head of the Department Computational Engineering and Networking

iii

ACKNOWLEDGEMENT Let the Almighty Lord, be praised for his compassion, whose ample grace has helped us in the successful completion of my project.

I would like to take the opportunity to extend my most sincere gratitude to all those who provided their assistance and co-operation during project work on “ENGLISHMALAYALAM STATISTICAL MACHINE TRANSLATION”.

I express my deep sense of gratitude to Dr.K.P SOMAN, Head of Department, Computational Engineering and Networking & my project guide for his timely advice and constant encouragement and guidance throughout the progress of the project. His valuable thoughts and suggestions were critical throughout all the stages of this work.

.

I would also like to thank Mr.Loganathan, Mr.Rakesh Peter, Mr.Saravanan,

Mr.Anand Kumar Research associates, CEN for their active discussions with me about the topic and insightful comments and constructive suggestions to improve the quality of this project work.

I express my special thanks to Mr.Dinunath.K, my classmate who was with me from the starting of the project, for the interesting discussions and the ideas he shared. Thanks also to all my friends for sharing so many wonderful moments.

Last but not the least we express our thanks to all staff members of our department, my parents and friends who always stood with me with their valuable suggestions and help.

iv

ABSTRACT This report work entails an in-depth study about Statistical machine translation. Statistical machine translation is a data oriented statistical framework for translating text from one natural language to another based on the knowledge extracted from bilingual corpus. Until now there are not much efforts towards building a English-Malayalam machine translation system. Such a system could play a big role in bridging the gap and to enable the common Keralite to keep abreast of the recent technical advancements by providing him an interface to the Web in his own language. So as a first step for developing this we developed a parallel corpus of English-Malayalam sentences. After creating a well aligned corpus we will use tools like SRILM, GIZA++ and MOSES to develop a machine translation from English to Malayalam. Here the main work lies in the area of developing parallel corpus; ie if we increase the parallel corpus as much as possible we will get much accuracy for the translated output. SMT systems make use of a combination of one or more translation models and a language model. In this project we explore how a direct, well aligned corpus (English – Malayalam) will work on SMT system, its decoding and evaluation and how to improve the translation by adding rules, and by adding morphological information to the Malayalam data. The main ideas which have proven very effective are (i) reordering the English source sentence according to Malayalam syntax, and (ii) using the root suffix separation on both English and Malayalam words. The first one is done by applying simple modified transformation rules on the English parse tree, which is given by the Stanford Dependency Parser. The second one is developed by using a morph analyzer. This approach achieves good performance and better results over the phrasebased system. Our approach avoids the use of parsing for the target language (Malayalam), making it suitable for statistical machine translation from English to Malayalam, since parsing tools for Malayalam are currently not available.

v

TABLE OF CONTENTS CHAPTER

1

2

TITLE

PAGE NO.

ABSTRACT

v

LIST OF FIGURES

x

LIST OF TABLES

x

ORGANIZATION OF THE PROJECT

xi

INTRODUCTION

01

1.1

Machine Translation & the Statistical Approach

02

1.1.1 A brief history of MT

02

1.1.2

Approaches to MT

04

1.1.3

Statistical Machine Translation

07

STATE OF THE ART

09

2.1

Word-based Translation Models

09

2.1.1

IBM Translation & Alignment Models

10

2.1.2

Training & Decoding Tools

13

2.2

Phrase-based Translation Models

13

2.2.1

13

P(E | M) Phrase-based Translation Model

2.3

Tuple-based Translation Models

16

2.4

Statistical Word Alignment

17

2.4.1

17

Evaluating Word Alignment

2.5

Use Of Linguistic Knowledge into SMT

18

2.6

Machine Translation Evaluation

20

2.6.1

Automatic Evaluation Metrics

20

2.6.1.1 BLEU Score

20

2.6.1.2 NIST Score

22

2.6.1.3 mWER

24

2.6.1.4 mPER

25 vi

3

STATISTICAL MACHINE TRANSLATION

27

3.1

Background

27

3.1.1 Model

28

3.1.2

Word Alignment

29

3.1.3

Methods for Learning Phrase Translations

30

3.1.4

Och and Ney

30

3.2

3.3

3.4

3.5

3.6

4

Decoder

33

3.2.1

33

Translation Options

3.2.2 Core Algorithm

34

3.2.3

Recombining Hypotheses

35

3.2.4

Beam Search

36

3.2.5

Future Cost Estimation

38

3.2.6 N-Best Lists Generation

40

Factored Translation Models

42

3.3.1

Motivating Example: Morphology

43

3.3.2

Decomposition of Factored Translation

44

3.3.3 Statistical Model

46

Confusion Networks Decoding

49

3.4.1

Confusion Networks

50

3.4.2

Word Lattices

51

Factors, Words, Phrases

51

3.5.1 Factors

52

3.5.2

Words

52

3.5.3

Factor Types

52

3.5.4 Phrases

53

Morphology

54

3.6.1 Malayalam Verb Morphology

56

INSIGHT INTO ENGLISH-MALAYALAM SMT 4.1

63

The Bilingual N-gram Translation Model

63

4.1.1

63

Reviewing X-grams for Language Modeling vii

4.1.2

Training from parallel data

4.1.3 Tuple definition

4.1.4

68

4.1.3.2 An initial reordering strategy

68

N-gram implementation

69

4.1.4.1 Modeling issues

70 73

4.2.1 Starting with Moses

73

4.2.2 Getting Started with the Moses Software

74

4.2.3

Roadmap

75

4.2.4

Decoding

76

4.2.4.1 Tuning for Quality

76

4.2.4.2 Tuning for Speed

77

4.2.4.3 Limit on Distortion (Reordering)

79

Factored Models

79

4.2.5.1 Moses decoder in parallel

79

4.2.5

4.2.6 Advanced Features of the Decoder

4.2.7

80

4.2.6.1 Lexicalized Reordering Models

80

4.2.6.2 Maintaining stack diversity

81

4.2.6.3 Cube Pruning

82

4.2.6.4 Multiple Translation Tables

82

4.2.6.5 Pruning the Translation Table

83

Training

83

4.2.7.1 Preparing Training Data

83

4.2.7.2 Cleaning the corpus

84

4.2.7.3 Factored Training

84

4.2.7.4 Decoding steps

85

4.2.7.5 Building a Language Model

91

4.2.7.6 Tuning

95

IMPLEMENTATION 5.1

67

4.1.3.1 Monotonicity vs. Word reordering

4.2 Moses

5

64

96

Toolkits Used

96 viii

5.1.1 SRILM: Stanford Research Institute Language Model 96 5.1.2 GIZA++: Training of statistical translation models

97

5.1.3 MOSES

97

5.1.4 BLEU: Bilingual Evaluation Understudy

98

5.2

Development Of Parallel Corpus

99

5.3

Running The System

100

5.4

Rule-Based Reordering & Morphological Processing

101

5.4.1 Syntactic Information

102

5.4.2 Sandhi (Euphonic Combination) Rules

105

5.4.3 Tense markers

108

5.4.4 Morphological Information

111

5.5

PERL

114

5.6

GUI

114

6

RESULTS AND EVALUATION

116

7

CONCLUSION & FUTURE WORK

118

REFERENCES

119

APPENDIX

121

ix

LIST OF FIGURES 1.1 Machine Translation pyramid

04

2.1 Generative process underlying IBM models

11

2.2 Phrase extraction from a certain word aligned pair of sentences

14

3.1 Phrase Reordering

28

3.2 Word alignment

31

3.3 Hypothesis stacks

38

3.4 Generation of an Arc

41

3.5 Factored Translation Models

43

3.6 Illustration of Factored Translation Models

44

4.1 Training a translation FST from parallel data Flow diagram.

66

4.2 Tuple extraction from a certain word aligned pair of sentences

67

4.3 Three alternative tuple definitions from a given word alignment

68

4.4 A Road Map of SMT

76

5.1 Steps for developing language model

97

5.2 Snap shot of mapping file developed

100

5.3 Preprocessing of source data for training

103

5.4 Stanford Parser Output

104

5.5 Modified Dependency Tree

105

5.6 Giza Alignment

113

5.7 Morph Analyzer & Reordering

113

5.8 Syntactic and Morphological Processing

114

5.9 GUI for English – Malayalam SMT

115

LIST OF TABLES 6.1 Glossary of Corpus used

116

6.2. Results of evaluation

117

x

ORGANIZATION OF THE PROJECT Chapter 1 is a brief introduction about Machine Translation & the Statistical Approach, A brief history of Machine Translation. It also lists Approaches to Machine Translation, and gives a detail study about Statistical Machine Translation.

Chapter 2 describes the State of Art of SMT. In this chapter we came across Word-based Translation Models, Phrase-based Translation Models, Tuple-based Translation Models, Statistical Word Alignment, Use of Linguistic Knowledge into SMT and Machine Translation Evaluation.

Chapter 3 describes an in-depth study about Statistical Machine Translation. It explores in detail about Decoder, Factored Translation Models, Confusion Networks Decoding, Malayalam Morphology and coined out the explanation about Factors, Words and Phrases

Chapter 4 gives an insight into English-Malayalam SMT. It entails The Bilingual N-gram Translation Model and Moses. In Moses we go through Decoding, Factored Models, Advanced Features of the Decoder and the step by step process in training

Chapter 5 briefly describes how the implementation of the project is being done. It includes a brief description about the toolkits used, methodology of developing parallel corpus, How to run an SMT system and how Rule-Based Reordering & Morphological Processing will enhance the output of the translation. The language used for programming, is Perl, and the Graphical User Interface for this project is developed in Java.

Chapter 6 contains Results and evaluation about the project.

Chapter 7 is the conclusion & future work of the project work.

xi

CHAPTER 1 INTRODUCTION The information society we live in is undoubtedly a globalized and multilingual one. Every day, hundreds of thousands of documents are being generated, and in many cases one or several translations for them are needed in order to cover the linguistic variety of the target population. The majority of work carried out by professional translators is related

to non-literary documents (technical

reports, legal and financial documents, user manuals, political debates, meeting minutes, and so on), where translation tends to be mechanical and domain-specific. However, the high translation cost in terms of money and time is a bottleneck that prevents all information from being easily spread across languages. Apart from that, the growth and popularity rise of internet has given users access to practically any written, visual and audio material from anywhere in the world. Still, the language barrier is the only obstacle for this vast information to be fully shared by all users. In this context, automatic or machine translation (MT) services are becoming more and more attractive. Several companies are already using and offering automatic translation software, and thousands of users are automatically translating web content on a daily basis, even though translation performance is still far from perfection. Additionally, many research efforts are being focused on speech-tospeech machine translation, and the seemingly unreachable goal of automatically translating spoken language is nearer than ever before. To a large extent, much of the optimism being shared in the MT research community now a days has been caused by the revival of statistical approaches to machine translation, or in other words, the birth of purely Statistical Machine Translation (SMT).

In contrast to previous approaches based on linguistic

1

knowledge representation, SMT is based on large amounts of human-translated example sentences (parallel corpora) in order to estimate a set of statistical models describing the translation process.

1.1 Machine Translation and the Statistical Approach 1.1.1 A brief history of MT The beginnings of statistical machine translation (SMT) can be traced back to the early fifties, closely related to the ideas from which information theory arose and inspired by works on cryptography during World War II. According to this view, machine translation was conceived as the problem of finding a sentence by decoding a given “encrypted” version of it. At that time, machine translation was seen as a quite simple and feasible task, basically consisting of automatically reading dictionary entries in order to translate the input sentence into a hypothetical universal language, from which the target sentence could be generated. A first public Russian–English system was presented at the University of Georgetown in 1954, and despite its very restricted domain (with a vocabulary size of around 250 words), the promising prospects of rapid improvement, and undoubtedly the cold war political context, led the United States government to make strong

investments in emergent machine

translation

technologies. Since then, many research projects were devoted to MT during the late 1950s. However, as the complexity of the linguistic phenomena involved in the translation process together with the computational limitations of the time were made apparent, enthusiasm faded out quickly. Despite much research effort, by the turn of the decade results had fallen

2

short of all expectations. To crown it all, two negative reports had a dramatic impact on MT research. On the one hand, the Bar-Hillel report concluded that Fully Automatic High-Quality Translation was an unreachable goal and that research efforts should be focused on less-ambitious tasks, such as Computer-assisted Machine Translations tools. On the other hand, the controversial 1966 AL- PAC report concluded that machine translation was of poor quality and twice as expensive as human translation, effectually causing all MT research to vanish. During the 1970s, the focus of MT activity switched from the United States to Canada and to Europe, especially due to the growing demands for translations within their multicultural societies. Meteo, a fully-automatic system translating weather forecasts had a great success in Canada, and meanwhile, the European Commission installed a French–English MT system called Systran. Other research projects, such as Eurotra, Ariane and Susy, broadened the scope of MT objectives and techniques, and rule-based approaches emerged as the right way towards successful MT quality. Throughout the 1980s many different types of MT systems appeared the most prevalent being those using an intermediate semantic language such as the Interlingua approach. In the early 1990s, the progress made by the application of statistical methods to speech recognition inspired the introduction by IBM researchers of purelystatistical machine translation models. The drastic increment in computational power and the increasing availability of written translated texts allowed the development of statistical and other corpus-based MT approaches. Many academic tools turned into useful commercial translation products, and several translation engines were quickly offered in the World Wide Web. Today, while commercial MT systems are not error-free, their use is widespread and there is a growing demand for high-quality automatic translation. Regarding research, basically all the research community has moved towards corpus-based

techniques,

which have systematically outperformed traditional

knowledge-based techniques in most performance comparisons. Every year more research groups embark on SMT experimentation, and a regained optimism as regards to future progress seems to be shared among the community.

3

1.1.2 Approaches to MT Several criteria can be used to classify machine translation approaches, yet the most popular classification is done attending to the level of linguistic analysis (and generation) required by the system to produce translations. Usually, this can be graphically expressed by the machine translation pyramid in Figure 1.1.

Figure 1.1: Machine Translation pyramid Generally speaking, the bottom of the pyramid represents those systems which do not perform any kind of linguistic analysis of the source sentence in order to produce a target sentence. Moving upwards, the systems which carry out some analysis (usually by means of morpho syntax- based rules) are to be found. Finally, on top of the pyramid a semantic analysis of the source sentence turns the translation task

into generating

a target

sentence

according to the obtained

semantic representation. Aiming at a bird’s-eye survey rather than a complete review, next each of these approaches is briefly discussed, before delving into the statistical approach to machine translation.

4

Direct translation This approach solves translation on a word-by-word basis, and it was followed by the early MT systems, which included a very shallow morphosyntactic analysis.

Today, this preliminary approach has been abandoned, even in the

framework of corpus-based approaches.

Transfer-based translation The rationale behind the transfer-based approach is that, once we grammatically analyze a given sentence, we can pass this grammar on to the grammatical representation of this sentence in another language. In order to do so, rules to convert source text into some structure, rules to transfer the source structure into a target structure, and rules to generate target text from it are needed. Lexical rules need to be introduced as well. Usually, rules are collected manually, thus involving a great deal of expert human labor and knowledge of comparative grammar of the language pair. Apart from that, when several competing rules can be applied, it is difficult for the systems to prioritize them, as there is no natural way of weighing them. This approach was massively followed in the 1980s, and despite much research effort, high- quality MT was only achieved for limited domains.

Interlingua-based translation This approach advocates for the deepest analysis of the source sentence, reaching a language of semantic representation named Interlingua. This conceptual language, which needs to be developed, has the advantage that, once the source meaning is captured by it, in theory we can express it in any number of target languages, so long as a generation engine for each of them exists. Though conceptually appealing, several drawbacks make this approach unpractical. On the one hand, the difficulty of creating a conceptual language 5

capable of bearing the particular semantics of all languages is an enormous task, which in fact has only been achieved in very limited domains. Apart from that, the requirement that the whole input sentence needs to be understood before proceeding onto translating it, has proved to make these engines less robust to the grammatical incorrectness of informal language, or which can be produced by an automatic speech recognition system.

Corpus-based approaches In contrast to

the

previous

approaches, these

systems

extract the

information needed to generate translations from parallel corpora that include many sentences which have already been translated by human translators. The advantage is that, once the required techniques have been developed for a given language pair, in theory it should be relatively simple to transpose them to another language pair, so long as sufficient parallel training data is available. Among the many corpus-based approaches that sprung at the beginning of the 1990s, the most relevant ones are example-based (EBMT) and statistical (SMT),

although

the differences between them are constantly under debate.

Example-based MT makes use of parallel corpora to extract a database of translation examples, which are compared to the input sentence in order to translate. By choosing and combining these examples in an appropriate way, a translation of the input sentence can be provided.

In SMT, this process is accomplished by

focusing on purely statistical parameters and a set of translation and language models, among other data-driven features. Although this approach initially worked on a word-to-word basis and could therefore be classified as a direct method, nowadays several engines attempt to include a certain degree of linguistic analysis into the SMT approach, slightly climbing up the aforementioned MT pyramid. The following section further introduces the statistical approach to machine translation.

6

1.1.3 Statistical Machine Translation Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation. The first ideas of statistical machine translation were introduced by Warren Weaver in 1949, including the ideas of applying Claude Shannon's information theory. Statistical machine translation was re-introduced in 1991 by researchers at IBM's Thomas J. Watson Research Center and has contributed to the significant resurgence in interest in machine translation in recent years. As of 2006, it is by far the most widely-studied machine translation paradigm. Since its revival more than a decade ago when IBM researchers presented the Candide SMT system , the statistical approach to machine translation has seen an increasing interest among both natural language and speech processing research communities. Mainly, three factors account for this increasing interest: • There is a growing availability of parallel texts (though this applies, in general, only to major languages in terms of presence in internet), coupled with increasing computational power. This enables research on statistical models which, in spite of their huge number of parameters (or probabilities) to estimate, are sufficiently represented in the data. • The statistical methods are more robust to speech disfluencies or grammatical faults. As no deep analysis of the source sentence is done, these systems seek the most probable translation hypothesis given a source sentence, assuming the input sentence is correct.

7

• And last but not least, shortly after their introduction, these methods proved at least as good or even better as rule-based approaches in various evaluation campaigns . A clear example is the German project Verb Mobil, which concluded that preliminary statistical approaches outperformed other approaches, on which research had been focused for many years. At the turn of the 21st century, apart from VerbMobil2 , other projects and consortiums (C-STAR3 , LC-STAR, FAME, among others) involving many research centers have focused on SMT and its applications to text and speech translation tasks. Recently, the European project TC-STAR (Technology and Corpora for Speech to Speech Translation),4 has among its main objectives to achieve significant performance improvements in the statistical machine translation approach. The benefits of statistical machine translation over traditional paradigms that are most often cited are the following: •

Better use of resources o

There is a great deal of natural language in machine-readable format.

o

Generally, SMT systems are not tailored to any specific pair of languages.

o

Rule-based translation systems require the manual development of linguistic rules, which can be costly, and which often do not generalize to other languages.

•

More natural translations The ideas behind statistical machine translation come out of information theory.

Essentially, the document is translated on the probability p(m | e) that a string m in native language (Malayalam) is the translation of a string ‘e’ in foreign language (English). Generally, these probabilities are estimated using techniques of parameter estimation.

8

CHAPTER 2 STATE OF THE ART 2.1 Word-based translation models In word-based translation, translated elements are words. Typically, the numbers of words in translated sentences are different due to compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Simple word-based translation is not able to translate language pairs with fertility rates different from one. To make word-based translation systems manage, for instance, high fertility rates, and the system could be able to map a single word to multiple words, but not vice versa. For instance, if we are translating from French to English, each word in English could produce zero or more French words. But there's no way to group two English words producing a single French word. An example of a word-based translation system is the freely available GIZA++ package, which includes the training program for IBM models and HMM model and Model 6. The word-based translation is not widely used today comparing to phrasebased systems, whereas, most phrase based system are still using GIZA++ to align the corpus. The alignments are then used to extract phrase or induce syntactical rules. And the word alignment problem is still actively discussed in the community. Because the importance of GIZA++, there are now several distributed implementations of GIZA++ available online. Statistical machine translation is based on the assumption that every sentence ‘m’ in a target language is a possible translation of a given sentence ‘e’ in a source language. The main difference between two possible translations of a given sentence 9

is a probability assigned to each, which is to be learned from a bilingual text corpus. The first SMT models applied these probabilities to words, therefore considering words to be the translation units of the process.

2.1.1 IBM translation and alignment models Supposing we want to translate a source sentence ‘e’ into a target sentence ‘m’, we can follow a noisy-channel approach (regarding the translation process as a channel which distorts the target sentence

and outputs the source sentence),

defining statistical machine translation as the optimization problem expressed by: ^

m = arg max P ( m | e )

2.1

m

Typically, Bayes rule is applied, obtaining the following expression: ^

m = arg max P(e | m) P(m)

2.2

m

This way, translating ‘e’ becomes the problem of detecting which ‘m’ (among all possible target sentences) scores best given the product of two models: P (m), the target language model, and P (e | m), the translation model. Although it may seem less appropriate to estimate two models instead of just one (considering that P (m | e ) and P (e | m) are equally difficult to estimate), the use of such a target language model justifies the application of Bayes rule, as this model helps penalize non-grammatical target sentences during the search. Whereas the language model, typically implemented using Ngrams, was already

being used successfully in speech processing and

other

fields, the

translation model was first presented by introducing a hidden variable a to account for the alignment relationships between words in each language, as in equation 2.3 J

p(e | m) = ∑ p(e, a | m) = p( J | m)C p(a j | e1j −1 , a1j −1 , m). p(e j | e1j −1 , a1j , m) a

j =1

10

2.3

Where ej stands for word in position j of the source sentence e, J is the length of this sentence (in number of words), and aj stands for the alignment of word ej, ie the position in the target sentence m where the word which aligns to ej is placed. The set of model parameters, or probabilities, is to be automatically learnt from parallel data. In order to train this huge amount of parameters, in the EM algorithm with increasingly complex models is used. These models are widely known as the five IBM models, and are inspired by the generative process described in Figure 2.1, which interprets the model decomposition of equation 2.3. Conceptually, this process states that for each target word, we first find how many source words will be generated (following a model denoted as fertility); then, we find which source words are generated from each target word (lexicon or word translation probabilities); and finally, we reorder the source words (according to a distortion model) to obtain the source sentence .

Figure 2.1: Illustration of the generative process underlying IBM models These models are expressed as: • n(φ|e) or Fertility model, which accounts for the probability that a target word mi generates φi words in the source sentence • t(e |m) or Lexicon model, representing the probability to produce a source word ej given a target word mi

11

• d(π|τ, φ, m) or Distortion model, which models the probability of placing a source word in position j given that the target word is placed in position i in the target sentence (also used with inverted dependencies, and known as Alignment model) IBM models 1 and 2 do not include fertility parameters so that the likelihood distributions are guaranteed to achieve a global maximum. Their difference is that Model 1 assigns a uniform distribution to alignment probabilities, whereas Model 2 introduces a zero-order dependency with the position in the source. Presented a modification of Model 2 that introduced first-order dependencies in alignment probabilities, the so-called HMM alignment model, with successful results. Model 3 introduces fertility and Model 4 and 5 introduce more detailed dependencies in the alignment model to allow for jumps, so that all of them must be numerically approximated and not even a local maximum can be guaranteed.

Word Alignment

As explicitly introduced by IBM formulation as a model parameter, word alignment becomes a function from source positions j to target positions i, so that a(j)=i. This definition implies that resultant alignment solutions will never contain many-to-many links, but only many-to- one2 , as only one function result is possible for a given source position j. Although this limitation does not account for many real-life alignment relationships, in principle IBM models can solve this by estimating the probability of generating the source empty word, which can translate into non-empty target words. However, as we will see in the following section, many current SMT systems do not use IBM model parameters in their training schemes, but only the most probable alignment (using a Viterbi search) given the estimated IBM models. Therefore, in order to obtain many-to-many word alignments, usually alignments from source-totarget and target-to-source are performed, and symmetrisation strategies have to be applied.

12

2.1.2 Training and decoding tools A stack decoder for IBM model 2 was presented, based on the A*-search algorithm. In 1999, the John Hopkins University summer workshop research team on SMT released GIZA (as part of the EGYPT toolkit), a tool implementing IBM models training from parallel corpora and best-alignment Viterbi search, where a decoder for model 3 is also described. This was a breakthrough in that it enabled many other teams to join SMT research easily. In 2001 and 2003 improved versions of this tool were released, and named GIZA++.

2.2 Phrase-based translation models In phrase-based translation, the restrictions produced by word-based translation have been tried to reduce by translating sequences of words to sequences of words, where the lengths can differ. The sequences of words are called, for instance, blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from the corpus. Restricting the phrases to linguistic phrases has been shown to decrease translation quality. By the turn of the century it became clear that in many cases specifying translation models at the level of words turned out to be inappropriate, as much local context seemed to be lost during translation. Novel approaches needed to describe their models according to longer units, typically sequences of consecutive words (or phrases).

2.2.1 P (E|M): THE PHRASE-BASED TRANSLATION MODEL

The job of the translation model, given a Malayalam sentence M and a foreign sentence E, is to assign a probability that M generates E. While we can estimate these probabilities by thinking about how each individual word is translated, modern statistical Machine Translation is based on the intuition that a better way to compute these probabilities is by considering the behavior of phrases. As we see in Fig. 2.2,

13

entire phrases often need to be translated and moved as a unit. The intuition of phrase-based statistical Machine Translation is to use phrases (sequences of words)

as well as single words as the fundamental units of translation.

Fig 2.2: Phrase e x t r a c t i o n from a certain word aligned pair of sentences. There are a wide variety of phrase-based models; in this section we will sketch the model of Koehn et al. (2003). We used an example, to show the phrasebased model computes the probability P (Yesterday I saw a film / Innale njaan oru sinima kaNTu ). The generative story of phrase-based translation has three steps. First we group the Malayalam source words into phrases m1, m2... mI. Next we translate each Malayalam phrase mi into an English phrase ej. Finally each of the English phrases is (optionally) reordered. The probability model for phrase-based translation relies on a translation probability and a distortion probability. The factor Φ(ej | mi) is the translation

probability of generating English phrase ej from Malayalam phrase mi. The reordering of the English phrases is done by the distortion probability d. Distortion in statistical machine translation refers to a word having a different (‘distorted’) position in the English sentence than it had in the Malayalam sentence; it is thus a measure of the distance between the positions of a phrase in the two languages. The distortion

probability in phrase-based Machine Translation means the probability of two consecutive Malayalam phrases being separated in English by a span (of English words) of a particular length. More formally, the distortion is parameterized by d (ai−bi−1), where ai is the start position of the foreign (English) phrase generated by the ith Malayalam phrase mi, and bi−1 is the end position of the foreign (English) phrase generated by the i−1th Malayalam phrase mi−1. We can use a very simple distortion probability, in which we simply raise some small constant ά to the distortion. d(ai− 14

bi−1)= ά

|ai− bi−1−1|

. This distortion model penalizes large distortions by giving lower

and lower probability the larger the distortion. The final translation model for phrase-based Machine Translation is: I

P ( E | M ) = Π Φ (ei , mi )d (a i − bi −1 )

2.4

i =1

Let’s consider the following particular set of phrases for our example sentences: Position

1

2

3

Malayalam

Innale

njaan oru

English

Yesterday

I

4

saw

5

sinima

kantu

a

film

Since each phrase follows are not directly in order the distortions are all not all 1, and the probability P(E|M) can be computed as: P(E|M) =

P(Yesterday | Innale)×d(1)×P(I | njaan)×d(1)× P(a | oru)×d(2)×P(film | sinima)×d(2)× P (saw | kantu) ×d (3)

2.5

In order to use the phrase-based model, we need two more things. We need a model of decoding, so we can go from a surface English string to a hidden Malayalam string. And we need a model of training, so we can learn parameters. Let’s turn first to training. The main set of parameters that needs to be trained is the set of phrase translation probabilities Φ (ei, mi).These parameters, as well as the distortion constant Φ, could be set if only we had a large bilingual training set, in which each English sentence was paired with a Malayalam sentence, and if furthermore we knew exactly which phrase in the English sentence was translated by which phrase in the Malayalam sentence. We call such a mapping a phrase alignment. The table of phrases above showed an implicit alignment of the phrases for this sentence, for example ‘Innale’ aligned with ‘yesterday’. If we had a large training

15

set with each pair of sentences labeled with such a phrase alignment, we could just count the number of times each phrase-pair occurred, and normalize to get probabilities: Φ (e, m) =

count (e, m) Σ f count (e, m)

2.6

We could store each phrase pair (e, m), together with its probability Φ (e, m), in a large phrase translation table. Alas, we don’t have large hand-labeled phrase-aligned training sets. But it turns that we can extract phrases from another kind of alignment called a word alignment. A word alignment is different than a phrase alignment, because it shows

exactly which English word aligns to which Malayalam word inside each phrase. We can visualize a word alignment in various ways.

2. 3 Tuple-based translation model Without loss of generality, an alternative approach to SMT is to view translation as a stochastic process maximizing the joint probability p(e, m) instead of the conditional probability p(e | m), leading to the following decomposition: m1I = arg max { p(e1I , m1I )} = ... = I m1

2.7

N

arg max {C p((e, m) I

m1

n

| (e, m)1 ,....., (e, m) n −1 )}

n =1

Where (e, m) n is the n-th bilingual unit, or tuple, of a given tuple sequence which generates monotonically both the training source and target sentences. Under this approach, the translation of a given source unit (e, m)

n

is conditioned by a

previous bilingual context (e, m)1, ..., (e, m)n−1, which in practice must be limited in length.

16

2.4 Statistical word alignment Even though IBM word-based translation models include the alignment model as part of a whole translation scheme, this can also be defined as an independent Natural Language P r o c e s s i n g task. In fact, most of current new generation t r a n s l a t i o n models treat word alignment as an independent result from the translation model. The task of automatic word alignment focuses on detecting, given a parallel corpus, which tokens or sets of tokens from each language are connected together in a given translation context, revealing thus the relationship between these

bilingual

units.

Among the

many

applications in natural language

processing, such as bilingual dictionaries extraction or transfer rules learning, word alignment becomes particularly crucial in the context of statistical machine translation, where it represents an essential block in the learning process of current statistical translation models. In fact, it is reasonable to expect that a correct generation of word alignment will show a positive correlation w i t h translation quality.

2.4.1 Evaluating word alignment In order to evaluate the quality of the word alignment task, so far Alignment Error Rate (AER) is commonly used. This measure requires a manual alignment reference (also called gold standard), indicating which source words should be linked to which target words for each reference sentence. Due to the ambiguity brought up by the alignment task, two link types are allowed during manual tagging, namely Sure links (which must be present for the alignment to be correct) and Possible links (which may be present for the alignment to be correct, but are not compulsory). With these definitions, one can define Recall, Precision and AER measures thus: 17

| A∩ B | | A∩ P | , precision = |S| | A| | A∩S | + | A∩ P | AER = 1 − | A| +| S |

recall =

2.8

Where A is the hypothesis alignment and S is the set of sure links in the gold standard reference and P includes the set of Possible and sure links in the gold standard reference. It has been shown that the percentage of Sure and Possible links in the gold standard reference has a strong influence in the final AER result, favoring high-precision

alignments

when Possible links outnumber Sure links, and

favoring high-recall alignments . A well-founded criterion is to produce Possible links only when they allow combinations which are considered equally correct, as a reference with too many Possible links suffers from a resolution loss, causing several different alignments to be equally rated. Taking into account that the notion of word alignment quality depends on the application, the authors review standard scoring metrics for full text alignment and give explanations on how to use them better, and suggest a strategy to build a reference corpus particularly adapted to applications.

2.5 Use of linguistic knowledge into SMT Although initially SMT systems did not incorporate any linguistic analysis and worked at the surface level of word forms, an increasing number of research efforts are introducing a certain degree of linguistic knowledge into their statistical framework. At this point, the pair of languages involved and their respective linguistic properties are crucial to justify a certain approach and explain its

18

results. Therefore, the idea that a good statistical translation model for a certain pair of languages can be used for any other pair is faced against the view that the goodness of such a model may be, at least in part, dependent on the specific language pair. Of course, conclusions will easily hold for languages sharing many linguistic properties. To illustrate this, consider translating from English into Malayalam. While a certain vocabulary reduction of the source language may be useful in this direction, since many English words may translate to the same Malayalam word (due to morphological derivations which are not present in Malayalam), this same technique can be useless when translating in the opposite direction. The use of POS information for improving statistical alignment quality of the HMM-based model, where we introduce additional lexicon probability for POS tags in both languages, but actually are not going beyond full forms. In words that share the same base form are considered equal in the EM training of the alignment models, resulting in a AER reduction. Regarding translation modeling, a primary work on the subject is needed for increased translation performance. Transformations include issues such as compound words separation, reordering of separated verb prefixes (which are placed after the object in Malayalam) or word mapping to word plus POS to distinguish articles from pronouns, among others. A number of researchers have proposed other translation models where the translation process involves syntactic representations of the source and/or target languages. These models have radically different structures and parameterizations from N gram–based (or phrase–based) models for SMT.

19

2.6 Machine Translation evaluation It is well-known that Machine Translation is a very hard task to evaluate automatically. Usually, this task is performed by producing some kind of similarity measure between the translation hypothesis and a set of human reference translations, which represent the expected solution of the system. The fact that there are several correct alternative translations for any input sentence adds complexity to this task, and whereas the higher the correlation with the human references the better quality, theoretically we cannot guarantee that in correlation with the available set of references means bad translation quality, unless we have all possible correct translations available. Therefore, in general it is accepted that all automatic metrics comparing hypotheses with a limited set of manual reference translations are pessimistic. Yet, instead of an absolute quality score, automatic measures are claimed to capture progress during system development and to statistically correlate well with human intuition. Next the most widely-used MT evaluation measures are introduced, such as BLEU, NIST, mWER and mPER. Other measures, which have not been used during this project work, are just referenced.

2.6.1 Automatic evaluation metrics 2.6.1.1 BLEU score

Arguably the most extended evaluation measure as of today, BLEU (acronym for BiLingual Evaluation Understudy) was introduced by IBM, and is always referred to a given n-gram order (BLEUn , n usually being 4). The metric works by measuring the n-gram co-occurrence between a given 20

translation and the set of reference translations and then taking the weighted geometric mean. BLEU is specifically designed to approximate human judgment on a corpus level and can perform badly if used to evaluate the quality of isolated sentences. BLEUn is defined as: ⎛ n ⎞ ⎜ ∑ bleu i ⎟ BLEU n = exp⎜ i =1 + length _ penalty ⎟ ⎜ n ⎟ ⎜ ⎟ ⎝ ⎠

(2.9)

Where bleui and length penalty are cumulative counts (updated sentence by sentence) referred to the whole evaluation corpus (test and reference sets). Even though these matching counts are computed on a sentence-by-sentence basis, the final score is not computed as a cumulative score, i.e. it is not computed by accumulating a given sentence score. Equations 2.10 and 2.11 show bleun and length penalty definitions, respectively: ⎛ Nmatchedn ⎞ ⎟⎟ bleun = log⎜⎜ ⎝ Ntestn ⎠

(2.10)

⎧ shortest _ ref _ length ⎫ length _ penalty = min ⎨0,1 − ⎬ Ntest1 ⎩ ⎭

(2.11)

Finally, N matched

i

, N testi and shortest ref length are also cumulative counts

(updated sentence by sentence), defined as:

{

}

Nmatched i = ∑ ∑ min N (tset n , ngr ), max{N (ref n ,r , ngr )} N

n =1 ngr∈S

r

(2.12)

where S is the set of Ngrams of size i in sentence testn , N (sent, ngr) is the number of occurrences of the Ngram ngr in sentence sent, N is the number of sentences to eval, testi is the ith sentence of the test set, R is the number of different references for each test sentence and refn,r is the rth reference of the nth test sentence. 21

Ntset i = ∑ length(test n ) − i + 1

(2.13)

shortest _ ref _ length = ∑ min{length(ref n ,r )}

(2.14)

N

n =1

N

n =1

r

From BLEU description, we can conclude: • BLEU is a quality metric and it is defined in a range between 0 and 1, 0 meaning the worst-translation (which does not match the references in any word), and 1 the perfect translation. • BLEU is mostly a measure of precision, as bleun is computed by dividing by the matching n-grams by the number of n-grams in the test (not in the reference). In this sense, a very high BLEU could be achieved with a short output, so long as all its ngrams are present in a reference. • The recall or coverage effect is weighted through the length penalty. However, this is a very rough approach to recall, as it only takes lengths into account. • Finally, the weight of each effect (precision and recall) might not be clear, being very difficult from a given BLEU score to know whether the provided translation lacks recall, precision or both. Note that slight variations of these definitions have led to alternative versions of BLEU score, although literature considers BLEU as a unique evaluation measure and no distinction among versions is done.

2.6.1.2 NIST score

NIST evaluation metric is based on the BLEU matric, but with some alterations. Where as BLEU simply calculates n-gram precision considering of equal importance each n-gram, NIST calculates how informative a particular n-gram is, and the rarer a correct n-gram is, the more weight it will be given. NIST also differs from BLEU in its calculation of the brevity penalty, and small variations in translation length do not impact the overall score as much. Again, NIST score is always referred to a given n-gram order (NISTn , usually n being 4), and it is defined 22

as: Again, NIST score is always referred to a given n-gram order (NISTn, usually n being 4), and it is defined as: ⎞ ⎛ ⎜ test ⎟ ⎞ ⎛ 1 ⎟ NISTn = ⎜ ∑ nist i ⎟.nist _ penalty⎜ ⎜ ref1 ⎟ ⎠ ⎝ i =1 ⎟ ⎜ ⎝ R ⎠ n

(2.15)

where nistn and nist penalty(ratio) are cumulative counts (updated sentence by sentence) referred to the whole evaluation corpus (test and reference sets). Even though these matching counts are computed on a sentence-by-sentence basis, the final score is not computed as a cumulative score. The ratio value computed using test1 , ref1 and R shows the relation between the number of words of the test set (test1 ) and the average number of words of the reference sets (ref1 /R). In other words, the relation between the translated number of words and the expected number of words for the whole test set. Equations 2.16 and 2.17 show nistn and nist penalty definitions, respectively.

nistn =

Nmatch_ weightn Ntestn

⎛ log(0.5) 2⎞ ⎟ ( ) nist _ penalty (ratio) = exp⎜⎜ . log ratio 2 ⎟ ( ) log 1 . 5 ⎝ ⎠

(2.16)

(2.17)

Finally, N match weighti is also a cumulative count (updated sentence by sentence), defined as: N ⎛ ⎞ ⎧ ⎫ Nmatch_ weighti = ∑ ∑ ⎜⎜ min⎨N (testn , ngr), max{N (refn,r , ngr)}⎬.weight(ngr)⎟⎟ (2.18) n=1 ngr∈S r ⎩ ⎭ ⎝ ⎠

where weight(ngr) is used to weight every n-gram according to the identity of the words it contains, expressed as follows:

23

⎧ ⎫ ⎛ N (ngr ) ⎞ ⎟⎟ifmgrexists ⎪ ⎪− log 2 ⎜⎜ ⎪ ⎪ ⎝ N (mgr ) ⎠ weight (ngr ) = ⎨ ⎬ ⎪− log ⎛ N (ngr ) ⎞otherwise ⎪ ⎟ 2⎜ ⎪⎩ ⎪⎭ ⎝ Nwords ⎠

(2.19)

Where mgr is the same N-gram of words contained in ngr except for the last word. N (ngram) is the number of occurrences of the Ngram ngram in the reference sets. N words are the total number of words of the reference sets. The NIST score is a quality score ranging from 0 to (worst translation) to an unlimited positive value. In practice, this score ranges between 5 or 12, depending on the difficulty of the task (languages involved and test set length). From its definition, we can conclude that NIST favours those translations that have the same length as the average reference translation. If the provided translation is perfect but ’short’ (for example, it is the result of choosing the shortest reference for each sentence), the resultant NIST score is much lower than another translation with a length more similar to that of the average reference.

2.6.1.3 mWER

Word Error Rate (WER) is a standard speech recognition evaluation metric, where the problem of multiple references does not exist. For translation, its multiplereference version (mWER) is computed on a sentence-by-sentence basis, so that the final measure for a given corpus is based on the cumulative WER for each sentence. This is expressed in 2.22: N

mWER =

∑ WERn n =1

.100

N

2.20

∑ Avrg _ Re f _ Lengthn

n =1

Where N is the number of sentences to be evaluated. Assuming we have R different references for each sentence, the average reference length for a given sentence n is defined as:

24

R

Avrg _ Re f _ Length n =

∑ Length(Re f n ,r )

r =1

R

2.21

Finally, the WER cost for a given sentence n is defined as: WERn = min LevDist (Test n , Re f n ,r ) r

2.22

Where LevDist is the Levenshtein Distance between the test sentence and the reference being evaluated, assigning an equal cost of 1 for deletions, insertions and substitutions. All lengths are computed in number of words. mWER is an percentual error metric, thus defined in the range of 0 to 100, 0 meaning the perfect translation (matching at least one reference for each test sentence). From mWER description, we can conclude that the score tends to slightly favour shorter translations to longer translations. This can be explained by considering that the absolute number of errors (found as the Levenshtein distance) is being divided by the average sentence length of the references, so that a mistake of one word with respect to a long reference is being over weighted in contrast to one mistake of one word with respect to a short reference. Suppose we have three references of length 9, 11 and 13 (avglen = 11). If we have a translation which is equal to the shortest reference, except by one mistake, we have a score of 1/11 (where, in fact, the error could be considered higher, as it is one mistake over 9 words, that is 1/9).

2.6.1.4 mPER

Similar to WER, the so-called Position-Independent Error Rate (mPER) is computed

on a sentence-by-sentece basis, so that the final measure for a given

corpus is based on the cumulative PER for each sentence. This is expressed thus: N

mPER =

∑ PERn n =1

.100

N

2.23

∑ Avrg _ Re f _ Lengthn

n =1

Where N is the number of sentences to be evaluated. Assuming we have R

25

different references for each sentence, the average reference length for a given sentence n is defined as in eqn array 2.21. Finally, the PER cost for a given sentence n is defined as: PERn = min ( P max(Test n , Re f n ,r )) r

2.24

Where P max is the maximum between: • POS = number of words in the REF that are not found in the TST sent. (recall) • NEG = number of words in the TST that are not found in the REF sent. (Precision) In this case, the number of words includes repetitions. This means that if a certain word appears twice in the reference but only once in the test, then POS=1.

26

CHAPTER 3 STATISTICAL MACHINE TRANSLATION 3.1 Background Statistical Machine Translation as a research area started in the late 1980s with the Candide project at IBM. IBM’s original approach maps individual words to words and allows for deletion and insertion of words. Lately, various researchers have shown better translation quality with the use of phrase translation. Phrase-based MT can be traced back to Och’s alignment template model, which can be re-framed as a phrase translation system. Other researchers used augmented their systems with phrase translation, such as Yamada, who use phrase translation in a syntax-based model. Marcu introduced a joint-probability model for phrase translation. At this point, most competitive statistical machine translation systems use phrase translation, such as the CMU, IBM, ISI, and Google systems, to name just a few. Phrase-based systems came out ahead at a recent international machine translation competition. Of course, there are other ways to do machine translation. Most commercial systems use transfer rules and a rich translation lexicon. Until recently, machine translation research has been focused on knowledge based systems that use a interlingua representation as an intermediate step between input and output. There are also other ways to do statistical machine translation. There is some effort in building syntax-based models that either use real syntax trees generated by syntactic parsers, or tree transfer methods motivated by syntactic reordering patterns. The alternative phrase based methods differ in the way the phrase translation table is created, which we explore in the coming section. 27

3.1.1 Model The figure below illustrates the process of phrase-based translation. The input is segmented into a number of sequences of consecutive words (so-called phrases). Each phrase is translated into a Malayalam phrase, and Malayalam phrases in the output may be reordered.

Fig 3.1 Phrase reordering In this section, we will define the phrase-based machine translation model formally. The phrase translation model is based on the noisy channel model. We use Bayes rule to reformulate the translation probability for translating a English sentence e into Malayalam m as

arg max P(m | e) = arg max(e | m)(m) m

3.1

m

This allows for a language model m and a separate translation model p (e|m). During decoding, the English input sentence e is segmented into a sequence of I phrases eI1 .We assume a uniform probability distribution over all possible segmentations. Each English phrase ei in e1 I is translated into a Malayalam phrase mi. The Malayalam phrases may be reordered. Phrase translation is modeled by a probability distribution (ei | mi). Recall that due to the Bayes rule, the translation direction is inverted from a modeling standpoint. Reordering of the Malayalam output phrases is modeled by a relative distortion probability distribution d(starti,endi-1), where starti denotes the start position of the English phrase that was translated into the ith Malayalam phrase, and endi-1 denotes the end position of the English phrase that was translated into the (i-1)th Malayalam phrase. We use a simple distortion model d (starti, endi-1) = _|starti-endi-1-1| with an appropriate value for the parameter. In order to calibrate the output length, we 28

introduce a factor! (called word cost) for each generated Malayalam word in addition to the trigram language model pLM. This is a simple means to optimize performance. Usually, this factor is larger than 1, biasing toward longer output. In summary, the best Malayalam output sentence m best given an English input sentence e according to our model is

mbest = arg max m (m | e) = arg max m (e | m) p LM (m) w length( m )

3.2

Where p (e|m) is decomposed into

p ( e 1I | m 1I ) = Φ

I i =1

φ ( e i | m i ) d ( start

i

, end

i −1

)

3.3

3.1.2 Word Alignment Methods on extracting a phrase translation table from a parallel corpus start with a word alignment. Word alignment is an active research topic. At this point, the most common tool to establish a word alignment is to use the toolkit GIZA++. This toolkit is an implementation of the original IBM Models that started statistical machine translation research. However, these models have some serious draw-backs. Most importantly, they only allow at most one Malayalam word to be aligned with each English word. To resolve this, some transformations are applied. First, the parallel corpus is aligned bidirectionally, e.g., English to Malayalam and Malayalam to English. This generates two word alignments that have to be reconciled. If we intersect the two alignments, we get a high-precision alignment of high-confidence alignment points. If we take the union of the two alignments, we get a high-recall alignment with additional alignment points.

29

3.1.3 Methods for Learning Phrase Translations Most of the recently proposed methods use a word alignment to learn a phrase translation table. We discuss three such methods in this section and one exception. First, the exception: To establish phrase correspondences directly in a parallel corpus. To learn such correspondences, they introduced a phrase-based joint probability model that simultaneously generates both the source and target sentences in a parallel corpus. Expectation Maximization learning in Marcu and Wong’s framework yields both (i) a joint probability distribution (m,e),which reflects the probability that phrases m and e are translation equivalents;(ii)and a joint distribution d(i,j),which reflects the probability that phrase at position i is translated into a phrase at position j. To use this model in the context of our framework, we simply marginalize the joint probabilities to conditional probabilities. This approach is consistent with the approach taken by Marcu and Wong themselves, who use conditional models during decoding.

3.1.4 Och and Ney

Och and Ney propose a heuristic approach to refine the alignments obtained from Giza++. At a minimum, all alignment points of the intersection of the two alignments are maintained. At a maximum, the points of the union of the two alignments are considered. To illustrate this, see the figure below. The intersection points are black, the additional points in the union are shaded grey.

30

Fig 3.2 Word alignment Och and Ney explore the space between intersection and union with expansion heuristics that start with the intersection and add additional alignment points. The decision which points to add may depend on a number of criteria: • In which alignment does the potential alignment point exist? English-Malayalam or English-English? • Does the potential point neighbor already established points? Does neighboring mean directly adjacent (block-distance), or also diagonally adjacent? • Is the Malayalam or the English word that the potential point connects unaligned so far? Are both unaligned? • What is the lexical probability for the potential point? Och and Ney are ambiguous in their description about which alignment points are added in their refined method. We reimplemented their method for Moses, so we will describe this interpretation here. Our heuristic proceeds as follows: We start with intersection of the two word alignments. We only add new alignment points that exist in the union of two word

31

alignments. We also always require that a new alignment point connects at least one previously unaligned word. First, we expand to only directly adjacent alignment points. We check for potential points starting from the top right corner of the alignment matrix, checking for alignment points for the first Malayalam word, then continue with alignment points for the second Malayalam word, and so on. This is done iteratively until no alignment point can be added anymore. In a final step, we add non-adjacent alignment points, with otherwise the same requirements. We collect all aligned phrase pairs that are consistent with the word alignment: The words in a legal phrase pair are only aligned to each other, and not to words outside. The set of bilingual phrases BP can be defined formally as: BP ( f 1 J , e1J , A) =

{( f

j+m j

)}

, e ii + n : ∀ (i ' , j ' ) ∈ A : j 〈= j ' 〈= j + m ↔ i 〈= i ' 〈= i + m 3.4

Given the collected phrase pairs, we estimate the phrase translation probability distribution by relative frequency: ( e | m) = count( e, m) e count( e, m) No smoothing is performed, although lexical weighting addresses the problem of sparse data. Tillmann proposes a variation of this method. He starts with phrase

alignments based on the intersection of the two Giza alignments and uses points of the union to expand these.

Venugopal, Zhang, and Vogel

Venugopal et al. (ACL 2003) allows also for the collection of phrase pairs that are violated by the word alignment. They introduce a number of scoring methods take consistency with the word alignment, lexical translation probabilities, phrase length, etc. into account. Zhang et al. (2003) proposes a phrase alignment method that is based on word alignments and tries to find a unique segmentation of the sentence pair, as it is done by Marcu and Wong directly. This enables them to estimate joint

32

probability distributions, which can be marginalized into conditional probability distributions. Vogel et al. (2003) reviews these two methods and shows that the combining phrase tables generated by different methods improves results.

3.2 Decoder This section describes the Moses decoder from a more theoretical perspective. The decoder was originally developed for the phrase model. At that time, only a greedy hill-climbing decoder was available, which was unsufficent for our work on noun phrase translation. The decoder implements a beam search and is roughly similar to work by Tillmann (PhD, 2001) and Och (PhD, 2002). In fact, by reframing Och’s alignment template model as a phrase translation model, the decoder is also suitable for his model, as well as other recently proposed phrase models. We start this section with defining the concept of translation options; describe the basic mechanism of beam search, and its necessary components: pruning, future cost estimates. We conclude with background on n-best list generation.

3.2.1 Translation Options Given an input string of words, a number of phrase translations could be applied. We call each such applicable phrase translation a translation option. This is illustrated in the figure below, where a number of phrase translations for the English input sentence Maria no daba uma bofetada a la bruja verde are given. These translation options are collected before any decoding takes place. This allows a quicker lookup than consulting the whole phrase translation table during decoding. The translation options are stored with the information • First English word covered 33

• Last English word covered • Malayalam phrase translation • phrase translation probability Note that only the translation options that can be applied to a given input text are necessary for decoding. Since the entire phrase translation table may be too big to fit into memory, we can restrict ourselves to these translation options to overcome such computational concerns. We may even generate a phrase translation table on demand that only includes valid translation options for a given input text. This way, a full phrase translation table (that may be computationally too expensive to produce) may never have to be built.

3.2.2 Core Algorithm The phrase-based decoder we developed employs a beam search algorithm. The Malayalam output sentence is generated left to right in form of hypotheses. Starting from the initial hypothesis, the first expansion is the English word E1, which is translated as M1. We can generate new hypotheses from these expanded hypotheses. Following the back pointers of the hypotheses we can read of the (partial) translations of the sentence.Let us now describe the beam search more formally. We begin the search in an initial state where no English input words are translated and no Malayalam output words have been generated. New states are created by extending the Malayalam output with a phrasal translation of that covers some of the English input words not yet translated. The current cost of the new state is the cost of the original state multiplied with the translation, distortion and language model costs of the added phrasal translation. Note that we use the informal concept cost analogous to probability: A high cost is a low probability.

34

Each search state (hypothesis) is represented by • A back link to the best previous state (needed for find the best translation of the sentence by back-tracking through the search states) • The English words covered so far • The last two Malayalam words generated (needed for computing future language model costs) • The end of the last English phrase covered (needed for computing future distortion costs) • The last added Malayalam phrase (needed for reading the translation from a path of hypotheses) • The cost so far • An estimate of the future cost Final states in the search are hypotheses that cover all English words. Among these the hypothesis with the lowest cost (highest probability) is selected as best translation. The algorithm described so far can be used for exhaustively searching through all possible translations. In the next sections we will describe how to optimize the search by discarding hypotheses that cannot be part of the path to the best translation. We then introduce the concept of comparable states that allow us to define a beam of good hypotheses and prune out hypotheses that fall out of this beam.

3.2.3 Recombining Hypotheses Recombining hypothesis is a risk-free way to reduce the search space. Two hypotheses can be recombined if they agree in • The English words covered so far. • The last two Malayalam words generated. • The end of the last English phrase covered. If there are two paths that lead to two hypotheses that agree in these properties, we keep only the cheaper hypothesis, e.g., the one with the least cost so

35

far. The other hypothesis cannot be part of the path to the best translation, and we can safely discard it. Note that the inferior hypothesis can be part of the path to the second best translation. This is important for generating n-best lists.

3.2.4 Beam Search While the recombination of hypotheses as described above reduces the size of the search space, this is not enough for all but the shortest sentences. Let us estimate how many hypotheses (or, states) are generated during an exhaustive search. Considering the possible values for the properties of unique hypotheses, we can estimate an upper bound for the number of states by N 2nf |Ve|2 nf where nf is the number of English words, and |Ve| the size of the Malayalam vocabulary. In practice, the number of possible Malayalam words for the last two words generated is much smaller than |Ve|2. The main concern is the exponential explosion from the 2nf possible configurations of English words covered by a hypothesis. Note this causes the problem of machine translation to become NP-complete and thus dramatically harder than, for instance, speech recognition. In our beam search we compare the hypotheses that cover the same number of English words and prune out the inferior hypotheses. We could base the judgment of what inferior hypotheses are on the cost of each hypothesis so far. However, this is generally a very bad criterion, since it biases the search to first translating the easy part of the sentence. For instance, if there is a three word English phrase that easily translates into a common Malayalam phrase, this may carry much less cost than translating three words separately into uncommon Malayalam words. The search will prefer to start the sentence with the easy part and discount alternatives too early. So, our measure for pruning out hypotheses in our beam search does not only include the cost so far, but also an estimate of the future cost. This future cost estimation should favor hypotheses that already covered difficult parts of the sentence

36

and have only easy parts left, and discount hypotheses that covered the easy parts first. Given the cost so far and the future cost estimation, we can prune out hypotheses that fall outside the beam. The beam size can be defined by threshold and histogram pruning. A relative threshold cuts out a hypothesis with a probability less than a factor of the best hypotheses. Histogram pruning keeps a certain number n of hypotheses. This type of pruning is not risk-free. If the future cost estimates are inadequate, we may prune out hypotheses on the path to the best scoring translation. In a particular version of beam search, A* search, the future cost estimate is required to be admissible, which means that it never overestimates the future cost. Using bestfirst search and an admissible heuristic allows pruning that is risk free. In practice, however, this type of pruning does not sufficiently reduce the search space. The figure below gives pseudo-code for the algorithm we used for our beam search. For each number of English words covered, a hypothesis stack in created. The initial hypothesis is placed in the stack for hypotheses with no English words covered. Starting with this hypothesis, new hypotheses are generated by committing to phrasal translations that covered previously unused English words. Each derived hypothesis is placed in a stack based on the number of English words it covers. initialize hypothesisStack[0 .. nf]; create initial hypothesis hyp_init; add to stack hypothesisStack[0]; for i=0 to nf-1: for each hyp in hypothesisStack[i]: for each new_hyp that can be derived from hyp: nf[new_hyp] = number of English words covered by new_hyp; add new_hyp to hypothesisStack[nf[new_hyp]]; prune hypothesisStack[nf[new_hyp]];

37

find best hypothesis best_hyp in hypothesisStack[nf]; output best path that leads to best_hyp; We proceed through these hypothesis stacks, going through each hypothesis in the stack, deriving new hypotheses for this hypothesis and placing them into the appropriate stack (see figure below for an illustration). After a new hypothesis is placed into a stack, the stack may have to be pruned by threshold or histogram pruning, if it has become too large. In the end, the best hypothesis of the ones that cover all English words is the final state of the best translation. We can read off the Malayalam words of the translation by following the back links in each hypothesis.

Fig 3.3: Hypothesis stacks

3.2.5 Future Cost Estimation Recall that for excluding hypotheses from the beam we do not only have to consider the cost so far, but also an es-timate of the future cost. While it is possible to calculate the cheapest possible future cost for each hypothesis, this is computationally so expensive that it would defeat the purpose of the beam search.

38

The future cost is tied to the English words that are not yet translated. In the framework of the phrase-based model, not only may single words be translated individually, but also consecutive sequences of words as a phrase. Each such translation operation carries a translation cost, language model costs, and a distortion cost. For our future cost estimate we consider only translation and language model costs. The language model cost is usually calculated by a trigram language model. However, we do not know the preceding Malayalam words for a translation operation. Therefore, we approximate this cost by computing the language model score for the generated Malayalam words alone. That means, if only one Malayalam word is generated, we take its unigram probability. If two words are generated, we take the unigram probability of the first word and the bigram probability of the second word, and so on. For a sequence of English words multiple overlapping translation options exist. We just described how we calculate the cost for each translation option. The cheapest way to translate the sequence of English words includes the cheapest translation options. We approximate the cost for a path through translation options by the product of the cost for each option. The cheapest path for a sequence of English words can be quickly computed with dynamic programming. Also note that if the English words not covered so far are two (or more) disconnected sequences of English words, the combined cost is simply the product of the costs for each contiguous sequence. Since there are only n(n+1)/2 contiguous sequences for n words, the future cost estimates for these sequences can be easily precomputed and cached for each input sentence. Looking up the future costs for a hypothesis can then be done very quickly by table lookup. This has considerable speed advantages over computing future cost on the fly.

39

3.2.6 N-Best Lists Generation Usually, we expect the decoder to give us the best translation for a given input according to the model. But for some applications, we might also be interested in the second best translation, third best translation, and so on. A common method in speech recognition that has also emerged in machine translation is to first use a machine translation system such as our decoder as a base model to generate a set of candidate translations for each input sentence. Then, additional features are used to rescore these translations. An n-best list is one way to represent multiple candidate translations. Such a set of possible translations can also be represented by word graphs or forest structures over parse trees. These alternative data structures allow for more compact representation of a much larger set of candidates. However, it is much harder to detect and score global properties over such data structures.

3.2.6.1 Additional Arcs in the Search Graph

Recall the process of state expansions. The generated hypotheses and the expansions that link them form a graph. Paths branch out when there are multiple translation options for a hypothesis from which multiple new hypotheses can be derived. Paths join when hypotheses are recombined. Usually, when we recombine hypotheses, we simply discard the worse hypothesis, since it cannot possibly be part of the best path through the search graph (in other words, part of the best translation). But since we are now also interested in the second best translation, we cannot simply discard information about that hypothesis. If we would do this, the search

40

graph would only contain one path for each hypothesis in the last hypothesis stack (which contains hypotheses that cover all English words). If we store information that there are multiple ways to reach a hypothesis, the number of possible paths also multiplies along the path when we traverse backward through the graph. In order to keep the information about merging paths, we keep a record of such merges that contains • identifier of the previous hypothesis • identifier of the lower-cost hypothesis • cost from the previous to higher-cost hypothesis The figure below gives an example for the generation of such an arc: in this case, the hypotheses 2 and 4 are equivalent in respect to the heuristic search, as detailed above. Hence, hypothesis 4 is deleted. But since we want keep the information about the path leading from hypothesis 3 to 2, we store a record of this arc. The arc also contains the cost added from hypothesis 3 to 4. Note that the cost from hypothesis 1 to hypothesis 2 does not have to be stored, since it can be recomputed from the hypothesis data structures.

Fig 3.4 Generation of arc 41

3.2.6.2 Mining the Search Graph for an n-Best List

The graph of the hypothesis space can be also be viewed as a probabilistic finite state automaton. The hypotheses are states, and the records of back-links and the additionally stored arcs are state transitions. The added probability scores when expanding a hypothesis are the costs of the state transitions. Finding the n-best path in such a probabilistic finite state automaton is a well-studied problem.

3.3 Factored Translation Models The current state-of-the-art approach to statistical machine translation, socalled phrase-based models, are limited to the mapping of small text chunks (phrases) without any explicit use of linguistic information, may it be morphological, syntactic, or semantic. Such additional information has been demonstrated to be valuable by integrating it in pre-processing or post processing. However, a tighter integration of linguistic information into the translation model is desirable for two reasons: • Translation models that operate on more general representations, such as lemmas instead of surface forms of words, can draw on richer statistics and overcome the data sparseness problems caused by limited training data. • Many aspects of translation can be best explained on a morphological, syntactic, or semantic level. Having such information available to the translation model allows the direct modeling of these aspects. For instance: reordering at the sentence level is mostly driven by general syntactic principles, local agreement constraints show up in morphology, etc.

42

Therefore, we developed a framework for statistical translation models that tightly integrates additional information. Our framework is an extension of the phrase-based approach. It adds additional annotation at the word level. A word in our framework is not anymore only a token, but a vector of factors that represent different levels of annotation.

3.3.1 Motivating Example: Morphology One example to illustrate the short-comings of the traditional surface word approach in statistical machine translation is the poor handling of morphology. Each word form is treated as a token in itself. This means that the translation model treats, say, the word house completely independent of the word houses. Any instance of house in the training data does not add any knowledge to the translation of houses. In the extreme case, while the translation of house may be known to the model, the word houses may be unknown and the system will not be able to translate it. Because this problem occurs strongly in Malayalam - due to the morphologicalrichness in Malayalam (Malayalam is a morphologically rich language with lot of inflections and agglutinative) it does constitute a significant problem for morphologically rich foreign languages such as Arabic, German, Czech, etc. Thus, it may be preferably to model translation between morphologically rich languages on the level of lemmas, and thus pooling the evidence for different word 43

forms that derive from a common lemma. In such a model, we would want to translate lemma and morphological information separately, and combine this information on the output side to ultimately generate the output surface words. Such a model can be defined straight-forward as a factored translation model. See figure below for an illustration of this model in our framework.

while we illustrate the use of factored translation models on such a linguistically motivated example, our framework also applies to models that incorporate statistically defined word classes, or any other annotation.

3.3.2 Decomposition of Factored Translation The translation of factored representations of input words into the factored representations of output words is broken up into a sequence of mapping steps that either translate input factors into output factors, or generate additional output factors from existing output factors. In this model the translation process is broken up into the following three mapping steps: • Translate input lemmas into output lemmas • Translate morphological and POS factors • Generate surface forms given the lemma and linguistic factors

44

Factored translation models build on the phrase-based approach, which defines a segmentation of the input and output sentences into phrases. Our current implementation of factored translation models follows strictly the phrase-based approach, with the additional decomposition of phrase translation into a sequence of mapping steps. Since all mapping steps operate on the same phrase segmentation of the input and output sentence into phrase pairs, we call these synchronous factored models.

Let us now take a closer look at one example, the translation of the one-word phrase ‘houses’ into Malayalam. The representation of ‘house’ in English is: surfaceform houses | lemma house | part-of-speech NN | count plural | case nominative | gender neutral. The three mapping steps in our morphological analysis and generation model may provide the following applicable mappings: • Translation: Mapping lemmas – house -> viiT, pura, keTTiTaM, kuuT

• Translation: Mapping morphology – NN|plural-nominative-neutral -> NN|plural, NN|singular

• Generation: Generating surface forms – viiT|NN|plural -> viiTukaL~ – viiT |NN|singular -> viiTukaL~ – kuuT|NN|plural -> kuuTukaL~

We call the application of these mapping steps to an input phrase expansion. Given the multiple choices for each step (reflecting the ambiguity in translation), each input phrase may be expanded into a list of translation options. The German houses | house | NN | plural-nominative neutral may be expanded as follows: • Translation: Mapping lemmas – { ?|viiT|?|?, ?|pura|?|?, ?|keTTiTaM|?|?, ?|kuuT|?|? }

• Translation: Mapping morphology

45

– { ?| viiT |NN|plural, ?| pura |NN|plural, ?| keTTiTaM|NN|plural, ?| kuuT|NN|plural,

?|house|NN|singular,... } • Generation: Generating surface forms –

{

viiTukaL~|viiT|NN|plural,

purakaL~|

pura|NN|plural,

keTTiTangngaL~|keTTiTaM |NN|plural, kuuTukaL~| kuuT|NN|plural, viiT|viiT|NN|singular ...}

3.3.3 Statistical Model Factored translation models follow closely the statistical modeling approach of phrase-based models (in fact, phrase-based models are a special case of factored models). The main difference lies in the preparation of the training data and the type of models learned from the data.

3.3.3.1 Training

The training data (a parallel corpus) has to be annotated with the additional factors. For instance, if we want to add part-of-speech information on the input and output side, we need to obtain part-of-speech tagged training data. Typically this involves running automatic tools on the corpus, since manually annotated corpora are rare and expensive to produce. Next, we need to establish a word-alignment for all the sentences in the parallel training corpus. Here, we use the same methodology as in phrase-based models (symmetrized GIZA++ alignments). The word alignment methods may operate on the surface forms of words, or on any of the other factors. In fact, some preliminary experiments have shown that word alignment based on lemmas or stems yields improved alignment quality.

46

Each mapping step forms a component of the overall model. From a training point of view this means that we need to learn translation and generation tables from the word-aligned parallel corpus and define scoring methods that help us to choose between ambiguous mappings. Phrase-based translation models are acquired from a word-aligned parallel corpus by extracting all phrase-pairs that are consistent with the word alignment. Given the set of extracted phrase pairs with counts, various scoring functions are estimated, such as conditional phrase translation probabilities based on relative frequency estimation or lexical translation probabilities based on the words in the phrases. In our approach, the models for the translation steps are acquired in the same manner from a word-aligned parallel corpus. For the specified factors in the input and output, phrase mappings are extracted. The set of phrase mappings (now over factored representations) is scored based on relative counts and word-based translation probabilities. The tables for generation steps are estimated on the output side only. The word alignment plays no role here. In fact, additional monolingual data may be used. The generation model is learned on a word-for-word basis. An important component of statistical machine translation is the language model, typically an n-gram model over surface forms of words. In the framework of factored translation models, such sequence models may be defined over any factor, or any set of factors. For factors such as part-of-speech tags, building and using higher order n-gram models (7-gram, 9-gram) is straight-forward.

3.3.3.2 Combination of Components

As in phrase-based models, factored translation models can be seen as the combination of several components (language model, reordering model, translation

47

steps, generation steps).These components define one or more feature functions that are combined in a log-linear model:

m | e ) = exp Σ λ i h i ( m , e ) n

i =1

3.5

To compute the probability of a translation m given an input sentence e, we have to evaluate each feature function hi. For instance, the feature function for a bigram language model component is (m is the number of words ei in the sentence m):

hlm (m, e) = plm (m) = p(m1 ) p(m2 | p1 ).... p(mm | mm−1 )

3.6

We now consider the feature functions introduced by the translation and generation steps of factored translation models. The translation of the input sentence e into the output sentence m breaks down to a set of phrase translations ( ej , mj ). For a translation step component, each feature function ht is defined over the phrase pairs ( ej , mj ) given a scoring function : ht (m , e) = Σ

jτ

(e j , m j ) n

3.7

For a generation step component, each feature function hg given a scoring function is defined over the output words ek only: h g (m , e) = Σ

kλ

(e k )

3.8

The feature functions follow from the scoring functions (τ , γ ) ' acquired during the training of translation and generation tables. For instance, recall our earlier example: a scoring function for a generation model component that is a conditional probability distribution between input and output factors. The feature weights λi in the log-linear model are determined with the usual minimum error rate training method.

48

3.3.3.3 Efficient Decoding

Compared to phrase-based models, the decomposition of phrase translation into several mapping steps creates additional computational complexity. Instead of a simple table lookup to obtain the possible translations for an input phrase, now multiple tables have to be consulted and their content combined. In phrase-based models it is easy to identify the entries in the phrase table that may be used for a specific input sentence. These are called translation options. We usually limit ourselves to the top 20 translation options for each input phrase. The beam search decoding algorithm starts with an empty hypothesis. Then new hypotheses are generated by using all applicable translation options. These hypotheses are used to generate further hypotheses in the same manner, and so on, until hypotheses are created that cover the full input sentence. The highest scoring complete hypothesis indicates the best translation according to the model. How do we adapt this algorithm for factored translation models? Since all mapping steps operate on the same segmentation, the expansions of these mapping steps can be efficiently pre-computed prior to the heuristic beam search, and stored as translation options. For a given input phrase, all possible translation options are thus computed before decoding. This means that the fundamental search algorithm does not change. However, we need to be careful about combinatorial explosion of the number of translation options given a sequence of mapping steps. In other words, the expansion may create too many translation options to handle. If one or many mapping steps result in a vast increase of (intermediate) expansions, this may be become unmanageable. We currently address this problem by early pruning of expansions, and limiting the number of translation options per input phrase to a maximum number, by default 50. This is, however, not a perfect solution.

49

3.4 Confusion Networks Decoding Machine translation input currently takes the form of simple sequences of words. However, there are increasing demands to integrate machine translation technology in larger information processing systems with upstream natural language and/or speech processing tools (such as named entity recognizers, automatic speech recognizers, morphological analyzers, etc.). These upstream processes tend to generate multiple, erroneous hypotheses with varying confidence. Current MT systems are designed to process only one input hypothesis, making them vulnerable to errors in the input. We extend current MT decoding methods to process multiple, ambiguous hypotheses in the form of an input lattice. A lattice representation allows an MT system to arbitrate between multiple ambiguous hypotheses from upstream processing so that the best translation can be produced. As lattice has usually a complex topology, an approximation of it, called confusion network, is used instead. The extraction of a confusion network from a

lattice can be performed by means of a publicly available lattice-tool contained in the SRILM toolkit.

3.4.1 Confusion Networks A Confusion Network (CN), also known as a sausage, is a weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes. Each edge is labeled with a word and a (posterior) probability. The total probability of all edges between two consecutive nodes sum up to 1. This is not a strict constraint from the point of view of the decoder; any score can be provided. A path from the start node to the end node is scored by multiplying the scores of its edges. If the previous constrain is satisfied, the product represents the likelihood of the path, and the sum of the likelihood of all paths equals to 1.

50

Between any two consecutive nodes, one (at most) special word _eps_ can be inserted; _eps_ words allows paths having different lengths. Any path within a CN represents a realization of the CN. Realizations of a CN can differ in terms of either sequence of words or total score. It is possible that two (or more) realizations have the sae sequence of words, but different scores. Word lengths can also differ due to presence of the _eps_.

3.4.2 Word Lattices A word lattice is a directed acyclic graph with a single start point and edges labeled with a word and weight. Unlike confusion networks which additionally impose the requirement that every path must pass through every node, word lattices can represent any finite set of strings (although this generality makes word lattices slightly less space-efficient than confusion networks). However, in general a word lattice can represent an exponential number of sentences in polynomial space. Moses can decode input represented as a word lattice, and, in most useful cases, do this far more efficiently than if each sentence encoded in the lattice were decoded serially. When Moses translates input encoded as a word lattice the translation it chooses maximizes the translation probability along any path in the input.

3.4.2.1 How to represent lattice inputs

Lattices are encoded by ordering the nodes in a topological ordering (there may be more than one way to do this- in general, any one is as good as any other). Then, proceeding in order through the nodes, each node lists its outgoing edges and any weights associated with them. The second number is the probability associated with an edge, and the third number is the distance to the next node.

51

3.5 Factors, Words, Phrases Moses is the implemented of a factored translation model. This means that each word is represented by a vector of factors, which are typically word, part-ofspeech tags, etc. It also means that the implementation is a bit more complicated than a non-factored translation model. This section intends to provide some documentation of how factors, words, and phrases are implemented in Moses.

3.5.1 Factors The class Factor implements the most basic unit of representing text in Moses. In essence it is a string. Factors do not know about their own type, this is referred to as its Factor Type when needed. This factor type is implemented as a size_t, i.e. an integer. What a factor really represents (be it a surface form or a part of speech tag), does not concern the decoder at all. All the decoder knows is that there are a number of factors that are referred to by their factor type, i.e. an integer index. Since we do not want to store the same strings over and over again, the class FactorCollection3 contains all known factors. The class has one global instance, and it provides the essential functions to check if a newly constructed factor already exists and to add a factor. This enables the comparison of factors by the cheaper comparison of the pointers to factors. Think of the Factor Collection as the global factor dictionary.

3.5.2 Words A word is, as we said, a vector of factors. The class Word4 implements this. As data structure, it is a array over pointers to factors. This does require the code to know what the array size is, which is set by the global MAX_NUM_FACTORS. The word class implements a number of functions for comparing and copying words, and the addressing of individual factors.

52

Again, a word does not know, how many factors it really has. So, for instance, when you want to print out a word with all its factors, you need to provide also the factor types that are valid within the word.

3.5.3 Factor Types This is a good place to note that referring to words gets a bit more complicated. If more than one factor is used, it does not mean that all the words in the models have all the factors. Take again the example of a two-factored representation of words as surface form and part-of-speech. We may still use a simple surface word language model, so for that language model, a word only has one factor. We expect the input to the decoder to have all factors specified and during decoding the output will have all factors of all words set. The process may not be a straight-forward mapping of the input word to the output word, but it may be decomposed into several mapping steps that either translate input factors into output factors, or generate additional output factors from existing output factors. At this point, keep on mind that a Factor has a FactorType and a Word has a vector, but these are not internally stored with the Factor and the Word. Related to factor types is the class FactorMask, which is a bit array indicating which factors are valid for a particular word.

3.5.4 Phrases Since decoding proceeds in the translation of input phrases to output phrases, a lot of operations involve the class Phrase6. Phrases know a little bit more about their role in the world. For instance, they know if they are Input or Output phrases, which they reveal with a function call to Phrase::GetDirection(). Since the total number of input and output factors is known to the decoder (it has to be specified in the configuration file moses.ini), phrases are also a bit smarter about copying and comparing. 53

The Phrase class implements many useful functions, and two other classes are derived from it: The simplest form of input, a sentence as string of words, is implemented in the class Sentence. • The class Target Phrase may be somewhat misleadingly named, since it not only contains a output phrase, but also a phrase translation score, future cost estimate, pointer to source phrase, and potentially word alignment information.

3.6 Morphology Morphology is the study of morphemes and their arrangements in forming words. Morphemes are the minimal meaningful units which may constitute words or parts of words. In spoken language, morphemes are composed of phonemes, the smallest linguistically distinctive units of sound. re-, de-, un-, -ish, -ly, -cieve, -mand, tie, boy, like, etc. of receive, demand, untie, boyish, likely. Morphology deals with all combinations that form words or parts of words. Two broad classes of morphemes, stems and affixes: The stem is the ”main morpheme” of the word, supplying the main meaning, e.g. eat in eat+ing. ’naTakk’ in ’naTakkunnu’ (walks). Affixes add ”additional” meanings.

Morphemes

Morphs are the phonological/orthographical realization Morphemes. A single morpheme may be realized by more than one morph. In such cases, the morphs are said to be allomorphs of a single morpheme. The following examples demonstrate the concept of Morphemes and their realization as morphs. For example, in Malayalam, the morph ‘kaL’ is the realization of morpheme for denoting plurality. KuTTi + PL

KuTTikaL~ ‘boys’

54

pakshi + PL

pakshikaL~ ‘birds’

But when the same morpheme is attached with a different word, it is realized as a different morph. maraM + PL -> marangngaL~ ‘trees’ janaM + PL -> janangngaL ~ ‘people’ So the same morpheme can be realized by different morphs in a language. These different morphs of a same morpheme are called allomorphs. Free morphemes like town, dog can appear with other lexemes (as in town hall or dog house) or they can stand alone, i.e. "free". Free morphemes are morphemes, which can stand by themselves as single words, for example maraM (‘Tree’) and aabharaNaM (“jewelry’) in Malayalam Bound morphemes (or affixes) never stand alone. They always appear attached with other morphemes like "un-" appears only together with other morphemes to form a lexeme. Bound morphemes in general tend to be prefixes and suffixes. The word "unbreakable" has three morphemes: "un-" (meaning not x), a bound morpheme;"-break-" a free morpheme; and "-able", a bound morpheme. "un-" is also a prefix, "-able" is a suffix. Both are affixes.. Bound morphemes are used to fulfill grammatical functions as well as productive elements for forming the new words. For example kaL and ooTu (the sociative case marker, realized in English as ‘with’) of Malayalam, which are used for forming plural forms and denoting case relationship respectively.

Inflection

Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function, e.g. plural of nouns. table (singular) table+s (plural) Inflection is productive. cat cats cats’ kuTTi ( boy) kuTTikaL~ (boys) kuTTikaLuTe ( boys’)

55

The meaning of the resulting word is easily predictable. Inflectional morphemes modify a word's tense, number, aspect, and so on (as in the dog morpheme if written with the plural marker morpheme s becomes dogs).

Derivation

Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly.In case of derivation, the POS of the new derived word may change. E.g: 'destroyer' is a noun whereas the root 'destroy' is a verb. Not always productive. Derivational morphemes can be added to a word to create (derive) another word: the addition of "-ness" to "happy," for example, to give "happiness." In Malayalam, derivational morphology is very common and productive. Clauses, in English, like "the one who did" is translated to a single word, "ceytayaaL ", which is a noun derived from the verb "cey" by concatenating it with tense and nominative suffix morphemes. New words can also be derived by concatenating two root morphemes.

Compounding

Compounding is the joining of two or more base forms to form a new word. For instance, two nouns "paNaM" (money) and "peTTi" (box) can be fused to created "paNappeTTi" (box of money). Such frequent root-root fusions are very common in written Malayalam. Semantic interpretation of compound words is even more difficult than with derivation. Almost any syntactic relationship may hold between the components of a compound.

56

3.6.1 Malayalam Verb Morphology Verb Morphology

‘daatu’ (Base) is the utterance denoting action. • iLaki

iLak

• naTannu

naTa

We have to consider the changes effected in the base according to the different senses associated with the action, namely their form, character, time, manner etc. Base forms:

The Bases which in addition to their own meanings give an extra meaning like persuation are designated as ‘pṛayoojaka’ (Causative) Bases. Those which do not have such a special meaning are called ‘keevala’ (Simple) Bases. Thus Bases are by nature two types. ‘bharikkunnu’(rules) is a Simple Base and bharippikkunnu (causes to rule) is a Causative Base. Simple ezhutunnu (writes) Causative ezhutikkunnu (causes to write) Nature:

Certain Simple Bases have the form of Causative Bases. Such Bases are called ‘kaarita’ , and others are called ‘akaarita’.Thus kaaritas are Bases which are simple in meaning but Causative in form.Whether a Base is kaarita or akaarita has to be understood from the text of the Base itself. This difference is restricted to Bases ending in a vowel or a ‘cil’.

kaaritas : patḥikkun nu (learn) pitịkkun nu (catch) akaaritas : paaṭun nu (sing), kaaṇun nu (see)

57

Tense

There are morphologically distinct tenses in the language, and these are labeled as ‘past’, ’present’ and ‘future’. The combination of the three tenses with different aspects and moods are used for a given time specification. Tense is the last feature marked on the verb form and follows causative and aspect suffixes. Past tense is marked by ‘-i’ added to the verb root or derived stem, or by ‘-u’ preceded by one or another of a range of consonants or consonant sequences. The selection of the appropriate past tense suffix depends on a combination of morphological and phonological conditioning.

Present tense is marked by ‘-unnu’ suffixed to the verb root or derived stem. The future tense is marked by ‘-uM’ (occasionally ‘uu’) suffixed to the verb root or derived stem. The use of ‘uu’ is restricted to sentences in which one element carries the emphatic particle ‘-ee’. Use of formally Distinct Tenses

Universal time reference: For universal time reference both future and present tense forms occur. The influence of English is sometimes cited for the use of the present tense. For the habitual actions of an individual, the future is normally used, though the use of the present imperfective or the simple present is not entirely excluded. Eg: avan divaseena iviTe varuM . (He comes here daily). For generic statements , the present is generally used, though the future too is possible. Eg: suuryan kizhakk udikkuM. (The rises in the east). paSu paal~ tarunnu. (Cow give milk). Referece to present time: Referece to present is by present tense forms. If the ongoing nature of the present action is being noted , an imperfective aspect will be used.

58

Eg: kuTTi karayunnuNT ( The child is crying).

Reference to past time: It is by means of past tense forms. Eg: njaan innale maTangngi vannu (I returned yesterday). To sress that much time has passed since an event occurred , the past perfect is used. Eg; paNTorikkal njaan iviTe vannirunnu. (I cane here once long ago).

Future time: A variety of tense, aspectual and modal forms are used to refer to future states, events and actions.the purposive infinitive may be used in conjunction with the verb ‘pookuka’ (go). The future tense is marked by ‘-uM’ (occasionally ‘uu’) suffixed to the verb root or derived stem. The use of ‘uu’ is restricted to sentences in which one element carries the emphatic particle ‘-ee’. If the latter is added to the verb, it is suffixed to an infinitive form (-uka) and is followed by –uLLuu. Eg: avan naaLe varuM (He will come tomorrow). Avan varukayeeyuLLuu. (he will just come).

The modal suffix –aaM is suffixed directly to the verb stem, may be used to refer to future time.If used with the first person singular or a first person plural pronoun , it expresses a commitment to a future course of action or willingness to do something. When used with a third person subject, it indicates a possibility. To express a casual or indifferent consent with a first person subject, the form –eekkaaM is suffixed to a past tense stem. Eg: inn mazha peyyaaM ( It may rain today). njaan vanneekkaaM (I may come).

59

The combination of an infinitive with ‘pookuka’ is frequently used to refer to a future event , particularly if intention is to be indicated. Eg: njaan kooleejil~ ceeraan~ pookunnu (I am going to join college). Imperfective forms are used for future time reference. Eg: naaLe ellaaruM pookunnuNT (Everyone is going tomorrow).

Tense Markers

In tense, the suffixes i, um, and unnu denote past, future and present respectively.

Example: Base

Past

Future

Present

iLak – to move

iLaki

iLakuM

iLakunnu

minn – to glitter

minni

minnum

minnunnu

vilas – to shine

vilasi

vilasum

vilasunnu

If the Base ends in vowel or cil, the past suffix will be ‘tu’ instead of i. E.g; kaaN –to see kanNTu tozh – to bow tozhutu tin – to eat

tinnu (tintu)

In kaarita Bases, the ‘t’ of suffix ‘tu’ is geminated to form ‘ttu’. In akaarita it has to be ‘ntu’ by addition of ‘n’. E.g; koTu – to give

kotTuttu

manNa – to smell

manNattu

ceeṛ~ – to join

ceer~NTu= ceer~nnu

karaL ̣ -to gnaw

karaLNTu = karaNTu

60

In Bases, which end in labial a it is ‘ntu’ even when the Base is kaaritam. When the Base ends in Palatal Vowel, in both the ‘ntu’ and ‘ttu’ forms of the suffix tu, there will be Palatalisation and Nasal assimilation making them ‘ccu’ and ‘ññu’ respectively. E.g: aLi – to melt aLinjnju kara – to cry

karanjnju

aTị – to beat

aTịccu

vai – to place vaccu Bases which are ‘eekamaatra’, i.e, containing a short vowel only , ending in k, r,̣ or t ̣take the suffix tu as past marker. It dissolves in k, R, t and gives the effect of their gemination. E.g: puk – to enter aṚ – to break

+ tu = pukku + tu = aRRu

Aspect

Aspectual forms are built up from three different bases: (i) the infinitive: (ii) the verbal root; (iii) the adverbial participle. For the verbs ‘ceyyuka’ (do) and ‘koTukkuka’ (give) , the following types of bases are obtained. Perfect aspect There are two sets of forms to indicate a past situation that has relevance to a later time. Both sets of forms take the adverbial participle as base. Forms of perfect aspect: The first form is ‘irikkuka’, which has some status as a ‘being’ verb. The present, past and future exponents of this variety of perfect are ‘irikkunnu’, -irunnu, and –irikkuM. Eg: vannirikkunnu (has come). vannirunnu (had come) vannirikkuM (will have come). The second combines –iTT and uNT to give the following forms for the three tenses. 61

Eg: vanniTTuNT

(has come).

vanniTTuNTaayirunnu (had come). vanniTTuNTaakuM

(will have come).

One slightly complex form, with –aayirikkuM added to the stem of second form , indicates possibility. Eg: vanniTTuNTaayirikkuM ( may have come).

Use of perfect aspect Present result of a past situation: There are two different forms, which are interchangeable. Eg: avan pooyiTTuNT ( He has gone). kuTTikaL~ vannirikkunnu (Children have come). The past tense of a perfect is used when there is a reference to a time in the past to which a still earlier past situation had relevance. Eg: DookTar~ ettunnatin munpe roogi mariccirunnu. ( Before the doctor arrived, the patient had died). The Future perfect fulfils the same sort of function with regard to future time. Eg: ippooL~ ellaaM kazhinjnjirikkuM (Everything will now be over ). A situation holding at least once: to indicate a situation that has held at least once in the period leading up to the present. Eg: njaan~ taajmahaaL~ kaNTiTTuNT. ( I have seen the Taj Mahal). A situation still continuing: to indicate explicitly a situation that began in the past and is still continuing, (i) the present or present progressive is used, and (ii) the adverbial will usually be an –aayi form. Eg: ayaaL~ kuRe neeramaayi kaattirikkunnu ( kaattukoNTirikkunnu) (He has been waiting for a long time).

62

CHAPTER 4 INSIGHT INTO ENGLISH-MALAYALAM SMT 4.1 The Bilingual N-gram Translation Model 4.1.1

Reviewing X-grams for Language Modeling

A language model can be defined as the probability of a certain sequence of tokens (usually words), and it is a broadly-used model which plays a very important role in several statistical pattern matching tasks, ranging from speech recognition to machine translation or other language-related tasks defined on posterior probability frameworks. In order to estimate the probability of a certain sequence of M words w1 , w2 ...wM as defined in equation 4.1, the most extended method are n-grams, which reduce the exponentially increasing history space to the n − 1 previous words, as shown in equation 4.2. M

p( w1 w2 .....wM ) = ∑ p( wi | w1 ....wi −1 ) i =1 M

p( w1 w2 .....wM ) ≈ ∑ p( wi | wi − n +1 ....wi −1 ) i =1

4.1 4.2

Still, even for a moderate vocabulary size of K = 1000 words, the number of model parameters to estimate for trigrams (n = 3) is very large (more than 109) and usually unobserved in training material. In this case, the maximum likelihood estimator is not appropriated, and smoothing techniques are required. In an alternative approach to language modelling, named X -grams, is presented. Three main conclusions can be drawn from this work, namely: • n-grams can be implemented efficiently by means of a Finite-State Automaton. Each state is represents a conditioning history (wi−n+1 ...wi−1) which has been 63

seen in training material, and each arc contains the words wi which have followed them, together with the associated language model probability p(wi | wi−n+1 ...wi−1 ). This way, the language model of a certain sentence can be computed by traversing the full sentence through the FSA and multiplying probabilities. • State merging techniques can be used to smooth probabilities and achieve a language model with low perplexity. Two criteria to merge the state defined by history (wi−n+1 ...wi−1) into the smaller-history states defined by (wi−n+2 ...wi−1) are introduced, resulting in language models with low perplexity values. These criteria are: – The states are merged if the longest-history state has occurred less than kmin number of times in the training material – The states are merged if the divergence between their output probability distributions p = p(w | wi−n+1 ...wi−1 ) and q = q(wi

| wi−n+2 ...wi−1 ) is smaller than a

certain threshold f , where divergence is defined as in equation 4.3, a well-known information theory function J

D( p || q) = ∑ p(i ). log j=1

p( j ) q( j )

4.3

• History size can vary depending on the words involved. When n-gram models increase their history size n, probabilities estimated for long histories may not be always reliable. On the other hand, even when reliable, perhaps these give no additional information with respect to probabilities conditioned on smaller histories. As the X -gram term denotes, the presented state merging techniques produce a model conditioned on variable-size histories, capturing long histories only whenever these bear some relevant information. 4.1.2 Training from parallel data

Therefore, assuming we are able to extract a set of tuples from a given parallel text, we can use X -grams to estimate the bilingual model and, by means of a non-

64

intrusive modification of Viterbi search algorithm, we can perform statistical machine translation. The training procedure from a parallel text is graphically represented in Figure 4.1

Figure 4.1: Training a translation FST from parallel data Flow diagram. The first preliminary step requires the preprocessing of the parallel data, so that it is sentence aligned and tokenized. By sentence alignment the division of the parallel text into sentences and the alignment from source sentences to target sentences is referred. This alignment is usually (though not always) monotone and, even though it produces one-to-one links (one source sentence aligned to one target

sentence), it can also include one-to-many, many-to-one, and many-to-

many links. By tokenization, we refer to separating punctuation marks, classifying numerical expressions into a single token, and in general, simple normalization strategies tending to reduce vocabulary size without an information loss (ie. which can be reversed if required). Then, word alignment is performed, by estimating IBM translation models from parallel data and finding the Viterbi alignment in accordance to them. This process is carried out using the GIZA toolkit.

65

Finally, before estimating the bilingual X -grams, a tuple extraction from word-aligned data needs to be done. Given a word-aligned sentence, this process segments it into a sequence of tuples, respecting two crucial constraints: • Monotonicity. The resultant segmentation can be traced sequentially in order to produce back both the source and the target sentences. • Minimal tuple size. For the resultant model to be less sparse, or in other words, for tuples to have the biggest generalizing power, we are interested in the shortest tuples (in number of source and target words) which respect the previous constraint This process is illustrated in Figure 4.2, where four tuples are extracted and it is worth noting that A→W is not extracted given the previous constraints, and that tuples

with empty target side are allowed (source word D translates to no target

word).

Figure 4.2: Tuple extraction from a certain word aligned pair of sentences. Given a word alignment, these restrictions define a unique set of tuples except in one situation, which is whenever the resulting tuple contains no source word (a NULL-source

tuple). In order to reuse these units in decoding new

sentences, the search should allow for no input word to generate units, and this is not the case. Therefore, these units cannot be allowed, and a certain decision must be taken to Re-segment tuples in these cases.

66

4.1.3 Tuple definition: from one-to-many to many-to-many

During the training of a joint-probability bilingual translation model, and especially during tuple extraction, certain considerations need to be taken into account. On the one hand, one has to decide whether to structurally allow tuples to be one-to-many or many-to-many. If we restrict tuples to one-to-many structure, only one source word will be allowed for each tuple (independently of the number of target words), whereas if we allow them to have a more general many-to-many structure, they can have any number of source and target words .

Figure 4.3: Three alternative tuple definitions from a given word alignment. In two alternative units presented, which are graphically represented in Figure 4.3. Firstly, all units are required to be one-to-many (type I), and whenever many source words are linked to a same target word or if there is a crossed dependency, only the last source word carries the full translation (whereas the previous ones are left without translation). A second option is to force all tuples which have an empty target side to be linked to the next tuple with a non-empty target side, thus allowing the introduction of many-to-many units (type II). This solution is claimed to behave significantly worse in all translation experiments, due to the fact that the obtained finite-state transducers are much bigger, and consequently, the assigned probabilistic distributions are poorly estimated. We experimented with two different alternative tuple definitions; on the one hand, the one- to-many structure defined as type I; on the other, we allowed for

67

many-to-many tuples, but only if automatic word alignment had linked the words (be it through a many-to-one link or due to a crossed dependency). In practice, this is equivalent to type II, except for those tuples comprised of one source word without a link to any target word. These tuples are kept in the model, as shown in Figure 4.3 (type III).

4.1.3.1 Monotonicity vs. Word reordering

Given the need for a sequence of tokens related to the X -grams formulation, the very basic treat of tuples is their monotonicity. In other words, when following them in monotone order, tuples must produce both the source and target sentence in the correct word order. However, each language expresses concepts in different order, and depending on the pair of languages involved, translation will not be a monotone process. In fact, this is also captured in the definition of IBM translation and alignment models, where crossed dependencies are expected to occur. Therefore, if two languages are close in word order, alignment will tend to be monotone, and the extracted tuples from training data will be shorter (include less words) and easier to re-use during decoding. On the contrary, if these languages differ strongly in word order, alignments will have so-called long-distance links, forcing the extraction of long tuples (including many words), leading to a sparser bilingual model with a bigger tuple vocabulary.

4.1.3.2 An initial reordering strategy

In order to improve the generation of bilingual tuples during the FST training, the introduction, for the most frequent patterns, of indexes referring to the relative positions of the target-language words given a cross was presented. For example, for

68

pattern (1,2)(2,1), we swapped the order of the target words, introducing an index to the last target word to preserve the original link. This is expressed by:

⎧⎪ s i → t j ⎧⎪ s i → t j + 1 ⇒ ⎨ ⎨ ⎪⎩ s i + 1 → t j + 1 ( − 1) ⎪⎩ s i + 1 → t j

4.4

Where si is the source word in position i and tj is the target word in position j. The same applies for the second most frequent pattern (1, 2) (2, 0) (3, 1), as the middle source word is aligned to no target word. For the third pattern (1,3)(2,1)(2,2), the same can be done after joining the two last words of the target sentence into a single token.

⎧s i → t j + 2 ⎧⎪si → t j ⎪ ⎨si +1 → t j ⇒ ⎨ ⎪⎩s i +1 → t j +1t j + 2 (−1) ⎪ s t → j +1 ⎩ i +1

4.5

4.1.4 N-gram implementation

The X -grams implementation of the joint-probability bilingual model evolved into a standard N -gram implementation by the end of 2004. The motivation behind this change was double: • Software incapacity to deal with large amounts of data. The

X -grams

implementation by

means

of a

Finite-State Transducer was

achieved by adapting a speech recognition toolkit developed at UPC during the late 1990s, and which was not prepared to work with vocabulary sizes beyond the hundred thousand words. Therefore, this posed a strong limitation to scale the translation system to largevocabulary tasks, unless a substitute X -grams translation software was fully reprogrammed.

69

• Availability of tools implementing large-vocabulary language modeling which included the most advanced smoothing strategies At the same time, SRI International released a freely-available languagemodeling toolkit (SRILM). This collection of C++ libraries, executable programs, and helper scripts was designed to allow both production of and experimentation with statistical language models for speech recognition and other applications, supporting creation and evaluation of a variety of language model types based on Ngram statistics (including many smoothing strategies), among other related tasks. However, another implication of such a change was the need for a decoder. In order to study the N -gram translation model, and given that the FST architecture had been discarded

for hash tables, the Viterbi search implemented onto the FST

architecture was not useful anymore.

4.1.4.1 Modeling issues

In this section we focus on the N -gram implementation of the tuples bilingual model, conducting a thorough study of its main modeling aspects, and their effects on translation quality and model size. These aspects include: • A study on pruning strategies • A study on smoothing strategies

Pruning strategies

As we have seen in the previous section, the bilingual translation model becomes huge when increasing training material. Pruning strategies are therefore needed. Pruning can be defined as any technique taking a decision on discarding certain training material as useless. This decision must be a hard one, ie. taken before

70

having any knowledge on the test set. Usually, pruning reveals the classical trade-off between efficiency and performance. Where as a strong pruning produces smallsized models which can be used in a much more efficient way, performance usually falls off. In addition to that, a balanced degree of pruning can sometimes make a more efficient model at no performance cost (or even at an improvement in performance). Several pruning strategies can be devised, but here we divided them in two types (which can, of course, be combined as well): • N -gram pruning. A classic pruning strategy in language modelling is to perform the N - gram modeling, but according to restrictions affecting the N -gram counts extracted from training data, so that for instance a certain tuple may participate in a certain trigram while not participating in any bigram. • Tuple pruning. This approach refers to any technique which takes a hard decision on tuple vocabulary, taking an a priori decision on which tuples are allowed to belong to the tuples vocabulary, and which are not. According to this, Tuple pruning is a special case of N - gram pruning in which, for a given discarded tuple, all N -gram counts in which this tuple participates are set to 0 (for all N ).

N -gram pruning

During

language

model estimation, one can assume that all N -grams

occurring less than a certain number of times are discounted to zero. This strategy is very often used for large N values, and has as a consequence smaller models with improved performance. The idea behind this is to consider long N -grams occurring just once in the training

material as important as those entire N -grams which do not occur at all

(and which are taken into account via smoothing the model probabilities.

71

Smoothing the bilingual model

It has been shown in ASR research field that solely relying on a maximum likelihood estimate may not be the best option when performing language modelling, especially when it comes to unfrequent events. Consider a trigram which occurs just once in the training material. Taking into account that the number of possible trigrams is very huge and that only a few will occur in the training data,

it may be

unreasonable to perform maximum likelihood estimation for these events, as all unseen events will have a zero probability. Smoothing refers to all techniques which redistribute probability from seen events to unseen events. This is usually done by removing a certain amount of probability mass from seen events and reserve it for the wide range of unseen combinations of words. Several alternative smoothing techniques have been applied for language modeling. In this work each smoothing technique and contrast their performance, for different training corpus size, in terms of perplexity. This measure is related to the probability that the model assigns to a test data, and is defined as in equation 4.6:

PP p (T ) = 2

H p (T )

(4.6)

Where Hp (T) is the cross-entropy of the language model on data T, defined as:

H p (T ) = −⎛⎜ 1 ⎞⎟ log2 p(T ) ⎝ WT ⎠

(4.7)

and where WT is the number of words in the test data T . Smoothing techniques include Additive smoothing, the Good-Turing estimate, Jelinek-Mercer smoothing, Katz smoothing, Witten-Bell smoothing,

72

absolute dis-

counting and Kneser-Ney smoothing. Furthermore, a slight modification of KneserNey smoothing is also presented, rendering the best performance. Being out of the scope of this project work, all details regarding each smoothing formulation and implementation are omitted here. Still, due to the bilingual nature of our translation model, a comparative experiment was conducted in order to assess which smoothing was most suited for the bilingual N-gram modeling. Instead of defining a perplexity score for the translation task, we opted for evaluating translation performance directly. As it can be seen, the modified Kneser-Ney achieves the best translation scores in both translation directions. It must be noted that all internal experiments across various tasks have followed the same tendency. All in all, the results lead to the conclusion that the bilingual model behaves similarly to a standard (monolingual) language model when it comes to smoothing techniques.

4.2 MOSES 4.2.1 Starting with Moses Moses is a statistical machine translation system that allows us to automatically train translation models for any language pair. All we need is a collection of translated texts (parallel corpus). • beam-search: an efficient search algorithm finds quickly the highest probability translation among the exponential number of choices • phrase-based: the state-of-the-art in statistical machine translation allows the translation of short text chunks • factored: words may have factored representation (surface forms, lemma, part-ofspeech, morphology, word classes...)

73

Features

• Moses is a drop-in replacement for Pharaoh, the popular phrase-based decoder, with many extensions. • Moses allows the decoding of confusion networks, enabling easy integration with ambiguous upstream tools, such as automatic speech recognizers. • Moses features novel factored translation models, which enable the integration linguistic and other information at many stages of the translation process. The released software includes a command line executable which can be used for decoding. However, if we want the addition scripts to create factored phrase tables and train the weights of the models, we need to download the complete snapshot via SVN. This repository also contains regression tests, which should be interested in enhancing the decoder.

4.2.2 Getting Started with the Moses Software Required additional software

Moses requires a language model toolkit to build. At the moment, Moses supports: • The SRILM language modeling toolkit. • The IRST language modeling toolkit. These toolkits provide commands to estimate and compile language models, and are in some way complementary. See the corresponding documentation to learn more about their main features. Moses provides interfaces towards them. According to your need, we can choose one of them, but we suggest to get both.

74

4.2.3 Roadmap

Figure 4.4: Roadmap • Chart parse-decoding: to support hierarchical models and syntax-based translation models • Depth-first decoding: to provide anytime algorithms for decoding • Forced decoding: to compute scores for provided output • Suffix-array translation models: an alternative way to store large rule-sets without the need to translate them • Maximum entropy translation models: translation models that incorporate additional source-side and context information for scoring translation rules.

Notable additions in 2009

• Language model server Notable additions in 2008

• Specification of reordering constraints with XML 75

• Early discarding pruning • Randomized language models • Output of search graph • Cube pruning

4.2.4 Decoding A Simple Translation Model model consists of two files: • phrase-table the phrase translation table • moses.ini the configuration file for the decoder The translation tables are the main knowledge source for the machine translation decoder. The decoder consults these tables to figure out how to translate input in one language into output in another language. Being a phrase translation model, the translation tables do not only contain single word entries, but multi-word entries. These are called phrases, but this concept means nothing more than an arbitrary sequence of words, with no sophisticated linguistic motivation.

4.2.4.1 Tuning for Quality

The key to good translation performance is having a good phrase translation table. But some tuning can be done with the decoder. The most important is the tuning of the model parameters. The probability cost that is assigned to a translation is a product of probability costs of four models: • Phrase translation table, • Language model, • reordering model, and • Word penalty.

76

Each of these models contributes information over one aspect of the characteristics of a good translation: • The phrase translation table ensures that the Malayalam phrases and the English phrases are good translations of each other. • The language model ensures that the output is fluent Malayalam. • The distortion model allows for reordering of the input sentence, but at a cost: The more reordering, the more expensive is the translation. • The word penalty provides means to ensure that the translations do not get too long or too short. Each of the components can be given a weight that sets its importance. Mathematically, the cost of translation is:

P ( m | e) = (e | m )

weight φ

× LM weight LM × D ( m, e) weight d × W ( m)

weight φ

4.8

The propability p(e|f) of the English translation e given the English input f is broken up into four models, phrase translation p(f|e), language model LM(e), distortion model D(e,f), and word penalty W(e) = exp(length(e)). Each of the four models is weighted by a weight. The weighting is provided to the decoder with the four parameters weight-t, weight-l, weight-d, and weight-w. The default setting for these weights are 1, 1, 1, and 0. These are also the values in the configuration file moses.ini. Setting these weights to the right values can improve translation quality.

4.2.4.2 Tuning for Speed

Let us now look at some additional parameters that help to speed up the decoder. Unfortunately higher speed usually comes at cost of translation quality. The speed-ups are achieved by limiting the search space of the decoder. By cutting out part of the search space, we may not be able to find the best translation anymore.

77

Translation Table Size

One strategy to limit the search space is by reducing the number of translation options used for each input phrase, i.e. the number of phrase translation table entries that are retrieved. If the phrase translation table is learned from real data, it contains a lot of noise. So, we are really interested only in the most probable ones and would like to eliminate the others. The are two ways to limit the translation table size: by a fixed limit on how many translation options are retrieved for each input phrase, and by a probability threshold, that specifies that the phrase translation probability has to be above some value.

Hypothesis Stack Size (Beam)

A different way to reduce the search is to reduce the size of hypothesis stacks. For each number of English words translated, the decoder keeps a stack of the best (partial) translations. By reducing this stack size the search will be quicker, since less hypotheses are kept at each stage, and therefore less hypotheses are generated. With small stack sizes or small thresholds we risk search errors, meaning the generation of translations that score worse than the best translation according to the model. By worse translation, we mean worse scoring according to our model. If it is actually a worse translation in terms of translation quality, is another question. However, the task of the decoder is to find the best scoring translation. If worse scoring translations are of better quality, then this is a problem of the model, and should be resolved by better modeling.

78

4.2.4.3 Limit on Distortion (Reordering)

The basic reordering model implemented in the decoder is fairly weak. Reordering cost is measured by the number of words skipped when English phrases are picked out of order. Total reordering cost is computed by D (e, f) = - Σ i (d_i) where d for each phrase i is defined as d = abs (last word position of previously translated phrase + 1 - first word position of newly translated phrase). This reordering model is suitable for local reordering: they are discouraged, but may occur with sufficient support from the language model. But large-scale reordering are often arbitrary and effect translation performance negatively. By limiting reordering, we can not only speed up the decoder, often translation performance is increased. Reordering can be limited to a maximum number of words skipped (maximum d) with the switch -distortion-limit. Setting this parameter to 0 means monotone translation (no reordering). If you want to allow unlimited reordering, use the value -1.

4.2.5 Factored Models 4.2.5.1 Moses decoder in parallel

Since decoding large amounts of text takes a long time, you may want to split up the text into blocks of a few hundred sentences (or less), and distribute the task across a Sun GridEngine cluster. This is supported by the script moses-parallel.pl, which is run as follows: moses-parallel.pl -decoder decoder -config cfgfile -i input -jobs N [options] Use absolute paths for your parameters (decoder, configuration file, models, etc.). • Decoder is the file location of the binary of Moses used for decoding • cfg file is the configuration file of the decoder • Input is the file to translate • N is the number of processors you require

79

• Options are used to overwrite parameters provided in cfg file Among them, overwrite the following two parameters for n best generation – -n-best-file output file for n best list – -n-best-size size of n best list

4.2.6 Advanced Features of the Decoder 4.2.6.1 Lexicalized Reordering Models The default standard model that for phrase-based statistical machine translation is only conditioned on movement distance and nothing else. However, some phrases are reordered more frequently than others. An English adjective like beautiful typically gets switched with the preceding noun, when translated into Malayalam. Hence, we want to consider a lexicalized reordering model that conditions reordering on the actual phrases. One concern, of course, is the problem of sparse data. A particular phrase pair may occur only a few times in the training data, making it hard to estimate reliable probability distributions from these statistics. Therefore, in the lexicalized reordering model, we only consider three reordering types: (m) monotone order, (s) switch with previous phrase, or (d) discontinuous How can we learn such a probability distribution from the data? Again, we go back to the word alignment that was the basis for our phrase table. When we extract each phrase pair, we can also extract its orientation type in that specific occurrence. Looking at the word alignment matrix, we note for each extracted phrase pair its corresponding orientation type. The orientation type can be detected, if we check for a word alignment point to the top left or to the top right of the extracted phrase pair. An alignment point to the top left signifies that the preceding Malayalam word is aligned to the preceding English word.The orientation type is defined as follows: • monotone: if a word alignment point to the top left exists, we have evidence for monotone orientation. 80

• swap: if a word alignment point to the top right exists, we have evidence for a swap with the previous phrase. • discontinuous: if neither a word alignment point to top left nor to the top right exists, we have neither monotone order nor a swap, and hence evidence for discontinuous orientation. • bidirectional: Certain phrases may not only flag, if they themselves are moved out of order, but also if subsequent phrases are reordered. A lexicalized reordering model for this decision could be learned in addition, using the same method. • e and m: Out of sparse data concerns, we may want to condition the probability distribution only on the English phrase (e) or the Malayalam phrase (m). • monotonicity: To further reduce the complexity of the model, we might merge the orientation types swap and discontinuous, leaving a binary decision about the phrase order. These variations have shown to be occasionally beneficial for certain training corpus sizes and language pairs. Moses allows the arbitrary combination of these decisions to define the reordering model type (e.g. bidrectional-monotonicity-f).

4.2.6.2 Maintaining stack diversity

The beam search organizes and compares hypotheses based on the number of English words they have translated. Since they may have different English words translated, we use future score estimates about the remaining sentence translation score. Instead of comparing such apples and oranges, we could also organize hypotheses by their exact English word coverage. The disadvantage of this is that it would require an exponential number of stacks, but with reordering limits the number of stacks is only exponential with regard to maximum reordering distance. Such coverage stacks are implemented in the search, and their maximum size is specified with the switch -stack-diversity, which sets the maximum number of hypotheses per coverage stack.

81

The actual implementation is a hybrid of coverage stacks and English n word count stacks: the stack diversity is a constraint on which hypotheses are kept on the traditional stack. If the stack diversity limits leave room for additional hypotheses according to the stack size limit, then the stack is filled up with the best hypotheses, using score so far and the future score estimate.

4.2.6.3 Cube Pruning

Cube pruning, as described by Liang Huang and David Chiang (2007), has been implemented in the Moses decoder. This is in addition to the traditional search algorithm. The code offers developers the opportunity to implement different search algorithms using an extensible framework. Cube pruning is faster than the traditional search at comparable levels of search errors. To get faster performance than the default Moses setting at roughly the same performance, use the parameter settings.With cube pruning, the size of the stack has little impact on performance, so it should be set rather high. The speed/quality trade-off is mostly regulated by the cube pruning pop limit, i.e. the number of hypotheses added to each stack. Stacks are organized by the number of english words covered, so they may differ by which words are covered. You may also require that a minimum number of hypotheses is added for each word coverage (they may be still pruned out, however). This is done using the switch -cube-pruning-diversity MINIMUM which sets the minimum. The default is 0.

4.2.6.4 Multiple Translation Tables

Moses allows the use of multiple translation tables, but there are two different ways how they are used:

82

• both translation tables are used for scoring: This means that every translation option is collected from each table and scored by each table. This implies that each translation option has to be contained in each table: if it is missing in one of the tables, it can not be used. • either translation table is used for scoring: Translation options are collected from one table, and additional options are collected from the other tables. If the same translation option (in terms of identical input phrase and output phrase) is found in multiple tables, separate translation options are created for each occurrence, but with different scores.

4.2.6.5 Pruning the Translation Table

The translation table contains all phrase pairs found in the parallel corpus, which includes a lot of noise. To reduce the noise, recent work by Johnson et al. has suggested to prune out unlikely phrase pairs.

4.2.7 Training 4.2.7.1 Preparing Training Data

Training data has to be provided sentence aligned (one sentence per line), in two files, one for the English sentences, one for the English sentences: A few other points have to be taken care of: • unix commands require the environment variable LC_ALL=C • one sentence per line, no empty lines • sentences longer than 100 words (and their corresponding translations) have to be eliminated (note that a shorter sentence length limit will speed up training • everything lowercased (use lowercase.perl)

83

4.2.7.2 Cleaning the corpus

The script clean-corpus-n.perl is small script that cleans up a parallel corpus, so it works well with the training script. It performs the following steps: • removes empty lines • removes redundant space characters • drops lines (and their corresponding lines), that are empty, too short, too long or violate the 9-1 sentence ratio limit of GIZA++

4.2.7.3 Factored Training

For training a factored model, you will specify a number of additional training parameters: --alignment-factors FACTORMAP --translation-factors FACTORMAPSET --reordering-factors FACTORMAPSET --generation-factors FACTORMAPSET --decoding-steps LIST

Alignment factors

It is usually better to carry out the word alignment (step 2-3 of the training process) on more general word representations with rich statistics. Even successful word alignments with words stemmed to 4 characters have been reported. For factored models, this suggests that word alignment should be done only on either the surface form or the stem/lemma. Factors that are used during word alignment is set with the alignment-factors switch. The switch requires a FACTORMAP as argument, for instance 0-0 (using only factor 0 from source and target language) or 0,1,2-0,1 (using factors 0, 1, and 2 from the source language and 0 and 1 from the target language).Typically we may want to train the word alignment using surface forms or lemmas.

84

Translation factors

Purpose of training factored translation model training is to create one or more translation tables between a subset of the factors. All translation tables are trained from the same word alignment, and are specified with the switch translation-factors.

Reordering factors

Reordering tables can be trained with reordering-factors. Syntax is the same as for translation factors.

Generation factors

Finally, we also want to create generation tables between target factors. Which tables to generate is specified with generation-factors, which takes a FACTORMAPSET as a parameter.This time the mapping is between target factors, not between source and target factors. One example is generation-factors 0-1 with creates a generation table between factor 0 and 1.

4.2.7.4 Decoding steps

The mapping from source words in factored representation into target words in factored representation takes place in a number of mapping steps (either using a translation table or a generation table). These steps are specified with the switch decoding-steps LIST. For example decoding-steps t0,g0,t1,t2,g1 specifies that mapping takes place in form of an initial translation step using translation table 0, then a generation step using generation table 0, followed by two translation steps using translation tables 1 and 2, and finally a generation step using generation table 1. (The specific names t0, t1,... are automatically assigned to translation tables in the order you define them with translation-factors, and likewise for g0 etc.)

85

Training Step 1: Prepare Data

The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit. Two vocabulary files are generated and the parallel corpus is converted into a numberized format. The vocabulary files contain words, integer word identifiers and word count information. This will be present for both English and Malayalam. Then a sentence pair file will be generated .A sentence pair consists of three lines: First the frequency of this sentence. In our training process this is always 1. This number can be used for weighting different parts of the training corpus differently. The two lines below contain word ids of the English and the Malayalam sentence. From the sequence we can recognize of which source word is mapped to which target word. GIZA++ also requires words to be placed into word classes. This is done automatically by calling the mkcls program. Word classes are only used for the IBM reordering model in GIZA++.

Training Step 2: Run GIZA++

GIZA++ is a freely available implementation of the IBM Models. We need it as a initial step to establish word alignments. Our word alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs. Running GIZA++ is the most time consuming step in the training process. It also requires a lot of memory (1-2 GB RAM is common for large parallel corpora). GIZA++ learns the translation tables of IBM Model 4, but we are only interested in the word alignment file: In this file, after some statistical information and the English sentence, the Malayalam sentence is listed word by word, with references to aligned English words.

86

Each Malayalam word may be aligned to multiple English words, but each English word may only be aligned to at most one Malayalam word. This one-to-many restriction is reversed in the inverse.

Training on really large corpora

GIZA++ is not only the slowest part of the training; it is also the most critical in terms of memory requirements. To better be able to deal with the memory requirements, it is possible to train a preparation step on parts of the data that involves an additional program called snt2cooc. For practical purposes, all we need to know is that the switch parts n may allow training on large corpora that would not be feasible otherwise (a typical value for n is 3).

Training in parallel

Using the parallel option will fork the script and run the two directions of GIZA as independent processes. This is the best choice on a multi-processor machine. If we have only single-processor machines and still wish to run the two GIZA’s in parallel, use the following trick. Support for this is not fully user friendly, some manual involvement is essential. • First we start training the usual way with the additional switches last-step 2 direction 1, which runs the data preparation and one direction of GIZA++ training • When the GIZA++ step started, start a second training run with the switches. This runs the second GIZA++ run in parallel, and then continues the rest of the model training.

Training Step 3: Align Words

To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied. The default heuristic grow-diag-final starts with the intersection of the two alignments and then adds additional alignment points. Other possible alignment methods: • Intersection • grow (only add block-neighboring points)

87

• Grow-diag (without final step) • Union • srctotgt (only consider word-to-word alignments from the source-target GIZA++ alignment file) • tgttosrc (only consider word-to-word alignments from the target-source GIZA++ alignment file) Alternative alignment methods can be specified with the switch _alignment.

Training Step 4: Get Lexical Translation Table

Given this alignment, it is quite straight-forward to estimate a maximum likelihood lexical translation table. We estimate the w(m|e) as well as the inverse w(e|m) word translation table.

Training Step 5: Extract Phrases

In the phrase extraction step, all phrases are dumped into one big file. The content of this file is for each line: English phrase, Malayalam phrase, and alignment points. Alignment points are pairs (English, Malayalam). Also, an inverted alignment files extract.inv is generated, and if the lexicalized reordering model is trained (default), a reordering file extract.o.

Training Step 6: Score Phrases

Subsequently, a translation table is created from the stored phrase translation pairs. The two steps are separated, because for larger translation models, the phrase translation table does not fit into memory. To estimate the phrase translation probability (m|e) we proceed as follows: First, the extract file is sorted. This ensures that all Malayalam phrase translations for an English phrase are next to each other in the file. Thus, we can process the file, one English phrase at a time, collect counts and compute (m|e) for that English phrase e. To estimate (e|m), the inverted file is sorted, and then (e|m) is estimated for an Malayalam phrase at a time.

88

Next to phrase translation probability distributions (e|m) and (m|e), additional phrase translation scoring functions can be computed, e.g. lexical weighting, word penalty, phrase penalty, etc. Currently, lexical weighting is added for both directions and a fifth score is the phrase penalty. Currently, five different phrase translation scores are computed: • phrase translation probability (e|m) • lexical weighting lex(e|m) • phrase translation probability (m|e) • lexical weighting lex(m|e) • Phrase penalty (always exp (1) = 2.718)

Word-to-word alignment

An enhanced version of the scoring script outputs the word-to-word alignments between e and m as they are in the files (extract and extract.inv) generated in the previous training step "Extract Phrases". These two directional alignments are reported in the third and fourth fields; scores in the fifth field. In the third field, each word of the source phrase e is associated with the word(s) of the target phrase m, or with nothing. Vice versa, in the fourth field. As the two word-to-word alignments come from one word alignment, the two fields represent the same information. However, they are independent in principle. Hence, you can change them at your pleasure; for example you could replace them with two different word-to-word (source-to-target and target-to-source).

Training Step 7: Build reordering model

By default, only a distance-based reordering model is included in final configuration. This model gives a cost linear to the reordering distance. For instance, skipping over two words costs twice as much as skipping over one word. However, additional conditional reordering models may be build. These are conditioned on specified factors (in the source and target language), and learn different reordering probabilities for each phrase pair (or just the English phrase). Possible configurations are

89

• msd vs. monotonicity. MSD models consider three different orientation types: monotone, swap, and discontinous. Monotonicity models consider only monotone or non-monotone, in other words swap, and discontinous are lumped together. • e vs. em. The model may be conditioned on the English phrase (e), or on both the English phrase and Malayalam phrase (em). • unidirectional vs. bidirectional. For each phrase, the ordering of itself in respect to the previous is considered. For bidirectional models, also the ordering of the next phrase in respect to the currect phrase is modeled. This gives us the following possible configurations: • msd-bidirectional-em (default) • msd-bidirectional-m • msd- em • msd-f • monotonicity-bidirectional- em • monotonicity-bidirectional-e • monotonicity- em • monotonicity-e and of course distance. Which reordering model is used (and built during the training process, if necessary) can be set with the switch reordering, e.g.: -reordering distance -reordering distance, msd-bidirectional-em The distance model is always included, so there is no need to specify it. The number of features that are created with a lexical reordering model depends on the type of the model. An msd model has three features, one each for the probability that the phrase is translated monotone, swapped, or discontinuous. A monotonicity model has only one feature. If a bidirectional model is used, then the number of features doubles - one for each direction. Training Step 8: Build generation model

In this training step, the corresponding model is generated. This model is generally considered as the generation model of the system.

90

Training Step 9: Create Configuration File

As a final step, a configuration file for the decoder is generated with all the correct paths for the generated model and a number of default parameter settings. This file is called moses.ini .we will also need to train a language model.

4.2.7.5 Building a Language Model 4.2.7.5.1 Language Models in Moses

The language model should be trained on a corpus that is suitable to the domain. If the translation model is trained on a parallel corpus, then the language model should be trained on the output side of that corpus, although using additional training data is often beneficial. Our decoder works with one of the following: • the SRI language modeling toolkit, which is freely available. • the IRST language modeling toolkit, which is freely available and open source. • the RandLM language modeling toolkit, which is freely available and open source. In order to let Moses rely on one toolkit or on the other, it has to be compiled with the proper option: • –with-srilm= • –with-irstlm= • –with-randlm= In the Moses configuration file, the type (SRI/IRST/RandLM) of the LM is specified through the first field of the lines devoted to the specification of LMs:

4.2.7.5.2 Building a LM with the SRI LM

A language model can be created by calling: - ngram-count -text CORPUS_FILE -lm SRILM_FILE

91

The command works also on compressed (gz) input and output. There are a variety of switches that can be used, it’s good to use -interpolate -kndiscount.

4.2.7.5.3 IRST LM Toolkit

Moses can also use language models created with the IRSTLM toolkit. IRSTLM toolkit handles LM formats which permit to reduce both storage and decoding memory requirements and to save time in LM loading. In particular, it provides tools for: • building (huge) LMs • quantizing LMs • compiling LMs (possibly quantized) into a binary format • accessing binary LMs through the memory mapping mechanism • query class and chunk LMs

4.2.7.5.4 Building Huge Language Models

Training a language model from huge amounts of data can be definitively memory and time expensive. The IRSTLM toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. Typically, LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing parameters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is created containing n-grams with probabilities and back-off weights. This procedure can be very demanding in terms of memory and time if applied to huge corpora. IRSTLM provides a simple way to split LM training into smaller and independent steps, which can be distributed among independent processes.

92

The procedure relies on a training script that makes little use of computer memory and implements the Witten-Bell smoothing method. (An approximation of the modified Kneser- Ney smoothing method is also available.) First, create a directory under our working directory, where the script will save lots of temporary files; then, simply run the script

4.2.7.5.5 Binary Language Models

We can convert our language model file (created either with the SRILM ngram-count command or with the IRSTLM toolkit) into a compact binary format with the command:

> compile-lm language-model.srilm language-model.blm Moses compiled with the IRSTLM toolkit is able to properly handle that binary format. The binary format allows LMs to be efficiently stored and loaded. The implementation privileges memory saving rather than access time.

4.2.7.5.6 Quantized Language Models

Before compiling the language model, we can quantize its probabilities and back-off weights. The resulting language model requires less memory because all its probabilities and backoff weights are now stored in 1 byte instead of 4. No special setting of the configuration file is required: Moses compiled with the IRSTLM toolkit is able to read the necessary information from the header of the file.

4.2.7.5.7Memory Mapping

It is possible to avoid the loading of the LM into the central memory by exploiting the memory mapping mechanism. Memory mapping permits the decoding process to directly access the (binary) LM file stored on the hard disk.

93

In order to activate the access through the memory mapping, simply add the suffix .mm to the name of the LM file (which must be stored in the binary format) and update the Moses configuration file accordingly.

4.2.7.5.8 Chunk Language Models

A particular processing is performed whenever fields are supposed to correspond to micro tags, i.e. the per-word projections of chunk labels. The processing aims at collapsing the sequence of micro tags defining a chunk to the label of that chunk. The chunk LM is then queried with ngrams of chunk labels, in an asynchronous manner with respect to the sequence of words, as in general chunks consist of more words. The collapsing operation is automatically activated if the sequence of micro tags is: (TAG TAG+ TAG+ ... TAG+ TAG) Or TAG (TAG+ TAG+ ... TAG+ TAG) Both those sequences are collapsed into a single chunk label as long as (TAG / TAG (, TAG+ and TAG) are all mapped into the same label CHNK. The map into different labels or a different use/position of characters in the lexicon of tags prevent the collapsing operation.

4.2.7.5.9 RandLM

If you really want to build the largest LMs possible (for example, a 5-gram trained on the whole of the Gigaword Corpus) then you should look at the RandLM. This takes a very different approach to either the SRILM or the IRSTLM. It represents LMs using a randomised data structure. This can result in LMs that are ten times smaller than those created using the SRILM (and also smaller than IRSTLM), but at the cost of making decoding about four times slower. If we use RandLM then we need to split our corpus of sentences into multiple blocks and run the decoder on

94

each one in parallel. Since we are saving such a lot of memory, this will result in the same decoding speed as when using the SRILM and still benefitting from a reduced memory footprint.

4.2.7.6 Tuning

The training script ‘train-factored-model.perl’ produces a configuration file moses.ini which has default weights of questionable quality. That’s why we need to obtain better weights by optimizing translation performance on a development set. This is done with the tuning script mert-moses-new.pl. This new version of the minimum error rate training script is based on a new C++ software. The new mert implementation is stand alone open-source software. The only interaction between Moses and the new software is given by the script mert-mosesnew.pl itself. This new implementation of mert stores feature scores and error statistics in separate files for each nbest-list (at each iteration), and use these files to optimize weights. At the moment weight optimization can be based on either BLEU or PER. Most features of the old code mert-moses.pl are maintained, and new ones are added The script are run as follows: mert-moses-new.pl input-text references decoder-executable decoder.ini Parameters: • input-text and references are the development set, on which translation performance is optimized. The tuning script tries to find translations for input-text that resemble best the reference translations in references. The script works also with multiple output reference files, these have to be called [references]0, [references]1, [references]2, etc. • decoder-executable is the location of the decoder binary to be used • decoder.ini is the location of the configuration file to be used

95

CHAPTER 5 IMPLEMENTATION

5.1 TOOLKITS USED 5.1.1 SRILM: Stanford Research Institute Language Model SRILM is a toolkit for building and applying various statistical language models (LMs). It requires huge monolingual corpus (In our case Malayalam) in well aligned manner. The main objective of SRILM is to support language model estimation and evaluation. Estimation: create a model from training data Evaluation: compute the probability of a test corpus (copora) for which conventionally expressed as the test set perplexity. SRILM are based on N-gram statistics which the tools are ngram-count and ngram Three Main Functionalities are : – Generate the n-gram count file from the corpus – Train the language model from the n-gram count file – Calculate the test data perplexity using the trained language model

Fig 5.1: Steps for developing language model

96

5.1.2 GIZA++: Training of statistical translation models GIZA++, is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University1. The program includes the following extensions to GIZA: •

IBM Model 4;

•

IBM Model 5;

•

Alignment models depending on word classes

•

Implements the HMM alignment model: Baum-Welch training, ForwardBackward algorithm, empty word, dependency on word classes, transfer to fertility models etc.

•

Includes a variant of Model 3 and Model 4 which allow the training of the parameter.

•

Various smoothing techniques for fertility, distortion/alignment parameters.

•

Significant more efficient training of the fertility models.

•

Correct implementation of pegging as described in (Brown et al. 1993), a series of heuristics in order to make pegging sufficiently efficient.

5.1.3 MOSES Moses is a statistical machine translation system that allows us to automatically train translation models for any language pair. All we need is a collection of translated texts (parallel corpus). Moses is a drop-in replacement for Pharaoh, the popular phrase-based decoder10, with many extensions. Moses features

novel factored translation models6, which enable the integration linguistic and other information at many stages of the translation process. In Moses a phrase-based translation model consists of •

phrase-table the phrase translation table

97

•

moses.ini the configuration file for the decoder

The key to good translation performance is having a good phrase translation table. But some tuning can be done with the decoder. The most important is the tuning of the model parameters. The probability cost that is assigned to a translation is a product of probability costs of four models viz, phrase translation table, language model, reordering model, and word penalty. Each of these models contributes information over one aspect of the characteristics of a good translation: •

The phrase translation table ensures that the English phrases and the German phrases are good translations of each other.

•

The language model ensures that the output is fluent English.

•

The distortion model allows for reordering of the input sentence, but at a cost: The more reordering, the more expensive is the translation.

•

The word penalty provides means to ensure that the translations do not get too long or too short. The basic reordering model implemented in the decoder is fairly weak.

Reordering cost is measured by the number of words skipped when foreign phrases are picked out of order. Total reordering cost is computed by D(e,f) = - Σi (d_i) where d for each phrase i is defined as d = abs( last word position of previously translated phrase + 1 first word position of newly translated phrase ).

5.1.4 BLEU: Bilingual Evaluation Understudy BLEU is an automatic evaluation method, in which metric is based on n-gram co-occurrence based measure – by this we mean that the intrinsic quality of MT output is judged by comparing its n-grams with reference translations by humans. Actually English to Indian language translation BLEU is not much appropriate since this tool is not good for rough translations (Indian languages will produce only rough translations since it is morphologically rich and consists a lot of suffixes). 98

5.2 DEVELOPMENT OF PARALLEL CORPUS The main problem in the Machine Translation for Malayalam is the nonavailability of large amount of parallel corpora, though there is smaller amount of text in specific domains (such as education, history, sports etc).Also we need the data in Unicode format and hence some mappings need to be done for developing Unicode corpus. So we collected Malayalam and English medium texts of PDF format of the Kerala state syllabus. The Malayalam PDF data is not in the Unicode format. So we created a mapping file and wrote a Perl code for converting the data into Unicode format. A snap shot of mapping file is shown. Mapping file considers not only the single characters but also the combination characters and they are arranged in decreasing character length. This is done because while mapping we need to get considered the large word length first and then only the smaller ones.

Fig 5.2: Snap shot of mapping file developed 99

Now after getting the parallel corpus we need to align in such a way that for each English sentence there is a corresponding Malayalam text on the other side. In this way we aligned more than 15,000 lines. For language model we need Malayalam monolingual data which also needs to be in Unicode format. This we needs to be aligned in such a way that one sentence in one line which we developed up to 35,000 lines. Since Malayalam is a rich morphological language there will be a lot of suffixes for different tenses. So in order to come across all these formats we manually created some Simple English sentences which we will use in our day-to-day life and also their corresponding Malayalam. So this corpus comes across different grammatical usage of sentence and consists of more than 1,000 lines.

5.3 RUNNING THE SYSTEN First we need to develop the Language model for the target data (ie Malayalam) using the monolingual corpus we developed. This is done in order to make the system understood that we are using this particular language and to give an idea about its grammatical structure .we created the language model for tri-gram with Kneezer-ney smoothing technique. Second step is the training phase in which we make use of both Giza and Moses and the input will be the parallel corpus (English –Malayalam) with Malayalam language model file as the benchmark. It will perform the Giza alignment making use of IBM Model 1-5;first word alignment and then phrase alignment and creates phrase translation table for English to Malayalam and vice versa. Reordering (distortion) is done in source language for correct mapping into the target language. Third step is decoding or testing in which we had given testing sentences in 3 styles:

100

1. Lines which contains in the training data. 2. Combination of mixed words from 2 or 3 lines(i.e. words will be in training data but lines will not appear) 3. Taking some sentences on a fly(i.e. sometimes it contains some words sometimes it will not contains)

5.4 Rule-Based Reordering & Morphological Processing We can improve the accuracy of English to Malayalam statistical machine translation by incorporating rule based reordering and morphological information. The main ideas which have proven very effective are (i) reordering the English source sentence according to Malayalam syntax, and (ii) using the root suffix separation on both English and Malayalam words. The first one is done by applying simple modified transformation rules on the English parse tree, which is given by the Stanford Dependency Parser. The second one is developed by using a morph analyzer. This approach achieves good performance and better results over the phrase-based system. Our approach avoids the use of parsing for the target language (Malayalam), making it suitable for statistical machine translation from English to Malayalam, since parsing tools for Malayalam are currently not available. Statistical Machine Translation for Malayalam language gives poor result, if we provide parallel corpus directly, because of the following reasons;(i)English follows SVO (Subject – Verb – Object) word order but Malayalam follows SOV (Subject – Object – Verb) word order;(ii)Malayalam language is morphologically quite rich; and (iii)huge tagged parallel corpus are not available for EnglishMalayalam language pairs. So the technique of including rule based reordering and morphological processing for statistical machine translation (SMT) for Malayalam gives more accuracy. It is possible to combine phrase-based models with a group of reconstructing rules on English sentence for producing reordering English sentence as per Malayalam word order and also incorporating morphological information in English and Malayalam by using a Morph Analyzer. 101

5.4.1 Syntactic Information Phrase based models works in a successful manner only if the source and the target language have almost same in word order. Difference in the order of words in phrase based models is handled by calculating distortion probabilities. Reordering done by the phrase based models is not satisfactory for the languages those differ a lot in the word order. English sentence following the word order SVO, but the Malayalam follows SOV. It is possible to eliminate the ‘distortion load’ on the phrase based models if we can change the word order of English sentence into the word order of Malayalam before the SMT system processing the words. Our preprocessing is done as follows.

Fig 5.3 Preprocessing of source data for training Source English data is preprocessed. First it is given to the Stanford dependency parser. It forms a directed graph. The relations in the graphs express the semantic dependencies among the various words in the sentence. This uses 55 relations to express the dependencies among the various words in a sentence. A hierarchical structural graph is formed in which the most common relation coming always on the root. There are various argument relations like subject, object, objects of prepositions, and clausal complements, modifier relations like adjectival, adverbial, participial, and infinitival modifiers, and other relations like coordination, conjunct, expletive, and punctuation. The parse tree output is given to the tree processing module, which creates modified dependency tree by handling auxiliary verbs, Prepositions, Conjunctions etc, which is suitable for the target language from the occurrence of a phrase can be identified by calculating probabilities from parallel

102

corpora. Suitable tree traversal gives the reordered sentence. The whole process can be illustrated by following example. English

: He is running on the road

Malayalam: Aവന് േറാഡില്കൂെട ഓടിെ

ാ

ിരി

p

p

(avan roadilkuuTe ooTikkoNTirikkunnu) The output from the Stanford Parser for the English sentence is shown in Fig 5.4. After this tree processing is done in which combined word can be easily learned from the corpus. 1)Handling auxiliary verbs remove and postfix to their respective verb e.g.aux (running,is)

is_running

Figure 5.4.Stanford Parser Output 2) Handling Prepositions/Conjunctions 103

Extract the preposition from the relation and attach to parent/child e.g. prep_on (running , road)

prep (running, road_on)

From this we can form a modified dependency tree as shown in the Fig 5.5. We will not get the meaning of the words like ‘is running’, ‘on road’ from any dictionary. In this situation SMT will help us. So the meaning can be identified by calculating the probabilities from the parallel corpus. So this kind of conjunctions will help us to get more meanings other than that in the dictionary. This kind of combinations can be found from lot of heuristics. Reordering should be according to the word order of target language.

Fig 5.5.Modified Dependency Tree Malayalam:Aവന് േറാഡില്കൂെട ഓടിെ

ാ

ിരി

p

p

(avan roadilkuuTe ooTikkoNTirikkunnu)

Reordered: He the on road

is running

The reordering program was written using Perl, which will traverse the tree and generates the reordered output. It generally concerns the 55 grammatical relations

104

in the Stanford dependencies. So the original English sentence is reordered in such a way that it had the same grammatical order of the Malayalam sentence 5.4.2 Sandhi (Euphonic Combination) Rules

In Malayalam there are mainly four type of Sandhi changes occur when two morphemes combine together. After the suffix is stripped from the verb, in order to find the stem sandhi rules must be applied. The rules followed are: 1. Elision 2. Augmentation 3. Substitution 4. Reduplication (Gemination)

5.4.2.1 Elision Rules

When two articulated sound joined together one sound will be dropped. Eg: -

kaaRR+aTiccu=kaaRRaTiccu

1. When case suffixes beginning with vowel ‘u’ will be dropped. eg:-

atu+alla=atalla

2. The case suffixes beginning with a vowel the final ‘a’ of the adjectival particle will be lost. eg:-

peRRa+amma=peRRamma

3. The form alla, alla inflect for case suffix the final ‘a’ will be dropped eg:-

alla+engkil=allengkill

4. When the case suffixes to the vowel the final ‘e’ will be in these forms aaTTe,aate,uTe, uuTe eg:-

nilkkaTT+aviTe=nilkkaTTaviTe taraate+aayi=taraataayi

105

vazhiyiluuTe+ooTi=vazhiyiluuTooTi.

5.4.2.2 Augmentation Rules

1. Another vowel will be added after the semi-vowel ‘y’ eg:-

pana+oola kaLi+alla=kaLiyalla

2. ‘y’ will be added to the case suffix doubled ‘u’ will be occurred after the semivowel. Eg:-

cila+kunnu=cilakkunnu

3. After glides ‘v’ will be added with a vowel. Eg:-

tiru+ooNam=tiruvoonam puu+ampan=puuvampan

4. Verb ending with semi vowel ‘y’ and ‘n’ will be added. Eg:-

peruki+a=perukiya

5. ‘v’ occur before ‘a’, ‘i’, ‘e’ Eg:-

a+an=avan i +aaL~=iyaaL~ e+iTaM=eviTaM

5.4.2.3 Substitution Rules

1. When ‘c’ and ‘t’ add together ‘t’ substitute into ‘c’. eg:-

nin+ caritam=ninjcaritam

2. ‘t’ and ‘T’ add together ‘t’ changed into ‘T’ Eg:-

kaN+niiR=kaNNiiR

106

taN+taaR=taNTaaR 3. ‘l’ suffixed after ‘m’,’n’replaced into ‘l’ Eg:-

nal+ma=nanma nel+maNi=nenmaNi

4. ‘l’ does not changed into all places. Eg:-

pin+kalam=pilkkaalam

5. ‘am’ is suffixed after the vowel, ‘m’ is substituted. Eg:-

dhanam+uNT=dhanamuNT

6. The vowel suffixes after the future tense marker ‘um’, ‘v’ will be substituted. Eg:-

tarum+aan=taruvaan varum+aan=varuvaaan pookum+aan=pookuvaan

7. ‘tt’ is doubled when vowel is suffixes after the anusvaaram. Eg:-

kulam+il=kulattil dhanam+aal=dhanattaal

8. Some times ‘ooTu’ add to the case form ‘tt’ is substituted. Eg:-

dhanam+ooTu=dhanattooTu

9. Vowel suffixes after anusvaaram ‘tt’ is substituted. Eg:-

aayiram+aaNT=aayirattaaNT

10. ‘n’ is suffixes before the ‘k’ nasal sound is replaced. Eg:-

nin+kal=ningngal maram+kal=marangngal

107

5.4.2.4 Reduplication (Gemination)

1. The word ending with a,i,e(cuTTezhuttu) the next sound will be geminate. Eg:-

a+kaaryaM=akkaaryaM i+kaalaM=ikkaalaM e+kalavuM=ekkaalavuM

2. When adjectival phrase beginning with a case suffix first letter of the second case suffixes will be doubled. Eg:-

niila+kall=niilakkall maru+paRampu=maruppaRampu

3. In coupulative compound no change is occurring. Eg:- kara+caraNangngaL=karacaraNangngaL 4. Not doubled in determinative compounds, when first form will be a stem. Eg:- niRa+paRa=niRapara 5. The articulated sound will be doubled, when the case suffixes after aal,il,kal, and e. Eg:-

kaNNaal~+kaNTu=kaNNaalkkaNTu vannaal+kaaNaaM=vannaalkkaaNaaM

108

5.4.3 Tense markers 5.4.3.1 Past Tense markers

1. –i Eg:-

paaTi, caaTi, ooTi, vilasi, pooyi, aaTi, niinti, kiTTi, poTTi, aaTTi, kaRangngi, tuTangngi etc.

The past tenses marker -i occurs all roots which have the consonant ending stem which do not have any alternant form. So here elision rule is present. The past tense marker –i added directly to the stem. 2.

-RRu Eg:-

tooRRu viRRu peRRu aRRu

The past tense marker –RRu occurs all stems. When ‘i’ added to the verb stem, it changed into’t’. Here substitution rule is present. Some times ‘t’ changed into ‘RRu’. 3. –tu Eg:- ceytu, uzhutu, koytu, neytu, peytu etc. The past tense marker –tu occurs all stems. After L sound the vowel – u is occurred. There no change in Sandhi rules. 4. -Tu Eg:-

kaNTu, koNTu, uNTu etc

The past tense marker -Tu occurs after consonant –N. Here substitution rule is present. 5. –ttu Eg:-

eTuttu, koTuttu, aRuttu etc

-ttu marker occurs all roots. After ‘R’ no vowel sound is occurred. In transitive marker ‘tu’ became ‘t’ and then geminate into ‘-tta’. 6. –ccu Eg:-

ciriccu, paThiccu, kuLiccu, aTaccu, veRaccu, aTiccu etc

109

Past tense marker –ccu occurs after all vowel ending roots (cvcv). Here no sandhi rule is occurring. 7. –ntu Eg:-

ventu, nontu.

Past tense marker –ntu occurs after the n. This marker occurs after the stem /no/ and /ve/ which have alternate form. Here addition sandhi rule is present. Here na changed into –ntu 8. –nnu Eg:-

vannu, konnu etc

-nnu occurred after all roots. Here elision rule is present. ‘n’ is dropped. Exceptional case:-No change is occur after the ‘R’ Eg:-vaLaRnnu, taLaRnnu. 9. –njnju Eg:-

paRanjnju,

karanjnju,

carinjnju,

nananjnju,

eRinjnju,

aranjnju.etc -njnju occurs after vowel ending stems. Here marker –njnju is added to the stem no sandhi rule is present.

5.4.3.2 Present tense markers

-unnu Eg:-aaTunnu, tarunnu Here ‘u’ occur after the stem.so here addition rule present. Eg:-aTikkunnu, paThikkunnu For this case Gemination rule is present.Here Link morph ‘kk’ added after the stem. Eg:-tinnunnu Here, final ‘n’ of the verb geminates into ‘nn’

110

5.4.3.3Future tense marker

–um Eg:-varum, tarum -um marker occurs after the consonant ending stem. Here Final ‘u’ will be dropped. So here elision rule present. Exceptional case- The final ‘L’ and ‘y’ reflect to the stem resulting the doubling ‘L’ and ‘y’ Eg:-

ceyyum, kollum, koyyum, peyyum, tallum

Exceptional case- The vowel ending root Link morph –kk is added before the future tense marker –um. Eg:-

aRukkum, ceRukkum, karikkum etc.

Exeptional case- In some cases the final –u will be dropped. When the suffixes beginning with a vowel. Eg:-

uzhum, tozhum etc

Here elision rule is present.

5.4.4 Morphological Information If an SMT system can use of morphological information, the accuracy of the translated output is high for morphologically rich language like Malayalam. Another advantage is that we can reduce the size of the training data also. To including morphological information, we can use a morphological analyzer for English, and Malayalam. Morphological analysis consists of the identification of parts of the words or constituents of the words. For example the word kuTTikaL (‘boys’) consists of two constituents: the element kuTTi (‘boy’) and the element kaL (the plural marker‘s’). The morphological analysis primarily consists in breaking up the words into their parts and establishing the rules that govern the co-occurrence of these parts. The source and target parallel data must pass through the blocks that are shown in Fig.4. Morph analyzer and reordering is done for both target and source side. Detailed blocks are shown in the Fig.5. The reordered source sentence in English is given to the corresponding morph analyzer. Target sentence is given to the

111

target morph analyzer. Thereafter, the outputs of both are given to the Giza++ for training and word alignment. Phrase table created is given to the decoder. Testing data is also processed like this, to make it possible to consider the grammatical changes also, even though corresponding forms are not available in the corpus. The Giza++ alignment is as shown below.

Fig 5.6 Giza alignment After this we need a morph generator for combining stems and their corresponding suffixes of Malayalam sentences.

Fig 5.7 Morph Analyzer & Reordering All blocks can be combined and formulated as shown in figure(6).The system which we had implemented by training the phrase based system described in this section uses 20,000 sentence training corpus.For the Malayalam language model we compared with various n-gram models and found trigram model with modified

112

Kneezer-Ney smoothing is the best performing. Larger monolingual corpus of about 40,000 lines was used to learn language model. The SRILM toolkit is used for the language modeling experiments. The development corpus was used to set weights for the language models, the distortion model, the phrase translation model etc using minimum error rate training. Decoding was performed using Moses. POS tag was done on the English corpus, and Stanford parser was use for parsing. For reordering, program was written using Perl.

Fig 5.8 Syntactic and Morphological Processing:

113

5.5 PERL Perl, Practical Extraction Report Language,

is a programming language

which can be used for a large variety of tasks. A typical simple use of Perl would be for extracting information from a text file and printing out a report or for converting a text file into another form. But Perl provides a large number of tools for quite complicated problems, including systems programming. Programs written in Perl are called Perl scripts, whereas the term the Perl program refers to the system program named Perl for executing Perl scripts. Perl was originally created by Larry Wall, a programmer then at Unisys Corporation. Perl is a dynamic programming language very similar to C in many respects. It is used most widely in system administration, web site development (especially on the GUI side), and networking tasks. It is known to be a very straightforward, easy to use programming language. Perl is implemented as an interpreted (not compiled) language. Thus, the execution of a Perl script tends to use more CPU time than a corresponding C program, for instance. On the other hand, computers tend to get faster and faster, and writing something in Perl instead of C tends to save your time. In SMT, it is obvious that we are handling texts, or words of a Natural Language. Dealing with the patterns or we can say, morphemes, in the word would be easy if Perl is used. The program for “mapping files into Unicode font” is written in Perl. Perl can be called from a Java program. The Graphical User Interface (GUI) of the program is designed in Java.

5.6 GUI A graphical user interface (GUI) is a type of user interface which allows people to interact with electronic devices such as computers with images rather than text commands. A GUI offers graphical icons, and visual indicators, as opposed to text-based interfaces, typed command labels or text navigation to fully represent the information and actions available to a user. The actions are usually performed 114

through direct manipulation of the graphical elements. The GUI Builder in Net Beans IDE enables you to create Java GUI applications visually. The Graphical User Interface is created in Java. User can input the English text lines and when we click the ‘Translate’ button it generates the Malayalam sentences corresponding to that input line. GUI contains two text areas, one for inputting the English lines, and another text area for the output. Java and Perl are compactable. A Perl Script can be called from a Java program. Here, the Perl program for the Roman to Unicode and vice versa is linked to the GUI in Java.

Fig 5.9 GUI for English – Malayalam SMT

115

CHAPTER 6 RESULTS AND EVALUATION Since English to Malayalam translator is not yet developed, it is an interesting area of research. The results shown by our system are quiet interesting #Sentence

#Words

Training

10,000

1, 28,584

Development

750

10,580

Test

400

6,220

Monolingual(Malayalam)

39,034

9, 53,238

Table 1.1 Corpus used 1. For the first set(ie Lines which contains in the training data ) of testing lines it is completely translating and the output is very accurate. 2. For the second set (Combination of mixed words from 2 or 3 lines i.e. words will be in training data but lines will not appear) in which the testing lines is a combination of words from 2 or 3 lines it is translating well, but the problem is since Malayalam is a relatively free word order and morphologically rich language with lot of inflections and also agglutinative (i.e. each root word can combine with multiple morphemes to generate words with different forms), the root word will be correct but sometimes the inflections will differ which yields a grammatically incorrect sentence. 3. In the third case (Taking some sentences on a fly i.e. sometimes it contains some words sometimes it will not contains) since we are giving sentences whatever we like; if the word occurs in the training data it translates otherwise it will not do any change to the words Evaluation

We did the evaluation on this tested corpus using the BLEU toolkit and the result is quiet astonishing 1. Showing 100% accuracy for sentences which are in the training corpora.

116

2. Showing 70-80% accuracy for sentences which are combination of words from different lines of the parallel corpora 3. For sentences which are completely out of the training corpora it is showing 25-50% accuracy which we needs to work on We can see that using syntactic information brings substantial improvements over the baseline phrase based system. While the impact of morphological information is not much seen in the BLEU since it doesn’t support rough translations

Technique

Evaluation Metric (BLEU)

Baseline

13.15

Baseline+ syntax

15.60

Baseline+syntax+Morphology 16.10 Table 1.2. Results of evaluation We find that including syntactic and morphological information brings substantial improvements in translation fluency.

117

CHAPTER 7 CONCLUSION In this Project work we have presented an effective methodology for EnglishMalayalam phrase-based SMT. The results shows that significant improvements are possible by incorporating syntactic and morphological information to the plain corpus. Since all Indian languages follow SOV order, and are relatively rich in terms of morphology, the methodology presented should be applicable to English to Indian language SMT in general. Since morphological and parsing tools are not much widely available for Indian languages, an approach like this which minimizes the use of such tools for the target language would be quite handy. In future work, since Malayalam is very much morphologically rich and agglutinative we try to improve the performance of morphological analyzer. Since there is much deficiency in the parallel corpora we try to get more corpora from different domains in such a way that it will cover all the wordings. Since BLUE is not so good for rough translation we need to some other evaluation techniques also.

118

REFERENCES

1. “A Statistical MT Tutorial Workbook” by Kevin Knight prepared in connection with the JHU summer workshop April 30, 1999 . 2. “The Mathematics of Statistical Machine Translation”, Brown et al Computational Linguistics, 1993. 3. Foundations of Statistical NLP, Manning and Schutze. 1999. 4. The Mathematics of Statistical Machine Translation: Parameter Estimation. P. F. Brown, S. A. Della Pietra, V. J. Della Pietra and R.L. Mercer. 1993. 5. Automating Knowledge Acquisition for Machine Translation. Kevin Knight. 1997. 6. Decoding Algorithm in Statistical Machine Translation. Ye-Yi Wand and Alex Waibel. 1997. 7. An Efficient A* Search Algorithm for Statistical Machine Translation. Franz Josef Och, Nicola Ueffing, Hermann Ney. 2001. 8. Phrase based statistical MT: Statistical Phrase-Based Translation. Philipp Koehn, Franz Jasof Ock and Daniel Marcu. 2003. 9. The Alignment Template Approach to Statistical Machine Translation. Franz Josef Och and Hermann Ney. 2004. 10. Syntax based translation: What's in a Translation Rule? Galley, Hopkins, Knight and Marcu. 2004. 11. A Syntax-Based Statistical Translation Model. Yamada and Knight. 2001. 12. A Decoder for Syntax-Based Statistical MT. Kenji Yamada and Kevin Knight. 2001. 13. Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. Och and Ney. 2002. 14. Discriminative Reranking for Machine Translation. Shen, Sarkar and Och. 2004. 15. MT Evaluation: BLEU: A Method for Automatic Evaluation of Machine Translation. Papineni, Roukos, Ward and Zhu. 2001.

119

16. P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263-311. 17. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. ACL 2007, Demonstration Session, Prague, Czech Republic . 18. Ralf Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006). "The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages". Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. 19. http://www.statmt.org/ -- includes introduction to research, conference, corpus and software listings. 20. http://www.statmt.org/moses/ 21. http://www-nlp.stanford.edu/links/statnlp.html -- Includes links to freely available statistical machine translation software. 22. http://code.google.com/p/giza-pp/ 23. http://www.cs.cmu.edu/Eqing/ 24. http://urd.let.rug.nl/tiedeman/OPUS/ 25. http://wiki.smc.org.in/Payyans ---payyans unicode converter

120