Sanskrit-Russian Statistical Machine Translation

2 downloads 0 Views 107KB Size Report
University of Georgetown in 1954 with only 6 grammatical rules and vocabulary of around 250 words (John Hutchins, 1986). Later many research projects were ...
Sanskrit-Russian Statistical Machine Translation: Using Bi-lingual POS Tagged Corpora Sudhir Kumar Mishra AAI Group, CDAC Pune, Pune, India [email protected]

Ainura Asamidinova American University of Central Asia, Bishkek, Kyrgyzstan [email protected]

Abstract: The presented paper proposes a model of the Sanskrit to Russian Machine Translation System (SaRS). To produce better quality of statistical machine translation, a combined feature-based and trigram-based approach is used. The closeness of Russian and Sanskrit has been discovered by many linguists in Russia and India. Sanskrit, a ‘holy’ language of Veda, is a proto-language of all Indo-European languages. Slavic languages have many similarities with Sanskrit. It is a proved fact that Russian language has its genetic roots from Sanskrit (Guseva, 2002). We propose an architecture that systematically exploits the structural transfer and robust generation capabilities of statistical machine translation system. The preprocessing and postprocessing structure is the unique difference of the proposed model from other SMT models.

I. Introduction SMT is a data oriented statistical environment for translating document from one language to another which is based on the knowledge extracted from bilingual corpus. The most important and challenging task in statistical machine translation (SMT) is creating huge well aligned parallel corpus to train the system. Following tagging tasks are typically performed by manual annotation of the document and training a classifier on it. The annotation can be done on the word level or phrase level depending on a system’s specific requirements. In this paper we concentrate on both word and phrase level tagging of the Source Language (SL) and Target Language (TL). The annotation process is expensive and difficult because it requires human expertise in the both languages. The idea behind SMT comes from information theory. A text is translated according to the probability distribution function indicated by p(r|s). It is a probability of translation of a sentence ‘s’ in the SL ‘S’ (Sanskrit) to a sentence ‘r’ in the TL ‘R’ (Russian). The probability distribution p(r|s) has been approached in a number of ways. According to the Bayes theorem (Wikipedia web site), if p(s|r) and p(r) indicates translation and language model respectively, then probability distribution p(r|s) ∞ p(s|r)p(r). The translation model p(s|r) is the probability that the source sentence is the translation of the target sentence or the way sentences in ‘r’ get converted to sentences in ‘s’. Previous Work A first public Russian to English Machine Translation System (MTS) was presented at the University of Georgetown in 1954 with only 6 grammatical rules and vocabulary of around 250 words (John Hutchins, 1986). Later many research projects were devoted to MTS. However, as 1

the complexity of the linguistic phenomena involved in the translation process together with the computational limitations of the concept were made apparent, enthusiasm faded out quickly. Most competitive SMT systems such as IBM, Google etc. used phrase-based systems. System Architecture: Language model, Translation model and Decoder model are the main components of the system. The basic difference with other systems is its preprocessing and postprocessing models. Data cleaning is the manual task because identification of abbreviations and Proper Noun with its grammatical information is difficult. Spelling mistakes are also covered in the preprocessing stage since foreign words and unique unknown words (not available in the database of the SaRS) can not be differentiated on the system level. Loss or addition of any Additional information is taken care manually only. After these cleaning and alignment stage, the model has the most important scanning component to identify presence of Number, Date and Phrase in the input document, if any. Number identification is required because there are lot of types of number writing, which are used in different types of notifications, references, orders, clauses etc. Similar requirement for the Date identification is proposed. INPUT TEXT

Data Cleaning + Data Alignment

Number Identification + Date Identification + Phrase Identification

POS Tagging + Parsing Chunkar + Tri-gram

Structural Transformation Language Model + Translation Model Alignment + Phrase Extraction Decoder

OUT PUT

Fig-1: System Design

2

Phrase identification, Part of Speech (POS) Tagging and Parsing, disambiguation of POS on syntax level, chunking and tri-gram identification (one left, central and one right) is the main research approach that makes this study unique and different from other works. The proposed approach requires training of POS tokens of corresponding words in SL and TL instead of training of SL and TL words. Alignment in Parallel Corpora After cleaning the bi-lingual data, word/phrase alignment is the first step in most present-day approaches of SMT. There are two types of alignment - sentence alignment and word/phrase alignment, which are based on previous research works (Adam Lopez, 2008). Sentence alignment establishes correspondence between sentences in the parallel corpus. Sentence alignment is many-to-many in general, due to the different syntax and structure of different languages. Word/phrase alignments are relationships between words in a pair of corresponding sentences in a parallel corpus. The alignment establishes correspondence of each SL word in its aligned TL word. Complete sentence level alignment is possible only if the translation is correct and consistent without any kind of additional information in TL or no any information lost. The alignment is many to many in general and can be seen as a bipartite graph between words of the two sentences. For the primary research work, we cleaned, aligned and POS tagged manually 100 sentences of IT domain for building bilingual corpora of Sanskrit and Russian. An example of an alignment between Sanskrit-Russian sentence pair: E: You have a reliable and fast 32-bit operating system in your hands. R: V vashih rukah nadezhanaya i bystraya 32-razryadnaya operacionnaya sistema.

S: bhavata karayo eka⋅ vi♣vasan×ya⋅ druta⋅ ca dv tri⋅♣adbi↑apram a⋅ sa calanatantra⋅ The POS tagging of above sentence pair can be seen as following: NO

Russian

Sanskrit

1 2 3

V= PREP, INDCL Vashih= PSPRN, LOC, 2, PL Rukah= N, LOC, F, PL

-

4

-

5 6 7

Nadezhnaja= ADJ, NOM, F, SG, LNG i= CONJ, INDCL Bystraja= ADJ, NOM, F, SG, LNG

भवत

= PSPRN, GEN, M, SG करयो = N, LOC, M, PL एकम् = NUM, ACC, M, SG व सनीयं = ADJ, NOM, F, SG च = CONJ, INDCL ितं ु =ADJ, NOM, NEU, SG 3

8 9 10 11

2-razrjadnaja= ADJ, NOM, F, SG, LNG Operacionnaja= ADJ, NOM, F, SG, LNG Sistema= N, NOM, F, SG -

ा ऽंशटूमाणं =

ADJ, NOM, NEU, SG सचलनं = ADJ, NOM, NEU, SG तऽं N, NOM, NEU, SG वतते = V, PSV, PRS, 3, SG

The word alignment is divided into three major functions according to the proposed model: (1) to find the most likely translations of an SL word, irrespective of word positions (translation model); (2) to align positions in the SL sentence with positions in the TL sentence (distortion model); and (3) to find out how many TL words are generated by one SL word or how many SL words will generate single TL word or a SL word may sometimes generate no TL word, or a TL word may be generated by no SL word (fertility model).

Part-of-Speech (POS) Tagging There are two main approaches in POS tagging. Rule-based tagging relies on hand-written rules to disambiguate words with the help of more than one tag sequence. This approach requires human experts to craft the rules and tagging tasks are typically performed by manual annotation. The annotation process is expensive. A stochastic approach includes frequency, probability or statistics. The simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training data and uses this information to tag that word in the un-annotated text. The annotation can be done on the word/phrase level or on the sentence level depending on the requirement of the kind of the research work. In this research work both levels of tagging are used for better output. The broad level tagging can be defined in the following four types, as follows: • Named Entity Recognition (NER Tagging): Identification of person names, organizations, places, expressions of date and time, quantities, and others. • Part of Speech Tagging (POS Tagging): Each word in a given sentence is assigned to a part of speech tag that depends on the word and its context inside the sentence. • Noun Phrase Chunkar: A noun phrase chunkar extracts noun phrases in sentences. • Morphology Analysis: Each word of a sentence is classified according to its internal structure and origin. The word is the smallest semantic unit of a sentence in most of languages. Understanding the nature of individual words is important for understanding of deep level of the sentence. 51 tags are used in the deep level tagging of the Sanskrit and Russian language. These are as follows: TAG

Description

ABL ACC ACT ADJ

Ablative Accusative Active voice Adjective 4

ADV BRF CONJ CPR CST DAT DG DL DMPRN DTPRN F FUT FW GEN GER IMPRF INDCL INPRN INST LNG LOC M N NEG NEU NOM NPRN NUM PL PNCT PPRN PREP PRF PRN PRS PRT PRTC PSPRN PST PSV REFL RFPRN RLPRN SG SMB SPR

Adverb Short form Conjunction Comparative degree Causative Dative Digital Dual Demonstrative pronoun Determinative pronoun Feminine Future Foreign word Genitive Gerund Imperfect aspect Indeclinable Interrogative pronoun Instrumental Long form Locative/prepositional Masculine Noun Negation Neutral Nominative Negative pronoun Numerical Plural Punctuation marks Personal pronoun Preposition Perfect aspect Pronoun Present Participle Particle Possessive pronoun Past Passive voice Reflexive Reflexive pronoun Relative pronoun Singular Symbol Superlative degree 5

V

Verb

Language Model The language model p(r) for this system is based on trigram model. But the translation model p(s|r) has to be trained from a sentence aligned Sanskrit-Russian parallel corpus. The probability p(s|r) is the probability that the Sanskrit word ‘s’ is a translation of the Russian word ‘r’. Translation model itself consists of three models in the IBM approach, like, the translation model, the distortion model and the fertility model. A statistical language model (LM) is a model that predicts the probability of a sequence of words, P(w1 ...wn)= P(w1)P(w2|w1)...P (wn|w1,...,wn-1). The most common statistical model is the N-gram language model which approximates the probability P(wi|w1,...,wi-1)as P(wi|wi-N ,...,wi-1). i.e., the probability of a word sequence is predicted using at most N preceding words. For instance, the simplest example of an N-Gram language model is the unigram model that predicts the probability of a word sequence as: P(w1 ...wn)= P(w1)P(w2)P(w3)...P (wn) A bi-gram would predict the word sequence probability as: P(w1 ...wn)= P(w1)P(w2|w2)P(w3|w2)...P (wn|wn-1) Bi-gram probabilities (as well as N-gram probabilities for any N) can be estimated by counting their frequencies in a text corpus: count(wi-1,wi) P(wi |wi-1) = ————————— count(wi-1) Trigram is collection of three consecutive POS tags; current POS in trigram depends on the previous POS and next POS in same trigram. Trigram is used to specify the boundary of middle POS. So in this step, the POS tags are changed into trigrams. The training of the aligned corpus is base on the trigram model of the disambiguate POS tokens sequence of the sentence of Sanskrit and Russian. The probability of the current POS token can be calculated ascount (previous POS, current POS, next POS) P (current POS | previous POS, next POS) = ———————————————————— count (previous POS, next POS) Translation model The word/phrase based model will be develop as that described by Koehn (Koehn et al., 2003), as implemented in MOSES (Koehn et al., 2007). The input to a phrase based translation system is a SL sentence with N words, x1, x2…….xN. A phrase table is used to define the set of possible phrases for the sentence: each phrase is a tuple p = (s, t, e), where (s, t) are indices representing a 6

contiguous span in the SL sentence (we have s ≲ t), and e is a target-language string consisting of a sequence of TL words. For example, the phrase p = (2, 5, the dog) would specify that words x2… x5 have a translation in the phrase table as “the dog”. Each phrase p has a score g(p) = g(s, t, e): this score will typically be calculated as a log-linear combination of features (e.g., see Koehn et al. (2003)). A major difference between machine translation and sentence retrieval is that machine translation assumes there is little, if any, overlap in the vocabularies of the two languages. In sentence retrieval we depend heavily on the overlap between the two vocabularies. With the Berger and Lafferty formulation in equation 1, the probability of a word translating to itself is estimated as a fraction of the probability of the word translating to all other words. Because the probabilities must sum to one, if there are any other translations for a given word, its selftranslation probability will be less than 1.0. SMT assumes that a sentence in the source language can be translated to several sentences in the target language. We denote a Sanskrit sentence with ‘s’ and Russian sentence with ‘r’. Each translation generated by a SMT system has an assigned probability, P(r|s). The goal of an SMT system is to produce the best target language translation ‘r’* for source language sentence ‘s’: r* = argmax P(r|s) r

P(s|r)P(r) = argmax P(r|s) ——————— r P(s) = argmax P(s|r)P(r)

using Bayes rule because P(s)does not depend on r

r

Decoder The beam search decoding algorithm starts with an empty hypothesis. Then new hypotheses are generated by using all applicable translation options. These hypotheses are used to generate further hypotheses in the same manner, and so on, until hypotheses are created that cover the full input sentence. The highest scoring complete hypothesis indicates the best translation according to the model. In phrase-based models it is easy to identify the entries in the phrase table that may be used for a specific input sentence. Many decoding algorithms were developed during the last decade. Their basic mechanism is the same as use word translation probabilities to construct a target language sentence maximizing the overall translation probability. The translation model allows us to compute the translation probability of a TL sentence given a SL sentence. However, when translating, only the SL sentence is available. Examining all possible valid TL sentences is not an option because there is a huge number of them. Therefore at run-time, the SMT system needs to do an efficient search in the space of target language sentences to come up with a good translation candidate. This search operation is referred to as “decoding”. The simplest function takes only into account the probability P(s|r): 7

h(n) = max P (sn| r) e A node is open if more words can be added to the partial sentence, otherwise the node is closed. For each open node a new word is added to the hypothesis based and a new node is created. Conclusion The paper presented word/phrase and sentence based approach of SMT that utilizes the Grammatical component (POS) based training model with integration of Parsing, disambiguation of POS on syntax level, chunking and tri-gram modules. These components reduce the amount of corpus which is required for bi-lingual corpus training and produce the higher quality of the output. The presented research needs a good parser and POS tagger for both the languages. The SaRS works given that the POS is disambiguated and tagged and the text is parsed. References: •

Alexander M. Rush and Michael Collins, 2011, Exact decoding of syntactic translation models through Lagrangian relaxation. In Proceedings of ACL;



Adam Lopez, 2008, Statistical Machine Translation. In ACM Computing Surveys 40(3): Article 8, pp. 1-49.



Daisuke Okanohara and Jun’ichi Tsujii, 2007, A discriminative language model with pseudonegative samples. In Proceedings of ACL, pp. 73–80;



David Chiang, 2005, A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proceedings of the 43rd Annual Meeting of the ACL, pp.263-270;



E. Brill, 1992, A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pp. 152–155;



http://www.statmt.org/;



http://en.wikipedia.org/wiki/Bayes'_theorem



Koehn, 2005, Europarl: A parallel corpus for statistical machine translation. In Proceedings of MTSummit X, Phuket, Thailand.



M. Popovic and H. Ney, 2006, POS-based word reorderings for statistical machine translation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC);



Maja Popovic and Hermann Ney, 2006, Statistical Machine Translation with a Small Amount of Bilingual Training Data, 5th LREC SALTMIL Workshop on Minority Languages, pp. 25–29;



Och, F.J., Ueffing, N., and Ney, H, 2001, An efficient A* search algorithm for statistical machine translation. In Proceedings of the workshop on Data-driven methods in machine translation, pp. 1–8, Morristown, NJ, USA. Association for Computational Linguistics;



Philipp Koehn, Franz Josef Och, and Daniel Marcu, 2003, Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the ACL on Human Language Technology, NAACL’03, pp. 48–54;

8



Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondˇrej Bojar, Alexandra Constantin, and Evan Herbst, 2007, Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pp. 177–180;



William John Hutchins, 1986, Machine Translation: Past, Present, Future, Chichester.

9