Interactive Multi-system Machine Translation With ...

9 downloads 6726 Views 783KB Size Report
2 K-Translate on GitHub - https://github.com/M4t1ss/K-Translate. 3 K-Translate .... pies and a. 4 Natural language processing tools - http://php-nlp-tools.com/documentation/tokenizers.html ... Each translation API is defined with a function that has source and .... 11 Memory Networks - https://github.com/facebook/MemNN/.
Interactive Multi-system Machine Translation With Neural Language Models Matīss RIKTERSa,1 University of Latvia

a

Abstract. The tool described in this article has been designed to help machine translation (MT) researchers to combine and evaluate various MT engine outputs through a web-based graphical user interface using syntactic analysis and language modelling. The tool supports user provided translations as well as translations from popular online MT system application program interfaces (APIs). The selection of the best translation hypothesis is done by calculating the perplexity for each hypothesis. The evaluation panel provides sentence tree graphs and chunk statistics. The result is a syntax-based multi-system translation tool that shows an improvement of BLEU scores compared to the best individual baseline MT. We also present a demo server with data for combining English - Latvian translations. Keywords. Machine translation, hybrid machine translation, syntactic parsing, chunking, natural language processing, language modelling, neural networks

1. Introduction Multi-system machine translation (MSMT) is a subset of hybrid MT (HMT) where multiple MT systems are combined in a single system in order to boost the accuracy and fluency of the translations. It is also referred to as multi-engine MT, MT coupling or just MT system combination. Recent open-source MSMT approaches tend to use statistical language models (LMs) for scoring and comparing candidate translations. Since neural networks (NNs) have shown potential in modelling long distance dependencies in data when compared to state of the art statistical models, the aim of this research is to utilise this potential in combining translations. This paper presents an enrichment of an existing MSMT tool K-Translate [1] with neural language models. The experiments described use multiple combinations of outputs from two, three or four MT systems. Experiments described in this paper are performed for English-Latvian. Translating from and to other languages is also supported, however the underlying framework developed within this work has some limitations. The code of the developed system is freely available at GitHub 2. A demo server3 with data for combining English - Latvian translations is also available. The structure of this paper is as following: Section 2 compares the current tool with previous work. Section 3 describes the back-end and the evaluation mechanism. Section 4 outlines the main functionality of the graphical interface and Section 5 provides

Corresponding Author: Matīss Rikters, 19 Raina Blvd., Riga, Latvia; E-mail: [email protected] K-Translate on GitHub - https://github.com/M4t1ss/K-Translate 3 K-Translate demo - http://k-translate.lielakeda.lv/ 1 2

information about how the system performs under certain experiment. Finally, section 6 includes a summary and section 7 - aims for further improvements.

2. Related Work Ahsan and Kolachina [2] describe a way of combining SMT and RBMT systems in multiple setups where each one had input from the SMT system added in a different phase of the RBMT system. Barrault [3] describes a MT system combi-nation method where he combines multiple confusion networks of 1-best hypotheses from MT systems into one lattice and uses a language model for decoding the lattice to generate the best hypothesis. Mellebeek et al. [4] introduced a hybrid MT system that utilised online MT engines for MSMT. Their system at first attempts to split sentences into smaller parts for easier translation by the means of syntactic analysis, then translate each part with each individual MT system while also providing some context, and finally recompose the output from the best scored translations of each part (they use three heuristics for selecting the best translation). Freitag et al. [5] use a combination of a confusion network and a neural network model. A feedforward neural network is trained to improve upon the traditional binary voting model of the confusion network. This gives the confusion network the option to prefer other systems at different positions even in the same sentence.

3. Back-end System Workflow The main workflow consists of three main constituents – 1) pre-processing of the source sentences, 2) the acquisition of a translations via online APIs (except for when translations are provided by the user) and 3) post-processing - the selection of the best translations of chunks and creation of MT output. A visualisation of the workflow is presented in Figure 1. For translation, four translation APIs are used. However, the system’s architecture is flexible allowing to integrate more translation APIs easily. The system is set to be able to translate from English, German or French into Latvian, German, English or French. Nevertheless, the source and target languages can also be changed to other language pairs that are supported by the APIs, Berkeley Parser [6] parse grammars and the LM toolkits. Each new source language requires a grammar that is compliant with the Berkeley Parser. The parser is able to learn new grammars from treebanks. Each new target language requires a language model that is compliant with either KenLM [7] or one of the NN LM tools. New LMs can also be trained using monolingual plain text files as input.

Sentence tokenization

Syntactic parsing

Sentence chunking

Selection of the best translated chunk Sentence recomposition

Translation with APIs

Google Translate

Bing Translator

Hugo

Yandex Translate

Output

Figure 1. General workflow of the translation process. [1]

3.1. Pre-processing The first step is to tokenize the input. The tokenizer uses the whitespace and punctuation tokenizer from NlpTools 4 . Tokenization is essential for proper functioning of all subsequent steps – the syntactic parser can misclassify a word or a phrase and the translation APIs can issue an incorrect translation. For example, the parser will not correctly understand a word that has a dot, comma or a colon as the ending symbol. After tokenization it is necessary to divide sentences into linguistically motivated chunks that will be further translated. For this task the Berkeley Parser is used in conjunction with a chunk extractor (chunker). The parse tree of each sentence is processed by the chunker to obtain the parts of the sentence that will be individually translated and passed to the translation step. 3.1.1. Sentence Chunking A need for splitting the long sentences into smaller chunks is motivated by the hypothesis that modern MT systems can produce better results when given a reasonable length source sentence. To justify that the linguistically motivated chunks are superior to simply cutting sentences into chunks of specific length, we performed experiments splitting sentences into 5-grams in one experiment, random 1-grams to 4-grams in the second experiment, random 1-grams to 6-grams in the third, and random 6-grams to n-grams in the last. For best translation selection the same baseline LM was used and the results of these experiments fully confirmed the advantage of linguistically motivated chunks. The chunker reads output of the Berkeley Parser and places it in a tree data structure. During this process each node of the tree is initialised with its phrase (NP, VP, ADVP, etc.), word (if it has one) and a chunk consisting of the chunks from its child nodes. To obtain the final chunks for translation the resulting tree is traversed bottom-up post-order and only the top-level subtrees are used as the resulting chunks. The chunking consists of steps shown in Figure 2. The following is an example of output generated by the Berkeley Parser for the English sentence “Characteristic specialities of Latvian cuisine are bacon pies and a 4

Natural language processing tools - http://php-nlp-tools.com/documentation/tokenizers.html

refreshing, cold sour cream soup.” Figure 3 shows the visualized parse tree with two chunks highlighted in darker and lighter outlines. Parse tree

Read parse tree into a tree data structure and set an inner chunk for each node

Traverse the tree in post-order No

Does parent of the current chunk have no parent?

Yes Add the chunk of the current node to the list of chunks and mark it with a random color in the tree

List of chunks

Tree data structure with chunks marked in different colors

Figure 2. Chunking process flowchart. [1]

Figure 3. Visualised tree with marked chunks. [1]

3.2. Translation with Online APIs Currently four online translation APIs are included in the project – Google Translate5, Bing Translator6, Yandex Translate7 and Hugo8. These specific APIs are used because of their public availability and available documentation as well as the range of languages that they support. Each translation API is defined with a function that has source and target language identifiers and the source chunk as input parameters and the target chunk as the only output. This makes adding new APIs very easy. 3.3. Selection of the Best Translated Chunk The selection of the best-translated chunk is done by calculating the perplexity for each hypothesis translation with a language model. The baseline system used the statistical KenLM for this task and the updated variant, in addition, allows choosing one of three NN language models – RWTHLM, MemN2N or Char-RNN. For reliable results, a large target language corpus is necessary. For each ma-chine-translated chunk a perplexity score represents the probability of the specific sequence of words appearing in the training corpus used to create the language model (LM). Sentence perplexity has been proven to correlate with human judgments and BLEU scores, and it is a good evaluation method for MT without reference translations [8]. It has been also used in other previous attempts of MMT to score output from different MT engines as mentioned by [9] and [10]. 3.3.1. Perplexity Perplexity is used to measure the likelihood of a sentence belonging to a certain language. Perplexity calculation is one of the core functions provided by all LM tools. First, probabilities based on the observed entry with longest matching history 𝑤𝑓𝑛 are calculated using the formula: 𝑓−1

= 𝑝(𝑤𝑛 |𝑤𝑓𝑛−1 ) ∏ 𝑏(𝑤𝑖𝑛−1 ) 𝑖=1 𝑝(𝑤𝑛 |𝑤𝑓𝑛−1 ) and backoff penalties 𝑏(𝑤𝑖𝑛−1 )

𝑝(𝑤𝑛 |𝑤1𝑛−1 )

where the probability are given by an already-estimated language model. Perplexity is then calculated using this probability: 1 𝑁

𝑏 −𝑁 ∑𝑖=1 𝑙𝑜𝑔𝑏 𝑞(𝑥𝑖) where given an unknown probability distribution p and a proposed probability model q, it is evaluated by determining how well it predicts a separate test sample x1, x2... xN drawn from p. 3.3.2. KenLM KenLM is an open-source tool for fast and scalable estimation, filtering, and querying of language models. It is one of the most popular LM tools and is integrated into many phrase-based MT systems like Moses [11], cdec [12], and Joshua [13]. It does the job 5

Google Translate API - https://cloud.google.com/translate/ Bing Translator Control - http://www.bing.com/dev/en-us/translator 7 Yandex Translate API - https://tech.yandex.com/translate/ 8 Latvian public administration machine translation service API - http://hugo.lv/TranslationAPI 6

quite efficiently and was included as the only LM option in the initial release of KTranslate. For training, a fairly large order of 12 was chosen for maximum quality. 3.3.3. Neural Network Language Models In the future work of the first publication on K-Translate, adding neural network components was mentioned as one of the key directions for upcoming improvements. Since there is a vast variety of various NN frameworks, three variants with different characteristics were chosen for testing. The first one is highly optimized for use on CPUs, the second one is able to function either on a CPU or a CUDA9 capable GPU and the last one, in addition, also supports OpenCL10. 3.3.3.1. RWTHLM RWTHLM is a toolkit for training many different types of neural network language models [14]. It has support for feedforward, recurrent and long short-term memory NNs. While training different NN configurations, the best results were achieved with a model consisting of one feedforward input layer with a 3-word history, followed by one linear layer of 200 neurons with sigmoid activation function. 3.3.3.2. MemN2N MemN2N11 trains an end-to-end memory network [15] model for language modeling. It is a neural network with a recurrent attention model over a possibly large external memory with architecture of a memory network. Because it is trained end-to-end, the approach requires significantly less supervision during training. MemN2N requires Torch 12 scientific computing framework to be installed for running. For training, the default configuration was used with an internal state dimension of 150, linear part of the state 75 and number of hops set to 6. 3.3.3.3. Char-RNN Char-RNN 13 is a multi-layer recurrent neural network for training character-level language models. It has support for recurrent NNs, long short-term memory (LSTM) and rated recurrent units. To run Char-RNN on a CPU, a minimum installation of Torch is also required. Running on a GPU requires some additional Torch packages. The best scoring model was trained using 2 LSTM layers with 1024 neurons each and the dropout parameter set to 0.5. 3.3.3.4. Training on GPUs For both – faster language model training and quicker perplexity querying, the latter two NN LM frameworks allow utilisation of graphics cards. Using GPUs makes training several times faster when compared to training on CPUs. Both MemN2N and Char-RNN have support for NVIDIA’s CUDA parallel computing platform. In addition, Char-RNN also supports OpenCL, allowing the use of AMD’s graphics cards.

9

Parallel Programming and Computing Platform - http://www.nvidia.com/object/cuda_home_new.html Open standard for parallel programming of heterogeneous systems - https://www.khronos.org/opencl/ Memory Networks - https://github.com/facebook/MemNN/ 12 A scientific computing framework for LuaJIT - http://torch.ch/ 13 Multi-layer RNNs for character-level language models in Torch - https://github.com/karpathy/char-rnn 10 11

3.4. Sentence Recomposition When the best translation for each chunk is selected, the translation of the full sentence is generated by concatenation of chunks. The chunks are recomposed in the same order as they were split up.

4. Translation Combination Panel This section briefly describes the graphical front-end of K-Translate. Figure 4 shows a schematic overview of the options available. Each of the two ways of combining translations consists of all or most of the steps covered in the previous section, except when the user inputs their own translations – skipping translation with online APIs. Start page

Translate with online systems

Input translations to combine

Settings

Input source sentence Input source sentence Input translated chunks

Translation results

Figure 4. Architectural visualization of the translation combination panel. [1]

4.1. Translating with Online Systems The start-up screen of the translation combination panel allows to fully automatically get translations from several online MT systems that have APIs available, combine them and output the best fitting hybrid translation. The results look the same as when combining user provided translations (Figure 5) with the exception of showing the name of the used online system as the source instead of MT1, MT2, etc.

4.2. Combining Multiple User-provided Translations The second option of the translation combination panel is intended for the more experienced MT professionals who already have several (two or more) translations of the input sentence from different MT systems and just want to obtain the combined result. At first, the user must select source and target languages and input the sentence in a source language. After that, K-Translate will have performed syntactic analysis on the input sentence and split it into smaller fragments or chunks 14 . The syntax tree with highlighted color-coded chunks will also be shown so that the user can better understand where and why the chunks have their boundaries (Figure 3). These chunks will be given in a text box each in a new line for the user to translate with the chosen MT systems. Finally, the obtained translations must be pasted in the MT 1, MT 2, etc. text boxes. In the last step, (Figure 5) K-Translate will provide the best fitting combined translation and highlight which chunks were used from which input. It also shows the source used for each chunk and the confidence level of each selection. The confidence is calculated by comparing chunk perplexities to each other.

Figure 5. Translation combination results page. [1]

4.3. Settings The settings require at least a Berkeley Parser compatible grammar for each source language and a language model for each target language, compatible with at least one of the LM tools. For each of the online APIs - the corresponding API settings are mandatory. The necessity of these requirements is explained in section 3. A new addition in the settings is the option to designate one of the four LMs to use system wide and also the necessary model files for each framework. Opposed to the three others, RWTHLM requires one additional file to be specified – a vocabulary file that was also used for training the model.

14

The process behind chunking is clarified in section 3.1.

5. Experiments This section describes the experiments performed to test the workflow of K-Translate. Firstly, details on the input data and experiment methodology are provided. Next, the results of both, language modelling and translation experiments are summarized and interpreted. For the purposes of the experiment a slightly similar hybrid MT system Multi-System Hybrid Translator [16] was chosen as a baseline. 5.1. Experiment Setup Experiments for LM training and perplexity evaluation were done on three desktop workstation machines with different configurations. The KenLM and RWTHLM models was trained on an 8-core CPU with 16GB of RAM, for training MemN2N a GeForce Titan X (12GB memory, 3072 CUDA cores) GPU with a 12-core CPU and 64GB RAM and the Char-RNN model was trained on a Radeon HD 7950 (3GB memory, 1792 cores) GPU with an 8-core CPU and 16GB RAM. To train the LMs the Latvian monolingual part of the DGT-TM [17] was used. It consists of 3.1 million sentences. In the case of training an LM with Char-RNN only the first half of this corpus (1.5 million sentences) was used in order to speed up the training process as well as because the character level model requires much less training data when compared with the others. When training all NN LMs evaluation and validation datasets were automatically derived from the training data with the proportion of 97% for training, 1.5% for validation and 1.5% for testing. The final evaluation data consisted of 1134 sentences randomly selected out of a different corpus – the JRC Acquis corpus version 3.0 [18]. The translation experiments were conducted on the English – Latvian part of the JRC Acquis corpus from which both the test data and data for training of the language model were retrieved. The test data contained 1581 randomly selected sentences. A 12gram language model was trained using KenLM. The method was applied by combining all possible combinations of two and then also all three APIs. As a result, seven different translations for each source sentence were obtained. Google Translate, Hugo, Yandex and Bing Translator APIs were used with the default configuration. Output of each system was evaluated with two scoring methods – BLEU [19] and NIST [20]. The resulting translations were inspected closer to determine, which system from the hybrid setups was selected to get the specific translation for each chunk and analyse differences in the resulting translations. 5.2. Language Model Experiments To justify using different language modelling approaches, different language models were trained with the same and similar (half of the corpus in one case) training data. Table 1 shows differences in perplexity evaluations that outline the superiority of NN LMs. It also shows that the statistical model is much faster to train on a CPU and that NN LMs train more efficiently on GPUs.

Table 1. Language model perplexity experiment results System

Perplexity 34.67 136.47 25.77 24.46

KenLM RWTHLM MemN2N Char-RNN

Training corpus size 3.1M 3.1M 3.1M 1.5M

Trained on CPU CPU GPU GPU

Training time 1 hour 7 days 4 days 2 days

5.3. Translation Experiments The results of the automatic evaluation are summarized in Table 2. Surprisingly all hybrid systems that include the Hugo API produce lower results than the baseline Hugo system. However, the combination of Google Translate and Bing Translator shows improvements in BLEU and NIST scores compared to each of the baseline systems. The results also clearly show an improvement over the baseline hybrid system that does not have a syntactic pre-processing step. Also, contrary to the baseline, the new system tends to use more chunks from Hugo, which, according to BLEU and NIST scores, is the better selection. Table 2. Baseline experiment results. B – Bing, G – Google, H – Hugo, L - LetsMT, Y – Yandex. System Google Bing Hugo LetsMT Yandex K-Translate BG GH BH BGHY

BLEU

NIST

16.19 16.99 20.27 20.55 19.75

8.37 8.09 9.45 9.48 9.30

G 100% -

17.34 18.63 18.98 18.33

8.54 9.09 8.97 8.67

74% 25% 17%

Hybrid selection B H/L 100% 100% 100% 26% 24% 18%

74% 76% 35%

Y 100% 30%

The table also shows the percentage of translations from each API for the hybrid systems. Although, according to scores, the Hugo system was a little better than the other systems, it seems that the language model was eager to favour its translations. The next table - Table 3 - shows differences in BLEU scores when NN LMs were used. Correlation between LM perplexity and the resulting BLEU score is clearly visible as well as a slight improvement in the overall result. Table 3. Impact of different LMs on BLEU scores System KenLM RWTHLM MemN2N Char-RNN

LM Perplexity 34.67 136.47 25.77 24.46

BLEU 19.23 18.78 18.81 19.53

6. Conclusions and Future Work This paper described ways to improve the interactive MT system combination approach with neural network language models. The main goals were to provide MT researchers

with more options for the easy to use translation combination tool and to further improve translation quality over the baseline. Test cases showed an improvement in BLEU score, when used only with Google and Bing, of 0.35 BLEU points. In all hybrid systems that included the Hugo API a decrease in overall translation quality was observed. This can be explained by the scale of the engines - the Bing and Google systems are more general, designed for many language pairs, whereas the MT system in Hugo was specifically optimized for English – Latvian translations. Further improvements include the visual side of the system that may benefit from additional graphs and charts that display how and why the system chose each individual chunk for the final translation. The back-end could benefit from improving the chunking step and the selection of the best translated chunk. For now, the chunker splits sentences with no regard for sub-chunks or occasions when a chunk is only one word or symbol. Larger chunks should be split in smaller subchunks and the single-word chunks should be combined with one of the neighbouring longer chunks. It may be also more appropriate to divide certain phrases, e.g. noun phrases and verb phrases but not prepositional phrases, infinitive phrases, etc. Adding alternative resources to select from in each step of the translation process could benefit the more advanced user base. For instance, the addition of more online translation APIs like Baidu Translate [21] would expand the variety of choices for translations. A configurable usage of different syntactic parsers like SyntaxNet - Neural Models of Syntax [22] is likely to improve the translation process.

References [1] Rikters, M.: K-Translate-Interactive Multi-system Machine Translation. International Baltic Conference on Databases and Information Systems. Springer International Publishing (2016) [2] Ahsan, A., Kolachina, P.: Coupling Statistical Machine Translation with Rule-based Transfer and Generation, AMTA-The Ninth Conference of the Association for Machine Translation in the Americas. Denver, Colorado (2010) [3] Barrault, L.: MANY: Open source machine translation system combination. The Prague Bulletin of Mathematical Linguistics 93: 147-155. (2010) [4] Mellebeek, B., Owczarzak, K., Van Genabith, J., Way, A.: Multi-engine machine translation by recursive sentence decomposition. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, 110-118. (2006) [5] Freitag, M., Peter, J., Peitz, S., Feng, M., Ney, H.: Local System Voting Feature for Machine Translation System Combination. In EMNLP 2015 Tenth Workshop on Statistical Machine Translation (WMT 2015), pages 467-–476, Lisbon, Portugal. (2015) [6] Petrov, S., Barrett, L., Thibaux, R., Klein, D.: Learning accurate, compact, and interpretable tree annotation. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics. (2006) [7] Heafield, K.: KenLM: Faster and smaller language model queries. Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics. (2011) [8] Gamon, M., Aue, A., Smets, M.: Sentence-level MT evaluation without reference translations: Beyond language modeling. Proceedings of EAMT. (2005) [9] Callison-Burch, C., Flournoy, R. S.: A program for automatically selecting the best output from multiple machine translation engines. Proceedings of the Machine Translation Summit VIII. (2001) [10] Akiba, Y., Watanabe, T., Sumita, E.: Using language and translation models to select the best among outputs from multiple MT systems. Proceedings of the 19th international conference on Computational Linguistics-Volume 1. Association for Computational Linguistics. (2002) [11] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C.: Moses: Open source toolkit for statistical machine translation. Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, (2007)

[12] Dyer, C., Weese, J., Setiawan, H., Lopez, A., Ture, F., Eidelman, V., Ganitkevitch, J., Blunsom, P., Resnik, P.: cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. Proceedings of the ACL 2010 System Demonstrations. Association for Computational Linguistics, (2010) [13] Li, Z., Callison-Burch, C., Dyer, C., Ganitkevitch, J., Khudanpur, S., Schwartz, L., Thornton, W.N., Weese, J., Zaidan, O.F.: Joshua: An open source toolkit for parsing-based machine translation. Proceedings of the Fourth Workshop on Statistical Machine Translation. Association for Computational Linguistics, (2009) [14] Sundermeyer, M., Schlüter, R., Ney, H.: rwthlm-the RWTH aachen university neural network language modeling toolkit. INTERSPEECH. (2014) [15] Sainbayar, S., Weston, J., Fergus, R.: End-to-end memory networks. Advances in neural information processing systems. (2015) [16] Rikters, M.: Multi-system machine translation using online APIs for English-Latvian. ACL-IJCNLP 2015: 6. (2015) [17] Steinberger, R., Eisele, A., Klocek, S., Pilos, S., Schlüter, P.: Dgt-tm: A freely available translation memory in 22 languages. arXiv preprint arXiv:1309.5226. (2013) [18] Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058. (2006) [19] Papineni, K., Roukos, S., Ward, T., Zhu, W. J.: BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics. (2002) [20] Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proceedings of the second international conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc. (2002) [21] Zhongjun, H. E.: Baidu Translate: Research and Products. ACL-IJCNLP 2015 (2015): 61. [22] Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Collins, M.: Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042 (2016)