Survey on neural machine translation into Polish

1 downloads 0 Views 243KB Size Report
lation and neural machine translation using recurrent and convolutional neural ... compared to SMT as well as to Google Translate engine. In both ..... Bahdanau, D., Cho, K., & Bengio, Y.: Neural machine translation by jointly learning to align.
Survey on neural machine translation into Polish Krzysztof Wolk1, Krzysztof Marasek1 1

Polish-Japanese Academy of Information Technology, Warsaw, Poland [email protected]

Abstract. In this article we try to survey most modern approaches to machine translation. To be more precise we apply state of the art statistical machine translation and neural machine translation using recurrent and convolutional neural networks on Polish data set. We survey current toolkits that can be used for such purpose like Tensorflow, ModernMT, OpenNMT, MarianMT and FairSeq by doing experiments on Polish to English and English to Polish translation task. We do proper hyperparameter search for Polish language as well as we facilitate in our experiments sub-word units like syllables and stemming. We also augment our data with POS tags and polish grammatical groups. The results are being compared to SMT as well as to Google Translate engine. In both cases we success in reaching higher BLEU score.

Keywords: NMT, CNN in translation, RNN in translation, machine translation into Polish.

1

Introduction

Machine translation (MT) started ca. 50 years ago with some rule-based systems. In 90’s statistical MT (SMT) systems were invented. They create statistical models by analyzing aligned source-target language data (training set) and use them to generate the translation. This has been extended to phrases [1] and additional linguistic features (part-of-speech, and such a method dominates over last decades. During training phase, SMT creates a translation model and a language model. The first one stores the different translations of the phrases while the later model stores the probability of the sequence of phrases on the target side. During the translation phase, the decoder chooses the best translation based on the combination of these two models. This of course needs huge training sets (for proper estimation of statistical models) what limits translation quality especially for grammatically rich languages. Recent achievement, Neural Machine Translation uses vector-space word representation and deep learning techniques to learn best weights for neural network to transform segments from source to the target language. This is achieved using different recurrent network architectures: recurrent networks [2], networks with attention mechanism [3] or convolutional networks[4]. After initial enthusiasm gained by better NMT results on shared tasks [5], some observations has been made that NMT not guarantees better MT performance. Koehn & Knowles [6] found that for English-Spanish and German-English pairs NMT systems (compared to SMT ) have: worse out-of-domain performance, worse performance in low resource

2

conditions, worse translation of long sequences, sometimes weird word alignments produced by attention mechanism, some problems with large beam decoding, but better translation of unfrequent words (perhaps because use of subword units). This study surveys SMT and NMT toolkits for Polish-English translations. General quality of MT systems hardly depends on language pairs, training data amount and quality and domain’s match. Particularly challenging is translation to/from low resourced language with different syntax and morphology. Polish as a Slavic language, have quite free word order and is highly inflected. The inflectional morphology is very rich for all word classes, seven distinct cases affect not only common nouns, but also proper nouns as well as pronouns, adjectives and numbers, complex orthography. This, in case of Polish-English translation, forms tasks which are hard to solve by statistical systems: unbalanced dictionaries (Polish usually 4-5 times bigger than English), segments sequences (free-word order) probabilities estimation, frequent use of foreign words but with Polish inflection, limited sizes of parallel corpora.

2

Toolkits used in the research

The baseline system testing was done using the Moses open source SMT toolkit [7] with its Experiment Management System (EMS) [8]. The SRI Language Modeling Toolkit (SRILM) [9] with an interpolated version of the Kneser-Ney discounting (interpolate –unk –kndiscount) was used for 5-gram language model training. We used the MGIZA++ [10] tool for word and phrase alignment. KenLM [11] was used to binarize the language model, with a lexical reordering set to use the msd-bidirectional-fe model. As a second SMT toolkit we used state of the art ModernMT (MMT) system [12]. It was created in cooperation of Translated, FBK, UEDIN and TAUS. ModernMT also has secondary neural translation engine. Form the code on Github we know that it is based on PyTorch [13] and OpenNMT [14] it also uses BPE for sub-word units [15] generation as default. More detailed information are unknown and not stated on the project manual pages. Project probably has many more default optimizations. The third toolkit we used is based on Google’s TensorFlow. TensorFlow is an open source software library for high performance numerical computation. One of the modules within the TensorFlow is seq2seq [2, 16]. Sequence-to-sequence (seq2seq) models have enjoyed great success in a variety of tasks such as machine translation, speech recognition, and text summarization. This work uses seq2seq in the task of Neural Machine Translation (NMT) which was the very first testbed for seq2seq models with wild success. The next system we experimented on was MarianNMT [17] (formerly known as AmuNMT) that is an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies. It has mainly been developed at the Adam Mickiewicz University in Poznań (AMU) and at the University of Edinburgh. It advantages are up to 15x faster translation than Nematus and similar toolkits on a single GPU, up to 2x faster training than toolkits based on Theano, TensorFlow, Torch on a single GPU, multi-GPU training and translation, usage of different types of models, including deep

3

RNNs, transformer and language model Binary/model-compatible with Nematus [18] models for certain model types, adjustation for Polish language. Though RNNs have historically outperformed CNNs [19] at language translation tasks, their design has an inherent limitation, which can be understood by looking at how they process information. Computers translate text by reading a sentence in one language and predicting a sequence of words in another language with the same meaning. RNNs [16] operate in a strict left-to-right or right-to-left order, one word at a time. This is a less natural fit to the highly parallel GPU hardware that powers modern machine learning. In comparison, CNNs can compute all elements simultaneously, taking full advantage of GPU parallelism. They therefore are computationally more efficient. Another advantage of CNNs is that information is processed hierarchically, which makes it easier to capture complex relationships in the data.

3

Data preparation

The experiments described in this article were conducted using official test and training sets from IWSTL’13 [20, 21] conference as well as WMT’17 conference. From IWSLT we borrowed the TED Lectures corpora and from WMT we used the Europarl v7 corpus [22]. Whereas TED Lectures were ready to be used the Europarl had to be pre-processed. To be more precise it was necessary to deduplicate the dataset and assure that none of development and test data was present in the training data set. The specification of both corpora is presented in the Table 1. Table 1. Corpora specification

TED Europarl

Number of sentences 134,678 619,858

Unique Polish Tokens 92,135 164,140

Unique English Tokens 58,393 50,474

For data sub-word division and augmentation, we used our author tool. It was implemented as part of the study the Polish language and is able to segment Polish texts into the suffix prefix core and for syllables. Its additional advantage is the possibility of dividing the text into grammatical groups and tagging texts with the POS tags. This type of tool will not only have considerable significance for scientists, but also for business. In the currently rapidly evolving machine translation based on neural networks, the morphological segmentation is necessary to reduce the size of dictionaries consisting of full word forms (the so-called open dictionary) which are used, among others, in the training of the translation system and in language modelling. Sample stemmed Polish sentence: dział++ --@@pos_noun++ --@@b_38++ --@@sb_1++ --ania pod++ --@@pos_past_participle++ --@@b_46++ --@@sb_2++ --jęte w++ --@@pos_other_x++ --@@b_0++ --@@sb_0 wy++ --ni++ -@@pos_noun++ --@@infl_M3++ --@@b_0++ --@@sb_4++ --ku

4

For English side text tagging we utilized spaCy POS Tagger for which we coded python script to unify the format of both tools. The tagging speed was about 70 sentences per second. A sample result: we --%%pos_pronoun++ can --%%pos_verb++ not --%%pos_adverb++ run --%%pos_verb++ the --%%pos_determiner++ risk -%%pos_noun++ of --%%pos_adposition++ creating -%%pos_verb++ a --%%pos_determiner++ regulatory Whenever in experiments section “stemmed” is used it means that the data was processed to such data format. In addition, we used byte pair encoding (BPE) [15], a compression algorithm, to the task of word segmentation. BPE allows for the representation of an open vocabulary through a fixed-size vocabulary of variable-length character sequences, making it a very suitable word segmentation strategy for neural network models. We try this method independently as well as in conjunction with our “stemmer”. After tokenizing and applying BPE to a dataset, the original sentences may look like the following. Note that the name "Nikitin" is a rare word that has been split up into subword units delimited by @@. Madam President , I should like to draw your attention to a case in which this Parliament has consistently shown an interest . It is the case of Alexander Ni@@ ki@@ tin .

4

Experiments

All of the experiments were conducted on the same machine. We had to our disposal 32 core CPU, 256GB of RAM and 4 x nVidia Tesla K80 GPUs. We used only CUDAenabled toolkits that were able facilitate cuDNN library for faster neural computing. This allowed us not only to measure the quality but also time cost needed for similar operations on different toolkits and settings [23]. The experiments on neural machine translation were started rather in most casual option which is usage of Recurrent Neural Networks (RNN). Firstly, we focused on TensorFlow toolkit provided by the Google corporation. Using its default settings and official IWSTL 2013 PL-EN test sets we tried to compare it with SMT (Moses) quality. The results of such comparison are showed in Table 2. Table 2. TensorFlow and Moses baseline results on TED corpus. Corpus

Direction

NMT (BLEU)

SMT(BLEU)

TED TED

PL->EN EN->PL

4.46 5.87

16.02 8.49

NMT (Training Time) 4 days 4 days

SMT (Training time) 1.5 hours 1.5 hours

5

The quality of NMT was not only much lower but also much slower. We conducted that it might be because of the fact that TED has a very wide domain, diverse dictionary and dictionary size disproportion between PL and EN. On the other hand, most successful research on NMT were done on narrow domains on texts that had similar vocabulary on both sides. Decision was made to switch to European Parliament Proceedings (EUP) parallel corpus. Those results are showed in the Table 3. As we can see even that adagrad optimization made positive impact on training outcome still the time cost and lower quality were not satisfying [24]. Table 3. TensorFlow and Moses baseline results on EuroParl corpus. Corpus

Direction

Iteration

NMT (BLEU)

SMT(BLEU)

37,91 37,91 37,91 37,91 37,91 37,91 37,91 27.11 27.11 27.11 27.11 27.11 37,91 37,91

NMT (Training Time) 1 day 2 days 4 days 4.5 days 4.5 days 5 days 7 days 6 days 7 days 8 days 9.5 days 12 days 2 days 4 days

SMT (Training time) 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours 2 hours

EUP EUP EUP EUP EUP EUP EUP EUP EUP EUP EUP EUP EUP EUP Adagrad EUP – Adagrad EUP Adagrad EUP – Adagrad

PL->EN PL->EN PL->EN PL->EN PL->EN PL->EN PL->EN EN->PL EN->PL EN->PL EN->PL EN->PL PL->EN PL->EN

5000 10000 20000 50000 70000 100000 150000 30000 60000 100000 130000 180000 20000 40000

6.13 9.38 13.02 15.69 15.44 14.52 13.67 8.17 9.78 10.10 10.21 9.85 9.78 11.34

PL->EN

55000

12.43

37,91

5 days

2 hours

PL->EN

90000

19.43

37,91

7 days

2 hours

PL->EN

120000

19.63

37,91

9 days

2 hours

In conclusion we assumed that current baseline RNN topology and training parameters in the TensorFlow were not properly optimized for the polish language. That is why we put our attention on Polish NMT system called Marian NMT. The Table 4 shows our experiments in PL-EN direction. Table 4. MarianNMT Polish to English results with sub-word units. Options

Iter(k)

steps (k) 2 4 6

non-stemmed, no-bpe, baseline BLEU 5.29 21.44 30.89

stemmed, no-bpe BLEU 16.02 -

nonstemmed, bpe BLEU -

stemmed, no-bpe BLEU 4.83 17.03 22.1

6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

34.79 36.74 37.99 38.57 39.15 39.54 39.51 39.83 39.98 40.08 40.18 40.45 40.36 40.37

25.7 27.11 27.82 28.08 28.54 28.42 28.68 28.79 28.65 28.69 28.53 -

35.54 37.98 38.76 -

24.09 25.18 25.64 25.95 26 25.98 25.91 25.62 25.58 25.02 25.1 24.63 24.57 24.26

In the Table 5 we present similar experiments on opposite direction (EN->PL). Table 5. MarianNMT English to Polish results with sub-word units. Options

Iter(k)

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

stemmed, no-bpe

stemmed, no-bpe

nonstemmed, no-bpe, baseline

stemmedkorrida, no-bpe

dim-rnn2048, stemmedkorrida, no-bpe

BLEU 1.36 6.58 10.1 12.31 13.56 14.54 14.83 15.15 15.41 15.67 15.59 15.77 15.95 15.91 15.82 15.86 15.62 15.43 15.46 15.27 15.34

BLEU 1.17 7.98 13.35 15.9 17.52 18.6 19.08 19.63 20.01 20.29 20.68 20.83 20.7 21.13 21.04 21.07 20.87 21.09 21.23

BLEU 0.81 6.8 15 19.56 22.32 23.96 25.17 26.25 27.11 27.76 28.3 28.29 28.43 28.81 29.06 29.14 29.35 29.68 29.75 29.85 29.97 30.14 30.15 30.25 30.34

BLEU 1.26 5.96 10.56 12.98 14.31 15.06 15.6 15.95 16.1 16.4 16.21 16.75 16.58 16.63 17.04 16.92 16.87 16.74 16.57 16.51 16.51 16.59 16.38 16.47

BLEU 8.32 12.86 15.15 16.31 17.09 17.23 17.31 17.77 17.74 17.73

maxlength200, dimrnn-2048, stemmedkorrida, no-bpe BLEU 1.25 4.75 9.69 14.41 17.51 19.93 21.33 22.35 23.04 23.61 24.46 24.89 25.19 25.55 26.07 26.17 26.37 26.78 26.9 27.04 27.31 27.61 27.63 27.8 27.86

7 52 54 56 58 60 62 64 66 204

30.48 30.41 30.63 30.66 30.82 30.72 30.28

28.06 28.18 28.37 28.29 28.43 28.28 28.59 28.52 31.35

Even that we finally obtained satisfactory results those were still too similar to SMT method. Another problem was training performance. We required one week to compute a proper experiment. We also found out that using sub-words units (stemming) improves the translation into Polish by visible factor, whereas in opposite direction the results were negative. All this was motivation for using CNN instead of RNN. It should provide similar results in much faster time. Decision was made to use FairSeq toolkit developed in the Facebook laboratories. We show results of translation into Polish in Table 6. We were able to obtain results very similar to MarianNMT but in much less time. It required only 1.5 day in average to conduct a full training whereas in Marian NMT it was about 6 days. We also decided to translate only into Polish direction because from Marian NMT experiments we concluded that sub-word units are only usable when translating into Polish. In opposite direction they generated too many wrong hypotheses. The same conclusions could be drawn from FairSeq experiments. Table 6. Initial experiments on CNN Translation into Polish using FairSeq. #

Experiment

Data Type

Settings

BLEU

5

4_eup-nmax-175enpl

With group and suffix numbers

7

5_eup-nmax-59enpl-notstemmed 6_eup-nmax-175enplstemmedno-codes 6_eup-nmax-175enplstemmedno-codes

Stemmed with grammatical group codes Notstemmed

8

9

Dict en

Dict pl

26.77

Sentence length limit 175

28.364

37,328

Baseline

29.59

59

28,210

80,437

Stemmed

-bptt 0

28.85

175

28,364

37,008

Stemmed

-bptt 25 test, worse bleu, slower training

28.48

175

28,364

37,008

8 10

6_eup-nmax-175enplstemmedno-codes 4_eup-nmax-175enpl

Stemmed

5 attention modules noutembed 768

31.29

175

28,364

37,008

Stemmed with grammatical group codes

29.74

175

28,364

37,328

12

7_eup-nstemmer-3enpl

Stemmed

31.33

175

28,364

31,655

13

7_eup-nstemmer-3enpl

Stemmed

Do 5 attention modules work better than 2? noutembed 768 Stemmer stemmed-3 from destemmed stemmed-5; -nenclayer 15 -nlayer 10 -nembed 256 noutembed 256 nhid512 Stemmer-r neclayer 15 -nlayer 5 nembed 512 -noutembed 768 -nhid 512

31.02

175

28,364

31,655

11

Our next step in research was training hyperparameter search. For this purpose, we used only 25% of dataset in order to improve the performance. Our baseline translation system score was equal to 24.95 whereas finally we obtained score of 26.18 in BLEU metric [25]. Most importantly we proved that our “stemmer” works well and that it really improves the translation quality. Highest score was obtained using stemmed data augmented with grammatical group and base suffix code. Some minor improvement was observed while only using stemmed word forms without extra codes. What is interesting, most advanced tagging and stemming with group codes and POS tags did not work as anticipated. Most likely reason to this is that we simply added to many information which artificially extended the number of tokens in sentences. This could possibly reduce the training accuracy especially that our word embeddings encoding vector was set only to have size of 256 (--dim-embeddings). Next, we choose best settings from hyperparameter search part and applied it to 100% of our data set. Because of the POS issue we also decided to increase maximum number of accepted tokens per sentence and the size of embedding vector to 512 and 768 respectively. This made the BLEU score improve event more to 31.53 without POS tags (but with stemming and BPE) and with POS tags and BPE to 31.28. By doing so

9

we managed to obtain better results on CNN then on MarianNMT(RNN) reducing training time by factor of 4. Table 7. CNN Experiments with sub-word units (English to Polish). # 23

Experiment 23_st5_15at_100pr

Data Type Stemmed with grammatical group codes

24

24_st5_15at_100pr_512emb

Stemmed with grammatical group codes

25

25_st5_15at_100pr_768emb

Stemmed with grammatical group codes

29

29_stem26

Stemmed with grammatical group codes and POS tags.

30

30_stemmed_15_bpe

31 33

31_clean_bpe 33_stemmed_with_pos_191_en _pos_bpe

Stemmed with grammatical group codes and BPE BPE only Stemmed with grammatical group codes, POS tags and BPE

Parameters -model fconv -nenclayer 15 -nlayer 15 -dropout 0.25 -nembed 256 noutembed 256 -nhid 512 -model fconv -nenclayer 15 -nlayer 15 -dropout 0.25 -nembed 512 noutembed 512 -nhid 512 -model fconv -nenclayer 15 -nlayer 15 -dropout 0.25 -nembed 768 noutembed 768 -nhid 512 --dropout 0.25 –optim nag –lr 0.25 –clip-norm 0.1 –momentum 0.99 – max-tokens 5000; arch=’fconv’, decoder_embed_dim=512, decoder_out_embed_dim=256, encoder_embed_dim=512, max_source_positions=1024, max_target_positions=1024 Like 29, --max-tokens 6000

bleu 29.91

Same as 30 Same as 30

30.23 31.28

30.67

30.71

25.97

31.53

The problem of little lower score while using POS tags which we did not anticipate remained. That is why did manual system evaluation by analysing translation results. That is how we discovered a looping issue. For instance, for the following sentence the generated hypothesis was partially correct but repeated many times. This was most likely reason for BLEU disproportion.

10

We judged that the attention window on English side was too small and made the system go into a loop. Unfortunately, we did not succeed in eliminating this issue yet. The final step of our research was comparison of our systems to commercial Google Translate engine and context aware ModernMT system (Table 8). Both of those systems are state of the art tools but (especially MMT) their close architecture makes it impossible to directly compare the results. Nonetheless, they put some light at the outcomes of this research. Table 8. Translation of EUP using Google Translate Engine Google Google ModernMT(SMT) ModernMT(SMT) ModernMT(NMT) ModernMT(NMT)

5

Translation Direction PL->EN EN->PL PL-EN EN->PL PL->EN EN->PL

BLEU 31,74 27,61 39,17 28,93 41,47 33,01

Conclusions

To conclude our work, we put everything in one big table for easier comparison. We only put the best scores for every toolkit that we used in the research and we skipped the information if the data was augmented with tags, BPE or stemmed. Only best scores remained. We compare our work to Google Translate API [26] and ModernMT. Table 9. Summary of translation experiments. System

Type

Direction

BLEU

Moses TensorFlow MarianNMT MarianNMT

SMT RNN RNN RNN (Stemmed) CNN Hybrid SMT RNN SMT RNN RNN RNN (Stemmed) CNN CNN (Stemmed) Hybrid SMT RNN

PL->EN PL->EN PL->EN PL->EN

37.91 19.63 40.45 28.74

TRANING TIME 2h 9 days 2 days 3 days

PL->EN PL->EN PL->EN PL->EN EN->PL EN->PL EN->PL EN->PL

31.74 39.17 41.47 27.11 10.21 30.28 31.35

1h 15h 2h 9.5 days 6 days 6 days

EN->PL EN->PL

29.59 31.53

1.5 days 1.5 days

EN->PL EN->PL EN->PL

27.61 28.93 33.01

1h 15h

FairSeq Google ModernMT ModernMT Moses TensorFlow MarianNMT MarianNMT FairSeq FairSeq Google ModernMT ModernMT

11

Summing up we successfully surveyed main translation systems that are present on the market in context of polish language. We trained baseline statistical system and successfully improved its quality using neural networks. What is more we were able to achieve better system score than Google Translate Engine within the test domain. We also proved that using sub-words units in translation into polish make positive impact on translation quality. To be more precise our sub-division tool performed better than the widely used BPE method, especially when also annotating data with grammatical groups or POS tags. Nonetheless it must be noted that much better engines that we trained are most likely to be possible to be prepared. First of all, we used small amount of training data. Secondly, it would be a good idea to incorporate language model into NMT trained from bigdata amounts of texts. Another idea for future experiments is adding lemmatization into our data. We also plan use transfer learning methods in order to further improve quality of NMT and adapt it to other text domains. We believe that currently CNN is best translation path to follow. In our opinion by optimizing data training parameters and CNN topology it would be easy to overscore in BLEU even the ModernMT system.

References 1. Koehn, P., Och, F. J., & Marcu, D.: Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 48-54). Association for Computational Linguistics. (2003, May) 2. Sutskever, I., Vinyals, O., & Le, Q. V.: Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112). (2014) 3. Bahdanau, D., Cho, K., & Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, (2014). 4. Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N.: Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122. (2017). 5. Luong, M. T., & Manning, C. D.: Stanford neural machine translation systems for spoken language domains. In Proceedings of the International Workshop on Spoken Language Translation (pp. 76-79). (2015). 6. Koehn, P., & Knowles, R.: Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872. (2017). 7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., ... & Dyer, C.: Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 177-180). Association for Computational Linguistics. (2007, June). 8. Vasiļjevs, A., Skadiņš, R., & Tiedemann, J.: LetsMT!: a cloud-based platform for do-ityourself machine translation. In Proceedings of the ACL 2012 System Demonstrations (pp. 43-48). Association for Computational Linguistics. (2012, July). 9. Stolcke, A.: SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing. (2002).

12 10. Junczys-Dowmunt, M., & Szał, A.: Symgiza++: symmetrized word alignment models for statistical machine translation. In Security and Intelligent Information Systems (pp. 379390). Springer, Berlin, Heidelberg. (2012). 11. Heafield, K.: KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation (pp. 187-197). Association for Computational Linguistics. (2011, July). 12. Jelinek, R.: Modern MT systems and the myth of human translation: Real World Status Quo. In proceedings of the International Conference Translating and the Computer. (2004, November). 13. Team, PyTorch Core.: Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. (2017). 14. Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M.: Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810 (2017). 15. Sennrich, R., Haddow, B., & Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. (2015). 16. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. (2014). 17. Junczys-Dowmunt, M., Grundkiewicz, R., Grundkiewicz, T., Hoang, H., Heafield, K., Neckermann, T., ... & Martins, A.: Marian: Fast Neural Machine Translation in C++. arXiv preprint arXiv:1804.00344. (2018). 18. Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., ... & Nădejde, M.: Nematus: a toolkit for neural machine translation. arXiv preprint arXiv:1703.04357. (2017). 19. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. (1998). 20. Wołk, A., Wołk, K., & Marasek, K.: Analysis of complexity between spoken and written language for statistical machine translation in West-Slavic group. In Multimedia and Network Information Systems (pp. 251-260). Springer, Cham. (2017). 21. Wołk, K., & Marasek, K.: Polish-English speech statistical machine translation systems for the IWSLT 2013. arXiv preprint arXiv:1509.09097. (2013). 22. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86). (2005, September). 23. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Klingner, J.: Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. (2016). 24. Kingma, D. P., & Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. (2014). 25. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J.: BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Association for Computational Linguistics. (2002, July). 26. Groves, M., & Mundt, K.: Friend or foe? Google Translate in language for academic purposes. English for Specific Purposes, 37, 112-121. (2015).