Factored Neural Machine Translation Mercedes Garc´ıa-Mart´ınez, Lo¨ıc Barrault and Fethi Bougares LIUM, University of Le Mans [email protected]

arXiv:1609.04621v1 [cs.CL] 15 Sep 2016

Abstract We present a new approach for neural machine translation (NMT) using the morphological and grammatical decomposition of the words (factors) in the output side of the neural network. This architecture addresses two main problems occurring in MT, namely dealing with a large target language vocabulary and the out of vocabulary (OOV) words. By the means of factors, we are able to handle larger vocabulary and reduce the training time (for systems with equivalent target language vocabulary size). In addition, we can produce new words that are not in the vocabulary. We use a morphological analyser to get a factored representation of each word (lemmas, Part of Speech tag, tense, person, gender and number). We have extended the NMT approach with attention mechanism (Bahdanau et al., 2014) in order to have two different outputs , one for the lemmas and the other for the rest of the factors. The final translation is built using some a priori linguistic information. We compare our extension with a word-based NMT system. The experiments, performed on the IWSLT’15 dataset translating from English to French, show that while the performance do not always increase, the system can manage a much larger vocabulary and consistently reduce the OOV rate. We observe up to 2% BLEU point improvement in a simulated out of domain translation setup.

1

Introduction

Neural Machine Translation (NMT) has been further developed in the last years (Bahdanau et al., 2014). In contrast to the traditional phrased-based statistical machine translation (Koehn et al., 2007) that automatically translates subparts of the sentences, NMT uses the sequence to sequence of words approach (Cho et al., 2014). Recently, NMT has improved the results of the phrased-based systems (Bahdanau et al., 2014). Besides these improvements in NMT, some problems still remain. One problem is the high computational cost of the target word probability due to the softmax that requires to normalize all the output values, see Equation 1: N X pi = eoi / eor for i ∈ {1, . . . , N } (1) r=1

where oi are the outputs, pi their softmax normalization and N the total number of outputs. In order to solve this issue, a standard technique is to define a short-list containing the most frequent words only. This has the disadvantage of increasing the out of vocabulary (OOV) rate. OOV words correspond to those unseen in the training dataset or which are not included in the vocabulary. They are all considered as unknown words and mapped to the special UNK token. Jean et al. (2014), proposed to carefully organise the batches so that only a subset of the target vocabulary is possibly generated at training time. This allows the system to perform the softmax only on this subset during training (the complexity remains the same at test time). Another possibility is to define a structured output layer (SOUL) to handle the words not appearing in the shortlist. This allows the system to always apply the softmax normalization on a layer with reduced size (Le et al., 2011). Recently, some works have used subword units to translate instead of words. In Sennrich et al. (2015), the rare and some unknown words are encoded as subword units with the Byte Pair Encoding (BPE)

method. The authors show that this can also generate words which are unseen at training time. As an extreme case, the character-level neural machine translation has been presented in several works (Chung et al., 2016; Ling et al., 2015; Costa-Juss`a and Fonollosa, 2016) and showed very promising results. In this work we propose an approach using factors as unit level in the output side of the neural network. The factors are referring to the linguistic annotation at word level like the Part of Speech (POS) tags. Moses toolkit (Haddow and Koehn, 2012) for statistical machine translation is able to manage factors information in addition to the words to be able to improve the translation. Some works have used factors as additional information for language modeling (Bilmes and Kirchhoff, 2003; Alexandrescu, 2006). Recently, factors have been used as linguistic input features to improve NMT (Sennrich and Haddow, 2016) as well. Our approach differs from previous works in the sense that we use only the linguistic decomposition of the words in the output side. Each word is represented by its lemma along its linguistic factors (POS tag, tense, gender, number and person). By these means, the target vocabulary size is reduced because we do not have to keep all the derived forms of the verbs, nouns, adjectives, etc. Furthermore, we are able to produce new words that are not in the vocabulary using all the derived forms of the lemmas. We use two different outputs for the translation at word level, one output is the lemma of the word and the other output is the rest of the factors mentioned earlier. Multiple output neural networks have been used before (Firat et al., 2016) with the difference that in our approach the system produces both outputs at the same time instead of scheduling them. With both outputs (lemma and factors) we are able to generate the final word using linguistic resources.

2

Neural Machine Translation

The encoder-decoder architecture used for NMT consists of two recurrent neural networks (RNN), one for the encoder and the other for the decoder. The encoder maps a source sequence into a continuous space representation and the decoder maps the representation back to a target sequence. Our trained neural translation models are based on a bidirectional encoder-decoder deep neural network equipped with an attention mechanism (Bahdanau et al., 2014), as described in Figure 2. Il

uj

4

pj

3

zj

a bien longtemps dans une galaxie lointaine , très lointaine

DECODER

5

y

Attention mechanism

va

lin

Softmax

i

↵1,6 ↵2,6

1

eij ↵ij = X eij

+

eij

↵11,6

↵6,6 ↵7,6

hi ( A long time ago in

a galaxy far

,

ENCODER

Ua

tanh

2

Wa

far away )

Figure 1: Architecture of the NMT system equipped with an attention mechanism. This architecture consists of a bidirectional RNN as an encoder (as seen in stage 1 of Figure 2). An input sentence is encoded in a sequence of annotations (one for each input word), corresponding to the concatenation of the outputs of a forward and a backward RNN. Each annotation represents the full sentence with a strong focus on the current word. The decoder is composed of a conditional RNN as provided for the DL4MT winter school1 (see stage 3 of Figure 2), equipped with an attention mechanism This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/ 1 https://github.com/nyu-dl/dl4mt-tutorial

(stage 2). The attention mechanism aims at providing weights for each annotation in order to generate a context vector (by performing a weighted sum over the annotations). The attention mechanism uses the hidden state at timestep j of the decoder RNN along with the annotation hi to generate a coefficient eij . A softmax operation is performed over those coefficients to generate the annotation weights αij . As described in (Bahdanau et al., 2014), the annotation weights can be used to align the input words to the output words. The RNN takes as input the context vector, the embedding of the previous output word (stage 4), and of course its hidden state. Finally, on stage 5 of the Figure 2, the output probabilities of the target vocabulary are computed. The word with the highest probability is selected to be the translation at each timestep. The encoder and the decoder are trained jointly to maximize the conditional probability of the correct translation.

3

Factors in Neural Machine Translation To perform factored neural machine translation, we need to extend the standard NMT architecture of the Figure 2 to allow generating several output symbols at the same time. For the sake of simplicity, we decided to generate only two symbols: the lemma and the concatenation of the different factors that we considered. For example, from the word devient, we obtain the lemma devenir and the factors vP3#s meaning that it is a verb, in Present, 3rd person, irrelevant gender (#) and singular. The morphological and grammatical analysis is performed with the MACAON toolkit (Nasr et al., 2011). Figure 2 shows this modification.

lemma

factors

uj pj

zj

+

Figure 2: Output detail of the Factored NMT architecture.

As we can see, lemmas and factors are generated separately, which in some cases, lead to sequences with different length. To solve this problem, we give priority to the length of the lemmas. Consequently, we constraint the length of the factors sequence to be equal to the length of the lemma sequence. This is motivated by the fact that the lemmas are closer to the final objective (a sequence of words) and that they are the symbols carrying most of the meaning. Another issue is the feedback that the RNN receives. In the word based model, this feedback is the embedding of the previous generated word. Since we have two outputs, we have to decide what will be given to the decoder RNN. Several options are possible and will be explored in this paper (details in section 4.3.2). For the first set of experiments, only the previous lemma embedding was used as feedback (no information from the factors output). 3.1

Handling beam search with factors

The beam search procedure has also been extended with respect to the original approach, since we are actually facing two beams (one for lemmas and one for factors). We need to deal with the multiple outputs because we do not want to rely solely on the lemma sequence to decide which are the best sequences. Then, we merge the two beams. Once the best lemma and factors hypotheses are generated for each partial hypothesis, the cross product of those output spaces is performed. By this mean, each lemma hypothesis is associated with each factors hypothesis. Afterwards, we keep the k-best combinations for each sample, with k being the beam size. Finally, the number of best hypotheses is reduced again to the beam size for further processing. 3.2

From factors to word

Once we obtain the factorized outputs from the neural network, we need to fall back to the word representation. This operation is also performed with the MACAON tool, which, given a lemma and some factors, provides the word candidate.

4

Experiments

We performed several sets of experiments trying different architectures and vocabulary sizes for Factored NMT (FNMT) and comparing them with the NMT system.

4.1

Data processing and selection

We evaluate our approach on the English to French Spoken Language Translation task from IWSLT 2015 evaluation campaign2 . A selection method (Rousseau, 2013) has been applied using the available parallel corpora (news-commentary, united-nations, europarl, wikipedia, and two crawled corpora) and Technology Entertainment Design (TED3 ) corpus as in-domain corpus. We also do a preprocessing to convert html entities and filter out the sentences with more than 50 words for both source and target languages. We finally end with a selected corpus of 2M sentences, 147K unique words for English side and 266K unique words for French side. 4.2

Training

We chose the following hyperparameters to train the systems. The embedding and recurrent layers have a dimensionality of 620 and 1000 respectively. We use a minibatch size of 80 sentences trained with Adadelta algorithm. The norm of the gradient is clipped to be no more than 1 (Pascanu et al., 2012) and the weights are initialized with Xavier (Glorot and Bengio, 2010). The validations start at the second epoch and are performed every 5000 updates. Early stopping is based on BLEU with a patience set to 10 (early stopping occurs after 10 evaluations without improvement in BLEU). The vocabulary size of the source languages is set to 30K. We varied the output layer size from 5K to 30K in order to simulate different levels of out of domain data. Once the model is trained, we set the beam size to 12 (as this is the standard value for NMT, (Bahdanau et al., 2014)) when translating the development corpus. 4.3

Factors models and results

The Factored NMT system aims at integrating linguistic knowledge into the decoder in order to obtain better performance when facing out of domain data and/or a low resource setup. To assess the feasibility and estimate the potential gain of our approach, we performed a set of experiments reducing the output vocabulary size, simulating such an environment. The results are presented in Table 1.

Model NMT FNMT NMT FNMT NMT FNMT NMT FNMT

Output size vocab. 30K 30K 30K+142 172K 20K 20K 20K+142 139K 10K 10K 10K+142 85K 5K 5K 5K+142 48K

Coverage (%) 97.96 99.23 (+1.27) 97.03 98.88 (+1.85) 94.51 97.72 (+3.21) 91.02 95.61 (+4.59)

#OOV 1775 784 2171 1014 3996 1897 6545 3424

#Par. 89.7M 89.8M 77.3M 77.4M 64.9M 64.9M 58.7M 58.7M

word 34.88 34.80 34.21 34.46 32.61 34.13 30.54 32.55

%BLEU lem. 37.78 37.52 37.07 35.22

fact. 42.72 42.65 42.75 42.98

Oracle word 36.33 36.14 35.72 33.86

Table 1: Comparison of the performance of the NMT and FNMT systems in terms of %BLEU score evaluating at word level, and separately, each output lemma and factors. The size of the output layer and the size of the corresponding vocabulary are presented in columns 2 and 3. Columns 4 and 5 show coverage in test dataset and number of OOVs, respectively. Last column corresponds to the oracle output. The FNMT system obtains a similar performance compared to the NMT system (first two rows) in terms of word level BLEU score, despite the increased complexity of the architecture of our model (and in particular the two outputs). In order to estimate the capacity of such a model, we computed the oracle which corresponds to ignore the errors caused by the factors, i.e. if we produce the correct lemma, then the correct word is generated (see last column of Table 1). We can see that a potential gain of more than 1.5% BLEU points can be achieved with a perfect modeling of the factors, which is encouraging. 2 3

IWSLT’15: https://sites.google.com/site/iwsltevaluation2015 TED: https://www.ted.com

The first comment is that the Factored NMT approach is able to model a bigger word vocabulary while preserving manageable output layers size. This is due to the fact that the factors-to-word tool is able to generate words which are unseen in the training corpus, augmenting the expressiveness of our model. For the sake of comparison, we provide the target vocabulary size for the standard NMT and the FNMT systems. For example, with an output layer size of 30K, the NMT system can model 30K words against 172K words for the FNMT system. This is an almost 6 times larger word vocabulary. One consequence is that the word coverage is higher for the FNMT than for the NMT system, as shown in column 4. However, for the first two systems (first two rows), we see that the difference between the coverages is small. When decreasing the output layer size, we can observe that the coverage decreases slowly for FNMT systems compared to the word based system. The FNMT approach surpasses standard NMT when the coverage difference becomes higher. This proves that the approach is sound and well performing, when dealing with out-of-domain data. This is of course dependent on the linguistic knowledge available in the factors-to-word tool. This is exactly the sought behavior: by integrating a priori linguistic knowledge, we reduce the impact of the training conditions (domain, data availability, etc.) on the performance of the system. The reduction of the out of vocabulary (OOV) rate of about 47% is a promising result which is not always well reflected by the BLEU score. These results would be better highlighted if performing a human evaluation (this point will not be addressed further in this paper). To make things clear, the OOV rate corresponds to the number of UNK tokens generated by our system. In those experiments, we did not use any specific method to replace them (e.g. put source words aligned to them, use a dictionary, etc.) Moreover, the number of parameters to train also decreases according to the size of the output layer, as shown in column 6, allowing a simpler training because we have to learn less weights in the model. For example, using a lemma output layer size of 10K instead of 30K (3 times smaller) for factored model, we obtain a small drop of 0.67 points in BLEU. By contrast, in NMT base model we observe a drop of 2.27 points in BLEU comparing the same output sizes 30K and 10K. Another interesting remark is that the scores evaluating in lemmas and factors are higher than the BLEU in words for both systems, this is due to the difficulty of the final step to generate the words. Nevertheless, the BLEU for factors are pretty low considering that the output layer size for this is only 142. This can be due to two different causes. First, the neural network is not able to correctly model this small output. Second, the task of translating from English words to French factors is complex. 4.3.1 Evaluating each output We evaluated BLEU at different levels (word, lemma or factors) using the base NMT system with only one output (see Table 2). We compare the values with the Factored NMT system results which models lemmas and factors at the same time. We observe that the difference between the results in BLEU for lemmas using the FNMT are similar to the NMT system. However, the differences evaluating factors are big between the two systems (2.44 difference of %BLEU). This experiment confirms that the task to predict factors managing very different output sizes respect to the source words is not easy. In future we will implement factors also in the input side of the neural network to verify this hypothesis. Also, we have to take into account that we are giving more priority to the length of the lemmas sequence than the factors one during beam search. This also suggests that we adapt our architecture so that factors are better predicted to obtain a final better BLEU at word evaluation.

Model NMT FNMT

word 34.88 34.80

%BLEU lemma factors 37.72 45.16 37.78 42.72

Table 2: Comparison of the performances between standard NMT system and the Factored NMT system in terms of %BLEU computed at word, lemma and factors level. The first line corresponds to 3 standard NMT systems built to generate at the output words, lemmas and factors, respectively.

4.3.2

Feedback

As explained in section 2, the decoder RNN is a conditional-GRU which is fed by the input context vector, its hidden state and the feedback (i.e. the previous generated symbol). Since we now have two outputs, we need to define what kind of feedback is more suitable for the Factored NMT system. Several solutions are possible. The first assumption we made is highly dependent on the design of the considered factors, i.e. the lemmas are the most informative factors among all. Then, we tried using only the output lemma embedding as feedback (see equation 2). Lemma feedback : EL [yj ]

(2)

where EL is the target language lemma lookup table and EL [yj ] is the embedding of the lemma used to generate the output word yi . Another straightforward operation is to sum the embeddings of the previous lemma with the embedding of the previous factors, as described in equation 3. Sum feedback : EL [yj ] + EF [yj ]

(3)

where yi is the target output word, and EL [yi ] and EF [yi ] are its corresponding lemma and factors embeddings. While this could seem unnatural, by doing this, we hope to obtain a joint vector representation of both the lemma and the factors. Finally, we investigated whether the neural network can learn a better combination of the lemmas and factors embeddings using a linear (eq. 4) or non-linear (eq. 5) operation instead of a simple sum. Linear feedback : Tanh feedback :

EL [yj ] · WL + EF [yj ] · WF

(4)

tanh (EL [yj ] · WL + EF [yj ] · WF )

(5)

where WL and WF are the parameters to be learned.

Model NMT FNMT FNMT FNMT FNMT

Feedback Lemma Sum Linear Tanh

word 34.88 34.80 34.48 34.42 34.58

%BLEU lemma 37.78 37.14 37.27 37.28

factors 42.72 44.46 44.03 43.96

#OOV 1775 784 815 868 757

Table 3: Performance in terms of %BLEU computed on word, lemma and factors when using different output embedding combinations as feedback. Table 3 presents the results obtained with systems integrating the different output embedding combinations as feedback. We can see that all systems perform similarly regarding BLEU score on words with a better result for the lemma feedback. As expected, when using only lemma as feedback, the system better estimates the lemmas probabilities, as a consequence, there is a significant reduction of the performance on factors. The comparison between the lemma %BLEU (fourth column of Table 3) and the number of OOVs (sixth column) shows a correlation between those two values, except when using non-linear combination which has the lowest value of OOVs. This tends to prove that modeling the lemmas better is important to reduce the OOV rate (confirming our assumption that lemmas are more informative) but not sufficient. In the future we would like to explore the combination of the two embeddings using its concatenation to see if we can get better results.

4.3.3

Dependency model lemma

One observation that can be made is that while generating factors could seem easier due to the small number of the possible outputs (only 142), the BLEU score is not as high as what we could expect. However, one could argue that generating a sequence of factors in French from a sequence of English words is not an easy task. In order to help the factors prediction, we contextualized the corresponding output with the lemma being generated. This creates a dependency between the lemma output and the factors output. The dependency has been implemented by including a transformer (see Figure 3) which projects the lemma embeddings into the hidden layer used to generate factors. The results by applying those two techniques are presented in Table 4.

Model NMT FNMT with dependency FNMT with dependency FNMT with dependency FNMT with dependency

Feedback Lemma Sum Linear Tanh

word 34.88 34.45 34.65 34.25 34.38

%BLEU lemma 37.45 37.34 37.02 37.09

factors

uj

T

pj +

zj

Figure 3: Dependency model

factors 42.15 44.35 43.57 43.82

#OOV 1775 770 800 822 915

Table 4: Results for dependency model In Table 4, we can observe that the dependency model does not improve the results in terms of %BLEU score on words from Table 3 using lemma, linear and tanh feedback. However, it improves using the sum feedback. For the sum feedback dependency model, we see that lemma BLEU output improves with respect to the same model without dependency. By contrast, the factors output obtains lower BLEU. This can occur because factors output receives more information from lemma and when the factors cost is back-propagated, the lemma output can improve the learning. We can also observe that if we improve lemma output it is more correlated to the word evaluation than if we improve factors output. Moreover, the number of the OOV are reduced for all the feedback combination excepting tanh feedback, which is not reflected by the automatic score. 4.3.4

Qualitative analysis

We have observed some of the translation outputs to better understand in what cases our FNMT system performs better or worse than the NMT system. Translation examples with better BLEU performance In the first two examples of Table 5, the FNMT system obtains better BLEU score than the NMT system. First example shows when our factored system can generate words when the NMT base system predicts unknown words. Firstly, the word lineage in source sentence is translated as the reference (ligne´e) by the FNMT system and mapped to UNK by the NMT base system. Secondly, the word adaptive is translated as adaptatifs by the FNMT system, the reference translation is adapt´es, but we can consider the FNMT choice a better translation. NMT system also mapped the word adaptive to UNK. In the second example, FNMT translation performs as the reference. We are able to generate the new word actualis´ee (actualiser+past participle+feminine+singular) that it is not in the shortlist of the NMT system vocabulary. This is due, on one hand, because the word actualis´ee appears 40 times in the word vocabulary of the NMT system so it is excluded from the shortlist. On the other hand, the lemma actualiser appears 172 times in the lemmas shortlist so it is included and we are able to generate

1

2

3

4

Src Ref NMT FNMT Src Ref NMT FNMT Src Ref NMT FNMT Src Ref NMT FNMT

set of adaptive choices that our lineage made de choix adapt´es e´ tablis par notre lign´ee de choix UNK que notre UNK a fait de choix adaptatifs que notre lign´ee a fait here ’s the updated version of this entry voici la version actualis´ee de cette entr´ee . voici la version mise a` jour de cette entr´ee . voici la version actualis´ee de cette entr´ee . i could draw i could paint je pouvais dessiner . je pouvais peindre . je pouvais dessiner . je pouvais peindre . je pourrais dessiner . je pouvais peindre . and it ’s a very easy question c’ est une question tr`es simple . c’ est une question tr`es simple . et c’ est une question tr`es facile .

Table 5: Examples of translations with NMT and Factored NMT.

actualis´ee from the lemma and factors outputs. These examples can show the potential of our FNMT system generating new words and reducing unknown words. Translations with lower BLEU performance We also have extracted some translations where we have seen a lower BLEU from the FNMT system with respect to the NMT base system (see Table 5). Example 3 shows a problem with the factors output, from the correct lemma pouvoir, the FNMT system has generated the word pourrais instead of pouvais. We can consider both translations as correct but BLEU score penalizes the FNMT translation. Finally, in the last example, we saw that the translation of the FNMT system is more correct than the NMT system because it translated the word and to et but in the reference is not included. In addition, FNMT system translated easy to a synonym (facile) of simple. Consequently, BLEU score penalizes this example in FNMT system being a correct translation.

5

Conclusion

In this paper, we have proposed an NMT architecture which produces a factored representation of the target language words. Those factors are based on linguistics a priori knowledge. We showed that we are able to train Factored NMT systems with similar performance to word based systems but with the advantage of modeling an almost 6 times bigger word vocabulary with only a slight increase of the computational cost. A consequence of that is the OOV rate reduction observed with the FNMT system. Also, the use of additional linguistic resources allows us to generate new word forms that would not be included in the standard NMT system shortlist. By reducing the target language vocabulary, we simulated an out-of-domain setup, and we showed that our factored NMT method performs better than the basic NMT system in this case. As future work, we would like to include linguistic features at the input. It is known that this can be helpful for NMT (Sennrich and Haddow, 2016). Extending the approach with input factors could make the target language factors generation simpler. This will be investigated in the future. The proposed Factored NMT method could even show better performance if applied on highly inflected languages like German, Arabic, Czech, Russian or Hindi on the target side.

References [Alexandrescu2006] Andrei Alexandrescu. 2006. Factored neural language models. In In HLT-NAACL. [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473. [Bilmes and Kirchhoff2003] Jeff A. Bilmes and Katrin Kirchhoff. 2003. Factored language models and generalized parallel backoff. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Companion Volume of the Proceedings of HLT-NAACL 2003–short Papers - Volume 2, NAACL-Short ’03, pages 4–6, Stroudsburg, PA, USA. Association for Computational Linguistics. [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, C¸aglar G¨ulc¸ehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078. [Chung et al.2016] Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. CoRR, abs/1603.06147. [Costa-Juss`a and Fonollosa2016] Marta R. Costa-Juss`a and Jos´e A. R. Fonollosa. 2016. Character-based neural machine translation. CoRR, abs/1603.00810. [Firat et al.2016] Orhan Firat, KyungHyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. CoRR, abs/1601.01073. [Glorot and Bengio2010] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics. [Haddow and Koehn2012] Barry Haddow and Philipp Koehn, 2012. Interpolated backoff for factored translation models. Association for Machine Translation in the Americas, AMTA. [Jean et al.2014] S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. CoRR, abs/1412.2007. [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondˇrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Association for Computational Linguistics. [Le et al.2011] Hai-Son. Le, Ilya Oparin, Abdel. Messaoudi, Alexandre Allauzen, Jean-Luc Gauvain, and Franc¸ois Yvon. 2011. Large vocabulary SOUL neural network language models. In INTERSPEECH. [Ling et al.2015] Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. 2015. Character-based neural machine translation. CoRR, abs/1511.04586. [Nasr et al.2011] Alexis Nasr, Fred´eric B´echet, Jean-Franc¸ois Rey, Benoˆıt Favre, and Joseph Le Roux. 2011. Macaon, an nlp tool suite for processing word lattices. In Proceedings of the ACL-HLT 2011 System Demonstrations, pages 86–91. [Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. Understanding the exploding gradient problem. CoRR, abs/1211.5063. [Rousseau2013] Anthony Rousseau. 2013. XenC: An open-source tool for data selection in natural language processing. The Prague Bulletin of Mathematical Linguistics, 100:73–82. [Sennrich and Haddow2016] Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. CoRR, abs/1606.02892. [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909.

arXiv:1609.04621v1 [cs.CL] 15 Sep 2016

Abstract We present a new approach for neural machine translation (NMT) using the morphological and grammatical decomposition of the words (factors) in the output side of the neural network. This architecture addresses two main problems occurring in MT, namely dealing with a large target language vocabulary and the out of vocabulary (OOV) words. By the means of factors, we are able to handle larger vocabulary and reduce the training time (for systems with equivalent target language vocabulary size). In addition, we can produce new words that are not in the vocabulary. We use a morphological analyser to get a factored representation of each word (lemmas, Part of Speech tag, tense, person, gender and number). We have extended the NMT approach with attention mechanism (Bahdanau et al., 2014) in order to have two different outputs , one for the lemmas and the other for the rest of the factors. The final translation is built using some a priori linguistic information. We compare our extension with a word-based NMT system. The experiments, performed on the IWSLT’15 dataset translating from English to French, show that while the performance do not always increase, the system can manage a much larger vocabulary and consistently reduce the OOV rate. We observe up to 2% BLEU point improvement in a simulated out of domain translation setup.

1

Introduction

Neural Machine Translation (NMT) has been further developed in the last years (Bahdanau et al., 2014). In contrast to the traditional phrased-based statistical machine translation (Koehn et al., 2007) that automatically translates subparts of the sentences, NMT uses the sequence to sequence of words approach (Cho et al., 2014). Recently, NMT has improved the results of the phrased-based systems (Bahdanau et al., 2014). Besides these improvements in NMT, some problems still remain. One problem is the high computational cost of the target word probability due to the softmax that requires to normalize all the output values, see Equation 1: N X pi = eoi / eor for i ∈ {1, . . . , N } (1) r=1

where oi are the outputs, pi their softmax normalization and N the total number of outputs. In order to solve this issue, a standard technique is to define a short-list containing the most frequent words only. This has the disadvantage of increasing the out of vocabulary (OOV) rate. OOV words correspond to those unseen in the training dataset or which are not included in the vocabulary. They are all considered as unknown words and mapped to the special UNK token. Jean et al. (2014), proposed to carefully organise the batches so that only a subset of the target vocabulary is possibly generated at training time. This allows the system to perform the softmax only on this subset during training (the complexity remains the same at test time). Another possibility is to define a structured output layer (SOUL) to handle the words not appearing in the shortlist. This allows the system to always apply the softmax normalization on a layer with reduced size (Le et al., 2011). Recently, some works have used subword units to translate instead of words. In Sennrich et al. (2015), the rare and some unknown words are encoded as subword units with the Byte Pair Encoding (BPE)

method. The authors show that this can also generate words which are unseen at training time. As an extreme case, the character-level neural machine translation has been presented in several works (Chung et al., 2016; Ling et al., 2015; Costa-Juss`a and Fonollosa, 2016) and showed very promising results. In this work we propose an approach using factors as unit level in the output side of the neural network. The factors are referring to the linguistic annotation at word level like the Part of Speech (POS) tags. Moses toolkit (Haddow and Koehn, 2012) for statistical machine translation is able to manage factors information in addition to the words to be able to improve the translation. Some works have used factors as additional information for language modeling (Bilmes and Kirchhoff, 2003; Alexandrescu, 2006). Recently, factors have been used as linguistic input features to improve NMT (Sennrich and Haddow, 2016) as well. Our approach differs from previous works in the sense that we use only the linguistic decomposition of the words in the output side. Each word is represented by its lemma along its linguistic factors (POS tag, tense, gender, number and person). By these means, the target vocabulary size is reduced because we do not have to keep all the derived forms of the verbs, nouns, adjectives, etc. Furthermore, we are able to produce new words that are not in the vocabulary using all the derived forms of the lemmas. We use two different outputs for the translation at word level, one output is the lemma of the word and the other output is the rest of the factors mentioned earlier. Multiple output neural networks have been used before (Firat et al., 2016) with the difference that in our approach the system produces both outputs at the same time instead of scheduling them. With both outputs (lemma and factors) we are able to generate the final word using linguistic resources.

2

Neural Machine Translation

The encoder-decoder architecture used for NMT consists of two recurrent neural networks (RNN), one for the encoder and the other for the decoder. The encoder maps a source sequence into a continuous space representation and the decoder maps the representation back to a target sequence. Our trained neural translation models are based on a bidirectional encoder-decoder deep neural network equipped with an attention mechanism (Bahdanau et al., 2014), as described in Figure 2. Il

uj

4

pj

3

zj

a bien longtemps dans une galaxie lointaine , très lointaine

DECODER

5

y

Attention mechanism

va

lin

Softmax

i

↵1,6 ↵2,6

1

eij ↵ij = X eij

+

eij

↵11,6

↵6,6 ↵7,6

hi ( A long time ago in

a galaxy far

,

ENCODER

Ua

tanh

2

Wa

far away )

Figure 1: Architecture of the NMT system equipped with an attention mechanism. This architecture consists of a bidirectional RNN as an encoder (as seen in stage 1 of Figure 2). An input sentence is encoded in a sequence of annotations (one for each input word), corresponding to the concatenation of the outputs of a forward and a backward RNN. Each annotation represents the full sentence with a strong focus on the current word. The decoder is composed of a conditional RNN as provided for the DL4MT winter school1 (see stage 3 of Figure 2), equipped with an attention mechanism This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/ 1 https://github.com/nyu-dl/dl4mt-tutorial

(stage 2). The attention mechanism aims at providing weights for each annotation in order to generate a context vector (by performing a weighted sum over the annotations). The attention mechanism uses the hidden state at timestep j of the decoder RNN along with the annotation hi to generate a coefficient eij . A softmax operation is performed over those coefficients to generate the annotation weights αij . As described in (Bahdanau et al., 2014), the annotation weights can be used to align the input words to the output words. The RNN takes as input the context vector, the embedding of the previous output word (stage 4), and of course its hidden state. Finally, on stage 5 of the Figure 2, the output probabilities of the target vocabulary are computed. The word with the highest probability is selected to be the translation at each timestep. The encoder and the decoder are trained jointly to maximize the conditional probability of the correct translation.

3

Factors in Neural Machine Translation To perform factored neural machine translation, we need to extend the standard NMT architecture of the Figure 2 to allow generating several output symbols at the same time. For the sake of simplicity, we decided to generate only two symbols: the lemma and the concatenation of the different factors that we considered. For example, from the word devient, we obtain the lemma devenir and the factors vP3#s meaning that it is a verb, in Present, 3rd person, irrelevant gender (#) and singular. The morphological and grammatical analysis is performed with the MACAON toolkit (Nasr et al., 2011). Figure 2 shows this modification.

lemma

factors

uj pj

zj

+

Figure 2: Output detail of the Factored NMT architecture.

As we can see, lemmas and factors are generated separately, which in some cases, lead to sequences with different length. To solve this problem, we give priority to the length of the lemmas. Consequently, we constraint the length of the factors sequence to be equal to the length of the lemma sequence. This is motivated by the fact that the lemmas are closer to the final objective (a sequence of words) and that they are the symbols carrying most of the meaning. Another issue is the feedback that the RNN receives. In the word based model, this feedback is the embedding of the previous generated word. Since we have two outputs, we have to decide what will be given to the decoder RNN. Several options are possible and will be explored in this paper (details in section 4.3.2). For the first set of experiments, only the previous lemma embedding was used as feedback (no information from the factors output). 3.1

Handling beam search with factors

The beam search procedure has also been extended with respect to the original approach, since we are actually facing two beams (one for lemmas and one for factors). We need to deal with the multiple outputs because we do not want to rely solely on the lemma sequence to decide which are the best sequences. Then, we merge the two beams. Once the best lemma and factors hypotheses are generated for each partial hypothesis, the cross product of those output spaces is performed. By this mean, each lemma hypothesis is associated with each factors hypothesis. Afterwards, we keep the k-best combinations for each sample, with k being the beam size. Finally, the number of best hypotheses is reduced again to the beam size for further processing. 3.2

From factors to word

Once we obtain the factorized outputs from the neural network, we need to fall back to the word representation. This operation is also performed with the MACAON tool, which, given a lemma and some factors, provides the word candidate.

4

Experiments

We performed several sets of experiments trying different architectures and vocabulary sizes for Factored NMT (FNMT) and comparing them with the NMT system.

4.1

Data processing and selection

We evaluate our approach on the English to French Spoken Language Translation task from IWSLT 2015 evaluation campaign2 . A selection method (Rousseau, 2013) has been applied using the available parallel corpora (news-commentary, united-nations, europarl, wikipedia, and two crawled corpora) and Technology Entertainment Design (TED3 ) corpus as in-domain corpus. We also do a preprocessing to convert html entities and filter out the sentences with more than 50 words for both source and target languages. We finally end with a selected corpus of 2M sentences, 147K unique words for English side and 266K unique words for French side. 4.2

Training

We chose the following hyperparameters to train the systems. The embedding and recurrent layers have a dimensionality of 620 and 1000 respectively. We use a minibatch size of 80 sentences trained with Adadelta algorithm. The norm of the gradient is clipped to be no more than 1 (Pascanu et al., 2012) and the weights are initialized with Xavier (Glorot and Bengio, 2010). The validations start at the second epoch and are performed every 5000 updates. Early stopping is based on BLEU with a patience set to 10 (early stopping occurs after 10 evaluations without improvement in BLEU). The vocabulary size of the source languages is set to 30K. We varied the output layer size from 5K to 30K in order to simulate different levels of out of domain data. Once the model is trained, we set the beam size to 12 (as this is the standard value for NMT, (Bahdanau et al., 2014)) when translating the development corpus. 4.3

Factors models and results

The Factored NMT system aims at integrating linguistic knowledge into the decoder in order to obtain better performance when facing out of domain data and/or a low resource setup. To assess the feasibility and estimate the potential gain of our approach, we performed a set of experiments reducing the output vocabulary size, simulating such an environment. The results are presented in Table 1.

Model NMT FNMT NMT FNMT NMT FNMT NMT FNMT

Output size vocab. 30K 30K 30K+142 172K 20K 20K 20K+142 139K 10K 10K 10K+142 85K 5K 5K 5K+142 48K

Coverage (%) 97.96 99.23 (+1.27) 97.03 98.88 (+1.85) 94.51 97.72 (+3.21) 91.02 95.61 (+4.59)

#OOV 1775 784 2171 1014 3996 1897 6545 3424

#Par. 89.7M 89.8M 77.3M 77.4M 64.9M 64.9M 58.7M 58.7M

word 34.88 34.80 34.21 34.46 32.61 34.13 30.54 32.55

%BLEU lem. 37.78 37.52 37.07 35.22

fact. 42.72 42.65 42.75 42.98

Oracle word 36.33 36.14 35.72 33.86

Table 1: Comparison of the performance of the NMT and FNMT systems in terms of %BLEU score evaluating at word level, and separately, each output lemma and factors. The size of the output layer and the size of the corresponding vocabulary are presented in columns 2 and 3. Columns 4 and 5 show coverage in test dataset and number of OOVs, respectively. Last column corresponds to the oracle output. The FNMT system obtains a similar performance compared to the NMT system (first two rows) in terms of word level BLEU score, despite the increased complexity of the architecture of our model (and in particular the two outputs). In order to estimate the capacity of such a model, we computed the oracle which corresponds to ignore the errors caused by the factors, i.e. if we produce the correct lemma, then the correct word is generated (see last column of Table 1). We can see that a potential gain of more than 1.5% BLEU points can be achieved with a perfect modeling of the factors, which is encouraging. 2 3

IWSLT’15: https://sites.google.com/site/iwsltevaluation2015 TED: https://www.ted.com

The first comment is that the Factored NMT approach is able to model a bigger word vocabulary while preserving manageable output layers size. This is due to the fact that the factors-to-word tool is able to generate words which are unseen in the training corpus, augmenting the expressiveness of our model. For the sake of comparison, we provide the target vocabulary size for the standard NMT and the FNMT systems. For example, with an output layer size of 30K, the NMT system can model 30K words against 172K words for the FNMT system. This is an almost 6 times larger word vocabulary. One consequence is that the word coverage is higher for the FNMT than for the NMT system, as shown in column 4. However, for the first two systems (first two rows), we see that the difference between the coverages is small. When decreasing the output layer size, we can observe that the coverage decreases slowly for FNMT systems compared to the word based system. The FNMT approach surpasses standard NMT when the coverage difference becomes higher. This proves that the approach is sound and well performing, when dealing with out-of-domain data. This is of course dependent on the linguistic knowledge available in the factors-to-word tool. This is exactly the sought behavior: by integrating a priori linguistic knowledge, we reduce the impact of the training conditions (domain, data availability, etc.) on the performance of the system. The reduction of the out of vocabulary (OOV) rate of about 47% is a promising result which is not always well reflected by the BLEU score. These results would be better highlighted if performing a human evaluation (this point will not be addressed further in this paper). To make things clear, the OOV rate corresponds to the number of UNK tokens generated by our system. In those experiments, we did not use any specific method to replace them (e.g. put source words aligned to them, use a dictionary, etc.) Moreover, the number of parameters to train also decreases according to the size of the output layer, as shown in column 6, allowing a simpler training because we have to learn less weights in the model. For example, using a lemma output layer size of 10K instead of 30K (3 times smaller) for factored model, we obtain a small drop of 0.67 points in BLEU. By contrast, in NMT base model we observe a drop of 2.27 points in BLEU comparing the same output sizes 30K and 10K. Another interesting remark is that the scores evaluating in lemmas and factors are higher than the BLEU in words for both systems, this is due to the difficulty of the final step to generate the words. Nevertheless, the BLEU for factors are pretty low considering that the output layer size for this is only 142. This can be due to two different causes. First, the neural network is not able to correctly model this small output. Second, the task of translating from English words to French factors is complex. 4.3.1 Evaluating each output We evaluated BLEU at different levels (word, lemma or factors) using the base NMT system with only one output (see Table 2). We compare the values with the Factored NMT system results which models lemmas and factors at the same time. We observe that the difference between the results in BLEU for lemmas using the FNMT are similar to the NMT system. However, the differences evaluating factors are big between the two systems (2.44 difference of %BLEU). This experiment confirms that the task to predict factors managing very different output sizes respect to the source words is not easy. In future we will implement factors also in the input side of the neural network to verify this hypothesis. Also, we have to take into account that we are giving more priority to the length of the lemmas sequence than the factors one during beam search. This also suggests that we adapt our architecture so that factors are better predicted to obtain a final better BLEU at word evaluation.

Model NMT FNMT

word 34.88 34.80

%BLEU lemma factors 37.72 45.16 37.78 42.72

Table 2: Comparison of the performances between standard NMT system and the Factored NMT system in terms of %BLEU computed at word, lemma and factors level. The first line corresponds to 3 standard NMT systems built to generate at the output words, lemmas and factors, respectively.

4.3.2

Feedback

As explained in section 2, the decoder RNN is a conditional-GRU which is fed by the input context vector, its hidden state and the feedback (i.e. the previous generated symbol). Since we now have two outputs, we need to define what kind of feedback is more suitable for the Factored NMT system. Several solutions are possible. The first assumption we made is highly dependent on the design of the considered factors, i.e. the lemmas are the most informative factors among all. Then, we tried using only the output lemma embedding as feedback (see equation 2). Lemma feedback : EL [yj ]

(2)

where EL is the target language lemma lookup table and EL [yj ] is the embedding of the lemma used to generate the output word yi . Another straightforward operation is to sum the embeddings of the previous lemma with the embedding of the previous factors, as described in equation 3. Sum feedback : EL [yj ] + EF [yj ]

(3)

where yi is the target output word, and EL [yi ] and EF [yi ] are its corresponding lemma and factors embeddings. While this could seem unnatural, by doing this, we hope to obtain a joint vector representation of both the lemma and the factors. Finally, we investigated whether the neural network can learn a better combination of the lemmas and factors embeddings using a linear (eq. 4) or non-linear (eq. 5) operation instead of a simple sum. Linear feedback : Tanh feedback :

EL [yj ] · WL + EF [yj ] · WF

(4)

tanh (EL [yj ] · WL + EF [yj ] · WF )

(5)

where WL and WF are the parameters to be learned.

Model NMT FNMT FNMT FNMT FNMT

Feedback Lemma Sum Linear Tanh

word 34.88 34.80 34.48 34.42 34.58

%BLEU lemma 37.78 37.14 37.27 37.28

factors 42.72 44.46 44.03 43.96

#OOV 1775 784 815 868 757

Table 3: Performance in terms of %BLEU computed on word, lemma and factors when using different output embedding combinations as feedback. Table 3 presents the results obtained with systems integrating the different output embedding combinations as feedback. We can see that all systems perform similarly regarding BLEU score on words with a better result for the lemma feedback. As expected, when using only lemma as feedback, the system better estimates the lemmas probabilities, as a consequence, there is a significant reduction of the performance on factors. The comparison between the lemma %BLEU (fourth column of Table 3) and the number of OOVs (sixth column) shows a correlation between those two values, except when using non-linear combination which has the lowest value of OOVs. This tends to prove that modeling the lemmas better is important to reduce the OOV rate (confirming our assumption that lemmas are more informative) but not sufficient. In the future we would like to explore the combination of the two embeddings using its concatenation to see if we can get better results.

4.3.3

Dependency model lemma

One observation that can be made is that while generating factors could seem easier due to the small number of the possible outputs (only 142), the BLEU score is not as high as what we could expect. However, one could argue that generating a sequence of factors in French from a sequence of English words is not an easy task. In order to help the factors prediction, we contextualized the corresponding output with the lemma being generated. This creates a dependency between the lemma output and the factors output. The dependency has been implemented by including a transformer (see Figure 3) which projects the lemma embeddings into the hidden layer used to generate factors. The results by applying those two techniques are presented in Table 4.

Model NMT FNMT with dependency FNMT with dependency FNMT with dependency FNMT with dependency

Feedback Lemma Sum Linear Tanh

word 34.88 34.45 34.65 34.25 34.38

%BLEU lemma 37.45 37.34 37.02 37.09

factors

uj

T

pj +

zj

Figure 3: Dependency model

factors 42.15 44.35 43.57 43.82

#OOV 1775 770 800 822 915

Table 4: Results for dependency model In Table 4, we can observe that the dependency model does not improve the results in terms of %BLEU score on words from Table 3 using lemma, linear and tanh feedback. However, it improves using the sum feedback. For the sum feedback dependency model, we see that lemma BLEU output improves with respect to the same model without dependency. By contrast, the factors output obtains lower BLEU. This can occur because factors output receives more information from lemma and when the factors cost is back-propagated, the lemma output can improve the learning. We can also observe that if we improve lemma output it is more correlated to the word evaluation than if we improve factors output. Moreover, the number of the OOV are reduced for all the feedback combination excepting tanh feedback, which is not reflected by the automatic score. 4.3.4

Qualitative analysis

We have observed some of the translation outputs to better understand in what cases our FNMT system performs better or worse than the NMT system. Translation examples with better BLEU performance In the first two examples of Table 5, the FNMT system obtains better BLEU score than the NMT system. First example shows when our factored system can generate words when the NMT base system predicts unknown words. Firstly, the word lineage in source sentence is translated as the reference (ligne´e) by the FNMT system and mapped to UNK by the NMT base system. Secondly, the word adaptive is translated as adaptatifs by the FNMT system, the reference translation is adapt´es, but we can consider the FNMT choice a better translation. NMT system also mapped the word adaptive to UNK. In the second example, FNMT translation performs as the reference. We are able to generate the new word actualis´ee (actualiser+past participle+feminine+singular) that it is not in the shortlist of the NMT system vocabulary. This is due, on one hand, because the word actualis´ee appears 40 times in the word vocabulary of the NMT system so it is excluded from the shortlist. On the other hand, the lemma actualiser appears 172 times in the lemmas shortlist so it is included and we are able to generate

1

2

3

4

Src Ref NMT FNMT Src Ref NMT FNMT Src Ref NMT FNMT Src Ref NMT FNMT

set of adaptive choices that our lineage made de choix adapt´es e´ tablis par notre lign´ee de choix UNK que notre UNK a fait de choix adaptatifs que notre lign´ee a fait here ’s the updated version of this entry voici la version actualis´ee de cette entr´ee . voici la version mise a` jour de cette entr´ee . voici la version actualis´ee de cette entr´ee . i could draw i could paint je pouvais dessiner . je pouvais peindre . je pouvais dessiner . je pouvais peindre . je pourrais dessiner . je pouvais peindre . and it ’s a very easy question c’ est une question tr`es simple . c’ est une question tr`es simple . et c’ est une question tr`es facile .

Table 5: Examples of translations with NMT and Factored NMT.

actualis´ee from the lemma and factors outputs. These examples can show the potential of our FNMT system generating new words and reducing unknown words. Translations with lower BLEU performance We also have extracted some translations where we have seen a lower BLEU from the FNMT system with respect to the NMT base system (see Table 5). Example 3 shows a problem with the factors output, from the correct lemma pouvoir, the FNMT system has generated the word pourrais instead of pouvais. We can consider both translations as correct but BLEU score penalizes the FNMT translation. Finally, in the last example, we saw that the translation of the FNMT system is more correct than the NMT system because it translated the word and to et but in the reference is not included. In addition, FNMT system translated easy to a synonym (facile) of simple. Consequently, BLEU score penalizes this example in FNMT system being a correct translation.

5

Conclusion

In this paper, we have proposed an NMT architecture which produces a factored representation of the target language words. Those factors are based on linguistics a priori knowledge. We showed that we are able to train Factored NMT systems with similar performance to word based systems but with the advantage of modeling an almost 6 times bigger word vocabulary with only a slight increase of the computational cost. A consequence of that is the OOV rate reduction observed with the FNMT system. Also, the use of additional linguistic resources allows us to generate new word forms that would not be included in the standard NMT system shortlist. By reducing the target language vocabulary, we simulated an out-of-domain setup, and we showed that our factored NMT method performs better than the basic NMT system in this case. As future work, we would like to include linguistic features at the input. It is known that this can be helpful for NMT (Sennrich and Haddow, 2016). Extending the approach with input factors could make the target language factors generation simpler. This will be investigated in the future. The proposed Factored NMT method could even show better performance if applied on highly inflected languages like German, Arabic, Czech, Russian or Hindi on the target side.

References [Alexandrescu2006] Andrei Alexandrescu. 2006. Factored neural language models. In In HLT-NAACL. [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473. [Bilmes and Kirchhoff2003] Jeff A. Bilmes and Katrin Kirchhoff. 2003. Factored language models and generalized parallel backoff. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Companion Volume of the Proceedings of HLT-NAACL 2003–short Papers - Volume 2, NAACL-Short ’03, pages 4–6, Stroudsburg, PA, USA. Association for Computational Linguistics. [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, C¸aglar G¨ulc¸ehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078. [Chung et al.2016] Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. CoRR, abs/1603.06147. [Costa-Juss`a and Fonollosa2016] Marta R. Costa-Juss`a and Jos´e A. R. Fonollosa. 2016. Character-based neural machine translation. CoRR, abs/1603.00810. [Firat et al.2016] Orhan Firat, KyungHyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. CoRR, abs/1601.01073. [Glorot and Bengio2010] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics. [Haddow and Koehn2012] Barry Haddow and Philipp Koehn, 2012. Interpolated backoff for factored translation models. Association for Machine Translation in the Americas, AMTA. [Jean et al.2014] S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. CoRR, abs/1412.2007. [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondˇrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Association for Computational Linguistics. [Le et al.2011] Hai-Son. Le, Ilya Oparin, Abdel. Messaoudi, Alexandre Allauzen, Jean-Luc Gauvain, and Franc¸ois Yvon. 2011. Large vocabulary SOUL neural network language models. In INTERSPEECH. [Ling et al.2015] Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. 2015. Character-based neural machine translation. CoRR, abs/1511.04586. [Nasr et al.2011] Alexis Nasr, Fred´eric B´echet, Jean-Franc¸ois Rey, Benoˆıt Favre, and Joseph Le Roux. 2011. Macaon, an nlp tool suite for processing word lattices. In Proceedings of the ACL-HLT 2011 System Demonstrations, pages 86–91. [Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. Understanding the exploding gradient problem. CoRR, abs/1211.5063. [Rousseau2013] Anthony Rousseau. 2013. XenC: An open-source tool for data selection in natural language processing. The Prague Bulletin of Mathematical Linguistics, 100:73–82. [Sennrich and Haddow2016] Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. CoRR, abs/1606.02892. [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909.