Machine Translation for Languages of Limited Diffusion

0 downloads 0 Views 1MB Size Report
May 24, 2018 - Improving MT methods for LoLD. 3. ... copy target-to-source ... open mind. 10 ... “Neural machine translation for low-resource languages without ...
Machine Translation for Languages of Limited Diffusion

Alina Karakanta May 24, 2018 Universität des Saarlandes

Table of contents

1. Introduction 2. Improving MT methods for LoLD 3. Extending resources for MT 4. The role of translators

1

Introduction

Low-resource vs. Limited diffusion

2

The challenge: Data sparsity

The problem LoLD cannot fully benefit from translation technologies because of lack of data (usually parallel) and language-specific tools. The reason

OOV Translation output: „I am to you for .“

3

Improving MT methods for LoLD

Triangulation/pivot

Razmara and Sarkar, 2013; Cohn and Lapata, 2007; Singla et al., 2014; Tiedemann, 2012

4

Transfer learning

Zoph et al., 2016

5

Zero-shot

Johnson et al., 2016; Lakew et al., 2017

6

Extending resources for MT

Parallel sentence extraction Comparable corpora • bilingual dictionary induction • parallel sentence/phrase extraction based on similarity measures (distributional, temporal, topic, and string) to improve • accuracy • coverage • MT Quality in general Irvine and Callison-burch, 2013; Munteanu et al., 2004; Fiser and Ljubesic, 2011

7

Closely-related languages

Taking advantage of similarities between languages • transform data from related (high-resource) languages (e.g. transliteration, word-word translation) • use as additional training data

Karakanta et al., 2017; Currey et al., 2016; Nakov and Ng, 2012

8

Monolingual/Synthetic data

Monolingual data • for phrase-based MT: language model • for neural MT: • separately trained language models • copy target-to-source • back-translation Gülçehre et al., 2015; Sennrich et al., 2015; Currey et al., 2017

9

The role of translators

Implications for translators

Working together • easily-accessible, publicly available resources for LoLD • stronger collaboration between translators and MT researchers • shared-task? Learning from each other • as technologies for LoLD improve, they should be incorporated in the curriculum (training about MT, TMs) • computer literacy • open mind

10

Thank

you! [email protected] www.researchgate.net/profile/Alina_Karakanta

11

References Cohn, Trevor and Mirella Lapata (2007). “Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora”. In: In Proc. of the ACL. Currey, Anna, Alina Karakanta, and Jon Dehdari (2016). “Using Related Languages to Enhance Statistical Language Models”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. San Diego, CA, USA: Association for Computational Linguistics, pp. 24–31. url: https://www.aclweb.org/anthology/N/N16/#2000.

Currey, Anna, Antonio Valerio Miceli Barone, and Kenneth Heafield (2017). “Copied Monolingual Data Improves Low-Resource Neural Machine Translation”. In: Proceedings of the Second Conference on Machine Translation. Copenhagen, Denmark: Association for Computational Linguistics, pp. 148–156. url: http://aclweb.org/anthology/W17-4715. Fiser, Darja and Nikola Ljubesic (2011). “Bilingual lexicon extraction from comparable corpora for closely related languages”. In: Proceedings of Recent Advances in Natural Language Processing. Hissar, Bulgaria, pp. 125–131. Gülçehre, Çaglar et al. (2015). “On Using Monolingual Corpora in Neural Machine Translation”. In: CoRR abs/1503.03535. url: http://arxiv.org/abs/1503.03535. Irvine, Ann and Chris Callison-burch (2013). Combining Bilingual and Comparable Corpora for Low Resource Machine Translation.

Johnson, Melvin et al. (2016). “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation”. In: ArXiv Preprint. url: https://arxiv.org/abs/1611.04558. Karakanta, Alina, Jon Dehdari, and Josef van Genabith (2017). “Neural machine translation for low-resource languages without parallel corpora”. In: Machine Translation. issn: 1573-0573. doi: 10.1007/s10590-017-9203-5. url: https://doi.org/10.1007/s10590-017-9203-5. Lakew, Surafel M. et al. (2017). “Improving Zero-Shot Translation of Low-Resource Languages”. In: Proceedings of the 14th International Workshop on Spoken Language Translation. Tokyo, Japan.

Munteanu, Dragos Stefan, Alexander M. Fraser, and Daniel Marcu (2004). “Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora”. In: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2004, Boston, Massachusetts, USA, May 2-7, 2004, pp. 265–272. url: http://aclweb.org/anthology/N/N04/N04-1034.pdf. Nakov, Preslav and Hwee Tou Ng (2012). “Improving statistical machine translation for a resource-poor languages using related resource-rich languages”. In: Journal of Artificial Intelligence Research, pp. 179–222. Razmara, Majid and Anoop Sarkar (2013). “Ensemble triangulation for statistical machine translation”. In: In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 252–260.

Sennrich, Rico, Barry Haddow, and Alexandra Birch (2015). “Improving Neural Machine Translation Models with Monolingual Data”. In: CoRR abs/1511.06709. url: http://arxiv.org/abs/1511.06709. Singla, Karan et al. (2014). “Reducing the Impact of Data Sparsity in Statistical Machine Translation”. In: Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, pp. 51–56. url: http://aclweb.org/anthology/W/W14/W14-4006.pdf. Tiedemann, Jörg (2012). “Character-based Pivot Translation for Under-resourced Languages and Domains”. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. EACL ’12. Avignon, France: Association for Computational Linguistics, pp. 141–151. isbn: 978-1-937284-19-0. url: http://dl.acm.org/citation.cfm?id=2380816.2380836.

Zoph, Barret et al. (2016). “Transfer Learning for Low-Resource Neural Machine Translation”. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Austin, Texas, 1568–1575. url: https://aclweb.org/anthology/D16-1163.