Machine Translation for Languages of Limited Diffusion

Machine Translation for Languages of Limited Diffusion

Alina Karakanta May 24, 2018 Universität des Saarlandes

Table of contents

1. Introduction 2. Improving MT methods for LoLD 3. Extending resources for MT 4. The role of translators

1

Introduction

Low-resource vs. Limited diffusion

2

The challenge: Data sparsity

The problem LoLD cannot fully benefit from translation technologies because of lack of data (usually parallel) and language-specific tools. The reason

OOV Translation output: „I am to you for .“

3

Improving MT methods for LoLD

Triangulation/pivot

Razmara and Sarkar, 2013; Cohn and Lapata, 2007; Singla et al., 2014; Tiedemann, 2012

4

Transfer learning

Zoph et al., 2016

5

Zero-shot

Johnson et al., 2016; Lakew et al., 2017

6

Extending resources for MT

Parallel sentence extraction Comparable corpora • bilingual dictionary induction • parallel sentence/phrase extraction based on similarity measures (distributional, temporal, topic, and string) to improve • accuracy • coverage • MT Quality in general Irvine and Callison-burch, 2013; Munteanu et al., 2004; Fiser and Ljubesic, 2011

7

Closely-related languages

Taking advantage of similarities between languages • transform data from related (high-resource) languages (e.g. transliteration, word-word translation) • use as additional training data

Karakanta et al., 2017; Currey et al., 2016; Nakov and Ng, 2012

8

Monolingual/Synthetic data

Monolingual data • for phrase-based MT: language model • for neural MT: • separately trained language models • copy target-to-source • back-translation Gülçehre et al., 2015; Sennrich et al., 2015; Currey et al., 2017

9

The role of translators

Implications for translators

Working together • easily-accessible, publicly available resources for LoLD • stronger collaboration between translators and MT researchers • shared-task? Learning from each other • as technologies for LoLD improve, they should be incorporated in the curriculum (training about MT, TMs) • computer literacy • open mind

10

Thank

you! [email protected] www.researchgate.net/profile/Alina_Karakanta

11

References Cohn, Trevor and Mirella Lapata (2007). “Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora”. In: In Proc. of the ACL. Currey, Anna, Alina Karakanta, and Jon Dehdari (2016). “Using Related Languages to Enhance Statistical Language Models”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. San Diego, CA, USA: Association for Computational Linguistics, pp. 24–31. url: https://www.aclweb.org/anthology/N/N16/#2000.

Currey, Anna, Antonio Valerio Miceli Barone, and Kenneth Heafield (2017). “Copied Monolingual Data Improves Low-Resource Neural Machine Translation”. In: Proceedings of the Second Conference on Machine Translation. Copenhagen, Denmark: Association for Computational Linguistics, pp. 148–156. url: http://aclweb.org/anthology/W17-4715. Fiser, Darja and Nikola Ljubesic (2011). “Bilingual lexicon extraction from comparable corpora for closely related languages”. In: Proceedings of Recent Advances in Natural Language Processing. Hissar, Bulgaria, pp. 125–131. Gülçehre, Çaglar et al. (2015). “On Using Monolingual Corpora in Neural Machine Translation”. In: CoRR abs/1503.03535. url: http://arxiv.org/abs/1503.03535. Irvine, Ann and Chris Callison-burch (2013). Combining Bilingual and Comparable Corpora for Low Resource Machine Translation.

Johnson, Melvin et al. (2016). “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation”. In: ArXiv Preprint. url: https://arxiv.org/abs/1611.04558. Karakanta, Alina, Jon Dehdari, and Josef van Genabith (2017). “Neural machine translation for low-resource languages without parallel corpora”. In: Machine Translation. issn: 1573-0573. doi: 10.1007/s10590-017-9203-5. url: https://doi.org/10.1007/s10590-017-9203-5. Lakew, Surafel M. et al. (2017). “Improving Zero-Shot Translation of Low-Resource Languages”. In: Proceedings of the 14th International Workshop on Spoken Language Translation. Tokyo, Japan.

Munteanu, Dragos Stefan, Alexander M. Fraser, and Daniel Marcu (2004). “Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora”. In: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2004, Boston, Massachusetts, USA, May 2-7, 2004, pp. 265–272. url: http://aclweb.org/anthology/N/N04/N04-1034.pdf. Nakov, Preslav and Hwee Tou Ng (2012). “Improving statistical machine translation for a resource-poor languages using related resource-rich languages”. In: Journal of Artificial Intelligence Research, pp. 179–222. Razmara, Majid and Anoop Sarkar (2013). “Ensemble triangulation for statistical machine translation”. In: In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 252–260.

Sennrich, Rico, Barry Haddow, and Alexandra Birch (2015). “Improving Neural Machine Translation Models with Monolingual Data”. In: CoRR abs/1511.06709. url: http://arxiv.org/abs/1511.06709. Singla, Karan et al. (2014). “Reducing the Impact of Data Sparsity in Statistical Machine Translation”. In: Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, pp. 51–56. url: http://aclweb.org/anthology/W/W14/W14-4006.pdf. Tiedemann, Jörg (2012). “Character-based Pivot Translation for Under-resourced Languages and Domains”. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. EACL ’12. Avignon, France: Association for Computational Linguistics, pp. 141–151. isbn: 978-1-937284-19-0. url: http://dl.acm.org/citation.cfm?id=2380816.2380836.

Zoph, Barret et al. (2016). “Transfer Learning for Low-Resource Neural Machine Translation”. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Austin, Texas, 1568–1575. url: https://aclweb.org/anthology/D16-1163.

Machine Translation for Languages of Limited Diffusion

Machine Translation for Languages of Limited Diffusion

Suggest Documents

Machine Translation for Languages of Limited Diffusion

Exploiting Similarities among Languages for Machine Translation

Exploiting Similarities among Languages for Machine Translation

Statistical Machine Translation between Related Languages

machine translation and minority languages - CiteSeerX

Statistical Machine Translation between Related Languages

Machine Translation-Indian Regional Languages - Wseas.us

Machine Translation Approaches and Survey for Indian Languages

Statistical Machine Translation for Indian Languages: Mission Hindi

IBM Statistical Machine Translation for Spoken Languages - CiteSeerX

Statistical Machine Translation for Indian Languages: Mission Hindi 2

Resource Review The Languages of Limited Diffusion Work ... - Ncihc

Resource Review The Languages of Limited Diffusion Work ... - Ncihc

Machine Translation of Languages - Histoire des théories linguistiques

English-Japanese machine translation - Machine Translation Archive

Machine translation and people - Machine Translation Archive

Natural Languages Analysis in Machine Translation (MT) based on ...

Machine Translation

evaluation of machine translation

Chart translation - Machine Translation Archive

Machine translation of various text genres - Machine Translation Archive

Statistical Machine Translation of English - Machine Translation Archive

Literature in Languages: Teaching for Translation Studies

Enabling Medical Translation for Low-Resource Languages