Translation Memory for a Machine Translation System ...

Translation Memory for a Machine Translation System using the Hadoop framework Anuradha Tomar, Jahnavi Bodhankar, Pavan Kurariya, Priyanka Jain, Anuradha Lele, Hemant Darbari Centre for Development of Advanced Computing, Pune, India [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract - Machine Translation System (MTS) that uses the Tree Adjoining Grammar (TAG) is considered. To improve the response time of our online MTS, we propose the use of a translation memory (TM). The integrated architecture of MTS with TM is outlined. Several examples of language dependent TM tools and translation process are given. To further speedup the translation process, we port MTS on a computing cluster that uses the Hadoop framework and carry out distributed execution. The computational experiments demonstrate that substantial speedups could be obtained by using the Hadoop framework. Keywords - Translation Memory (TM), Machine Translation (MT), Tree Adjoining Grammar (TAG), high performance computing, Hadoop framework, distributed architecture.

I.

INTRODUCTION

The global demand for translation is increasing sharply without a much increase in the number of translation professionals. Translation Memory (TM) is a segmented, aligned, parsed and classified corpus of already-translated examples, used to ease some of the growing pressures on the industry. It works as a reference model to assist the existing professionals for the new translations. It gives high quality consistent translation when the similarity in texts is high such as updated product manuals. Apart from a solution for human translation professionals, TM can be an answer for a complex online translation system to meet the industry demands. Nowadays, applications are coming up with hybrid approaches of TM, where TM is integrated with MTS for performance optimization in terms of time and accuracy. Most of the approaches published yet use TM to improve the performance of Statistical Machine Translation (SMT) and Example-Based Machine Translation (EBMT) systems[2] [6] [7] [9]; these systems uses fuzzy matching algorithm for searching the best match with an accuracy below 70% [13]. Even this approach has dependency on lexicon sequencing and thus pre-translated texts are not fully utilized. To overcome the language dependency of TM and to efficiently utilize the pre-translated text; we have proposed a Machine Translation System (MTS) built on top of a language independent TM technology, based on the Tree Adjoining Grammar (TAG) [1]. TAG target structure is independent of input sentence lexicon as it works on sentence structure. To build an offline repository or TM of unique TAG target structure for a set of input sentences, we have designed and implemented a distributed batch processing infrastructure for

Virendrakumar C. Bhavsar Faculty of Computer Science, University of New Brunswick, Fredericton, NB, Canada [email protected] large data processing using Hadoop framework [3]. An offline MTS is ported on Hadoop framework to determine the applicability of TM technology and to illustrate the benefits of TM for reducing the translation time, expense and energy. This paper is organized as follows. We first briefly introduce related research in Section II. Section III presents the working of MTS with and without the use of TM. The computational experiments performed with the Hadoop implementation are discussed in Section IV and their results are discussed in Section V. Finally, concluding remarks are given in Section VI II.

RELATED WORK

The basic ideas behind TM technology arose in the late 1960s and early 1970s, with the aim to provide translators with a number of different resources that would allow them to use it as a reference tool to carry out their work more accurately and efficiently. With the growth in both the storage capability and processing power of personal computers in 1980s a number of commercial computer-aided translation tool have indeed evolved considerably[5][7]. These tools work on a segmentation and pattern recognition algorithm, the source language text is parsed segment by segment, substituting previous translations automatically until it reaches a segment that has not been previously translated. Apart from complete matches between the sources sentence to be translated they implements fuzzy matches, where the source sentence has some substantial commonality with a previously translated sentence, T1 and LOGOS are such tools [5]. But these tools have been mostly concentrated in the complementary features offered by integrated translation environments, rather than significant increases in the level of reuse of previous translations. With the availability of huge multilingual corpora, wide range of domains covered and need of quick response time have caused the technology to move in the direction of integration of TM with MTS system. Many work has been reported till date where Statistical Models [2] and Example-Based Machine Translation (EBMT) system uses TM for efficient translation. Biçici and Dymetman [2] and Simard and Isabelle [9], designed a MTS based on TM output to extract new phrase pairs that supplement the SMT phrase table. However, this approach does not guarantee that the SMT system will select the phrase which is stored in TM even if a heavy weightage is applied to them. Another work on SMT is published by

Zhechev and Genabith [6]; it is phrase-based sub-tree structural alignment to discover parts of the input sentence that correspond to parts of the suggested translation extracted from the TM database. This approach implements alignment algorithm for phrases and Fuzzy-Match Score (FMS) for best match selection. Major drawback in the above approach is that SMT output is not much reliable as it may give different output for the same set of sentences and even the FMS algorithm implemented for searching the best TM match accuracy is below 70% as reported by the author [13]. The fuzzy matching algorithm fails most of the time as it depends on the sequence of lexical item in a sentence or passage. A fuzzy match is, after all, significantly more help to a translator when there is no match at all. The main obstacle in the road to fully exploiting the capabilities offered by Translation Memory is the fact that the translation industry is still focused on applying TM technology at the segment level, foregoing the advantages offered by treating translation databases as large parallel corpora. III.

MACHINE TRANSLATION SYSTEM (MTS) BASED ON TRANSLATION MEMORY

Our project is a domain specific web based multi-user MT application. It includes various inter-dependent modules such as pre-processing, pre-parser, TAG based parser and generator, morphological analyzer and synthesizer as shown in Figure 1. Input Sentence

Pre processor

Input Sentence

Pre processor

languages is a strict superset of context-free languages and the complexity of parsing algorithms is O(n6) in time and O(n4) for space with respect to the length n of the input string [1]. So, it needs very high processing time and memory for parsing. Subsequently, a language synthesizer module carries out further smoothening of the output for fluency. MTS is a sequential process as explained above, every time each sentence needs to pass through all the modules, and this is time consuming. Whereas the online translation engine is expected to generate output very fast, to achieve this we have already carried out data and task parallelization [11] [14]. Data parallelism (aka single program multiple data (SPMD)), is the simultaneous execution on multiple cores of the same function across the elements of a dataset. We have used a load balancing algorithm to distribute multiple sentences [12]. We have also implemented multithreading between various modules at the architectural level, i.e. task parallelism. As lexicon fetching from database requires information carried by pre-parser and is independent of engine module, it can be executed in parallel with the translation engine, generating target lexicon and target structure respectively. After parallelization of the system, we have achieved a remarkable improvement in terms of execution time [11]. Since MT is a complex and time consuming job [4], we have to keep looking into ways to achieve further speedup. In this paper, we have created a fully automated TM where category sequence is aligned with source derivation structure and target derived tree [8] produced by the MT system. This TM can be reused by our MTS when the same set of derivation structure is encountered.

Pre Parser

Pre Parser TAG Parser & Generator

TM lookup for source

Translation

Not Present Not Present

Target Sequence

TAG Parser & Generator

Present

Target Sequence

Translated

Synthesizer

Figure 1. Architecture of MTS

Output

Synthesizer

Figure 2. Architecture of MTS with Translation Memory

Pre-processing module divides the text into paragraphs, paragraph into sentences and a sentence into words. Basic chunking of lexicons is also performed in this module on the basis of Named-Entity Recognition (NER) rule [10] and linguistic heuristic of phrase marking. In the pre-parser module, classification of the words based on lexical categories is done thus called as Part-of-Speech (POS) tagger. The TAG parser syntactically and semantically compiles the POS output of source sentence and converts it into derivation tree format. TAG generator takes the compiled output of the parser and interprets it into target language. TAG parsing and generation involve substantial calculation complexity in terms of tree selection, its substitution and adjunction of one tree on other. The only auxiliary requirement is the initial and auxiliary tree for the languages that are to be translated. TAG generates tree adjoining

To explain the working of MTS with and without TM two sentences with same structure are taken as a case study. Figure 3 and Figure 6 shows two sentences with various translation steps including intermediate outputs for each sentence of MTS with and without TM. In Figure 3, 'Kanak Vrindavan is a famous garden in Jaipur City' is an input sentence given for translation to the pre-processing step A. The input sentence is converted into 5 chunks during pre-processing as shown in step B. Then the pre-parser module performs POS tagging on each token of the pre-processed sentence, 'Kanak-Vrindavan =NN, 1=is=VBZ, 2= a-famous-garden =NN, 3=in=PRP, 4= Jaipur-City =NN. Initials of POS category of each token are taken to form a string called category sequence, i.e. NVNPN for this particular input sentence. This category sequence is an input for the TAG based parser and a derivation tree [8] is created as shown in Figure 4. Derived tree of the given category sequence will always have the same structure and many input sentences can fall under a category sequence irrespective of the lexicon involved. The derivation structure [8] obtained by the parser is given to the TAG based generator to generate a target structure as shown in Figure 5. Finally, the derived target structure is given to the morph synthesizer where the structure is lexicalized with target language words to give the translated output.

lexical items involved by performing a TM lookup thus avoiding the re-parsing and re-generation.

Figure 3. An example of translation steps in MTS when an English sentence is given for translation.

Figure 4. Derived tree for input sentence: ‘Kanak Vrindavan is a famous garden in Jaipur City’

Figure 5. Derived tree of target output: ‘कनक वृंदावन जयपुर शहर में एक ूिसद्ध उद्यान है Õ

Figure 6. An example of translation steps when a sentence with same structure as stored in TM is input for translation.

Figure 7. Derived tree for input sentence: ‘The Amber Fort is a classic example of Mughal and Hindu architecture’

TM saves computational power and processing time for translation, increasing the efficiency of the system. Therefore, larger the TM, greater will be the system speed up. As the size of data increases in TM, lookup and searching takes higher time. To reduce the searching we have used MySQL database with MySQL indexes as B-trees, which means they can find a prefix of the indexed column very quickly. MySQL also supports fulltext searching. This creates an index of all the words in the column and then it can find these words quickly. Thus, searching does not remain an overhead now. Now, our aim is to create a large offline database or TM where category sequences are aligned with corresponding derived structures of source and target sentences. IV.

COMPUTATIONAL EXPERIMENTS

Hadoop framework [3] is a widely used open source framework for large scale data processing having distributed batch processing infrastructure. Translation of a large input file (in our case 10,000 sentences) of unique structures is a tedious task for the offline MTS. It takes nearly 10 days on a desktop machine to complete the translation. Translation system is one of the best applications to utilize the resources of Hadoop. So, to reduce the translation time in online translation we have used TM, and to reduce the translation time for offline process we have used Hadoop. Experiments are performed at two levels. First, a parallel MTS is designed and implemented [11] and second, parallel MTS integrated with TM is ported on Hadoop. The atomic actions of the proposed TM engine can be are described as: (a) Translate a single source segment, and (b) add a new translation unit i.e. TM, a pair of a source and target structures obtained from the TAG based parser and generator is stored in a database.

Figure 8. Derived tree of target output: ‘अम्बर िकला मुगल और िहन्दू ःथापत्य कला का एक उत्कृ ष्ट उदाहरण है Õ

The workflow of MTS changes when TM is integrated at the backend of MTS as shown in Figure 6. ‘The Amber Fort is a classic example of Mughal and Hindu architecture' is given as input to our online MTS where TM is integrated in backend. Pre-processing module forms 5 chunks for the sentence. Now, POS tagging is done on each token, 0=The-Amber-Fort=NN 1=is=VBZ 2=a-classic-example=NN 3=of=PRP 4=Mughaland-Hindu-architecture=NN. The category sequence generated for second sentence is NVNPN, same as that of first sentence as in Figure 3 and Figure 6. So, both the sentences will have same TAG source derived tree as shown in Figure 4 and Figure 7. So, clearly they will also have the same TAG target derived tree structure as in Figure 5 and Figure 8. Thus, the same parsed and generated output can be reused again and again irrespective of

Figure 9. Multithreading in MTS using Hadoop framework

Figure 9 shows the architecture of our multi-threaded MTS implementing data parallelism on the Hadoop framework. A cluster of homogeneous machines is formed with the Hadoop master-slave configuration. Sentences are divided in chunks by master node and distributed among Datanodes (slaves). The entire system works in parallel and has independent configured databases on each Datanode so that each TaskTracker can access their own local database avoiding the congestion caused in case

of a central database. The local database of stored structures is aggregated on the Master node to form a TM for our online MTS. Hence, larger the TM, faster translation can be done, thus improves MTS response time, efficiency and throughput. V.

RESULTS AND ANALYSIS

An isolated system with hardware configuration of four Intel corei5-2400 3.10 GHz processors and 3240 MB RAM memory takes 253 hrs to complete the translation of 10,000 sentences. To analyze the performance of the new MTS with Hadoop framework in terms of time and speedup, experiments are carried out with a single threaded MTS and a multi-threaded MTS by varying the number of nodes of the Hadoop cluster and the size of data file for translation. Each machine with above mentioned configuration is installed with ubuntu 12.10. The Hadoop version 1.0.4 is used to setup a cluster of 4 machines to form an environment of 16 cores. MySQL 6.0 is used as the database. We have carried out three experiments: (1) The single threaded MTS is ported on a single node and on a multinode cluster; (2) A multithreaded MTS is designed to be ported on the Hadoop setup; more than one sentence can be translated by this system when given simultaneously; and (3) Size of input file is varied for translation on the multithreaded translation system which is ported on Hadoop. Experiment 1 - First experiment is carried out in two phases by porting single threaded MTS on Hadoop with and without integrated TM at backend. The experiment is performed on a set of 10,000 sentences (file size 1.33 MB) for translation; the same set is used for multiple node processing. Each node of the multinode cluster receives a unique sentence for translation. During this experiment the TM is kept empty, and each sentence is translated through processing in various MT modules. In the second experiment, translation is carried out with the same set of data but this time with TM at the backend avoiding the re-translation of sentence with the same TAG structure. Experiment 2 - In this experiment a multithreaded MTS is designed for the Hadoop setup. In this case, multiple (at a time 3) distinct sentences can be concurrently translated on a single node of the Hadoop cluster. A set of 3 sentences are given to multiple nodes from the set of 10,000 sentences for translation. First, the experiment is carried out with these two setups without TM, and then TM is used to avoid re-translation. Performance comparison of experiment 1 and experiment 2 Table I contains the performance results of both the experiments where first column represents Hadoop nodes, second column contains data where a single sentence is given as input to MTS with and without TM integration, third column represents the results when multiple sentences are given to multithreaded MTS with and without TM integration. From table I, it can be clearly seen that time gets reduced by 18 hours 6 minutes when a single sentence is given for translation to a

single node with TM at the backend. When a unique sentence is given to each node on multinode clusters for translation then the execution times get reduced by 9 hours 10 minutes with 2 nodes, 7 hours 28 minutes with 3 nodes and 4 hours 25 minutes with 4 nodes. The time is further reduced when multiple sentences (3 at a time) are given to a single node in experiment 2; with no TM it takes 28 hours 18 minutes, whereas with TM 3 hours 13 minutes are required for translating the same set of sentences. With the use of TM and 4 nodes, the execution time is reduces to 1 hour 7 minutes. TABLE I. Single sentence and multiple sentences execution times on a multi-node clusters. No. of cluster nodes

1 2 3 4

Single sentence input to each node Time (mins) MTS MTS without TM with TM 1698 1478 923 750 709 555 627 407

Figure 10. Performance comparison of single sentence with and without TM on varying number of nodes.

Figure 12. Speedup comparison of single sentence with and without TM on varying number of node

Multiple sentences input to each node Time (mins) MTS MTS without TM with TM 392 193 200 139 167 68 142 67

Figure 11. Performance comparison of multiple sentences with and without TM on varying number of nodes.

Figure 13. Speedup comparison of multiple sentence translation with and without TM on varying number of nodes.

The data in Table I for single sentence and multiple sentences as input is plotted as a bar graph in Figure 10 and Figure 11, where reduction in the translation time with TM can be easily observed. Multiple sentences when given as input to translation system with integrated TM shows a remarkable reduction in execution time with the use of 4 nodes when compared with single sentence translation with and without TM. The speedup graphs for both the experiments are plotted in Figure 12 and Figure 13. For single sentence without TM, the maximum speedup is about 3.63 on 4 nodes cluster while with integrated TM, the maximum speed up of 2.88 on same no. of nodes i.e. 4. When compared with the time taken for translation of 10,000

sentences on a single desktop machine of 253 hours, the 4-node Hadoop cluster takes 142 minutes for execution. Further speedup of about 2.74 is achieved by porting the multithreaded MTS on the Hadoop cluster of 4 nodes with TM in background. By comparing the time taken for translation on a single desktop machine and the cluster configuration of 4 nodes of Hadoop with multithreaded MTS integrated with TM, we can see a huge reduction in the time for translation of 10000 sentences from 253hrs to 67mins. Experiment 3 - In this experiment the size of input file for translation is increased on the single node cluster as well as the multi node cluster and results are summarized in Table II. For a single node with a single threaded execution, the execution time increases almost linearly. In contrast, when a multinode machine with 4 nodes and single threaded execution on each node, the execution time decreases as the input file size increases. Thus, higher speedup is achieved on the Hadoop cluster by increasing the number of nodes as well as the size of data.

with the pre-parsed source structure and the generated target structure obtained from the parser/generator module of the TAG based engine in a database for further reference. The entire TM is based on structure of source sentence and, thus is independent of sentence lexical token. It can be reused by the same RBMT system when the same set of sentences or structure is encountered. By comparing a new sentence for translation to the already done translations improves consistency and saves time: the previous translations can be used or modified to create new, more contextually appropriate translations. The Hadoop framework is used to support and facilitate the translation process by speeding up the offline TM creation process. We have first carried out performance analysis of MT on a standalone windows machine to provide a baseline. Then our MT system is ported to the Hadoop framework by using Map Reduce. We have demonstrated in the results that substantial speedups are achievable by using the Hadoop framework. [1]

TABLE II- MT on a single and multinode machines without TM [2] Time(min)

Input sentences 10000 15000 20000 40000

Data size 1.33 MB

Single threaded MT on single node 1478mins

Single threaded MT on multimode 407mins

2.24 MB 3.82 MB 8.17 MB

2096mins 2847mins 5892mins

356mins 307mins 252mins

VI. CONCLUSION A traditional TM is used as a translator’s aid, and it stores a human translator’s text in a database for future use. In the last two decades, various translation tools are built on top of TM technology are designed to fasten the manual translation to meet the industrial demands. These tools require manual intervention for post editing the matched text from the TM and thus slow down the translation process. To ease the growing pressures on the industry TM is integrated with MTS for translating text more accurately and efficiently. In this paper, we have discussed some closely related work published where TM is used to improve the output of SMT system. Major drawback in the approaches used in such systems, as discussed in this paper, is that the output is not much reliable as it may give different output for the same set of sentences. Further, such systems are language dependent and usually work only on the available trained data thus produce low quality output.

[3] [4] [5] [6]

[7]

[8] [9] [10] [11]

[12] [13]

To overcome the language dependency in the reusability of a TM, TM is integrated with MTS. As a part of research, we have integrated a fully automated TM to a Rule-based machine translation system (RBMT) which uses the Tree Adjoining Grammar (TAG). This TM stores the category sequence aligned

[14]

REFERENCES Joshi A. K., Levy L. S., and M. Takahashi, ‘Tree adjunct grammars’, Journal of Computer and System Sciences, 10(1) , pp. 136-163, February 1975. Ergun B. and Dymetman M., ‘Dynamic TM: Using Statistical Machine Translation to Improve TM’. In Proceedings of the 9th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), volume 4919 of Lecture Notes in Computer Science, pp. 454– 465, 2008. Hadoop Framework. Available at http://Hadoop.apache.org/ Zadrozny W., ‘Natural Language Processing: Structure and Complexity’, Proc. SEKE’96, 8th Int. Conf. on Software Engineering and Knowledge Engineering, Lake Tahoe, pp. 595-602, 1996, The LOGOS Intelligent Translation system. Available at http://www.mtarchive.info/AMTA-1994-bennett.pdf. Zhechev V. and Genabith J.V., ‘Seeding Statistical Machine Translation with TM Output through Tree-Based Structural Alignment’. In: SSST-4 4th Workshop on Syntax and Structure in Statistical Translation, pp. 43–51, 28 August 2010, Beijing, China. Wang K., Zong C., and Su K., ‘Integrating Translation Memory into Phrase-Based Machine Translation during Decoding’. In: The 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013 Rogers J., ‘A Unified Notion of Derived and Derivation Structures in TAG’, University of Central Florida, Gainesville, 1997. Simard M., Isabelle P., ‘Phrase-based Machine Translation in a Computerassisted Translation Environment’, In: The Twelfth Machine Translation Summit (MT Summit XII), pp. 120-127, 2009. Nadeau D., Sekine S., ‘A survey of named entity recognition and classification’, Lingvisticae Investigationes 30.1, pp. 3-26 2007. A. Tomar, J. Bodhankar, P. Kurariya, P. Anarase, P. Jain, A. Lele, H. Darbari and V.C. Bhavsar, “Parallel Implementation of Machine Translation using MPJ Express,” Proc. National Conference on Parallel Computing Technologies (PARCOMPTECH), Feb. 21-23, 2013, Bangalore, India, Center for Development of Advanced Computing (CDAC), pp. 223-233, 2013. Barmon C., Faruqui M.N. and Battacharjee G.P. , ‘Dynamic load balancing algorithm in a distributed system’, North- Holland Microprocessing and Microprogramming, 29(5), pp. 273-285, (1990/91 ) Fuzzy Match Algorithm, Available at http://en.wikipedia.org/wiki/Fuzzy_string_searching A. Tomar, J. Bodhankar, P. Kurariya, P. Anarase, P. Jain, A. Lele, H. Darbari and V.C. Bhavsar, “High Performance Natural Language Processing Services in the GARUDA Grid,” Proc. National Conference on Parallel Computing Technologies (PARCOMPTECH), Feb. 21-23, 2013, Bangalore, India, Center for Development of Advanced Computing (CDAC), pp. 249-261, 2013.