Hindi-English Translation Memory System - IJETTCS

8 downloads 34762 Views 203KB Size Report
Web Site: www.ijettcs.org Email: [email protected] ... and provide efficient, cheap and quick translations. ..... Other domain specific content will be added to our.
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015

ISSN 2278-6856

Hindi-English Translation Memory System Nandita Srivastava*1, Priya Singh*2, Sukanya Chauhan*3, Shashi Pal Singh#4, Ajai Kumar#5, , Dr. Hemant Darbari#6 *Banasthali Vidyapith, Banasthali, India #AAI, Center for development of Advanced Computing, Pune, India

Abstract This paper outlined the Hindi-English Translation Memory System as a tool to translate the Hindi texts into English by human translator using the already existing Hindi-English text pairs in the TM database. Translation memory(TM) systems were commercially available in the market in late 1990’s but the researches in this field have been in continuation since 1970’s. It is a database that consists of existing translations which can be later reused as a suggestion to the translator for translating a query.TM has evolved as an important area in the translation industry and is widely used among translators but not much of the works have been done for Indian languages in comparison to other foreign languages. This makes us feel the requirement of having such a tool that performs the translation of Hindi documentation into English which favors the recognition of the localized work in Hindi globally. The following methodologies are used in our system: - pre processing module that includes filtering and segmentation, retrieval module that employs N-gram statistical model with 3-operation Edit Distance (Levenshtein Distance) as a string matching metrics, termbase module, named entity recognition and transliteration are employed to enhance the quality and the efficiency of the translation. We further discussed the workflow among the processes involved in the establishment of our system and concluded with the efficient experimental results obtained.

Keywords:Translation Theory, Computer Assisted Translation, Translation Memory, Hindi to English, Natural Language Processing

1. INTRODUCTION Translation is in existence since human beings needed to communicate with people who did not speak the same language. Efficient translation is much more demanded nowadays which is an expensive process, requires a good understanding of the domain and is not an easy task. Thus the advent of the computers and software technologies has been taken into advantage to facilitate human translators and provide efficient, cheap and quick translations. Translation technology in software world has been categorized into: Computer Assisted Translation (CAT) and Machine Translation (MT). An MT system involves the full automatic translation done by computer without having any human intervention or involvement while CAT systems have human translator that with the help of computer application does translation. Fully automatic translation systems generally do not produces high quality translation therefore there seems a need to have an

Volume 4, Issue 2, March – April 2015

interactive human and computer application that will provide more efficient and consistent translation and it constitutes the CAT systems. Translation Memory (TM)[3] technology belongs to CAT systems. It is used as a translator’s aid for providing a good precision translation. Basically, it is a database application that keeps record of previously translated units and reuse the existing if it is being repeated in future translations instead of translating it from the scratch. It was believed that the translation memory could be better utilized for translating monotonous type of texts but other features incorporated make it useful for non-monotonous texts also. The following TM systems, such as Trados translation workbench, SDL[14], SDLX, Atril Déjà Vu[15], Star Transit, IBM Translation Manager, OmegaT[16], and Wordfast[17] are commercially available in the market but they need to be specified to suit the language specific requirements. For a multilingual country like India, translation leads to development, commercial foreign exchanges, and advancement and acquaintance in academics and researches. At some level, work has still been done in regional language or Hindi, which needs a proper translation to make the information easily understandable, usable and accessible to whoever needed it. Thus our tool, Hindi-English Translation Memory System, meets the requirement of translation of Hindi texts to English efficiently. It will be matching the Hindi input to be translated with the Hindi-English language pair stored in TM and retrieve the matching translation candidates in English which helps human translator to accept it, replace it with a fresh translation or change the translation to fit into new translation and record it in TM. There is a provision of Termbase, Named entity recognition and the transliteration that will assist the translator by handling and identifying the necessary details. Further sections covered in the paper are: - section 2 gives the brief overview on the various TM standards available for data exchange among TMs. Further section 3 covers the description of our Hindi-English TM system with architectural model. The methodologies used in our system are discussed in section 4. In section 5, evaluation and various experimental studies are discussed. Finally our paper is concluded with the future enhancements discussed in section 6.

Page 147

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 2. DATA EXCHANGE STANDARDS Exchange of data among various translation memory systems is facilitated by agreeing to the common data representation format so that data can be easily import/export within TM systems. Various standards are available for translation memories which have been widely accepted and supported by the systems[13]. Exchange of translation memory data between TM systems is supported by Translation Memory Exchange (TMX), which is an XML specification and was given by Open Standards for Content Allowing Reuse (OSCAR), group of Localization Industry Standards Association (LISA). Segmentation Rules Exchange (SRX) standard just states how the segmentation should be done and was given by LISA. Termbase Exchange Format (TBX) was originally released by LISA in 2002, but later was adopted by ISO 30042 standard. It specifies how the terminology data is represented and exchanged. Open Lexicon Interchange Format (OLIF) was developed by the OLIF2 Consortium and is an open standard that specify how the lexical data exchanges takes place.XML Localization Interchange File Format (XLIFF) was given by OASIS in 2002 and is an XML based format which is used to standardize the way localizable data are passed between TM systems during a localization process.

ISSN 2278-6856

and other placeables. The new or modified translations suggested by the translator are further updated in TM database in an aligned manner. 3.2Architecture The architecture layout of Hindi-English translation memory system is illustrated in the figure 1. The following figure shows various components used in our system and specifies how the work flows among them.

Figure 1System Architecture

3. HINDI-ENGLISH TM SYSTEM

4. METHODOLOGY USED

3.1Brief Overview Translation memory is basically a data repository that maintains the segment of previously translated texts in a source-target fashion (this language pair stored in database is known as translation unit (TU)) so that in future it can help the human translator to reuse it instead of translating from scratch. Translation memory in our system is a database that consists of Hindi with its corresponding English translation stored as a pair. The Hindi text which is to be translated is imported into the TM system by the translator and then splitted into segments. These newly created Hindi segments are then matched with the already stored aligned Hindi segments in the translation memory. The score measure is calculated for the similarity between the segments and the corresponding English translation for the one whose scores are within the defined threshold are retrieved to the translator as a translation candidate and it is upto the translator to either accept the translation or to modify it or to completely replace it with the new translation. The retrieval is based on the percentage of the matching desired by the translator, whether it is an exact match where the new segment exactly matches the source segment stored or a fuzzy match where there is some degree of similarity between the segments i.e. partial solution which can be measured in 0% to 100%. Termbase maintained provides the translation of non translatable Hindi words. Named entity recognition and transliteration are employed to make the translation more efficient by identifying and handling the proper nouns, gender cases

4.1 TM development A Hindi English bilingual aligned corpus consisting of 10,000 sentences is build. Sentences were taken from the documents released in the University Grants Commission (UGC) from the period of 2006 to 2010 which were available both in Hindi and English.This corpus or database at an initial level is used as a Translation memory for further training future translations. A hash function is employed to map the Hindi sentences in corpus to generate a unique data known as hash code or hash values which will be used for database lookup. In later sections it will be discussed how to effectively retrieve the translation from TM and update the new entries if required.

Volume 4, Issue 2, March – April 2015

4.2 Pre Processing The Hindi text that needs to be translated by the translator should be imported in the system. The textual parsing of the source statements is done that further includes the following two processes: - filtering and segmentation. 4.2.1 Filtering:Filtering reduces the number of entries of low value i.e. short sentences, long sentences, obvious mismatches, etc and unnecessary spaces and symbols for the translators from the text which is least significant on the basis of defined rules. 4.2.2 Tagging: part of speech tag and lemmas of Hindi words are maintained.

Page 148

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 4.2.3 Segmentation:TM database stores the text in units called “segments” which will be considered as the most suitable and useful translation units (TUs). These segments can be anything as per defined like sentences or titles or headings or paragraphs. The imported text is then segmented based on the regular expressions that will split the text at orthographical boundaries or any punctuations that symbolises the logical end of the text and other entities like the titles or headings can also be treated as a segment. In our system the Hindi source text which is to be translated are segmented by splitting the text on the basis of defined rules that separates on the basis of

पूण वराम(“|”),

िच ह(“?”),

अध वराम(“;”),

अनु छे द(paragraph) and शीषक (headings) in the text. After importing the text and segmenting it, we examined each segment because there are certain terms or codes or proper nouns or placeables that needs to be handled separately in the system. For handling the proper nouns we have used the Named Entity Recognition[7] approach based on some defined rules to identify and then transliterate[7][4] them. Numbers, Dates and other placeables that need not to be translated are recognised and placed appropriately. Gender discernability in pronoun cases are handled using defined rules that categorizes on the basis of verbs and auxiliary verbs used in Hindi sentences. Let us consider the sentence S with its translation T stored in TM database. S:“राम

ने मोहन को कताब द

T: “Ram gave a book to Mohan” If any sentence similar to the above structure is inputted by translator like:

याम ने राधा को फूल दया

I:

Tagged structure: S: “राम

ने मोहन को कताब द ”

NN Prep NN Prep Noun Verb

I: “

याम ने राधा को फूल दया”

NN

Prep NN Prep Noun

Verb

Names in the sentences will be recognized using NER which will further be classified as variables and these variables will further be replaced and transliterated. N1=“

याम” and N2=“राधा”, which will be replaced by

“राम” and “मोहन”and “फूल” meaning “flower” fetched from the termbase will be replaced instead of “book”. Thus the translation retrieved to the translator is: IT: “Shyam gave a flower to Radha”

Volume 4, Issue 2, March – April 2015

ISSN 2278-6856

4.3 Retrieval Translation Retrieval in our system employed the statistical language modelling (N-gram approach) [10] to compute ngrams and then implemented the 3-operation Edit Distance (Levenshtein Distance) dynamic programming algorithm. 4.3.1 N-gram modeling:Substrings of size N are referred as N-grams [5]. A window of size N is slide over the sentences to partition into substrings of length N. N-gram model is used to consider the context or word order in the sentences. In our approach, we used the bi-grams [8], which are the substrings of size 2, for the sentences whose length is less than 7 to maintain the proper context and for greater length we compute tri-grams and four-grams. Sentences in the TM are partitioned into N-grams and then correspondingly matched with the N-grams of the query using string similarity metrics. N-gram modeling is depicted using the following example: S1:“भारत के धानमं ी मोद जी ह” Table 1: N-Gram Models 1-gram

भारत, के,

2-gram

धानमं ी, मोद , जी, ह

“भारत के”, “के

धानमं ी”, “ धानमं ी मोद ”,

“मोद जी”, “जी ह” 3-gram

“भारत के

धानमं ी”, “के

धानमं ी मोद ”,

“ धानमं ी मोद जी”, “मोद जी ह” 4-gram

“भारत के

धानमं ी मोद ”, “के

धानमं ी मोद

जी”, “ धानमं ी मोद जी ह”

4.3.2 String similarity metrics: For efficient retrieval from TM, 3-operation edit distance (Levenshtein distance) [1] approach is used as a string matching function in our system. Levenshtein distance is a character based matching approach that computes the minimum number of edit operations (i.e., (i) insertion; (ii) deletion and (iii) replacement)[1] required for transforming one N-gram to another and will act as a measure to provide extent for the matching of current input and the source in the TM and retrieve the translation candidates in target language which may help human translator to either accept the translation, replace it with a fresh translation or modify the translation to fit into the new translation and update it in TM. Translation retrieved correspondingly can be classified as an exact match or fuzzy match. In exact retrieval, the source sentence is exactly similar to target sentence or same sentence has been translated before. It is also known as 100% matching.Fuzzy match retrieves the partial solution i.e. the one having some degree of similarity. It can be categorize on the basis of score, segments with score 0 to 1(boundaries are excluded) will come under fuzzy matched segments. The available exact or fuzzy matched sentences will be retrieved as the translations to the translator as a translation candidate Page 149

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 that are matched with the differences marked. Definition of threshold by the translator will define the extent of the fuzziness acceptable and will retrieve those sentences with the score above threshold. Algorithm 1[1] shows the pseudo code of Levenshtein Distance algorithm. Algorithm 1:Levenshtein Distance[1] Input- Two strings S1 and S2 Output- Score of similarity 1. int m[i, j]=0 2. for i 1 to |S1| 3. do m[i, 0]=i 4. for j 1 to |S2| 5. do m[0, j]=j 6. for i ← 1 to |S1| 7. do for j 1 to |S2| 8. do m[i, j]= min{m[i-1, j-1]+if(S1[i]=S2[j]) then 0 else 1, m[i-1, j]+1,m[i, j-1]+1} 9. Return m[|S1|,|S2|] Levenshtein distance is computed in (|S1|x|S2|) time and the required space is (min (|S1|,|S2|)). Internal processing and workflow for matching and retrieval within our system follows the following procedures:  Each Segment from the query is searched for whole source in TUs and if an exact match is found then the target corresponding to the source matched is retrieved.  Otherwise, if number of tokens in the segment is 7