GRAPHON Tamil to English Transliteration for Tamil

15 downloads 0 Views 486KB Size Report
transliteration to retrieve essential information from concealed English words in a spool of unstructured Tamil text. These English words written in Tamil are iden-.
GRAPHON Tamil to English Transliteration for Tamil Biomedicine J. Betina Antony and G.S. Mahalakshmi

Abstract Cross-Language Information Retrieval is a fast-growing field that attracts many researches. In a field with humongous application, basic understanding and accessibility of words is a crucial task. Transliteration is one such vital task that paves way for a wide range of improvements. In our work, we focus on deploying transliteration to retrieve essential information from concealed English words in a spool of unstructured Tamil text. These English words written in Tamil are identified, and their correct form is retrieved by performing statistical search in a collection of built-in database. This GRAPHON (Grapheme + Phoneme)-based Tamil to English transliteration gave an accuracy of 68% being the first of its kind.





Keywords Transliteration Tamil biomedicine Cross-lingual Information Extraction Phonetic algorithm Soundex code





1 Introduction Tamil, one of world’s oldest languages, is famous not only for its rich morphology and vocabulary but also for the opulent knowledge its literature bears. Of the many information that are passed on from generation to generation, the knowledge about indigenous medicines and their uses has always been preserved and practised for centuries. This knowledge, also known as Siddha System of Medicine (SSM), is put to practice even today. Siddha is predominantly a collection of alchemy texts that is believed to be invented by 18 Siddhars. It uses different minerals, metals and chemical products of nature to heal human ailments, both physical and mental.

J.B. Antony (&)  G.S. Mahalakshmi Department of Computer Science and Engineering, College of Engineering Guindy, Anna University, Chennai 600025, Tamil Nadu, India e-mail: [email protected] G.S. Mahalakshmi e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 S.S. Dash et al. (eds.), International Conference on Intelligent Computing and Applications, Advances in Intelligent Systems and Computing 632, https://doi.org/10.1007/978-981-10-5520-1_41

445

446

J.B. Antony and G.S. Mahalakshmi

The main achievement of this ancient system is its application even in the present medicine system. The original Siddha texts are in the form of poems. Hence, a number of works have been carried out to convert these poems to prose for better understanding. These translations have been done for decades. As a result of cultural advancement, a number of contemporary words and languages have also mingled with the native language in the process of translation. Our work focuses on identifying these non-Tamil words that are otherwise lost as noisy data in our original information extraction system. Therefore, we seek the help of transliteration to locate these non-Tamil yet significant named entities. Transliteration is the process of converting characters from one script to another without changing its phonetic structure. Transliteration is widely used in various cross-lingual applications such as Cross-Language Information Retrieval, WebSearch, Machine Translation. Transliteration is many a times confused with translation which also involves transmuting from one language to another. But translation strictly adheres to keeping the meaning of the word or phrase intact. Transliteration stresses on semantic correctness but not on meaning matching. In our work, we suggest a method that can identify and convert English terms that are concealed inside normal text. A statistical search approach is carried out to determine the corresponding English words for the identified Tamil words and to eliminate unrelated words. For words with colliding search results, a phoneme matching-based tie breaker is applied.

2 Background This work is a part of Information Retrieval for Tamil Biomedicine [1]. The objective of the system is to retrieve medicinal information from the unstructured text. That is, for a query (name of a disease or an ingredient), a collection of data is obtained that contains information about the medicinal ingredients used, the diseases or disorders it can cure (both these fall into the field-named entities), and the preparatory procedure for the medicine along with information about the ingredients and directions to be followed to consume the medicine is provided. Many challenges were encountered on the way [2], and our research has been carried out to eradicate as many limitations in the system as possible. Here, for the retrieval process, the identification of named entities is involved [3]. Named entities in the context of Tamil Biomedicine refers to terms that denote an ingredient element or its related items, its by-products, name of a disease or disorder, the symptoms, parts of a body, etc. Our data set is a collection of Siddha-related information obtained from published books, magazines and also to some extent information obtained from blogs and Web portals. The latter contemporary information gathered contains many words from other languages predominantly from English. However, these words did to large portion share of valuable information as they served as queries in the retrieval process. They also indicated words that could be understood by the current generation. For example,

GRAPHON Tamil to English Transliteration …

447

most people may not recognize (paRRuyiri) which is the proper Tamil word for (paaktiriyaa), (Bacteria). These words were lost in processing for Information Extraction. Hence, we proceeded with this work to locate non-Tamil words from text and identify their English alternates accurately.

3 Related Works The works related to Tamil transliteration started in the late 90s or early 2000. Viswanadha [4] suggested a universally accepted ICU (International Components for Unicode) that involves Romanization of Tamil alphabets. This laid the foundation for transliteration in Indic languages. Since then it has been put to test in many multilingual querying and information processing [5–7]. When transliteration in Tamil is taken into picture, two types of analysis have been carried out. One includes conversion of English terms to Tamil which includes statistical as well as machine learning approaches. One particular group of researchers have worked on the different methods of transliterating from English to Tamil where they started by considered it as a sequence labelling problem, and multiclassification was done based on memory learning [8] with 84.16% accuracy. They further modelled a C4.5 decision tree classifier, in WEKA Environment [9] with 84.82% accuracy of English names. Finally, a One Class Support Vector Machine Algorithm was developed in 2010 [10] that outperformed both the previous methods. Finch et al. [11] suggested a bidirectional neural model to transliterate with 62.9% accurate English to Tamil transliterate. The other set of works includes transliteration between Indian Languages. They were comparatively trivial as most of the Indian languages are morphologically rich and have similar phoneme structure. Keerthana et al. [12] proposed a sequence labelling-based method to transliterate from Tamil to Hindi. Their system produced an accuracy of about 84% in spite of not addressing the variation in the pronunciation and sound of different letters in both the languages. In [13], transliteration of Hindi to 7 different Indic languages was carried out based on word alignment and Soundex matching. The system, however, performed comparatively badly for Tamil when compared to the other languages with an accuracy of 68%. A number of these transliteration works considered Soundex code matching some of which altered the codes to favour Indian Languages. After a thorough study of all these works, it was evident that none of them was carried out for Tamil to English transliteration which paved way for our research.

4 System Description The steps involved in our transliteration work and their operations are listed in the following sections (Fig. 1).

448

J.B. Antony and G.S. Mahalakshmi

Fig. 1 Biotransliteration procedure

Table 1 Tamil romanization scheme Vowels

Vowels Tamil

4.1

English

Tamil

English

Consonants Tamil

English

Consonants Tamil

English

Special Letters Tamil English

Preprocessing

The input to the system is given in the form of unstructured Tamil sentences containing Tamil biomedical instructions. Since the transliteration is done at word level, the initial step is splitting the words to tokens. After tokenization, the common stopwords are removed. Finally, the individual words are stemmed to remove variation.

4.2

Unicode to Tab Conversion

The first step towards transliteration is changing the text format of source language to the target language. It is also necessary in our case as the external DB to be used in the later stages involves words represented in target language. Hence, we convert our unicode Tamil text to their corresponding English representation. This representation is universally accepted standard representation (Table 1).

GRAPHON Tamil to English Transliteration …

4.3

449

Non-Tamil Word Identification

Identifying non-Tamil words from the given lot is the strenuous part in the entire process. We have used a Tamil WordNet, assuming they contain most of the words in Tamil dictionary. Words that are not present in the DB are considered non-Tamil.

4.4

GRAPHON-Based Indexing

GRAPHON is a combination of grapheme- and phoneme-based phonetic algorithm that we suggest. Generally in any language, alphabetic letters are involved. However, when Tamil alphabets are represented in English, they involve more than one letter as they are the agglutinated form of a vowel and a consonant in most cases. Example (ka) is an amalgamation of the consonant (k) + vowel (a). Phoneme indexing, however, involves grouping words based on their phonetic representation. A phonetic algorithm generally involves giving a common code for each phoneme based on their sound and matching words with similar codes for indexing, comparison or other purposes. The very common phonetic algorithm used for various applications is Soundex Algorithm [14]. Soundex is a hashing system for English that uses a code to represent a word based on their sound. Each value of the code is given in such a way that all similar sounding words starting with the same alphabet are given the same value, thus identifying words with similar phonetics. In our system, we have first grouped our words into graphemes based on their byte code and then assigned Soundex codes to the phonemes.

4.5

Span-Based Word Matching

Now, the Soundex code for a given Tamil word is mapped onto their corresponding English words based on the sounding similarity. For this mapping, we have created a table with Soundex codes for all the words from English WordNet. Initially, when all the words with same code were considered, the system produced highly noisy results. Hence, the first level of filtration was done by confining the search to words with acceptable length. After few trials, the length was fixed to  that Length + 5. All the words are now saved as an Array List.

4.6

Statistical Transliteration Based on Domain Knowledge

The final step in the transliteration process is to filter out the unrelated terms from the retrieved list. The first level was scrutiny may have been the word length. But

450

J.B. Antony and G.S. Mahalakshmi

that only filters noisy data. The ability to identify the correct words from the lot is the most challenging step. For our work, since almost 90% of the words are related to English medicines and biomedicine, we employ a biomedicine wordlist1 that has more than 98,100 words. Hence words that are present in the dictionary are given a score to indicate its candidacy. Note that the values are numeric and are not Boolean, as more than one term in the list can be present in the dictionary and this leads to clash in assigning the terms. To overcome the problem of multiple word assignment, each word is given a score based on its occurrence and weight-age. Weight-age here means phoneme match score or longest common subsequence score. Every term is assigned a score which is the length of the longest matched subsequence with the original transliterated word. The word with the maximum score is selected. Let t be the term to be iterated and B be the list of terms {b1, b2,…, bn} that are present in the biomedical dictionary. Then, the final term t′ is assigned to the term bi which has the maximum normalized LCS value with t (Eq. 1):   t0 ¼ bi 2 BjLCSnorm ðt; bi Þ  LCSnorm t; bj 8bj 2 B

ð1Þ

5 Results and Discussion 5.1

Data Set

The experiment was initially started with 668 files containing Siddha Medicinal information collected from various sources. The sources include genuine publications dating from the 1980s to details obtain from recent Web pages using Web crawler. As suspected, the files from earlier decades did not contain any English texts though they had certain non-Tamil (mostly Sanskrit) words. These diluted the accuracy of the system. Hence, the data set was narrowed down to 85 files containing a total of 20,344 words. Two different WordNets (Tamil and English) and one biomedical name list were also involved in processing.

5.2

Discussion

Since the identification of the concealed English terms is the predominant task, our discussion revolves around the steps involved in refining the search for these terms. The identification task involves trying to locate a term in the Tamil WordNet.

1

https://github.com/Glutanimate/wordlist-medicalterms-en.

GRAPHON Tamil to English Transliteration …

451

Fig. 2 Twn table entries for word ‘paTi’ Fig. 3 Morphtable entries for word ‘paTi’

The Tamil WordNet includes 4 different tables out of which only 2 are taken into consideration: (i) twn—this table contains 50,497 labels or words along with their sense details and indexes to identify their synonyms, antonyms, hypernyms, hyponyms and troponyms. A sample tuple for entry (paTi) (to study) is shown in Fig. 2. (ii) morphtable—this table has a collection of 434,849 words and their corresponding root word. For the example (paTi), there are about 574 different variations of the word. Some of them are shown in Fig. 3. It is to be noted that the number of words in morphtable is more than the number of tables in twn. This is because the morphtable has most (but not all) of the variations of a given word. In our system, the word was initially mapped with any of the inflated words in the morphtable. If present, its corresponding root word was located in the twn table which connotes the fact that the word is Tamil. Note that the identification is purely based on words present in dictionary. This procedure had 2 major difficulties. Firstly, certain commonly used English words were already present in the dictionary. Example, the word (aakSijan) (oxygen) was present in both the tables, hence was ignored in processing and hence adding to false negatives (Fig. 4). Secondly, the morphtable may not contain the particular agglutinated form of the word. This lead to the inclusion of actual Tamil words in the retrieved non-Tamil terms, thereby increasing the false positives and hence greatly affecting the precision of the system. After tokenization and stemming, the first run of operation produced 5783 words as non-Tamil terms. This was a noisy result as the words had hidden junk bytes in them. After elimination of these bytes, the number of junks was largely reduced to 2982.

452

J.B. Antony and G.S. Mahalakshmi

Fig. 4 Table entries for

(oxygen)

The non-Tamil words are changed to their corresponding English representation by Romanization. In this case, the letters are initially split based on their graphemes denoted by their byte codes. The units are then treated as phonemes for assigning Soundex codes. For ease of operation, the Soundex codes for all the words (203,147 words) in the English WordNet were stored priorly. Now, the Soundex for the English and Tamil words are mapped and all the words are retrieved: e.g.

(viTTaminTi): [Vedanta, vitamin_D].

Here, the Soundex codes for all the words are V353. The related words are saved in the form of ArrayList to avoid duplication. Few challenges were encountered when mapping the Soundex codes. (i) Some letters in Tamil have different sounding patterns in different places. For example, the letter (pi) can be applied for (pired) (bread) as well as (pirashar) (pressure). Their difference cannot be given individually. One solution we came up with was to replace few letters in the beginning of the word and find Soundex for them as well. Some of the words that were replaced are b for p, c for k, o for aa, s for c, etc. (ii) The agglutination of non-Tamil words is difficult to uncouple as the root words are not present in the original dictionary. For example, (kaansaraal) (due to cancer): [conger_eel, common_sorrel, censorial] did not identify the correct word as the Soundex for (kaansar) (cancer) is C526 and for (due to cancer) is C5264. (iii) Since Tamil does not support acronyms, the English acronyms are identified wrongly. One such example is Urinary Tract Infection (UTI) (yutiai) (UTI): [youth, yodh, youth, Yeddo, Yedo, yeti, youth, yautia, youth, youth, youth, yet, yet, yet, yet, yet, yet]. The final step in the process is to determine the final English word to the term in question. The words in the list are checked with the medical wordlist, if present weighted score is given based on the total number of related words retrieved. This is to give a normalized value for all the words irrespective of the number of candidate terms identified. Consider the following: (aastiyooporoosiS): [osteoporosis 1].

GRAPHON Tamil to English Transliteration …

453

Table 2 Statistics of the transliteration process Total number of words (including duplicates)

20,344

Number of non-Tamil words retrieved (including duplicates) Number of unique non-Tamil words Number of unique English words Number of correctly transliterated words Precision (TP/TP + FP)

2,755 1,832 185 126 0.681

In this case, the retrieved term is an exact match. Hence, the terms are assigned. Consider the following case: (miisootheeliyam): [musteline 0, mesothelium 0.5, Mazatlan 0, Magdalena 0, magdalen 0, mesothelioma 0.5]. (miisootheeliyoomaa):[musteline 0, mesothelium 0.5, Mazatlan 0, Magdalena 0, magdalen 0, mesothelioma 0.5]. Here, both mesothelioma and mesothelium match the term in question when considering the Soundex code. Here, the tie is broken by matching phonemes and assigning ranks. Hence, after matching, the results are: (miisootheeliyam): [mesothelium—0.81, mesothelioma—0.75]. (miisootheeliyoomaa): [mesothelium—0.81, mesothelioma— 0.916]. Note that the values are normalized to balance the varying length of the words. The statistical results are listed in Table 2. Of the 2695 words that were retrieved, only 185 words were found to be English words. The major dip in the accuracy of retrieval is due to the lack of agglutinated form in the morphtable. This can be rectified by using a different stemmer, but the cost of false negative should also be taken into consideration. The precision value was calculated to determine the correct assigning of English words to Tamil-written entities. The recall and hence f-measure for system are almost impossible to determine as the total number of words to do manual checking is comparatively high.

6 Conclusion The Tamil to English transliteration system, the first of its kind, was successfully built and performed effectively for a small yet productive data set. Certain difficulties such as agglutinated Tamil words can be resolved by applying different stemming algorithms on tagged non-Tamil words by non-dictionary-based methods. Also to overcome the ambiguity in pronunciation and sounding pattern, modification to the Soundex Algorithm can be made to adjust to the language in consideration. Even without the domain wordlist, the system will perform

454

J.B. Antony and G.S. Mahalakshmi

efficiently for data from any corpus as word length matching is also considered. The system obtained an agreeable result for the statistical method that was used, but the fact that it might perform better if a machine learning strategy is used is still debatable. Experiments to extend this work to machine translation for enhancing Information Extraction are in progress. Acknowledgements This work is part of our research supported by Department of Science and Technology’s INSPIRE fellowship Programme, Ministry of Science and Technology, India.

References 1. Betina Antony, J., & Mahalakshmi, G. S.: Content-based Information Retrieval by Named Entity Recognition and Verb Semantic Role Labelling. j jucs, 21, pp. 1830–1848, (2015). 2. Antony, J. B., & Mahalakshmi, G. S.: Challenges in Morphological Analysis of Tamil Biomedical Texts. Indian Journal of Science and Technology, 8(23), pp. 1–4, (2015). 3. Antony, J. B., & Mahalakshmi, G. S.: Named entity recognition for Tamil biomedical documents. In Circuit, Power and Computing Technologies (ICCPCT), 2014 International Conference, pp. 1571–1577, (2014, March). 4. Viswanadha, R. Transliteration of Tamil and Other Indic Scripts. INFITT, Tamil Internet 2002 277–285, (2002). 5. Ganesan, K., & Siva, G.: Multilingual Querying and Information Processing. Information Technology Journal, 6(5), pp. 751–755, (2007). 6. Kumaran, A.: MIRA: Multilingual information processing on relational architecture. In International Conference on Extending Database Technology, pp. 12–23, (2004, March). 7. Saravanan, K., Udupa R, and A. Kumaran: Crosslingual information retrieval system enhanced with transliteration generation and mining. Forum for Information Retrieval Evaluation (FIRE-2010) Workshop. (2010). 8. Vijaya, M. S., Shivapratap, G., Dhanakshmi, V., Ajith, V. P., & Soman, K. P.: Sequence labeling approach for English to Tamil Transliteration using Memory based Learning. In Proceedings of Sixth International Conference on Natural Language processing, (2008). 9. Vijaya, M. S., Ajith, V. P., Shivapratap, G., & Soman, K. P.: English to tamil transliteration using weka. International Journal of Recent Trends in Engineering, 1(1), (2009). 10. Vijaya, M. S., Shivapratap, G., & Soman, K. P.: English to Tamil Transliteration using One Class Support Vector Machine. International Journal of Applied Engineering Research, 5(4), pp. 641–652, (2010). 11. Finch, A., Liu, L., Wang, X., & Sumita, E.: Target-Bidirectional Neural Models for Machine Transliteration. ACL 2016, pp. 78–82, (2016). 12. Keerthana, S., Dhanalakshmi, V., Kumar, M. A., Ajith, V. P., & Soman, K. P.: Tamil to Hindi Machine Transliteration Using Support Vector Machines. In International Joint Conference on Advances in Signal Processing and Information Technology, pp. 262–264, (2011, December). 13. Srivastava, R., & Bhat, R. A.: Transliteration Systems across Indian Languages Using Parallel Corpora. Sponsors: National Science Council, Executive Yuan, ROC Institute of Linguistics, Academia Sinica NCCU Office of Research and Development, pp. 390–398, (2013). 14. Jacobs, J. R.: Finding Words That Sound Alike-The Soundex Algorithm. Byte, 7(3), pp. 473– 474, (1982).