Jawi-Malay Transliteration

3 downloads 0 Views 484KB Size Report
from English to Korean, Japanese or Chinese. According to Knight & Graehl [8], transliteration refers to ... [2] Kamus Dewan, 1995. Dewan Bahasa dan Pustaka.
2009 International Conference on Electrical Engineering and Informatics 5-7 August 2009, Selangor, Malaysia

Jawi-Malay Transliteration Roslan Abdul Ghani#1, Mohamad Shanudin Zakaria#2, Khairuddin Omar#3 #

Center of Artificial Intelligence Technology, Fakulti Teknologi dan Sains Maklumat, University Kebangsaan Malaysia, 43600 UKM Bangi Selangor, MALAYSIA. 1

[email protected] 2 [email protected] 3 [email protected]

Abstract— Jawi-Malay transliteration at basis was introduced to get information from old manuscripts that contain important information in various spheres of knowledge. The weakness in reading and understanding from Jawi source rise when new generation less in use Jawi as medium in their daily activities such as spreading and retrieval information. Using the earlier literature reviews, approaches, framework and evaluation result as guideline, we proposed some main method such as character categorise, stemming process, Unicode Mapping, Jawi-Malay Rules and Voice synthesizer. The integration of the above methods will help to quick in transliteration process and well in output. By the way, this transliteration design also have their weakness because not all word can be accurate in transliterate. For the beyond researcher, suggested to improve the rules and develop synthesizer by our own slang.

M.Z. Ayob and A.F. Ismail [3] presented that use of rulebase system as a practical realization of expert system has been attempted to reduce some vowelisation problem. The researcher also use optimisation via Kohonen Self-organising Map (KSOM). The KSOM learning algorithm can be simplified for four step: 1. Initialisation 2. Activation and similarity matching 3. Learning 4. Iteration Yonghendri [4] presented Rumi-Jawi Transliteration Engine that proposed segmentation method for roman syllable, word are segmented based on prefix, root word and suffix. Prefix and suffix are direct map to Jawi Unicode and root will process with Jawi Rules. In this engine obtained 130 rules of Jawi. Afterwards prefix root work and suffix are merged again.

Keywords— Transliteration, stemming process, Jawi-Malay

Rules, Voice synthesizer. I. INTRODUCTION Jawi writing is a system that adopted from Arabic character. Jawi character have 31 character from original Arabic Character added by six additional ‫ۏ‬, ‫ڽ‬, ‫ڤ‬, ‫ڬ‬, ‫ڠ‬, ‫چ‬ character that not obtain in Arabic language. Jawi writing was use as main writing system in Malay language form 13 century until center of 20 century. Using Jawi writing also related by Islamic arriving to Malay Archipelago [1]. JawiMalay transliteration process is understanding as translation from Jawi character to Malay character (Roman) using Unicode mapping, although their pronunciation is still the same, for examples word “fish” ”‫( “اﻳﻜﻦ‬Jawi) to “ikan” (Malay). II. TRANSLITERATION At general, transliteration was understood as a changing character from some style writing to another style writing that suitable in pronunciation [2]. In this paper, we focus on transliteration of word from Jawi character to Malay (Roman) alphabet.

Al-Onaizan and Knight [5] have also produced transliteration for Arabic or English which receive more extensive description and evaluation. Their work includes evaluation of how well the transliteration can match a source spelling. Nasreen and Larkey [6] presented a statistical technique to train an English to Arabic transliteration model from pair or names. This model called selected n-gram model because a two stage training procedure first learn which n-gram segments should be added to unigram inventory for source language and the a second stage learn the transliteration model over this inventory. This technique requires no heuristic or linguistic knowledge of either language. Oh et al. [7] presented transliteration that generally used to phonetically translate proper names and technical terms, especially from language using Roman alphabets, such as from English to Korean, Japanese or Chinese. According to Knight & Graehl [8], transliteration refers to phonetic transliteration across two languages with difference writing system. Arbabi et al. [9] describe an algorithm for the forwardtransliteration of Arab names into English languages. The

978-1-4244-4913-2/09/$25.00 ©2009 IEEE

AI-31 154

transliteration process start by vowelizing the given Arab names by inserting the appropriate short vowel which originally are not written but necessary for correct pronunciation of the names. Then the vowelized Arab name is converted into its phonetic Roman representation using a parser and table lookup. The phonetic representation is the used in a table lookup to produce the spelling in desired language.

Fig. 2 Jawi vocal grouping character

3) Jawi Diftong grouping character was divided for three main situation, at first , centre or back of a word

III. METHODOLOGY This section will explain about Jawi-Malay Character categorise, proposed architecture and the main component that integrated in the system. A. Character Categorise 1) Character mapping from Jawi to Malay alphabet, this part will divided to five groups as below:-

Fig. 3 Jawi Diftong grouping character, adapted from model proposed by Stalls and Knight [10]”

B. System Architecture In Jawi-Malay Transliteration, we introduce some component like stemming, Unicode mapping, Rules applied and word synthesizer. Words are segmented based on prefix, root word and suffix. Root word, Prefix and suffix are direct map to Jawi Unicode and will process with Jawi Rules. Afterwards prefix root work and suffix are merged again. At the last section Malay word will insert in synthesizer for pronunciation.

Fig. 1 Jawi- Malay character mapping

2) Jawi vocal grouping character, this part was divided to two group, single character and combined of two character.

Fig. 4 Jawi- Malay Architecture

155

(rules will cover a missing “a” alphabet). Rules were developed through Production Rules (forward Channing). C. Main Component 1) Stemming Process A word might contain Prefix, suffix, infix or the word itself might be a root word [11]. Words are segmented to prefix, root word and suffix to make a transliteration process easier with minimum character.

Fig. 7 Jawi-Malay Rules applied, adapted from Paul Compton Production Rules (forward Channing) [12]

Fig. 5 Jawi Stemming Process

2) Unicode mapping Jawi root word, prefix and suffix will convert to Malay by using Jawi and Roman Unicode [12] with direct mapping. An example of mapping show as below:-

4) Voice Synthesizer After Jawi-Roman transliteration process complete, Malay words will insert in voice synthesizer (already have in Microsoft Windows XP operating system) to pronoun the word to make the transliteration more reliable.

Perkataan B.Melayu

Synthesizer

suara

Fig. 8 Malay Voice Pronunciation

IV. RESEARCH DIRECTION Fig. 6 Jawi-Malay Unicode Mapping Examples

3) Jawi-Malay Rules Applied In Jawi- Malay transliteration, Rules are needed for perfect transliteration, for examples for word “fish” ”‫ “اﻳﻜﻦ‬will convert only to “ikn” but the real translation is “ikan”

Jawi-Malay transliteration at basis was introduced to get information from old manuscript with easy way. This research focus more on root word, prefix and suffix, but in future research hope new researcher will come out with complete word focus including infix and so on. Jawi-Malay transliteration also provided voice synthesizer for more reliable to user but only use voice synthesizer that

156

already have in Microsoft Windows XP Operating system. For the further researcher should to develop voice synthesizer with Malay slang. V. CONCLUSIONS This paper was introduced to come out some method for easy way and fast process in Jawi to Malay transliteration. For examples Jawi stemming process was develop to make a word as short as possible but only focus on root word and some prefix and suffix. Diftong filtering method and vocal filtering also introduce to make a word simpler in Unicode mapping process. JawiMalay rules also applied for this transliteration to make output more accurate. Other then method stated above, dictionary database also provided for checking the words that cannot be found while process occur. This alternative method prepared because format writing in Jawi is not remained. REFERENCES [1] Amat Juhari Moain, 1996. Perancangan Bahasa: Sejarah Aksara Jawi. Dewan Bahasa dan Pustaka. Kuala Lumpur. [2] Kamus Dewan, 1995. Dewan Bahasa dan Pustaka. Kuala Lumpur. [3] M M.Z. Ayob and A.F. Ismail, “Design of Prototype expert system for transliterating Arabic to Roma-to-Roman words.,” 4th student conference on research and development (SCOReD 2006), Shah Alam, Selangor Malaysia, 27-28-June 2006.

[4] Yonghendri, 2008, “Enjin Transliterasi Rumi Jawi,”Tesis Sarjana, Fakulti Teknologi Dan Sains Maklumat, Universiti Kebangsaaan Malaysia.. [5] Al-Onaizan, Y. and Knight, K. Machine transliteration of names in Arabic text. Proceedings of the ACL conference workshop on computational approaches to semantic languages, 2002. [6] Larkey, Leah, Nasreen AbdulJaleel. 2003, “Statistical Transliteration for English-Arabic Cross Language Information Retrieval”. CIKM’03, November 3-8, 2003. New Orleans, Louisiana, USA. [7] Knight & Graehl, K. Machine transliteration of names in Arabic text. Proceedings of the ACL conference workshop on computational approaches to semantic languages, 2002. [8] Jong-Hoon Oh, Key-Sun Choi And Hitoshi Isahara. 2006. A Machine Transliteration Model Based on Correspondence between Graphemes and Phonemes. ACM Transactions on Asian Language Information Processing, Vol. 5, No. 3, September 2006, [9] M. Arbabi, S. M. Fischthal, V.C. Cheng and E. Bart, “Algorithms for Arabic name transliteration,” IBM J. res. Develop, 38(2), pp.183-193, 1994. [10] Stalls, B. G.and Knight, K. Translating name and technical terms in Arabic text. Proceeding of COLING/ACL Workshop on computational approaches to semantic languages, 1998. [11] Asim Osman, 1993, “Pengakar Perkataan Melayu Untuk Sistem Capaian Dokumen”Tesis Sarjana, Fakulti Sains Matematik Dan Komputer, Universiti Kebangsaaan Malaysia. [12] The Unicode Standard 4.1, Unicode, Inc., copyright 1991-2005. [13] Paul Compton, Production Rules (forward Channing) http :/ /www .cse. unsw.au./~billw/cs9414/notes/kr/rules/rules.html 05/06/2009

157