Machine Transliteration Design for Old Malay Manuscript - Planetary ...

11 downloads 0 Views 624KB Size Report
[4] Hamdan Abdul Rahman, "Sistem Baharu Ejaan Jawi Bahasa Melayu," presented at ... [25] Muhammad @ Mokhtar Talib (MATLOB), Pandai Jawi, 3 ed. Shah.
2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

Machine Transliteration Design for Old Malay Manuscript Che Wan Shamsul Bahri Che Wan Ahmad, Khairuddin Omar, Mohammad Faidzul Nasrudin, Mohd Zamri Murah, Sanusi Mohd Azmi

name of Pedoman Ejaan Jawi Yang Disempurnakan (Guidelines of the Enhanced Jawi Spelling) is intended to meet the present the needs of the future. This system is the result of a formula written in the Jawi National Convention held in 1984 at Kuala Terengganu, Malaysia. Jawi spelling system is compiled by [2] as its basis. According to [3], the system involves five processes, which are: maintaining confirmed words, perfecting the imperfect, creating the non-existence, clarifying the vague, tidying up the loose. In this paper, we define old Jawi as Jawi spelling system before era of Za’ba

Abstract—Jawi script is a script that has the Arabic influence. In the past, these writings are widely used by the Malay community as well as foreigners who have diplomatic relations, business, missionary and such. At that time, the Malay language is the lingua franca of this region. So there are many Malay heritages such as manuscripts, religious books, letters, documents and other agreements in the Jawi script. There are significant needs to do the transliteration of the Jawi text on the materials to Malay Roman. Thus, research on machine transliteration will help the effort. Many researches in machine transliteration in the world for high-level language have been done; such as English, European, and Asia languages such as Chinese, Japanese, Korean and Arabic. However the research in the context of the Malay language is still lacking, especially those involving the Romanized transliteration of Jawi. Jawi writing is quite different from the Urdu and Arabic although they share the same characters. Modern Jawi uses more vowel than the old version. These papers discuss the previous studies related to machine transliteration of the Malay language and approaches that can be used to develop it.

Old Jawi ‫ﺑﮏ‬ ‫ﺳڬﻞ‬ ‫ﺍﻭﻟﻬﻢ‬ ‫ﺗﺎﮐﺔ‬ ‫ﺳﻮﺩﺭﺍ‬

TABLE I SAMPLE OLD AND NEW JAWI New Jawi Roman ‫ﺑﺎڬﻲ‬ bagi (for) ‫ﺳڬﺎﻻ‬ segala (all) ‫ﺍﻭﻟﻬﻴﻤﻮ‬ olehmu (by you) ‫ﺗﺎﮐﻮﺓ‬ takut (fear) ‫ﺳﺎﻭﺩﺍﺭﺍ‬ saudara (brother)

(before 1949). According to [4] there is a 30% difference between the old and new Jawi spelling. The old and new Jawi spelling systems are differences as shown in Table I below.

Keywords—Jawi, machine transliteration, Malay, rule base I. INTRODUCTION

II. PROBLEM STATEMENT

J

AWI writing today is actually a Malay writing with Arabic influences that have been used nearly 700 years ago. This is evidenced by the discovery of Terengganu Inscribed Stone, dated 1303 AD [1]. However, the old Jawi is different compare Jawi script today in terms of the use of the vowel, the writing techniques and the use of new letters. It is seen in line with the development of the Malay language itself began to include some foreign words and technical terms especially in the advancement of science and technology today. New Jawi spelling system was introduced in (1986) with the

In the context of the Malay language, transliteration is used to change the Jawi spelling to Rumi or vice versa. The main problem in the transliteration is when there are no matching characters in target text [5]. Jawi spelling system and Rumi (Roman) spelling are two different spelling systems. Jawi spelling read and written from right to left, while the reverse spelling of Rumi [6]. In addition, old Jawi spelling were not using more letters in practicing the concept of economy in the use of Jawi character than modern today [7]. According to [8] Jawi spelling as long as it is not consistent (varies according to the author) and many occur ambiguous word compared to new Jawi spelling system. The problems mentioned above are main challenges in the development machine transliteration for old Jawi to Rumi.

Che Wan Shamsul Bahri C.W.Ahmad is with the International Islamic University College Selangor (KUIS), Bangi, 43000 Kajang, Selangor, Malaysia (phone: +60389254251; fax: +6038925447; e-mail: [email protected]). Prof. Dr. Khairuddin Omar, Dr. Mohammad Faidzul Nasrudin and Mohd Zamri Murah was with Center for Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia, Bangi, 43600 Kajang, Selangor, Malaysia. (e-mail: [email protected], [email protected], [email protected]). Mohd Sanusi Azmi is with the Universiti Teknikal Malaysia (UTEM), Malacca, Malaysia (e-mail: [email protected]).

III. RELATED WORK Machine transliteration is an important matter in the application of natural language processing (NLP), especially in translating an entity name from one language to another [9]. 23

2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

A. Homograph and Less Vowel Homograph is referring to two or more words that have the same spelling but different meanings[23]. Homograph can also coincided with a homonymous (same sound), for example, the word mereka (to create) with mereka (they) which is similar in terms of raw sound[24]. In the Jawi script, most homograph word occurred due to Jawi writing system itself that only use four vowels (‫ ﻯ‬،‫ ﻱ‬،‫ ﻭ‬، ‫ )ﺍ‬compared with six vowels (a, e (epepet), e (e-taling), i, o, u) in Rumi writing. There are more homograph problems in old Jawi spelling because the economic principle of the lesser use of the vowel [7]. Here is an example of the old Jawi sentences occurs homograph. ‫ ﺗﻴﻤﺒﻖ ﺗﻮﭬﻲ ﺑﺮﺳﺎﻡ ﺍﻧﭽﺊ ﺟﻮﻫﺮ ﺩ ﺟﻮﻫﺮ‬،‫ﺍﻳﻪ ﭬﺎﻛﻲ ﺗﻮﭬﻲ‬

Among the world's fastest-growing language in its transliteration mechanism is English, European, Asian languages such as Chinese, Japanese, Korean, Arabic, Urdu, Hindi, Punjabi, Taiwan, Korea, Japan, China, Thailand and others. Among them are [10], [11], [12], [13], [14], [15] and [5]. In Malaysia, research on machine transliteration is long, but its development is quite slow particularly the Malay language as compared with foreign languages. There are many studies in Malay machine transliteration have done such as by [16], [17], [18], [19], [20], [21] and [22]. All researches on Malay transliteration are related to modern Jawi, except study done by [19] and [20] are related to old Jawi. Several researchers such as [16], [17] and [18], using a character mapping technique that matches directly ‫ ﺍ‬to a, ‫ ﺏ‬to b, and so on. Therefore, there are many words that cannot be converted correctly in the machine transliteration process. This is because there are more than matches for some of the letters in the target text. In contrast to previous studies, [21] use rule-based transliteration technique in the study of Rumi to Jawi transliteration. Each Rumi word is processed following the Jawi spelling patterns. First, the word is divided into syllables, and then the syllables are matched with rule for Jawi conversion into Jawi spelling. Based on [21] research, there are words that cannot be converted into Jawi precisely because some of the problems; loan words from Arabic and English, the difficulty in distinguishing e-taling and e-pepet vowel, and the difficulty to determine whether the word has glottis ‫ ﻕ‬or ‫ک‬. In the Jawi vocabulary, several words classified as past law or exempt from the law. Several words are not subject to the rule of Jawi. These are among the difficulties faced by [21] because Jawi spelling system is so unique. [19] and [20] study is different as studying old Malay manuscripts. [19] using a Kitab Nazam (old Malay Manuscript) that has been digitized in advance for the study. [19] uses stemming and filtering model which is to make the process of rooting and separation between old and new Jawi words based on a Jawi corpus of built. While [20] use a transliteration based on grapheme to the epic of the old Malay manuscript, Merong Mahawangsa. With grapheme method, [20] successfully converted letters ‫ ﻑ‬to ‫ ڤ‬for old Jawi words, instead of using grapheme p in Rumi, not f. This is because most of the old Jawi script does not distinguish phonemes ‫ڤ‬ and ‫ﻑ‬. [20]can reduce the gap between the old and new Jawi. Many other studies are more focused on Rumi to new Jawi transliteration. Transliteration for old Jawi to Rumi is different because it contains its own rules and methods. Many studies have been done on transliteration for new Jawi but not for old Jawi.

6B

Ayah pakai topi, tembak tupai bersama En. Johar di Johor. (Dad wear a hat, shoot squirrels with Mr Johar in Johor) U

U

U

U

‫ ﺳﺪڠﻜﻦ ﺍﻧﻖ ﺑﺮﻣﺎءﻳﻦ ڬﻮﻟﻲ‬،‫ﺍﻳﻪ ﻣﺎﻛﻦ ڬﻮﻟﻲ‬ Ayah makan gulai, sedangkan anak bermain guli. (Dad eat curry, while children play marbles) U

U

U

U

B. Spelling Not Consistency In old Malay manuscripts, spelling varies used even in a same book for some words. It also varies spelling between authors for some words. There are classification of old Jawi writer, including educated people in religion and palace writers anonymous, scholars and the general public[8]. Some of these people are not educated about the method of Jawi writing. Thus they write not according to the rules, not presentable and uncertain sentences. For example, [8] identified is in the pamphlets of ilmu wafaq (wafaq knowledge), ilmu hikmat (wisdom knowledge), mantera (mantras), traditional medicine and others. 7B

C. Sentence Structure Most of the old Jawi spelling or old manuscripts do not use punctuation [25]. The beginning of a new sentence usually begins with the word bahawasanya (whereas), maka (therefore), lagi (again) as sentence delimiter signal, even though sometimes the words maka (then) and lagi (again) act as conjunctions. Word dan (and) also used at the beginning of a sentence. In addition, there is a hadith or Quranic verses that can be used as a delimiter sentence. 8B

D. Spelling System Most of the old spelling does not use a point on the letter ‫ﻱ‬ (ya) if in the end of the word. This is because many of the old books printed in Arabia [25]. In Arabic, the letter ‫( ﻯ‬ya) does not use a point when at the end of a word. Unlike other Malay letters, for example ‫( چ‬ca), ‫( ڠ‬nga), ‫( ڽ‬nya) that dotted the three and ‫( ڬ‬ga) which is used in real-dotted one. Sometimes there is also the point of the letter is placed under ‫( ڬ‬ga). E. Arabic Loanwords There are consumption foreign words that absorption Arabic word for a kitab (book) or manuscript was written in Mecca. The Arabic word is synonymous with Islam itself. Sometimes, the author of a kitab, for example Kitab Hidayah al-Salikin

IV. CHALLENGES

10B

3B

In doing transliteration of the old Malay manuscripts, there are some challenges that are identified as they use old Jawi spelling. Several challenges are discussed here.

24

2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

difficult to find a corresponding meaning in Malay vocabulary because the original book is written in Arabic, Kitab Bidayatul Hidayat written by al-Imam al-Ghazali. If there are words in the Malay language, the purpose to be served may not meet the original intent of the word. Therefore, most scholars who

Jawi ‫ﻧﺒﻰ‬ ‫ﺷﻔﺎﻋﺔ‬ ‫ﻧﺒﻮﺓ‬ ‫ﻋﺎﻟﻢ‬ ‫ﻓﺼﻞ‬ ‫ﻓﻀﻠﺔ‬ ‫ﻗﺮﺍﻥ‬ ‫ﺣﺪﻳﺚ‬ ‫ﻋﻠﻤﺎء‬ ‫ﺍﻓﻀﻞ‬ ‫ﻣﻨﻔﻌﺔ‬ ‫ﻋﻠﻢ‬

The old Jawi spelling in before Za'ba (1949) mostly do not use vowel strokes at the end of the open and closed syllable.

TABLE II ARABIC LOANWORD Rumi (Malay) English nabi prophet syafaat mediation nubuwwah prophetship alim pious fasal clause fadhilat virtues quran quran hadith hadith ulamak scholars afdal nice manfaat benefit ilmu knowledge

Check pattern (rules)

Two syllables

t

cv+c cv+ca

ta

kata

Fig. 2 Transliteration process based on rules

For the words that are not on the list after the test, then that word will also be tested using a search engine. For example, the sentence “‫”ﺍﻳﻪ ﭬﺎﻛﻲ ﺗﻮﭬﻲ‬. Phrase “‫ ”ﭬﺎﻛﻲ ﺗﻮﭬﻲ‬is not available in the list, and then the on-line resource will be used by the API search engine connections. The highest results will be selected as an outcome of transliteration as shown in Table III. Some rulings in old Jawi writing like Shift Law, Insert Law, and character vowel materialized method have to be taken into consideration in design of the old Jawi to Rumi transliteration. Methods involve character vowel materialized Kāf -Ga Law and out of Deranglu law (‫)ﺩ ﺭ ڠ ﻝ ﻭ‬. This is important to consider because several laws are used in old Jawi spelling but is no longer used in new Jawi spelling. For example, the Shift and Insert Law no longer used in new Jawi spelling while KāfGa Law and out of Deranglu Law (‫ )ﺩ ﺭ ڠ ﻝ ﻭ‬still remain in use till today. Deranglu Law use full vowels for its each syllables.

Fig. 1 show proposed architecture for Old Malay Transliteration. Hybrid model are highlighted in the proposed architecture considering the unique old Jawi spelling itself. Each word in the old Malay manuscripts will search through the list of words before the character mapping conducted at no matching words or Out of Vocabulary (OOV).

Old Jawi Word

Homograph Word

+

Output = Rumi

V. ARCHITECTURE OF OLD MALAY TRANSLITERATION

Transliteration Rule (Insertion Vowel)

ka

ka

wrote the book more comfortable use Arabic loanwords as Table II because more accurate meaning and mix with the Malay culture.

Database Lookup

‫ﻛﺎﺕ‬

Input = Old Jawi

Malay Wordlist

TABLE III ARABIC LOANWORD

Mapping Tables

Search Engine

Jawi

Rumi (Malay)

‫ﺍﻳﻪ ﭬﺎﻛﻲ ﺗﻮﭬﻲ‬

Ayah pakai topi Dad wear a hat Ayah pakai tupai Dad wear a squirrel

Result – based on google SE 1,040,000 (√) 430,000 (X)

Rumi Word

Fig. 1 Propose architecture for old Malay transliteration

VI. CONCLUSION 5B

From the above discussion, we can see some of difference between old and new Jawi spelling system. Development work of machine transliteration for old Jawi to Rumi is not an easy job compared new Jawi to Rumi. The work becomes more complicated and complex because the characters used are different. The results obtained from the proposed design may be accurate depending on how far we can follow some law or rule discussed above.

Database lookup is use for old Jawi words (input) referred to the Jawi-Rumi word list provided. If the word is found in the list, then its pair Rumi will be obtained. On the other hand, if the word is not available, then the rulebased approach applied to the old Jawi word. If [14, 26] using the model collapsed-vowel strokes (CV) but in this study, a different approach is used which is called the Insertion-Vowel (IV) strokes. See fig. 2 for transliteration process based on rules. This is the reverse process of the [21]. 25

2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

[22] Roslan Abdul Ghani, et al., "Jawi-Malay transliteration," in International Conference on Electrical Engineering and Informatics, 2009. (ICEEI '09), 2009, pp. 154-157. [23] "homograph. (n.d.) " in Collins English Dictionary – Complete and Unabridged., ed, 1991, 1994, 1998, 2000, 2003. [24] Adi Yasran Abdul Aziz and Hashim Musa, "Isu homograf dan cabarannya dalam usaha pelestarian tulisan Jawi," Jurnal ASWARA, vol. 3, pp. 109-126, 2008. [25] Muhammad @ Mokhtar Talib (MATLOB), Pandai Jawi, 3 ed. Shah Alam: Cerdik Publications Sdn. Bhd., 2007. [26] S. Karimi, "Machine transliteration of proper names between English and Persian," Tesis PhD, School of Computer Science and Information Technology, RMIT University, Melbourne, Victoria, Australia, 2008.

ACKNOWLEDGMENT Special thanks to the International Islamic University College Selangor (KUIS) for providing the funds to continue studies at PhD level under the Academic Staff Training Scheme KUIS (SLAK). Thanks also to Universiti Kebangsaan Malaysia (UKM) for fees sponsorship to attend this seminar. REFERENCES [1] Amat Juhari Moain, "Sejarah Tulisan Jawi," dalam Jurnal Dewan Bahasa, 1991. [2] Zainal Abidin Ahmad (Za'ba), Daftar Ejaan Melayu (Rumi-Jawi), . Singapore: Department of Education Federation of Malaya Printers Ltd., 1949. [3] Ismail Dahaman, "Pedoman ejaan jawi yang disempurnakan (1986)," presented at the Konvensyen Tulisan Jawi, Kuala Lumpur, 1991. [4] Hamdan Abdul Rahman, "Sistem Baharu Ejaan Jawi Bahasa Melayu," presented at the Konvensyen Tulisan Jawi, Terengganu, 1984. [5] G. S. Josan and J. Kaur, "Punjabi to Hindi Statistical Machine Trasnliteration," International Journal of Information Technology, vol. 4, pp. 459-463, 2011. [6] M. F. Nasrudin, et al., "Handwritten Cursive Jawi Character Recognition: A Survey," in Computer Graphics, Imaging and Visualisation, 2008. CGIV '08. Fifth International Conference on, 2008, pp. 247-256. [7] Mahmud Haji Ashari, et al., "Antara Jawi lama dan baru serta masalah pelaksanaannya," presented at the Konvensyen Tulisan Jawi, Kuala Lumpur, 1991. [8] Wan Mohd Shaghir Abdullah, "Tulisan Melayu/Jawi dalam manuskrip dan kitab bercetak : Suatu analisis perbandingan," in Tradisi Penulisan Manuskrip Melayu, ed Kuala Lumpur: Perpustakaan Negara Malaysia, 1997, pp. 87-105. [9] Antony P J and Soman K P, "Machine Transliteration for Indian Languages: A Literature Survey," International Journal of Scientific & Engineering Research,IJSER © 2011, vol. 2, pp. 1-8, 2011. [10] M. Arbabi, et al., "Algorithms for Arabic name transliteration," IBM Journal of Research and Development, vol. 38, pp. 183-194, 1994. [11] K. Knight and J. Graehl, "Machine transliteration," Computational Lingustics, vol. 24, pp. 128-135, 1997. [12] B. G. Stalls and K. Knight, "Translating names and technical terms in Arabic text," in COLING/ACL Workshop on Computational Approaches to Semitic Languages, 1998, pp. 34-41. [13] Y. Al-Onaizan and K. Knight, "Machine transliteration of names in Arabic text," in ACL-02 Conference Workshop on Computational approaches to Semantic Languages, 2002, pp. 1-13. [14] A. T. Sarvnaz Karimi, and Falk Scholer, "English to Persian Transliteration," in Springer-Verlag Berlin Heidelberg 2006, P. F. Fabio Crestani, Mark Sanderson Ed., ed Glasgow, UK, October 11-13, 2006, : Springer 2006, 2006, pp. 255-266. [15] A. Malik, et al., "A hybrid model for Urdu Hindi transliteration," presented at the Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, Suntec, Singapore, 2009. [16] K. A. Rahman, "Perisian Pemprosesan Perkataan untuk Sistem Tulisan Rumi dan Jawi (Editor Rumi-Jawi)," Sarjana Muda Teknologi Maklumat, Universiti Kebangsaan Malaysia, 1998. [17] Ab. Nasir Abdul Aziz, "Sistem Pertukaran Tulisan Rumi Ke Jawi," Tesis Sarjana Muda, Universiti Teknologi Malaysia, Skudai, Johor Bahru, 1998. [18] Suhailan Safei, "Sistem Penterjemahan Jawi Ke Rumi," Tesis Sarjana Muda, Universiti Teknologi Malaysia, Skudai, Johor Bahru, 2000. [19] C. W. S. B. C. W. Ahmad, "Penterjemah Jawi lama kepada Jawi baru," Sarjana Tesis Sarjana, Fakulti Teknologi Dan Sains Maklumat, Universiti Kebangsaaan Malaysia, 2007. [20] Juhaida Abu Bakar, "Transliterasi Jawi lama kepada Jawi baru berasaskan grafem (kajian kes pada Hikayat Merong Mahawangsa)," Sarjana Tesis Sarjana, Fakulti Teknologi Dan Sains Maklumat, Universiti Kebangsaaan Malaysia, 2008. [21] Yonhendri, "Transliterasi Rumi ke Jawi berasaskan petua," Sarjana Tesis Sarjana, Fakulti Teknologi Dan Sains Maklumat, Universiti Kebangsaaan Malaysia, 2009.

Che Wan Shamsul Bahri is a lecturer at International Islamic University College Selangor (KUIS), Bangi, Selangor, Malaysia. He is a Phd candidate at Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor, Malaysia Prof Khairuddin Omar, Dr Mohammad Faidzul Nasrudin and Mohd Zamri Murah are lecturer at Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor. They are also members of Center of Artificial Intelligent Technology(CAIT), UKM. Mohd Sanusi Azmi is a lecturer at Universiti Teknikal Malaysia (UTeM), Malacca, Malaysia. He is also members of Center of Artificial Intelligent Technology (CAIT), UKM.

26