Simple Stemming Rules for Arabic Language

0 downloads 0 Views 616KB Size Report
i.e. applicable on any word type, and one rule for adverbs and adjectives, one .... is separated from the following noun (plays), while in Arabic the pronoun is.
Simple Stemming Rules for Arabic Language Hussein Soori, Jan Platos, Vaclav Snasel

Abstract Processing of Arabic language is very improtant and actual these days. The arabic is the sixth most used language in the word. The problem of stemming is very important in information retrival, knowledge mining language processing. The Arabic has very complex morhology and the stemming rules must deal with lot if specific properties of Arabic. This paper describes five very simple rules for stemming of arabic vords, two rules are universal, i.e. applicable on any word type, and one rule for adverbs and adjectives, one rule for nouns and one rule for verbs.

1 Introduction The importance of conducting research in the area of Arabic language processing is called for two main reasons: the rapidly growing number of computer and internet users in the Arab world and the fact that the Arabic language is the sixth most used language in the world today. It is also one of the six languages used in the United Nations (www.un.org). Another important factor is, after the Latin alphabet (Latin alphabet), Arabic alphabet is the second-most widely used alphabet around the world. Arabic script has been used and adapted to such diverse languages as the Slavic tongues (also known as Slavic languages), Spanish, Persian, Urdu, Turkish, Hebrew, Amazigh (Berber), Swahili, Malay (Jawi in Malaysi and Indonesia), Hausa, and Mandinka (in West Africa), Swahili (in East Africa), Sudanese, and some other languages [1]. Department of Computer Science, FEECS VSB-Technical University of Ostrava 17. listopadu 15, 708 00, Ostrava Poruba, Czech Republic {sen.soori, jan.platos, vaclav.snasel}@vsb.cz

1

2

Hussein Soori, Jan Platos, Vaclav Snasel

1.1 Challenges to be Considered When Working with Arabic Texts One of challenges faced by researchers in this area has to do with the special nature of Arabic script in Arabic. Arabic is written horizontally from right to left. The shape of each letter depends on its position in a word-initial, medial, and final. There is a fourth form of the letter when written alone. The letters alif, waw, and ya (standing for glottal stop, w, and y, respectively) are used to represent the long vowels a, u, and i. This is very much different from Roman alphabet which is naturally not linked. Other orthographic challenges can be the the persistent and widespread variation in the spelling of letters such as  hamza ( Z ) and ta’ marbuTa ( è ), as well as, the increasing lack of differentiation between word-final ya ( ø) and alif maqSura ( ø). Another problem

that may appear is that typists often neglect to insert a space after words that end with a non-connector letter such as P , P, ð [3]. In addition to that, Arabic has eight short vowels and diacritics (see Table 1). Typists normally ignore putting them in a text, but in case of texts where they exist, they are pre-normalized -in value- to avoid any mismatching with the dictionary or corpus in light stemming. As a result, the letters in the decompressed text, appear without these special diacritics. Table 1 Short vowels and diacritics marks.

( َ◌ , ◌ِ , ُ◌ , ْ◌ ,

ً◌ ,

◌ٍ

, ٌ◌

, ّ◌ )

Diacritization has always been a problem for researches. According to Habash [4], since diacritical problems in Arabic occur so infrequently, they are removed from the text by most researchers. Other text recognition studies in Arabic include, Andrew Gillies et. al. [5], John Trenkle et al. [6] and Maamouri et al. [7]. The problem of syllabification was studied in [2]. This paper is organized as follows, the Section 2 describes Arabic morphology, Section 3 describes the rules for stemming, Section 4 describes experimental results and last section conclude the content of this paper.

2 Arabic Morphology

to go on foot (rather than

Arabic is considered as one of the highly inflectional languages with complex morphology. Unlike most other languages, it is written from right to left. It consists of 28 main letters. As mentioned earlier, the shape of each letter depends on its position in a word-initial, medial, and final, as well as, a fourth form of the letter when written alone (separate). One example of this can be

Simple Stemming Rules for Arabic Language

3

given for the letter ( ¨) as depicted in Table 2. The full arabic alphabet is depicted in Table 3. Table 2 Shape of a character in dependence on its position in a word

Initial

Medial

Final

Separate

‫ﻋـ‬

‫ـﻌـ‬

‫ـﻊ‬

‫ﻉ‬

Table 3 Arabic alphabet

p

h

kha

haa

 € saad sjiin † ¬ qaaf

ø

H

h.

H

jiim thaa taa

€ ¨

P

siin

P

H. X

baa

@ alif

X

zaay raa thaal daal

¨

 

 



faa ghayn ayn thaa taa daad

ð

yaa waaw

è ha

à

Ð

È

¼

nuun miim laam kaaf

In addition to the 28 letters, in some cases we may encounter a dual letters such as ( B) which is a combination of two letters ( È + @ ). These dual letters are considered as one letter. Other than letters, other factor determain the word identity and in many instances can change the meaning and part of speech. This factor is the eight short vowels and diacritics (see Table 1). An example for ( Ég. P) is given in the following table where we can see the total change in word category and meaning as a result of adding the diactricals which resulted in producing three different words in meaning and three different parts of speech for the same three letter ( Ég. P) as depicted in Table 4 Table 4 Application of diacritics changes word meaning

Word

Meaning

Part of Speech

‫ﺭ ُﺟ ُﻝ‬

man

noun (subject)

‫ﺭ ُﺟ َﻝ‬

man

noun (object)

‫ِﺭﺟْ ﻝ‬

foot to go on foot (rather than, e. g., ride a bike)

noun

‫ﺭَ ِﺟ َﻝ‬

verb

Part of Speech

,

4

Hussein Soori, Jan Platos, Vaclav Snasel

Never the less, it is always advised that these vowels and diacritics are often normalized before processing in most light stemming or morphological approaches [4]. Mainly the reasons for not including them in the word processing is the claim that they do occur so infrequently, and that in Modern Standard Arabic (MSA), people tend not to use them and, as a result of that, the meaning is left for the native speaker’s intuition, or , in some cases, can be determined from the context. This problem is still waiting for a challenging attempt where the processor is ready to process words with or without diacritics, without needing to normalize words. Another morphological feature in Arabic is that, unlike Roman letters which are separated naturally, Arabic has an agglutinated nature(as mentioned above) where letters are linked to each other in some cases, while unlinked in some other case, depending on position of the letter in the root, stem and word level. For example, in English the pronoun (he) in (he plays) is separated from the following noun (plays), while in Arabic the pronoun is represented by the letter ( ø) which is linked to the root verb I . ªË to form

I.ªÊK (he plays). The same is true when it comes to different kinds of Affixes.

Arabic has four types of affixes. Prefixes: these are letters (normally one) that change the tense of the verb from past to present, such as the letter ( ø)



in case of the verb I . ªË and I.ªÊK above. Suffixes: these represent the inflectional terminations (endings) of verbs, as well as, the female and dual/plural markers for the nouns. Postfixes: these are the pronouns attached at the end of the word. Antefixes: these are prepositions agglutinated to the beginning of words. One of the problems that always occur during word processing is when another word, of the same or different category as ( I . ªÊK ) above starts with

the letter ( ø), as in áÖß (Yemen) or K (felt desperate), while ya ( ø) in



these two last words is not an affix. Most processors would remove the ya ( ø) resulting in undesirable output. This calls for a deeper stemming that

requires involving morphology, as well as, syntax, but this is beyond the limit of this study.

3 Simple Stemming Rules Our rules were divided into 5 categories. The first two categories are focused on any word type. The first necessary step is to normalize the words using

following process: Conversion all different types of alif ( @) and ( @) to bare alif ( @) and change alif maqSura ( ø) to ya ( ø).

After normalization, remove all possible prefixes (preposition, definite articles and conjunctions) using following process:

Simple Stemming Rules for Arabic Language

5

• Remove baa ( H . ) from all beginning of every word; • Remove waaw ( ð) from all beginning of every word, if the remaining is 3 or more characters. Then remove the following linked characters from the beginning of a word, if the remaining number of characters is higher three characters. • • • • •

Remove: Remove: Remove: Remove: Remove:

alif alif alif alif alif

( @) and laam ( È); ( @), waaw (ð) and laam ( È); ( @), baa ( H . ) and laam ( È); ( @), kaaf ( ¼) and laam ( È); ( @), faa ( ¬) and laam ( È).

The next rule is designed for suffixes. The rule consists of removing linked characters if the remaining is two or more characters: • • • • •

Remove: Remove: Remove: Remove: Remove:

ha (ë ) and alif ( @); alif ( @) and noon ( à );  ); alif ( @) and taa( H waaw (ð) and noon ( à ); ya ( ø) and noon ( à );



 Remove: ya ( ø) and ta’ marbuTa ( è).

• Remove: ya ( ø) and ha ( è); •

The third rule is designed for Adjectives and Adverbs, more precisely for changing adjectives and adverbs from feminine into muscular form:



• Remove ta’ marbuTa ( è) if found as the last character in a word; • Remove ha ( è) if found as the last character in a word. The fourth rule is designed for nouns; its focus is to change nouns from plural into singular form. In this rule, following linked characters are removed if the remaining is longer than 2 characters. If the word has only three characters, remove the following characters if the additional conditions are fulfilled. • • • •

Remove alif ( @) from the beginning of every word; Remove alif ( @) if it is found as the character before the last one; Remove waaw (ð) if it is found as the character before the last one; Remove alif ( @) from the beginning of every word, only if the character before the last is also alif ( @).

If the word has four characters, remove the following characters if the additional conditions are fulfilled. • Remove alif ( @) from the character before the last one • Remove alif ( @) if it is found as the third character

6

Hussein Soori, Jan Platos, Vaclav Snasel

The fifth rule is designed for verbs. It converts inflectional forms of the verb into stem. The characters are removed from the end of the word, if the remaining is at least 2 characters.

) • Remove: taa ( H • Remove: yaa ( ø)

• Remove: alif ( @) • Remove: alif ( @) if found as the second character in a word, only if alif ( @) is found as the last character in a word. • Remove: waaw (ð) • Remove: miim ( Ð) • Remove: noon ( à ) and alif ( @)  ) and alif ( @) • Remove: taa ( H  ) and miim ( Ð) • Remove: taa ( H  ) and ya ( ø) • Remove: taa ( H • • • •



Remove: miim ( Ð) and alif ( @)  ) and alif ( @) Remove: miim ( Ð), taa ( H  ) and alif ( @) Remove: noon ( à), taa ( H ) Remove alif ( @) if it is the second in a word. After that, Remove taa ( H at the end of the word  ) and alif ( @) • Remove: taa ( H

4 Experiments The suggested rules must be tested against real data. For this purpose, we used some news articles, from the BBC Arabic and Al Jazeera Arabic news portals, of which we selected some of the most frequent nouns, verbs, adverbs and adjectives in those articles. These words were divided into group files according to words’ part of speech. Each group consists of several words, selected precisely to cover all type of words. The lists of words used in the experiments are shown in Table 5. All source data were encoded using codepage 1256 Arabic. In our experiments, we use the rules described in the previous section. There are three common rules used for each word type and one specific rule per group. So we have process each word by 4 rules and therefore, we have 5 version of a word during its processing. The first version is a original word. The second version has replaced alif and alif maqSura. The third version of a word has processed prepositions, definite articles and conjunctions. The fourth version has processed suffixes. The fifth version uses the rule specific for each word type. The error are higlighetd using bold font.

Simple Stemming Rules for Arabic Language

7

Table 5 List of words for each vord type a) Nouns ‫ﻣﻛﺗﺑﺔ‬

‫ﺃﺳﻧﺎﻥ‬

‫ﻗﻠﺏ‬

‫ﺑﻠﺩ‬

‫ﺍﻟﺻﻳﻥ‬

‫ﻗﻠﻡ‬

‫ﺍﻟﻣﺣﺎﻣﻲ‬

‫ﺇﺻﺑﻌﺎﻥ ﺇﺻﺎﺑﻊ‬ ‫ﻟﻐﺔ‬

‫ﻟﻐﺎﺕ‬

‫ﺇﺻﺑﻊ‬

‫ﺧﺩﻭﺩ‬

‫ﺧﺩ‬

‫ﺣﻭﺕ‬

‫ﺃﺭﻧﺏ‬

‫ﺳﻣﻙ‬

‫ﺣﺻﺎﻥ‬

‫ﺗﻣﺳﺎﺡ‬

‫ﻣﻬﻧﺩﺱ‬

‫ﺯﻭﺝ‬

‫ﺯﻭﺟﺔ‬

‫ﻗﻣﻳﺹ‬

‫ﻟﺣﻡ‬

‫ﺩﺟﺎﺝ‬

‫ﺩﺟﺎﺟﺔ‬

‫ﻣﻛﺗﺏ‬

‫ﻋﺎﺻﻔﺔ ﺳﻛﺭﺗﻳﺭﺓ‬

‫ﺛﻠﺞ‬

‫ﺻﺣﺭﺍء ﺿﺑﺎﺏ‬

‫ﺧﻠﻳﺞ‬

‫ﻣﻛﺗﺑﺔ‬

‫ﺳﻛﺭﺗﻳﺭ ﺍﻹﺟﻬﺎﺽ‬

‫ﺍﻟﺧﺭﻳﻑ ﺍﻟﻛﺭﺓ‬ b) Verbs ‫ﺗﻐﻠﺏ ﻋﻠﻰ‬

‫ﺗﺯﻭﺝ‬

‫ﺭﺳﻡ‬

‫ﺗﻌﺎﻣﻝ‬

‫ﺍﻋﺗﺭﻑ ﺭﻛﺯ‬

‫ﺍﺗﺻﻝ‬

‫ﻛﺳﺭ‬

‫ﻋﺽ‬

‫ﺇﻋﺗﺭﻓﺕ ﺳﺄﻝ‬

‫ﺳﻣﺢ‬

‫ﺗﻭﺳﻝ‬

‫ﺍﺳﺗﻣﺗﻊ‬

‫ﺭﻥ‬

‫ﻛﺗﺏ‬

‫ﺯﺭﻉ‬

‫ﻏﺎﺿﺏ ﻣﺳﺗﻳﻘﻅ ﺳﻲء‬

‫ﻧﺷﻳﻁ‬

‫ﻗﺎﺩﺭ‬

‫ﺻﻐﻳﺭﺓ ﺻﻐﻳﺭ‬

‫ﻛﺑﻳﺭ‬

‫ﻛﺑﻳﺭﺓ‬

‫ﺯﺭﻗﺎﻭﺍﻥ‬

‫ﺃﺯﺭﻗﺎﻥ‬

‫ﺃﺯﺭﻕ‬

c) Adverb ‫ﻣﻛﺳﻭﺭ‬

‫ﻳﻐﻠﻲ‬

‫ﺃﻓﺿﻝ‬

‫ﺃﺣﺳﻥ‬

d) Adjectives ‫ﻛﺑﻳﺭﺍﻥ‬

‫ﻛﺑﻳﺭ‬

‫ﺑﻛﻣﺎء‬

‫ﺃﺑﻛﻡ‬

‫ﺯﺭﻗﺎء‬

‫ﺃﺯﺭﻕ‬

‫ﺟﻣﻳﻝ‬

‫ﺟﻣﻳﻠﺔ‬

Table 1: List of words used in our experiments

Our results for the nouns list are depicted in Table 6. The results for the noun category produced some undesirable output especially in case of plural. Another letters were deleted by the processor as they were taken as suffix prepositions. The verbs list results are depicted in Table 7. Verb category result produced good results in case of two, three, four and five-letter stems. However, some letters were removed by the processor from the three letter verbs from the middle and the end. Modification of rules is to be considered in the future. Our results for adverbs are depicted in Table 8. Processing results for the adverb category produced very good results in both muscular and feminine cases. The results for adjectives are depicted In Table 9. Three undesirable results were produced in the adjective category of which two in case of singular feminine and one in case of dual feminine. Morphological rules have to be enhanced in the future to overcome this problem. The summarization of the experiments is depicted in Table 10. As may be seen, the best results were achived for adverbs. Three errors were produced for verbs and adjectives and eight errore were produced for nouns.

‫‪8‬‬

‫‪Hussein Soori, Jan Platos, Vaclav Snasel‬‬

‫‪Table 6 Results for nouns‬‬

‫‪Original‬‬ ‫‪Step 1‬‬ ‫‪Step 2‬‬ ‫‪Step 3‬‬ ‫‪Final‬‬ ‫ﺗﻤﺴﺎﺡ‬ ‫ﺗﻤﺴﺎﺡ‬ ‫ﺗﻤﺴﺎﺡ‬ ‫ﺗﻤﺴﺎﺡ‬ ‫ﺗﻤﺴﺎﺡ‬ ‫ﺣﻮﺕ‬ ‫ﺣﻮﺕ‬ ‫ﺣﻮﺕ‬ ‫ﺣﻮﺕ‬ ‫ﺣﺖ‬ ‫ﺧﺪ‬ ‫ﺧﺪ‬ ‫ﺧﺪ‬ ‫ﺧﺪ‬ ‫ﺧﺪ‬ ‫ﺧﺪﻭﺩ‬ ‫ﺧﺪﻭﺩ‬ ‫ﺧﺪﻭﺩ‬ ‫ﺧﺪﻭﺩ‬ ‫ﺧﺪ‬ ‫ﺇﺻﺒﻊ‬ ‫ﺍﺻﺒﻊ‬ ‫ﺍﺻﺒﻊ‬ ‫ﺍﺻﺒﻊ‬ ‫ﺍﺻﺒﻊ‬ ‫ﺇﺻﺒﻌﺎﻥ‬ ‫ﺍﺻﺒﻌﺎﻥ‬ ‫ﺍﺻﺒﻌﺎﻥ‬ ‫ﺍﺻﺒﻊ‬ ‫ﺍﺻﺒﻊ‬ ‫ﺍﺻﺎﺑﻊ‬ ‫ﺍﺻﺎﺑﻊ‬ ‫ﺍﺻﺎﺑﻊ‬ ‫ﺍﺻﺎﺑﻊ‬ ‫ﺍﺻﺎﺑﻊ‬ ‫ﺃﺳﻨﺎﻥ‬ ‫ﺍﺳﻨﺎﻥ‬ ‫ﺍﺳﻨﺎﻥ‬ ‫ﺍﺳﻦ‬ ‫ﺳﻦ‬ ‫ﺩﺟﺎﺟﺔ‬ ‫ﺩﺟﺎﺟﺔ‬ ‫ﺩﺟﺎﺟﺔ‬ ‫ﺩﺟﺎﺟﺔ‬ ‫ﺩﺟﺎﺟﺔ‬ ‫ﺩﺟﺎﺝ‬ ‫ﺩﺟﺎﺝ‬ ‫ﺩﺟﺎﺝ‬ ‫ﺩﺟﺎﺝ‬ ‫ﺩﺟﺞ‬ ‫ﻗﻤﻴﺺ‬ ‫ﻗﻤﻴﺺ‬ ‫ﻗﻤﻴﺺ‬ ‫ﻗﻤﻴﺺ‬ ‫ﻗﻤﻴﺺ‬ ‫ﺯﻭﺟﺔ‬ ‫ﺯﻭﺟﺔ‬ ‫ﺯﻭﺟﺔ‬ ‫ﺯﻭﺟﺔ‬ ‫ﺯﻭﺟﺔ‬ ‫ﻟﻐﺎﺕ‬ ‫ﻟﻐﺎﺕ‬ ‫ﻟﻐﺎﺕ‬ ‫ﻟﻐﺎﺕ‬ ‫ﻟﻐﺖ‬ ‫ﺍﻟﺼﻴﻦ‬ ‫ﺍﻟﺼﻴﻦ‬ ‫ﺻﻴﻦ‬ ‫ﺻﻴﻦ‬ ‫ﺻﻴﻦ‬ ‫ﺑﻠﺪ‬ ‫ﺑﻠﺪ‬ ‫ﻟﺪ‬ ‫ﻟﺪ‬ ‫ﻟﺪ‬ ‫ﺻﺤﺮﺍء‬ ‫ﺻﺤﺮﺍء‬ ‫ﺻﺤﺮﺍء‬ ‫ﺻﺤﺮﺍء‬ ‫ﺻﺤﺮﺍء‬ ‫ﺿﺒﺎﺏ‬ ‫ﺿﺒﺎﺏ‬ ‫ﺿﺒﺎﺏ‬ ‫ﺿﺒﺎﺏ‬ ‫ﺿﺒﺐ‬ ‫ﺳﻜﺮﺗﻴﺮﺓ ﺳﻜﺮﺗﻴﺮﺓ ﺳﻜﺮﺗﻴﺮﺓ ﺳﻜﺮﺗﻴﺮﺓ ﺳﻜﺮﺗﻴﺮﺓ‬ ‫ﺳﻜﺮﺗﻴﺮ‬ ‫ﺳﻜﺮﺗﻴﺮ‬ ‫ﺳﻜﺮﺗﻴﺮ‬ ‫ﺳﻜﺮﺗﻴﺮ‬ ‫ﺳﻜﺮﺗﻴﺮ‬ ‫ﺍﻹﺟﻬﺎﺽ‬ ‫ﺍﻻﺟﻬﺎﺽ‬ ‫ﺍﺟﻬﺎﺽ‬ ‫ﺍﺟﻬﺎﺽ‬ ‫ﺍﺟﻬﺎﺽ‬ ‫ﺍﻟﻜﺮﺓ‬ ‫ﺍﻟﻜﺮﺓ‬ ‫ﻛﺮﺓ‬ ‫ﻛﺮﺓ‬ ‫ﻛﺮﺓ‬ ‫‪Table 7 Results for verbs‬‬

‫‪Final‬‬ ‫ﺳﻤﺢ‬ ‫ﺍﻋﺘﺮﻑ‬ ‫ﺳﻞ‬ ‫ﻋﺾ‬ ‫ﺍﺗﺼﻞ‬ ‫ﺍﻋﺘﺮﻑ‬ ‫ﺗﻌﺎﻣﻞ‬ ‫ﺯﻭﺝ‬ ‫ﻋﻞ‬ ‫ﺯﺭﻉ‬ ‫ﺭﻥ‬ ‫ﺍﺳﺘﻤﺘﻊ‬ ‫ﺗﻮﺳﻞ‬

‫‪Step 3‬‬ ‫ﺳﻤﺢ‬ ‫ﺍﻋﺘﺮﻓﺖ‬ ‫ﺳﺎﻝ‬ ‫ﻋﺾ‬ ‫ﺍﺗﺼﻞ‬ ‫ﺍﻋﺘﺮﻑ‬ ‫ﺗﻌﺎﻣﻞ‬ ‫ﺗﺰﻭﺝ‬ ‫ﻋﻠﻲ‬ ‫ﺯﺭﻉ‬ ‫ﺭﻥ‬ ‫ﺍﺳﺘﻤﺘﻊ‬ ‫ﺗﻮﺳﻞ‬

‫‪Step 2‬‬ ‫ﺳﻤﺢ‬ ‫ﺍﻋﺘﺮﻓﺖ‬ ‫ﺳﺎﻝ‬ ‫ﻋﺾ‬ ‫ﺍﺗﺼﻞ‬ ‫ﺍﻋﺘﺮﻑ‬ ‫ﺗﻌﺎﻣﻞ‬ ‫ﺗﺰﻭﺝ‬ ‫ﻋﻠﻲ‬ ‫ﺯﺭﻉ‬ ‫ﺭﻥ‬ ‫ﺍﺳﺘﻤﺘﻊ‬ ‫ﺗﻮﺳﻞ‬

‫‪Step 1‬‬ ‫ﺳﻤﺢ‬ ‫ﺍﻋﺘﺮﻓﺖ‬ ‫ﺳﺎﻝ‬ ‫ﻋﺾ‬ ‫ﺍﺗﺼﻞ‬ ‫ﺍﻋﺘﺮﻑ‬ ‫ﺗﻌﺎﻣﻞ‬ ‫ﺗﺰﻭﺝ‬ ‫ﻋﻠﻲ‬ ‫ﺯﺭﻉ‬ ‫ﺭﻥ‬ ‫ﺍﺳﺘﻤﺘﻊ‬ ‫ﺗﻮﺳﻞ‬

‫‪Original‬‬ ‫ﺳﻤﺢ‬ ‫ﺇﻋﺘﺮﻓﺖ‬ ‫ﺳﺄﻝ‬ ‫ﻋﻀﺖ‬ ‫ﺍﺗﺼﻞ‬ ‫ﺇﻋﺘﺮﻓﺎ‬ ‫ﺗﻌﺎﻣﻞ‬ ‫‪Original‬‬ ‫ﺗﺰﻭﺝ‬ ‫ﻋﻠﻰ‬ ‫ﺯﺭﻋﺖ‬ ‫ﺭﻥ‬ ‫ﺍﺳﺘﻤﺘﻊ‬ ‫ﺗﻮﺳﻞ‬

Simple Stemming Rules for Arabic Language

9

Table 8 Results for adverbs

Original ‫ﻗﺎﺩﺭ‬ ‫ﻧﺸﻴﻂ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﻣﺴﺘﻴﻘﻈﺔ‬ ‫ﺳﻲء‬ ‫ﺃﺣﺴﻦ‬ ‫ﺃﻓﻀﻞ‬ ‫ﻳﻐﻠﻲ‬ ‫ﻣﻜﺴﻮﺭ‬

Step 1 ‫ﻗﺎﺩﺭ‬ ‫ﻧﺸﻴﻂ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﺳﻲء‬ ‫ﺍﺣﺴﻦ‬ ‫ﺍﻓﻀﻞ‬ ‫ﻳﻐﻠﻲ‬ ‫ﻣﻜﺴﻮﺭ‬

Step 2 ‫ﻗﺎﺩﺭ‬ ‫ﻧﺸﻴﻂ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﺳﻲء‬ ‫ﺍﺣﺴﻦ‬ ‫ﺍﻓﻀﻞ‬ ‫ﻳﻐﻠﻲ‬ ‫ﻣﻜﺴﻮﺭ‬

Step 3 ‫ﻗﺎﺩﺭ‬ ‫ﻧﺸﻴﻂ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﺳﻲء‬ ‫ﺍﺣﺴﻦ‬ ‫ﺍﻓﻀﻞ‬ ‫ﻳﻐﻠﻲ‬ ‫ﻣﻜﺴﻮﺭ‬

Final ‫ﻗﺎﺩﺭ‬ ‫ﻧﺸﻴﻂ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﻣﺴﺘﻴﻘﻆ‬ ‫ﺳﻲء‬ ‫ﺍﺣﺴﻦ‬ ‫ﺍﻓﻀﻞ‬ ‫ﻏﻠﻲ‬ ‫ﻣﻜﺴﻮﺭ‬

Table 9 Results for adjectives

Original Step 1 Step 2 ‫ﻛﺒﻴﺮﺓ‬ ‫ﻛﺒﻴﺮﺓ‬ ‫ﻛﺒﻴﺮﺓ‬ ‫ﻛﺒﻴﺮﺍﻥ‬ ‫ﻛﺒﻴﺮﺍﻥ‬ ‫ﻛﺒﻴﺮﺍﻥ‬ ‫ﻛﺒﻴﺮ‬ ‫ﻛﺒﻴﺮ‬ ‫ﻛﺒﻴﺮ‬ ‫ﺻﻐﻴﺮﺓ‬ ‫ﺻﻐﻴﺮﺓ‬ ‫ﺻﻐﻴﺮﺓ‬ Original ‫ﺟﻤﻴﻠﺔ‬ ‫ﺟﻤﻴﻠﺔ‬ ‫ﺟﻤﻴﻠﺔ‬ ‫ﺟﻤﻴﻞ‬ ‫ﺟﻤﻴﻞ‬ ‫ﺟﻤﻴﻞ‬ ‫ﺃﺑﻜﻢ‬ ‫ﺍﺑﻜﻢ‬ ‫ﺍﺑﻜﻢ‬ ‫ﺑﻜﻤﺎء‬ ‫ﺑﻜﻤﺎء‬ ‫ﻛﻤﺎء‬ ‫ﺃﺯﺭﻕ‬ ‫ﺍﺯﺭﻕ‬ ‫ﺍﺯﺭﻕ‬ ‫ﺯﺭﻗﺎء‬ ‫ﺯﺭﻗﺎء‬ ‫ﺯﺭﻗﺎء‬ ‫ﺃﺯﺭﻗﺎﻥ‬ ‫ﺍﺯﺭﻗﺎﻥ‬ ‫ﺍﺯﺭﻗﺎﻥ‬ ‫ﺯﺭﻗﺎﻭﺍﻥ ﺯﺭﻗﺎﻭﺍﻥ ﺯﺭﻗﺎﻭﺍﻥ‬

Step 3 ‫ﻛﺒﻴﺮﺓ‬ ‫ﻛﺒﻴﺮ‬ ‫ﻛﺒﻴﺮ‬ ‫ﺻﻐﻴﺮﺓ‬ Step 3 ‫ﺟﻤﻴﻠﺔ‬ ‫ﺟﻤﻴﻞ‬ ‫ﺍﺑﻜﻢ‬ ‫ﻛﻤﺎء‬ ‫ﺍﺯﺭﻕ‬ ‫ﺯﺭﻗﺎء‬ ‫ﺍﺯﺭﻕ‬ ‫ﺯﺭﻗﺎﻭ‬

Final ‫ﻛﺒﻴﺮ‬ ‫ﻛﺒﻴﺮ‬ ‫ﻛﺒﻴﺮ‬ ‫ﺻﻐﻴﺮ‬ Final ‫ﺟﻤﻴﻞ‬ ‫ﺟﻤﻴﻞ‬ ‫ﺍﺑﻜﻢ‬ ‫ﻛﻤﺎء‬ ‫ﺍﺯﺭﻕ‬ ‫ﺯﺭﻗﺎء‬ ‫ﺍﺯﺭﻕ‬ ‫ﺯﺭﻗﺎﻭ‬

Table 10 Summarization of the experiments Word type Errors Word count Nouns 8 26 Verbs 3 14 Adverbs 0 10 Adjectives 3 14

5 Conclusion In this paper we defined very simple rules for stemming Arabic words. We defined two universal rules, one rule for adjectives and adverbs, one rule for nouns and one rule for verbs. As may be seen from the experimental results, the rules were more successful in case of adverbs. As for nouns, verbs and adjectives, errors were produced. Most errors were occurred in case of suffix

10

Hussein Soori, Jan Platos, Vaclav Snasel

processing. In the future, these rules must be improved and enhanced to cover more inflectional cases and be tested against wider vocabulary and bigger number of words.

Acknowledgement This work was supported by the Grant Agency of the Czech Republic under grant no. P202/11/P142.

References 1. Encyclopaedia Britannica Online. Alphabet. Online (2011). URL http://www.britannica.com/EBchecked/topic/17212/alphabet 2. H. Soori, J. Platos, V. Snasel, H. Abdulla, in Digital Information Processing and Communications, Communications in Computer and Information Science, vol. 188, ed. by V. Snasel, J. Platos, E. El-Qawasmeh (Springer Berlin Heidelberg, 2011), pp. 97–105. URL http://dx.doi.org/10.1007/978-3-642-22389-1 9. 10.1007/978-3-642-22389-1 9 3. T. Buckwalter, in Arabic Computational Morphology, Text, Speech and Language Technology, vol. 38, ed. by N. Ide, J. Veronis, A. Soudi, A.v.d. Bosch, G. Neumann (Springer Netherlands, 2007), pp. 23–41. URL http://dx.doi.org/10.1007/978-1-4020-6046-5 3. 10.1007/978-1-4020-6046-5 3 4. N.Y. Habash, Synthesis Lectures on Human Language Technologies 3(1), 1 (2010). DOI 10.2200/S00277ED1V01Y201008HLT010. URL http://www.morganclaypool.com/doi/abs/10.2200/S00277ED1V01Y201008HLT010 5. A. Gillies, E. Erl, J. Trenkle, S. Schlosser, in Proceedings of the Symposium on Document Image Understanding Technology (1999) 6. J. Trenkle, A. Gilles, E. Eriandson, S. Schlosser, S. Cavin, in Symposium on Document Image Understanding Technology (2001), pp. 159–168 7. M. Maamouri, A. Bies, S. Kulick, in IN PROCEEDINGS OF THE BRITISH COMPUTER SOCIETY ARABIC NLP/MT CONFERENCE (2006)