Arabic Noun Classifier System - Semantic Scholar

4 downloads 0 Views 223KB Size Report
needed for the Arabic language. He stated that the lexicon should contain the part of speech (verb, noun, adverb, adjective, particle), and morphological.
International Journal of Computer Processing of Oriental Languages Vol. 17, No. 2 (2004) 97– 120 © Chinese Language Computer Society & World Scientific Publishing Company

New Technique to Support Arabic Noun Morphology: Arabic Noun Classifier System (ANCS) SALEEM ABULEIL* AND KHALID ALSAMARA† Information Systems Department, Chicago State University, 9501 S. King Drive, Chicago, IL 60628, USA * [email protected][email protected]

Many successful attempts have been discussed and implemented in Arabic verb morphology but very few have discussed Arabic noun morphology, particularly for nouns that are not derived from verbs. Most verbs in the Arabic language follow fixed rules that define their morphology and generate their paradigms. Since nouns are not like verbs in the Arabic language, there is no fixed rule to define the morphological information and generate the morphology paradigms for them, instead each group of nouns follows its own pattern. In this paper we describe a new technique to resolve this complexity by classifying the nouns into different groups according to their patterns for singular, plural, masculine and feminine. A method is generated for each group to be used to find the morphological information and to form its paradigm and building a classifier system using a rule base that uses suffix analysis as well as pattern analysis to help the user to classify the nouns and identify the group that it belongs. Keywords: Arabic Language; Noun Morphology; Classification, Pattern.

1. Introduction A morphology system is the backbone of a natural language processing system. No application in this field can survive without a good morphology system to support it. The Arabic language has its own features that are not found in other languages. That is why many researchers have worked in this area. Al-Fedaghi and Al-Anzi [4] present an algorithm to extract the root and the pattern of a given Arabic word. The main concept behind the algorithm is to locate the position of the root’s letters in the pattern and examine the letters in the same position in a given word to see whether the tri-graph forms a valid Arabic root or not. Al-Shalabi [5] developed a system that removes the longest possible prefix from the word where the three letters of the root must lie somewhere in the first four or 97

98

S. Abuleil and K. Alsamara

five characters of the remainder. Then he generates some combinations and checks each one of them against all the roots in the file. Al-Shalabi reduced the processing, but he discussed this from the point of view of verbs, not nouns. de Roeck and Waleed Al-Fares [13] developed a clustering algorithm for Arabic words sharing the same verbal root. They used root-based clusters to substitute for dictionaries in indexing for information retrieval. Beesley and Karttunen [7] described a new technique for constructing finite-state transducers that involves reapplying a regular-expression compiler to its own output. They implemented the system in an algorithm called compile-replace. This technique has proved useful for handling non-concatenative phenomena, and they demonstrate it on Malay full-stem reduplication and Arabic stem inter-digitations. Byrd et al. [8] claim that all natural language processing systems need a lexicon full of explicit information. Foxley and Feddag [10] adopted a strategy of combined affixes to alleviate the speed of operation overhead that the affix manipulation routines can take. Hammouri [11] outlined the kind of lexicon needed for the Arabic language. He stated that the lexicon should contain the part of speech (verb, noun, adverb, adjective, particle), and morphological information about each entry. Al-Samara [3] built an Arabic lexical database. This lexicon contains information about 10,000 words. Al-Sughaiyer and AlKharashi [6] provide definitions of standard linguistic terms as they are seen in Arabic analysis and identify efficiency, compactness, bi-directionality, success rate, and retrieval performance as the measures of the effectiveness of morphological analysis algorithms. After a review of the Arabic morphological analysis literature, they suggest the approaches may be classified as table lookup (large construction demands and space requirements), linguistic (require a large number of lists and removing affixes by trial and error), combinatorial (large space and time requirements), or rule based (the authors’ choice). Most verbs in the Arabic language follow fixed rules that define their morphology and generate their paradigms. Those nouns that are not derived from roots do not seem to follow a similar set of well-defined rules. Instead there are groups showing family resemblances, rather like strong verbs in English. We believe that nouns in Arabic that are not derived from verbal roots are governed not only by phonological rules but by lexical patterns that must be identified and stored for each noun. Like irregular verbs in English their forms are determined by history and etymology, not just phonology. Among many other examples, Pinker [12] points to the survival of past tense forms like became for become and overcame for overcome, modeled on came for come, while succumb, with the same sound pattern, has a regular past form succumbed. The same kinds of phenomena are especially apparent for proper adjective, some proper nouns, and some genus nouns in Arabic language like: ݎ’Ÿ “mountains”, “®ó°Ÿ “island” Ùîè‘ “banks” ñ®¼ã “Egyptian” andᩍ “Adam”. Pinker uses examples like this, as

New Technique to Support Arabic Noun Morphology

99

well as emerging research in neurophysiology, to argue for the coexistence of phonological rules and lexical storage of English verb patterns. We believe that further work in Arabic computational linguistics requires the development of a pattern bank for nouns. This paper describes the tool that we have built for this purpose. While the set of patterns for common nouns in Arabic may soon be established, newspapers and other dynamic sources of language will always contain new proper names, so we expect our tool to be a permanent part of our system, even though we may need it less often as time goes on (Abuleil and Evens [2]).

2. Nouns in the Arabic Language A noun in Arabic is a word that indicates a meaning by itself without being connected with the notion of time. There are two main kinds of nouns: variable and invariable. Variable nouns have different forms for the singular, dual, plural, masculine, feminine, diminutive, and relative. The diminutive type is formed by inserting the letter (ñ) after the second letter of a declined noun, in order to indicate paucity, contempt or affection, (e.g., àơ̂). Variable nouns are again divided into two kinds: inert and derived. The inert noun is not derived from another word, i.e., it does not refer to a verbal root. Inert nouns are divided into two kinds: concrete nouns (e.g., lion), and abstract nouns (e.g., love). Derived nouns are taken from another word (usually a verb) (e.g. office); they have a root to refer to. A derived noun is usually close to its root in meaning. It indicates, besides the meaning, the concrete thing that caused its formation (in the case of the agent-noun), or underwent its action (in the case of the patient-noun), or any other notions of time, place, or instrument. The following are the noun types (ElDahdah [9]; Write [14]): A genus noun indicates what is common to every element of the genus without being specific to any one of them. It is the word naming a person, an animal, a thing or an idea. Examples are:

ޟ­

man

Ž˜Û

book

ª³

lion

“­î»

picture

An agent noun is a derived noun indicating the actor of the verb or its behavior. It has several patterns according to its root. Example is:

±­©

person who studies

A patient noun is a derived noun indicating the person or thing that undergoes the action of the verb. Patient nouns have several patterns depending in the verbal root. Example is:

±í­ªã

thing that has been studied

100 S. Abuleil and K. Alsamara

An instrument noun is a noun indicating the tool of an action. Some instruments are derived; some are inert. Examples are:

¡Ž˜Ôã

ӯ툋

key

spoon

An adjective is considered to be a type of noun in traditional Arabic grammar. It describes the state of the modified noun. Examples are:

É­°ã farmer

ªô³ Mr.

®¿Ž¤ã professor

®’’Û big

An adverb is a noun that is not derived and that indicates the place or the time of the action. Examples:

®ì· month

”èóªã city

ÝŽä· north

Was it home?

?–ô’ߎ‘ –çŽÛ

A relative noun is a noun that is derived from another noun (singular or plural) by adding the letter (ñ) at the end. Examples:

òˎ䘟 socialize

òßí© universal

òÔ¤» journalist

A proper noun is the name of a specific person, place, organization, thing, idea, event, date, time, or other entity. Some of them are solid (inert) nouns and some of them are derived. (Abuleil and Evens [1]). Examples are:

âôà³

Saleem

ª’Ë

Abdullah

æ󪑎Ìߍ æó¯

Zain Alabdeen

A proper adjective is a noun that is derived from proper noun. Examples are:

òÜó®ã American

ñ®¼ã Egyptian 3. Noun Classification

In this paper we focus on the following nouns: genus nouns, agent nouns, instrument nouns, adjectives, relative nouns, proper adjectives (adjectives derived from proper nouns), and adverbs. Some of these nouns are not derived from verbs and some are. All of these nouns use the same pattern when it comes to the dual form either for masculine or feminine, but there are many patterns to form the plural noun. Some of the nouns have both masculine and feminine forms, some of them have just feminine forms and some have just masculine forms. A few nouns use the same letters for both the plural and the dual (e.g., æô³­ªã teachers used for both dual and plural). For most nouns, when they end with the letter (“), this indicates the feminine form of the noun, sometimes it does not, but it changes the meaning of the noun completely (e.g., ˜Üã office, ”’˜Üã library). Sometimes the same consonant string with different vowels has different meanings (e.g., ”³­ªã school, ”³­ªã female teacher). Nouns are not like verbs in the Arabic language, there is no fixed rule to define the morphological

New Technique to Support Arabic Noun Morphology 101 Table 1. Pattern classification, Rules/methods G#

S-M

S-F

P-M

P-F

1 2 3 4

F9L F9LwL X Fa9L

X X F9L@ Fa9L@

aF9aL F9aLeL F9L Fa9Lat

5

Fa9L

Fa9L@

6 7 8 9 10 11 12 13 14 15

X X F9l X F9wL X X Fa9L X X

F9L@ F9LLa’ X F9wL@ X F9L@ F9eL@ X Fa9L@ Fw9L@

X X X F9L@ Fa9Len/ Fa9Lwn F9aL Fa9Len/ Fa9Lwn X X X X X X X X X X

16 17 18

F9eL mF9L mFa9L

X X MFa9L@

19 20 21

MF9wL TF9eL AFt9aL

X X X

S: Singular F: Feminine P: Plural M: Masculine X: not available

X X mFa9Lwn / mFa9Len X X X

Fa9Lat

F9al F9aLL F9aL F9wLat F9aL aF9aL / F9L F9a’L Fwa9L Fa9Lat Fwa9L / MFw9Lat aF9L@ mFa9L mFa9Lat mFa9eL tFa9eL aFt9aLat

Ex. âà× ­îìäŸ “­î» ޗŽ×

ʋŽŸ

”Ìà×

ïŽÔ¤à³ ÞäŸ ”ãîÜ£ Ñí®§ “® · ”’ôØ£ Ê㎟ ”Ì㎟ “®ëîŸ

Òô»­ âÌÄã ޗŽØã

Éí®¸ã æó®ä— ɍ®˜§

102 S. Abuleil and K. Alsamara

information and generate the morphology paradigms for them. Instead each group of nouns follows its own pattern. We have classified the nouns into 152 groups according to their patterns for singular, plural, masculine and feminine. We generated a method for each group to be used to find the morphological information and to form its paradigm. Very few of these groups have a unique pattern for plural and singular; and most of them share the same pattern with other groups. Table 1 shows some examples of these groups and their patterns. The digit 9 stands for the letter “ayn [É]”, ‘ stands for “hamzeh [ï]”, w stands for “waw [í]” and @ stands for “ta [“]” since there are no corresponding letters in English for these letters. All proper adjective nouns and most relative nouns follow one of two rules: If the noun ends with ( or “) then drop it and add the suffix. If the noun does not end with ( or “) then just add the suffix. See Table 2. Table 2. Proper adjective nouns and relative nouns. PN / Noun Sing–Masc. Sing–Fem. Dual–Masc. Dual–Fem. Plural–Masc. Plural–Fem.

ŽÜó®ã

®¼ã

åîçŽ×

”¿Žó­

America

Egypt

Law

Sport

òÜó®ã

ñ®¼ã

òçîçŽ×

ò¿Žó­

”ôÜó®ã

”ó®¼ã

”ôçîçŽ×

”ô¿Žó­ åŽô¿Žó­

åŽôÜó®ã

åŽó®¼ã

æôôçîçŽ×

æôôÜó®ã

æôó®¼ã

åŽôçîçŽ×

æôô¿Žó­

厘ôÜó®ã

厘ó®¼ã

厘ôçîçŽ×

厘ô¿Žó­

æô˜ôÜó®ã

æô˜ó®¼ã

æô˜ôçîçŽ×

æô˜ô¿Žó­

æôôÜó®ã

åîó®¼ã

åîôçîçŽ×

åîô¿Žó­

åîôÛ®ôã

æôó®¼ã

æôôçîçŽ×

æôô¿Žó­

•ŽôÜó®ã

•Žó®¼ã

•ŽôçîçŽ×

•Žô¿Žó­

There are very few relative nouns that fail to follow these rules like the noun “®Ë” “Arab”. Most of the paradigm follows the same rules but the masculine plural does not. The masculine plural for ò‘®Ë is ®Ë and not æôô‘®Ë or åîô‘®Ë.

4. The System: ANCS The system reads the next noun in the text, isolates and analyzes the suffixes of the noun, generates its pattern, and uses either the DB Checker Module and Classified Noun Table, the Suffix/Pattern Analysis or the Noun Classifier Module to find the group to which the noun belongs, to identify the rules that apply to this group, to generate all morphological paradigms with respect to the number and gender, and to update the database. The system consists of several modules as shown in Figure 1.

New Technique to Support Arabic Noun Morphology 103

Pattern Generator

Interface

Noun Morphology Analyzer

Suffix Analyzer

Noun Classifier

Database

DB Checker

Figure 1. The ANCS system.

4.1. Database The database consists of three tables: Classified Noun Table, Patterns Table and Groups Table. The first table is dynamic while the other two tables are static, the first one contains 85 entries and the second one contains 152 entries. The Classified Noun Table contains each root noun, source noun (singular): masculine or feminine format of the noun and the number of the group to which the noun belongs. Each time the system identifies a new noun it adds its root to the Classified Noun Table. Each entry in The Pattern Table has a certain pattern and its code. Each entry in the Group Table contains information about one group: the group number, the code of the pattern for each one of the following: masculine singular, feminine singular, masculine plural and feminine plural represented in the group. 4.2. Noun morphology analyzer module This is the core of the system, it calls different modules and performs different tasks to identify the noun and find its paradigm. First, it passes the noun to the suffix analyzer module to drop the suffix. Second, it passes to the pattern generator module to find the pattern. Third, it passes the noun and its pattern to the DB Checker Module to analyze the patterns and the paradigms to see if it is a classified noun or not. If yes, the module returns the name of the group to which it belongs. If the DB Checker Module cannot identify the group then the Morphology Analyzer Module passes the noun to the Noun Classifier Module to classify it and identify the group it belongs to. Finally, depending on the group

104 S. Abuleil and K. Alsamara

Noun

Extract Suffixes / Generate Pattern

Find all Groups Contains the Pattern For the Given Noun

Find Candidate Nouns (Singular Nouns)

Check Nouns Table

User

NO

Select Correct Group

Classify New Noun / Update Database / Generate Results

Noun Found ?

YES

Identify Group ID

Generate Results

Figure 2. Noun morphology analyzer module.

that the noun belongs to, it generates the morphological paradigms for number and gender and updates the database. See Figure 2. 4.3. Suffix analyzer module This module identifies the suffix, analyzes it and produces some lexical information about the noun like the number and the gender. First, it checks if any pronoun is concatenated with the noun. Second, it checks for a suffix indicating the number. Third, it checks for a suffix indicating the gender. When the letter (ñ) comes at the end of the noun there are two cases: it could be a part of the noun so we should not drop it, or it could be an extra letter as in relative nouns or when the pronoun is connected to the noun and it should be dropped in this case. When the noun ends with the letters (æó), most of the time this signifies a dual noun but sometimes it signifies both the plural and dual noun forms as in the following patterns: mfa9l, fa9l, mf9ull. Sometimes we have to check the pattern also to help in analyzing the suffix.

New Technique to Support Arabic Noun Morphology 105

Most of the time when the noun ends with (•) this indicates it is a feminine plural. To change it to singular we drop the letters (•) and we replace them with (“) but in a few cases we just drop the letters (•) as in (•ŽŸ­ªã). Most of the time when the noun ends with the letters (å), this indicates that the noun is dual (厒Ìàã two playgrounds) but sometimes it indicates plural form as in (åû°Ï deer). We will handle these problems as special cases. 4.4. Pattern generator module We have used El-Dahdah’s [9] book to collect 85 different patterns including both masculine and feminine, singular and plural forms, after the suffix has been dropped (see Appendix A). We used these patterns to generate a set of rules to build a finite-state machine that can be used to find the pattern for any noun, see Table 3. The input to this module is a noun after its suffix has been dropped in the previous step, the output is one or more patterns. If more than one pattern is found we validate the string by checking the root table. The letter (á), the letter Table 3. Rules for forming the patterns. # of Letters

Slots

Nouns of (3) letters Nouns of (4) letters

á

Nouns of (5) letters

æã

1st

2nd

3rd

4th

5th

6th

7th

sp

Ñ

sp

É

sp

Ý

sp

| sp

É

|ñ||å| | í | VS

É

| í |  | ñ | í | VS

É



|í| VS

É



|  | • | VS

| –ã | á | • | VS

Ñ

Ñ

Ñ

| å | • | –³ | | –³ |  | VS

Ñ

å

Nouns of (7) letters

± á

sp: “space” no letter to insert *: Letter can be repeated

í

•

•

| ñ |  | sp

Ý*

| í | |• |í | ñ |VS

Ý*

| ñ | í | í | VS

Ý*

| ñ | í | VS

Ý*

í

ï

•

|  | • | á | –³ | –³ | æã | –ã | VS

Nouns of (6) letters



“

ñ

| sp

| í |  |ï | “ | å | | ï | ñí | VS

”ó



| ñ | ï | “ | ï | ”ó | ”ç | ”‹ | VS “

| ”‹ | ”ó | VS

106 S. Abuleil and K. Alsamara

(•) and the letter () at the beginning of the noun may sometimes be the first characters of the noun root (noun: Ùîàã “kings” – pattern: F9wL – root: Úàã), the letter (á) changed to (Ñ) in the pattern. Sometimes they are just extra letters (noun: ¡Ž˜Ôã “key” – pattern: mF9al – root: ¢˜Ó), the letter (á) does not change to (Ñ) in the pattern. We collected tens of nouns that begin with the letters (á), (•) and () and saved them in a file to help us to distinguish between these two cases. 4.5. Database checker module This module identifies any already classified noun or any noun derived from it. It gets the noun and its pattern from the Noun Morphology Analyzer Module, finds all groups that contain the pattern, finds the singular noun (masculine or feminine) in each group and uses it to check the Classified Noun Table. If the noun appears in our tables, it fetches the number of the group to which it belongs and passes it to the Noun Morphology Analyzer to generate the results. For example the noun (âˎÄã restaurants) has the pattern (mfa9l). This pattern appears in three different groups. See Table 4. The nouns formed from these patterns have the paradigms shown in Table 5.

Table 4. The groups of the noun “âˎÄã”. Group#

Sing./Masc.

Sing./Fem.

Plural/Masc.

Plural/Fem.

1 2 3

X mF9L mFa9L

MF9L@ X mFa9L@

X X mFa9Lwn/ mFa9Len

mFa9L mFa9L mFa9Lat

Table 5. The noun of each pattern in each group. Group#

Sing./Masc.

Sing./Fem.

1 2 3

X

ӊ퀋

Plural/Masc.

âÌÄã

X

X X

âˎÄã

”äˎÄã

åîäˎÄã æôäˎÄã

Plural / Fem. âˎÄã âˎÄ㠕ŽäˎÄã

New Technique to Support Arabic Noun Morphology 107

If the noun itself “âÌÄã” or any noun derived from it “âˎÄã åŽäÌÄã æôäÌÄã” or any noun derived from it connected to a pronoun “âìäˎÄã æìäÌÄ㠎èôäÌÄã” has been previously checked and classified by the system we will find its noun root (singular noun) in the Classified Noun Table. The module will find the root (singular masculine) “âÌÄã” in the table and will get its group number “2” and pass it to Noun Morphology Analyzer to find the noun paradigms. In a few cases when we check the noun to find out to which group it belongs we may end up with an ambiguous situation. One singular noun may have two or more plural form and each has a different meaning. This is because the noun is given without vowels. For example: Noun: ”³­ªã Singular noun with no vowels/Plural form ”³­ªã School / ±­ªã ”³­ªã Teacher (feminine) / •Ž³­ªã To solve this problem we use the algorithm shown in Figure 3.



If the noun is singular then

Look up the noun: Find all matches in the Nouns Table “Generate all scenarios”. Identify the group number for each case.



If the noun is plural then

Look up the group number: find the group number for valid singular noun. Figure 3. Algorithm to resolve ambiguity.

Example 1: Input: ”³­ªã (singular noun) has two meanings, either a school or a teacher (feminine). In the Nouns table we will find two matches and we will consider them both. See Table 6. Table 6. Ambiguity — Example 1. Group# 65 83

Sing./Masc.

Sing./Fem.

Plural/Masc.

Plural/Fem.

”³­ªã

X

X

”³­ªã

X X

•Ž³­ªã

±­ªã

108 S. Abuleil and K. Alsamara

Example 2: Input: ±­ªã schools (plural noun). We find the group number for this noun using Database Checker Module as follows: Step#1: Find all groups contains the pattern “mfa9l”. mfa9l is the pattern for the noun ±­ªã. See Table 7. Step#2: Use the Noun Table to validate the nouns “singular nouns”. See Table 8. Step#3: Find the group number for valid singular noun. See Table 9. Table 7. Ambiguity — Example 2 – Step #1. Group# 65 71

Sing./Masc. mf9l@ mfa9l

Sing./Fem. X mFa9L@

Plural/Masc. X mF9Lwn

Plural/Fem. mfa9l mF9Lat

mF9Len

Table 8. Ambiguity — Example 2 – Step #2. Group#

Sing./Masc.

Sing./Fem.

Plural/Masc.

Plural/Fem.

65

”³­ªã

X

±­ªã

X

71

±­ªã

”³­ªã

åî³­ªã æô³­ªã

•Ž³­ªã

Table 9. Ambiguity — Example 2 – Step #3. Group#

Sing./Masc.

Sing./Fem.

Plural/Masc.

Plural/Fem.

65

”³­ªã

X

±­ªã

X

4.6. Noun classifier module This module gets the noun and its pattern from the noun morphology analyzer module. It finds all alternatives (groups) that contain the pattern and provides the user with these groups to choose one of them to reduce the number of alternatives to one. Example: Input: The noun (Ê󭎸ã projects)

/

The pattern: ÞôˎÔã mFa9eL

Process: Step #1: Identify the groups: each group in Table 10 contains the pattern “ÞôˎÔã”.

New Technique to Support Arabic Noun Morphology 109

Step #2: Find the noun of each pattern in each group. See Table 11. Step #3: Ask the user to pick the right group. Based on the user’s selection classify the new noun. Step #4: Update the database. Add a new entry to the Nouns Table. The singular masculine noun Éí®¸ã “project” is stored in the noun field and the group’s number “3” is stored in the group field. Table 10. Noun classifier module — Step #1. Group# 1 2 3

Sing./Masc. mF9eL mF9aL mF9wL

Sing./Fem. X X X

Plural/Masc. X X X

Plural/Fem. mFa9eL mFa9eL mFa9eL

Table 11. Noun classifier module — Step #2. Group# 1 2 3

Sing./Masc. Êó®¸ã ɍ®¸ã Éí®¸ã

Sing./Fem. X X X

Plural/Masc. X X X

Plural/Fem.

Ê󭎸ã Ê󭎸ã Ê󭎸ã

4.7. Interface module This graphical user interface allows the user to interact with the system and handles the input/output. This module displays a main menu with two main options: • Find the morphological information for the paradigm. • Form a noun. For the first option, if the noun is classified then the morphological information for this paradigm is provided, otherwise the system provides the user with the options to pick a paradigm to classify the noun. Then it provides the user with the information. Example:

 Input: Žè³­ªã  Output: Noun:

(masculine — singular) ±­ªã / Number: Singular/Gender: masculine/Person: 1st — singular — masculine/feminine

Table 12 shows the paradigms of the noun “±­ªã”. Table 13 shows the paradigms of the noun “±­ªã” connected to the pronouns.

110 S. Abuleil and K. Alsamara Table 12. The paradigms of the noun “±­ªã”. Sing./Masc.

±­ªã

Sing./Fem.

DUAL/MASC.

”³­ªã

æô³­ªã 厳­ªã

DUAL/FEM.

Plural /Masc.

æô˜³­ªã 厘³­ªã

æô³­ªã åî³­ªã

Plural /Fem.

•Ž³­ªã

Table 13. The paradigms of the noun “±­ªã” connected to the pronouns. 1st/S M-F

1st/P M-F

2nd/S M

2nd/S F

2nd/D

2nd/P

2nd/P F

3rd/S

2nd/P

2nd/P F

3rd/S

M-F

M

M

M

S-M

ò³­ªã

Žè³­ªã

Ú³­ªã

Ú³­ªã

Žäܳ­ªã

âܳ­ªã

æܳ­ªã

ê³­ªã

âܳ­ªã

æܳ­ªã

ê³­ªã

M

S-F

ò˜³­ªã

Žè˜³­ªã

ژ³­ªã

ژ³­ªã

Žäܘ³­ªã

âܘ³­ªã

æܘ³­ªã

Žì³­ªã

âܘ³­ªã

æܘ³­ªã

Žì³­ªã

òô³­ªã

Žèô³­ªã

Úô³­ªã

Úô³­ªã

ŽäÜô³­ªã

âÜô³­ªã

æÜô³­ªã

êô³­ªã

âÜô³­ªã

æÜô³­ªã

êô³­ªã

ñŽ³­ªã

ŽçŽ³­ªã

َ³­ªã

َ³­ªã

òô˜³­ªã

Žèô˜³­ªã

Úô˜³­ªã

Úô˜³­ªã

ñŽ˜³­ªã

ŽçŽ˜³­ªã

َ˜³­ªã

َ˜³­ªã

òèô³­ªã

Žèô³­ªã

Úô³­ªã

Úèô³­ªã

ŽäÜô³­ªã

âÜô³­ªã

æÜô³­ªã

òçî³­ªã

Žçî³­ªã

Ùî³­ªã

Úçî³­ªã

ŽäÛî³­ªã

âÛî³­ªã

ò—Ž³­ªã

Žè—Ž³­ªã

ڗŽ³­ªã

ڗŽ³­ªã

ŽäܗŽ³­ªã

âܗŽ³­ªã

ñŽ˜³­ªã

D-M D-F P-M P-F

鎳­ªã

鎳­ªã

ŽäÜô˜³­ªã

âÜô˜³­ªã

æÜô˜³­ªã

鎘³­ªã

âÜô˜³­ªã

æÜô˜³­ªã

êôô³­ªã

âÜô³­ªã

æÜô³­ªã

鎘³­ªã

êôô³­ªã

æÛî³­ªã

êçî³­ªã

âÛî³­ªã

æÛî³­ªã

êçî³­ªã

æܗŽ³­ªã

ꗎ³­ªã

âܗŽ³­ªã

æܗŽ³­ªã

ꗎ³­ªã

êô˜³­ªã

êô˜³­ªã

For the second option in the menu, the system gets some information from the user about the noun: noun root, gender, number and person. Then the system forms the new noun paradigm accordingly. Examples: • Input: Noun: âà× (pencil)/Gender: masculine/number: plural/Person: 3rd — singular — feminine • Output: Žìãü׍ her pencils • Input: Noun: ±­ªã (teacher)/Gender: Feminine/Number: dual/Person: 2nd — dual — masculine/feminine • Output: Žä뎘³­ªã Žäìô˜³­ªã their teachers • Input: Noun: “®‹ŽÃ (airplane)/Gender: masculine/number: plural/Person: 1st — plural — masculine/feminine • Output: Žè—®‹ŽÃ our airplanes • Input: Noun: Ê㎟ (Mosque)/Gender: masculine/number: plural • Output: ÊãîŸ Mosques

New Technique to Support Arabic Noun Morphology 111

5. Examples The following example shows how the system works. Assume that the input is the noun (â옑­ªã their female trainer), first the system calls the suffix analyzer module to drop the suffix letter (pronoun: their) at the end (âë + –‘­ªã), replaces the letter (•) with the letter (“). Then, it generates the noun (”‘­ªã trainer) and provides some lexical information about the noun. Second, it passes the noun (”‘­ªã trainer) to the pattern generator module to generate the pattern (mf9l@). Third, it checks the group table looking for this pattern (mf9l@). Fourth, if more than one group is found it uses the Database Checker Module to check the Classified Noun Table. Fifth, if the noun does not exist in the table, it calls the Noun Classifier Module to analyze the groups (all alternatives) and asks the user to assist in identifying the group. See Tables 14 and 15. Table 16 shows more examples of how the system works.

6. Other Problems the System Addresses and Solves Since our system classifies nouns according to which group they belong to, it can help other systems to obtain more accurate results.

Table 14. The groups of the noun “”‘­ªã”. Group# 1 2

Sing./Masc. X mf9L

Sing./Fem. mF9L@ mF9L@

3

X

MF9L@

Plural/Masc. X mF9Lwn mF9Len X

Plural/Fem. mFa9L mF9Lat mF9Lat

Table 15. The paradigms of the noun “”‘­ªã”. Group#

Sing./Masc.

Sing./Fem.

Plural/Masc.

Plural/Fem.

1 2

X ­ªã

”‘­ªã ”‘­ªã

X åªã æô‘­ªã

­ªã •Ž‘­ªã

3

X

”‘­ªã

X

•Ž‘­ªã

112 S. Abuleil and K. Alsamara Table 16. More examples. ¢ô—ŽÔã

“®‹ŽÃ

Žè—î»

æôäó®Û

Noun

Keys

Plane

Our sound

Generous

Suffix





Žç

æó

Pattern

mfa9el

ÞôˎÔã

”àˎÓ

fa9l@

ÞÌÓ

ÞôÌÓ

f9l

f9el

3

37

Group #

52

23

Result

Plural masc.

Singular feminine

Singular/Masc.

¡Ž˜Ôã

X

•î»

âó®Û

Singular/Fem.

X

“®‹ŽÃ

X

”äó®Û

Singular feminine Dual/plural masc.

ïŽã®Û

Plural/Masc.

X

X

X

Plural/Fem.

¢ô—ŽÔã

•®‹ŽÃ

•î»

Dual/Masc. Dual/Fem.

厣Ž˜Ôã

æô£Ž˜Ôã

X

X 厗®‹ŽÃ æô—®‹ŽÃ

厗î»

æô—î»

X

/ æôäó®Û

•Žäó®Û åŽäó®Û

æôäó®Û

厘äó®Û

æô˜äó®Û

Pattern Generator System: When the noun starts with , á or • it is often difficult to determine whether to start the pattern of the noun with one of these letters or with the letter Ñ. Our system helps the pattern system to find the answer by validating the noun. Example: ®ô㍠prince To form the pattern for this noun we have two choices: to start the pattern with either () “ÞÌӍ” or (Ñ) “ÞôÌÓ”. By checking and validating the root of the noun for each pattern “®ã” or “®ôã” using the Nouns Table, the system chooses the pattern “ÞôÌÓ”. Tokenizer System: When the noun starts with ñ, Ñ, Ý, etc., we do not know whether we are looking at an article or preposition concatenated at the front or a valid part of the noun. The system resolves the problem by validating the noun using both the Noun Table and the Group Table. Example: ±®Ó horse The letter (Ñ) at the beginning of the noun may indicate one of two things: it may be part of the noun or it may be an article or a preposition concatenated at the front of the word. Our system validates the noun by checking the Classified Nouns Table and the Groups Table. If the noun table is found in the Nouns Table the system finds that the letter (Ñ) at the beginning of the noun (±®Ó) is not an add-on and determines that we should not drop it.

New Technique to Support Arabic Noun Morphology 113

Stemmer System: The letters ñ, Ù, å, etc., at the end of the word indicates one of two cases: these letters may be part of the noun or they may be articles or possessive pronouns concatenated with the word. Our system helps the stemmer to determine if the suffix is part of the noun or an add-on by validating the noun using Classified Nouns Table. Example: åû°Ï deer The letter (å) at the end of the noun may indicate two different things. It may be either part of the noun or an extra letter like “協 in the noun “厒Ìàã” (two playgrounds). Our system validates the noun “åû°Ï” by checking the Classified Noun Table and the Groups Table, finds that the letters (å) are part of the noun, and keeps them.

7. Results To test our system we used nouns obtained from the newspaper Al-Raya, published in Qatar. We have implemented the system by using Visual Basic and MS-Access. We have tested each module in our system: the Suffix Analyzer Modules, the Pattern Generator Module, the Database Checker Module, the Classifier Module and the Noun Morphology Module. Table 17 shows the result of testing the system on 7000 nouns. We classified 592 different nouns as either singular-masculine or singular-feminine. As shown in Table 17 there were 329 failures because of incorrect suffix analysis due to the different problems maintained in Sec. 4.3, and 252 failures due to either missing patterns or the wrong form of patterns for foreign nouns as described in Sec. 4.4. These missing patterns have now been added. The suffix analysis problem is hard to correct because it arises from underlying ambiguities. The whole system as one unit including all the modules processed 93.7% of the nouns correctly and missed 6.3%. If the noun has been classified previously the system does not have any problem in identifying it and identifying any noun derived from it. Our system solves a lot of these problems as we explained in Sec. 6. Table 17. Suffix/pattern/noun morphology analyzer. Module

Total

# correct

# incorrect

% correct

% incorrect

Suffix Analyzer Pattern Analyzer DB Checker Whole System

7000 7000 7000 7000

6671 6748 6405 6559

329 252 595 441

95.3% 96.4% 91.5 % 93.7%

2.9% 3.6% 8.5% 6.3%

114 S. Abuleil and K. Alsamara 500

documents

400 300 200 100 0 1

2

3

4

5

groups

6

7

8

9

10

Nouns Found by Noun Classifier Module Nouns Found by DB Checker Module Failed Nouns

Figure 4. Noun classifier module versus DB checker module

The Noun Classifier Module found most of the nouns that the Database Checker Module failed to identify. We have grouped the nouns found in the documents, 700 in each group. Figure 4 shows the number of nouns identified by the Database Checker Module, the number of nouns identified by the Noun Classifier Modules and the number of nouns that the system failed to identify. Figure 2 demonstrates that as the system gains more knowledge and adds more nouns to the Classified Noun Table, the percentage of the nouns found by the Database Checker Module increases and the percentage of the nouns identified by Noun Classifier Modules decreases. To test the efficiency of the system in terms of speed, we ran it on a Pentium machine with the following specifications (speed: 750 MHz, memory: 128 MB, operating system: Win 2000). The average time for each module independently and the average time for the whole system including all the modules (affix, pattern, database checker) is shown in Table 18. Table 18. Efficiency. Module

Affix extraction

Pattern generator

Database checker

Whole system

Time

27.5 nouns/sec.

23.3 nouns/sec.

39.8 nouns/sec.

9.6 nouns/sec.

New Technique to Support Arabic Noun Morphology 115

8. Conclusion We have built a learning system that utilizes user feedback to identify the nouns in the Arabic language, obtain their features and generate their paradigms with respect to number and gender. We tested the system on 7000 nouns from newspaper text. The system identified (6559) 93.7% of them, 91.5% by using the Database Checker Module and the Classified Noun Table and 8.5% by using Noun Classifier Module. The system failed on 6.3% of the tested nouns. The system classified and added 592 new different nouns to the database.

9. Future Research The vowels in the Arabic language create many big problems, the absence of those vowels may change the meaning of the words completely and make it difficult to distinguish between them. Some nouns in Arabic language like (”³­ªã) have the same sequence of letters but with a different set of vowels. We have two fundamentally different words, homonyms which represent two different concepts and which belong to two different groups with different paradigms. ”³­ªã “m(u)d(a)r(i)s@”

female teacher

”³­ªã “m(a)dr(a)s@”

school

Some nouns use the same letters but use different vowels to form both the dual form and the plural form of the noun. Example: Teacher (dual) æô³­ªã Teacher (plural) æô³­ªã

No vowel “shada” above the letter (ñ). The vowel “shada” above the letter (ñ).

When the letter (ñ) comes at the end of the noun as an extra letter it is hard to tell if it represents a pronoun òˎ䘟 “my meeting” or it is added to form a relative noun òˎ䘟 “socialized”. For these problems our system gives all the possibilities but we need a good grammar to help us to identify the meaning of the word and to clear the ambiguity. Another problem involves the special letters (, ï, í, ñ) in the Arabic language. When we form plural nouns or singular nouns that include one of these letters most of the time we need to map it into another form from the same list. The question is how to change it in each case. Examples:

Æ ñ: Æ ñ: Æ í: Æï Æí

 ï ï ñ ñ

Æ •ŽôÓí Æ ”óí© Æ •íŽä³ Æ ‹ŽØ£ Æ åîó—

“ŽÓí ïí© ïŽä³ ”’ôØ£ ”ô‘®—

ʋŽŸ

Æ ÉŽôŸ

116 S. Abuleil and K. Alsamara

References [1] S. Abuleil and M. Evens, “Discovering Lexical Information by Tagging Arabic Newspaper Text”, Proceedings of the Workshop on Semitic Language Processing. COLING-ACL’98, Aug. 16, 1998, pp. 1–7. [2] S. Abuleil and M. Evens, “Extracting an Arabic Lexicon from Arabic Newspaper Text”, in Computers and the Humanities, 36(2) (2002) 191– 221. [3] K. Alsamara, “An Arabic Lexicon to Support Information Retrieval, Parsing, and Text Generation”, PhD Dissertation, Illinois Institute of Technology, Chicago, IL, 1996. [4] Al-Fedaghi, Sabah and Al-Anzi, Fawaz, “A New Algorithm to Generate Arabic Root-Pattern Forms”, Proceedings of the 11th National Computer Conference, 1989, pp. 4–7. [5] R. Al-Shalabi and M. Evens, “A Computational Morphology System for Arabic”, Workshop on Semitic Language Processing. COLING-ACL’98, 1998, 66–72. [6] I. Al-Sughaiyer and I. Al-Kharashi, “Arabic Morphological Analysis Techniques: A Comprehensive Survey”, Journal of the American Society for Information Science and Technology, 55(3) 2003. [7] K. Beesley and L. Karttunen, “Finite-State Non-Concatenative Morphotactics”, Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, 2000, pp.191–198. [8] R. J. Byrd, N. Calzolari, M. Chodorow, J. Klavans, M. Neff and O. A. Rizk, “Tools and Methods for Computational Lexicography”, Computational Linguistics, 13(3–4), 1987, 219–240. [9] A. El-Dahdah, A Dictionary of Universal Arabic Grammar, 1st Edition, Librairie du Libanaty, Beirut, Lebanon, 1992. [10] E. Foxley and A. Feddag, “A Syntactic and Morphological Analyzer of Arabic Words”, Proceedings of the Second International Conference on Computing in Arabic-English, Cambridge University, UK, 1990. [11] A. Hammouri, “An Arabic Lexical Database to Support Natural Language Processing”, PhD Dissertation, Illinois Institute of Technology, Chicago, IL, 1994. [12] S. Pinker, Words and Rules: The Ingredients of Language, Basic Books, New York, 1999.

New Technique to Support Arabic Noun Morphology 117

[13] A. Roeck de and W. Al-Fares, “A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots”, Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, 2000, pp.199– 206. [14] W. Wright, A Grammar of the Arabic Language, 3rd Edition. Cambridge University Press, Cambridge, UK, 1991.

118 S. Abuleil and K. Alsamara

Appendix A. Patterns Pattern

Used for

Example

aF9aL aF9eL aF9L aF9L aF9L@ aF9La’ aF9wL aFa9eL aFa9L aFt9aL aFt9aL@ anF9aL astF9aL F9a’L F9aL F9aL F9aL F9aL F9aL@ F9aLe F9aLeL F9aLL F9aweL F9eL F9eL@ F9L F9L F9L F9L F9L F9L@ F9L@ F9La F9La’ F9La’ F9La’

plural–fem. sing.–masc. plural–fem. sing.–masc. plural–fem. plural–fem./masc. sing.–masc. plural–fem. plural–masc. sing.–masc. sing.–fem. sing.–masc. sing.–masc. plural–fem. sing.–masc. plural–fem. plural–masc. sing.–masc. sing.–fem. plural–fem. plural–fem./masc. plural–fem. plural–fem. sing–masc. sing.–fem. sing.–masc. plural–masc. plural–fem./masc. plural–fem. sing.–masc. sing.–fem. Plural–masc. plural.–masc. plural–masc. sing.–fem. Sing.–fem.

­Ž · Ö󮑍 ޟ­ ©î³ ”óí© ïŽôèύ

É Ö󭎑 ­Ž× ɍ®˜§ ”£®˜³ ­Ž Ôç ­Žäœ˜³ ‹ŽØ£ ­ŽÄã ÁŽ¸ç ݎ’Ÿ ­®× “­Ž´§ ñ­Ž¤» ®ôëŽäŸ â덭© ²ôçîÓ âó®Û “®ó°Ÿ ÞäŸ ­°Ÿ ®Ë ­î» ïî¿

“­î» ”à˜× 𣮟 ïŽäàË ï®ä£ ï®¤»

New Technique to Support Arabic Noun Morphology 119

Appendix A. (Continued) Pattern

Used for

Example

F9LaL F9Lan F9Lan F9Lawat F9Le@ F9LL

sing.–masc. plural–fem. sing.–masc. plural–fem. sing.–fem. sing.–masc.

åîèË

F9LL@ F9LwL F9naLL F9waL F9wL F9wL@ F9wL@ Fa9L Fa9L@ Fa9wL Fe9aL Fw9L@ Fwa9eL Fwa9L mF9aL mF9eL mF9L mF9L@ mF9LL mF9wL mF9wL@ mFa9eL mFa9L mFa9L mFa9L@ mFt9L mstF9a mstF9L mstF9L@ mtFa9L

sing.–fem. sing.–masc. sing.–masc. sing.–masc. sing.–masc. sing.–fem. plural–fem. sing.–masc. sing.–fem. sing.–masc. sing.–masc. sing.–fem. plural–fem. plural–fem. sing.–masc. sing.–masc. sing.–masc. sing.–fem. sing.–masc. sing.–masc. sing.–fem. plural–fem. plural–fem. sing.–masc. sing.–fem. sing.–masc. sing.–fem. sing.–masc. sing.–fem. sing.–masc.

åû°Ï 厒ÀÏ •í®ä£ ”ôÌäŸ âë­© ”à’è× ­îìäŸ žãŽç®‘ ݍîäË ©îäË ”‘ ”ãîÜ£ âßŽË “®§Ž‘ ¥í­Ž» 印ô㠓®ëîŸ æó¯îã ÊãîŸ ¡Ž˜Ôã Þóªè㠐˜Ü㠔’˜Üã Þ´à´ã Éí®¸ã ”Ëî³îã Ê󭎸㠮ŸŽèã ޗŽØ㠓®ëŽÈã ÞؘÌã ðÔ¸˜´ã ®ÔИ´ã “®ä̘´ã ®ëŽÈ˜ã

120 S. Abuleil and K. Alsamara

Appendix A. (Continued) Pattern

Used for

Example

tF9eL tF9eL@ tF9L tF9L@ tFa9eL tFa9L aF9aL@ aF9eL@ F9aLe@ F9LwL@ mnF9L mtF9LL tF9aL

sing.–masc. sing.–fem. sing.–masc. plural–fem. plural–fem. sing–masc. sing–fem. sing–fem. sing–fem. sing–fem. sing–masc. sing–masc. sing–masc.

æó®ä— ”äô算 ­îė ”Ôàܗ æó­Žä— ݅Ž´— “©Ž ç ”àôÏ­ ”ôë®Û ”ãîäó© ®´Üèã Ý°ß°˜ã ­®Ü—