atranf: machine translation system prototype (application on arabic to ...

1 downloads 0 Views 116KB Size Report
Our goal is to develop a machine translation system from Arabic to French using a purely .... the dictionary in four tables: verbs, nouns, adjectives and particles.
Special Issue on The International Conference on Information and Communication Systems (ICICS 2011)

ATRANF: MACHINE TRANSLATION SYSTEM PROTOTYPE (APPLICATION ON ARABIC TO FRENCH TRANSLATION) Fahima Bouzit ,Mohamed Tayeb Laskri Badji Mokhtar University, Algeria [email protected], [email protected]

ABSTRACT This work falls within the framework of natural language processing. Our goal is to develop a machine translation system from Arabic to French using a purely linguistic method as a key to get insights into the standard layer-based structure of linguistic phenomena (morphology, syntax and semantics) as well as in recognizing the interaction between them, which we think is the most appropriate to a such rich in morphology and syntax language as Arabic. To ensure that, we used a set of linguistic theories such as: Fillmore theory, conceptual dependency, semantic traits of Chafe and frame representation of Minsky. We will show in this paper, the usefulness of these methods and how we combined them to realize a multilingual translation system. Keywords: machine translation, Arabic, semantic cases, Fillmore theory.

1

INTRODUCTION

Machine translation has for decades attracted the interest of researchers in artificial intelligence, which gave rise to two different currents. The current based on linguistic theories, where the principle is to identify all the rules and features of the language, in order to use them to create dictionaries and rules, and use a set of formalisms to move from the external representation of the sentence, to an internal universal one. The second current consists to use large amounts of aligned bilingual text to estimate the probabilities of models; this is the statistical or probabilistic current. In this paper, we will describe our idea which aims to develop Arabic to french machine translation system using a purely linguistic approach, which combines several methods as Fillmore Theory, the semantic features of Chafe and the frame based representation of Minsky. In the linguistic approach, one of the most important phases is the semantic analysis, which involves extracting the meaning of surface structures using a variety of tools and methods. To understand the meaning of a sentence, it is essential to know the meaning of its various components and the role of each one of them [1].In order to ensure this, we used Fillmore theory. The idea is to consider the verb as the kernel of the sentence and to study the role of its other constituents (nouns) with this kernel. 2

FILLMORE THEORY Verbs differ according to their typological

UbiCC Journal

characteristics, for example, there are verbs that require the semantic cases: 'Agent' and 'Subject', although some other verbs require other cases, such as 'Source' and 'Destination'. The cases ideally form a single, limited, small in number, universal and valid list in all languages [2]. For the Arabic language, semantic cases are identified by casual marks (short vowels), for example, the case agent is marked by the nominative grammatical case marked by the diacritic ' ُ◌' or the suffixes '‫ 'ان‬or '‫'ون‬. Where the case instrument is recognized by the dative case ' ِ◌' or the ْ and is preceded by the suffixes '◌َ 0:!' or ' ِ◌ 0:◌!' preposition ' ,' ‫' ا‬, etc. '‫ ' ـ‬or the words '‫ل‬ The advantage of this method is that it allows to give to the sentence, a representation that does not stop at the tips of the results of syntax parsing [3], in other words, even if two sentences have different representations, they may transport the same meaning. For example, the sentences: – ‫ذ‬ ‫و إ ا‬ ‫ا‬ ‫ا‬ ‫( أر ا‬The child sent an e-mail to the teacher). ‫و إ ا‬ ‫ا‬ ‫ا‬ ‫( أر‬an e– ‫ ف ا‬0: ‫ذ‬ mail was sent to the teacher by the child), The subject is different, although, the action (verb) is the same, and the words ‫و‬ ‫ا‬ ‫( ا‬email) and ‫( ا‬child) play the same syntactic role: subject, while the agent is in both cases ‫( ا‬child) and the object is always: ‫و‬ ‫ا‬ ‫( ا‬e-mail). We extracted and specified the Arabic semantic cases, based on its characteristics [1], here are some examples: . The case AGENT: Syntactic Case = Subject. (grammatical case=Nominative)

Page 26

www.ubicc.org

Special Issue on The International Conference on Information and Communication Systems (ICICS 2011)

consequently decides that, the agent can only be ‫ل‬ ‫ا‬.

The case OBJECT: Syntactic Case = Object Comp Or Syntactic Case = Subject Verb mode = Passive

4

The case INSTRUMENT: Grammatical Case = Dative Preposition = ‫ ا‬, ‫ ـ‬,‫ل‬ The case SOURCE: Grammatical Case = Dative ْ ِ◌ Preposition = ◌0: Or A place noun playing the role of a direct object complement of some known verbs, such us: ‫ در‬.c, ‫ ك‬:i like in ‫ر‬.l ‫ا‬ ‫ در ا‬.c The case DESTINATION: Grammatical Case = Dative َ ◌ْ , Preposition = ‫◌ِ ـ‬, ‫إ‬, ◌َ , ◌‫ب‬ Or A place noun playing the role of a direct object complement of some known verbs, such us: .l....S like in ◌َ .S ‫ا‬ ‫ ا‬.l....S. 3

SEMANTIC TRAITS OF CHAFE

This method consists to endow every noun in the definitions dictionary, with many semantic traits, showing the relations it may have with the other words used with it in the sentence [4]. For noun representation, Chafe proposed a classification model. He defined a list of semantic traits (markers) that represent noun proprieties. According to Chafe, the noun is characterized with the traits: Animated, Human, Feminine, Unique, Concrete, Countable and Potent. [4] and the traits: Consumable and Dimension could be added [5]:

FRAME REPRESENTATION

Once the semantic cases drawn, we must find a way to represent them and the relations existing between them. There is a wild choice of methods and formalisms for this (Context Free Grammar, Recursive Transition Networks, logic grammar, knowledge based processing,…), but given the characteristics of the Arabic language: we can swap the components of the sentence without changing its meaning. For example: the six sentences are correct, and transport the same meaning: ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ ‫ا‬ So we can’t use any of those formalisms, because we will be face to huge and difficult to build representations and grammars. This leads us to choose the method proposed by Minsky: frames [1]. Frames have a whole set of slots, reserved for the various concepts contained in the sentence to represent, what drives us to provide a slot for each component that may be encountered Fig. 1. Action Patient

Agent

‫ا‬ Source

• ‫م‬.l ‫([ =ا‬+) Animated, (+) Human, (-) Feminine, (-) Unique, (+) Concrete, (+) Countable, (+) Potent, (-) Consumable, (-) Dimension] • ..: ..: ‫([ =ا‬-) Animated, (-) Human, (+) Feminine, (-) Unique, (+) Concrete, (+) Countable, (-) Potent, (-) Consumable, (-) Dimension]. Although this method has been developed and used only for names, we proposed to apply it on verbs to solve the problem of the lack of information that occurs if the user wants to translate a text without short vowels. So if the user wants to � !, the translate the sentence: ‫ل‬ ‫ب ا‬ ‫ ا‬.S‫ا‬ system can identify the agent which is ”‫ل‬ ‫”ا‬ through the semantic features of the verb � ! [+ human], which means that this action (� !) can be made only by a human, and therefore, the system checks the features of every noun in the .S‫[ ا‬-human], ‫ل‬ ‫[ا‬+human], and sentence:

UbiCC Journal

Furnisher

Object

◌ّ

Instrument

‫ا‬

Destination

Time

‫ا‬ Beneficiary

Place

�• .A ‫ا‬ State

Manner

Purpose

Figure 1: Representation, in the basic frame, of the ‫ا‬ ‫ ا‬e; ” sentence “ One can easily notice that there are several slots that are empty, and this is in fact, the drawback of this type of representations: the waste of storage space, because in general, each verb has its own characteristics and therefore requires a reduced number of slots [1].

Page 27

www.ubicc.org

Special Issue on The International Conference on Information and Communication Systems (ICICS 2011)

We know that there are verbs that require the same slots as others, and that most verbs are used to express ideas that may well be expressed by other basic verbs. This leads us to use a method of classification of verbs. For that raison, we chose the theory of the conceptual dependency. So, the sentence ‫ا‬ ‫ ا‬e; (the child easily printed the text with the printer), for instance, will be represented in a frame where the number of slot is smaller (a specialized frame), Fig. 2.

Table 1: the eleven basic primitives proposed by Shank.

�‫�ـ‬.b

‫ا‬ Source

Object

Apply a force to something Moving a body part Catch an object Ingest, for a moving object Physically expel, for a moving object Move physical object Modify an abstract relationship, such as possession Produce a sound, support of an action such us “communicate” Apply his attention to a perception or stimulus Information transfer Creating a new though

◌ّ

ATTEND

Instrument

‫ا‬

‫ا‬

MTRANS MBUILD

Destination

�• .A ‫ا‬ Manner

Figure 2: Representation, in a specialized frame, of ‫ا‬ ‫ ا‬e; ” the sentence “ CONCEPTUAL DEPENDENCY

This theory is characterized by the following axioms: 1. Two sentences have the same meaning in one language or two languages (although they have very different syntactic structures) should have the same internal representation. 2. Any information implied in a sentence must be made explicit in the representation. 3. Any action is expressed in terms of primitives. Each primitive has an associated diagram which must be instantiated and filled (at least partially) in the understanding process. For example, the primitive of the verb to drink (INGEST primitive) is the same for the verbs to eat or swallow [4]. This theory therefore allows the reduction of each set of verbs in a primitive, which shall be the representative, and will now undergo a common treatment for all these verbs, instead of duplicating it for each one. In our work, we considered the eleven basic actions proposed by Shank [4], Table 1.

UbiCC Journal

PROPEL MOVE GRASP INGEST EXPEL

SPEAK

Time

5

Description

PTRANS ATRANS

Action

Agent

Primitive

So for two verbs referring two similar actions, we use the same primitive, for example, for any verb that denotes an action of transfer of something abstract (eg possession), such as the verbs: ,,. ,.l ‫أ‬ ‫أر‬, we use the primitive ATRANS. Thus, we have the same frame that represents these verbs. The difference lies in the contents of the Action field. We can implement the frame as a list, table, or using objects. We defined a class for each primitive, and during the frame construction phase, we instantiate an object of the class to which the verb belongs and fill its fields. 6

DICTIONARIES

To have all the information about the different words needed during the analysis, it is necessary to have in the dictionary (fields or tables) that contains any information that could be useful. To insure this, we initially ranked the words in the dictionary in four tables: verbs, nouns, adjectives and particles. For example, the table Verbs contains the stem of the verb (verb in the past with the masculine singular person: ‫ )ه‬and its primitive, but also its various semantic features, and its translation, Table 2. Table 2: portion of the dictionaries. Verb

primitive

animated

..

translation

Noun

adjective

animated

..

translation

But during the implementation, we noted that there are special cases that must be treated separately. Example: let the words: ..: (panel) = panneau

Page 28

www.ubicc.org

Special Issue on The International Conference on Information and Communication Systems (ICICS 2011)

,. :i (configuration) = configuration = clés e :i (keys) The verbatim translation (to French) gives: ,. ‫ ا‬..: = panneau de configuration; e :i ‫ ا‬..: = panneau de clés; While e :i ‫ ا‬..: in French is: clavier. The solution we proposed was to put these strings of words in a table called Sequences, and see during processing (step of construction of the frame) if the text contains one of these suites, in which case we put directly its translation in the target frame. And therefore the number of tables used by the analysis module and the module for word translation is five: verbs, nouns, adjectives, particles and sequences. Another case we can underline, is the distinction between the manner and the instrument, because both cases have the grammatical case Dative and are preceded by the preposition ‘ ‫’ ـ‬, for example: ◌ِ ‫ا‬ ‫ا‬ and ◌ِ ‫ا‬ ‫ا‬ . Both words ‫ ا‬and are preceded by the preposition ‘ ‫ ’ ـ‬but ‫ ا‬is an instrument and is a manner, Looking in the characteristics of Arabic language, we found that when the the preposition ‘ ‫ ’ ـ‬precede a noun which we can derive to an adjective, then this noun describes a manner, else it is an instrument: (we can derive from it the adjectives: and ) => it describes a manner; ‫( ا‬we can’t derive an adjective from this noun) => it is an instrument This is why we added to the Noun table an entree which we called Adjective to mention if it can be derived to an adjective or not, Table 2. 7

SYSTEM ARCHITECTURE

As shown in the diagram Fig. 3, when we introduce a sentence to be translated, the system begins the analysis.

The Sentence in Arabic

The Sentence in French

Analysis

Frame in Arabic

Constructio

Frame in the French

In fact, the analysis goes through three phases: a morpho-lexical analysis that aims to recognize each word in the sentence, a syntactic analysis to pull the various syntactic cases (subject, object…). The results of this phase are the inputs of the next one: the semantic analysis. The system recognizes the action by comparing the words of the sentence with the verbs table, then consults the primary field (class of the verb) to extract the type of verb (ATrans, PTrans, ...), it makes an instantiation of this class and starts filling the fields. Then, the definition dictionary is consulted to extract the semantic features of each word of the sentence in order to analyze them: • For example, the subject is recognized by the ', and according to the casual mark: Damma: ' ◌ُ semantic features of the verb and nouns of the sentence. • The instrument is recognized by the particles or words ( ‫ ا‬,‫) ـ‬. • The source and destination with the particles: (0: ) and (‫ب‬ , , ‫)إ‬. • Concerning the adjectives, and particles, they can be recognized easily (a table was devoted to the adjectives and another one, to the particles), etc. At this stage, we get a representation of the sentence, independently of any language; this is what we call universal or internal representation. And it is in fact, the strength point of this approach, because it permits the generation of a translation to any target language; we just need to add a module to support it. So, after the frame of the Arabic sentence (Arabic frame) is created, comes the role of the module: word for word translation, and we get the destination (target) frame. Then, the system dials the sentence from the target frame in a sequence that has been previously defined following the syntax rules of the target language (French), for instance: Affirmative_Sentence = Agent Action [ ‘de’ + Source][ ‘a’ + Destination][ ‘avec’ + Instrument] ... The system and before providing the result, organizes the sentence following the rules of syntax and grammar of the target language (French): _ the time of Verbs: Present / past / future, _ the gender and number of names; All these treatments (sentence generation frame compliances of destination language) are provided by the module of management of the target language. We note that the translation produced by the three phases: analysis, word for word translation and generation. 8

Translation Figure 3: Simplified architecture of the system

UbiCC Journal

CONCLUSION & PERSPECTIVES

in this article is therefore, in the semantic processing of texts using purely linguistic and finds fulfillment with the DCF method as a basis. This method has

Page 29

www.ubicc.org

Special Issue on The International Conference on Information and Communication Systems (ICICS 2011)

proved highly adaptable to the Arabic language and its peculiarities as to syntax and semantics [1], [4], [5], [6]. We can underline as prospects for this work, to integrate to the system a good morphological analyzer (such as the tool: Aramorph), and enrich the dictionaries used to cover other application areas and improve the results, because when the dictionaries are richer and the rules are well defined, the resulting translation will be more accurate. 9 REFERENCES [1] R. Mahdjoubi: Un Système pour le Traitement Automatique de la Langue Arabe Basé sur ces Propres Caractéristiques, Magister (1994) [2] K. Meftouh and Al.: Extraction Automatique du

UbiCC Journal

[3] [4]

[5]

[6]

Page 30

Sens d’Une Phrase en Langue Française par une Approche Neuronale, JADT (2002) S. Russel and Al.: Artificial Intelligence with 400 Exercises, Pearson Edition (2005) M. T. Laskri: Une Sémantique du Langage Naturel à Travers un Système Support de Thésaurus, Docorat d’Etat (1994) K. Meftouh: Un Réseau Simplement Récurrent pour la Génération d’une Représentation du Sens d’une Phrase Ecrite en Langue Arabe, Magister (2002) K. Rezeg: Une Approche Connexionniste pour la Traduction Automatique des Textes Arabe en Français, Courrier du Savoir, N° 08, pp. 5967 (2007)

www.ubicc.org