An Affinity Based Greedy Approach towards

An Affinity Based Greedy Approach towards Chunking for Indian Languages Dipanjan Das, Monojit Choudhury, Sudeshna Sarkar and Anupam Basu Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur India - 721302 email: [email protected], {monojit, sudeshna, anupam}@cse.iitkgp.ernet.in

Abstract A robust chunker can drastically reduce the complexity of parsing of natural language text. Chunking for Indian languages require a novel approach because of the relatively unrestricted order of words within a word group. A computational framework for chunking based on valency theory and feature structures has been described here. The paper also draws an analogy of chunk formation in free word order languages with the bonding of atoms, radicals or molecules to form complex chemical structures. The unavailability of large annotated corpora forces one to adopt a statistical approach to achieve the task of word grouping for Indian language text. A chunker has been implemented for Bengali using this approach with considerably good accuracy.

1

Introduction

Chunking or local word grouping is an intermediate step towards parsing and follows part of speech tagging of raw text. Other than simplifying the process of full parsing, chunking has a number of applications in problems like speech processing and information retrieval. Chunking has been traditionally defined as the process of forming groups of words based on local information (Bharati et al, 1995).

Abney (1991) viewed chunks to be connected subgraphs of the parse tree of a sentence. For example, in the follwoing sentence (1)

The young girl was sitting on a bench.

the chunks are: (2)

[The young girl]/NP [was sitting]/VP [on a bench]/PP.

where NP, VP and PP are noun, verb and prepositional phrases respectively. Several methods for chunking have been explored for English and other European languages since the 90’s. (Abney, 1991) introduced a context-free grammar based approach for chunking of English sentences. Recent work in the area has mostly focussed on statistical techniques. Machine learning based approaches eliminate the need of manual rule creation and is language independent. (Ramshaw and Marcus, 1995) proposed a transformation based learning paradigm for text chunking, which achieved approximately 92% precision and recall rates. This work involved text corpora as training examples where the chunks were marked manually. Other works report more than 90% accuracy in identifying English noun, verb, prepositional, adjective and adverbial phrases (Buccholz et al, 1999; Zhang et al, 2002). The work of (Vilain and Day, 2000) in identifying chunks is worth mentioning here because their system used transformation based rules which would either be hand-crafted or trained. The trained system performed with precision and recall rates of 89% and 83% respectively, whereas

they attained much lower precision and recall rates in the hand-engineered system, and hence their observations justified the use of the learning based approach. (Johansson, 2000) used context free and context sensitive rules for part of speech tag sequences to chunk sentences. The rules in their work were also derived from training data. The problem of chunking for Indian languages takes a new dimension for a two reasons. First, for most free-word order languages, the internal structure of the chunks often have an unrestricted order, for which chunk identification becomes a challenging task in comparison to English text. However, full parsing of free-word order languages, which is a hard task, becomes much easier to solve when the intermediate step of identifying chunks is achieved. Second, there is a dearth of annotated corpora of Indian language text. Typically the number of sentences used for training an NLP system for languages such as English goes beyond a few millions. In the scenario of Indian languages, such colossal amount of training data is hardly available. Therefore, a non-statistical approach seems to be the natural choice. An approach towards chunking based on valency theory and feature structure unification has been described here. An analogy of unification of FSs of syntactic units has been drawn with the formation of complex structures from atoms and radicals. In the formalism reported the parse tree of a sentence has further been imagined to be similar to the molecular structure of a substance. A chunker for Bengali has been implemented using this formalism with a precision of around 96% and a recall of approximately 98%. In the next section we describe the nature of chunk formation in Bengali.

2

Chunking in Free Word Order Languages

In this paper, chunking, especially with a view towards free word order languages, is defined to be the process of identifying phrases in a sentence that have an internal arrangement independent of the overall structure of the sentence. Chunks according to

this description may have a nested structure as well. The following sentence elucidates some chunks for a sentence in Bengali 1 : (3)

[[kAThera TebilaTira]/NN [upara]/PPI]/NN [[khuba sundara]/JJ [phuladAni]/NN]/NN [Chila]/VF. wood[of] table[on] very beautiful flowervase was (There was a very beautiful flower-vase on the wooden table.)

In sentence 3, kAThera TebilaTira upara forms a noun chunk (NN) which has two chunks within it a noun chunk and a postposition (PPI). In the noun chunk khuba sundara phuladAni, there is an adjectival chunk (JJ) khuba sundara and a noun phuladAni nested inside. It should be noted that due to lack of strong word ordering constraints in Bengali, the chunks can be freely permuted in the above sentence to yield grammatically correct sentences having the same meaning. However, words may not be permuted across the chunks. For example, the sentence (4)

sundara TebilaTira upara khuba kAThera phuladAni Chila. beautiful table[of] very wood[of] flowervase was. (There was a very wooden flower-vase on the beautiful table.)

has a different sense. This sentence has correct syntax, but a different semantics. Thus informally, a chunk or a word group is a string of words inside which no other word of the sentence can enter without changing the meaning of the sentence. The present paper assumes this definition to identify chunks for a particular input sentence. The problem of dividing a Hindi sentence into word groups has been explored by (Bharati et al, 1995). The output of the grouper served as an input to a computational Paninian parser. However, in their work, they made a distinction between local word groups and phrases. They assert from a computational point that the recognition of noun phrases 1

In this paper, Bengali script has been written in italicized Roman fonts following the ITRANS notation.

and verb phrases is neither simple nor efficient. Therefore, the scope of the grouper was limited and much of the disambiguation task was left to the parser. The problem of phrase identification in Hindi has been explored by (Ray et al, 2003) using a rewrite rule based approach. But their technique gave only one possible grouping of a particular sentence, did not consider sentence level constraints, which at times generated ill-formed groups and the case of noun-noun groups has been completely ignored.

Figure 1: Overview of the Framework

3

Formalism

The present paper focuses on a formalism developed using an analogy with formation of bonds in chemical compounds. The criteria that governs the series of bond formations is the minimization of energy through which a stable state is reached. Two atomic units can form a bond if and only if they have a mutual attraction towards each other, which is governed by the property called valency.During bond formation of two structures, there is an amount of energy that is released and this brings the system to a more stable state. Ultimately, when the system cannot release any more energy, the final structure of the substance is attained. We can extend this concept to chunking (or parsing in general) as follows: The individual words of a sentence are the atoms, which have valencies representative of their syntactic and/or semantic properties. The chunks can be viewed as semi-stable molecular complexes formed out of a set of atoms, which in turn have its own valencies. A quantity called the bond energy reflects the relative affinity between two syntactic units. The higher the bond energy between two units, the higher is the probability that they will form a chunk. The complete parse-tree of the sentence is analogous to the final chemical compound that is formed. The concepts of valencies and feature structures have been used to realize a chunker in the lines of the aforementioned analogy. The following subsections examine the development of this formalism in detail.

3.1

Overview of the Framework

Figure 1 shows the framework proposed in this work. The main components of the system are the POS tagger, the chunking system and a search space pruner. The POS tagger takes an input sentence and tags each word with its POS category and other morphological attributes valid for that category. The grouper tries to formulate valid groups based on the POS tags and the morphological attributes as provided by the tagger. The first phase of the grouping is the feature structure instantiation of each and every word of the sentence followed by the grouping algorithm. At each step of grouping, a search space pruner reduces the number of redundant groups that may have been formed by the chunking system. The current work focusses only on the basic chunker.

3.2

Valency and Feature Structures

Valency theory takes an approach towards an analysis of sentences that focuses on the role of certain key words in a sentence. For example, certain verbs allow only certain kinds of predicates some of which are compulsory while some others may be optional. The arguments that a verb can take are defined in terms of its valency. It was introduced by French structuralist Lucien Tesniere in dependency grammar formalism. Numerous theoretical publications (Helbig, 1992; Welke, 1995) have established it as arguably the most widely

used model of complementation in languages like German, French, Romanian and Latin. Valency theoretic description of English has also been explored (Allerton, 1982; Herbst, 1999). Traditionally, valency theory has focused on the syntactic and semantic valencies of verbs and occasionally of nouns. A generalization of the valency theory has been proposed in this work, where instead of only nouns or verbs, each syntactic unit of a sentence at every level of analysis has a certain number of valencies associated with it. For the purpose of chunking, where only adjacent words or groups can be attached to form larger groups, the units (word or word groups) have a maximum of two valencies associated with them - the right and the left. The right valency defines the capability of the unit to form groups with the units that lie to its right in a sentence. The left valency can be defined likewise. However, the affinity of a unit to form groups also depends on the syntactic attributes of the adjacent units. For example, a noun can readily form a group with an adjective to its left, but never with a finite verb. Therefore, the syntactic attributes of the adjacent units are also important. We use feature structure (FS) to represent the syntactic affinities of a syntactic unit in terms of the valencies and bond energies. Successful unification between adjacent FS implies formation of a valid group with its own FS, i.e. the valency constraints and the process releases an energy equal to the bond energy associated with that particular valency. The basic contents of the FS are shown in Figure 2. Apart from the POS category and the morphological attributes, the FS also contains two hands - the right and the left. The right (left) hand specifies the constraints on the units to the right (left) of an object that can be grouped with it. The constraints are specified as a list of 5-tuples . Each of the elements of the 5-tuples can be described as follows:

BondEnergy - The formalism assumes a variable affinity for a particular syntactic unit towards

units to its left and right that is represented through the corresponding bond energy. Section 3.3 provides further discussion on the theoretical and practical issues concerning bond energies. Category is simply the POS category of the syntactic unit that the FS allows in the concerned hand. Attributes are the allowable attribute of the syntactic category in the respective hand. N ewCategory - At every step of the chunking process two feature structures unite to form a new FS. N ewCategory is the POS value of the newly formed FS. N ewAttributes - Similarly, N ewAttributes are the attributes of the FS obtained through unification. The FS corresponding to all the POS categories and their different attributes is designed manually through linguistic analysis. However, in the presence of an annotated corpus, these templates can be learnt automatically. 3.3

Bond Energy

Consider the nested chunk [[kAT hera T ebilaT ira]N N [upara]P P I ]N N in example 3. A noun with a possesive inflection era, as in kAT hera, has an affinity or valency for a noun or noun group (NN) to its right. Similarly, a noun in the oblique form like T ebilaT ira can form a NN chunk with a post-position like upara to its right. Therefore, the sequence of words kAT here T ebilera upara can have two possible groupings [kAT hera [T ebilaT ira [upara]P P I ]N N ]N N and the aforementioned one, out of which only the latter is syntactically valid. The concept of bond energy helps us solve this particular problem. Suppose, the bond energy between NN and PPI is smaller than that between NN-possesive and NN. Then in our approach, the group between NN-pssesive and NN will form before NN and PPI, thus leading to the correct grouping. Similarly, an adjective can connect to a noun to its right and a qualifier to its left. However, in a sentence, when an adjective is followed by a

Figure 2: Contents of a Feature Structure

noun and preceded by a qualifier, the correct grouping requires adjective to be grouped with the qualifier first, and then the new adjectival group is joined with the noun. This is illustrated by the chunk [[khuba sundara]JJ [phuladAni]N N ]N N in example 3. Therefore, we can conclude that the bond energy between qualifier and adjective is more than the bond energy between an adjective and a noun.Thus, bond energy helps us decide which way the bond will form (or in other words, units will be grouped) whenever there are multiple possibilities. Let there be N POS categories from C1 to CN . Let Ci · lef tBE(Cj ) be the bond energy between Ci and Cj when Cj is to the left of Ci . Ci · rightBE(Cj ) can be defined likewise. We make the following observations. • Ci · lef tBE(Cj ) is 0 if Ci cannot be grouped with a preceeding Cj . • Ci · lef tBE(Cj ) is equal to Cj · rightBE(Ci ). • We need to estimate only N ×N bond energies, that can be stored in a matrix BE, such that BEij = Ci ·lef tBE(Cj ) = Cj ·rightBE(Ci ) • It is not necessary to assign exact values to the

elements of BE. We are interested in a patial ordering between the bond energies. Note that we need to compare BEij only with BEki , for all ks other than j, beacause an ambiguity occurs only when a particular category can form bonds with the units both to its right and left. We call such elements as comparable pairs. Thorough linguistic analysis has led us to 47 unique values for the bond energies, based on which all possible ambiguous pairs can be compared. Note that it is not important to know what are these 47 values, only the ordering is important. In other words, these 47 distinct values have been chosen arbitrarily from the set of natural numbers. It should be mentioned here that a given sentence can be inherently ambiguous and lead to multiple valid groupings. Since, we do not preclude the use of the same bond energy value for a comparable pair, multiple grouping emerge as a natural possibility. We illustrate this with the following examples. (5)

[rAmera kathA]/NN [balA]/VN [bAraNa]/ADV Ram[of] story talk forbidden (It is forbidden to talk about Ram.)

(6)

[rAmera]/NN [kathA balA]/VN [bAraNa]/ADV Ram[of] story talk forbidden (Ram is forbidden from talking.)

Thus, in order to retain the possibility of both the groupings, we must assign the same bond energies to the pairs NN-possesive and NN, and NN and VN (verbal noun).

4

The Grouping Algorithm

The grouping algorithm of the formalism proceeds by first instantiating the feature structures for each and every word and the punctuation marks in a sentence. We stress on the fact that the formalism follows a divide and conquer approach that eases the complexity of full parsing. The following subsubsections concentrate on the details of the algorithm. 4.1

Instantiation of Feature Structures

After an input sentence sentence is fed to the POS tagger, the sentence comes as input to the chunking system with the POS tags and morphological attributes attached with each of the words and sentinels. The tagged sentence is fed into the initial instantiator. The steps of instantiation are: 1. Each word of the sentence is extracted one after another. 2. A rule file which assists in the process of instantiation consists of the template of FS for a POS category and its attibutes. The present system uses inflection as the only syntactic attribute for chunking purposes. 3. For each word, a new instance for the FS is created from the template, which is selected from the rule file based on the POS tag and the attribute information associated with the word. 4.2

The Grouping Process

After the instantiation of the feature structures of each syntactic unit of the sentence, the grouping process starts a greedy process and proceeds in the following manner:

1. The list of FSs is stored in an array F . 2. Initially, for every pair of FSs in the sentence, the bond energy between them is calculated and stored in a max-heap B. 3. The arrays F and B are stored in a global stack S. 4. The stack is popped to get a partial grouped list of FSs - denoted by B and F for this run. 5. The maximum bond energy bmax is extracted from B and the two corresponding FSs x and y are extracted from F as well. 6. If only one such pair is present, then the bond between x and y is formed by reading the N ewCategory and N ewAttributes from one of the hands of either x or y. Let the FS after unification be z. z is inserted into the array F . z contains the old FSs x and y. 7. The BondEnergies between z and the FSs to its right and left which are denoted by bzlef t and bzright are calculated and inserted into B. 8. At step 5, if there are multiple values of bmax , for each of the values, Steps 6 and 7 are repeated and the corresponding B and F are stored in the global stack S. The control flow starts from Step 5 again. 9. If there is only one value of bmax , then the partial grouped structure is stored in S and control flows from Step 4. 10. If at Step 5, the value of bmax is zero, the confuguration is popped from S. 11. Grouping stops when S is empty. 4.3

Handling Conjunction

A thumbrule for handling conjunctions and other conditionals is applied before storing the configuration B and F in the stack S in Step 11/12. The details of the thumbrule is as follows: 1. For all the FSs in F , it is checked whether 3 FSs u, v and w exist such that the tuples
,
,
exist

in the list of tuples stored in a file for handling conjunctions in the system.

output from the chunker: hugalI/NN [nadIra [dui tIre]/NN]/NN kolakAtA/NN [ga.De uTheChe.]/VF

2. For all such sets of 3 FSs, the 3 FSs are united to form a single FS x with new category P OSx and attributes Attributesx , both of which is present in the thubrule file.

However, note that the correct grouping is [hugalI nadIra]/NN.poss [dui tIre]/NN]/NN [kolakAtA]/NN [ga.De uTheChe.]/VF

3. After unification, the new FS x is inserted into the list F , and the bond energies of x with the FSs to its right and left are inserted into B maintaining the decreasing order of the array. 4.4

An Example

Below we illustrate the process of chunking with an example sentence. Input sentence: hugalI nadIra dui tIre kolakAtA ga.De uTheChe. Hoogly river’s two bank Kolkata built up Kolkata has grown up on the either banks of river Hoogly. After MA and POS-tagging and FS instantiation:2 Words CAT.attr left.cat, be right.cat, be hugalI NP – – nadIra NN.poss – NN, 1 dui JQ – NN, 2 tIre NN.e JQ, 2 – kolakAtA NP – – ga.De V.nf – V.f, 1 uTheChe V.f V.nf, 1 – The process of bonding: There are two possible bonds that can be formed at the first step - that between dui - tIre and ga.De - uTheChe. The former has the higher bond energy 2 and therefore is chosen. The resulting category is an NN (noun group). Thus we have the following partially chunked sentence. hugalI nadIra [dui tIre]/NN kolakAtA ga.De uTheChe. At this point again two bonds can form, both of which have the same bond energy. Since the bonds are non-overlapping, it does not matter which one is chosen first. Finally, we get the following 2

The meaning of the symbols are - NP: proper noun, poss: possesive, JQ: quantifying adjective, NN: noun, V: Verb, nf: non-finite, f: finite, CAT: category, attr: attribute, be: bond energy

5

Implementation and Results

A chunker for Bengali has been implemented based on the aforementioned framework. It depends on a morphological analyzer (MA) and a POS tagger, both of which have been developed in-house and have accuracies of 95% and 91% respectively. The salient features of the chunker for Bengali are mentioned below: • The chunker uses 39 tags and 47 distinct values for specifying the bond energies. • The chunks identified by the system are nested. Their categories are noun-chunks (NN), finite verb chunks (VF), non-finite verb chunks (VN), adverbial chunks (ADV) and adjectival chunks (JJ). The punctuation marks other than ”,” form independent chunks as well. • The templates of the feature structures are specified manually in a rule file. There are 75 such templates for the different POSs and their respective attributes. • Multiple word groupings for the same sentence are also recognized by the chunker. • The chunker cannot identify multiword expressions, named entities and clausal conjunctions of conditional constructs .It cannot handle cases where semantic issues are involved. Since, the accuracy of the chunker heavily depends on the accuracies of the pre-processing steps such as MA and POS tagging, evaluation of the system as such will not reflect the accuracy of the chunking algorithm. Therefore, in order to estimate the accuracy of the chunking algorithm rather than the system as a whole, the present chunker has been tested on a POS-annotated corpus of about 30,000 words. The corpus has been developed in-house by

manually annotating a subset of the Bengali monolingual corpus provided by CIIL. A few conventions have been kept in mind while evaluating the accuracy of the algorithm. Since it produces word groups which may be nested, the outermost chunk have been considered during evaluation. Therefore, for a particular sentence, the chunker produces a set of non-overlapping chunks which may have a correct or an incorrect structure. The average precision and recall rates are 96.12% and 98.03% respectively. A cent percent accuracy in chunking is not observed since the system cannot identify multiword expressions and named entities. Most of the incorrect chunks are due to the occurrences of these two phenomena in natural language text.

6

Conclusion

In this paper, a framework has been described for chunking of free word order languages based on a generalization of the valency theory implemented in form of feature structures. We described a strong analogy between formation of word groups and formation of larger chemical structures from smaller ones as well. The concept of a variable bond energy between syntactic units provided a lot of representational ease in capturing the problem of chunking. The formalism can be extended to complete parsing of free word order languages. This can be accomplished through the inclusion of semantic valencies and non-adjacent syntactic valencies, which will be associated with a non-adjacent bond energy value. The idea is similar to that of karaka or traditional valencies of verbs. Also, the system can be extended by starting to chunk the words without the intervention of the POS tagger. An untagged sentence may serve as the input of the chunker and the system may tag and chunk in one go to make the chunker independent of the POS tagger. A morphological analyzer would serve as the front end to such a system. If semantic resources become available for Bengali and other free word order languages, then the present system can be extended to a full parser for such languages.

Acknowledgement This work has been partially supported by Media Lab Asia research grant.

References S. Abney. 1991 Parsing by Chunks. Principle Based Parsing. Kluwer Acedemic Publishers. D. Allerton. 1982. Valency and the English Verb. London/New York:Academic Press. A. Bharati, V. Chaitanya, and R. Sangal. 1995. Natural Language Processing: A Paninian Perspective. Prentice Hall India. S. Buchholz, J. Veenstra, and W. Daelemans. 1999. Cascaded Grammatical Relation Assignment. Proceedings of EMNLP/VLC-99, University of Maryland, USA. G. Helbig. 1992. Probleme der Valenzund Kasustheorie. Tubingen: Niemeyer. T. Herbst. 1999. English Valency Structures - A first sketch. Technical report EESE 2/99. C. Johansson. 2000 A Context Sensitive Maximum Likelihood Approach to Chunking. Proceedings of CoNLL-2000 and LLL-2000. Lisbon, Portugal, 2000. L. A. Ramshaw, and M. P. Marcus. 1995. Text Chunking using Tansformation-Based Learning. ACL Third Workshop on Very Large Corpora P. R. Ray, Harish, V., S. Sarkar, and A. Basu 2003 Part of Speech Tagging and Local Word Grouping Techniques for Natural Language Parsing in Hindi. Proceedings of International Conference on Natural Language Processing (ICON 2003). Mysore, India. M. Vilain, and D. Day 2000. Phrase Parsing with Rule Sequence Processors: an Application to the Shared CoNLL Task. Proceedings of CoNLL-2000 and LLL2000 Lisbon, Portugal, 2000. K. Welke. 1995. Dependenz, Valenz und Konstituenz. Ludwig M. Eichinger and Hans-Werner Eroms (eds.). Dependenz und Valenz. pp. 163-175. Hamburg: Buske. T. Zhang, F. Damerau, and D. Johnson. 2002. Text Chunking based on a Generalization of Winnow. Journal of Machine Learning Research, 2 (March).