Probabilistic Parsing of Korean Sentences using

Probabilistic Parsing of Korean Sentences using Collocational Information Kong Joo Lee

Dept. of Computer Science Korea Advanced Institute of Science and Technology

Jae-Hoon Kim

Dept. of Computer Engineering Korea Martime University

Gil Chang Kim

Dept. of Computer Science Korea Advanced Institute of Science and Technology

Abstract Lexical information is one of the most important source that can improve the accuracy of the syntactic disambiguation. This paper describes a Korean probabilistic parser that is based on the probabilities of phrase structure rules as well as the probabilities of collocational information between lexical items to resolve syntactic ambiguity. The proposed parser is shown by means of an extensive experiment to gain improvements in parsing accuracy. The experimental result shows about 84% in parsing accuracy.

1. Introduction Parsing a natural language sentence can be described as a procedure that searches through various ways of combining grammatical rules to nd a combination that generates a tree that could be the structure of the input sentence. In the procedure, as is well known, several trees of the input sentence are possible. Therefore, a parser has to be able to decide which one is the best tree for an input sentence among many other trees. Recently, a probabilistic approach to parsing a sentence is widely used. The simplest example of the approach is using a probabilistic context-free grammar (PCFG) to assign parse probability to each parse tree. With PCFG, a parser resolves syntactic ambiguity in a very simple way. The probability of a parse tree can be calculated as the product of the probabilities of the PCFG rules This research was supported in part by KOSEF

used in the parse, then the tree that is assigned the highest probability can be decided as the best tree for the input sentence. However, this approach is only able to distinguish between very common and uncommon constructions in the language. Lexical information has been shown to be crucial for many parsing decisions, such as prepositional-phrase attachment [5]. Consider the following example sentence. (1) a. \Moscow sent 100,000 (NP soldiers (PP into Afghanistan))..." b. \Moscow (VP sent (NP 100,000 soldiers) (PP into Afghanistan))..."

In example (1), early approach to probabilistic parsing that conditioned probabilities on nonterminal node and part-of-speech tag alone can choose between two possibilities: either into is attached to the verb sent or it is attached to the noun soldier on the basis of only the preference of the rule `VP! verb NP PP ' and `NP! noun PP '. However, lexical information such as `send into', `soldier into' may be more important than preference of simple context free rules in processing of syntactic disambiguation. Most recently, some researchers [2, 3, 7] have attempted a technique to deal with lexical information in a whole syntactic disambiguation not a sub-part such as prepositional-phrase attachment. In this paper, we present a probabilistic parser for Korean sentences using collocational information, which is similar to verb case frame or subcategorization. Collocational information is triple that consists of two heads of phrases and their syntactic relationship. The head of a phrase is a lexical item which represents the phrase. It can improve the accuracy of disambiguation in syntactic analysis. Also, a restricted form of phrase struc-

Table 1: Classi cation of Functional Words in Korean. category of functional word Intracase particle predicative(jp), vocative(jcv), nal auxiliary(jxf) Constituent verbal-ending nominal(etn), pre nal(ep), nal(ef) Relation sux noun-derivational(xsn), verb-derivational(xsv), Indicator adjective-derivational(xsm), adverb-derivational(xsa) Intercase particle subjective(jcs), objective(jco), complemental(jcc), Constituent adverbial(jca), adnominal(jcm), comitative(jct), quotative(jcr), conjunctive(jcj), auxiliary(jxc) Relation Indicator verbal-ending coordinate conjunctive(ecc), subordinate conjunctive(ecs), adnominal(etm) Class

ture grammar for Korean is introduced in this paper. We show that a head of a phrase and collocational information co-occurred within a phrase can be automatically speci ed thanks to the restricted form of grammar.

2. Phrase Structure Grammar for Korean Sentences Korean is an agglutinative language, in which a word-phrase is in general a composition of a content word and functional word(s). A functional word { particle or verbal-ending { indicates the grammatical role of the word-phrase in a sentence, while a content word indicates the meaning of the word-phrase. Roughly there are two types of functional words in Korean according to the position of the functional word in a word-phrase. Table 1 summarizes the part-of-speech tags of functional words with respect to their types. In this paper, the form of phrase structure grammar for Korean sentence is restricted into the following three types based on the class of functional words (see [6] for a detailed description).

Intra-Constituent Relation Indicator :

Functional words included in this class change the attribute and the grammatical role of the constituent they attach to. The rule pattern that includes functional words of Intra-Constituent Relation Indicator is as follows:

Type I : A ! B +

where is one of the Intra-Constituent Relation Indicator and can be omitted. This type of rules speci es that the constituent B can be transformed into the constituent A owing to a functional word . In example (2.b), verb phrase (VP) is transformed into noun phrase (NP) due

to a functional word etn which is a nominal verbal-ending. (2) a. VP ! NP+jp b. NP ! VP+etn Inter-Constituent Relation Indicator : Functional words which belong to this class indicate the grammatical relationship between two constituents. The rule patterns that include Inter-Constituent Relation Indicator are restricted as following two ways. Type II : A ! B + C where is one of the Inter-Constituent Relation Indicator except conjunctive case particle (jcj) and coordinate conjunctive verbal-ending (ecc). This type of rules means that the constituent B modi es the constituent C as the grammatical role which indicates. In example (3.a), the subjective case particle (jcs) makes noun phrase (NP) subject of the following verb phrase (VP). (3) a. VP ! NP+jcs VP b. NP ! VP+etm NP Type III : 0 A ! A1 + 0 A2 + 0 :::A where can be a conjunctive case particle (jcj ), a coordinate conjunctive verbal-ending (ecc), and a comma (sp) which can connect parallel structures. This type of rules covers the parallel structures. In example (4.b), the parallel noun phrase (NP)s are connected by the functional word conjunctive case particle (jcj ). (4) a. VP ! VP+ecc VP+ecc VP b. NP ! NP+jcj NP+jcj NP n

Figure 11 shows an example sentence and its parse tree. The rules used in parsing the sentence 1

ncn is a tag for common noun, nq for proper noun,

S[dicult:))-Å] Type I : NP ! ncn

NP ! V P + etn S ! ADJP + ef + sf

ADJP[dicult:))-Å]

! NP + jco pvg V P ! NP + jcs V P ADJP ! NP + jxt paa

NP[pass:!|Ù-¸!"]

Type II : V P

! nq + jcj

Type III : NP

VP[pass:!|Ù-¸!"]

nq + jcj nq VP[pass:!|Ù-¸!"]

NP[Jude:|\h]

NP[exam:$!5Ø]

ÐÕç/nq+/jcj {&»/nq+/jcj |\h/nq+"/jcs John-and

Susan-and

$!5Ø/ncn+ÏÚê/jco !|Ù-¸!"/pvg+$/etn+ÁÆç/jxt ))-Å/paa+"/ef+./sf

Jude-NOM

pass-ing-NOM-TOP

exam-ACC

dicult

Figure 1: An example sentence \It is dicult for John, Susan and Jude to pass an examination." and its parse tree; the words within the bracket are the heads of the phrases. and their rule types are described as well.

;

3. Probabilistic Parser Using Collocational Information In this section, we present collocational information that is similar to verb case frame or subcategorization. Then parsing model for Korean sentence using collocational information is introduced.

3.1 Collocational Information

;

;

;

;

;

;

;

;

A head of a phrase is the central element which is equivalent to the phrase as a whole [4]. In other words, the head of a phrase (nonterminal node) is the most important word of the phrase, so the head of NP (Noun Phrase), for example, is a noun word and the head of VP (Verb Phrase) is a main verb. Figure 2 gives an example that two alternative parse trees for a sentence are possible. In this gure, a lexical item within brackets beside a nonterminal symbol represents the head of the phrase.2 To gloss a few of theses, the head of the phrase NP1 2 is `ÐÕç(John)' and that of a phrase VP1 3 is `!|Ù-¸!"(pass)'. And in tree (A), the head of the phrase VP0 3, which is constructed by the rule `VP0 3 ! NP0 1+ jco VP1 3', is same as that of the ;

phrase VP1 3 because the head can be propagated from RHS to LHS of the rule. Let us take a look how lexical item can aid the disambiguation in syntactically analyzing. In tree (A) of Figure 2, the phrase NP0 1 and the phrase VP1 3 are combined into the bigger phrase VP0 3 . With this combination of phrases, we can observe that the predicate of an object èxam' is a verb `pass' in tree (A), while that of èxam' is a verb `graduate' in tree (B) due to the combination of the phrases NP0 1 and VP1 4 in tree (B). Generally speaking, the word `$!5ØÏÚê(exam-ACC)' is more matched for the verb `!|Ù-¸!""(pass)' than for the verb `ÐÔê,Å!""(graduate)'. The more matched the word-pair is, the more preferable the syntactic parse tree in which the word-pair co-occurs is. The parser can decide that the tree (A) is more proper than the tree (B) for this example sentence from the fact that the word-pair (examACC, pass) is more matched than (exam-ACC, graduate). The most common example like this matched word-pair is verb case frame or subcategorization. We call the word pair such as `$!5Ø ÏÚê(exam-ACC)', and `!|Ù-¸!""(pass)' collocational information in this paper. More formally, we can de ne that collocational information is triple that consists of modi er-head word, modi eehead word and their syntactic relationship like (modi er-head, relationship, modi ee-head). In above example, collocational information such as (exam, ACC, pass) is possible. Because the form of phrase structure grammar

;

pvg for general verb, paa for attributive adjective. NOM stands for nominative case, ACC does accusative case and TOP does topic marker. 2 All lexical items are translated into English for the convenience of explanation throughout this paper.

;

VP0,4[graduate]

VP0,4[graduate]

VP0,3[pass]

VP1,4[graduate]

VP1,3[pass] NP0,1[exam] 0

|è

1

VP1,3[pass] NP0,1[exam]

NP1,2[John]

examination-ACC John-NOM

2

Mü pass-COMP

3

i ¼ ,.

4

graduate

0

|è

NP1,2[John]

1

examination-ACC John-NOM

(A)

2

Mü

i ¼ ,.

3

pass-COMP

4

graduate

(B)

Figure 2: An example sentence \If John passes an examination, then (he) can graduate (from...)." and its possible parse trees; subscript number beside the nonterminal symbol is for the convenience of explanation. Table 2: A head phrase and collocational information according to the type of phrase structure grammar. Type of Rule Head Phrase Collocational Information Type I A ! B + B Type II A ! B + C C (B ; ; C ) 0 0 Type III A ! A1 + A2 + A A (A1 ; 0 ; A ) (A2 ; 0 ; A ) ..... (A ?1; 0 ; A ) n

n

h

h

h

h n h n

h

h n

described in this paper is restricted to three forms as mentioned in Section 2, the head phrase and collocational information of each phrase structure rule can be automatically speci ed according to the type of rule without additional eort. Table 2 shows the head phrase of each rule and collocational information between heads with their syntactic relationship according to the type of rules. In Table 2, we denote the head of a phrase A by A . The head of rule in Type I is that of the phrase B itself and collocational information of this type cannot be de ned because this type of rules does not relate one phrase to another phrase but transform only its syntactic attribute. In rules of Type II, their heads are those of the phrase C owing to the head- nal language of Korean. In addition, there can be possible triple such as (B ; ; C ), that means the head B relates the head C with the syntactic relationship . The rules in Type III are similar to rules in Type II except that there are more than two phrases in their RHSs. So, in case of rules in Type III, more than one collocational information are possible as shown in Table 2. The heads of the rules in Type III are the heads of the last phrase A , and n ? 1 collocational information between the last phrase A and other preceding phrases are possible. Table 3 shows the rules used in the parse tree in h

h

h

h

h

n

n

h n

Figure 1 and their possible collocational information.

3.2 Parsing Model Using Collocational Information

This section describes the parser which uses not only the preference of each rule but also the preference of each word pair to improve the accuracy of syntactic disambiguation. Let P (W1 ; T ) be the probability of a sentence W1 and a particular parser tree T , then a probability of the sentence P (W1 ) can be computed as P (W1 ) = P 2P ( 1n ) P (W1 ; T ), where P (W1 ) is the set of all parses for W1 . The probabilistic model for parsing can resolve the syntactic ambiguity as nding the best parse tree which maximizes P (W1 ; T ). The most basic model considers that the probability of a parse tree is equivalent to the probability of having all of the phrase structure rules of the parse, that is, n

n

n

n

n

T

n

W

n

n

P (W1 ; T ) = n

Y 2

rule

P (rule):

(1)

T

First, the basic model is conditioned by contextual information so that the resultant parsing model is as follows.

Table 3: Head and collocational information extracted from an example parse tree in Figure 1. Rule Type Rule Head Collocational Information Type I NP ! ncn $!5Ø(exam) NP ! VP + etn !|Ù-¸!"(pass) S ! ADJP+ ef + sf ))-Å(dicult) Type II V P ! NP + jco pvg !|Ù-¸!"(pass) ( $!5Ø(exam), jco, !|Ù-¸!"(pass)) V P ! NP + jcs VP !|Ù-¸!"(pass) ( |\h(Jude), jcs, !|Ù-¸!"(pass)) ADJP ! NP + jxt paa ))-Å(dicult) ( !|Ù-¸!"(pass), jxt, ))-Å(dicult)) Type III NP ! nq + jcj nq + jcj nq |\h(Jude) ( ÐÕç(John), jcj, |\h(Jude)) ( {&»(Susan), jcj, |\h(Jude))

P (W1n ; T ) =

Y P (rule tl ; tr ) rule2T 8> P (A B + t ; t ) l r >> if rule >< !

=

Y

(2)

j

P (A

> rule2T > >> P (A :

!

j

2

B + C t l ; tr ) j

if rule 2 Type II 0 ! A1 + A2 + 0 An jtl; tr ) if rule 2 Type III

r

h

!

(3)

B + tl; tr ) if rule Type I B + C tl ; tr ) P (B h ; C h ) if rule Type II j

2

!

j

j

2

!

A1 + 0

A n tl ; t r ) j

Y P (Ah 0; Ah )

n?1

i=1

ij

if rule Type III 2

To see that Equation (3) calculate the probability of a parse tree, let us consider the following equation3 which is the probability of the parse tree in Figure 1. P (W1n ; T ) = P (S ADJP + ef + sf bos;eos) P (ADJP NP + jxt paa bos; ef ) P (pass jxt; dicult) P (NP V P + etn bos; jxt) P (V P NP + jcs V P bos; etn) P (Jude jcs; pass) P (NP nq + jcj nq + jcj nq bos; jcs) P (John jcj; Jude) P (Susan jcj; Jude) P (V P NP + jco pvg jcs; etn) P (exam jco; pass) P (NP ncn jcs;jco) !

j

!

j

!

!

j

!

j

!

j

j

j

j

j

j

j

!

j

h

h h P (B h ; C h ) = F (FB( ;; ;C hC) ) ; j

where F () denotes the frequency of an argument appeared in the training corpus. We now formally de ne our probabilistic parsing model for the probability of a parse T for the sentence W1 as follows: n

8 P (A >> >> Y >< P (A > rule2T > >> P (A :

Type I

where t and t are the left and the right context of the constituent A, respectively. They are the parts-of-speech of the outside of both ends of the coverage which the rule spans (see [6] for a detailed description). Now, the basic model with contextual information can be extended to include the probability of collocational information. A syntactic tree for an input sentence is assumed to be represented as joining of the set of the rules and the set of collocational information that is annexed to the rules. Hence, the probability of a parse tree can be computed as product of not only probabilities of all rules used in the tree but also probabilistic signi cance of all collocational information, which is speci ed according to the type of phrase structure rules (see Table 2). The probabilistic signi cance of collocational information for the type of rule À ! B + C ' is measured by the conditioned probability P (B j ; C ) estimated by: l

P (W1n ; T ) =

4. Preliminary Experiment All experiments were made on a Sun Ultra1 (UltraSPARC 167MHz). The training corpus contains 30,000 sentences (796,449 words), whose lengths range from 2 to 66 words, with a mean length of 25.6 words, while the test corpus contains held-out 1,000 sentences whose mean length is 25.3 words.

bos stands for the begin of sentence and eos does the end of sentence. 3

n

Table 4: The result of the preliminary experiment; LP is Labeled Precision and LR is Labeled Recall. Measure Basic Model Model using(Eq. 2) Model using(Eq. 3) (Eq. 1) Contextual Contextual & Information Collocational Inf. Train LP 78.42 87.92 96.29 Corpus LR 76.09 87.23 95.53 Test LP 78.12 83.25 84.42 Corpus LR 75.97 82.86 83.94 The number of distinct collocational information extracted from the training corpus is 224,883. Also, The number of rules extracted from the training corpus is 2,614, while that of rules with contextual information is quite large, about 36,600. The database le containing collocational information is about 6.7M in our system. To estimate the probabilities of the phrase structure rules and collocational information, the maximum likelihood estimation(MLE) method is used. Collocational information is closely related to the lexical item rather than to the abstracted item, so there are huge number of parameters to be estimated. Consequently, we face the serious problem such as data sparseness which there might be event that never been in the train data. To alleviate the problem, well-known back-o smoothing [3] technique is used in this experiment. The parsing accuracy is measured by PARSEVAL measurement [1]. Table 4 presents the result of the experiment. It is obvious that the accuracy of the result of train corpus is considerably higher than that of test corpus. Comparing the rst column to the second in Table 4, one can see that the model using contextual information outperforms the basic model in both of the train corpus and the test corpus. The accuracy of the model using contextual and collocational information is about 96% when the train corpus is tested. Although the probabilistic model using contextual and collocational information improves remarkably the accuracy of syntactic disambiguation on the train corpus, its result on the test corpus is slightly superior to the model with only contextual information, which is somewhat disappointing. It is because the data sparseness of collocational information suering from a lack of training data. We can expect that the improvement of accuracy is more notable when the size of train corpus is larger and the more elaborate smoothing method is used.

5. Conclusion In this paper, we have suggested the new parser for Korean using collocational information. As is well known, lexical item is one of the important information that can improve the disambiguation accuracy in decision of syntactic analysis. Owing to the restricted form of phrase structure grammar, collocational information, which is triple of (head, syntactic-relationship, head), can be automatically de ned according to the type of rules. Experiments showed that the accuracy of disambiguation of the parser is improved helping of collocational information. Also, we can observe that the smoothing method is critical factor in a parser using sparse data such as lexical information.

References

[1] E. Black et al., A procedure for quantitatively comparing the syntactic coverage of english grammars. In Proceedings of Fourth DARPA Speech and Natural Language Workshop, pages 306{311, 1991. [2] Eugene Charniak. Parsing with context-free grammar and word statistics. Technical Report CS-95-28, Dept. of Computer Science, Brown Univ., 1995. [3] Michael John Collins. A new statistical parser based on bigram lexical dependencies. In Proc. of ACL-96, pages 184{191, 1996. [4] David Crystal. A Dictionary of Linguistics and Phonetics. Basil Blackwell, 1985. [5] D. Hindle and M. Rooth. Structural ambiguity and lexical relations. Computational Linguistics, 19(1):103{ 120, 1993. [6] Kong Joo Lee, Jae-Hoon Kim, and Gil Chang Kim. Probabilistic language model for analyzing Korean sentences. In Proceedings of the 17th International Conference on Computer Processing of Oriental Languages(ICCPOL'97), pages 392{395, 1997, http://csone.kaist.ac.kr/ kjlee/paper.html/ [7] David M. Magerman. Statistical decision-tree models for parsing. In Proc. of ACL-95, pages 276{283, 1995.