Grammar Induction Using Co-Training

3 downloads 34964 Views 323KB Size Report
Grammar Induction Using. Co-Training. Bjorn Nelson. Master of Science. School of Informatics. University of Edinburgh. 2004 ...
Grammar Induction Using Co-Training

Bjorn Nelson

Master of Science School of Informatics University of Edinburgh 2004

Abstract In this thesis we implement a version of the Inside-Outside algorithm that is capable of being trained on partially bracketed corpora. Furthermore, we combine multiple instances of these learners, in a novel way, by means of co-training. Our hypothesis is that automatically labeled training data will boost the performance of the InsideOutside algorithm. However, after performing experiments with the ATIS corpus, we find that performance actually decreases as a result of co-training. One potential cause of this, which we confirm experimentally, is that additional labeled data does not benefit the training algorithm. An additional result of the research performed in this thesis shows that the efficacy of a partially bracketed corpus remains the same, while using only a fraction of the annotation. Hence, the burden of creating labeled training data for use with the Inside-Outside algorithm can be reduced, without causing a decrease in performance.

i

Acknowledgements I would like to thank my family, friends and my girlfriend for being supportive while I study overseas. Also, many thanks to my supervisor Miles Osborne for his proofreading and feedback. Finally, thanks to my fellow MSc students for demonstrating dedication and insight.

ii

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Bjorn Nelson)

iii

Table of Contents 1 Introduction

1

1.1

Grammar Induction . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Grammar Constraints . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Partially Bracketed Corpora . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Co-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5

Co-Training applied to Grammar Induction . . . . . . . . . . . . . .

5

1.5.1

6

Experimental Results . . . . . . . . . . . . . . . . . . . . . .

2 Background and Previous Work

9

2.1

Grammar Induction . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2

Probabilistic Context-Free Grammars . . . . . . . . . . . . . . . . .

10

2.2.1

Parsing Algorithms . . . . . . . . . . . . . . . . . . . . . . .

13

2.2.2

Retrieving the best parse . . . . . . . . . . . . . . . . . . . .

16

2.3

Inside probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.4

Outside Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.5

The Inside-Outside Algorithm . . . . . . . . . . . . . . . . . . . . .

19

2.6

Supervised Grammar Induction . . . . . . . . . . . . . . . . . . . . .

21

2.6.1

Bracketed Training Data . . . . . . . . . . . . . . . . . . . .

22

2.6.2

Modifications to the Inside-Outside Algorithm . . . . . . . .

22

2.6.3

Experimental Results . . . . . . . . . . . . . . . . . . . . . .

23

Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.7.1

Initial co-training experiments . . . . . . . . . . . . . . . . .

26

2.7.2

Co-training applied to natural language processing . . . . . .

27

2.7

iv

2.7.3 2.8

Co-training versus EM . . . . . . . . . . . . . . . . . . . . .

28

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3 Implementation and Experimental Environment

31

3.1

Basic Inside-Outside algorithm implementation . . . . . . . . . . . .

31

3.2

Bracketing support . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.3

Viterbi-style parse retrieval . . . . . . . . . . . . . . . . . . . . . . .

33

3.4

Backoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.5

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.6

Co-training implementation . . . . . . . . . . . . . . . . . . . . . . .

37

3.6.1

Co-training evaluation . . . . . . . . . . . . . . . . . . . . .

39

3.7

Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.8

Initial Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.8.1

X-Bar Grammar . . . . . . . . . . . . . . . . . . . . . . . .

42

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.9

4 Experimental Results 4.1

44

Single-view Experiments . . . . . . . . . . . . . . . . . . . . . . . .

44

4.1.1

Artificial Grammar . . . . . . . . . . . . . . . . . . . . . . .

44

4.1.2

Unconstrained English Grammar . . . . . . . . . . . . . . . .

45

4.1.3

English X-Bar Grammar . . . . . . . . . . . . . . . . . . . .

50

4.2

Co-training Experiments . . . . . . . . . . . . . . . . . . . . . . . .

51

4.3

Ancillary Experiments . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.3.1

Learner Agreement . . . . . . . . . . . . . . . . . . . . . . .

51

4.3.2

Effects of corpus size . . . . . . . . . . . . . . . . . . . . . .

52

4.3.3

Different domains . . . . . . . . . . . . . . . . . . . . . . .

54

4.3.4

Effects of partial bracketing variation . . . . . . . . . . . . .

54

5 Conclusion and Future Work

56

Bibliography

59

v

Chapter 1 Introduction The term ”Grammar induction” (also known as ”grammar learning” or ”grammar training”) refers to the process of learning grammars from empirical data. In other words, we are looking for a grammar with the highest probability, given some amount of training data. While this process can be applied to natural or artificial grammars alike, it is most useful when applied to a domain where manually-created grammars are performing poorly. For example, this is useful in the field of natural language processing because manually-created grammars are ill-equipped to handle the complexity and dynamism of real-world text. There are two high-level prerequisites of the grammar induction process. Namely, we must first decide what sort of grammar to induce; and secondly, we must locate or prepare data with which to learn from. With regard to the first prerequisite, and for the purposes of this experiment, we will be learning Probabilistic Context Free Grammars (PCFGs) (Booth, 1969). With regard to the latter point, the data can take many forms. It can simply be raw text, or it could be annotated in some linguistically useful manner. The level of information carried in the annotation depends on the time and effort expended by human annotators. Like many other tasks in the field of natural language processing, grammar induction performs better given richly-labeled training data (i.e. supervised learning). However, such data is costly to create, and may not be available in large quantity. And, while it is possible to operate in an unsupervised mode, the results are less than ideal; for reasons that will be explored shortly. To utilize both labeled and unlabeled data in 1

Chapter 1. Introduction

2

a weakly supervised manner, this report applies co-training (Blum and Mitchell, 1998) to the process of grammar induction. In the following introduction we will first cover the general problem of grammar induction, followed by an overview of the co-training process. Next, we discuss how to combine grammar induction with co-training, and finally, we present the experimental results of that combination.

1.1 Grammar Induction The general problem of grammar induction can be summarized as the search for a grammar that has the highest probability, given some training data. One way of inducing a grammar is to start with no preconceptions and attempt to construct rules based on the training data. Alternatively, the sum total of all potential rules can be generated, and the training data can be examined in an effort to refine the probabilities of the ”correct” rules. The most straightforward way to perform the latter method is to examine the training data and explicitly count the frequency with which grammar rules have been applied. Mathematically, we can then represent the probability of a particular grammar rule in the following way: P α

β  α 

Count α  β  ∑γ Count α  γ 



Count α  β  Count α 

Where α is a non-terminal within a grammar and β and γ represent strings of terminals and non-terminals. However, in order for this to work we must have labeled training data at our disposal (in other words, each constituent in the training set has to be annotated with a non-terminal). Unfortunately, as mentioned before, such data takes human effort to construct, and hence may not be available in large quantity. So, how do we train a grammar in the absence of such information? The answer involves the utilization of a method known as the Inside-Outside algorithm, which is a specific instance of a more general class of methods known as Expectation Maximisation (EM) algorithms. The Inside-Outside algorithm can be used to train grammars in the following way (Charniak, 1993):

Chapter 1. Introduction

3

 Generate all permutations of PCFG rules, given some set of terminals and nonterminals.

 Initialize the probabilities associated with the rules (the values can be random or explicitly defined).

 Perform the Inside-Outside algorithm on unlabeled training data (raw strings).  Prune off the rules whose probability falls below some predefined threshold. Thus, the ”correct” rules will remain. Unfortunately, this procedure has a crucial drawback: The application of the InsideOutside algorithm to raw text is plagued with local maxima. In other words, the solution space contains a number of points upon which the algorithm may converge. This makes it more difficult to find the correct solution (i.e. a pruned grammar that has the highest bracketed precision and recall, given some test set). Charniak (1993) found that after 300 random trials of the aforementioned procedure, only 1 converged on a globally maximal solution.

1.2 Grammar Constraints One way with which to counteract the problem of local maxima is to constrain the initial grammar in some manner. For example, we can stipulate that the rules must adhere to the so-called X-bar syntax. Essentially, this states that the left-hand side of a rule dictates what can appear on the right-hand side. More specifically, rules take the form of:

 XP   X-bar 

(specifier), X-bar (in any order) X, (complement) (in any order)

For instance, a noun phrase must have a noun somewhere in its right-hand side. Hence, the utilization of such constraints interjects domain-specific knowledge into the grammar induction process, moving it away from a general algorithm. However, as long as this does not pose significant overhead, it is a reasonable tactic for improving the training of natural-language grammars.

Chapter 1. Introduction

4

1.3 Partially Bracketed Corpora Another method of dealing with the problem of local maxima is the utilization of training data that provides more structural information than raw strings. To this end, Pereira and Schabes (1992) extended the Inside-Outside algorithm to handle partially bracketed corpora. This type of training data consists of sentences annotated with delimiters (parentheses) that define the range of sentence constituents. For example: (book (the flight) to Atlanta) As seen here, a noun phrase has been identified with brackets. The rest of the sentence, however, is un-annotated. So, the level of bracketing can vary, and depends on the effort expended by annotators. This variable level of structure leads to the term ”partially bracketed”. When used in conjunction with the Inside-Outside algorithm, this data constrains the application of certain grammar rules. For instance, we know that a constituent cannot span the first two words of the sentence (”book the”), because such a constituent would overlap the bracketing of the ”the flight”. Therefore, the grammar rule probabilities will be guided toward optimal values. As a result, grammar performace (measured in bracketing precision and recall) improves. However, there is a cost associated with this performance gain. Partially bracketed data is less expensive to produce than fully bracketed and labeled data, but it still isn’t free. Ideally, we’d like to operate in an unsupervised fashion, using unlabeled data whenever possible.

1.4 Co-Training In order to make use of both labeled and unlabeled data during the training process, we can implement a weakly supervised process known as co-training (Blum and Mitchell, 1998). This process allows multiple learners to be used in conjunction with each other, each viewing the training data from a different perspective. Specifically, one learner is trained on a small set of labeled data, while examining certain features of that data.

Chapter 1. Introduction

5

Next, a second learner is trained on the same data, by means of different features. Once the initial training is completed, the two learners each label an amount of raw data. Next, the examples in which they are most confident are added to the other learner’s labeled set. Finally, this process is repeated, as long as there is unlabeled data from which to created labeled examples (or as long as performance increases). Web page classification was the initial problem domain to which co-training was applied (Blum and Mitchell, 1998). Specifically, the system sought to utilize a Naive Bayes classifier to classify web pages with one of several labels, such as ”faculty member page”. In order to implement this, two learners were used, each concentrating on a different feature of the web page. Namely, those features were:

 Text within the web page itself. For instance, if the phrase ”research interests” is seen within a web page, it might be indicative of a faculty homepage.

 Anchor text within links from other web pages to the web page in question. For instance, the link text ”My Advisor” could be indicative of a faculty homepage. As a result of co-training, the performace of both learners increased.

1.5 Co-Training applied to Grammar Induction In order for co-training to be effective, the machine learning problem at hand should have multiple views from which to be examined. With regard to grammar induction, this type of variation amongst learners need not originate from the training data itself. Variation between the initial grammars can provide a basis for co-training. In this project, I implement the bracketing-aware Inside-Outside algorithm introduced by Pereira and Schabes (1992) as a baseline for co-training experiments (using the Inside-Outside implementation from Johnson (2000) as a foundation). Next, multiple grammar learners are combined in a novel way, by means of co-training. Each learner is trained according to a different initial grammar, and provides labeled sentences to the other learner for retraining. The hypothesis is that the use of automatically labelled data will improve performance in a manner similar to the use of manually labelled training data.

Chapter 1. Introduction

6

The experiments performed for this report are organized into the following groups:

 Develop a baseline grammar induction system using raw strings of part-of-speech (POS) tags as training data, as opposed to strings of words.

 Augment the baseline system so that it can deal with partially bracketed strings of POS tags.

 Combine multiple instances of the bracketed learners by means of co-training.  Perform ancillary experiments related to learner agreement and the effects of partially bracketing corpora variations on grammar performance.

1.5.1 Experimental Results While the preliminary experiments behave as predicted, the results of the main cotraining experiments actually decrease the performance of the Inside-Outside algorithm (as shown below). 48 46 44 42

F(b=1)

40 38 36 34 32 30 28 105

110

115 120 125 Size of training set (in sentences)

The potential causes of this behaviour are as follows:

130

135

Chapter 1. Introduction

7

 The learners involved are not making sufficiently uncorrelated errors.  The pool of unlabeled data is too small.  The process of selecting confident examples is not optimal. This could be due to a confidence measure that results in the selection of uninformative examples.

 The Inside-Outside algorithm is still suffering from overfitting, and is not benefitting from extra labeled data. After performing additional experiments, we found that the learners are making relatively uncorrelated errors but the extra labeled training data is not beneficial. The following graph indicates that even 20% of the already small training corpus is sufficient to reach the level of performance that results from 100% training set usage. 52 50 48 46

F(b=1)

44 42 40 38 36 34 32

0

50

100

150 200 250 300 350 Corpus size (number of training sentences)

400

450

500

In terms of future work, our co-training results could stand reinforcement by applying the Inside-Outside algorithm and co-training to another domain and another corpus. However, the results of this thesis alone indicate that advancements in fundamental grammar induction techniques are more fruitful than providing the InsideOutside algorithm with automatically-labeled training data. The remainder of this thesis is organized as follows:

Chapter 1. Introduction

8

 Chapter 2 provides background information about grammar induction and cotraining, and an overview of the current state of related research.

 Chapter 3 outlines the implementation of the Inside-Outside algorithm used for this project, and the extensions related to bracketing support and co-training. Furthermore, we describe the experimental environment related to this project, such as corpora choice.

 Chapter 4 presents the results of the various experiments which were conducted for this project.

 Chapter 5 provides conclusions about the efficacy of co-training and grammar induction, and discusses future directions on which to focus research efforts.

Chapter 2 Background and Previous Work 2.1 Grammar Induction The problem of grammar induction can be summarized as: find a grammar (G) with the highest conditional probability, given some training data (D). Using Bayes’ theorem, we can define this quantity as: P G  D 

P D  G P G P D

The terms P D  and P G  are ”prior probabilities”, because they represent our preconceptions about the grammar and the data. P D  will stay constant, whereas P G  is more difficult to compute. This is because it encapsulates the relative linguistic merit of a particular grammar. For simplicity, we can assume a uniform prior across all models and rewrite P G  D  as: P G  D ∝ P D  G Furthermore, we can define our training data (D) as a list of sentences (D  s i      sn)

and rewrite the term P D  G  as follows (assuming that the probability of each sentence does not depend on any other sentence in the training set): P D  G  P si      sn  G 

9

n

∏ P si  G 

i 1

Chapter 2. Background and Previous Work

10

At this point, the important task is to calculate the probability of each sentence in the training set, with respect to the grammar. So, how do we do this? First, each sentence can be defined as sequence of terminal symbols (in our case, POS tags), like so: si

 w1 n

Where wi is a terminal symbol and n is the length of the sentence. Also, since each sentence can have multiple parses (t1 n ), we define the overall sentence probability as the sum of the probabilities associated with each parse: P si  G  P w1 n 

∑ P w1 n  t1 n  ∑ P t1 n  



t1 n

t1 n

P w1 n  t1 n 

∑ P t1 n  

t1 n

In order to better understand how to calculate this quantity given raw training data, we need to first examine the fundamentals of Probabilistic Context-Free Grammars.

2.2 Probabilistic Context-Free Grammars A Probabilistic Context-Free Grammar (PCFG) is defined as containing the following quantities (Manning and Sch¨utze, 1999):

 A set of terminals, wk , k = 1,...,V  A set of non-terminals, N i , i = 1,...,n  A start symbol, N 1  A set of rules, N i  ζ j , where ζ j is a sequence of terminals and non-terminals  A corresponding set of probabilities on rules such that: 

i ∑ P Ni j

 ζ j  1

Chapter 2. Background and Previous Work

11

Also, PCFGs have a number of advantages over normal Context-Free Grammars (CFGs). Of paramount importance is the fact that CFGs have no mechanism for dealing with parse ambiguity (multiple analyses of a sentence). To illustrate this problem, examine the following sentence: The man saw the frog with a telescope. Due to the prepositional phrase ”with a telescope”, this sentence has two possible meanings. It could refer to the process of viewing a frog by means of a telescope (in which case, the prepositional phrase is associated with the verb phrase), or it could refer to the process of viewing a frog who is in possession of a telescope (wherein the prepositional phrase would be associated with the noun phrase). Our intuition tells us that the former example is more plausible. However, how do we represent this formally? First, let’s examine a normal CFG capable of parsing the aforementioned sentence: S

Det 

NP VP

the

NP 

Det N

N

man

NP 

Det N PP

N

frog

 Prep NP VP  V NP VP  V NP PP Det  a

N

telescope

Prep

 with

V

saw

PP

Given the preceding grammar, we arrive at the following parse trees for our example sentence:

  S              VP   NP       Det N V NP   PP       the

man saw

the frog Prep with

NP       a telescope

Chapter 2. Background and Previous Work

1.0 S 

12

0.5 Det 

NP VP

the

0.6 NP 

Det N

0.4 N 

man

0.4 NP 

Det N PP

0.4 N 

frog

 Prep NP 0.7 VP  V NP 0.3 VP  V NP PP 0.5 Det  a

0.2 N 

telescope

1.0 Prep

 with

1.0 V 

saw

1.0 PP

Figure 2.1: Probabilistic Context Free Grammar (PCFG)

Here is the example sentence with the PP attached to the verb phrase.

 S       NP       VP     the man V     NP      saw

Det the

N

 PP   

frog Prep with

 NP      a telescope

Here is the example sentence with the PP attached to the noun phrase. At this point, there is no way for the system to favor one interpretation over the other. Both are considered equally plausible. What we would like to do, is assign some sort of weight to a parse that is linguistically or semantically valid. One way of accomplishing this is to integrate our grammar within a probabilistic framework. In order to convert a CFG into a PCFG, probabilities are assigned to each rule. In a real-world setting, these probabilities are computed by means of grammar induction (as outlined later on). However, for demonstration purposes, we can assign any probabilities we like, as long as they are properly distributed. Figure 2.1 shows the probabilistic equivalent of our example grammar (with artificially assigned probabilities). Now that we’ve seen what a PCFG looks like, in the next section we’ll see how to apply it to free text.

Chapter 2. Background and Previous Work

13

2.2.1 Parsing Algorithms Training data can be parsed according to a PCFG in a bottom-up or top-down fashion (or by means of augmented versions of those methods). Bottom-up refers to the process of first describing lexical items with non-terminals, and then incrementally annotating the sentence until a final symbol is reached. Alternatively, top-down parsing seeks to generate the observed sequence of words by invoking various grammar rules, beginning with the sentence-level rules. However, both methods have difficulty representing ambiguity. Furthermore, they are inefficient because they redundantly construct subtrees during the backtacking process (Jurafsky and Martin, 2000). To alleviate these problems, we can make use of a technique known as dynamic programming. Dynamic programming uses a data structure called a chart in order to keep track of intermediate solutions. Hence, this will solve the issue of ambiguity since all parses can be retrieved from the chart. Furthermore, it solves the problem of redundant subtree parsing, since the system need only make reference to a chart cell in order to retrieve the solution to a subproblem (as opposed to re-computing the value). This algorithm can be applied to non-probabilistic CFGs and PCFGs alike. When applied to the latter, the chart entries consist of inside probabilities for various sentence ranges. Finally, there are two variants of this dynamic programming strategy that implement top-down and bottom-up parsing respectively:

 Earley Algorithm (implements top-down parsing)  Cocke-Younger-Kasami (CYK) Algorithm (implements bottom-up parsing) For the purposes of this report, we’ll examine the probabilistic CYK Algorithm as applied to grammars in Chomsky Normal Form (CNF). CNF requires that the grammar consist only of binary rules (the length of the right-hand side is 2) and unary rules (the length of the right-hand side is 1). Furthermore, binary rules must expand to two nonterminals and unary rules must expand to a terminal symbol (no null expansions are allowed). This constraint is not unreasonable since there is a weakly equivalent CNF grammar for any CFG that does not produce the empty string. Figure 2.2 shows pseudo-code for the probabilistic CYK algorithm.

Chapter 2. Background and Previous Work

14

# Insert lexical items for i = 1 to num words for A = 1 to num nonterminals if A 

wi is in grammar then

C[i,i,A] = P(A 

wi )

# Recursive case for j = 2 to num words for i = 1 to num words - j + 1 for k = 1 to j - 1 for A = 1 to num nonterminals for B = 1 to num nonterminals for C = 1 to num nonterminals C[i,j,A] = C[i,k,B]

! C[i + k, j - k, C] ! P(A 

B C)

Figure 2.2: Pseudo-code for calculating the inside probability of a string.

Chapter 2. Background and Previous Work

15

In plain English, the algorithm performs the following steps:

 For every word in the sentence (wi ), find unary rules (A 

wi ) that account for



the word. Then, place the probability of those rules (P(A

wi )) in the chart

cell (i,i). When this process is completed, the unary rule probabilities will fill a diagonal of the chart beginning in the upper-left hand corner and ending in the lower-right, as seen below. the

man

saw

the

frog

Det: 0.5 N: 0.4 V: 1.0 Det: 0.5 N: 0.4

 Examine increasingly longer ranges within the sequence and attempt to annotate them with binary rules. In other words, process chart diagonals until the upperright chart cell is filled. Ambiguity is represented by multiple entries within the same chart cell. the

man

saw

the

Det: 0.5 NP: 0.12

frog S: 0.01008

N: 0.4 V: 1.0

VP: 0.084 Det: 0.5 NP: 0.12 N: 0.4

As seen above, each chart cell corresponds to a range within the sentence. The terminals are represented on the outermost diagonal and the sentence-level symbol is stored in the upper-righthand corner of the chart. The computation of each cell follows the pseudocode implementation. For instance, to calculate the probability of the first noun phrase, the following formula was used: I(1,2,NP) = I(1,1,Det)

! I(2,2,N) ! P(NP 

Det N) = 0.5

! 0.4 ! 0.6 = 0.12.

Given the preceding method of calculating inside probabilities, we can arrive at one or more overall sentence probabilities, depending on parse ambiguity. However,

Chapter 2. Background and Previous Work

16

# Insert lexical items for i = 1 to num words for A = 1 to num nonterminals if A 

wi is in grammar then

C[i,i,A] = P(A

 wi )

# Recursive case for j = 2 to num words for i = 1 to num words - j + 1 for k = 1 to j - 1 for A = 1 to num nonterminals for B = 1 to num nonterminals for C = 1 to num nonterminals prob = C[i,k,B] if (prob

! C[i + k, j - k, C] ! P(A 

B C)

" C[i,j,A])

C[i,j,A] = prob B[i,j,A] = k,A,B Figure 2.3: Pseudo-code for storing the Viterbi-style, most-likely parse; where C is a chart of rule probabilities and B is a chart of back pointers.

which parse is associated with the highest probability? We will address the issue of parse retrieval in the next section.

2.2.2 Retrieving the best parse Recovering the actual parse from the chart of inside probabilities requires a modification of the original Inside algorithm. Namely, rather than storing the sum of all probabilities associated with a particular non-terminal, we instead record only the probability associated with the most likely rule. Figure 2.3 shows the modifications to the CYK pseudo-code. For those familiar with Hidden Markov Models (HMMs), this process resembles

Chapter 2. Background and Previous Work

17

the Viterbi algorithm for finding the most likely sequence generated by the model. In the case of PCFGs, we are augmenting the dynamic programming method as follows:

 In each chart cell, store the rule with the maximum probability.  Store pointers from the rule’s parent cell to the cells associated with the rule’s right-hand side. In the case of a binary rule, two pointers would be stored. Once this algorithm has been performed, the most likely parse can be reconstructed by following pointers from the start symbol to the leaf nodes.

2.3 Inside probabilities In order to facilitate both parse selection and grammar induction, we need to compute the overall probability of a sentence. Previously, we had defined this quantity as the sum of the probabilities associated with every possible parse of a sentence: P si  G 

∑ P t1 n  

t1 n

Now, we will show how to estimate this value by means of the Inside algorithm. Another way of thinking about sentence probability is that we are seeking to find the probability that a particular non-terminal (the start symbol ”S”) generates an observed sequence of words (ws ...wt ). This is known as the inside probability (I(s,t,A)), and is recursively defined as follows (Lari and Young, 1990). The base case is where we deal with lexical items. In other words, we are looking for a unary rule to account for a singular word. Hence, the starting position (s) of the sentence range will be equal to the ending position (t), and the value of I(s,s,A) is defined as:

$ I s  s  A #

In the recursive case, s

p A 0

ws 

ifA 

ws

% G

otherwise

& t. Hence, we are seeking to find binary rules (A  B C)

that account for the sentence range (s,t):

Chapter 2. Background and Previous Work

18

S

A

ws

wt

Figure 2.4: The range of the inside probability

( A) ∑ B C ,* +

I s  t  A '

-

t 1

∑p A

G r s

B C /. I s  r B /. I r 0 1  t  C 

Where s is the beginning index within a given sequence of words, t is the ending index and A, B and C are non-terminals within the grammar. Since, as mentioned before, ”S” is commonly used to denote the sentence-level non-terminal, the quantity I(0,n,S) will provide us with the overall sentence probability (where ”n” is length of the sentence, minus one). As an example, let’s envision the previously mentioned sentence as a sequence of words, ordered by a numeric index like so: words: the

man saw the frog

with a

index:

1

5

0

2

3

4

telescope

6 7

Here, I(2,4,VP) is the probability that ”saw the frog” can be annotated with a verb phrase and I(0,7,S) is the probability that the entire sentence can be described by ”S”. Figure 2.4 illustrates the sentence range that is dominated by the terminal ”A”; the range for which we are attempting to find an associated probability.

Chapter 2. Background and Previous Work

19

2.4 Outside Probabilities As we will see in the next section, two sets of probabilities are needed in order to train a PCFG. We’ve covered the first, which is the inside probability of a string. The second type is the outside probability (Figure 2.5), which is an alternative method for calculating the probability of a string, given a grammar. Formally, we can rewrite the quantity P(w1 12121 n  G) as:

∑ O k  k  N j 4. P N j 

P w1 13121 n  G 

j

wk 

Where O s  t  A  is outside probability of the sentence range (s  t). This quantity can be defined as: O 0  n  S # 1  0

-

s 1

∑ ∑ O r t  C 6.

O s  t  A 5

+



BC G r 1 n



8

r t 1

O s  r C 6. C 

C

BA /. I r s 7 1  B  0

AB /. I t 0 1  r B  

2.5 The Inside-Outside Algorithm Training a PCFG (inducing the grammar) involves examining data (sentences) and counting the number of times various grammar rules are utilized. A rule that is used more often will have a higher probability. Specifically, the probability of a given grammar rule is defined as follows: P α

β  α 

Count α  β  ∑γ Count α  γ 



Count α  β  Count α 

However, it’s not always possible to use such a direct method of training grammars. Sometimes parsed corpora are not available and we must learn from raw strings (or partially bracketed data). In this case we can utilize a method known as the Inside-Outside Algorithm; a member of a class of methods known as Expectation Maximization (EM) algorithms. An EM algorithm is essentially a two-step process:

Chapter 2. Background and Previous Work

20

S

A

w1

ws−1

wt+1

wn

Figure 2.5: The range of the outside probability

 Using the current parameters of a given model, compute the expected results; based on training data (E step)

 Update the parameters of the model. (M step) These two steps are repeated until some set number of iterations have been performed, or an evaluation measure has not substantively changed. The variant of EM for Hidden Markov Models is known as the Baum-Welch algorithm, which attempts to maximize the likelihood that a particular sequence of states generates training data (using forward and backward probabilities). The PCFG variant of EM operates in a similar fashion, using inside and outside probabilities instead of forward and backward probabilities. To accomplish this, we can make use of the previous definitions of the inside and outside probabilities, in order to create expectations of the rule counts. Specifically, we define the following functions for rule and category count estimation (Lari and Young, 1990): Cw A  : 

1 n n ∑ I s  t  A/. O s  t  A P s∑ t 1 s

This defines how many times a non-terminal A has been utilized.

Chapter 2. Background and Previous Work

21

# Perform Inside-Outside algorithm while cross entropy of the model hasn’t changed much for i = 1 to num training sentences Compute the Inside probabilities of the sentence Compute the Outside probabilities of the sentence Update the probabilities of the grammar rules Figure 2.6: Pseudo-code for the Inside-Outside algorithm

Cw A 

α :

1 P 19

n



9

t n wt α

I t  t  A /. O t  t  A 

This defines how many times a unary rule (A  Cw A 

BC  : 

1 n- 1 n t - 1 ∑ ∑p A P s∑ 1 t s8 1 r s

α) has been used.

BC /. I s  r B /. I r 0 1  t  C /. O s  t  A 

This defines how many times a binary rule (A 

BC) has been used.

Using these definitions, we can construct pseudo-code for the Inside-Outside algorithm, as shown in Figure 2.6. This algorithm effectively computes the original grammar induction problem we defined. That is: argmaxG P D  G  While this solves the problem of learning a grammar from unannotated text, we are still plagued with a large solution space. In the next section we’ll explore how supervised training can guide the process toward a more globally optimal solution.

2.6 Supervised Grammar Induction In the last section we saw how to train grammars using the Inside-Outside algorithm. However, implementing this process using raw strings often results in poorly performing grammars (Lari and Young, 1990). To address this issue, Pereira and Schabes (1992) augmented the Inside-Outside algorithm so that it could make use of bracketed training data.

Chapter 2. Background and Previous Work

22

2.6.1 Bracketed Training Data Constituent bracketing is one type of annotation from which the Inside-Outside algorithm can learn. Data that is prepared in such a manner consists of sentences annotated with delimiters (parentheses) that define the range of sentence constituents. For example: (book (the flight) to Atlanta) As seen here, a noun phrase has been identified with brackets. However, we don’t have specific knowledge that this sentence ranged is associated with the named constituent ”NP”. We simply know that it is dominated by a constituent. This is known as unlabeled bracketing.1 The alternative is labeled bracketing, which is more informative, and hence more costly to create. Such bracketing takes the form: ((VB book) ((DT the) (NN flight)) (TO to) (NNP Atlanta)) Another aspect of this type of training data is that the bracketing can vary anywhere from nothing (raw strings) to fully descriptive (full parse trees). The more brackets there are, the easier it is for the algorithm to learn the corresponding grammar. (Pereira and Schabes, 1992) define a partially bracketed corpus as follows: Informally, a partially bracketed corpus is a set of sentences annotated with parentheses marking constituent boundaries that any analysis of the corpus should respect. More precisely, we start from a corpus C consisting of bracketed strings, which are pairs c = (w,β) where w is a string and β is a bracketing of w. For convenience, we will define the length of the bracketed string c by  c  =  w  .

2.6.2 Modifications to the Inside-Outside Algorithm The first step in modifying the Inside-Outside Algorithm is the following: define a function that indicates whether or not a sentence range infracts the bracketing of a 1 Not

to be confused with labeled vs. unlabeled training data. In other words, labeled and unlabeled brackets are both types of labeled training data. Unlabeled training data consists of the raw strings mentioned previously.

Chapter 2. Background and Previous Work

23

the man saw

the frog

Figure 2.7: Chart cells available for raw learning

given training example. This function is defined like so:

$

b k  l 

1 i f Nk l is consistent with sentence bracketing 0 otherwise

The second step is the modification of the binary rule count formula. We need only be concerned about binary rules, since a unary rule cannot, by definition, infract the bracketing of a sentence. Here is the original binary rule count formula: Cw A 

BC  : 

1 n- 1 n t - 1 ∑ ∑p A P s∑ 1 t s8 1 r s

BC /. I s  r B /. I r 0 1  t  C /. O s  t  A 

And here is the modified version: Cw A 

BC  : 

b s  r 4. I s  r B :.

-

-

t 1 1 n 1 n P ∑s 1 ∑t s 1 ∑r s



8



p A

BC 6.

b r 0 1  t /. I r 0 1  t  C 6. b s  t /. O s  t  A  

These modifications serve to nullify the effects of rule applications that would violate the constituent bracketing. We can see the effects of this new formula by looking at the chart of inside probabilities, for example. Figure 2.7 illustrates the unconstrained nature of the raw grammar induction system, and Figure 2.8 illustrates the level of constraints that can be applied based on the bracketing of two noun phrases.

2.6.3 Experimental Results Pereira and Schabes (1992) tested their system with the Air Travel Information System

Chapter 2. Background and Previous Work

(the man) saw

24

(the

frog)

X X

X

X

X

Figure 2.8: Chart cells available for bracketed learning

(ATIS) corpus, which consists of 770 partially bracketed sentences. Of this number, 700 were used as a training set and 70 were used as a test set. The initial grammar consisted of all CNF permutations over 15 non-terminals and 48 part-of-speech tags (as terminals). The number of binary rules was equal to: 15 ! 15 ! 15  3375, and the number of unary rules was equal to: 15 ! 48  720. So, the total number of initial grammar rules was 4095. The results of this experiment were as follows:

 When the system was trained on the baseline, unbracketed corpus, the bracketing accuracy with respect to the test set was 37% (after 75 iterations).

 When the system is trained on the partially bracketed corpus, accuracy increased to 90% (after 75 iterations).

 From a practical standpoint, the bracketed learner has a better time complexity than its unsupervised variant (in the best case, it is linear on the combined lengths of the training and testing sets).

2.7 Co-training A supervised training method is one in which a system learns from data that has been manually annotated. In the case of a grammar learning system, that annotation might consist of partial constituent bracketing. For a named entity recognizer, the annotation might consist of strings of entity labels. Whatever the case, data annotation takes

Chapter 2. Background and Previous Work

25

Unlabeled Data

Learner Labeled Data Figure 2.9: Self training.

time and effort to perform. In order to counteract this disadvantage, we would ideally like to be able to utilize unlabeled training data (of which there is usually a large supply) whenever possible. There are a number of different methods for dealing with the problem of sparse training data. For instance:

 Self-training: This technique involves training a single learner on a small set of labeled training data. Then, that same learner is used to label raw examples and add the confidently labeled ones to it’s own training pool. This process is illustrated in Figure 2.9.

 Co-training: Co-training is similar to self-training, except that multiple learners are used. In order for this to be effective however, each learner must view the training data in a unique way (in other words, they must make uncorrelated errors). After each learner has been trained on an initial labeled data set, examples are pulled from a pool of unlabeled material and labeled by each learner (while keeping track of a subset of confidently labeled ones). The confidently labeled examples are added to the shared, labeled training pool. Finally, the learners are retrained and this entire process repeats. The co-training procedure is illustrated

Chapter 2. Background and Previous Work

26

Unlabeled Data

Learner #1

Learner #2

Labeled Data Figure 2.10: Co-training.

in Figure 2.10 and pseudo-code is listed in Figure 2.11. For the purposes of this report, we will focus on the method of co-training, because it has proved to be a useful compromise between supervised and unsupervised learning techniques.

2.7.1 Initial co-training experiments Web page classification was the initial problem domain to which co-training was applied (Blum and Mitchell, 1998). In other words, the system sought to classify web pages with one of several labels, such as ”faculty member page”. In order to implement this, two learners were used, each concentrating on a different ”feature” of the web page. Namely, those features were:

 Text within the web page itself. For instance, if the phrase ”research interests” is seen within a web page, it might be indicative of a faculty homepage.

 Anchor text within links from other web pages to the web page in question. For instance, the link text ”My Advisor” could be indicative of a faculty homepage.

Chapter 2. Background and Previous Work

27

Given:

 a set L of labeled training examples  a set U of unlabeled examples

Create a pool U ; of examples by choosing u examples at random from U Loop for k iterations Use L to train a classifier h1 that considers only the x1 portion of x Use L to train a classifier h2 that considers only the x2 portion of x Allow h1 to label p positive and n negative examples from U ; Allow h2 to label p positive and n negative examples from U ; Add these self-labeled examples to L Randomly choose 2p 0 2n examples from U to replenish U ;

Figure 2.11: Pseudo-code for the web page classfication Co-training algorithm.

Given these features, Naive Bayes classifiers were trained on bags of words that had been gleaned from the web content used for training. As a baseline, a supervised classifier using the combined features operated with an error rate of 11.1%. The cotraining system, using two distinct classifiers, improved on this figure by operating with an error rate of 5.0%. So, at least for this problem domain, co-training had proved a useful tool. However, the question of whether other domains could benefit was still an open one.

2.7.2 Co-training applied to natural language processing One of the main areas to which co-training has been applied is the problem of statistical parsing. Namely, this involves assigning parse trees to training data and selecting the most likely parse. This problem has a good deal in common with our problem of grammar induction, which is evident from our earlier section regarding parsing algorithms. Namely, the process of producing labels is more complex than the web page classification domain explored by Blum and Mitchell (1998). Furthermore, when co-

Chapter 2. Background and Previous Work

28

training is applied to both statistical parsing and grammar induction, overall sentence probability can be used as a measure of confidence in the automatically labeled data. The following section outlines recent research in the field of co-training for statistical parsing: 2.7.2.1 Sarkar (2001)

This project implements co-training for statistical parsing with Lexical Tree Adjoining Grammars (LTAG). The first learner views the training data by means of a tri-gram tagging model; and, the second learner views the data by means of a parsing model. The co-training process begins with a seed set of 9,695 sentences and achieves a bracketed precision of 80.02% and a bracketed recall of 79.64%. This shows significant improvement over the scores of the baseline, supervised learner (72.23% and 69.12% precision and recall, respectively). 2.7.2.2 Steedman et al. (2003)

This project involves combining a PCFG parser and an LTAG parser, using sentence probability as a confidence measure. First, the system was trained with an initial set of 500 sentences, and each round of co-training produced the top 20 parses out of 30 unlabeled examples. The results of this experiment show that co-training improves on self-training. Next, the initial seed data was raised to 1,000 sentences. This produced comparable, though slightly higher performance than the previous experiment. However, considering the effect of initial training set size on overall performance, it is clear that co-training is most beneficial when the initial amount of labeled training data is small. Furthermore, this project illustrates that co-training can benefit even when provided with unlabeled data from a domain other than that of the initial labeled set.

2.7.3 Co-training versus EM Expectation Maximisation (EM) and Co-training are two methods with which to utilize unlabeled data. EM iteratively refines the parameters of a model based on unlabeled

Chapter 2. Background and Previous Work

Algorithm

29

# Labeled # Unlabeled Error Rate

Naive Bayes

788

-0-

3.3%

EM

12

776

5.4%

Co-training

12

776

4.3%

Naive Bayes

12

-0-

13.0%

Figure 2.12: Results for the WebKB-Course dataset

Algorithm

# Labeled # Unlabeled Error Rate

Naive Bayes

1006

-0-

3.9%

Co-training

6

1000

3.7%

EM

6

1000

8.9%

Naive Bayes

6

-0-

34.0%

Figure 2.13: Results for the News 2x2 dataset

training data. Similarly, Co-training produces labeled examples from unlabeled data, in an effort to refine the parameters of a model. In order to compare EM to co-training, Nigam and Ghani (2000) performed a variety of experiments. After testing Naive Bayes, EM and co-training on the WebKBCourse dataset (consisting of computer science web pages), the results show that EM outperformed co-training in this particular instance (Figure 2.12). However, when the same methods were tested with a different dataset (the News 2x2 dataset, consisting of Usenet posts), the results come out in favor of co-training over EM (Figure 2.13). Potential reasons for the varying efficacy of co-training are as follows:

 The machine learning problem is too simplistic, and suffers from a performance ceiling. As a result, it may be difficult to compare algorithms in a useful way.

 The feature split is not independent enough. In this case, the learners will make the same errors, thus eliminating any informativity on the part of the automatically labeled data. So, this illustrates that the automatically labeled data produced by co-training may

Chapter 2. Background and Previous Work

30

not always benefit every machine learning problem.

2.8 Summary In this chapter we first examined the general problem of grammar induction. Next, we saw how to train PCFGs with the Inside-Outside algorithm, using raw training data. Since the unsupervised version of the Inside-Outside algorithm performs poorly, we saw how to augment it with support for partially bracketed corpora. Finally, we examined the benefits of co-training, and compared that process to traditional EM methods. In the next chapter, we’ll show how to implement the algorithms we’ve just presented.

Chapter 3 Implementation and Experimental Environment This chapter outlines the steps taken to implement the algorithms discussed in Chapter 2. Namely, our previously defined goals for this project are as follows:

 Develop a baseline grammar induction system using raw strings of part-of-speech (POS) tags as training data.

 Augment the baseline system so that it can deal with partially bracketed strings of POS tags.

 Combine multiple instances of the bracketed learners by means of co-training.  Perform ancillary experiments related to learner agreement and the effects of partially bracketing corpora variations on grammar performance. In this chapter we will discuss the specific steps taken in order to implement these goals.

3.1 Basic Inside-Outside algorithm implementation The software foundation of this project is the ANSI C implementation of the InsideOutside algorithm created by Johnson (2000). The program accepts an initial contextfree grammar and a training set of raw strings as input. The initial grammar can consist 31

Chapter 3. Implementation and Experimental Environment

32

of arbitrarily long rules, and will automatically be binarized by the system. However, our particular project does not make use of this feature, as the grammars we are dealing with have already been represented in Chomsky Normal Form. The only requirement on the initial grammar is that every terminal symbol is associated only with unary rules. By default, the system reiterates the Inside-Outside algorithm on the training data until the cross entropy of the data and the model varies less than some predefined limit. Mathematically, this cross entropy is defined as: H W1 n  PM