Information Extraction using Sequencing with ... - Semantic Scholar

4 downloads 0 Views 201KB Size Report
On May 6th, 2002, the Dutch politician Pim Fortuyn, called a “maverick” and “con- ... in this list will constitute the labelset, which means approximately one million ...
Information Extraction using Sequencing with Large Label Sets

Wouter van Atteveldt

NI VER

S

Y

TH

IT

E

U

R

G

H

O F E

D I U N B

Master of Science School of Informatics University of Edinburgh 2003

Abstract Information Extraction can be conducted by assigning labels drawn from a large set to words or phrases. However, predicting the labels of unseen examples using such a large set of labels poses a number of challenges. Using a Maximum Entropy Markov Model with features that generalise over labels, it was shown that a significant improvement over the baseline model can be made. Two tested models achieved F-Scores of 0.30 and 0.39 over baseline results of 0.07 and 0.26. These models were tested on the novel domain of a corpus of Dutch newspaper articles annotated by researchers in the social sciences.

iii

Acknowledgements Many thanks go to Claire Grover and Malvina Nissim for their useful suggestions at the beginning of my project, and to my supervisor Miles Osborne for his help and patience. I would also like to thank Jan Kleinnijenhuis and his colleagues for the use of their annotated material and for their helpful comments.

iv

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Wouter van Atteveldt)

v

To my parents

vi

Table of Contents

1 Introduction

1

2 Maximum Entropy Markov Models

5

2.1

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.3

MEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.4

Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.5

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3 Labelling with Large Label Sets

11

3.1

Assembled Models . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.2

Unseen Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.2.1

Partial models . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.2.2

Generalising features . . . . . . . . . . . . . . . . . . . . . .

17

3.3

Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.4

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

4 The NET method and corpus

21

4.1

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

4.2

The method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

4.3

The corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

4.4

Encoding the Corpus . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.5

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

vii

5 Experimental Work

27

5.1

Problem Definition and Experimental Setup . . . . . . . . . . . . . .

27

5.2

Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

5.2.1

Contextual Predicates . . . . . . . . . . . . . . . . . . . . . .

29

5.2.2

Generalising Features . . . . . . . . . . . . . . . . . . . . . .

31

5.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

5.4

Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

5.5

Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

5.5.1

Confusion Matrices . . . . . . . . . . . . . . . . . . . . . . .

37

5.5.2

Example sentences . . . . . . . . . . . . . . . . . . . . . . .

39

5.5.3

Label bias problem . . . . . . . . . . . . . . . . . . . . . . .

41

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

5.6

6 Conclusion and Future Work

45

6.1

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

6.2

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Bibliography

49

viii

Chapter 1 Introduction

On May 6th, 2002, the Dutch politician Pim Fortuyn, called a “maverick” and “controversial anti-immigrant politician” by BBC news, was murdered in Hilversum, the Netherlands, by an animal rights activist. People claimed the ‘bullet came from the left’- that the murder of Fortuyn was caused by the demonizing of his person by left wing politicians and the media, who were said to have depicted him as a racist, fascist or even neo-Nazi. The question of whether Fortuyn could rightly be called a racist or fascist is presumably not a valid question in most scientific disciplines. Whether this was the image of him depicted by the media, and how the media reported the clash between him and the standing political order, however, are questions that could possibly be answered in an objective and repeatable manner, and thus might be a valid subject for scientific investigation. Researchers in the social sciences have developed a method claimed to extract this information in an objective manner from (subjective) news reports and editorials. One of their conclusions, published in the Dutch newspaper NRC Handelsblad of 15 May 2002, was that “the media certainly did not demonise Fortuyn”, but that most of the news about him consisted of negative or critical statements from other politicians. 1

2

Chapter 1. Introduction

The affair around Pim Fortuyn in the Netherlands, just as the affair around Dr Kelly in the United Kingdom, has brought to the surface the complexity of the relationship between politics and media. Although there is an abundance of data published by the media, this data often consists of opinions rather than factual statements, or at least of a subjective view of the facts, making it difficult to interpret. However, information extracted from this data can be very valuable to researchers in the social and political sciences, political campaign teams, and possibly even the justice system. Researchers in the social sciences, mainly based at the Free University in Amsterdam, have developed a method for extracting this information by identifying and annotating references to interesting concepts or entities and determining the relationship between them. Although this method, called the NET method, manages to extract useful data from the written and spoken media, it does so at the high cost of manual annotation of the investigated articles. Full, accurate automation of this annotation process would of course be ideal, but any reduction in the amount of human labour required to extract the relevant information would be of great benefit to the researchers in that field. This report will focus on the use of statistical sequencing models to automate the annotation process. Specifically, it will extract the required information by labelling instantiations of the entities drawn from a fixed list. The possible pairs of entities in this list will constitute the labelset, which means approximately one million labels from a thousand different entities. The theoretical part of this report will propose and investigate a number of possible methods for dealing with the large number of possible labels. The empirical part tests one of these methods by training and testing a sequencing model on the corpus that was created for the investigation of the campaigns leading up to the 2002 elections in the Netherlands. The results, F-scores of 30% and 40% on a subproblem against baseline results of 7% and 25%, show that on the one hand significant improvements to a baseline system can be made, and on the other hand that a number of both practical and theoretical problems are yet to be resolved. The rest of this report will be organized as follows. Chapter 2 will explain the theoretical points related to the used sequencing model. Chapter 3 will discuss some

3

of the theoretical and practical possibilities and difficulties encountered when working with large labelsets. Chapter 4 will give an overview of the NET method and the corpus used in this report. Chapter 5 will explain the experimental setup and present the results as well as an analysis of the errors. In conclusion, chapter 6 will attempt to explain these findings and suggest possible further work for improvement.

Chapter 2 Maximum Entropy Markov Models This chapter will provide a theoretical overview of Maximum Entropy Markov Models, which is the modelling technique used in the experimental part of this paper. Maximum Entropy Markov Models are sequencing models that use a Maximum Entropy model to determine the state transition probabilities given the observations. First, a small survey of previous uses of Maximum Entropy and Maximum Entropy Markov Modelling will be given, followed by theoretical overviews of both Maximum Entropy models and Maximum Entropy Markov Models. The chapter will end with the more practical aspects of parameter estimation for Maximum Entropy models.

2.1 Related Work Maximum Entropy models have been used for a large variety of Natural Language Processing tasks, including Language Modelling (Berger et al. 1996; Rosenfeld 1994), Named Entity Recognition (Borthwick et al. 1998), Word Sense Disambiguation (Ratnaparkhi 1998), Part Of Speech tagging (Ratnaparkhi 1996) and Sentence Boundary Detection (Reynar and Ratnaparkhi 1997). Maximum Entropy Markov Models (MEMM) (McCallum et al. 2000) use a Maximum Entropy model within a Markovian Model to create a sequencing model that can efficiently combine different and possibly overlapping sources of information. Maximum Entropy Markov Models have been used for tasks such as Information Extraction 5

6

Chapter 2. Maximum Entropy Markov Models

(McCallum et al. 2000) and Named Entity Recognition (Malouf 2002b).

2.2 Maximum Entropy Using Maximum Entropy modelling, it is possible to integrate heterogeneous and possibly overlapping information from different sources. This information is represented in the model as binary or real valued feature functions defined over the combined space of data points and labels. Given a training set of labelled examples, the set of possible models is constrained to be the models where the expected value of each feature equals the average value of that feature in the training data. Out of this set of possible models, the principle of Maximum Entropy dictates that the model is chosen that has the highest entropy. The intuition behind this is that this model makes no assumptions beyond those dictated by the empirical data, since it is the most uniform model possible given the constraints derived from that data. More formally, assume we are trying to model a distribution p over a set of events X consisting of labelled observations drawn from the event space E

O

L, where

O is the observation space and L the label space. Moreover, suppose we are given a d-dimensional feature function f : O L  Rd that characterises these events. Taking p˜

 to denote the empirical distribution of the operand in the training data, and p  to

denote the probability of the operand assigned by the model, the above constraint can be reformulated as follows:



∑ p˜ oi l j



i j

 f o l        i j ∑ p˜ oi p l j oi f oi l j ij

(2.1)

From the subset of models that satisfy 2.1, Maximum Entropy requires us to choose the model that is most uniform, using conditional entropy as a definition of uniformity:

 

H p L O

 p l o  log p l o   

∑ p˜ o



i j

(2.2)

Thus, the problem of finding the Maximum Entropy model given the empirical





distribution p˜ L  O is a constrained optimization problem. Defining C to be the set of models satisfying (2.1), this problem can be formulated as:

2.3. MEMM

7



argmax p

C Hp

L O  

(2.3)

Introducing Lagrangian multipliers for each of the constraints imposed by the features, Berger et al. (1996) shows that the model pˆ will be the model of the parametric form pλ below:

  pλ l o

1   exp Z o





  ∑ λi f i o  l

(2.4)

i

The problem of finding the parameters λ1 λd  that maximize the conditional entropy is discussed in section 2.4.

2.3 MEMM Traditional Hidden Markov Models (HMM) are usually described by two probability distributions: the transition probability gives the probability of going to a new state given the current state, and the emission probability gives the probability of an observation based on the current state. In Hidden Markov Models, the observation is conditionally independent of all other variables given the state it belongs to. In contrast, Maximum Entropy Markov Models (MEMM) are governed by a single probability distribution, namely that of going to a state based on the previous state and the current observation. This difference is visualized in figures 2.3 and 2.3, where L1   Ln are the markovian states or labels and O1   On are the observations. In that diagram, an arrow from A to B indicates the conditional independence of B given A. As can be seen in these diagrams, the only graphical difference between the two models is the reversal of the arrows between the labels and the observations. Maximum Entropy Markov Models still obey the Markov Property that the only dependence on history is on the directly preceding state, i.e. that the current state is independent of the previous observation and the earlier states given the previous state. Another key difference is that while Hidden Markov Models are generative, defining the probability over the joint distribution of states and observations, Maximum Entropy Markov Models are discriminative models, assigning a probability to a label

8

Chapter 2. Maximum Entropy Markov Models

sequence given the observation sequence.

L1

L2

L3

O1

O2

O3

...

Ln

L1

L2

L3

On

O1

O2

O3

Figure 2.1: HMM structure

...

Ln

On

Figure 2.2: MEMM structure

2.4 Parameter Estimation The parameter estimation problem for Maximum Entropy models is to find the parameters λ1   λn that satisfy the constrained optimization problem posed in equation (2.3) in section 2.2. As the entropy is concave in the parameter space there is a single global optimum, and the different estimation methods proposed are all iterative (hillclimbing) methods that differ only in convergence speed and ultimate proximity to the global maximum, not in the quality of the solution found. However, as the estimation problem can be quite daunting computationally due to the high number of free parameters, the rate of convergence can be of great practical value. Traditionally, parameter estimation is done using an implementation of Iterative Scaling, either Generalized Iterative Scaling (Darroch and Ratchliff 1972) or Improved Iterative Scaling (Della Pietra et al. 1997). Both are iterative methods that scale the probability distribution by a factor proportional to the ratio of the estimated values of the features under the current estimated parameters to the estimated value of the features under the empirical distribution. Although the above methods are widely used for Maximum Entropy parameter estimation, Malouf (2002a) describes a number of other methods and an experiment aimed at comparing these methods under different circumstances. Somewhat surprisingly, he finds that the Iterative Scaling methods are among the worst methods performancewise, and the method that achieves the highest performance throughout his experiments is a limited memory variable metric method, which uses a computationally tractable approximation to a second order hillclimbing method, which he finds

2.5. Chapter Summary

9

to converge approximately 10-20 times faster than the iterative scaling methods. His implementation of this method is used throughout the experiments described in this report. However, even with this improved method training time can still be a substantial problem for large label sets, as the time required to calculate the parameter updates for each iteration is generally proportional to the number of labels used. A number of papers published earlier try to reduce the training time by some form of clustering. Lafferty and Suhm (1995) and Wu and Khudanpur (2000) describe a framework to use the hierarchical nature of the n-gram predicates in languages by using a cluster expansion method adapted from statistical physics. Although they report significant decrease of training time for language models, their methods are dependant on the specific hierarchic structure of the problem. Goodman and Gao (2000) build upon this by clustering the labels into classes, and then using two distinct models to predict class based on observation and label based on class and observation. This overcomes the need for a specifically structured features, but it does require that a sensible clustering can be found. Moreover, as the training data is divided over the different classes for the second model, data sparseness might become worse.

2.5 Chapter Summary This chapter gave a theoretical overview of Maximum Entropy Markov Models and the underlying Maximum Entropy models. Although the main focus was on the theoretical foundation of these modelling techniques, some practical aspects of the parameter estimation process were also given due to their relevancy to the issue of labelling with large labelsets.

Chapter 3 Labelling with Large Label Sets One problem encountered when extracting information by assigning semantic labels representing the entities of interest is the large number of labels that can be assigned. This problem is not unique to information extraction; the same problem occurs if one uses similar machine learning techniques for machine translation or language modelling. The first difficulty caused by a large label set is that of data sparseness: the more labels there are, the less training examples there are per label. An even worse problem occurs when there are more labels than training examples, or when not all labels can be expected to be seen in the training set. Although this might sound prohibitive initially, this problem occurs regularly in statistical machine translation, where one needs to deal with unseen words, for example by using orthographical and morphological features and smoothing techniques. This chapter provides an overview of different methods to deal with these two distinct but related problems. First, a number of methods to combine binary or small n-ary models will be presented. Then, the fundamental problem of dealing with unseen labels in the test data will be investigated. In the third section, some possibilities for using these methods for the task addressed in this report will be investigated. Finally, the practical problems that make the application of these methods difficult will be discussed. 11

12

Chapter 3. Labelling with Large Label Sets

3.1 Assembled Models In principle, Maximum Entropy and many other models are able to label examples using a theoretically unbounded number of labels. However, certain problems occur if the number of labels becomes high. One problem, especially with Maximum Entropy based models, is the training time which generally increases with the number of target labels. Another problem, called the Imbalanced Data problem, occurs when the distribution of training examples over the different labels is strongly non-uniform, which will usually bias the model towards the more well-represented labels (Kubat and Matwin 1997; Japkowicz 2000). Finally, the decision boundaries for multi-way labelling problems can be very complex, making the problem impossible or difficult to learn depending on the exact task and technique used (Berger 1999). As a result of this, and the fact that certain models are limited to binary classification, a number of techniques have been proposed to combine multiple (usually binary) models to create a combined model that ranges over more labels than any of its constituents. These techniques differ in the amount of models that need to be trained, the process of labelling unseen examples, and the brittleness in front of mislabelling by the individual models, and there is generally a tradeoff between these two factors. This section will provide an overview of these proposed methods, considering both training and labelling complexity and expected robustness, as summarized by Salomon (2000).

One versus One The idea behind this assembly method is to train one model for each pair of labels, and to label an unseen example as the label that is predicted by the most models (Friedman 1996). Although this method can be robust in the face of imperfect constituent models,

 

it requires the training of O N 2 models and an equal number of atomic labellings per test example, which might be prohibitive if the number of labels N is high.

One versus Rest In this method, N binary models are constructed, each distinguishing between a certain label and the other labels. At labelling time, the label that is assigned the highest

3.1. Assembled Models

13

probability by its model is chosen as the final label. Although it requires less models than the ’One versus One’ method, all constituent models are expected to deal with all labels, which might increase the training complexity of these models, and more importantly all models will probably have many more negative examples than positive, making it difficult to train accurate models. Lastly, this method can be very brittle if one constituent model mistakenly assigns a very high probability.

Decision Directed Acyclic Graphs Decision Directed Acyclic Graphs are aimed at reducing the number of atomic labellings needed to label an unseen example as compared to One versus One models. It still trains one model for each label pair, but at labelling time uses each model to

 

 

eliminate one label until only one label is left, leading to O N 2 models but only O N

labellings per unseen example (Platt et al. 2000). An obvious disadvantage of this method is that one atomic mislabelling can cause the ultimate mislabelling of an unseen example, making this method more brittle than the One versus One method while not reducing training time and model complexity.

Decision Trees Although traditionally seen as a learning algorithm rather than an assembly method, Decision Trees also split a multiway labelling problem into a number of binary or n-ary constituent problems. Although usually decision trees use a simple test per decision node, in theory any type of model could be plugged into the nodes. Assuming a

 

symmetric tree and binary constituent models, this requires O N trained models and





O log2 N atomic actions per labelling. Disadvantage is that any atomic mislabelling necessarily leads to a mislabelling of an unseen example, making the method less robust than the other, more expensive, methods. Note that in the case that the tree is comprised of a small number of multiway models this is equivalent to the class based method as proposed by (Goodman 2001) to decrease training time for Maximum Entropy models.

14

Chapter 3. Labelling with Large Label Sets

Error-Correcting Output Coding The Error-Correcting Output Coding (ECOC) method, which can be seen as a generalisation of the One versus One method, works by assigning a bitvector code to each label with length l, where l

log2 N (Berger 1999). Then, for each bit in the vector

a binary model is trained to distinguish between the labels for which that particular bit is one and labels for which it is zero. At labelling time a new vector is created from the results of all constituent models and the label is chosen for which the code vector is most similar to the created vector for some suitable distance metric. If the matrix composed of all bitvectors is an N by N identity matrix this method is equivalent to the One versus One method. In general, it will require l constituent models and atomic decisions per labelling, and the robustness of the model will in general be higher for greater l. Although in principle any method can be used for assigning the bitvector codes, Berger (1999) gives theoretical and empirical arguments for choosing random codes assuming the vectors are long enough. Although only a small number of problems is investigated, his emperical results suggest that this method outperforms One versus All labelling for vector lengths of approximately the order of magnitude of the number of labels. Thus, this leads to better performance for a similar number of constituent models, while having better balanced data sets for the individual models and more flexibility in terms of the complexity versus robustness tradeoff.

3.2 Unseen Labels At the first glance, attempting to construct a statistical model that can assign a probability to labels that were not encountered seems a contradiction in terms. However, consider a labelling problem with a two dimensional binary observation vector and the training and test examples in table 3.1. If we see the label as atomic, there is no sensible way to predict the unseen label ’BB’. However, by exploiting the structure in the label, in this simple case seeing the label as composed of two separate characters, it can be possible for a model to assign a meaningful probability to the label ’BB’ without having observed it in the training data.

3.2. Unseen Labels

15

Training examples:

Test example:

o1

o2

label

0

0

AA

0

1

AB

1

0

BA

1

1

BB

Table 3.1: Toy labelling problem with unseen test label

In order to be able to consider a label that is not seen in the training set, some list of assignable labels must be available other than the set of labels encountered in the training data. This list can be either explicit or implicit: it is possible that a list of labels is given in addition to the labelled training set, or there can be a procedure to generate labels based on the labels in the training set, especially if there is an internal structure to the labels. For example, in the label set of the task described in this report, the labels consist of Subject/Predicate/Object triples. It is possible to construct a previously unseen triple from parts that did occur in the training set. In addition to being able to iterate over the unseen labels, a probability must somehow be assigned to it in the context where it is encountered in the test set. The remainder of this section will propose two methods for using the training data to estimate the probability of unseen labels based on aspects of these labels that have been observed in the training data.

3.2.1 Partial models One possibility is to train multiple models that each determine only part of the label. In the above example, the obvious implementation of this would be to train one model to predict the first character of the label, another to predict the second character, and to combine the two models by concatenating their outputs. Although this is a trivial solution for the above problem, it can be used in any problem where there is some structure in the individual labels, with constituent models all predicting part of the resulting label. It is also possible to have overlap between the different models, in which case some form of voting must take place between the models to resolve disputes. In the case of binary models, this corresponds to a special case of the Error-Correcting Out-

16

Chapter 3. Labelling with Large Label Sets

put Coding described above, and this can be easily extended to multiway constituent models. Table 3.2 illustrates this by splitting an 8-way labelling problem into three partial models that each predict one bit of the label and into two partial models, one predicting the first two bits and the second the last bit. For both cases, the bitvector or ‘intvector’ encoding for the labels is given that reduces the problem to a (minimal) instance of Error-Correcting Output Coding. Label

Binary

Model 1

Model 2

Model 3

Label

Binary

Model 1

Model 2

0

000

0

0

0

0

00 0

0

0

1

001

0

0

1

1

00 1

0

1

2

010

0

1

0

2

01 0

1

0

3

011

0

1

1

3

01 1

1

1

4

100

1

0

0

4

10 0

2

0

5

101

1

0

1

5

10 1

2

1

6

110

1

1

0

6

11 0

3

0

7

111

1

1

1

7

11 1

3

1

(a)

(b)

Table 3.2: Partial Models as Error-Correction Output Coding: (a) Binary constituent models, (b) Multi-way constituent models

Note that although this solution for the toy problem is a special case of the ErrorCorrecting Output Coding, as illustrated in table 3.3(a) below, it is not true that the application of randomly created bitvector codes generally solve the problem of unseen labels in the test set. To illustrate this, consider the fully expanded 2N long bitvectors in table 3.3(b), consisting of eight pairs of models that differ only in the last bit of the corresponding columns in the table. Since the three training examples are identical for the models in one pair, their decision boundary will also be identical. Thus, on average half the bits in the output vector for the unseen example will be different from the bitvector code for label BB, which is equivalent to assigning a random vector to the example. Since we cannot expect a random subset of these columns, with or without repetitions, to outperform the full set, this means that assigning random bitvector codes to the labels does not in general solve the problem of unseen test labels in the way the

3.2. Unseen Labels

17

encoding in table 3.3(a) does. Thus, it is imperative that there is some useful structure to the labels that can be used in constructing the output coding.

label

bitvector

label

bitvector

AA

0

0

AA

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

AB

0

1

AB

0

0

0

0

1

1

1

1

0

0

0

0

1

1

1

1

BA

1

0

BA

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

BB

1

1

BB

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

(a)

(b)

Table 3.3: Error-Correction Output Coding: (a) Hand crafted partial models, (b) Fully expanded bitvectors.

Partial models were used to deal with the large label set in the experimental setup as described in section 5.1.

3.2.2 Generalising features In the case of Maximum Entropy based models that work by assigning a probability based on a feature vector defined over the joint space of observations and labels, it is also possible to construct a ‘monolithic’ model that is able to assign a probability to unseen labels such as in table 3.1. This would work by creating features that generalise over multiple labels rather than fire for only a certain label. Traditionally, the features used in Maximum Entropy modelling fire for a combination of a contextual predicate and one label. This results in features templates of the general form Fi and actual features such as f1 :

F1

f1

1 0 1 0

if oi

ω and label

γ

otherwise if o1

0 and label

otherwise

’AA’

18

Chapter 3. Labelling with Large Label Sets

However, the features for which γ equals ’BB’ will not be seen in the training data at all, and thus receive a weight of zero during the training. Thus, the label ’BB’ will never be assigned during the labelling process. This is not a limitation of the Maximum Entropy framework, however, as this also allows us to define templates and features such as the following:

F2

f2

1 0 1 0

if oi

ωi and label

γa  γb  

otherwise if o1

0 and label

’AA’, ’AB’ 

otherwise

Note that it is necessary that the full list of labels is available at training time, as the Maximum Entropy parameter estimation has to consider each possible label for each observation and assign it a (very small) possibility. If not all labels are known at training time, the probability p

 used in equations (2.1) and (2.2) cannot be accu-

rately assigned and the final model, which will use the unseen labels, will violate the constraint in equation (2.1), and the distribution assigned by the model will not be a Maximum Entropy distribution. Similar, but not identical, to the partial model above, this allows the toy problem in table 3.1 to be solved using features that fire for a certain value of the first character of the label. Most importantly, this solution will also work without knowing which particular generalising features are useful since the absolute weight attached to features that are statistically less relevant will be lower. Thus, given that the relevant features exist and that sufficient training data is available to determine the relevancy of these features, the Maximum Entropy parameter estimation process can pick these features from the total set of possible generalisations. Although they were not used to deal with unknown labels, Generalising features were used in the experimental work and were found to increase performance by three to five percentage points.

3.3. Possibilities

19

3.3 Possibilities In the previous sections, a number of different methods have been proposed to deal with large labelsets, both to alleviate data sparseness and imbalanced data problems and to deal with unseen labels in the test set. This section will discuss some possible applications of these techniques to the problem of extracting semantic information from the text by labelling the instantiations of entities in the text. In the type of semantic labelling addressed here, it is possible that some information is contained within the label itself. For example, the unseen label might be a normal word for which a definition or WordNet synset might be given, or for which unlabeled material might be used to obtain co-occurrence counts with words from the context in which it occurs in the test set. Additionally, the labels might be ordered in groups or a hierarchical structure such that generalisations can be made about unseen labels from their group membership, or alternatively sets of synonyms from WordNet or thesauri can be used to create such groups. Another possibility is to use features that do not fire for certain labels at all, but rather for a certain relation between observation and label. For example, a feature is possible that fires if the current word is a substring of the label, or a real-valued feature that measures the co-occurrence of the current word with the label in some unlabeled document collection. The value of these features is less directly dependant on the label being investigated, and with such features it is possible to assign a probability to unknown labels using the information known about these labels, and using the training data to establish the effect of that information. In general, using features that are not restricted to a single label might be a way of integrating different (statistical and symbolic) systems that can contribute information to the text by taking the output probabilities for the different labels as input features for the final Maximum Entropy model. That way, information from the training data can be used to establish possible interactions between these sources. In the case of non-statistical subsystems, this might also provide a way of using the training data to determine the weight to be attached to evidence regarding unseen labels from the systems that do not rely on having seen all labels in the training set.

20

Chapter 3. Labelling with Large Label Sets

3.4 Chapter Summary This chapter gave a theoretical overview of the difficulties encountered when attempting to label with large labelsets, including data sparseness, imbalanced data, long training times and unseen labels in the test set. It also listed a number of methods previously used to deal with this problem by using multiple small models instead of one monolithic model. Two different methods of dealing with unseen labels were investigated: using partial models that each predict only part of the label, and using Maximum Entropy features that generalise over multiple labels instead of fire for a single label only. The chapter ended with some practical possibilities and drawbacks for using these techniques for the annotation task described in this report.

Chapter 4 The NET method and corpus The corpus used for this report is the result of a project conducted at the Free University, Amsterdam, and is described in (Kleinnijenhuis et al. 2003). The project monitored five national Dutch newspapers and the national television news broadcasts before and after the elections. The NET method is further described in (Cuilenburg et al. 1988; De Ridder 1994; Kleinnijenhuis et al. 1997; De Ridder and Kleinnijenhuis 2001).

4.1 Example As an example of the process of NET encoding and the resulting information, consider the editorial given in figure 4.1 below. Given a suitable entity list, the information shown in table 4.1 below will be extracted from this editorial. As can be seen in this table, from each sentence zero or more ‘nuclear sentences’ or triples are extracted, where the relation (or predicate) is said to give the association between the Object and the Subject of the triple. The Quality is a quantification of the relation, with a positive quality corresponding to a positive relation and vice versa. This method will be elaborated in section 4.2 below. Thus, this editorial associates the Mitchell Report with the ideal directly and through the negotiations and the prevention of war. Since reality is positively associated with the report, the editorial is seen as stating that something positive happened, since there 21

22

Chapter 4. The NET method and corpus

(1)

The Mitchell report, issued earlier this week, opens another page in the

story of the Middle East conflict. (2)With its provisions for a cease-fire, a cooling-off period, confidence-building measures and an eventual return to the negotiating table, the report represents a positive contribution. [...] (3)

It is at the negotiating table that the fate of the region will be decided.

(4)

Only if negotiations succeed it is possible to avert the fullscale war that

would have severe negative implications for the region and the whole world. Figure 4.1: Example newspaper editorial from the New York Times

Sentence

Subject

Relation

Quality Object

(1)

Reality

issues

0.5

Report

(2)

Report

provides for

0.5

Cease-fire

(2)

Report

provides for

0.5

Confidence

(2)

Report

provides for

0.5

Negotiations

(2)

Report

positive contribution

0.5

Ideal

(4)

Negotiations prevents

-1

War

(4)

War

-1

Ideal

negative implications

Table 4.1: Encoding for example in figure 4.1

is an indirect positive relation between reality and the ideal. Moreover, the editorial associates a number of entities, such as the negative association between negotiations and war, which can yield interesting results when combined with other articles about the same subject.

4.2 The method As illustrated above, the basic modelling element in NET is the ‘nuclear sentence’,





basically a triple of sub ject  relation  ob ject . The semantic interpretation of this triple is that of association: a positive quality means that the subject is associated with

4.3. The corpus

23

the object; a negative quality means that the subject is dissociated from the object. The subjects and objects are drawn from a more or less fixed list determined by the researcher which could be called the domain of the project. This list determines which aspect of the text the researcher is interested in, and will change from project to project. In the project that yielded the corpus used in this report, the list consisted mainly of political parties and their members and issues important to the 2002 elections in The Netherlands. The subject can also be the special subject ’Reality’ in case there is no explicit subject, and the object of a relation can be ’Ideal’, in which case the subject is directly associated with what the source of the sentence thinks is good or bad. The relation is in principle quantified in (Quality, Importance, Ambivalence, and Type) tuples. Of these measures, quality is the most important. It is a direct quantification of the strength of the association or dissociation. Although it can be a real valued number between -1 and +1, in practice multitudes of 0.5 are almost exclusively used.

4.3 The corpus The corpus consists of 5,664 newspaper articles, in total 373,429 words, of which the full text was available. Of these articles, 1875 did not contain any nuclear sentences, presumably due to false positives in the keyword based selection mechanism. The remaining articles, comprising 241,484 words, contain 11,370 nuclear sentences in 15,653 natural sentences. The domain for this project consisted of 964 entities, including 15 political parties, 381 members of political parties, and 71 country names. The other approximately 500 entities were primarily issues and concepts, and some names of organisations and companies. 359 objects were not used at all, these were mainly (unimportant) members of the political parties. 100 entities were used once, 251 were used more than 10 times and 45 were used more than 100 times. The most popular entities were ideal and reality, used 2,227 and 1,551 times, followed by Fortuyn, PvdA (Dutch labour) and VVD (Liberals / Conservatives), used 1,396, 873, and 612 times, respectively. The twenty most used entities contained 7 persons, apart from Fortuyn all ministers of the cabinet before the 2002 elections, and 6 political parties. In total, political parties, persons,

24

Chapter 4. The NET method and corpus

and country names were responsible for 9,820 of 22,740 uses of entities (43%).

4.4 Encoding the Corpus As an example of the process of encoding this corpus, consider the last sentence of the example in figure 4.1, partially copied below. In this sentence, two entities are referenced explicitly: War and Negotiations, and there is an implicit reference to the Ideal. Restricting ourselves to the first triple

War prevents  Negotiations

, an intuitive

annotation for the sentence would be:

 rolesub ject  rolerelation   roleob ject  re f Negotiations   quality1   re f War  Only if  negotiations  succeed it is possible to  avert  the full-scale war  that ...       (4.1) However, the exact words corresponding to an entity or a relation are not known. Moreover, these words cannot even always be determined unambiguously; for example, it is not clear whether the relation ‘avert’ corresponds to only that word or to ‘is possible to avert’. Thus, unless manual or automatic preprocessing is done to determine the words to which the labels should be assigned to, the encoding scheme in (4.1) cannot be used. Two alternative annotation schemes were devised, which will be described below. Please note that both these models use the same number of labels. Moreover, as the features for Model 2 were generated by adding up all the features generated per word, the two models have approximately the same number of features. The difference between those models lies in the number of tokens and the distribution of the features over the tokens.

Model 1: Per word encoding One option if the exact words corresponding to the labels are not known is to assign the labels belonging to a sentence to all the words in that sentence. Thus, the contextual predicates are directly related to the words that generated them, but the labels, since they are based on the sentences, do not directly correspond to the tokens. This would

4.5. Chapter Summary

25

yield the following encoding:

s:Negotiations  s:Negotiations   s:Negotiations  s:Negotiations  s:Negotiations  s:Negotiations   r:avert   r:avert   r:avert   r:avert   r:avert   r:avert  o:War o:War o:War o:War o:War  Only   o:War          ... if negotiations   succeed   it is                 (4.2)

Model 2: Per sentence encoding As an alternative encoding, it is possible to consider the sentences to be tokens rather than the words. Although this might make it more difficult for the model to consider positional features, this causes a one to one correspondence between labels and tokens:

  s:Negotiations   r:avert o:War Only if negotiations succeed it is possible to avert the full-scale war that ...   

(4.3)

4.5 Chapter Summary This chapter gave an overview of the NET corpus and the method by which it was annotated, as well as some statistics of the corpus. Using a small example, two different annotation schemes were described: Model1 assigns the label belonging to a sentence to each word in that sentence, while Model2 treats the whole sentence as one token. Both models have approximately the same number of labels and features, and differ in the number of tokens and the distribution of features over the tokens.

Chapter 5 Experimental Work This chapter describes the experimental part of the thesis. After giving a definition of the problem and explaining the experimental setup, it will describe the features used by the MEMM. Then, the achieved results will be given as well as an analysis of the remaining mislabellings by the models.

5.1 Problem Definition and Experimental Setup The complete coding process for the NET method, and therefore potentially the task definition for this report, is to find all subject/relation/object triples in a document collection. This task is seen as a sequencing task: for each token, the task is to assign such a triple based on the observed features of the current token and the previous triple. By seeing this as a sequencing tasks rather than as a set of classification tags, it is possible to use features that range over more than one token and use the globally optimal set of labels rather than the set of locally optimal labels. Even if we would limit ourselves to predicting subject-object pairs or if we draw the relations from a small set of possible relations, this means that the size of our total label space, which contains every possible combination of one thousand possible subjects and the same number of objects, is approximately a million. Thus, we are in the situation described in chapter 3 of having more possible labels than training tokens, which number approximately 250,000 for word level tokens (Model1), or 15,000 for 27

28

Chapter 5. Experimental Work

sentence level tokens (Model2). From the methods discussed in that chapter, it was decided to use the partial model approach explained in section 3.2.1, training separate models for each of the constituents of the triple. Moreover, a number of features were used that generalise over multiple labels, as explained in section 3.2.2. Although it would be interesting to conduct an experiment to compare the different methods in a systematic manner, such a comparison is beyond the scope of this report and is deferred to future work. Both subject and object are drawn from the total entity list for the project, yielding in this case approximately one thousand labels. Although this number is certainly less prohibitive than the five million mentioned above, for practical reasons it was decided to divide these labels into categories and use two models, one to predict the class of the label and one to predict the actual label, along the lines of the method proposed by Goodman (2001). This is also expected to ameliorate the imbalanced data problem by grouping the data into more balanced categories. The full experimental setup is visualized in figure 5.1. Since the entities were given in a hierarchical structure, that structure was used to divide the entities into 47 classes in total. Predict class of subject based on observations O b s e r v a t i o n s

Predict subject based on class and observations

Predict predicate based on observations

Predict class of object based on observations

T r i p l e

Predict object based on class and observations

Figure 5.1: Experimental Setup

It is assumed that all these models face more or less the same difficulties. Thus, it was decided to concentrate on one of these models and perform an in-depth evaluation of the strengths and shortcomings of the chosen method. The chosen model, the prediction of the class to which the subject belongs, is indicated by the thick borderline in

5.2. Features

29

the top left corner of figure 5.1. A further simplification is made in that it is assumed that each natural sentence has exactly one triple. Thus, sentences that did not have a triple were removed from the corpus, and for sentences containing multiple triples all but the first were likewise removed. All in all, this simplified task can be defined as follows. Given a list of categories, and a collection of documents, determine the category of the subject of each sentence in each document. To perform the task above, a Maximum Entropy Markov Model was created and trained on the training corpus. The rest of this chapter describes the features used for this model, the results obtained by this model and an analysis of the errors made.

5.2 Features In Maximum Entropy modelling, feature templates are normally used to automatically generate the actual features. These templates often have placeholders instead of words and labels, and the actual features are generated by inserting the actual words and labels into these placeholders. This section describes the different feature templates used to generate the features for the models. These templates can be divided into two groups: the first group is the traditional contextual predicate that combines with each label to form a feature, and the second group form features that fire for specific relations between the word and the label.

5.2.1 Contextual Predicates Orthographic and Positional Predicates:

Position in sentence, both the absolute position in the sentence and a number from one to three indicating the part of the sentence in which the word occurred. A predicate indicating whether the current word is capitalized or not and whether it is punctuation.

30

Chapter 5. Experimental Work

A predicate that fires only if the current word starts a new sentence. A predicate that fires only if the current word is one of the definite articles ‘de’ and ‘het’. A predicate that fires only if the current word is a personal pronoun. POS-tag and parse output predicates:

Postag: a predicate template for the current postag. Isname: fires if and only if the current word is tagged as a name. Isnoun: fires if and only if the current word is tagged as a noun. Grammatical function: a predicate template for the grammatical function of the word as determined by the parser. Can be subject, object, predicate or none. Strict grammatical function: restricts the subject and object categories to nouns, names, and pronouns, and the predicate category to verbs. Thesaurus features:

Researchers in Coreference and Nominal Anaphora Resolution, as well as Text Summarization, often use WordNet or thesaurus information for synonym and hypernym information to determine possible (co)reference (Harabagiu and Maiorano 1999; Amit 1998; Barzilay and Elhadad 1997). As the current task can also be defined as determining coreference between the entities and their instantiations in the text, the following thesaurus feature was introduced. Predicate templates for the three abstraction categories in the Brouwers Dutch thesaurus. Anaphora resolution:

In written text, after an entity is introduced it is usually referred to using a shortened form such as a pronoun, definite noun phrase, or abbreviated name. Since this

5.2. Features

31

shortened form might contain too little information to determine the entity to which it refers, it can be necessary to resolve its antecedent in order to find out what entity it is an instantiation of. In other words, it might be beneficial to incorporate some form of anaphora resolution in our system to provide additional information about these references. Although complicated and knowledge heavy anaphora resolution systems have been proposed, research suggests that a reasonable accuracy can be obtained using simple positional features (Lappin and Leass 1994; Kennedy and Boguraev 1996; Kameyama 1997). The features below are aimed at providing a simple anaphora resolution system at low cost. A predicate template for the last seen noun or name if the current word is a pronoun. As above, but finds the nearest name or noun belonging to the same grammatical function.

5.2.2 Generalising Features The features that fire for a certain relation between word and label can be subdivided into two categories. The first category contains features that only use the orthographic aspects of the word and the labels, while the other category uses external information in the hopes of establishing a more semantic relationship between label and word. For all features, the value for a word and a category for the category predicting models is the best score for any of the labels in that category. These features are expected to help with the labels that were sparse or underrepresented in the training corpus. Surface features:

Research on the MUC-6 and MUC-7 coreference task suggests that using (sub)string match as the basis for coreference resolution can yield good results in combination with positional and semantic features (MUC-6 1995; MUC-7 1998; McCarthy and Lehnert 1995; Soon et al. 1999). String match: fires if the word matches the phrase describing the label.

32

Chapter 5. Experimental Work

Substring match: fires if the word is a substring of the phrase describing the label. To prevent the matching of very short words to almost all labels, a minimum word length of 3 characters was used.

String distance: The number of characters that differ between the two word and the phrase describing the label, divided by the length of the shortest word. For the category predicting models, this would be the best score for all labels in the category.

Semantic features:

As the string distance features try to establish superficial similarities between an observed word and the phrase describing a possible label, these features try to establish the semantic similarity. In the literature on coreference resolution and nominal anaphora resolution, this is often achieved using Thesaurus information or collocation or co-occurrence statistics derived from an unlabaled corpus or even the Internet (Soon et al. 1999; Kameyama 1997; Markert et al. 2003). Unfortunately, for practical reasons it was not possible to use the Dutch WordNet and using the Google API, which is limited to 500 queries per day, it would take 7 years to compute to collocation statistics between all words and labels in the vocabulary, while other search engines forbid automated querying of their index. Thus, the only feature used here is based on the publicly available manifestos of the political parties; if this feature template contributes significantly to the final result it might be worthwhile to acquire the resources required for other semantic features, such as a Dutch WordNet subscription or permission to use the index of an Internet search engine.

Manifesto: Fires only for the labels indicating political parties, and fires only if that word occurs more often in the manifesto of that political party than in those of the other parties, corrected for the length of the manifesto and with a minimum difference of 10 percent.

5.3. Results

33

5.3 Results For labelling and sequencing tasks, the most common performance metrics are precision and recall, both measured per possible label. Precision records how often the model was correct when it predicted the label, and recall measures how many of the occurrences of the label the model predicted correctly. The F-score for a label is the harmonic average of precision and recall and defined as F

2 pr p r.

As a measure of to-

tal performance, the average F-score over all labels weighed by frequency of the label in the training set can be used. Both the per word (Model 1) and per sentence (Model 2) models illustrated in section 4.4 were evaluated on a per sentence basis. Since no training tokens exist where the current label is different from the previous label and the current token is not the first in the sentence, there were almost no sentences where the per word model assigned more than one distinct label to the words. In the few cases that it did happen, the first label was chosen for the evaluation. The first baseline result is the result achieved by assigning the most frequent label to each token. The second baseline is the result acquired by retraining the publicly available MXPOST tagger as described by (Ratnaparkhi 1996) on the training set (with the labels assigned per word) and using it to evaluate the test set. The mxpost tagger is a MEMM based POS-tagger and does not contain any features specific to the current task or even specific to Dutch. This result is presented to give an indication of the inherent difficulty of the task compared to other labelling problems. Table 5.3 lists the precision, recall, and F-score for the individual models and the baseline models. Table 5.2 gives the performance of the two investigated models on the 14 most frequent labels. The configuration used for both models will be discussed in section 5.4 below.

5.4 Configuration There are a number of configuration options for Maximum Entropy Markov Models, the most obvious being which features or which feature templates to use. Apart from

34

Chapter 5. Experimental Work

Baseline results Model 1

Model 2

Most frequent

MXPOST

Average Precision

0.300

0.400

0.040

0.304

Average Recall

0.304

0.406

0.199

0.303

Average F-Score

0.302

0.385

0.066

0.258

Table 5.1: Performance versus baseline

the feature templates, two important parameter settings can affect the performance of MEMMs. A maximum number of iterations can be set to prevent overfitting the training data, and cutoff smoothing can be used to eliminate all features that have not been seen at least n times in the training data. Even restricting ourselves to the six broad groups of feature templates described in section 5.2, 7 different cutoff settings and 7 different settings for the number of training iterations, an exhaustive search of the configuration space would require 26  7  7

3136 experiments. Since each model takes a number of hours to train, this number of experiments is not practical. Thus, it was decided to optimize one option at a time by setting the other parameters to some intuitively sensible level.

Maximum Number of Iterations Using the full set of feature templates described in section 5.2, and no cutoff smoothing, it was found that for Model 1, performance on the held out set increased until the convergence point at around 90 iterations, while for Model 2 the optimal number of iterations was found to be approximately 50, as shown in figure 5.2.

Cutoff Level Still using the full set of feature templates, and allowing Model 1 to converge while restricting Model 2 to 50 training iterations, the optimal cutoff level was determined. It was found that using a cutoff only decreased performance for both models, as shown in figure 5.3, while training time only slightly decreased for both models, as can be seen in table 5.3. Training time was measured on a single 1024MB memory, ?? Processor machine.

5.4. Configuration

35

performance Model1 label

performance Model2

n

Precision

Recall

F-score

Precision

Recall

F-Score

reality

164

0.27

0.55

0.36

0.30

0.47

0.37

pvda

148

0.34

0.56

0.42

0.53

0.60

0.56

vvd

101

0.49

0.39

0.43

0.58

0.55

0.57

de politiek

86

0.30

0.21

0.25

0.33

0.45

0.38

lpf

60

0.4

0.27

0.32

0.48

0.36

0.41

binnenland

33

-

0

0

0.25

0.09

0.13

groep

32

-

0

0

0.20

0.03

0.05

cda

24

0.58

0.29

0.39

0.50

0.21

0.29

groenlinks

55

0.33

0.41

0.38

0.64

0.50

0.56

d66

16

0.57

0.27

0.36

0.52

0.75

0.62

buitenland

14

0.45

0.36

0.4

0.39

0.50

0.44

paars2

14

0.33

0.07

0.12

0.67

0.14

0.24

rechts

14

0.33

0.14

0.20

0.30

0.21

0.25

ln

14

0.33

0.07

0.12

0.44

0.57

0.50

Table 5.2: Performance per label

Performance (F-Score)

0.5

Model 1 Model 2

0.45 0.4 0.35 0.3 0.25 0.2 50

100

150

200

250

Maximum number of iterations Figure 5.2: Performance for different maximum iterations.

Active Feature Templates Finally, using the above results for cutoff level and number of iterations, the effect of using different sets of features was investigated. Since even 26 is a relatively large

36

Chapter 5. Experimental Work

Model 1 Cutoff Threshold # of features no cutoff

Model 2

Training time # of features

Training time

1,406,638

04:46:37

1,406,748

02:52:04

2

791,737

04:44:29

791,753

03:01:08

3

567,970

04:19:24

567,986

02:55:49

4

437,921

04:08:18

437,937

02:40:40

5

360,559

04:14:02

360,575

02:44:22

10

190,748

04:04:45

190,763

02:37:23

20

100,038

04:18:18

100,053

02:36:09

Table 5.3: Number of features and training time for different cutoff thresholds

Performance (F-Score)

0.5

Model 1 Model 2

0.45 0.4 0.35 0.3 0.25 0.2 0

2

4

6

8

10

Cutoff threshold Figure 5.3: Performance for different cutoff thresholds.

number, it was decided to test the effect on the total performance of leaving out each feature group defined in section 5.2 apart from the basic ‘positional and orthographic’ features . Apart from giving an indication of the best feature configuration to use, this also gives an idea of the contribution of each feature group to the total perfomance. A more elaborate investigation of the optimal configuration and the effect of the individual templates will probably yield interesting insights but is beyond the scope of this report. Table 5.4 below lists the performance obtained by leaving out each of the feature template groups and the contribution of each template group, which is defined

5.5. Error analysis

37

as the performance of the full set minus the performance of the set without that feature group. For reference, the full set performance is given in the top row. Model 1 Feature group

Performance

Model 2

Contribution Performance

Contribution

Full set

0.290

-

0.385

-

Stringmatch features

0.258

0.032

0.367

0.018

Semantic features

0.270

0.020

0.371

0.014

Anaphora Resolution

0.294

-0.004

0.344

0.041

Thesaurus Features

0.283

0.007

0.381

0.004

Syntactic Features

0.302

-0.012

0.374

0.011

Table 5.4: Contribution of the different feature groups

Thus, it can be seen that the two generalising feature groups, Stringmatch and Semantic features, increased performance in both cases. The Anaphora Resolution component strongly increased performance for Model 2 while it actually decreased performance for Model 1, which might be due to the less elaborate input into Model 2. The Thesaurus features helpep slightly in both cases, and the Syntactic features improved Model 2 but worsened Model 1. It is conceivable that Model 2 has more need of the syntactic preprocessing since it ignores word order.

5.5 Error analysis 5.5.1 Confusion Matrices Tables 5.5 and 5.6 give the confusion matrices for the 10 most common labels for the two models using the full feature set. From the above matrices, a number of interesting conclusions can be drawn. Performance on the political parties (PvdA, VVD, LPF, CDA, GroenLinks, D66, LN; listed in italics in table 5.2) seems to be better than on the other labels. In fact, the average F-scores on only these labels are 0.354 and 0.517, respectively, as compared to 0.264 and 0.322 for the whole set displayed above. This might possibly be due to

38

Chapter 5. Experimental Work

reality

pvda

vvd

de pol

lpf

binnenl

groep

cda

groenl

d66

reality

66

23

14

17

13

2

7

4

1

3

pvda

32

71

9

9

2

0

4

6

0

7

vvd

15

22

36

5

0

0

2

0

1

6

de politiek

28

17

8

21

1

0

1

0

0

1

lpf

20

8

1

4

14

0

4

0

0

1

binnenland

10

4

3

1

7

3

2

1

0

1

groep

9

7

3

2

3

0

4

1

0

0

cda

5

5

6

1

0

0

0

5

0

0

groenlinks

5

3

1

3

0

0

1

0

5

0

d66

4

1

3

0

1

0

0

0

0

4

Table 5.5: Confusion matrix Model1

reality

pvda

vvd

de pol

lpf

binnenl

groep

cda

groenl

d66

reality

79

21

15

22

7

1

2

1

1

2

pvda

33

89

6

3

7

1

1

1

0

3

vvd

23

8

56

0

3

0

0

0

0

3

lpf

16

3

1

27

4

1

0

0

0

0

de politiek

26

11

8

3

31

1

0

1

2

0

groep

12

4

0

8

1

3

0

0

2

0

binnenland

14

6

2

4

0

0

1

0

0

0

cda

11

4

1

0

1

0

0

5

0

0

groenlinks

2

0

0

2

0

4

0

0

9

0

d66

3

0

0

0

0

0

0

1

0

12

Table 5.6: Confusion matrix Model2

the fact that political parties or their members are often referred to using some form of their name, which is easy to learn and recognize. The slightly above average score for ’buitenland’ (foreign affairs) might also be a result of this, as country names will often be used in connection with foreign affairs, although the total number of references is too low to make any strong claim about that. Moreover, performance generally decreases slightly as the number of examples for that label decreases. This might be due to the imbalance data problem. A way to

5.5. Error analysis

39

circumvent this problem might be to create more well-balanced clusters so that all labels will be seen with approximately the same frequency. This might be difficult to achieve without making the classes less obviously related to the existing entity hierarchy, which might decrease overall performance unless another sensible clustering can be found. Whether the increase in performance on the smaller classes compensates for this will have to be determined empirically. The largest source of confusion seems to be the ‘reality’ label. As this label is defined in negative terms, namely that something happened without another entity being the cause, it is not surprising that the model has difficulty finding indicators for this label. Confusion between the political parties seems relatively low except for PvdA/VVD. This might be explained by the fact that they were coalition partners in the pre-2002 cabinet, and were often connected with the same issues.

5.5.2 Example sentences Below are some example sentences that illustrate some of the strengths and weaknesses of the models. Looking at such examples can yield valuable insights into the sort of information we use to solve the problem, and thus into what additional features might be used to allow the model to solve it as well. Named entities

Presumably, the easiest entities to recognize are those that occur in the form of a directly identifiable named entity. Usually, this will be either political parties, members of such parties, or country names. Consider: 1. GroenLinks schept de meeste banen. GreenLeft creates the most jobs. Actual: GreenLeft; Model1: GreenLeft; Model2: GreenLeft 2. Kader PvdA tegen de JSF. Party leadership PvdA against the JSF. Actual: PvdA; Model1: Reality; Model2: PvdA

40

Chapter 5. Experimental Work

3. En wederom wilde Pim Fortuyn weglopen. And again Pim Fortuyn [founder/leader of LPF] wanted to walk away. Actual: LPF; Model1: PvdA; Model2: LPF 4. Kalsbeek wil cursus voor ouders criminelen. Kalsbeek [secretary of state for PvdA] wants courses for parents of criminals. Actual: PvdA; Model1: PvdA; Model2: PvdA 5. Macedonie dringt aan op Nederlandse troepen. Macedonia insists on Dutch troops. Actual: Foreign; Model1: Foreign; Model2: Foreign One would expect these entities to be easily recognized, and as stated above performance is better on those categories that are often instantiated as Named Entities. However, it is striking that Model1 fails to recognize some of the above examples, which have been selected for the fact that they are short and fairly obvious. Although it is difficult to make any definite statements about the workings of complicated statistical models such as the MEMM’s that produced these results, an intuition is that the Label Bias Problem (see also section 5.5.3) might make it difficult for Model1 to react properly to evidence presented after the beginning of the sentence, such as in the mislabelled examples above. Experiments to determine whether this is indeed the case, apart from using a model that is not susceptible to this problem, might be systematically comparing performance on sentences where the named entity occurs in the first token to those where it does not, and determining performance on determining the object of the sentence, which will generally occur later in the sentence due to the Subject-Object standard word order in Dutch. Multiple triples

Consider the following sentences: 1. Dijkstal en Melkert leiden ons om de tuin. Dijkstal [minister for VVD] and Melkert [minister for PvdA] fool us. Label: Actual: VVD; Per word: VVD; Per sentence:VVD

5.5. Error analysis

41

2. VVD en GroenLinks weten de komende regeerperiode de meeste banen te creeren. VVD and GreenLeft create most jobs in the next term. Label: Actual: GroenLinks; Per word: VVD; Per sentence:GroenLinks These sentences illustrate a potential problem created by the simplification that only the first triple was kept for sentences corresponding to multiple triples. Both sentences have a similar start grammatical subject, namely “NamedEntity and NamedEntity”, where the named entities are either political parties or one of their members. Both sentences actually had triples with both entities as subjects, but for both sentences only one of these was kept. Although Model2 labelled both instances correctly, we cannot expect average performance on these sentences to be higher than 50%.

5.5.3 Label bias problem In MEMMs there is a probability distribution over all possible states conditioned on the



previous state and the current observation: p St Ot  St

1 . Since this is a probability

distribution, and therefore must sum to one, all probability mass arriving in a state must be passed on to its successor states. This gives rise to the Label Bias problem as described by Lafferty et al. (2001). The Label Bias Problem has a number of consequences. Generally, it means a disproportional preference for states with few probable successors, or low entropy states, over states with more successors. Moreover, whole branches cannot be downgraded based on observations, as exemplified in the rib/rob example in Lafferty et al. (2001), as all probability mass passed onto a branch is preserved in the branch since all individual state transition probabilities sum to one. Thus, it is possible that distinguishing observations can be ignored if they occur after a crucial branch point. Since for Model 1 it was decided to assign the label belonging to the sentence to each of the words in the sentence, all words in a sentence will by necessity have the same label. Thus, the probability that the current label is not equal to the previous label will be zero for all non-sentence initial words. This leads to a markovian network similar to the one displayed in figure 5.4 below, where the possible labels are A to Z and the first sentence is fours tokens long.

42

Chapter 5. Experimental Work

START

A1

A2

A3

A4

A5

...

B1

B2

B3

B4

B5

...

. . .

. . .

. . .

. . .

. . .

. . .

Z1

Z2

Z3

Z4

Z5

...

Figure 5.4: Graphical Structure

One of the characteristics of locally optimized conditional models is that the total probability mass assigned to a branch cannot be reduced within that branch. Thus, all probability mass arriving in A1 must be passed onto A4, and the observations corresponding to the non-initial words in that branch will be ignored, as independent of the observation the probability of retaining the current label will be one. In other words, the label assigned to a sentence will be based solely on the observations relating to the first word in that sentence. Since the second method avoids this problem by collapsing all features corresponding to the words in a sentence to one token, this would also explain the relatively high performance of this method compared to the first, even though this method does not take into account the position of a word in the sentence except for the positional features. However, this increased performance might also be accounted for by sparse statistics on the side of the first method, as there is always a tradeoff between accurate models and accurate assessment. Even though it is very difficult to give accurate explanations for the performance of statistical models, given the above a significant role for the label bias problem for explaining the results presented above is not unthinkable. To see whether this is indeed the case, it would be interesting to repeat this experiment using a method that does not suffer from the Label Bias Problem, either using a generative model such as a Hidden Markov Model, or using a globally optimized Conditional Random Field.

5.6. Chapter Summary

43

5.6 Chapter Summary This chapter described the empirical part of the thesis. It gave a working definition of the annotation task and explained the setup and features used for the model. The achieved results were presented and a breakdown of the strong and weak points of the model were given. Although there are many possible explanations for the results, the Label Bias Problem was identified as a possible limitation of the accuracy of the used method on the current task.

Chapter 6 Conclusion and Future Work As the conclusion of this report, this chapter will give a number of ideas for future work to improve the results described, and it will sum up the conclusions that can be drawn from the theoretical and empirical findings of the report.

6.1 Future work Given the possible negative effect of the Label Bias Problem on the results presented in this report, a promising avenue of further investigation might well be using a Conditional Random Field sequencing model instead of the used Maximum Entropy Markov Model. Although this might aggravate the computational difficulty of the parameter estimation, it should also give a clear indication of the severity of the Label Bias Problem for the current task and method, and whether future improvements should be done in the context of Maximum Entropy Markov Models or Conditional Random Fields. Another possibility to increase the performance of the model is through the use of more features. For instance, WordNet glosses and synsets have been successfully used in Metonomy and Nominal Anaphora Resolution and it might be worthwhile to investigate whether the Dutch WordNet might be a useful source of semantic information, for example to use the distance between word and label as a feature, or to test for word overlap in the WordNet definitions of the word and label. This is especially promising given the positive contribution of the Semantic feature group to the final result. 45

46

Chapter 6. Conclusion and Future Work

Additionally, if the Label Bias is not the main cause of the performance difference between Model 1 and Model 2, it might be interesting to further experiment with Model 2 and use contextual predicates that are less strictly word-based, but rather aimed at extracting certain information from the whole sentence. This might open the way for more knowledge-heavy features that make use of syntactic or even some form of semantic preprocessing. In order to use possible interactions between the sequencing tasks that are treated as distinct in this report, it could be interesting to replace the different models by one model and use the current features as features generalising over multiple labels. That way, it will be possible to use multiple overlapping categorisations instead of the current disjoint categorisation, and to use features that include information about both subject and object label. For this it will be necessary to find or create an efficient estimation program that allow this type of features while still keeping training time and input format manageable. Even if accuracy can be brought to acceptable levels through the above means, this will presumably only be achieved with large amounts of training data. An investigation should be performed into the relation between the amount of training data and final performance in order to establish the estimated amount of data needed to make the results from the trained model useful, and whether the accuracy seems to be converging for the amount of training data available for the investigated project. As the list of entities will almost certainly change between projects, it is also an interesting question how and to what extent it is possible to utilize the training data from one project for labelling articles in a new project. The utility of a method that requires all labelled data from one project to achieve reasonable accuracy is limited unless this data can somehow be carried over to future projects. Another possibility to reduce the amount of training data required to achieve acceptable performance is through using Active Learning (Cohn et al. 1995). By allowing the computer to propose articles for human annotation it might be possible to achieve the same performance with a lower cost in terms of human annotation. If it is possible to carry over the data from previous projects to new projects in a limited way, this might be a good way to kickstart the active learning process.

6.2. Conclusion

47

6.2 Conclusion This report investigated the possibility of using sequencing methods to extract semantic information from natural text by labelling the instantiations of a list of interesting entities in that text. One of the problems encountered in doing so is the large number of possible labels. A number of methods for dealing with this were investigated. One idea was to use Maximum Entropy features that generalise over sets of labels to reduce problems of data sparseness and imbalanced data. Another investigated possibility was to split the model into an assembly of smaller models that each dealt with part of the problem. Using the latter method, the problem was split into a number of subproblem and a Maximum Entropy Markov Model was trained on one of these subproblems. Maximum Entropy features that generalised over the labels were also used and were found to increase performance by three to five percentage points. Even though significant improvements over a baseline model were achieved, there remain a number of theoretical and practical issues to be investigated, giving good hope on future improvements to the model.

Bibliography Amit, B. (1998). Evaluation of Coreferences and Coreference Resolution Systems. In Proceedings of the First Language Resource and Evaluation Conference. Barzilay, R. and M. Elhadad (1997). Using lexical chains for text summarization. In Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS’97), ACL, Madrid, Spain. Berger, A. (1999). Error-correcting output coding for text classification. In IJCAI’99: Workshop on machine learning for information filtering. Berger, A. L., S. A. Della Pietra, and V. J. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22(1), 39–71. Borthwick, A., J. Sterling, E. Agichtein, and R. Grishman (1998). Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Sixth Workshop on Very Large Corpora, New Brunswick, New Jersey. Association for Computational Linguistics. Cohn, D. A., Z. Ghahramani, and M. I. Jordan (1995). Active Learning with Statistical Models. In G. Tesauro, D. Touretzky, and T. Leen (Eds.), Advances in Neural Information Processing Systems, Volume 7, pp. 705–712. The MIT Press. Cuilenburg, J. v., J. Kleinnijenhuis, and J. de Ridder (1988). Tekst en Betoog. Muiderberg: Coutinho. Darroch, J. and D. Ratchliff (1972). Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics 43(5), 1470–1480. Della Pietra, S., V. J. Della Pietra, and J. D. Lafferty (1997). Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 380–393. Friedman, J. H. (1996). Another Approach to Polychotomous Classification. Technical report, Stanford University. Goodman, J. (2001). Classes for fast Maximum Entropy Training. In ICASSP-2001. Goodman, J. and J. Gao (2000). Language model size reduction by pruning and clustering. In ICSLP’00, Beijing, China. 49

50

Bibliography

Harabagiu, S. and S. Maiorano (1999). Knowledge-Lean Coreference Resolution and Its Relation to Textual Cohesion and Coreference. In D. Cristea, N. Ide, and D. Marcu (Eds.), The Relation of Discourse/Dialogue Structure and Reference, pp. 29–38. New Brunswick, New Jersey: Association for Computational Linguistics. Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. In Papers from the AAAI Workshop on Learning from Imbalanced Data Sets. Tech. rep. WS-00-05, Menlo Park, CA: AAAI Press. Kameyama, M. (1997). Recognizing referential links: An information extraction perspective. Technical report, AI Center, SRI International. Kennedy, C. and B. Boguraev (1996). Anaphora for everyone: Pronominal anaphora resolution without a parser. In Proceedings of the 16th International Conference on Computational Linguistics. Kleinnijenhuis, J., D. Oegema, J. A. de Ridder, A. M. van Hoof, and R. Vliegenthart (2003). De puinhopen in het nieuws. Alphen a/d Rijn: Kluwer. Kleinnijenhuis, J., J. de Ridder, and E.M.Rietberg (1997). Reasoning in Economic Discourse: An Application of the Network Approach to the Dutch Press. In C.W.Roberts (Ed.), Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts, pp. 191–209. Mahway, NJ: Lawrence Erlbaum Associates, Inc. Kubat, M. and S. Matwin (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the 14th International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann. Lafferty, J., A. McCallum, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco, CA. Lafferty, J. and B. Suhm (1995). Cluster expansions and iterative scaling of maximum entropy language models. In Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods. Kluwer Academic Publishers. Lappin, S. and H. J. Leass (1994). An Algorithm for Pronominal Anaphora Resolution. Computational Linguistics 20(4), 535–561. Malouf, R. (2002a). A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of CoNLL-2002, pp. 49–55. Malouf, R. (2002b). Markov Models for language-independent named entity recognition. In Proceedings of CoNLL-2002, pp. 187–190. Taipei, Taiwan.

Bibliography

51

Markert, K., M. Nissim, and N. Modjeska (2003). Using the Web for Nominal Anaphora Resolution. In R. Dale, K. van Deemter, and R. Mitkov (Eds.), Proceedings of the EACL Workshop on the Computational Treatment of Anaphora, Budapest, Hungary, pp. 39–46. McCallum, A., D. Freitag, and F. Pereira (2000). Maximum Entropy Markov Models for Information Extraction and Segmentation. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML 2000), Stanford, California, pp. 591–598. McCarthy, J. F. and W. G. Lehnert (1995). Using Decision Trees for Coreference Resolution. In IJCAI, pp. 1050–1055. MUC-6 (1995). Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, San Meteo, CA. MUC-7 (1998). Proceedings of the Seventh Message Understanding Conference (MUC-7). published on the web site http://www.muc.saic.com/. Platt, J., N. Cristianini, and J. Shawe-Taylor (2000). Large Margin DAGS for Multiclass Classification. In T. L. S.A. Solla and K.-R. Muller (Eds.), Advances in Neural Information Processing Systems, Volume 12. MIT Press. Ratnaparkhi, A. (1996). A Maximum Entropy Part-Of-Speech Tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference. Ratnaparkhi, A. (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph. D. thesis, University of Pennsylvania. Reynar, J. and A. Ratnaparkhi (1997). A Maximum Entropy Approach to Identifying Sentence Boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington D.C., pp. 16–19. Ridder, J. de (1994). Van tekst naar informatie. Ontwikkeling en toetsing van een inhoudsanalyse-instrument. Ph. D. thesis, University of Amsterdam. Ridder, J. de and J. Kleinnijenhuis (2001). Media Monitoring using CETA: The Stock-Exchange Launches of KPN and WOL. Progress in Communication Sciences 17: Applications of Computer Content Analysis 17, 165–185. Rosenfeld, R. (1994). Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Ph. D. thesis, Carnegie Mellon University, Pittsburgh, PA. Salomon, J. (2000). Support Vector Machines for Phoneme Classification. MSc. thesis, University of Edinburgh. Soon, W. M., H. T. Ng, and C. Y. Lim (1999). Corpus-Based Learning for Noun Phrase Coreference Resolution. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99), College Park, Maryland, USA, pp. 285–291.

52

Bibliography

Wu, J. and S. Khudanpur (2000). Efficient Training Methods for Maximum Entropy Language Modeling. In Proceedings of ICSLP2000, Beijing, China, pp. 114– 117.