Neural Networks, Part-of-Speech Tagging and

Neural Networks, Part-of-Speech Tagging and Lexicons* Nuno C. Marques **

Gabriel Pereira Lopes

([email protected])

([email protected])

Technical Report DI-FCT/UNL n. 6/98 Universidade Nova de Lisboa - Faculdade de Ciências e Tecnologia Departamento de Informática Grupo de Língua Natural 2825 Monte de Caparica Portugal http://www-ia.di.fct.unl.pt/~nmm/artigos.html

Abstract Neural networks are one of the most efficient techniques for learning from scarce data. This property is very useful when trying to build a part-of-speech tagger. Available part-of-speech taggers need huge amounts of hand tagged text, but for Portuguese as well as for many other languages there are no such hand tagged corpora available. In this paper we propose the cooperation of a lexical system and a neural network in such a way that the huge training corpus problem is overcome. The network topology we used was applied to the problem of learning the parameters of a part-of-speech tagger from a very small Portuguese training corpus and from a subset of the Susanne Corpus. The experiments carried out are discussed. The results obtained point to a correction rate above 97% when we start from a hand tagged training corpus with approximately 15,000 words. The application of our system to real texts is also described.

1.

Introduction

The application potential of textual corpora increases, when the corpora are annotated. The first logical level of annotation is usually part-of-speech tagging. At an upper level the text is no longer seen as a mere sequence of strings and is taken as a sequence of linguistic entities with some natural meaning. The annotated text can then be used to introduce further some new types of annotations (usually by means of syntactic parsing [Marcus et al. 93] or [Marcken, 90] ), or may directly (or indirectly) be used to collect statistics to different kinds of applications. Working at the word tagging level enabled applications such as: speech synthesis [Church et al.,93], clustering [Pereira et al., 93] computational lexicography [Manning, 93], even improve spell checking. The success of this kind of technique is certainly due to its intrinsic capability for assigning a sequence of partof-speech tags to any sequence of words with high levels of precision using quite modest computer resources. Despite this, part-of-speech taggers are not yet as fully available as they should, especially when we are working with languages other than English. The main problem with currently available part-of-speech taggers is the lack of tagged corpora: almost every tagger needs huge amounts of hand tagged text.

*Work partially supported by the projects Corpus (funded by JNICT under contract number PLUS/C/LIN/805/93) and project ESA: tagging and segmentation of medieval Portuguese Corpora (funded by JNICT under contract number FCSH/C/LIN/931/95). ** Work supported by PhD scholarship JNICT-PRAXIS XXI/BD/2909/94.

1

Our goal when we started to work in this area was to augment the information that can be extracted from Portuguese untagged Corpora without requiring any expensive and work intensive approach. In this paper we elaborate on the advantages of joining a neural network tagger with a lexical system capable of morphological parsing and containing a lexical database with more than 100,000 base word forms. We show how near this composed system is of our primary goal. We show how our system is able to overcome the huge training corpus problem and present some preliminary results. We start by describing the general topology of the used neural network, from a computational perspective. In section 3 the proposed tagger is evaluated by comparing it to the Xerox tagger and by providing the results the tagger has achieved over the Susanne Corpus1 [Sampson, 95]. Finally we describe how the neural net problem is being successfully applied to two real life situations.

2.

System Description 2.1.

Neural Networks

The main processing principle of neural nets is their capability for distributing activation patterns (learned from a training set) across the links via a learning algorithm. This is done in a way similar to the basic mechanism of the human brain. The similarity however ends here. The human brain is a living organ, capable of changing dynamically the strength of its connections while an artificial neural net (at least with the models used in this work) is a parallel algorithm, that once the training is complete, cannot change anymore and it is only capable of classifying their input vectors in a fixed and deterministic fashion. Schmid [Schmid, 94], has already applied a neural network approach to the problem of part-of-speech tagging. Though a 96.22% performance is reported, the neural network model presented there is more complex than the one presented here and a corpus of 100,000 words was used for training the neural net. Schmid argues that the neural net based tagging achieves better results when the size of the training corpus is small. In this paper we argue in the same direction. [Nakamura et al., 90] has also done some related work, but their goal was the prediction of the next word to appear in the input text. It was not intended for corpus tagging (a precision of 86.9% was obtained). The presented neural network topology was tested and trained using a general purpose neural network simulator: the Stuttgart Neural Network Simulator (SNNS)2 [SNNS, 94]. This simulator supports the automatic definition of neural net topologies (Specification of how the constituent units connect to each other). Neural units can be of three types: input units, hidden units or output units. Connection links are unidirectional, although recurrent links are supported3. Several algorithms for learning from a training set are available. In this work only standard backpropagation and momentum backpropagation were used.

1 The SUSANNE Corpus is a freely available, English annotated subset of the Brown corpus (ftp://ota.ox.ac.uk/pub/ota/public/susanne). This corpus is supplied by the University of Sussex. 2 The Stuttgart Neural Network Simulator package is freely available at the University of Stuttgart and can be obtained by anonymous ftp from host: ftp.informatik.uni-stuttgart.de (129.69.211.2) in the subdirectory /pub/SNNS. 3 Neural networks can be feed-forward, where an input vector is passed from a input layer (set of units) to the next layer of units, until it finally reaches the set of output units (there are no loops: each unit is used only once), or it can be recurrent. In recurrent nets the input is passed to a hidden layer and then it can be passed backwards, until it reaches the same unit (there are loops).

2

The values of the input units are directly received from a pattern file4. The values of the output units can be passed to a result file (when we are using the network to tag text), or can receive their values from a pattern file (when we are training the network). A pattern file can have several vectors of input values and associated output values (an input vector is a series of values that is presented to all input units at a time). The SNNS also supports a dynamic mode: the input vectors (and the input units also) are ordered into sets, and the output of a net is computed for a sequence of sets. That sequence can change according to a particular step (this step is a parameter specifying the number of sets to jump in order to perform the next iteration). This feature is used to implement a ngram model. The hidden units are units that can both receive their values from the previous units, and send their values to the next unit. Each unit in a SNNS neural net has two characteristic functions: the activation function and the output function. The activation function specifies the way the input links of a given unit are combined into a unique value. The output function specifies further changes to that value before passing it to the other units in the network. All neural networks presented in this paper use the logistic activation function [SNNS, 94]. 2.2.

The Neural Network Tagger Topology

In order to be able to assign a part-of-speech tag to each word in the corpus, the corpus was tokenized first, then a lexicon was used to assign to each word its vector representation and only then we could use the SNNS simulator to tag the text. These procedures were implemented using a Prolog program, the standard Unix text processing commands and the awk programming language [Church, 94]. Three main modules were implemented: The tokenization module is responsible for the division into tokens of the original Corpus. Special care must be taken in order to preserve corpus structure and annotational information, during the tagging process5. This is possible if we require the corpus to be annotated using the SGML standard. In the lexical module, once the corpus was divided into tokens, each token is assigned a lexical probability vector. This is done by using a internal lexicon learned from the tagged corpus. In previous work [MarquesLopes, 96a] and [MarquesLopes, 96b] this was the only source for lexical information. In this paper we report on the use of lexical information taken from the POLARIS system ([LopesMarquesRocio, 94]), a lexical database management system working on morphological rules and irregular word forms and a lexicon with more than 100 000 base word forms, giving rise to a lexicon with more than 106 inflected word forms. The neural network module Translates a training or a testing set into a SNNS pattern file, calls the SNNS simulator in order to use a corpus to train a neural network or to use a neural network for tagging the corpus. In Figure 1 we schematically illustrate the tagging process. 2.3.

The Lexical Module

4 The file containing the set of activation patterns of all input and output units of a neural network. 5 We used our tagger with Medieval Portuguese (13th and 14th century) that require special care. These texts are normally heavily decorated (see Text Encoding Initiative).

3

2.3.1. Lexical Probabilities The coding of the input and output vectors is one of the problems we faced when a general purpose learning paradigm, such as neural networks, is used. Distinct outputs should be generated by distinct representations. Accordingly with Haykin and several other authors ([Haykin, 94], pp. 163-164), the classification function learned by a neural network approach such as the one we used, approximates the a posteriori class probabilities, if we train our network with: •Binary target values. •Independently and identically distributed (i.i.d.) training examples6. This way, we followed the coding proposed by Schmid [Schmid, 94], where there is a neuron for each tag. In the input units, each neuron receives a value as input. This value should be a function of the word to tag and of the frequency of the tag appearing together with the word. Normally this is done using a model derived from speech recognition: the n-gram (or HK model) [Merialdo, 94] . This model uses the lexical probability p(w|t) (probability of a word given a tag), which can be estimated by using the formula:

p ( w| t ) =

freq ( w, t ) freq (t )

Where freq(w,t) is the joint frequency of a word and its tag and freq(t) is the tag frequency in the corpus. In his work Schmid [Schmid, 94] used the probability p(t|w) - the probability of the tag given the word. This probability can be estimated by:

p (t | w) =

freq ( w, t ) freq ( w)

Where freq(w) is the frequency of word w in the corpus. If we take those two measures into account and compare them we will notice that usually the word frequency is much lower than the tag frequency (freq(w)