115-IJPRAI 00166 FINITE STATE ... - TAGH

1 downloads 0 Views 549KB Size Report
distribution is defined by a probabilistic finite state automaton, the situation is ... unigram, we can go up to the state in the bigram corresponding to drove.
April 30, 2002 12:19 WSPC/115-IJPRAI

00166

International Journal of Pattern Recognition and Artificial Intelligence Vol. 16, No. 3 (2002) 275–289 c World Scientific Publishing Company

FINITE STATE LANGUAGE MODELS SMOOTHED USING n-GRAMS∗

DAVID LLORENS† and JUAN MIGUEL VILAR‡ Departament de Llenguatges i Sistemes Inform` atics, Universitat Jaume I de Castell´ o, Spain † [email protected][email protected] FRANCISCO CASACUBERTA Dpt. de Sistemes Inform` atics i Computaci´ o, Institut Tecnol` ogic d’Inform` atica, Universitat Polit` ecnica de Val` encia, Spain [email protected]

We address the problem of smoothing the probability distribution defined by a finite state automaton. Our approach extends the ideas employed for smoothing n-gram models. This extension is obtained by interpreting n-gram models as finite state models. The experiments show that our smoothing improves perplexity over smoothed n-grams and Error Correcting Parsing techniques. Keywords: Language modeling; smoothing; stochastic finite state automata.

1. Introduction In different tasks like speech recognition, OCR, or translation, it is necessary to estimate the probability of a given sentence. For example, in speech recognition, we are given an utterance u (a sequence of vectors representing the sounds emitted by the speaker) and we need to compute the most probable sentence w ˆ given u. That is w ˆ = arg max P (w|u) . w

Using Bayes, and since u is given, we can transform this into w ˆ = arg max P (w)P (u|w) . w

The term P (w) is known as language model and intuitively represents the a priori probability of utterance of w. Ideally, these models should represent the language ∗ Work

partially funded by Fundaci´ on Bancaja’s Neurotrad project (P1A99-10). 275

April 30, 2002 12:19 WSPC/115-IJPRAI

276

00166

D. Llorens, J. M. Vilar & F. Casacuberta

under consideration and they should not allow ungrammatical sentences. In practice, it is adequate that they define a probability distribution over the free monoid of all possible sentences. This distribution is expected to be non-null at each point. The process of modifying a given distribution to be non-null is traditionally called smoothing. A widely used language model is the so-called n-gram. These models estimate the probability of a word considering only the previous n − 1 words. A large range of different smoothing techniques exist for n-gram models.2 However, when the distribution is defined by a probabilistic finite state automaton, the situation is very different: the only non-trivial smoothing in the literature is Error Correcting Parsing.4 We present an approach to smoothing finite state models that is based on the techniques used for smoothing n-gram models. This is done by first interpreting the n-gram models as finite state automata. This interpretation is extended to the smoothing techniques. The smoothing is then modified in order to adapt it for general finite state models. The basic idea underlying the smoothing is to parse the input sentence using the original automaton as long as possible, resorting to an n-gram (possibly smoothed) when arriving to a state with no adequate arcs for the current input word. After visiting the n-gram, control returns to the automaton. This is similar to backoff smoothing, where the probability for a word is given from its n − 1 predecessors when possible; if it is not possible, only n − 2 predecessors are considered, and so on. An example may help in clarifying this. Suppose that the corpus has only these two sentences: I saw two new cars you drove a new car Assume also that the learning algorithm yields the model at the top in Fig. 1. The bigram model for the corpus is seen in the middle of that figure, and the unigram on the bottom (the arc labeled Σ stands for the whole vocabulary). We have to estimate now the probability for the new sentence “I drove a new car”. The parsing starts in the initial state of our model. We follow the arc to state B. Now, there are no output arcs with label drove, so we have to resort to the bigram. We “go down” to the state corresponding to I (the history so far). As there is also no arc, we go down again to the unigram. Now, using the probability of the unigram, we can go up to the state in the bigram corresponding to drove. When the word a is seen, we can return to the original automaton. We do this by observing that the history so far is drove a and that state I has that history. The words new and car can be treated by the automaton.

April 30, 2002 12:19 WSPC/115-IJPRAI

00166

Automaton

Smoothing Using n-Grams

I

A

B

saw

two

C

D

new

E

ca rs

r ca

yo u

G

drove

a

H

I

new

277

F

J

a

 saw

two ne

w

Bigram

I

w

ne

yo u

rs ca

ca r

a

drove

Unigram

dro

ve



 Fig. 1. Parsing of I drove a new car by an automaton smoothed using a bigram, which in turn is smoothed by an unigram. Dotted lines represent arcs of the models not used in the parsing.

2. Basic Concepts and Notation In this section, we introduce some basic concepts in order to fix the notation. An alphabet is a finite set of symbols (words), we represent them by calligraphic letters like X . Strings (or sentences) are concatenations of symbols and represented using a letter and a small bar, like in x ¯. The individual words are designated by the name of the sentence and a subindex indicating the position, so x ¯ = x1 x2 . . . xn . The length of a sentence is indicated by |¯ x|. Segments of a sentence are denoted by |¯ x| x ¯ji = xi . . . xj . For substrings of the form x ¯i we use the notation x¯·i . The set of all strings over X is represented by X ∗ . The empty string is represented by λ. A probabilistic finite state automaton η is a sixtuple (X , Q, q0 , δ, π, ψ) where: • X is an alphabet. • Q is a finite set of states. • q0 is the initial state (it belongs to Q).

April 30, 2002 12:19 WSPC/115-IJPRAI

278

00166

D. Llorens, J. M. Vilar & F. Casacuberta

• δ is the set of transitions. • π is the function that assigns probabilities to transitions. • ψ is the function that assigns to the states the probability of being final. The set δ is a subset of Q × (X ∪ {λ}) × Q. This represents the possible movements for the automaton. The elements of δ are assigned probabilities by π, a complete function from δ into the real numbers (usually, it will be between 0 and 1, but we will relax that somehow). Abusing notation, we will write δ(q, x) for the set {q 0 ∈ Q | (q, x, q 0 ) ∈ δ}. If |δ(q, x)| = 1, we will write π(q, x) for π(q, x, δ(q, x)). An automaton can be used for defining a probability distribution on X ∗ . ¯ to be a sequence (q1 , x1 , q2 ) First define a path from q to q 0 with input x (q2 , x2 , q3 ) . . . (qn , xn , qn+1 ) such that q1 = q, qn+1 = q 0 and x ¯ = x1 . . . xn (note that n may be larger that |¯ x| since some of the xi may be λ). Define the probability Qn of the path to be i=1 π(qi , xi , qi+1 ). Now, for a string x ¯, define its probability Pη (¯ x) to be the sum of the products of the probabilities of all the paths from q0 to some q with input x ¯ times the final probability of q. Symbolically:   |c| X Y  π(qi , xi , qi+1 ) ψ(q|c|+1 ) , Pη (¯ x) = c∈C(q0 ,¯ x)

i=1

where C(q0 , x ¯) is the set of paths starting in q0 and having x ¯ as input. The wellknown forward algorithm can be used for computing this quantity. If there are no λ-arcs in the automaton, we can use the version of Fig. 2(c). The absence of λ-arcs is not an important restriction since in this paper we will smooth automata without such arcs. Furthermore, an equivalent automaton without λ-arcs can always be obtained. During the rest of the paper, we will assume that we are working with a training corpus for building the models. We use the function N to represent the number of times that certain string appears in that corpus. 3. The n-Gram Model as a Finite State Model A language model computes the probability of a sentence x ¯ . A usual decomposition of this probability is P (¯ x) = P (x1 )P (x2 |x1 ) · · · P (xn |¯ xn−1 ). 1 We can say then that the language model computes the probability that a word x follows a string of words q¯, which is the “history” of that word. Since it is impossible to accurately estimate P (x|¯ q ) for all possible q¯, some restriction has to be placed. An n-gram model considers only the last n − 1 words of q¯. But, since even that may · be too long, in case it is not possible to estimate P (x|¯ q|¯ q |−n+1 ) directly from the training corpus, the backoff-smoothed (BS) n-gram model uses the last n − 2 words and so on until arriving to an unigram model.5 More formally: ( 0 P (x|¯ q) N (¯ q x) > 0 , PBO (x|¯ q) = · Cq¯ · PBO (x|¯ q2 ) N (¯ q x) = 0 ,

April 30, 2002 12:19 WSPC/115-IJPRAI

00166

Smoothing Using n-Grams

279

where P 0 is the discounted probability of P obtained with the chosen discount P strategy and Cq¯ is a normalization constant chosen so that PBO (x|¯ q ) = 1. x∈X

The CMU-Cambridge Toolkit,3 is a well-known tool for building BS n-gram models. 3.1. The n-gram model as a deterministic automaton We can represent a backoff model by a deterministic automaton η = (X , Qη , q¯0 , δη , πη , ψη ), as follows. The set of states will be Qη = {¯ q|¯ q ∈ (X ∪ {#})n−1 } ∪ {λ}. That is, we have a state for each possible n-gram plus an initial state (#n ). The special symbol # is not part of the vocabulary and is used to simplify the notation. The states represent the history used so far. A string w1 · · · wn−1 represents the fact that the last n − 1 words were w1 , . . . , wn−1 . On the other hand, a string # · · · #w represents the moment in which the first word has been read and the next has to be predicted. Finally, the end of the sentence corresponds to the automata predicting # for a given history. Note that in this manner, all the histories have the same length. The transitions will be δη (¯ q , x) = (¯ q x)·2 . The function π will reflect the probabilities of the backoff model, πη (¯ q , x) = PBO (x|¯ q ). Finally, ψη (¯ q ) = PBO (#|¯ q ), assigns the discounted final state probability. It is trivial to see that this deterministic automaton is proper, so it is consistent using traditional deterministic parsing [Fig. 2(a)]. Moreover, the probability distribution induced is the same as for the original model. 3.2. The n-gram model as a nondeterministic automaton The representation of the previous section can be too large. An equivalent, but much smaller, model can be obtained by using a nondeterministic automaton. For this, we define the automaton η = (X , Qη , q¯0 , δη , πη , ψη ), as follows. The set of states will be Qη = {¯ q |¯ q ∈ (X ∪ {#}) 0} ∪ {λ}. So we have a state for each k-gram seen in the training (1 2 both models are different. 4. Automata Smoothing Using n-grams (SUN) Suppose that you have both an stochastic (possibly nondeterministic) automaton τ and a smoothed n-gram model η. It is tempting to use η for smoothing τ . We present here a way to do this. The idea is to create arcs between τ and η. These new arcs come in two groups: the down arcs and the up arcs. The down arcs are λ-arcs from τ into η, they are used when the current word has no arc from the current state. The probability of these arcs is discounted from the original arcs of τ . The up arcs, are used to return to τ and they will distribute the original probabilities of η. In order to present the construction, we need the concept of set of histories of length k for a state q. This is ¯ ∈ #l X k−l |q0 Hτ,k (q) = {#l h

¯ h

¯ ∈ X k |∃p ∈ Qτ : p q} ∪ {h

¯ h

q} .

Intuitively, this represents those strings of length k that are suffixes of a path leading to q. In case there is a path from the initial state shorter than k, the corresponding string is padded at the beginning with symbols #. It is also useful to define the −1 pseudo-inverse function Hτ,k (¯ v ) which is the set of states in Qτ that have the k length suffix of string v¯ in their histories. These functions are easily extended to sets: [ Hτ,k (Q) = Hτ,k (q) , Q ∈ P(Q) , q∈Q −1 Hτ,k (H) =

[ v ¯∈H

−1 Hτ,k (¯ v) ,

H ∈ P(X k ) .

April 30, 2002 12:19 WSPC/115-IJPRAI

00166

Smoothing Using n-Grams

Algorithm Deterministic parsing Input string x, automaton  Output probability of x q := q0 ; p := 1; for i := 1 to x do j

j

281

Algorithm Determ.-backo parsing Input string x, automaton  Output probability of x q := q0 ; p := 1; for i := 1 to x do while Æ(q; xi ) = do j

j

;

q := Æ (q; ) p := p   (q; );

end while

p := p   (q; xi ); q := Æ (q; xi );

p q

end for p := p (q); return p; End

end for p := p (q); return p; End





(a) Deterministic parsing

Algorithm Forward parsing Input string x, automaton  Output probability of x P1 [q0 ] := 1; V1 = q0 ; for i := 1 to x do V2 = ; P2 := [0]; for q V1 do f

j

g

j

;

2

p

:= p  (q; xi ); := Æ(q; xi );

:= P1 [q];

(b) Deterministic-backoff parsing

Algorithm Forward-backo parsing Input string x, automaton  Output probability of x P1 [q0 ] := 1; V1 = q0 ; for i := 1 to x do V2 = ; P2 := [0]; for q V1 do p := P1 [q ]; while Æ(q; xi ) = do f

j

g

j

;

2

;

q := Æ (q; ) p := p   (q; );

for q

do

Æ (q; xi ) V2 := V2 [ fq 0 g; P2 [q 0 ] := P2 [q 0 ] + p   (q; xi ; q 0 ); 0

2

end for end for V1 := V2 ; P1 := P2 ; end forP return q V P2 [q] End 2

2



(q);

(c) Forward parsing

end while for q Æ(q; xi ) do 0

2

V2 := V2 [ fq 0 g; P2 [q 0 ] := P2 [q 0 ] + p   (q; xi ; q 0 );

end for end for V1 := V2 ; P1 := P2 ; end forP return q V P2[q] End 2

2



(q);

(d) Forward-backoff parsing

Fig. 2. The parsing algorithms. The areas marked correspond to the (small) differences between the conventional and the backoff versions. The rest of the algorithms is unchanged.

The rest of this section is divided into three parts: first, we present the formal models; after that, a particular set of probabilities for doing the smoothing is explained; finally, we introduce the construction of automata with an interesting property: the set of histories of each state is a singleton. This makes them somehow analogous to the n-grams and facilitates smoothing.

April 30, 2002 12:19 WSPC/115-IJPRAI

282

00166

D. Llorens, J. M. Vilar & F. Casacuberta

4.1. Definition of the models The set of states of the smoothed model is simply Qτ ∪ Qη , the union of the states of both models (without loss of generality, we assume that they are disjoint). There are four types of arcs: • • • •

Down arcs (δd ). Up arcs (δu ). Stay arcs (δη0 ). The original arcs of τ (δτ ).

As commented above, down arcs go from τ into η. Suppose that the analysis is in state q and the following word has no arc departing from q. In this case, it is sensible to resort to η. We can only tell that the path leading to q has followed one of the histories of length n − 1. So we will have a λ-arc corresponding to each of these histories. This arc will go to the only state in η having that history (since η is an n-gram there can be at most one state for each history of length n − 1 and, assuming the same training data for τ and η, is safe to expect that such state actually exists). Formally: ¯ ∈ Qτ ∧ ¯h ∈ Hτ,n−1 (q)} . δd = {(q, λ, h)|q Notice that we are identifying the states in Qη (the states from the smoothed n-gram model η) with the strings leading to them. This can be done as in the construction presented in Sec. 3.2. Once the analysis has visited η, it should go back to τ . This is done by means of the up arcs. The idea is better explained from a particular state in Qη and a ¯ arriving at it. When we symbol x. Since η is an n-gram there is only a history h join it with the symbol under consideration, we get a longer history. The up arc ¯ in their set of histories. This seeks to is drawn from q into those states having hx ensure that the analysis returns to a sensible point in τ . In a more formal way: ¯ x, q) ∈ Qη × X × Qτ |q ∈ H −1 (hx)} ¯ δu = {(h, . τ,n Note that there may be symbols for which the extended history does not belong to any state in τ . This gives rise to stay arcs, which are the arcs in η that do not generate up arcs: ¯ x, h ¯ 0 ) ∈ δη |δu (h, ¯ x) = ∅} . δη0 = {(h, Finally, all the original arcs of τ are also part of the model (δτ ). In parallel with the definition of the arcs, we find the definition of four sets of probabilities. The down arcs get their probabilities after discounting some probability mass from the arcs of τ . We use certain function d : δτ → [0, 1] for this discounting. The remaining probability of each state is distributed among the corresponding down arcs. This is done by function b : δd → [0, 1], which must fulfill: X ¯ = 1. b(q, λ, h) (2) ¯ d (q,λ) h∈δ

April 30, 2002 12:19 WSPC/115-IJPRAI

00166

Smoothing Using n-Grams

283

With these two functions, we can define πτ0 , the probability of the arcs of τ in the new model as πτ0 (q, x, q 0 ) = d(q, x, q 0 )π(q, x, q 0 ) , for (q, x, q 0 ) ∈ δτ . And the definition of πd , the probability of the down arcs is ¯ = Cq b(q, λ, h) ¯ , for (q, λ, h) ¯ ∈ δd . πd (q, λ, h) The normalization constant Cq is computed in a manner analogous to the n-gram backoff smoothing. The up arcs will get their probabilities by distributing the probability of the arc that originated them. For this, we use function s : δu → [0, 1], that must fulfill the following X ¯ x, q) = πη (h, ¯ x) . s(h, (3) ¯ q∈δu (h,x)

With this function, the probabilities of the up arcs are trivial, simply define ¯ x, q) = s(h, ¯ x, q) , for (h, ¯ x, q) ∈ δu . πu (h, Finally, the stay arcs keep their original probabilities, so ¯ x, h ¯ 0 ) = πη (h, ¯ x, h ¯ 0 ) , for (h, ¯ x, h ¯ 0) ∈ δ0 . πη0 (h, η We define SUN (τ, η, d, b, s) as the automaton (X , Q, q¯0 , δ, π, ψ) where: (a) The states, Q, are the union of the states from τ and η: Q = Qτ ∪ Qη . (b) The transitions are the union of the four sets of arcs: δ = δτ ∪ δd ∪ δu ∪ δη0 . (c) The probabilities of the arcs are the union of the four sets of probabilities: π = πτ0 ∪ πd ∪ πu ∪ πη0 , (d) the function ψ, which assigns the final probability to every state, is: ( 0 ψτ (q) if q ∈ Qτ , ψ(q) = ψη (q) if q ∈ Qη , where ψτ0 (q) is the discounted final probability. This automaton, which we call SUN BS, is neither deterministic nor proper. To make the defined language consistent, a special parsing is required. We present the Forward-backoff parsing, which is a modified forward parsing that does not use λ-arcs in an active state if there exists an arc labeled with the next input symbol [Fig. 2(d)]. As in n-gram backoff smoothing, we can obtain a SUN GIS automaton by choosing Cq so that the automaton is proper. The language it models is consistent using forward parsing. The methods SUN BS and SUN GIS constitute a generalization of the n-gram smothing techniques BS and GIS, respectively. This can be seen in the fact that if an n-gram is smoothed with an smoothed (n−1)-gram using SUB BS (respectively, SUN GIS), the result is the same of a BS (respectively GIS) model.

April 30, 2002 12:19 WSPC/115-IJPRAI

284

00166

D. Llorens, J. M. Vilar & F. Casacuberta

4.2. Choosing the distributions The formalization of the SUN allows different parameterizations for the functions d, b and s. In n-gram smoothing literature there are several discounting techniques that can be easily adapted to play the role of d. We have chosen the Witten–Bell discounting,6 which is powerful and simple to implement. Given an automaton τ : πτ0 (q, x, q 0 ) = d(q, x, q 0 ) · πτ (q, x, q 0 ) =

N (q, x, q 0 ) N (q) N (q, x, q 0 ) = , N (q) + D(q) N (q) N (q) + D(q)

where N (q, x, q 0 ) and N (q) are the number of times this arc/state was used on analyzing the training corpus, and D(q) is the number of different labeled arcs following state q (plus one if it is final). The smoothed probability of final state is Nf (q) ψτ0 (q) = N (q)+D(q) , where Nf (q) is the number of times q was final when analyzing the training corpus. We feel that functions b and s should take into account the frequency of the different histories in each state. This information can be easily obtained using a history counting function over a training corpus. As a way of distributing the probabilities ¯ and s(h, ¯ x, q) as: in accordance with these counts, we define b(q, λ, h) ¯ Eτ,q (h) X , ¯ 0) Eτ,q (h

¯ = b(q, λ, h)

¯ 0 ∈Hτ,n (q) h

¯ x, q) = πη (h, ¯ x) · s(h,

¯ Eτ,q (hx) X , ¯ Eτ,q0 (hx) −1 ¯ q0 ∈Hτ,n (hx)

¯ is the number of times we arrive at state q of τ with history h. ¯ It can where Eτ,q (h) be easily proven that both b and s satisfy their respective restrictions, (2) and (3). 4.3. Single history automata A problem for SUN is the existence of more than one history of length n − 1 in the states of the automaton. Remember that the n-gram predicts the next word by considering the previous n − 1. Now, consider state q and suppose it has three histories h1 , h2 , and h3 of length n − 1. This implies the construction of three down arcs. But, when actually using the automaton, at most one of the histories is the one seen, so the other two are used for paths that lead to states of the n-gram with histories not belonging to these states. It would be desirable to have only one history per state, so that it is possible to go down to the state with the correct history. We can in fact do that for general automata. First, define τ to be an n-SH automaton if for every state q the set Hτ,n (q) is a singleton. The interesting result is that every automaton can be converted to an equivalent n-SH by a simple

April 30, 2002 12:19 WSPC/115-IJPRAI

00166

Smoothing Using n-Grams

285

construction. Given τ , the corresponding n-SH automaton is τ 0 with: ¯ h ¯ ∈ Hτ,n (q)}. • Qτ 0 = {(q, h)| n • q0,τ 0 = (q0 , # ). ¯ x, (q 0 , (hx) ¯ · ))|(q, x, q 0 ) ∈ δτ ∧ h ¯ ∈ Hτ,n (q)}. • δτ 0 = {((q, h), 2 With respect to probabilities, this conversion can be done in two manners. The first one is to translate the probabilities so that the new automaton represents the same distribution. The other one, which we have followed, takes into account that the original distribution was inferred from a training corpus and is therefore only an approximation. We feel that it is better to reestimate the probabilities on the new automaton. As seen in the experiments below, the best approach for smoothing the automaton is to first convert it in an n-SH, then reestimate the probabilities, and after that to use the SUN. 5. Experiments To test the SUN we use ATIS-2, a subcorpus of ATIS, which is a set of sentences recorded from speakers requesting information to an air travel database. ATIS-2 contains a training set of 13,044 utterances (130,773 tokens), and two evaluation sets of 974 utterances (10,636 tokens) and 1,001 utterances (11,703 tokens), respectively. The vocabulary contains 1,294 words. We use the second evaluation set as test set. The first evaluation set was not used. However, the Error Correcting Parsing (ECP) approach we compare with uses this set to estimate the error model. Three aspects were considered for the experiments: SUN versus other automata smoothing techniques; SUN GIS versus SUN BS; and automata smoothing versus n-gram smoothing. The experiments consisted in building the smoothed unigram, bigram and trigram for the task (using both GIS and BS smoothing), plus an automaton obtained using the ALERGIA algorithm.1 This automaton was smoothed using direct smoothing with a unigram, Error Correcting Parsing, and the SUN approach. The results of the first two methods constitute the reference and they were presented by Dupont and Amengual.4 The SUN approach was tested with the different n-grams and using both the original automaton and the corresponding n-SHs. Table 1 shows the results of the experiments. The reference results are those presented by Dupont and Amengual.4 Their experiments were done with the same corpus and the same ALERGIA automaton, so our results are comparable with theirs. They smoothed the automaton using an unigram and different ECP models (ECP smoothing does not use n-grams). They obtained a perplexity of 71 with the unigram smoothing and of 37 with the best ECP model, nearly a 50% improvement but still far from the perplexity of bigrams or trigrams.

April 30, 2002 12:19 WSPC/115-IJPRAI

286

00166

D. Llorens, J. M. Vilar & F. Casacuberta Table 1. Test set perplexity results. Column SUNSH (SUN using single history automata) shows two results: using all paths and only the best one. (a) GIS

(b) BS

(c) Reference

n

n-gram

SUN

SUNSH

n

n-gram

SUN

SUNSH

Direct

ECP

1 2 3

147 20 14

33 30 26

33/36 16/19 14/18

1 2 3

147 20 14

39 38 34

39/40 18/18 14/15

71

37

Table 2. Number of arcs of the automaton and of the n-gram traversed during the parsing of the test using only the best path. The last column shows the number of down arcs employed. The number of symbols in the test is 12 704. (a) GIS

(b) BS

n

automaton

n-gram

down arcs

n

automaton

n-gram

down arcs

1 2 3

11510 l0425 8575

1194 2279 4129

1148 1442 1734

1 2 3

11954 10606 8640

750 2098 4064

695 1297 1687

Our results improve theirs. The use of SUN GIS with unigrams slightly improves perplexity. When using bigrams or more, the results are far better (nearly a 30% improvement over ECP for trigrams), but still worse than n-grams. When the automaton is transformed into its equivalent n-SH, the results are even better than for bigrams. For trigrams they are equivalent. When using BS, the conclusions are very similar except for the case of the automaton not transformed in n-SH. In the last column of Table 1, we can see the effect of using only the best path instead of summing all possible paths. This effect is negligible for SUN BS but important for SUN GIS. The difference is due to the large number of paths accepting a sentence in SUN GIS. Another important question is whether the parsing uses more arcs from the automaton or from the n-gram. It would be desirable that most of the arcs correspond to the automaton. In Table 2, we can see the numbers for the best paths. Both for SUN BS and SUN GIS, most of the symbols are parsed in the automaton. The ratio decreases as the degree of the n-gram increases. This shows that the better the smoothing model, the more it is used. Using SUN BS and unigrams the percentage of symbols parsed by the automaton is 91%, for bigrams 82% and for trigrams 67%. On the other hand, the percentage of words using down arcs is 5% for unigrams, 10% for bigrams and 13% for trigrams. The behavior of SUN GIS is similar. For comparison purposes, we have repeated the tests of BS n-gram models using the CMU-Cambridge Toolkit3 with the same discounting function. As expected, the results are the same as the second column in Table 1(b).

April 30, 2002 12:19 WSPC/115-IJPRAI

00166

Smoothing Using n-Grams

287

6. Conclusions Smoothed n-grams can be formalized as finite state models. The formalization has been used in two directions: for defining a new smoothing for n-grams, GIS; and for extending backoff smoothing and GIS to general automata. The extension of the smoothing to n-grams has been made using several new concepts also introduced here. In the first place, the definition of the sets of histories of a state is used in order to define the relationship between the states of the automaton and the n-gram used for smoothing. Related to that, we present a transformation of automata (n-SH) that makes them conceptually nearer to the structure of the n-grams, therefore easing the smoothing. Finally, appropriate modifications to the parsing algorithms are presented in order to cope with the new automata. The experimental results indicate that the methods obtain a suitable smoothing. This is translated in better perplexities than the models used for smoothing them, while keeping the structure of the automaton, allowing for modeling more complex relationships than possible with n-grams. References 1. R. C. Carrasco and J. Oncina, “Learning deterministic regular grammars from stochastic samples in polynomial time,” Theoret. Inform. Appl. 33, 1 (1999) 1–9. 2. S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Comput. Speech Lang. 13, 4 (1999) 359–393. 3. P. Clarkson and R. Rosenfeld, “Statistical language modeling using the CMU-Combridge Toolkit,” EUROSPEECH, 1997, pp. 2702–2710. 4. P. Dupont and J.-C. Amengual, “Smoothing probabilistic automata: an errorcorrecting approach,” Grammatical Inference: Algorithma and Applications, ed. A. de Oliveira, Lecture Notes in Artificial Intelligance, Vol. 1891, Springer-Verlag, 2000, pp. 51–64. 5. S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Trans. ASSP 34, 3 (1987) 400–401. 6. P. Placeway, R. Schwartz, P. Fung and L. Nguyen, “The estimation of powerful language models from small and large corpora,” ICASSP’93, Vol. II, 1993, pp. 33–36.

April 30, 2002 12:19 WSPC/115-IJPRAI

288

00166

D. Llorens, J. M. Vilar & F. Casacuberta

David Llorens received the Master and Ph.D. degrees in computer science from the Polytechnic University of Val` encia, Spain, in 1995 and 2000, respectively. From 1995 to 1997 he worked with the Department of Information Systems and Computation, Polytechnic University of Val` encia, first as a predoctoral fellow and from 1996 as Assistant Professor. Since 1997, he has been with the Department of Software and Computing Systems of the University Jaume I of Castell´ o, Spain, first as Assistant Professor and from 2001 as Associate Professor. He has been an active member of the Pattern Recognition and Artificial Intelligence Group in the Department of Information Systems and Computation, and now he works with the Group for Computational Learning, Automatic Recognition and Translation of Speech at the Department of Software and Computing Systems of the University Jaume I of Castell´ o, Spain. His current research lies in the areas of finite state language modeling, language model smoothing, speech recognition, machine translation and syntactic pattern recognition.

Juan Miguel Vilar received the B.Sc. in computer studies in 1990 from the Liverpool Polytechnic (now the John Moore’s University) in England, and the M.Sc. and Ph.D. degrees in computer science in 1993 and 1998 from the Polytechnic University of Valencia in Spain. He is currently working as an Associate Professor with the Department of Software and Computing Systems of the University Jaume I of Castell´ o, Spain. He has been an active member of the Pattern Recognition and Artificial Intelligence Group in the Department of Information Systems and Computation and now works with the Group for Computational Learning, Automatic Recognition and Translation of Speech at the Department of Software and Computing Systems of the University Jaume I of Castell´ o, Spain. His current research is in finite state and statistical models for machine translation, computational learning, language modeling, and word clustering.

April 30, 2002 12:19 WSPC/115-IJPRAI

00166

Smoothing Using n-Grams

Francisco Casacuberta received the Master and Ph.D. degrees in physics from the University of Val` encia, Spain, in 1976 and 1981, respectively. From 1976 to 1979, he worked with the Department of Electricity and Electronics at the University of Val` encia as an FPI fellow. From 1980 to 1986, he was with the Computing Center of the University of Val` encia. Since 1980, he has been with the Department of Information Systems and Computation of the Polytechnic University of Val` encia first as a Associate Professor and from 1990 as a Full Professor. Since 1981, he has been an active member of a research group in the fields of automatic speech recognition and machine translation. Dr. Casacuberta is a member of the Spanish Society for Pattern Recognition and Image Analysis (AERFAI), which is an affiliate society of IAPR, the IEEE Computer Society and the Spanish Association for Artificial Intelligence (AEPIA). His current research interest lies in the areas of speech recognition, machine translation, syntactic pattern recognition, statistical pattern recognition and machine learning.

289