Learning Stochastic Finite Automata from Experts

0 downloads 0 Views 48KB Size Report
In the case we study the experts are stochastic deterministic finite automata (sdfa). ... through Bayes minimisation, Carrasco and Oncina [CO94] through state ...
Learning Stochastic Finite Automata from Experts Colin DE LA HIGUERA, EURISE, Université de Saint-Etienne, France www.univ-st-etienne.fr/eurise/cdlh.html

Abstract We present in this paper a new learning problem called learning distributions from experts. In the case we study the experts are stochastic deterministic finite automata (sdfa). We deal with the situation arising when wanting to learn sdfa from unrepeated examples. This is intended to model the situation where the data is not generated automatically, but in an order dependent of its probability, as would be the case with the data presented by a human expert. It is then impossible to use frequency measures directly in order to construct the underlying automaton or to adjust its probabilities. In this paper we prove that although a polynomial identification with probability one is not always possible, a wide class of automata can successfully, and for this criterion, be identified. As the framework is new the problem leads to a variety of open problems. Keywords: identification with probability one, grammatical inference, polynomial learning, stochastic deterministic finite automata.

1 Introduction and Related work Inference of deterministic finite automata (dfa) or of regular grammars is a favourite subject of grammatical inference. It is well known [Gol67, Gol78] that the class cannot be identified in the limit from text, i.e. from positive examples only. As the problem of not having negative evidence arises in practice, different options as how to deal with the issue have been proposed. Restricted classes of dfa can be identified [Ang82, GV90], heuristics have been proposed [RV87] and used for practical problems in speech recognition or pattern recognition [L&al94], and stochastic inference has been proposed as a mean to deal with the problem [CO94, SO94]. Stochastic grammars and automata have been used for some time in the context of speech recognition [RJ93, Ney95]. Algorithms that (heuristically) learn a context-free grammar have been proposed (for a recent survey see [Sak97]), and other algorithms (namely the forward-backward algorithm for Hidden Markov Models, close to stochastic finite automata or the inside-outside algorithm for stochastic context-free grammars) that

1

compute probabilities for the rules have been realised [RJ93, LY90]. But in the general framework of grammatical inference it is important to search for algorithms that not only perform well in practice, but that provably converge to the optimal solution, using only a polynomial amount of time. For the case of stochastic finite automata the problem has been dealt with by different authors: Stolcke and Ohomundro [SO94] learn stochastic deterministic finite automata through Bayes minimisation, Carrasco and Oncina [CO94] through state merging techniques common to classical algorithms for the dfa inference problem. Along the same line Ron et al. [RST95] learn acyclic dfa, proving furthermore that under certain restrictions the inferred automaton is Probably Approximately Correct. In these papers (but also those from the speech recognition community) an elementary assumption is that the presentation of the examples is unordered (or at least the algorithms do not use this information), and that strings with high probability can appear many times in the learning multisample. If for instance some string has probability 1/3, it is expected that one third of the sample is occupied by this string. This can be justified by practical reasons: when learning from a set where sampling has been done automatically, strings will be repeated according to their probability. But let us deal with the case where it is not so, and suppose that the protocol does not allow for multiple presentation of data. This is intended to model the situation where the learning data is given to us by a human expert who will certainly not give us repeatedly the same string. Nevertheless the same expert will give us the strings accordingly to the distribution they follow. We can thus consider the expert as a black box containing the distribution to be learned. When requested the expert computes (accordingly to the distribution) a new example and adds it with a label to the learning set, and increases the label. The expert acts as follows: rank←0 Do n times (or less if less than n strings have non null probability) 1. Generate an unseen example u 2. Add (u, rank) to the learning set 3. rank←rank+1 The problem of learning from examples delivered by such a black box will be called learning from an expert (as opposed to multisample learning) in the sequel. In this paper the expert will be a sdfa. We will prove that for restricted classes of sdfa the structure of the automaton can be learned, and the probabilities estimated. Furthermore the algorithms to do so are polynomial in the overall length of the data and convergence can be obtained with probability one.

2

In section 2 we provide the elementary definitions and techniques. The problem of learning sdfa involves two different matters: learning the structure or topology of the underlying automaton and estimating the probabilities. In the usual framework of multisample learning both problems are dealt with simultaneously. Two passes are needed in learning from an expert. In section 3 we study the easier of the two problems: estimating the probabilities given a structure. In section 4 we give our results concerning the inference of the structure. We conclude with a first list of open problems concerning this new model.

2 Preliminaries An alphabet is a finite non empty set of distinct symbols. For a given alphabet X, the set of all finite strings of symbols from X is denoted by X*. The empty string is denoted by λ. A language L over X is a subset of X*. Given L, L is the complementary of L in X*. A stochastic deterministic finite automaton (sdfa) A=< X, Q, q 0 , P, δ> consists of an alphabet X , set of terminal symbols, a set Q of states, with q0 the initial state, a partial transition function δ: Q×X→Q and P a probability function Q×X∪{λ}→ , such that: ∀q ∈ Q, ∑ P( q, x ) = 1

Q

x ∈X ∪{λ}

We define recursively: δ(qi , x.w ) = δ(qδ( q , x ) , w ) i

And the probability for a string to be generated by A is defined recursively by: P( qi , x.w) = P( qi , x ). P( qδ( q , x ) , w) i

The language generated by the automaton is defined as L = {w ∈ X * : p( w ) ≠ 0} In case the sdfa contains no useless nodes it generates a distribution over X*. We can then also define recursively: P( qi , x.wX*) = P( qi , x ). P( qδ( q , x ) , wX*) i

P( qi , X*) = 1 The class of stochastic regular languages consists of all languages generated by stochastic non deterministic finite automata. Although not all stochastic regular languages can be generated by sdfa, we will concentrate on the deterministic case: indeed determinism plays a central part in grammatical inference, and by doing so we are following the same line as related work [CO94, SO94, RST95] Two sdfa are equivalent if they provide identical probability distribution over X*, i.e. if every string over X has equal probability for both distributions. A sample is a finite presentation of examples, i.e. a subset S of X*, and a rank function ρ : S-> giving the rank of each element of S.

N

3

The main tool to compare distributions given a sample is to use the fact that given strings appear before others in the sample. This will be done through the following lemma: Lemma 1 Let A=< X, Q, q0, P, δ> be a sdfa, S a sample and w, w' two strings appearing in S such that w=uv, w'=u'v and δ(q0, u)= δ(q0, u'). P( q0 , uX*) Then Pr(ρ( w) < ρ( w ′)) = P( q0 , uX*) + P( q0 , u′X*) Proof Pr(ρ( w) < ρ( w ′)) =

P( q0 , w) = P( q0 , w) + P( q0 , w′)

P( q0 , uX*) ⋅ P( qδ( q ,u ) , v) 0 = P( q0 , uX*) ⋅ P( qδ( q ,u ) , v) + P( q0 , u′ X*) ⋅ P( qδ( q ,u ′ ) , v) 0 0 P( q0 , uX*) P( q0 , uX*) + P( q0 , u′X*) ◊

3 Learning the probabilities In this section we suppose that the actual structure (the automaton) is given, and the problem is to infer from the data the probabilities of the automaton. Thus we are given an automaton A=< X, Q, q0, P, δ >, and a sample S. In the case of repeated data the results for the problem are two fold. On one hand the backward-forward algorithm performs well in practice [LY90], or alternatively Alergia [CO94] can compute the probabilities. On the other hand negative results [AW92, K&al94] show that if the size of the alphabet is allowed to increase no polynomial algorithm can achieve identification. We start this section with a counter-example showing that in general the probabilities are hard to learn: a

1

0 b 2 figure 1 When learning from an expert from the automaton of figure 1 a set of learning examples can contain at most 3 examples, and the 6 possible orderings or rank functions

4

for these 3 examples will only allow us to learn 6 different distributions. Therefore a necessary condition for learning to be possible is that an infinite number of examples is available. This condition has to hold for all probabilities we wish to compute hence it is sufficient (but also necessary) that the automata has to accept an infinity of strings reentering at least once into the initial state. Technically: Definition 1 A sdfa is left infinite if  {u∈X *: δ (q0, u)=q0 ∧ ∃v∈X* P(uv)>0} = ∞ In order to learn the probability distributions we need the following notations: given q∈Q, a language L over X and a string u∈X *, µ(u, L)=min {ρ(uv): v∈L} c(q, L)={w∈X*: δ(q0, w)=q ∧ µ(w, L)