introduction to inductive logic programming - Semantic Scholar

2 downloads 0 Views 68KB Size Report
B = {mother(ann,mary),mother(ann,tom),father(tom,eve), father(tom,ian),female(ann),female(mary),female(eve), male(pat) .... θ flies(oliver) ← bird(oliver) flies(X) ...
INTRODUCTION TO INDUCTIVE LOGIC PROGRAMMING Nada Lavrac J. Stefan Institute Ljubljana, Slovenia

Talk outline • Introduction to ILP • ILP as search • ILP techniques and implementations – Specialization techniques • Top-down search of refinement graphs – Generalization techniques • Relative least general generalization

Predictive vs. Descriptive Induction Predictive induction: Inducing classifiers, aimed at solving classification/prediction tasks – Classification rule learning, Decision tree learning, ... – Bayesian classifier, ANN, SVM, .. – Data analysis through hypothesis generation and testing

Descriptive induction: Discovering regularities, uncovering patterns, aimed at solving KDD tasks – Symbolic clustering, Association rule learning, Subgroup discovery, ... – Exploratory data analysis

+ +

+ + + - - - - - -

x x x x x x+ x x x

x x

Η

Η

Predictive ILP • Given: – A set of observations

• positive examples E + • negative examples E -

– background knowledge B – hypothesis language LH – covers relation

• Find:

A hypothesis H ∈ LH, such that (given B) H covers all positive and no negative examples

• In logic, find H such that

– ∀e ∈ E + : B ∧ H |== e (H is complete) – ∀e ∈ E - : B ∧ H |=/= e (H is consistent)

• In ILP, E are ground facts, B and H are (sets of) definite clauses

+ + + + + + Η -- - - -

Predictive ILP • Given: – A set of observations

• positive examples E + • negative examples E -

– – – –

background knowledge B hypothesis language LH covers relation quality criterion

• Find:

A hypothesis H ∈ LH, such that (given B) H is optimal w.r.t. some quality criterion, e.g., max. predictive accuracy A(H)

(instead of finding a hypothesis H ∈ LH, such that (given B) H covers all positive and no negative examples)

+ -+ + + ++ - + Η -- -+ - -

Descriptive ILP • Given: – A set of observations

(positive examples E +)

– background knowledge B – hypothesis language LH – covers relation

• Find:

Maximally specific hypothesis H ∈ LH, such that (given B) H covers all positive examples

• In logic, find H such that ∀c ∈ H, c is true in some preferred model of B ∪E (e.g., least Herbrand model M (B ∪E )) • In ILP, E are ground facts, B are (sets of) general clauses

+ + + + + + Η

Sample problem Logic programming E + = {sort([2,1,3],[1,2,3])} E - = {sort([2,1],[1]),sort([3,1,2],[2,1,3])} B : definitions of permutation/2 and sorted/1 • Predictive ILP sort(X,Y) ← permutation(X,Y), sorted(Y).

• Descriptive ILP sorted(Y) ← sort(X,Y). permutation(X,Y) ← sort(X,Y) sorted(X) ← sort(X,X)

Sample problem Knowledge discovery E + = {daughter(mary,ann),daughter(eve,tom)} E - = {daughter(tom,ann),daughter(eve,ann)} B = {mother(ann,mary), mother(ann,tom), father(tom,eve), father(tom,ian), female(ann), female(mary), female(eve), male(pat),male(tom), parent(X,Y) ← mother(X,Y), parent(X,Y) ← father(X,Y)} ann

mary

tom

eve

ian

Sample problem Knowledge discovery • E + = {daughter(mary,ann),daughter(eve,tom)} E - = {daughter(tom,ann),daughter(eve,ann)} • B = {mother(ann,mary),mother(ann,tom),father(tom,eve), father(tom,ian),female(ann),female(mary),female(eve), male(pat),male(tom),parent(X,Y)←mother(X,Y), parent(X,Y)←father(X,Y)} • Predictive ILP - Induce a definite clause daughter(X,Y) ← female(X), parent(Y,X). or a set of definite clauses daughter(X,Y) ← female(X), mother(Y,X). daughter(X,Y) ← female(X), father(Y,X). • Descriptive ILP - Induce a set of (general) clauses ← daughter(X,Y), mother(X,Y). female(X)← daughter(X,Y). mother(X,Y); father(X,Y) ← parent(X,Y).

Dimensions of ILP • Empirical ILP: learn from a fixed, predefined set of examples (batch). + noise + efficiency

- determine examples

+ example selection easy

- efficiency - noise

• Incremental ILP: process examples one by one.

• Interactive ILP: generate questions to an oracle (user). + user-friendly

- efficiency - noise

• Multiple predicate learning learn multiple predicates from their examples. + more powerful

- harder - less efficient

• Predicate invention extend the given vocabulary by new predicates. + larger class

- harder

History of ILP and RDM • Prehistory (1970) – Plotkin – lgg, rlgg – 1970, 1971

• Early enthusiasm (1975-83) – Vere 1975, 1977; Summers 1977; Kodratoff and Jouanaud 1979 – MIS - Shapiro 1981, 1983

• Dark ages (1983-86) • Renaissance (1987-96), ILP (1991-...) – MARVIN – Sammut & Banerji 1986, DUCE – Muggleton 1987 – Helft – 1987 – QuMAS – Mozetic 1987 LINUS – Lavrac et al. 1991 – CIGOL – Muggleton & Buntine 1988

– ML-SMART – Bergadano, Giordana & Saitta 1988 – CLINT – De Raedt & Bruynooghe 1988 – BLIP, MOBAL – Morik, Wrobel et al. 1988, 1989 – GOLEM – Muggleton & Feng 1990 – FOIL – Quinlan 1990, mFOIL – Dzeroski 1991, FOCL – Brunk & Pazzani 1991, FFOIL – Quinlan 1996 – CLAUDIEN – De Raedt & Bruynooghe 1993 – PROGOL – Muggleton et al. 1995 – FORS – Karalic & Bratko 1996

• RDM (1997-...) – – – –

TILDE – Blockeel & De Raedt 1997 WARMR – Dehaspe 1997 MIDOS – Wrobel 1997 ...

Literature Proceedings and Books • ILP Workshop Proceedings: – L. De Raedt (1992) Interactive Theory – Viana de Castelo 1991, Tokyo Revision: An Inductive Logic Programming 1992, Bled 1993, Bonn 1994, Approach. Academic Press. Leuven 1995, Stockholm 1996, Prague 1997, Madison 1998, Bled – S. Muggleton (ed.) (1992) Inductive Logic 1999, London 2000, Strasbourg Programming. Academic Press. 2001, Sydney 2002 – N. Lavrac and S. Dzeroski (1994) • Special Issues Inductive Logic Programming: Techniques and Applications. Ellis Horwood. – SIGART, 1994 – F. Bergadano and D. Gunetti (1995) – New Generation Computing, Inductive Logic Programming: From 1995, 1999 Machine Learning to Knowledge – Journal of Logic Programming, Engineering. The MIT Press. 1999 – L. De Raedt (ed.) (1996) Advances in – Data Mining and Knowledge Inductive Logic Programming. IOS Press. Discovery 1999 – Machine Learning, 1997, 2x2001 – S.H. Nienhuys-Cheng and R. de Wolf (1997) Foundations of Inductive Logic • Selected books: Programming, Springer LNAI. – E.Y. Shapiro (1983) Algorithmic – S. Dzeroski and N. Lavrac (eds.) (2001) program debugging. The MIT Relational Data Mining. Springer. Press.

ILP as search of program clauses • Searching the hypothesis space – states: hypotheses from LH – goal: find a hypothesis that satisfies the quality criterion, e.g., completeness and consistency

• Naïve generate-and-test algorithm for all H ∈ LH do if H is complete and consistent then output(H) → computationally expensive

• Restricting the search – language bias - reducing the hypothesis space – search bias - reducing the search of the hypothesis space

ILP as search of program clauses • An ILP learner can be described by – the structure of the space of clauses – its search strategy • uninformed search (depth-first, breadth-first, iterative deepening) • heuristic search (best-first, hill-climbing, beam search) – its heuristics • for directing search • for stopping search (quality criterion)

ILP as search of program clauses • Structure of the hypothesis space based on the generality relation – G is more general than S if and only if covers(S) ⊂ covers(G)

• Pruning the search space – If H does not cover an example e then no specialization of e will cover e → used with positive examples for pruning – If H covers and example e then all generalizations of e will also cover e → used with negative examples for pruning

• Version space

The above properties determine the space of possible solutions

Hypothesis space The Version Space model too general -

-

-

-

-

-

+

+

+ +

+

too specific

How to structure the hypothesis space? How to move from one hypothesis to another?

More general (induction)

More specific

More general (induction)

covered ee + not covered

More specific

More general (induction)

= false

flies(X) ←

flies(X) ← bird(X)

flies(X) ← bird(X) normal(X)

More specific

= true

More general (induction)

= false

flies(X) ← flies(X) ← bird(X)

flies(X) ← bird(X) normal(X)

= true More specific

y flies(oliver) ← bird(oliver)

Generality Generality: Let C and D be two clauses. C is more general than D iff C |= D, i.e., iff cov(D) ⊆ cov(C) Note: Logical entailment (even between Horn clauses) is undecidable.

Subsumption Substitution: θ = {X1/t1,…,Xn/tn} is an assignment of terms ti to variables Xi Subsumption: Let C and D be two clauses. C subsumes D iff there is a θ, such that Cθ ⊆ D Example: p(a,b) ← r(b,a) is subsumed by (θ ={X/a,Y/b}) p(X,Y) ← r(Y,X) subsumes p(X,Y) ← r(Y,X), q(X) Note: Subsumption is decidable (but NP complete) Note: if C subsumes D, then C |= D. The reverse is not always true !

ILP as search of program clauses • Semantic generality

Hypothesis H1 is semantically more general than H2 w.r.t. background theory B if and only if B ∪ H1 |= H2

• Syntactic generality or θ-subsumption

(most popular in ILP) – Clause c1 θ -subsumes c2 (c1 ≥ θ c2) if and only if ∃θ : c1θ ⊆ c2 – Hypothesis H1 ≥ θ H2 if and only if ∀c2 ∈ H2 exists c1 ∈ H1 such that c1 ≥ θ c2

• Example

c1 = daughter(X,Y) ← parent(Y,X) c2 = daughter(mary,ann) ← female(mary), parent(ann,mary), parent(ann,tom). c1 θ -subsumes c2 under θ = {X/mary,Y/ann}

ILP as search of program clauses • Properties – soundness: if c1 ≥ θ c2 then c1 |= c2 – incompleteness: the opposite does not hold: (e.g., self-recursive clauses c1 = p(f(X)) ← p(X) and c2 = p(f(f(Y))) ← p(Y)) – decidable (but NP-complete) – transitive and reflexive but not anti-symmetric • Lattice on reduced clauses – equivalence classes [c]: c1 = parent(X,Y) ← mother(X,Y) c2 = parent(X,Y) ← mother(X,Y), mother(X,Z) – c1 reduced clause of c2 iff c1 minimal subset of literals such that c1≡ θ c2 – equivalence classes induce a lattice (Plotkin) • any two clauses have unique lub (the lgg) • any two clauses have glb

ILP as search of program clauses • Two strategies for learning – Top-down search of refinement graphs – Bottom-up search • building least general generalizations • inverting resolution (CIGOL) • inverting entailment (PROGOL)

Refinement operator C → ref(C) Minimal (most general) specialization: D is subsumed by C, but there is no more general D’ which is also subsumed by C. • Apply a substitution θ to the clause • Add a literal to the clause Refinement operator defines the specialization graph. The hypothesis language has to be specified: predicates, functors, constants, (types of arguments). Size of the search space increases with the number of existential variables in the body.

Generality ordering of clauses Training examples

Background knowledge

daughter(mary,ann).



parent(ann,mary).

female(ann.).

daughter(eve,tom).



parent(ann,tom).

female(mary).

daughter(tom,ann).

y

parent(tom,eve).

female(eve).

daughter(eve,ann).

y

parent(tom,ian).

daughter(X,Y) ← ...

... daughter(X,Y) ← parent(Y,X)

daughter(X,Y) ← X=Y

daughter(X,Y) ← parent(X,Z)

daughter(X,Y) ← female(X) ... daughter(X,Y) ← female (X) female(Y)

... daughter(X,Y) ← female(X) parent(Y,X)

Part of the refinement graph for the family relations problem.

Greedy search of the best clause Training examples

Background knowledge

daughter(mary,ann).



parent(ann,mary).

female(ann.).

daughter(eve,tom).



parent(ann,tom).

female(mary).

daughter(tom,ann).

y

parent(tom,eve).

female(eve).

daughter(eve,ann).

y

parent(tom,ian).

daughter(X,Y) ← ...

2/4

daughter(X,Y) ← parent(Y,X)

daughter(X,Y) ← X=Y 0/0

daughter(X,Y) ← female(X) ... daughter(X,Y) ← female (X) 1/2 female(Y)

2/3

2/3

...

daughter(X,Y) ← female(X) 2/2 parent(Y,X)

... daughter(X,Y) ← parent(X,Z)

FOIL • Language: function-free normal programs recursion, negation, new variables in the body, no functors, no constants (original) • Algorithm: covering • Search heuristics: weighted info gain • Search strategy: hill climbing • Stopping criterion: encoding length restriction • Search space reduction: types, in/out modes determinate literals • Ground background knowledge, extensional coverage • Implemented in C

Generalization operator C → gen(C) Least general generalization (LGG): D subsumes C but there is no more specific D’ which also subsumes C. • Apply an inverse substitution θ* to the clause • Remove a literal from the clause Techniques: •RLGG •inverse resolution Inverse substitution: θ* = {t1/X1, ..., tn/Xn} is an assignment of variables Xi to terms ti Fθθ* = F

GOLEM • GOLEM [Muggleton & Feng 1990] – definite programs – empirical, batch – using RLGG

Bottom-up algorithm Basic noise handling

Least general generalization lgg of terms lgg(t1,t2) 1. lgg(t,t) = t, 2. lgg(f(s1,.., sn), f(t1, .., tn)) = f(lgg(s1, t1), .., lgg(sn, tn)). 3. lgg(f(s1, .., sm), g(t1, .., tn)) = V, where f ≠ g, and V is a variable which represents lgg(f(s1, .., sm), g(t1, .., tn)), 4. lgg(s,t) = V, where s ≠ t and at least one of s and t is a variable; in this case, V is a variable which represents lgg(s,t) Examples: - lgg([a,b,c], [a,c,d]) = [a,X,Y]. - lgg(f(a,a), f(b,b)) = f(lgg(a,b), lgg(a,b)) = f(V,V ) where V stands for lgg(a,b). When computing lggs one must be careful to use the same variable for multiple occurrences of the lggs of subterms, i.e. lgg(a,b) in this example. This holds for lggs of terms, atoms and clauses alike.

Least general generalization lgg of atoms lgg(A1,A2) 1. lgg(p(s1,.., sn), p(t1, .., tn)) = p(lgg(s1, t1), .., lgg(sn, tn)), if atoms have the same predicate symbol p, 2. lgg(p(s1, .., sm), q(t1, .., tn)) is undefined if p ≠ q. lgg of literals lgg(L1,L2) 1. if L1 and L2 are atoms, then lgg(L1,L2) is computed as defined above, __ __ 2. if both L1 and__L2__are negative ________literals, L1 = A1 and L2 = A2, then lgg(L1,L2) = lgg(A1,A2) = lgg(A1,A2), 3. if L1 is a positive and L2 is a negative literal, or vice versa, lgg(L1,L2) is undefined. Examples: - lgg(parent(ann,mary), parent(ann,tom)) = parent(ann,X ). ______________

- lgg(parent(ann,mary), parent(ann,tom)) = undefined. - lgg(parent(ann,X ), daughter(mary,ann) = undefined.

Least general generalization

lgg of clauses lgg(c1,c2) Let c1 = {L1, .., Ln} and c2 = {K1, .., Km}. Then lgg(c1, c2) = {Lij = lgg(Li, Kj) | Li ∈ c1, Kj ∈ c2 and lgg(Li, Kj) is defined } Examples: c1 = daughter(mary,ann) ← female(mary), parent(ann, mary) c2 = daughter(eve,tom) ← female(eve), parent(tom,eve) lgg(c1,c2) = daughter(X,Y ) ← female(X ), parent(Y,X ), where X stands for lgg(mary, eve) and Y stands for lgg(ann,tom).

GOLEM using Relative LGG (Plotkin 71)

• LGG relative to background theory B (with model M) • rlgg(e1,e2) = lgg(e1 ← M, e2 ← M) • e1, e2 true ground facts (positive examples) M ground model (possibly h-easy): ground background theory • in general: inefficient – uncomputable (indefinite models) • rlgg is basis of the system GOLEM (Muggleton and Feng 90) – first ILP system to be applied in practice and to make real scientific discoveries

GOLEM using Relative LGG Example: • Background knowledge B: { (female(mary), parent(ann,mary), female(eve), parent(tom,eve) } • Positive examples: e1: daughter(mary,ann) e2: daughter(eve,tom) rlgg(e1,e2,B) = lgg(e1 ← B,e2 ← B) d(m,a) ← f(m), p(a,m), f (e), p(t,e) d(e,t) ← f(m), p(a,m), f(e), p(t,e) d(Vme,Vat) ← f (m), p(a,m), f (e), p(t,e), f(Vme), p(Vat,Vme), f(Vem), p(Vta,Vem) rlgg = daughter(X,Y ) ← parent(Y,X ), female(X )

Recent RDM techniques • Predictive RDM – Learning of classification rules – Induction of logical decision trees – First-order regression – Inductive constraint logic programming – ... • Descriptive RDM – Learning of causal theories and association rules – First-order clustering – Database restructuring – Subgroup discovery – Learning models of dynamic systems – ...