Finding Peculiar Compositions of Frequent

0 downloads 0 Views 7MB Size Report
Text Data. Peculiar Compositions of Frequent Substrings. 2009å¹´9月9日水曜日 ..... 2. peculiar compositions which can't be found by z-score. 2009å¹´9月9日 ...
ECML PKDD 2009

Sep. 9, 2009@Bled Slovenia

Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts

Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp

2009年9月9日水曜日

ECML PKDD 2009

Sep. 9, 2009@Bled Slovenia

Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts

Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp

2009年9月9日水曜日

ECML PKDD 2009

Sep. 9, 2009@Bled Slovenia

Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts

Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp

2009年9月9日水曜日

ECML PKDD 2009

Sep. 9, 2009@Bled Slovenia

Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts

Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp

2009年9月9日水曜日

ECML PKDD 2009

Sep. 9, 2009@Bled Slovenia

Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts Target Texts Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp

2009年9月9日水曜日

ECML PKDD 2009

Sep. 9, 2009@Bled Slovenia

Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts Target Texts Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp

2009年9月9日水曜日

Motivating Example Papers written by non-natives: ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...

2009年9月9日水曜日

Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...

2009年9月9日水曜日

Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...

Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...

2009年9月9日水曜日

Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...

Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...

2009年9月9日水曜日

Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...

Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...

2009年9月9日水曜日

Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...

Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...

discuss, about: frequent 2009年9月9日水曜日

Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...

discuss about: peculiar Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...

discuss, about: frequent 2009年9月9日水曜日

Basic Notations x, y ∈ Σ : strings (Σ : alphabet) ∗

def

composition, (x, y), of x and y ⇐⇒ xy (concatenation) e.g., if x=discuss and y= about, then (x, y)=discuss about

the frequency, f (x|y) , of x in y def ⇐⇒ the number of occurrences of x in y e.g., f (i |mississippi) = 4, f (issi |mississippi) = 2

D : a set of strings ! f (x|D) = f (x|y) y∈D

f (x|D) ((empirical) probability) P (x|D) = #D (#D : total number of substrings in D)

2009年9月9日水曜日

Peculiar Composition Discovery Problem T, B: two sets of strings (target & background texts, resp.) x, y : strings θT , θB > 1: thresholds (x, y) is peculiar (in T against B ) def

⇐⇒

P (x|B) > θB P (x|T ) P (y|B) > θB P (y|T ) θT P (xy|B) < P (xy|T )

Peculiar Composition Discovery Problem: Input : T, B, θT , θB , ηT (min-sup) Output : all peculiar compositions (x1 , y1 ), (x2 , y2 ), . . . which are maximal and appear at least ηT times in T maximality: (discuss, about), (discuss, abou), (discuss, abo), ... 2009年9月9日水曜日

Related Works [Ji et al. ICDM’05]: contrast patterns from supervised texts popular in target texts but rare in background ones [Suzuki KDD’97]: exceptional & general rule mining our basic idea borrows from this work table data not for text data [Marschall & Rahmann ISMB’09]: motif discovery exceptional patterns are likely to be motif p-value for 10-gram patterns under Bernoulli (no-overlap) and Markov (overlap) models [Leung et al. JCB’96] and [Schbath et al. JCB’97]: exceptional n-gram discovery using z-score n is fixed (2- to 5-gram) under Bernoulli and Markov models 2009年9月9日水曜日

z-score w : substring of length n (n-gram) f (w) − E(w) z-score is defined by z(w) = N (w) f (w) : observed

frequency E(w) : its expectation under a probabilistic model N (w) : normalization factor

Example of E(w): if symbols in {A, C, G, T } occur independently and their probabilities are P (A) = P (C) = P (G) = P (T ) = 1/4 , ! "|w| 1 then E(w) =

4

let w = ACT ACCAG (|w| = 8) 1 then E(w) = 8 4

2009年9月9日水曜日

,

Problems of Scores Based on Probabilistic Models n of n-gram is fixed:

probability

1. large n: inaccurate estimation of E(w) for sparse text sparseness: many n-grams do not appear in given data if n is large 3-gram

6-gram

9-gram

rank 2. small n: again inaccurate estimation E(w) for long patterns from probabilities of shorter n-grams

2009年9月9日水曜日

Our Strategy to The Problems

2009年9月9日水曜日

Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions

2009年9月9日水曜日

Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions New Challenges Data size is increased (T&B) Simultaneous discovery of frequent substrings and peculiar compositions

2009年9月9日水曜日

Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions New Challenges Data size is increased (T&B) Simultaneous discovery of frequent substrings and peculiar compositions Algorithm and data structure (suffix tree)

2009年9月9日水曜日

Data Structure: Suffix Tree [McCreight’76] compact trie of the all suffixes Example: the suffix tree for ‘mississippi$’ ‘$’ is the special symbol not in Σ

ppi$ $ ssi

u ppi$ ssippi$

mississippi$

vv

p

i

$

s

si i$ pi$ ppi$

i

w

ppi$ ssippi$ ssippi$

A node v correponds to a substring, denoted by BS (v) e.g., BS (v) = i, BS (u) = issi frequency of BS (v)= # of leaves below v there exists only O(N ) nodes (N : input length) 2009年9月9日水曜日

Outline of Our Algorithm: FPCS

(Finding Peculiar Compositions) construct the suffix tree of input texts traverse each node v of the tree BS (v) is a candidate for peculiar compositions check if BS (v) appears ηT times if so, for each composition (x, y) of BS (v) = xy check if P (x|B) > θB P (x|T ) P (y|B) > θB P (y|T ) θT P (xy|B) < P (xy|T )

output (x, y) Theorem: FPCS finds all maximal peculiar compositions

2009年9月9日水曜日

Time Complexity of FPCS construct the suffix tree of input texts ← O(N )time travese each node v of the tree ← ∃ O(N ) nodes BS (v) is a candidate for peculiar compositions check if BS (v) appears ηT times if so, for each composition (x, y) of BS (v) = xy ∃ check if P (x|B) > θB P (x|T ) ↑ O(N ) compositions θT P (xy|B) < P (xy|T )

←constant time

output (x, y) (f (BS(v)|T ), f (BS(v)|B))

Theorem:

v

ppi$ $ ssi u

2 O(N The time complexity of FPCS is   ppi$) ssippi$ 2009年9月9日水曜日

p

i mississippi$

P (y|B) > θB P (y|T )

Complexity Discussion O(N )

a naive algorithm: generate-check approach enumerate all pairs of substrings x, y and check if w=xy is peculiar or not

O(N )

FPCS (finding peculiar compositions) anti-monotonicity does not hold Even if (x, y) is not peculiar, we do not know whether (x! , y ! ) is peculiar or not. (xy = x! y ! )

O(N )

z-score [Apostolico JCB’00] pattern: just a substring

4

2

2009年9月9日水曜日

Complexity Discussion O(N )

a naive algorithm: generate-check approach enumerate all pairs of substrings x, y and check if w=xy is peculiar or not

O(N )

FPCS (finding peculiar compositions) anti-monotonicity does not hold Even if (x, y) is not peculiar, we do not know whether (x! , y ! ) is peculiar or not. (xy = x! y ! )

O(N )

z-score [Apostolico JCB’00] pattern: just a substring

4

2

Experiments: 1. linear scalability for practical parameters 2. peculiar compositions which can’t be found by z-score 2009年9月9日水曜日

Experiments: Data and Computing Environment Real Data whole DNA sequences of Escherichia Coli K-12 (RefSeq NC_000913) and Bacillus subtilis (RefSeq NC_000964) their sizes: 9.3M and 7.4M with their complementary strands Inflated Data extract randomly m substrings with the same length n from the above data Computing Environment gcc 4.0.1 with -O3, -arch ppc64 and -fast flags PowerMac G5 Mac OS X 10.5, 4×2.5GHz PowerPC G5, 8GB RAM 2009年9月9日水曜日

Performance of FPCS on Different Parameters

2009年9月9日水曜日

Performance of FPCS on Different Parameters (θT , θB ) =( 5, 1.5)

time (second)

=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)

T=B. Subtilis B=E. Coli N=17.7MB

2009年9月9日水曜日

ηT

Performance of FPCS on Different Parameters

time (second)

(θT , θB ) =( 5, 1.5)

T=B. Subtilis B=E. Coli N=17.7MB

2009年9月9日水曜日

θB θB θB θB

=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)

= 1.5 = 2.0 = 2.5 = 3.0

ηT

Performance of FPCS on Different Parameters

time (second)

(θT , θB ) =( 5, 1.5)

T=B. Subtilis B=E. Coli N=17.7MB

θB θB θB θB

=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)

= 1.5 = 2.0 = 2.5 = 3.0

ηT

Execution times decrease drastically as ηT increases. θB is more effective for pruning than θT . 2009年9月9日水曜日

time (in second)

θT = 5 } θT = 10}

T=B. Subtilis B=E. Coli N=17.7MB ηT 2009年9月9日水曜日

the number of outputs

Details in case of θ     B = 1.5

θT = 5 } θT = 10}

2.235σ=2.235 → 2.5% 1

f (x) = √ e 2πσ

(x−µ)2 − 2 2σ

E(X) = µ V(X) = σ2

we have more than N2=1012 substrings (N>106) →2.5% of 1012 substrings have >=2.235 z-score a huge number

2009年9月9日水曜日

MO Methods Letter based estimation P(ABCDEF) = P(A) * P(B) * P(C) *...*P(F) Independent KVI [Krishnan, Vitter, Iyer, 1996] Using independent substrings P(ABCDEF) = P(AB) * P(CDE) * P(F) MO (Maximal overlap) [Jagadish, 1999] Using overlapping substrings P(ABCDEF) = P(AB|ε) * P(CDE|B) * P(F|DE) = P(AB) * P(BCDE) / P(B) * P(DEF) / P(DE)

2009年9月9日水曜日