Finding Peculiar Compositions of Frequent

ECML PKDD 2009

Sep. 9, 2009@Bled Slovenia

Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts

Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp

2009年9月9日水曜日

ECML PKDD 2009





ECML PKDD 2009





ECML PKDD 2009





ECML PKDD 2009


Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts Target Texts Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp


ECML PKDD 2009


Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts Target Texts Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp


Motivating Example Papers written by non-natives: ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...


Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...


Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...

Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...


Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...








discuss, about: frequent 2009年9月9日水曜日


discuss about: peculiar Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...

discuss, about: frequent 2009年9月9日水曜日

Basic Notations x, y ∈ Σ : strings (Σ : alphabet) ∗

def

composition, (x, y), of x and y ⇐⇒ xy (concatenation) e.g., if x=discuss and y= about, then (x, y)=discuss about

the frequency, f (x|y) , of x in y def ⇐⇒ the number of occurrences of x in y e.g., f (i |mississippi) = 4, f (issi |mississippi) = 2

D : a set of strings ! f (x|D) = f (x|y) y∈D

f (x|D) ((empirical) probability) P (x|D) = #D (#D : total number of substrings in D)


Peculiar Composition Discovery Problem T, B: two sets of strings (target & background texts, resp.) x, y : strings θT , θB > 1: thresholds (x, y) is peculiar (in T against B ) def

⇐⇒

P (x|B) > θB P (x|T ) P (y|B) > θB P (y|T ) θT P (xy|B) < P (xy|T )

Peculiar Composition Discovery Problem: Input : T, B, θT , θB , ηT (min-sup) Output : all peculiar compositions (x1 , y1 ), (x2 , y2 ), . . . which are maximal and appear at least ηT times in T maximality: (discuss, about), (discuss, abou), (discuss, abo), ... 2009年9月9日水曜日

Related Works [Ji et al. ICDM’05]: contrast patterns from supervised texts popular in target texts but rare in background ones [Suzuki KDD’97]: exceptional & general rule mining our basic idea borrows from this work table data not for text data [Marschall & Rahmann ISMB’09]: motif discovery exceptional patterns are likely to be motif p-value for 10-gram patterns under Bernoulli (no-overlap) and Markov (overlap) models [Leung et al. JCB’96] and [Schbath et al. JCB’97]: exceptional n-gram discovery using z-score n is fixed (2- to 5-gram) under Bernoulli and Markov models 2009年9月9日水曜日

z-score w : substring of length n (n-gram) f (w) − E(w) z-score is defined by z(w) = N (w) f (w) : observed

frequency E(w) : its expectation under a probabilistic model N (w) : normalization factor

Example of E(w): if symbols in {A, C, G, T } occur independently and their probabilities are P (A) = P (C) = P (G) = P (T ) = 1/4 , ! "|w| 1 then E(w) =

4

let w = ACT ACCAG (|w| = 8) 1 then E(w) = 8 4


,

Problems of Scores Based on Probabilistic Models n of n-gram is fixed:

probability

1. large n: inaccurate estimation of E(w) for sparse text sparseness: many n-grams do not appear in given data if n is large 3-gram

6-gram

9-gram

rank 2. small n: again inaccurate estimation E(w) for long patterns from probabilities of shorter n-grams


Our Strategy to The Problems


Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions


Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions New Challenges Data size is increased (T&B) Simultaneous discovery of frequent substrings and peculiar compositions


Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions New Challenges Data size is increased (T&B) Simultaneous discovery of frequent substrings and peculiar compositions Algorithm and data structure (suffix tree)


Data Structure: Suffix Tree [McCreight’76] compact trie of the all suffixes Example: the suffix tree for ‘mississippi$’ ‘$’ is the special symbol not in Σ

ppi$ $ ssi

u ppi$ ssippi$

mississippi$

vv

p

i

$

s

si i$ pi$ ppi$

i

w

ppi$ ssippi$ ssippi$

A node v correponds to a substring, denoted by BS (v) e.g., BS (v) = i, BS (u) = issi frequency of BS (v)= # of leaves below v there exists only O(N ) nodes (N : input length) 2009年9月9日水曜日

Outline of Our Algorithm: FPCS

(Finding Peculiar Compositions) construct the suffix tree of input texts traverse each node v of the tree BS (v) is a candidate for peculiar compositions check if BS (v) appears ηT times if so, for each composition (x, y) of BS (v) = xy check if P (x|B) > θB P (x|T ) P (y|B) > θB P (y|T ) θT P (xy|B) < P (xy|T )

output (x, y) Theorem: FPCS finds all maximal peculiar compositions


Time Complexity of FPCS construct the suffix tree of input texts ← O(N )time travese each node v of the tree ← ∃ O(N ) nodes BS (v) is a candidate for peculiar compositions check if BS (v) appears ηT times if so, for each composition (x, y) of BS (v) = xy ∃ check if P (x|B) > θB P (x|T ) ↑ O(N ) compositions θT P (xy|B) < P (xy|T )

←constant time

output (x, y) (f (BS(v)|T ), f (BS(v)|B))

Theorem:

v

ppi$ $ ssi u

2 O(N The time complexity of FPCS is ppi$) ssippi$ 2009年9月9日水曜日

p

i mississippi$

P (y|B) > θB P (y|T )

Complexity Discussion O(N )

a naive algorithm: generate-check approach enumerate all pairs of substrings x, y and check if w=xy is peculiar or not

O(N )

FPCS (finding peculiar compositions) anti-monotonicity does not hold Even if (x, y) is not peculiar, we do not know whether (x! , y ! ) is peculiar or not. (xy = x! y ! )

O(N )

z-score [Apostolico JCB’00] pattern: just a substring

4

2


Complexity Discussion O(N )

a naive algorithm: generate-check approach enumerate all pairs of substrings x, y and check if w=xy is peculiar or not

O(N )

FPCS (finding peculiar compositions) anti-monotonicity does not hold Even if (x, y) is not peculiar, we do not know whether (x! , y ! ) is peculiar or not. (xy = x! y ! )

O(N )

z-score [Apostolico JCB’00] pattern: just a substring

4

2

Experiments: 1. linear scalability for practical parameters 2. peculiar compositions which can’t be found by z-score 2009年9月9日水曜日

Experiments: Data and Computing Environment Real Data whole DNA sequences of Escherichia Coli K-12 (RefSeq NC_000913) and Bacillus subtilis (RefSeq NC_000964) their sizes: 9.3M and 7.4M with their complementary strands Inflated Data extract randomly m substrings with the same length n from the above data Computing Environment gcc 4.0.1 with -O3, -arch ppc64 and -fast flags PowerMac G5 Mac OS X 10.5, 4×2.5GHz PowerPC G5, 8GB RAM 2009年9月9日水曜日

Performance of FPCS on Different Parameters


Performance of FPCS on Different Parameters (θT , θB ) =( 5, 1.5)

time (second)

=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)

T=B. Subtilis B=E. Coli N=17.7MB


ηT


time (second)

(θT , θB ) =( 5, 1.5)



θB θB θB θB

=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)

= 1.5 = 2.0 = 2.5 = 3.0

ηT


time (second)

(θT , θB ) =( 5, 1.5)


θB θB θB θB

=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)

= 1.5 = 2.0 = 2.5 = 3.0

ηT

Execution times decrease drastically as ηT increases. θB is more effective for pruning than θT . 2009年9月9日水曜日

time (in second)

θT = 5 } θT = 10}

T=B. Subtilis B=E. Coli N=17.7MB ηT 2009年9月9日水曜日

the number of outputs

Details in case of θ B = 1.5

θT = 5 } θT = 10}

2.235σ=2.235 → 2.5% 1

f (x) = √ e 2πσ

(x−µ)2 − 2 2σ

E(X) = µ V(X) = σ2

we have more than N2=1012 substrings (N>106) →2.5% of 1012 substrings have >=2.235 z-score a huge number


MO Methods Letter based estimation P(ABCDEF) = P(A) * P(B) * P(C) *...*P(F) Independent KVI [Krishnan, Vitter, Iyer, 1996] Using independent substrings P(ABCDEF) = P(AB) * P(CDE) * P(F) MO (Maximal overlap) [Jagadish, 1999] Using overlapping substrings P(ABCDEF) = P(AB|ε) * P(CDE|B) * P(F|DE) = P(AB) * P(BCDE) / P(B) * P(DEF) / P(DE)


Finding Peculiar Compositions of Frequent

Finding Peculiar Compositions of Frequent

Suggest Documents

Finding Peculiar Students from Student Database ...

Finding frequent substructures in chemical compounds - CiteSeerX

Finding Frequent Substructures In Chemical Compounds

Finding Secure Compositions of Software Services: Towards A Pattern ...

A Pattern Decomposition (PD) Algorithm for Finding All Frequent ...

Finding Closed Frequent Item Sets by Intersecting ... - CiteSeerX

An Algorithm for Finding Frequent Itemset based on Lattice Approach ...

A Novelty Approach for Finding Frequent Itemsets in Horizontal and ...

Compositions

A Fast Algorithm For Finding Frequent Episodes In Event ... - CiteSeerX

Finding Frequent Items over General Update Streams - Springer Link

Frequent Pattern Mining Algorithms for Finding ... - Semantic Scholar

Finding Frequent Subgraphs in Longitudinal Social Network Data ...

Finding neural assemblies with frequent item set ... - Semantic Scholar

compositions versus cyclic compositions - IUPUI Math

Peculiar Presentation of Ulcerative Colitis

Matrix Compositions

The Peculiar Economics of Bureaucracy

STABILIZED COMPOSITIONS OF VOLATILE

Cyclosporin compositions

Immunoregulatory compositions

Peculiar Presentation of Ulcerative Colitis

MISS PEREGRINE'S PECULIAR CHILDREN

Final Peculiar attachment.indd - Bitly