Text Data. Peculiar Compositions of Frequent Substrings. 2009å¹´9æ9æ¥æ°´ææ¥ ..... 2. peculiar compositions which can't be found by z-score. 2009å¹´9æ9æ¥ ...
ECML PKDD 2009
Sep. 9, 2009@Bled Slovenia
Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts
Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp
2009年9月9日水曜日
ECML PKDD 2009
Sep. 9, 2009@Bled Slovenia
Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts
Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp
2009年9月9日水曜日
ECML PKDD 2009
Sep. 9, 2009@Bled Slovenia
Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts
Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp
2009年9月9日水曜日
ECML PKDD 2009
Sep. 9, 2009@Bled Slovenia
Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts
Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp
2009年9月9日水曜日
ECML PKDD 2009
Sep. 9, 2009@Bled Slovenia
Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts Target Texts Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp
2009年9月9日水曜日
ECML PKDD 2009
Sep. 9, 2009@Bled Slovenia
Finding Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts Target Texts Daisuke Ikeda Einoshin Suzuki Department of Informatics, Kyushu University, JAPAN {daisuke, suzuki}@inf.kyushu-u.ac.jp
2009年9月9日水曜日
Motivating Example Papers written by non-natives: ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...
2009年9月9日水曜日
Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...
2009年9月9日水曜日
Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector of real numbers..... ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...
Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...
2009年9月9日水曜日
Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...
Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...
2009年9月9日水曜日
Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...
Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...
2009年9月9日水曜日
Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...
Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...
discuss, about: frequent 2009年9月9日水曜日
Motivating Example Papers written by non-natives:Target Texts ......we employ a set........and........ ....in order to discuss about a vector space on a space of points...... ...such as the field of real or of complex numbers... We are developing ..... algorithms ......about.... We discuss about a structure on the quotient set... ......namely we discuss a vector or real numbers..... numbers of real ......is more frequent in.......and much more..... Unlike existing methods....... use simple estimation methods... Simply discuss about a field...
discuss about: peculiar Papers by native English speakers: Background Texts ......we discuss various issues .... Given ..... is to put about....., where I discuss......and ....... ...... about 10 oclock .......and....precisely....about ...
discuss, about: frequent 2009年9月9日水曜日
Basic Notations x, y ∈ Σ : strings (Σ : alphabet) ∗
def
composition, (x, y), of x and y ⇐⇒ xy (concatenation) e.g., if x=discuss and y= about, then (x, y)=discuss about
the frequency, f (x|y) , of x in y def ⇐⇒ the number of occurrences of x in y e.g., f (i |mississippi) = 4, f (issi |mississippi) = 2
D : a set of strings ! f (x|D) = f (x|y) y∈D
f (x|D) ((empirical) probability) P (x|D) = #D (#D : total number of substrings in D)
2009年9月9日水曜日
Peculiar Composition Discovery Problem T, B: two sets of strings (target & background texts, resp.) x, y : strings θT , θB > 1: thresholds (x, y) is peculiar (in T against B ) def
⇐⇒
P (x|B) > θB P (x|T ) P (y|B) > θB P (y|T ) θT P (xy|B) < P (xy|T )
Peculiar Composition Discovery Problem: Input : T, B, θT , θB , ηT (min-sup) Output : all peculiar compositions (x1 , y1 ), (x2 , y2 ), . . . which are maximal and appear at least ηT times in T maximality: (discuss, about), (discuss, abou), (discuss, abo), ... 2009年9月9日水曜日
Related Works [Ji et al. ICDM’05]: contrast patterns from supervised texts popular in target texts but rare in background ones [Suzuki KDD’97]: exceptional & general rule mining our basic idea borrows from this work table data not for text data [Marschall & Rahmann ISMB’09]: motif discovery exceptional patterns are likely to be motif p-value for 10-gram patterns under Bernoulli (no-overlap) and Markov (overlap) models [Leung et al. JCB’96] and [Schbath et al. JCB’97]: exceptional n-gram discovery using z-score n is fixed (2- to 5-gram) under Bernoulli and Markov models 2009年9月9日水曜日
z-score w : substring of length n (n-gram) f (w) − E(w) z-score is defined by z(w) = N (w) f (w) : observed
frequency E(w) : its expectation under a probabilistic model N (w) : normalization factor
Example of E(w): if symbols in {A, C, G, T } occur independently and their probabilities are P (A) = P (C) = P (G) = P (T ) = 1/4 , ! "|w| 1 then E(w) =
4
let w = ACT ACCAG (|w| = 8) 1 then E(w) = 8 4
2009年9月9日水曜日
,
Problems of Scores Based on Probabilistic Models n of n-gram is fixed:
probability
1. large n: inaccurate estimation of E(w) for sparse text sparseness: many n-grams do not appear in given data if n is large 3-gram
6-gram
9-gram
rank 2. small n: again inaccurate estimation E(w) for long patterns from probabilities of shorter n-grams
2009年9月9日水曜日
Our Strategy to The Problems
2009年9月9日水曜日
Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions
2009年9月9日水曜日
Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions New Challenges Data size is increased (T&B) Simultaneous discovery of frequent substrings and peculiar compositions
2009年9月9日水曜日
Our Strategy to The Problems background texts instead of a probabilistic model being peculiar is defined by two ratios of frequencies between T&B composition of frequent substrings with arbitrary length long substrings and much longer compositions New Challenges Data size is increased (T&B) Simultaneous discovery of frequent substrings and peculiar compositions Algorithm and data structure (suffix tree)
2009年9月9日水曜日
Data Structure: Suffix Tree [McCreight’76] compact trie of the all suffixes Example: the suffix tree for ‘mississippi$’ ‘$’ is the special symbol not in Σ
ppi$ $ ssi
u ppi$ ssippi$
mississippi$
vv
p
i
$
s
si i$ pi$ ppi$
i
w
ppi$ ssippi$ ssippi$
A node v correponds to a substring, denoted by BS (v) e.g., BS (v) = i, BS (u) = issi frequency of BS (v)= # of leaves below v there exists only O(N ) nodes (N : input length) 2009年9月9日水曜日
Outline of Our Algorithm: FPCS
(Finding Peculiar Compositions) construct the suffix tree of input texts traverse each node v of the tree BS (v) is a candidate for peculiar compositions check if BS (v) appears ηT times if so, for each composition (x, y) of BS (v) = xy check if P (x|B) > θB P (x|T ) P (y|B) > θB P (y|T ) θT P (xy|B) < P (xy|T )
output (x, y) Theorem: FPCS finds all maximal peculiar compositions
2009年9月9日水曜日
Time Complexity of FPCS construct the suffix tree of input texts ← O(N )time travese each node v of the tree ← ∃ O(N ) nodes BS (v) is a candidate for peculiar compositions check if BS (v) appears ηT times if so, for each composition (x, y) of BS (v) = xy ∃ check if P (x|B) > θB P (x|T ) ↑ O(N ) compositions θT P (xy|B) < P (xy|T )
←constant time
output (x, y) (f (BS(v)|T ), f (BS(v)|B))
Theorem:
v
ppi$ $ ssi u
2 O(N The time complexity of FPCS is ppi$) ssippi$ 2009年9月9日水曜日
p
i mississippi$
P (y|B) > θB P (y|T )
Complexity Discussion O(N )
a naive algorithm: generate-check approach enumerate all pairs of substrings x, y and check if w=xy is peculiar or not
O(N )
FPCS (finding peculiar compositions) anti-monotonicity does not hold Even if (x, y) is not peculiar, we do not know whether (x! , y ! ) is peculiar or not. (xy = x! y ! )
O(N )
z-score [Apostolico JCB’00] pattern: just a substring
4
2
2009年9月9日水曜日
Complexity Discussion O(N )
a naive algorithm: generate-check approach enumerate all pairs of substrings x, y and check if w=xy is peculiar or not
O(N )
FPCS (finding peculiar compositions) anti-monotonicity does not hold Even if (x, y) is not peculiar, we do not know whether (x! , y ! ) is peculiar or not. (xy = x! y ! )
O(N )
z-score [Apostolico JCB’00] pattern: just a substring
4
2
Experiments: 1. linear scalability for practical parameters 2. peculiar compositions which can’t be found by z-score 2009年9月9日水曜日
Experiments: Data and Computing Environment Real Data whole DNA sequences of Escherichia Coli K-12 (RefSeq NC_000913) and Bacillus subtilis (RefSeq NC_000964) their sizes: 9.3M and 7.4M with their complementary strands Inflated Data extract randomly m substrings with the same length n from the above data Computing Environment gcc 4.0.1 with -O3, -arch ppc64 and -fast flags PowerMac G5 Mac OS X 10.5, 4×2.5GHz PowerPC G5, 8GB RAM 2009年9月9日水曜日
Performance of FPCS on Different Parameters
2009年9月9日水曜日
Performance of FPCS on Different Parameters (θT , θB ) =( 5, 1.5)
time (second)
=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)
T=B. Subtilis B=E. Coli N=17.7MB
2009年9月9日水曜日
ηT
Performance of FPCS on Different Parameters
time (second)
(θT , θB ) =( 5, 1.5)
T=B. Subtilis B=E. Coli N=17.7MB
2009年9月9日水曜日
θB θB θB θB
=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)
= 1.5 = 2.0 = 2.5 = 3.0
ηT
Performance of FPCS on Different Parameters
time (second)
(θT , θB ) =( 5, 1.5)
T=B. Subtilis B=E. Coli N=17.7MB
θB θB θB θB
=(10, 1.5) =( 5, 2.0) =(10, 2.0) =( 5, 2.5) =(10, 2.5) =( 5, 3.0) =(10, 3.0)
= 1.5 = 2.0 = 2.5 = 3.0
ηT
Execution times decrease drastically as ηT increases. θB is more effective for pruning than θT . 2009年9月9日水曜日
time (in second)
θT = 5 } θT = 10}
T=B. Subtilis B=E. Coli N=17.7MB ηT 2009年9月9日水曜日
the number of outputs
Details in case of θ B = 1.5
θT = 5 } θT = 10}
2.235σ=2.235 → 2.5% 1
f (x) = √ e 2πσ
(x−µ)2 − 2 2σ
E(X) = µ V(X) = σ2
we have more than N2=1012 substrings (N>106) →2.5% of 1012 substrings have >=2.235 z-score a huge number
2009年9月9日水曜日
MO Methods Letter based estimation P(ABCDEF) = P(A) * P(B) * P(C) *...*P(F) Independent KVI [Krishnan, Vitter, Iyer, 1996] Using independent substrings P(ABCDEF) = P(AB) * P(CDE) * P(F) MO (Maximal overlap) [Jagadish, 1999] Using overlapping substrings P(ABCDEF) = P(AB|ε) * P(CDE|B) * P(F|DE) = P(AB) * P(BCDE) / P(B) * P(DEF) / P(DE)
2009年9月9日水曜日