Sequential PAttern Mining using A Bitmap

Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu Dept. of ComputerScience Cornell University

ABSTRACT W e i n t r o d u c e a n e w a l g o r i t h m for m i n i n g sequential patterns. O u r a l g o r i t h m is especially efficient w h e n t h e s e q u e n tial p a t t e r n s in t h e d a t a b a s e are very long. W e i n t r o d u c e a novel depth-first search s t r a t e g y t h a t integrates a depth-first traversal of t h e search space w i t h effective p r u n i n g m e c h a nisms. O u r i m p l e m e n t a t i o n of t h e search s t r a t e g y c o m b i n e s a vertical b i t m a p r e p r e s e n t a t i o n of t h e d a t a b a s e with efficient s u p p o r t counting. A salient feature of our a l g o r i t h m is t h a t it i n c r e m e n t a l l y o u t p u t s n e w frequent i t e m s e t s in a n online fashion. In a t h o r o u g h e x p e r i m e n t a l e v a l u a t i o n of o u r a l g o r i t h m on s t a n d a r d b e n c h m a r k d a t a from t h e literature, our a l g o r i t h m o u t p e r f o r m s previous work u p to a n order of m a g n i t u d e .

1.

C u s t o m e r ID (CID)

TID

Itemset

1

1

1 1 2 2 3 3

3 6 2 4 5 7

{a,b,d} {b, c, d} {b, c, d}

I I

CID

1

Sequence

({a,b,d},{b,c,d},{b,c,d})

2

((b},(a,b,c})

3

({a,b},{b,c,d})

T a b l e 2: S e q u e n c e

m

t d~f E Isil" i=1

A s e q u e n c e w i t h l e n g t h I is called a n l-sequence. A s e q u e n c e Sa = (al,a2 . . . . ,a,~) is c o n t a i n e d in a n o t h e r sequence Sb = ( b h b 2 , . . . , b , ~ ) if t h e r e exist integers 1 < il < iz < ... < in _< m s u c h t h a t ax C bil, a2 C bi2, . . . , an C bl,. If s e q u e n c e sa is c o n t a i n e d in s e q u e n c e Sb, t h e n we call s~ a subsequence of Sb a n d Sb a supersequence of sa. A d a t a b a s e D is a set of t u p l e s (c/d, rid, X ) , w h e r e c/d is a cnstomer-id, tid is a t r a n s a c t i o n - i d based on t h e t r a n s a c t i o n

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page, To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD '02 Edmonton, Alberta, Canada Copyright 2002 ACM 1-58113-567-X/02/0007 ...$5.00.

{,~,b} {b, e, d}

T a b l e 1: D a t a s e t s o r t e d b y C I D a n d T I D

INTRODUCTION

F i n d i n g s e q u e n t i a l p a t t e r n s in large t r a n s a c t i o n d a t a b a s e s is a n i m p o r t a n t d a t a m i n i n g problem. T h e p r o b l e m of m i n ing sequential p a t t e r n s a n d t h e s u p p o r t - c o n f i d e n c e framework were originally proposed by Agrawal a n d Srikant [2, 10]. Let I = { i l , i 2 , . . . , i n } be a set of items. W e call a s u b s e t X C_ I an itemset a n d we call [X[ t h e size of X . A sequence s = (sl, s 2 , . . . , sin) is a n ordered list of itemsets, w h e r e s~ C I , i E { 1 , . . . , m } . T h e size, m , of a s e q u e n c e is t h e n u m b e r of i t e m s e t s in t h e sequence, i.e. [s[. T h e l e n g t h l of a sequence s = ( s h s 2 , . . . ,sin) is defined as

(b}

{a, b, c}

for each customer

time, a n d X is a n i t e m s e t s u c h t h a t X C /. E a c h t u p l e in D is referred to as a transaction. For a given c n s t o m e r id, t h e r e are no t r a n s a c t i o n s w i t h t h e s a m e t r a n s a c t i o n ID. All t h e t r a n s a c t i o n s w i t h t h e s a m e cid c a n be viewed as a s e q u e n c e of i t e m s e t s ordered by increasing tid. A n analogous r e p r e s e n t a t i o n for t h e d a t a b a s e is t h u s a set of sequences of t r a n s a c t i o n s , one s e q u e n c e p e r c u s t o m e r , a n d we refer to this d u a l r e p r e s e n t a t i o n of D as its sequence representation. T h e absolute support of a s e q u e n c e s~ in t h e sequence r e p r e s e n t a t i o n of a d a t a b a s e D is defined as t h e n u m b e r of sequences s E D t h a t c o n t a i n s~, a n d t h e relative support is defined as t h e p e r c e n t a g e of s e q u e n c e s s E D t h a t c o n t a i n sa. W e will use a b s o l u t e a n d relative s u p p o r t i n t e r c h a n g e a b l y in t h e rest of t h e paper. T h e s u p p o r t of s~ in D is d e n o t e d by supD(s~). G i v e n a s u p p o r t t h r e s h o l d rninSup, a sequence s~ is called a frequent sequential pattern on D if supD(s~) > rninSup. T h e p r o b l e m of m i n i n g sequential p a t t e r n s is to find all frequent s e q u e n t i a l p a t t e r n s for a d a t a b a s e D , given a s u p p o r t t h r e s h o l d sup. Table i s h o w s t h e d a t a s e t consisting of t u p l e s of ( c u s t o m e r id, t r a n s a c t i o n id, i t e m s e t ) for t h e t r a n s a c t i o n . It is sorted by c u s t o m e r id a n d t h e n t r a n s a c t i o n id. Table 2 shows t h e d a t a b a s e in its s e q u e n c e representation. Consider t h e seq u e n c e of c u s t o m e r 2; t h e size of t h i s s e q u e n c e is 2, a n d t h e l e n g t h of t h i s s e q u e n c e is 4. S u p p o s e we w a n t to find t h e s u p p o r t of t h e sequence sa = ({a}, {b, c}). F r o m Table 2, we know t h a t s~ is a s u b s e q u e n c e of t h e s e q u e n c e s for c u s t o m e r 1 a n d c u s t o m e r 3 b u t is n o t a s u b s e q u e n c e of t h e s e q u e n c e for c u s t o m e r 2.

429

Hence, the support of s~ is 2 (out of a possible 3), or 0.67. If the user-defined minimum support value is less than 0.67, then s t is deemed frequent.

1.1

Contributions of This Paper

In this paper, we take a systems approach to the problem of mining sequential patterns. We propose an efficient algorithm called SPAM (Sequential P A t t e r n Mining) that integrates a variety of old and new algorithmic contributions into a practical algorithm. SPAM assumes that the entire database (and all data structures used for the algorithm) completely fit into main memory. With the size of current main memories reaching gigabytes and growing, many moderate-sized to large databases will soon become completely memory-resident. Considering the computational complexity that is involved in finding long sequential patterns even in small databases with wide records, this assumption is not very limiting in practice. Since all algorithms for finding sequential patterns, including algorithms that work with disk-resident databases, are CPU-bound, we believe that our study sheds light on the most important performance bottleneck. SPAM is to the best of our knowledge the first depthfirst search strategy for mining sequential patterns. An additional salient feature of SPAM is its property of online outputting sequential patterns of different length - - compare this to a breadth-first search strategy that first outputs all patterns of length one, then all patterns of length two, and so on. Our implementation of SPAM uses a vertical bitmap data layout allowing for simple, efficient counting.

2.

THE SPAM ALGORITHM

In this section, we will describe the lexicographic tree of sequences upon which our algorithm is based. We will also discuss the way we traverse the tree and the priming methods that we use to reduce the search space.

2.1

Lexicographic Tree for Sequences

This part of the paper describes the conceptual framework of the sequence lattice upon which our approach is based. A similar approach has been used for the problem of mining frequent itemsets in MaxMiner [3] and MAFIA [5]. We use this framework to describe our algorithm and some pertinent related works. Assume that there is a lexicographical ordering < of the items I in the database. If item i occurs before item j in the ordering, then we denote this by i _

Sequential PAttern Mining using A Bitmap

Sequential PAttern Mining using A Bitmap

Suggest Documents

Sequential Pattern Mining: A Survey

Sequential Pattern Mining Algorithms

A Study of Sequential Pattern Mining Techniques

A Sequential Pattern Mining Approach to Extract

Using sequential pattern mining for links recommendation ... - CiteSeerX

Using sequential pattern mining for links recommendation ... - CiteSeerX

CloSpan Sequential Pattern Mining for Books Recommendation ...

Sequential Pattern Mining â Approaches and Algorithms

Efficient Sequential Pattern Mining Algorithms - wseas

sequential pattern mining with approximated ... - Semantic Scholar

Spatial Sequential Pattern Mining for Seismic Data

Sequential Pattern Mining Algorithm Based on Text

Anonymity Preserving Sequential Pattern Mining - Unito

Privacy Preserving Sequential Pattern Mining in ... - CiteSeerX

A Survey of Parallel Sequential Pattern Mining - arXiv

A Survey of Sequential Pattern Mining - Philippe Fournier-Viger

Web Usage Mining: Sequential Pattern Extraction with a Very Low ...

Web Usage Mining: Sequential Pattern Extraction with a Very Low

ERMiner: Sequential Rule Mining using Equivalence Classes

Sequential Pattern Mining for Situation and Behavior Prediction in ...

Sequential Pattern Mining for Moving Objects in Sptio ... - Google Sites

Mobile Sequential Pattern Mining in Location-Based Service

Sequential Pattern Mining: Survey and Current Research ... - CiteSeerX

sequential pattern mining from web log data - IJESAT

Sequential PAttern Mining using A Bitmap