Sequential PAttern Mining using A Bitmap

0 downloads 0 Views 624KB Size Report
tical bitmap representation of the database with efficient support counting. A salient feature ... it incrementally outputs new frequent itemsets in an online fashion.
Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu Dept. of ComputerScience Cornell University

ABSTRACT W e i n t r o d u c e a n e w a l g o r i t h m for m i n i n g sequential patterns. O u r a l g o r i t h m is especially efficient w h e n t h e s e q u e n tial p a t t e r n s in t h e d a t a b a s e are very long. W e i n t r o d u c e a novel depth-first search s t r a t e g y t h a t integrates a depth-first traversal of t h e search space w i t h effective p r u n i n g m e c h a nisms. O u r i m p l e m e n t a t i o n of t h e search s t r a t e g y c o m b i n e s a vertical b i t m a p r e p r e s e n t a t i o n of t h e d a t a b a s e with efficient s u p p o r t counting. A salient feature of our a l g o r i t h m is t h a t it i n c r e m e n t a l l y o u t p u t s n e w frequent i t e m s e t s in a n online fashion. In a t h o r o u g h e x p e r i m e n t a l e v a l u a t i o n of o u r a l g o r i t h m on s t a n d a r d b e n c h m a r k d a t a from t h e literature, our a l g o r i t h m o u t p e r f o r m s previous work u p to a n order of m a g n i t u d e .

1.

C u s t o m e r ID (CID)

TID

Itemset

1

1

1 1 2 2 3 3

3 6 2 4 5 7

{a,b,d} {b, c, d} {b, c, d}

I I

CID

1

Sequence

({a,b,d},{b,c,d},{b,c,d})

2

((b},(a,b,c})

3

({a,b},{b,c,d})

T a b l e 2: S e q u e n c e

m

t d~f E Isil" i=1

A s e q u e n c e w i t h l e n g t h I is called a n l-sequence. A s e q u e n c e Sa = (al,a2 . . . . ,a,~) is c o n t a i n e d in a n o t h e r sequence Sb = ( b h b 2 , . . . , b , ~ ) if t h e r e exist integers 1 < il < iz < ... < in _< m s u c h t h a t ax C bil, a2 C bi2, . . . , an C bl,. If s e q u e n c e sa is c o n t a i n e d in s e q u e n c e Sb, t h e n we call s~ a subsequence of Sb a n d Sb a supersequence of sa. A d a t a b a s e D is a set of t u p l e s (c/d, rid, X ) , w h e r e c/d is a cnstomer-id, tid is a t r a n s a c t i o n - i d based on t h e t r a n s a c t i o n

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page, To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD '02 Edmonton, Alberta, Canada Copyright 2002 ACM 1-58113-567-X/02/0007 ...$5.00.

{,~,b} {b, e, d}

T a b l e 1: D a t a s e t s o r t e d b y C I D a n d T I D

INTRODUCTION

F i n d i n g s e q u e n t i a l p a t t e r n s in large t r a n s a c t i o n d a t a b a s e s is a n i m p o r t a n t d a t a m i n i n g problem. T h e p r o b l e m of m i n ing sequential p a t t e r n s a n d t h e s u p p o r t - c o n f i d e n c e framework were originally proposed by Agrawal a n d Srikant [2, 10]. Let I = { i l , i 2 , . . . , i n } be a set of items. W e call a s u b s e t X C_ I an itemset a n d we call [X[ t h e size of X . A sequence s = (sl, s 2 , . . . , sin) is a n ordered list of itemsets, w h e r e s~ C I , i E { 1 , . . . , m } . T h e size, m , of a s e q u e n c e is t h e n u m b e r of i t e m s e t s in t h e sequence, i.e. [s[. T h e l e n g t h l of a sequence s = ( s h s 2 , . . . ,sin) is defined as

(b}

{a, b, c}

for each customer

time, a n d X is a n i t e m s e t s u c h t h a t X C /. E a c h t u p l e in D is referred to as a transaction. For a given c n s t o m e r id, t h e r e are no t r a n s a c t i o n s w i t h t h e s a m e t r a n s a c t i o n ID. All t h e t r a n s a c t i o n s w i t h t h e s a m e cid c a n be viewed as a s e q u e n c e of i t e m s e t s ordered by increasing tid. A n analogous r e p r e s e n t a t i o n for t h e d a t a b a s e is t h u s a set of sequences of t r a n s a c t i o n s , one s e q u e n c e p e r c u s t o m e r , a n d we refer to this d u a l r e p r e s e n t a t i o n of D as its sequence representation. T h e absolute support of a s e q u e n c e s~ in t h e sequence r e p r e s e n t a t i o n of a d a t a b a s e D is defined as t h e n u m b e r of sequences s E D t h a t c o n t a i n s~, a n d t h e relative support is defined as t h e p e r c e n t a g e of s e q u e n c e s s E D t h a t c o n t a i n sa. W e will use a b s o l u t e a n d relative s u p p o r t i n t e r c h a n g e a b l y in t h e rest of t h e paper. T h e s u p p o r t of s~ in D is d e n o t e d by supD(s~). G i v e n a s u p p o r t t h r e s h o l d rninSup, a sequence s~ is called a frequent sequential pattern on D if supD(s~) > rninSup. T h e p r o b l e m of m i n i n g sequential p a t t e r n s is to find all frequent s e q u e n t i a l p a t t e r n s for a d a t a b a s e D , given a s u p p o r t t h r e s h o l d sup. Table i s h o w s t h e d a t a s e t consisting of t u p l e s of ( c u s t o m e r id, t r a n s a c t i o n id, i t e m s e t ) for t h e t r a n s a c t i o n . It is sorted by c u s t o m e r id a n d t h e n t r a n s a c t i o n id. Table 2 shows t h e d a t a b a s e in its s e q u e n c e representation. Consider t h e seq u e n c e of c u s t o m e r 2; t h e size of t h i s s e q u e n c e is 2, a n d t h e l e n g t h of t h i s s e q u e n c e is 4. S u p p o s e we w a n t to find t h e s u p p o r t of t h e sequence sa = ({a}, {b, c}). F r o m Table 2, we know t h a t s~ is a s u b s e q u e n c e of t h e s e q u e n c e s for c u s t o m e r 1 a n d c u s t o m e r 3 b u t is n o t a s u b s e q u e n c e of t h e s e q u e n c e for c u s t o m e r 2.

429

Hence, the support of s~ is 2 (out of a possible 3), or 0.67. If the user-defined minimum support value is less than 0.67, then s t is deemed frequent.

1.1

Contributions of This Paper

In this paper, we take a systems approach to the problem of mining sequential patterns. We propose an efficient algorithm called SPAM (Sequential P A t t e r n Mining) that integrates a variety of old and new algorithmic contributions into a practical algorithm. SPAM assumes that the entire database (and all data structures used for the algorithm) completely fit into main memory. With the size of current main memories reaching gigabytes and growing, many moderate-sized to large databases will soon become completely memory-resident. Considering the computational complexity that is involved in finding long sequential patterns even in small databases with wide records, this assumption is not very limiting in practice. Since all algorithms for finding sequential patterns, including algorithms that work with disk-resident databases, are CPU-bound, we believe that our study sheds light on the most important performance bottleneck. SPAM is to the best of our knowledge the first depthfirst search strategy for mining sequential patterns. An additional salient feature of SPAM is its property of online outputting sequential patterns of different length - - compare this to a breadth-first search strategy that first outputs all patterns of length one, then all patterns of length two, and so on. Our implementation of SPAM uses a vertical bitmap data layout allowing for simple, efficient counting.

2.

THE SPAM ALGORITHM

In this section, we will describe the lexicographic tree of sequences upon which our algorithm is based. We will also discuss the way we traverse the tree and the priming methods that we use to reduce the search space.

2.1

Lexicographic Tree for Sequences

This part of the paper describes the conceptual framework of the sequence lattice upon which our approach is based. A similar approach has been used for the problem of mining frequent itemsets in MaxMiner [3] and MAFIA [5]. We use this framework to describe our algorithm and some pertinent related works. Assume that there is a lexicographical ordering < of the items I in the database. If item i occurs before item j in the ordering, then we denote this by i _