A Characterization of Data Mining Algorithms on a Modern ... - CiteSeerX

5 downloads 14620 Views 375KB Size Report
and decision tree induction. Our study reveals that data mining algorithms are compute and memory intensive. Fur- thermore, some algorithms have poor spatial ...
A Characterization of Data Mining Algorithms on a Modern Processor ∗

Amol Ghoting, Gregory Buehrer, and Srinivasan Parthasarathy† Data Mining Research Laboratory, Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA Daehyun Kim, Anthony Nguyen, Yen-Kuang Chen, and Pradeep Dubey Architecture Research Laboratory, Intel Corporation, Santa Clara, CA 95052, USA ABSTRACT In this paper, we characterize the performance and memory access behavior of several data mining algorithms. Specifically, we consider algorithms for frequent itemset mining, sequence mining, graph mining, clustering, outlier detection, and decision tree induction. Our study reveals that data mining algorithms are compute and memory intensive. Furthermore, some algorithms have poor spatial locality, while most algorithms have poor temporal locality. Hardware prefetching helps the algorithms with good spatial locality, but most algorithms are unable to leverage simultaneous multithreading because of their memory intensive nature. Consequently, all these algorithms grossly under-utilize a modern day processor. Using the knowledge gleaned in this investigation, we briefly show how we improve the performance of a frequent itemset mining algorithm, FPGrowth, on a modern processor. Our study suggests that a specialized memory system with several thread contexts per processor is needed to allow these algorithms to scale on future microprocessors.

1.

INTRODUCTION

Over the last decade, our ability to gather, collect, and distribute data of all kinds has resulted in large and dynamically growing datasets at many organizations. Many companies already have data warehouses in the tera-byte range (e.g. FedEx, UPS, and Walmart). Similarly, scientific data has reached gigantic proportions (e.g. genomic data banks). While database technology has provided us with the basic tools for accessing and manipulating such data stores, the ∗

Contact emails – [ghoting,srini]@cse.ohio-state.edu This work is supported in part by NSF grants (CAREERIIS-0347662), (RI-CNS-0403342), and (NGS-CNS-0406386). †

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Proceedings of the First International Workshop on Data Management on New Hardware (DaMoN 2005); June 12, 2005, Baltimore, Maryland, USA. Copyright 2005 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

issue of how to make end-users understand this data has become a pressing problem. The field of data mining, spurred by advances in data collection and storage technologies, concerns itself with the discovery of knowledge hidden in these large datasets. Today, data mining applications constitute a rapidly growing segment of the commercial and scientific computing domains. Data mining is an interactive process, and thus it is imperative that we minimize response time to a user’s query. With this objective in mind, the past few years have seen data mining researchers make significant progress in reducing the computational complexity of data mining algorithms. Unfortunately, as our study shows, very little has been done to ensure that data mining algorithms are fully utilizing a modern day processor. In this paper, we delve into performance evaluations of several popular data mining algorithms on the Intel Pentium 4 processor. Specifically, implementations for frequent itemset mining [11], sequence mining [21], graph mining [13], clustering [15, 22], outlier detection [3], and decision tree induction [17], are considered. Through a detailed performance study, we characterize the processor, cache, and memory system performance of state-of-the-art data mining implementations. Furthermore, we evaluate the benefits of modern processor technologies such as hardware prefetching and simultaneous multithreading (SMT) on algorithm performance. Intel VTune Performance analyzers1 are used to collect real-time performance data. Our study reveals that data mining algorithms are significantly different from well studied Online Transaction Processing (OLTP) and Decision Support Systems (DSS) algorithms [2]. Data mining algorithms are both memory intensive and compute intensive. CPU utilization varies from a low of 4.9% to a high of 44.6% over the considered algorithms. This suggests that many of these implementations have serious performance bottlenecks, limiting CPU utilization on a modern processor. Our study further reveals that while these algorithms do not exert much pressure on the instruction cache, they put significant pressure on the data cache due to their memory intensive nature. Some of these algorithms have good spatial locality, and access memory in a streaming fashion. However, due to a large memory foot-print, they exhibit very 1

http://www.intel.com/software/products/vtune

poor temporal locality. In addition, some algorithms must manipulate complex pointer-based structures that exhibit poor locality. Given the memory intensive nature and streaming memory access behavior of some of these algorithms, hardware prefetching does provide a significant improvement in performance. However, because of the gap between processor speeds and memory speed, prefetching is still unable to hide most of the memory latency. This limits instruction level parallelism, because instructions tend to stall on cache misses for a large fraction of the time. Consequently, these algorithms are not able to effectively use simultaneous multithreading on the Intel Pentium 4 processor. We conclude that manifestations of such performance bottlenecks are primarily due to the memory intensive nature of these algorithms, coupled with the increasing gap between processor performance and main memory. Even with latency-hiding processor technologies such as hardware prefetching, these algorithms are unable to fully utilize a modern processor. In some cases, data mining researchers need to pay special attention to data layout and computation structure. To this effect, we briefly present our efforts in improving the performance of the fastest known frequent pattern mining algorithm, FPGrowth. We demonstrate how effective cache-conscious data layout and computation restructuring can significantly increase processor utilization for FPGrowth (Details of our optimization techniques are the focus of a forthcoming publication [9]). Furthermore, we leverage SMT by co-scheduling threads to reuse cached data. We hope this study will spark the development of cache-conscious algorithms in the data mining community. Specifically, we make the following contributions: • We characterize the performance and memory access behavior of eight data mining algorithms. • We study the impact of processor technologies such as hardware prefetching and simultaneous multithreading on algorithm performance. • We use our characterization to improve the performance of the frequent itemset mining algorithm, FPGrowth.

2.

RELATED WORK

Several studies are related to this work. A number of recent efforts have revisited core database algorithms to improve cache performance [4, 20]. Ailamaki et al. [1] examined DBMS performance on modern architectures, noting that poor cache utilization is the primary cause of extended query execution time. They conclude that database programmers must increase the attention given to data layout to improve cache performance. Rao and Ross [18, 19] propose two new types of data structures, Cache-Sensitive Search Trees and Cache-Sensitive B+ Trees. This work builds on the premise that the optimal tree node size is equal to the natural data transfer size. Chen et al. [8] have further improved the index and range search performance of B+ trees using prefetching. More recently, Chen et al. [7] have also used prefetching to improve the performance of HashJoin operations. Lo et al. [14] analyzed the performance of database workloads on simultaneously multithreaded processors. They showed that while database memory footprints tend to be large, working sets often can fit in cache

(when properly organized). They determine that improved cache performance is required to leverage the abilities of multiple threads in an SMT environment. Relatively little work has been done to characterize data mining algorithms. The memory characteristics of the self organizing map (SOM), a neural network model, is presented in [12]. The authors characterize data locality, memory block utilization, and TLB performance through simulation. In [5], C4.5 is analyzed for its memory characteristics. This study is also based on a simulation, and the authors conclude that the performance of this algorithm is limited by memory latency and bandwidth. We characterize the memory access behavior of Apriori, an association rule mining algorithm, in [16]. This article points out that many data structures used by association rule mining algorithms exhibit poor locality, which is an obstacle to detecting false sharing through parallel execution. To the best of our knowledge, there has been no work in the area of cache-conscious data mining. Our analysis focuses on data mining algorithms, and uses real executions on the Intel Pentium 4 processor.

3. WORKLOADS UNDER STUDY This section describes the data mining algorithms and implementations that we have chosen for the workload characterization. Due to space constraints, we will only give brief descriptions of the algorithms at hand. We would like the readers to note that these implementations are optimized for performance, and are by no means naive straw man implementations. Many of the implementations use custom memory managers for minimizing overheads associated with heavy memory allocation. Frequent Itemset Mining - FPGrowth and MAFIA: Frequent itemset mining is the task of finding groups of items or values that co-occur frequently in a transactional dataset. For instance, the output of this task for market basket data could be, “item A and item B are purchased together 30% of the time”. We consider two algorithms for this task, namely FPGrowth [11] and MAFIA [6]. Through several independent evaluations, FPGrowth is now accepted to be the fastest known algorithm for frequent itemset mining. To summarize, this algorithm recursively traverses the itemset search space in depth-first order. At each stage in this recursion, it dynamically creates a projected dataset, represented as an FP-Tree. An FP-Tree is an annotated prefix tree which offers a potentially compact dataset representation at the cost of a pointer-based structure. MAFIA also recursively traverses the itemset search space in depth first order. However, it does not use a prefix tree-based dataset representation. It employs vertical tid lists represented as bit vectors for support counting. We use the fastest known implementations of these algorithms from the FIMI repository [10] for our workload characterization study. Sequence Mining - SPADE: Informally, the sequence mining task is to discover a set of attributes, shared across time, among a large number of objects in a database. For example, a discovered sequence in the context of market basket data could be “70% of the customers who buy item A also buy item B within a period of 1 month”. We consider SPADE [21] as a representative sequence mining algorithm. This algorithm recursively traverses the itemset search space in depth first order. Like MAFIA, it uses the vertical tid-list

representation of the dataset, but in its native form2 , not as bit vectors. The fastest known implementation for SPADE was obtained from the author’s website3 . Graph Mining - FSG: The task of graph mining typically involves finding subgraphs that occur frequently in a graph database. The discovery process can find subgraphs that are frequent within a single graph database as well as those that co-occur frequently across different graph databases. Graph mining has applications in several domains ranging from security to bioinformatics. We consider FSG [13] as a representative graph mining algorithm. This algorithm traverses the sub-graph search space in breadth first order. It also uses the vertical tid-list representation of the dataset in its native form. The FSG implementation was obtained from the author’s website as a part of the PAFI toolkit4 . Clustering - kMeans and vCluster: Data clustering is the process of grouping data points into groups or clusters such that the distance between data points assigned to the same cluster is minimized and the distance between data points assigned to different clusters is maximized. We consider two algorithms for this task. The first is the kMeans clustering algorithm that takes as input a parameter k, where k is the number of clusters that are desired. It groups n input data points into k clusters iteratively. The second algorithm that we consider, vCluster, is a hierarchical clustering algorithm that achieves the same goal using a partitional scheme. It breaks the data points into an increasing number of clusters, repeating this process until the desired total number of clusters is reached. The kMeans implementation was obtained from Prof. David Mount’s website5 , and the vCluster implementation was obtained from the author’s website as a part of the CLUTO toolkit6 . Decision Tree Induction - C4.5: Decision trees are a common knowledge representation used for classification. In classification, the decision tree predicts, based on data from a specific instance, the membership class of an instance. Each node in the tree consists of a test, based on one of more attributes of the instance to be classified. The leaf nodes provide the class label. Decision tree induction is the process of learning a decision tree from a labeled (training) dataset. We consider C4.5 [17] as a representative decision tree construction algorithm. The implementation was obtained from the author’s website7 . Outlier Detection - ORCA: The task of outlier detection finds the top k data points in a dataset that are most different from the remaining points in the dataset. Most outlier detection algorithms for high dimensional datasets perform an all to all distance computation to find the top k outliers in the dataset, with some smart pruning strategies. We consider ORCA [3] for outlier detection. It uses randomization and pruning to converge to the set of top k outliers, typically in near linear time. In the worst case, however, execution scales quadratically with the number of data points. The implementation for ORCA was obtained from the author’s website8 . 2

The tid-list in its native form consists of a list of transaction ids containing the item

3 4 5 6 7 8

http://www.cs.rpi.edu/ ˜zaki/software http://www.cs.umn.edu/ ˜karypis/pafi http://www.cs.umd.edu/ ˜mount/Projects/KMeans http://www.cs.umn.edu/ ˜karypis/cluto http://www.rulequest.com/Personal/ http://www.isle.org/ ˜sbay

4. PERFORMANCE CHARACTERIZATION To analyze the performance of the selected data mining algorithms, we use a system with an Intel Pentium 4 processor with HT technology and 1.5GB of physical memory. The processor runs at 2.8GHz and has a 4-way set-associative 8KB L1 data cache and an 8-way set-associative 512KB unified L2 cache. The line size is 64 bytes for the L1 and L2 caches. The front side bus runs at 800MHz. We use Intel VTune Performance Analyzers to collect performance characteristics of execution such as clock ticks, cache misses, branch miss predictions, etc. We choose application parameters and datasets such that they represent realistic and time consuming executions. Table 1 shows the chosen parameters and dataset sizes. Note that real datasets that are large are hard to come by. Therefore, in many cases, widely cited synthetic datasets have been used. Operation Mix: The operation mix for our workloads is presented in Table 2. First, we consider FPGrowth, MAFIA, SPADE, and FSG, our frequent pattern mining algorithms. For these algorithms, most of the time is spent on counting the support (number of occurrences) of an itemset, either in a prefix tree or in tid lists. Counting is carried out using integer ALU operations (increment or AND). Thus, these algorithms have a large number of ALU operations and memory operations per instruction. Next, we consider kMeans and vCluster, the clustering algorithms, and ORCA, the outlier detection algorithm. These algorithms are memory, integer ALU, and FPU intensive (Table 2). Most of the time in these algorithms is spent on performing euclidean distance computation, which is an FPU intensive operation. Finally, C4.5, our decision tree construction algorithm, is also memory, integer ALU, and FPU intensive. The FPU computations in C4.5 are associated with sorting operations and gain computations on a floating point array. All the algorithms have relatively few I/O operations and branch mispredictions per instruction. As these do not constitute a significant part of the execution time, we do not consider them in the rest of this paper. In summary, all the algorithms under consideration are memory intensive as well as compute intensive (integer ALU intensive and in some cases FPU intensive). Memory Access Behavior and CPU Utilization: Table 3 presents the performance numbers related to cache behavior and CPU utilization. For FPGrowth, the core access pattern is a prefix tree traversal. The prefix tree is a pointer based structure with poor spatial locality. The large number of DTLB misses per instruction and poor L1 hit rate bear this out. Tree accesses exhibit poor temporal locality, as the tree is accessed several times and does not fit in cache. Consequently, only 11.9% of the CPU is utilized and the processor remains stalled for the remainder of the time, waiting on load operations to complete. For MAFIA, we have significantly better CPU utilization of 44.6%. The core computation that is performed is a tid-list join operation, using an AND operation on bit vectors. Bit vector accesses are streaming in nature with good spatial locality, but poor temporal locality. For SPADE and FSG, we have CPU utilization of 14.6% and 15.2%, respectively. The core computation performed is a join operation, not using bitvectors, however. The sort and merge-based join is used with tid-lists in their native form. This operation also exhibits streaming memory accesses with good spatial locality. For these algorithms, poor CPU utilization can be attributed

Application FPGrowth MAFIA SPADE FSG kMeans vCluster C4.5 ORCA

Parameters Synthetic dataset with 300K transactions and support = 4000 Synthetic dataset with 300K transactions and support = 4000 Synthetic dataset with 400K transactions and support = 80 Synthetic dataset with 30K transactions and support = 600 Synthetic dataset with 100K transactions and k = 100 clusters Synthetic dataset with 83K transactions and k = 100 clusters KDDCup 1999 intrusion detection dataset KDDCup 1999 intrusion detection dataset with top 20 outliers

Size of Dataset 84MB 84MB 81MB 30MB 41MB 72MB 74MB 74MB

Table 1: Input parameters and datasets Integer ALU operations/instruction FP operations/instruction Memory operations/instruction Branch operations/instruction

FPGrowth 0.560 0.001 0.650 0.136

MAFIA 0.822 0.001 0.433 0.074

SPADE 0.636 0.004 0.593 0.177

FSG 0.625 0.015 0.692 0.166

kMeans 0.668 0.252 0.267 0.057

vCluster 0.620 0.207 0.353 0.154

C4.5 0.769 0.087 0.517 0.163

ORCA 0.607 0.273 0.392 0.156

Table 2: Operation Mix to data dependencies in the instruction stream that lead to excessive pipeline stalls. kMeans, vCluster and ORCA all utilize between 24.4 - 32.2% of the CPU. They too exhibit streaming memory access behavior with good spatial locality. vCluster exhibits poor temporal locality. Poor CPU utilization can be attributed to their FPU intensive nature. For C4.5, we afford a CPU utilization of only 4.9%. This is due to the poor cache performance of the quick sort routine and split point calculation routine on large arrays that do not fit in cache. These routines have non-sequential memory accesses and consequently have poor spatial locality. In addition, manipulation of large arrays restricts temporal locality. This is also evident from the large number of DTLB misses per instruction and poor cache hit rates. All these algorithms do not fully utilize the modern processor. The primary performance bottleneck is that these algorithms are memory intensive and affected by bottlenecks related to poor data locality, and data dependencies in the instruction stream9 . In some instances, these algorithms have good spatial locality with streaming memory accesses. All these algorithms have significantly more loads than stores. This suggests a larger percentage of read only accesses than write accesses. Impact of Hardware Prefetching: The Intel Pentium 4 hardware prefetcher tracks memory accesses from programs and learns sequential/strided memory access patterns. The hardware prefetcher fetches cache lines (into L2 cache) along the predicted access pattern once it captures a sequence of misses with a fixed stride. By default, the Intel Pentium 4 prefetcher is always turned on. To measure speedup due to prefetching, we measure execution time with the prefetcher turned on and off10 . Table 4 presents speedup when the hardware prefetcher is turned on. Speedup due to hardware prefetching ranges from 1.01 (marginal) to 1.65. When the codes are memory intensive with good spatial locality, prefetching can produce large improvements in execution time. Hardware prefetching has a large impact on algorithms such as MAFIA and vCluster that have streaming memory accesses with good spatial locality. These algorithms have slightly lower L1 hit rates than other algorithms and therefore, hardware prefetching provides a greater speedup than when applied to the other algorithms with streaming memory accesses. For SPADE 9

These algorithms always have small instruction footprints and exert very little pressure on the ITLB.

10

This feature is only available on an experimental Intel Pentium 4 system

and kMeans, which exhibit streaming data accesses, more time is spent on computation per word fetched. Therefore, the impact of prefetching is not significant, as memory latency is not the primary performance bottleneck. Impact of SMT: SMT is a processor technology in which multiple thread contexts are maintained on chip. SMT provides a potentially larger pool of schedulable instructions, so as to improve ILP. To evaluate the benefits of SMT, we compare execution time for the program with one instance and two instances running simultaneously. The impact of SMT is witnessed in Table 4. SMT does not have a significant impact for memory intensive algorithms with poor cache locality, when compute operations are primarily issued to the integer ALU. Integer ALU latency is minimal when compared to memory latency. Thus, even with improved ILP, large memory stall times will leave the CPU waiting on misses on both thread contexts for the majority of the time. On the other hand, SMT provides significant improvement for kMeans and vCluster. In these FPU intensive algorithms, the floating point unit latency can contribute to poor CPU utilization, because the FPU requires a significant number of cycles to complete. By having a larger pool of instructions to choose from, the processor can overlap FPU operations between the two instruction streams. Summary: We summarize what we have learned in this section below: • Data mining algorithms are compute intensive, being integer ALU intensive, and sometimes FPU intensive. • Data mining algorithms are memory intensive. This fact limits full CPU utilization, and specialized memory systems will be needed to improve CPU utilization. • Some data mining algorithms use complex data structures such as prefix trees. Pointer-chasing problems, coupled with poor locality, add to the above-mentioned bottlenecks. Algorithm designers should rethink data layout in such scenarios. • When simpler data structures (such as bit-vectors, tid lists, and arrays) are used, we have streaming memory accesses with good spatial locality, but poor temporal locality in some cases. In this case, algorithm designers should rethink computation structure, and try and improve temporal locality by restructuring code.

L1 LD hit rate L2 LD hit rate L2 LD misses/instruction LD operations/instruction ST operations/instruction DTLB misses/instruction ITLB misses/instruction CPU Utilization

FPGrowth 0.891 0.430 0.03 0.515 0.135 0.024 0.000 0.119

MAFIA 0.953 0.997 0.001 0.391 0.042 0.000 0.000 0.446

SPADE 0.954 0.992 0.001 0.538 0.116 0.012 0.000 0.146

FSG 0.963 0.985 0.002 0.532 0.160 0.007 0.000 0.152

kMeans 0.979 0.989 0.000 0.254 0.013 0.001 0.001 0.244

vCluster 0.882 0.987 0.000 0.279 0.083 0.001 0.000 0.322

C4.5 0.60 0.969 0.031 0.385 0.131 0.005 0.000 0.049

ORCA 0.970 0.993 0.000 0.335 0.057 0.003 0.000 0.316

Table 3: Cache and CPU Performance Speedup due to hardware prefetching Speedup due to SMT

FPGrowth 1.26 1.02

MAFIA 1.65 1.06

SPADE 1.02 1.05

FSG 1.15 1.26

kMeans 1.02 1.30

vCluster 1.19 1.26

C4.5 1.06 1.18

ORCA 1.01 1.03

Table 4: Impact of hardware prefetching and SMT Name DS1 DS2 DS3 DS4 DS5 -

T40I15D300K T60I15D300K T70I15D300K T100I15D300K Webdocs.dat

Number of transactions 300000 300000 300000 300000 500000

Table 5: Data Sets • SMT helps improve performance for some of these algorithms. Therefore, more thread contexts on chip are likely to be beneficial for these instances. For algorithms that do not benefit from SMT, we need to restructure computation to facilitate cache reuse, i.e., the two threads of execution should be structured so that they reuse cached data. This is one approach to improve ILP for such codes. Furthermore, with increasing number of contexts per chip, ILP for these memory intensive algorithms will likely improve. In the next section, we present our efforts for improving the performance of the frequent pattern mining algorithm, FPGrowth. We employ a cache-conscious data structure layout to improve spatial locality, computation restructuring to improve temporal locality, and thread coscheduling to improve ILP.

5.

IMPROVEMENTS TO FPGROWTH

We use our characterization results from Section 4 to guide us in improving the execution time of FPGrowth. Details of our optimization techniques are the focus of a forthcoming publication [9], and as such, we merely provide an overview here for the sake of completeness of this article. The results of our improvement efforts can be seen in Figure 1. Cumulative speedups are presented for the five datasets in Table 5. The first four datasets were generated with the IBM Quest Dataset Generator, and the fifth is a real dataset of webpage accesses [10]. The data in Table 3 illustrates that the algorithm has poor cache utilization. Our first technique is designed to improve spatial locality, which will reduce this bottleneck. In the prefix tree proposed by Han et al. [11], each node has a list of child pointers, a parent pointer, a nodelink pointer, a count, and an item. Except for the item and parent pointer, all other fields in the prefix tree node are not required for the two main routines. The additional fields loaded into cache significantly degrade cache line utilization. Second, it is unlikely that consecutive nodes in a bottom up traversal

Figure 1: Speedups for FPGrowth.

(the core access pattern) will be adjacent in memory. The result is often a cache miss. We reallocate the tree in depth first order. This is accomplished by allocating one contiguous block of memory and adding nodes (sequentially) as we perform a depth first traversal. This simple reallocation strategy provides significant cache improvements because these algorithms access the prefix tree several times in a bottom up fashion, which is largely aligned with a depth first order of the tree. Also, our node size is much smaller than the original node size because we do not include child pointers, next pointers, or counts. These data members are only used when constructing the tree (not during traversals). Over twice as many nodes fit on a cache line than previously. In addition, we can restructure iterations within the algorithm so as to improve temporal locality. The goal is to maximize reuse of prefix tree nodes once they are in placed in L1 cache. We use an approach we call path tiling. First, we decompose the tree into blocks of memory (or tiles) along paths of the tree from leaf nodes to the root. This strategy is only possible through the design of our cache conscious prefix tree (due to its depth first order). Next, we iteratively fetch each tile into cache. Once fetched, we traverse each path that falls in the address space of this tile. Thus, once a tile is brought into the cache, all accesses that hit the tile are managed in the cache. Finally, with improved cache utilization, we can return our focus to simultaneous multithreading. A natural candidate for a two-thread decomposition of a frequent itemset mining algorithm is to use an extant strategy like that proposed in [16]. This strategy would involve decompos-

ing execution into two independent threads of computation. However, when we evaluate these strategies for prefix tree implementations, there is no significant benefit. This can be seen in Table 4. However, with the aid of the cache conscious prefix tree, we devise a novel parallelization strategy in which the two threads follow each other through recursive calls. These threads are not independent, but rather operate on the same tree simultaneously. This is accomplished through fine grained parallel execution of the tiled loops. By co-scheduling the two threads, when one thread fetches a portion of the tile into the cache, it will be reused by both threads. This not only reduces the number of cache misses, it also improves ILP.

6.

CONCLUSIONS

We have presented a characterization of eight data mining algorithms. We have shown that these algorithms are compute and memory intensive. Some of these algorithms have good spatial locality, while all of them have poor temporal locality. The memory intensive nature of these algorithms limits full utilization of modern hardware. We believe that developers of these applications must consider the data layout and cache performance of their implementations to leverage technologies such as hardware prefetching and simultaneous multi-threading. Finally, we gave an overview of techniques to accomplish these goals with FPGrowth, including path tiling, cache-conscious prefix trees, and finegrained thread co-scheduling. Our characterizations suggest that specialized memory systems with several thread contexts per processor will help these algorithms scale on future microprocessors.

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

7.

REFERENCES

[1] A. Ailamaki, D. J. DeWitt, M. Hill, and D. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of International Conference on Very Large Data Bases (VLDB), 1999. [2] L. Barroso, K. Gharachorloo, and F. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the International Symposium on Computer Architecture (ISCA), 1998. [3] S. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the International Conference on Knowledge Discovery and Data Mining(KDD), 2003. [4] M. Bender, E. Demaine, and M. Farach-Colton. Cache-oblivious b-trees. In Proceedings of the International Symposium on Foundations of Computer Science (FOCS), 2000. [5] J. Bradford and J. Fortes. Performance and memory-access characterization of data mining applications. In Proceedings of the Workshop on Workload Characterization (WWC), 1998. [6] D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A maximal frequent itemset mining algorithm for transactional databases. In Proceedings of the International Conference on Data Engineering (ICDE), 2001. [7] S. Chen, A. Ailamaki, P. Gibbons, and T. Mowry. Improving hash join performance through prefetching.

[16]

[17] [18]

[19]

[20]

[21]

[22]

In Proceedings of the International Conference on Data Engineering (ICDE), 2004. S. Chen, P. Gibbons, and T. Mowry. Improving index performance through prefetching. In Proceedings of the International Conference on Management of Data (SIGMOD), 2001. A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, Y. Chen, A. Nguyen, and P. Dubey. Cache-conscious frequent pattern mining on a modern processor. In Proceedings of International Conference on Very Large Data Bases (VLDB) (to appear), 2005. B. Goethals and M. Zaki. Advances in frequent itemset mining implementations. In Proceedings of the ICDM workshop on frequent itemset mining implementations, 2003. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), 2000. J. Kim, X. Qin, and Y. Hsu. Memory characterization of a parallel data mining workload. In Proceedings of the Workshop on Workload Characterization (WWC), 1999. M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proceedings of the International Conference on Data Mining (ICDM), 2001. J. Lo, L. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the International Symposium on Computer Architecture (ISCA), 1998. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, 1967. S. Parthasarathy, M. Zaki, M. Ogihara, and W. Li. Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems Journal, 2001. J. Quinlan. Programs for Machine Learning. 1993. J. Rao and K. Ross. Cache conscious indexing for decision support in main memory. In Proceedings of the International Conference on Very Large Databases (VLDB), 1999. J. Rao and K. Ross. Making B+ trees cache conscious in main memory. In Proceedings of the International Conference on Management of Data (VLDB), 2000. A. Shatdal, C. Kant, and J. Naughton. Cache-conscious algorithms for relational query processing. In Proceedings of the International Conference on Very Large Databases (VLDB), 1994. M. Zaki. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning Journal, special issue on Unsupervised Learning, 2001. Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2002.