Accelerating Rabin Karp on a Graphics Processing Unit ... - IEEE Xplore

Accelerating Rabin Karp on a Graphics Processing Unit (GPU) using Compute Unified Device Architecture (CUDA) Nayomi Dayarathne and Roshan Ragel University of Peradeniya Abstract - String matching or pattern matching algorithms are used in various applications. They are used to find the occurrences of a pattern in a given text or a pool of strings. They are widely used in text editors in computing machines, database queries, bio-informatics, chem-informatics, search engines and many more applications. String matching algorithms can be of two ways: single pattern matching and multiple pattern matching. Rabin Karp is a string searching algorithm that can act in both ways. Since these algorithms are working on large pool of data, achieving higher throughput on the implementation of the algorithm has always been a major concern. Parallel implementation of such algorithms can achieve the concerned objective. Parallelism could be achieved easily with the rapid development of GPU architectures. In this research, Rabin Karp algorithm is implemented in CUDA C. We have compared CUDA implementation of this algorithm against both its serial CPU implementation and parallel Pthread implementation on multi-core CPUs. Eventually, using the empirical results, we could conclude that the CUDA C implementation of Rabin Karp on the GPU can achieve much high throughput for a large pool of data in string matching. Index Terms – CUDA, CUDA C, GPU, Pthread, Rabin Karp

I.

INTRODUCTION

String or pattern matching algorithms are used to find the occurrences of a pattern in a given text or a large pool of strings. They are used in various applications like text editors, bio-informatics, spell checkers, network intrusion detection systems, and search engines. There are popular and most widely used string searching algorithms such as AhoCorasick [1], Boyer Moore [3], Knuth Morris Pratt [9], Naïve/Brute Force [6] and Rabin Karp [8]. These algorithms can be further categorized as single pattern matching and multiple pattern matching algorithms. With the growth of data stored and manipulated over the years, these algorithms have to work with a large pool of data. Therefore, their efficiency in terms of throughput plays a great deal when applying them on solving scientific problems. There are many ways we can increase the throughput of an algorithm. One such method is to parallelize the algorithm. Many researchers try to find the optimum ways of parallelizing these string searching algorithms. Running a parallelized algorithm on a CPU can only improve its throughput up to some extent. With the rapid technological development of Graphics Processing Unit (GPU), now we are able to run complex computations of 978-1-4799-4598-6/14/$31.00 ©2014 IEEE

algorithms in parallel, to get better throughput. GPU technology has been evolved to process large amount of threads in parallel. With the availability of large number of cores (usually in many hundreds), it has become more powerful than CPU in performing computationally intensive parallel tasks. In this paper, we design, implement and compare the average execution times of serial implementation, Compute Unified Device Architecture (CUDA) C parallel implementation and Pthread parallel implementation of Rabin Karp algorithm on a random DNA sequence data. The main contribution of this paper is the CUDA C implementation of the Rabin Karp algorithm. This is the first time, the Rabin Karp algorithm is implemented on CUDA C targeting a GPU. In the rest of the paper, we have discussed related work in Section II. Related work, background, algorithms and implementation, results and discussion, scalability of approaches, conclusion and further scope are discussed in Sections III, IV, V, VI, VII and VIII respectively. II.

RELATED WORK

Accelerating string searching algorithms to be used in large datasets has been one of the major concerns in bioinformatics researches. As most of these algorithms are inherently parallel and the growing development of more high performance parallel architectures, the use of parallel hardware is the key solution to increase their performance. Researchers have employed both hardware and software parallelism in their researches. Chillar and Kochar (2008) [5] have developed a new algorithm using Rabin Karp, as RB-Matcher to improve the numeric pattern matching process. They have eliminated the extra computation when processing for the spurious hits. In the algorithm, they are calculating quotients of the pattern and text when calculating the remainders of them. Therefore, based on the quotient and the remainder, if they are equal to each other in pattern and text, they have eliminated the spurious hit and only successful and unsuccessful hits are remaining. They have done the experiments with fixed length text and pattern length in various sizes and vice versa for the range of numbers that used to find the remainders and quotients. The conclusion was that, any numeric pattern can be found out from a given text ‘T’ in an effective and

efficient way by using RB matcher. This is the same serial implementation with an added single step but no major changes. A serial implementation of an algorithm performs lower unless it is implemented in parallel. Therefore, parallel implementation is the solution to speed up the string matching process that is done using Rabin Karp, rather than improving its serial version. Our research is an effort to increase up to the maximum amount of speedup that can be obtained by Rabin Karp by implementing it in parallel CUDA C version. Bradley Kuszmaul (2009) [4] have implemented Rabin Karp hashing, using Cilk++ multithreading programming environment and ran it on Linux. The searching component of the Cilk++ implementation runs with nearly linear speedup on up to 16 cores. They have used the first chromosome of Canis familiaris (123MB) as the text string. The algorithm was used for searching multiple query strings and with the increasing number of cores they have achieved a good speed up of string matching. However, the parallelized parts had not added much advantage to this algorithm to achieve its expected efficiency. The numbers of cores were limited up to 16 in their research and parallel implementation has gained 13x speedup over serial implementation at maximum number of cores. However, by using higher number of cores they could have achieved better speedup than they have gained. Also there can be other parallel implementations that perform better than Cilk implementation. In this research, they have not made such a comparison. Therefore, our approach, the CUDA C parallel implementation of Rabin Karp on a GPU with 448 cores is worth checking in obtaining the maximum speedup of the algorithm. Blandon and Lombardo (2012) [2] have performed a review of the state of the art of different string matching algorithms including Rabin Karp, used in network intrusion detection systems. They have reviewed the performance of several algorithms including Rabin Karp and they have stated that the serial implementation is too slow with worst case execution time equal to O(nm) and best case equal to O(n+m) when compared to other algorithms they considered, where n is the text length and m is the pattern length. Further, they have noted that Rabin Karp would be an excellent algorithm for multiple pattern matching. In [2], there are no experimental results to prove their conclusions and suggestions. It is an obvious fact that parallel implementation of Rabin Karp can overcome its inefficiency issues. Questions would be the design architecture that has to be used for such implementation and the amount of speedup it can obtain. Our research not only provides such a design architecture and its implementation, but also proves that this hypothesis using experimental results of real data. III.

BACKGROUND

In this section, we will give a brief description about Rabin Karp algorithm and CUDA libraries which are used to implement our parallel CUDA based Rabin Karp algorithm.

A. Rabin Karp algorithm Rabin Karp is a string searching algorithm, which is mainly used in plagiarism detection applications. The algorithm uses hashing to find out any given pattern strings of a text. Let us assume that a given text T of length n and a given pattern P of length m. Also assume that the elements of P and T are characters in the finite alphabet . Now we compute the hash value of pattern P and the hash value of m size substring of the text T. If the modules of hash values are equal, then match each character of the pattern against the substring. Likewise, we take each substring of size m in text T and match against the pattern P. When calculating hash value of the next substring in the text it uses the hash value of the current position. Due to the usage of hashing function, this algorithm is more efficient in serial implementation, when compared to other string searching algorithms. B. CUDA libraries CUDA libraries are from CUDA architecture which is a parallel programming architecture that provides APIs to perform computations on GPU. They facilitate CUDA C language to perform parallel programming. IV.

ALGORITHMS AND IMPLEMENTATION

This section will describe the approach taken to implement parallel CUDA C Rabin Karp algorithm targeting a multycore GPU system. We implemented parallel version of Rabin Karp algorithm using CUDA C language. We also implemented it using Pthreads, and compared with the CUDA C implementation. Algorithm 1 and 2 are used in the implementations of the two parallel versions concerned. Algorithm 1: Rabin Karp CUDA C pseudo code

Start Allocate space for textarr, result and pattern arrays x=Length of textarr array For each text t = 1 to x For each pattern p=1 to m textarr[x*m+m] = text[x+m] End for End for Copy textarr, result and pattern arrays from CPU to GPU Allocate grid and block dimensions Start function matchstring () Define thread _id ifthread_id< x for each pattern i=0 to m ifTextarr[thread_id*m+i]=!pattern[i] row[thread_id]=0 else row[thread_id]=1 End function Copy result array from GPU to CPU Print result array End

Algorithm 2: Rabin Karp Pthread pseudo-code

Start Allocate memory for result and textarr dynamic arrays x=Length of textarr array For each text t=1to x For each pattern p=1 to m textarr[x*m+m]=Text[x+m] End for End for Allocate memory for structure arg_data Create pthread_tthr[THREADS] For each t=0 to THREADS Start Function Pthread_create() If thread_id