Comparative Study of Massively Parallel Cryptalysis and Cryptography on CPU-GPU Cluster Ewa Niewiadomska-Szynkiewicz, Michal Marks
Jarosław Jantura, Mikołaj Podbielski, Przemysław Strzelczyk
Institute of Control and Computation Engineering, Warsaw University of Technology Research and Academic Computer Network (NASK) Warsaw, Poland [email protected]
, [email protected]
Abstract—The paper addresses issues associated with the application of mixed CPU and GPU processing to cryptography and cryptanalysis. The performance of new efficient OpenCLbased parallel implementations of selected commonly used cryptanalysis and cryptographic algorithms executed on the GPU devices is compared with implementations running on the CPU processor. Moreover, the paper describes the hardware architecture of a novel hybrid cluster system (HGCC) integrating two types of devices: Intel processors with NVIDIA graphics processing unit and AMD processors with AMD graphics processing unit, and the specialized software framework that hides a heterogeneity of the cluster and provides a single system image. The results of the presented effort show that the GPU can perform as an efficient accelerator, and that the HGCC cluster system is a powerful, effective, scalable, flexible and easy to use platform for cryptography and cryptanalysis. Keywords-component; cryptography, cryptanalysis; parallel computing, HPC, clusters, GPU computing
The demand on reliable and efficient cryptanalysis and cryptographic solutions has been continuously growing in the last decade. It is a consequence of using the Internet in critical areas like government, business, healthcare, etc. Unfortunately, most of cryptographic and cryptanalysis applications are computationally intensive, and able to push CPUs to their performance limits. However, many of data encryption, decryption and password recovery algorithms can be easily decomposed, and the calculations can be partitioned into independent parts and carried out on different cores, processors and computers. In general, these algorithms are natural candidates for massively parallel computations. Hence, many hardware-software accelerators with an affordable cost and a great ease of integration have been studied and proposed both in the research and the industrial field. Nowadays, Graphics Processing Units (GPUs) have been the subject of extensive research and have been successfully applied to general purpose computations out of the graphical domain. The GPU has evolved into a massively parallel stream processor with a flexible programming model. It is
Research and Academic Computer Network (NASK) Warsaw, Poland [email protected]
, [email protected]
, [email protected]
designed to perform hundreds of billions of floating point operations per second. Since the GPU platforms are now supported by frameworks which allow the implementation of general purpose software, they become a prime choice for the implementation of computationally demanding algorithms. GPUs have been already used in many areas for applications in science, engineering and commerce. The MapReduce programming model for processing large data sets utilized by GPU applications can be successfully used to improve the efficiency and speed up calculations, which explains why GPUs are often enlisted as cryptanalysis and cryptographic coprocessors. We built a hybrid cluster system (HGCC) that integrates two types of multicore CPUs and GPUs of different vendors. This cluster was dedicated to perform massive parallel computations in cryptography and cryptanalysis. We developed CPU and GPU based implementations of widely used cryptanalysis and cryptography techniques. Due to a fact that our cluster is composed of devices of different vendors, we used OpenCL for implementing kernels. In this paper we present and discuss the evaluation of the performance of our implementations. The rest of this paper is structured as follows: Section II discusses background and related works; Section III describes hardware and software of the HGCC cluster; Section IV presents a brief description of implementations of cryptanalysis and cryptography techniques on HGCC; Section V presents and discusses the performance evaluation of selected cryptanalysis and cryptography algorithms. Finally, Section VI concludes the paper. II. A.
BACKGROUND AND RELATED WORKS
GPU Programming Programming on graphics processing units usually refers to nongraphics related programming operations that are performed on the GPU. This programming paradigm opens up many possibilities to increase performance by utilizing the specific processing nature of the GPU. A new model for parallel calculation based on using GPUs and CPUs together
to perform scientific and engineering computing have been already used to solve many practical complex problems. There are currently two common frameworks available to program GPUs, namely Compute Uniform Device Architecture CUDA (https://developer.nvidia.com/cuda-toolkit) from NVIDIA and Open Computing Language - OpenCL (http://www.khronos.org/opencl/) from AMD. The CUDA framework currently only support its native GPU architecture (NVIDIA) while OpenCL is portable across all hardware and can be executed on a wide range of different hardware configurations. Using the OpenCL or CUDA frameworks many real-life problems can be easily implemented and run significantly faster than on multiprocessor or multicore machines [5,6,8,14].
• Elcomsoft: http://www.elcomsoft.com. The fastest ones are capable of checking up to several billions passwords per second (using a single GPU unit), [7,8,13]. All of them are capable of using multiple GPU processors installed on one host system, but only one - the solution offered by Elcomsoft – allows to distribute the workload on many hosts. It is worth to mention that the bruteforce attack guarantees the success.
The TOP500 project (www.top500.org) aims to provide a reliable basis for tracking and detecting trends in highperformance computing. Recently, we can observe a rising significance of GPU accelerators. The latest ranking from November 2012 shows the grow of number of systems utilizing accelerators/co-processors to 13% from 7% one year earlier. There is a lot of new installations utilizing not only popular NVIDIA chips like Fermi family, but also the newest Kepler family chips and Intel Xeon Phi solutions. The aggressive competition for the GPU market is driving these architectures towards increasing levels of hardware parallelism, while containing the costs.
In case of a rainbow table technique [11,12] two phases of calculations can be distinguished. At the beginning the hashes for all potential passwords are pre-computed. Next, these hashes are recorded in the special compressed form called a rainbow table. Hence, the password search is reduced to several millions hash calculations and rainbow table look-ups. The GPU-enhanced rainbow tables implementations, can be found in literature, the most popular is Rainbowcrack (http://project-rainbowcrack.com). An implementation with NVIDIA CUDA support was provided. The rainbow table attack does not guarantee success, but speeds up the search significantly.
Cryptographic and Cryptanalysis Algorithms on GPU There are a number of GPU-enhanced popular cryptanalysis and cryptographic algorithms in use in computing today. The massive parallel computations that can be performed on GPU can be naturally exploited in cryptography and cryptanalysis. Most of operations performed by data decryption, encryption and password recovery algorithms are natively supported by GPU computing units. They allow to take full advantage of their computing power.
The most common application of GPU in cryptanalysis is the password recovery from hashes. The commonly used techniques are based on MD4, MD5, SHA-1 and SHA-2. The hashes are easy to compute but extremely hard to inverse. Therefore the only general technique for recovering password from hash, is to scan all potential passwords, compute their hash and test the coincidence. Cryptographic hash functions are based on integer and binary operations such as: addition modulo power of two, bit shift and rotation, bitwise xor, bitwise or, bit negation and words permutation. Those operations are natively supported by GPU computing units. In general, we can distinguish three common techniques to password strength validation: • brute-force, • rainbow table, • dictionary test. Brute-force and rainbow table are very suitable for porting to GPU. A numerous GPU implementations of a brute-force attack are available in literature:
Another method which can benefit from the GPU capabilities is the dictionary attack - a technique for defeating a cryptographic system by searching its decryption key or password/passphrase in a list of words or combinations of these words. It is obvious that the main factor for the success of this type of attack is the choice of a suitable list of possible words. However, the efficiency and reliability of the implementation of the attack may become critical factor as well. The efficient implementations of the dictionary attacks are described in literature. In  the authors present distributed CPU-based computing platform to carry out large scale dictionary attack against cryptosystems compliant to the OpenPGP standard (a widely used standard for encryption and authentication of email messages). The improved version of this attack implemented in CUDA, and executed on a single GPU or multiple GPUs is described in . The authors show the significant improvement in case of GPU-enabled application - the time required to check large sets of passphrases is drastically reduced if compared to the CPUbased implementation. The significant attempt to implement symmetric ciphers on the GPU was made by some researchers. It was observed that the GPU-enhanced algorithms speed up calculations especially when used to encrypt or decrypt huge amount of data sets or attack secured communication protocols. The GPU-enabled versions of widely used symmetric ciphers, i.e., DES, 3DES, AES, RC4 and BlowFish are described in literature [3,4,9]. Most of them are implemented in the CUDA framework. In case of many applications asymmetric ciphers (RSA, ECC, NTRU and GGH) were also found to give a significant
speed when ported to GPU. In this type of algorithms the main research is focused on the design a parallel multi-precision arithmetic routines and Montgomery reduction algorithms, which are the basic building blocks. Several efforts indicate that GPU-enabled applications can outperform the best available CPU-based implementations. III.
A HYBRID CPU-GPU CLUSTER
A cluster is a group of cooperating computers that serves as one virtual machine . One of the biggest advantages of clusters over standalone computers is an ability to share the workload between machines. The efficiency of a given cluster depends on the speed of processors of separate workstations and the efficiency of particular network technology. In advanced computing clusters simple local networks are substituted by complicated network graphs or very fast communication channels. A typical cluster is usually formed by homogenous central processing units (CPUs) operating under UNIX or Linux systems. The new trend is to create GPU clusters in which each node is equipped with a graphic processing unit. However, currently only a few software tools are provided to support computing on GPU clusters. Virtual OpenCL (VCL) (www.mosix.org/txt_vcl.html) is a software framework for GPU clusters. It allows to execute OpenCL applications on Linux clusters, and provides a single system image for cluster built from a group of GPU units. The components of VCL, its performance and applications are described in . We built a cluster system HGCC (Heterogenous GPUCPU Cluster) composed of multicore CPUs working together with multicore GPUs, both with different architecture. The current version of HGCC consists of 24 nodes and integrates
two types of CPUs: 12 servers with two Intel Xeon processors each and 12 servers with two AMD Opteron processors each. All servers are equipped with advanced GPUs, adequately, NVIDIA Tesla and AMD FirePro units. The system architecture is depicted in Fig. 1. The specification of components of the HGCC cluster is presented below: CPU: • •
Intel Xeon X5650, 2.66GHz/3.06GHz turbo, 6 cores / 12 threads, 6x256 L2, 12MB L3 cache. AMD Opteron 6172, 2.1GHz, 12 cores / 12 threads, 12x512KB L2, 12MB L3 cache.
NVIDIA Tesla M2050, 448 CUDA cores, 384-bit memory bus. • AMD FirePro V7800, 1440 stream processors (equivalent of 288 CUDA cores), 256-bit memory bus. Moreover, we developed a software framework that allows unmodified OpenCL applications to concurrently run on multiple CPU and GPU components of a cluster. The aim of this framework is to hide a heterogenity of the cluster. From the user’s perspective, the cluster serves as one superserver. The other goal is to minimize the effort of the user during the design, implementation and execution of his application. We implemented a single system image. In this model all resources of a cluster, i.e., processors (CPU, GPU) and memory are seen by the user as one unique machine. The detailed description of our software framework is in . To develop efficient and reliable cryptographic and cryptanalysis algorithms we need a simple functionality of the computing platform, i.e., a calculation speed up, resistance and ease of use. Hence, we assumed a static decomposition of the problem in a calculation startup, since the dynamic load balancing is superfluous for this application. Our software framework is quite similar to VCL platform , however in our solution it is possible to use both CPUs and GPUs to solve the considered problem. IV.
HGCC LIBRARY OF CRYPTOGRAPHIC AND CRYPTANALYSIS ALGORITHMS
HGCC Library: Cryptography Algorithms The popular algorithms for encryption and decryption of large data sets were selected for implementation on our hybrid cluster system HGCC. The symmetric ciphers were considered. The C and OpenCL-based versions of the following techniques are currently provided in our library: •
• Figure 1. The components of the HGCC cluster.
Block ciphers (deterministic algorithms operating on fixed-length groups of bits): • DES (CPU and GPU versions), • 3DES (CPU and GPU versions), • AES (CPU AESNI and GPU versions), • Blowfish (CPU version only), • Twofish (CPU and GPU versions). Stream ciphers (plaintext digits are combined with a pseudorandom cipher digit stream – keystream)
All mentioned block ciphers are implemented using the electronic codebook (ECB) mode specified in http://csrc.nist.gov/index.html. In ECB the encrypted data are divided into blocks and each block is encrypted separately. The CTR (counter) mode was used for the stream cipher implementation. The CTR mode makes a block cipher into a synchronous stream cipher and also provides a random access property during decryption. It should be pointed that both ECB and CRT modes are suited to parallel implementation. B.
HGCC Library: Cryptanalysis Algorithms The widely used algorithms for password recovery were selected for implementation on our cluster system HGCC. Three attacks on passwords were considered: brute-force, rainbow table and dictionary test. The C and OpenCL-based versions of the following techniques are currently provided in our library of cryptographic hash functions: MD5: • • •
brute-force: CPU and GPU, dictionary test: CPU, rainbow table: CPU and GPU.
brute-force: CPU and GPU.
NT-hash: • brute-force: GPU. SHA-1: • brute-force: CPU and GPU, • dictionary test: CPU, • rainbow table: CPU and GPU. SHA-2 (SHA-224, SHA-256, SHA-384, SHA-512): • brute-force: CPU and GPU. RIPEMD (RIPEMD-128, 160, 256, 320): • brute-force (RIPEMD-128, 160): CPU and GPU, • brute-force (RIPEMD-256, 320): CPU.
TESTS AND PERFORMANCE EVALUATION
The results of extensive tests both for cryptographic and cryptanalysis techniques are presented and discussed. The goal of the experiments was to compare the efficiency of data encryption and decryption algorithms and password recovery techniques implemented for various hardware platforms. Several series of experiments were performed. First, the CPUbased and GPU-based multi-threaded implementations were executed on devices of different vendors. The objective was to compare the efficiency and scalability of algorithms and processors. Next, the distributed implementations executed on the cluster in which each node was equipped with CPU and GPU units were evaluated and compared. A.
Cryptography Algorithms Three series of experiments were performed for symmetric cryptography. The aim of the first series was to test the efficiency of CPU-based parallel implementation of the symmetric block ciphers (DES, 3DES, AES, Twofish) and various modes of these algorithms operations. The efficiency of our implementations were compared with the results from the Internet: TrueCrypt 7.0a (www.truecrypt.org) – a free open-source disk encryption software, and GnuPG 1.4.11 (www.gnupg.org/download) - GNU Privacy Guard (Fig. 2). The results presented in figures show amounts of data in MiB/s (MiB/s = 1 048 576 bytes/s) that were encrypted/decrypted per second in case of all algorithms. We can observe that the efficiency of our (NASK) implementations is similar to the results provided by other projects. It is worth to mention that both GnuPG and TrueCrypt are widely used products and are characterized with good reputation. The aim of the second series of tests was to compare the efficiency of the block ciphers implementations on GPUs of different vendors. We compared three types of GPU devices: AMD FirePro V7800, NVIDIA Tesla M2050 and AMD Radeon 6970. The results of the experiments are presented in Fig. 3. It can be easily observed that GPU-enhanced implementation significantly speeds up calculations. In our tests the best results were obtained for the Radeon 6970 device.
Figure 2. Block ciphers performance on CPU (comparison of various single thread implementations).
Figure 3. Block ciphers performance on GPUs of different vendors.
Figure 4. Comparison of AES performance (CPU and GPU)
Fig. 4 presents the comparison of AES implementations on CPUs and GPUs of different vendors. In case of CPUs better results were obtained for Intel Xeon X5650 than AMD Opteron 6172. Next, we tested the implementation of AES utilizing an Advanced Encryption Standard – New Instruction Set (AES-NI) extention. This instruction set is an extention to the x86 instruction set architecture for microprocessors from Intel and AMD (http://ark.intel.com/) which supports elementary AES operations. In the HGCC cluster only Intel Xeon X5650 (Westmare) processor provides AES-NI extension. We assesed impact of NI set on AES performance. The conclusion is that the application of the NI set causes massive acceleration of the AES algorithm (14848.80 [MiB/s] – encryption, 14841.60 [MiB/s] – decryption). The acceleration is much more effective than in case of utilizing GPU accelerometers, which efficiencies are similar as CPU processors. Finally, the scalability of HGCC-based implementations of the symmetric algorithms was tested. The performance
evaluation of our implementation of the AES algorithm in two subclusters: the first composed of two and four AMD nodes, and the second composed of two and four Intel nodes is presented in Table I. We can see that AES scales very well. The speed up is closed to linear. TABLE I. Processor
SCALABILITY OF THE AES ALGORITHM Speed up 1 node
Intel Xeon X5650
AMD Opteron 6172
Cryptanalysis Algorithms Multiple tests were performed for password recovery algorithms. The performance of CPU-based and GPU-based versions of MD4, MD5, RIPEMD, SHA-1, SHA-2, NT-hash techniques for hashes generation was evaluated. The aim of the first series was to compare the efficiency and scalability of two types of CPU devices from different
vendors. The multi-threaded implementations of the MD5 and SHA-1 were executed on two nodes, each composed of two processors, respectively from Intel and AMD. Figures 5 and 6 show the number of hashes (in millions) generated per second using MD5 (Fig. 5) and SHA-1 (Fig. 6).
generated using MD4, MD5, RIPMED, SHA-1, SHA-2, NThash algorithms executed on all listed clusters. Figures 7 and 8 present the results, respectively for CPU clusters and GPU clusters. TABLE II.
NUMBER OF GENERATED HASHES – CPU/ GPU PROCESSORS. 2×Intel Xeon X5650
It can be seen that both algorithms scales up very well. The speed up for the AMD server, and 24 threads is equal 23.2 (MD5) and 23.5 (SHA-1). In the same way the Intel Xeon scales up very well up to 12 threads – number of physical cores. For bigger number of threads scalability is restricted, but in general Intel Xeon processor with hyper-threading technology is more powerful than AMD. The objective of the second series of experiments was to compare CPU-based and GPU-based implementations of hashes generation techniques. Table II collects the number of hashes generated per second using MD4, MD5, RIPMED, SHA-1 and SHA-2 functions running on CPU and GPU devices. In general, it can be observed that GPU implementations are much more efficient than CPU ones. The speed up of calculations was 25% to 550% depending on the algorithm. The aim of the last series of tests was to compare the efficiency of four subclusters: CPU-1 - composed of 24 Intel Xeon X5650 processors, CPU-2 - composed of 24 AMD Opteron 6172 processors, GPU-1 - composed of 12 NVIDIA Tesla M2050 devices and GPU-2 - composed of 12 FirePro V7800 devices. Table III collects the number of hashes
Figure 6. Scalability of hashes generation using SHA-1 (AMD & Intel).
2×AMD Opteron 6172
Figure 5. Scalability of hashes generation using MD5 (AMD & Intel).
NVIDIA Tesla M2050
NUMBER OF GENERATED HASHES – CPU/ GPU CLUSTERS. CPU cluster 24×AMD
24×Intel Xeon X5650
12×NVIDIA Tesla M2050
As in the previous experiment GPU implementations were much more efficient than CPU ones. The best results were obtained for the OpenCL versions of MD4, MD5 and NT-hash functions executed on the cluster composed of AMD FirePro V7800. In case of hashes generation graphic processor AMD FirePro V7800 proved to be more efficient than NVIDIA Tesla M2050 in all considered cases. VI.
SUMMARY AND CONCLUSIONS
The main goal of the paper was to present the wide applicability of the GPU technology to cryptography and
cryptanalysis. We compared the efficiency of the CPU-based clusters with the clusters formed by GPU devices, both of different vendors. The numerical results confirmed that the modern unified GPU architecture can perform as an efficient cryptographic acceleration board. Moreover, we showed that hybrid computer clusters offer a new opportunity to increase the performance of parallel implementations, by combining traditional CPU and efficient GPU devices. We demonstrated the effectiveness and scalability of our HGCC platform and its applicability to cryptology. REFERENCES  
A. Barak, A. Shiloh, “The MOSIX Virtual OpenCL (VCL) Cluster Platform”, Proc. Intel European Research and Innovation Conf., 2011. (dic 1) M. Bernaschi, M. Bisson, E. Gabrielli, S. Tacconi, “An Architecture for Distributed Dictionary Attacks to Cryptosystems”, Journal of Computer, vol. 4(3), pp. 378-386, 2009. A. Di Biagio, A. Barenghi, G. Agosta, G. Pelosi, “Design of a Parallel AES for Graphics Hardware Using the CUDA Framework” Inter. Symp. on Parallel & Distributed Processing (IPDPS 2009), pp. 1-8, 2009. J.W. Bos, D.A. Osvik, D. Stefan, “Fast Implementations of AES on Various Platforms”, Cryptology ePrint Archive, Report 2009/501, 2009, http://eprint. iacr. Org. A. Karbowski, E. Niewiadomska-Szynkiewicz, “Parallel and Distributed Computing” (in Polish), WUT Publishing House, 2009.
  
D. M. Kunzman, L. V. Kale, “Programming heterogeneous clusters with accelerators using object-based programming”, Sci. Program., vol. 90, pp. 47-62, 2011. C. Li, H. Wu, S. Chen, X. Li, D. Guo, “Efficient Implementation for MD5-RC4 Encryption Using GPU with CUDA”, Anti-counterfeiting, Security, and Identification in Communication, 2009, pp. 167-170, 2009. M. Marks, J. Jantura, E. Niewiadomska-Szynkiewicz, P. Strzelczyk, K. Gozdz, “Heterogenous GPUGPU Cluster for High Performance Computing in Cryptography”, Computer Science, vol. 14, No 2, pp. 6379, 2012. C. Mei, H. Jiang, J. Jenness, “CUDA-based AES Parallelization with Fine-Tuned GPU Memory Utilization” Workshops and Phd Forum Parallel & Distributed Processing, (IPDPSW), pp. 1-7, 2010. (dic 2) F. Milob, M. Bernaschi, M. Bisson, “A fast, GPU based, dictionary attack to OpenPGP secret keyrings”, The Journal of Systems and Software, vol. 84, pp. 2088-2096, 2011. P. Oechslin, “Making a Faster Cryptanalytic Time-Memory Trade-off”, Advances in Cryptology-(CRYPTO 2003), pp. 617--630, 2003. P. Oechslin, “Password Cracking: Rainbow Tables Explained”, Constituent Contributions, vol. 14, 2005. Z. Wang, J. Graham, N. Ajam, H. Jiang, “Design and Optimization of Hybrid MD5-Blowfish Encryption on GPUs”, Proc. of International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), pp. 18-21, 2011. W. H. Wen-Mei W. (ed.), “GPU Computing Gems Emerald Edition”, Morgan Kaufman, 2011.
Figure 7. Number of generated hashes – CPU cluster.
Figure 8. Number of generated hashes – GPU cluster.