Comparative Study of Massively Parallel Cryptalysis ...

Comparative Study of Massively Parallel Cryptalysis and Cryptography on CPU-GPU Cluster Ewa Niewiadomska-Szynkiewicz, Michal Marks Institute of Control and Computation Engineering, Warsaw University of Technology Research and Academic Computer Network (NASK), Warsaw, Poland [email protected], [email protected]

Jarosław Jantura, Mikołaj Podbielski, Przemysław Strzelczyk Research and Academic Computer Network (NASK), Warsaw, Poland {Jaroslaw.Jantura, Mikolaj.Podbielski, Przemek.Strzelczyk}@nask.pl Abstract: The paper addresses issues associated with the application of mixed CPU and GPU processing to cryptography and cryptanalysis. The performance of new efficient OpenCL-based parallel implementations of selected commonly used cryptanalysis and cryptographic algorithms executed on the GPU devices is compared with implementations running on the CPU processor. Moreover, the paper describes the hardware architecture of a novel hybrid cluster system (HGCC) integrating two types of devices: Intel processors with NVIDIA graphics processing units and AMD processors with AMD graphics processing units, and the specialized software framework that hides a heterogeneity of the cluster and provides a single system image. The results of the presented effort show that the GPU can perform as an efficient accelerator, and that the HGCC cluster system is a powerful, effective, scalable, flexible and easy to use platform for cryptography and cryptanalysis. Keywords: cryptography, cryptanalysis, parallel computing, HPC, clusters, GPU computing

I. Introduction The demand on reliable and efficient cryptanalysis and cryptographic solutions has been continuously growing in the last decade. It is a consequence of using the Internet in critical areas like government, business, healthcare, etc. Unfortunately, most of cryptographic and cryptanalysis applications are computationally intensive, and able to push CPUs to their performance limits. However, many of data encryption, decryption and password recovery algorithms can be easily decomposed, and the calculations can be partitioned into independent parts and carried out on different cores, processors and computers. In general, these algorithms are natural candidates for massively parallel computations. Hence, many hardware-software accelerators with an affordable cost and a great ease of integration have been studied and proposed both in the research and the industrial field.

248

Niewiadomska-Szynkiewicz, Marks, Jantura, Podbielski, Strzelczyk

Nowadays, Graphics Processing Units (GPUs) have been the subject of extensive research and have been successfully applied to general purpose computations out of the graphical domain. The GPU has evolved into a massively parallel stream processor with a flexible programming model. It is designed to perform hundreds of billions of floating point operations per second. Since the GPU platforms are now supported by frameworks which allow the implementation of general purpose software, they become a prime choice for the implementation of computationally demanding algorithms. GPUs have been already used in many areas for applications in science, engineering and commerce. The MapReduce programming model for processing large data sets utilized by GPU applications can be successfully used to improve the efficiency and speed up calculations, which explains why GPUs are often enlisted as cryptanalysis and cryptographic coprocessors. We built a hybrid cluster system (HGCC) that integrates two types of multicore CPUs and GPUs of different vendors. This cluster is dedicated to perform massive parallel computations in cryptography and cryptanalysis. We developed CPU and GPU based implementations of widely used cryptanalysis and cryptography techniques. Due to a fact that our cluster is composed of devices of different vendors, OpenCL for implementing kernels was used. In this paper we present and discuss the evaluation of the performance of our implementations. The rest of this paper is structured as follows: Section II discusses background and related works; Section III describes hardware and software of the HGCC cluster; Section IV presents a brief description of implementations of cryptanalysis and cryptography techniques on HGCC; Section V presents and discusses the performance evaluation of selected cryptanalysis and cryptography algorithms. Finally, Section VI concludes the paper.

II. Background and Related works A. GPU Programming Programming on graphics processing units usually refers to nongraphics related programming operations that are performed on the GPU. This programming paradigm opens up many possibilities to increase performance by utilizing the specific processing nature of the GPU. A new model for parallel calculation based on using GPUs and CPUs together to perform scientific and engineering computing have been already used to solve many practical complex problems. There are currently two common frameworks available to program GPUs, namely Compute Uniform Device Architecture – CUDA (https://developer.nvidia.com/cuda-toolkit) from NVIDIA and Open Computing Language – OpenCL (http://www.khronos.org/ opencl/) from Khronos Group. The CUDA framework supports only its native GPU architecture (NVIDIA) while OpenCL is portable across all hardware and can be executed on a wide range of different hardware configurations. Using the OpenCL

Comparative Study of Massively Parallel Cryptalysis and Cryptography...

249

or CUDA frameworks many real-life problems can be easily implemented and run significantly faster than on multiprocessor or multicore machines [5,6,8,14]. The TOP500 project (www.top500.org) aims to provide a reliable basis for tracking and detecting trends in high-performance computing. Recently, we can observe a rising significance of GPU accelerators. The latest ranking from November 2012 shows the grow of number of systems utilizing accelerators/co-processors to 13% from 7% one year earlier. There is a lot of new installations utilizing not only popular NVIDIA chips like Fermi family, but also the newest Kepler family chips and Intel Xeon Phi solutions. The aggressive competition for the GPU market is driving these architectures towards increasing levels of hardware parallelism, while containing the costs.

B. Cryptographic and Cryptanalysis Algorithms on GPU There are a number of GPU-enhanced popular cryptanalysis and cryptographic algorithms in use in computing today. The massive parallel computations that can be performed on GPU can be naturally exploited in cryptography and cryptanalysis. Most of operations performed by data decryption, encryption and password recovery algorithms are natively supported by GPU computing units. They allow to take full advantage of their computing power. The most common application of GPU in cryptanalysis is the password recovery from hashes. The commonly used techniques are based on MD5, SHA-1 and SHA-2. The hashes are easy to compute but extremely hard to inverse. Therefore the only general technique for recovering password from hash, is to scan all potential passwords, compute their hash and test the coincidence. Cryptographic hash functions are based on integer and binary operations such as: addition modulo power of two, bit shift and rotation, bitwise xor, bitwise or, bit negation and words permutation. Those operations are natively supported by GPU computing units. In general, we can distinguish three common techniques to password strength validation: t CSVUFGPSDF t SBJOCPXUBCMF t EJDUJPOBSZBUUBDL Brute-force and rainbow table are very suitable for porting to GPU. A numerous GPU implementations of a brute-force attack are available in literature: t JHIBTIHQVIUUQXXXHPMVCFWDPNIBTIHQVIUN t XIJUFQJYFMIUUQXIJUFQJYFM[PSJOBRDPN t #BST8'IUUQCZFONE t PDM)BTIDBUQMVTIUUQIBTIDBUOFUPDMIBTIDBUQMVT t &MDPNTPęIUUQXXXFMDPNTPęDPN The fastest ones are capable of checking up to several billions passwords per second (using a single GPU unit), [7,8,13]. All of them are capable of using multiple

250


GPU processors installed on one host system, but only one – the solution offered by Elcomsoft – allows to distribute the workload on many hosts. It is worth to mention that the brute-force attack guarantees the success. In case of a rainbow table technique [11,12] two phases of calculations can be distinguished. At the beginning the hashes for all potential passwords are pre-computed. Next, these hashes are recorded in the special compressed form called a rainbow table. Hence, the password search is reduced to several millions hash calculations and rainbow table look-ups. The GPU-enhanced rainbow tables implementations, can be found in literature, the most popular is Rainbowcrack (http://project-rainbowcrack. com). An implementation with NVIDIA CUDA support was provided. The rainbow table attack does not guarantee success, but speeds up the search significantly. Another method which can benefit from the GPU capabilities is the dictionary attack – a technique for defeating a cryptographic system by searching its decryption key or password/passphrase in a list of words or combinations of these words. It is obvious that the main factor for the success of this type of attack is the choice of a suitable list of possible words. However, the efficiency and reliability of the implementation of the attack may become critical factor as well. The efficient implementations of the dictionary attacks are described in literature. In [2] the authors present distributed CPU-based computing platform to carry out large scale dictionary attack against cryptosystems compliant to the OpenPGP standard (a widely used standard for encryption and authentication of email messages). The improved version of this attack implemented in CUDA, and executed on a single GPU or multiple GPUs is described in [10]. The authors show the significant improvement in case of GPU-enabled application – the time required to check large sets of passphrases is drastically reduced if compared to the CPU-based implementation. The significant attempt to implement symmetric ciphers on the GPU was made by some researchers. It was observed that the GPU-enhanced algorithms speed up calculations especially when used to encrypt or decrypt huge amount of data sets or attack secured communication protocols. The GPU-enabled versions of widely used symmetric ciphers, i.e., DES, 3DES, AES, RC4 and BlowFish are described in literature [3,4,9]. Most of them are implemented in the CUDA framework. In case of many applications asymmetric ciphers (RSA, ECC, NTRU and GGH) were also found to give a significant speed up when ported to GPU. In this type of algorithms the main research is focused on the design a parallel multiprecision arithmetic routines and Montgomery reduction algorithms, which are the basic building blocks. Several efforts indicate that GPU-enabled applications can outperform the best available CPU-based implementations.

III. A Hybrid cpu-gpu Cluster A cluster is a group of cooperating computers that serves as one virtual machine [5]. One of the biggest advantages of clusters over standalone computers


251

is an ability to share the workload between machines. The efficiency of a given cluster depends on the speed of processors of separate workstations and the efficiency of particular network technology. In advanced computing clusters simple local networks are substituted by complicated network graphs or very fast communication channels. A typical cluster is usually formed by homogenous central processing units (CPUs) operating under UNIX or Linux systems. The new trend is to create GPU clusters in which each node is equipped with a graphic processing unit. However, currently only a few software tools are provided to support computing on GPU clusters. Virtual OpenCL (VCL) (www.mosix.org/txt_vcl. html) is a software framework for GPU clusters. It allows to execute OpenCL applications on Linux clusters, and provides a single system image for cluster built from a group of GPU units. The components of VCL, its performance and applications are described in [1]. We built a cluster system HGCC (Heterogenous GPU-CPU Cluster) composed of multicore CPUs working together with multicore GPUs, both with different architecture. The current version of HGCC consists of 24 nodes and integrates two types of CPUs: 12 servers with two Intel Xeon processors each and 12 servers with two AMD Opteron processors each. All servers are equipped with advanced GPUs, adequately, NVIDIA Tesla and AMD FirePro units. It is worth to note that FirePro V7800 has a peak performance in the single precision equal 2020 GFLOPS which is almost two times greater than NVIDIA Tesla M2050 performance (1030 GFLOPS) and a peak performance in the double precision equal 400 GFLOPS which is less than Tesla’s 510 GFLOPS. Moreover AMD GPU is approximately four times cheaper than NVIDIA GPU. However effective programming of GPU based on Very Long Instruction Word (VLIW5) architecture – AMD FirePro V7800 – is not a simple task and not every application is capable to achieve performance close to the peak one. The system architecture is depicted in Fig. 1. The specification of components of the HGCC cluster is presented below: CPU: t *OUFM9FPO9 ‫[)(ڀ[)(ڀ‬UVSCP ‫ڀ‬DPSFT‫ڀ‬UISFBET Y‫ڀ‬- 12 MB L3 cache. t ".%0QUFSPO ‫ڀ [)(ڀ‬DPSFT‫ڀ‬UISFBET Y‫ڀ‬,#- ‫ڀ‬.# L3 cache. GPU: t /7*%*"5FTMB. $6%"DPSFT CJUNFNPSZCVT t ".%'JSF1SP7 TUSFBNQSPDFTTPST FRVJWBMFOUPG‫ڀ‬$6%" cores), 256-bit memory bus. Each of the nodes is equipped with 16 GB RAM memory, three different interconnects: InfiniBand 4x QDR (40 Gb/s), Ethernet 10 Gb/s and 1 Gb/s and runs CentOS 6 operating system.

252


Figure 1. The components of the HGCC cluster

The most import part of building HGCC cluster was to develop a software framework that allows unmodified OpenCL applications to concurrently run on multiple CPU and GPU components of a cluster. The cluster framework consists of several components. The most important are: t .BTUFS"QQ (master node application) – component that is responsible for the user-system communication and calculation management. t 4MBWF"QQ (the computational node application) – component that is responsible for calculations that are performed by the assigned server. Each computational node contains some number of resources (CPUs and GPUs). The cluster framework can handle any computational task which was implemented by the user. A committed task is sent to the MasterApp component. All parameters defining appropriate task (the task descriptor) are parsed inside MasterApp. Next, the task is divided into smaller subtasks, which list is stored by MasterApp. Next these subtasks are allocated to the slave nodes, which contain any free resources. Subtaks are served by appropriate plugins utilizing in the most efficient manner available resources – there are different plugins for CPUs and GPUs of different vendors.


253

The aim of this framework is to hide a heterogeneity of the cluster, so all these steps are hide for the user. From the user’s perspective, the cluster serves as one superserver. The detailed description of our software framework is in [8]. To develop efficient and reliable cryptographic and cryptanalysis algorithms we need a simple functionality of the computing platform, i.e., a calculation speed up, resistance and ease of use. Hence, we assumed a static decomposition of the problem in a calculation startup, since the dynamic load balancing is superfluous for this application. Our software framework is quite similar to VCL platform [1], however in our solution it is possible to use both CPUs and GPUs to solve the considered problem.

IV. HGCC Library of Cryptographic and Cryptanalysis Algorithms A. HGCC Library: Cryptography Algorithms The popular algorithms for encryption and decryption of large data sets were selected for implementation on our hybrid cluster system HGCC. The symmetric ciphers were considered. The C and OpenCL-based versions of the following techniques are currently provided in our library: t #MPDLDJQIFST EFUFSNJOJTUJDBMHPSJUINTPQFSBUJOHPO‫ڀ‬ĕYFEMFOHUIHSPVQT of bits): ✓ DES (CPU and GPU versions), ✓ 3DES (CPU and GPU versions), ✓ AES (CPU AES-NI and GPU versions), ✓ Blowfish (CPU version only), ✓ Twofish (CPU and GPU versions). t 4USFBNDJQIFST QMBJOUFYUEJHJUTBSFDPNCJOFEXJUIB‫ڀ‬QTFVEPSBOEPNDJQIFS digit stream – keystream). All mentioned block ciphers are implemented using the electronic codebook (ECB) or counter (CTR) modes specified in http://csrc.nist.gov/index.html. In ECB the encrypted data are divided into blocks and each block is encrypted separately. In Counter (CTR) mode, a block cipher is used as a keystream generator, and the keystream is bitwise exclusive-ored into the plaintext to produce the ciphertex. It should be pointed that both ECB and CRT modes are suited to parallel implementation.

B. HGCC Library: Cryptanalysis Algorithms The widely used algorithms for password recovery were selected for implementation on our cluster system HGCC. Three attacks on passwords were considered: brute-force, rainbow table and dictionary attack. The C and OpenCL-based versions

254


of the following techniques are currently provided in our library of cryptographic hash functions: MD5: t CSVUFGPSDF$16BOE(16 t EJDUJPOBSZBUUBDL$16 t SBJOCPXUBCMF$16BOE(16 MD4: t CSVUFGPSDF$16BOE(16 NT-hash: t CSVUFGPSDF(16 SHA-1: t CSVUFGPSDF$16BOE(16 t EJDUJPOBSZBUUBDL$16 t SBJOCPXUBCMF$16BOE(16 SHA-2 (SHA-224, SHA-256, SHA-384, SHA-512): t CSVUFGPSDF$16BOE(16 RIPEMD (RIPEMD-128, 160, 256, 320): t CSVUFGPSDF 3*1&.% $16BOE(16 t CSVUFGPSDF 3*1&.% $16

V. Tests and Performance Evaluation The results of extensive tests both for cryptographic and cryptanalysis techniques are presented and discussed. The goal of the experiments was to compare the efficiency of data encryption and decryption algorithms and password recovery techniques implemented for various hardware platforms. Several series of experiments were performed. First, the CPU-based and GPU-based multi-threaded implementations were executed on devices of different vendors. The objective was to compare the efficiency and scalability of algorithms and processors. Next, the distributed implementations executed on the cluster in which each node was equipped with CPU and GPU units were evaluated and compared.

A. Cryptography Algorithms Three series of experiments were performed for symmetric cryptography. The aim of the first series was to test the efficiency of CPU-based implementation of the symmetric block ciphers (DES, 3DES, AES, Twofish) and various modes of these algorithms operations. The efficiency of our implementations were compared with the results from the Internet: TrueCrypt 7.0a (www.truecrypt.org) – a free open-source disk encryption software, and GnuPG 1.4.11 (www.gnupg .org/download) – GNU Privacy Guard (Fig. 2). The results presented in figures show amounts of data in MiB/s (MiB/s = 1 048 576 bytes/s) that were encrypted/


255

decrypted per second in case of all algorithms. We can observe that the efficiency of our (NASK) implementations is similar to the results provided by other projects. It is worth to mention that both GnuPG and TrueCrypt are widely used products and are characterized with good reputation.

Figure 2. Block ciphers performance on CPU (comparison of various single thread implementations)

The aim of the second series of tests was to compare the efficiency of the block ciphers implementations on GPUs of different vendors. We compared three types of GPU devices: AMD FirePro V7800, NVIDIA Tesla M2050 and AMD Radeon 6970. The results of the experiments are presented in Fig. 3. It can be easily observed that GPU-enhanced implementation significantly speeds up calculations. In our tests the best results were obtained for the Radeon 6970 device.

Figure 3. Block ciphers performance on GPUs of different vendors

Fig. 4 presents the comparison of AES implementations on CPUs and GPUs of different vendors. In case of CPUs better results were obtained for Intel Xeon X5650 than

256


AMD Opteron 6172. Next, we tested the implementation of AES utilizing an Advanced Encryption Standard – New Instruction Set (AES-NI) extention. This instruction set is an extention to the x86 instruction set architecture for microprocessors from Intel and AMD (http://ark.intel.com/) which supports elementary AES operations. In the HGCC cluster only Intel Xeon X5650 (Westmare) processor provides AES-NI extension. We assesed impact of NI set on AES performance. The conclusion is that the application of the NI set causes massive acceleration of the AES algorithm (14848.80 [MiB/s] – encryption, 14841.60 [MiB/s] – decryption). The acceleration is much more effective than in case of utilizing GPU accelerators, which efficiencies are similar as CPU processors.

Figure 4. Comparison of AES performance (CPU and GPU)

Finally, the scalability of HGCC-based implementations of the symmetric algorithms was tested. The performance evaluation of our implementation of the AES algorithm in two subclusters: the first composed of two and four AMD nodes, and the second composed of two and four Intel nodes is presented in Table 1. We can see that AES scales very well. The speed up is close to linear. Table 1. Scalability Of The AES Algorithm Processor

Speed up 1 node

2 nodes

4 nodes

Intel Xeon X5650

1

1.95

3.71

AMD Opteron 6172

1

1.93

3.73

B. Cryptanalysis Algorithms Multiple tests were performed for password recovery algorithms. The performance of CPU-based and GPU-based versions of MD4, MD5, RIPEMD, SHA-1, SHA-2, NT-hash techniques for hashes generation was evaluated.


257

The aim of the first series was to compare the efficiency and scalability of two types of CPU devices from different vendors. The multi-threaded implementations of the MD5 and SHA-1 were executed on two nodes, each composed of two processors, respectively from Intel and AMD. Figures 5 and 6 show the number of hashes (in millions) generated per second using MD5 (Fig. 5) and SHA-1 (Fig. 6). It can be seen that both algorithms scales up very well. The speed up for the AMD server, and 24 threads is equal 23.2 (MD5) and 23.5 (SHA-1). In the same way the Intel Xeon scales up very well up to 12 threads – number of physical

Figure 5. Scalability of hashes generation using MD5 (AMD & Intel)

Figure 6. Scalability of hashes generation using SHA-1 (AMD & Intel)

258


cores. For bigger number of threads scalability is restricted, but in general Intel Xeon processor with hyper-threading technology is more powerful than AMD. The objective of the second series of experiments was to compare CPU-based and GPU-based implementations of hashes generation techniques. Table 2 collects the number of hashes generated per second using MD4, MD5, RIPMED, SHA-1 and SHA-2 functions running on CPU and GPU devices. In general, it can be observed that GPU implementations are much more efficient than CPU ones. The speed up of calculations was 25% to 550% depending on the algorithm. Table 2. Number of generated hashes – CPU/ GPU Processors 2 × Intel Xeon X5650

NVIDIA Tesla M2050

2 × AMD Opteron 6172

FirePro V7800

MD4

619.64

1156.69

672.79

1977.01

MD5

407.54

830.99

465.44

1540.09

RIPMED-128

186.43

510.58

115.83

849.93

Algorithm

RIPMED-160

126.9

304.11

80.41

477.43

147.75

406.27

117.48

949.77

SHA-2 (224)

66.82

178.18

84.97

380.99

SHA-2 (256)

71.28

177.47

82.51

380.25

SHA-1

SHA-2 (384)

25.72

45.86

35.67

75.69

SHA-2 (512)

25.71

44.64

35.65

75.53

As the results obtained in the second series of experiments were very promising we decided to compare our results with other tools for cryptanalysis available in the market. We decide to compare our results with cudaHashcat (version 0.14) solution. We cannot provide similar comparison with oclHashcat, as oclHashcat (version 0.13 and 0.14) requires drivers for AMD cards which are unavailable for GPUs from FirePro family. The results of comparison can be found in Table 3. Table 3. Number of generated hashes – for single GPU Algorithm

NASK project

cudaHashcat ver. 0.14

MD4

1156.69

1620.50

MD5

830.99

1129.90

SHA-1

406.27

469.40

SHA-2 (256)

177.47

211.90

1203.10

1465.40

NT-hash

Obtained results show that cudaHashcat algorithms are more efficient than our solution, but the differences are not big. The HGCC software achieves efficiency

259


between 75 and 90% of cudaHashcat. However the cudaHashcat cannot be applied in parallel on computer cluster – it is restricted to single node (with multiple GPUs, but single node). It means that two nodes of our cluster provide better efficiency than competitive solution, as the speed up connected with utilizing many nodes is almost linear in HGCC cluster. The aim of the last series of tests was to compare the efficiency of four subclusters: CPU-1 – composed of 24 Intel Xeon X5650 processors, CPU-2 – composed of 24 AMD Opteron 6172 processors, GPU-1 – composed of 12 NVIDIA Tesla M2050 devices and GPU-2 – composed of 12 FirePro V7800 devices. Table 4 collects the number of hashes generated using MD4, MD5, RIPMED, SHA-1, SHA-2, NThash algorithms executed on all listed clusters. Figures 7 and 8 present the results, respectively for CPU clusters and GPU clusters. Table 4. Number of generated hashes – CPU/ GPU Clusters CPU cluster Algorithm

GPU cluster

24 × Intel Xeon X5650

24 × AMD Opteron 6172

12 × NVIDIA Tesla M2050

12 × FirePro V7800

MD4

7253

7788

14646

21655

MD5

4792

5518

10024

16599

RIPMED-128

2131

1293

6320

17409

RIPMED-160

1437

819

3597

4688

RIPMED-256

2116

1248

–

–

RIPMED-320

1408

811

–

–

SHA-1

1750

1340

4952

11206

SHA-2 (224)

791

945

2094

4571

SHA-2 (256)

793

940

2094

4548

SHA-2 (384)

297

417

532

859

SHA-2 (512)

298

410

529

870

–

–

14626

18568

NT-hash

As in the previous experiment GPU implementations were much more efficient than CPU ones. The best results were obtained for the OpenCL versions of MD4, MD5 and NT-hash functions executed on the cluster composed of AMD FirePro V7800. In case of hashes generation graphic processor AMD FirePro V7800 proved to be more efficient than NVIDIA Tesla M2050 in all considered cases.

260


Figure 7. Number of generated hashes – CPU cluster

Figure 8. Number of generated hashes – GPU cluster

VI. Summary and Conclusions The main goal of the paper was to present the wide applicability of the GPU technology to cryptography and cryptanalysis. We compared the efficiency of the CPU-based clusters with the clusters formed by GPU devices, both of different vendors. The numerical results confirmed that the modern unified GPU architecture can perform as an efficient cryptographic acceleration board. Moreover, we showed that hybrid computer clusters offer a new opportunity to increase the performance of parallel implementations, by combining traditional CPU and efficient GPU devices. The obtained results show that HGCC solution is an efficient solution for many cryptography and cryptanalysis problems – comparable with competitive single node solutions and, in the same time, it offers the possibility of parallel execution on computer cluster with almost no impact of communication and tasks splitting on overall efficiency – the speed up is almost linear.


261

REFERENCES

[1] A. Barak, A. Shiloh, “The MOSIX Virtual OpenCL (VCL) Cluster Platform”, Proc. Intel European Research and Innovation Conf., 2011. [2] (dic 1) M. Bernaschi, M. Bisson, E. Gabrielli, S. Tacconi, “An Architecture for Distributed Dictionary Attacks to Cryptosystems”, Journal of Computer, vol. 4(3), pp. 378–386, 2009. [3] A. Di Biagio, A. Barenghi, G. Agosta, G. Pelosi, “Design of a Parallel AES for Graphics Hardware Using the CUDA Framework” Inter. Symp. on Parallel & Distributed Processing (IPDPS 2009), pp. 1–8, 2009. [4] J.W. Bos, D.A. Osvik, D. Stefan, “Fast Implementations of AES on Various Platforms”, Cryptology ePrint Archive, Report 2009/501, 2009, http://eprint. iacr. Org. [5] A. Karbowski, E. Niewiadomska-Szynkiewicz, “Parallel and Distributed Computing” (in Polish), WUT Publishing House, 2009. [6] D.M. Kunzman, L.V. Kale, “Programming heterogeneous clusters with accelerators using object-based programming”, Sci. Program., vol. 90, pp. 47–62, 2011. [7] C. Li, H. Wu, S. Chen, X. Li, D. Guo, “Efficient Implementation for MD5-RC4 Encryption Using GPU with CUDA”, Anti-counterfeiting, Security, and Identification in Communication, 2009, pp. 167–170, 2009. [8] M. Marks, J. Jantura, E. Niewiadomska-Szynkiewicz, P. Strzelczyk, K. Gozdz, “Heterogenous GPUGPU Cluster for High Performance Computing in Cryptography”, Computer Science, vol. 14, No 2, pp. 63–79, 2012. [9] C. Mei, H. Jiang, J. Jenness, “CUDA-based AES Parallelization with Fine-Tuned GPU Memory Utilization” Workshops and Phd Forum Parallel & Distributed Processing, (IPDPSW), pp. 1–7, 2010. [10] (dic 2) F. Milob, M. Bernaschi, M. Bisson, “A fast, GPU based, dictionary attack to OpenPGP secret keyrings”, The Journal of Systems and Software, vol. 84, pp. 2088–2096, 2011. [11] P. Oechslin, “Making a Faster Cryptanalytic Time-Memory Trade-off ”, Advances in Cryptology-(CRYPTO 2003), pp. 617–630, 2003. [12] P. Oechslin, “Password Cracking: Rainbow Tables Explained”, Constituent Contributions, vol. 14, 2005. [13] Z. Wang, J. Graham, N. Ajam, H. Jiang, “Design and Optimization of Hybrid MD5-Blowfish Encryption on GPUs”, Proc. of International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), pp. 18–21, 2011. [14] W. H. Wen-Mei W. (ed.), “GPU Computing Gems Emerald Edition”, Morgan Kaufman, 2011.