Parallel Collection of Live Data Using Hadoop - IEEE Xplore

2010 14th Panhellenic Conference on Informatics

Parallel Collection of Live Data Using Hadoop Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George Stephanides Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece [email protected], [email protected], [email protected], [email protected] appropriate for combination with crawling techniques in contrast to the traditional data-distribution methods. The aim of our research is to effectively use Hadoop for collecting live data. We will explain how combining Hadoop with crawling techniques could maximize the efficiency of data procession. We chose Hadoop because it is freely available; it is designed for cheap hardware; it enables online, userfriendly addition and deletion of computing resources; and due to the power and convenience that it provides to the programmer. To enhance our statement we will present our results in three different projects, which are based on collection of live data. Primarily we will present a Hadoop implementation for Domain Appraisal Tool (DAT) [7]. Then we will experiment with the OpenBet project [11] and finally we will show how to use Hadoop for brute force password recovery and cryptographic purposes.

Abstract—Hadoop is a fault tolerant Java framework that supports data distribution and process parallelization using commodity hardware. Based on the provided scalability and the independence of task execution, we combined Hadoop with crawling techniques to implement various applications that deal with large amount of data. Our experiments show that Hadoop is a very useful and trustworthy tool for creating distributed programs that perform better in terms of computational efficiency. Keywords— parallelization, distributed systems, Hadoop, commodity hardware, live data collection

INTRODUCTION For a long time, researchers were trying to find solutions on performing high-level applications easily, thus freeing developers from complex implementations. The advent of cheap storage capacity, and the availability of low cost resources that process large amounts of data, reinstated the need for largescale data processing. A promising direction, which gains advantage from the new technologies, is the emerging map/reduce architecture and its open-source implementation, called ‘Hadoop’. Hadoop [1] is a software framework written in Java. It was originally built by a former developer at Yahoo, named Doug Cutting, to support distribution for the Nutch search engine project. It was designed for running applications on large clusters (>>1ΤΒ) made from commodity hardware. Hadoop is a distributed system that uses sort/merge processing techniques. The provided scalability enables reliable storage and processing of petabytes of data. Both data and algorithmic functions are divided into clusters consisting of up to thousands of commodity nodes. Thus, data distribution and process parallelization result in faster execution time. Reliability is achieved through the automatic retention of multiple copies of data and automatic assignment of work to new nodes in case of failure. Hadoop is already being used by a surprising array of companies including Yahoo!, Google, Facebook and Twitter, helping them to effectively distribute their computing jobs among servers in their own data centers or public computing services operated by other companies such as Amazon. Hadoop makes it significantly easier to access and analyze large volumes of data, due to its simple programming model. It hides the ‘messy’ details of parallelization, allowing even inexperienced programmers to easily utilize the resources of a large distributed system. Although it is written in Java, Hadoop streaming allows its implementation using any programming language. Due to the independence of task execution, provided by Hadoop, it is extremely fault tolerant [6] and 978-0-7695-4172-3/10 $26.00 © 2010 IEEE DOI 10.1109/PCI.2010.47

I. BACKGROUND In this section some of the key points of our research and the Hadoop technology are presented. A. MapReduce As mentioned above, Hadoop is an open-source implementation of the MapReduce [8] programming model. A MapReduce job is usually completed in three steps: map, copy and reduce. The MapReduce [4] library in the user’s program separates input data into M pieces of 64MB (by default) and then, copies of the editing program begin to run on a cluster of computers. One of these copies is called master and the remaining workers. Initially the master is responsible for the performance of M (map) and R (reduce) tasks in each of the workers. We denote them as map and reduce workers, respectively. A map worker, reads the data from the department targeted to it, and it exports intermediate key/value pairs, which are stored in memory. Periodically, intermediate key/value pairs are stored on the local disk. There, through the ‘separating’ process operated on the intermediate data, these pairs are divided into R segments. The locations of these sections are transmitted to the master and then to the R reduce workers. Every reduce worker reads iteratively the sorted list of key/value pairs. Then, for each intermediate key it passes the key and the associated intermediate values to the reduce function. The output of the reduce function is attached to an output file. When the above process is completed, the master invokes/informs the user program. From the description above, we note that the master needs to get O(M + R) programming decisions and retain O(M * R) registers in memory. 66

In a system where there are hundreds or thousands of machines running in parallel, it is a reasonable scenario that failures will occur. The MapReduce library [5],[9] provides mechanisms that handle these failures. Finally, the model of MapReduce [3] is user-friendly, since it hides the details of parallelization. The only thing left for a user to do is to appropriately configure the parameters of the map and reduce functions, respectively, and leave the MapReduce library to accomplish the rest of the job. Figure 1 provides an overview of the MapReduce Execution.

Fig. 2. Overview of the HDFS Architecture

C. Example Applications This subsection provides information about three selected applications in which Hadoop can be applied leading to increased performance. C.1 Domain Appraisal Tool (DAT) DAT is a fully extensible and scalable tool that estimates the price of domain names [7]. The concept of domain name refers to the names of websites and their extensions. For simplicity, we will often refer to the domain name as ‘domain’. The rapid development of e-business all over the world transformed the domain name market into a competitive area and in some cases it has become even more profitable than real-estate markets. The aim of DAT is to conduct a more objective evaluation of domains, ending in .com, .org and .net. It combines statistical, econometrical and artificial intelligence methods which result to a powerful automated price prediction model. Also, as far as we know, DAT is the first tool to estimate the domestic extensions .gr and .cy. DAT is available under the link costmydomain.com and it provides an open-source platform for assessing value of domain names.

Fig. 1. Overview of the MapReduce Execution

Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) [10]. It creates multiple replicas of data blocks for reliability, storing them on computing nodes around the cluster. B. HDFS HDFS [2] is a distributed file system designed to run on commodity hardware. There are many similarities with existing distributed file systems, but on the other hand a number of differences exist. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that handle large data sets. HDFS is a master-slave architecture. According to this, each cluster consists of a NameNode and a set of DataNodes. NameNode is the designation of the Master Server. It manages the File System NameSpace, which includes performing operations on files and folders, and assigns data blocks to DataNodes. Additionally, NameNode administers the cluster configuration and use mechanisms to create copies of blocks. NameNode plays the role of coordinator and repository for all the HDFS metadata. DataNodes manage data storage at the node where storage is performed. They also serve clients’ requests for reading and writing, and perform functions on blocks after NameNode’s requests. It is highlighted that NameNode and DataNode operations are successfully executed on commodity hardware. The HDFS mechanism is represented schematically in Figure 2.

C.2 OpenBet OpenBet project [11] aspires to become an integrated digital environment for recording, analyzing and presenting sport related data. Although the upper goal of OpenBet is to predict the outcome of sports games, it aims at gathering as much data as possible to develop an open-source software. Among the others, Openbet is expanded to many complex directions including statistical analysis of past sports data, visualization in appropriate formats, use of artificial intelligence for prediction, information retrieval through web agents and risk analysis. In this article, we will refer to techniques of collecting live sports data in a scalable, highly optimized way, using Hadoop. C.3 Brute Force Cryptanalysis In the light of new technological advances, to preserve high-level security, the strength of specific ciphers, passwords and one-way hash functions must be periodically reevaluated. There are a number of articles in the literature discussing password cracking and brute force cryptanalysis. Recovery of hashed passwords and cryptanalysis of specific cryptosystems is a problem ideally suited to parallel computing, due to the ‘parallel’ nature of a brute force attack. It is highly possible that Hadoop’s user-friendliness and low-cost requirements

67

may transform it to a powerful tool for password recovery and/or cracking.

money in betting exchange games evolving simultaneously. Parallel crawling of data using Hadoop has been proved to be inevitable. It is a fact that after using Hadoop, the accuracy and the performance of the main prediction model of Openbet has been improved adequately.

II. IMPLEMENTATION A. Description This section describes the actual use of Hadoop, for collecting live data with the help of crawling programs, in each of the aforementioned applications.

A.3 Password Recovery and Brute Force Cryptanalysis Password recovery has been proved to be an interesting and very attractive task for companies or individuals. Various programs have been constructed aiming at finding or cracking encrypted passwords. Although a big number of real life password selections fall into a small password space, which are predictable through dictionary attacks, the more complex passwords are difficult to be extracted, assuming that the used hash algorithm is mathematically proven to be secure. In cases where a dictionary attack cannot be applied, thousands of days are required for going through all the possible x-digit plaintexts using only one CPU. However, as a brute force attack can be easily parallelized, we used Hadoop to speed up the recovery process. Then, the execution time lessens analogically to the number of the CPUs involved. Similarly, Hadoop can also be applied for cryptographic purposes. There are some crypto contests available in the Internet. For instance, Certicom[14] has published a challenge for the first person who will break particular Elliptic Curve Cryptography keys. Using a distributed system, the required amount of decryption time can be decreased dramatically. Another typical example is the calculation of the next biggest Marsenne Prime number. We are currently building such systems to achieve faster computations. B. Computing Environment For testing purposes, we created a cluster of 4 commodity dual core PCs. Our testing environment includes the following: 1. The cluster consists of 8 machines (4 dual core CPUs). 2. Each machine runs Ubuntu 9.10 Linux as an operating system and Hadoop 0.20.1. 3. Machines are typically dual-processor x86 processors at 1.8 – 2 GHz, with 1-2GB of memory per machine. 4. For each machine there is a 2GB swap file, and 1 Gb Ethernet network connection.

A.1 DAT DAT is inserted between the Internet, an internal database and a user in order to estimate a domain name’s selling price. The main goal of this application is to update an existing database which has characteristic values of certain domain names’ transactions. The database contains information on selling prices collected from publicly available tenders, and values from databases of domain names, coming from closed auctions’ data. Currently, such a database consists of 55,000 transaction prices that occurred during the period 1999 to 2010. To update our database, we should be able to locate domains that have been sold at specific time intervals. For each of these domains, we should be able to find the name, the sales price, the broker of the sale, the date of the sale, the number of Google results, the global ranking in Alexa rank, and finally the importance degree of the domain name in Google’s page rank service. It is worth to say that the collection and update of data is not an easy task since selling prices are not available in digital form, although they are published. Thus, we had to create some web-crawlers and program them in order to collect different transaction prices from different websites (e.g. from namebio.com). To accomplish this issue, we used shell scripts on a particular web-crawler that automatically retrieves information from a website. Then, we created a number of parsers to process and store data in a common database. Every request should be executed as fast as possible, and thus the use of Hadoop in handling these large volumes of data is very promising. A.2 OpenBet OpenBet deals with collecting betting-related data from the Internet. The website betfair.com [13] is the main source of data for this project, because it publishes both betting transactions’ history and live data of betting across the globe. The matching amount of money in these transactions sometimes reaches to million of dollars per minute. Openbet also deals with other forms of data, such as the weather prevailing during the games, the course of the league, and the performance of each of the teams in a game. This kind of information is not provided by betfair.com, so it is obtained from other Internet sources. It becomes obvious that one of the basic needs of OpenBet is the fast and direct collection of live data. The acquired data is used as input in various prediction methods, such as neural networks, decision trees, genetic algorithms and econometrical models. The difficulty we had to deal with was the large amount of

The data set used for the empirical study was taken from [12], [13] for DAT and Openbet, respectively.

C. Code Sample We present code samples and explain the key points of both map and reduce functions. C.1 Map Function In the following, we provide the sample code of the map function, as we have configured it, in order to implement it in the OpenBet project. With the “map” function, we try to locate match games and collect data associated with them. The id of each match is stored as an input in the HDFS system of Hadoop. The data we are looking for will be stored in the HDFS as an outputpath.

68

Furthermore, we observe that, while the intermediate keys and values are from the same domain as the output keys and values, the inputs are drawn from a different domain than the outputs. This could be a bright example of how the map/reduce function of Hadoop hides the ‘messy’ details of parallelization from the user. The sole requirement is to define some parameters of the map/reduce functions and let the internal details to be managed by Hadoop.

public static class Map extends MapReduceBase implements Mapper { private Text matchgame = new Text(); private Text gamedata = new Text(); public void map(LongWritable key, Text matches, OutputCollector output, Reporter reporter) throws IOException { StringTokenizer itr = new StringTokenizer(matches.toString()); while(itr.hasMoreTokens()) { String gameid = itr.nextToken(); Match football = new Match(); matchgame.set(football.getResults()); gamedata.set(gameid); output.collect(matchgame, gamedata); } } }

III. EXPERIMENTS AND RESULTS To start our Hadoop cluster, we had to choose among three supported modes: Local (Standalone) Mode, PseudoDistributed Mode, and Fully-Distributed Mode. By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. Hadoop can also run on a single-node in a pseudo-distributed mode, where each Hadoop daemon runs as a separate Java process. Finally, we can set up fully-distributed, non-trivial clusters. To implement our applications, we have chosen FullyDistributed Mode. We came up with this decision, in order to observe the time improvements brought in by the parallel processing in a cluster of computers. It is worth to say that the following tables and diagrams present the average values after running our applications three times, to get as much objective results as possible.

As we can discern, the Map Class includes two properties: matchgame and gamedata. The matchgame property is the key that holds the id of a bet, while the gamedata property is the value where live data will be stored. The input of map function can be described by the following type: (gameid,matches). The mappers or the workers that have been assigned with a map-task read the matches from the inputpath of HDFS and undertake the task of locating the live data associated with each of the matches. Then, they produce the intermediate pairs of key/values as an output of the map task: list (matchgame, gamedata), where matchgame is the key and gamedata is the collected live data.

A. DAT The first diagrams and tables refer to our experimental use of Hadoop in order to collect live data for the DAT project. Diagram 1 illustrates the problem of collecting live data for 100 domains, using DAT.

B.2 Reduce Function The Reduce function accepts the intermediate key (matchgame) and a set of values (gamedata) associated with this key. It combines these values in order to create as much as possible smaller set of values. We present the sampled code of reduce function below:

public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } } }} According to the above code, the input of reduce function is: (matchgame, list(gamedata)). The output of the reduce function can be described by the following type: list(gamedata). The iterator passes the intermediate values to the reduce function. Thus, we can easily handle lists of values that are too large to fit in memory.

Diag 1. Running DAT using different number of mappers

From Diagram 1, it is clear that the closer we are to the actual number of processors, the smaller are the time improvements that we can achieve in computation. Mappers

Total Time(sec)

SpeedUp

1

90,83

-

2

76,77

1,183

4

73,6

1,234

8

67,65

1,343

16

61,5

1,477

32

69,86

1,300

Table 1. DAT times – speedup

69

As we can observe in Table 1, while the points are remaining constant, the average total time initially decreases with increasing rate, then with descending rate, and after 16 mappers there is no further time improvement. The efficiency of a parallel processor falls sharply as the number of processors increases. Thus, as the part of the program that can be parallelized falls slightly under 100%, the drop of the efficiency increases. Similarly Diagram 2 shows that the total speed-up falls also very sharply.

Diag 3. Running OpenBet using 16 mappers

Bets 20 40 60 80 100

Total Time(sec) 75,79 121,85 178,26 268,34 292,4

Table 2. OpenBet’s computation time

IV. CONCLUSIONS AND FUTURE WORK In this paper we studied the applicability of Hadoop, and proposed three applications (Domain appraisal tool, Open Bet and Brute Force Cryptanalysis). We also exposed experiments on the proposed implementations. As we observed, Hadoop is really efficient while running in a fully distributed mode, however in order to achieve optimal results and get advantage of Hadoop scalability, it is necessary to use a large clusters of computers. We have several promising avenues for future work. We plan to run experiments on larger numbers of nodes; this will allow us to examine how scalability is affected by mixed workloads. We want to concentrate our research on adding and deleting nodes on the fly without rebooting the system. More specifically, at this time we are developing an application which will help many other computers to carry out experiments of this type and see the functionality of Hadoop with more nodes. Finally, we are currently working on a project that uses Hadoop for cryptographic purposes, mainly for computing and retrieving cryptographic keys.

Diag 2. Speed Up of DAT

When we increase the number of mappers, speedup seems to increase, but after reaching 16 mappers, it begins to decline. Speed-up is calculated as follows: time in serial / time in parallel. After the analysis of associated diagrams and tables, we conclude, that by increasing mappers’ number, we achieve better times in collecting live data. Concerning Hadoop, it contributed in achieving better times until a certain point (16 mappers) is reached. This happened because of the increment of communication time caused by the increase of mappers. B. OpenBet The next experiment concerns the OpenBet project. We try to collect, from the Internet, live data associated with betting transactions. The nature of this experiment differs from the previous one. At this point, we keep a constant number of 16 mappers and increase the number of transactions (bets) that we are looking for. As it is shown on Diagram 2, significant improvements are obtained. Table 2 shows the average times when running bet-functions. As we observe, running OpenBet for 20 bets, results in achieving an average time of 75.79 seconds. If the relationship between time and the number of running functions was linear, the increment in the number of functions would cause a linear increment in completion time. Thus, if we had 100 bets, the associated time of completion would be 379 seconds. Nevertheless, after running our project, using Hadoop, for 100 bets, the achieved average time was 292.4 seconds, which is considered as a significant improvement. The latter could be a great example of the parallelization efficiency provided by Hadoop.

REFERENCES [1] [2] [3] [4] [5] [6]

70

Abhinav Pathak, Himabindu, “Towards Optimizing Hadoop Provisioning in the Cloud”, First Workshop on Hot Topics in Cloud Computing, June 2009. S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system”, 19th ACM Symposium on Operating Systems Principles, October 2003. S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce. ICDM, 2008. C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun, “Map-reduce for machine learning on multicore”, NIPS 19, 2006. H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. “Map-reducemerge: Simplified relational data processing on large clusters”, SIGMOD, 2007. Luiz A. Barroso, Jeffrey Dean, and Urs Holzle. “Web search for a planet: The Google cluster architecture”, IEEE Micro, 23(2):22.28, April 2003.

[7] [8] [9] [10] [11]

[12] [13] [14]

Kyriacos Talattinis, Konstantinos Chalkias and George Stephanides, “Combined methods for Domain Appraisals”, 3th Panhellenic Student Conference on Informatics Eureka, September 2009. J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters”, OSDI’04, 2004. Amazon Web Services - Amazon Elastic Map Reduce [Online]. Available: http://aws.amazon.com/elasticmapreduce [Accessed: April, 13, 2010]. HDFS Architecture by Dhruba Borthakur [Online]. Available:http:// hadoop.apache.org/common/docs/r0.20.2/hdfs_design.pdf [Accessed: April, 13, 2010]. P. Basdaras , K. Chalkias , A. Chatzigeorgiou , I. Deligiannis , P. Tsakiri , N. Tsantalis, “Lessons learned from an open-source University project, Transactions on Advances in Engineering Education”, Issue 5, Volume 3, pp. 317-320, 2006. Namebio Database of domain sales, [Online]. Available: http://www. namebio.com [Accessed: April, 13, 2010]. Betfair Internet betting site [Online]. Available: http://www.betrfair.com [Accessed: April, 14, 2010]. Certicom ECC Challenge, Certicom Corp, [Online]. Available: http://download.certicom.com/pdfs/cert_ecc_challenge.pdf [Accessed: April, 15, 2010].

71