Hadoop and Big Data

Hadoop and Big Data Keijo Heljanko Department of Computer Science and Engineering and Helsinki Institute for Information Technology HIIT School of Science, Aalto University [email protected]

26.3-2014

Keijo Heljanko - Hadoop and Big Data DIGILE Data to Intelligence (D2I) - 26.3-2014 1/16

Big Data

I

IDC forecasts the amount of digital universe to double every two years between now and 2020 I I

40% of the data will be touched by cloud computing 13% of the data will be stored in the cloud

I

Data comes from: Video, digital images, sensor data, biological data, Internet sites, social media, . . .

I

The problem of such large data massed, termed Big Data calls for new approaches to storage and processing of data


No Single Threaded Performance Increases

I

Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb’s Journal, 30(3), March 2005 (updated graph in August 2009). Keijo Heljanko - Hadoop and Big Data DIGILE Data to Intelligence (D2I) - 26.3-2014 3/16

Implications of the End of Free Lunch I

The clock speeds of microprocessors are not going to improve much in the foreseeable future I

I

The number of transistors in a microprocessor is still growing at a high rate I

I

I

The efficiency gains in single threaded performance are going to be only moderate

One of the main uses of transistors has been to increase the number of computing cores the processor has The number of cores in a low end workstation (as those employed in large scale datacenters) is going to keep on steadily growing

Programming models need to change to efficiently exploit all the available concurrency - scalability to high number of cores/processors will need to be a major focus


Tape is Dead, Disk is Tape, RAM locality is King

I

Trends of RAM, SSD, and HDD prices. From: H. Plattner and A. Zeier: In-Memory Data Management: An Inflection Point for Enterprise Applications Keijo Heljanko - Hadoop and Big Data DIGILE Data to Intelligence (D2I) - 26.3-2014 5/16

Tape is Dead, Disk is Tape, RAM locality is King I

I

RAM (and SSDs) are radically faster than HDDs: One should use RAM/SSDs whenever possible RAM is roughly the same price as HDDs were a decade earlier I

I

I

Workloads that were viable with hard disks a decade ago are now viable in RAM One should only use hard disk based storage for datasets that are not yet economically viable in RAM (or SSD) The Big Data applications (HDD based massive storage) should consist of applications that were not economically feasible a decade ago using HDDs


Distributed Warehouse-scale Computing (WSC) I

Google is one of the companies who has had to deal with vast datasets too big to be processed by a single computer - the scale of data processed is truly Big Data

I

The smallest unit of computation in Google scale is: Warehouse full of computers [WSC]: Luiz Andre´ Barroso, Jimmy Clidaras, and Urs ¨ Holzle: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second edition Morgan & Claypool Publishers 2013

I

http://dx.doi.org/10.2200/S00516ED2V01Y201306CAC024 I

The WSC book says: “. . . we must treat the datacenter itself as one massive warehouse-scale computer (WSC).”


Google Summa Warehouse Scale Computer


Google Servers from Year 2012

Figure : Google Server Racks form 2012, Figure 1.2 of [WSC]


Warehouse Scale Computers: Automation I

In order to maintain the warehouse scale computer with minimal staff, fault tolerance has to be fully automated: I

I

I

I

All data is replicated to several hard disks in several computers If one of the hard disks or computers breaks down, the software stack automatically reconfigures itself around the failure For many applications this can be done fully transparently to the user of the system

Thus warehouse scale computers can be maintained with minimal staff


Hadoop - Linux of Big Data I

Hadoop = Open Source Distributed Operating System Distribution for Big Data I I I

I

I I

I

Based on “cloning” the Google architecture design Fault tolerant distributed filesystem: HDFS Batch processing systems: Hadoop MapReduce and Apache Pig (HDD), Apache Spark (RAM) Distributed SQL for analytics: Apache Hive, Cloudera Impala, Apache Shark, Facebook Presto Fault tolerant real-time distributed database: HBase Distributed machine learning libraries, text indexing & search, etc. Data import and export with relational databases: Sqoop


Apache Hadoop Background

I

An Open Source implementation of the MapReduce framework, originally developed by Doug Cutting and heavily used by e.g., Yahoo! and Facebook

I

“Moving Computation is Cheaper than Moving Data” - Ship code to data, not data to code.

I

Project Web page: http://hadoop.apache.org/


Apache Hadoop Background (cnt.) I

Builds reliable systems out of unreliable commodity hardware by replicating most components

I

Each node is usually a Linux compute node with a small number of hard disks (4-12 drives, often 1 HDD/core)

I

Tuned for large (gigabytes of data) files

I

Designed for very large 1 PB+ data sets

I

Designed for streaming data accesses in batch processing, designed for high bandwidth instead of low latency

I

For scalability HDFS is NOT a POSIX filesystem

I

Written in Java, runs as a set of userspace daemons


Two Large Hadoop Installations I

I

Yahoo! (2009): 4000 nodes, 16 PB raw disk, 64TB RAM, 32K cores Facebook (2010): 2000 nodes, 21 PB storage, 64TB RAM, 22.4K cores I

I

12 TB (compressed) data added per day, 800TB (compressed) data scanned per day A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, H. Liu: Data warehousing and analytics infrastructure at Facebook. SIGMOD Conference 2010: 1013-1020. http://doi.acm.org/10.1145/1807167.1807278


Hadoop at Aalto I

We have been working with Hadoop since 2010 inside the Cloud software program, collaborating with CSC I

I I

Hadoop-BAM: A library to process genomics data formats with Hadoop 2000+ Downloads of the library from SourceForge ¨ P., Niemenmaa, M., Kallio, A., Schumacher, A., Klemela, Korpelainen, E., and Heljanko, K.: Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the Cloud. Bioinformatics 28(6):876-877, 2012. (http://dx.doi.org/10.1093/bioinformatics/bts054).

I

Teaching Hadoop and Big data technologies since 2011

I

Collaborating in D2I on Hadoop related topics with CSC, Tieto, and PacketVideo


Conclusions I

Hadoop is becoming the “Linux distribution for Big Data”

I

It consists of a number of interoperable open source components

I

Commercial support is available from commercial vendors such as: Cloudera, HortonWorks, MapR, IBM, and Intel; as well as their partners

I

Amazon is providing “Elastic MapReduce (EMR)”, which is basically Hadoop as a service offering

I

Microsoft HDInsight is a similar service for Hadoop on Microsoft Azure

I

Aalto course with more Hadoop information: https: //noppa.aalto.fi/noppa/kurssi/t-79.5308/etusivu