Big Data in Memory A Survey

18 downloads 22844 Views 232KB Size Report
implementation of in-memory Big data platforms requires. In order to achieve this, .... clustering, access log analysis and creation of inverted indexes. It was then ...
Big Data in Memory A Survey Luis Manuel Rom´an Garc´ıa Department of Computer Science Instituto Tecnol´ogico Aut´onomo de M´exico M´exico, Distrito Federal Email: [email protected] Abstract—This paper aims to unify in a coherent yet simple way the different aspects of memory management that the implementation of in-memory Big data platforms requires. In order to achieve this, the paper is semantically divided in three parts: One that focuses in current implementations for working with large-scale data. Other one that focuses on the characterization of the applications that are commonly found in these environments. Finally, the last part explores the different levels of memory management and details the modifications that the classical methods had to overtake in order to meet the requirements of Big data. These three parts are not necessarily displayed explicitly and due to the large interplay that they exert on each other, they are sometimes regarded simultaneously. Keywords—Big Data, Distributed Computing, Memory Managment, In-memory systems.

I.

I NTRODUCTION

In 2005 Roger Magoulas from O’Reilly used the term Big data to refer to a large amount of data that could not be reasonably handled by traditional techniques. In this sense, Big data was defined by its size, however, some more refined definitions [15], [21], [10] make specific reference to the architectural and design factors that the speed, volume and variety of Big data require from any computational system. Particularly, it is clear from any perspective that the only effective way to face the challenges of this paradigm is through the use of some form of cluster computing. Under this circumstances, several distributed data storage systems have emerged, such as the Google File System [16], LogBase [47] and ES2 [7] that allow enterprises to handle and operate huge amounts of data in a reliable, efficient and robust manner. Unfortunately, these systems are often too complex to operate, and the need to deal with the parallelization, communication, synchronization and execution details often obscures the simple computations that the user may want to execute. In order to alleviate this, several abstractions have been developed in the form of libraries to allow users to execute large distributed operations in a simplified manner. Examples of these abstractions are MapReduce [9], Data Flow [26] and All-pairs [34]. Even though these frameworks are widely used in largescale data analytics, they all have in common that they are disk-oriented [52], in this sense, they are not efficient for many data management tasks, particularly those requiring high processing speed, commonly found in on-line environments or

OLTP1 systems [48]. Nevertheless, the tendency in Big data applications is precisely this: Large-scale applications, capable of processing huge amounts of data under real-time conditions. To solve this matter, special attention has been put in the main performance bottlenecks of this kind of applications and several benchmarks have been established along the way [48], [14]. Interestingly enough, this studies had led researchers to the conclusion that most of Big data applications could largely improve their performance by keeping their working set in memory. Under these circumstances, some authors [37], [51] have proposed the shift of paradigm from disk-based systems to in-memory systems. Since keeping data in memory is the main focus of this work, the rest of the paper is organized as follows: Section II details some of the main features of the different frameworks developed for working with Big data. This section focuses primarily on disk-oriented platforms, it discusses its pros and cons and explores different applications and workloads that these systems support. Based on these notions, Section III discusses several in-memory systems that have been developed in order to deal with the nuisances of the previous systems. Since in-memory systems deal with the typical replacement policy problems of every memory management architecture, Section IV gives a quick introduction to some concepts and terminology on this field and overviews a set of algorithms that take advantage of the memory usage of Big data applications to introduce efficient policies. Finally, Section V points out research paths for future work and Section VI concludes. II.

B IG DATA F RAMEWORKS

The explosion in the number of applications that use and generate large amounts of data such as web-crawlers, search engines and social networks has led companies to develop frameworks that allow them to handle these workloads efficiently. Due to the complexity and diversity of these applications, there is no one cure-all solution [25]. This is worsened by the fact that these platforms must handle such applications concurrently, hence creating an environment with rapidly changing workloads and resource requirements2 . In the following, we are going to review some of the most popular platforms and get an in-depth knowledge of their principal properties and limitations. 1 On-Line 2 This

Transaction Processing [23] is commonly known as workload churns [4].

A. MapReduce This is perhaps the most famous abstraction for working with massive amounts of data in a distributed environment. It was first developed by Google [9] as an in house interface for running largely distributed applications such as document clustering, access log analysis and creation of inverted indexes. It was then popularized by an open source implementation called Hadoop3 [22]. The vast success of this framework, is due to the fact that it allows the user to execute distributed computations by using only two interfaces, namely the Map and the Reduce, without having to worry about the parallelization, work distribution, and fault tolerance details. In this sense, the success of MapReduce relies on its flexibility, the programming model is simple yet expressive, it can emulate SQL queries and perform data mining and machine learning tasks, scalability, it achieves elastic scalability through block level scheduling, efficiency and fault tolerance [11] [29]. The logic of MapReduce is quite simple, it allows only two types of nodes, a master node and a worker node. The master is the single piece in charge of the execution of a job, each job is subdivided into tasks and each task is assigned to a worker via a scheduler. Each worker is responsible of a map or reduce task and the number of workers that execute any one of these might be different4 . The importance of this is that many Big data tasks can be structured as a MapReduce task. Nonetheless, there are some challenges that MapReduce has failed to meet, for example, it lacks a high level language such as SQL5 , it exhibits poor performance executing iterative algorithms and it also lacks support for iterative data exploration and stream processing [19]. B. Data Flow Data Flow is a programming model implemented on a platform called Dryad [26]. With the same spirit of MapReduce, it focuses on the task of providing a simple programming abstraction that could help developers write efficient parallel and distributed applications, without having to worry about the deeper concurrency mechanisms, resource allocation and scheduling policies or failure recovery details. Though arguably more complicated than MapReduce, this simplicity sacrifice enables Dryad to easily express a wider range of computations. Even though Dryad is more general than mapReduce, it is well known that it is not an appropriate platform for performing iterative jobs, nested parallelism or irregular parallelism [49]. The logic of execution of every job on a Dryad implementation follows a directed acyclic graph (DAG) topology. Each vertex of the graph is a program, and every edge is a data channel. This logical graph is automatically mapped into physical resources. Each job is coordinated by a job manager. The job manager contains the code necessary to construct the 3 This

is by far the most widely used implementation of MapReduce, regardless of its known disadvantages, namely: software architectural bottlenecks, portability limitations and portability assumptions [42] 4 There are some refinements such as partitioning functions, input and output types and combiner modules that allow the execution to be more efficient [9] 5 simple SQL commands often require large and complex MapReduce routines, moreover, this code is generally not reusable [11].

communication graph and the libraries necessary to schedule the work. In order to distribute the work, the job manager consults the name server to obtain the number of available computers. One of the main drawbacks of Dryad, is that if the job manager’s computer fails, the entire job is terminated. C. All-pairs All-pairs is an abstraction aimed to simplify the parallelization and distribution of workloads commonly found in several scientific fields. The general problem of All-pairs is stated as follows: Definition II.1. The operation All−P airs(A, B, F ) compares all of the elements of a set A with all of the elements of a set B via a function F and returns a matrix M such that: M [i, j] = F (A[i], B[j]) In this way, All-pairs carry out four steps for every computation. First, it must model the system. Second, distribute the data. Third, dispatch the jobs and fourth, clean up the system [49]. The complexity of All-pairs, lies in the distribution of data among several nodes avoiding unacceptable latency [34]. All-pairs scheme can be extrapolated to different scientific environments where the goal is to understand the effect of a given function over two distinct sets, or to asses some quality measure between both sets. Particularly, All-pairs has direct application in Biometrics and data mining. Notwithstanding the fact that all of the above platforms are widely used in the industry, the emergence of new kinds of applications that make iterative reuse of data such as machine learning algorithms, iterative data mining and graph algorithms is requiring more flexibility in the use of memory in order to execute them efficiently. This fact, coupled with the increasing speed requirement, has led researchers to look for a new kind of paradigm, namely, the memory-oriented abstraction. III.

M EMORY O RIENTED S YSTEMS

In this section we describe three of the several approaches that are being developed in order to deal with the in-memory problem. As we are going to see, since main memory is volatile, the main challenge of these systems is to execute tasks ensuring fault tolerance efficiently. The rethoric of this section somewhat mirrors that of section II. First we are going to describe the architectural and functional details of Memory Map Reduce (M3R) [43], an in-memory MapReduce based system, then we are going to focus on an in-memory data flow model named Piccolo [38], finally, we are going to review the different characteristics of an abstraction called Resilient Distributed Datasets that reasonably outperforms the other two abstractions in matter of efficiency and robustness. A. M3R M3R is fundamentally based on the Hadoop Map Reduce (HMR) engine. It was first conceived to solve two performance bottlenecks of the HMR system. The first, is the inability of HMR to perform multiple jobs without writing its state to disk and therefore incurring in unacceptable performance penalties. The second, as we mentioned in this section’s introduction, is

the inability of HMR to optimize memory usage for iterative algorithms. In this way, the three design principles behind M3R’s implementation are: In-memory execution, Performance and No resilience, as we are going to see in a later section, this last point poses some serious disadvantages when executing huge workloads. •

Every instance of M3R allows the sharing of heapstates between jobs due to the fact that each one of them runs on a fixed number of java virtual machines.



Subsequent reads of input splits are fulfilled by reading them from the heap instead of going all the way to the file.



Key-value pair shuffle is performed in memory.



Same partition number is mapped to the same place. This property enables memory re-use hence solving the performance problem with iterative algorithms.

The flow of execution of M3R is similar to that of HMR. With the fundamental difference that M3R allows memory reuse between jobs in a single virtual machine. Some other differences are that M3R introduces an Input/Output cache, it implements co-location 6 partition stability and de-duplication.

data mining. This last property distinguishes RDD from other in memory abstractions such as Piccolo. The RDD abstraction consists on a read-only partition of records. RDDs are created via transformations7 . The key point of RDDs is that they need not to be materialized at every time, instead, it can be computed by examining its lineage8 . In order to allow programmers to perform operations in memory, Spark (the language integrated API for working with RDDs) offers them the operation persist. Finally, RDDs are best suited for iterative batch applications due to the fact that they can remember the applied transformations as one step in its lineage. RDDs are less suitable for a web application’s storage system or an incremental web crawler. Table I summarizes the main differences between RDDs and regular distributed shared memory systems. Aspect Reads Writes Consistency Fault recovery Straggler mitigation Work placement Behaviour if not enough RAM

TABLE I.

B. Piccolo As well as M3R, Piccolo is developed as an abstraction for in-memory large-scale computations. Its ideal is to simplify the access and sharing of intermediate states stored in memory. The computations in Piccolo are organized as kernel functions and each kernel is ran as multiple instances on many compute nodes. The kernel instances share their states through mutable tables.

RDDs Coarse-or fine-grained Coarse-grained Trivial Fine-grained and low overhead Possible using backup Automatic based on data locality Similar to existing data flow systems

DSM Fine-grained Fine-grained Up to app/ runtime Requires checkpoints Difficult Up to app Poor performance

C OMPARISON OF RDD S WITH DISTRIBUTED SHARED MEMORY SYSTEMS .

N OTE : A DAPTED FROM M ATEI Z AHARIA AND M OSHARAF C HOWDHURY AND TATHAGATA DAS AND A NKUR DAVE AND J USTIN M A AND M URPHY M C C AULY AND M ICHAEL J. F RANKLIN AND S COTT S HENKER AND I ON S TOICA . Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,P RESENTED AS PART OF THE 9 TH USENIX S YMPOSIUM ON N ETWORKED S YSTEMS D ESIGN AND I MPLEMENTATION (NSDI 12),2012, 978-931971-92-8, S AN J OSE , CA, 15–28, HTTPS :// WWW. USENIX . ORG / CONFERENCE / NSDI 12/ TECHNICAL - SESSIONS / PRESENTATION / ZAHARIA , USENIX.

IV.

L ARGE -S CALE M EMORY

The main advantages of piccolo over other systems are:

As it can be appreciated in the previous sections, the implementation of in-memory systems poses diverse architectural and design challenges, they must be reliable, resilient, . The main focus of this section is threefold: first, we are going to revisit the basic concepts of memory architecture, hierarchy and management. Then we are going to explore the different kind of applications and workloads that are regularly found on Big-data environments, finally we are going to go through several memory optimization techniques used to satisfy those application’s requirements.



Piccolo is fast (11X 4X faster than Hadoop).

A. Memory Fundamentals



It scales well.



It enables parallelism.

Tasks are carried out by two types of processes, a master process and many worker processes. The master is in charge of scheduling kernel instances and assigning table partitions among workers. Each worker stores the table partition in memory and handles the operations associated with those tables. Some other considerations to make the system more efficient are load balancing among workers, table migration and in order to grant resilience checkpointing is implemented.

C. Resilient Distributed Datasets This is perhaps the most efficient platforms of the in memory paradigm treated in this paper. Although it has some limitations, its logic is quite simple and it naturally provides fault tolerance, yet, it is expressive enough to allow RDD to perform cluster programming similar to platforms such as MapReduce and Dryad, with the advantage of providing operations that those platforms do not support such as interactive 6 the

avoidance of network or disk overhead produced by de-serialization of jobs when the mapper and the reducer are running on the same machine

To understand the logic of the memory management optimizations of the next section it is worthwhile to make a quick overview of the main concepts regarding memory utilization. As [24] points out, every programmer would like to have an unlimited amount of fast memory. Since this is impossible, a more reasonable, and more affordable, approach to provide the illusion of vast fast memory is to implement a memory hierarchy. Under this structure, each level provides a smaller, faster and hence more expensive memory than the following. This scheme takes advantage of the principle that must users or applications do not accesses all of the code or 7 They can be created by operations on two kind of records (1) stable storage and (2) other RDDs 8 Information about how it was derived from other datasets.

data uniformly9 . Hence, just a small amount of critical data, commonly known as the working set of an application, has to be kept in fast access memory. This container that holds a copy of a computation or data that can be accessed more quickly than the original is called a cache [2]. Figure 1 displays a common memory hierarchy as well as standard sizes and access latencies [2].

Fig. 1.

Memory Hierarchy: size and access time

The different components of the diagram accomplish the following functions [44], [2], [52]:

The latency discrepancy between disk and memory access explains why most of the recent Big data platforms focus on keeping recurrent computations in memory. A reality of life however, is that the space available at memory is quite limited (compared with the space available on disk for example). So the question arises: How to make the usage of fast memory efficient? This enquiry can be subdivided into four different questions [44]: •

How to maximize the probability of finding a memory reference in the cache?



How to optimize the speed at which a memory reference is found in the cache?



How to minimize the delay incurred with a cache miss?



How to keep consistency efficiently among the different levels of memory?

In this paper we are going to focus on the first point and more specifically on replacement policies. The second point regards more architectonic issues, for example: A cache must be associative or hard wired?. For the third point, there are multiple papers covering several aspects of cache misses for example [17], [45] and [31]. [39] is particularly interesting since it differentiates the latencies induced by parallel misses from those that happen in isolation, this is specially useful for distributed environments, hence, for Big data. On the other hand, the seminal work of Goodman [18] covers the basics of the fourth point.



I-UNIT This unit is in charge of instruction fetch and decode.



E-UNIT This unit is in charge of the execution.

B. Regular Replacement Policies



S-UNIT This unit is in charge of the in-cpu storage, it is the interface between the I-UNIT and the E-UNIT. ◦ Register The register is the data storage that the processor can manipulate directly, it is usually very small (commonly 64bit wide). ◦ First level Cache This is a virtually addressed cache and is the closest to the processor. ◦ TLB The TLB is an associative memory that is in charge of mapping virtual to real addresses. ◦ Second level Cache Must processors will usually hold a second-level cache, though this one is physically addressed and is meant to handle the misses of the first cache.

In this section we are going to give a quick overview of the different replacement policies that are commonly found in regular computer systems. The objective is to give the reader a panoramic understanding of the basic functionality of this algorithms, what they aim for and why they fail when facing complex distributed systems11 .







LRU This policy always replace the least-recently used page. Its wide adoption is due to its simple implementation and optimality guarantee when requests follow a stack depth distribution [36].



Third level Cache This is an even larger cache and under several architectures, it is shared among several cores10 .

LFU This policy replaces the least-frequently used page. It exhibits optimality under the independent reference request model [1]. Its main drawback is that its running time per request is logarithmic.



Main Memory This is where the program and data space lies. It takes typically 2 or 3 bus cycles to begin accessing data.

FBR, LIRS & LFRU This are modifications, adaptations and combinations of the first two policies so we are not going to discuss them further.



Shepherd Cache [41] This policy is a bit more complicated, it attempts to emulate Belady’s optimal policy in an online fashion. To achieve this, the cache is divided in two sub caches: the Main Cache and the Shepherd Cache. When the main cache misses one page, the Shepherd Cache stores the reference to that



Local disk It can serve



Remote data center disk Remotely allocated disks.

9 This principle is known as the locality principle [24] and it could be both spatial or temporal 10 Since two or more cores share the same bus to access the third level cache, this accesses must be done in a sequential manner, inducing an extra overhead.

11 In this paper we focus exclusively on online algorithms. The interested reader could go to [35] for a thorough analysis of several offline and online algorithms.

page and gathers information about the page usage behaviour in order to find an optimal replacement candidate. Of Course this is not an extensive list, some other policies for shared caches are DIP [40], TADIP [27], RRIP [28] and so on. Even though there are several advantages and disadvantages among those policies, there is no policy that uniformly outperforms all of the others under every scenario. Take the following example: let us say that a user executes an application whose working set is divided in four blocks A, B, C and D and it makes use of every block cyclically A → B → C → D → A . . . . Moreover, suppose that only three blocks fit in memory, thus, a block must be kept out at every given time. It is clear that the LRU policy will cause a cache miss at every step after the first iteration, hence, providing worst performance than random replacement. This last observation is exacerbated in Big data platforms where multiple applications with different behaviours execute and strive for resources concurrently. Under this circumstances, regular replacement policies perform poorly. In the next section we are going to review the different applications that tend to execute on this environments and we are going to try to understand their needs and behaviour.

C. Large-Scale Applications As mentioned in [46], one of main characteristics of Big data, besides its volume, is variety. The data is no longer following the regular table format. It comes in many shapes and formats: unstructured text, images, video and voice comes streaming from millions of users from multiple applications every second. In order to extract value from all that information, Big data platforms must be able to perform multiple tasks concurrently and efficiently. The characterization of the requirements of those tasks to correctly asses the quality of a system is critical. Several efforts have been made on this direction [14], [52], [48]. In this paper we follow the approach proposed by [48] and consider three different aspects of Big data workloads, namely: •

Data type This refers the shape of the data. Is it structured, unstructured or semi-structured.



Data structure This is the nature of the data. It could be text, video, voice, images, etc.



Application type The application requirements. Online, offline or both.

Table II summarizes the possible combinations of this three elements along with the applications that present them.

Application Scenarios

Application Type

Micro Benchmarks

Offline Analytics

Basic Datastore Operations

Online Service

Relational Query

Realtime Analytics

Search Engine

Online Services Offline Analytics

Social Network

Online Services Offline Analytics

E-commerce

Online Services Offline Analytics

TABLE II.

Workloads Sort Grep WordCount BFS Read Write Scan Select Query Aggregate Query Join Query Nutch Server Index Page Rank Olio Server Kmeans Connnected Components Rubis Server Collaborative Filtering Naive Bayes

Data Types Un-structured

Semi-structured

Structured

Un-structured

Un-structured Structured Semi-structured

S UMMARY OF B IG DATA APPLICATIONS AND WORKLOADS .

N OTE : A DAPTED FROM Bigdatabench: A big data benchmark suite from internet services., BY WANG , L EI , ET AL ., 2014, H IGH P ERFORMANCE C OMPUTER A RCHITECTURE (HPCA), 2014 IEEE 20 TH I NTERNATIONAL S YMPOSIUM ON . IEEE.

Understanding the behaviour of these tasks is important, for example, the fact that MapReduce was designed under the assumption of batch computations limits its applicability to most of the real-time/online analytics. Some of the most relevant observations about this applications are [8]: 1) 2) 3) 4) 5)

Data access frequencies are skewed (80-0 and 80-1 rule). This involves temporal locality with 80 The workloads tend to be bursty and unpredictable. Most jobs are small (90 Most of the querying frameworks are used for interactive data mining and streaming analysis. There is no typical behaviour, most systems exhibit widely heterogeneous workloads.

Due to this variety, it is not a surprise that static caching policies are inapplicable in Big data environments. In the next section we are going to review some alternatives. D. Big Data Memory Management As it can be appreciated in last section’s conclusion, the variety of applications and the different resource pressures that they exert on a system are often too heterogeneous to be graciously handled by a regular replacement algorithm. Therefore, several efforts have been committed to find different alternatives. Among the most recent findings, adaptive caching algorithms [12], [13], [20] seem to be the most promising. In this section we review different approaches to this model and some minor variations. In section IV.A we reviewed the different considerations that an efficient memory management policy must observe. One of them is the probability that a given page is going to be found in the cache. In this sense, one criterion to assess the effectiveness of a given policy is to measure its cache hit ratio. Of Course, the overhead generated by the policy must not exceed the latency of a cache miss, if such is the case, what is the sense of implementing the policy in the first place? and this is the tricky part, most systems settle with classic replacement policies precisely because they are lightweighted. One of the first attempts to provide low-overhead adaptive caching was ARC (adaptive-replacement cache) [35]. The suitability of this cache for Big data systems is that, in contrast with LRU, it does not need workload specific pre-tuning. The whole idea of ARC is to provide to LRU page lists, one that gathers the pages that have been seen once, let us call it L1 and one that

gathers the pages that have been seen twice in the past, L2 . In this way, L1 captures recency while L2 captures frequency. With the help of an adaptive parameter p ARC can identify when frequency is becoming more important than recency or vice versa and enlarge the respective cache accordingly. Even though, ARC outperforms LRU in P6, SPC1-like, and Merge [35], it does not takes into consideration the nature of the cache miss, in other words, it treats parallel and isolated misses indistinctly. As noted by [32], his could be a serious performance halt. In order to alleviate this, [39] proposes the MLP-aware replacement policy (Memory Level Parallelism aware replacement policy). They argue that not every miss is equally costly, so they make the clear distinction between parallel misses and isolated ones. The isolated misses prove to be more costly since the processor is idle while satisfying just one miss while in parallel misses, the processor idle time is divided among the concurrent misses. The main contributions of the authors are that they propose an in-line algorithm to compute the MLP-aware cost for in-flight misses, they show that MLP-based costs serve as a predictor for the next time MLP-based cost and they propose the LIN policy that takes into consideration both, recency and MLP-based cost. However better than regular LRU and SC, MLP only outperforms the former when isolated misses are frequent. When parallel misses are the rule, MLP behaves just as LRU [32] failing to fulfill the requirements of Big data platforms. Another approach that makes adaptive replacement choices that differs fundamentally from the prior is the one proposed by [12]. Under this scheme, the caching problem is viewed as an instance of the Knapsack problem. In this way, the cache is regarded as a knapsack having capacity C. Objects are denoted by i = 1, 2, . . . , n each having size si , probability of future reference Pi (t) and cost ci . The generality of this approach gives the user the flexibility to incorporate notions from different replacement approaches. For example, one can simply introduce the frequency and recency notions of ARC into Pi (t) and the cost of a parallel or isolated miss into ci 12 . In this sense, the caching problem can be easily mapped into an optimization problem:

max

X

ci Pi (t)

i∈M

s.t. X

(1)

V.

F UTURE W ORK

There are at least two areas, regarding caches and Big data applications, worth studying in parallel with the replacement policy. Not only because of their large interplay, but also because a correct understanding of their interactions could help scientists develop more integrated solutions and better optimizations overall. These two areas are: cache aware algorithms and hardware and software optimizations. Several papers have been written covering different aspects of cache aware algorithms [5], [33], nonetheless, there is no (to the knowledge of the author) machine learning library developed for working at large scale that makes use of this kind of algorithms. This is particularly troubling since most optimization algorithms make use of some technique such as LU decomposition for solving linear systems and is precisely this kind of decompositions that can be easily adapted to be cache aware [5]. On the other hand, there is a whole series of hardware and software optimizations that can be implemented simultaneously with different replacement techniques. For example, [50] explicitly mentions two strategies for improving cache usage, namely: the slice and merge strategy for sorting processes and the direct-memory-access for keytorage. To the knowledge of the author, there is no a general framework that treats this different themes in an integrated fashion. VI.

C ONCLUSION

In this paper we have explored some of the most important aspects of memory management for in-memory Big data systems. Along the way, we reviewed some current implementations as well as some architectonic considerations and some more application-specific themes. The importance of an adequate strategy for managing fast memory efficiently was recurrently emphasized. In fact, we observed that this is a paramount component for the proper functioning of the whole system. This last remark becomes more critical the more speed becomes a critical requirement of the system. And as [3] points out, this is precisely the current trend. The takeaway of this work is that Big data environments are extremely complex. This complexity becomes a serious memory problem when you add speed. No conventional strategy works under this circumstances and the use of adaptive algorithms seems to be the the most promising option. However non cure-all method has been found yet. The only thing that is left is to keep on searching.

≤C R EFERENCES

i∈M

[1]

Since this is a known NP-hard problem, the authors propose the following heuristic: set Ri (t) = Wsii(t) , then if Ri1 ≥ J X Ri2 ≥ . . . then pick the largest index J such that: sij ≤ C.

[2] [3]

j=1

Following this logic, in case of a cache miss, the new object k is going to be introduced in the cache if there is a set of objects {k1 , k2 , . . . , kn } whose eviction does not causes the value of the sum of every Ri in the cache to decrease. 12 In [12] however, P is estimated from past history and c is taken i i proportional to the object’s size.

[4]

[5]

[6]

A.V. Aho, P.J. Denning, and J.D. Ullman, Principles of Optimal Page Replacement, J. ACM, vol. 18, no.1, 1971, pp. 80-93. Thomas Anderson, Michael Dahlin. Operating Systems Principles & Practice. Recursive books, second edition. p. 545 2014. Assunc¸a˜ o, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. (2015). Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79, 3-15. Barroso, Luiz Andr´e, Jimmy Clidaras, and Urs H¨olzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 8.3 (2013): 1-154. Bender, Michael A., et al. Cache-adaptive algorithms. Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2014. Eric A. Brewer. Towards robust distributed systems. (Invited Talk)

[7] [8]

[9]

[10]

[11] [12] [13] [14]

[15]

[16]

[17]

[18] [19] [20] [21] [22] [23]

[24] [25]

[26]

[27]

[28]

[29] [30] [31] [32]

Y. Cao, C. Chen, F. Guo et al., Es2: A Cloud Data Storage System for Supporting Both OLTP and OLAP, ICDE, 11, 2010. Chen, Yanpei, Sara Alspaugh, and Randy Katz. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment 5.12 (2012): 18021813. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107113. DOI=http://dx.doi.org/10.1145/1327452.1327492 Edd Dumbill, What is Big Data?: An introduction to the big data landscape. https://www.oreilly.com/ideas/what-is-big-data January 11, 2012. ¨ Feng Li, B. C. Ooi, M. T. Ozsu, and S. Wu Distributed Data Managment Using MapReduce ACM Comput. Surv. 2014. Floratou, Avrilia, et al. Adaptive Caching Algorithms for Big Data Systems. (2015). Funke, Florian Andreas. Adaptive Physical Optimization in Hybrid OLTP & OLAP Main-Memory Database Systems. Ghazal, Ahmad, et al. BigBench: towards an industry standard benchmark for big data analytics. Proceedings of the 2013 ACM SIGMOD international conference on Management of data. ACM, 2013. Elena Geanina, Florina Camelia, Anca, Manole. Perspectives on Big Data and Big Data Analytics. Database Systems Journal, vol III, no.4, 2012. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, The Google File SystemProceeding SOSP ’03 Proceedings of the nineteenth ACM symposium on Operating systems principles Pages 29-43 ACM New York, NY, USA 2003. Ghosh, Somnath, Margaret Martonosi, and Sharad Malik. Cache miss equations: An analytical representation of cache misses. Proceedings of the 11th international conference on Supercomputing. ACM, 1997. Goodman, James R. Cache consistency and sequential consistency. University of Wisconsin-Madison, Computer Sciences Department, 1991. Grolinger, Katarina, et al. Challenges for mapreduce in big data. Services (SERVICES), 2014 IEEE World Congress on. IEEE, 2014. Guo, Cong, and Martin Karsten. Towards Adaptive Resource Allocation for Database Workloads. Jimmy Guterman, Big Data, Release II, Issue 11. O’Reilly, June 2009. Apache, Apache hadoop, 2005. Online, http://hadoop.apache.org. Harizopoulos, Stavros, et al. OLTP through the looking glass, and what we found there. Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008. Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011. Hindman, Benjamin, et al. A common substrate for cluster computing. Workshop on Hot Topics in Cloud Computing (HotCloud). Vol. 2009. 2009. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks In EuroSys ’07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems (2007), pp. 59-72, doi:10.1145/1272996.1273005. Jaleel, Aamer, et al. Adaptive insertion policies for managing shared caches. Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 2008. Jaleel, Aamer, et al. High performance cache replacement using rereference interval prediction (RRIP). ACM SIGARCH Computer Architecture News. Vol. 38. No. 3. ACM, 2010. Jiang, Dawei, et al. The performance of mapreduce: An in-depth study. Proceedings of the VLDB Endowment 3.1-2 (2010): 472-483. D. Jiang, G. Chen, B. C. Ooi et al., Epic: An Extensible and Scalable System for Processing Big Data, PVLDB, 14, 2014. Karkhanis, Tejas, and James E. Smith. A day in the life of a data cache miss. Workshop on Memory Performance Issues. Vol. 99. 2002. Georgios Keramidas, Pavlos Petoumenos and Stefanos Kaxiras Where replacement algorithms fail: a thorough analysisProceeding CF ’10 Proceedings of the 7th ACM international conference on Computing c frontiers. Pages 141-150 ACM New York, NY, USA 2010.

[33]

[34]

[35]

[36]

[37]

[38] [39] [40]

[41]

[42]

[43]

[44] [45]

[46] [47] [48]

[49] [50]

[51]

[52]

Kowarschik, Markus, and Christian Weiß. ”An overview of cache optimization techniques and cache-aware numerical algorithms.” Algorithms for Memory Hierarchies. Springer Berlin Heidelberg, 2003. 213-232. Moretti, Christopher, et al. All-pairs: An abstraction for data-intensive cloud computing. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008. Nimrod Megiddo, Dharmendra S. Modha Outperforming LRU with an Adaptive Replacement Cache AlgorithmIBM Almaden, Research Center, IEEE Computer Science 2004. O’neil, Elizabeth J., Patrick E. O’Neil, and Gerhard Weikum. An optimality proof of the LRU-K page replacement algorithm. Journal of the ACM (JACM) 46.1 (1999): 92-112. John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazi`eres, Subhasish Mitra, Aravind Narayanan, Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2010. The case for RAMClouds: scalable high-performance storage entirely in DRAM. SIGOPS Oper. Syst. Rev. 43, 4 (January 2010), 92-105. DOI=http://dx.doi.org/10.1145/1713254.1713276. Power, Russell, and Jinyang Li. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. OSDI. Vol. 10. 2010. Qureshi, Moinuddin K., et al. A case for MLP-aware cache replacement. ACM SIGARCH Computer Architecture News 34.2 (2006): 167-178. M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer. Adaptive insertion policies for high-performance caching. In ISCA-34, 2007. K. Rajan and R. Govindarajan. Emulating optimal replacement with a shepherd cache. Proc. of the International Symposium on Microarchitecture, 2007. Shafer, Jeffrey, Scott Rixner, and Alan L. Cox. The hadoop distributed filesystem: Balancing portability and performance. Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on. IEEE, 2010. Shinnar, Avraham, et al. M3R: increased performance for in-memory Hadoop jobs. Proceedings of the VLDB Endowment 5.12 (2012): 17361747. Smith, Alan Jay. Cache memories. ACM Computing Surveys (CSUR) 14.3 (1982): 473-530. Smith, Alan Jay. Disk cache—miss ratio analysis and design considerations. ACM Transactions on Computer Systems (TOCS) 3.3 (1985): 161-203. Ularu, Elena Geanina, et al. ”Perspectives on Big Data and Big Data Analytics.” Database Systems Journal 3.4 (2012): 3-14. H. T. Vo, S. Wang, D. Agrawal et al., Logbase: A Scalable Logstructured Database System in the Cloud, in PVLDB, 12, 2012. Wang, Lei, et al. Bigdatabench: A big data benchmark suite from internet services. High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 2014. Wang, Peng, et al. Transformer: a new paradigm for building dataparallel programming models. IEEE micro 4 (2010): 55-64. Yan, D., Yin, X. S., Lian, C., Zhong, X., Zhou, X., & Wu, G. S. (2015). Using Memory in the Right Way to Accelerate Big Data Processing. Journal of Computer Science and Technology, 30(1), 30-41. Matei Zaharia and Mosharaf Chowdhury and Tathagata Das and Ankur Dave and Justin Ma and Murphy McCauly and Michael J. Franklin and Scott Shenker and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12),2012, 978-931971-92-8, San Jose, CA, 15–28, https://www.usenix.org/conference/nsdi12/technicalsessions/presentation/zaharia, USENIX. Zhang, Hao, et al. In-memory big data management and processing: A survey. (2015).