Council for Innovative Research

0 downloads 0 Views 630KB Size Report
Oct 28, 2014 - Google Docs allows users to upload documents, spreadsheets and ..... in collaboration with the Hortonworks Hadoop on Windows distribution.
ISSN 2277-3061 Big Data leverages Cloud Computing opportunities Logica Bănică, Viorel Păun, Cristian Ștefan Faculty of Economics, University of Pitesti, Targu din Vale 1, Pitesti, Romania

[email protected] Faculty of Mathematics-Informatics, University of Pitesti, Targu din Vale 1, Pitesti, Romania

[email protected] Faculty of Mathematics-Informatics, University of Pitesti, Targu din Vale 1, Pitesti, Romania

[email protected] ABSTRACT Nowadays the information is available on large scale, spatial and temporal barriers have been surpassed and the limits in data storage volume are constantly being pushed further. The biggest confrontation with huge amounts of data is specific to search engines. One of the most important data sources today is originated in social media sites and mobile communications which, among data from business environments and institutions, lead to the definition of a new concept, known today as Big Data. The paper aims at discussing aspects regarding the evolution of two technologies (Big Data and Cloud Computing) and their fusion. Also, we have concentrated our efforts around the principles to achieve, organize and access huge datasets in Cloud environment, offering a 3-step system architecture, based on actual software solutions.

Indexing terms/Keywords Big data; Cloud computing; NoSQL databases.

Academic Discipline And Sub-Disciplines Information technologies; Databases.

SUBJECT CLASSIFICATION Computer Science

TYPE (METHOD/APPROACH) Model Designing.

Council for Innovative Research Peer Review Research Publishing System

Journal: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY Vol. 13, No. 12 www.ijctonline.com , [email protected] 5253 | P a g e

October 28, 2014

ISSN 2277-3061 1. INTRODUCTION The Internet and its most important service – the World Wide Web – have led to the re-evaluation and redefining of certain terms, like “database”. Relational databases no longer have the power to accommodate the wealth of information arising from the Internet, from large organizations and multinational companies, or from telecom operators that expand their operations quickly, and this way the Big Data domain was established, aiming to the entities help organize enormous quantities of data. The issues related to this new area come from hardware (storage devices, processing power) but also from the software layer (volume management for just-in-time knowledge delivery, instant communication between large communities of users, very fast information retrieval and filtering), and from the operations point of view (high availability, security, and confidentiality). Starting from 2009, IDC's Digital Universe Study makes the prediction that, until 2020, the volume of digital data will grow 44-fold to 35ZB per year. Given this evolution rate and Internet user challenges, IDC comes with a legitimate question: “How big is Big Data today, and how big will it be in the next five years?” [1]. Even the presence of the first Cloud-related initiatives in the years around 2000 was related in a great extent to the requirement to store and access important data volumes, and thus the convergence of these two areas was natural and foreseeable. Great software companies showed their interest in processing large amounts of data in a shared environment many years ago, by introducing collaborative software. Also, the business analysts are faced with a challenge regarding data flow filtering, and implementation of Business Intelligence or Forecasting strategies for their activity. Starting from the current studies concerning the two technologies and their integration, we have concentrated our efforts around the principles to build, organize and access huge datasets in the Cloud environment. The paper aims at analysing these aspects, offering 3-step model architecture to the other researchers involved in this domain. Our work is divided in two sections and a Conclusions part. Section 2 presents the two concepts involved (Big Data and Cloud Computing), summarizing the state-of-the-art, and investigates several methods to store, filter and process a large volume of data, with the help of commercially-available software solutions. Section 3 includes our proposed solution for a system that is able to accommodate Big Data in a Hybrid Cloud environment, by presenting the software platforms for each of the three levels of the designed architectural model. Conclusions close the paper, and suggest ways of improving our future research activity in this domain.

2. LITERATURE REVIEW 2.1 The Cloud Computing characteristics Migrating applications and data to the Cloud is a frequent topic in our research and, from this point of view, we consider this paper as a continuation of our work related to the next-generation Internet architectures. Cloud computing implementations have been quickly adopted by individual and academic users, and they are growing fast in the business environment because of the well-balanced cost-to-performance ratio compared to privately-owned hardware and software platforms [2]. The definition for Cloud Computing, according to the National Institute of Standards and Technology (NIST), is the following: „a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” [3]. This definition includes the basic elements of the concept [2][3]: -

features of cloud computing: resource pooling, broad network access, on-demand self-service;

-

cloud service models (software, platform and infrastructure);

-

deployment models (private, community, public and hybrid) that provide direction to deliver cloud services.

The most important features of cloud computing solutions are the following: -

usage of Internet technologies that are able to allocate on-demand resources based on current client requirements, and ubiquitous remote access;

-

high availability of the Cloud environment, being able to detect failed node and isolate them, without affecting the normal operations of the system, due to its capacity of integration of mass storage and high-performance computing power [4];

-

high level of security involving the mission of the cloud provider to ensure data security and backup, configure load balancing, deploy the software etc. [4];

-

on-demand expansion of storage capacity, and the capacity to grow as far as required;

5254 | P a g e

October 28, 2014

ISSN 2277-3061 highly-redundant and minimal-downtime features that allow the client to reduce business infrastructure costs.

-

In addition to the three models for Cloud computing services defined by NIST (Infrastructure as a Service - IaaS, Platform as a Service - PaaS, and Software as a Service -SaaS), other models has emerged on the IT market: Business Process as a Service – BpaaS, Desktop as a Service - DaaS, Storage as a Service - StaaS and Security as a Service - SECaaS. Briefly, the services provided by each layer as they are described on NIST website and in the specialty literature are [3][5][6]: -

Infrastructure as a Service (IaaS) – the computing resources, whether virtualized or physical, are offered by the provider, and remote deployment is made by the clients;

-

Platform as a Service (PaaS) is a model which is based on hardware and basic software (OS) from the provider and application software installed and managed by the client;

-

Software as a Service (SaaS) is usually the preferred option, as it involves that all software layers are implemented by the hosting company, and leased to the client as needed;

-

Business Process as a Service (BPaaS) is an emerging model offering additional business functions, such as payment processing, human resources management [7];

-

Desktop as a Service (DaaS) takes PaaS one step further and offers the client a coherent desktop experience, delivered over the network and involving complex application stack [8];

-

Storage as a Service (StaaS) comes as an addition to other cloud services, and spares the client costs for data archival, as the provider already has high-availability, redundant SAN facilities that can be leased [9];

-

Security as a Service (SECaaS) adds an extra layer of protection over the client resources, by offering highspeed traffic analysis with dedicated equipment and trained personnel [10].

NIST also defines the key implementation models for Clouds [3]: -

Public Cloud – the provider owns the infrastructure and offers public access for free or by implementing a payper-use model;

-

Private Cloud – resources are used internally by an organization or company, and access is based on security policies defined within the private entity;

-

Community Cloud – involves a shared environment, usually with common access policies, and dedicated to a specific task that all involved partners agree upon;

-

Hybrid Cloud – a mix between the other models, that involves a dedicated provider and complex security models based on all using clients‟ requirements;

More and more organizations and companies are making use of various types of services offered by the Public Cloud implementations, as the providers have expanded their service portfolio. This way, important corporations like Google, IBM, Sun, Amazon, Cisco, Intel, and Oracle have invested in cloud computing and can now offer to the public a wide range of cloud-based solutions. For example, at the beginning of Cloud computing, in 2002, Amazon Web Services provided a suite of cloud-based services including storage, computation and even human intelligence through the Amazon Mechanical Turk. Then, in 2006, Amazon launched Elastic Compute Cloud (EC2) as a commercial web service that allows small companies and individuals to rent computing resources on which to run their own applications. In 2009, as Web 2.0 gained acceptance, Google and others started to offer browser-based enterprise applications, though services such as Google Apps. In 2010, Rackspace Hosting and NASA launched an open-source cloud-software initiative known as OpenStack. The OpenStack project intended to help organizations offer cloud-computing services running on standard hardware. In 2011, IBM announced the IBM SmartCloud framework to support Smarter Planet [11] and in 2012, Oracle announced the Oracle Cloud. While aspects of the Oracle Cloud are still in development, this cloud offering is posed to be the first to provide users with access to an integrated set of IT solutions, including the Applications (SaaS), Platform (PaaS), and Infrastructure (IaaS) layers [12]. Besides these well-known companies, there are many other providers of cloud storage services on the Web, and their number is constantly increasing. The competition between these providers involves many aspects: the storage capacity, the security of hosted data, the bandwidth offered to the customers, but also the subscription cost. The most popular companies that offer some form of cloud storage are the following:

-

Google Docs allows users to upload documents, spreadsheets and presentations to Google's data servers and edit these files using a web-based application. Users can also publish documents so that other people can read them or even make edits (sharing and real-time collaboration).

5255 | P a g e

October 28, 2014

ISSN 2277-3061 -

Web e-mail like Gmail, Hotmail and Yahoo! Mail store e-mail messages on their own servers, making them accessible to the users from any computer or Internet-connected device.

-

Sites like Flickr and Picasa host millions of digital photographs, organized as online photo albums by uploading pictures directly to the services' servers and allowing shared access.

-

YouTube hosts millions of user-uploaded video files, which are freely accessible on the web.

-

Social networking sites like Facebook, Twitter and MySpace allow members to post messages, pictures and interact with other users, by creating communities.

Like any software technology that is constantly in development, Cloud Computing has drawbacks, some that can be solved more quickly, some that are still being investigated by researchers and specialists in order to find the best solutions. Some of draw-backs of the Cloud environment solutions: -

data security risks – secured access policies are required in order to keep unauthorized users away from the business data;

-

data loss challenge – all databases are required to implement automatic backup and transaction-based queries, thus mitigating the chance to affect service quality;

-

system unavailability – network outages and OS crashes can negatively affect the performance of the solution, and redundant architectures must be implemented by providers.

As an example, the IBM Cloud security policy includes new ways to ensure protection from unauthorized access to the Cloud [2]: 1.

layered security approach (different set of user privileges, grouped in access roles);

2.

firewall policies, with different rule sets for Intranet and Internet access;

3.

usage of cryptography for the data of great importance;

4.

smart solutions for traffic filtering and access monitoring, with automated alerting.

2.2 Big Data concept The increase in the global data volume, estimated at 40% per year and reaching 44% between 2009 and 2020 (according to McKinsey Global Institute), has started to raise data collection and time-efficient processing problems to big companies. For the first time, the term „big data‟ was used in 1997, by two NASA researchers, Michael Cox and David Ellsworth, to describe „massive amounts of information that cannot be processed and visualized” [13], obviously a far outdated definition today. Over the past few years, big data was defined in many different ways, and so, there is some sort of confusion surrounding the concept. Which is the appropriate definition of Big Data as seen by the authors of this paper? Big data is a massive collection of shareable data originating from any kind of private or public digital sources, which represents on its own a source for ongoing discovery, analysis, and Business Intelligence and Forecasting. Not only websites, weblogs, webmail, mobile and social networks provide data, but also financial and governmental institutions and the business environment offer a wealth of important data derived from statistical, financial or production information. It is true that the volume of data originating from Social media sites, smart phones, and other consumer devices is growing at an exponential rate, but other categories of data sources have a much more substantial amount of useful content that can be used by researchers to empower the future development of the human society. Gartner researcher Doung Laney described Big Data by three keys characteristics (the three Vs): Volume, Velocity and Variety [14]. Oracle defined Big Data in terms of four Vs: Volume, Velocity, Variety and Value [15]. Another approach refers to Big Data by five Vs and a C, adding two additional dimensions: the Variability and Complexity [16]. Although the authors of this paper are in favour of the Oracle description (the four Vs), we will further describe the content and the importance of every key characteristic, in the most complex approach (five Vs and a C) [17]: -

Volume: the size of the data. It is a very relative aspect, as the lower and upper limits cannot be defined for Big Data and these boundaries may grow daily, weekly or monthly;

-

Velocity: flows of data are increasing, and high bandwidth networks in use today are able to carry more and more information that needs to be processed as fast as possible;

-

Variety: big data is a combination of all types of formats, unstructured and multi-structured.

-

Value: refers to the potential commercial or scientific value of Big Data. So, for enterprises is important to use data originating from social media in combination with internal data in order to develop their business, which can lead to

5256 | P a g e

October 28, 2014

ISSN 2277-3061 enhanced productivity, a stronger competitive position and greater innovation. The efforts to achieve, analyse, filter and process the amount of data must have a target: to obtain information or find solutions for solving problems and developing projects in the business environment or research domains. The most profitable business model in this area is represented by social media sites (such as Facebook, Twitter and LinkedIn), built on the Big Data concept. -

Variability: the streams of data can vary dramatically in richness over time, as online events bring together large amounts of users and generate high server loads during peak hours [16];

-

Complexity: as there are so many involved sources of data, it is a challenge to process, filter and organize this great pool of information in reasonable time, and detect correlations [16].

In the IT world, when a new concept arises, it is always compared with the existing technologies. Thus, in the specialty literature, a comparison between Big Data and Data Warehousing (DW) is frequently discussed. The main difference between the two is related to the type of processed data, as DW uses structured data, built upon a relational database, and Big Data involves a mix of unstructured and multi-structured data that comprises the volume of information. According to [18], the two terms come from different zones, as Big Data is considered a technology, able to store and manage huge amounts of data, whilst Data Warehousing is an architecture, meaning a method of organizing data. Oracle specialists see things differently - in the way that they consider Big Data is an evolution of Data Warehousing [19]. Enterprise data warehousing analyses organization‟s data, having also sourced data from other databases. An enterprise‟s data warehouse contains data from its enterprise financials systems, its customer marketing systems, its billing systems, its point-of-sales systems, and so on [19]. Though, an important data segment remains, which is not captured in DW: clickstream logs, sensor data, location data from mobile devices, customer emails, chat transcripts, and surveillance videos. It is the point where Big Data systems prove their importance, allowing enterprises to analyse and extract business value from this unstructured information. The Big Data architecture is an extension of the Data Warehouse architecture. The relational database which was at the core of the DW architecture will still be used for storing a company‟s core transactional data, and it will be augmented by a big-data system, originating from the new sources of information: machine-generated log files, social-media data, and videos and images [19]. It is important that this data should be processed directly from the NoSQL database, or that it will be converted to relational database model and become accessible from SQL-based environments.

2.3 Meeting Big Data with Cloud Computing Cloud computing has allowed the usage of the most advanced technological resources available on the market in any domain, from individuals to small and medium enterprises, due to the open gate to hardware and software resources available at low costs. Meanwhile, this opportunity has led to a significant increase in the produced data volume and to the cornering of the Big Data concept. Not only the business environment is interested in collecting information from unconventional data sources, but also government agencies, national institutions and other organizations analyse and extract meaningful insight from this maze of data, be it security related or simply behavioural patterns of consumers [20]. It could be said that there is a strong correlation and symbiosis between these two technologies, as any Cloud Computing implementation includes a high-capacity storage solution and any Big Data platform uses distributed information collection and processing, as in Cloud architectures. As mentioned, there is no limit in the storage space allocated for this kind of data, and thus Cloud platforms are faced with problems related to the efficient scaling in storage capacity. Another challenge comes from the evolution from structured data in relational databases to fast processing of large, unstructured data sets [20]. Hybrid Clouds are often the preferred option for the companies, which may use Private Clouds to manage internal structured data, while Public Clouds allow the extension of their resources and the addition of model services. IBM researchers come with a new approach in cloud services for business, called Analytics as a Service – AaaS [21]. Starting from their experience of working with large companies, they observed that enterprises often keep their most sensitive data in-house, while the volumes of external data or archives (Big Data) may be located in a Public Cloud environment. To extract value from these collections is necessary to implement a new cloud service, called Analytics as a Service (AaaS), having the following key capabilities [22]: 

capturing and extracting structured and unstructured data from different sources;



managing and controlling data in accordance with company policy and specific requirements;



performing data integration, analysis, transformation in order to deliver the required information.

From the studies that were published and briefly mentioned in this chapter, it is clear that both technologies are evolving constantly and their interconnection will lead to a new, hybrid concept that will reunite the best of what each has to offer,

5257 | P a g e

October 28, 2014

ISSN 2277-3061 with the aim of cloud-based data acquisition and processing for very large volumes of information, available anywhere on demand.

3. METHODOLOGY In order to build a Big Data infrastructure in the cloud, specific software applications are required, which were born from applying the newest trends in technology combined with management capabilities to organize and distribute information. In this chapter we will briefly present a new type of database - NoSQL, used for storing Big Data, and a series of successful implementations for it already on the market today. NoSQL databases (also called “Not Only SQL”) are a new type of database that manages the unstructured data using several models: key value stores, graph, and document data. NoSQL data may be implemented in a distributed architecture, based on nodes able to process and store data. The NoSQL distributed architecture has been the solution adopted by the biggest software companies like Google and Amazon. While relational databases are based on ACID model (Atomic, Consistent, Isolated, and Durable), where data is consistent due to the transactions committed successfully, which do not corrupt data, NoSQL are defined by the BASE model (Basically Available, Soft State, and Eventually Consistent), more flexible, where high availability is the primary concern, not consistency [23]. Thus, developers building NoSQL data structures are required to completely shift paradigm, as the main difference between relational and this new kind of database is the inexistence of the relations between records. There are four categories of NoSQL databases [24]: -

Document – for managing data from documents in different format standards, such as XML or JSON. It is a complex category of storage that enables data querying. Document databases are indicated when working with large amounts of documents that can be stored into structured files such as text documents, emails or XML documents.

-

Key-value store (KVS) – for designing databases where each record has attached a unique key in order to allow the access to the record‟s information, represented as value. Key-value datasets are more appropriate for the management of stocks, products and real time data analysis, providing high data retrieving speed while the greatest amount of data is mapped into memory.

-

Column – refers to a database structure similar to the standard relational databases, data being stored as sets of columns and rows. Columns that store related data that is often retrieved together may be grouped. Column category databases are recommended to be used when the number of write operations exceeds reads, for example in logging.

-

Graph - for designing the structures where data may be represented as a graph with interlinked elements. In this category social networking and maps are the main applications. Graph databases are more appropriate for working with connected data, for example, to analyse social connections among a set of individuals.

At the moment, there are more than 150 NoSQL databases with different features and optimizations, which focus on keeping data consistency and a very fast retrieval. The evaluation of NoSQL implementations takes into account their storage capacity, but also their execution time for different operations [24]. To compare the performances of different types of NoSQL databases, Yahoo Inc. launched the Yahoo Cloud Serving Benchmark (YCSB) software. It allows deploying a framework and a set of workloads for evaluating the performance of different types of NoSQL databases and cloud serving stores, in a distributed or non-distributed environment. Hadoop is an open source software platform that enables the processing of large data sets in a distributed computing environment. This processing tool distributes data across a cluster of balanced machines working in parallel. The philosophy of this software is to do the processing in proximity to the location where data is stored and not to bring the data to the computation units, preventing unnecessary network transfers [25]. The machines in a Hadoop cluster store and process data, so they need to be configured to accomplish both requirements. This software is capable to manage huge quantities of information, acquiring “web-scale” data at an astonishing speed and offering a plethora of design patterns that reduce some of the complexity associated with data and obtaining a semi-structured data. Generally, a Hadoop distribution has two core components: Hadoop Distributed File System (HDFS) and MapReduce engine [26]. HDFS is the component designed to splits large data files into subsets of records which are managed by different nodes in the cluster and also are replicated across several machines, as a security service [25]. As a working principle, Hadoop is based on the MapReduce framework, which is a programming paradigm based on two tasks performed: -

Map – transforms the received streaming data into individual elements;

-

Reduce – takes the mapped data and does the processing on smaller set of tuples in order to simplify the analysis.

5258 | P a g e

October 28, 2014

ISSN 2277-3061 Typically, HDFS is the storage system for both input and output of the MapReduce jobs. MapReduce is the key paradigm that Hadoop uses to distribute work around a cluster, so that operations can be run in parallel on different nodes of the cluster and also, data is to be processed locally. Hadoop has the advantage that it can be used in a datacentre as well as in the Cloud environment. Today, Hadoop is the preferred solution for large-unit computing architectures and practically, Big Data has become synonymous with technologies like Hadoop, and the NoSQL class of databases [26]. Many successful NoSQL processing solutions are open source products such as Apache Hadoop, launched in 2005, and used by Yahoo and Facebook. Big software corporations, such as IBM and Oracle, have their own distributions of Hadoop. It is also an enabler of certain types NoSQL distributed databases (such as HBase) [27]. Figure 1 depicts the functional structure of Hadoop, the stages that data passes through from acquisition to storage in the NoSQL database, on which queries can be run in order to analyse, extract insights and perform business intelligence and forecasting tasks.

MAPPING PROCESS 1

Node 1

Node 2

MAPPING PROCESS 2

MAPPING PROCESS 3

Node N

INPUT DATA

DATA DISTRIBUTION

TEMPORARY STORAGE OF DATA

MAPPING PROCESS

Node 1

Node 2

REDUCING PROCESS 1

REDUCING PROCESS 2

Node N

INTERMEDIATE DATA FROM MAPPERS

REDUCING PROCESS 3

REDUCING PROCESS

OUTPUT DATA

Fig 1: Big Data Processing with Hadoop In order to achieve very fast data handling, the Massively Parallel Processing (MPP) technology and Analytic databases are used. They have good performance at computing and aggregating results, but lack data acquiring speed, and thus are recommended as a backend for reporting and scientific environments, but not as a transactional database for front end systems [28]. MPP has in common several architectural characteristics with Hadoop's MapReduce parallel computation approach to handling large data sets. Complex Event Handling (CEP) is the type of technology that processes continuous data flows in real-time and identifies meaningful events or patterns from data sources that come from the business environment, financial and governmental institutions. These events may be related to an organization and involve their business data, or other types of information, such as text messages, social media, traffic reports etc. The vast amount of information available about events is sometimes referred to as the event cloud [29]. Software companies as Oracle, Twitter, Sybase/SAP and Microsoft launched different versions of CEP engines and they applied successfully a fusion between CEP and Hadoop/MapReduce technologies, in order to integrate Real-time processing and Batch processing when dealing with Big Data. For example, Twitter uses the product Storm (now available as open source) from 2011, product which processes streaming data in parallel, while MapReduce processes data in batches. So, there is a distributed process between Storm and MapReduce. Another example of integrating CEP - MapReduce is provided by Microsoft, which launched the StreamInsight CEP engine, designated to work in collaboration with the Hortonworks Hadoop on Windows distribution.

5259 | P a g e

October 28, 2014

ISSN 2277-3061 4. A SCENARIO FOR BIG DATA ON CLOUD ARCHITECTURE Our research on current Big Data and Cloud computing approaches have led us to the model we propose in this chapter, a model that is focused on data acquisition, stages of processing, archiving and fast retrieval for large data sets, both structured and unstructured. It relies on Hybrid Cloud architectures, hosted at the SaaS level. The chain of operations in our model is as follows: a). All structured and unstructured business data (messages, images, videos) are collected from different sources, using open source Apache Hadoop and stored in its batch-optimized file system. -

A distributed file system spreads multiple copies of the data across different machines and offers multiple locations to run the mapping.

-

A job scheduler (in Hadoop, the Job Tracker), keeps track of which jobs are executing, schedules individual Maps, Reduces or intermediate merging operations to specific machines, monitors the success and failures of these individual tasks, and works to complete the entire batch job.

-

The file system and Job scheduler can be accessed through web and a dedicated API, in order to read and write data, and to submit and monitor MapReduce jobs.

Data flow is processed based on Map and Reduce functions: -

Split large input data into subsets, which get assigned to a Map function;

-

Map function - assigns file data to smaller, intermediate pairs;

-

Partition function – finds the correct reducer: given the key and number of reducers, returns the desired Reduce node;

-

Compare function – input for Reduce is pulled from the Map intermediate output and sorted according to this compare function;

-

Reduce function – takes intermediate values and reduces to a smaller solution handed back to the framework;

-

Store output data.

For both SQL and NoSQL databases, the proposed solution involves a software classifier that could distinguish between domains and subdomains, a categorization that is standard for Data Warehouses, and that is based on specific criteria for unconventional data representations. As an example, a criterion could be the counting of word occurrences and comparing this number with Domain Dictionaries. A metadata level is added, in order to direct data to specific storage locations (SQL database or unstructured datasets). b). Real-time analysis uses Complex Event Processing (CEP) engine, which will identify a certain trend, a specific piece of information through the use of parallel queries. The identification of trends, patterns or clusterization models is based on keywords, which trigger a parallel processing of the data stored for the required domain/subdomain, using both NoSQL and SQL systems, preferably organized as Data Warehouses. By applying the principle of dividing a job in many sub-tasks and by taking into account the benefits and drawbacks of each category of tools used for Big Data, we have separated the batch and parallel operations in a three-level model, as depicted in Figure 2:

1)

1.

Gathering and Processing of all kind of data (structured and unstructured) interesting for Big Data with Apache Hadoop;

2.

Attaching them Domain/Subdomain Metadata (different treatment for NoSQL and SQL databases);

3.

Parallel Processing of NoSQL Databases and SQL Databases and finding a pattern, a trend or a forecasting, as the response to the search requirements.

The first level is designated for processing any type of data and is based on a Private Cloud hosting a Hadoop cluster. We must take into account that the most important volumes of data are the one originating from social media, and they contain a relatively low amount of useful information, compared to a smaller volume of data from business activities or government institutions, data which contains a much bigger percent of useful information. This level can be based on Private clouds or Grid computing environments, and lead to the development of Data Warehouses – for organizations and companies which must protect their information, and on Public cloud, if data is coming from social media sites. Apache Hadoop is used for data collection at this level.

2)

At the second level a new software layer will be implemented, in order to execute the first classification of the data that is stored in the Hadoop cluster nodes, by separating the data into domains and subdomains, using dictionaries and the rule set that inserts domain-specific and type-specific (text, picture, video) metadata. For the unconventional

5260 | P a g e

October 28, 2014

ISSN 2277-3061 data, the classification can be achieved by using a simple algorithm, like the one counting the number of occurrences of a word that is to be found in the dictionaries or is related with the given topic. This is a level where information resides temporary, can be based on Public Clouds and does not require important computing resources. For structured data, there are a several tools for metadata control into the Data Warehouse, such as MetaStage. MetaStage provides metadata management across a spectrum of warehousing tools, including business and modeling tools, online analytical processing (OLAP) servers, and BI reporting tools [30].

Fig 2: A scenario for Big Data on Cloud architecture 3)

The third level is focused on storage of the processed information that contains metadata into NoSQL and SQL Databases. It takes place in Public Cloud and uses important storage resources, and also parallel processing. Because transforming information originating from media sources into a form suitable for traditional database would be difficult and usually useless, we consider that an efficient way to store this kind of data would be to build Index Catalogues by using accompanying metadata and place the original information into separate storage spaces according to its type. The third level refers to the access to structured and semi-structured version of Big Data. Our model involves also the existence of a search engine for the Index Catalogues, based on keywords and data types. Parallel search and processing are conducted on distributed databases, so that the user can directly take advantage of, or with the help of

5261 | P a g e

October 28, 2014

ISSN 2277-3061 the decision support module (business intelligence or forecasting software) in order to emphasize a trend, a pattern or perform a forecast.

5.

CONCLUSIONS AND FUTURE WORK

Big Data and Cloud Computing are two evolving technologies, promoted by the biggest software companies. Big Data in the Cloud solutions, from corporations like Oracle, IBM and Fujitsu, are more and more performant but tend to be very expensive. Thus, the evolution of open-source products for this domain will allow individuals, small and medium enterprises to benefit from this new trend that empowers today‟s business. In our future research we intent to analyse the performance of some of NoSQL engines (Level 1 of the proposed model), testing them against workloads with read, update and mixed read and update speed. Using Yahoo Cloud Serving Benchmark (YCSB) software we could evaluate the performance of several open-source NoSQL implementations, in order to interpret the results and recommend them based on the characteristics being requested. We plan also to evaluate NoSQL engines behaviour by running them in non-distributed environments, and then in distributed and parallel environments.

REFERENCES [1] Villars, R., Olofson , C., Eastwood, M., 2011. Big Data: What It Is and Why You Should Care, IDC White paper, available on http://sites.amd.com/sa/Documents/IDC_AMD_Big_Data_Whitepaper.pdf [2] Banica, L., Burtescu, E., Stefan, C., 2014. Advanced Security Models for Cloud Infrastructures, Journal of Emerging Trends in Computing and Information Sciences, Vol. 5, No. 6, pp. 484-491. [3] NIST Special Publication 800-145, 2011. Final Version of NIST Cloud Computing Definition, available at http://www.nist.gov/itl/csd/cloud-102511.cfm [4] Banica, L., Stefan, C., Rosca, D. & Enescu, F., 2013. Moving from Learning Management Systems to the e-Learning Cloud, AWERProcedia Information Technology & Computer Science. [Online], pp 865-874, available from: www.awer-center.org/pitcs. [5] Gray, M., 2010. Cloud Computing: Demystifying IaaS, PaaS and http://www.zdnet.com/news/cloud-computingdemystifying-iaas-paas-and-saas/477238.

SaaS,

available

on

[6] Banica, L., Stefan, C., 2013. From Grid Computing to Cloud Infrastructures, International Journal of Computers & Technology, Vol 12, No.1, pp. 3187-3194. [7] Sreekanth I., 2010. Cloud Deployment and Delivery developerworks/community/blogs/sreek/entry/cloud_4?lang=en

Models,

available

on

https://www.ibm.com/

[8] Technical White paper VMWare &Symantec, 2011. Desktop as a Service with VMware and Symantec, available on: https://www.vmware.com/files/pdf/bdesktop_as_a_service_WP_en-us_08-11.pdf [9] Kulkarni, G., Sutar, R., Gambhir, J., 2012. Cloud Computing-Storage as Service, International Journal of Engineering Research and Applications, Vol. 2, Issue 1, pp.945-950. [10] Rashmi, R., Sahoo, G., Mehfuz, S.,2013. Securing Software as a Service Model of Cloud Computing: Issues and Solutions, International Journal on Cloud Computing: Services and Architecture, Vol.3, No.4, DOI: 10.5121/ijccsa.2013.3401 [11] Malhotra, N., S., 2014. Cloud Computing: Plugging into the Cloud, International Journal of Advanced Research in Computer Science & Technology (IJARCST), Vol. 2, Issue 2, pp. 234-239. [12] Gouda, K., C., Patro, A., Dwivedi, D. & Bhat, N., 2014. Virtualization Approaches in Cloud Computing, International Journal of Computer Trends and Technology (IJCTT), vol. 12, Issue 4, pp. 161-166. [13] Cox, M., Ellsworth, D., 1997. Application-Controlled Demand Paging for Out-of-Core Visualization, Proceedings of the 8th IEEE Visualization '97 Conference, 1997 IEEE available at http://www.evl.uic.edu/cavern/rg/20040525_renambot/Viz/parallel_volviz/paging_outofcore_viz97.pdf [14] Laney, D., 2001. Application Delivery Strategies, Source: Meta Group, available at http://blogs.gartner.com/douglaney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf [15] Dijcks, J.P, 2013. Big Data for Enterprise, Oracle White Paper, http://www.oracle.com/us/ products/database/bigdata-for-enterprise-519135.pdf [16] Oguntimilehin A., Ademola E.O., 2014. A Review of Big Data Management, Benefits and Challenges, Journal of Emerging Trends in Computing and Information Sciences, Vol. 5, No. 6 , pp. 433 – 438. [17] Cackett, D., Bond, A. & Gouk, J., 2013. Information Management and Big Data, A Reference Architecture, available on http://www.oracle.com/technetwork/topics/entarch/articles/info-mgmt-big-data-ref-arch-1902853.pdf

5262 | P a g e

October 28, 2014

ISSN 2277-3061 [18] Inmon ,B., 2013. Big network.com/view/17017

Data

Implementation

vs.

Data

Warehousing,

available

at

http://www.b-eye-

[19] Oracle White Paper, 2014. Oracle Database 12c for Data Warehousing and Big Data, available at http://www.oracle.com/technetwork/database/bi-datawarehousing/data-warehousing-wp-12c-1896097.pdf, [20] Ferkoun, M., 2014. Cloud computing and big data: An ideal combination, http://thoughtsoncloud.com/2014/02/cloud-computing-and-big-data-an-ideal-combination/

available

at

[21] Fattah, A., 2014. Cloud Analytics: Selecting Patterns of Integration, Source: IBM Data magazine, available at http://ibmdatamag.com/2014/09/cloud-analytics-selecting-patterns-of-integration/ [22] Intel IT Center, 2013. Big Data in the Cloud: Converging Technologies, available http://online.ipexpo.co.uk/index.php/layout/set/print/content/download/81982/1679544/file/big-data-cloudtechnologies-brief.pdf

at

[23] Mohamed, A., M., Altrafi, O.,G., Ismail, M., O., 2014. Relational vs. NoSQL Databases: A Survey, International Journal of Computer and Information Technology, Vol.3, Issue 3, pp. 598-601. [24] Abramova, V., Bernardino, J. and Pedro Furtado, 2014. Experimental Evaluation of NoSQL Databases, International Journal of Database Management Systems, Vol.6, No.3, pp. 1-16. [25] Yahoo Developer Network, 2007. Hadoop Tutorial, available at https://developer.yahoo.com/hadoop/tutorial/ [26] Frank Lo, 2014. Big Data Technology, available at https://datajobs.com/what-is-hadoop-and-nosql [27] Feinleib, D., 2012. Big Data and NoSQL: Five Key Insights, available at http://www.forbes.com/sites/ davefeinleib/2012/10/08/big-data-and-nosql-five-key-insights/ [28] Deptula, C., 2013. With all of the Big Data Tools, what is the right one for me, available at www.openbi.com/blogs/chris deptula. [29] Brust, A., 2012. CEP and MapReduce: Connected in complex ways, http://www.complexevents.com/2012/03/10/cepand-mapreduce-connected-in-complex-ways/ [30] Eckerson, W., 2001. Meta Data Management in the Data Warehouse Environment - A Logical Approach to Meaningful Data Analysis, http://www.olap.it/Articoli/Ascential -meta data management in the dwh environment.pdf

5263 | P a g e

October 28, 2014