Resource Allocation and SLA Determination for Large ... - IEEE Xplore

3 downloads 18699 Views 327KB Size Report
method for resource allocation for data processing services over the cloud taking into account not just the processing power and memory requirements, but the ...
2010 IEEE International Conference on Services Computing

Resource Allocation and SLA Determination for Large Data Processing Services Over Cloud K Hima Prasad #1 , Tanveer A Faruquie #2 , L Venkata Subramaniam #3 , Mukesh Mohania #4 Girish Venkatachaliah ∗5 #

{

1

hkaranam,

2

IBM Research India New Delhi, India

ftanveer,

3

lvsubram,

4

mkmukesh}@in.ibm.com

IBM India Software Lab Bangalore, India

5

[email protected]

data cleansing [5], data transformation, data analysis [14] and data manipulation are offered as services over cloud [8]. All these applications are data intensive and involve transferring of data in real time. Offering them over the cloud requires not just provisioning cloud resources but also controlling data throughput to complete the given task to meet business requirements. With more and more services being offered over the cloud there is a need to look at various aspects of resource provisioning. Drawing up service level agreement (SLA) when services are composed from various sites, connected over the internet, in an uncontrolled environment, can be challenging. Under these circumstances data flow and network throughput play a major role in resource provisioning and defining SLAs. The cloud infrastructure provider has to allocate resources optimally to each request based not only on computing requirement but also on the amount of data that is expected for each service request. Little work has been done in optimizing resource provisioning for a composite service put together using cloud computing infrastructures where different services are offered by various cloud sites each with certain service guarantees [20]. Traditional resource allocation and pricing may not suit for offering data processing services over cloud, where new requests can come at any time and each data processing job needs to be delivered within a stipulated time meeting given SLAs. An example of such an application could be data cleansing where the customer triggers a cleansing job overnight in batch mode for the data collected during the day. In this case the cleansing output has to be delivered within strict time bounds. Since the data is sent over the network and is processed at the remote cloud site the cloud resources need to be optimally provisioned taking into account actual data volumes that the underlying network will deliver. Today’s static algorithms work well in cases where the number of data records to be processed are small or if the data is available at the cloud site itself. In a dynamic scenario it is necessary to adapt to the complexities that data transfer across the network brings. Especially in the case of large data processing services over cloud, where huge amounts of data need to be transferred, this becomes very important. To offer services in realtime within stipulated SLAs it becomes necessary to optimally provision the resources taking into account the data throughput achievable and the network parameters

Abstract—Data processing on the cloud is increasingly used for offering cost effective services. In this paper, we present a method for resource allocation for data processing services over the cloud taking into account not just the processing power and memory requirements, but the network speed, reliability and data throughput. We also present algorithms for partitioning data, for doing parallel block data transfer to achieve better throughput and allocated cloud resources. We also present methods for optimal pricing and determination of Service Level Agreements for a given data processing job. The usefulness of our approach is shown through experiments performed under different resource allocation conditions.

I. I NTRODUCTION Cloud computing is gaining importance because of its cost effective, easy to scale and easy to use style of services [17]. Several companies are focussed on offering hardware and software services over cloud [1], [10], [11]. This paradigm of computing is more attractive to customers as they don’t need to invest on infrastructure and can use the computing services online on a pay per usage basis. Also businesses can easily deal with scaling as the business grows or shrinks by acquiring or releasing the cloud resources. There are several companies that are offering computing and storage as services over cloud. For example, incase of Amazon’s EC2 compute offering a customer can request the amount of processing power needed along with the memory requirements for acquiring a machine instance. If the requested resources are available at the cloud site, a machine instance is allocated which can be rented for a given amount of time and released after usage. Similarly storage [1] can also be purchased for specific time periods. Salesforce.com [19] offers customer relationship management solutions over the cloud where a customer uploads all the data prior to using the service. In most of these cases resource provisioning and the data transmission across the sites are decided and performed upfront. Often there is also a requirement for the data to reside at the customer site. In such cases the data resides at the customer site and gets processed at the cloud site as and when required. In such a situation it is necessary to take into account the network reliability, data throughput requirements and the data processing complexity in order to provision optimal resources on the cloud. Not much attention has been given to resource provisioning for such real time data processing requirements in cloud computing literature. Various applications such as 978-0-7695-4126-6/10 $26.00 © 2010 IEEE DOI 10.1109/SCC.2010.92



522

1

that could affect the throughput. Also we look at the ways to improve data throughput using multi-session processing employed over partitioned data. In this paper we focus on bulk data processing over the cloud within time bounds to complete the data processing task. We consider the hybrid case where resources from both the cloud and customer site are utilized and the two sites are connected by a network of limited bandwidth. We study how data throughput affect resource provisioning and data partitioning strategies. We also study the effect of session lifetime in data partitioning for multi-session processing. We present algorithms from the service provider perspective for optimally allocating resources and estimating SLAs so that services can be offered to many customers without any SLA violations. The main contributions of this paper are • Method for resource provisioning over the cloud for bulk data processing. • Algorithm to determine SLA parameters for a given work load and available resources. • Various data partitioning strategies to improve the data throughput. The outline of the paper is as follows. Section 2 gives the motivation with few example applications. Section 3 gives the details on the data processing services over the cloud. Section 4 gives resource allocation over cloud where various parameters affecting the data processing over cloud are discussed. It also presents the algorithms for pricing and SLA determination. In Section 5 we present experimental results. We present the related work in Section 6 followed by conclusions in Section 7.

2500000

5000000

7500000

M

Fig. 1.

Data Processing Over Cloud

of such a case is when data cleansing process is triggered every night before pushing the data into a warehouse. In all such cases where the customer requirement for data processing is only for a short time the investment in hardware and software resources is not justified. Moreover, many of these one time processes have significant commonality across enterprises as more often than not they deal with similar data. To this effect our “solution as a service” model can significantly decrease the amount of investment needed by customers to implement data management solutions. The reusability of this solution component across customers enables sharing the hardware and software resources across enterprises thus amortizing the cost across multiple customers with economies of scale allowing the service provider to offer the services at an affordable prices. The service model for this approach is different from traditional cloud models and requires a different set of service level agreements (SLA) and performance guarantees. The hybrid approach where the complete data processing workflow is executed part inhouse and part on the cloud requires that the execution should meet customer specific business constraints. Hence it is imperative that any cloud based data processing offering should be accompanied with certain SLA guarantees of throughput, time to completion and an exact estimate of resources needed for meeting specific business needs. The SLAs should allow scalability to accommodate different needs of customers. This is also beneficial for the customers as it reduces cost and enables them to focus on the business aspects of the solution they are building.

II. M OTIVATION In recent times the amount of information that is relevant to the enterprise has seen a tremendous increase. Apart from the information that exists in external sources such as blogs, message boards and webpages the information collected within the enterprise in the form of transactional records, profile information, and customer interactions like emails, audio recordings, and call center notes is reaching petabyte scales. This demands an ever increasing investment from the enterprise in computing and human resources to manage, maintain and utilize this huge amount of data. The first step in utilizing this information is to clean, transform and extract relevant pieces of information before storage. Often this step is a one time process because subsequently the information needs of the enterprise is fulfilled using the transformed and stored data. Therefore, it does not make economic sense for an enterprise to perpetually own the computing and human resources for this transient step. For example, in the implementation of a Master Data Management (MDM) solution data is transformed to create the master data. Similarly, for an enterprise data warehouse implementation, data needs to be transformed before populating the data warehouse. In both these implementations the data transformation stage is an essential one time step and is not used on a recurring basis in the entire workflow. Also these services could be triggered periodically to process any incremental data gathered over a period of time. An example

III. DATA P ROCESSING AS A SERVICES In this section we give an overview of large scale data processing services over cloud where a customer can use the service to process his data without worrying about the hardware and software resources. We describe two methods in which such a service can be offered in an elastic manner and the challenges it brings in terms of data throughput and the resource provisioning at the cloud site. Figure 1 shows a typical data processing task over the cloud where a set of records are processed on the cloud through a set

523

of parallel sessions between cloud and customer site connected over the internet. Here the data resides on customer premises. There are two ways data processing can be offered as a service over cloud as described below. • Offline : In this method all the data required for processing is uploaded to the cloud before the processing can start. One can use SFTP or any other secure mechanism to transfer the data beforehand. One example of this service is CRM service offered by Salesforce.com where all the required data is uploaded to the cloud [19]. In this method one can easily arrive at a conclusion about the resources required to process the data as all the data is residing on the cloud and any existing resource allocation and SLA determination methods can be applied to offer the service. • Inline: In this method, data resides on the customer premises and gets processed on the cloud in realtime without being stored on the cloud. An example of this method is offering data cleansing as service where the data to be cleansed is transferred in realtime to the cloud, is processed, and the results are transferred back to the customer [8]. In this case one needs to estimate the resources to be allocated keeping in mind the amount of bandwidth that is available between the cloud and customer sites and the data throughput variations. Traditional resource allocation techniques may fail if data throughput variability are not taken into account. Figure 2 shows these two modes of data processing over the cloud. In the offline mode the entire data is exported in some predefined format and is transferred to the cloud site. At the cloud site the data gets processed and the results are again transferred back to the customer site which can be imported into the database. In the second inline-realtime mode, the data processing is over the cloud using a session between the data (customer) location and the compute (cloud) location. In this kind of workflow each step has a time limit within which it should be completed so that the next step can be triggered. To offer such services with time bounds and SLA guarantees one needs to pre compute the resources needed and determine the feasibility of offering such a service within the time bound. In the rest of paper we consider a data processing task which has M records to be processed over cloud with bandwidth B between the customer and cloud location. We present methods to improve throughput and allocate appropriate resources.

Data Processing

File

File

Export Import

Data Warehouse Implementation Source 1

Extract

Source

Offline

Cleanse

Warehouse

Source 2

Inline

Fig. 2.

Offline and realtime mode of data processing over cloud

1) Network Parameters: •









IV. R ESOURCE A LLOCATION F OR DATA P ROCESSING This section gives details about the the factors that need to be considered for resource allocation for large data processing over cloud along with the algorithms for resource allocation. Since the customer premises is the place where the data resides and the cloud is the place where the actual data processing happens we first consider the important factors that affects the throughput and resource allocation requirements.

Bandwidth (B): Bandwidth available between the customer site and the cloud site. This is equivalent to the minimum of the bandwidths available at customer site the cloud site and determines the data processing throughput that can be achieved. Average Session Throughput (s): Session throughput indicates the average throughput achieved by a session (Database,sftp etc) opened between the customer site and the cloud site. This is dependant on the network topology and the bottlenecks that are present between the source and destination sites in the network. This is determined by increasing the number of sessions (n) until a point where you get the maximum aggregate throughput from all the sessions. A session throughput is obtained by taking the average of all the n sessions throughputs and is given by given by Σ(si )/n. Here, si is the throughput obtained with ith session. Session Overhead (Tco ): Time taken to establish a session between the customer location and the cloud location. It includes, for example, database connection overhead when establishing a database connection from the cloud to the cluster database. Record Fetch Time (k): This is the average time needed to fetch a record from the customer location to the cloud site. Session Lifetime (ρ(t)): Probability that a session is alive for a given time t. Session lifetime follows an exponential distribution [13], [18].

2) Resource Parameters: •



A. Parameters affecting resource allocation Parameters affecting data processing over cloud in the hybrid setting can be categorized into three categories. The following sections describe the parameters within each category.

Average Inflow F : Average number of records per second available at the cloud site. This is used to decide the amount of processing power and memory that needs to be provisioned on the cloud. Processing Speed (q): Number of records that can be processed per second on the cloud. This depends on the provisioned resources on the cloud for a particular job.

3) Data Processing Parameters:

524

Job Size (M ): Total number of records to be processed for a given customer. • Number of sessions (p): Total number of parallel sessions that can be established. This is determined to maximize the throughput. • Block Size (b): Number of records processed by a single session. Session failure leads to reprocessing the whole block. Hence determining correct block size is important to avoid reprocessing overheads. • Total Time (T ): Time required within which to finish the given processing job with the available processing power and other infrastructure constraints. This includes connection overheads, data transfer time, and processing time on the cloud. For a given data processing job the network parameters and the resource parameters are known apriori because they are input parameters. Given these parameters one needs to find out the optimal block size b and the number of parallel sessions p so that the job is completed within time T while minimizing the associated cost.

complete block which results in duplicate processing. Smaller block sizes results in increased connection overheads. The time taken to transfer a block of size b over the network 0 without any session failure is given by t = tco + kb. Now, given session failure probability ρ(t) the effective time taken to transfer the same block b is given by



0

where ρ(t ) determines whether a session is alive before a block of size b can be transferred. From the above equation we compute the effective time taken to transfer all the M records in blocks of size b over p sessions as 0

TN = M (tco + kb)/pbρ(t )

0

T = max(M/q, M (tco + kb)/pbρ(t ))

(6)

If the service configuration is throughput constrained then over allocation of resources results in increased costs. This also results in deteriorated processing time which should be accounted in SLA. Similarly if the available resources on cloud is limited, the SLA will be affected adversely resulting in increased time for servicing the request. C. Algorithms For Resource Allocation In this section we present an algorithm for determining the optimal block size depending on the number of parallel session that can be established and the session lifetime probability. We also give an algorithm for optimal resource allocation to finish the data processing within the give time constraints. Algorithm 1 gives a method for computing the block size which takes into account the number of parallel sessions p. Multiple sessions results in partitioning M records and gives the partition size that each parallel session has to handle. The session lifetime probability ρ is used to compute the block size which can be processed before a session failure occurs. The optimal block size is chosen as the minimum of the two blocks sizes given by number of possible sessions and the session lifetime probability.

(1)

If the average flow F is less than q then the time taken is (2)

Here the connection overhead, Tco , is incurred only for each block. Hence the effective time taken is given by the maximum of the above two equations. This is because the data transfer and data processing occurs concurrently. T = max(M/q, Tco + (M/p)k)

(5)

Since the processing power is limited by q at the cloud site the effective time taken is given by the following equation which takes into account the maximum of the time taken at the cloud site and the network transfer.

Here we describe two situations using different parameter settings and demonstrate the effect of these configurations on the overall processing time of the job on the cloud. 1) Finite Bandwidth and Finite Processing Power: In this case both the bandwidth and the processing power are limited. The total number of parallel sessions needed to maximize the throughput is p. This throughput results in average flow of F records per second on the cloud. Since processing power is also limited, because of cost or resource implications, only q number of records are processed per second. The total time is then determined either by throughput or processing power constraints. If the average flow F is greater than q then the time taken is given by the following equation

T = Tco + (M/p)k

(4)

0

B. Service Configurations

T = M/q

0

τ = t /ρ(t )

Algorithm 1 Algorithm to Compute Optimal Block size 1: Read Total Number of Records M 2: Compute the blocksize b1 = M/p 3: Read session lifetime probability ρ 4: Compute the set (bi , Ti ) for various block sizes with the given ρ 5: Assign b2 as the bi with minimum Ti 6: Block Size b=minimum(b1 , b2 )

(3)

2) Finite Bandwidth and Finite Processing Power with Session Failure: This case incorporate the effect of finite session lifetime along with finite bandwidth and processing power to estimate the total time taken. Based on [13] [18] we assume session lifetime follows an exponential distribution with the scale parameter β. The probability that the session is alive for time t is then given by ρ(t). The scale parameter β is the average lifetime of session i.e. E[t] = β. This has a direct effect on determining the block size as one would like to transfer the entire block when the session is alive. For large block size session failure results in retransmitting the

Once we have the optimal block size and the number of parallel connections that can be established between the sites, we can compute the rate at which records are arriving on the cloud. This is then used to allocate the computing resources 525

earlier. Also depending on the time and money that a customer is willing to pay for the service one has to draw up the possible set of SLA guarantees that one can offer. For each incoming request a table of optimal times possible with different resource configurations, lets say ranging from one machine instance to the optimal number of instances, can be tabulated using algorithm 4 and can be suggested to the customer. Steps 4 to 6 in the algorithm perform this task of tabulating the time taken with each instance and the associated cost, Ci , which is a function of the number of instances, i, used and the time, Ti for which they are used. This algorithm can also be used in case the optimal instances required to satisfy the timing request are not available on the cloud. In that case the service provider can offer other alternative options along with the total time taken by them. Steps 8 and 9 perform the task of suggesting achievable SLAs to the customer. Here n is the optimal number of instance required to handle the incoming flow to the cloud site. Note that instances mentioned here can be replaced with the CPU and memory requirements.

on the cloud. Algorithm 2 gives the method for computing the required resources and provisioning them on the cloud. Here, step 2 computes the optimal inflow that can be achieved at the cloud site for given network condition. Algorithm 2 Algorithm to compute resource allocation 1: Read number of input records M 2: Compute the incoming data rate F = pbρ/(tco + kb) 3: if F < q then 4: Allocate the resources 5: else 6: Reject the service request 7: end if Algorithm 3 gives the method to estimate the time taken to process the given M records which can be used for pricing and SLA determination later. This uses parameters like number of sessions p, processing rate q, session lifetime probability ρ, average record fetch time k, and the connection overhead Tco . This algorithm estimates the time that will be taken for servicing a particular request on the cloud site based on the provisioned resources and the permissible throughput given session lifetime limitations. It assigns the maximum of the two as the effective time taken for completing the request. This is because the data transfer and processing are overlapped so the effective time is mainly determined by the bottleneck point which could be either network or processing power dependent.

Algorithm 4 Algorithm for SLA determination 1: Compute the Optimal Block Size b=ComputeBlocksize() 2: Compute the average inflow at the cloud site 3: Compute the optimal machine instances n 4: for The instances from 1 to n do 5: Compute the time to complete and the cost (Ti , Ci )) 6: end for 7: Read the available machine instances r at cloud site 8: if r < n then 9: Present the available options of (T1 , C1 ) to (Tr , Cr ) 10: else 11: Present the available options of (T1 , C1 ) to (Tn , Cn ) 12: end if

Algorithm 3 Algorithm to estimate time taken for completing a service request 1: Read number of records M 2: Compute the number of parallel connections p = B/s 3: Compute number of blocks with block size b as n = M/b 4: Compute data processing rate q on cloud 5: Time taken on cloud T1 = M/q 6: Compute time Taken on the network T2 = M (tco + kb)/pbρ 7: Effective time taken =max(T1 , T2 )

In this model of offering data processing services one can look at various pricing strategies to attract customers. A few of them are described below. • Instance based: In this method pricing is decided by the number of data processing machine instances or the CPU and memory allocated to the customer for the period he is renting it. • Data size based: In this method the customer pays for the amount of data one processes on the cloud and the time within which he has to process the given amount of data.

D. SLA Determination and Pricing For a given service request at the cloud site, estimating the time taken to process the request optimally can be used to define SLAs. Once the SLAs are defined and the resources required are allocated, pricing can be defined as a function of time taken to finish the job along with the number of resources allocated for the job. Determining attainable SLA’s in the face of session failures becomes an important aspect for the service provider. Using the algorithms to obtain the optimal block size, resources and total time taken, we estimate the SLAs that can be achieved for various real life situations involving session failures, limited bandwidth, limited resources availability etc. Whenever a new request arrives at the cloud site it is important for the service provider to estimate the time taken to complete the job under the various constraints discussed

V. E XPERIMENTAL R ESULTS In this section we present experimental results for large scale data processing as a service. We present results on a real cloud implementation in which data is processed after using storage services on the cloud (which we call offline processing) and when the data is transferred, processed and the results are used in realtime between cloud platform and the customer over the internet (which we call inline processing). In order to evaluate the performance of our method for different configurations and data processing requests we simulate the cloud platform and give simulation results for situations differing in data throughputs, session lifetimes, job sizes and cloud resources. 526

1000

partition the data at the customer site and use parallel sessions from the customer site to the resources on the cloud with each session handling the data processing of one partition. Figure 4 shows the comparison of the time taken for single session, two sessions and four sessions. One can observe that using four sessions took almost 2.5 times less to finish the job compared the a single session. This is because a single database connection was not pulling enough data over the session to utilize the full bandwidth.

T im e T a k e n

800 600 400 200 0 Offline Fig. 3.

Inline

Web Services

B. Simulation Results

Time taken for data processing using different methods

We simulate a cloud computing platform to evaluate the performance of inline processing when multiple data processing requests are received from different customers with different job sizes, bandwidth and service expectations. Our initial two experiments focus on the time taken to transfer M records with and with out considering session lifetime parameter. In both these cases we assume that the resources on the cloud are infinite and show the impact of network parameters on the total time taken. In experiment 3 we work with finite resources on the cloud and show the effect of network bandwidth and the session lifetime on the resource allocation. Experiment 4 shows the performance of servicing data processing requests for a series of random service requests arriving at the cloud with and without considering session lifetime parameter and evaluate resource allocation in both the cases. 1) Data Processing without Session failures: In this experiment we simulate the time taken to process the data without associating any session lifetime and assume that enough processing power is available on the cloud. We show the time taken for processing 10 million records on the cloud for different block sizes. Total time taken in this case is dominated by the network transfer time.

800 700

Time in Minutes

600 500 400 300 200 100 0 1

2

4

Number of database sessions

Fig. 4.

Performance gains using parallel sessions

A. Data Processing Results For evaluating both offline and inline data processing we used an inhouse cloud implementation similar to EC2. One can request resources as virtual machines with desired configurations running over a xen virtualization platform. The data processing task was cleansing [12] and standardization [5] of one million address records on the cloud platform for a Master Data Management (MDM) implementation. The chosen task is only indicative and can be replaced with any other task. We requested a standard virtual machine configuration consisting of 4 GB RAM and 4 CPUs with a pre-installed and preconfigured data transformation task. The bandwidth between the cloud platform and the customer was 512 Kbps line through internet. We experimented with different modes of data transfer. Figure 3 shows the data processing time for different experiments. In the offline processing a file is transferred to the cloud site and processed there and the results are returned. It is seen that inline processing, where direct database session is established, gives us the best throughput because data processing and data transfer happen concurrently. For invoking webservice we had to limit the number of records to be processed per request because of connection timeouts. This experiment showed that webservices is not suitable for large scale data processing. Another experiment we conducted using inline mode was to

TABLE I T IME TAKEN W ITHOUT S ESSION FAILURES Block Size 250 300 400 500 2000000

Time Taken 176020 153338 120020 100020 20020

Table I shows the time taken to process 10 million records with different block sizes. In this case number of parallel sessions needed is computed and since a session never dies the maximum possible block size, obtained by dividing the total number of records with the number sessions, gives the best overall throughput. 2) Data Processing with Session Failure: In this experiment we simulated session lifetime as an exponential distribution where the rate parameter, β, gives the time for which the session is alive. For this network we compute the time taken to process 10 million records for a given network bandwidth for different β. Figure 5 shows the time taken for processing 10 million records with different block sizes parameterized on the expected session lifetime given by β. From the graph we see that there exists an optimal block size for a given β. 527

300000

120000

100000

350

200000

80000

400

150000

T im e T a k e n

T im e T a k e n

250000

450 100000

500

350 400 60000 450 500 40000

50000 0

20000

250 300 325 350 375 400 425 450 500 0

Block Size Fig. 5. rates

1

Fig. 7.

250000 350 Simulation 350 Expected

T im e

400 Simulation 400 Expected

150000

450 Simulation 450 Expected 100000

500 Simulation 500 Expected

50000

0 250

300

325

350

375

400

425

450

3

4

5

Time taken at the cloud site with varying number of instances

Figure 7 shows the processing time for different block sizes at the cloud site as the number of machine instances provisioned on the cloud varies. We presented the time taken for processing the records with the number of instances ranging from one instance to the optimal instances required to handle the given input flow for a given session lifetime parameter. Once we compute the time taken for various instances on the cloud site we can always suggest the SLAs to the customer. Depending on the amount of time in which the client wants to get the processing done he can chose the number of instances and can bear the cost accordingly. Also, for the case when optimal number of instances are not available on the cloud, the service provider can suggest the estimated SLA that can be met with the current available resources. 4) Resource Allocation on the Cloud: In order to demonstrate the effect of limited resources when multiple requests are arriving with different parameters, we simulated a series of data processing requests arriving on the cloud. Each series has a set of 10 random input requests coming in with different job configurations. We allocated resources for these requests in the normal case without taking into account session lifetime and the corresponding optimal block sizes. We then used our method to compute optimal block size and allocated resource suggested by our method. Table II shows the input request details and the amount of resources allocated on the cloud. In this experiment we assumed that each machine instance on the cloud can handle 500 records per second and average session throughput is fixed at 2Mbps. Using these configurations optimal inflow to the cloud is computed with and without considering session lifetime. The number of machine instances allocated for both these cases is also reported in the table. From the Table II it is clear that if one does optimize data partitioning, in the presence of unreliable transmission and varying job sizes, it will result in over provisioning of resources, most of which may be lying idle. Whereas if the resource allocation is based on expected data inflow, size, time to process then it is possible to have optimal number of instances provisioned thus reducing cost and increasing overall resource utilization.

300000

200000

2

Number of Machine Instance

Time taken for various block sizes with different session lifetime

500

Block Size

Fig. 6. Comparison of time taken for different block sizes between simulated and expected

Note that blocks with the minimum block size do not result in optimal time as the number of sessions are fixed and the connection overhead for transferring each block will increase the total time taken. In our experiments connection overhead is 20 seconds. Figure 6 compares the time taken, in the presence of session lifetime, using simulation runs and those computed using our algorithm. We can see that the difference in processing time computed using our method and that obtained using simulations is minimal for the optimal block size. 3) Data Processing with Session Lifetime and Limited Processing: In this section we analyze the processing time on the cloud for different session lifetime parameters. We first compute the optimal input flow that can be obtained at the cloud. This is computed from the results obtained in experiment 2. Once we obtain the optimal input flow at the cloud we can determine the number of machine instances needed to process the flow. We have taken the number of machine instances as a metric for computing power on the cloud. This can be easily extended to include CPU power and memory requirements. In our simulation each machine instance has the capability to process 100 records per second.

528

TABLE II R ESOURCE ALLOCATION AT THE CLOUD SITE Job Size (records) 40000000 50000000 50000000 80000000 100000000 80000000 90000000 70000000 90000000 90000000

Bandwidth (mpbs) 16 8 10 16 8 14 20 20 12 6

Session Lifetime 500 700 800 500 700 250 800 500 500 800

Baseline (instances) 13 4 5 13 4 10 20 20 8 2

we presented a method for resource allocation for large data processing over the cloud which takes into account various network parameters. We also presented a method to determine SLAs for a given customer request depending on the resources available at the cloud site and the data transfer achievable with the given network configuration. Our experiments showed that optimal resource allocation is possible only when various network parameters are taken into account.

Our method (instances) 3 1 2 3 1 1 6 4 2 1

R EFERENCES [1] Amazon. S3 storage, ec2 computing power and sqs for network services, 2009. http://www.amazon.com. [2] M. Armbrust, A. Fox, R. Grifith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. Above the clouds: A berkeley view of cloud computing. Technical report, University of California at Berkeley, 2009. [3] B. F. Cooper, E. Baldeschwieler, R. Fonseca, J. J. Kistler, P. P. S. Narayan, C. Neerdaels, T. Negrin, R. Ramakrishnan, A. Silberstein, U. Srivastava, and R. Stata. Building a cloud for yahoo! IEEE Data Eng. Bull., 32(1):36–43, 2009. [4] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!’s hosted data serving platform. PVLDB, 1(2):1277–1288, 2008. [5] M. Dani, T. A. Faruquie, R. Garg, G. Kothari, M. Mohania, K. H. Prasad, L. V. Subramaniam, and V. Swamy. A knowledge acquisition method for improving data quality in services engagements. In Proc of IEEE SCC, 2010. [6] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of OSDI. USENIX Association, 2004. [7] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The cost of doing science on the cloud: the montage example. In Proceedings of ACM/IEEE conference on SC, 2008. [8] T. A. Faruquie, K. H. Prasad, L. V. Subramaniam, M. K. Mohania, G. Venkatachaliah, S. Kulkarni, and P. Basu. Data cleansing as a transient service. In Proceedings of IEEE ICDE, 2010. [9] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003. [10] Google. Google app engine, 2009. http://code.google.com/appengine. [11] HP. Flexible computing services fcs, 2009. http://www.hp.com/services/flexiblecomputing. [12] G. Kothari, T. A. Faruquie, L. V. Subramaniam, K. H. Prasad, and M. Mohania. Transfer of supervision for improved address standardization. In Proc of ICPR, 2010. [13] D. Kundu and R. D. Gupta. Bivariate generalized exponential distribution. Journal of Multivariate Analysis, 100(4):581 – 593, 2009. [14] K. Murthy, T. A. Faruquie, L. V. Subramaniam, K. H. Prasad, and M. Mohania. Automatically generating term-frequency-induced taxonomies. In Proc of ACL, 2010. [15] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of ACM SIGMOD, 2008. [16] S. PaaS. Internet application development platform, 2009. http://www.salesforce.com/paas. [17] D. C. Plummer, T. J. Bittan, T. Austin, D. W. Cearley, and D. M. Smith. Cloud computing: Defining and describing an emerging phenomenon. Technical report, Gartner, 2008. [18] S. M. Ross. Introduction to Probability Models, Ninth Edition. Academic Press, Inc., Orlando, FL, USA, 2006. [19] Salesforce.com. Crm application as a service, 2009. http://www.salesforce.com. [20] M. M. Tsangaris, G. Kakaletris, H. Kllapi, G. Papanikos, F. Pentaris, P. Polydoras, E. Sitaridi, V. Stoumpos, and Y. E. Ioannidis. Dataflow processing and optimization on grid and cloud infrastructures. IEEE Data Eng. Bull., 32(1):67–74, 2009. [21] H. C. Yang, A. Dasdan, R. L. Hsiao, and D. S. Parker. Map-reducemerge: simplified relational data processing on large clusters. In Proceedings of ACM SIGMOD, 2007. [22] J. Zhang, M. Yousif, R. Carpenter, and R. J. Figueiredo. Application resource demand phase analysis and prediction in support of dynamic resource provisioning. In Proceedings of ICAC, page 12, 2007.

VI. R ELATED W ORK Cloud computing is gaining importance with its cost effective and easy to scale ways of offering hardware and software services [17]. Cloud computing has the potential to change the way hardware and software is made and purchased [2]. There are companies like Amazon, Google , and sales forge are already offering various computing, storage and platform services over the cloud [1], [11], [19], [16], [10]. Yahoo is actively building a cloud of its own to support various services that it offers [3]. Yahoo also has a data servicing platform PNUTS [4], a massively parallel and geographically distributed data base system. There are also efforts to show the feasibility of doing scientific experiments on the cloud with the computing and storage services provided by various players and it is shown that cloud computing can be used to build cost effective data intensive applications [7]. There are efforts to support large scale data processing over the clusters using map reduce framework [6], [21]. In this model data is stored on the systems beforehand and waits for the computation. Whenever a computation request comes in it processes the data and returns the result. Google File System [9] and the open source Hadoop File System support this model of computation to support the large scale data processing. Pig Latin is a language specifically designed to support easy programming using Map Reduce which fits between the declarative style of SQL and procedural style of map reduce [15]. There are efforts to use build applications using various services offered over the cloud and to optimize the data flow control in such environments [20]. There are efforts to offer data cleansing as a transient service where cleansing is offered as a service over the cloud platform [8]. Resource Provisioning on cloud has been an important aspect of cloud computing. There are efforts to support dynamic resource provisioning by profiling the execution phase of application [22]. VII. C ONCLUSION To utilize the resources on the cloud effectively it is necessary to optimize based on available computing power, available bandwidth, and session lifetime. Not taking these into account can result in wasteful provisioning of computing power on the cloud which may never be used. In this paper 529