Cloud computing for geosciences: deployment of ... - ACM Digital Library

4 downloads 0 Views 157KB Size Report
[email protected]. Doug Nebert. U.S. Federal Geographic Data. Committee Secretariat [email protected]. Kai Liu. Center for Intelligent Spatial Computing.
Cloud Computing for Geosciences: Deployment of GEOSS Clearinghouse on Amazon’s EC2 Qunying Huang

Chaowei Yang

Doug Nebert

Center for Intelligent Spatial Computing George Mason University

Center for Intelligent Spatial Computing George Mason University

U.S. Federal Geographic Data Committee Secretariat

[email protected] Kai Liu

[email protected]

[email protected] Huayi Wu

Center for Intelligent Spatial Computing George Mason University

Center for Intelligent Spatial Computing George Mason University

[email protected]

[email protected]

ABSTRACT

[1]. The National Institute of Standards and Technology (NIST) describes cloud computing as "...a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction" [2].

To test the utilization of cloud computing for Geosciences applications, the GEOSS clearinghouse was deployed, maintained and tested on the Amazon Elastic Cloud Computing (EC2) platform. The GEOSS Clearinghouse is a web based Geographic Metadata Catalog System, which manages millions of the metadata of the spatially referenced resources for the Global Earth Observations (GEO). Our experiment reveals that the EC2 cloud computing platform facilitates geospatial applications in the aspects of a) scalability, b) reliability, and c) reducing duplicated efforts among Geosciences communities. Our test of massive data inquiry by concurrent user requests proves that different applications should be justified and optimized when deploying onto the EC2 platform for a better balance of cost and performance.

The U.S. Federal Geographic Data Committee (FGDC) convened a governmental team to develop the GeoCloud Sandbox Initiative in early 2010 to deploy up to ten geospatial application projects in the Cloud environment. Objectives of this initiative are to, over a one year period, define common operating system and software suites based on requirements, explore and document deployment and management strategies, monitor usage and costing of Cloud services in an operational environment, and to pursue shared system security profiles (certification and accreditation) for such solutions. The result of the project will be documented best practices that will be used by governmental agencies in the future when considering Cloud service adoption for geospatial capabilities.

Categories and Subject Descriptors C.4 [Performance of Systems]; J.2 [Physical Sciences and Engineering]: Earth and atmospheric sciences

This paper presents initial documentation of deploying an open source geospatial software suite, GeoNetwork, a companion spatially-enabled database, operating system, and application framework in the Amazon Web Services (AWS) Elastic Cloud Computing (EC2) environment as a Platform as a Service( PaaS) candidate for use by U.S. government agencies. The Global Earth Observation System of Systems (GEOSS) Clearinghouse is a geospatial data and service catalog for which this deployment is made.

General Terms Performance, Experimentation

Keywords Cloud Computing, Clearinghouse

Virtualization,

Amazon

EC2,

GEOSS

1. INTRODUCTION

GeoNetwork, used by the GEOSS Clearinghouse is a web based geographic metadata catalog system, used for managing spatially referenced resources. It is a data intensive application, and high memory and computing power are required to search large metadata collections for multiple simultaneous users; each user may harvest thousands of records in a single request. The paper will introduce a method to deploy geospatial applications, using GEOSS Clearinghouse as example, on Amazon EC2 as a prototype for utilizing cloud computing to support Geoscience applications for Earth Scientists.

Technological advancements, such as multi-core processors and networked computing environments, drive advancement in computing platforms and computing paradigms. Computing paradigms have evolved into Cloud computing in the past decades

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM HPDGIS’10, November 2, 2010, San Jose, CA, USA. Copyright 2010 ACM ISBN 978-1-4503-0432-0/10/11…$10.00.

A key issue for high performance and parallel computing research with traditional computing paradigms is how to configure an optimized computing platform for the geospatial analysis and

35

processing [3]. Servers that contain high computing capabilities, high memory and fast network connections are always preferred to handle the tasks. However, in cloud computing, especially for commercial cloud services charged or billed on a consumptionbased model, consumers have to pay for the resources they use. Therefore, while arranging the cloud computing platform to support geospatial applications, cost is an important factor to consider. Amazon EC2 offers a number of different instance types to meet computing needs with each instance providing different dedicated compute CPU power, memory, disk, etc. Amazon EC2 offers a number of different instance types with different costs to meet computing needs with each instance providing different dedicated compute CPU power, memory, disk etc. Through experiment results of deploying GEOSS on all available Amazon EC2 instances, this paper will offer insights for cloud computing solutions for geospatial applications.

2.3 Deploying Applications Onto The Cloud Through the AWS (Amazon Web Service) Management Console [6], or Amazon EC2 AMI Tools [6], users can request to launch an instance based on a specified AMI. If the request is authorized, a VM is deployed. Amazon has two types of storage services, including EBS (Elastic Block Store) and Simple Storage Service (S3). Both storages types can store AMI volume and be used as the virtual storage device for an Amazon EC2 instance. In order to deploy applications on Amazon, an AMI should be prepared. In Linux, there are two common ways to prepare an AMI: 1) The easiest method involves starting from an existing public AMI and modifying it according to your requirements. This is applicable for both Amazon EBS-backed and Amazon S3-backed AMIs; 2) Another approach is to build a fresh installation either on a standalone machine or on an empty file system mounted by loopback. This is only applicable for AMIs backed by Amazon S3 and entails building an operating system installation from scratch.

Section 2 introduces the available cloud service types and describes Amazon EC2, a popular platform to host applications. Section 3 discusses the GEOSS Clearinghouse and workflow when deploying GEOSS on Amazon EC2. Section 3 also reports an experiment that tests the feasibility of Amazon EC2 to satisfy the computational requirements of the GEOSS Clearinghouse. Section 4 concludes and discusses future research directions and issues for cloud computing in Geosciences.

3. DEPLOYING GEOSS CLEARINGHOUSE ONTO AMAZON EC2 3.1 GEOSS Clearinghouse The GEOSS Clearinghouse is a common search facility for the Intergovernmental Group on Earth Observation (GEO). Through the harvest or distributed search of registered metadata catalogues, EO data, services, and related resources can be discovered and accessed. It is collaborative work between the CISC (http://cisc.gmu.edu) and FGDC. It is now hosted on a CISC server and is being deployed to the Amazon EC2 cloud platform.

2. RELATED WORK 2.1 Cloud Services The forms of service that cloud computing provides today may be broken down into Infrastructure as a Service (IaaS), PaaS and Software as a Service (SaaS). IaaS delivers the computer infrastructure, e.g., grid or cluster virtual server, network, storage and system software, as standardized services over the network. PaaS delivers a computing platform as a service. It encompasses a layer of software and provides it as a service that can be used to build higher-level services [4]. Users can run existing applications, or develop new applications on such a platform and do not need to consider maintaining the operating system, server hardware, load balancing or computing capacity. As the most famous and widely used type of cloud computing, SaaS provides all kinds of the capabilities found in sophisticated traditional applications installed locally. The difference between these local applications and SaaS, however, is that these capacities are provided over the Internet to the end users.

3.2 Deployment Figure 1 demonstrates how to deploy GEOSS Clearinghouse onto Amazon EC2 platform. In our case, the first method of preparing the GEOSS Clearinghouse AMI from an existing public AMI and customizing is used. A public AMI with CentOS 5.4 as OS is selected to launch an Amazon EC2 instance. After the instance that is being loaded with the selected AMI and begins booting up, users are able to interact with the EC2 instance. The next step is to login to the instance to get full root access through remote accessing method SSH (Secure Shell) after authorizing the network access by opening the port 22 to enable the SSH access. This can be set up through Amazon’s command line tools or AWS Management Console [6]. After logging in, the user can explore and play with the system however he/she likes: setup FTP and drop a web application, etc. In our group's case, the database software Postgresql with PostGIS to support spatial datasets, which keep the data of GEOSS Clearinghouse, and the servlet container tomcat, which is used to host the GEOSS Clearinghouse, should be installed.

2.2 Amazon EC2 As a central part of Amazon’s cloud services, Amazon EC2 allows users to deploy scalable resources on demand. Amazon EC2 is a typical IaaS cloud service. Based on Xen [5], EC2 enables users to boot an Amazon Machine Image (AMI) to create a virtual machine, which Amazon calls an "instance." AMI is a bootable virtual machine (VM) root image with various OS and any software desired to create a VM. At present, EC2 offers a number of different instance types to meet computing needs with each instance providing a predictable amount of dedicated compute capacity (CPU power, memory, disk etc). Amazon classifies these EC2 instances into three categories including Standard, High-Memory and High-CPU Linux instances, and different categories is suitable for different types of applications.

The data files of Postgresql are kept on a separate EBS volume to provide persistent storage in the event of instance failure. Both command line tools and the AWS Management Console can be used to create and attach the EBS volume to the GEOSS Clearinghouse instance. The EBS volume is mounted to the data and log directory of Postgresql. The EBS volume can be cloned and it can be used to restore the database on a second EC2 instance. After transferring the GEOSS Clearinghouse applications to the instance and restoring data into the database for the instance, we can start the servlet container and the GEOSS

36

Clearinghouse can now be successfully accessed through a web browser.

3.3 Amazon EC2 Platform Test GetCapabilies requests from different numbers of concurrent user requests are used to test the performance of different categories of Amazon EC2 instances in terms of supporting the GEOSS Clearinghouse. GetCapabilities is a very important CSW (catalog service web) request for the GEOSS clearinghouse. When the GEOSS clearinghouse server side receives a GetCapabilities request, it will first read a xml template file, and fill in this xml template with data by reading the local configuration files and searching the database via the Lucene[7]. The filled xml template will sent back to the client.

After successfully launching the Amazon EC2 instance, users are ready to begin exploring the endless EC2 possibilities. Actually using EC2 as an elastic computing cloud usually involves setting up an instance as a load balancer and giving that instance access to an array of active EC2 instances where it can work. Amazon Simple Queue Service (SQS) [6] offers a reliable, highly scalable, hosted queue for storing messages as they travel between computers. Used with Amazon EC2, as well as Amazon S3, Amazon SQS is able to make applications more flexible and scalable. Amazon SQS provides a good solution for the GEOSS Clearinghouse to be able to scale up and scale down the instances automatically. As geospatial processing and analysis are often complex, computational and data intensive, a geospatial application may need only a single CPU during some phases of execution but may be capable of leveraging hundreds of CPUs at other times. Scalable EC2 services would facilitate geospatial applications, elastically satisfying computing requirements.

GetCapabilities Average Reponse Time(s)

1000 800 600 400 200 0 1

20

40

60

80

100

120

Concurrent Request Number m1.small m2.2xlarge

m1.large m2.4xlarge

m1.xlarge c1.medium

m2.xlarge c1.xlarge

Figure 2. GetCapabilities Performance Comparison by Amazon’s 8 types of instances

Figure 1. The process of deploying GEOSS Clearinghouse onto Amazon EC2 Amazon EC2 offers a highly reliable environment for the GEOSS Clearinghouse because the service runs within Amazon’s proven network infrastructure and datacenters. The Amazon EC2 Service Level Agreement (SLA) guarantees 99.95% availability for all Amazon EC2 regions, including US Standard, EU (Ireland), US West (Northern California) and Asia Pacific (Singapore). We have also set up an additional EBS volume on GEOSS Clearinghouse instance for Postgresql database for added reliability. Since the EBS data volume can be attached to another instance, the GEOSS Clearinghouse data are therefore protected from instance termination or failure. Therefore, Amazon EC2 improves the reliability of geospatial applications.

Figure 3. Average CPU Utilization of Amazon’s 8 types of instances Figure 2 shows the performance of the eight Amazon EC2 instances by different concurrent user requests. Performance is illustrated with eight line graphs with six lines close together indicating the superior performance of the Large instance (m1. large), Extra large instance (m1.xlarge), High memory extra large (m2.xlarge), double extra large (m2.2xlarge), Quadruple extra large instances (m2.4xlarge), and High-cpu extra large instance (c1.xlarge). These six instances have faster response times than the other two, including small instance and high-cpu medium instance (c1.medium). These six instances show very little difference in performance even though they have different numbers of CPU cores and memory. This is because during the experiment, we found that no matter how many concurrent users there are for GetCapabilities requests, only one core of these six instances are used, and little memory is being used even though each instance has at least 2 virtual cores (Figure 3). For example, the average CPU utilization of m1.large (large instance) with two cores is around 50% while the average CPU utilization of m1.xlarge (extra large instance) with four cores is around 25%.

A cloud computing model can greatly facilitate the Geosciences by promoting reuse and sharing of application frameworks across Geosciences communities. By creating new AMI based on our running instance, other Geoscience communities can launch an Amazon EC2 instance from this new AMI and host their own applications by customizing the available software packages. Other organizations can host cloud services based on our work without much effort to procure physical infrastructure, install and configure spatially-enabled databases, Postgresql/Postgis, GeoNetwork and OS. Thus a cloud computing approach greatly reduces the duplication of efforts among organizations.

37

Lucene (used for indexing while searching) might be the reason behind the virtual CPUs under-utilization. The GetCapabilities response time seems dependent on the number of metadata records in the catalog. When there are no metadata records, the response time is 0.38s. On the other hand, the response time for Lucene increases to more than 3s when using a standard Amazon EC2 server with 26, 130 records. The GEOSS clearinghouse doesn’t use multi-thread to improve the indexing and search performance of the Lucene [7]. Using MapReduce for indexing might be a solution to improve the performance of GEOSS Clearinghouse [8].

5. ACKNOWLEDGMENTS Research reported is supported by FGDC GeoCloud initiative and GEOSS Clearinghouse projects (G09AC00103), and NASA Cloud Computing project at GSFC (NNX07AD99G). Betsy and Steve help with language editing.

6. REFERENCES [1] Armbrust, M., Fox, A. and Griffith, R., et al. 2009. Above the Clouds: A Berkeley View of Cloud Computing. Tech. Rep. UCB/EECS-2009-28, EECS Department, University of California, Berkeley, CA, 2009. http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS2009-28.html. (Accessed March, 2010).

Although there is not much difference regarding the performance among the six instances, they still have higher performance, when compared to the other two instances (m1.small, c1.small). Taking into the performance and cost into consideration, Standard Linux Large instance(m1.large) should be used to balance the cost and performance as it only costs approximately 0.34 and has relatively high performance.

[2] NASA Cloud Computing Platform. http://www.nasa.gov/offices/ocio/ittalk/062010_cloud_computing.html. (Accessed March, 2010). [3] Huang Q. and Yang C. 2010. Optimizing Grid Computing Configuration and Scheduling for Geospatial Analysis-- An Example with Interpolating DEM. Computers & Geosciences, (in press).

4. Conclusion Through the deployment and maintenance of the GEOSS Clearinghouse on the EC2 platform, we demonstrated how to utilize cloud computing to support Geosciences applications. In addition, the EC2 cloud computing platform can facilitate geospatial applications in the aspects of a) scalability, b) reliability and c) reducing duplicated efforts among Geosciences communities. Through the experiments that included all the available EC2 instances to support different concurrent requests from GEOSS Clearinghouse, it is found that the Standard Linux large instance should be used to balance the cost and performance as it only costs $0.34 yet still has relatively high performance.

[4] Sun White Paper, 2009. Introduction to Cloud Computing Architecture, 1 edition, June, 2009. [5] Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A. 2003. Xen and the art of virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM, New York, NY, 164-177. DOI= http://doi.acm.org/10.1145/945445.945462 [6] Amazon Web Service. Available at: http://aws.amazon.com. (Access June, 2010).

For Geosciences, spatial cloud computing is where we are heading. To construct such a spatial computing environment, a key issue is how to incorporate the underlying resources through Geospatial middleware. Implementing a Geospatial middleware for cloud computing is more difficult than other computing paradigms (e.g. grid computing). Such a Geospatial middleware should be able to hide all kinds of complex implementation, including organizing computing resources, parallelizing, scheduling, and applying spatial principles and constraints to leverage cloud computing performance for geoscience problems, as well as providing kernel geospatial functions [9][10]. In addition, determining how to achieve service interoperability among cloud environment could be an issue. There are still many obstacles for users to obtain cross-cloud services and different services in one platform. However, standards for the normalization of the describing, discovering, accessing, and evaluating of geospatial data and model could greatly aid data, model and platform interoperability. Other issues related cloud computing are data and personal privacy, unpredictable performance, data transfer bottlenecks, bugs in large-scale distributed systems, software licensing, quick Scaling [11].

[7] Lucene. Available at: (http://lucene.apache.org/java/docs/index.html) (Access Sep, 2010). [8] Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international Conference on Management of Data (Providence, Rhode Island, USA, June 29 - July 02, 2009). C. Binnig and B. Dageville, Eds. SIGMOD '09. ACM, New York, NY, 165-178. DOI= http://doi.acm.org/10.1145/1559845.1559865 [9] Yang C., Raskin R., Goodchild M.F., Gahegan M., 2010a, Geospatial Cyberinfrastructure: Past, Present and Future, Computers, Environment, and Urban Systems, 34(4):264277. [10] Yang, C., Wu. H., Huang, Q., Li, Z. and Li, J., 2010b. Spatial Computing for Supporting Physical Sciences. PNAS. (in review). [11] Armbrust, M. et al. 2010. A view of cloud computing. Communications of the ACM, 53(4): 50-58.

38