Feb 12, 2016 - No installation/hosting is required. â Public instance with open ... can be created dynamically. â Task dependencies are managed by the application ... Amazon ML, Microsoft Azure ML, Databricks Cloud. â General-purpose ...
RUSSIAN-FRENCH WORKSHOP ON BIG DATA APPLICATIONS National Research University Higher School of Economics, Moscow, 02.12.2016
Publication and Reuse of Big Data Applications as Services Oleg Sukhoroslov Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)
Motivation ●
●
●
Modern science and engineering require the use of advanced computational tools and high-performance resources for simulation, data analysis, etc. The specialized information technologies are crucial for supporting research and automation of routine activities in such complex environments However, small and medium laboratories lack the human and financial resources needed to acquire and operate such technologies
02.12.2016
Publication and Reuse of Big Data Applications as Services
2 / 20
Motivation: Computational Science
02.12.2016
Publication and Reuse of Big Data Applications as Services
3 / 20
Proposed Approach ●
Use cloud computing models (SaaS, PaaS) to provide researchers with access to required solutions via remotely accessible services –
Minimal requirements for technical expertise or local infrastructure
–
Support discovery within small and medium labs
–
Accelerate work by automating routine activities
02.12.2016
Publication and Reuse of Big Data Applications as Services
4 / 20
Scientific Application as a Service ●
Software as a Service (SaaS)
●
No need to install software and deal with computing resources
●
Centralized maintenance and accelerated feature delivery
●
Application composition and integration with third-party tools
●
Collaboration
●
Publication and reproducibility
02.12.2016
Publication and Reuse of Big Data Applications as Services
5 / 20
A Similar Vision from Globus Team We are convinced that the Discovery Cloud represents the future of scientific computing. Once realized, it will allow any researcher, in any laboratory, to access, via intuitive interfaces, a rich set of services that collectively automate and accelerate common research activities. Researchers working within SMLs will be able to discover any computational, software, or data resource relevant to their research; track and organize data consumed and produced by their research; access and run powerful modeling and simulation software; and collaborate with colleagues regardless of location - all without installing software, acquiring storage systems or computational infrastructure, or employing IT staff to operate and maintain hardware and software. Foster, I., Chard, K., & Tuecke, S. (2016, April). The Discovery Cloud: Accelerating and Democratizing Research on a Global Scale. In 2016 IEEE International Conference on Cloud Engineering (IC2E) (pp. 68-77). IEEE.
02.12.2016
Publication and Reuse of Big Data Applications as Services
6 / 20
Computational Service
02.12.2016
Publication and Reuse of Big Data Applications as Services
7 / 20
Everest ●
●
●
Web-based platform supporting –
Publication of computational applications as services
–
Execution of applications on external computing resources
–
Sharing applications and resources with other users
–
Composition of applications (workflows)
Platform as a Service (PaaS) –
Remote access via web browser and REST API
–
Single platform instance can be accessed by many users
–
No installation/hosting is required
Public instance with open registration –
02.12.2016
http://everest.distcomp.org/ Publication and Reuse of Big Data Applications as Services
8 / 20
Everest Architecture
02.12.2016
Publication and Reuse of Big Data Applications as Services
9 / 20
Supported Application Types ●
●
●
●
Command –
Generic template for applications with command-line interface
–
Single compute task
Parameter Sweep –
Large number of independent compute tasks with parametrized inputs
–
Generic service for running parameter sweep experiments
Many-task Application –
Multiple compute tasks, can be created dynamically
–
Task dependencies are managed by the application
Workflow –
Composition of multiple applications
–
Multiple jobs with dependencies
02.12.2016
Publication and Reuse of Big Data Applications as Services
10 / 20
Big Data Application as a Service
02.12.2016
Publication and Reuse of Big Data Applications as Services
11 / 20
Big Data Application as a Service
02.12.2016
Publication and Reuse of Big Data Applications as Services
12 / 20
Examples ●
Academic projects –
–
●
Globus Genomics ●
Analysis of next generation sequencing data
●
Amazon cloud
PDACS Portal ●
Analysis of cosmological data
●
NERSC computing center, Magellan academic cloud
Commercial cloud solutions –
Amazon ML, Microsoft Azure ML, Databricks Cloud
–
General-purpose platforms
02.12.2016
●
Set of universal services
●
Own infrastructure for data storage and processing
Publication and Reuse of Big Data Applications as Services
13 / 20
Problems ●
Implementation of services on the base of existing Big Data infrastructure –
●
Integration of services with existing data storage resources and repositories, or with other services –
●
Hadoop cluster
A problem may require the use of multiple services
Platforms for implementation and deployment of DIS –
02.12.2016
Ready-to-use solutions of typical problems related to DIS development
Publication and Reuse of Big Data Applications as Services
14 / 20
Data-Intensive Service: Requirements ●
Efficient transfer of large volumes of data to the cluster
●
Integration with external data sources
●
–
Cloud services (Dropbox, Google Drive)
–
Scientific data repositories (Dataverse, FigShare, Zenodo)
–
Specialized databases (1000 Genomes Project)
–
File servers and grid services (HTTP, FTP, GridFTP, rsync)
Caching and reuse of downloaded data –
●
Protection of user's data
Transfer of results to the user's computer, external storage or another service –
02.12.2016
Remote data access and visualization Publication and Reuse of Big Data Applications as Services
15 / 20
Decoupling Service from Infrastructure ●
Multiple clusters with load balancing
●
Resource is provided by user (portable application)
●
Direct data transfer to the resource, bypassing the service
02.12.2016
Publication and Reuse of Big Data Applications as Services
16 / 20
Using Everest for Building DIS ●
●
●
Tools for rapid deployment of computational services and integration with computing resources Support for passing input files by reference to avoid moving data through the platform (HTTP) Direct data transfer on resource from other types of external data sources –
●
Staging out results from resource to external storage –
●
Agent improvements: stage-in of data from FTP servers, Dropbox and Dataverse
Agent improvements: stage-out of data to FTP servers and Dropbox
Integration of Everest agent with Hadoop platform –
Agent improvements: adapter for running applications via Hadoop YARN
02.12.2016
Publication and Reuse of Big Data Applications as Services
17 / 20
Implementation of DIS with Everest
02.12.2016
Publication and Reuse of Big Data Applications as Services
18 / 20
Pilot Service
02.12.2016
Publication and Reuse of Big Data Applications as Services
19 / 20
Conclusion ●
●
●
Services providing access to computational applications and infrastructure increase research productivity and accelerate innovation Big data applications have specific requirements that should be taken into account when implementing data-intensive services Everest platform provides easy-to-use tools for rapid development of both computational and data-intensive services on the base of existing computing and data storage resources
http://everest.distcomp.org/ 02.12.2016
Publication and Reuse of Big Data Applications as Services
20 / 20