Publication and Reuse of Big Data Applications as ...

3 downloads 49747 Views 1MB Size Report
Feb 12, 2016 - No installation/hosting is required. ○ Public instance with open ... can be created dynamically. – Task dependencies are managed by the application ... Amazon ML, Microsoft Azure ML, Databricks Cloud. – General-purpose ...
RUSSIAN-FRENCH WORKSHOP ON BIG DATA APPLICATIONS National Research University Higher School of Economics, Moscow, 02.12.2016

Publication and Reuse of Big Data Applications as Services Oleg Sukhoroslov Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)

Motivation ●





Modern science and engineering require the use of advanced computational tools and high-performance resources for simulation, data analysis, etc. The specialized information technologies are crucial for supporting research and automation of routine activities in such complex environments However, small and medium laboratories lack the human and financial resources needed to acquire and operate such technologies

02.12.2016

Publication and Reuse of Big Data Applications as Services

2 / 20

Motivation: Computational Science

02.12.2016

Publication and Reuse of Big Data Applications as Services

3 / 20

Proposed Approach ●

Use cloud computing models (SaaS, PaaS) to provide researchers with access to required solutions via remotely accessible services –

Minimal requirements for technical expertise or local infrastructure



Support discovery within small and medium labs



Accelerate work by automating routine activities

02.12.2016

Publication and Reuse of Big Data Applications as Services

4 / 20

Scientific Application as a Service ●

Software as a Service (SaaS)



No need to install software and deal with computing resources



Centralized maintenance and accelerated feature delivery



Application composition and integration with third-party tools



Collaboration



Publication and reproducibility

02.12.2016

Publication and Reuse of Big Data Applications as Services

5 / 20

A Similar Vision from Globus Team We are convinced that the Discovery Cloud represents the future of scientific computing. Once realized, it will allow any researcher, in any laboratory, to access, via intuitive interfaces, a rich set of services that collectively automate and accelerate common research activities. Researchers working within SMLs will be able to discover any computational, software, or data resource relevant to their research; track and organize data consumed and produced by their research; access and run powerful modeling and simulation software; and collaborate with colleagues regardless of location - all without installing software, acquiring storage systems or computational infrastructure, or employing IT staff to operate and maintain hardware and software. Foster, I., Chard, K., & Tuecke, S. (2016, April). The Discovery Cloud: Accelerating and Democratizing Research on a Global Scale. In 2016 IEEE International Conference on Cloud Engineering (IC2E) (pp. 68-77). IEEE.

02.12.2016

Publication and Reuse of Big Data Applications as Services

6 / 20

Computational Service

02.12.2016

Publication and Reuse of Big Data Applications as Services

7 / 20

Everest ●





Web-based platform supporting –

Publication of computational applications as services



Execution of applications on external computing resources



Sharing applications and resources with other users



Composition of applications (workflows)

Platform as a Service (PaaS) –

Remote access via web browser and REST API



Single platform instance can be accessed by many users



No installation/hosting is required

Public instance with open registration –

02.12.2016

http://everest.distcomp.org/ Publication and Reuse of Big Data Applications as Services

8 / 20

Everest Architecture

02.12.2016

Publication and Reuse of Big Data Applications as Services

9 / 20

Supported Application Types ●







Command –

Generic template for applications with command-line interface



Single compute task

Parameter Sweep –

Large number of independent compute tasks with parametrized inputs



Generic service for running parameter sweep experiments

Many-task Application –

Multiple compute tasks, can be created dynamically



Task dependencies are managed by the application

Workflow –

Composition of multiple applications



Multiple jobs with dependencies

02.12.2016

Publication and Reuse of Big Data Applications as Services

10 / 20

Big Data Application as a Service

02.12.2016

Publication and Reuse of Big Data Applications as Services

11 / 20

Big Data Application as a Service

02.12.2016

Publication and Reuse of Big Data Applications as Services

12 / 20

Examples ●

Academic projects –





Globus Genomics ●

Analysis of next generation sequencing data



Amazon cloud

PDACS Portal ●

Analysis of cosmological data



NERSC computing center, Magellan academic cloud

Commercial cloud solutions –

Amazon ML, Microsoft Azure ML, Databricks Cloud



General-purpose platforms

02.12.2016



Set of universal services



Own infrastructure for data storage and processing

Publication and Reuse of Big Data Applications as Services

13 / 20

Problems ●

Implementation of services on the base of existing Big Data infrastructure –



Integration of services with existing data storage resources and repositories, or with other services –



Hadoop cluster

A problem may require the use of multiple services

Platforms for implementation and deployment of DIS –

02.12.2016

Ready-to-use solutions of typical problems related to DIS development

Publication and Reuse of Big Data Applications as Services

14 / 20

Data-Intensive Service: Requirements ●

Efficient transfer of large volumes of data to the cluster



Integration with external data sources





Cloud services (Dropbox, Google Drive)



Scientific data repositories (Dataverse, FigShare, Zenodo)



Specialized databases (1000 Genomes Project)



File servers and grid services (HTTP, FTP, GridFTP, rsync)

Caching and reuse of downloaded data –



Protection of user's data

Transfer of results to the user's computer, external storage or another service –

02.12.2016

Remote data access and visualization Publication and Reuse of Big Data Applications as Services

15 / 20

Decoupling Service from Infrastructure ●

Multiple clusters with load balancing



Resource is provided by user (portable application)



Direct data transfer to the resource, bypassing the service

02.12.2016

Publication and Reuse of Big Data Applications as Services

16 / 20

Using Everest for Building DIS ●





Tools for rapid deployment of computational services and integration with computing resources Support for passing input files by reference to avoid moving data through the platform (HTTP) Direct data transfer on resource from other types of external data sources –



Staging out results from resource to external storage –



Agent improvements: stage-in of data from FTP servers, Dropbox and Dataverse

Agent improvements: stage-out of data to FTP servers and Dropbox

Integration of Everest agent with Hadoop platform –

Agent improvements: adapter for running applications via Hadoop YARN

02.12.2016

Publication and Reuse of Big Data Applications as Services

17 / 20

Implementation of DIS with Everest

02.12.2016

Publication and Reuse of Big Data Applications as Services

18 / 20

Pilot Service

02.12.2016

Publication and Reuse of Big Data Applications as Services

19 / 20

Conclusion ●





Services providing access to computational applications and infrastructure increase research productivity and accelerate innovation Big data applications have specific requirements that should be taken into account when implementing data-intensive services Everest platform provides easy-to-use tools for rapid development of both computational and data-intensive services on the base of existing computing and data storage resources

http://everest.distcomp.org/ 02.12.2016

Publication and Reuse of Big Data Applications as Services

20 / 20