Publication and Reuse of Big Data Applications as ...

RUSSIAN-FRENCH WORKSHOP ON BIG DATA APPLICATIONS National Research University Higher School of Economics, Moscow, 02.12.2016

Publication and Reuse of Big Data Applications as Services Oleg Sukhoroslov Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)

Motivation ●

●

●

Modern science and engineering require the use of advanced computational tools and high-performance resources for simulation, data analysis, etc. The specialized information technologies are crucial for supporting research and automation of routine activities in such complex environments However, small and medium laboratories lack the human and financial resources needed to acquire and operate such technologies

02.12.2016

Publication and Reuse of Big Data Applications as Services

2 / 20

Motivation: Computational Science

02.12.2016


3 / 20

Proposed Approach ●

Use cloud computing models (SaaS, PaaS) to provide researchers with access to required solutions via remotely accessible services –

Minimal requirements for technical expertise or local infrastructure

–

Support discovery within small and medium labs

–

Accelerate work by automating routine activities

02.12.2016


4 / 20

Scientific Application as a Service ●

Software as a Service (SaaS)

●

No need to install software and deal with computing resources

●

Centralized maintenance and accelerated feature delivery

●

Application composition and integration with third-party tools

●

Collaboration

●

Publication and reproducibility

02.12.2016


5 / 20

A Similar Vision from Globus Team We are convinced that the Discovery Cloud represents the future of scientific computing. Once realized, it will allow any researcher, in any laboratory, to access, via intuitive interfaces, a rich set of services that collectively automate and accelerate common research activities. Researchers working within SMLs will be able to discover any computational, software, or data resource relevant to their research; track and organize data consumed and produced by their research; access and run powerful modeling and simulation software; and collaborate with colleagues regardless of location - all without installing software, acquiring storage systems or computational infrastructure, or employing IT staff to operate and maintain hardware and software. Foster, I., Chard, K., & Tuecke, S. (2016, April). The Discovery Cloud: Accelerating and Democratizing Research on a Global Scale. In 2016 IEEE International Conference on Cloud Engineering (IC2E) (pp. 68-77). IEEE.

02.12.2016


6 / 20

Computational Service

02.12.2016


7 / 20

Everest ●

●

●

Web-based platform supporting –

Publication of computational applications as services

–

Execution of applications on external computing resources

–

Sharing applications and resources with other users

–

Composition of applications (workflows)

Platform as a Service (PaaS) –

Remote access via web browser and REST API

–

Single platform instance can be accessed by many users

–

No installation/hosting is required

Public instance with open registration –

02.12.2016

http://everest.distcomp.org/ Publication and Reuse of Big Data Applications as Services

8 / 20

Everest Architecture

02.12.2016


9 / 20

Supported Application Types ●

●

●

●

Command –

Generic template for applications with command-line interface

–

Single compute task

Parameter Sweep –

Large number of independent compute tasks with parametrized inputs

–

Generic service for running parameter sweep experiments

Many-task Application –

Multiple compute tasks, can be created dynamically

–

Task dependencies are managed by the application

Workflow –

Composition of multiple applications

–

Multiple jobs with dependencies

02.12.2016


10 / 20

Big Data Application as a Service

02.12.2016


11 / 20

Big Data Application as a Service

02.12.2016


12 / 20

Examples ●

Academic projects –

–

●

Globus Genomics ●

Analysis of next generation sequencing data

●

Amazon cloud

PDACS Portal ●

Analysis of cosmological data

●

NERSC computing center, Magellan academic cloud

Commercial cloud solutions –

Amazon ML, Microsoft Azure ML, Databricks Cloud

–

General-purpose platforms

02.12.2016

●

Set of universal services

●

Own infrastructure for data storage and processing


13 / 20

Problems ●

Implementation of services on the base of existing Big Data infrastructure –

●

Integration of services with existing data storage resources and repositories, or with other services –

●

Hadoop cluster

A problem may require the use of multiple services

Platforms for implementation and deployment of DIS –

02.12.2016

Ready-to-use solutions of typical problems related to DIS development


14 / 20

Data-Intensive Service: Requirements ●

Efficient transfer of large volumes of data to the cluster

●

Integration with external data sources

●

–

Cloud services (Dropbox, Google Drive)

–

Scientific data repositories (Dataverse, FigShare, Zenodo)

–

Specialized databases (1000 Genomes Project)

–

File servers and grid services (HTTP, FTP, GridFTP, rsync)

Caching and reuse of downloaded data –

●

Protection of user's data

Transfer of results to the user's computer, external storage or another service –

02.12.2016

Remote data access and visualization Publication and Reuse of Big Data Applications as Services

15 / 20

Decoupling Service from Infrastructure ●

Multiple clusters with load balancing

●

Resource is provided by user (portable application)

●

Direct data transfer to the resource, bypassing the service

02.12.2016


16 / 20

Using Everest for Building DIS ●

●

●

Tools for rapid deployment of computational services and integration with computing resources Support for passing input files by reference to avoid moving data through the platform (HTTP) Direct data transfer on resource from other types of external data sources –

●

Staging out results from resource to external storage –

●

Agent improvements: stage-in of data from FTP servers, Dropbox and Dataverse

Agent improvements: stage-out of data to FTP servers and Dropbox

Integration of Everest agent with Hadoop platform –

Agent improvements: adapter for running applications via Hadoop YARN

02.12.2016


17 / 20

Implementation of DIS with Everest

02.12.2016


18 / 20

Pilot Service

02.12.2016


19 / 20

Conclusion ●

●

●

Services providing access to computational applications and infrastructure increase research productivity and accelerate innovation Big data applications have specific requirements that should be taken into account when implementing data-intensive services Everest platform provides easy-to-use tools for rapid development of both computational and data-intensive services on the base of existing computing and data storage resources

http://everest.distcomp.org/ 02.12.2016


20 / 20