Gfarm Grid File System - Semantic Scholar

11 downloads 14623 Views 943KB Size Report
Petascale Data Intensive Computing. High Energy Physics ... Scalable parallel & distributed data computing. Decentralization of disk access putting priority to local disk. File affinity .... Create file replicas at distant locations for disaster recovery.
HPC File Systems: From Cluster to Grid Oct 4, 2007, IRISA, Rennes

Gfarm Grid File System Osamu Tatebe University of Tsukuba

Petascale Data Intensive Computing High Energy Physics z CERN LHC, LHC KEK-B KEK B Belle z ~MB/collision, 100 collisions/sec z ~PB/year z2000 physicists, 35 countries

Detector for LHCb experiment

Detector for ALICE experiment

Astronomical Data Analysis z data analysis of the whole data z TB~PB/year/telescope z Subaru telescope z 10 GB/night GB/night, 3 TB/year

Petascale Data Intensive Computing Requirements Storage Capacity Peta/Exabyte scale files files, millions of millions of files Computing Power > 1TFLOPS, hopefully > 10TFLOPS I/O Bandwidth > 100GB/s, hopefully > 1TB/s within a system and between systems Global Sharing group-oriented authentication and access control g p

Goal and feature of Grid Datafarm Project Began in summer 2000 2000, collaboration with KEK, Titech, and AIST Goa Goal Dependable data sharing among multiple organizations High-speed g p data access,, High-performance g p data computing Grid Datafarm Gf G id File il System S Gl b l dependable d d bl virtual i l file fil Gfarm Grid – Global system Federates local disks in PCs

Scalable parallel & distributed data computing Decentralization of disk access putting priority to local disk File affinity scheduling

Gfarm Grid File System [CCGrid 2002] Commodity-based distributed file system that federates storage of each site It can be mounted from all cluster nodes and clients Itt p provides o des sca scalable ab e I/O /O pe performance o a ce wrtt tthe e number u be o of parallel processes and users It supports fault tolerance and avoids access concentration b automatic by t ti replica li selection l ti

Global namespace /gfarm ggf aist file1

jp gtrc file2

file1 e file3

Gfarm File System

mapping file2 e

file4

File replica creation

Gfarm Grid File System (2) Files can be shared among all nodes and clients Ph i ll it may be Physically, b replicated li t d and d stored t d on any file fil system node Applications can access it regardless of its location File system nodes can be distributed GridFTP, samba, NFS server GridFTP, samba, NFS server Compute & fs node Compute & fs node

Gfarm metadata server

/gfarm metadata

Compute & fs node

File A

Compute & fs node

File File B C

Compute & fs node

Compute & fs node

File A

Compute & fs node

Compute & fs node

Compute & fs node Compute & fs node

US

File B

Compute & fs node

Japan

File C

Client PC

File A File C File B Gfarm file system

Note PC …

Scalable I/O Performance Decentralization of disk access putting priority to local disk When a new file is created, Local disk is selected when there is enough space Otherwise, near and the least busy node is selected

When a file is accessed, Local disk is selected if it has one of the file replicas Otherwise, near and the least busy node having one of file replicas is selected

File affinity scheduling S h d l a process on a node Schedule d having h i the h specified ifi d file fil Improve the opportunity to access local disk

Scalable I/O performance in distributed environment User’s User s view

Physical execution view in Gfarm (file-affinity scheduling)

User A submits Job A that accesses File A

Job A is executed on a node that has File A

User B submits Job B that accesses File B

Job B is executed on a node that has File B

network Cl Cluster, Grid G id

CPU

CPU

CPU

CPU

Gfarm file system Shared network file system

File system nodes = compute nodes

Do not separate storage and CPU (SAN not necessary) Move and execute program instead of moving large large-scale scale data exploiting local I/O is a key for scalable I/O performance

Particle Physics Data Analysis

• Construct 26 TB of Gfarm FS using 1112 nodes • Store all 24 24.6 6 TB of Belle experiment data • 52.0GB/s in parallel read → 3,024 times speedup • 24.0GB/s in skimming process for b → s γ decays using 704 nodes → 3 weeks to 30 minutes

I/O bandw width [GB/sec]

O. Tatebe et al, “High Performance Data Analysis for Particle Physics y using g the Gfarm File System”, y , SC06 HPC Storage Challenge, Winner – Large Systems, 2006 52.0 GB/sec /

60 50 40 30 20 10 0 0

200

400 600 800 number of client nodes

1000

1200

1112 nodes

PRAGMA Grid C. Zheng, O. Tatebe et al, “Lessons Learned Through Driving Science Applications in the PRAGMA Grid”, Int. J. Web and Grid Services, Inderscience de sc e ce Enterprise te p se Ltd., td , 2007 00 • Worldwide Grid testbed consisting of 14 countries countries, 29 institutes • Gfarm file system is used for file sharing infrastructure → executable, input/output data sharing possible in Grid → no explicit staging to a local cluster needed

More Feature of Gfarm Grid File System Commodity PC based scalable architecture Add commodity PCs to increase storage capacity in operation much more than petabyte scale Even PCs at distant locations via internet

Adaptive replica creation and consistent management Create multiple file replicas to increase performance and reliability C t fil Create file replicas li att distant di t t locations l ti for f disaster di t recovery Open Source Software Linux binaryy p packages, for *BSD,, . . . g , ports p It is included in Naregi, Knoppix HTC edition, and Rocks distribution

Existing applications can access w/o any modification

SC03 Bandwidth

SC05 StorCloud

SC06 HPC Storage Challenge Winner

Design and implementation of Gfarm v2

Problems in Gfarm v1 Security issue No access control of metadata Not enough g physical p y file protection p Inconsistency between physical file and metadata Renaming, removing, changing metadata not supported during file opening Metadata access overhead No user and group management No p privileged g user

Design policy of Gfarm v2 Inherit architectural benefit of scalable I/O performance f using i commodity dit platform l tf Design as a POSIX compliant file system Solve previous problems Improve performance for small files Reduce metadata access overhead Grid file system -> Distributed file system Still benefit from improvement of local file system y Compete with NFS, AFS, and Lustre

Software component http://datafarm.apgrid.org/ gfarm2fs and libgfarm Mount Gfarm file system and provide Gfarm API gfmd – metadata server Namespace, replica catalog, host information, process information gfsd f d – I/O server file access

Metadata server

application gfarm2fs libgfarm gfarm2fs,

CPU gfsd

CPU gfsd

CPU gfsd

Compute and file system nodes

CPU gfsd

...

Implementation of Gfarm v2 (1) Metadata server – gfmd F ll Follow directory-based di t b d access control t l faithfully f ithf ll Cache all metadata in memory No disk wait when responding a request Backend database access in background

Manage file f l descriptors d for f each h process and d processes that open each file C i t Consistency managementt among diff differentt processes

Monitor status of all IO servers H Heartbeat, b CPU load, l d capacity, i …

Introduce compound RPC procedures

Implementation of Gfarm v2 (2) IO server – gfsd Access files based of file descriptor Not path name based Obtain inode number from metadata server

Update metadata etadata at cclose ose ttime e Do not keep unnecessary metadata information rename chmod, chmod . . . No need to access gfsd for rename,

Opening files in read-write mode (1) Semantics (the same as AFS) [without advisory file locking] Updated content is available only when opening the file after a writing process closes the file [[with advisoryy file locking] g] Among processes that locks a file, up-to-date content is available in the locked region. g This is not ensured when a process writes the same file without file locking. g

Opening file in read-write mode (2) Process 1

fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”)

Metadata server file2

/grid ggf file1

file2 FSN1 () fclose()

jp

Process 2

FSN2 File access

File access fclose()

file2

FSN1

Delete invalid file copy in metadata metadata, FSN2 but file access is continued Before closing, closing any file copy can be accessed

Consistent update of metadata (1) Gfarm v1 – Gfarm library updates metadata Application Gfarm library

open

Metadata server

FSN1 close

Update metadata

Metadata is not updated at unexpected application crash File system node

FSN1

Consistent update of metadata (2) Gfarm v2 – file system node updates metadata Application

open FSN1

Metadata server

Gfarm library close l or broken pipe

Update metadata

File system node

FSN1

Metadata is updated by file system node even at unexpected application crash

Software Stack Gfarm Application

UNIX Application

Windows Appli. (Word,PPT, Excel etc)

UNIX Command Gfarm Command

NFS Client

Windows Client

nfsd f d

samba b

FTP Client

SFTP / SCP

Web Browser / Client

ft d ftpd

sshd hd

htt d httpd

GfarmFS--FUSE (that mounts FS in UserGfarmFS User-space) libgfarm - Gfarm I/O Library

Gfarm Grid File System

Open Source Development Sourceforge.net http://sourceforge.net/projects/gfarm Source code svn co https://gfarm.svn.sourceforge.net/svnroot/gfarm/ https://gfarm svn sourceforge net/svnroot/gfarm/ gfarm/branches/gfarm_v2 svn co https://gfarm.svn.sourceforge.net/svnroot/gfarm/ gfarm2fs/trunk gfarm2fs

Summary Gfarm file system Scalable commodity-based architecture File replicas for fault tolerance and hot spot avoidance Capacity increase/decrease in operation Gfarm v1 U d for Used f severall production d ti systems t 1000+ clients and file system nodes scalability Gfarm v2 Plug up security hole in Gfarm v1, v1 and improve metadata access performance Comparable performance with NFS for small files in LAN 0.2 ~ 0.5 milliseconds

Scalable file IO performance even in distributed environment

1433 MB/sec parallel read IO performance from 22 clients in Japan and US

Open Source Development h http://sourceforge.net/projects/gfarm // f / / f Collaboration with JuxMem team Gfarm provides persistency for JuxMem