Adapting the VLT data flow system for handling high ...

5 downloads 517 Views 67KB Size Report
location of the node hosting the data ... The REI was deployed and tested on two different server systems, demonstrating its fit to both the “big iron” model and.
Adapting the VLT data flow system for handling high data rates Stefano Zampieri1a, Michele Peron a, Olivier Chuzel a, Neil Ferguson a b, Jens Knudstrup a, Nick Kornweibel a a Data Management and Operations Division, European Southern Observatory, Karl-Schwarzschild-Str. 2, D-85748 Garching bei München, Germany b Rutherford Appleton Laboratory, CCLRC, Chilton, Didcot, OX11 0QX, United Kingdom ABSTRACT The Data Flow System (DFS) for the ESO VLT provides a global approach to handle the flow of science related data in the VLT environment. It is a distributed system composed of a collection of components for preparation and scheduling of observations, archiving of data, pipeline data reduction and quality control. Although the first version of the system became operational in 1999 together with the first UT, additional developments were necessary to address new operational requirements originating from new and complex instruments which generate large amounts of data. This paper presents the hardware and software changes made to meet those challenges within the back-end infrastructure, including on-line and off-line archive facilities, parallel/distributed pipeline processing and improved association technologies. Keywords: Data Flow System, high data rate, parallel processing

1. INTRODUCTION The DFS is a distributed system, consisting of a set of instrument generic applications (called infrastructure) and additional instrument specific tools. Instrument pipelines belong to the second category while observation preparation tools are part of the DFS infrastructure. As described in detail in [1], the DFS has been continuously evolving since the beginning of VLT operations. This has been driven by requirements originated by new instruments and the optimization of the VLT operational concepts. One of the key aspects of the DFS infrastructure is its instrument independence. No structural changes should be required to support a new instrument. Instrument specific issues are typically encapsulated in configuration files or database tables. There are cases, however, where improvements to the infrastructure are required to satisfy the requirements coming from new instruments. For example the OmegaCAM instrument at the VLT Survey Telescope (VST) is expected to generate about 100 GB per observing night, with peaks up to 300 GB per night. If we consider that the total amount of data generated in one night by all operational VLT instruments averages to 30 GB, it’s easy to see how OmegaCAM represents a challenge for our data handling systems. Within the DFS it is the back-end subsystem, responsible for handling data post observation, which has been the most affected by the increased data rate. In this article we describe the evolution of the DFS back-end infrastructure towards an optimized system able to efficiently process large volumes of data.

2. DFS BACK-END The DFS back-end infrastructure provides functionalities for supporting Paranal Science Operations (PSO), Data Flow Operations (DFO) and Operations Technical Support (OTS) operational processes. It is being used in the on-line (i.e. at the ESO observatories) and off-line (i.e. at the ESO headquarters) environments and includes: • Archival of raw data generated by VLT instruments on medium-term storage • Organization of raw data for on-line pipeline and quality control • Organization of raw data for off-line pipeline and quality control • Preparation and delivery of data packages to the users 1

Email: [email protected]

• •

Support of quality control process in Garching Retrieval of data through the archive.

In addition a number of tools are made available to the user community, including data management tools. As mentioned, the DFS back-end is a collection of components that have been developed over the past seven years. Some of the applications have been operational since VLT first light; others were developed over the following years to satisfy new operational requirements. In the following paragraphs the “original” components are briefly described. In section four we will give more details about new developments. On Line Archive System The On Line Archive System (OLAS) is a central part of the DFS and it is in operation at Paranal and La Silla. The main responsibilities of OLAS are: • Temporary storage of data • Verification of data consistency and integrity • Data distribution (archive, pipeline, user) • Ingestion of header information into the database OLAS has been operational for more than five years and can be considered a mature application able to cope with many new operational scenarios (for example the quick delivery of pre-imaging data to Garching). Some of the key features of OLAS are the reliability and its modular architecture; another important aspect is the extensive test suite that was developed over the years. One of the critical aspects of OLAS is data distribution; frames shall be delivered to the subscribers approximately at the same rate at which they are received by OLAS. In the current system data distribution is performed sequentially, i.e. a frame is delivered to one subscriber at a time, for all the subscribers. This solution is simple and satisfies the throughput requirements of the first generation VLT instruments. In order to sustain the data rates expected from instruments like OmegaCAM, we need to perform data distribution in a more efficient way. In section four we explain how we’ve managed to increase the OLAS throughput by multi-threading the data distribution. Archive Storage System The Archive Storage System (ASTO) is responsible for long-term archiving of data on various types of media, mainly CD and DVD. ASTO is still used at Paranal as media production system for the VLT archive and for the user data, but it’s gradually being replaced by the Next Generation Archive System (NGAS), based on Hard Disks. Data Organization and Recipe Execution In the on-line environment the pipeline infrastructure is responsible for organizing the data and scheduling the execution of pipeline recipes. Frames are processed sequentially, classified and associated with the relevant calibrations by applying instrument specific rules. Data reduction jobs are also executed sequentially. This system satisfies the data processing needs of VLT first generation instruments, but it’s unable to meet all the requirements of future instruments. The limiting factors of the current pipeline infrastructure are: • Data organization is performed exclusively on frames stored on disk and not on database entries • The scheduling model does not allow mapping to a multi-processor or cluster environment • The off-line environment (quality control process) is not supported Data Packer The Data Packer is used at Paranal to prepare the visiting astronomers’ data packages. A package typically contains the user’s raw science data and the appropriate calibrations. Similar tools are used in Garching to prepare data packages for the service mode users.

Gasgano Gasgano is a graphical application that can be used to manage and organize in a systematic way the astronomical data observed and produced by all VLT compliant telescopes. Gasgano can be also used as a front-end interface to the VLT instrument pipelines. It provides functionalities for grouping, classifying and inspecting data.

3. NEW REQUIREMENTS Although the framework of the Data Flow System can be considered as completed after five years of operation, the appearance of new instruments, generating large amounts of data, requires new thinking and new software development. The following diagram shows the volume of data generated by all ESO instruments (in GB) as a function of time. The steady increase from 1999 on is due to the commissioning of new Unit Telescopes (UT) and instruments at the VLT, and the major steps expected for 2005 and 2007 are related to the commissioning of VST/OmegaCAM and VISTA, respectively. OmegaCAM will approximately equal the total data volume generated by all ESO instruments. This means that a single DFS (the one installed at the VST) will have to handle the same amount of data that is currently handled by six DFS (UT1, UT2, UT3, UT4, NTT and VLTI). Amount of data generated by ESO instruments (GB/year) 70000 VISTA

60000

GB

50000 40000

VST

30000 20000 10000

VLT

0 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Ye ar

The load on the DFS back-end, both in the on-line environment and in the off-line environment, is driven by the combination of the data volume and data complexity, where data volume can be characterized as the number of pixels per unit of time and data complexity by the number of floating point operations per pixel. The on-line DFS must be capable of archiving all data acquired during a night during the following day and of processing the data resulting from the execution of an Observation Block (OB) while the next OB is being executed. The off-line DFS system shall be able to process one night of data in 8-10 hours. In addition the archive system must be: • operationally cost-effective • reliable and efficient • scaleable i.e. the number of entries and volume of data shall not affect the access time to single data sets The currently operational DFS is capable of handling a data volume of about 40 GB per night and is therefore able to support the existing VLT instruments. OmegaCAM, designed around a mosaic of 2k X 4k CCDs, is characterized by high data volume; it is expected to generate about 100 GB per night, with peaks up to 300 GB per night. It is not a complex instrument, i.e. the pipeline performs relatively few floating point operations per pixel, but the very large data files (approx 500 MB) contain many

pixels. It will be the first instrument, followed by VISTA, to require a next generation DFS back-end which addresses the following issues: • The requirements on the on-line DFS system can only be met if the system minimizes data movement, file copies and retention time on disk • The Data Flow system shall fully support FITS files with extensions • The design of the new system shall be based on the hardware which is currently supported in the on-line environment (HP-UX except for data reduction pipelines which run on Linux boxes) • Data processing shall take advantage of the nature of OmegaCAM and VISTA, which allow their individual chips to be processed independently

4. SOLUTIONS In order to meet the data archiving and processing requirements described in the previous section, we have identified a number of structural changes to be applied to the DFS back-end. Some of those changes have already been implemented and are described in detail in this section. The main areas where new developments have been made are: • On line archive • Pipeline infrastructure • Association technologies One common way to increase the throughput of a system is to execute tasks in parallel when possible. Parallel systems are typically more complex than sequential ones, so complexity and maintainability should also be taken into account when designing such a system. The DFS is a distributed system where several tasks are executed at the same time by various processes; therefore some degree of parallelism is already built into the system. On the other hand there are tasks that are executed sequentially and could be parallelized, like data distribution or data reduction. In these cases it’s very important to carefully analyze the pros and cons of restructuring the code to make it more parallel, because sometimes this exercise can compromise the reliability of the system without necessarily improving the performance. Another important aspect is the testability of the system; usually parallel systems are much harder to test than sequential ones. Of course increasing the parallelism of the system isn’t the only way to improve the throughput. In addition or as an alternative to parallelizing the software, one can also buy more powerful hardware. This is especially true the pipeline machines, which are dedicated to CPU intensive data reduction tasks. Since 2003 we have started to replace the existing HP J5000 (or J5600) pipeline workstations with dual processor Linux PCs; as a result, the data processing rate has increased by at least a factor of two and some recipes run up to ten times faster on the Linux boxes. Another way to improve the throughput of a data handling system is through automation; reducing as much as possible the amount of manual intervention required. As an example we can mention the ongoing transition from DVDs to Hard Disks (HD) as storage media for archiving data. The capacity of a DVD is about 4.7 GB, whereas the HDs currently on the market can store up to 300 GB. One of the main advantages of NGAS over the current DVD based system is that it allows a much more efficient archiving of data, because the manual intervention for handling HDs is much less than that required for DVDs. The following diagram shows the most probable layout of the on-line DFS for future high data rate instruments like OmegaCAM.

Instrument WS

Raw Data

NGAS Archive

Data Archive (NGAS Cluster)

Raw Data OLAS WS

PC Linux

HP J5600 Ingest Header Keywords

Ingest Info about Archived Data

Raw Data Database Server Raw Data

Classify and Associate Data

Query Archive

Pipeline User WS Data Pipeline (Cluster)

HP J5600 Display “Quick Look” Data

PC Linux

The key components described by this diagram are: • OLAS: responsible for file integrity checks, database ingestion and file distribution • NGAS: responsible for archiving raw data and master calibrations, also has file distribution functions • Pipeline: responsible for data reduction • Database: responsible for storing header information, pipeline command queues, logs, etc. The new developments affecting these components are described in the following paragraphs.

OLAS In section two we saw how sequential data distribution limits the throughput of the OLAS system. The following solutions were adopted and will be integrated into the DFS installation at the VST: 1.

Restructuring the Data Flow layout in order to share the data distribution task between OLAS and NGAS. In particular, OLAS delivers data to NGAS and to the Pipeline, and NGAS forwards the raw data to the User WS.

2.

Multi-threading the data distribution to better utilize the network bandwidth. Each new frame is delivered to all data subscribers in parallel (one thread per subscriber) resulting in more efficient usage of the available resources (network bandwidth, disk I/O). This solution would also allow the use of an improved bandwidth obtained through multiple LAN cards on the OLAS WS.

The tests performed so far show that with these enhancements OLAS will be able to meet the throughput requirements of OmegaCAM. Another important task performed by the OLAS software is the ingestion of header information into the on-line database. This aspect was also enhanced; in particular the following capabilities were added: •

Ingestion of complete FITS headers (compressed). For each frame, the complete FITS header is ingested into the database and copied to Garching through database replication. Headers are used mostly by QC scientists in Garching for quality control purposes and for data organization in the off-line environment, but can also be viewed/downloaded by the user community through the web.



Ingestion of instrument specific information into the database. In the previous system only a small selection of header keywords was stored in the DB. This was enough for the purpose of tracking some general information about each observation and to support basic applications like the packing of visiting astronomers’ data. Now it’s possible to ingest into the DB an arbitrary selection of keywords. Applications will be migrated to read header information from the database instead of from the FITS files on the file system (e.g. Data Organizer).

Next Generation Archive System The Next Generation Archive System (NGAS) has been developed over the past years to face the challenges of the exploding data rate coming from the new instruments. NGAS is based on Hard Disks and Linux PCs. At the observatory an Archiving Unit (Linux PC running NGAS software) is responsible for the archiving of data onto removable hard disks (two copies of the data are made). The main disk is shipped to Garching and the replication disk is kept at the observatory site as backup. In Garching the disks are connected to the NGAS cluster which is part of the Science Archive Facility. The key features of NGAS are: • Data is online from the instant it has been archived • Each Archiving Unit can handle 300-400 GB/night • WEB interfaces provided for the operators to query information about the system • Data can be archived from remote sites directly into the central Garching NGAS Cluster (this feature is used for the quick delivery of pre-imaging data to Garching) • Transparent access to data archived in the NGAS system is provided independently of the geographical location of the node hosting the data • NGAS continually monitors the condition of data and disks and raises an alarm if problems are encountered In the near future NGAS will be capable of archiving onto a cluster of Archiving Units. As a result, the NGAS system will become highly scalable in terms of archiving throughput and will be able to comfortably meet the data archiving requirements of future high data rate instruments.

Recipe Execution Infrastructure The Recipe Execution Infrastructure (REI) is intended to improve the scheduling and execution of data reduction recipes within the current pipeline infrastructure. The current system is based on an architecture which does not allow mapping to a parallel system. REI is highly scalable, service based and instrument neutral. It can be deployed on a single node as well as over a multiple-processor system. This allows us to take advantage of the parallel nature of data reduction for multi-chip instruments like OmegaCAM, where each chip can be processed independently from the other ones. The REI implements the “blackboard paradigm” [7], typically applied to AI or expert systems. The blackboard paradigm is composed of three main components: • The Blackboard is globally accessible memory containing objects of the solution space. These objects are hierarchically organized and can be linked to each other. The blackboard represents the only means of communication for the Knowledge Sources. • The Knowledge Sources. The Blackboard paradigm defines a method of solving heterogeneous problems as a set of independent Knowledge Sources. These KSs generally specialize in a given field of the global application. They read and write to the blackboard, where the solution is incrementally built up. When a KS produces a change in the blackboard, it generates an event that may trigger processing in other KSs, or the Control. • The Control is responsible for selecting the appropriate sequence of Knowledge Sources to solve the current problem and making them cooperate. The blackboard paradigm is appropriate for the REI for the following reasons: • The paradigm has a very simple scheduling model. Tasks are not assigned by a central control instead, KSs read and execute tasks from the Blackboard which they are able to process. A detailed knowledge of the system/KS topography is not required in order to schedule tasks on the system. • FITS data reduction is often termed “embarrassingly parallel” in nature. The advantage of this is that no internode communication is required during processing. The Blackboard provides a simple means to meet all interprocess communication requirements. • The paradigm is suited to heterogeneous problem solving, allowing KSs to specialize in tasks. For a pipeline this permits a range of DRS and a range of recipes to exist across the system. In addition, node specialization permits certain nodes to perform single tasks such as archive access, header correction or file splitting/joining. • The blackboard represents a central point for monitoring the entire system. This permits easy global system monitoring, as well as helping to provide the user with a single interface to a potentially complex computing infrastructure. The REI maps the blackboard components to a (possibly) distributed software system as follows: • The Blackboard is represented as SYBASE. An RDBMS provides the REI with reliable, standard inter-process communication at a rate acceptable to the problem (astronomical data reduction). In addition it permits easy persistence of results and an open interface for system control and monitoring. The REI logs all results from stdout and stderr to the database. • The Knowledge Sources are software processes called pipeline workers (pworker). Pworkers are configurable software processes running on one or many servers that execute tasks on the Blackboard. There are various types of tasks performed by pworkers, including file operations, node control, data reduction and administration. • The Control is a specialized pipeline worker called the “master”. The master receives and processes all pipeline requests into an execution plan. The execution plan represents a hierarchical map of commands in the Blackboard.

The diagram below shows an example deployment of the REI over three nodes:

Pipeline commands: addcmd, cmdstat, listworkers, stopworkers, etc.

SYBASE

PC: Linux

WS: HP-UX

WS: Solaris

pworker

pworker

pworker (master) pworker

The REI was deployed and tested on two different server systems, demonstrating its fit to both the “big iron” model and the “cluster” model. • A single Linux PC (1.8 GHz): in this model a single pworker process is run, acting in both master and standard roles. This setup represents the simplest deployment of REI, where a single process handles all pipeline requests. • Beowulf System: the Beowulf system consists of 6 slave node PCs (650MHz) connected over 100Mb LAN to a master node PC (650 MHz). The master node mounts a RAID system that is available, through the master, to all slaves. In this model, the slave nodes each run a single pworker process configured solely for data reduction. The master node runs the master pworker, and a second pworker configured for file operations only (archive access, file splitting and joining). The Beowulf REI system running science image reductions of OmegaCAM test data showed a data rate of around 0.958 MPixel/s. OmegaCAM, producing a 256 MPixel raw image every 240 seconds, will require an image processing throughput of at least 1 MPixel/s. If the Beowulf REI was upgraded (for example from 650 MHz processors to around 1.7 GHz, with possible upgrades to dual LAN cards and/or a faster LAN), the data rate for OmegaCAM Science reduction could exceed 2 MPixel/s on a 6 node Beowulf and more with additional nodes. Data rate aside, the REI has shown that the system can be deployed over a range of computing topographies and tuned to meet specific or general processing requirements.

Association Technologies Data Organization, Classification and Association (OCA) is the backbone of a number of operationally critical DFS applications, including the Data Organizer, the Data Packer and Gasgano. Data OCA is based on instrument specific rules that are typically provided by the instrument scientists or by the quality control scientists. These rules are embedded in configuration files or in database tables. For various historical reasons, including the evolution of operational schemes and limited resources, a unified data OCA concept, applied to all relevant points in ESO operations, was not developed. As a result, applications do not always behave consistently, the maintenance of the rules is costly and the exchange of information between groups and applications is difficult. In the past months we started a new project aimed at defining a unified syntax for the OCA rules and developing a common framework for use by all relevant applications to perform OCA. The first achievement was the definition of an expression language (syntax) to define the rules. The following concepts are supported by the common syntax: • Classification. To identify the type of data produced by the ESO instruments • Grouping. To create homogeneous groups of frames that can be, for example, processed together • Association. To assign a set of relevant data to one or more frames • Action. To specify the action to be performed on a set of frames (for example applying a pipeline recipe) • Cascade. This concept is needed in the off-line environment, where complete data sets are processed in batches. The calibration cascade defines the order in which the various types of data must be processed Each concept translates into instrument specific rules that can be defined by the QC or instrument scientists using the common syntax. As an example of the common syntax, some VIMOS classification rules are given hereafter: if DPR.CATG=="CALIB" and DPR.TYPE=="BIAS" and DPR.TECH=="IMAGE" then class="BIAS"; if DPR.CATG=="CALIB" and DPR.TYPE=="DARK" and DPR.TECH=="IMAGE" then class="DARK"; if DPR.CATG=="CALIB" and (DPR.TYPE like "%FLAT%" and DPR.TYPE like "%SKY%") and DPR.TECH=="IMAGE" then class="IMG_SKY_FLAT"; if DPR.CATG=="CALIB" and (DPR.TYPE like "%FLAT%" and DPR.TYPE like "%LAMP%") and DPR.TECH=="IMAGE" and TPL.ID=="VIMOS_img_cal_ScreenFlat" then class="IMG_SCREEN_FLAT";

A detailed description of these rules is outside the scope of this article. What is important to mention here is that it will be possible to use the same syntax (or even the same rules) in all relevant points of operations and on data stored in different formats (FITS files, database entries, header files).

5. CONCLUSIONS AND FUTURE DEVELOPMENTS The currently operational DFS infrastructure is able to handle comfortably the data rates generated by the first generation VLT instruments (< 50 GB/night). One of the main challenges of the next few years is represented by the large data volumes that will be generated by some of the new instruments like OmegaCAM at the VST (~ 100 GB/night with peaks up to 300 GB/night). In order to meet the archiving and processing requirements of future instruments, some new thinking and new developments were (and will be) required, especially in the area of the DFS back-end infrastructure. In particular, the following functionalities are in the process of being improved: • On line archive • Pipeline infrastructure • Association technologies One of the key features for a data handling system that must support a variety of operational scenarios is scalability. Most of the new developments described in this paper aim at making the DFS infrastructure a more scalable system. Scalability is often associated to parallel processing and cluster technology. These concepts are used for example by NGAS and REI. Another critical aspect of a data handling system is information management. The database has a central role in the DFS as persistent storage for many objects (e.g. header information) but also as a communication protocol (blackboard). In five years of VLT operations the amount and complexity of information stored in databases has been steadily increasing and will probably require re-designing some of our databases. For example we are investigating the possibility of storing all header information in a data warehouse system, which is capable of efficiently managing hundreds of millions of entries.

ABBREVIATIONS The following abbreviations are used in this article: AI CPU DB DFS DFO DHS DO ESO GB HW LAN KS MB NGAS OB

Artificial Intelligence Central Processing Unit Database Data Flow System Data Flow Operations Data Handling System Data Organizer European Southern Observatory Gigabyte Hardware Local Area Network Knowledge Source Megabyte Next Generation Archive System Observation Block

OCA OLAS OTS PC PSO REI TB UT VIMOS VCS VLT VLTI VST WFI WS

Association Classification Organization Online Archive System Operations Technical Support Personal (Intel) Computer Paranal Science Operations Recipe Execution Infrastructure Terabyte Unit Telescope VIsible MultiObject Spectrograph VLT Control System Very Large Telescope Very Large Telescope Interferometer VLT Survey Telescope Wide Field Imager at La Silla Workstation

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Knudstrup J., et al., Evolution and Adaptation of the VLT Data Flow System, in Observatory Operations to Optimize Scientific Return III, SPIE Proc. 4844, 2002. Ballester P., et al., The VLT interferometer Data Flow System: from observation preparation to data processing, in Observatory Operations to Optimize Scientific Return III, SPIE Proc. 4844, 2002. ESO DMD, Data Flow System Overview, http://www.eso.org/org/dmd/, http://www.eso.org/projects/dfs/ ESO/ST-ECF Science Archive Facility, http://archive.eso.org/ Deul E. R., et al., OmegaCam: The 16k x 16k Survey Camera for the VLT Survey Telescope, in Survey and Other Telescope Technologies and Discoveries, SPIE 2002 Albrecht M A., et al., The VLT Science Archive System, in Astronomical Data Analysis Software and Systems VII ASP Conference Series, Vol. 145, 1998 Buschmann F., et al., Pattern-Oriented Software Architecture, Vol. 1, 2001