November 2007
Software Enabling Technologies for Petascale Science
Arie Shoshani, Lawrence Berkeley National Laboratory
Ilkay Altintas, San Diego Supercomputer Center
Alok Choudhary, Northwestern University
Terence Critchlow, Pacific Northwest National Laboratory
Chandrika Kamath, Lawrence Livermore National Laboratory
Bertram Ludäscher, University of California, Davis
Jarek Nieplocha, Pacific Northwest National Laboratory
Steve Parker, University of Utah
Rob Ross, Argonne National Laboratory
Nagiza Samatova, Oak Ridge National Laboratory
Mladen Vouk, North Carolina State University

Descriptions of technologies developed and used in the SDM Center

In this section we describe the SDM Center technologies, and include some examples of their application in various scientific projects. We proceed with technologies from the top layer to the bottom layer.

The Kepler Scientific Workflow System

A practical bottleneck for more effective use of available computational and data resources is often the design of resource access and use of processes, and the corresponding execution environments, i.e., in the scientific workflow environment of end user scientists. The goal of the Kepler system2 is to provide solutions and products for effective and efficient modeling, design and execution of scientific workflows. Kepler is a multi-site open source effort, co-founded by the SDM center, to extend the Ptolemy system (from UC Berkeley) and create an integrated scientific workflow infrastructure. We have also started to incorporate data, process, system and workflow provenance and run-time tracking and monitoring. We have worked closely with application scientists to design, implement, and deploy workflows that address their real-world needs. In particular, we have active users on the SciDAC Terascale Supernova Initiative (TSI) team and an LLNL Biotechnology project, as well as at the Center for Plasma Edge Simulation (CPES) fusion project. While the Scientific Process Automation (SPA) layer uses Kepler to achieve workflow automation, it is the specific task components (called “actors” in Kepler) developed by the SDM center that makes our work unique in it usefulness to scientific applications.

Figure 2

Figure 2. An abstract representation of a scientific workflow.

Underlying challenges related to simulations, data analysis and data manipulation include scalable parallel numerical algorithms for the solution of large, often sparse linear systems, flow equations, and large Eigen-value problems, running of simulations on supercomputers, movement of large amounts of data over large distances, collaborative visualization and computational steering, and collection of appropriate process and simulation related status and provenance information. This requires interdisciplinary teams of application scientists and computer scientists working together to define the workflows and putting them into the Kepler workflow framework. The general underlying “templates” are often similar across disciplines: large-scale parallel computations and steering (hundreds of processors, gigabytes of memory, hours to weeks of CPU time), data-movement and reduction (terabytes of data), visualization and analytics (interactive, retrospective, and auditable). An abstraction of this and its Kepler translation are illustrated in Figure 2 and 3 for a particular astrophysics project, call the Terascale Supernova Initiative (TSI).3 Figure 3 shows the capability of the Kepler system to represent hierarchically structured workflows. In the center of the figure there are four simple high-level tasks; each is expanded into lower level tasks that manage the detailed processes.

Figure 3

Figure 3. Instantiation of the abstract workflow in Kepler.

Pages: 1 2 3 4 5 6

Reference this article
Shoshani, A., Altintas, I., Choudhary, A., Critchlow, T., Kamath, C., Ludäscher, B., Nieplocha, J., Parker, S., Ross, R., Samatova, N., Vouk, M. "Scientific Data Management: Essential Technology for Accelerating Scientific Discoveries," CTWatch Quarterly, Volume 3, Number 4, November 2007. http://www.ctwatch.org/quarterly/articles/2007/11/scientific-data-management-essential-technology-for-accelerating-scientific-discoveries/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.