November 2007
Software Enabling Technologies for Petascale Science
Garth Gibson, Carnegie Mellon University
Bianca Schroeder, Carnegie Mellon University
Joan Digney, Carnegie Mellon University


Three of the most difficult and growing problems in future high-performance computing (HPC) installations will be avoiding, coping and recovering from failures. The coming PetaFLOPS clusters will require the simultaneous use and control of hundreds of thousands or even millions of processing, storage, and networking elements. With this large number of elements involved, element failure will be frequent, making it increasingly difficult for applications to make forward progress. The success of petascale computing will depend on the ability to provide reliability and availability at scale.

While researchers and practitioners have spent decades investigating approaches for avoiding, coping and recovering from abstract models of computer failures, the progress in this area has been hindered by the lack of publicly available, detailed failure data from real large-scale systems.

We have collected and analyzed a number of large data sets on failures in high-performance computing (HPC) systems. Using these data sets and large scale trends and assumptions commonly applied to future computing systems design, we project onto the potential machines of the next decade our expectations for failure rates, mean time to application interruption, and the consequential application utilization of the full machine, based on checkpoint/restart fault tolerance and the balanced system design method of matching storage bandwidth and memory size to aggregate computing power.1

Not surprisingly, if the growth in aggregate computing power continues to outstrip the growth in per-chip computing power, more and more of the computer’s resources may be spent on conventional fault recovery methods. For example, we envision applications being denied as much as half of the system’s resources in five years.2 The alternatives that might compensate for this unacceptable trend include application-level checkpoint compression, new special checkpoint devices or system level process-pairs fault-tolerance for supercomputing applications.

Our interest in large-scale cluster failure stems from our role in a larger effort, the DOE SciDAC-II Petascale Data Storage Institute (PDSI), chartered to anticipate and explore the challenges of storage systems for petascale computing.3 In as much as checkpoint/restart is a driving application for petascale data storage systems, understanding node failure and application failure tolerance is an important function for the PDSI. To increase the benefit of our data collection efforts, and to inspire others to do the same, we are working with the USENIX Association to make publicly available these and other datasets in a Computer Failure Data Repository (CFDR).4 Systems researchers and developers need to have ready access to raw data describing how computer failures have occurred on existing large-scale machines.

Pages: 1 2 3 4 5 6 7

Reference this article
Gibson, G., Schroeder, B., Digney, J. "Failure Tolerance in Petascale Computers," CTWatch Quarterly, Volume 3, Number 4, November 2007. http://www.ctwatch.org/quarterly/articles/2007/11/failure-tolerance-in-petascale-computers/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.