November 2007
Software Enabling Technologies for Petascale Science
Garth Gibson, Carnegie Mellon University
Bianca Schroeder, Carnegie Mellon University
Joan Digney, Carnegie Mellon University

Understanding Outages in LANL Computers

The first question most ask is “what causes a node outage?” Figure 2 provides a root cause breakdown of failures from the LANL data into human, environment, network, software, hardware, and unknown, with the relative frequency of the high-level root cause categories on the left. Hardware is the single largest source of malfunction, with more than 50% of all failures assigned to this category. Software is the second largest contributor, with around 20% of all failures. The trends are similar if we look at Figure 2(b), which shows the fraction of total repair time attributed to each of the different root cause categories.

Figure 2

Figure 2. (a) The breakdown of failures by root cause. (b) The breakdown of total repair time spent on a system due to each root cause. Each bar shows the breakdown for the systems of one particular hardware platform, labeled D, E, F, G, and H, and the right-most bar shows aggregate statistics across all LANL systems.

It is important to note that the number of failures with undetermined root cause is significant. Since the fraction of hardware failures is larger than the fraction of undetermined failures, and the fraction of software failures is close to that of undetermined failures, we can still conclude that hardware and software are among the largest contributors to failures. However, we cannot conclude that any of the other failure sources (Human, Environment, Network) is actually insignificant.

A second question is “How frequently do node outages occur?” or “How long can an application be expected to run before it will be interrupted by a node failure?” Figure 3(a) shows the average number of node failures observed per year for each of the LANL systems according to the year that each system was introduced into use. The figure indicates that the failure rates vary widely across systems, from less than 20 failures per year per system to more than 1100 failures per year. Note that a failure rate of 1100 per year means that an application running on all the nodes of the system will be interrupted and forced into recovery more than two times per day. Since many of the applications running on these systems require a large number of nodes and weeks of computation to complete, failure and recovery are frequent events during an application’s execution.

One might wonder what causes the large differences in failure rates across the different systems. The main reason for these differences is that the systems vary widely in size. Figure 3(b) shows the average number of failures per year for each system normalized by the number of processors in the system. The normalized failure rates show significantly less variability across the different types of systems, which leads us to two interesting suggestions. First, the failure rate of a system grows in proportion to the number of processor chips in the system. Second, there is little indication that systems and their hardware get more reliable over time as technology changes.

Figure 3

Figure 3. (a) Average number of failures for each LANL system per year. (b) Average number of failures for each system per year normalized by number of processors in the system. Systems with the same hardware type have the same color.

Pages: 1 2 3 4 5 6 7

Reference this article
Gibson, G., Schroeder, B., Digney, J. "Failure Tolerance in Petascale Computers," CTWatch Quarterly, Volume 3, Number 4, November 2007. http://www.ctwatch.org/quarterly/articles/2007/11/failure-tolerance-in-petascale-computers/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.