Lilliputians of Supercomputing Have Arrived!
Jose Castanos, George Chiu, Paul Coteus, Alan Gara, Manish Gupta, Jose Moreira, IBM T.J. Watson Research Center
CTWatch Quarterly
August 2005

Introduction

In Gulliver’s Travels (1726) by Jonathan Swift, Lemuel Gulliver traveled to various nations. One nation he traveled to, called Lilliput, was a country that consisted of weak pygmies. Another nation, called Brobdingnag, was that of mighty giants. When we build a supercomputer with thousands to more than hundreds of thousands of chips, is it better to choose a few mighty and powerful Brobdingnagian processors, or is it better to start from many Lilliputian processors to achieve the same computational capability? To answer this question, let us trace the evolution of computers.

The first general purpose computer, ENIAC (Electronic Numerical Integrator And Calculator), was publicly disclosed in 1946. It took 200 microseconds to perform a single addition and it was built with 19,000 vacuum tubes. The machine was enormous, 30 m long, 2.4 m high and 0.9 m wide. Vacuum tubes had a limited lifetime and had to be replaced often. The system consumed 200 kW. ENIAC cost the US Ordnance Department $486,804.22.

In December 1947, John Bardeen, Walter Brattain, and William Shockley at Bell Laboratories invented a new switching technology called the transistor. This device consumed less power, occupied less space, and was made more reliable than those of vacuum tubes. Impressed by these attributes, IBM built its first transistor based computer called Model 604 in 1953. By early 1960, transistor technology became ubiquitous. Further drive towards lower power, less space, higher reliability, and lower cost resulted in the invention of integrated circuits in 1959 by Jack Kilby of Texas Instruments. Kilby made his first integrated circuit in germanium. Robert Noyce at Fairchild used a planar process to make connections of components within a silicon integrated circuit in early 1959, which became the foundation of all subsequent generations of computers. In 1966, IBM shipped the System/360 all-purpose mainframe computer made of integrated circuits.

Within the transistor circuit families, the most powerful transistor technology was the bipolar junction transistor (BJT) rather than the CMOS (Complementary Metal Oxide Semiconductor) transistor . However, compared to CMOS transistors, the bipolar ones, using the fastest ECL (emitter coupled logic) circuit, cost more to build, had a lower level of integration, and consumed more power. As a result, the semiconductor industry moved en masse to CMOS in early 1990s. From then on, the CMOS technology became the entrenched technology, and supercomputers were built with the fastest CMOS circuits. This picture lasted until about 2002 where CMOS power and power density rose dramatically to the point that they exceeded those of the corresponding bipolar numbers in the 1990’s. Unfortunately, there was no lower power technology lying in wait to diffuse the crisis. Thus, we find ourselves again at a crossroad to build the next generation supercomputer. According to the “traditional” view, the way to build the fastest and largest supercomputer is to use the fastest microprocessor chips as the building block. The fastest microprocessor is in turn built upon the fastest CMOS switching technology that is available to the architect at the time the chip is designed. This line of thought is sound provided that there are no other constraints to build supercomputers. However, in the real world there are many constraints (heat, component size, etc.) that make this reasoning unsound.

In the mean time, portable devices such as PDAs, cellphones, and laptop computers, developed since the 1990’s, all require low power CMOS technology to maximize the battery recharge interval. In 1999, IBM foresaw the looming power crisis and asked the question whether we could architect supercomputers using low power, low frequency, and inexpensive (Lilliputian) embedded processors to achieve a better collective performance than using high power, high frequency (Brobdingnagian) processors. While this approach has been successfully utilized for special purpose machines such as the QCDOC supercomputer, this counter-intuitive proposal was a significant departure from the traditional approach to supercomputer designs. However, the drive toward lower power and lower cost remained a constant theme throughout.

We chose an embedded processor optimized for low power and low frequency design, rather than performance. Such a processor has a performance/power advantage compared to a high performance and high power processor. A simple relation is

performance/rack = performance/watt x watt/rack.

The last term in this expression, watt/rack, is determined by thermal cooling capabilities of a given rack volume. Therefore, it imposes the same limit (of the order of 25 kilowatts) for using either high- frequency, high-power chips or using low-frequency, low-power chips. To maximize performance/rack, it is the performance/watt term that must be compared among different CMOS technologies. This clearly illustrates one of the areas in which electrical power is critical to achieving rack density. We have found that in terms of performance/watt, the low frequency, lower power embedded IBM PowerPC 440 core consistently outperforms high frequency, high power microprocessors by a factor of about ten regardless of the manufacturers of the systems. This is one of the main reasons we chose the low power design point for our Blue Gene/L supercomputer. Figure 1 illustrates the power efficiency of some recent supercomputers. The data is based on total peak Gflops (giga floating-point operations per second) divided by total system power in watts, when that data is available. If the data is not available, we approximate it using Gflops/chip power (an overestimate of the true system Gflops/power number).

Figure 1. Power efficiencies of recent supercomputers. (Blue = IBM Machines, black = other U.S. machines, red = Japanese machine)

This chart presents empirical evidence of the fact that in the presence of a common power envelope, the collective peak performance per unit volume is superior with low- power CMOS technology. We now explain the theoretical basis of the superior collective performance of low power systems. Any performance metric such as flops , MIPS (millions instructions per sec), or SPEC benchmarks is linearly proportional to the chip clock frequency. On the other hand, the power consumption of the ith transistor is given by the expression:

Pi = switching power of transistor i + leakage power of transistor i

= ½ CLi V2 fi + leakage power of transistor i,

where CLi is the load capacitance of the ith transistor, V = VDD is the supply voltage, and fi is the switching frequency of the ith transistor. Note that not every transistor participates in switching on every clock cycle f. Although the leakage power is increasingly important for 90nm, 65nm and 45nm technologies, we ignore the leakage power of the Blue Gene/L chips which, built in 130 nm technology, contributes less than 2% of the system power. The switching power consumed in a chip is the sum of the power of all switching nodes. It can be expressed as:

Pchip = Σ switching power of transistor i = 1/2 Csw V2 f,

where the average switching chip capacitance is given by

Csw = (Σ CLi fi) / f.

It is difficult to predict Csw accurately because we seldom know the switching frequencies fi of every transistor in every cycle, and furthermore fi is different for each application. To simplify the discussion, we use an averaged value of Csw obtained either from direct measurement or from power modeling tools. For high power, high frequency CMOS chips, the clock frequency f is roughly proportional to the supply voltage V, thus the power consumed per chip Pchip is proportional to V2 f or f3. Therefore, in the cubed-frequency regime, the power grows by a factor of eight, if the frequency is doubled. If we use eight moderate frequency chips, each of them half the frequency of the original high frequency chip, we burn the same amount of power, yet we have a fourfold increase in flops/watt. This then is the basis of our Blue Gene/L design philosophy. One might ask if we can do this indefinitely. If 100,000 processors at some frequency is good, are not 800,000 processors at 1/2 the frequency even better? The answer is complex, because we must consider also the mechanical component sizes, power to communicate between processors, the failure rate of those processors, the cost of packaging those processors, etc. Blue Gene/L is a complex balance of these factors and many more. Moreover, as we lower the frequency, the power consumed per chip drops from cubic frequency dependence to quadratic dependence and finally to linear dependence. In the linear regime, both power and performance are proportional to frequency; there is no advantage of reducing frequency at that point.

Blue Gene/L Architecture

The Blue Gene/L supercomputer project is aimed to push the envelope of high performance computing (HPC) to unprecedented levels of scale and performance. Blue Gene/L is the first supercomputer in the Blue Gene family. It consists of 65,536 high-performance compute nodes (131,072 processors), each of which is an embedded 32-bit PowerPC dual processor, and has 33 Terabytes of main memory. Furthermore, it has 1024 I/O nodes, using the same chip that is used for compute nodes. A three-dimensional torus network and a sparse combining network are used to interconnect all nodes. The Blue Gene/L networks were designed with extreme scaling in mind. Therefore, we chose networks that scale efficiently in terms of both performance and packaging. The networks support very small messages (as small as 32 bytes) and include hardware support for collective operations (broadcast, reduction, scan, etc.), which will dominate some applications at the scaling limit. The compute nodes are designed to achieve a 183.5 Teraflops/s peak performance in the co-processor mode, and 367 Teraflops/s in the virtual node mode.1

The system on chip approach used in the Blue Gene/L project integrates two processors, cache (Level 2 and Level 3), internode networks (torus, tree, and global barrier networks), JTAG and Gigabit Ethernet links on the same die. By using the embedded DRAM, we have enlarged the on-chip Level 3 cache to four MB, four to eight times larger than competitive cache’s made of SRAM and greatly enhancing the amount of realized performance of the processor. By integrating the inter-node networks, we can take advantage of the same generation technology, i.e., these networks scale with chip frequency. Furthermore, the off-chip drivers and receivers can be optimized to consume less power than those of industry standard networks. Figure 2 is a photograph of multi-rows of the Blue Gene/L system. The first two rows have their black covers on, whereas the remaining rows are uncovered.

Figure 2. The Blue Gene/L system installed at the Lawrence Livermore National Laboratory.

One of the key objectives in the Blue Gene/L design was to achieve cost/performance on a par with the COTS (Commodity Off The Shelf) approach, while at the same time incorporating a processor and network design so powerful that it can revolutionize supercomputer systems.

Using many low power, power-efficient chips to replace fewer, more powerful ones succeeds only if the application users can realize more performance by scaling up to a higher number of processors. This indeed is one of the most challenging aspects of the Blue Gene/L system design and must be addressed through scalable networks along with software that will efficiently leverage these networks.

System Software

The system software for Blue Gene/L was designed with two key goals, familiarity and scalability. We wanted to make sure that high performance computing users could migrate their parallel application codes with relative ease to the Blue Gene/L platform. Secondly, we wanted the operating environment to allow parallel applications to scale to the unprecedented levels of 64K nodes (128K processors). It is important to note that this requires scaling not only in terms of performance but also in reliability. A simple mean-time-between-failure calculation shows that if the software on a compute node fails about once a month, under the assumption that failures over all nodes are independent, a node failure would be expected once every 40 seconds! Clearly, this shows the need for compute node software to be highly reliable.

We have developed a programming environment based on familiar programming languages (Fortran, C, and C++) and the single program multiple data (SPMD) programming model, with message passing supported via the message passing interface (MPI) library. This has allowed the porting of several large scientific applications to Blue Gene/L with a modest effort (often within a day).

We have relied on simplicity and a hierarchical organization to achieve scalability of software in terms of both performance and reliability. Two major design simplifications that we have imposed are:

The software for Blue Gene/L is organized in the form of a three-tier hierarchy. A lightweight kernel, together with the runtime library for supporting user applications, constitutes the programming environment on the compute node. Each I/O node, which can be viewed as a parent of a set of compute nodes (referred to as a processing set or pset), runs Linux and supports a more complete range of operating system services, including file I/O and sockets, to the applications via offloading from the compute nodes. The Linux kernel on I/O nodes also provides support for job launch. Finally, the control system services run on a service node, which is connected to the Blue Gene/L computational core via a control network.

Results

In October 2004, an 8-rack Blue Gene/L system, which occupied less than 200 square feet of floorspace, and consumed about 200 KW in power, passed the Earth Simulator (which occupies an area of about 70,000 square feet and consumes about seven MW of power) in LINPACK performance. In the recent, June 2005 TOP500 list,2 a 32 rack Blue Gene/L system, which has been delivered to Lawrence Livermore National Laboratory, occupies the #1 spot with a LINPACK performance of 136.8 Teraflop/s. Blue Gene/L systems account for five of the top ten entries in the June 2005 TOP500 list.

More importantly, several scientific applications have been successfully ported and scaled on the Blue Gene/L system. The applications reported in our studies 3 4 have achieved, on Blue Gene/L, their highest ever performance. Those results also represent the first proof point that MPI applications can effectively scale to over ten thousand processors.

Conclusions

In this paper, we described the main thrust of the Blue Gene/L supercomputer made of Lillputian low power, low frequency processors. By exploiting the superior performance/watt metric, we can package ten times more processors in a rack, thus it became the number one rated supercomputer since November 2004. In June 2005, five of the top ten supercomputers in the 25th TOP500 list were based on Blue Gene/L architecture. Blue Gene/L is currently producing unprecedented simulation in classical and quantum molecular dynamics, climate, quantum chromodynamics, and the list is growing. The future is likely to be even more power constrained due to the slowing of the power-performance scaling of the underlying transistor technologies. This will likely drive systems to aggressively search for opportunities to build even more power efficient systems, likely driving to more Blue Gene/L-like parallelism. In the future, the Lilliputians are likely to be active in nearly every area of computing.

Acknowledgement
The Blue Gene/L project has been supported and partially funded by the Lawrence Livermore National Laboratory on behalf of the United States Department of Energy under Lawrence Livermore National Laboratory Subcontract No. B517522.
1 IBM Journal of Research and Development, special double issue on Blue Gene, Vol.49, No.2/3, March/May, 2005.
2 TOP500 Supercomputer Sites, http://www.top500.org .
3 G. Almasi et al. "Scaling physics and material science applications on a massively parallel Blue Gene/L system," Proceedings of International Conference on Supercomputing, Cambridge, MA, June 2005.
4 G. Almasi et al. "Early Experience with Scientific Applications on the BlueGene/L Supercomputer," Proceedings of Euro-Par 2005, Lisboa, Portugal, August-September 2005.

URL to article: http://www.ctwatch.org/quarterly/articles/2005/08/lilliputians-of-supercomputing-have-arrived/