May 2006
Designing and Supporting Science-Driven Infrastructure
Ralph Roskies, Pittsburgh Supercomputing Center
Thomas Zacharia, Oak Ridge National Laboratory


In this article, we outline the types of activities required (and an estimate of their cost) in designing and supporting high-end computational facilities. The major categories are facility costs, system software, and the human effort involved in designing and keeping the systems running. This discussion does not include any costs associated with direct user support, application software, application support, or for the development of new technology. Nor does it include the networking issues related to connecting outside the machine room. Those are covered elsewhere in this volume.

Facility Issues

The principal points to be included in planning an HPC facility are sufficient space, power, and cooling. Equally important, but often more easily amenable to improvement, are physical security, water and fire protection, pathways to the space, and automatic monitoring systems.

In the provision of space there is more to consider than the required number of square feet. This is especially true for today’s air-cooled clusters, which were not designed to be used together in the large quantities found in leading HPC centers. Today’s dense, air-cooled systems require large volumes of air for cooling. The size of the plenum under the floor, i.e. the area between the solid subfloor and the bottom of the raised floor tile is an important measure of the ability to deliver adequate air. Distribution is also an issue. Masses of under-floor cable tend to cause air dams which impede the ability to deliver air where it is needed. Conversely, moving large volumes of air through a barely adequate plenum will tend to cause streamlining, particularly when vents are located close to air handling units. Optimal location of the air handling units within the space often seems counter-intuitive. For example, one might think that placing air handlers close to the machine is better and more efficient. But that is likely to cause problems with streamlining and result in low pressure areas. Establishing the correct flow of air is an iterative process no matter what your CFD study says. These issues get a lot simpler with liquid-cooled systems.

There are also many mundane problems to attend to. Subfloors should be sealed to prevent cement dust from proliferating. Floor drains are needed for disposal of condensing moisture from air handler coils. Floor tiles should be carefully selected to avoid the problem of “zinc whiskers” the dispersion of tiny metallic slivers from the undersides of older tiles that cause seemingly random hardware reliability problems. Since computer equipment, air handlers, PDUS, etc. are both large and heavy, it is of great benefit to have a level pathway between the computer room and the loading dock where the equipment will be delivered. Be sure to take into account any hallway corners to ensure that aisles are sufficiently wide to enable corners to be turned. Also, make note of sprinkler heads that will be below ceiling height on the path as well as door locking mechanisms and door jams on the floor that will reduce the effective clearance. Some equipment is sufficiently heavy that the use of metal plates is necessary to avoid floor damage or collapse during delivery to the computer room. With systems requiring much cooling, very large pipes carry very large volumes of water. These pipes may be under the floor or overhead. Smoke detectors and moisture detectors must be correctly installed. Most modern detection systems interface to a site management/security system. It is important to make sure the detection system is integrated so that the proper people are notified in a timely manner.

Power consideration begins with the ability of the utility company to deliver adequate power to the site from its substations. Be prepared for a shocked reaction from your utility company the first time you call and make your request, especially if you have never done this before. During installation, it is wise to label and record every path that the electrical supply will follow to enable quick traceback in the event of problems or electrical capacity questions.

Pages: 1 2 3 4

Reference this article
Roskies, R., Zacharia, T. "Designing and Supporting High-end Computational Facilities," CTWatch Quarterly, Volume 2, Number 2, May 2006. http://www.ctwatch.org/quarterly/articles/2006/05/designing-and-supporting-high-end-computational-facilities/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.