November 2006 A
High Productivity Computing Systems and the Path Towards Usable Petascale Computing
Nicole Wolter, San Diego Supercomputing Center
Michael O. McCracken, San Diego Supercomputing Center
Allen Snavely, San Diego Supercomputing Center
Lorin Hochstein, University of Nebraska, Lincoln
Taiga Nakamura, University of Maryland, College Park
Victor Basili, University of Maryland, College Park


our initial statement, Table 1 displays some trends that are not surprising. On average small jobs have shorter wait times. The average wait time for jobs requesting only one node was nine hours while the average wait time for 128 nodes (1024 processors) was approximately 24 hours. In contrast, the inverse trend is evident in maximum wait times; as the number of nodes increased, the maximum wait time decreased. The maximum wait time for a 128 node job was 17 days and the maximum wait time for one node was 71 days. The median tells us a similar story to the average. The trends in run time and wait time are also visible in the histograms shown in Figure 1 and Figure 2, grouped by job size. Table 2 shows statistics for the job run time.

The expansion factor ((waittime + runtime) / runtime) data in Table 3 reinforces the trends established by the wait times and also displays how the queue favors large allocations. The average expansion factor increases as the number of nodes increases up to 256 processors, where there is a dip in the expansion factor curve. This appears to show a preference for large jobs in the scheduler beginning at 256 processors. On closer inspection, the trend was influenced by a single high priority account with a large allocation, running many jobs on 256 processors. The actual priority increase for “large” jobs begins at 512 processors or more. Also note the high expansion factor for 1024 processor runs. While it would seem to show that users running on large portions of the system are not very productive, we see that this effect is due to a large number of such jobs with very short run time. 632 of the 839 jobs that ran on 128 nodes (1024 processors) ran for less than two hours, and most of those jobs were part of a benchmarking effort. The average expansion factor for jobs that were not benchmarking was 9.58, and the median was 10.05, which more clearly shows the scheduler favoring large jobs as intended.

Although the analysis above does show that marquee users receive priority for using large portions of the system and are therefore not disproportionately hurt by queue wait time, they do face a serious productivity bottle-neck in the inability to get a large-enough allocation on one system, or on the right system for their task. Due to site policies, single-site allocations often have a limit. In some cases, users were forced to port codes to several systems due to allocation limits, resource constraints, or system and support software compatibility issues. One developer told us of having to port one program to four different systems during the period of one project. Another told us of having to port to at least two systems for an individual run to be able to capitalize on the different features supplied on different systems.

Porting itself can be time consuming and difficult. We heard about the time required to complete a port, ranging from less than an hour to many days, and in some cases the successful port was never achieved. Once the code is ported, these multi-platform scenarios tend to require tedious manual effort from the user to move data files around, convert formats, and contend with the problems of working on many sites with different policies, support, and capabilities.

Pages: 1 2 3 4 5 6 7 8 9

Reference this article
Wolter, N., McCracken, M. O., Snavely, A., Hochstein, L., Nakamura, T., Basili, V. "What's Working in HPC: Investigating HPC User Behavior and Productivity," CTWatch Quarterly, Volume 2, Number 4A, November 2006 A. http://www.ctwatch.org/quarterly/articles/2006/11/whats-working-in-hpc-investigating-hpc-user-behavior-and-productivity/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.