November 2006 A
High Productivity Computing Systems and the Path Towards Usable Petascale Computing
Nicole Wolter, San Diego Supercomputing Center
Michael O. McCracken, San Diego Supercomputing Center
Allen Snavely, San Diego Supercomputing Center
Lorin Hochstein, University of Nebraska, Lincoln
Taiga Nakamura, University of Maryland, College Park
Victor Basili, University of Maryland, College Park

Conjecture 3: Time to solution is the limiting factor for productivity on HPC systems.

While the initial migration of a project to HPC systems is certainly due to expanding resource requirements, we found across all four studies that the HPC users represented in our samples treat performance as a constraint rather than a goal to be maximized. Code performance is very important to them only until it is “good enough” to sustain productivity with the allocation they have, and then it is no longer a priority. In economics literature, this is called satisficing 8,and while it is not surprising in retrospect, it is important to keep this distinction in mind when thinking about what motivates HPC users. The in-depth system log evaluation showed most users not taking advantage of the parallel capacity available to them. Out of 2,681 jobs run in May 2004, 2,261 executed on one eight-processor node. This accounted for 41% of all the CPU hours utilized on the system. Furthermore, the job logs show that in between 2004 and 2006, 1,706 jobs out of 59,030 jobs on DataStar were removed from the batch nodes for exceeding the maximum job time limit of 18 hours. Of these jobs, 50% were running on fewer than eight processors. Given access to a resource with thousands of processors, the majority of users choose to reduce the priority of performance tuning as soon as possible, indicating that they have likely found a point at which they feel they can be productive.

The support ticket evaluation tells a similar story. Many users requested longer run times, and fewer than ten users requested anything related to performance. Furthermore, some of the users requesting longer run times did not have checkpoint restart capabilities in their code, and many were running serial jobs.

While performance is certainly the original reason for moving to an HPC platform, the attitudes and statements of our interview subjects reinforces the assertion that it is not a primary goal of theirs. When asked about productivity problems, performance was generally taken for granted while other issues such as reliability, file system capacity and storage policies, as well as queue policies and congestion, were discussed in depth. Performance problems are just one of many barriers to productivity, and focus on performance at the expense of other improvements should be avoided.

Conjecture 4: Lack of publicity is the main roadblock to adoption of performance and parallel debugging tools.

While it is true that many users are unfamiliar with the breadth of performance tools available, a significant portion of users simply prefer not to use them. Reasons given by interviewees included problems scaling tools to large number of processors, unfamiliar and inefficient GUI interfaces, steep learning curves, or the overwhelming detail provided in the displayed results.

Unsurprisingly, the performance optimization consultants were the most likely to use tools. However even they often opted for the seemingly more tedious path of inserting print statements with timing calls and recompiling rather than learning to use performance tools.

The situation for parallel debuggers was no more encouraging. Only one interviewee used readily available parallel debugging tools. However, other developers indicated that if the debugging tools were consistent and easy to use at scales of up to hundreds of processors, they would have been used. In their current state, parallel debugging tools are considered difficult or impossible to use by the interviewees who had tried them.

It is clear that there is a long way to go before common HPC programming practice embraces the powerful tools that the research community has built. Certainly there is a lack of motivation on the part of many HPC programmers to learn new tools that will help them with the task of performance optimization, which is not always their main priority. Aside from the obvious issue of acceptable performance at large scale, it seems that continuing to strive for tools that are easier to learn and use is important for improved adoption. As discussed by Pancake 4, it is important to design tools from a user's perspective and with early user involvement in the design process for the design to be effective.

Pages: 1 2 3 4 5 6 7 8 9

Reference this article
Wolter, N., McCracken, M. O., Snavely, A., Hochstein, L., Nakamura, T., Basili, V. "What's Working in HPC: Investigating HPC User Behavior and Productivity," CTWatch Quarterly, Volume 2, Number 4A, November 2006 A. http://www.ctwatch.org/quarterly/articles/2006/11/whats-working-in-hpc-investigating-hpc-user-behavior-and-productivity/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.