November 2006 B
High Productivity Computing Systems and the Path Towards Usable Petascale Computing
D. E. Post, DoD High Performance Computing Modernization Program
R. P. Kendall, Carnegie Mellon University Software Engineering Institute


Another key aspect of CSE project workflows is the project life cycle (Figure 2). Large-scale CSE projects can have a life cycle of 30 to 40 years or more, far longer than most Information Technology projects. The NASTRAN engineering analysis code was originally developed in the 1960s and is still heavily used today.3 In contrast, the time between generations of computers is much shorter, often no more than two to four years. A typical major CSE project has an initial design and development phase (including verification and initial validation), that often lasts five or more years (Fig. 2). That is followed by a second phase in which the initial release is further validated, improved and then further developed based on experience by the users running real problems. A production phase follows during which the code is used to solve real problems. If the project is successful, the production phase is often the most active development phase. Once the code enters heavy use, many deficiencies and defects become apparent and need to be fixed, and the users generate new requirements for expanded capability. The new requirements may be due to new demands by the sponsor or user community, to the desire to incorporate new algorithmic improvements, or to the need to port to different computer platforms. Even if no major changes are made during the production phase, substantial code maintenance is usually required for porting the code to different platforms, responding to changes in the computational infrastructure, and fixing problems due to non-optimal initial design choices. The rule of thumb among many major CSE projects is that about one FTE of software maintenance support is needed for each four FTEs of users.

Figure 2

Figure 2. Typical Large-scale Computational Science and Engineering Project Life Cycle.

Historically, many, if not most, CSE codes have included only a limited number of effects and were developed by teams of one to five or so professionals. The few CSE codes that were multi-effect generally developed one module for a new effect and added it to the existing code (Figure 3). Once the new module had been successfully integrated into the major application, the developers then started development of the next module. This approach had many advantages. It allowed the developers and users to extensively use and test the basic capability of the code while there was time to make changes in the choices of solution algorithms, data structures, mesh and grid topologies and structures, user interfaces, etc. The users were able to verify and validate the basic capability of the code. Then they were able to test each new capability as it was added. The developers got rapid feedback on every new feature and capability. The developers of new modules had a good understanding of the existing code because many of them had written it. It was therefore possible to make optimum trade-offs in the development of good interfaces between the existing code and new modules. On the other hand, serial development takes a long time. If a code has four major modules that take five years to develop, the full code won’t be ready for 20 years. Unfortunately, by then the whole code may be obsolete. Certainly the code will have been ported to new platforms many times.

Figure 3

Figure 3. Historic CSE Code Development Workflow for serial development.

To overcome these limitations, multi-effect codes are now generally developed in parallel (Figure 4). If a code is designed to include four effects, and the modules for each effect take five years to develop, then the development team will consist of 20 members plus those needed to support the code infrastructure. If all goes well, the complete code with treatments of all four effects will be ready five or six years after the start of the project instead of 20 years.

Because the development teams are much larger, and the individual team members often don’t have working experience with the modules and codes being developed by the other module sub-teams, the software engineering challenges are much greater. Parallel development also increases the relative risks. If the development of a module fails, a new effort has to be started. If one out of four module initial development efforts fail, then the impact on total development time is to double it compared to only a twenty-five percent increase with serial development.

Figure 4

Figure 4. Parallel project development workflow.

Pages: 1 2 3 4 5 6 7 8

Reference this article
"Large-Scale Computational Scientific and Engineering Project Development and Production Workflows," CTWatch Quarterly, Volume 2, Number 4B, November 2006 B. http://www.ctwatch.org/quarterly/articles/2006/11/large-scale-computational-scientific-and-engineering-project-development-and-production-workflows/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.