CTWatch
November 2007
Software Enabling Technologies for Petascale Science
John Mellor-Crummey, Rice University
Peter Beckman, Argonne National Laboratory
Keith Cooper, Rice University
Jack Dongarra, University of Tennessee, Knoxville
William Gropp, Argonne National Laboratory
Ewing Lusk, Argonne National Laboratory
Barton Miller, University of Wisconsin, Madison
Katherine Yelick, University of California, Berkeley

6
3. Recent and Ongoing Work

To date, work in CScADS has included both research and development of a range of technologies necessary to support leadership computing, as well as direct involvement with SciDAC application teams. We briefly summarize a few of these efforts.

3.1 Research and Development of Software for Leadership Computing

Rice and Wisconsin have begun collaborative development of a series of performance-tool components that can serve as community infrastructure for performance tools for leadership computing platforms. Initial efforts in this area have been focused on development of multi-platform components for stack unwinding within and across process address spaces that has uses for both debugging and performance analysis, and a library that provides a foundation for sampling-based performance measurement of both statically-linked and dynamically-linked executables.

Berkeley and Tennessee have been collaborating on re-engineering numerical libraries for parallel systems. Initially, this work has been exploring parallel matrix factorization using multi-threading in combination with intelligent scheduling. The new execution model relies on dynamic, dataflow-driven execution and avoids both global synchronization and implicit point-to-point synchronization due to send/receive-style message passing. Experimental results indicate that this strategy can significantly outperform traditional codes by hiding both algorithmic and communication latencies. Future plans call for exploring this programming paradigm for both two-sided linear algebra algorithms and sparse matrix algorithms.

Argonne has been exploring the implementation and performance evaluation of MPI support for multi-threading and remote memory access. Experiments with Argonne’s own MPI implementation (MPICH) and various vendor implementations have demonstrated the potential contribution these still little-used parts of MPI can make to parallel program performance and have revealed widely varying attention to efficient implementations.

3.2 Application Engagement

As part of the Center’s application engagement efforts, Rice has been working closely with several of the SciDAC S3D and GTC application teams to diagnose application performance bottlenecks on leadership-class platforms using a combination of measurement, analysis, and modeling. S3D is a massively parallel solver for turbulent reacting flows.8 GTC (Gyrokinetic Toroidal Code) is a three-dimensional particle-in-cell (PIC) code used for studying the impact of fine-scale plasma turbulence on energy and particle confinement in the core of tokamak fusion reactors.9 Our early experiences with both S3D and GTC demonstrate the value of the CScADS approach of tightly coupling computer science research with application development and tuning. Work with these applications has influenced development of software tools for performance measurement and performance modeling, as well as motivated a study of run-time libraries for adaptive data reordering.

Work with S3D uncovered opportunities for using source-to-source tools to tailor code to improve memory hierarchy utilization. This led to refinement of Rice’s LoopTool program transformation tool. Applying LoopTool to S3D yielded improved performance of S3D’s most memory intensive loop by nearly a factor of three.10 Additionally, analysis of experiments with S3D on the hybrid Cray XT3/XT4 system showed that the lower memory bandwidth on the XT3 nodes hurts the weak scaling performance of S3D on the hybrid system. Performance on the hybrid system could be improved by proportionally adjusting the partitioning of computation to account for the higher efficiency of the XT4 nodes.

Work with GTC has focused on exploring opportunities for improving memory hierarchy utilization. One component of this effort has been studying the impact of data structure layout and code organization on the spatial and temporal locality present in data access patterns. A detailed study of GTC using a performance modeling toolkit developed at Rice11 identified several opportunities for improving application performance. These included reorganizing the particle data structures to improve spatial reuse in the charge deposition and particle pushing phases of the application, using loop fusion to increase temporal reuse of particle data, and transforming the code to increase instruction-level parallelism and reduce translation look-aside buffer misses. A study of the transformed code on an Itanium2 system showed that our code transformations improved performance by 33%. Code modifications have been provided back to the application team. Ongoing work is exploring on-line adaptive reordering of particle data to improve temporal locality for the cell data structures during the charge deposition and particle pushing phases. Preliminary experiments indicate that this approach offers the potential for substantially improving performance.

An outcome of the CScADS summer workshop Libraries and Algorithms for Petascale Applications was a substantial improvement in I/O scaling and performance of the Omega3P simulation tool under development at the Stanford Linear Accelerator Center. Discussions at the workshop led to the use of collective communication patterns to avoid scaling bottlenecks associated with reading input data. Additionally, adjusting the application to use parallel netCDF and MPI-IO reduced the time for writing output data by a factor of 100 when Omega3P was run on thousands of processors on the Cray XT system at Oak Ridge. These improvements dramatically enhanced the scalability of Omega3P.

Acknowledgement
The Center for Scalable Application Development Software is supported by cooperative agreement number DE-FC02-07ER25800 from the Department of Energy’s Office of Science.
References
1 CScADS - cscads.rice.edu/
2 HPCToolkit - www.hipersoft.rice.edu/hpctoolkit/
3 Paradyn - www.paradyn.org/
4 Whaley, C., Petitet, A., Dongarra, J. “Automated empirical optimizations of software and the ATLAS project,” Parallel Computing, Vol. 27 (2001), no. 1, pg. 3-25.
5 Frigo, M. “A fast Fourier transform compiler,” Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Atlanta, Georgia, May 1999.
6 Qasem, A., Kennedy, K. “A cache-conscious profitability model for empirical tuning of loop fusion,” Proceedings of the 2005 International Workshop on Languages and Compilers for Parallel Computing, Hawthorne, NY, October 20-22, 2005.
7 Zhao, Y., Kennedy, K. “Dependence-based code generation for a CELL processor,” Proceedings of the 19th International Workshop on Languages and Compilers for Parallel Computing (LCPC), New Orleans, Louisiana, November 2 - 4, 2006.
8 Monroe, D. “Energy science with digital combustors,” SciDAC Review. Fall 2006. www.scidacreview.org/0602/html/combustion.html
9 Krieger, K. “Simulating star power on earth,” SciDAC Review, Spring 2006. www.scidacreview.org/0601/html/fusion.html
10 Mellor-Crummey, J. “Harnessing the power of emerging petascale platforms. SciDAC 2007,” Journal of Physics: Conference Series 78 (2007) 012048.
11 Marin, G., Mellor-Crummey, J. “Understanding unfulfilled memory reuse potential in scientific applications,” Technical Report TR07-6, Department of Computer Science, Rice University, October 2007.

Pages: 1 2 3 4 5 6

Reference this article
Mellor-Crummey, J., Beckman, P., Cooper, K., Dongarra, J., Gropp, W., Lusk, E., Miller, B., Yelick, K. "Creating Software Tools and Libraries for Leadership Computing," CTWatch Quarterly, Volume 3, Number 4, November 2007. http://www.ctwatch.org/quarterly/articles/2007/11/creating-software-tools-and-libraries-for-leadership-computing/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.