CTWatch
November 2006 B
High Productivity Computing Systems and the Path Towards Usable Petascale Computing
Pedro C. Diniz, Information Sciences Institute, University of Southern California
Jeremy Abramson, Information Sciences Institute, University of Southern California

3
3. A Case Study

We now present preliminary experimental results of the performance expectation and sensitivity analysis for a synthetic code, the UMT2K, modeled after a real engineering application, a 3D, deterministic, multigroup, photon transport code for unstructured meshes. 6 We first describe the methodology followed in these experiments and then present and discussion our findings using our analysis approach.

3.1 Methodology

We have built the basic analysis described above in section 2 using the Open64 compilation infrastructure.10 Our implementation takes an input source program file and focuses on the computationally intensive basic blocks. We used our analysis to extract DFG and generate performance expectation metrics for various combinations of architectural elements. We also applied manual unrolling to the significant loops in the kernel code as a way to compare the expected performance for the various code variants given the potential increase in instruction-level parallelism.

3.2. The Kernel Code

The computationally intensive section of the UMT2K code is located in what is designated the angular-loop located in the snswp3D subroutine. This loop contains a long basic block spanning about 300 lines of C code. This basic block, executed at each iteration of the loop, is the "core" of the computation, and has many high-latency floating point operations, such as divides (4) and multiplies (41) as depicted in Table 1.

Table 1. Instruction breakdown for the core of UMT2K.
Total
Operations
FP
Operations
Integer
Operations
Load/Store
Operations
FP
Multiplies
FP
Divides
Integer
Multiply
272 93 95 84 41 4 22

In the next section we review some of the experimental results for this basic block in an unmodified form, as well as for manually unrolled versions to explore the performance impact of data dependences and number of arithmetic and load/store units on the projected performance. The unrolled versions will expose many opportunities to explore pipelined and non-pipelined execution of the various functional units and individual operations.

Table 2 below depicts the latencies of the individual operations used in our approach. These latencies are notional rather then being representative of a real system.

Table 2. Selected Operation Latencies.
Load
(cache miss)
Load
Address
32-bit int.
Add
32-bit Int.
Multiply
32-bit FP
Multiply
32-bit FP
Divide
Array Address
Calculation
(Non-affine)
Array Address
Calculation
(Affine)
20 2 1 12 50 60 33 13

Pages: 1 2 3 4 5

Reference this article
"SLOPE - A Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes," CTWatch Quarterly, Volume 2, Number 4B, November 2006 B. http://www.ctwatch.org/quarterly/articles/2006/11/slope-a-compiler-approach-to-performance-prediction-and-performance-sensitivity-analysis-for-scientific-codes/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.