Exploitation of Latency Hiding on the KSR1 - Case Study: The Barnes Hut Algorithm
Tumuluri, Chaitanya; Choudhary, Alok N.
This study is aimed at examining the performance of dynamic, irregular and loosely synchronous class of applications on the KSR1 distributed shared memory COMA system. The Barnes-Hut tree based algorithm for simulating galactic evolution , was chosen as a representative of this class of applications. The performance measures include the overall time-stepping loop execution time, the efficacy of the scaling rules (EES and RCTS) proposed in  as well as the computational load balance achieved by the CostZone data partitioning scheme  under these scaling rules. We define notions of geographical locality, transfer locality flux and partition locality flux to explain the sources of remote memory accesses in the application. The contributions of our study include two runtime latency hiding techniques PST and PREFH proposed for the effective and automatic utilization of the poststore and prefetch instructions to hide the latencies of remote memory accesses. The architectural support assumed, compiler analysis required and code instrumentation schemes for the implementation of the PST and PREFH techniques are presented in this paper. We also examine the scalability of our schemes under the afore mentioned scaling rules. These schemes were tuned for a 32k particle simulation size on a 112 processor configuration, producing a reduction of 30% in the overall loop execution time of the simulation. Further, a combination of the schemes, PREPST, produced an overall reduction of 50% in the loop execution time of simulation. These improvements were traced to a reduction in the problems of locality fluxes which arose as the application was scaled under the EES and RCTS rules/ Interestingly, the problems of locality fluxes manifested themselves as load imbalance conditions in the application. We found that our schemes did not scale too well under EES scaling, but produced appreciable reductions in execution timings under RCTS scaling. It needs to be emphasized that our work involved the study of a whole application and on a 128 processor KSR1 machine, as opposed to most other work reported to date which examine performances of computational kernels on 32 or 64 processor configurations only.
Previously Published As