Think Globally, Search Locally
MetadataShow full item record
Yotov, Kamen; Pingali, Keshav; Stodghill, Paul
A key step in program optimization is the determination of optimal values for code optimization parameters such as cache tile sizes and loop unrolling factors. One approach, which is implemented in most compilers, is to use analytical models to determine these values. The other approach, used in library generators like ATLAS, is to perform a global search over the space of parameter values by generating different versions of the code and executing them on the actual machine to find the parameter values that give the best performance. Neither approach is suitable for use in general-purpose compilers that must generate high quality code for large programs running on complex architectures. Model-driven optimization may incur a performance penalty of 10-20\% even for a relatively simple code like matrix multiplication, as was shown recently by Yotov et al. On the other hand, global search is not tractable for optimizing large programs for complex architectures because the optimization space is too large. To address this problem, some researchers are exploring more sophisticated search algorithms such as the simplex method, but it remains to be seen if these methods are successful in reducing search time without compromising on the quality of the solution. In this paper, we advocate a different methodology for generating high-performance code without increasing search time dramatically. Our methodology has three components: (i) modeling, (ii) local search, and (iii) model refinement. We use analytical models to estimate optimal values for transformation parameters. Since it is impossible to build tractable analytical models that capture all the features of complex architectures, we advocate improving these estimates by using a local search in the neighborhood of the model-predicted values. Finally, if the performance gap between handwritten code and generated code is substantial on some architecture, we advocate model refinement. To demonstrate this methodology, we built a modified ATLAS system that used a simple analytical model and local search, and showed that on most architectures, the performance of the code produced by this system was comparable to that of code produced by the original ATLAS system using global search. However, on x86 architectures, the gap in performance was substantial, and could not be bridged by local search alone. We argue that the problem is that the model assumed aggressive operation scheduling to mask instruction latencies, but such scheduling can actually be harmful on x86 architectures, a somewhat surprising fact that does not appear to be known widely. To address this problem, we use model refinement to generate a more sophisticated model that, when combined with local search, enables the production of high-quality code on both RISC and CISC architectures.
computer science; technical report
Previously Published As