Learning-Assisted Techniques for Agile Arithmetic Design on FPGAs
With the recent trends in technology scaling, specialized hardware accelerators such as field-programmable gate arrays (FPGAs) are increasingly employed for high performance and energy efficiency. However, productivity in FPGA design is hindered by the weak guarantee of existing CAD tools to achieve fast design closure. The goal of this dissertation is to achieve rapid end-to-end FPGA design closure and contribute to the rise of specialized accelerators for applications including cryptography and digital signal processing. To realize this goal, the dissertation focuses on developing learning-assisted methodologies that learn from the inherent structure of today's designs and technologies and leverage this knowledge in building efficient design automation tools. To meet the timing requirements of designs from a broad range of application domains, FPGA CAD tools provide users with an extensive set of configuration options as a means of selecting between different heuristics or controlling the behavior of a heuristic. The greatest challenge behind efficient design auto-configuration is the hours-long FPGA compilation. I introduce LAMDA, a learning-assisted auto-configuration framework that accelerates timing closure by balancing the tension between computing effort and prediction accuracy using multi-stage and online learning techniques. LAMDA achieves significant speedup in the timing closure of large arithmetic designs as well as other DSP and linear algebra kernels. Modern heterogeneous FPGA architectures incorporate hardened blocks, such as DSP and carry, for boosting the performance of arithmetic designs. These blocks are highly configurable and a variety of dataflow patterns can be mapped to each block. Existing HLS tools often fail to capture some of the complex operation mapping patterns, which leads to limited resource and timing prediction accuracy. In order to provide an accurate timing model for further acceleration opportunities in LAMDA, I introduce D-SAGE, a framework that exploits graph neural networks to learn mapping patterns of arithmetic operations and performs mapping-aware delay characterization. It significantly improves the accuracy of delay prediction in HLS. The ever increasing complexity of cryptographic algorithms necessitates mapping very large arithmetic operations, especially multiplication, to limited FPGA resources. Existing techniques for optimizing large integer multiplication bring about nontrivial trade-offs between different resource types and performance. I introduce IMpress, a framework that leverages equality saturation to automatically produce a wide range of equivalent integer multiplication expressions corresponding to various hardware implementations and costs. IMpress is based on a learning-assisted cost model to address the difficulty in predicting resource utilization at the dataflow level. It uses constrained and multi-objective extraction techniques to choose the optimal multiplier implementations based on the application requirements on resource utilization. IMpress offers significant control over resource utilization and balance, and increases the maximum number of instances of cryptographic applications on FPGA.