Designing Machine Learning Accelerators via High-Level Synthesis Through Calyx
Access to this document is restricted. Some items have been embargoed at the request of the author, but will be made publicly available after the "No Access Until" date.
During the embargo period, you may request access to the item by clicking the link to the restricted file(s) and completing the request form. If we have contact information for a Cornell author, we will contact the author and request permission to provide access. If we do not have contact information for a Cornell author, or the author denies or does not respond to our inquiry, we will not be able to provide access. For more information, review our policies for restricted content.
As applications grow in complexity and performance demand, general-purpose processors are no longer optimal for many domains due to their limited efficiency and high energy overhead in computation-intensive tasks. This is particularly evident in the field of machine learning (ML), where model sizes and usage have grown exponentially, making CPUs ill-suited for inference and training workloads. Although custom hardware accelerators offer significant performance and energy efficiency gains, designing them at the register-transfer level (RTL) is time-consuming and error-prone. Furthermore, the vast majority of ML programs are written in Python, a high-level language that is far removed from RTL design. To bridge this semantic gap, we present a complete open-source compiler toolchain that translates ML models written in Python into synthesizable SystemVerilog, targeting FPGAs as the hardware backend. Our toolchain leverages Calyx, a structured intermediate representation designed for hardware accelerators, and is integrated into the CIRCT project for extensibility and analysis. In addition to building the end-to-end flow, we design and implement a set of compiler passes for memory partitioning, enabling effective parallelism in memory-intensive ML workloads. Experimental results demonstrate that our compiler can effectively generate hardware from high-level ML models and achieve performance gains through static memory banking optimizations.