Toward Fast Deployment and Resource Efficient AI on FPGAs
Access to this document is restricted. Some items have been embargoed at the request of the author, but will be made publicly available after the "No Access Until" date.
During the embargo period, you may request access to the item by clicking the link to the restricted file(s) and completing the request form. If we have contact information for a Cornell author, we will contact the author and request permission to provide access. If we do not have contact information for a Cornell author, or the author denies or does not respond to our inquiry, we will not be able to provide access. For more information, review our policies for restricted content.
As machine learning models flourish, many hardware accelerators have been proposed, but designing an application specific accelerator for each new model or algorithm is not straightforward and usually requires multiple iterations between hardware and software. Field programmable gate arrays (FPGAs) have therefore become popular as platforms for prototyping and evaluating new architectures before fabrication. However, it is still unclear how to deploy machine learning models on these devices quickly while using logic, memory, and interconnect resources efficiently. Doing so often demands detailed knowledge of FPGA architecture and tool flows. This work addresses this gap through four representative case studies that span inference, training, emulation, and compilation, moving toward fast deployment and resource efficient AI on FPGAs. First, we design a compact object detection model and accelerator for nearby drone detection that fits on a small device through model simplification and quantization. Second, we propose a low-precision normalization algorithm and corresponding hardware design that preserve training accuracy close to full precision. Third, we introduce a chain-based time-division multiplexing (CTDM) method that emulates very large neural processing units on one or more FPGA devices while significantly reducing resource utilization. Finally, we present Torch2FPGA, a compilation framework that maps vision transformer models from PyTorch to a modular accelerator architecture, reducing manual design effort and shortening the path from new vision transformer models to running hardware prototypes.