Decoupling Algorithm from Hardware Customizations for Software-Defined Reconfigurable Computing
With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with spatial accelerators such as FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge. In this dissertation, we first provide a taxonomy of the essential techniques for building a high-performance FPGA accelerator, which requires customizations of the compute engines, memory hierarchy, and data representations. We also summarize a rich spectrum of work on programming abstractions and optimizing compilers that provide different trade-offs between performance and productivity. Next we present SuSy, a programming framework composed of a domain-specific language (DSL) and a compilation flow that enables programmers to productively build high-performance systolic arrays on FPGAs. With SuSy, programmers express the design functionality in the form of uniform recurrence equations (UREs). The URE description in SuSy is followed by a set of decoupled spatial mapping primitives that specify how to map the equations to a spatial architecture. More concretely, programmers can apply space-time transformations and several other memory and I/O optimizations to build a highly efficient systolic architecture productively. After that, we present HeteroCL, an open-source programming infrastructure composed of a Python-based domain-specific language and an FPGA-targeted compilation flow. Similar to SuSy, HeteroCL cleanly decouples algorithm specifications from three important types of hardware customization in compute, data types, and memory architectures. In addition, HeteroCL produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures. Finally, we introduce DrTrace, a trace-based online profiling technique that enables automated validation and recommendation for application-specific data reuse. Unlike existing work that leverages static analysis, our proposed technique can infer stencil operations from programs with data-dependent memory accesses. Moreover, by integrating the proposed profiling technique with HeteroCL, the stencil operations can be mapped to efficient hardware such as dataflow pipelines with line buffers.