Analytical Modeling and Efficient Implementation for Specialized Hardware Acceleration of Sparse Applications
Access to this document is restricted. Some items have been embargoed at the request of the author, but will be made publicly available after the "No Access Until" date.
During the embargo period, you may request access to the item by clicking the link to the restricted file(s) and completing the request form. If we have contact information for a Cornell author, we will contact the author and request permission to provide access. If we do not have contact information for a Cornell author, or the author denies or does not respond to our inquiry, we will not be able to provide access. For more information, review our policies for restricted content.
Sparse linear algebra (SLA) underpins a wide range of critical applications, including data analytics, graph processing, deep learning, and transformers. In light of the growing footprints of sparse processing and diminishing benefits of generalpurpose scaling, recent years have seen an active body of research on accelerating sparse kernels using specialized hardware. However, the diverse characteristics of inputs across different sparse applications– such as variations in dimensions, density, and non-zero value distributions– necessitate tailored architectural configurations and components, resulting in an extensive design space for exploration. To efficiently develop SLA accelerators, designers must rapidly explore the complex design and configuration space, making traditional, time-consuming simulation-based approaches unsuitable. This dissertation presents our recent research on analytical modeling for rapid design space exploration (DSE) and the efficient implementation of sparse accelerators. First, we introduce a novel and comprehensive analytical model capable of estimating the performance of various compute patterns for SLA kernels, factoring in the characteristics of input matrices and specific architecture configurations. Next, by leveraging the rapid DSE enabled by this analytical model, we develop Vesper, a versatile sparse accelerator equipped with High Bandwidth Memory (HBM) to efficiently handle datasets of varying dimensions and density. Vesper accelerates four key SLA kernels: sparse-matrix dense-vector multiplication (SpMV), sparse-matrix dense-matrix multiplication (SpMM), sparse-matrix sparse vector multiplication (SpMSpV), and sparse-matrix sparse-matrix multiplication (SpGEMM). Guided by the analytical model, Vesper is designed to be configurable and resource-efficient, supporting diverse compute patterns and kernels while achieving throughput comparable to or exceeding state-of-the-art sparse accelerators. Finally, we present GraphAxe, the first format-architecture co-designed hardware accelerator tailored for processing ultrascale (out-of-core) graphs using solid-state drives (SSDs). Specifically, we introduce a quantization-based doubly compressed sparse column (QDCSC) format to reduce SSD traffic. To further maximize bandwidth utilization, we implement SSD-aware mechanisms, including dynamic payload size adjustment and scoreboard-based coordination. Compared to the state-of-the-art CPU-based out-of-core graph computing system, GraphAxe delivers a 2.20x geometric mean runtime improvement and a 154.16x increase in energy efficiency.