Design and Generation of Efficient Hardware Accelerators for Sparse and Dense Tensor Computations

Other Titles


Tensor algebra lives at the heart of big data applications. Where classical machine learning techniques such as embedding generation in recommender systems, dimensionality reduction and latent Dirichlet allocation make use of multi-dimensional tensor factorizations, deep learning techniques such as convolutional neural networks, recurrent neural networks and graph learning use tensor computations primarily in the form of matrix-matrix and matrix-vector multiplications. The tensor computations often used in many of these fields operate on sparse data where most of the elements are zeros. Traditionally, tensor computations have been performed on CPUs and GPUs, both of which have low energy-efficiency as they allocate excessive hardware resources to flexibly support various workloads. However, with the end of Moore's law and Dennard scaling, one can no longer expect more and faster transistors for the same dollar and power budget. This has led to an ever-growing need for energy-efficient and high-performance hardware that has resulted in a recent surge of interest in application-specific, domain-specific and behavior-specific accelerators, which sacrifice generality for higher performance and energy efficiency. In this dissertation, I explore hardware specialization for tensor computations by building programmable accelerators. A central theme in my dissertation is determining common spatial optimizations, computation and memory access patterns, and building efficient storage formats and hardware for tensor computations. First, I present T2S-Tensor, a language and compilation framework for productively generating high-performance systolic arrays for dense tensor computations. Then I present a versatile accelerator, Tensaurus, that can accelerate both dense and mixed sparse-dense tensor computations. Here, I also introduce a new sparse storage format that allows accessing sparse data in a vectorized and streaming fashion and thus achieves high memory bandwidth utilization for sparse tensor kernels. Finally, I present a novel sparse-sparse matrix multiplication accelerator, MatRaptor, designed using a row-wise product approach. I also show how these different hardware specialization techniques outperform CPUs, GPUs and state-of-the-art accelerators in both energy efficiency and performance.

Journal / Series

Volume & Issue


140 pages


Date Issued




Matrix Multiplication; Sparse Tensor Computations; Tensor Accelerator; Tensor Decomposition


Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

Albonesi, David H.
Zhang, Zhiru

Committee Co-Chair

Committee Member

Batten, Christopher
Manohar, Rajit

Degree Discipline

Electrical and Computer Engineering

Degree Name

Ph. D., Electrical and Computer Engineering

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Attribution 4.0 International


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record