Co-Design of Binarized Deep Learning
The 2010s oversees the explosive growth of deep learning and its profound impact beyond. This technology has fundamentally revolutionized the paradigm of pattern recognition and decision making in a wide range of applications and is believed to be one step further towards machine intelligence. One common trend observed by researchers over the years is the persistent scaling of model, data, and hardware. In fact, today's large-scale deep learning models have hundreds of billions of parameters and septillions of floating-point operations, trained on thousands of chips using trillions of tokens. Consequently, deploying such models at scale keeps stretching the limit of today's supercomputers. The core problem studied in this dissertation is quantizing deep neural networks to one bit. Quantization reduces the number of bits in numerical representations. Quantizing matrix multiplications, the most compute-intensive part of deep neural networks, can substantially reduce their energy footprint and improve their performance. One-bit quantization, referred to as binarization, embraces its extreme benefit. Although there is a lack of theoretical evidence, binarization empirically suffers from degraded model qualities. This dissertation explores the possibility of co-designing both high-quality and high-performance binarized neural networks (BNNs). In part one, it introduces the precision gating (PG) technique that leverages additional sparse binary matrix multiplications to improve the accuracy of existing BNN architectures. The result BNN with PG, named as FracBNN, for the first time achieves the same accuracy as a well-known compact floating-point network MobileNetV2 and is capable of performing real-time inference on an embedded FPGA. In part two, it introduces the design methodology of PokeBNN, a novel vision BNN architecture that establishes a new Pareto-state-of-the-art in terms of the accuracy-efficiency trade-off. In part three, it further explores binarized Transformer-based language models. The one-bit weight-binarized Transformer demonstrates no quality loss on WMT De-En translation dataset and a similar scaling trend as its floating-point counterpart when evaluated on Google Translate's production-scale dataset. Finally, the dissertation presents a comprehensive and automated model deployment framework for quantized deep neural networks. With one line of code addition, the framework can deploy mixed-precision models on CPUs and demonstrate a practical acceleration.