Trainable Fixed-Point Quantization for Deep Learning Acceleration on FPGAs
This thesis introduces a novel fixed-point quantization aware training tool, QFX, designed to bridge the gap between trained and deployed deep learning models on resource-constrained devices, such as embedded FPGAs. Conventional quantization techniques have primarily focused on quantizing only the matrix multiplication in deep learning models during training, necessitating extensive fine-tuning efforts to quantize other layers to fixed-point precision for FPGA deployments. In this thesis, QFX enables training “hardware-ready” deep learning models to address the problem, by effectively emulating the fixed-point casting function and basic arithmetic operations, and dynamically learning the binary-point position during model training. During deployment, the fixed-point operations in the model can be seamlessly transitioned to their synthesizable counterparts supported by HLS, which eliminates numerical issues. The performance is evaluated on image classification tasks based on accuracy on practical datasets. Moreover, the thesis introduces K-hot, a multiplier-free quantization strategy within QFX, designed to minimize DSP usage. The effectiveness of this strategy is demonstrated through integration with a state-of-the-art binarized neural network accelerator, showcasing improved hardware performance on an embedded FPGA. In summary, this thesis presents a tool that simplifies the quantization process, enabling the training of high-quality fixed-point quantized models with reduced quantization overhead. The tool is intended for open-source release, and can potentially expand the audience with less preliminary quantization knowledge needed in efficient machine learning. The innovative approach of applying multiplier-free quantization to binarized neural networks also implies future possibilities for deploying extremely compressed deep learning models on resource-constrained devices.