LOW PRECISION AND SPARSE AI ALGORITHMS FOR EDGE COMPUTING
Over the last decade, generative AI has revolutionized the world by expanding the scope of conventional computer vision and natural language processing models across various domains. As performance-driven AI architectures grow in size, they demand increasingly substantial memory and computational resources. To manage these models’ performance, the research community has invested significant capital in storage and compute infrastructure. The excessive resource requirements have confined deep neural architectures primarily to cloud environments and data centers. Recent advances in on-device AI have motivated exploration of resource-optimal pre-training, fine-tuning, and inference of AI models for domain-specific tasks. These efforts address the fundamental challenge of deploying sophisticated models on resource-constrained devices that struggle with high memory usage, computational inefficiency, and energy consumption. Optimizing AI architectures is particularly crucial for advancing computer vision and generative AI applications in autonomous driving, drone navigation, robotics, and augmented/virtual reality, where real-time processing is essential but hardware resources are limited. This thesis addresses the fundamental challenges of resource-inefficient AI by developing systematic and hardware-aware optimization techniques across both conventional and biologically-inspired neural networks. The study first explores Spiking Neural Networks (SNNs), which offer high memory and energy efficiency. Despite their biological plausibility, SNNs suffer from suboptimal training and temporal compute overheads. To overcome these limitations, this thesis introduces: (1) LT-SNN: a self-adaptive SNN with learnable potential thresholds, (2) QE-SNN: a fully quantized SNN with integer-only membrane potential, and (3) SQE-SNN: low-precision SNN with spatial-channel pruning. These approaches collectively achieve state-of-the-art performance with up to 13× memory reduction and > 4 FLOPs reduction while maintaining accuracy for both classification and object detection tasks in static and dynamic scenes. Beyond 2D vision, this study identifies and addresses resource bottlenecks in state-of-the-art 3D scene understanding and generation models. For Neural Radiance Fields (NeRF), which suffer from high latency due to ray marching, the thesis proposes Quant-NeRF, an end-to-end low-precision algorithm hardware co-design that achieves 3.8× speedup and 7.2× memory reduction while maintaining visual quality (< 0.5dB PSNR loss). Finally, advancing resource-optimal 3D rendering, this dissertation proposes techniques to address the memory inefficiency (> 900MB) and low-frequency bias in both visual and audio-visual 3D Gaussian Splatting (3DGS). The proposed schemes, which enable 3DGS applications beyond GPU environments, reduce the memory footprint by 3.9× while improving high visual and audio-visual quality. Through these contributions, carefully designed hardware-aware and algorithm-level optimizations enable the deployment of sophisticated vision and generative AI models on resource-constrained devices while maintaining high performance. This work establishes a foundation for more accessible, efficient, and sustainable AI applications across diverse real-world scenarios, bridging the gap between cutting-edge AI research and on-device applications.