Mapping Neural Network Inference Onto Heterogeneous Hardware Platforms
Datacenters are evolving towards heterogeneity, incorporating specialized hardware for tasks such as networking, video processing, and particularly deep learning. To effectively harness the compute capabilities of modern heterogeneous datacenters, this thesis proposes an approach for compiler-level partitioning of deep neural networks (DNNs) across interconnected hardware devices. We present a comprehensive framework for heterogeneous DNN compilation, offering automatic partitioning and device mapping. Our scheduler integrates an exact solver, utilizing a mixed integer linear programming (MILP) formulation, and a modularity-based heuristic for scalability. Additionally, we introduce a theoretical lower bound formula to assess the quality of heuristic solutions, enabling the evaluation of optimal solutions. We evaluate the proposed scheduler by optimizing both traditional DNNs and randomly-wired neural networks, while considering latency and throughput constraints. Our experiments are conducted on a heterogeneous system consisting of a CPU and two distinct GPUs. Compared to simply running DNNs on the fastest GPU, our framework achieves latency reductions of over 3$\times$ and throughput improvements of up to 2.9$\times$ by automatically leveraging both data and model parallelism. Furthermore, our modularity-based "splitting" heuristic significantly enhances solution runtime by up to 395$\times$, without compromising solution quality compared to the exact MILP approach. Additionally, it outperforms alternative heuristic baselines by 30-60% in terms of solution quality. Lastly, we present two case studies to demonstrate the capabilities of our scheduler. The first case study investigates performance in memory-constrained environments, while the second explores the extension of our framework for scheduling large language models across multiple heterogeneous servers by leveraging symmetry in the hardware setup. Overall, this research contributes to the efficient deployment of DNNs in heterogeneous datacenters through compiler-level partitioning, showcasing improved latency, throughput, and solution scalability.