OS INSPIRED COMPLETE KERNEL FUSION
Distributed Machine Learning (DML) is increasingly recognized as a communication-bound workload, with most existing work aiming to alleviate this bottleneck through communication-computation overlap. However, current approaches—largely reliant on CPU-managed operator scheduling and synchronous communication collectives—leave significant performance on the table. This thesis identifies, analyzes, and addresses the inefficiencies arising from the interplay between CPU-driven collective communication and highly parallel GPU computation. We demonstrate that the standard bulk-synchronous communication model underutilizes GPU interconnect bandwidth and is especially vulnerable to straggler-induced performance degradation. Focusing on dynamic workloads such as Mixture-of-Experts (MoE), we highlight two critical bottlenecks. First, we demonstrate how CPU-driven execution limits the exploitation of task locality and introduces artificial synchronization barriers across distributed GPU tasks, resulting in performance reduction due to straggler effects. Second, we observe payload inefficiency at the application layer, where existing collective primitives force unnecessary padding of GPU communication buffers. To overcome these limitations, we propose a model of complete GPU residency, where inter-GPU communication is integrated directly into GPU kernels. We realize this vision in FlashDMoE: a persistent, in-kernel, actor-style operating system with packet switching that enables complete operator fusion for Distributed MoE (DMoE) into a single kernel, the first of its kind. FlashDMoE features a modular, message-driven architecture that supports lockless execution across tens of thousands of GPU threads and distributed GPUs as well. We demonstrate how FlashDMoE addresses the all-to-all communication bottleneck in expert parallelism and enables high-throughput, GPU-initiated communication. Evaluated against state-of-the-art distributed MoE frameworks, FlashDMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines—despite FlashDMoE using FP32 while baselines use FP16.