Cornell University
Library
Cornell UniversityLibrary

eCommons

Help
Log In(current)
  1. Home
  2. Cornell University Graduate School
  3. Cornell Theses and Dissertations
  4. OS INSPIRED COMPLETE KERNEL FUSION

OS INSPIRED COMPLETE KERNEL FUSION

File(s)
Aimuyo_cornell_0058O_11955.pdf (2.59 MB)
Permanent Link(s)
https://doi.org/10.7298/ybdg-yr88
https://hdl.handle.net/1813/120701
Collections
Cornell Theses and Dissertations
Author
Aimuyo, Osayamen
Abstract

Distributed Machine Learning (DML) is increasingly recognized as a communication-bound workload, with most existing work aiming to alleviate this bottleneck through communication-computation overlap. However, current approaches—largely reliant on CPU-managed operator scheduling and synchronous communication collectives—leave significant performance on the table. This thesis identifies, analyzes, and addresses the inefficiencies arising from the interplay between CPU-driven collective communication and highly parallel GPU computation. We demonstrate that the standard bulk-synchronous communication model underutilizes GPU interconnect bandwidth and is especially vulnerable to straggler-induced performance degradation. Focusing on dynamic workloads such as Mixture-of-Experts (MoE), we highlight two critical bottlenecks. First, we demonstrate how CPU-driven execution limits the exploitation of task locality and introduces artificial synchronization barriers across distributed GPU tasks, resulting in performance reduction due to straggler effects. Second, we observe payload inefficiency at the application layer, where existing collective primitives force unnecessary padding of GPU communication buffers. To overcome these limitations, we propose a model of complete GPU residency, where inter-GPU communication is integrated directly into GPU kernels. We realize this vision in FlashDMoE: a persistent, in-kernel, actor-style operating system with packet switching that enables complete operator fusion for Distributed MoE (DMoE) into a single kernel, the first of its kind. FlashDMoE features a modular, message-driven architecture that supports lockless execution across tens of thousands of GPU threads and distributed GPUs as well. We demonstrate how FlashDMoE addresses the all-to-all communication bottleneck in expert parallelism and enables high-throughput, GPU-initiated communication. Evaluated against state-of-the-art distributed MoE frameworks, FlashDMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines—despite FlashDMoE using FP32 while baselines use FP16.

Description
63 pages
Date Issued
2025-08
Keywords
Accelerator
•
Distributed Machine Learning
•
GPU
•
Machine Learning
•
Mixture-of-Experts
•
Operator Fusion
Committee Chair
Singh, Rachee
Committee Member
De Sa, Christopher
Guidi, Giulia
Degree Discipline
Computer Science
Degree Name
M.S., Computer Science
Degree Level
Master of Science
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International
Rights URI
https://creativecommons.org/licenses/by-nc-nd/4.0/
Type
dissertation or thesis

Site Statistics | Help

About eCommons | Policies | Terms of use | Contact Us

copyright © 2002-2026 Cornell University Library | Privacy | Web Accessibility Assistance