Provably Efficient Model Training over Centralized and Decentralized Datasets

Other Titles



Deep learning models are prevalent in many real-world applications, and can easily scale to millions of billions of learnable parameters. Training a deep learning model usually requires iterating or sampling over a massive-scale dataset, which can either be centralized or decentralized. Centralized datasets are gathered and stored in a central authority where the training processes have access to, e.g., a Networked File System (NFS) with read permission. Decentralized datasets, on the other hand, are located at different parties and cannot be shared or shuffled. In this dissertation, we investigate provably efficient model training algorithms for both cases. When training a model over centralized datasets, the example order has long been known to affect convergence rate. Recent results show that accelerated rates are possible in a variety of cases for permutation-based sample orders, in which each example from the training set is used once before any example is reused. We develop a broad condition on the sequence of examples used by SGD that is sufficient to prove tight convergence rates in both strongly convex and non-convex settings. Leveraging our insights, we propose algorithms that use the gradient balancing technique to find provably better data permutations than the state-of-the-art random reshuffling. For decentralized dataset, multiple computing nodes possess local datasets and they collaboratively train a model without sharing their own data. We formalize such setting as decentralized training, and provide tight lower bounds on the iteration complexity in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction our lower bounds are tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. In addition, to make decentralized training robust to low-bandwidth high-latency networks, we propose Moniqua, a technique that allows decentralized SGD to use quantized communication.

Journal / Series

Volume & Issue


306 pages


Date Issued





Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

De Sa, Christopher

Committee Co-Chair

Committee Member

Zhang, Zhiru
Damle, Anil

Degree Discipline

Computer Science

Degree Name

Ph. D., Computer Science

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Attribution 4.0 International


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record