eCommons

 

Provably Efficient Model Training over Centralized and Decentralized Datasets

dc.contributor.authorLu, Yucheng
dc.contributor.chairDe Sa, Christopheren_US
dc.contributor.committeeMemberZhang, Zhiruen_US
dc.contributor.committeeMemberDamle, Anilen_US
dc.date.accessioned2024-04-05T18:47:12Z
dc.date.available2024-04-05T18:47:12Z
dc.date.issued2023-08
dc.description306 pagesen_US
dc.description.abstractDeep learning models are prevalent in many real-world applications, and can easily scale to millions of billions of learnable parameters. Training a deep learning model usually requires iterating or sampling over a massive-scale dataset, which can either be centralized or decentralized. Centralized datasets are gathered and stored in a central authority where the training processes have access to, e.g., a Networked File System (NFS) with read permission. Decentralized datasets, on the other hand, are located at different parties and cannot be shared or shuffled. In this dissertation, we investigate provably efficient model training algorithms for both cases. When training a model over centralized datasets, the example order has long been known to affect convergence rate. Recent results show that accelerated rates are possible in a variety of cases for permutation-based sample orders, in which each example from the training set is used once before any example is reused. We develop a broad condition on the sequence of examples used by SGD that is sufficient to prove tight convergence rates in both strongly convex and non-convex settings. Leveraging our insights, we propose algorithms that use the gradient balancing technique to find provably better data permutations than the state-of-the-art random reshuffling. For decentralized dataset, multiple computing nodes possess local datasets and they collaboratively train a model without sharing their own data. We formalize such setting as decentralized training, and provide tight lower bounds on the iteration complexity in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction our lower bounds are tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. In addition, to make decentralized training robust to low-bandwidth high-latency networks, we propose Moniqua, a technique that allows decentralized SGD to use quantized communication.en_US
dc.identifier.doihttps://doi.org/10.7298/g59s-ed07
dc.identifier.otherLu_cornellgrad_0058F_13762
dc.identifier.otherhttp://dissertations.umi.com/cornellgrad:13762
dc.identifier.urihttps://hdl.handle.net/1813/114692
dc.language.isoen
dc.rightsAttribution 4.0 International*
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/*
dc.titleProvably Efficient Model Training over Centralized and Decentralized Datasetsen_US
dc.typedissertation or thesisen_US
dcterms.licensehttps://hdl.handle.net/1813/59810.2
thesis.degree.disciplineComputer Science
thesis.degree.grantorCornell University
thesis.degree.levelDoctor of Philosophy
thesis.degree.namePh. D., Computer Science

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Lu_cornellgrad_0058F_13762.pdf
Size:
5.29 MB
Format:
Adobe Portable Document Format