Provably Efficient Model Training over Centralized and Decentralized Datasets
dc.contributor.author | Lu, Yucheng | |
dc.contributor.chair | De Sa, Christopher | en_US |
dc.contributor.committeeMember | Zhang, Zhiru | en_US |
dc.contributor.committeeMember | Damle, Anil | en_US |
dc.date.accessioned | 2024-04-05T18:47:12Z | |
dc.date.available | 2024-04-05T18:47:12Z | |
dc.date.issued | 2023-08 | |
dc.description | 306 pages | en_US |
dc.description.abstract | Deep learning models are prevalent in many real-world applications, and can easily scale to millions of billions of learnable parameters. Training a deep learning model usually requires iterating or sampling over a massive-scale dataset, which can either be centralized or decentralized. Centralized datasets are gathered and stored in a central authority where the training processes have access to, e.g., a Networked File System (NFS) with read permission. Decentralized datasets, on the other hand, are located at different parties and cannot be shared or shuffled. In this dissertation, we investigate provably efficient model training algorithms for both cases. When training a model over centralized datasets, the example order has long been known to affect convergence rate. Recent results show that accelerated rates are possible in a variety of cases for permutation-based sample orders, in which each example from the training set is used once before any example is reused. We develop a broad condition on the sequence of examples used by SGD that is sufficient to prove tight convergence rates in both strongly convex and non-convex settings. Leveraging our insights, we propose algorithms that use the gradient balancing technique to find provably better data permutations than the state-of-the-art random reshuffling. For decentralized dataset, multiple computing nodes possess local datasets and they collaboratively train a model without sharing their own data. We formalize such setting as decentralized training, and provide tight lower bounds on the iteration complexity in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction our lower bounds are tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. In addition, to make decentralized training robust to low-bandwidth high-latency networks, we propose Moniqua, a technique that allows decentralized SGD to use quantized communication. | en_US |
dc.identifier.doi | https://doi.org/10.7298/g59s-ed07 | |
dc.identifier.other | Lu_cornellgrad_0058F_13762 | |
dc.identifier.other | http://dissertations.umi.com/cornellgrad:13762 | |
dc.identifier.uri | https://hdl.handle.net/1813/114692 | |
dc.language.iso | en | |
dc.rights | Attribution 4.0 International | * |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | * |
dc.title | Provably Efficient Model Training over Centralized and Decentralized Datasets | en_US |
dc.type | dissertation or thesis | en_US |
dcterms.license | https://hdl.handle.net/1813/59810.2 | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | Cornell University | |
thesis.degree.level | Doctor of Philosophy | |
thesis.degree.name | Ph. D., Computer Science |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Lu_cornellgrad_0058F_13762.pdf
- Size:
- 5.29 MB
- Format:
- Adobe Portable Document Format