ML and System Approaches for Data Processing Optimization and Long Sequence Modeling
Modern data systems are integral to various industries, yet optimizing their performance remains complex. Traditional methods relying on cost models often fail due to unreliable assumptions about data distribution and the difficulty inestimating the size of intermediate results. To overcome these limitations, we propose a novel paradigm that abandons classical cost models in favor of a sequential decision-making framework inspired by reinforcement learning. Specifically, we utilize Monte Carlo Tree Search (MCTS) and runtime feedback from sampled or partial query executions to learn and converge to the optimal configuration. We introduce and apply this paradigm to several data system tuning problems, query optimization, SkinnerDB in Chapter 2, general parameter optimization, UDO in Chapter 3, and graph query optimization, ADPOT in Chapter 4, demonstrating that this approach effectively handles complex scenarios (e.g., skewed data and large queries) that traditional models struggle with. During this journey, we introduce new theoretical extensions of MCTS under noisy, delayed, and multi-fidelity feedback in Chapter 5 and a new gym environment in Chapter 6 for benchmarking various RL algorithms (e.g., SAC, DQN, TD3, PPO) and motives the development of safety-aware RL approaches. At the same time, data-intensive applications have increasingly leveraged large language models (LLMs) to integrate vector-based entity searches and manage modalities within large-scale data lakes. However, LLMs are typically trained using the transformer architecture, which faces fundamental challenges due to its quadratic dependency on sequence length. This limitation restricts their ability to handle and manage long data (e.g., large tables, videos, and audio). To enable LLMs ability to process long sequence inputs, we investigate whether linear Recurrent Neural Networks (RNNs) can provide a viable alternative while maintaining transformer-level performance. In Chapter 9, we introduce the first bidirectional linear complexity language model that matches BERT performance without using attention. In Chapter 10, we also demonstrate that linear RNNs outperform transformers in byte-level language modeling, which may enable the universal representation of different modalities and formats. Finally, since training LLMs from scratch is costly and resource-intensive, we consider distilling large transformers (e.g., Llama) into linear RNNs trained on very limited GPUs without sacrificing performance in Chapter 11. Additionally, we introduce a hardware-aware speculative decoding algorithm that accelerates inference speed, enabling more efficient deployment of these models. Our research opens new opportunities for LLMs to handle long contexts and to more efficiently model different long data modalities.