Show simple item record

dc.contributor.authorGan, Yu
dc.description247 pages
dc.description.abstractCloud computing has greatly increased in prevalence and impact. Datacenter applications today have a strong focus towards cloud-native architectures. The cloud-native architecture utilizes many technologies, including microservices, containerization, service meshes and orchestration, cloud telemetry, and serverless, to fully exploit the flexibility, scalability, and robustness of public, private, or hybrid clouds. As a fundamental programming model to design modern cloud applications, microservices have drawn much attention from both academia and industry. Many cloud service providers, including Google, Facebook, Netflix, Twitter, Amazon, Uber, and Alibaba, have adopted or supported microservices in their systems over the past decade. Microservices have several advantages, including accelerating development and deployment, enabling software heterogeneity, and promoting elasticity and decoupled design. Despite these benefits, microservice architecture raises several challenges and opportunities in system design. First of all, microservices have different requirements from traditional monolithic applications. They are more sensitive to performance unpredictability from both hardware and software sources, spend large fractions of their end-to-end latency processing network requests, and introduce backpressure effects due to the dependencies between different microservices. To study the implications that this new programming model introduces across the system stack, we need representative applications built with end-to-end microservices. In this thesis, we first design a representative large-scale microservice benchmark suite and use it to study the system implications of microservices across the system stack. Then we use the benchmark suite to highlight the benefits machine learning-based techniques have in addressing the performance and resource efficiency issues of microservices in a practical and scalable way. We first present DeathStarBench, an open-source microservice benchmark suite built with five end-to-end applications, with tens of microservices each, implementing interactive cloud services, like social networks and e-commerce sites. We use DeathStarBench to analyze the implication of microservices across the system stack and highlight the need for automated approaches to address their complexity. The benchmark suite has been widely adopted in both industry and academia since its release. We then present Ditto, an application cloning system to generate synthetic benchmarks which have similar performance characteristics to the original cloud services, including monolithic applications and microservices. We show that the synthetic applications generated by Ditto can represent the on-CPU and off-CPU performance of the target applications in both kernel space and user space. Second, we show that the complex dependency graph between microservices makes it impractical to use traditional mechanisms, such as empirical- or heuristic-based approaches to analyze millions of cloud telemetry data samples and discover the root causes of poor performance. We propose to address this issue in a data-driven, machine-learning-based way. First, we introduce Seer, a data-driven performance debugging system that uses a hybrid neural network to identify patterns that signal upcoming Quality-of-Service (QoS) violations in the end-to-end service. By determining the microservice that initiated the QoS violation, Seer can take proactive action and avoid the degraded performance altogether. We show that Seer can achieve 93% accuracy in detecting upcoming QoS violations and 89% in identifying the root cause of these violations across a diverse set of services. Second, to promote the practicality and scalability of the performance debugging system, we design Sage, a causal inference system that relies on entirely unsupervised learning to identify the culprits of unpredictable performance in complex graphs of microservices, making it practical for large-scale production systems. Sage maintains the same accuracy as the previous supervised learning system, but does not require heavy instrumentation and high-frequency tracing, making it robust to sparse or missing monitoring data, and adaptive to frequent service updates. Leveraging ML to handle the increasing complexity microservices introduce means that this new cloud programming model and its benefits do not come at the expense of performance predictability or resource efficiency.
dc.rightsAttribution 4.0 International
dc.subjectCloud Computing
dc.subjectDistributed System
dc.subjectML for System
dc.subjectPerformance Debugging
dc.titleDesigning and Managing Large-scale Interactive Microservices in Datacenters
dc.typedissertation or thesis and Computer Engineering University of Philosophy D., Electrical and Computer Engineering
dc.contributor.chairDelimitrou, Christina
dc.contributor.committeeMemberMartínez, José F.
dc.contributor.committeeMemberWeatherspoon, Hakim

Files in this item


This item appears in the following Collection(s)

Show simple item record

Except where otherwise noted, this item's license is described as Attribution 4.0 International