HOST CONGESTION CONTROL A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Saksham Agarwal August 2024 © 2024 Saksham Agarwal ALL RIGHTS RESERVED HOST CONGESTION CONTROL Saksham Agarwal, Ph.D. Cornell University 2024 The conventional wisdom in systems and networking communities is that con- gestion in datacenter networks happens primarily within the network fabric (i.e., at network links and/or switches). This dissertation has three core contributions: (1) it presents a new problem of “host congestion”—congestion within the datapath between peripheral de- vices and compute/memory—and presents evidence of host congestion both in production datacenters and in experimental lab setups; (2) it builds an in-depth understanding of the root causes of host congestion, and of the impact of host congestion on application-level performance; and (3) it explores the implica- tions of host congestion to the design of network protocols, network stacks and operating systems. We define host congestion in the context of networked applications as fol- lows: the receiver-side host network interface card (NIC) receives data from the network at a rate faster than it can transfer it to compute/memory. This reduces the available NIC-to-memory bandwidth, resulting in queueing and eventual packet drops at hosts. We demonstrate that, even with state-of-the-art network protocols and network stacks, host congestion leads to significant degradation in throughput and orders-of-magnitude inflation in tail latency and a surpris- ingly large fraction of packet drops at the host, even when the access link band- width is far from fully utilized. We present evidence and characterization of the host congestion phenomenon for both large-scale production clusters (run- ning Swift congestion control protocol with a userspace network stack), and in experimental testbeds (running DCTCP congestion congestion protocol with Linux network stack). Several recent studies have built upon our work to show that hardware-offloaded network stacks also suffer from similar or worse host congestion phenomenon. Host congestion, and resulting queueing and packet drops at the host, are new to the community. To that end, this thesis also builds an in-depth under- standing of the root causes of the host congestion phenomenon. We demon- strate that host congestion is caused due to nanosecond-scale latency inflation within the NIC-to-CPU/memory datapath. Such latency inflation is rooted in the poor interplay between processor, memory and peripheral interconnects within the host; such an interplay, in turn, leads to contention at host resources (e.g., memory bandwidth, IO memory management units, etc.) and manifests it- self in the form of underutilization of peripheral interconnect (PCIe) bandwidth, queueing and packet drops at the NIC, and subsequent drop in application- level performance. We also discuss that unfavorable technology trends for host resources are going to make this problem worse over time: access link band- widths and PCIe bandwidths are expected to increase by 8-16x over the next few years; however, trends for essentially all other host resources (e.g., memory bandwidth per core, IO memory management unit caches, NIC buffer sizes, etc.) are relatively stagnant. Thus, we expect higher degrees of resource imbalance and contention within the host. Host congestion alters the many assumptions entrenched within the design of modern networked and operating systems. For instance, existing datacenter congestion control (CC) protocols were not designed to efficiently detect and respond to host congestion—they operate at a network round-trip-time (RTT) granularity (typically tens-to-hundreds of microseconds) while host congestion can change dynamically at sub-microsecond granularity. We present hostCC, a new congestion control architecture that handles both host congestion and network fabric congestion. HostCC introduces new “host-local” congestion sig- nals and a sub-RTT granularity “host-local” congestion response to handle host congestion. We demonstrate that hostCC can be integrated with existing CC protocols to efficiently perform host and network resource allocation among competing entities. As another example of host congestion altering the many assumptions en- trenched within operating systems, we explore IO memory protection mecha- nisms. Such mechanisms are used in real-world production datacenters to pre- vent malicious and/or faulty NICs from executing errant transfers into host memory. Modern hosts achieve this using an IOMMU—NICs operate on vir- tual addresses, and IOMMU translates virtual addresses to physical addresses (potentially speeding up translations using a cache called IOTLB) before execut- ing memory transfers. We demonstrate that existing state-of-the-art IO memory protection mechanisms are able to provide one of the two desirable properties: (1) strong safety property, that results in unavoidable IOTLB misses and sub- sequent host congestion; or (2) high performance. We present “Fast & Safe” (F&S), a simple modification to existing memory protection mechanisms that enables them to provide the strongest safety property, and yet, near-completely eliminates their overheads. The key insight in F&S design is that, rather than fo- cusing on achieving high IOTLB hit rates, we should focus on reducing the cost of each translation upon an IOTLB miss. F&S reduces the cost of each trans- lation by exploiting modern IOMMU hardware, along with novel mechanisms for contiguous virtual address allocation and batched unmapping and invali- dations of used virtual addresses. We demonstrate that F&S design requires no modifications in host hardware, minimal modifications within the operating system, and yet, near-completely alleviates all overheads of the strongest form of memory protection mechanism. Finally, using the example of storage stacks, we demonstrate that host con- gestion has implications that are more far-reaching than networked applica- tions. Specifically, modern storage stacks enable exploiting high throughput offered by storage devices like SSDs by maintaining multiple in-flight IO re- quests. Typically, the number of in-flight requests is set to be the “knee point” of the device’s latency-load curve. This minimizes the latency while maximiz- ing the throughput, under the assumption that the bottleneck is at the CPU and/or at the storage device. However, under host congestion, the latency- load curve (and the knee-point) can change dynamically at sub-microsecond granularity. We demonstrate that, under such host congestion, existing storage stacks keep a sub-optimal number of in-flight requests resulting in significant application-level performance degradation. We present storageCC, a new stor- age stack architecture to handle host congestion. Our key insight is that the end- to-end storage datapath (i.e., SSD-to-CPU/memory datapath) is conceptually a (lossless) computer network. StorageCC builds upon this insight to detect host congestion by collecting congestion signals within the host, and adapting the number of in-flight requests based on host congestion. Evaluation of StorageCC over a wide variety of scenarios demonstrates that StorageCC converges to the optimal parallelism for each individual scenario, maintaining near-hardware la- tency and throughput performance. Our thesis suggests that host congestion may have wide-reaching implica- tions to design of network protocols, network stacks, operating systems, and even host hardware. To that end, we close by summarizing the many new re- search questions opened up by our thesis in computer networking, operating systems and computer architecture. BIOGRAPHICAL SKETCH Saksham Agarwal is a Ph.D. candidate in the department of Computer Science at Cornell University, advised by Prof. Rachit Agarwal. He holds a Master of Science in Computer Science from Cornell University, and Bachelor of Technol- ogy in Electrical Engineering from Indian Institute of Technology Kanpur. He is a recipient of Cornell University Fellowship, Google Ph.D. Fellowship, and a Cornell Computer Science Outstanding Teaching Assistant Award. iii ”It always seems impossible until it’s done.” - Nelson Mandela iv ACKNOWLEDGEMENTS I will forever be indebted to my Ph.D. advisor, Rachit Agarwal, for enabling me to reach where I am today. I have learned so much from Rachit about re- search and life in general that it is nearly impossible to express my gratitude adequately in just a few lines. Rachit dedicated a tremendous amount of his time and energy to help me hone my skills in every facet necessary to become a successful researcher—from thinking about problems correctly, to formulat- ing concrete research agendas, to effectively communicating and presenting my work, and to establishing meaningful connections with other researchers. I will always cherish the countless hours we spent brainstorming together, during which he provided invaluable feedback on even the tiniest details of my re- search proposals, paper drafts, and presentation slides. Even more importantly, I learned from him the importance of having an unwavering perseverance in the face of temporary setbacks, and having strong conviction in one’s beliefs and values in life. Thank you Rachit, for everything! I am also grateful to the other members of my Ph.D. committee—David Shmoys, Nate Foster, and Amin Vahdat—for their insightful feedback on my research over the years. David and Amin have been my collaborators since I started at Cornell, and I am thankful for their constant guidance and support throughout my Ph.D. journey. I would like to extend my gratitude to all of my other collaborators and mentors—Sylvia Ratnasamy, Arvind Krishnamurthy, Hitesh Ballani, Vishal Shrivastav, and Kevin Tang. All of them played a signifi- cant role in my growth as a researcher, and I appreciate the time they invested in helping me succeed. During the summers of 2020 and 2021, I had the opportunity to intern with Google’s Congestion Control team. I would like to thank Nandita Dukkipati, v Behnam Montazeri, Gautam Kumar, Masoud Moshref, Luigi Rizzo, and every- one else in the team, for providing a great learning experience. Google went above and beyond to make my internship an enjoyable experience, even during the challenging times of the pandemic. I am also grateful to them for supporting my Ph.D. through a Google PhD Fellowship. I am immensely thankful to my labmates, Qizhe Cai and Midhul Vuppalap- ati. Our daily research discussions in the lab were instrumental to my learning experience at Cornell. Without these conversations, I might still be grappling with numerous bugs buried deep within the kernel or obscure register counter values hidden in processor manuals. We shared not only our successes but also the many (and often stressful) moments of paper rejections and all-nighters be- fore deadlines. I would also like to express my gratitude to an amazing group of junior Ph.D., master’s, and undergraduate students: Shreyas Kharbanda, Shouxu Lin, Katie Gioioso, Benny Rubin, Pai Li, and Claire Wang. They not only helped me grow as a mentor but also taught me a great deal along the way. I would like to express my gratitude to the incredible professors at Cornell, especially Robbert Van Renesse, Lorenzo Alvisi, Fred Schneider, Robert Klein- berg, Ken Birman, and Chris De Sa for taking time from their busy schedules to provide feedback on multiple practice talks for conference presentations and job talks. The staff at Cornell’s Computer Science department was also amazingly helpful. Becky always magically resolved all our administrative problems, and Corey was a huge help in organizing our group meetings, lunches, and confer- ence registrations. I also thank my first-year office mates, who will always hold a special place in my heart: Drishti, Eric, Cheng, Claire, Dietrich, Deepak, and Andrew. The folks made my first year at Cornell, far away from home, a really enjoyable vi one. My gratitude also goes to my Networking Lab mates—Vishal, Praveen, and Hardik—for countless discussions on datacenters, programmable switches, and graduate student life. Many thanks to my Systems Lab mates—Yunhao, Ali, Florian, Sagar, and Alicia—for random technical discussions, watercooler chats, and practice talks. Special thanks to Shijin, my first student collaborator at Cornell, who taught me a lot not only CS theory, and also introduced me to the life at Cornell and in Ithaca. The Ph.D. journey can be tough at times, and it would have been impossible to navigate it without the support of my friends—my family away from home. I would like to thank Ashudeep Singh, Utkarsh Mall, Jashan Singhal, Prateek Sehgal, Drishti Wali, Arzoo Katiyar, Arpita Sharma, and Meenakshi Khosla for the Friday FIFA nights, squash, tennis, and badminton games, road trips, count- less dinners, and the weekly check-in sessions over Zoom during the pandemic. My gratitude extends to my IITK family—Ashwin Verma, Sourya Basu, Ayush Sinha, Kapil Sethi, Rewant Teotia, Shubham Agarwal, and Shubham Jain—for staying in touch and offering me places to stay while I roamed across the globe. Most importantly, I thank my family—my parents, Shweta and Manish Agarwal; my sister, Dr. Samiksha Agarwal; and my grandparents, Sita and Ram Raghubir Agarwal—for their upbringing, countless sacrifices, and unwavering support, no matter where I am. I also extend my thanks to my aunt and uncle, Shivani and Chetan Jain, for always welcoming me into their home in Colorado, especially during the pandemic. I would also like to thank Cornell University and Cornell Health for bril- liantly managing the spread of COVID-19 on campus, helping my family feel less stressed about my safety while I was thousands of miles away from home for several years. vii TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction 1 1.1 The Host Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Emergence of Host Congestion . . . . . . . . . . . . . . . . . . . . 3 1.3 The Root Cause of Host Congestion . . . . . . . . . . . . . . . . . 6 1.4 Impact of Host Congestion . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Implications to Design of Network Protocols & OS . . . . . . . . . 8 1.6 Thesis and Key Contributions . . . . . . . . . . . . . . . . . . . . . 11 1.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Understanding the Impact of Host Congestion 14 2.1 Host Network Datapath . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Network (NIC-to-CPU/memory) Datapath . . . . . . . . . 14 2.1.2 Storage (SSD-to-CPU/memory) Datapath . . . . . . . . . . 17 2.1.3 Datapath with Memory Protection Enabled . . . . . . . . . 19 2.2 Host Congestion Impact on Networking Applications . . . . . . . 22 2.2.1 Impact of Memory Interconnect Contention . . . . . . . . . 23 2.2.2 Impact of IOMMU Contention . . . . . . . . . . . . . . . . 28 2.3 Host Congestion Impact on Storage Applications . . . . . . . . . . 35 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3 Rearchitecting Congestion Control 41 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 hostCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Host congestion signals . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Host-local congestion response at sub-RTT granularity . . 46 3.2.3 Network resource allocation at RTT granularity . . . . . . 50 3.3 hostCC Implementation . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 Host congestion signals . . . . . . . . . . . . . . . . . . . . 52 3.3.2 Host-local congestion response . . . . . . . . . . . . . . . . 54 3.3.3 Network resource allocation . . . . . . . . . . . . . . . . . 56 3.4 hostCC Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1 hostCC benefits . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.2 hostCC results with DDIO enabled . . . . . . . . . . . . . . 63 3.4.3 hostCC sensitivity analysis . . . . . . . . . . . . . . . . . . 65 3.4.4 Deep dive into hostCC performance . . . . . . . . . . . . . 65 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 viii 4 Rethinking IO Memory Protection 71 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 F&S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 F&S Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 F&S Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4.1 F&S Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.2 Necessity of each design idea in F&S . . . . . . . . . . . . . 87 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 Rearchitecting Storage Stacks 90 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 storageCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.1 storageCC Overview . . . . . . . . . . . . . . . . . . . . . . 94 5.2.2 Detecting host network bottlenecks . . . . . . . . . . . . . 96 5.2.3 Converging to the right amount of parallelism . . . . . . . 99 5.3 storageCC Implementation . . . . . . . . . . . . . . . . . . . . . . 103 5.4 storageCC Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4.1 storageCC benefits . . . . . . . . . . . . . . . . . . . . . . . 110 5.4.2 RocksDB with storageCC . . . . . . . . . . . . . . . . . . . 112 5.4.3 storageCC Overheads . . . . . . . . . . . . . . . . . . . . . 114 5.4.4 storageCC Sensitivity Analysis . . . . . . . . . . . . . . . . 114 5.4.5 Understanding storageCC microscopic behavior . . . . . . 116 5.4.6 Fair resource allocation using storageCC . . . . . . . . . . 118 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6 Related Work 120 6.1 Managing Host Resource Contention . . . . . . . . . . . . . . . . . 120 6.2 Efficient IO Memory Protection . . . . . . . . . . . . . . . . . . . . 121 6.3 Mitigating Bottlenecks in Storage Datapath . . . . . . . . . . . . . 123 7 Conclusion and Outlook 125 7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Bibliography 130 ix LIST OF FIGURES 1.1 End-to-end datapath of distributed applications involves not only transfers across the network fabric (left) comprising links and switches, but also across the host network (right), compris- ing processor interconnect, peripheral interconnect and memory interconnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Memory bandwidth is becoming increasingly oversubscribed due to rapidly approaching PCIe bandwidth. . . . . . . . . . . . 4 1.3 NIC-to-memory transfer rate matches the arrival rate at the NIC until the latency budget is met for these transfers. Host conges- tion, that is, the reduction in achievable NIC-to-memory transfer rate, is a result of overshooting of this latency budget. . . . . . . 6 1.4 Host congestion in a large-scale Google production cluster. The figure shows a scatter plot of link utilization (normalized by ac- cess link bandwidth) and packet drop rates at hosts. The clus- ter runs both the Linux kernel and SNAP [95] network stacks, with TCP and Swift [77] congestion control protocols, respec- tively. Precise values for packet drop rates are hidden due to these being sensitive. . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Illustration of host network datapath for the Intel architecture (AMD architecture is conceptually similar) between NIC and CPU/memory for the DDIO disabled case. . . . . . . . . . . . . . 15 2.2 Illustration of storage datapath between SSD and CPU/memory. 18 2.3 High-level DMA datapath with IOMMU enabled. . . . . . . . . . 20 2.4 (a) Host congestion leads to significant performance degradation in terms of packet drop rates and throughput (even with no net- work congestion). DDIO helps a little but observes similar per- formance degradation. (b) MApp is able to acquire a large fraction of memory bandwidth, leaving little room for NetApp-T. . . . . . 24 2.5 The impact of host congestion worsens with larger MTU size and number of flows, with 3× degree of host congestion. DDIO- enabled case, in particular, suffers more than DDIO-disabled case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Host congestion leads to orders of magnitude tail latency in- flation. The figure shows {P50, P90, P99, P99.9, P99.99} laten- cies (represented by whiskers) for NetApp-L, with and without host congestion, when NetApp-T, NetApp-L, and MApp are run to- gether with DDIO-disabled (we observed near-identical latency results with DDIO-enabled). . . . . . . . . . . . . . . . . . . . . . 27 x 2.7 Enabling memory protection results in >35% degradation in throughput. Increasing ring buffer sizes with IOMMU enabled leads to (a) decreasing throughput and (b) increasing packet drop rates. (c) IOTLB misses do not significantly change as the ring buffer size increases. (d) However, IOMMU L3 cache miss rates increase significantly, leading to a larger cost of IOTLB misses (e) Larger ring buffer sizes result in worse locality as the IOVA working set size grows. . . . . . . . . . . . . . . . . . . . . 29 2.8 With IOMMU enabled and an increasing number of flows, we observe (a) higher throughput degradation, (b) larger packet drop rates, (c) a larger number of IOTLB misses, (d) higher IOMMU cache misses, and (e) poorer locality in L3 page caches. 32 2.9 IOMMU contention leads to orders of magnitude tail latency in- flation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.10 Application performance with both IOMMU and memory band- width contention. Memory bandwidth contention causes further performance degradation due to the increased latency for each memory read. This worsens the cost of each page table walk de- spite similar miss rates for both the IOTLB and page table caches. 34 2.11 Existing storage stacks suffer from tail latency inflation due to memory interconnect bottlenecks. (a) The setup comprises L-apps, T-apps, and M-apps all co-located within a single NUMA node; (b) the T-apps set parallelism as the benchmarked knee point (circled in the green curve); in the presence of memory in- terconnect bottleneck, the T-apps continue to maintain the same amount of parallelism (cross marked on the red curve), incurring 7.23× higher tail latency than the ideal operating point for this scenario (circled in the red curve); (c-f) other IO schedulers sup- ported by Linux blk-mq, and blk-switch (a state-of-the-art stor- age stack) incur similar latency inflation and/or high through- put overheads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.12 Higher inflation in tail latencies with memory interconnect bot- tlenecks in remote NUMA node. (a) Setup uses applications run- ning on CPU cores in NUMA node remote from SSDs; (b) L-apps incur 17× tail latency inflation. . . . . . . . . . . . . . . . . . . . . 37 2.13 Throughput starvation problem when T-apps use multiple NUMA nodes. (a) a setup, where the culprit application suf- fers from memory, interconnect bottleneck, but not the victim; (b) the culprit must necessarily suffer from throughput degrada- tion; however, it also results in unnecessary degradation in the victim application throughput. . . . . . . . . . . . . . . . . . . . 39 3.1 hostCC architecture overview. Discussion in §3.2. . . . . . . . . . 44 xi 3.2 hostCC generates host congestion signals that are not on the NIC-to-memory datapath; it can thus measure both IS (left) and BS (right) at sub-µs timescales independent of host congestion. . 51 3.3 Variation of IIO occupancy IS and PCIe bandwidth utiliza- tion BS with time for 1ms period for the default experimental setup without host congestion (left) and with 3× host congestion (right). In absence of host congestion, BS ≈ 103Gbps (line-rate bandwidth including PCIe overheads with 4K MTUs) and IS≈65, which corresponds to hardware-specific IIO-DRAM bandwidth- delay product (§3.2.1). During host congestion, IS increases to a maximum value of ∼93 (shown in red), resulting in a reduction of BS , queueing and packet drops at the NIC, and network CC reducing the rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 Evaluating MBA efficacy: NetApp-T is able to acquire more host resources with higher host-local response levels (that is, more backpressure on MApp). As a result, NetApp-T gets higher throughput (left) and a larger fraction of memory bandwidth (right). Discussion in §3.3.2 and §3.4. . . . . . . . . . . . . . . . . 55 3.5 (a) hostCC allows network traffic to achieve its target network bandwidth of BT = 80Gbps, while simultaneously reducing packet drop rates by orders of magnitude, even with a high de- gree of host congestion. (b) with hostCC, MApps no longer ac- quire a large fraction of memory bandwidth, even with high de- gree of host congestion. . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 Even with high degree of host congestion (3× in this experiment), hostCC consistently maintains its benefits across (a) MTU sizes and (b) number of flows. . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 Even with a high degree of host congestion (3× in this experi- ment), hostCC incurs minimal latency inflation compared to no host congestion, significantly improving tail latency across all RPC sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.8 (a) In the presence of network congestion and absence of host congestion, hostCC achieves performance similar to network CC indicating minimal overheads; (b) in the presence of both host and network congestion, hostCC consistently provides benefits similar to the case with only host congestion. . . . . . . . . . . . . 61 3.9 hostCC, with DDIO enabled, provides benefits similar to the DDIO disabled case. . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.10 hostCC, with DDIO enabled, provides benefits in terms of la- tency improvements similar to the DDIO disabled case. . . . . . 64 3.11 hostCC consistently maintains its benefits across varying net- work target bandwidth BT . Discussion in §3.4.3. . . . . . . . . . . 64 xii 3.12 (a) Increasing IT values lead to increasing drop rates due less ag- gressive hostCC reaction to congestion. (b) MApp acquires larger memory bandwidth with larger IT due to less aggressive back- pressure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.13 Each of the three key technical ideas in the hostCC architecture— generating host congestion signals at sub-µs granularity, sub- RTT host-local congestion response, and network resource al- location based on both host and network congestion signals— contribute to hostCC’s performance. Discussion in §3.4.4. . . . . 67 3.14 Each of the three key technical ideas in the hostCC architecture— generating host congestion signals at sub-µs granularity, sub- RTT host-local congestion response, and network resource al- location based on both host and network congestion signals— contribute to hostCC’s performance. Discussion in §3.4.4. . . . . 68 3.15 In steady-state, hostCC keeps PCIe bandwidth close to the target network bandwidth, while ensuring that IIO occupancy remains smaller than the congestion threshold. Discussion in §3.4.4. . . . 69 4.1 F&S achieves contiguous IOVA mappings without requiring changes to IOMMU hardware, IOVA allocator, and IOMMU driver. (a) This process repeats for each page in the Rx descrip- tor, requiring calls to both the IOVA allocator and IO page table map. (b) A single 256KB IOVA is allocated and then mapped in page-size chunks to the physical pages in the Rx descriptor. F&S requires only one call to the IOVA allocator and the same number of calls to the IO page table map. . . . . . . . . . . . . . . 80 4.2 F&S eliminates memory protection overheads by reducing cost of each IOTLB miss. (a) F&S enables IOMMU on scenario to match the performance of IOMMU off, even as the number of flows increases. (b) By addressing the IOMMU contention, F&S significantly lowers packet drop rates caused by queueing at the NIC buffers. (c) F&S reduces the number of IOTLB misses per descriptor up to 2x by minimizing packet drops, hence de- creasing the number of transmitted ACKs. (d) F&S eliminates L1/L2 misses and substantially reduces L3 misses by preserving IO page table caches during invalidation and allocating contigu- ous IOVAs. (e) F&S enhances the locality and thus reduces the number of L3 cache misses. See §4.4.1 for more discussion. . . . . 82 xiii 4.3 F&S maintains good locality, resulting in low L3 cache misses even as the IO working set sizes increase. With IOMMU enabled and increasing ring buffer sizes, we observe that F&S achieves several key performance metrics: (a) it matches the performance of systems with IOMMU disabled; (b) experiences no packet drops; (c) maintains a consistent number of IOTLB misses; (d) incurs low IOMMU cache misses; and (e) demonstrates good lo- cality in L3 page caches. See §4.4.1 for more discussion. . . . . . . 83 4.4 F&S enables IOMMU on to match the performance of IOMMU off with varied number of cores and MTU sizes. See §4.4.1 for more discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 F&S resolves IOMMU contention with DDIO on for varying number of flows. Moreover, by improving CPU utilization, F&S is able to saturate the link bandwidth. See §4.4.1 for more dis- cussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 The CPU breakdown results indicate that data copy overheads increase with the number of ring buffer sizes due to ineffective hardware prefetching. Results shown for (a) hardware prefetch- ing enabled, and (b) hardware prefetching disabled respectively. 85 4.7 F&S brings DCTCP latency closer to IOMMU off case. See §4.4.1 for more discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.8 hostCC and F&S work together to resolve both memory band- width and IOMMU contention. See §4.4.1 for more discussion. . 86 4.9 Contribution of each of the F&S key design ideas. As ring buffer size increases we observe: (a) the throughput drops for the case of “default + preserve” (b) as ring buffer size increases, despite preserving the L3 cache, we still see misses due to the larger IO working set size. See §4.4.2 for more discussion. . . . . . . . . . . 88 5.1 Datapath of an IO request within the Linux storage stack. Blue boxes indicate points within the datapath where storageCC logic is implemented. (§5.3) . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Effectively detecting storage datapath contention using stor- ageCC: when there is no contention, IIO occupancy is 70% of the maximum value, and reaches close to maximum in the presence of memory bandwidth bottlenecks. Memory controller queue occupancy (figure shows fraction of time WPQ occupancy is full) also increases only when memory bandwidth is contented. . . . 106 5.3 IIO buffer and memory controller queue occupancies for the re- mote NUMA setup. (Left column) Even without memory band- width contention, the IIO buffer is filled up. This is due to higher access latency from IIO to memory in remote NUMA node. (Right column) IIO buffer and WPQ occupancies are as expected under memory bandwidth contention. . . . . . . . . . . 109 xiv 5.4 storageCC allows storage stack to converge to amount of paral- lelism which is close to the knee point, independent of degree of contention and application NUMA-affinity. storageCC pro- vides close to optimal latencies for L-App (upto ∼17× tail latency improvement, even assuming the default IO depth = 64 – the minimum needed to sustain optimal throughput under no con- gestion) and maximum achievable throughput for T-app. Dis- cussion in §5.4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5 storageCC helps alleviate the throughput starvation prob- lem: improving victim application throughput, while incurring smaller cost in culprit application throughput, leading to ∼55% gains in victim throughput and ∼22% in overall throughput. Dis- cussion in §5.4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.6 Evaluation using RocksDB: storageCC provides similar im- provement in tail latency as earlier results. . . . . . . . . . . . . . 113 5.7 storageCC observe negligible overheads (<5%) without the pres- ence of any storage datapath congestion. Discussion in §5.4.3. . . 113 5.8 storageCC converges to right amount of parallelism indepen- dent of IO size of the applications. Discussion in §5.4.4. . . . . . . 115 5.9 storageCC converges to the right amount of parallelism, even with increasing load from competing L-apps. Discussion in §5.4.4. 115 5.10 DDIO enabled results: (a) scenario where DDIO can be benefi- cial, storageCC provides even better tail latencies with close to maximum achievable throughputs, (b) DDIO is not applicable for remote NUMA scenario, storageCC provides near identical results as baseline results. . . . . . . . . . . . . . . . . . . . . . . . 116 5.11 storageCC microscopic behavior . . . . . . . . . . . . . . . . . . . 117 5.12 Fair resource allocation by storageCC under host congestion and varying number of application threads. Discussion in §5.4.6. . . . 118 xv CHAPTER 1 INTRODUCTION We begin the chapter by providing a brief background on the host network and its characteristics (§1.1), followed by a discussion on the technology trends leading to the emergence of host congestion—i.e., congestion within the host network (§1.2). We then discuss the root cause behind the host congestion phe- nomenon (§1.3), and provide evidence of host congestion in large-scale produc- tion clusters (§1.4). We next discuss the implications of host congestion to design of existing network protocols and OS (§1.5). Finally, we outline the key contri- butions of this dissertation, with regards to rearchitecting the network protocols and OS to help resolve the host congestion problem (§1.6). 1.1 The Host Network Most distributed applications in datacenters—such as data analytics en- gines [26, 117, 125], machine learning training frameworks [106, 58, 18], key- value stores [43, 33, 19, 127], and graph processing databases [98, 42, 66], etc.— perform high-speed data transfers between multiple hosts, across the datacen- ter network (Figure 1.1 shows an example network topology used in datacen- ters). The end-to-end datapath of these applications also involves data transfers within the hosts, particularly between the CPU cores, memory, and the periph- eral devices (like network interface cards or NICs [97], storage devices [84], and accelerators [39, 16]). Such “intra-host transfers” are enabled by a network that every host has—the host network—comprising processor, memory, and periph- eral interconnects. 1 Peripheral Device (eg., Network Interface Card) Root Complex Processor Interconnect DRAM Peripheral Interconnect (PCIe Lanes) Datacenter Network Topology Switches Links Hosts Host Network Architecture Memory Interconnect (DDR Channels) Figure 1.1: End-to-end datapath of distributed applications involves not only transfers across the network fabric (left) comprising links and switches, but also across the host network (right), comprising processor interconnect, peripheral interconnect and memory interconnect. Figure 1.1 provides an overview of the host network architecture in mod- ern servers. The processor interconnect is used by CPU cores to exchange data amongst themselves (between L1d, L2, and Last Level Caches) and to send read/write requests to the DRAM. The peripheral interconnect, such as PCIe [12], is used by peripheral devices to send read/write requests to CPU cores and DRAM. The memory interconnect intercepts read/write requests from CPU cores and peripheral devices to execute them on DRAM. Each of these interconnects uses independent, specialized, hardware-based protocols for data transfers [104, 41, 35]; these different protocols interpolate with each other to enable the intra-host transfers across the host network. Traditionally, the host network has provided many desirable properties: minuscule probability of failures and packet corruption, losslessness guaran- tees [28] (that is, no data will be dropped within the host network), and nanosecond-scale latencies [104, 115]. Further, the host network also provides ample bandwidth for intra-host transfers. For instance, modern servers have 2 higher peripheral interconnect bandwidth than the peripherals devices’ band- width (e.g., 128Gbps and 256Gbps PCIe bandwidths for 100Gbps and 200Gbps NICs, respectively). The memory interconnect bandwidth and processor inter- connect bandwidth are often even larger than the peripheral interconnect band- width [20, 107]. Distributed applications typically incur latencies of the order of network round-trip-times (RTTs), which usually span a few tens-to-hundreds of mi- croseconds in modern datacenter networks [77, 88, 139]; and the NIC band- widths limit their throughput. The host network, on the other hand, provides nanosecond-scale latencies and higher bandwidth than NICs. Unsurprisingly, the conventional wisdom in the systems and networking communities has been that the host network is an “ideal network” and has minimal-to-no impact on the end-to-end performance or correctness properties of distributed appli- cations. Thus, performance degradation for these applications is largely at- tributed to contention for resources within the datacenter network fabric (that is, contention for available link bandwidths and switch buffer capacities), a phe- nomenon referred to as network congestion [77, 88, 139]. 1.2 Emergence of Host Congestion This dissertation establishes that we need to revisit the above conventional wis- dom, in particular, due to the emerging technology trends: rapid growth in network link and peripheral device bandwidths, combined with relatively stag- nant trends for other host resources (e.g., memory bandwidth, memory access latency, cache sizes per core, processor interconnect bandwidth, etc.) [129, 35, 82] 3 Cascadelake (2019) Icelake (2021) Sapphire Rapids (2023) Emerald Rapids (2024) Granite Rapids (2025) Diamond Rapids (2026) Processor Architcture (Year) 0 1 2 3 4 5 6 7 8 M em or y B an dw id th O ve rs ub sc ri pt io n CPU Traffic IO Traffic CPU+IO Traffic Figure 1.2: Memory bandwidth is becoming increasingly oversubscribed due to rapidly approaching PCIe bandwidth. have led to contention for resources within the host network, a phenomenon we refer to as “host congestion”. Host congestion causes inefficiencies in peripheral device-to-memory transfers within the host network. This means that the rate at which data (that is destined to host memory) is generated at peripheral devices can exceed the rate at which the data can actually be transferred to memory over the host network. For example, when NICs serve as peripheral devices, host congestion leads to scenarios where the arrival rate of data at the NICs from the network fabric can be higher than the rate at which NICs transfer the data to memory. Consequently, this results in queueing at the NIC buffers, leading to eventual data drops and significant degradation in application performance and isolation guarantees (more details in §1.4). Host congestion can result from contention for multiple possible resources within the host network. For instance, contention can occur at the memory in- terconnect, which is shared by data traffic originating from the peripheral de- vices (IO traffic) and the CPU cores (compute traffic), destined to read/write from/into the memory. Even though the memory bandwidth is typically larger 4 than peripheral device bandwidths, when IO and compute traffic are co-located, the memory interconnect can observe contention. The memory interconnect has historically been oversubscribed by the compute traffic, that is, the aggregate memory bandwidth demand by all CPU cores exceeds the theoretical maximum bandwidth for the memory interconnect. For instance, the degree of oversub- scription (the ratio of total memory bandwidth demand by cores and the theo- retically maximum memory bandwidth) has remained stagnant over the years; however, with PCIe bandwidth increasing rapidly over time, when account- ing for load offered by both CPU cores and peripheral devices, memory inter- connect is increasingly more oversubscribed. As shown in Figure 1.2, the total memory bandwidth oversubscription in recent Intel architectures increased by 9.2% from 2019 (Cascadelake) to 2021 (Icelake), and an additional 3.4% in 2023 (Sapphire Rapids); the overall oversubscription is also projected to incur even larger increase over time [130]. Contention within the host network can happen even in the absence of compute traffic—for example, due to the desire for memory protection against buggy/malicious peripheral devices in modern cloud deployments. These memory protection mechanisms ensure that peripheral devices execute mem- ory transfers using virtual memory addresses. An IOMMU—a special-purpose hardware within the host network—translates these virtual addresses to phys- ical addresses before executing memory reads/writes [22]. These translations are done using an address translation table residing in memory and are sped up using a special cache within the IOMMU (referred to as the IOTLB) [17]. To ex- ploit peripheral interconnect bandwidth effectively, peripheral devices can ini- tiate multiple simultaneous device-to-memory transfers, leading to contention at IOTLB. Like other host resources, IOTLB sizes have also remained largely 5 Figure 1.3: NIC-to-memory transfer rate matches the arrival rate at the NIC until the latency budget is met for these transfers. Host congestion, that is, the reduction in achievable NIC-to-memory transfer rate, is a result of overshooting of this latency budget. stagnant compared to the peripheral interconnect bandwidth [104, 108, 91], sig- nificantly increasing the contention over time. 1.3 The Root Cause of Host Congestion We now provide some more insight into how contention for resources like mem- ory interconnect or IOMMU caches within the host network can lead to ineffi- ciencies in peripheral device-to-memory transfers. To simplify the discussion, the rest of this section assumes NIC as an example peripheral device, but similar arguments also hold for any other peripheral device. For any NIC receiving data from the network fabric, to ensure that the NIC- to-memory transfer rate matches the data arrival rate at the NIC, there is an allowed “latency budget” for individual NIC-to-memory transfers. This latency budget (ℓ) depends upon the NIC bandwidth (B) and the inherent parallelism (P) supported by the host network hardware. Using Little’s law [89], we get 6 ℓ ∝ P/B: larger bandwidth implies data arrives faster at the NIC, resulting in a smaller latency budget to transfer data to memory. Similarly, larger hardware parallelism means the ability to transfer a larger amount of data simultaneously from NIC to memory, increasing the latency budget. When the host network can meet the budget, it allows the NIC-to-memory transfer rate to match the arrival rate at the NIC. However, contention at the memory interconnect results in inflation of memory access latency, and con- tention at IOTLB results in increased address translation latency. This, in turn, could lead to overshooting the latency budget; when this happens, NIC-to- memory transfers cannot keep up with the arrival rate, resulting in NIC buffer build-up and drops. For instance, for our default setup in Chapter 3, we observe that without host congestion the latency between the PCIe root complex (shown in Figure 1.1) is ∼330 ns when the NIC-to-memory transfer rate matches the NIC bandwidth of 100 Gbps. To sustain this transfer rate, the hardware employs a parallelism of ∼65 cachelines between the root complex and the memory con- troller. However, the maximum available hardware parallelism on our setup is ∼93 cachelines; hence an inflation in latency ∼150 ns is enough for the NIC-to- memory transfer rate to not be able to keep up with the arrival rate. Therefore, at its core, host congestion is caused by latency inflation (even at a scale of a few hundred nanoseconds) within the host network. 1.4 Impact of Host Congestion We now provide evidence of host congestion in production clusters: even with state-of-the-art protocols, software, and hardware infrastructure, large-scale 7 production clusters observe significant application-level violations of isolation and service-level agreement (throughput and tail latency) guarantees due to host congestion. Figure 1.4 shows a scatter plot of network link bandwidth uti- lization on the x-axis and the measured data drop rates at the hosts on the y-axis for a production cluster at Google [6]. The colors denote the density of sample points, with purple colors denoting a larger sample fraction. This figure shows that, even with Google’s state-of-the-art congestion control protocol (Swift [77]) and userspace network stack (SNAP [95]), hosts in Google production clusters observe high data drop rates—even when network link bandwidth is far from saturation. These drops resulted in throughput degradation, multiple orders of magnitude tail latency inflation, and violation guarantees for many applications running on this production cluster. We also reproduce the host congestion phenomenon observed in Google’s production cluster, using open-sourced Linux network stack and congestion control protocol (DCTCP [14]), and characterize the impact on application per- formance. We observe that host congestion can lead to as much as 1% packet drops at the host, 35–55% throughput degradation, and 120–5000× tail latency inflation [9]. We present these results in Chapter 2. 1.5 Implications to Design of Network Protocols & OS Existing network protocols and various OS components were designed with several implicit assumptions that start to break under host congestion. For instance, consider the case of network protocols such as congestion con- trol protocols (e.g., TCP, DCTCP, Swift, etc.) that manage data transfers in dis- 8 0.2 0.4 0.6 0.8 Host Access Link Bandwidth Utilization H o s t D ro p R a te N o rm a li z e d F ra c ti o n (A m o n g a ll d a ta p o in ts o n p lo t) 0.0 1.0 0 1 Figure 1.4: Host congestion in a large-scale Google production cluster. The figure shows a scatter plot of link utilization (normalized by access link band- width) and packet drop rates at hosts. The cluster runs both the Linux kernel and SNAP [95] network stacks, with TCP and Swift [77] congestion control pro- tocols, respectively. Precise values for packet drop rates are hidden due to these being sensitive. tributed applications so as to minimize the occurrences of resource contention within the network fabric. Classical congestion control protocols assume an ideal host network and operate at network RTT granularity—that is, they de- tect and respond to any contention within the end-to-end network datapath in at least one network RTT, which is typically of the order of tens-to-hundreds of microseconds. Such long congestion detection and response times cannot cor- rectly handle host congestion: this is because, in the host congestion regime, traffic is generated much closer to the point of contention (e.g., due to applica- tions generating CPU-to-memory traffic, with CPU cores situated few hundreds of nanoseconds away from the memory controller) and thus the degree of host congestion can change dramatically at sub-RTT granularity. Such a mismatch in congestion granularity and congestion control response demonstrably leads to performance far from optimal. Thus, host congestion allows us to revisit fun- damental questions related to the design of congestion control architecture and protocols. 9 Another implicit assumption ingrained in existing OS, particularly within the IO memory protection mechanisms, is that the memory access latency is a tiny fraction of peripheral interconnect latency. As a result, operations such as mapping/unmapping virtual-to-physical memory addresses for DMA buffers (used when enabling memory protection against untrusted peripheral devices) have been designed fundamentally inefficiently—in that, they can lead to larger than necessary number of memory accesses for these operations, especially when the memory protection mechanisms intend to provide stronger safety guarantees. These additional memory accesses can lead to inflated latencies within the host network. The latency inflation leads to the host congestion phe- nomenon and similar application-level performance degradation, as discussed previously. We need to revisit the design of memory protection mechanisms such that they incur minimal overheads for the required safety guarantees. We also demonstrate that the observations made about host congestion resulting in performance degradation for network applications (throughput degradation and/or latency inflation) generalize to the context of storage appli- cations. For example, we show in Chapter 2 that under host congestion, storage applications can observe up to 2× throughput degradation and 17× tail latency inflation compared to an ideal operating point. We find that this is rooted in storage stacks assuming the host network always behaves in an ideal manner— to achieve high throughput, storage stacks typically keep a fixed number of re- quests in-flight, which implicitly assumes the bottlenecks are either at the CPUs or at the storage device. This assumption is no longer true under the host con- gestion regime, with bottlenecks shifting dynamically within the host network. Therefore, we also need to revisit the architecture of storage stacks in the regime of host congestion. 10 Put together, the emerging phenomenon of host congestion, led by current technology trends, points to the inevitability of a fundamental re-architecture of the existing host infrastructure. 1.6 Thesis and Key Contributions The central thesis in this dissertation is that we need across-the-stack solutions—including re-architecture of network protocols and several compo- nents of the OS—to address the impact of host congestion on applications. We next provide an overview of the core contributions of this dissertation towards this thesis: • Understanding the impact of host congestion. We perform a deep dive into the host network datapath in order to understand the root causes and impact of host congestion. We show how nanosecond-scale latency inflation due to contention at various points within the host network— specifically, the memory interconnect and the IOMMU—results in host congestion. We evaluate the impact of host congestion on application- level performance, reproducing the observations made in Google produc- tion clusters. We observe similar impact on application performance even when using an open-sourced Linux network stack and DCTCP congestion control protocol. We also demonstrate that the impact generalizes to the context of storage applications, using state-of-the-art storage stacks and IO schedulers. These results are presented in Chapter 2. • Rethinking congestion control protocols. We present hostCC [9], a new 11 architecture for congestion control protocols that handle host conges- tion along with network fabric congestion. The key idea in hostCC is a new “host-local” congestion response performed at a sub-RTT granular- ity. hostCC uses host-local congestion response to minimize queueing and data drops at the host: it modulates host resources allocated to the net- work traffic at sub-RTT granularity to ensure that NIC queues are drained at the same rate at which network traffic arrives at the NIC. hostCC in- tegrates the idea of host-local congestion response with new congestion signals collected at sub-microsecond timescales within the host hardware, allowing it to capture and react to host congestion at sub-microsecond timescales. We discuss hostCC in Chapter 3. • Rethinking IO memory protection. We present Fast & Safe (F&S), a sim- ple modification to existing memory protection mechanisms that enables achieving the strongest safety guarantees and near-optimal performance. The key insight in F&S design is that rather than focusing on achieving high IOTLB hit rates, we should focus on reducing the cost of each trans- lation upon an IOTLB miss. This insight was an outcome of building a deep understanding of the IOMMU hardware, that led to us discover- ing existence of IO page table caches (in addition to the IOTLB) within the IOMMU hardware; effectively utilizing these caches can help reduc- ing the address translation cost upon an IOTLB miss. F&S design re- quires no modifications in host hardware, minimal modifications within the operating system, and yet, near-completely alleviates all overheads of the strongest form of memory protection mechanism. We discuss F&S in Chapter 4. • Rethinking storage stacks. We present storageCC [8], a new architec- 12 ture for storage stacks that enables them to operate at an ideal amount of parallelism in the presence of host congestion. In particular, storageCC introduces the conceptual idea of congestion control (typically used in the computer networking literature) in storage stacks. Our key insight is that the end-to-end storage datapath (i.e., SSD-to-CPU/memory datap- ath) is conceptually a (lossless) computer network. storageCC detects host congestion by collecting host congestion signals within the hardware at sub-microsecond granularity, and adapts the number of in-flight requests based on host congestion. Similar to transport designs in computer net- works, storageCC ensures that the number of in-flight requests is the min- imum of a flow control window size (that accounts for bottlenecks at end- points of storage datapath, i.e., CPUs and/or SSDs) and the congestion control window size (that accounts for bottlenecks within the storage dat- apath, and is decided using storageCC’s storage congestion control mech- anism). We discuss storageCC in Chapter 5. 1.7 Acknowledgements This dissertation is an outcome of multiple collaborative efforts [6, 8, 9, 116, 128]. While the core of this dissertation is about host congestion control, I was also fortunate enough to work on several other collaborative projects during my Ph.D. that are not a part of this dissertation [7, 10, 120, 126]. 13 CHAPTER 2 UNDERSTANDING THE IMPACT OF HOST CONGESTION This chapter provides an overview of the host network datapath (§2.1). We then discuss the impact of host congestion caused by contention for host network resources such as the memory interconnect and IOMMU caches. We charac- terize application-level performance degradation in both networking (§2.2) and storage (§2.3) contexts, providing a deep dive into understanding the behav- ior of network transport protocols and storage stacks under the host congestion regime. 2.1 Host Network Datapath This section serves as a brief primer on the host network datapath. We begin by discussing the datapath in the context of networking applications (§2.1.1), where the NIC serves as the employed peripheral device. Subsequently, we offer relevant additional details about the datapath when utilizing SSDs as the peripheral device (in the context of storage applications, in §2.1.2), as well as when employing memory protection against untrusted peripheral devices (i.e., with IOMMU enabled in the host, in §2.1.3). 2.1.1 Network (NIC-to-CPU/memory) Datapath Figure 2.1 illustrates the host network datapath for the context of networking applications. We discuss below the life of a packet from when it arrives at the 14 Rx Descriptors NIC Buffer PCIe Credits PCIe Channel IIO Memory Controller NIC CPUs DRAM IIO Buffer Rd Q Wr Q 1 2 3 4 1 2 Figure 2.1: Illustration of host network datapath for the Intel architecture (AMD architecture is conceptually similar) between NIC and CPU/memory for the DDIO disabled case. NIC until it is transferred to CPU/memory. We primarily focus on the receiver side since host congestion is more prominent at the receiver [77, 6, 86]. The host network datapath is best described in two components—one between the NIC (one end of the PCIe interconnect) to the Integrated IO Controller (IIO, the other end of the PCIe interconnect), and the other between the IIO to memory. We discuss below the steps involved in each component. NIC-to-IIO datapath. NIC and IIO sit at the two ends of the PCIe interconnect. 1 Upon a packet arrival, the NIC enqueues the packet into its input buffer (typically in a small SRAM [70, 6]). 2 Next, the NIC fetches a descriptor that provides a host memory address for the NIC to Direct Memory Access (DMA) the packet: the NIC driver sets up a per-core ring buffer of Rx descriptors, each of which contains memory addresses to be used to DMA data (with memory protection enabled, each 15 descriptor is equipped with a virtual memory address instead of the physi- cal address; more details in §2.1.3). Modern NICs support each descriptor to have addresses worth one or more pages (e.g., default 64-page descriptors in Mellanox NICs [97]), allowing multiple packets to be DMA’d using the same descriptor. The NIC driver periodically replenishes these descriptors. 3 Importantly, PCIe is a lossless interconnect that uses a credit-based flow control mechanism implemented via a fixed (hardware-specific) number of credits [12]. When PCIe credits are available, the NIC instantiates a DMA request over the PCIe (executed using PCIe transactions); DMA’ing a single packet may require multiple PCIe credits [12, 6]. PCIe being a lossless in- terconnect, the packet can be safely removed from the NIC buffer as soon as DMA is initiated. If PCIe runs out of credits (we will discuss potential reasons below), DMAs cannot be initiated until credits are replenished. 4 The IIO intercepts each PCIe transaction and initiates writes to memory (dis- cussed below); importantly, a PCIe credit is replenished only when the IIO has successfully issued a write to the memory1. IIO to memory datapath. Modern hosts have support for direct cache access (e.g., using DDIO [59]) that allows NICs to DMA packets directly into the last- level cache (LLC). The precise IIO to memory datapath depends on whether DDIO is enabled or disabled. We first describe the datapath with DDIO dis- abled; we then discuss the case when DDIO is enabled. 1PCIe transactions are executed at the granularity typically 256 − 512 byte-sized PCIe Trans- action Layer Packets (TLPs); however, IIO to memory transactions are executed at the granular- ity of 64 byte-sized cache lines. Thus, each PCIe transaction requires multiple IIO-to-memory writes before completion. We will largely ignore this detail since it is not necessary to under- stand the phenomenon of host congestion. 16 1 Upon receiving a PCIe transaction, IIO enqueues the request in an IIO buffer. 2 IIO issues the requests from its buffer to the memory controller buffer, and the controller executes the final write to DRAM. Importantly, IIO to memory controller datapath also traverses a lossless interconnect that uses a credit- based flow control mechanism; the IIO can issue a write request to the mem- ory controller only if the write queue at the memory controller is not full. If this write queue is full (e.g., due to other memory requests from CPUs), the request remains enqueued in the IIO buffer. Once the write queue becomes free, IIO transfers the request to the memory controller write queue; this incurs a cache line worth of memory write bandwidth. If DDIO is enabled, IIO transfers the cache line to the LLC. This may require evicting an already existing cache line to the memory controller [31, 59]. There- fore, if DDIO is enabled and does not lead to evictions, it reduces the latency of an IIO write request since the speed-of-light delay from IIO to LLC is smaller than from IIO to DRAM. However, if it does lead to evictions, we are back to the DDIO disabled case: each eviction not only incurs a cache line worth of mem- ory write bandwidth but also higher latency since IIO to LLC write can only be executed after the eviction has been completed. 2.1.2 Storage (SSD-to-CPU/memory) Datapath Figure 2.2 illustrates the end-to-end storage datapath, that is, the life of a storage request from when it is issued by the application until the desired data is written to the memory. The datapath is best described in three components: between 17 Submission eues SSD PCIe Credits PCIe Channel IIO Memory Controller SSD Hardware & Driver CPUs DRAM IIO Buffer Rd Q Wr Q Figure 2.2: Illustration of storage datapath between SSD and CPU/memory. the application and SSD, between the SSD and the IIO, and between the IIO and host memory. We discuss the steps involved in the first two components; IIO to host memory datapath is identical to the networking context discussed previously. Application to SSD datapath. The storage application submits the IO read re- quest 2 to the Linux kernel block layer. The block layer forwards the request to the NVMe driver, which creates a descriptor for the SSD (the descriptor carries the destination memory address for DMA and uses a virtual memory address if memory protection is enabled using an IOMMU), enqueues the descriptor into one of the SSD’s submission queues and rings a doorbell. The SSD dequeues the request from the submission queue and processes it, which entails two steps: (1) fetching the data from its internal storage fabric and (2) transferring it to the host by issuing DMA transactions over the PCIe interconnect. 2The datapath for write requests essentially follows a reverse path. 18 SSD to IIO datapath. SSD and IIO sit at the two ends of the PCIe interconnect. Similar to the networking context, The IIO intercepts each PCIe transaction and is responsible for writing the data to host memory, and the PCIe credits are re- plenished only when the IIO has successfully issued the required write requests to the memory. 2.1.3 Datapath with Memory Protection Enabled We now discuss the additional steps taken within the end-to-end datapath when IOMMU is enabled to provide memory protection against buggy/untrusted pe- ripherals. We consider the networking context for brevity, with Mellanox NICs and Intel CPUs. The high-level datapath of other NICs [110], other kinds of pe- ripherals like SSDs, and other CPU architectures are similar to what is shown in Figure 2.3. The datapath involves three main steps: 1 IOVA allocation and mapping to physical page address. To provide mem- ory protection, NIC uses a virtual address (IO virtual address or IOVA) in the Rx ring buffer descriptor, associated with the corresponding application core, to DMA the packet. To prepare a single descriptor for each page, the NIC driver al- locates a physical page and passes it to the IOMMU driver. The IOMMU driver then allocates a page-sized IOVA from the IOVA allocator. An IOVA itself is simply a range of addresses, and the IOVA allocator keeps track of free address ranges. Allocated IOVA ranges are managed with a globally locked red-black tree, where each node in the tree contains a high and a low IOVA address [108]. In the worst case, operations on the red-black tree can incur linear time searches, leading to high CPU overheads [108]. To avoid such overheads, modern IOVA 19 Rx Descriptors NIC Buffer PCIe Channel IIO IOVA Allocator IOMMU IO Page Table L1 L2 L3 L4 IOMMU Invalidation eue IO Page Table Caches L1 L2 L3 IOTLB NIC CPUs DRAM 1 2 23 Figure 2.3: High-level DMA datapath with IOMMU enabled. allocators keep all operations constant time using a stack-based cache of IOVAs that in addition to the red-black tree. To avoid synchronization overheads of multiple cores accessing the IOVA allocator, the OS maintains a per-core IOVA cache, allowing IOVAs to recycle on a per-core basis. The IOMMU driver maps the allocated IOVA to the physical page in an IO page table and passes the IOVA to the NIC driver. The NIC driver inserts these IOVAs into the descriptor. 2 Translating IOVA to physical address. After DMA initiates and PCIe trans- actions subsequently reach the IIO, each transaction’s IOVA must be translated to a physical address. These translations are sped up using special-purpose hardware caches: first, the IOVA is looked up in the IOTLB, which stores map- pings directly from IOVAs to physical addresses. Upon a hit, the address is im- mediately translated. Otherwise, the IOMMU must perform an IO page table walk. The IO page table has four levels (L1, L2, L3, and L4), and each page has 512-page table entries; L1 entries map from the 9 MS bits of the IOVA to an L2 page, L2 entries map from the next 9 bits of the IOVA to an L3 page, etc. An L4 page contains direct mappings to physical addresses. While existing literature primarily assumes that IO page table walks are always needed to be performed, 20 we found that modern IOMMU hardware has caches for the IO page table that can speed up translation [60]. There is a cache for the first three levels of the page table (L1, L2, and L3); each entry in the cache points from an IOVA to the address of the corresponding L2/L3/L4 page in the IO page table. At worst, a page table walk incurs four memory reads (one for each level of the IO page table) in the case of all misses, and at best, one unavoidable memory read from an L4 page, which contains the requisite physical address. It may take multiple address translations to complete the DMA of a single packet. 3 IOVA unmapping and deallocation. After DMA completes, to ensure that the NIC can no longer access the DMAed packets’ buffers, for each page in the descriptor, the driver unmaps the IOVA-to-page mapping in the IO page table, including invalidating the corresponding entries in the IOMMU and frees the IOVA back to the IOVA allocator. Modern operating systems provide two secu- rity models for invalidations: strict and deferred. In strict mode, invalidation happens at per-descriptor granularity; for each IOVA in the descriptor, corre- sponding entries in the IOTLB and L1, L2, and L3 page table cache are invali- dated. 3 In deferred mode, invalidation happens at multi-descriptor granularity - every 4 descriptors by default in Linux - leaving stale entries in the IOTLB that can be used by the device. 3This is the reason for an unavoidable IOTLB miss per (huge)page worth of data: since IO- VAs are unmapped and corresponding IOTLB entries are immediately invalidated after use, every new page that is DMA’d will result in an IOTLB miss; thus we expect to see at least one IOTLB miss per page. In reality, misses are slightly higher due to PCIe transactions for accessing descriptors, etc. [6]. 21 2.2 Host Congestion Impact on Networking Applications As discussed in Chapter 1, host congestion may occur due to one or more bot- tlenecks along the datapath, e.g., memory interconnect [6, 86] or the IOMMU caches [12, 22, 108, 31, 6]. These bottlenecks lead to inflation in latency for IIO- to-memory requests due to inflation in memory access latency (under memory interconnect contention) or higher address translation latency at the IIO (under IOMMU cache contention). We observe a domino effect due to latency inflation for IIO-to-memory requests: as IIO-to-memory latency increases, more requests get queued at the IIO buffer, PCIe credit replenishing incurs larger delays, and PCIe may run out of credits. As a result, PCIe bandwidth remains underuti- lized, resulting in packet queueing and eventual drops at the NIC buffer for networking applications. Existing network protocols and stacks are not designed to handle host congestion—they primarily target either network fabric congestion (e.g., [99, 14, 88, 30]), or compute bottlenecks due to inefficient software (e.g., [31, 47, 77, 32, 95]). Indeed, the Google study [6] demonstrates that, even with state-of-the-art congestion control protocol Swift [77] and userspace network stack SNAP [95], their production clusters suffer from host congestion resulting in significant queueing and packet drops at the host, throughput degradation, and tail la- tency inflation. In this section, we reproduce the host congestion phenomenon from [6] using Linux DCTCP, and provide insights on root causes for queueing, packet drops, and performance degradation in the host congestion, caused by bottlenecks at the memory interconnect (2.2.1), and the IOMMU ( 2.2.2). 22 2.2.1 Impact of Memory Interconnect Contention Setup. To build an understanding of performance degradation in the host con- gestion regime, we use a setup with two servers connected via a single switch. We use 4-socket (NUMA node) servers with 8 Intel Cascade lake CPU cores per socket, 100Gbps Mellanox CX5 NIC connected to one of sockets over 128Gbps PCIe 3.0, and DDR4 DIMMs connected to 2 memory channels (with a total max- imum theoretical capacity of 375Gbps = 46.9GBps). In terms of resource bal- ancing, our servers are state-of-the-art: the latest commercially available Intel servers have the same set of resources but scaled by a factor (that is, a server with PCIe 4.0 with 256 Gbps capacity would also have 4× more cores, 2× more PCIe bandwidth, 4× more memory bandwidth, etc.) [50, 6]. Our servers run Linux kernel 5.4.0 and, by default, use 4K MTUs (similar to [6]), and all avail- able optimizations (segmentation offloads like TSO and GRO, accelerated re- quest flow steering, etc.) that have been shown to achieve best-possible Linux performance [31]. We use three applications—(a) a NetApp-T that generates 4 long flows, each flow from one sender-side CPU core to one receiver-side CPU core on the NIC- local NUMA node (DCTCP needs a minimum of 4 cores to saturate 100Gbps NIC in an uncongested scenario); (b) a NetApp-L that generates latency-sensitive RPCs (of sizes varying from 128B to 32KB) between a sender-side and receiver- side CPU core on the NIC-local NUMA node; and (c) an MApp that generates CPU-to-memory traffic with 1 : 1 read-write ratio and sequential memory ac- cess pattern; we increase the offered load to the memory interconnect by the MApp from 1× to 3×, by increasing the number of MApp CPU cores, and therefore increasing the number of in-flight memory requests (in the absence of any other 23 0x 1x 2x 3x0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 0x 1x 2x 3x Degree of Host Congestion 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) DDIO Off DDIO On (a) Network throughput and packet drop rates 0x 1x 2x 3x Degree of Host Congestion 0.0 0.2 0.4 0.6 0.8 1.0 M em or y B an dw id th U til iz at io n DDIO Off, NetApp-T DDIO Off, MApp DDIO On, NetApp-T DDIO On, MApp (b) Memory bandwidth utilization Figure 2.4: (a) Host congestion leads to significant performance degradation in terms of packet drop rates and throughput (even with no network congestion). DDIO helps a little but observes similar performance degradation. (b) MApp is able to acquire a large fraction of memory bandwidth, leaving little room for NetApp-T. source of memory traffic, using 1× to 3× MApp cores results in a total observed memory bandwidth of 16.0GBps, 28.7GBps, and 34.8GBps, respectively). We use standard benchmark tools for these applications—iperf [63] for NetApp-T, netperf [68] for NetApp-L, and MLC [62] for MApp. [Figure 2.4] Host congestion leads to degraded network throughput (>35%), even when DDIO is enabled. Figure 2.4 shows throughput, packet drops, and memory bandwidth utilization with increasing degrees of host congestion, with and without DDIO. The case of 0× is easy: there is no host congestion, and thus, network traffic is able to saturate the access link bandwidth. Enabling DDIO reduces memory bandwidth utilization since the CPU is able to consume the data before it is evicted; nevertheless, memory bandwidth utilization is non- zero because evictions do happen due to cache pollution—since LLC is shared across all cores, one cannot guarantee a perfect cache hit rate [31]. The case of 1× is more interesting; let us start with the DDIO disabled case. Here, as shown 24 in the right figure, memory bandwidth utilization is now close to saturation4; thus, memory access latency (and thus, the number of CPU cycles per mem- ory access) starts to increase. As a result, network throughput is now compute bottlenecked (more precisely, by receiver-side buffers) and cannot saturate the access link bandwidth. DDIO shines in such a scenario: lower memory band- width utilization results in lower memory access latency and fewer CPU cycles per memory access, allowing network traffic with DDIO to continue to saturate the access link bandwidth. Memory bandwidth is now saturated for 2× and 3×, resulting in inflated memory access latency. We now see the domino effect discussed earlier: PCIe becomes underutilized, and packets get queued (and eventually dropped) at the NIC; even in this simple setup, we observe 0.3% drops. Interestingly, while DDIO can be slightly helpful in improving network throughput, it has negligi- ble impact on packet drop rate; this is because of the reasons discussed earlier— as shown in Figure 2.4(b), memory bandwidth utilization is similar for both DDIO enabled and disabled (that is, majority of cache lines are evicted from LLC before the CPU can consume them). At 2× and 3×, DCTCP is operating in the AIMD regime: senders keep on increasing sending rate, resulting in queue build-up and drops at the NIC when total sending rate exceeds the instanta- neous NIC-to-memory transfer rate; packet drops leads to rate reduction, fol- lowed by subsequent sawtooth behavior. Figure 2.4b shows that, as we increase the number of MApp cores, network traffic is allocated a decreasing fraction of memory bandwidth. More work is needed to understand the precise reasons; our evaluation suggests that mem- 4Memory bandwidth utilization at saturation depends upon DRAM hardware as well as application workload (read-write ratio, access patterns, etc.) [64], and is typically lower than the maximum theoretical bandwidth. 25 1500B 4000B 9000B0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 1500B 4000B 9000B MTU Size 0.00 0.25 0.50 0.75 1.00 1.25 Pa ck et D ro p R at e (% ) DDIO Off DDIO On (a) Varying MTU Size 4 8 160 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 4 8 16 Number of Active Flows 0.0 0.5 1.0 1.5 2.0 Pa ck et D ro p R at e (% ) DDIO Off DDIO On (b) Varying Number of Flows Figure 2.5: The impact of host congestion worsens with larger MTU size and number of flows, with 3× degree of host congestion. DDIO-enabled case, in particular, suffers more than DDIO-disabled case. ory bandwidth allocation is essentially proportional to the load generated by individual entities (IIO or CPU); this implies that, as MApp cores increase, MApp generates increasingly larger load5 but the maximum number of requests issued by IIO remains the same (dependent on the PCIe credit limit). As a result, CPUs are able to quickly acquire a larger fraction of memory bandwidth, creating even more host contention for network traffic. [Figure 2.5] Impact of host congestion worsens with increasing MTU sizes and a larger number of flows. Larger MTU sizes and a large number of con- nections have been shown to improve network throughput when compute at the host is bottlenecked due to inefficient software [31, 6]. Figure 2.5 shows that, in the presence of host congestion, these optimizations can, in fact, hurt performance—we see a significant increase in packet drop rates for both DDIO- enabled and disabled cases; and, while we observe a slight improvement in throughput for the DDIO-disabled case (due to reduction in compute cycles re- quired to process each packet), we also observe significant throughput reduc- 5This is because each CPU core can maintain a hardware-specific maximum number of in- flight memory requests, equal to so-called Line Fill Buffer (LFB) size. On our servers, this limit is typically 10 − 12. 26 104 105 106 128B 512B 2KB 8KB 32KB RPC Size 50 100 150 200 L at en cy (µ s) DCTCP (No host congstion) DCTCP (With host congestion) Figure 2.6: Host congestion leads to orders of magnitude tail latency infla- tion. The figure shows {P50, P90, P99, P99.9, P99.99} latencies (represented by whiskers) for NetApp-L, with and without host congestion, when NetApp-T, NetApp-L, and MApp are run together with DDIO-disabled (we observed near- identical latency results with DDIO-enabled). tion for the DDIO enabled case. The trends for the DDIO disabled case are easy to understand: for AIMD- style congestion control protocols, it is known that packet drop rates increase with MTU size [96] and the number of flows [103, 11]. The trends for the DDIO- enabled case are related to the inefficacy of DDIO observed in [31]—with an increase in MTU size and/or number of flows, DDIO observes an increase in cache eviction rates (and thus, an increase in memory bandwidth utilization); as discussed in §2.1.1, each such eviction not only incurs a cache line worth of memory bandwidth as in DDIO-disabled case, but also higher latency than DDIO-disabled case since IIO-to-LLC write can only be executed after the evic- tion has completed. As a result, the DDIO-enabled case at 3× host congestion observes even higher packet drop rates than the DDIO-disabled case for large MTU sizes and large numbers of flows. [Figure 2.6] Host congestion can cause orders of magnitude tail latency infla- tion for latency-sensitive applications. We now run the three applications— NetApp-T, NetApp-L, and MApp—together, with 3× host congestion. Since NetApp-L introduces a tiny amount of additional traffic, NetApp-T and MApp 27 performance is similar to previous experiments; we thus focus on NetApp-L re- sults. We observe significant latency inflation for NetApp-L in the host conges- tion regime. This latency inflation is caused due to three reasons (a) queueing delay at the NIC buffer; (b) retransmission and timeout delays due to drops at the NIC buffer; and (c) larger CPU processing delays due to inflated in CPU cycles in memory accesses caused by host congestion. Building upon previous observations (close to 0.3% drops), P99 latency is dominated by (a) and (c), and P99.9 is dominated by (b)—for both DDIO-enabled and -disabled (that observe similar drop rates), P99 latency inflation is roughly 60 − 100µs which is close to the worst-case queueing delay at the NIC buffer; at P99.9, latency inflation is close to 200ms, which is the default Linux minimum retransmission timeout (RTO) value. Smaller RPCs suffer from higher tail latency inflation because any packet drop necessitates a timeout; for larger RPCs, the Linux Tail Loss Probe (TLP) [112] mechanism is effective (with a smaller timeout) when there is more than one in-flight packet. Isolating NIC buffers does not solve this problem: a smaller NIC buffer size would incur a larger number of drops, increasing (b); a larger NIC buffer size, on the other hand, would increase (a). 2.2.2 Impact of IOMMU Contention Setup. We use servers with the same hardware setup (identical NIC, PCIe, and memory bandwidths) as described in §2.2.1, with the addition of enabling the IOMMU to characterize the application-level impact of memory protection overheads. We employ Linux kernel 6.0.3, and by default, 4K MTUs (similar to the setup in [6]), along with all available TCP optimizations (TSO, GRO, and aRFS). The same set of networking applications—NetApp-T and NetApp-L—are 28 256 512 1024 2048 Ring Buffer Size 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) IOMMU Off IOMMU On (a) Network throughput 256 512 1024 2048 Ring Buffer 0.0 0.5 1.0 Pa ck et Dr op R ate (% ) IOMMU Off IOMMU On (b) Packet drop rates 256 512 1024 2048 Ring Buffer Size 0 1 2 IO TL B M iss es pe r P ag e (c) IOTLB miss rates 256 512 1024 2048 Ring Buffer Size 0.0 0.5 1.0 IO M M U Ca ch e M iss es pe r P ag e L1 L2 L3 ACKs 0.00 0.05 0.10 0.15 Nu mb er of AC Ks pe r P ag es- Siz ed Pa ck et (d) L1/L2/L3 cache miss rates 0 100 200 300 400 500 Sample Sequence of IOVA Accesses 0 100 200 300 400 500 Di st in ct L 3s A cc es se d Be fo re R ep ea tin g 2048rb 256rb (e) IOVA access pattern Figure 2.7: Enabling memory protection results in >35% degradation in throughput. Increasing ring buffer sizes with IOMMU enabled leads to (a) de- creasing throughput and (b) increasing packet drop rates. (c) IOTLB misses do not significantly change as the ring buffer size increases. (d) However, IOMMU L3 cache miss rates increase significantly, leading to a larger cost of IOTLB misses (e) Larger ring buffer sizes result in worse locality as the IOVA work- ing set size grows. used as in §2.2.1. By default, NetApp-T generates 5 long flows, as DCTCP re- quires a minimum of 5 cores in this setup to saturate a 100 Gbps NIC in an un- congested scenario. We also use the same memory-intensive application—MApp, as discussed earlier—to characterize the impact of contention observed at both the memory interconnect and the IOMMU. [Figure 2.7] Enabling memory protection leads to degraded network through- 29 put (>35%) even without memory interconnect contention. Figure 2.7 shows application level throughput for increasing ring buffer size, with all other pa- rameters fixed to default, but with IOMMU disabled and enabled. With the IOMMU disabled, the application saturates the 100Gbps link bandwidth. How- ever, with IOMMU enabled, throughput drops up to 35% for ring buffer size = 2048. This throughput degradation is attributed to increasing IOTLB/L1/L2/L3 misses. Note that there is increasing throughput degradation as the ring buffer size increases despite no significant increase in the number of IOTLB misses. This is because the cost of each IOTLB miss gets worse due to the increasing rate of L3 misses. The L3 miss rate increase is caused by poor IOVA locality with increasing ring buffer sizes. As the ring buffer size increases, the IOVA working set per core exceeds the per-core cache size, requiring access to the global cache or red-black tree for IOVA allocation. Access to the global cache or red-black tree can cause subsequent IOVA allocations to belong to different L3 entries, as the global cache or red-black tree IOVAs are unlikely to be on the same L3 page as the per-core cache. Figure 2.7(e) shows that the locality of L3 entries degrades with increasing ring buffer size. The X axis represents each subsequent IOVA allocation and the Y axis represents the number of unique L3 page table entries accessed before that same L3 entry is accessed again. The red line thresholds represent potential L3 cache sizes of 64 and 128. A value above the threshold means it is more likely that an IOVA will experience an L3 miss when translated by the IOMMU. The figure shows how spatially and temporally located the IOVA accesses are within the IOMMU for varying ring buffer sizes. 30 Note that even for the case of 256 ring buffer size, with working set size po- tentially smaller than the per-core IOVA allocator cache size (as evident from the L3 access pattern in Figure 2.7(e)), it still has significant L3 misses, prohibit- ing the application from saturating the link bandwidth. We believe these misses result from the interference of IOVA allocation/deallocation for ACK packets (Tx datapath packets) with the payload packets (Rx datapath packets). Imagine a core allocating IOVAs for an Rx ring buffer from a perfectly contiguous cache. We would expect to see the same perfect locality within the Rx ring buffer. How- ever, if a free call from the Tx data path is interleaved with the alloc calls for the Rx ring buffer, then the Rx IOVAs will also be interleaved and no longer con- tiguous. Prior work has shown a similar phenomenon [92]. [Figure 2.8] Larger IOMMU cache miss rates lead to larger cost of IOTLB misses, and hence larger performance degradation. Figure 2.8 shows increas- ing L1, L2, L3, and IOTLB cache miss rates with increasing number of flows, subsequently leading to throughput degradation of up to 65% for the case of 40 flows. The significant drop in throughput shown is due to compounding factors: a larger number of IOTLB misses cause a larger number of page table walks, and the worse page table cache miss rates lead to higher latency per page table walk. In the evaluated setup, the L1 and L2 locality should ideally be perfect: us- ing 32-bit DMA, there is only a single entry (an L1 and L2 entry cover 239 and 230 bytes of address space, respectively). Thus, we would expect never to see L1/L2 misses as the working set easily fits inside a single entry of these caches. Yet we observe a rather surprising result that, in some cases (as shown in Fig- ure 2.8(d), for a larger number of flows), there are L1/L2 misses for almost every 31 5 10 20 40 Number of Active Flows 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) IOMMU Off IOMMU On (a) Network throughput 5 10 20 40 Number of Active Flows 0 2 4 Pa ck et Dr op R ate (% ) IOMMU Off IOMMU On (b) Packet drop rates 5 10 20 40 Number of Active Flows 0 1 2 IO TL B M iss es pe r P ag e (c) IOTLB misses rate 5 10 20 40 Number of Active Flows 0.0 0.5 1.0 IO M M U Ca ch e M iss es pe r P ag e L1 L2 L3 ACKs 0.00 0.05 0.10 0.15 Nu mb er of AC Ks pe r P ag es- Siz ed Pa ck et (d) L1/L2/L3 cache miss rate 0 100 200 300 400 500 Sample Sequence of IOVA Accesses 0 50 100 150 200 250 300 Di st in ct L 3s A cc es se d Be fo re R ep ea tin g 40 flows 5 flows (e) IOVA access pattern Figure 2.8: With IOMMU enabled and an increasing number of flows, we ob- serve (a) higher throughput degradation, (b) larger packet drop rates, (c) a larger number of IOTLB misses, (d) higher IOMMU cache misses, and (e) poorer lo- cality in L3 page caches. other DMAed packet! The reason behind these surprisingly large L1/L2 misses is the invalidations of previously DMAed IOVAs. While the lack of measure- ment infrastructure for visibility into invalidations limits our ability to identify root causes of invalidations, our findings suggest that interference of alloca- tion/deallocation for ACK IOVAs results in larger invalidations. Recall that once an ACK is sent, its IOVA is unmapped and invalidated immediately. This will cause concurrent Rx DMAs to suffer an additional miss for each L1/L2/L3 entry that the ACK and Rx IOVAs share. Figure 2.8 corroborates this hypothe- sis: as the number of flows and corresponding transmitted ACKs increase, the 32 104 105 106 128B 512B 2KB 8KB 32KB RPC Size 50 100 150 200 250 300 La ten cy ( s) IOMMU Off IOMMU On Figure 2.9: IOMMU contention leads to orders of magnitude tail latency infla- tion L1/L2 misses rise. The reason for the increase in ACKs is subtle: as shown earlier in this section, increasing the number of flows under the congestion scenario leads to increas- ing packet drop rates. It is a commonly known fact that an inflated drop rate can lead to out-of-order packets and an increase in ACKs [31]. An increasing number of ACKs also leads to increasing L3 misses (due to worse IOVA local- ity, as explained earlier) and IOTLB misses (due to a larger number of PCIe transactions per payload packet, including read transactions needed for send- ing ACKs). [Figure 2.9] IOMMU cache contention also leads to tail latency inflation. Fig- ure 2.9 shows the tail latency inflation by several orders of magnitude for L-apps with IOMMU enabled. The reasons are similar as discussed previously for la- tency inflation in Figure 2.6—due to queueing delays at the NIC buffer and retransmission delays due to subsequent packet drops at the NIC buffer. [Figure 2.10] Memory bandwidth contention, along with IOMMU cache con- tention, further degrades performance due to increased cost of IOTLB misses. Figure 2.10 demonstrates the effects of compounding memory bandwidth con- 33 0x 1x 2x 3x Degree of Memory Contention 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) IOMMU Off IOMMU On (a) Network throughput 0x 1x 2x 3x Degree of Memory Contention 0.0 0.5 1.0 Pa ck et Dr op R ate (% ) IOMMU Off IOMMU On (b) Packet drop rate 0x 1x 2x 3x Degree of Memory Contention 0 1 2 IO TL B M iss es pe r P ag e (c) IOTLB miss rate 0x 1x 2x 3x Degree of Memory Contention 0.0 0.5 1.0 IO M M U Ca ch e M iss es pe r P ag e L1 L2 L3 ACKs 0.00 0.05 0.10 0.15 AC Ks pe r P ag e-s ize d P ac ke t (d) L1/L2/L3 cache miss rate 0 100 200 300 400 500 IOVA Accesses 0 100 200 300 400 500 Di st in ct L 3s A cc es se d Be fo re R ep ea t 0x Memory Contention 3x Memory Contention (e) IOVA access pattern Figure 2.10: Application performance with both IOMMU and memory band- width contention. Memory bandwidth contention causes further performance degradation due to the increased latency for each memory read. This worsens the cost of each page table walk despite similar miss rates for both the IOTLB and page table caches. tention and IOMMU contention. Despite similar IOTLB and IOMMU cache misses, we still observe throughput degradation as the memory contention in- creases. This is because PCIe latency increases as each page table walk (and its corresponding memory reads) takes longer to complete. Comparing 3x memory contention to the 2048 ring buffer result, there is a 25% reduction in throughput despite equivalent IOTLB and L1/L2/L3 cache misses. As each memory read incurs more latency, there is a larger amount of throughput degradation. 34 2.3 Host Congestion Impact on Storage Applications We now show existing storage stacks were not designed to handle the host con- gestion regime either, with the results shown in §2.2 also generalizing to storage applications: up to 50% of unnecessary throughput degradation and an order of magnitude tail latency inflation. Existing storage stacks using various IO sched- ulers [114, 57] handle contention at CPUs and/or storage devices (that leads to queueing at the block layer), but they have no mechanisms to dynamically iden- tify contention and adjust to the appropriate amount of parallelism for storage applications when the memory interconnect is bottlenecked. This limitation can cause individual IO requests to experience significant latency inflation when latency-sensitive L-apps compete for host resources with throughput-bound T-apps in the presence of memory interconnect bottlenecks. This is because T-apps are able to quickly acquire most of the resources (such as credits used for data transfers over peripheral and memory interconnects) within the stor- age datapath, resulting in L-app requests observing backpressure from within the host network hardware (since the host network is inherently lossless). The problem worsens when we extend our setup to incorporate remote NUMA nodes, where the storage applications run on CPU cores in NUMA nodes remote to the storage devices in the presence of memory interconnect bottlenecks. Furthermore, we observe throughput degradation due to breakage of isolation across multiple T-apps running on CPU cores in different NUMA nodes and accessing storage devices within the same NUMA node. When con- tention arises at the memory interconnect for only one of the T-apps (referred to as the “culprit” application), the other T-app (the “victim” application) observes backpressure, resulting in unnecessary throughput degradation. This backpres- 35 CPUs DRAMDRAM SSD SSD SSD M-app Traffic L-app & T-app Traffic SSD (a) Memory interconnect congestion within NUMA node local to SSDs 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Mi ion IOPS) 102 103 104 L- Ap p Ta i La te nc ) (u s) 7.23( Latenc) Inf ation (1215.5µs) (1410.5µs) (195.0µs) No Contention With Contention (b) Latency-load curve for increasing IO depth; none scheduler 0.000 0.002 0.004 0.006 0.008 0.010 0.012 Throughput (in Million IOPS) 102 103 104 L- A Ta il La te nc y (u s) No Contention With Contention (c) mq-deadline scheduler 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Mi ion IOPS) 102 103 104 L- Ap p Ta i La te nc ) (u s) 1.55( Latenc) Inf ation (104.0µs) (292.5µs) (188.5µs) No Contention With Contention (d) kyber (4KB IO) 0.00 0.03 0.05 0.08 0.10 0.12 0.15 0.18 0.20 0.23 Throughput (in Mi ion IOPS) 103 104 105 L- Ap p Ta i La te nc ) (u s) 3.10( Latenc) Inf ation (1239.1µs) (1828.1µs) (589.0µs) No Contention With Contention (e) kyber (64KB IO) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in M ll on IOPS) 102 103 104 L- Ap p Ta l La te nc y (( s) 7.16) Latency Inflat on (1201.1µs) (1396.1µs) (195.0µs) No Content on W th Content on (f) blk-switch Figure 2.11: Existing storage stacks suffer from tail latency inflation due to memory interconnect bottlenecks. (a) The setup comprises L-apps, T-apps, and M-apps all co-located within a single NUMA node; (b) the T-apps set parallelism as the benchmarked knee point (circled in the green curve); in the presence of memory interconnect bottleneck, the T-apps continue to maintain the same amount of parallelism (cross marked on the red curve), incurring 7.23× higher tail latency than the ideal operating point for this scenario (circled in the red curve); (c-f) other IO schedulers supported by Linux blk-mq, and blk-switch (a state-of-the-art storage stack) incur similar latency inflation and/or high throughput overheads. sure originates from shared hardware buffers within the host network (such as shared IIO buffers, as discussed in §2.1). Setup. We demonstrate the problems described above using the Linux storage 36 Local NUMA node Remote NUMA node UPI Link CPUs M-app Traffic L-app & T-app Traffic DRAM DRAMDRAM CPUs M-app Traffic SSD SSD SSD SSD (a) Memory interconnect congestion in remote NUMA. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in M ll on IOPS) 102 103 104 105 L- Ap p Ta l La te nc y (( s) 17.44) Latency Inflat on (2756.5µs) (2924.1µs) (167.6µs) No Content on W th Content on (b) Latency-load curve for increasing IO depth Figure 2.12: Higher inflation in tail latencies with memory interconnect bottle- necks in remote NUMA node. (a) Setup uses applications running on CPU cores in NUMA node remote from SSDs; (b) L-apps incur 17× tail latency inflation. stack with the default block multi-queue architecture (blk-mq); while we use the Linux storage stack, the problems apply for userspace stacks as well since none of the existing userspace stacks account for memory interconnect bottle- necks. We use the same server configuration as discussed in §2.2. We attach 4 Samsung PM173X NVMe SSDs to one of the NUMA nodes through 4 dedi- cated PCIe Gen3 lanes per SSD (i.e., a total of 16 PCIe lanes implying a maxi- mum aggregate PCIe bandwidth of 128Gbps or 16GB/s). We use two kinds of storage applications — latency-sensitive applications (L-app) and throughput- bound applications (T-app). Similar to prior work [57], we use the standard benchmark tool FIO [1] with an IO depth of 1 for the L-app and varying IO depth from 1-256 for the T-apps. We perform 4KB random reads IOs from the SSDs. We also use the same memory-intensive application (M-app) as in §2.2.1. [Figure 2.11] Modern storage stacks assume bottlenecks are at CPUs and/or SSDs. Existing storage applications, therefore, typically maintain a fixed max- imum number of requests in-flight (e.g., by specifying a per-thread IO depth) to ensure that CPUs and/or SSDs are not overwhelmed. The precise number of in- flight requests is decided based on the latency-load curve during a benchmark- 37 ing phase. The green lines in Figure 2.11 demonstrate this process: individual points on the latency-load curve in Figure 2.11(b) are obtained by increasing the IO depth for T-app (using 1 T-app thread per-core). In this scenario, the T-app throughput is bottlenecked by available SSD bandwidth at ∼3.0M IOPS. Benchmarking this curve informs us that an IO depth of 64 (shown in a green circle) is the ideal choice. In the presence of additional traffic from M-apps, the bottleneck shifts from SSD bandwidth to memory interconnect bandwidth. The ideal operating point now is IO depth = 16 (depicted in circle on the red curve) in Figure 2.11(b). However, the T-apps continue operating at the benchmarked parallelism (i.e., IO depth = 64). Using the default none scheduler (which is also the recommended scheduler for SSDs [57, 111]), this results in L-app requests observing ∼7.23× inflation in tail latency compared to the ideal operating point. We also evaluate two other state-of-the-art-practice Linux blk-mq sched- ulers [114] (mq-deadline and kyber), along with a state-of-the-art storage stack (blk-switch [57]), using the same experimental setup with increasing IO depths. Figures 2.11(c-f) show that these schedulers observe either simi- lar tail latency inflation (compared to noop) for the L-app under memory in- terconnect contention and/or significant throughput overheads for T-app even without memory interconnect contention. Both mq-deadline and kyber lack mechanisms to provide isolation across L-apps and T-apps when both appli- cations issue the same kind of requests to SSDs (reads in our setup). Isolating L-apps using higher prioclass in mq-deadline results in starvation for T-apps; kyber using default 4KB IO size observes similar CPU overheads and tail laten- cies as reported earlier [114]; hence the knee-point in Figure 2.11(d) experiences relatively minor shift under memory interconnect contention. However, if we use a larger IO size such that we’re no longer CPU bottlenecked (shown in Fig- 38 Local NUMA node Remote NUMA node UPI Link CPUs Victim T-app Traffic DRAM DRAMDRAM CPUs M-app Traffic Culprit T-app Traffic SSD SSD SSD SSD (a) Memory interconnect congestion for culprit T-app Culprit Victim0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Th ro ug hp ut (i n M illi on IO PS ) Necessary Unnecessary No Contention With Contention in Culprit Datapath (b) Throughput starvation for victim T-app Figure 2.13: Throughput starvation problem when T-apps use multiple NUMA nodes. (a) a setup, where the culprit application suffers from memory, intercon- nect bottleneck, but not the victim; (b) the culprit must necessarily suffer from throughput degradation; however, it also results in unnecessary degradation in the victim application throughput. ure 2.11(e)), kyber starts observing larger latency inflation. Finally, we also eval- uate blk-switch, which by design tries to isolate L-apps from T-apps by prior- itizing L-app requests enqueued in block layer software and hardware queues. Under memory interconnect bottlenecks, these queues are filled up only after significant build-up within the host and SSD-internal hardware queues, result- ing in sub-optimal performance. [Figure 2.12] Higher inflation in tail latency when using remote NUMA nodes. When running L-apps and T-apps on CPU cores located on NUMA nodes re- mote from the SSDs, we observe a ∼17.44× inflation in tail latency (∼2.4× higher than local NUMA case in Figure 2.11(b)). This additional performance degra- dation is due to remote memory access latencies from the IIO being higher than local memory accesses—we measured up to ∼ 65% higher latencies even when there are no memory interconnect bottlenecks (more details in §5.2.2). Higher IIO-to-memory latencies lead to IIO buffer filling up more quickly and larger depletion of PCIe credits for L-apps to perform DMAs, compared to the local NUMA scenario. 39 [Figure 2.13] Throughput starvation problem when using multiple NUMA nodes. We encounter a throughput starvation issue in scenarios where two T-apps are colocated on a server with isolated datapaths from SSDs to memory, resulting in dedicated SSDs, PCIe lanes, and memory addresses across differ- ent NUMA nodes. When a ”culprit” T-app contends for memory interconnect resources within its datapath, it experiences throughput degradation. How- ever, this contention also leads to unnecessary throughput degradation for the ”victim” T-app, which does not face memory interconnect contention in its dat- apath. This occurs because contention within the culprit T-app results in IIO buffers filling up, which both the culprit and victim T-apps share. Since the IIO lacks a mechanism to distinguish between the two T-apps, both applications suffer from degraded throughput. 2.4 Summary In this chapter we provide an overview of the end-to-end datapath for network and storage applications, with an emphasis on potential bottlenecks within the host network—e.g., the memory interconnect and the IOMMU caches. We demonstrated that existing host infrastructure is ill-equipped to effectively han- dle host congestion: we showcase significant throughput degradation and or- ders of magnitude of tail latency inflation for networking and storage applica- tions in the regime of host congestion. 40 CHAPTER 3 REARCHITECTING CONGESTION CONTROL In this chapter, we discuss the assumptions in the design of existing congestion control protocols that start the break under host congestion. We then present design, implementation and evaluation of hostCC, a new congestion control architecture to handle host congestion along with network fabric congestion. 3.1 Overview Classical literature in datacenter congestion control takes a narrow view of “end-to-end”, often interpreting the end as the point of presence of Ethernet (the network interface card, or NIC). This view precludes the host network, with the conventional wisdom in systems and networking communities being that congestion happens primarily within the network fabric (that is, at network switches). Several recent studies from large-scale production clusters [6, 86, 87] (along with the results using open-sourced software and hardware infrastruc- ture, shown in Chapter 2) demonstrate that the above conventional wisdom is merely an appeal to tradition fallacy with the emergence of host congestion. The regime of host congestion forces us to revisit the many fundamental assumptions entrenched within decades of research and practice of congestion control. For instance, classical congestion control literature assumes that packet drops happen at the congestion point; in contrast, host congestion results in queueing and drops away from the actual congestion point (since the host net- work is lossless). Thus, we must rethink congestion signals to capture the pre- cise time, location, and reason for host congestion. As another example, an 41 unspoken assumption in classical congestion control literature is that all com- peting traffic adheres to the congestion control protocol; such is not the case in the host congestion regime where traffic from “outside the network” (e.g., appli- cations generating CPU-to-memory traffic) does not employ congestion control mechanisms, is much closer to the congestion point, and can thus change dra- matically at sub-RTT granularity. This has powerful implications in terms of rethinking congestion response: existing congestion control protocols that op- erate at RTT granularity may achieve performance far from optimal in the host congestion regime. Thus, host congestion provides us an opportunity to revisit intellectually intriguing, decades-old, fundamental questions related to conges- tion control architecture and protocols. We present hostCC, a congestion control architecture that takes the ultimate end-to-end view: it handles both host congestion and network fabric congestion by allocating both host and network resources among competing traffic. The ethos of host and network resource allocation is the core that drives the three key technical ideas embodied within hostCC. First, in addition to congestion signals from within the network fabric, hostCC generates host-local congestion signals at processor, memory, and peripheral interconnects at sub-microsecond timescales. These host congestion signals enable hostCC to precisely capture the time, location, and reason for host congestion. The second key technical idea in hostCC is a sub-RTT granularity host-local congestion response: at both the sender and the receiver, hostCC uses host-local congestion signals to allo- cate host resources between network traffic and host-local traffic. At the sender, hostCC uses host-local congestion response to ensure that network traffic is not starved, even at sub-RTT granularity; at the receiver, hostCC uses host-local congestion response to minimize queueing and packet drops at the host: it mod- 42 ulates host resources allocated to the network traffic at sub-RTT granularity to ensure that NIC queues are drained at the same rate at which network traffic arrives at the NIC. Finally, the third key technical idea in hostCC is to use both host and network congestion signals to perform efficient network resource allo- cation at RTT timescales. hostCC admits efficient realization within existing host network stacks, without any modifications in applications, host hardware, and/or network hardware; moreover, hostCC can be integrated with existing congestion con- trol protocols to efficiently handle both host and network fabric congestion. To demonstrate this, we perform an end-to-end implementation of hostCC in the Linux kernel using ∼800LOC, and evaluate it along with unmodified Linux DCTCP. Our evaluation demonstrates that, in the presence of host conges- tion, hostCC reduces queueing and packet drops at the host to a bare min- imum, resulting in near-optimal network utilization and tail latency for net- worked applications. The end-to-end implementation of hostCC, along with all the documentation needed to reproduce our results, is available at https: //github.com/Terabit-Ethernet/hostCC. 3.2 hostCC Figure 3.1 illustrates the end-to-end hostCC architecture. In this section, we pro- vide details on the three key technical ideas embodied within the architecture— host congestion signals (§3.2.1), host-local congestion response (§3.2.2), and net- work resource allocation (§3.2.3). 43 https://github.com/Terabit-Ethernet/hostCC https://github.com/Terabit-Ethernet/hostCC Classical Network FC Host-local Congestion Response NIC PCIe NIC PCIe Host Congestion Signal Network Congestion Signal Receiver Host Classical Network CC Input Sender-host congestion signal + Receiver-host congestion signal + Network congestion signal Output Network resource allocation (via rate control/window mgmt) Host-local Congestion Response Input Host congestion signal Output Host-local response level Sender Host Host Congestion Signal Output Host-local response level Input Host congestion signal Input Host congestion signal + Network congestion signal Output Flow control + Echo congestion signals to sender Figure 3.1: hostCC architecture overview. Discussion in §3.2. 3.2.1 Host congestion signals hostCC, in addition to classical congestion signals from within the network fab- ric, generates host-local congestion signals. More precisely, hostCC uses IIO buffer occupancy as a congestion signal. To understand why, recall the host datapath discussed in Chapter 2 (§2.1.1). Let R be the rate at which NIC receives data, P be the maximum in-flight bytes that PCIe can maintain (a fixed hardware-dependent constant that depends on the maximum number of credits, and on TLP size), ℓp be the latency between the NIC to the IIO (a fixed hardware-dependent constant), and ℓm be the la- tency between the IIO and the memory controller. As discussed in Chapter 2, ℓm depends upon multiple factors, including the memory controller write queue size, load on the memory controller, whether DDIO is enabled (and whether an eviction is triggered), and the speed-of-light-latency between the IIO and mem- ory [59, 21, 80]. We will refer to ℓmin m and ℓmax m as the minimum and the maximum 44 value of ℓm. Given the above, PCIe throughput is given by P/max{ℓp, ℓm}, that is, PCIe utilization is dominated by the maximum latency among all links along the path from the NIC to memory1. The IIO buffer occupancy is equal to R × ℓm, that is, the maximum number of bytes that can be received by the IIO while it is waiting for credits to be replenished (if it had credits, IIO would issue the requests). In the regime of no host congestion, ℓm ≈ ℓmin m ≪ ℓp by hardware design, making ℓp the dominant factor in PCIe utilization. Thus, by hardware design, IIO replenishes PCIe credits at a rate no lower than PCIe consumes credits, and PCIe bandwidth utilization matches the rate at which NIC receives data; that is, P/ℓp ≥ R. However, in the host congestion regime, the IIO buffer occupancy can be anywhere between R × ℓmin m and min{R × ℓmax m , maximum number of PCIe credits}, where the second expression in the upper bound is achieved when IIO is unable to replenish PCIe credits due to large ℓm. Our measurements in Figure 3.3 provide an empirical confirmation. We are now ready to describe the benefits of using IIO occupancy as the host congestion signal. First, IIO occupancy provides accurate information about time, location, and reason for host congestion: IIO occupancy increases imme- diately upon the memory controller becoming congested (accuracy in time and location) and it increases only if memory controller is congested (accuracy in rea- son)2. Second, IIO occupancy can be combined with another statistic—IIO in- 1Here is one way to visualize this intuitively: if ℓp → ∞ or ℓm → ∞, PCIe utilization tends to 0; furthermore, if ℓm ≫ ℓp, PCIe utilization is bottlenecked by ℓm (host congestion) and if ℓp ≫ ℓm, PCIe utilization is bottlenecked by ℓp. This follows almost immediately using the analysis in [65]. 2When IOMMU is enabled, IIO occupancy can also increase due to contention at IOMMU caches. It is straightforward to distinguish IOMMU cache contention as a root cause by also measuring the cache miss rates as shown in §2.2. In this chapter, we focus on IOMMU-disabled scenario, and discuss handling host congestion due to IOMMU cache contention in Chapter 4. 45 sertion rate, defined as the rate at which PCIe inserts data into the IIO buffer—to measure various other useful metrics; for instance, instantaneous PCIe through- put (capturing the rate at which NIC buffers are drained) is equal to instanta- neous IIO insertion rate times the cacheline size, host delay (ℓp+ ℓm) can be com- puted using Little’s Law [89]), etc. Third, IIO occupancy and IIO insertion rates can be measured using two registers typically available on commodity hard- ware, allowing hostCC to work without any hardware modifications/support. Finally, IIO measurements are done at the processor interconnect, outside the NIC-to-memory datapath; thus, IIO occupancy measurements are not impacted by host congestion. We provide more details in §3.3.1, including details on how hostCC measures IIO occupancy and IIO insertion rates at sub-µs granularity using existing hardware. 3.2.2 Host-local congestion response at sub-RTT granularity A conceptual interpretation of classical congestion control protocols is that, to handle congestion within the network, these protocols (along with network switches) allocate network resources across entities competing at the congestion point. The second key technical idea in hostCC architecture is motivated by this conceptual view: to handle both host and network congestion, hostCC allocates both host and network resources among entities competing at the congestion point. To achieve this, hostCC introduces a host-local congestion response—at both the sender and the receiver host—that uses host congestion signals dis- cussed in the previous subsection to allocate host resources across network traf- fic and host-local traffic. 46 Resource allocation depends on the underlying policy; hostCC architecture does not dictate the precise resource allocation policy—just like different net- work resource allocation mechanisms use different network allocation policies (max-min fairness, weighted max-min fairness, prioritization, etc.), we envi- sion hostCC to embody various host resource allocation policies and respective implementation. For the following discussion, we assume that the policy pe- riodically computes a target network bandwidth (BT ), and feeds it as input to hostCC. In addition, the host-local congestion response takes as input IIO oc- cupancy IS as the host congestion signal (using a threshold IT , where IS > IT indicates host congestion) and PCIe bandwidth utilization BS (computed us- ing IIO insertion rates, as discussed in the previous subsection). The conges- tion response mechanism operates on a per-packet basis, and makes a decision whether to increase or decrease the resource allocation to both network traffic and host-local traffic. Given the above, hostCC’s host-local congestion response mechanism is best described in terms of four possible regimes of operation. We describe individual regimes and corresponding host-local congestion response below. 1 [No host congestion, network traffic has met the target network band- width]. In this regime, host is not congested (IS < IT ) and the network traffic is using more resources than what is needed to meet the target bandwidth (that is, BS > BT ). Thus, the host-local congestion response mechanism increases the resources allocated to the host-local traffic. This is the right action to take since, in absence of host congestion, more host resources can be allocated to either network traffic or host-local traffic; moreover, since the network traffic has al- ready met the target network bandwidth, we want to ensure that host-local traf- 47 fic is not backpressured unnecessarily. Thus, the host-local congestion response mechanism increases resources allocated to the host-local traffic. It is possible that host-local traffic does not need additional resources; hostCC handles this case by relying on the AIMD-style mechanisms used in network congestion con- trol protocols—since host is not congested, network traffic does not get marked with congestion signals at the host allowing network traffic to increase its rate and acquire unused host resources (if network fabric is not congested). 2 [Host congestion, network traffic has met the target network bandwidth]. In this regime, host is congested and the network traffic is using more resources than what is needed to meet the target bandwidth. Thus, the right action in this regime is to reduce resources allocated to the network traffic and to not reduce resources allocated to the host-local traffic. To achieve this, hostCC again relies on AIMD-style mechanisms used in network congestion control protocols: it echoes the host congestion signal to the network congestion control protocol resulting in reduction in network traffic rate. 3 [Host congestion, network traffic has not met the target network band- width]. In this regime, host is congested but the network traffic is allocated fewer resources than what is needed to meet the target bandwidth. Since there is host congestion, we must reduce the allocated resources; since network traffic has not met the target bandwidth, hostCC first reduces the resources allocated to the host-local traffic (this happens at sub-RTT granularity). However, this is not sufficient to avoid NIC buffer build up and packet drops. To see why, recall that the PCIe bandwidth utilization BS and target network bandwidth BT may be much lower than the rate R at which NIC is currently receiving traffic. By reducing resources allocated to host-local traffic in order to accommodate BT 48 network traffic bandwidth, the host-local congestion response merely ensures that NIC buffers build up at a rate no faster than R − BT . Without any explicit congestion signal, the network traffic will have no reason to reduce R, resulting in increasingly more queueing at the NIC and eventual packet drops. To avoid this, the host-local congestion response also echoes the host congestion signal to the network congestion control protocol resulting in reduction in network traf- fic rate. We note that, if R < BT , the host-local congestion response may lead to inefficient resource allocation due to reducing resources allocated to both net- work and host-local traffic; nevertheless, hostCC takes the above conservation decision temporarily to minimize NIC buffer buildup and packet drops. 4 [No host congestion, network traffic has not met the target network band- width]. In this regime, host is not congested and the network traffic has fewer resources than what is needed to meet the target bandwidth. Thus, the host- local congestion response allocates more resources to network traffic; this allo- cation is again implicit, in that, it again relies on the AIMD-style mechanisms— since host is not congested, network traffic does not get marked with congestion signals at the host allowing network traffic to increase its rate and acquire un- used host resources (if network fabric is not congested). Increasing resources allocated to the network traffic implicitly may take multiple RTTs, or may not even be feasible (e.g., due to network congestion); nevertheless, the host-local congestion response makes the conservation decision to not increase resources allocated to the host-local traffic in this regime to avoid host congestion before reaching the target network bandwidth. The host-local congestion response in hostCC uses host congestion signals (that are generated at sub-microsecond granularity) and is purely local to the host; 49 thus, it can be done at sub-RTT granularity. As a result, even if host-local traffic changes at sub-RTT granularity, the host-local congestion response can ensure high host resource utilization while maintaining target network bandwidth ac- cording to any given policy. 3.2.3 Network resource allocation at RTT granularity Consider the regime of host congestion (IS > IT ) and network traffic transmit- ting at rate R > BT (which will lead to BS > BT ). In this scenario, the right action is for the network traffic to reduce its rate. To achieve this, hostCC’s host- local congestion response mechanism does not take any action; instead, hostCC simply echos the host congestion signal back to the network congestion control protocol (in addition to any network congestion signal). This has two benefits. The first benefit is conceptual: it enables a clean separation of concerns, where the host-local congestion response handles host congestion at sub-RTT granu- larity, and network congestion control continues to handle network congestion at RTT granularity as they do today. Second, hostCC can be integrated with any network congestion control protocol: the only difference is that the protocol will now use both host and network congestion signals for host resource allocation. End-to-end hostCC behavior. We provide an intuitive description of hostCC’s end-to-end behavior. Suppose the network traffic is operating at rate R > BT . Suppose severe host congestion is introduced abruptly; then, as evaluated in §2.2, BS will reduce to a small value below BT , IS will grow beyond IT , and the host-local congestion response will kick in quickly to increase the host resources allocated to network traffic to accommodate ∼BT bandwidth (potentially by re- 50 0.4 0.6 0.8 1.0 1.2 Measurement Latency (µs) 0.0 0.2 0.4 0.6 0.8 1.0 C D F O ve rA ll M ea su re m en ts No Host Congestion With Host Congestion (a) CDF for IS read latency 0.4 0.6 0.8 1.0 1.2 Measurement Latency (µs) 0.0 0.2 0.4 0.6 0.8 1.0 C D F O ve rA ll M ea su re m en ts No Host Congestion With Host Congestion (b) CDF for BS read latency Figure 3.2: hostCC generates host congestion signals that are not on the NIC- to-memory datapath; it can thus measure both IS (left) and BS (right) at sub-µs timescales independent of host congestion. ducing resources allocated to host-local traffic causing host congestion). How- ever, for a few RTTs, the arrival rate of network traffic R at receiver NIC will still be higher than BT , resulting in hostCC echoing host congestion signals back to the sender. The sender will eventually reduce R until it converges to the target network bandwidth BT . 3.3 hostCC Implementation We now provide details on how we incorporate the three key technical ideas from the previous section in an end-to-end hostCC implementation— generating host congestion signals (§3.3.1), using host congestion signals for host-local congestion response at sub-RTT granularity (§3.3.2) and using both host and network congestion signals for network resource allocation at RTT granularity (§3.3.3). We implement hostCC as a loadable Linux kernel module using ∼800 LOC; hostCC works out-of-the-box with various existing conges- tion control protocols, without requiring any modifications to applications, host hardware, and/or network hardware. 51 3.3.1 Host congestion signals hostCC collects host congestion signals at sub-microsecond granularity. To maintain brevity, we describe an implementation atop Intel architectures (AMD architecture is conceptually very similar). Most hardware counters are exposed using model specific registers (MSRs) [61]. The MSR for the IIO occupancy value at time t (denoted by ROCC(t)) maintains the cumulative value of the occupancy, incremented at IIO clock fre- quency (denoted by FIIO); for example, FIIO = 500MHz for our servers. The average IIO occupancy IS between any two time instants t1 and t2 is computed using: IS = (ROCC(t2) − ROCC(t1))/((t2 − t1) ∗ FIIO). The time difference (t2 − t1) is measured by the standard method of reading the TSC register, which provides nanosecond level time accuracy. To minimize read latency, we used inline as- sembler code to read the TSC register. The read latency is bottlenecked by the read call to the IIO occupancy MSR register. On our servers, we measured that each TSC read took < 2ns, and each MSR read call took <∼600ns. Thus, we are able to collect host congestion signal (IIO occupancy IS ) at sub-µs timescales. Similarly, to measure PCIe bandwidth utilization BS , we read another MSR counter (denoted by RINS ) that stores cumulative IIO insertions. Thus, similar to IIO occupancy, we compute the average rate of IIO insertions I between time in- stants t1 and t2 as I = (RINS (t2)−RINS (t1))/(t2− t1). The PCIe bandwidth utilization between t1 and t2 is thus I times the cacheline size. Both congestion signals—IS and BS —require reading CPU registers, which does not overlap with the NIC- to-memory datapath; thus, hostCC is able to measure these signals at a sub-µs timescale, independent of host congestion (as shown in Figure 3.2). We note a low-level detail. Similar to existing network congestion control 52 0 200 400 600 800 1000 0 20 40 60 80 100 120 PC Ie B W (G bp s) 0 200 400 600 800 1000 Time (µs) 50 60 70 80 90 100 II O O cc up an cy (a) No Host Congestion 0 200 400 600 800 1000 0 20 40 60 80 100 120 PC Ie B W (G bp s) 0 200 400 600 800 1000 Time (µs) 50 60 70 80 90 100 II O O cc up an cy (b) With Host Congestion Figure 3.3: Variation of IIO occupancy IS and PCIe bandwidth utilization BS with time for 1ms period for the default experimental setup without host con- gestion (left) and with 3× host congestion (right). In absence of host congestion, BS ≈ 103Gbps (line-rate bandwidth including PCIe overheads with 4K MTUs) and IS≈65, which corresponds to hardware-specific IIO-DRAM bandwidth- delay product (§3.2.1). During host congestion, IS increases to a maximum value of ∼93 (shown in red), resulting in a reduction of BS , queueing and packet drops at the NIC, and network CC reducing the rate. protocols [77, 99], hostCC uses an exponentially weighted moving averaging (EWMA) for its congestion signals, IS and BS , rather than their instantaneous values. The EWMA weight used in hostCC has a standard tradeoff in terms of aggressiveness in hostCC’s response to host congestion and delayed reaction— using a large weight will quickly trigger host-local congestion response as well as network congestion control response (the latter because congestion signals will be generated more quickly, and echoing them to network congestion con- trol protocol will trigger its response) which could lead to overreaction in pres- ence of temporary burst of host congestion; a small weight, on the other hand, will delay congestion response. hostCC uses a default weight value of 1/8 for IS and 1/256 for BS (that is, last 8 IIO occupancy values and last 256 PCIe band- width utilization values are dominant). An important note here is that we use the same weight for IS for detecting host congestion and echoing congestion to the network congestion control protocol; this works because network conges- 53 tion control protocols typically maintains EWMA of their parameters (e.g., α in DCTCP) used to trigger network congestion response. We show in Figure 3.3 the IIO occupancy and PCIe write bandwidth with time for the baseline setup in §2.2, with and without host congestion. 3.3.2 Host-local congestion response The host-local congestion response in hostCC implementation takes as input host congestion signals (IS and BS ) and corresponding thresholds (IT and BT ) and triggers the response discussed in §3.2.2. Any resource allocation mech- anism must operate using the interface provided by the system that offers re- sources; we describe our implementation using the Intel Memory Bandwidth Allocation (MBA) tool. This interface uses a simple multi-level backpressure mechanism to the host-local traffic (recall, host interconnect is lossless; thus, backpressure is a natural mechanism to reduce CPU to memory traffic). Higher levels mean more backpressure; that is, higher levels result in fewer resources allocated to host-local traffic. Internally, MBA alters the rate at which any CPU core can generate memory traffic by introducing additional latency to every read/write request that observes an L2 cache miss on that core. Therefore, aver- age traffic generated by a core to memory is inversely proportional to the intro- duced additional latency: (LFB size × cacheline size)/per-access latency, where LFB size is as discussed in §2.2. The current MBA interface allows 10 levels, with higher levels introducing higher latency resulting in lower CPU to mem- ory traffic [121]. AMD’s Memory Bandwidth QoS control tool uses a similar interface [15]. 54 0 1 2 3 4 Host-local Response Level 0 20 40 60 80 100 N et A pp -T T pu t (G bp s) DDIO Off DDIO On 0 1 2 3 4 Host-local Response Level 0 50 100 150 200 M A pp T pu t (G bp s) DDIO Off DDIO On (a) Application throughput 0 1 2 3 4 Host-local Response Level 0.0 0.2 0.4 0.6 0.8 1.0 M em or y B an dw id th U til iz at io n DDIO Off, NetApp-T DDIO Off, MApp DDIO On, NetApp-T DDIO On, MApp (b) Memory bandwidth utilization Figure 3.4: Evaluating MBA efficacy: NetApp-T is able to acquire more host resources with higher host-local response levels (that is, more backpressure on MApp). As a result, NetApp-T gets higher throughput (left) and a larger fraction of memory bandwidth (right). Discussion in §3.3.2 and §3.4. The desired rate level is realized by performing a single MSR write to MBA- specific registers. For each socket (NUMA node), MBA maintains 8 MSR reg- isters, one for each “class-of-service” (COS), which can include any number of CPU cores within a particular socket. The assignment of CPU cores to COS can also be changed dynamically using an MSR write to another control regis- ter. Therefore, a single MSR write can simultaneously alter the CPU to memory traffic for any number of CPU cores within a socket. Our current implementa- tion uses 5 local response levels ℓ = {0, 1, 2, 3, 4}, where each successive level ℓ introduces increasingly larger latency—level 0 corresponds to no backpressure, and level 4 corresponds to maximum backpressure. We separate out network traffic cores from host-local traffic cores using different COS and introduce back- pressure only to the latter. Figure 3.4(a) shows network throughput when we hard code each individ- ual host-local response level ℓ. We observe that, with each level ℓ, throughput increases as expected: with more aggressive backpressure on host-local traffic, network throughput increases from 43Gbps at level 0 to ∼100 Gbps at level 43. 3The host-local response level 3 corresponds to the maximum possible latency that can be in- 55 To understand Figure 3.4(b), that shows corresponding memory bandwidth uti- lization, we must understand the difference between application-level through- put and corresponding load on memory bandwidth. In particular, NetApp-T and MApp use ∼2.1× and ∼1.33× memory bandwidth per unit of application- level throughput (due to data copy and processor interconnect overheads, re- spectively). When DDIO is enabled, the network application achieves higher throughput at lower response levels (for eg., ∼100Gbps at level 3 instead of 4) because NetApp-T utilizes smaller amount of memory bandwidth per unit throughput (since DDIO cache eviction rate is typically less than 100%). Con- sequently, it requires smaller amount of backpressure to the MApp to achieve the same network throughput. Our measurements also suggest that it takes ∼22µs to perform a write to today’s MBA MSR registers due to MBA limitations; to verify that this was not an hostCC implementation artifact, we performed MSR writes using inline assembler core such that we only execute a single assembly- level instruction for the write and still incurred 22µs latency (2× smaller than our network RTT). We further discuss this limitation of MBA in Chapter 6. 3.3.3 Network resource allocation hostCC can be integrated with many existing network congestion control pro- tocols. We focus here on hostCC’s implementation with ECN-based protocols; we discuss extensions to integrate hostCC with other protocols in §6.1. hostCC troduced using the current MBA implementation to all MApp cores. However, this added latency does not provide sufficient backpressure to MApp to allow the network application to reach line rate throughput (as shown in Figure 3.4, NetApp-T only achieves ∼77Gbps throughput at level 3). In order to emulate an MBA response with larger added latency, we introduced a level 4 in current hostCC implementation—when emulating level 4, we pause the execution of the MApp process using the SIGSTOP signal, and when switching back from level 4 to lower levels, we resume the MApp process using the SIGCONT signal. 56 requires no modification to existing network congestion control protocol imple- mentations: hostCC simply generates ECN markings on the ACKs sent back to the sender if IS > IT (if the packet was already marked by the switch, no modifi- cations are made). The current hostCC implementation performs ECN marking at the IP layer using 2 out of 6 DSCP bits (as in RFC 3168 [113]) by exploiting a hook to the ip recv function provided by the widely used NetFilter kernel mod- ule [131] in Linux. hostCC’s host-local congestion response does exactly what today’s switches do—mark both these bits as 1 to indicate congestion—before delivering the IP datagram to the transport layer. The transport layer then pro- cesses the ECN marked datagram in exactly the same manner as it would for any ECN marked packet at today’s switches. 3.4 hostCC Evaluation We now evaluate hostCC performance. Our goals are three-fold: • Understanding benefits of hostCC’s core ideas—host congestion signals, host-local congestion response, and performing network resource allocation using both host and network congestion signals—to application-level perfor- mance; • Deep dive into hostCC microscopic behavior (capturing host congestion sig- nals, host-local congestion response, and network resource allocation) in the host congestion regime; • Develop lessons for the host congestion regime that would be useful to design future host hardware and network stacks that can better enable the reaction to host congestion. 57 0x 1x 2x 3x0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 0x 1x 2x 3x Degree of Host Congestion 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) DCTCP DCTCP+hostCC (a) Network throughput and packet drop rate 0x 1x 2x 3x Degree of Host Congestion 0.0 0.2 0.4 0.6 0.8 1.0 M em or y B an dw id th U til iz at io n DCTCP, NetApp-T DCTCP, MApp DCTCP+hostCC, NetApp-T DCTCP+hostCC, MApp (b) Memory bandwidth utilization Figure 3.5: (a) hostCC allows network traffic to achieve its target network band- width of BT = 80Gbps, while simultaneously reducing packet drop rates by or- ders of magnitude, even with a high degree of host congestion. (b) with hostCC, MApps no longer acquire a large fraction of memory bandwidth, even with high degree of host congestion. Throughout this section, we use Linux DCTCP as our network congestion con- trol protocol (network CC) since Linux DCTCP is a stable open-sourced im- plementation that works with commodity hardware. We primarily focus on the DDIO disabled case since results are easier to explain (as discussed ear- lier, performance for DDIO enabled case depends on cache eviction policies); hostCC evaluation for the DDIO enabled case is presented in §3.4.2. Unless mentioned otherwise, we use IIO occupancy threshold IT = 70 and network target bandwidth BT = 80Gbps for hostCC (we present sensitivity analysis of hostCC performance with IT and BT in §3.4.3); for DCTCP, we use default pa- rameters from [14]. We do not perform any experiment-specific parameter opti- mization. 58 3.4.1 hostCC benefits We first evaluate hostCC on the same setup as in §2.2—this setup does not have any network congestion, allowing us to gain insights about hostCC per- formance in the host congestion regime. We then extend the setup to the one used in [6]; this setup includes network congestion and allows us to gain in- sights on hostCC performance in the presence of network congestion (with and without host congestion). hostCC avoids throughput degradation for network traffic while simultane- ously reducing packet drops at the host by orders of magnitude. Figure 3.5(a) shows that, when the degree of host congestion is so low that NetApp-T can reach its target network bandwidth without creating bottlenecks within the host network (0× and 1× cases), hostCC has negligible impact on NetApp-T through- put (which is bandwidth bottlenecked for the 0× case and CPU bottlenecked for the 1× case). More interestingly, in the presence of host congestion, hostCC al- lows NetApp-T to achieve throughput close to the desired target network band- width (80Gbps in this experiment), even with a high degree of host congestion. Essentially, using the host congestion signals, the sub-RTT host-local congestion response promptly reduces the resources allocated to the MApp traffic whenever the network throughput falls below BT in presence of host congestion (hostCC steady-state behavior illustrated later in §3.4.4) Moreover, Figure 3.5(b) shows that these benefits to NetApp-T do not come at the cost of starving MApp traffic: hostCC’s host-local congestion response also increases resources allocated to the MApp traffic whenever NetApp-T is able to sustain the target network band- width. Figure 3.5(a) also shows that hostCC reduces packet drop rates to a bare minimum, since the host-local congestion response and network CC (using host 59 1500B 4000B 9000B0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 1500B 4000B 9000B MTU Size 10−5 10−4 10−3 10−2 10−1 100 101 Pa ck et D ro p R at e (% ) DCTCP DCTCP+hostCC (a) Varying MTU size 4 8 160 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 4 8 16 Number of Active Flows 10−5 10−4 10−3 10−2 10−1 100 101 Pa ck et D ro p R at e (% ) DCTCP DCTCP+hostCC (b) Varying number of flows Figure 3.6: Even with high degree of host congestion (3× in this experiment), hostCC consistently maintains its benefits across (a) MTU sizes and (b) number of flows. 104 105 106 128B 512B 2KB 8KB 32KB RPC Size 50 100 150 200 250 300 L at en cy (µ s) DCTCP (No host congestion) DCTCP (With host congestion) DCTCP+hostCC (With host congestion) Figure 3.7: Even with a high degree of host congestion (3× in this experiment), hostCC incurs minimal latency inflation compared to no host congestion, sig- nificantly improving tail latency across all RPC sizes. congestion signals) work in tandem to keep the NIC buffer occupancy low for a larger fraction of time. We observe that, with a high degree of host congestion, the total memory bandwidth utilization is slightly reduced. We believe this behavior is not due to hostCC’s architecture but rather due to the coarse granularity of host resource allocation using existing tools (Intel MBA, in this case). Due to such coarse gran- ularity of the allocation, the MApp sometimes gets backpressured much more than than it needs to, in order to accommodate additional NetApp-T traffic. Fig- ure 3.4 shows an example of this behavior: when we switch from host-local re- sponse level 3 to 4, NetApp-T gains 5.2GBps of memory bandwidth, while MApp 60 1x 1.5x 2x 2.5x0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 1x 1.5x 2x 2.5x Increasing Degree of Incast 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) DCTCP DCTCP+hostCC (a) Network congestion 1x 1.5x 2x 2.5x0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 1x 1.5x 2x 2.5x Increasing Degree of Incast 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) DCTCP DCTCP+hostCC (b) Host + Network congestion Figure 3.8: (a) In the presence of network congestion and absence of host con- gestion, hostCC achieves performance similar to network CC indicating mini- mal overheads; (b) in the presence of both host and network congestion, hostCC consistently provides benefits similar to the case with only host congestion. loses 13.8GBps of memory bandwidth. We discuss potential avenues for future hardware support for improved host-local congestion response in §6.1. Figure 3.6 shows that hostCC consistently achieves benefits in terms of main- taining target network bandwidth and reduced packet drop rates across all eval- uated MTU sizes and number for flows. hostCC observes minimal tail latency inflation for latency-sensitive traffic, even with high degree of host congestion. Figure 3.7 shows the observed la- tency under the same multi-tenant evaluation setup as in Figure 2.11 (where we use all apps NetApp-T, NetApp-L and MApp together). hostCC observes mini- mal latency inflation in the regime of host congestion due to two reasons: (1) hostCC’s host-local congestion response ensures minimal queueing delay at the host; and (2) hostCC significantly reduces packet drop rates, avoiding retrans- mission and timeout delays. Results for small-sized RPCs provide evidence for the first reason: recall, from §2.2, that P99 latency in this experiment is domi- nated by NIC queueing delays; the figure shows that, despite the high degree of host congestion, hostCC incurs a minuscule latency inflation of 13µs for 128B 61 RPCs (when compared to no host congestion scenario). All results provide evi- dence for the second reason: we observe no timeouts even at P99.9 percentile. hostCC maintains its benefits even in the presence of both host and network congestion. Figure 3.8 evaluates hostCC performance in the presence of net- work congestion, with and without host congestion. For this experiment, we use an incast workload with two senders and a single receiver, directly con- nected to a switch. We vary the degree of network congestion by varying the degree of incast (the total number of active concurrent flows at the receiver) from 4 to 10 (1× to 2.5× degree of incast). We observe that, in the absence of host congestion, network CC without hostCC observes increased packet drop rates with an increase in the degree of network congestion (as one would ex- pect); since there is no host congestion, hostCC performance is near-identical to the network CC performance indicating that hostCC has minimal overheads in the absence of host congestion. In the presence of both host and network con- gestion, however, network CC performance without hostCC suffers from high packet drops rates and reduced throughput; in this scenario, hostCC provides significant benefits using all three of its core ideas: it collects host congestion sig- nals at sub-µs timescales, it is able to modulate host resources allocated to the network traffic so as to maintain target network bandwidth, and incurs mini- mal packet drops rates by ensuring that network CC converges to a rate that matches available network and host resources. This experiment demonstrates that hostCC interpolates well with network CC even in the presence of both host and network congestion. 62 0x 1x 2x 3x0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 0x 1x 2x 3x Degree of Host Congestion 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) DCTCP DCTCP+hostCC (a) Network throughput and packet drop rate 0x 1x 2x 3x Degree of Host Congestion 0.0 0.2 0.4 0.6 0.8 1.0 M em or y B an dw id th U til iz at io n DCTCP, NetApp-T DCTCP, MApp DCTCP+hostCC, NetApp-T DCTCP+hostCC, MApp (b) Memory bandwidth utilization Figure 3.9: hostCC, with DDIO enabled, provides benefits similar to the DDIO disabled case. 3.4.2 hostCC results with DDIO enabled Figure 3.9 shows hostCC results using the same setup as in Figure 3.5, but with DDIO enabled. We use IT = 50 here because the observed IIO occupancy value when there is no host congestion is smaller when DDIO is enabled (∼45, com- pared to ∼65 when disabled). This is due to smaller average IIO-to-memory latency with DDIO enabled as discussed in §2.1.1. We observe similar trends as in Figure 3.5—hostCC ensures that network traffic is able to achieve the target network bandwidth, while reducing packet drop rates to a bare minimum. With large degree of host congestion, the absolute packet drop rate with hostCC is slightly higher than other evaluated cases (hostCC still helps reduce packet drop rates by ∼37× compared to the case without hostCC); identifying the precise rea- sons for this observation requires more visibility into DDIO-related hardware operations like cache eviction policies (as also noted in [31, 46]). We also observe that, with DDIO enabled, MApp is able to acquire larger frac- tion of memory bandwidth (compared to DDIO disabled case in Figure 3.5) for any given degree of host congestion. Figure 3.4 helps explain this behavior— 63 104 105 106 128B 512B 2KB 8KB 32KB RPC Size 50 100 150 200 250 300 L at en cy (µ s) DCTCP (No host congestion) DCTCP (With host congestion) DCTCP+hostCC (With host congestion) Figure 3.10: hostCC, with DDIO enabled, provides benefits in terms of latency improvements similar to the DDIO disabled case. 10 20 30 40 50 60 70 80 90 1000 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 10 20 30 40 50 60 70 80 90 100 Target Network Bandwidth (BT ) 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) (a) Network throughput and packet drop rate 10 20 30 40 50 60 70 80 90 100 Target Network Bandwidth (BT ) 0.0 0.2 0.4 0.6 0.8 1.0 M em or y B an dw id th U til iz at io n NetApp-T MApp (b) Memory banwidth utilization Figure 3.11: hostCC consistently maintains its benefits across varying network target bandwidth BT . Discussion in §3.4.3. when DDIO is enabled, NetApp-T is able to sustain higher average throughput at a lower host-local response level; hence, MApp experiences smaller amount of backpressure to achieve the target network bandwidth. Figure 3.10 shows benefits of hostCC in terms of latency for NetApp-L using the same setup as in Figure 2.11, but with DDIO enabled. We observe that la- tency inflation without and with hostCC is identical to that of Figure 3.7; this is because both DDIO enabled and disabled cases observe similar level of packet drop rates for 3× degree of host congestion (as shown in Figure 2.4), leading to similar tail latency inflation for NetApp-L (as discussed in §2.2). 64 3.4.3 hostCC sensitivity analysis hostCC has only two parameters BT and IT . Figure 3.11 shows that hostCC con- sistently achieves benefits in terms of maintaining target network bandwidth and minimal packet drop rates for all values of BT (while only applying as much backpressure on MApp as needed to maintain target network bandwidth). The drop rates are particularly low for small values of BT ; this is because the arrival rate of packets at the NIC is smaller than the PCIe bandwidth utilization (that is, the rate at which packets are drained from the NIC buffer). To see this, recall from Figure 2.4 that, even without hostCC, network traffic achieves ∼43Gbps throughput at 3× degree of host congestion; thus, average PCIe utilization must be at least 43Gbps. Here, hostCC maintains average network throughput less than 40Gbps, resulting in NIC buffers rarely filling up and no packet drops. We also observe low drop rates with large values of BT . As discussed in §3.2.2, NIC buffer build up depends on R − BT , larger values of BT gives network traffic more time to converge to the right rate; since hostCC ensures the average PCIe bandwidth utilization remains close to BT , the time it takes for NIC buffer to fill up reduces with increasing BT . Figure 3.12 shows hostCC performance with varying IT values: increasing IT leads to an increasingly delayed reaction to the onset of host congestion, leading to larger packet drops and higher MApp throughput. 3.4.4 Deep dive into hostCC performance We now provide more insights into hostCC’s performance. 65 70 75 80 85 900 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) 70 75 80 85 90 Target IIO Threshold (IT ) 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) (a) Network throughput and packet drop rate 70 75 80 85 90 Target IIO Threshold (IT ) 0.0 0.2 0.4 0.6 0.8 1.0 M em or y B an dw id th U til iz at io n NetApp-T MApp (b) Memory bandwidth utilization Figure 3.12: (a) Increasing IT values lead to increasing drop rates due less ag- gressive hostCC reaction to congestion. (b) MApp acquires larger memory band- width with larger IT due to less aggressive backpressure. Necessity of the three hostCC ideas. Figure 3.14 demonstrates that each of the three key technical ideas in the hostCC architecture—generating host con- gestion signals at sub-µs granularity, sub-RTT host-local congestion response, and network resource allocation based on both host and network congestion signals—contribute to hostCC’s performance. In particular, without host-local congestion response, it is possible to mini- mize packet drop rates but only at the cost of degraded throughput: network traffic achieves merely ∼28Gbps of throughput. To explain the root cause for this observation, we plot in Figure 3.14(b) measured IIO occupancy and PCIe bandwidth utilization (including the PCIe-level overheads that turn out to be ∼5% with 4K MTU and hardware default TLP size) with time for a 1000µs hori- zon for the case of 3×memory contention. We observe that IIO occupancy often increases beyond IT = 70, indicating a possible onset of host congestion (when- ever the network traffic increases using AIMD; network CC reacts to this by reducing rate, thus bringing down the host congestion and drop rate, but also suffering from low throughput. On the other hand, without performing net- work resource allocation based on both host and network congestion signals, it 66 Echo congestion signals only Host-local response only Echo congestion signals + Host-local response 0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) Echo congestion signals only Host-local response only Echo congestion signals + Host-local response 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) (a) Necessity of using both kinds of hostCC responses in tandem 0 200 400 600 800 1000 0 20 40 60 80 100 120 PC Ie B W (G bp s) 0 200 400 600 800 1000 Time (µs) 50 60 70 80 90 100 II O O cc up an cy (b) hostCC behavior when it only echoes congestion signals 0 200 400 600 800 1000 0 20 40 60 80 100 120 PC Ie B W (G bp s) 0 200 400 600 800 1000 Time (µs) 50 60 70 80 90 100 II O O cc up an cy (c) hostCC behavior with only host-local congestion response 0 200 400 600 800 1000 0 20 40 60 80 100 120 PC Ie B W (G bp s) 0 200 400 600 800 1000 Time (µs) 50 60 70 80 90 100 II O O cc up an cy (d) Default hostCC behavior with both responses in tandem Figure 3.13: Each of the three key technical ideas in the hostCC architecture— generating host congestion signals at sub-µs granularity, sub-RTT host-local congestion response, and network resource allocation based on both host and network congestion signals—contribute to hostCC’s performance. Discussion in §3.4.4. is possible to achieve high throughput but at the cost of large packet drop rates. Figure 3.14(c) demonstrates the reason for this observation: IIO occupancy IS frequently saturates to the maximum value of ∼93, indicating NIC buffer build- up and subsequent packet drops at the host. Figure 3.14(d) shows that, by carefully allocating both host resources (us- ing host-local congestion response) and network resources (using both host and network congestion signals), hostCC is able to simultaneously achieve high 67 Echo congestion signals only Host-local response only Echo congestion signals + Host-local response 0 20 40 60 80 100 N et w or k T hr ou gh pu t (G bp s) Echo congestion signals only Host-local response only Echo congestion signals + Host-local response 10−5 10−4 10−3 10−2 10−1 100 Pa ck et D ro p R at e (% ) (a) Necessity of using both kinds of hostCC responses in tandem 0 200 400 600 800 1000 0 20 40 60 80 100 120 PC Ie B W (G bp s) 0 200 400 600 800 1000 Time (µs) 50 60 70 80 90 100 II O O cc up an cy (b) hostCC behavior when it only echoes congestion signals 0 200 400 600 800 1000 0 20 40 60 80 100 120 PC Ie B W (G bp s) 0 200 400 600 800 1000 Time (µs) 50 60 70 80 90 100 II O O cc up an cy (c) hostCC behavior with only host-local congestion response 0 200 400 600 800 1000 0 20 40 60 80 100 120 PC Ie B W (G bp s) 0 200 400 600 800 1000 Time (µs) 50 60 70 80 90 100 II O O cc up an cy (d) Default hostCC behavior with both responses in tandem Figure 3.14: Each of the three key technical ideas in the hostCC architecture— generating host congestion signals at sub-µs granularity, sub-RTT host-local congestion response, and network resource allocation based on both host and network congestion signals—contribute to hostCC’s performance. Discussion in §3.4.4. throughput and low packet drops rates. Using the host-local congestion re- sponse, hostCC is able to modulate host resources allocated to network traffic in a manner that NIC queue buffer buildup can be avoided (as suggested by smaller IIO occupancy immediately upon crossing the IT threshold) until net- work traffic converges to the right throughput using both host and network congestion signals. Understanding example hostCC steady-state behavior. Figure 3.15 shows a 68 0 50 100 150 200 250 Time (us) 60 65 70 75 80 85 90 95 100 PC Ie B an dw id th (G bp s) (a) PCIe Write Bandwidth 0 50 100 150 200 250 Time (us) 0 1 2 3 4 R es po ns e L ev el (b) Host-local Response Level 0 50 100 150 200 250 Time (us) 50 60 70 80 90 100 II O O cc up an cy (c) IIO Occupancy Figure 3.15: In steady-state, hostCC keeps PCIe bandwidth close to the target network bandwidth, while ensuring that IIO occupancy remains smaller than the congestion threshold. Discussion in §3.4.4. snapshot of hostCC’s behavior over a 250µs time horizon. Figure 3.15(a) shows the measured PCIe bandwidth utilization with time for BT = 80Gbps (includ- ing PCIe-level overheads, this amounts to 84Gbps, denoted by the green line). We note from Figure 3.4 that PCIe bandwidth utilization lies between the host- local response levels 3 and 4 (which provide ∼77Gbps and 100Gbps throughput, respectively). Therefore, as expected, hostCC host-local congestion response os- cillates between levels 3 and 4, ensuring that the measured PCIe bandwidth uti- lization remains close to BT in Figure 3.15(a). The switches across levels happen in accordance with the host-local congestion response logic in §3.2.2: hostCC switches from level 3 to 4 when the IIO occupancy goes higher than IT (denoted by red line in Figure 3.15(c)) and the PCIe bandwidth is still lower than BT ; and switches back to level 3 when PCIe bandwidth has increased beyond BT and IIO occupancy is again lower than IT . 69 3.5 Summary hostCC is a congestion control architecture that handles host congestion, along with network fabric congestion. hostCC achieves this using three key ideas— generation of host congestion signals, a sub-RTT host-local congestion response that uses host congestion signals to allocate host resources between network and host-local traffic, and using host and network congestion signals to perform network resource allocation at RTT granularity. We have realized hostCC within the Linux network stack without any modifications in applications, host hard- ware, and/or network hardware. hostCC evaluation on our testbed demon- strates that it provides significant performance benefits under host congestion, and interpolates well with existing congestion control protocols under network fabric congestion. 70 CHAPTER 4 RETHINKING IO MEMORY PROTECTION In this chapter we discuss the inefficiencies within the host network introduced by enabling memory protection against untrusted NICs. We show that the exist- ing memory protection mechanisms (falsely) assume that they need to give up on their security guarantees in order to alleviate these inefficiencies and achieve high application performance. We present design, implementation, and evalu- ation of F&S, a simple modification to existing memory protection mechanisms that enables strongest possible safety guarantees while near-completely elimi- nating performance overheads due to memory protection. 4.1 Overview Memory protection—ensuring that physical memory allocated to an applica- tion cannot be accessed and/or modified by another application—is a classical problem in operating systems and computer architecture. Memory manage- ment units (MMUs), since their inception in the 1970s, have enabled applica- tions running on processors to achieve memory protection. However, a buggy or malicious network interface card (NIC) could still execute errant transfers into the host memory. This led to the introduction of Input-Output Memory Management Unit (IOMMU) in the mid-2000s. An IOMMU operates very sim- ilar to MMU; we describe the IOMMU-enabled datapath from the NIC to the application in Chapter 2, but briefly: with IOMMU enabled, the operating sys- tem assigns NICs so-called IO virtual addresses (IOVAs), that are used by the NIC to initiate direct memory access (DMA) to the host memory; for every DMA 71 initiated by the NIC, the IOMMU uses a host memory resident IO page table to translate NIC-visible IOVA to host physical address that is used for the final data transfer. These address translations are sped up using a special IOTLB hardware (that caches frequently used IO page table entries). Upon an IOTLB hit, the address translation incurs near-zero cost; however, upon an IOTLB miss, an IO page table walk is performed to perform the address translation. As already illustrated in Chapter 2, enabling memory protection against un- trusted peripherals using IOMMU can have significant impact on application- layer performance. Consider a server with 100Gbps NICs and 128Gbps PCIe. To provide the strongest safety properties (referred to as strict mode in modern op- erating systems), each IOVA must be unmapped, and the corresponding IOTLB entry invalidated immediately after the DMA; thus, one must incur at least one IOTLB miss every (huge)page worth of data transfer and perform 4 sequential memory accesses per miss for IO page table access in the worst case. Even with an optimistic estimate of 100ns memory access latency, this incurs ∼4× 100ns of IOMMU overhead. Thus, with modern PCIe devices (that allow buffering up to ∼100 cachelines at the root complex, the processor-side end of the PCIe de- vice [105, 9]), enabling IOMMU in the strongest safety mode essentially pushes PCIe to its limits—using Little’s Law [38], the maximum sustainable rate would be 100 × 64bytes/400ns = 128Gbps. In practice, the situation is much worse: memory access latencies are usually higher [35, 81, 36], PCIe has its own over- heads [105], IOTLB can incur multiple misses during the huge(page) worth of data transmission [17, 6], etc. Indeed, recent studies from large-scale production clusters [6] have shown that inefficiencies within state-of-the-art memory pro- tection mechanisms can result in significant throughput degradation, orders-of- magnitude tail latency inflation, and violation of isolation guarantees. 72 Sadly, the above analysis is so fundamental that it has left the community with a bleak outlook about memory protection—without rearchitecting mem- ory management in operating systems and/or without significant improve- ments in server hardware (e.g., larger buffers at root complex, lower memory access latencies, etc.), it appears impossible to provide strong safety proper- ties while maintaining high performance. As eloquently argued in [6], tech- nology trends suggest that server hardware improvements along these direc- tions are unlikely; thus, most recent work on high-performance memory pro- tection targets weaker safety properties [108, 93, 45, 29, 23, 94], rearchitecting memory management within the operating system [93, 94], and/or redesigning IOMMU hardware [91, 79, 55, 17]. Unfortunately, such weaker safety proper- ties and clean-slate modifications themselves introduce new security vulnera- bilities [92, 93, 73, 13, 102]. This state-of-affair leaves organizations with two equally expensive and painful options: (1) high-performance but unsafe sys- tems; or, (2) safe systems but with suboptimal performance. We explored the inefficiencies in modern memory protection mechanisms in Chapter 2 (§2.2). While reproducing the overheads of memory protection mechanisms were straightforward, our preliminary experiments during this ex- ploration led to some baffling results: we found that, in many cases, the total number of memory accesses for IO page table walks were less than 4× the num- ber of IOTLB misses! Diving deeper into this inexplicable and seemingly impos- sible phenomenon, we discovered an aspect of IOMMU hardware that has been ignored in all previous studies: IO page table caches [60] (in the hindsight, this should have been obvious, given that processor MMUs have had such caches for a long time now [24]). Specifically, IOMMU has hardware caches that store most frequently accessed IO page table entries from level1, level2, and level3 of the 73 IO page table (importantly, for reasons discussed in §2.1.3, level4 entries are not cached in IO page table cache). Upon an IOTLB miss, IOMMU first checks the cache for the corresponding IO page table level1 page—if a hit, the correspond- ing level2 page can be identified without an IO page table walk (thus, avoiding one memory access), and IOMMU now checks the cache for the correspond- ing IO page table level2 page, and so on. Indeed, in the worst-case scenario, IOMMU still needs to perform 4 memory accesses per IOTLB miss; however, in the best-case scenario, the IO page table cache enables IOMMU to perform just 1 memory access per IOTLB miss (for the corresponding level4 entry)! We present F&S, a simple modification to existing memory protection mech- anisms that breaks the aforementioned impasse: it provides strongest safety properties, and yet, near-completely eliminates their overheads. The key insight in F&S design is that, while IOTLB misses are unavoidable with the strongest safety properties, it is still possible to reduce the cost of address translation upon an IOTLB miss using IO page table caches within the IOMMU hardware. Indeed, we demonstrate that, by carefully allocating IOVAs (to maximize IO page table cache hit rate) and by carefully managing IO page table cache invalidations (to ensure safety), F&S is able to provide strongest safety properties with minimal overheads of memory protection. For instance, across all evaluated workloads, we find that F&S techniques result in 0 level1 misses, 0 level2 misses, and at most 0.054 level3 misses per (huge)page worth of data; thus, F&S requires at most 1+ 2× 0.054 = 1.108 memory accesses per (huge)page worth of data across all evaluated workloads. Using the analysis earlier, this allows F&S to already achieve a sustainable rate of 100×64bytes/1.108×100ns = 462Gbps throughput, far higher than the current and the next generation PCIe. Finally, F&S provides such powerful results without any modifications in host hardware and without 74 any modifications in operating systems memory management; it needs merely ∼300 lines of code changes within the IOMMU and network drivers (no changes necessary in memory management). 4.2 F&S In this section, we present the design of F&S, our Fast & Safe modification to the existing memory protection mechanism within Linux. F&S rethinks the ap- proach to perform IOMMU address translation while guaranteeing high appli- cation performance, by focusing on reducing the cost of each IOTLB miss, rather than (solely on) reducing the number of IOTLB misses. Our key insight is that while we cannot reduce the IOTLB miss rate beyond a single miss per page (nec- essary to enforce the strict IOMMU protection model [101, 17]), we can reduce the cost of each miss by improving the IOMMU page table cache performance. As demonstrated in §2.2, existing IOVA allocation mechanisms in Linux typ- ically result in poor IOVA locality, and hence incurring a large number of IO page table cache misses. This leads to a larger number of memory accesses for performing address translation upon each IOTLB miss, resulting in a higher la- tency cost per IOTLB miss and hence, degraded application throughput and latencies. Even for few evaluated scenarios with improved IOVA locality, we observe higher-than-expected latency costs for each IOTLB miss and degraded application performance. This is due to IOVA invalidations: as discussed in §2.2, invalidating an IOVA entry leads to deletion of corresponding entries not only from the IOTLB but also from the IO page table (to safely handle potential page table page reclamations, discussed in more details later in this section), 75 thus preventing the IO page table caches from re-using their entries for yet-to- be-invalidated IOVAs. Moreover, in addition to a higher latency-cost for each IOTLB miss, invalidations also lead to a higher CPU-cost per IOTLB miss. Upon each invalidation, the CPU core waits until successful invalidation of IOTLB and/or IO page table cache entries by the IOMMU hardware [108]. Therefore, larger invalidation requests also result in a degradation of per-core application throughput. F&S resolves the aforementioned challenges using three key design ideas in order to minimize the latency and CPU cost for each IOTLB miss. 1 Contiguous IOVA allocation for improved locality. Rather than perform- ing allocations at the default per-page granularity, F&S allocates a single and contiguous IOVA for all the pages within a descriptor (e.g., descriptors used by recent Mellanox drivers consist of 64 pages per descriptor). We utilize the context of IOVA allocation by networking applications in order to lever- age locality benefits using contiguous address allocation. Unlike the context of conventional Linux memory management (where the applications them- selves dictate the access pattern for virtual addresses), the IOVA access pat- tern for networking applications is completely predictable: IOVAs within a single descriptor are accessed in a sequential manner. Thus contiguous IOVA allocation by F&S ensures improved IOVA locality. By changing the IOVA allocation granularity, F&S no longer relies on the IOVA allocator’s per-core caches as the only means to achieve good lo- cality. We already demonstrated in §2.2 that per-core caches typically fail to provide good IOVA locality due to (i) multiple CPU cores contending for globally-shared red-black tree when IOVA working set size exceeds the 76 per-core cache size and (ii) interference of IOVA allocation/deallocation for ACK packets with payload packets for each CPU core (even when the IOVA working set size is smaller than the per-core cache size). 2 Selectively invalidating IOTLBs to retain benefits of IOVA locality. Upon each IOVA unmap operation, F&S only invalidates the corresponding IOTLB entry while preserving the IOMMU L1/L2/L3 cache entries. This simple change allows the IOMMU page table caches to take advantage of IOVA locality by allowing them to re-use the cached entries after IOVA in- validations. To ensure safety, IO page table caches require invalidations if any corre- sponding IO page table pages were reclaimed after unmap operation(s). To- day, the Linux IOMMU driver takes a safer (but potentially suboptimal) ap- proach of always invalidating the IO page table caches whenever an IOVA is unmapped. However, our key insight is that L1/L2/L3 page reclamation will never happen in the context of networking applications. Linux today only reclaims a page table page if there is a single operation unmapping the entire address range that it covers. To make this concrete, in order to reclaim an L4 page (thus invalidating the L3 entry that points to it), the entire 2MB range of IOVAs on that page must be unmapped in a single operation. To invalidate an L2 entry and reclaim an L3 page, the entire 1GB range of IO- VAs pointed to by that L2 entry must be unmapped in a single operation. In the networking context, the maximum size of an unmap operation in F&S is 64 IOVAs: this could never cause a reclamantion of an IO page table page1. Thus, an IOMMU page table cache entry will never become stale2. 1Page table page reclamation is likely to happen when the IOMMU is used for virtualization. In this context, large parts of the address space are mapped and unmapped at once. 2If there is a context where the OS must change the IO page table structure (such as a driver 77 3 Batched invalidations to minimize CPU cost for IOTLB misses. The final design idea of F&S is to leverage the contiguous IOVA allocation to invali- date the entire IOVA range mapped to a descriptor with only a single inval- idation request to the IOMMU. Such a batched invalidation amortizes the CPU cost of a single invalidation request across the number of pages within a single descriptor (e.g., 64 pages per descriptor for Mellanox NICs). 4.3 F&S Implementation We implement F&S within the Linux kernel v6.0.3. We currently support recent Mellanox NICs (CX-5 [97] onwards), which operate with 64-page descriptors by default. F&S implementation uses ∼300 LOC, requires no hardware mod- ifications, no interface changes to the IOVA allocator and the IOMMU driver, and minimal changes to the network driver. We next discuss some of the most interesting implementation details for each of the three key F&S design ideas. Contiguous IOVA allocation. Figure 4.1a shows the default IOVA allocation mechanism in Linux. In order to DMA map a descriptor, for each page, the OS allocates a 4KB IOVA and then makes a call to the IO page table map function, specifying the starting address of the IOVA, the starting address of the page, and a 4KB length. In the same order that physical pages are accessed in the de- scriptor, the OS maps each page to a contiguous page-sized chunk of the IOVA. F&S, on the other hand, first allocates a contiguous descriptor-sized chunk of the IOVA, as shown in Figure 4.1b. However, F&S does not change the mapping that unmaps at >512 IOVA granularity), there exists a simple solution. At the time of the IOTLB invalidation for that IOVA, check if any page table pages will be reclaimed, and if so, then invalidate the page table caches. This simple additional conditional statement is unlikely to add any significant overhead. 78 granularity—each physical page in the descriptor is still mapped to a contigu- ous page-sized segment of the IOVA range. This idea of allocating a large IOVA but mapping individual physical pages has two benefits: (1) it does not require changes to the underlying hardware, and (2) it ensures good locality within a descriptor: in the worst case, a single descriptor may experience two L3 page table misses.3 This corresponds to .031 L3 misses per page-sized packet DMAed in the worst case. Selectively invalidating IOTLBs. As discussed in 2.1.3, IOMMU provides an invalidation queue as the interface to the IOMMU driver to request IOVA in- validations. The invalidation queue interface specifies an IOVA address, the size of the IOVA, and importantly, an option to only invalidate the IOTLB entry while preserving the IO page table cache entries. This option is not enabled in Linux by default, as it could potentially lead to stale page table cache entries if the page table structure changes (i.e., an L4 page table page is reclaimed, and so the page’s corresponding L3 entry is updated). To preserve IOMMU caches, F&S uses the existing invalidation queue interface and sets the specific bit to prevent the invalidation of associated page table caches when flushing IOTLB entries. Batched invalidations. F&S issues a single request to the invalidation queue upon a DMA unmap call, specifying the contiguous descriptor-sized IOVA chunk for invalidation. Note that, however, without the use of multi-page de- scriptors, F&S cannot benefit from batching invalidations since each descriptor must be invalidated immediately after DMA completes. 3One for the beginning of the IOVA range, mapped to the first page in the descriptor (I in Figure 4.1b), and another one if the IOVA range spans multiple L4 pages; for example, if ‘I‘ is the last mapping in an L4 page, then ‘I‘+4096 is on another page, which has a different L3 entry. 79 OS IOVA allocator alloc (4KB) 4KB Rx Descriptor (P1,P2,...,Pn) I Map() length = 12 bits (4KB) P1 IO Page Table [ I P] 4KB (a) Linux IOVA allocation and IO page table mapping. OS IOVA allocator alloc (256KB) 256KB I Map()P1 I + 4096 I + 2* 4096 Map()P2 length = 12 bits (4KB) length = 12 bits (4KB) Rx Descriptor (P1,P2,...,Pn) Map()P3 length = 12 bits (4KB) IO Page Table [ I P]4KB [ I ]P2 + 4096 4KB [ ]I P3+ 2* 4096 4KB (b) F&S IOVA allocation and IO page table mapping. Figure 4.1: F&S achieves contiguous IOVA mappings without requiring changes to IOMMU hardware, IOVA allocator, and IOMMU driver. (a) This process re- peats for each page in the Rx descriptor, requiring calls to both the IOVA allo- cator and IO page table map. (b) A single 256KB IOVA is allocated and then mapped in page-size chunks to the physical pages in the Rx descriptor. F&S requires only one call to the IOVA allocator and the same number of calls to the IO page table map. 4.4 F&S Evaluation In this section, we demonstrate that F&S near-completely eliminates memory protection overheads demonstrated in §2.2. We use the same testbed setup as described in §2.2. Our evaluation demonstrates that: 80 • With IOMMU enabled, F&S allows Linux to consistently match the appli- cation throughput and latency performance as with the IOMMU disabled scenario. F&S achieves its benefits even with the strictest memory protec- tion model, while incurring minimal CPU overheads, across a wide vareity of network and host parameters (such as number of flows, ring buffer sizes, MTU sizes, core counts, and with DDIO enabled or disabled). • F&S alleviates memory protection overheads by reducing the latency and CPU cost of IOTLB misses. By reducing the memory protection overheads, F&S also improves packet drop rates, thereby generating fewer ACKs, which in turn also improves IOTLB hit rates and help in further reducing the over- heads. • In the presence of memory bandwidth contention, by integrating with hostCC (discussed in Chapter 3), F&S improves application performance by resolving both memory bandwidth and IOMMU contention. 4.4.1 F&S Benefits F&S eliminates memory protection overheads by reducing the cost of each IOTLB miss (while providing strictest safety properties). Figure 4.2 shows the F&S results for increasing number of flows (same scenario as in Figure 2.8). F&S matches IOMMU-disabled performance by resolving the inflated L1/L2/L3 cache miss rates, resulting in >10x reduction in L3 misses per de- scriptor in all cases and bringing L1/L2 cache misses to 0. F&S also alleviates the IOTLB contention, resulting in almost 2x fewer IOTLB misses in the 40 flow case. Similar performance improvements are also observed across varied ring 81 5 10 20 40 Number of Active Flows 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) IOMMU Off IOMMU On IOMMU On + F&S (a) Network throughput 5 10 20 40 Number of Flows 0 2 4 Pa ck et Dr op R ate (% ) IOMMU Off IOMMU On IOMMU On + F&S (b) Packet drop rate 5 10 20 40 Number of Active Flows 0x 1x 2x 3x Re du cti on in IO TL B M iss es (c) Reduction in IOTLB misses 5 10 20 40 Number of Active Flows 0x 5x 10x 15x 20x 25x Re du cti on in IO M M U Ca ch e M iss es L1/L2 L3 ACKs 0x 5x Re du cti on in N um be r o f A CK s (d) Reduction in L1/L2/L3 misses 0 100 200 300 400 500 Sample Sequence of IOVA Accesses 0 2 4 6 8 10 12 14 Di st in ct L 3s A cc es se d Be fo re R ep ea tin g 40 flows 5 flows (e) IOVA access pattern Figure 4.2: F&S eliminates memory protection overheads by reducing cost of each IOTLB miss. (a) F&S enables IOMMU on scenario to match the perfor- mance of IOMMU off, even as the number of flows increases. (b) By addressing the IOMMU contention, F&S significantly lowers packet drop rates caused by queueing at the NIC buffers. (c) F&S reduces the number of IOTLB misses per descriptor up to 2x by minimizing packet drops, hence decreasing the number of transmitted ACKs. (d) F&S eliminates L1/L2 misses and substantially re- duces L3 misses by preserving IO page table caches during invalidation and allocating contiguous IOVAs. (e) F&S enhances the locality and thus reduces the number of L3 cache misses. See §4.4.1 for more discussion. buffer size (Figure 4.3), MTU size and CPU core count (Figure 4.4), and DDIO enabled (Figure 4.5) cases. Understanding F&S benefits. Let us focus on understanding the performance improvements with increasing number of flows (Figure 4.2), but similar insights also apply to other cases (such as with varying ring buffer sizes, in Figure 4.3). 82 256 512 1024 2048 Ring Buffer Size 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) IOMMU Off IOMMU On IOMMU On + F&S (a) Network throughput 256 512 1024 2048 Ring Buffer Size 0.0 0.5 1.0 Pa ck et Dr op R ate (% ) IOMMU Off IOMMU On IOMMU On + F&S (b) Packet drop rate 256 512 1024 2048 Ring Buffer Size 0x 1x 2x 3x Re du cti on in IO TL B M iss es (c) Reduction in IOTLB misses 256 512 1024 2048 Ring Buffer Size 0x 5x 10x 15x 20x 25x Re du cti on in IO M M U Ca ch e M iss es L1/L2 L3 ACKs 0x 5x Re du cti on in N um be r o f A CK s (d) Reduction in L1/L2/L3 misses 0 100 200 300 400 500 Sample Sequence of IOVA Accesses 0 2 4 6 8 10 12 14 Di st in ct L 3s A cc es se d Be fo re R ep ea tin g 2048rb 256rb (e) IOVA access pattern Figure 4.3: F&S maintains good locality, resulting in low L3 cache misses even as the IO working set sizes increase. With IOMMU enabled and increasing ring buffer sizes, we observe that F&S achieves several key performance metrics: (a) it matches the performance of systems with IOMMU disabled; (b) experiences no packet drops; (c) maintains a consistent number of IOTLB misses; (d) in- curs low IOMMU cache misses; and (e) demonstrates good locality in L3 page caches. See §4.4.1 for more discussion. As discussed in §2.2, without F&S, the IOTLB misses increase with the number of flows due to a higher drop rate and thus more ACKs. Figure 4.2 shows that by reducing memory protection overheads, F&S improves the drop rate, leading to a reduction in the number of ACKs. Consequently, this decrease in ACK interference results in fewer IOTLB and IO page table cache misses. To minimize page table cache misses, F&S preserves page table caches dur- ing IOTLB invalidations, enabling the IOMMU to fully benefit from enhanced 83 5 6 7 8 Cores 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) IOMMU Off IOMMU On IOMMU On + F&S (a) Varying number of cores. 1500B 4000B 9000B MTU Size 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) IOMMU Off IOMMU On IOMMU On + F&S (b) Varying MTU sizes Figure 4.4: F&S enables IOMMU on to match the performance of IOMMU off with varied number of cores and MTU sizes. See §4.4.1 for more discussion. 5 10 20 40 Flows 0 20 40 60 80 100 Th rou gh pu t ( Gb ps ) IOMMU Off IOMMU On IOMMU On + F&S Figure 4.5: F&S resolves IOMMU contention with DDIO on for varying number of flows. Moreover, by improving CPU utilization, F&S is able to saturate the link bandwidth. See §4.4.1 for more discussion. locality, as illustrated in Figure 4.2(b). This approach improves the L3 cache miss rate, and eliminates the L1/L2 miss rates. With preservation of page ta- ble caches, even for very large working sets, L1 and L2 caches will observe no misses; typically, the working set would need to be 8 million pages to see L2 misses and over 4 billion pages to see L1 misses. Today, even with configura- tions such as a 9k MTU, a 4k ring buffer size, and 128 cores, we expect to see ∼2.25 million active pages for Rx datapath4. By using contiguous IOVA allocations at a descriptor-sized granularity F&S eliminates the poor locality resulting from successive IOVA allocator calls at page-sized granularity. This approach allows F&S to maintain good locality 4Furthermore, at network speeds of 1 Tbps and assuming a round trip time of 100us—which is 10x higher than recent predictions[7]—the Bandwidth-Delay Product (BDP) for all flows cor- responds to only about 3k active pages. Therefore, even with scaling network speeds, we do not expect to see L1/L2 cache misses with F&S. 84 0 0.1 0.2 0.3 0.4 0.5 0.6 app sys call dat a co py tra nsp ort /IP p roc net dev ice sub sys tem skb mg mt me mo ry a lloc /de allo c loc k/u nlo ck sch edu ling iom mu etc . C PU U til (% ) 256 ring buffer sizes 2048 ring buffer sizes (a) Hardware Prefetching Enabled (Default) 0 0.1 0.2 0.3 0.4 0.5 0.6 app sys call dat a co py tra nsp ort /IP p roc net dev ice sub sys tem skb mg mt me mo ry a lloc /de allo c loc k/u nlo ck sch edu ling iom mu etc . C PU U til (% ) 256 ring buffer sizes 2048 ring buffer sizes (b) Hardware Prefetching Disabled Figure 4.6: The CPU breakdown results indicate that data copy overheads in- crease with the number of ring buffer sizes due to ineffective hardware prefetch- ing. Results shown for (a) hardware prefetching enabled, and (b) hardware prefetching disabled respectively. even as IO working set sizes increase, such as with larger ring buffer sizes. As discussed peviously in §4.3, this ensures that every descriptor will result in al- location of at most two unique L3 entries in the worst-case. For majority cases however, all IOVAs within a descriptor will share the same L3 entry. This is ev- ident from the Figure 4.2(e): the spikes in the figure occur at descriptor bound- aries. Since IOVAs are allocated at descriptor granularity, there is no guarantee the subsequent descriptor’s IOVAs will have the same L3 entry. However, the spikes are small enough to likely not cause contention in the L3 cache. 85 104 105 106 IOMMU Off IOMMU On IOMMU On and F&S 128B 512B 2KB 8KB 32KB RPC Size 50 100 150 200 250 300 La ten cy ( s) Figure 4.7: F&S brings DCTCP latency closer to IOMMU off case. See §4.4.1 for more discussion. 0x 1x 2x 3x Degree of Memory Contention 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) Linux Linux + hostCC (a) IOMMU off. 0x 1x 2x 3x Degree of Memory Contention 0 20 40 60 80 100 Ne tw ork T hro ug hp ut (G bp s) Linux Linux + hostCC Linux + hostCC + F&S (b) IOMMU on. Figure 4.8: hostCC and F&S work together to resolve both memory bandwidth and IOMMU contention. See §4.4.1 for more discussion. A quick note about throughput degradation (with memory protection) in Fig- ure 4.3. Despite F&S resulting in similar IOMMU cache performance for both 256 and 2048 ring buffers, we observe a decline in throughput with larger ring buffer sizes due to an increased CPU cache miss rate (increased by 1.47x), in- dependent of memory protection overheads. CPU profiling reveals that an in- crease in cache miss rates leads to greater overhead from data copying (Fig- ure 4.6), which degrades throughput as the CPU becomes the bottleneck. The rise in cache miss rates is associated with a decrease in the efficiency of CPU cache hardware prefetching for larger ring buffer sizes. Notably, disabling hard- ware prefetching eliminates the CPU cache miss rate difference and narrows the throughput gap between small and large ring buffer sizes, despite the observed increase in cache miss rates and decrease in throughput for both scenarios. We 86 leave the investigation of the underlying reasons for large ring buffers leading to inefficient hardware prefetching for future work. F&S improves tail latency for latency-sensitive applications. Using the same setup as Figure 2.9, F&S improves the tail latency of the L-appwhen IOMMU is enabled and leads to minimal latency inflation compared to the scenarios where the IOMMU is disabled. This improvement is primarily due to reduced queue- ing delay at the NIC buffer, and reduced retransmission delay due to lower rate of packet drops. F&S performance in presence of memory contention. Figure 4.8 shows how F&S and hostCC integrate with each other to resolve both IOMMU and mem- ory bandwidth contention. We introduce memory bandwidth contention in an identical manner as in §2.2, by increasing the number of M-app cores. Figure 4.8a shows that with memory contention only, hostCC is able to resolve mem- ory bandwidth contention and reach the target application-level throughput of 80Gbps, independent of the degree of memory bandwidth contention. How- ever, Figure 4.8b shows that, with IOMMU on, without F&S, hostCC fails to achieve the target rate because of increased PCIe latency induced by address translation overheads. By reducing these overheads, F&S with hostCC achieves throughput close to the target. 4.4.2 Necessity of each design idea in F&S Figure 4.9 shows the importance of each design idea in F&S. In particular, we plot the throughput for 3 configurations varying ring buffer sizes: default 87 256 512 1024 2048 Ring Buffer Size 0 50 100 Ne tw ork T hro ug hp ut (G bp s) Default Default + Preserve F&S 256 512 1024 2048 Ring Buffer Size 0.0 0.5 1.0 1.5 2.0 M iss es pe r P ag e IOTLB L1 L2 L3 Figure 4.9: Contribution of each of the F&S key design ideas. As ring buffer size increases we observe: (a) the throughput drops for the case of “default + preserve” (b) as ring buffer size increases, despite preserving the L3 cache, we still see misses due to the larger IO working set size. See §4.4.2 for more discussion. Linux, Linux but with IOMMU page table caches preserved, and Linux but the IOMMU page tables are preserved and IOVAs allocated contiguously, namely F&S. As illustrated in Figure 4.9 (a), preserving page table caches is sufficient for good performance with small ring buffer sizes. However, as the ring buffer sizes increase, leading to a larger IO working set size, performance degrades, highlighting the necessity of allocating IOVAs contiguously. Figure 4.9 (b) shows the cache performance for the case where the IOMMU page table caches are preserved. For small ring buffers with good locality, pre- serving the page table caches removes all L1/L2/L3 page table cache misses originating from invalidations (as described in §4.4.1), leading to minimal trans- lation overheads. However, as the locality gets worse, with increasing ring buffer sizes, the number of L3 page table cache misses increases, leading to a drop in throughput. As evident from Figure 4.3, upon improving the IOVA lo- cality with F&S, the L3 page table cache misses reduce, removing the source of address translation overhead. 88 4.5 Summary In this chapter, we presented F&S, a mechanism to provide memory protection against buggy and/or untrusted NICs while achieving near-optimal application performance, under strictest possible memory protection guarantees. Classical literature on reducing memory protection overheads have thus far focused pri- marily on reducing the number of IOTLB misses. We however discuss, and provide evidence that providing memory protection in modern Linux under strict protection model necessitates incurring at least a single IOTLB miss (per page-worth data). We then show that although it is hard to reduce the number of IOTLB misses, we can reduce the cost of each miss by improving the IO page table cache performance. F&S leverages this insight and reduces the latency and CPU cost for IOTLB misses using its three key ideas: allocating IOVA contigu- ously, selectively invalidating IOTLBs, and batching invalidations. Our evalu- ations demonstrate that Linux achieves performance close to IOMMU-disabled scenario using F&S, under the IOMMU strict protection model. 89 CHAPTER 5 REARCHITECTING STORAGE STACKS In this chapter, we discuss the assumptions in the design of existing storage stacks that break under host congestion. We then present the design, imple- mentation, and evaluation of storageCC, a new architecture for storage stacks that enable near-optimal application performance independent of the presence and amount of host congestion within the storage datapath. 5.1 Overview Computer systems, especially those that employ queues, have an ideal operat- ing point in terms of parallelism—a “knee” point; operating the system below the knee point results in sub-optimal throughput, and operating the system be- yond the knee point results in inflated latency but with marginal or no gain in throughput. Finding the right amount of parallelism for a system therefore, of- ten requires intricate understanding of the underlying software and hardware architecture, the workload under which the system is expected to operate, and how various tunable parameters define systems performance. Identifying the right parallelism is particularly challenging for storage ap- plications running atop modern non-volatile memory express (NVMe) attached solid state drives (SSDs). These devices can support low latency (tens of mi- croseconds) and high throughput (tens of gigabytes per second); however, iden- tifying the knee point turns out to be extremely challenging: it depends on device characteristics (type of flash, number of channels, bus frequency, de- 90 vice controller processing capabilities, etc.), software characteristics (data lay- out, driver software architecture, garbage collection processing, IO scheduler, etc.), and workloads (read/write ratios, random/sequential request patterns, write amplification, etc.). Unsurprisingly, there is a long and active body of re- search in rethinking storage hardware [85, 133, 83, 136, 34, 37, 69, 56, 71]; inter- faces [74, 75, 138, 27, 123]; and IO schedulers providing fairness (from classical disk-based storage systems [53, 51, 52, 2, 3] to high-speed storage [40, 90]), QoS guarantees [5, 54, 119], and isolation [57, 124, 76, 4]. The overarching goal of most of these systems has been to enable storage applications to operate at or close to the knee point. The challenge, however, is that all these works assume that performance bottlenecks are either at the storage device or the CPU. The above assumption are no longer true in the presence of host congestion, leading to bottlenecks within the storage datapath that are not the CPUs or the storage devices. We present storageCC, a new storage architecture that enables storage stacks to operate at the knee point in the presence of host congestion. The key insight in storageCC is that the end-to-end storage datapath is a special low-latency high- throughput “lossless” network that comprises of end-points (CPU and storage devices), switches (root complex, memory controller, caching and home agents, etc.), and network links (peripheral interconnect, memory bus, processor in- terconnect). Building upon this conceptual insight, storageCC extends existing storage stacks to incorporate a storage congestion control mechanism that enables the storage stack to dynamically adapt the number of in-flight requests to the underlying storage device for each individual application, based on available capacity along the storage datapath. Similar to transport designs in computer networks, storageCC ensures that the number of in-flight requests is the min- 91 imum of a flow control window size (that accounts for bottlenecks at the end- points, and is decided using any of the techniques from prior work) and the congestion control window size (that accounts for bottlenecks within the storage datapath, and is decided using storageCC’s storage congestion control mecha- nism). storageCC architecture thus enables applications to operate at the knee point corresponding to bottlenecks anywhere within the end-to-end storage dat- apath, even when the knee point shifts over time due to dynamically changing host network bottlenecks. The storage congestion control mechanism used in storageCC exploits the many properties of the storage datapath. On the one hand, storageCC ex- ploits the lossless nature of the storage datapath to design an extremely sim- ple and CPU-efficient congestion control protocol: unlike computer networks, storageCC does not need to deal with data drops, data corruption, data re- transmission, hardware failures, or even out-of-order data delivery (e.g., due to asynchronous IO interface). On the other hand, storageCC resolves unique challenges in terms of identifying and detecting host network bottlenecks, and converging to the right parallelism for storage datapath. For instance, for hop- by-hop credit-based flow control mechanisms used by host hardware protocols (§2.1.2), non-zero buffer occupancy is only a necessary but not a sufficient con- dition for detecting host network bottlenecks; furthermore, the maximum avail- able buffer sizes at hardware devices along the storage datapath are extremely small. Put together, these two constraints leave very small room in buffer oc- cupancy that corresponds to a host network contention scenario. storageCC re- solves this challenge using a key insight derived from the unique characteristics of the storage datapath: unlike computer networks, where congestion signals (such as switch buffer occupancies) can be delayed by network-layer round trip 92 time (RTT) or even more, storageCC can access buffer occupancies within the host network hardware with a delay that is a tiny fraction of end-to-end stor- age datapath round trip time (by accessing hardware queue occupancy on an on-demand basis). We will show that this enables a storage congestion control mechanism that is able to quickly detect and react to host network bottlenecks. We demonstrate that storageCC architecture, even with existing commod- ity hardware and even with unmodified application interfaces, can be realized with minimal overheads. To that end, we realize storageCC architecture within the Linux kernel block layer and evaluate it across a wide variety of scenar- ios involving memory interconnect bottlenecks. Our evaluation suggests that, without any experiment-specific fine-tuning, storageCC enables storage appli- cations to operate at or close to the experiment-specific knee point. These ben- efits come at the cost of minor CPU overheads (less than 4.2% across all evalu- ated scenarios) in case when bottlenecks are at CPUs/storage devices; however, these overheads are purely due to limitations of existing hardware in terms of providing efficient mechanisms to collect host hardware queue occupancies at fine-grained timescales. We envision that, as the computer systems and com- puter architecture communities build a deeper understanding of emerging host network bottlenecks, future hardware architecture and operating systems in- terfaces will evolve to enable even more efficient realization of the storageCC architecture. 93 5.2 storageCC This section introduces storageCC’s design. We begin with an overview of storageCC (§5.2.1), followed by an explanation of how storageCC detects the need to adjust parallelism in the presence of memory interconnect bottlenecks (§5.2.2). Finally, we discuss how storageCC converges to the optimal level of parallelism in such scenarios (§5.2.3). 5.2.1 storageCC Overview storageCC extends the block layer in storage stacks to address emerging host network bottlenecks within the storage datapath, in addition to traditional CPU/SSD bottlenecks. At a high-level, it intercepts application IO requests, delays the delivery of already generated requests to the underlying storage de- vices, and additionally provides backpressure to the applications in a manner that the in-flight requests never exceed the desired amount of parallelism. Im- portantly, storageCC makes no changes to the application interface and ensures compatibility for both synchronous and asynchronous IO. It enables storage stacks to quickly detect and converge to the right amount of parallelism for each application, depending upon the degree of contention within the storage datapath. This capability allows applications to achieve high throughput, low latencies, and alleviates the throughput starvation problem discussed in §2.3. storageCC draws inspiration from congestion control (CC) mechanisms (usually studied in networking literature [139, 88]) in order to detect and con- verge to the desired parallelism. Conceptually, CC protocols are designed to 94 solve a similar problem: they maintain in-flight data to maximize the network throughput while minimizing network latency. CC protocols achieve this ob- jective by detecting in-network congestion using congestion signals (e.g., switch buffer occupancies), and reacting to congestion via a congestion response (e.g., by maintaining a congestion window, and ensuring in-flight data never exceeds this window). Similar to computer networks, buffer occupancies within the storage datapath—e.g., IIO buffer and memory controller read/write pending queues (RPQs/WPQs)—act as congestion signals and help detect whether there is a need to alter the amount of parallelism for a storage application (§5.2.2). Simi- larly, ensuring that in-flight IO requests remain within a specified limit to avoid performance degradation serves as a congestion response, enabling storage stacks to converge to an ideal amount of parallelism (§5.2.3). However, unlike computer networks, buffers within the storage datapath are very small (few tens of cachelines as opposed to MBs), which presents a unique challenge in achieving both high throughput and low latency for stor- age applications. High throughput requires high utilization of IIO buffer, while low latency requires prompt detection as the IIO buffer nears capacity to avoid backpressure to the applications. storageCC addresses this challenge by lever- aging its key insight derived from the unique properties of the storage datapath: the queue build-up across the storage datapath buffers can be detected at much finer-grained, sub-µs timescales, unlike computer networks (§5.2.2). Addition- ally, storageCC leverages other advantageous properties of the storage datap- ath, such as lossless data transfer guarantees, to adapt CC protocols for storage stacks in a CPU-efficient manner (§5.2.3). 95 In addition to adjusting the parallelism of storage applications in the pres- ence of memory interconnect bottlenecks, one other possibility is to throttle the cores of M-apps using CPU schedulers or memory bandwidth allocation/QoS tools [9, 15, 121]. storageCC is orthogonal to such a solution, in that, our goal is to show that even in the worst case when we do not throttle the M-apps, we can handle the memory interconnect bottlenecks within the storage datapath. Our evaluation suggests that this really is the worst case scenario for us (§5.4). And when used in conjunction with the solutions which can throttle M-apps, we will still be able to converge to the right amount of parallelism. 5.2.2 Detecting host network bottlenecks Given the conceptual similarity between the storage datapath and (lossless) computer networks, storageCC leverages ideas from existing network CC mechanisms, which employ buffer occupancy as congestion signals (like Ex- plicit Congestion Notifications, or ECNs), to detect host network bottlenecks for storage applications. In particular, storageCC generates ECNs based on buffer occupancy at points of contention within the storage datapath (e.g., the IIO and the memory controller). If buffer occupancy at these points is greater than a fixed contention detection threshold (referred to as K), then an ECN is gener- ated. Upon observing generated ECNs, storageCC senses the need to reduce the parallelism in response to the contention within the datapath, and converges to the right rate (using mechanism discussed in §5.2.3). There is a well-known trade-off concerning K [14, 139, 88]. A value of K that is too close to the buffer capacity would result in larger buffer occupancy due to lesser available time to detect contention and converge to the right amount 96 of parallelism, increasing backpressure frequency and its associated problems discussed in §2.3; on the other hand, a value of K too far from the buffer capacity would result in unnecessary reduction in the parallelism, even in the absence of contention, leading to underutilization of CPUs and/or SSDs. Using ECNs in the context of storage datapath, therefore, presents a unique challenge: the maximum available buffer sizes at intermediate hardware de- vices within the storage datapath is very small—typically of the order of few tens of cachelines (< 50 cachelines for memory controllers, and < 100 cachelines for the IIO) (details in §5.3). Such small buffer sizes enforce a relatively large marking threshold K: e.g., we observe on our testbed that the IIO occupancy is close to 70% of the IIO buffer capacity when there is no congestion in the storage datapath, and the SSDs are fully utilizing the PCIe bandwidth. This is because host hardware protocols use a hop-by-hop credit-based flow control mechanism (as discussed in §2.1); for such credit-based flow control mechanisms, non-zero buffer occupancy is only a necessary but not a sufficient condition for conges- tion. Specifically, due to credit-based flow control mechanism, the buffer oc- cupancy is equal to the number of in-flight credits. Therefore, K must be set to a value larger than at least 70% of maximum buffer capacity to ensure that the buffer build-up is attributed to storage datapath contention. This leaves less time for detecting and converging to the right parallelism level without encoun- tering application backpressure. storageCC leverages the unique advantages of the storage datapath, where getting access to all buffer occupancies is much faster—storageCC can simply query hardware counters for individual queues on an on-demand basis. For instance, in §5.3, we show that existing commodity host hardware enables de- 97 tection of storage datapath congestion at a sub-µs scale (which is much smaller than a typical SSD unloaded IO completion time of ∼100µs) with minimal CPU overhead. Since most of the storage datapath latency is due to SSD access la- tency, storageCC is able to generate ECNs within a tiny fraction of the end-to- end storage datapath latency. Thus, storageCC is able to quickly detect and converge to the right amount of parallelism, allowing it to use larger values of K without unnecessarily delayed reactions. In addition, storageCC accesses oc- cupancies for all hardware queues within the storage datapath locally. Thus, ECNs can be generated entirely within the software, requiring no additional hardware support. We now detail the end-to-end mechanism to detect host network bottlenecks in the storage datapath using storageCC. For each queue along the storage dat- apath, storageCC specifies a queue-specific threshold that depends on its buffer capacity and buffer occupancy under no memory interconnect contention. If the buffer occupancy grows beyond this threshold, storageCC identifies that queue to be observing contention. However, storageCC does not generate an ECN if the request crosses only one queue observing contention: it generates an ECN if and only if the following two conditions are satisfied: (1) one of the IIOs along the storage datapath of the application observes contention; and (2) there is at least one more queue upstream that observes contention. We describe the rea- soning next. The first condition is for efficiency purposes—since the storage datapath is lossless, buffer occupancy build-up at one of the queues will result in backpressure all the way to the corresponding IIO; since storageCC has ex- tremely fast congestion detection and reaction mechanism for detecting queue build-up, we only declare storage datapath contention (and hence, the need to shift parallelism) if the IIO for the application observes contention. The second 98 condition is more interesting. Since storageCC has local access to all hardware queues within the storage datapath, the second condition enables storageCC to identify the culprit applications, and help alleviate the throughput starva- tion problem. The observation is that an application can be the source of con- tention only if it encounter queue build-up at more than one point along the path. As discussed earlier, due to hop-by-hop credit-based flow control mecha- nism of the host network, build-up at the IIO buffer results from contention and queue build-up at memory controller RPQs/WPQs. As contention arises within the culprit datapath, these memory controller queues become exclusively filled within the culprit application’s datapath. Pre-computing the storage datapath for each application is also straightforward—determined by the application’s target SSD location (PCIe slot on the host where the SSD is connected) and the NUMA node where the application is running. 5.2.3 Converging to the right amount of parallelism Upon detecting contention within the storage datapath, storageCC dynamically converges to the right amount of parallelism for storage applications based on the ECNs observed over time. To achieve this, storageCC adapts a classical con- gestion control algorithm from datacenter networking—DCTCP1 [14]—for stor- age stacks. Like DCTCP, storageCC uses an additive-increase multiplicative- decrease mechanism (AIMD) to adapt the number of in-flight requests based on amount of contention along the storage datapath, using an in-flight requests window; this mechanism reduces the in-flight requests window upon detecting 1storageCC architecture is independent of the underlying CC protocol. We chose DCTCP because it is open-sourced, supports ECN-based congestion signals, and has been widely de- ployed in datacenter networks 99 contention using a multiplicative factor, while increasing in-flight requests win- dow using an additive factor in absence of contention. At any given time, the number of requests in-flight is limited to the minimum of the in-flight request window and the IO depth set by the application (typically determined based on benchmarking of CPU/SSD bottlenecks, as discussed in §2.3). storageCC makes several observations in efficiently adapting DCTCP to stor- age stacks. First, storageCC observes that the lossless nature of the underlying storage datapath enables a simplified form of DCTCP protocol: storageCC does not need to deal with data drops, data corruption, data retransmission, or even hardware failures. Moreover, the inherent losslessness of the underlying net- work and the precise semantics provided by the hardware DMA engine, in- cluding the ordered delivery of data chunks within individual requests, obviate the need for storageCC to handle out-of-order data delivery and/or acknowl- edgement mechanisms2. Finally, since there are no packet drop and no time- out events, the in-flight request window does not require to be reset to a mini- mum value, eliminating the need for slow-start phase, and the need to maintain the slow-start threshold. Overall, this makes the end-to-end protocol both ex- tremely simple and CPU-efficient to implement. The main algorithm in storageCC is to adapt to the right amount of par- allelism, which is outlined in Algorithm 1. This algorithm operates on a per- request basis, and takes three inputs: current time, whether the latest received completion observed contention (i.e., an ECN was generated for the request), and an estimate of the request completion time. The first parameter is straight- forward; an ECN is generated for a request if the two conditions on queue oc- 2Note that individual request completions can still arrive out of order from the block layer, as is already the case today with applications using asynchronous IO interface. These out-of-order requests continue to be handled by the applications. 100 Algorithm 1 storageCC algorithm to adjust parallelism Parameters: K: contention detection threshold gα: EWMA weight for α gcomp time: EWMA weight for s comp time Input: cur time: current time comp time: latest completion time sample contended: whether contention detected during latest request completion? (True/False) Upon Receiving Request Completion if (cur time - last update) > s comp time then F ← cont req last comp/total req last comp α← (1 − gα) × α + gα × F ▷ Update α last update← cur time cont req last comp← 0 total req last comp← 0 end if total req last comp← total req last comp + 1 s comp time← (1 − gcomp time) × s comp time + gcomp time × comp time ▷ Update smoothed completion time if contended then if cur time - last reduced > s comp time then inf req wnd← inf req wnd × (1 − α/2) ▷MD last reduced← cur time end if cont req last comp← cont req last comp + 1 else inf req wnd← inf req wnd + (1/inf req wnd) ▷ AI end if cupancy discussed previously in §5.2.2 are met, whenever a request completion arrives. The estimate for the request completion times is used to ensure that number of requests in-flight are adjusted on a per round-trip-time basis within the storage datapath; this ensures that the protocol gets enough time to stabilize since, upon adjusting the number of in-flight requests, it can take up to a round- trip-time worth of time for the effects to kick-in. Ensuring that we adjust the number of in-flight requests on a per round-trip-time basis with an updated es- 101 timate of completion times is particularly crucial for storage applications due to highly variable latencies of NVMe SSD devices (we observe latency variability higher than even 50% in §5.4). storageCC computes an estimate of per-request completion time by computing the time delta between the request completion and its submission (both of which are readily available within the Linux kernel); moreover, storageCC uses smoothed completion time which is an exponential weighted moving average over the per-request completion time samples using a parameter grtt. Using the above three inputs, storageCC maintains the fraction of requests F with observed contention, within the last completion time, and a weighted moving average of F (using the variable α and a parameter gα). storageCC uses α to compute the in-flight request window on a per-round-trip-time basis using the same AIMD algorithm as in DCTCP: if the request corresponding to the received completion is not contended, then storageCC increases the in-flight request window using additive-increase such that the window increases by 1 every smoothed completion time; otherwise, storageCC reduces the in-flight request window using multiplicative-decrease proportionally to the size of α. AIMD is a popular mechanism ensures that the throughput of each application thread sharing the same underlying bottleneck independently converges to its max-min fair-share [14, 139]. We demonstrate that storageCC’s convergence to fair-share for each application in §5.3. 102 Application io_getevents()io_submit() Filesystem submit_bio() IO Scheduler blk_mq_submit_bio() bio_endio() update_inf_req_wnd() blk_mq_get_tag() enforce_inf_req_wnd() NVMe Driver ctx hctx submission queue completion queue Block Layer Figure 5.1: Datapath of an IO request within the Linux storage stack. Blue boxes indicate points within the datapath where storageCC logic is implemented. (§5.3) 5.3 storageCC Implementation We have implemented storageCC within the block layer of Linux kernel 6.3. Our implementation leverages as much of the existing storage stack infrastructure as possible, making minimal modifications to the datapath. Our implementa- tion currently uses ∼850 LOC. storageCC makes no interface changes for both asynchronous (e.g., aio, io uring, etc) and synchronous (eg, read and write syscalls) IO. In this section we first provide an overview of IO request datapath within the kernel storage stack. We then outline and provide the rationale be- hind the optimal modification points within Linux block layer for storageCC im- plementation, followed by a discussion of some of most interesting storageCC implementation details. 103 Life of an IO request in Linux storage stack. To maintain brevity, we focus on the aio interface [25] for rest of this section. As shown in Figure 5.1, appli- cations utilize the io submit() syscall to submit IO requests to the block layer and use io getevents() to poll for the completion of these requests. Before sub- mitting any request, an application must first establish an IO context (kioctx) with io setup(). For each kioctx, the Virtual File System (VFS) creates a cor- responding IO control block which in turn triggers the relevant VFS-specific function to create a bio structure and initiate submit bio(), transitioning into the block layer. Within the block layer, each bio is encapsulated into a request and mapped to a hardware context (device-driver) queue. Without an IO sched- uler (i.e., using none scheduler), the block layer attempts to directly queue the request in the hardware queue before falling back to the software queue. Prior to scheduling a request on the associated hardware queue, the block layer must acquire a tag using blk mq get tag(), which uniquely identifies the request. These tags are obtained from a bitmap reflecting the parallelism supported by the hardware context queue (i.e., the hardware queue depth). If no tags are available due to the hardware queue being full, the storage stack creates back- pressure by making the request wait within the block layer. Once the device processes the IO request and sends a completion response to the kernel. The kernel uses the corresponding tag to identify and retrieve the relevant request, release the tag, and invoke bio endio() to trigger the VFS-specific completion handler. This handler enqueues the completion event for the relevant context and sends a notification back to the application. storageCC implementation within the block layer. storageCC’s architecture incorporates two key hooks (shown in blue boxes in Figure 5.1): one in the IO re- quest submission datapath to enforce the in-flight request window (inf req wnd 104 computed via Algorithm 1), and another in the request completion datapath to update the in-flight request window. In the submission datapath, submit bio() serves as the ideal hook, since it is the entry point for any IO request into the block layer. We implement storageCC’s enforce inf wnd() logic within the blk mq get tag() function (part of the submit bio() callstack) to leverage the kernel storage stack’s built-in backpressure mechanism to throttle the applica- tion’s in-flight requests (the application thread issuing the requests goes to sleep for a fixed time interval, and retries getting a tag). In the completion datapath, bio endio() serves as the ideal hook, acting as the exit point for an IO request from the block layer to the application layer. Consequently, we implement stor- ageCC’s update inf wnd() logic within this function. storageCC’s enforce inf wnd implementation. storageCC manages an in- flight request window (inf req wnd), and tracks the amount of in-flight requests (infl req), on a per-application-thread basis. A request is considered in-flight from the moment it successfully receives a tag, and until the storage stack re- ceives a completion from the device driver. For each request submission, stor- ageCC checks if the amount of in-flight requests is less than the in-flight request window. If this condition is met, the request proceeds with unmodified stor- age stack for remainder of the request submission datapath. Otherwise, the request is temporarily queued in a pending req buffer; this is an optimiza- tion that allows storageCC to more efficiently handle any transient contention within the storage datapath. storageCC processes and dequeues requests from the pending req buffer in a FIFO manner, when allowed by the in-flight request window. During periods of persistent contention (when the buffer is full and no more requests can be queued), storageCC leverages the built-in back pressure mechanism of the block layer by denying the tag to the request and throttles 105 0 250 500 Time (µs) 0.0 0.2 0.4 0.6 0.8 1.0 No rm al ize d IIO O cc up an cy (a) IIO Occupancy (No Contention) 0 250 500 Time (µs) 0.0 0.2 0.4 0.6 0.8 1.0 No rm al ize d IIO O cc up an cy (b) IIO Occupancy (With Contention) 0 250 500 Time (µs) 0.0 0.2 0.4 0.6 0.8 1.0 No rm al ize d W PQ O cc up an cy (c) WPQ Occupancy (No Contention) 0 250 500 Time (µs) 0.0 0.2 0.4 0.6 0.8 1.0 No rm al ize d W PQ O cc up an cy (d) WPQ Occupancy (With Contention) Figure 5.2: Effectively detecting storage datapath contention using storageCC: when there is no contention, IIO occupancy is 70% of the maximum value, and reaches close to maximum in the presence of memory bandwidth bottlenecks. Memory controller queue occupancy (figure shows fraction of time WPQ occu- pancy is full) also increases only when memory bandwidth is contented. the application’s in-flight requests. storageCC’s update inf req wnd implementation. storageCC updates the inf req wnd upon each request completion using Algorithm 1. storageCC incorporates a minor modification to struct bio by adding two additional fields comp time and contended, to track completion times and whether con- tention was detected, on a per-request basis. storageCC also creates a hash- map to store per-application-thread variables and parameters: the key be- ing the pid of the thread using storageCC, and the value being a struct comprising the required per-application-thread state to execute Algorithm 1 (inf req wnd, alpha, cont req last comp, total req last comp, last update and s comp time). Additionally, upon updating the inf req wnd, storageCC evaluates whether a request from the pending req buffer is now eligible to sub- mitted to block device (by receiving a tag). If yes, storageCC dequeues the re- 106 quest, which retries getting a tag using blk mq get tag(). This is done asyn- chronously by constructing the worker thread to incur minimal overhead for the completion datapath. Detecting contention in storage datapath. storageCC marks the presence of contention within the storage datapath by setting the contended field (used in Algorithm 1). This field is set true if the two conditions discussed in §5.2.3 are met; evaluating these conditions requires measuring the occupancies of the IIO buffer and the memory controller RPQ/WPQs, and checking whether mea- sured values are larger than the fixed respective thresholds for each queue. stor- ageCC uses similar methodology as [9] for measuring the IIO buffer occupancy: it samples a hardware counter exposed using model specific registers (MSRs) that maintains the cumulative value of IIO occupancy, incremented every IIO clock frequency (500MHz for our Intel Cascadelake machines). Queue occu- pancies for RPQ/WPQs is measured similar to IIO buffer, the only difference being that the hardware counter exposing memory controller occupancies is lo- cated in PCI config space [61]. 3 Similar to [9] we use a dedicated CPU core polling these counters to measure them at sub-µs granularity; and also use an exponentially weighted moving averaging (EWMA) for measured occupancies. Figure 5.2 shows the measured queue occupancy values over 500µs horizon, with and without the presence of storage datapath contention. We use the same baseline setup as in §2.3. The figure shows that measured values are indeed ef- fective in both detecting the presence of contention at both the IIO and memory 3To capture the time instants when the contention at WPQ leads to backpressure, we measure the fraction of times WPQ is full (instead of simply measuring the WPQ occupancy) – this is because of the well known difference between read and write datapath. When IIO performs memory writes, it only observes latency inflation in writes when the WPQ is full. However, when performing memory reads, IIO can experience latency inflation, leading to backpressure, even when RPQ is high and not full. 107 controller queues. Deciding contention detection thresholds. Unless mentioned otherwise, stor- ageCC uses default contention threshold to be 91 cachelines for the IIO buffer. Additionally, we use 0.7 for WPQ occupancy threshold.4 storageCC however chooses a more aggressive threshold of 75 cachelines for storage applications running on CPU cores in remote NUMA nodes, in a victim-culprit-like sce- nario. The reasoning is as follows: when using remote NUMA setup, even without memory bandwidth contention, the IIO occupancies are significantly larger than the local NUMA case (as shown in Figure 5.3, this is due to higher remote NUMA memory access latencies; we measured the latency to be 65% higher, by using Little’s Law and observed average IIO occupancy and ap- plication throughput.) Therefore, even with the same amount of application throughput, remote NUMA applications can occupy larger space at the shared IIO buffer. storageCC tries to alleviate such unfair IIO buffer allocation by us- ing a smaller IIO threshold for remote NUMA applications, triggering a more aggressive detection and response for contention. 5.4 storageCC Evaluation This section presents the evaluation results for the Linux storage stack, with and without storageCC. We use the same setup as in §2.3. The goals of our evaluation are as follows: • Understanding benefits of using storageCC in the presence of host conges- 4As discussed in §5.3, we measure fraction of time WPQ is full, therefore measured values range between 0-1. 108 0 250 500 Time (µs) 0.0 0.2 0.4 0.6 0.8 1.0 No rm al ize d IIO O cc up an cy (a) IIO Occupancy (No Contention) 0 250 500 Time (µs) 0.0 0.2 0.4 0.6 0.8 1.0 No rm al ize d IIO O cc up an cy (b) IIO Occupancy (With Contention) 0 250 500 Time (µs) 0.0 0.2 0.4 0.6 0.8 1.0 No rm al ize d W PQ O cc up an cy (c) WPQ Occupancy (No Contention) 0 250 500 Time (µs) 0.0 0.2 0.4 0.6 0.8 1.0 No rm al ize d W PQ O cc up an cy (d) WPQ Occupancy (With Contention) Figure 5.3: IIO buffer and memory controller queue occupancies for the remote NUMA setup. (Left column) Even without memory bandwidth contention, the IIO buffer is filled up. This is due to higher access latency from IIO to memory in remote NUMA node. (Right column) IIO buffer and WPQ occupancies are as expected under memory bandwidth contention. tion due to memory bandwidth contention (§5.4.1) • Demonstrating similar benefits even when using unmodified real-world storage systems under host congestion (§5.4.2) • Understanding storageCC overheads, especially for scenarios where there isn’t any presence of host congestion (§5.4.3). In this section, we focus mainly on DDIO disabled scenario for better explica- bility of results; but storageCC also works with DDIO enabled scenarios; we discuss them in §5.4.4. 109 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Million IOPS) 102 103 104 105 L- A Ta il La te nc y (u s) 0x1x2x3x 7.34x Im rovementIn Latency Without StorageCC With StorageCC (a) Local NUMA setup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Million IOPS) 102 103 104 105 L- A Ta il La te nc y (u s) 0x1x 2x3x 17.00x Im rovement In Latency Without StorageCC With StorageCC (b) Remote NUMA setup Figure 5.4: storageCC allows storage stack to converge to amount of parallelism which is close to the knee point, independent of degree of contention and appli- cation NUMA-affinity. storageCC provides close to optimal latencies for L-App (upto ∼17× tail latency improvement, even assuming the default IO depth = 64 – the minimum needed to sustain optimal throughput under no congestion) and maximum achievable throughput for T-app. Discussion in §5.4.1. 5.4.1 storageCC benefits [Figure 5.4] storageCC maintains near-optimal amount of parallelism for stor- age applications independent of the degree of host congestion and NUMA- affinity of the applications. Figure 5.4 shows the latency-load curve with and without using storageCC using a similar setup as in §2.3 (Figures 2.11 and 2.12), under varying degrees of storage datapath contention. We vary this degree by proportionally changing the number of M-app cores in our setup. Starting with the scenario where memory bandwidth is contended in the NUMA node local to SSDs (Figure 5.4a) we observe that for any introduced degree of contention, stor- ageCC converges to an amount of parallelism that is close to the the knee-point on the latency-load curve. When T-app uses lower IO depth values (from 1 to 8), storageCC achieves near-identical throughput and latency values compared to the case without storageCC. This is expected because there is no memory band- width contention for these IO depths (the offered load by T-app is smaller than the available memory bandwidth); hence, storageCC never reduces the inflight 110 request window and keeps IO depth worth requests in-flight at all times. However, when the T-App IO depth increases beyond a threshold which is large enough to sustain the maximum achievable throughput—32 for 1×, and 16 for 2× and 3× degree of contention respectively—storageCC’s mechanisms to detect contention and shift the amount of parallelism kicks-in. For example, in a scenario with 2× contention and an IO depth of 64, which is the optimal par- allelism for no contention, storageCC adjusts the operating point to near that of an IO depth of 16—the ideal operating point for 2× contention scenario. This adjustment occurs as storageCC detects contention and converges the in-flight request window value close to 16. In the steady-state, storageCC exhibits AIMD behavior, oscillating around this value (we later discuss steady-state behavior of storageCC at a microscopic-level in Figure 5.11). storageCC converges close to the ideal operating point even in the subsequent scenario, where the memory bandwidth is contended in the remote NUMA node (Figure 5.4b); further, re- gardless of the application’s IO depth setting, storageCC converges to an oper- ating point close to the knee-point of each respective latency-load curve. These observations showcase the efficacy and robustness of storageCC’s contention detection and mitigation mechanisms. We discuss storageCC’s results under no contention scenario later in §5.4.3, demonstrating storageCC introduces mini- mal overheads. [Figure 5.5] storageCC is more resilient to the throughput starvation prob- lem. Using the same setup as in in Figure 2.11, we observe that storageCC provides >55% gain in victim application throughput, while incurring a smaller cost in terms of further degradation of culprit application throughput and lead- ing to a gain of >22% in total application throughput (culprit + victim). The 111 Culprit Victim0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Th ro ug hp ut (i n M illi on IO PS ) Necessary Unnecessary No Contention With Contention, Without StorageCC With Contention, StorageCC (a) 1x Congestion Culprit Victim0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Th ro ug hp ut (i n M illi on IO PS ) Necessary Unnecessary No Contention With Contention, Without StorageCC With Contention, StorageCC (b) 2x Congestion Culprit Victim0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Th ro ug hp ut (i n M illi on IO PS ) Necessary Unnecessary No Contention With Contention, Without StorageCC With Contention, StorageCC (c) 3x Congestion 1x 2x 3x Degree of Storage Datapath Congestion 0 20 40 60 80 100 % G ai n Gain in Total Throughput Gain in Victim Throughput (d) Net Throughput Gain (%) Figure 5.5: storageCC helps alleviate the throughput starvation problem: im- proving victim application throughput, while incurring smaller cost in culprit application throughput, leading to ∼55% gains in victim throughput and ∼22% in overall throughput. Discussion in §5.4.1. gain in victim application throughput stems from enabling of a better isolation across T-apps by storageCC: it detects that the contention is caused only in cul- prit application datapath (since the WPQ increases only for the remote NUMA memory controller), leading to reduction in in-flight request window just for the culprit appplication. Reducing this in-flight window reduces the contention of shared IIO buffer with the victim traffic, leading to improved victim apppli- cation throughput. 5.4.2 RocksDB with storageCC We now evaluate storageCC with RocksDB [43], a commonly used storage sys- tem, as an L-app. In our setup, we mount local SSDs using the XFS file system 112 0.0 0.5 1.0 1.5 2.0 2.5 Throughput (in Million IOPS) 102 103 104 L- A Ta il La te nc y (u s) 0x 3x 6.40x Im rovement In Latency Without StorageCC With StorageCC Figure 5.6: Evaluation using RocksDB: storageCC provides similar improve- ment in tail latency as earlier results. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Normalized T-App Throughput 102 103 104 L- Ap p Ta il La te nc y (u s) <5% Overhead Without StorageCC With StorageCC Figure 5.7: storageCC observe negligible overheads (<5%) without the presence of any storage datapath congestion. Discussion in §5.4.3. and configure RocksDB to utilize it as a backend with direct IO. We generate a ReadRandom workload of 4KB request sizes, with an IO depth of 1 for the appli- cation thread using the db bench benchmarking tool [44]. We run this bench- mark for the same setup we used in §2.3. Figure 5.6 shows the latency-load curve using storageCC with 3x degree of storage datapath contention. As ex- pected, compared to the no contention scenario, the L-app experiences high tail latency inflation (6.52×) under host congestion. We observe that storageCC ef- fectively converges near the optimal operating point under contention, demon- strating its effectiveness in handling host congestion for unmodified real-world storage applications. 113 5.4.3 storageCC Overheads We now discuss any implementation overheads of using storageCC when there is no contention within the storage datapath. Figure 5.7 shows these overheads for the worst-possible scenario: when all the CPU cores in the NUMA node lo- cal to the SSDs (and therefore, also NUMA node local to the IIO) are running the storage applications. As discussed in §5.3, in order to sample IIO occupancy values, MSR reads must be performed to read the required hardware counter values. However, those MSR counters keep track of statistics of the IIO in their local NUMA node [61]. If all the cores in the IIO-local NUMA are running stor- age applications, there is some additional CPU overhead while performing re- quired MSR reads, which can lead to slight performance degradation. On our testbed, we found <5% degradation, as shown in Figure 5.7. 5.4.4 storageCC Sensitivity Analysis We now evaluate storageCC’s robustness under a wider set of workload and host hardware settings. [Figure 5.8] Varying IO sizes. We observe that storageCC is robust to changing IO sizes and converges to the optimal level of parallelism in presence of memory bandwidth contention. [Figure 5.9] Varying L-app load. We observe that even for the cases L-app no longer occupies a tiny fraction of SSD bandwidth, storageCC continues to pro- vide similar benefits in performance. This is because storageCC contention de- tection and parallelism adjustment mechanism do not make any assumptions 114 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Throughput (in Milli n IOPS) 102 103 104 105 L- Ap p Ta il La te nc ( (u s) 0x3x 4.14x Impr vement In Latenc( With ut St rageCC With St rageCC (a) IO size = 8KB 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Throughput (i Millio IOPS) 102 103 104 105 L- Ap p Ta il La te c ) (u s) 0(3( 3.80( Improveme t I Late c) Without StorageCC With StorageCC (b) IO size = 16KB 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Throughput (in Million IOPS) 102 103 104 105 L- A Ta il La te nc y (u s) 0x 3x 5.75x Im rovement In Latency Without StorageCC With StorageCC (c) IO size = 32KB 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 Throughput (in Milli n IOPS) 102 103 104 105 L- Ap p Ta il La te nc ( (u s) 0x 3x 3.76x Impr vement In Latenc( With ut St rageCC With St rageCC (d) IO size = 64KB Figure 5.8: storageCC converges to right amount of parallelism independent of IO size of the applications. Discussion in §5.4.4. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Milli n IOPS) 102 103 104 105 L- Ap p Ta il La te nc ( (u s) 0x 3x 7.38x Impr vement In Latenc( With ut St rageCC With St rageCC (a) L-app IO Depth = 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Million IOPS) 102 103 104 105 L- A Ta il La te nc y (u s) 0x 3x 8.03x Im rovement In Latency Without StorageCC With StorageCC (b) L-app IO Depth = 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Milli n IOPS) 102 103 104 105 L- Ap p Ta il La te nc ( (u s) 0x 3x 8.26x Impr vement In Latenc( With ut St rageCC With St rageCC (c) L-app IO Depth = 8 Figure 5.9: storageCC converges to the right amount of parallelism, even with increasing load from competing L-apps. Discussion in §5.4.4. 115 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Million IOPS) 102 103 104 105 L- Ap p Ta il La te nc y (u ) 0x 3x Without StorageCC With StorageCC 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput (in Million IOPS) 102 103 104 105 L- Ap p Ta il La te nc y (u ) Without StorageCC With StorageCC Figure 5.10: DDIO enabled results: (a) scenario where DDIO can be beneficial, storageCC provides even better tail latencies with close to maximum achievable throughputs, (b) DDIO is not applicable for remote NUMA scenario, storageCC provides near identical results as baseline results. about the loads offered by individual kinds of applications. [Figure 5.10] Enabling DDIO. We evaluate storageCC performance under memory bandwidth contention when DDIO is enabled. Although sometimes beneficial, DDIO is known to not fundamentally resolve the contention at the memory interconnect [31, 6, 9]; for instance, we observe that efficacy of DDIO to worsen with increasing IO sizes, or using remote NUMA nodes (since DDIO is not applicable for that context [31]). storageCC is able to converge close to points providing close-to-max achievable throughput and low latencis, leverag- ing DDIO benefits when it is effective, and improving performance otherwise. 5.4.5 Understanding storageCC microscopic behavior Figure 5.11 shows the steady-state microscopic behavior for storageCC for the local NUMA setup with 2x degree of contention (we also observe similar trends for other host congestion scenarios). Figure 5.11a shows storageCC’s in-flight request window behaviour for a 2000 microsecond horizon for one of the T-Apps – we verified that the behavior is near-identical for all T-app instances. The fig- 116 0 250 500 750 1000 1250 1500 1750 2000 Time (us) 0 2 4 6 8 10 12 14 16 18 20 22 24 in f_ re q_ wn d (a) storageCC inf req wnd 0 250 500 750 1000 1250 1500 1750 2000 Time (us) 0.0 0.1 0.2 0.3 0.4 0.5 al ph a (b) storageCC α value 0 250 500 750 1000 1250 1500 1750 2000 Time (us) 60 65 70 75 80 85 90 95 100 IIO O cc up an cy (c) IIO Occupancy 0 250 500 750 1000 1250 1500 1750 2000 Time (us) 60 80 100 120 140 Sm oo th ed C om pl et io n Ti m e (u s) (d) Smoothed Completion Times Figure 5.11: storageCC microscopic behavior ure demonstrates the expected AIMD behavior in the steady-state in the pres- ence of host congestion. The average inf req wnd value determines the aver- age in-flight requests and hence the T-app throughput. Figure 5.11c shows the IIO occupancy behavior for the same time horizon. We observe that storageCC keeps the IIO occupancy close to the threshold (shown via red line), sustaining high throughput. Finally, Figure 5.11d shows the smoothed completion time behavior – we observe that completion time fluctuates much more than the de- gree of contention (indicated by IIO occupancy value) – sometimes even varying by more than 50%. The dynamic completion time estimation in storageCC how- ever ensures apt response to contention in face of such variable storage datapath completion times. 117 0 100 200 300 400 500 Time (s) 0 200 400 600 800 1000 1200 1400 1600 Ba nd wi dt h (M iB /s ) Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 Thread 16 Figure 5.12: Fair resource allocation by storageCC under host congestion and varying number of application threads. Discussion in §5.4.6. 5.4.6 Fair resource allocation using storageCC Figure 5.12 shows storageCC’s convergence to fair-share of throughput, under varying number of T-app instances. We use a similar setup as the 3× memory contention, and start with a single T-app thread on 4 CPU cores. We then in- troduce additional threads at t = 100s and t = 200s respectively to increase the total number of active T-app threads to be 8 and 16 respectively. We measured throughput versus time for each thread, and observed storageCC converge to the max-min fair-share as soon as the number of T-app instances were changed. This is realized due to quick contention detection mechanism in storageCC, and use of AIMD-based mechanism to adjust the parallelism in response to con- tention. 5.5 Summary storageCC is a new storage stack architecture to address the host network bot- tlenecks within the storage datapath. While existing storage stacks handle CPU and/or SSD bottlenecks, they lack the capability to detect and react to bottle- 118 necks within the storage datapath, resulting in significant tail latency inflation and isolation violation. Leveraging insights from classical networking liter- ature, storageCC quickly detects and efficiently reacts to these host network bottlenecks. Our evaluation demonstrates that storageCC enables existing un- modified storage applications to converge close to optimal operating points— achieving low tail latencies and isolation guarantees—while incurring minimal implementation overheads. 119 CHAPTER 6 RELATED WORK This chapter provides a brief overview of the related work in the space of under- standing and alleviating contention for host resources (§6.1), efficient memory protection in network stacks (§6.2), and managing bottlenecks within the stor- age datapath (§6.3). 6.1 Managing Host Resource Contention Performance degradation due to contention for host resources. The prob- lem of network applications experiencing underutilization due to lack of host resources is not entirely new—decades ago we experienced the receiver live- lock [100] problem, where the host spent all its time processing interrupts for arriving packets, with no cycles left to deliver these packets to the application. More recently, it has been shown [31] that network utilization can also be bottle- necked by the CPU cycles spent on copying the delivered data from the kernel to the application. The host congestion problem discussed in the dissertation is caused by host network resources, like the memory bus, becoming the bot- tleneck, and the network is underutilized even with CPU cycles to spare. Even more crucially, the packet queueing due to these bottlenecks happens at differ- ent places–at the kernel buffers inside the memory for CPU bottlenecks, but at the NIC buffers (which can lead to packet drops) for host network bottlenecks. Using emerging hardware technologies to alleviate host congestion. Two im- portant emerging technologies are RDMA [139] and CXL [118]. While zero- 120 copy datapath enabled by RDMA does alleviate host congestion to some de- gree (by reducing memory bandwidth utilization due to data copy overheads), it does not resolve the host congestion problem [86]. Data DMAed from the RDMA NICs to the application memory still requires transfer over the memory interconnect, leading to host congestion and similar performance degradation as shown in §2.2 [86]. The benefits of CXL to resolve host congestion are also unclear. For instance, consider the two use cases of CXL. First, reducing PCIe to IIO latency; our analysis in §3.2.1 suggests that reducing ℓp does not allevi- ate memory interconnect congestion (ℓm is the core problem). Second, CXL may enable memory expansion where CPUs can directly read from CXL-attached memory; this requires massive changes in the host infrastructure, and whether it will provide benefits to host congestion remains an interesting avenue of fu- ture research. 6.2 Efficient IO Memory Protection Improved performance using a “non-strict” security model. Some of the ex- isting works, like DAMN [94], change the memory protection security model to improve its performance. The key insight in DAMN is that IOVA invalida- tions lead to both IOMMU and CPU overheads, so they introduce DMA buffers that are permanently mapped in the IO page table, thus avoiding the need to perform IOMMU invalidation. Packet data is copied to an intermediate buffer, that the NIC cannot access, whenever it is read by the CPU (i.e. header data) or otherwise to application buffers. The authors argue this is still secure, since it avoids ”time of check to time of use” attacks where data is changed by the 121 NIC after it has been initially read by the CPU. While DAMN reduces over- heads related to (un)map, IOVA allocation, and invalidations, it does not at- tempt to reduce the cost of each IOTLB miss. As shown in our evaluation (and in Figure 4.9), removing invalidation overhead is insufficient to achieve good performance as poor locality still increases the address translation latency. In- deed, in their evaluation the authors show throughput degrades due to IOTLB miss overheads. Hence, DAMN is complementary to F&S, as it could be used for buffer allocation and F&S for IOVA allocation and mapping. Several other works have designed their solutions around deferring or avoiding invalidations altogether [108, 93, 45], which improve performance by sacrificing on security Improved performance using hugepages. [45] implements a hugepage allo- cator for IOVA mappings that can achieve good IOTLB performance at high throughput. However, their approach relies on keeping IOVAs permanently mapped to pages in their page cache, not invalidating them after each DMA. If they were to use strict mode, invalidations would lead to L1/L2/L3 misses, degrading performance. Moreover, the authors note that hugepages are not a panacea: faster link speeds will require larger hugepages to achieve optimal performance. We note that F&S integrates well with hugepages and is a com- plementary approach. Improved performance using redesigned hardware. There have been many other lines of work looking at designing improved hardware for address trans- lation. [55] explores new address translation hardware for accelerators. [79] de- signed hardware for address translation for a hypertenant environment. There has also been work to rethink IOTLB hardware to reduce the number of page table walks and remove invalidation overhead [17, 91]. Another line of re- 122 search [24, 48, 109] focuses on the design and performance of MMUs and TLBs. In MMU address translation, applications use virtual addresses, and the OS manages physical frames, making it challenging to ensure uniform performance across various workloads. Conversely, in IOMMU address translation, the OS allocates virtual addresses, offering a unique design opportunity to optimize performance as demonstrated in F&S. 6.3 Mitigating Bottlenecks in Storage Datapath Mitigating hardware bottlenecks in storage datapath. There is considerable re- search aimed at improving SSD performance by minimizing garbage-collection overheads [133, 83, 67, 137, 71], improving hardware interfaces [74, 75, 138, 27, 123], ensuring isolation [56], and redesigning the NVMe controller [136]. However, these solutions fall short when the bottleneck shifts from the SSD to the host interconnect. While incorporating Quality of Service (QoS) mech- anisms [49, 72, 78] could potentially address the HoL blocking problem un- der host congestion, they require invasive modifications to the DMA engine to prioritize L-app transfers at the PCIe transaction level. It is important to note that these research efforts are orthogonal to storageCC’s focus on adjusting application-level parallelism in response to host network bottlenecks, as lever- aging more capable hardware can contribute to overall system performance im- provements. Mitigating software bottlenecks in storage datapath. Extensive research ef- forts have been dedicated to designing storage stacks to alleviate CPU bottle- necks and improve system performance. Over time, these stacks have evolved 123 to tackle various challenges, including ensuring fairness [2, 3, 90], prioritiza- tion [4], QoS guarantees [5, 54, 119], and isolation [57, 124, 76]. However, as highlighted in §2.3, they prove ineffective in mitigating issues arising in the host congestion regime, where queuing happens beyond the Linux block layer and deeper within the storage datapath. Additionally, some of these stacks intro- duce significant CPU overheads (§2.3). Although userspace storage stacks [135] offer lower CPU overheads, they still do not address the challenges encountered in the host congestion regime. 124 CHAPTER 7 CONCLUSION AND OUTLOOK This dissertation revisits the conventional wisdom in systems and networking communities that congestion happens primarily within the network fabric (i.e., at network links and/or switches). We demonstrate recent technological trends, such as rapid growth in network link and peripheral bandwidths combined with relatively stagnant trends for other host resources (like memory bandwidth and cache sizes per core), have led to the emergence of host congestion. We dis- cuss the root cause of host congestion to be nanosecond-scale latency inflation within the datapath between NIC-to-CPU/memory datapath. Such latency in- flation results in underutilization of PCIe bandwidth, subsequently leading to network bandwidth underutilization and queueing/drops at the NIC buffers. We characterize impact of host congestion context of networking applica- tions, and also demonstrate that the problem generalizes to context of storage applications (Chapter 2). We show that existing network protocols and OS were designed with several implicit assumptions that start to break under the regime of host congestion. The dissertation revisits the achitecture of network proto- cols like congestion control, and several components of the OS like IO memory protection mechanisms and storage stacks, to handle the host congestion. We presented hostCC (Chapter 3), a new congestion control architecture that handles congestion within the host network, along with the network fabric. The key idea in hostCC is a new “host-local” congestion response performed at a sub-microsecond-scale. HostCC integrates the idea of host-local congestion response with new congestion signals collected at sub-microsecond timescales within the host hardware, allowing it to capture and react to congestion within 125 the host network at sub-microsecond timescales. We next presented F&S (Chapter 4), which revisits the design of memory protection mechanisms to minimize performance overheads while ensuring the required safety guarantees. We show that, although reducing the number of IOTLB misses under strictest possible memory protection guarantees is hard, we can reduce the cost of each miss by improving the IO page table cache per- formance. F&S leverages this insight and reduces the latency-cost and CPU-cost for IOTLB misses through its key design ideas of contiguous IOVA allocation, along with a selective and batched invalidation of IOTLBs. We finally presented storageCC (Chapter 5), a new architecture for storage stacks that enables them to operate at an ideal amount of parallelism in the pres- ence of host congestion. StorageCC introduces the conceptual idea of conges- tion control—which is typically used in the computer networking literature—in storage stacks. By leveraging insights from classical networking literature and exploiting specific properties of the host network, storageCC designs its conges- tion control mechanism to quickly detect and efficiently react to host congestion, dynamically adjusting in-flight IO requests to achieve near-optimal parallelism. 7.1 Future Directions As established in this dissertation, effectively handling host congestion requires exploring solutions across different layers of the infrastructure—-both within the software and hardware. The mechanisms discussed in Chapters 3, 4, and 5 take the initial steps toward re-architecting the existing host infrastructure. We next outline additional avenues for future research, aimed at rethinking the de- 126 sign of applications, operating systems, host hardware, and network fabric. Rethinking design of applications. Existing host-local applications perform multiple data copies for various kinds of operations [132, 33, 43, 134, 122] (like security, compression, (de)serialization, etc.), which contributes to inflated host network resources (e.g., memory bandwidth), resulting in higher inefficiencies. Host congestion calls for exploring techniques that enable reducing data copies at the application layer, with and without hardware support. Additionally, it would be interesting to explore algorithmic solutions that enable trading off computation for reduced memory traffic; the key insight here is that, when memory bandwidth is bottlenecked, CPU utilization is reduced due to larger stalls and that explicitly trading off computation for reduced memory traffic could lead to improved solutions. Rethinking design of operating systems. Exploration of different components of the operating system, beyond the ones already discussed in this dissertation, is needed to fundamentally tackle the impact of host congestion. For instance, existing CPU schedulers do not consider memory bandwidth contention into ac- count; it would be interesting to explore scheduling mechanisms that take into account memory bandwidth contention. At the memory management level, an interesting direction would be to rethink memory tiering solutions—existing solutions try to place all frequently accessed pages in the host-local memory (that is, the tier closer to the CPUs); however, when memory bandwidth is con- tended, this is not necessarily the optimal solution. We could potentially mi- grate pages across the memory tiers such as to strike a proper balance between desired application performance and memory bandwidth contention. Similarly, the existing network and storage stacks try placing physical pages (for DMAing 127 data to/from the peripherals) local to the NUMA node containing the peripher- als. It would be interesting to rethink the design of these stacks to dynamically migrate the physical pages across different NUMA nodes, depending upon the host network contention within the end-to-end datapath. Rethinking design of host hardware. This dissertation shows that at its root, the inefficiencies within the host network arise because today’s host hardware employs flow control mechanisms to handle data transfer between different host-local devices. Networking literature suggests that flow control is better suited for networks where resource contention happens at the end points of the network (i.e., the host-local devices like CPU cores and peripheral devices). Host network observes inefficiencies due to contention for resources within the end points of the network (i.e., the interconnects within the host network). Therefore, congestion control within the host network could help mitigate its inefficiencies. Such a re-architecture would especially be needed in the con- text of emerging hardware technologies like CXL—the amount of contention for CXL memory channels is likely to become much higher than the host-local memory (since processors from multiple hosts can access CXL memory chan- nels). Implementing congestion control mechanisms within the hardware could be beneficial to resolve such contention. As the host network inefficiencies get more pronounced with current hardware trends, we might even need to gener- alize such hardware-level congestion control mechanisms to transfers across the entire host network—i.e., between peripheral devices, CPU cores and memory. Rethinking design of (datacenter) network fabric. While this dissertation fo- cusses primarily on ensuring the host network behaves close to an ideal manner, providing performance guarantees similar to an ideal host network (e.g., near- 128 optimal latency and/or throughput) even for datacenter networks, could be beneficial for alleviating many kinds of application inefficiencies. Existing data- center networks are asynchronous—it is hard, or even impossible, to bound the time taken by a message to be transmitted from the sender to the receiver host. Such network asynchrony reduces the set of assumptions that system designers can rely upon from the underlying network, thus introducing inefficiency and complexity in end-host systems and applications. Datacenter networks offer- ing stronger abstractions—one where the network delivers the message within a bounded amount of time—can have powerful implications for future systems and networking infrastructure. A recent study [7] takes the first step towards ex- ploring such an abstraction, building upon programmable hardware and multi- tier topologies prevalent in existing datacenter networks. It would be interest- ing to build distributed applications, network protocols, and OS stacks that can harness the benefits of such an abstraction. 129 BIBLIOGRAPHY [1] axboe/fio: Flexible I/O Tester. https://github.com/axboe/fio. [2] BFQ (budget fair queueing). https://docs.kernel.org/ blockbfq-iosched.html. [3] CFQ (complete fairness queueing). https://www.kernel.org/doc/ Documentation/block/cfq-iosched.txt. [4] Kyber multiqueue I/O scheduler. https://lwn.net/Articles/720071/. [5] MQ-Deadline I/O scheduler. https://lwn.net/Articles/708465/. [6] Saksham Agarwal, Rachit Agarwal, Behnam Montazeri, Masoud Moshref, Khaled Elmeleegy, Luigi Rizzo, Marc Asher de Kruijf, Gautam Kumar, Sylvia Ratnasamy, David Culler, et al. Understanding host inter- connect congestion. In ACM HotNets, 2022. [7] Saksham Agarwal, Qizhe Cai, Rachit Agarwal, David Shmoys, and Amin Vahdat. Harmony: A congestion-free datacenter architecture. In USENIX NSDI, 2024. [8] Saksham Agarwal, Shreyas Kharbanda, and Rachit Agar- wal. Storage stacks need congestion control. In Draft: https://saksham.web.illinois.edu/assets/pdf/storagecc.pdf, 2024. [9] Saksham Agarwal, Arvind Krishnamurthy, and Rachit Agarwal. Host congestion control. In ACM SIGCOMM, 2023. [10] Saksham Agarwal, Shijin Rajakrishnan, Akshay Narayan, Rachit Agar- wal, David Shmoys, and Amin Vahdat. Sincronia: Near-optimal network design for coflows. In ACM SIGCOMM, 2018. [11] Amit Aggarwal, Stefan Savage, and Thomas Anderson. Understanding the performance of tcp pacing. In IEEE INFOCOM, 2000. [12] Jasmin Ajanovic. Pci express 3.0 accelerator features. Intel Corporation, 2008. 130 https://github.com/axboe/fio https://docs.kernel.org/blockbfq-iosched.html https://docs.kernel.org/blockbfq-iosched.html https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt https://lwn.net/Articles/720071/ https://lwn.net/Articles/708465/ https://saksham.web.illinois.edu/assets/pdf/storagecc.pdf [13] Markuze Alex, Shay Vargaftik, Gil Kupfer, Boris Pismeny, Nadav Amit, Adam Morrison, and Dan Tsafrir. Characterizing, exploiting, and detect- ing dma code injection vulnerabilities in the presence of an iommu. In ACM EUROSYS, 2021. [14] Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Pad- hye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Srid- haran. Data center tcp (dctcp). In ACM SIGCOMM, 2010. [15] AMD. Amd64 technology platform quality of service extensions. https: //developer.amd.com/wp-content/resources/56375.pdf, 2020. [16] AMD. Alveo U280 Data Center Accelerator Card. https://www.xilinx. com/products/boards-and-kits/alveo/u280.html, 2024. [17] Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. Iommu: Strategies for mitigating the iotlb bottleneck. In ISCA. Springer, 2010. [18] Robin Anil, Gokhan Capan, Isabel Drost-Fromm, Ted Dunning, Ellen Friedman, Trevor Grant, Shannon Quinn, Paritosh Ranjan, Sebastian Schelter, and Özgür Yılmazel. Apache mahout: Machine learning on dis- tributed dataflow systems. Journal of Machine Learning Research, 21(127):1– 6, 2020. [19] Apache Software Foundation. Apache cassandra. https://cassandra. apache.org, 2024. [20] Mohamed Arafa, Bahaa Fahim, Sailesh Kottapalli, Akhilesh Kumar, Lily P Looi, Sreenivas Mandava, Andy Rudoff, Ian M Steiner, Bob Valentine, Geetha Vedaraman, et al. Cascade lake: Next generation intel xeon scal- able processor. IEEE MICRO, 2019. [21] Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H Loh, and Onur Mutlu. Staged memory scheduling: Achiev- ing high performance and scalability in heterogeneous systems. In ACM SIGARCH Computer Architecture News, 2012. [22] Muli Ben-Yehuda, Jon Mason, Jimi Xenidis, Orran Krieger, Leendert Van Doorn, Jun Nakajima, Asit Mallick, and Elsie Wahlig. Utilizing iom- mus for virtualization in linux and xen. In OLS’06: The 2006 Ottawa Linux Symposium, pages 71–86. Citeseer, 2006. 131 https://developer.amd.com/wp-content/resources/56375.pdf https://developer.amd.com/wp-content/resources/56375.pdf https://www.xilinx.com/products/boards-and-kits/alveo/u280.html https://www.xilinx.com/products/boards-and-kits/alveo/u280.html https://cassandra.apache.org https://cassandra.apache.org [23] Muli Ben-Yehuda, Jimi Xenidis, Michal Ostrowski, Karl Rister, Alexis Bruemmer, Leendert van Doorn, and Doorn Amd. The price of safety: Evaluating iommu performance. Ottawa Linux Symposium (OLS), 01 2007. [24] Abhishek Bhattacharjee. Large-reach memory management unit caches. In IEEE/ACM MICRO, 2013. [25] Suparna Bhattacharya, Steven Pratt, Badari Pulavarty, and Janet Morgan. Asynchronous i/o support in linux 2.5. In Proceedings of the Linux Sympo- sium, pages 371–386. Citeseer, 2003. [26] Harshawardhan S Bhosale and Devendra P Gadekar. A review paper on big data and hadoop. International Journal of Scientific and Research Publica- tions, 4(10):1–7, 2014. [27] Matias Bjørling, Abutalib Aghayev, Hans Holmberg, Aravind Ramesh, Damien Le Moal, Gregory R Ganger, and George Amvrosiadis. Zns: Avoiding the block interface tax for flash-based ssds. In USENIX ATC, 2021. [28] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. Qnoc: Qos architecture and design process for network on chip. Journal of systems architecture, 50(2-3):105–128, 2004. [29] Jeff Bonwick and Jonathan Adams. Magazines and vmem: Extending the slab allocator to many CPUs and arbitrary resources. In USENIX ATC, 2001. [30] Qizhe Cai, Mina Tahmasbi Arashloo, and Rachit Agarwal. dcpim: Near- optimal proactive datacenter transport. In ACM SIGCOMM, 2022. [31] Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, and Rachit Agarwal. Understanding host network stack overheads. In ACM SIGCOMM, 2021. [32] Qizhe Cai, Midhul Vuppalapati, Jaehyun Hwang, Christos Kozyrakis, and Rachit Agarwal. Towards µ s tail latency and terabit ethernet: Dis- aggregating the host network stack. In ACM SIGCOMM, 2022. [33] Josiah Carlson. Redis in action. Simon and Schuster, 2013. 132 [34] Adrian M Caulfield, Laura M Grupp, and Steven Swanson. Gordon: us- ing flash memory to build fast, power-efficient clusters for data-intensive applications. ACM Sigplan Notices, 44(3):217–228, 2009. [35] Kevin K Chang. Understanding and improving the latency of DRAM-based memory systems. PhD thesis, Carnegie Mellon University, 2017. [36] Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. Understanding latency variation in modern dram chips: Experimental characterization, analysis, and optimization. SIGMETRICS Perform. Eval. Rev., 44(1):323–336, jun 2016. [37] Feng Chen, Rubao Lee, and Xiaodong Zhang. Essential roles of exploit- ing internal parallelism of flash memory based solid state drives in high- speed data processing. In IEEE HPCA, 2011. [38] Dilip Chhajed and Timothy Lowe. Building Intuition: Insights From Basic Operations Management Models and Principles, volume 115. 01 2008. [39] Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation. IEEE MICRO, 2021. [40] John Colgrove, John D Davis, John Hayes, Ethan L Miller, Cary Sandvig, Russell Sears, Ari Tamches, Neil Vachharajani, and Feng Wang. Purity: Building fast, highly-available enterprise flash storage from commodity components. In ACM SIGMOD, 2015. [41] Giovanni De Micheli. Networks on chips. In Design, Automation, and Test in Europe: The Most Influential Papers of 10 Years Date, pages 105–110. Springer, 2008. [42] Alin Deutsch, Yu Xu, Mingxi Wu, and Victor Lee. Tigergraph: A native mpp graph database. arXiv preprint arXiv:1901.08248, 2019. [43] Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. Rocksdb: Evolution of development priorities in a key-value store serving large- scale applications. ACM TOS, 2021. [44] Facebook, Inc. Rocksdb benchmarking tools. https://github.com/ facebook/rocksdb/wiki/Benchmarking-tools. 133 https://github.com/facebook/rocksdb/wiki/Benchmarking-tools https://github.com/facebook/rocksdb/wiki/Benchmarking-tools [45] Alireza Farshin, Luigi Rizzo, Khaled Elmeleegy, and Dejan Kostić. Over- coming the IOTLB wall for multi-100-gbps linux-based networking. PeerJ Comput. Sci., 9:e1385, May 2023. [46] Alireza Farshin, Amir Roozbeh, Gerald Q Maguire Jr, and Dejan Kostić. Reexamining direct cache access to optimize i/o intensive applications for multi-hundred-gigabit networks. In USENIX ATC, 2020. [47] Mario Gerla and Leonard Kleinrock. Flow control: A comparative survey. In IEEE Transactions on Communications, 1980. [48] Krishnan Gosakan, Jaehyun Han, William Kuszmaul, Ibrahim N. Mubarek, Nirjhar Mukherjee, Karthik Sriram, Guido Tagliavini, Evan West, Michael A. Bender, Abhishek Bhattacharjee, Alex Conway, Mar- tin Farach-Colton, Jayneel Gandhi, Rob Johnson, Sudarsun Kannan, and Donald E. Porter. Mosaic pages: Big tlb reach with small pages. In ACM ASPLOS, 2023. [49] Donghyun Gouk, Miryeong Kwon, Jie Zhang, Sungjoon Koh, Wonil Choi, Nam Sung Kim, Mahmut Kandemir, and Myoungsoo Jung. Amber: En- abling precise full-system simulation with detailed modeling of all ssd resources. In IEEE MICRO, 2018. [50] Giulia Guidi, Marquita Ellis, Aydin Buluç, Katherine Yelick, and David Culler. 10 years later: Cloud computing is closing the performance gap. In ACM/SPEC ICPE, 2021. [51] Ajay Gulati, Arif Merchant, Mustafa Uysal, Pradeep Padala, and Peter Varman. Efficient and adaptive proportional share i/o scheduling. ACM SIGMETRICS Performance Evaluation Review, 37(2):79–80, 2009. [52] Ajay Gulati, Arif Merchant, Mustafa Uysal, Pradeep Padala, and Peter Varman. Workload dependent io scheduling for fairness and efficiency in shared storage systems. In IEEE HPCA, 2012. [53] Ajay Gulati and Peter Varman. Lexicographic qos scheduling for parallel i/o. In ACM SPAA, 2005. [54] Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O Suminto, Cesar A Stuardo, Andrew A Chien, and Haryadi S Gunawi. Mit- tos: Supporting millisecond tail tolerance with fast rejecting slo-aware os interface. In ACM SOSP, 2017. 134 [55] Yuchen Hao, Zhenman Fang, Glenn Reinman, and Jason Cong. Sup- porting address translation for accelerator-centric architectures. In IEEE HPCA, 2017. [56] Jian Huang, Anirudh Badam, Laura Caulfield, Suman Nath, Sudipta Sen- gupta, Bikash Sharma, and Moinuddin K Qureshi. Flashblox: Achieving both performance isolation and uniform lifetime for virtualized ssds. In USENIX FAST, 2017. [57] Jaehyun Hwang, Midhul Vuppalapati, Simon Peter, and Rachit Agarwal. Rearchitecting linux storage stack for µs latency and high throughput. In USENIX OSDI, 2021. [58] Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachidambaresan. Py- torch. Programming with TensorFlow: Solution for Edge Computing Applica- tions, pages 87–104, 2021. [59] Intel. Intel® data direct i/o technology (intel® ddio): A primer. https://www.intel.com/content/dam/www/public/us/en/documents/ technology-briefs/data-direct-i-o-technology-brief.pdf, 2012. [60] Intel. Intel Virtualization Technology for Directed I/O. https: //www.intel.com/content/www/us/en/content-details/774206/ intel-virtualization-technology-for-directed-i-o-architecture-specification. html, 2023. [61] Intel. Intel® 64 and ia-32 architectures software developer manuals. https://www.intel.com/content/www/us/en/developer/articles/ technical/intel-sdm.html, 2023. [62] Intel. Intel® memory latency checker. https://www. intel.com/content/www/us/en/developer/articles/tool/ intelr-memory-latency-checker.html, 2023. [63] iperf. iperf - the ultimate speed test tool for tcp, udp and sctp. https: //iperf.fr/, 2023. [64] Bruce Jacob, David Wang, and Spencer Ng. Memory Systems: Cache, DRAM, Disk. 2010. [65] Raj Jain. Congestion control and traffic management in atm networks: 135 https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html https://iperf.fr/ https://iperf.fr/ Recent advances and a survey. In Computer Networks and ISDN Systems, 1996. [66] JanusGraph Contributors. Janusgraph. https://janusgraph.org, 2017. [67] Tianyang Jiang, Guangyan Zhang, Zican Huang, Xiaosong Ma, Junyu Wei, Zhiyue Li, and Weimin Zheng. Fusionraid: Achieving consistent low latency for commodity ssd arrays. In USENIX FAST, 2021. [68] Rick Jones. Netperf benchmark. http://www.netperf.org/, 2012. [69] Myoungsoo Jung, Wonil Choi, Miryeong Kwon, Shekhar Srikantaiah, Joonhyuk Yoo, and Mahmut Taylan Kandemir. Design of a host inter- face logic for gc-free ssds. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(8):1674–1687, 2019. [70] Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter rpcs can be general and fast. In USENIX NSDI, 2019. [71] Jaeho Kim, Donghee Lee, and Sam H Noh. Towards slo complying ssds through ops isolation. In USENIX FAST, volume 15, pages 183–189, 2015. [72] Seonbong Kim and Joon-Sung Yang. Optimized i/o determinism for emerging nvm-based nvme ssd in an enterprise system. In Proceedings of the 55th Annual Design Automation Conference, pages 1–6, 2018. [73] Taehun Kim, Hyeongjin Park, Seokmin Lee, Seunghee Shin, Junbeom Hur, and Youngjoo Shin. Devious: Device-driven side-channel attacks on the iommu. In IEEE SP, 2023. [74] Taejin Kim, Duwon Hong, Sangwook Shane Hahn, Myoungjun Chun, Sungjin Lee, Joo Young Hwang, Jongyoul Lee, and Jihong Kim. Fully au- tomatic stream management for multi-streamed ssds using program con- texts. In USENIX FAST, 2019. [75] Youngjae Kim, Junghee Lee, Sarp Oral, David A Dillow, Feiyi Wang, and Galen M Shipman. Coordinating garbage collectionfor arrays of solid- state drives. IEEE Transactions on Computers, 63(4):888–901, 2012. [76] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Reflex: Remote flash = local flash. ACM SIGARCH Computer Architecture News, 45(1):345–359, 2017. 136 https://janusgraph.org [77] Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan MG Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, et al. Swift: Delay is simple and effective for con- gestion control in the datacenter. In ACM SIGCOMM, 2020. [78] Jaewook Kwak, Sangjin Lee, Kibin Park, Jinwoo Jeong, and Yong Ho Song. Cosmos+ openssd: Rapid prototype for flash storage systems. ACM Transactions on Storage (TOS), 16(3):1–35, 2020. [79] Alexey Lavrov and David Wentzlaff. Hypertrio: hyper-tenant translation of i/o addresses. In ACM/IEEE ISCA, 2020. [80] Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N Patt. Dram-aware last-level cache writeback: Reducing write- caused interference in memory systems, 2010. [81] Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu. Adaptive-latency dram: Opti- mizing dram timing for the common-case. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 489– 501, 2015. [82] Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jong- moo Choi, and Onur Mutlu. Decoupled direct memory access: Isolating cpu and io traffic by leveraging a dual-data-port dram. In IEEE PACT, 2015. [83] Junghee Lee, Youngjae Kim, Galen M Shipman, Sarp Oral, Feiyi Wang, and Jongman Kim. A semi-preemptive garbage collector for solid state drives. In IEEE ISPASS, 2011. [84] Sang-Won Lee, Bongki Moon, and Chanik Park. Advances in flash mem- ory ssd technology for enterprise database applications. In ACM SIG- MOD, 2009. [85] Huaicheng Li, Martin L Putra, Ronald Shi, Xing Lin, Gregory R Ganger, and Haryadi S Gunawi. loda: A host/device co-design for strong pre- dictability contract on modern flash storage. In ACM SOSP, 2021. [86] Qiang Li, Qiao Xiang, Derui Liu, Yuxin Wang, Haonan Qiu, Gexiao Tian, Xiaoliang Wang, Lulu Chen, Ridi Wen, Jianbo Dong, et al. From rdma to rdca: Toward high-speed last mile of data center networks using remote direct cache access. https://arxiv.org/abs/2211.05975, 2022. 137 https://arxiv.org/abs/2211.05975 [87] Qiang Li, Qiao Xiang, Yuxin Wang, Haohao Song, Ridi Wen, Wenhui Yao, Yuanyuan Dong, Shuqi Zhao, Shuo Huang, Zhaosheng Zhu, et al. More than capacity: Performance-oriented evolution of pangu in alibaba. In USENIX FAST, 2023. [88] Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Al- izadeh, et al. Hpcc: High precision congestion control. In ACM SIG- COMM. 2019. [89] John DC Little and Stephen C Graves. Little’s law. Building intuition: insights from basic operations management models and principles, pages 81– 100, 2008. [90] Hui Lu, Brendan Saltaformaggio, Ramana Kompella, and Dongyan Xu. vfair: latency-aware fair storage scheduling via per-io cost-based differ- entiation. In ACM SoCC, 2015. [91] Moshe Malka, Nadav Amit, Muli Ben-Yehuda, and Dan Tsafrir. riommu: Efficient iommu for i/o devices that employ ring buffers. In ACM ASP- LOS, 2015. [92] Moshe Malka, Nadav Amit, and Dan Tsafrir. Efficient intra-operating sys- tem protection against harmful dmas. In USENIX FAST, 2015. [93] Alex Markuze, Adam Morrison, and Dan Tsafrir. True iommu protection from dma attacks: When copy is faster than zero copy. In ACM ASPLOS, 2016. [94] Alex Markuze, Igor Smolyar, Adam Morrison, and Dan Tsafrir. Damn: Overhead-free iommu protection for networking. In ACM ASPLOS, 2018. [95] Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C Evans, Steve Gribble, et al. Snap: A microkernel approach to host net- working. In ACM SOSP, 2019. [96] Matthew Mathis, Jeffrey Semke, Jamshid Mahdavi, and Teunis Ott. The macroscopic behavior of the tcp congestion avoidance algorithm. In ACM SIGCOMM CCR, 1997. 138 [97] Mellanox Technologies. Mellanox CX5 Network Interface Card. Product Information, 2024. [98] Justin J Miller. Graph database applications and concepts with neo4j. In Proceedings of the southern association for information systems conference, At- lanta, GA, USA, volume 2324, pages 141–147, 2013. [99] Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. Timely: Rtt-based congestion control for the datacenter. In ACM SIGCOMM, 2015. [100] Jeffrey C Mogul and Kadangode K Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. In ACM TOCS, 1997. [101] Benoı̂t Morgan, Éric Alata, Vincent Nicomette, and Mohamed Kaâniche. IOMMU protection against I/O attacks: a vulnerability and a proof of concept. J. Braz. Comput. Soc., 24(1), December 2018. [102] Benoı̂t Morgan, Éric Alata, Vincent Nicomette, and Mohamed Kaâniche. Bypassing iommu protection against i/o attacks. In LADC, 2016. [103] Robert Morris. Tcp behavior with many flows. In IEEE ICNP, 1997. [104] Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W Moore. Understanding pcie perfor- mance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 327–341, 2018. [105] Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. Understanding pcie perfor- mance for end host networking. In ACM SIGCOMM, 2018. [106] Bo Pang, Erik Nijkamp, and Ying Nian Wu. Deep learning with tensor- flow: A review. Journal of Educational and Behavioral Statistics, 45(2):227– 248, 2020. [107] Irma Esmer Papazian. New 3rd gen intel® xeon® scalable processor (co- dename: Ice lake-sp). In Hot Chips Symposium, pages 1–22, 2020. [108] Omer Peleg, Adam Morrison, Benjamin Serebrin, and Dan Tsafrir. Utiliz- ing the iommu scalably. In USENIX ATC, 2015. 139 [109] Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. Colt: Coalesced large-reach tlbs. In IEEE/ACM MICRO, 2012. [110] Solal Pirelli and George Candea. A simpler and faster NIC driver model for network functions. In USENIX OSDI, 2020. [111] Pure Storage. Linux recommended settings. https://support. purestorage.com/Solutions/Linux/Linux_Reference/Linux_ Recommended_Settings. [112] Mohammad Rajiullah, Per Hurtig, Anna Brunstrom, Andreas Petlund, and Michael Welzl. An evaluation of tail loss recovery mechanisms for tcp. In ACM SIGCOMM CCR, 2015. [113] K Ramakrishnan, Sally Floyd, and D Black. Rfc3168: The addition of explicit congestion notification (ecn) to ip. https://datatracker.ietf. org/doc/html/rfc3168, 2001. [114] Zebin Ren, Krijn Doekemeijer, Nick Tehrany, and Animesh Trivedi. Bfq, multiqueue-deadline, or kyber? performance characterization of linux storage schedulers in the nvme era. 2024. [115] Scott Rixner, William J Dally, Ujval J Kapasi, Peter Mattson, and John D Owens. Memory access scheduling. ACM SIGARCH Computer Architec- ture News, 28(2):128–138, 2000. [116] Benny Rubin, Saksham Agarwal, and Rachit Agarwal. Fast and safe memory protection for networked systems. In Draft: https://saksham.web.illinois.edu/assets/pdf/fns.pdf, 2024. [117] Salman Salloum, Ruslan Dautov, Xiaojun Chen, Patrick Xiaogang Peng, and Joshua Zhexue Huang. Big data analytics on apache spark. Interna- tional Journal of Data Science and Analytics, 1:145–164, 2016. [118] Debendra Das Sharma. Compute express link (cxl): Enabling heteroge- neous data-centric computing with heterogeneous memory hierarchy. In IEEE MICRO, 2022. [119] David Shue and Michael J Freedman. From application requests to virtual iops: Provisioned key-value storage with libra. In Proceedings of the Ninth European Conference on Computer Systems, pages 1–14, 2014. 140 https://support.purestorage.com/Solutions/Linux/Linux_Reference/Linux_Recommended_Settings https://support.purestorage.com/Solutions/Linux/Linux_Reference/Linux_Recommended_Settings https://support.purestorage.com/Solutions/Linux/Linux_Reference/Linux_Recommended_Settings https://datatracker.ietf.org/doc/html/rfc3168 https://datatracker.ietf.org/doc/html/rfc3168 https://saksham.web.illinois.edu/assets/pdf/fns.pdf [120] Athinagoras Skiadopoulos, Zhiqiang Xie, Mark Zhao, Qizhe Cai, Sak- sham Agarwal, Jacob Adelmann, David Ahern, Carlo Contavalli, Michael Goldflam, Vitaly Mayatskikh, et al. High-throughput and flexible host networking for accelerated computing. In USENIX OSDI, 2024. [121] Parul Sohal, Michael Bechtel, Renato Mancuso, Heechul Yun, and Orran Krieger. A closer look at intel resource director technology (rdt). In RTNS, 2022. [122] Timothy Stamler, Deukyeon Hwang, Amanda Raybuck, Wei Zhang, and Simon Peter. {zIO}: Accelerating {IO-Intensive} applications with trans- parent {Zero-Copy}{IO}. In USENIX OSDI, 2022. [123] Amy Tai, Igor Smolyar, Michael Wei, and Dan Tsafrir. Optimizing storage performance with calibrated interrupts. ACM TOS, 2022. [124] Eno Thereska, Hitesh Ballani, Greg O’Shea, Thomas Karagiannis, Antony Rowstron, Tom Talpey, Richard Black, and Timothy Zhu. Ioflow: A software-defined storage architecture. In ACM SOSP, 2013. [125] Jordan Tigani and Siddartha Naidu. Google bigquery analytics. John Wiley & Sons, 2014. [126] Shih-Hao Tseng, Saksham Agarwal, Rachit Agarwal, Hitesh Ballani, and Ao Tang. Codedbulk: Inter-datacenter bulk transfers using network cod- ing. In USENIX NSDI, 2021. [127] Mehul Nalin Vora. Hadoop-hbase for large-scale data. In Proceedings of 2011 International Conference on Computer Science and Network Technology, volume 1, pages 601–605. IEEE, 2011. [128] Midhul Vuppalapati, Saksham Agarwal, Henry N. Schuh, Baris Kasikci, Arvind Krishnamurthy, and Rachit Agarwal. Understanding the host net- work. In ACM SIGCOMM, 2024. [129] Hao Wang, Chang-Jae Park, Gyung-su Byun, Jung Ho Ahn, and Nam Sung Kim. Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems. In IEEE HPCA, 2015. [130] WCCFtech. Intel emerald rapids: 5th gen xeon cpu family specs leaked – up to 64 cores, ddr5-5600, 80 gen 5 lanes, 350w. 141 https://wccftech.com/intel-emerald-rapids-5th-gen-xeon-cpu-family- specs-leaked-up-to-64-cores-ddr5-5600-80-gen-5-lanes-350w/. [131] Harald Welte. The netfilter framework in linux 2.4. In Linux-Kongress, 2000. [132] Chenggang Wu, Vikram Sreekanti, and Joseph M Hellerstein. Autoscaling tiered cloud storage in anna. VLDB, 2019. [133] Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swami- nathan Sundararaman, Andrew A Chien, and Haryadi S Gunawi. Tiny- tail flash: Near-perfect elimination of garbage collection tail latencies in nand ssds. ACM TOS, 2017. [134] Juncheng Yang, Yao Yue, and KV Rashmi. A large-scale analysis of hun- dreds of in-memory key-value cache clusters at twitter. ACM TOS, 2021. [135] Ziye Yang, James R. Harris, Benjamin Walker, Daniel Verkamp, Chang- peng Liu, Cunyin Chang, Gang Cao, Jonathan Stern, Vishal Verma, and Luse E. Paul. Spdk: A development kit to build high performance storage applications. IEEE CloudCom, 2017. [136] Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut Taylan Kandemir, Nam Sung Kim, Jihong Kim, et al. Flashshare: Punching through server storage stack from kernel to firmware for ultra-low latency ssds. In USENIX OSDI, 2018. [137] Xiangqun Zhang, Shuyi Pei, Jongmoo Choi, and Bryan S. Kim. Excessive ssd-internal parallelism considered harmful. In ACM HotStorage, 2023. [138] Yiying Zhang, Leo Prasath Arulraj, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. De-indirection for flash-based ssds with name- less writes. In USENIX FAST, 2012. [139] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mo- hamad Haj Yahia, and Ming Zhang. Congestion control for large-scale rdma deployments. ACM SIGCOMM, 2015. 142 Biographical Sketch Dedication Acknowledgements Table of Contents List of Figures Introduction The Host Network Emergence of Host Congestion The Root Cause of Host Congestion Impact of Host Congestion Implications to Design of Network Protocols & OS Thesis and Key Contributions Acknowledgements Understanding the Impact of Host Congestion Host Network Datapath Network (NIC-to-CPU/memory) Datapath Storage (SSD-to-CPU/memory) Datapath Datapath with Memory Protection Enabled Host Congestion Impact on Networking Applications Impact of Memory Interconnect Contention Impact of IOMMU Contention Host Congestion Impact on Storage Applications Summary Rearchitecting Congestion Control Overview hostCC Host congestion signals Host-local congestion response at sub-RTT granularity Network resource allocation at RTT granularity hostCC Implementation Host congestion signals Host-local congestion response Network resource allocation hostCC Evaluation hostCC benefits hostCC results with DDIO enabled hostCC sensitivity analysis Deep dive into hostCC performance Summary Rethinking IO Memory Protection Overview F&S F&S Implementation F&S Evaluation F&S Benefits Necessity of each design idea in F&S Summary Rearchitecting Storage Stacks Overview storageCC storageCC Overview Detecting host network bottlenecks Converging to the right amount of parallelism storageCC Implementation storageCC Evaluation storageCC benefits RocksDB with storageCC storageCC Overheads storageCC Sensitivity Analysis Understanding storageCC microscopic behavior Fair resource allocation using storageCC Summary Related Work Managing Host Resource Contention Efficient IO Memory Protection Mitigating Bottlenecks in Storage Datapath Conclusion and Outlook Future Directions Bibliography