HOST CONGESTION CONTROL

A Dissertation

Presented to the Faculty of the Graduate School

of Cornell University

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

by

Saksham Agarwal

August 2024


© 2024 Saksham Agarwal

ALL RIGHTS RESERVED


HOST CONGESTION CONTROL

Saksham Agarwal, Ph.D.

Cornell University 2024

The conventional wisdom in systems and networking communities is that con-

gestion in datacenter networks happens primarily within the network fabric

(i.e., at network links and/or switches).

This dissertation has three core contributions: (1) it presents a new problem

of “host congestion”—congestion within the datapath between peripheral de-

vices and compute/memory—and presents evidence of host congestion both in

production datacenters and in experimental lab setups; (2) it builds an in-depth

understanding of the root causes of host congestion, and of the impact of host

congestion on application-level performance; and (3) it explores the implica-

tions of host congestion to the design of network protocols, network stacks and

operating systems.

We define host congestion in the context of networked applications as fol-

lows: the receiver-side host network interface card (NIC) receives data from the

network at a rate faster than it can transfer it to compute/memory. This reduces

the available NIC-to-memory bandwidth, resulting in queueing and eventual

packet drops at hosts. We demonstrate that, even with state-of-the-art network

protocols and network stacks, host congestion leads to significant degradation

in throughput and orders-of-magnitude inflation in tail latency and a surpris-

ingly large fraction of packet drops at the host, even when the access link band-

width is far from fully utilized. We present evidence and characterization of

the host congestion phenomenon for both large-scale production clusters (run-


ning Swift congestion control protocol with a userspace network stack), and

in experimental testbeds (running DCTCP congestion congestion protocol with

Linux network stack). Several recent studies have built upon our work to show

that hardware-offloaded network stacks also suffer from similar or worse host

congestion phenomenon.

Host congestion, and resulting queueing and packet drops at the host, are

new to the community. To that end, this thesis also builds an in-depth under-

standing of the root causes of the host congestion phenomenon. We demon-

strate that host congestion is caused due to nanosecond-scale latency inflation

within the NIC-to-CPU/memory datapath. Such latency inflation is rooted in

the poor interplay between processor, memory and peripheral interconnects

within the host; such an interplay, in turn, leads to contention at host resources

(e.g., memory bandwidth, IO memory management units, etc.) and manifests it-

self in the form of underutilization of peripheral interconnect (PCIe) bandwidth,

queueing and packet drops at the NIC, and subsequent drop in application-

level performance. We also discuss that unfavorable technology trends for host

resources are going to make this problem worse over time: access link band-

widths and PCIe bandwidths are expected to increase by 8-16x over the next

few years; however, trends for essentially all other host resources (e.g., memory

bandwidth per core, IO memory management unit caches, NIC buffer sizes, etc.)

are relatively stagnant. Thus, we expect higher degrees of resource imbalance

and contention within the host.

Host congestion alters the many assumptions entrenched within the design

of modern networked and operating systems. For instance, existing datacenter

congestion control (CC) protocols were not designed to efficiently detect and

respond to host congestion—they operate at a network round-trip-time (RTT)


granularity (typically tens-to-hundreds of microseconds) while host congestion

can change dynamically at sub-microsecond granularity. We present hostCC,

a new congestion control architecture that handles both host congestion and

network fabric congestion. HostCC introduces new “host-local” congestion sig-

nals and a sub-RTT granularity “host-local” congestion response to handle host

congestion. We demonstrate that hostCC can be integrated with existing CC

protocols to efficiently perform host and network resource allocation among

competing entities.

As another example of host congestion altering the many assumptions en-

trenched within operating systems, we explore IO memory protection mecha-

nisms. Such mechanisms are used in real-world production datacenters to pre-

vent malicious and/or faulty NICs from executing errant transfers into host

memory. Modern hosts achieve this using an IOMMU—NICs operate on vir-

tual addresses, and IOMMU translates virtual addresses to physical addresses

(potentially speeding up translations using a cache called IOTLB) before execut-

ing memory transfers. We demonstrate that existing state-of-the-art IO memory

protection mechanisms are able to provide one of the two desirable properties:

(1) strong safety property, that results in unavoidable IOTLB misses and sub-

sequent host congestion; or (2) high performance. We present “Fast & Safe”

(F&S), a simple modification to existing memory protection mechanisms that

enables them to provide the strongest safety property, and yet, near-completely

eliminates their overheads. The key insight in F&S design is that, rather than fo-

cusing on achieving high IOTLB hit rates, we should focus on reducing the cost

of each translation upon an IOTLB miss. F&S reduces the cost of each trans-

lation by exploiting modern IOMMU hardware, along with novel mechanisms

for contiguous virtual address allocation and batched unmapping and invali-


dations of used virtual addresses. We demonstrate that F&S design requires

no modifications in host hardware, minimal modifications within the operating

system, and yet, near-completely alleviates all overheads of the strongest form

of memory protection mechanism.

Finally, using the example of storage stacks, we demonstrate that host con-

gestion has implications that are more far-reaching than networked applica-

tions. Specifically, modern storage stacks enable exploiting high throughput

offered by storage devices like SSDs by maintaining multiple in-flight IO re-

quests. Typically, the number of in-flight requests is set to be the “knee point”

of the device’s latency-load curve. This minimizes the latency while maximiz-

ing the throughput, under the assumption that the bottleneck is at the CPU

and/or at the storage device. However, under host congestion, the latency-

load curve (and the knee-point) can change dynamically at sub-microsecond

granularity. We demonstrate that, under such host congestion, existing storage

stacks keep a sub-optimal number of in-flight requests resulting in significant

application-level performance degradation. We present storageCC, a new stor-

age stack architecture to handle host congestion. Our key insight is that the end-

to-end storage datapath (i.e., SSD-to-CPU/memory datapath) is conceptually a

(lossless) computer network. StorageCC builds upon this insight to detect host

congestion by collecting congestion signals within the host, and adapting the

number of in-flight requests based on host congestion. Evaluation of StorageCC

over a wide variety of scenarios demonstrates that StorageCC converges to the

optimal parallelism for each individual scenario, maintaining near-hardware la-

tency and throughput performance.

Our thesis suggests that host congestion may have wide-reaching implica-

tions to design of network protocols, network stacks, operating systems, and


even host hardware. To that end, we close by summarizing the many new re-

search questions opened up by our thesis in computer networking, operating

systems and computer architecture.


BIOGRAPHICAL SKETCH

Saksham Agarwal is a Ph.D. candidate in the department of Computer Science

at Cornell University, advised by Prof. Rachit Agarwal. He holds a Master of

Science in Computer Science from Cornell University, and Bachelor of Technol-

ogy in Electrical Engineering from Indian Institute of Technology Kanpur. He

is a recipient of Cornell University Fellowship, Google Ph.D. Fellowship, and a

Cornell Computer Science Outstanding Teaching Assistant Award.

iii


”It always seems impossible until it’s done.” - Nelson Mandela

iv


ACKNOWLEDGEMENTS

I will forever be indebted to my Ph.D. advisor, Rachit Agarwal, for enabling

me to reach where I am today. I have learned so much from Rachit about re-

search and life in general that it is nearly impossible to express my gratitude

adequately in just a few lines. Rachit dedicated a tremendous amount of his

time and energy to help me hone my skills in every facet necessary to become

a successful researcher—from thinking about problems correctly, to formulat-

ing concrete research agendas, to effectively communicating and presenting my

work, and to establishing meaningful connections with other researchers. I will

always cherish the countless hours we spent brainstorming together, during

which he provided invaluable feedback on even the tiniest details of my re-

search proposals, paper drafts, and presentation slides. Even more importantly,

I learned from him the importance of having an unwavering perseverance in

the face of temporary setbacks, and having strong conviction in one’s beliefs

and values in life. Thank you Rachit, for everything!

I am also grateful to the other members of my Ph.D. committee—David

Shmoys, Nate Foster, and Amin Vahdat—for their insightful feedback on my

research over the years. David and Amin have been my collaborators since I

started at Cornell, and I am thankful for their constant guidance and support

throughout my Ph.D. journey. I would like to extend my gratitude to all of my

other collaborators and mentors—Sylvia Ratnasamy, Arvind Krishnamurthy,

Hitesh Ballani, Vishal Shrivastav, and Kevin Tang. All of them played a signifi-

cant role in my growth as a researcher, and I appreciate the time they invested

in helping me succeed.

During the summers of 2020 and 2021, I had the opportunity to intern with

Google’s Congestion Control team. I would like to thank Nandita Dukkipati,

v


Behnam Montazeri, Gautam Kumar, Masoud Moshref, Luigi Rizzo, and every-

one else in the team, for providing a great learning experience. Google went

above and beyond to make my internship an enjoyable experience, even during

the challenging times of the pandemic. I am also grateful to them for supporting

my Ph.D. through a Google PhD Fellowship.

I am immensely thankful to my labmates, Qizhe Cai and Midhul Vuppalap-

ati. Our daily research discussions in the lab were instrumental to my learning

experience at Cornell. Without these conversations, I might still be grappling

with numerous bugs buried deep within the kernel or obscure register counter

values hidden in processor manuals. We shared not only our successes but also

the many (and often stressful) moments of paper rejections and all-nighters be-

fore deadlines. I would also like to express my gratitude to an amazing group

of junior Ph.D., master’s, and undergraduate students: Shreyas Kharbanda,

Shouxu Lin, Katie Gioioso, Benny Rubin, Pai Li, and Claire Wang. They not

only helped me grow as a mentor but also taught me a great deal along the way.

I would like to express my gratitude to the incredible professors at Cornell,

especially Robbert Van Renesse, Lorenzo Alvisi, Fred Schneider, Robert Klein-

berg, Ken Birman, and Chris De Sa for taking time from their busy schedules to

provide feedback on multiple practice talks for conference presentations and job

talks. The staff at Cornell’s Computer Science department was also amazingly

helpful. Becky always magically resolved all our administrative problems, and

Corey was a huge help in organizing our group meetings, lunches, and confer-

ence registrations.

I also thank my first-year office mates, who will always hold a special place

in my heart: Drishti, Eric, Cheng, Claire, Dietrich, Deepak, and Andrew. The

folks made my first year at Cornell, far away from home, a really enjoyable

vi


one. My gratitude also goes to my Networking Lab mates—Vishal, Praveen,

and Hardik—for countless discussions on datacenters, programmable switches,

and graduate student life. Many thanks to my Systems Lab mates—Yunhao,

Ali, Florian, Sagar, and Alicia—for random technical discussions, watercooler

chats, and practice talks. Special thanks to Shijin, my first student collaborator

at Cornell, who taught me a lot not only CS theory, and also introduced me to

the life at Cornell and in Ithaca.

The Ph.D. journey can be tough at times, and it would have been impossible

to navigate it without the support of my friends—my family away from home.

I would like to thank Ashudeep Singh, Utkarsh Mall, Jashan Singhal, Prateek

Sehgal, Drishti Wali, Arzoo Katiyar, Arpita Sharma, and Meenakshi Khosla for

the Friday FIFA nights, squash, tennis, and badminton games, road trips, count-

less dinners, and the weekly check-in sessions over Zoom during the pandemic.

My gratitude extends to my IITK family—Ashwin Verma, Sourya Basu, Ayush

Sinha, Kapil Sethi, Rewant Teotia, Shubham Agarwal, and Shubham Jain—for

staying in touch and offering me places to stay while I roamed across the globe.

Most importantly, I thank my family—my parents, Shweta and Manish

Agarwal; my sister, Dr. Samiksha Agarwal; and my grandparents, Sita and Ram

Raghubir Agarwal—for their upbringing, countless sacrifices, and unwavering

support, no matter where I am. I also extend my thanks to my aunt and uncle,

Shivani and Chetan Jain, for always welcoming me into their home in Colorado,

especially during the pandemic.

I would also like to thank Cornell University and Cornell Health for bril-

liantly managing the spread of COVID-19 on campus, helping my family feel

less stressed about my safety while I was thousands of miles away from home

for several years.

vii


TABLE OF CONTENTS

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

1 Introduction 1
1.1 The Host Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Emergence of Host Congestion . . . . . . . . . . . . . . . . . . . . 3
1.3 The Root Cause of Host Congestion . . . . . . . . . . . . . . . . . 6
1.4 Impact of Host Congestion . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Implications to Design of Network Protocols & OS . . . . . . . . . 8
1.6 Thesis and Key Contributions . . . . . . . . . . . . . . . . . . . . . 11
1.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Understanding the Impact of Host Congestion 14
2.1 Host Network Datapath . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 Network (NIC-to-CPU/memory) Datapath . . . . . . . . . 14
2.1.2 Storage (SSD-to-CPU/memory) Datapath . . . . . . . . . . 17
2.1.3 Datapath with Memory Protection Enabled . . . . . . . . . 19

2.2 Host Congestion Impact on Networking Applications . . . . . . . 22
2.2.1 Impact of Memory Interconnect Contention . . . . . . . . . 23
2.2.2 Impact of IOMMU Contention . . . . . . . . . . . . . . . . 28

2.3 Host Congestion Impact on Storage Applications . . . . . . . . . . 35
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Rearchitecting Congestion Control 41
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 hostCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Host congestion signals . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Host-local congestion response at sub-RTT granularity . . 46
3.2.3 Network resource allocation at RTT granularity . . . . . . 50

3.3 hostCC Implementation . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.1 Host congestion signals . . . . . . . . . . . . . . . . . . . . 52
3.3.2 Host-local congestion response . . . . . . . . . . . . . . . . 54
3.3.3 Network resource allocation . . . . . . . . . . . . . . . . . 56

3.4 hostCC Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 hostCC benefits . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.2 hostCC results with DDIO enabled . . . . . . . . . . . . . . 63
3.4.3 hostCC sensitivity analysis . . . . . . . . . . . . . . . . . . 65
3.4.4 Deep dive into hostCC performance . . . . . . . . . . . . . 65

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

viii


4 Rethinking IO Memory Protection 71
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 F&S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 F&S Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 F&S Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.1 F&S Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.2 Necessity of each design idea in F&S . . . . . . . . . . . . . 87

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Rearchitecting Storage Stacks 90
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 storageCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2.1 storageCC Overview . . . . . . . . . . . . . . . . . . . . . . 94
5.2.2 Detecting host network bottlenecks . . . . . . . . . . . . . 96
5.2.3 Converging to the right amount of parallelism . . . . . . . 99

5.3 storageCC Implementation . . . . . . . . . . . . . . . . . . . . . . 103
5.4 storageCC Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.4.1 storageCC benefits . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.2 RocksDB with storageCC . . . . . . . . . . . . . . . . . . . 112
5.4.3 storageCC Overheads . . . . . . . . . . . . . . . . . . . . . 114
5.4.4 storageCC Sensitivity Analysis . . . . . . . . . . . . . . . . 114
5.4.5 Understanding storageCC microscopic behavior . . . . . . 116
5.4.6 Fair resource allocation using storageCC . . . . . . . . . . 118

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6 Related Work 120
6.1 Managing Host Resource Contention . . . . . . . . . . . . . . . . . 120
6.2 Efficient IO Memory Protection . . . . . . . . . . . . . . . . . . . . 121
6.3 Mitigating Bottlenecks in Storage Datapath . . . . . . . . . . . . . 123

7 Conclusion and Outlook 125
7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Bibliography 130

ix


LIST OF FIGURES

1.1 End-to-end datapath of distributed applications involves not
only transfers across the network fabric (left) comprising links
and switches, but also across the host network (right), compris-
ing processor interconnect, peripheral interconnect and memory
interconnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Memory bandwidth is becoming increasingly oversubscribed
due to rapidly approaching PCIe bandwidth. . . . . . . . . . . . 4

1.3 NIC-to-memory transfer rate matches the arrival rate at the NIC
until the latency budget is met for these transfers. Host conges-
tion, that is, the reduction in achievable NIC-to-memory transfer
rate, is a result of overshooting of this latency budget. . . . . . . 6

1.4 Host congestion in a large-scale Google production cluster. The
figure shows a scatter plot of link utilization (normalized by ac-
cess link bandwidth) and packet drop rates at hosts. The clus-
ter runs both the Linux kernel and SNAP [95] network stacks,
with TCP and Swift [77] congestion control protocols, respec-
tively. Precise values for packet drop rates are hidden due to
these being sensitive. . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Illustration of host network datapath for the Intel architecture
(AMD architecture is conceptually similar) between NIC and
CPU/memory for the DDIO disabled case. . . . . . . . . . . . . . 15

2.2 Illustration of storage datapath between SSD and CPU/memory. 18
2.3 High-level DMA datapath with IOMMU enabled. . . . . . . . . . 20
2.4 (a) Host congestion leads to significant performance degradation

in terms of packet drop rates and throughput (even with no net-
work congestion). DDIO helps a little but observes similar per-
formance degradation. (b) MApp is able to acquire a large fraction
of memory bandwidth, leaving little room for NetApp-T. . . . . . 24

2.5 The impact of host congestion worsens with larger MTU size and
number of flows, with 3× degree of host congestion. DDIO-
enabled case, in particular, suffers more than DDIO-disabled
case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Host congestion leads to orders of magnitude tail latency in-
flation. The figure shows {P50, P90, P99, P99.9, P99.99} laten-
cies (represented by whiskers) for NetApp-L, with and without
host congestion, when NetApp-T, NetApp-L, and MApp are run to-
gether with DDIO-disabled (we observed near-identical latency
results with DDIO-enabled). . . . . . . . . . . . . . . . . . . . . . 27

x


2.7 Enabling memory protection results in >35% degradation in
throughput. Increasing ring buffer sizes with IOMMU enabled
leads to (a) decreasing throughput and (b) increasing packet
drop rates. (c) IOTLB misses do not significantly change as the
ring buffer size increases. (d) However, IOMMU L3 cache miss
rates increase significantly, leading to a larger cost of IOTLB
misses (e) Larger ring buffer sizes result in worse locality as the
IOVA working set size grows. . . . . . . . . . . . . . . . . . . . . 29

2.8 With IOMMU enabled and an increasing number of flows, we
observe (a) higher throughput degradation, (b) larger packet
drop rates, (c) a larger number of IOTLB misses, (d) higher
IOMMU cache misses, and (e) poorer locality in L3 page caches. 32

2.9 IOMMU contention leads to orders of magnitude tail latency in-
flation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.10 Application performance with both IOMMU and memory band-
width contention. Memory bandwidth contention causes further
performance degradation due to the increased latency for each
memory read. This worsens the cost of each page table walk de-
spite similar miss rates for both the IOTLB and page table caches. 34

2.11 Existing storage stacks suffer from tail latency inflation due
to memory interconnect bottlenecks. (a) The setup comprises
L-apps, T-apps, and M-apps all co-located within a single NUMA
node; (b) the T-apps set parallelism as the benchmarked knee
point (circled in the green curve); in the presence of memory in-
terconnect bottleneck, the T-apps continue to maintain the same
amount of parallelism (cross marked on the red curve), incurring
7.23× higher tail latency than the ideal operating point for this
scenario (circled in the red curve); (c-f) other IO schedulers sup-
ported by Linux blk-mq, and blk-switch (a state-of-the-art stor-
age stack) incur similar latency inflation and/or high through-
put overheads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.12 Higher inflation in tail latencies with memory interconnect bot-
tlenecks in remote NUMA node. (a) Setup uses applications run-
ning on CPU cores in NUMA node remote from SSDs; (b) L-apps
incur 17× tail latency inflation. . . . . . . . . . . . . . . . . . . . . 37

2.13 Throughput starvation problem when T-apps use multiple
NUMA nodes. (a) a setup, where the culprit application suf-
fers from memory, interconnect bottleneck, but not the victim;
(b) the culprit must necessarily suffer from throughput degrada-
tion; however, it also results in unnecessary degradation in the
victim application throughput. . . . . . . . . . . . . . . . . . . . 39

3.1 hostCC architecture overview. Discussion in §3.2. . . . . . . . . . 44

xi


3.2 hostCC generates host congestion signals that are not on the
NIC-to-memory datapath; it can thus measure both IS (left) and
BS (right) at sub-µs timescales independent of host congestion. . 51

3.3 Variation of IIO occupancy IS and PCIe bandwidth utiliza-
tion BS with time for 1ms period for the default experimental
setup without host congestion (left) and with 3× host congestion
(right). In absence of host congestion, BS ≈ 103Gbps (line-rate
bandwidth including PCIe overheads with 4K MTUs) and IS≈65,
which corresponds to hardware-specific IIO-DRAM bandwidth-
delay product (§3.2.1). During host congestion, IS increases to a
maximum value of ∼93 (shown in red), resulting in a reduction
of BS , queueing and packet drops at the NIC, and network CC
reducing the rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 Evaluating MBA efficacy: NetApp-T is able to acquire more
host resources with higher host-local response levels (that is,
more backpressure on MApp). As a result, NetApp-T gets higher
throughput (left) and a larger fraction of memory bandwidth
(right). Discussion in §3.3.2 and §3.4. . . . . . . . . . . . . . . . . 55

3.5 (a) hostCC allows network traffic to achieve its target network
bandwidth of BT = 80Gbps, while simultaneously reducing
packet drop rates by orders of magnitude, even with a high de-
gree of host congestion. (b) with hostCC, MApps no longer ac-
quire a large fraction of memory bandwidth, even with high de-
gree of host congestion. . . . . . . . . . . . . . . . . . . . . . . . . 58

3.6 Even with high degree of host congestion (3× in this experiment),
hostCC consistently maintains its benefits across (a) MTU sizes
and (b) number of flows. . . . . . . . . . . . . . . . . . . . . . . . 60

3.7 Even with a high degree of host congestion (3× in this experi-
ment), hostCC incurs minimal latency inflation compared to no
host congestion, significantly improving tail latency across all
RPC sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.8 (a) In the presence of network congestion and absence of host
congestion, hostCC achieves performance similar to network CC
indicating minimal overheads; (b) in the presence of both host
and network congestion, hostCC consistently provides benefits
similar to the case with only host congestion. . . . . . . . . . . . . 61

3.9 hostCC, with DDIO enabled, provides benefits similar to the
DDIO disabled case. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.10 hostCC, with DDIO enabled, provides benefits in terms of la-
tency improvements similar to the DDIO disabled case. . . . . . 64

3.11 hostCC consistently maintains its benefits across varying net-
work target bandwidth BT . Discussion in §3.4.3. . . . . . . . . . . 64

xii


3.12 (a) Increasing IT values lead to increasing drop rates due less ag-
gressive hostCC reaction to congestion. (b) MApp acquires larger
memory bandwidth with larger IT due to less aggressive back-
pressure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.13 Each of the three key technical ideas in the hostCC architecture—
generating host congestion signals at sub-µs granularity, sub-
RTT host-local congestion response, and network resource al-
location based on both host and network congestion signals—
contribute to hostCC’s performance. Discussion in §3.4.4. . . . . 67

3.14 Each of the three key technical ideas in the hostCC architecture—
generating host congestion signals at sub-µs granularity, sub-
RTT host-local congestion response, and network resource al-
location based on both host and network congestion signals—
contribute to hostCC’s performance. Discussion in §3.4.4. . . . . 68

3.15 In steady-state, hostCC keeps PCIe bandwidth close to the target
network bandwidth, while ensuring that IIO occupancy remains
smaller than the congestion threshold. Discussion in §3.4.4. . . . 69

4.1 F&S achieves contiguous IOVA mappings without requiring
changes to IOMMU hardware, IOVA allocator, and IOMMU
driver. (a) This process repeats for each page in the Rx descrip-
tor, requiring calls to both the IOVA allocator and IO page table
map. (b) A single 256KB IOVA is allocated and then mapped
in page-size chunks to the physical pages in the Rx descriptor.
F&S requires only one call to the IOVA allocator and the same
number of calls to the IO page table map. . . . . . . . . . . . . . . 80

4.2 F&S eliminates memory protection overheads by reducing cost
of each IOTLB miss. (a) F&S enables IOMMU on scenario to
match the performance of IOMMU off, even as the number of
flows increases. (b) By addressing the IOMMU contention, F&S
significantly lowers packet drop rates caused by queueing at
the NIC buffers. (c) F&S reduces the number of IOTLB misses
per descriptor up to 2x by minimizing packet drops, hence de-
creasing the number of transmitted ACKs. (d) F&S eliminates
L1/L2 misses and substantially reduces L3 misses by preserving
IO page table caches during invalidation and allocating contigu-
ous IOVAs. (e) F&S enhances the locality and thus reduces the
number of L3 cache misses. See §4.4.1 for more discussion. . . . . 82

xiii


4.3 F&S maintains good locality, resulting in low L3 cache misses
even as the IO working set sizes increase. With IOMMU enabled
and increasing ring buffer sizes, we observe that F&S achieves
several key performance metrics: (a) it matches the performance
of systems with IOMMU disabled; (b) experiences no packet
drops; (c) maintains a consistent number of IOTLB misses; (d)
incurs low IOMMU cache misses; and (e) demonstrates good lo-
cality in L3 page caches. See §4.4.1 for more discussion. . . . . . . 83

4.4 F&S enables IOMMU on to match the performance of IOMMU
off with varied number of cores and MTU sizes. See §4.4.1 for
more discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.5 F&S resolves IOMMU contention with DDIO on for varying
number of flows. Moreover, by improving CPU utilization, F&S
is able to saturate the link bandwidth. See §4.4.1 for more dis-
cussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.6 The CPU breakdown results indicate that data copy overheads
increase with the number of ring buffer sizes due to ineffective
hardware prefetching. Results shown for (a) hardware prefetch-
ing enabled, and (b) hardware prefetching disabled respectively. 85

4.7 F&S brings DCTCP latency closer to IOMMU off case. See §4.4.1
for more discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.8 hostCC and F&S work together to resolve both memory band-
width and IOMMU contention. See §4.4.1 for more discussion. . 86

4.9 Contribution of each of the F&S key design ideas. As ring buffer
size increases we observe: (a) the throughput drops for the case
of “default + preserve” (b) as ring buffer size increases, despite
preserving the L3 cache, we still see misses due to the larger IO
working set size. See §4.4.2 for more discussion. . . . . . . . . . . 88

5.1 Datapath of an IO request within the Linux storage stack. Blue
boxes indicate points within the datapath where storageCC logic
is implemented. (§5.3) . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.2 Effectively detecting storage datapath contention using stor-
ageCC: when there is no contention, IIO occupancy is 70% of the
maximum value, and reaches close to maximum in the presence
of memory bandwidth bottlenecks. Memory controller queue
occupancy (figure shows fraction of time WPQ occupancy is full)
also increases only when memory bandwidth is contented. . . . 106

5.3 IIO buffer and memory controller queue occupancies for the re-
mote NUMA setup. (Left column) Even without memory band-
width contention, the IIO buffer is filled up. This is due to
higher access latency from IIO to memory in remote NUMA
node. (Right column) IIO buffer and WPQ occupancies are as
expected under memory bandwidth contention. . . . . . . . . . . 109

xiv


5.4 storageCC allows storage stack to converge to amount of paral-
lelism which is close to the knee point, independent of degree
of contention and application NUMA-affinity. storageCC pro-
vides close to optimal latencies for L-App (upto ∼17× tail latency
improvement, even assuming the default IO depth = 64 – the
minimum needed to sustain optimal throughput under no con-
gestion) and maximum achievable throughput for T-app. Dis-
cussion in §5.4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.5 storageCC helps alleviate the throughput starvation prob-
lem: improving victim application throughput, while incurring
smaller cost in culprit application throughput, leading to ∼55%
gains in victim throughput and ∼22% in overall throughput. Dis-
cussion in §5.4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.6 Evaluation using RocksDB: storageCC provides similar im-
provement in tail latency as earlier results. . . . . . . . . . . . . . 113

5.7 storageCC observe negligible overheads (<5%) without the pres-
ence of any storage datapath congestion. Discussion in §5.4.3. . . 113

5.8 storageCC converges to right amount of parallelism indepen-
dent of IO size of the applications. Discussion in §5.4.4. . . . . . . 115

5.9 storageCC converges to the right amount of parallelism, even
with increasing load from competing L-apps. Discussion in §5.4.4. 115

5.10 DDIO enabled results: (a) scenario where DDIO can be benefi-
cial, storageCC provides even better tail latencies with close to
maximum achievable throughputs, (b) DDIO is not applicable
for remote NUMA scenario, storageCC provides near identical
results as baseline results. . . . . . . . . . . . . . . . . . . . . . . . 116

5.11 storageCC microscopic behavior . . . . . . . . . . . . . . . . . . . 117
5.12 Fair resource allocation by storageCC under host congestion and

varying number of application threads. Discussion in §5.4.6. . . . 118

xv


CHAPTER 1

INTRODUCTION

We begin the chapter by providing a brief background on the host network

and its characteristics (§1.1), followed by a discussion on the technology trends

leading to the emergence of host congestion—i.e., congestion within the host

network (§1.2). We then discuss the root cause behind the host congestion phe-

nomenon (§1.3), and provide evidence of host congestion in large-scale produc-

tion clusters (§1.4). We next discuss the implications of host congestion to design

of existing network protocols and OS (§1.5). Finally, we outline the key contri-

butions of this dissertation, with regards to rearchitecting the network protocols

and OS to help resolve the host congestion problem (§1.6).

1.1 The Host Network

Most distributed applications in datacenters—such as data analytics en-

gines [26, 117, 125], machine learning training frameworks [106, 58, 18], key-

value stores [43, 33, 19, 127], and graph processing databases [98, 42, 66], etc.—

perform high-speed data transfers between multiple hosts, across the datacen-

ter network (Figure 1.1 shows an example network topology used in datacen-

ters). The end-to-end datapath of these applications also involves data transfers

within the hosts, particularly between the CPU cores, memory, and the periph-

eral devices (like network interface cards or NICs [97], storage devices [84], and

accelerators [39, 16]). Such “intra-host transfers” are enabled by a network that

every host has—the host network—comprising processor, memory, and periph-

eral interconnects.

1


Peripheral Device 
(eg., Network Interface Card)

Root Complex

Processor Interconnect

DRAM

Peripheral Interconnect
(PCIe Lanes)

Datacenter Network Topology

Switches

Links

Hosts

Host Network Architecture

Memory Interconnect
(DDR Channels)

Figure 1.1: End-to-end datapath of distributed applications involves not only
transfers across the network fabric (left) comprising links and switches, but also
across the host network (right), comprising processor interconnect, peripheral
interconnect and memory interconnect.

Figure 1.1 provides an overview of the host network architecture in mod-

ern servers. The processor interconnect is used by CPU cores to exchange

data amongst themselves (between L1d, L2, and Last Level Caches) and to

send read/write requests to the DRAM. The peripheral interconnect, such as

PCIe [12], is used by peripheral devices to send read/write requests to CPU

cores and DRAM. The memory interconnect intercepts read/write requests

from CPU cores and peripheral devices to execute them on DRAM. Each of these

interconnects uses independent, specialized, hardware-based protocols for data

transfers [104, 41, 35]; these different protocols interpolate with each other to

enable the intra-host transfers across the host network.

Traditionally, the host network has provided many desirable properties:

minuscule probability of failures and packet corruption, losslessness guaran-

tees [28] (that is, no data will be dropped within the host network), and

nanosecond-scale latencies [104, 115]. Further, the host network also provides

ample bandwidth for intra-host transfers. For instance, modern servers have

2


higher peripheral interconnect bandwidth than the peripherals devices’ band-

width (e.g., 128Gbps and 256Gbps PCIe bandwidths for 100Gbps and 200Gbps

NICs, respectively). The memory interconnect bandwidth and processor inter-

connect bandwidth are often even larger than the peripheral interconnect band-

width [20, 107].

Distributed applications typically incur latencies of the order of network

round-trip-times (RTTs), which usually span a few tens-to-hundreds of mi-

croseconds in modern datacenter networks [77, 88, 139]; and the NIC band-

widths limit their throughput. The host network, on the other hand, provides

nanosecond-scale latencies and higher bandwidth than NICs. Unsurprisingly,

the conventional wisdom in the systems and networking communities has been

that the host network is an “ideal network” and has minimal-to-no impact

on the end-to-end performance or correctness properties of distributed appli-

cations. Thus, performance degradation for these applications is largely at-

tributed to contention for resources within the datacenter network fabric (that

is, contention for available link bandwidths and switch buffer capacities), a phe-

nomenon referred to as network congestion [77, 88, 139].

1.2 Emergence of Host Congestion

This dissertation establishes that we need to revisit the above conventional wis-

dom, in particular, due to the emerging technology trends: rapid growth in

network link and peripheral device bandwidths, combined with relatively stag-

nant trends for other host resources (e.g., memory bandwidth, memory access

latency, cache sizes per core, processor interconnect bandwidth, etc.) [129, 35, 82]

3


Cascadelake
(2019)

Icelake
(2021)

Sapphire
Rapids
(2023)

Emerald
Rapids
(2024)

Granite
Rapids
(2025)

Diamond
Rapids
(2026)

Processor Architcture (Year)

0

1

2

3

4

5

6

7

8

M
em

or
y

B
an

dw
id

th
O

ve
rs

ub
sc

ri
pt

io
n CPU Traffic

IO Traffic
CPU+IO Traffic

Figure 1.2: Memory bandwidth is becoming increasingly oversubscribed due to
rapidly approaching PCIe bandwidth.

have led to contention for resources within the host network, a phenomenon we

refer to as “host congestion”. Host congestion causes inefficiencies in peripheral

device-to-memory transfers within the host network. This means that the rate at

which data (that is destined to host memory) is generated at peripheral devices

can exceed the rate at which the data can actually be transferred to memory over

the host network. For example, when NICs serve as peripheral devices, host

congestion leads to scenarios where the arrival rate of data at the NICs from

the network fabric can be higher than the rate at which NICs transfer the data

to memory. Consequently, this results in queueing at the NIC buffers, leading

to eventual data drops and significant degradation in application performance

and isolation guarantees (more details in §1.4).

Host congestion can result from contention for multiple possible resources

within the host network. For instance, contention can occur at the memory in-

terconnect, which is shared by data traffic originating from the peripheral de-

vices (IO traffic) and the CPU cores (compute traffic), destined to read/write

from/into the memory. Even though the memory bandwidth is typically larger

4


than peripheral device bandwidths, when IO and compute traffic are co-located,

the memory interconnect can observe contention. The memory interconnect has

historically been oversubscribed by the compute traffic, that is, the aggregate

memory bandwidth demand by all CPU cores exceeds the theoretical maximum

bandwidth for the memory interconnect. For instance, the degree of oversub-

scription (the ratio of total memory bandwidth demand by cores and the theo-

retically maximum memory bandwidth) has remained stagnant over the years;

however, with PCIe bandwidth increasing rapidly over time, when account-

ing for load offered by both CPU cores and peripheral devices, memory inter-

connect is increasingly more oversubscribed. As shown in Figure 1.2, the total

memory bandwidth oversubscription in recent Intel architectures increased by

9.2% from 2019 (Cascadelake) to 2021 (Icelake), and an additional 3.4% in 2023

(Sapphire Rapids); the overall oversubscription is also projected to incur even

larger increase over time [130].

Contention within the host network can happen even in the absence of

compute traffic—for example, due to the desire for memory protection against

buggy/malicious peripheral devices in modern cloud deployments. These

memory protection mechanisms ensure that peripheral devices execute mem-

ory transfers using virtual memory addresses. An IOMMU—a special-purpose

hardware within the host network—translates these virtual addresses to phys-

ical addresses before executing memory reads/writes [22]. These translations

are done using an address translation table residing in memory and are sped up

using a special cache within the IOMMU (referred to as the IOTLB) [17]. To ex-

ploit peripheral interconnect bandwidth effectively, peripheral devices can ini-

tiate multiple simultaneous device-to-memory transfers, leading to contention

at IOTLB. Like other host resources, IOTLB sizes have also remained largely

5


Figure 1.3: NIC-to-memory transfer rate matches the arrival rate at the NIC
until the latency budget is met for these transfers. Host congestion, that is, the
reduction in achievable NIC-to-memory transfer rate, is a result of overshooting
of this latency budget.

stagnant compared to the peripheral interconnect bandwidth [104, 108, 91], sig-

nificantly increasing the contention over time.

1.3 The Root Cause of Host Congestion

We now provide some more insight into how contention for resources like mem-

ory interconnect or IOMMU caches within the host network can lead to ineffi-

ciencies in peripheral device-to-memory transfers. To simplify the discussion,

the rest of this section assumes NIC as an example peripheral device, but similar

arguments also hold for any other peripheral device.

For any NIC receiving data from the network fabric, to ensure that the NIC-

to-memory transfer rate matches the data arrival rate at the NIC, there is an

allowed “latency budget” for individual NIC-to-memory transfers. This latency

budget (ℓ) depends upon the NIC bandwidth (B) and the inherent parallelism

(P) supported by the host network hardware. Using Little’s law [89], we get

6


ℓ ∝ P/B: larger bandwidth implies data arrives faster at the NIC, resulting in a

smaller latency budget to transfer data to memory. Similarly, larger hardware

parallelism means the ability to transfer a larger amount of data simultaneously

from NIC to memory, increasing the latency budget.

When the host network can meet the budget, it allows the NIC-to-memory

transfer rate to match the arrival rate at the NIC. However, contention at the

memory interconnect results in inflation of memory access latency, and con-

tention at IOTLB results in increased address translation latency. This, in turn,

could lead to overshooting the latency budget; when this happens, NIC-to-

memory transfers cannot keep up with the arrival rate, resulting in NIC buffer

build-up and drops. For instance, for our default setup in Chapter 3, we observe

that without host congestion the latency between the PCIe root complex (shown

in Figure 1.1) is ∼330 ns when the NIC-to-memory transfer rate matches the NIC

bandwidth of 100 Gbps. To sustain this transfer rate, the hardware employs a

parallelism of ∼65 cachelines between the root complex and the memory con-

troller. However, the maximum available hardware parallelism on our setup is

∼93 cachelines; hence an inflation in latency ∼150 ns is enough for the NIC-to-

memory transfer rate to not be able to keep up with the arrival rate.

Therefore, at its core, host congestion is caused by latency inflation (even at

a scale of a few hundred nanoseconds) within the host network.

1.4 Impact of Host Congestion

We now provide evidence of host congestion in production clusters: even with

state-of-the-art protocols, software, and hardware infrastructure, large-scale

7


production clusters observe significant application-level violations of isolation

and service-level agreement (throughput and tail latency) guarantees due to

host congestion. Figure 1.4 shows a scatter plot of network link bandwidth uti-

lization on the x-axis and the measured data drop rates at the hosts on the y-axis

for a production cluster at Google [6]. The colors denote the density of sample

points, with purple colors denoting a larger sample fraction. This figure shows

that, even with Google’s state-of-the-art congestion control protocol (Swift [77])

and userspace network stack (SNAP [95]), hosts in Google production clusters

observe high data drop rates—even when network link bandwidth is far from

saturation. These drops resulted in throughput degradation, multiple orders of

magnitude tail latency inflation, and violation guarantees for many applications

running on this production cluster.

We also reproduce the host congestion phenomenon observed in Google’s

production cluster, using open-sourced Linux network stack and congestion

control protocol (DCTCP [14]), and characterize the impact on application per-

formance. We observe that host congestion can lead to as much as 1% packet

drops at the host, 35–55% throughput degradation, and 120–5000× tail latency

inflation [9]. We present these results in Chapter 2.

1.5 Implications to Design of Network Protocols & OS

Existing network protocols and various OS components were designed with

several implicit assumptions that start to break under host congestion.

For instance, consider the case of network protocols such as congestion con-

trol protocols (e.g., TCP, DCTCP, Swift, etc.) that manage data transfers in dis-

8


0.2 0.4 0.6 0.8

Host Access Link Bandwidth Utilization

H
o
s
t 

D
ro

p
 R

a
te

N
o
rm

a
li
z
e
d

F
ra

c
ti
o
n
 

(A
m

o
n
g
 a

ll
 d

a
ta

 p
o
in

ts
 o

n
 p

lo
t)

 
0.0 1.0
0

1

Figure 1.4: Host congestion in a large-scale Google production cluster. The
figure shows a scatter plot of link utilization (normalized by access link band-
width) and packet drop rates at hosts. The cluster runs both the Linux kernel
and SNAP [95] network stacks, with TCP and Swift [77] congestion control pro-
tocols, respectively. Precise values for packet drop rates are hidden due to these
being sensitive.

tributed applications so as to minimize the occurrences of resource contention

within the network fabric. Classical congestion control protocols assume an

ideal host network and operate at network RTT granularity—that is, they de-

tect and respond to any contention within the end-to-end network datapath in

at least one network RTT, which is typically of the order of tens-to-hundreds of

microseconds. Such long congestion detection and response times cannot cor-

rectly handle host congestion: this is because, in the host congestion regime,

traffic is generated much closer to the point of contention (e.g., due to applica-

tions generating CPU-to-memory traffic, with CPU cores situated few hundreds

of nanoseconds away from the memory controller) and thus the degree of host

congestion can change dramatically at sub-RTT granularity. Such a mismatch

in congestion granularity and congestion control response demonstrably leads

to performance far from optimal. Thus, host congestion allows us to revisit fun-

damental questions related to the design of congestion control architecture and

protocols.

9


Another implicit assumption ingrained in existing OS, particularly within

the IO memory protection mechanisms, is that the memory access latency is a

tiny fraction of peripheral interconnect latency. As a result, operations such as

mapping/unmapping virtual-to-physical memory addresses for DMA buffers

(used when enabling memory protection against untrusted peripheral devices)

have been designed fundamentally inefficiently—in that, they can lead to larger

than necessary number of memory accesses for these operations, especially

when the memory protection mechanisms intend to provide stronger safety

guarantees. These additional memory accesses can lead to inflated latencies

within the host network. The latency inflation leads to the host congestion phe-

nomenon and similar application-level performance degradation, as discussed

previously. We need to revisit the design of memory protection mechanisms

such that they incur minimal overheads for the required safety guarantees.

We also demonstrate that the observations made about host congestion

resulting in performance degradation for network applications (throughput

degradation and/or latency inflation) generalize to the context of storage appli-

cations. For example, we show in Chapter 2 that under host congestion, storage

applications can observe up to 2× throughput degradation and 17× tail latency

inflation compared to an ideal operating point. We find that this is rooted in

storage stacks assuming the host network always behaves in an ideal manner—

to achieve high throughput, storage stacks typically keep a fixed number of re-

quests in-flight, which implicitly assumes the bottlenecks are either at the CPUs

or at the storage device. This assumption is no longer true under the host con-

gestion regime, with bottlenecks shifting dynamically within the host network.

Therefore, we also need to revisit the architecture of storage stacks in the regime

of host congestion.

10


Put together, the emerging phenomenon of host congestion, led by current

technology trends, points to the inevitability of a fundamental re-architecture of

the existing host infrastructure.

1.6 Thesis and Key Contributions

The central thesis in this dissertation is that we need across-the-stack

solutions—including re-architecture of network protocols and several compo-

nents of the OS—to address the impact of host congestion on applications. We

next provide an overview of the core contributions of this dissertation towards

this thesis:

• Understanding the impact of host congestion. We perform a deep dive

into the host network datapath in order to understand the root causes

and impact of host congestion. We show how nanosecond-scale latency

inflation due to contention at various points within the host network—

specifically, the memory interconnect and the IOMMU—results in host

congestion. We evaluate the impact of host congestion on application-

level performance, reproducing the observations made in Google produc-

tion clusters. We observe similar impact on application performance even

when using an open-sourced Linux network stack and DCTCP congestion

control protocol. We also demonstrate that the impact generalizes to the

context of storage applications, using state-of-the-art storage stacks and

IO schedulers. These results are presented in Chapter 2.

• Rethinking congestion control protocols. We present hostCC [9], a new

11


architecture for congestion control protocols that handle host conges-

tion along with network fabric congestion. The key idea in hostCC is a

new “host-local” congestion response performed at a sub-RTT granular-

ity. hostCC uses host-local congestion response to minimize queueing and

data drops at the host: it modulates host resources allocated to the net-

work traffic at sub-RTT granularity to ensure that NIC queues are drained

at the same rate at which network traffic arrives at the NIC. hostCC in-

tegrates the idea of host-local congestion response with new congestion

signals collected at sub-microsecond timescales within the host hardware,

allowing it to capture and react to host congestion at sub-microsecond

timescales. We discuss hostCC in Chapter 3.

• Rethinking IO memory protection. We present Fast & Safe (F&S), a sim-

ple modification to existing memory protection mechanisms that enables

achieving the strongest safety guarantees and near-optimal performance.

The key insight in F&S design is that rather than focusing on achieving

high IOTLB hit rates, we should focus on reducing the cost of each trans-

lation upon an IOTLB miss. This insight was an outcome of building a

deep understanding of the IOMMU hardware, that led to us discover-

ing existence of IO page table caches (in addition to the IOTLB) within

the IOMMU hardware; effectively utilizing these caches can help reduc-

ing the address translation cost upon an IOTLB miss. F&S design re-

quires no modifications in host hardware, minimal modifications within

the operating system, and yet, near-completely alleviates all overheads of

the strongest form of memory protection mechanism. We discuss F&S in

Chapter 4.

• Rethinking storage stacks. We present storageCC [8], a new architec-

12


ture for storage stacks that enables them to operate at an ideal amount of

parallelism in the presence of host congestion. In particular, storageCC

introduces the conceptual idea of congestion control (typically used in

the computer networking literature) in storage stacks. Our key insight

is that the end-to-end storage datapath (i.e., SSD-to-CPU/memory datap-

ath) is conceptually a (lossless) computer network. storageCC detects host

congestion by collecting host congestion signals within the hardware at

sub-microsecond granularity, and adapts the number of in-flight requests

based on host congestion. Similar to transport designs in computer net-

works, storageCC ensures that the number of in-flight requests is the min-

imum of a flow control window size (that accounts for bottlenecks at end-

points of storage datapath, i.e., CPUs and/or SSDs) and the congestion

control window size (that accounts for bottlenecks within the storage dat-

apath, and is decided using storageCC’s storage congestion control mech-

anism). We discuss storageCC in Chapter 5.

1.7 Acknowledgements

This dissertation is an outcome of multiple collaborative efforts [6, 8, 9, 116,

128]. While the core of this dissertation is about host congestion control, I was

also fortunate enough to work on several other collaborative projects during my

Ph.D. that are not a part of this dissertation [7, 10, 120, 126].

13


CHAPTER 2

UNDERSTANDING THE IMPACT OF HOST CONGESTION

This chapter provides an overview of the host network datapath (§2.1). We then

discuss the impact of host congestion caused by contention for host network

resources such as the memory interconnect and IOMMU caches. We charac-

terize application-level performance degradation in both networking (§2.2) and

storage (§2.3) contexts, providing a deep dive into understanding the behav-

ior of network transport protocols and storage stacks under the host congestion

regime.

2.1 Host Network Datapath

This section serves as a brief primer on the host network datapath. We begin

by discussing the datapath in the context of networking applications (§2.1.1),

where the NIC serves as the employed peripheral device. Subsequently, we

offer relevant additional details about the datapath when utilizing SSDs as the

peripheral device (in the context of storage applications, in §2.1.2), as well as

when employing memory protection against untrusted peripheral devices (i.e.,

with IOMMU enabled in the host, in §2.1.3).

2.1.1 Network (NIC-to-CPU/memory) Datapath

Figure 2.1 illustrates the host network datapath for the context of networking

applications. We discuss below the life of a packet from when it arrives at the

14


Rx Descriptors NIC Buffer

PCIe Credits

PCIe 
Channel

IIO

Memory
Controller

NIC

CPUs

DRAM

IIO Buffer

Rd Q

Wr Q

1
2

3

4

1

2

Figure 2.1: Illustration of host network datapath for the Intel architecture (AMD
architecture is conceptually similar) between NIC and CPU/memory for the
DDIO disabled case.

NIC until it is transferred to CPU/memory. We primarily focus on the receiver

side since host congestion is more prominent at the receiver [77, 6, 86]. The host

network datapath is best described in two components—one between the NIC

(one end of the PCIe interconnect) to the Integrated IO Controller (IIO, the other

end of the PCIe interconnect), and the other between the IIO to memory. We

discuss below the steps involved in each component.

NIC-to-IIO datapath. NIC and IIO sit at the two ends of the PCIe interconnect.

1 Upon a packet arrival, the NIC enqueues the packet into its input buffer

(typically in a small SRAM [70, 6]).

2 Next, the NIC fetches a descriptor that provides a host memory address for

the NIC to Direct Memory Access (DMA) the packet: the NIC driver sets

up a per-core ring buffer of Rx descriptors, each of which contains memory

addresses to be used to DMA data (with memory protection enabled, each

15


descriptor is equipped with a virtual memory address instead of the physi-

cal address; more details in §2.1.3). Modern NICs support each descriptor to

have addresses worth one or more pages (e.g., default 64-page descriptors

in Mellanox NICs [97]), allowing multiple packets to be DMA’d using the

same descriptor. The NIC driver periodically replenishes these descriptors.

3 Importantly, PCIe is a lossless interconnect that uses a credit-based flow

control mechanism implemented via a fixed (hardware-specific) number of

credits [12]. When PCIe credits are available, the NIC instantiates a DMA

request over the PCIe (executed using PCIe transactions); DMA’ing a single

packet may require multiple PCIe credits [12, 6]. PCIe being a lossless in-

terconnect, the packet can be safely removed from the NIC buffer as soon

as DMA is initiated. If PCIe runs out of credits (we will discuss potential

reasons below), DMAs cannot be initiated until credits are replenished.

4 The IIO intercepts each PCIe transaction and initiates writes to memory (dis-

cussed below); importantly, a PCIe credit is replenished only when the IIO

has successfully issued a write to the memory1.

IIO to memory datapath. Modern hosts have support for direct cache access

(e.g., using DDIO [59]) that allows NICs to DMA packets directly into the last-

level cache (LLC). The precise IIO to memory datapath depends on whether

DDIO is enabled or disabled. We first describe the datapath with DDIO dis-

abled; we then discuss the case when DDIO is enabled.
1PCIe transactions are executed at the granularity typically 256 − 512 byte-sized PCIe Trans-

action Layer Packets (TLPs); however, IIO to memory transactions are executed at the granular-
ity of 64 byte-sized cache lines. Thus, each PCIe transaction requires multiple IIO-to-memory
writes before completion. We will largely ignore this detail since it is not necessary to under-
stand the phenomenon of host congestion.

16


1 Upon receiving a PCIe transaction, IIO enqueues the request in an IIO

buffer.

2 IIO issues the requests from its buffer to the memory controller buffer, and

the controller executes the final write to DRAM. Importantly, IIO to memory

controller datapath also traverses a lossless interconnect that uses a credit-

based flow control mechanism; the IIO can issue a write request to the mem-

ory controller only if the write queue at the memory controller is not full. If

this write queue is full (e.g., due to other memory requests from CPUs), the

request remains enqueued in the IIO buffer. Once the write queue becomes

free, IIO transfers the request to the memory controller write queue; this

incurs a cache line worth of memory write bandwidth.

If DDIO is enabled, IIO transfers the cache line to the LLC. This may require

evicting an already existing cache line to the memory controller [31, 59]. There-

fore, if DDIO is enabled and does not lead to evictions, it reduces the latency of

an IIO write request since the speed-of-light delay from IIO to LLC is smaller

than from IIO to DRAM. However, if it does lead to evictions, we are back to the

DDIO disabled case: each eviction not only incurs a cache line worth of mem-

ory write bandwidth but also higher latency since IIO to LLC write can only be

executed after the eviction has been completed.

2.1.2 Storage (SSD-to-CPU/memory) Datapath

Figure 2.2 illustrates the end-to-end storage datapath, that is, the life of a storage

request from when it is issued by the application until the desired data is written

to the memory. The datapath is best described in three components: between

17


Submission 
eues

SSD

PCIe Credits

PCIe 
Channel

IIO

Memory
Controller

SSD Hardware
& Driver

CPUs

DRAM

IIO Buffer

Rd Q

Wr Q

Figure 2.2: Illustration of storage datapath between SSD and CPU/memory.

the application and SSD, between the SSD and the IIO, and between the IIO

and host memory. We discuss the steps involved in the first two components;

IIO to host memory datapath is identical to the networking context discussed

previously.

Application to SSD datapath. The storage application submits the IO read re-

quest 2 to the Linux kernel block layer. The block layer forwards the request to

the NVMe driver, which creates a descriptor for the SSD (the descriptor carries

the destination memory address for DMA and uses a virtual memory address if

memory protection is enabled using an IOMMU), enqueues the descriptor into

one of the SSD’s submission queues and rings a doorbell. The SSD dequeues the

request from the submission queue and processes it, which entails two steps: (1)

fetching the data from its internal storage fabric and (2) transferring it to the host

by issuing DMA transactions over the PCIe interconnect.

2The datapath for write requests essentially follows a reverse path.

18


SSD to IIO datapath. SSD and IIO sit at the two ends of the PCIe interconnect.

Similar to the networking context, The IIO intercepts each PCIe transaction and

is responsible for writing the data to host memory, and the PCIe credits are re-

plenished only when the IIO has successfully issued the required write requests

to the memory.

2.1.3 Datapath with Memory Protection Enabled

We now discuss the additional steps taken within the end-to-end datapath when

IOMMU is enabled to provide memory protection against buggy/untrusted pe-

ripherals. We consider the networking context for brevity, with Mellanox NICs

and Intel CPUs. The high-level datapath of other NICs [110], other kinds of pe-

ripherals like SSDs, and other CPU architectures are similar to what is shown in

Figure 2.3. The datapath involves three main steps:

1 IOVA allocation and mapping to physical page address. To provide mem-

ory protection, NIC uses a virtual address (IO virtual address or IOVA) in the

Rx ring buffer descriptor, associated with the corresponding application core, to

DMA the packet. To prepare a single descriptor for each page, the NIC driver al-

locates a physical page and passes it to the IOMMU driver. The IOMMU driver

then allocates a page-sized IOVA from the IOVA allocator. An IOVA itself is

simply a range of addresses, and the IOVA allocator keeps track of free address

ranges. Allocated IOVA ranges are managed with a globally locked red-black

tree, where each node in the tree contains a high and a low IOVA address [108].

In the worst case, operations on the red-black tree can incur linear time searches,

leading to high CPU overheads [108]. To avoid such overheads, modern IOVA

19


Rx Descriptors NIC Buffer

PCIe 
Channel

IIO

IOVA
Allocator

IOMMU
IO Page Table

L1

L2

L3

L4

IOMMU

Invalidation eue

IO Page Table Caches

L1 L2 L3

IOTLB

NIC

CPUs

DRAM

1

2

23

Figure 2.3: High-level DMA datapath with IOMMU enabled.

allocators keep all operations constant time using a stack-based cache of IOVAs

that in addition to the red-black tree. To avoid synchronization overheads of

multiple cores accessing the IOVA allocator, the OS maintains a per-core IOVA

cache, allowing IOVAs to recycle on a per-core basis. The IOMMU driver maps

the allocated IOVA to the physical page in an IO page table and passes the IOVA

to the NIC driver. The NIC driver inserts these IOVAs into the descriptor.

2 Translating IOVA to physical address. After DMA initiates and PCIe trans-

actions subsequently reach the IIO, each transaction’s IOVA must be translated

to a physical address. These translations are sped up using special-purpose

hardware caches: first, the IOVA is looked up in the IOTLB, which stores map-

pings directly from IOVAs to physical addresses. Upon a hit, the address is im-

mediately translated. Otherwise, the IOMMU must perform an IO page table

walk. The IO page table has four levels (L1, L2, L3, and L4), and each page has

512-page table entries; L1 entries map from the 9 MS bits of the IOVA to an L2

page, L2 entries map from the next 9 bits of the IOVA to an L3 page, etc. An L4

page contains direct mappings to physical addresses. While existing literature

primarily assumes that IO page table walks are always needed to be performed,

20


we found that modern IOMMU hardware has caches for the IO page table that

can speed up translation [60]. There is a cache for the first three levels of the

page table (L1, L2, and L3); each entry in the cache points from an IOVA to the

address of the corresponding L2/L3/L4 page in the IO page table. At worst,

a page table walk incurs four memory reads (one for each level of the IO page

table) in the case of all misses, and at best, one unavoidable memory read from

an L4 page, which contains the requisite physical address. It may take multiple

address translations to complete the DMA of a single packet.

3 IOVA unmapping and deallocation. After DMA completes, to ensure that

the NIC can no longer access the DMAed packets’ buffers, for each page in the

descriptor, the driver unmaps the IOVA-to-page mapping in the IO page table,

including invalidating the corresponding entries in the IOMMU and frees the

IOVA back to the IOVA allocator. Modern operating systems provide two secu-

rity models for invalidations: strict and deferred. In strict mode, invalidation

happens at per-descriptor granularity; for each IOVA in the descriptor, corre-

sponding entries in the IOTLB and L1, L2, and L3 page table cache are invali-

dated. 3 In deferred mode, invalidation happens at multi-descriptor granularity

- every 4 descriptors by default in Linux - leaving stale entries in the IOTLB that

can be used by the device.

3This is the reason for an unavoidable IOTLB miss per (huge)page worth of data: since IO-
VAs are unmapped and corresponding IOTLB entries are immediately invalidated after use,
every new page that is DMA’d will result in an IOTLB miss; thus we expect to see at least one
IOTLB miss per page. In reality, misses are slightly higher due to PCIe transactions for accessing
descriptors, etc. [6].

21


2.2 Host Congestion Impact on Networking Applications

As discussed in Chapter 1, host congestion may occur due to one or more bot-

tlenecks along the datapath, e.g., memory interconnect [6, 86] or the IOMMU

caches [12, 22, 108, 31, 6]. These bottlenecks lead to inflation in latency for IIO-

to-memory requests due to inflation in memory access latency (under memory

interconnect contention) or higher address translation latency at the IIO (under

IOMMU cache contention). We observe a domino effect due to latency inflation

for IIO-to-memory requests: as IIO-to-memory latency increases, more requests

get queued at the IIO buffer, PCIe credit replenishing incurs larger delays, and

PCIe may run out of credits. As a result, PCIe bandwidth remains underuti-

lized, resulting in packet queueing and eventual drops at the NIC buffer for

networking applications.

Existing network protocols and stacks are not designed to handle host

congestion—they primarily target either network fabric congestion (e.g., [99, 14,

88, 30]), or compute bottlenecks due to inefficient software (e.g., [31, 47, 77, 32,

95]). Indeed, the Google study [6] demonstrates that, even with state-of-the-art

congestion control protocol Swift [77] and userspace network stack SNAP [95],

their production clusters suffer from host congestion resulting in significant

queueing and packet drops at the host, throughput degradation, and tail la-

tency inflation. In this section, we reproduce the host congestion phenomenon

from [6] using Linux DCTCP, and provide insights on root causes for queueing,

packet drops, and performance degradation in the host congestion, caused by

bottlenecks at the memory interconnect (2.2.1), and the IOMMU ( 2.2.2).

22


2.2.1 Impact of Memory Interconnect Contention

Setup. To build an understanding of performance degradation in the host con-

gestion regime, we use a setup with two servers connected via a single switch.

We use 4-socket (NUMA node) servers with 8 Intel Cascade lake CPU cores per

socket, 100Gbps Mellanox CX5 NIC connected to one of sockets over 128Gbps

PCIe 3.0, and DDR4 DIMMs connected to 2 memory channels (with a total max-

imum theoretical capacity of 375Gbps = 46.9GBps). In terms of resource bal-

ancing, our servers are state-of-the-art: the latest commercially available Intel

servers have the same set of resources but scaled by a factor (that is, a server

with PCIe 4.0 with 256 Gbps capacity would also have 4× more cores, 2× more

PCIe bandwidth, 4× more memory bandwidth, etc.) [50, 6]. Our servers run

Linux kernel 5.4.0 and, by default, use 4K MTUs (similar to [6]), and all avail-

able optimizations (segmentation offloads like TSO and GRO, accelerated re-

quest flow steering, etc.) that have been shown to achieve best-possible Linux

performance [31].

We use three applications—(a) a NetApp-T that generates 4 long flows, each

flow from one sender-side CPU core to one receiver-side CPU core on the NIC-

local NUMA node (DCTCP needs a minimum of 4 cores to saturate 100Gbps

NIC in an uncongested scenario); (b) a NetApp-L that generates latency-sensitive

RPCs (of sizes varying from 128B to 32KB) between a sender-side and receiver-

side CPU core on the NIC-local NUMA node; and (c) an MApp that generates

CPU-to-memory traffic with 1 : 1 read-write ratio and sequential memory ac-

cess pattern; we increase the offered load to the memory interconnect by the

MApp from 1× to 3×, by increasing the number of MApp CPU cores, and therefore

increasing the number of in-flight memory requests (in the absence of any other

23


0x 1x 2x 3x0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

0x 1x 2x 3x
Degree of Host Congestion

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)

DDIO Off DDIO On

(a) Network throughput and
packet drop rates

0x 1x 2x 3x
Degree of Host Congestion

0.0

0.2

0.4

0.6

0.8

1.0

M
em

or
y

B
an

dw
id

th
U

til
iz

at
io

n

DDIO Off, NetApp-T
DDIO Off, MApp

DDIO On, NetApp-T
DDIO On, MApp

(b) Memory bandwidth
utilization

Figure 2.4: (a) Host congestion leads to significant performance degradation in
terms of packet drop rates and throughput (even with no network congestion).
DDIO helps a little but observes similar performance degradation. (b) MApp is
able to acquire a large fraction of memory bandwidth, leaving little room for
NetApp-T.

source of memory traffic, using 1× to 3× MApp cores results in a total observed

memory bandwidth of 16.0GBps, 28.7GBps, and 34.8GBps, respectively). We

use standard benchmark tools for these applications—iperf [63] for NetApp-T,

netperf [68] for NetApp-L, and MLC [62] for MApp.

[Figure 2.4] Host congestion leads to degraded network throughput (>35%),

even when DDIO is enabled. Figure 2.4 shows throughput, packet drops, and

memory bandwidth utilization with increasing degrees of host congestion, with

and without DDIO. The case of 0× is easy: there is no host congestion, and thus,

network traffic is able to saturate the access link bandwidth. Enabling DDIO

reduces memory bandwidth utilization since the CPU is able to consume the

data before it is evicted; nevertheless, memory bandwidth utilization is non-

zero because evictions do happen due to cache pollution—since LLC is shared

across all cores, one cannot guarantee a perfect cache hit rate [31]. The case of

1× is more interesting; let us start with the DDIO disabled case. Here, as shown

24


in the right figure, memory bandwidth utilization is now close to saturation4;

thus, memory access latency (and thus, the number of CPU cycles per mem-

ory access) starts to increase. As a result, network throughput is now compute

bottlenecked (more precisely, by receiver-side buffers) and cannot saturate the

access link bandwidth. DDIO shines in such a scenario: lower memory band-

width utilization results in lower memory access latency and fewer CPU cycles

per memory access, allowing network traffic with DDIO to continue to saturate

the access link bandwidth.

Memory bandwidth is now saturated for 2× and 3×, resulting in inflated

memory access latency. We now see the domino effect discussed earlier: PCIe

becomes underutilized, and packets get queued (and eventually dropped) at

the NIC; even in this simple setup, we observe 0.3% drops. Interestingly, while

DDIO can be slightly helpful in improving network throughput, it has negligi-

ble impact on packet drop rate; this is because of the reasons discussed earlier—

as shown in Figure 2.4(b), memory bandwidth utilization is similar for both

DDIO enabled and disabled (that is, majority of cache lines are evicted from

LLC before the CPU can consume them). At 2× and 3×, DCTCP is operating in

the AIMD regime: senders keep on increasing sending rate, resulting in queue

build-up and drops at the NIC when total sending rate exceeds the instanta-

neous NIC-to-memory transfer rate; packet drops leads to rate reduction, fol-

lowed by subsequent sawtooth behavior.

Figure 2.4b shows that, as we increase the number of MApp cores, network

traffic is allocated a decreasing fraction of memory bandwidth. More work is

needed to understand the precise reasons; our evaluation suggests that mem-

4Memory bandwidth utilization at saturation depends upon DRAM hardware as well as
application workload (read-write ratio, access patterns, etc.) [64], and is typically lower than
the maximum theoretical bandwidth.

25


1500B 4000B 9000B0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

1500B 4000B 9000B
MTU Size

0.00
0.25
0.50
0.75
1.00
1.25

Pa
ck

et
D

ro
p

R
at

e
(%

)

DDIO Off DDIO On

(a) Varying MTU Size

4 8 160
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

4 8 16
Number of Active Flows

0.0
0.5
1.0
1.5
2.0

Pa
ck

et
D

ro
p

R
at

e
(%

)

DDIO Off DDIO On

(b) Varying Number of Flows

Figure 2.5: The impact of host congestion worsens with larger MTU size and
number of flows, with 3× degree of host congestion. DDIO-enabled case, in
particular, suffers more than DDIO-disabled case.

ory bandwidth allocation is essentially proportional to the load generated by

individual entities (IIO or CPU); this implies that, as MApp cores increase, MApp

generates increasingly larger load5 but the maximum number of requests issued

by IIO remains the same (dependent on the PCIe credit limit). As a result, CPUs

are able to quickly acquire a larger fraction of memory bandwidth, creating even

more host contention for network traffic.

[Figure 2.5] Impact of host congestion worsens with increasing MTU sizes

and a larger number of flows. Larger MTU sizes and a large number of con-

nections have been shown to improve network throughput when compute at

the host is bottlenecked due to inefficient software [31, 6]. Figure 2.5 shows

that, in the presence of host congestion, these optimizations can, in fact, hurt

performance—we see a significant increase in packet drop rates for both DDIO-

enabled and disabled cases; and, while we observe a slight improvement in

throughput for the DDIO-disabled case (due to reduction in compute cycles re-

quired to process each packet), we also observe significant throughput reduc-

5This is because each CPU core can maintain a hardware-specific maximum number of in-
flight memory requests, equal to so-called Line Fill Buffer (LFB) size. On our servers, this limit
is typically 10 − 12.

26


104
105
106

128B 512B 2KB 8KB 32KB
RPC Size

50

100

150

200

L
at

en
cy

(µ
s)

DCTCP
(No host congstion)

DCTCP
(With host congestion)

Figure 2.6: Host congestion leads to orders of magnitude tail latency infla-
tion. The figure shows {P50, P90, P99, P99.9, P99.99} latencies (represented
by whiskers) for NetApp-L, with and without host congestion, when NetApp-T,
NetApp-L, and MApp are run together with DDIO-disabled (we observed near-
identical latency results with DDIO-enabled).

tion for the DDIO enabled case.

The trends for the DDIO disabled case are easy to understand: for AIMD-

style congestion control protocols, it is known that packet drop rates increase

with MTU size [96] and the number of flows [103, 11]. The trends for the DDIO-

enabled case are related to the inefficacy of DDIO observed in [31]—with an

increase in MTU size and/or number of flows, DDIO observes an increase in

cache eviction rates (and thus, an increase in memory bandwidth utilization);

as discussed in §2.1.1, each such eviction not only incurs a cache line worth

of memory bandwidth as in DDIO-disabled case, but also higher latency than

DDIO-disabled case since IIO-to-LLC write can only be executed after the evic-

tion has completed. As a result, the DDIO-enabled case at 3× host congestion

observes even higher packet drop rates than the DDIO-disabled case for large

MTU sizes and large numbers of flows.

[Figure 2.6] Host congestion can cause orders of magnitude tail latency infla-

tion for latency-sensitive applications. We now run the three applications—

NetApp-T, NetApp-L, and MApp—together, with 3× host congestion. Since

NetApp-L introduces a tiny amount of additional traffic, NetApp-T and MApp

27


performance is similar to previous experiments; we thus focus on NetApp-L re-

sults. We observe significant latency inflation for NetApp-L in the host conges-

tion regime. This latency inflation is caused due to three reasons (a) queueing

delay at the NIC buffer; (b) retransmission and timeout delays due to drops at

the NIC buffer; and (c) larger CPU processing delays due to inflated in CPU

cycles in memory accesses caused by host congestion. Building upon previous

observations (close to 0.3% drops), P99 latency is dominated by (a) and (c), and

P99.9 is dominated by (b)—for both DDIO-enabled and -disabled (that observe

similar drop rates), P99 latency inflation is roughly 60 − 100µs which is close

to the worst-case queueing delay at the NIC buffer; at P99.9, latency inflation

is close to 200ms, which is the default Linux minimum retransmission timeout

(RTO) value. Smaller RPCs suffer from higher tail latency inflation because any

packet drop necessitates a timeout; for larger RPCs, the Linux Tail Loss Probe

(TLP) [112] mechanism is effective (with a smaller timeout) when there is more

than one in-flight packet. Isolating NIC buffers does not solve this problem: a

smaller NIC buffer size would incur a larger number of drops, increasing (b); a

larger NIC buffer size, on the other hand, would increase (a).

2.2.2 Impact of IOMMU Contention

Setup. We use servers with the same hardware setup (identical NIC, PCIe,

and memory bandwidths) as described in §2.2.1, with the addition of enabling

the IOMMU to characterize the application-level impact of memory protection

overheads. We employ Linux kernel 6.0.3, and by default, 4K MTUs (similar

to the setup in [6]), along with all available TCP optimizations (TSO, GRO, and

aRFS). The same set of networking applications—NetApp-T and NetApp-L—are

28


256 512 1024 2048
Ring Buffer Size

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

IOMMU Off IOMMU On

(a) Network throughput

256 512 1024 2048
Ring Buffer

0.0

0.5

1.0

Pa
ck

et 
Dr

op
 

 R
ate

 (%
)

IOMMU Off IOMMU On

(b) Packet drop rates

256 512 1024 2048
Ring Buffer Size

0

1

2

IO
TL

B 
M

iss
es 

pe
r P

ag
e

(c) IOTLB miss rates

256 512 1024 2048
Ring Buffer Size

0.0

0.5

1.0

IO
M

M
U 

Ca
ch

e M
iss

es 
 pe

r P
ag

e

L1 L2 L3 ACKs

0.00

0.05

0.10

0.15

Nu
mb

er 
of 

AC
Ks

 
 pe

r P
ag

es-
Siz

ed
 Pa

ck
et

(d) L1/L2/L3 cache miss rates

0 100 200 300 400 500
Sample Sequence of 

IOVA Accesses

0
100
200
300
400
500

Di
st

in
ct

 L
3s

 A
cc

es
se

d 
Be

fo
re

 R
ep

ea
tin

g

2048rb 256rb

(e) IOVA access pattern

Figure 2.7: Enabling memory protection results in >35% degradation in
throughput. Increasing ring buffer sizes with IOMMU enabled leads to (a) de-
creasing throughput and (b) increasing packet drop rates. (c) IOTLB misses do
not significantly change as the ring buffer size increases. (d) However, IOMMU
L3 cache miss rates increase significantly, leading to a larger cost of IOTLB
misses (e) Larger ring buffer sizes result in worse locality as the IOVA work-
ing set size grows.

used as in §2.2.1. By default, NetApp-T generates 5 long flows, as DCTCP re-

quires a minimum of 5 cores in this setup to saturate a 100 Gbps NIC in an un-

congested scenario. We also use the same memory-intensive application—MApp,

as discussed earlier—to characterize the impact of contention observed at both

the memory interconnect and the IOMMU.

[Figure 2.7] Enabling memory protection leads to degraded network through-

29


put (>35%) even without memory interconnect contention. Figure 2.7 shows

application level throughput for increasing ring buffer size, with all other pa-

rameters fixed to default, but with IOMMU disabled and enabled. With the

IOMMU disabled, the application saturates the 100Gbps link bandwidth. How-

ever, with IOMMU enabled, throughput drops up to 35% for ring buffer size =

2048. This throughput degradation is attributed to increasing IOTLB/L1/L2/L3

misses.

Note that there is increasing throughput degradation as the ring buffer size

increases despite no significant increase in the number of IOTLB misses. This

is because the cost of each IOTLB miss gets worse due to the increasing rate

of L3 misses. The L3 miss rate increase is caused by poor IOVA locality with

increasing ring buffer sizes. As the ring buffer size increases, the IOVA working

set per core exceeds the per-core cache size, requiring access to the global cache

or red-black tree for IOVA allocation. Access to the global cache or red-black

tree can cause subsequent IOVA allocations to belong to different L3 entries, as

the global cache or red-black tree IOVAs are unlikely to be on the same L3 page

as the per-core cache.

Figure 2.7(e) shows that the locality of L3 entries degrades with increasing

ring buffer size. The X axis represents each subsequent IOVA allocation and the

Y axis represents the number of unique L3 page table entries accessed before

that same L3 entry is accessed again. The red line thresholds represent potential

L3 cache sizes of 64 and 128. A value above the threshold means it is more likely

that an IOVA will experience an L3 miss when translated by the IOMMU. The

figure shows how spatially and temporally located the IOVA accesses are within

the IOMMU for varying ring buffer sizes.

30


Note that even for the case of 256 ring buffer size, with working set size po-

tentially smaller than the per-core IOVA allocator cache size (as evident from

the L3 access pattern in Figure 2.7(e)), it still has significant L3 misses, prohibit-

ing the application from saturating the link bandwidth. We believe these misses

result from the interference of IOVA allocation/deallocation for ACK packets

(Tx datapath packets) with the payload packets (Rx datapath packets). Imagine

a core allocating IOVAs for an Rx ring buffer from a perfectly contiguous cache.

We would expect to see the same perfect locality within the Rx ring buffer. How-

ever, if a free call from the Tx data path is interleaved with the alloc calls for the

Rx ring buffer, then the Rx IOVAs will also be interleaved and no longer con-

tiguous. Prior work has shown a similar phenomenon [92].

[Figure 2.8] Larger IOMMU cache miss rates lead to larger cost of IOTLB

misses, and hence larger performance degradation. Figure 2.8 shows increas-

ing L1, L2, L3, and IOTLB cache miss rates with increasing number of flows,

subsequently leading to throughput degradation of up to 65% for the case of

40 flows. The significant drop in throughput shown is due to compounding

factors: a larger number of IOTLB misses cause a larger number of page table

walks, and the worse page table cache miss rates lead to higher latency per page

table walk.

In the evaluated setup, the L1 and L2 locality should ideally be perfect: us-

ing 32-bit DMA, there is only a single entry (an L1 and L2 entry cover 239 and

230 bytes of address space, respectively). Thus, we would expect never to see

L1/L2 misses as the working set easily fits inside a single entry of these caches.

Yet we observe a rather surprising result that, in some cases (as shown in Fig-

ure 2.8(d), for a larger number of flows), there are L1/L2 misses for almost every

31


5 10 20 40
Number of Active Flows

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

IOMMU Off IOMMU On

(a) Network throughput

5 10 20 40
Number of Active Flows

0

2

4

Pa
ck

et 
Dr

op
 

 R
ate

 (%
)

IOMMU Off IOMMU On

(b) Packet drop rates

5 10 20 40
Number of Active Flows

0

1

2

IO
TL

B 
M

iss
es 

pe
r P

ag
e

(c) IOTLB misses rate

5 10 20 40
Number of Active Flows

0.0

0.5

1.0

IO
M

M
U 

Ca
ch

e M
iss

es 
 pe

r P
ag

e

L1 L2 L3 ACKs

0.00

0.05

0.10

0.15

Nu
mb

er 
of 

AC
Ks

 
 pe

r P
ag

es-
Siz

ed
 Pa

ck
et

(d) L1/L2/L3 cache miss rate

0 100 200 300 400 500
Sample Sequence of 

IOVA Accesses

0
50

100
150
200
250
300

Di
st

in
ct

 L
3s

 A
cc

es
se

d 
Be

fo
re

 R
ep

ea
tin

g

40 flows 5 flows

(e) IOVA access pattern

Figure 2.8: With IOMMU enabled and an increasing number of flows, we ob-
serve (a) higher throughput degradation, (b) larger packet drop rates, (c) a larger
number of IOTLB misses, (d) higher IOMMU cache misses, and (e) poorer lo-
cality in L3 page caches.

other DMAed packet! The reason behind these surprisingly large L1/L2 misses

is the invalidations of previously DMAed IOVAs. While the lack of measure-

ment infrastructure for visibility into invalidations limits our ability to identify

root causes of invalidations, our findings suggest that interference of alloca-

tion/deallocation for ACK IOVAs results in larger invalidations. Recall that

once an ACK is sent, its IOVA is unmapped and invalidated immediately. This

will cause concurrent Rx DMAs to suffer an additional miss for each L1/L2/L3

entry that the ACK and Rx IOVAs share. Figure 2.8 corroborates this hypothe-

sis: as the number of flows and corresponding transmitted ACKs increase, the

32


104
105
106

128B 512B 2KB 8KB 32KB
RPC Size

50
100
150
200
250
300

La
ten

cy
 (

s)

IOMMU Off IOMMU On

Figure 2.9: IOMMU contention leads to orders of magnitude tail latency infla-
tion

L1/L2 misses rise.

The reason for the increase in ACKs is subtle: as shown earlier in this section,

increasing the number of flows under the congestion scenario leads to increas-

ing packet drop rates. It is a commonly known fact that an inflated drop rate

can lead to out-of-order packets and an increase in ACKs [31]. An increasing

number of ACKs also leads to increasing L3 misses (due to worse IOVA local-

ity, as explained earlier) and IOTLB misses (due to a larger number of PCIe

transactions per payload packet, including read transactions needed for send-

ing ACKs).

[Figure 2.9] IOMMU cache contention also leads to tail latency inflation. Fig-

ure 2.9 shows the tail latency inflation by several orders of magnitude for L-apps

with IOMMU enabled. The reasons are similar as discussed previously for la-

tency inflation in Figure 2.6—due to queueing delays at the NIC buffer and

retransmission delays due to subsequent packet drops at the NIC buffer.

[Figure 2.10] Memory bandwidth contention, along with IOMMU cache con-

tention, further degrades performance due to increased cost of IOTLB misses.

Figure 2.10 demonstrates the effects of compounding memory bandwidth con-

33


0x 1x 2x 3x
Degree of Memory Contention

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

IOMMU Off IOMMU On

(a) Network throughput

0x 1x 2x 3x
Degree of Memory Contention

0.0

0.5

1.0

Pa
ck

et 
Dr

op
 

 R
ate

 (%
)

IOMMU Off IOMMU On

(b) Packet drop rate

0x 1x 2x 3x
Degree of Memory Contention

0

1

2

IO
TL

B 
M

iss
es 

 pe
r P

ag
e

(c) IOTLB miss rate

0x 1x 2x 3x
Degree of Memory Contention

0.0

0.5

1.0

IO
M

M
U 

Ca
ch

e M
iss

es 
 pe

r P
ag

e

L1 L2 L3 ACKs

0.00

0.05

0.10

0.15

AC
Ks

 
 pe

r P
ag

e-s
ize

d P
ac

ke
t

(d) L1/L2/L3 cache miss rate

0 100 200 300 400 500

IOVA Accesses
0

100

200

300

400

500

Di
st

in
ct

 L
3s

 A
cc

es
se

d 
Be

fo
re

 R
ep

ea
t

0x Memory Contention
3x Memory Contention

(e) IOVA access pattern

Figure 2.10: Application performance with both IOMMU and memory band-
width contention. Memory bandwidth contention causes further performance
degradation due to the increased latency for each memory read. This worsens
the cost of each page table walk despite similar miss rates for both the IOTLB
and page table caches.

tention and IOMMU contention. Despite similar IOTLB and IOMMU cache

misses, we still observe throughput degradation as the memory contention in-

creases. This is because PCIe latency increases as each page table walk (and its

corresponding memory reads) takes longer to complete. Comparing 3x memory

contention to the 2048 ring buffer result, there is a 25% reduction in throughput

despite equivalent IOTLB and L1/L2/L3 cache misses. As each memory read

incurs more latency, there is a larger amount of throughput degradation.

34


2.3 Host Congestion Impact on Storage Applications

We now show existing storage stacks were not designed to handle the host con-

gestion regime either, with the results shown in §2.2 also generalizing to storage

applications: up to 50% of unnecessary throughput degradation and an order of

magnitude tail latency inflation. Existing storage stacks using various IO sched-

ulers [114, 57] handle contention at CPUs and/or storage devices (that leads to

queueing at the block layer), but they have no mechanisms to dynamically iden-

tify contention and adjust to the appropriate amount of parallelism for storage

applications when the memory interconnect is bottlenecked. This limitation can

cause individual IO requests to experience significant latency inflation when

latency-sensitive L-apps compete for host resources with throughput-bound

T-apps in the presence of memory interconnect bottlenecks. This is because

T-apps are able to quickly acquire most of the resources (such as credits used

for data transfers over peripheral and memory interconnects) within the stor-

age datapath, resulting in L-app requests observing backpressure from within

the host network hardware (since the host network is inherently lossless).

The problem worsens when we extend our setup to incorporate remote

NUMA nodes, where the storage applications run on CPU cores in NUMA

nodes remote to the storage devices in the presence of memory interconnect

bottlenecks. Furthermore, we observe throughput degradation due to breakage

of isolation across multiple T-apps running on CPU cores in different NUMA

nodes and accessing storage devices within the same NUMA node. When con-

tention arises at the memory interconnect for only one of the T-apps (referred to

as the “culprit” application), the other T-app (the “victim” application) observes

backpressure, resulting in unnecessary throughput degradation. This backpres-

35


CPUs
DRAMDRAM

SSD SSD SSD

M-app
Traffic

L-app & T-app 
Traffic

SSD

(a) Memory interconnect congestion
within NUMA node local to SSDs

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Mi  ion IOPS)

102

103

104

L-
Ap

p 
Ta

i  
La

te
nc

) 
(u

s)

7.23( Latenc)
Inf ation

(1215.5µs)

(1410.5µs)

(195.0µs)

No Contention
With Contention

(b) Latency-load curve for increasing
IO depth; none scheduler

0.000 0.002 0.004 0.006 0.008 0.010 0.012
Throughput (in Million IOPS)

102

103

104

L-
A 

  
Ta
il 
La
te
nc
y 
(u
s)

No Contention
With Contention

(c) mq-deadline scheduler

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Mi  ion IOPS)

102

103

104

L-
Ap

p 
Ta

i  
La

te
nc

) 
(u

s)

1.55( Latenc)
Inf ation

(104.0µs)
(292.5µs)

(188.5µs)

No Contention
With Contention

(d) kyber (4KB IO)

0.00 0.03 0.05 0.08 0.10 0.12 0.15 0.18 0.20 0.23
Throughput (in Mi  ion IOPS)

103

104

105

L-
Ap

p 
Ta

i  
La

te
nc

) 
(u

s)

3.10( Latenc)
Inf ation

(1239.1µs) (1828.1µs)

(589.0µs)

No Contention
With Contention

(e) kyber (64KB IO)

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in M ll on IOPS)

102

103

104

L-
Ap

p 
Ta

 l 
La

te
nc

y 
((

s)

7.16) Latency
Inflat on

(1201.1µs)

(1396.1µs)

(195.0µs)

No Content on
W th Content on

(f) blk-switch
Figure 2.11: Existing storage stacks suffer from tail latency inflation due to
memory interconnect bottlenecks. (a) The setup comprises L-apps, T-apps, and
M-apps all co-located within a single NUMA node; (b) the T-apps set parallelism
as the benchmarked knee point (circled in the green curve); in the presence
of memory interconnect bottleneck, the T-apps continue to maintain the same
amount of parallelism (cross marked on the red curve), incurring 7.23× higher
tail latency than the ideal operating point for this scenario (circled in the red
curve); (c-f) other IO schedulers supported by Linux blk-mq, and blk-switch
(a state-of-the-art storage stack) incur similar latency inflation and/or high
throughput overheads.

sure originates from shared hardware buffers within the host network (such as

shared IIO buffers, as discussed in §2.1).

Setup. We demonstrate the problems described above using the Linux storage

36


Local NUMA node Remote NUMA node

UPI Link

CPUs
M-app
Traffic

L-app & T-app 
Traffic

DRAM DRAMDRAM

CPUs
M-app
Traffic

SSD SSD SSD SSD

(a) Memory interconnect congestion in
remote NUMA.

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in M ll on IOPS)

102

103

104

105

L-
Ap

p 
Ta

 l 
La

te
nc

y 
((

s)

17.44) Latency
Inflat on

(2756.5µs)

(2924.1µs)

(167.6µs)

No Content on
W th Content on

(b) Latency-load curve for increasing
IO depth

Figure 2.12: Higher inflation in tail latencies with memory interconnect bottle-
necks in remote NUMA node. (a) Setup uses applications running on CPU cores
in NUMA node remote from SSDs; (b) L-apps incur 17× tail latency inflation.

stack with the default block multi-queue architecture (blk-mq); while we use

the Linux storage stack, the problems apply for userspace stacks as well since

none of the existing userspace stacks account for memory interconnect bottle-

necks. We use the same server configuration as discussed in §2.2. We attach

4 Samsung PM173X NVMe SSDs to one of the NUMA nodes through 4 dedi-

cated PCIe Gen3 lanes per SSD (i.e., a total of 16 PCIe lanes implying a maxi-

mum aggregate PCIe bandwidth of 128Gbps or 16GB/s). We use two kinds of

storage applications — latency-sensitive applications (L-app) and throughput-

bound applications (T-app). Similar to prior work [57], we use the standard

benchmark tool FIO [1] with an IO depth of 1 for the L-app and varying IO

depth from 1-256 for the T-apps. We perform 4KB random reads IOs from the

SSDs. We also use the same memory-intensive application (M-app) as in §2.2.1.

[Figure 2.11] Modern storage stacks assume bottlenecks are at CPUs and/or

SSDs. Existing storage applications, therefore, typically maintain a fixed max-

imum number of requests in-flight (e.g., by specifying a per-thread IO depth) to

ensure that CPUs and/or SSDs are not overwhelmed. The precise number of in-

flight requests is decided based on the latency-load curve during a benchmark-

37


ing phase. The green lines in Figure 2.11 demonstrate this process: individual

points on the latency-load curve in Figure 2.11(b) are obtained by increasing

the IO depth for T-app (using 1 T-app thread per-core). In this scenario, the

T-app throughput is bottlenecked by available SSD bandwidth at ∼3.0M IOPS.

Benchmarking this curve informs us that an IO depth of 64 (shown in a green

circle) is the ideal choice. In the presence of additional traffic from M-apps, the

bottleneck shifts from SSD bandwidth to memory interconnect bandwidth. The

ideal operating point now is IO depth = 16 (depicted in circle on the red curve)

in Figure 2.11(b). However, the T-apps continue operating at the benchmarked

parallelism (i.e., IO depth = 64). Using the default none scheduler (which is also

the recommended scheduler for SSDs [57, 111]), this results in L-app requests

observing ∼7.23× inflation in tail latency compared to the ideal operating point.

We also evaluate two other state-of-the-art-practice Linux blk-mq sched-

ulers [114] (mq-deadline and kyber), along with a state-of-the-art storage

stack (blk-switch [57]), using the same experimental setup with increasing

IO depths. Figures 2.11(c-f) show that these schedulers observe either simi-

lar tail latency inflation (compared to noop) for the L-app under memory in-

terconnect contention and/or significant throughput overheads for T-app even

without memory interconnect contention. Both mq-deadline and kyber lack

mechanisms to provide isolation across L-apps and T-apps when both appli-

cations issue the same kind of requests to SSDs (reads in our setup). Isolating

L-apps using higher prioclass in mq-deadline results in starvation for T-apps;

kyber using default 4KB IO size observes similar CPU overheads and tail laten-

cies as reported earlier [114]; hence the knee-point in Figure 2.11(d) experiences

relatively minor shift under memory interconnect contention. However, if we

use a larger IO size such that we’re no longer CPU bottlenecked (shown in Fig-

38


Local NUMA node Remote NUMA node

UPI Link

CPUs
Victim T-app 

Traffic
DRAM DRAMDRAM

CPUs
M-app
Traffic

Culprit T-app 
Traffic

SSD SSD SSD SSD

(a) Memory interconnect congestion
for culprit T-app

Culprit Victim0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6

Th
ro
ug
hp
ut
 (i
n 
M
illi
on
 IO

PS
)

Necessary Unnecessary

No Contention
With Contention in Culprit Datapath

(b) Throughput starvation for
victim T-app

Figure 2.13: Throughput starvation problem when T-apps use multiple NUMA
nodes. (a) a setup, where the culprit application suffers from memory, intercon-
nect bottleneck, but not the victim; (b) the culprit must necessarily suffer from
throughput degradation; however, it also results in unnecessary degradation in
the victim application throughput.

ure 2.11(e)), kyber starts observing larger latency inflation. Finally, we also eval-

uate blk-switch, which by design tries to isolate L-apps from T-apps by prior-

itizing L-app requests enqueued in block layer software and hardware queues.

Under memory interconnect bottlenecks, these queues are filled up only after

significant build-up within the host and SSD-internal hardware queues, result-

ing in sub-optimal performance.

[Figure 2.12] Higher inflation in tail latency when using remote NUMA nodes.

When running L-apps and T-apps on CPU cores located on NUMA nodes re-

mote from the SSDs, we observe a ∼17.44× inflation in tail latency (∼2.4× higher

than local NUMA case in Figure 2.11(b)). This additional performance degra-

dation is due to remote memory access latencies from the IIO being higher than

local memory accesses—we measured up to ∼ 65% higher latencies even when

there are no memory interconnect bottlenecks (more details in §5.2.2). Higher

IIO-to-memory latencies lead to IIO buffer filling up more quickly and larger

depletion of PCIe credits for L-apps to perform DMAs, compared to the local

NUMA scenario.

39


[Figure 2.13] Throughput starvation problem when using multiple NUMA

nodes. We encounter a throughput starvation issue in scenarios where two

T-apps are colocated on a server with isolated datapaths from SSDs to memory,

resulting in dedicated SSDs, PCIe lanes, and memory addresses across differ-

ent NUMA nodes. When a ”culprit” T-app contends for memory interconnect

resources within its datapath, it experiences throughput degradation. How-

ever, this contention also leads to unnecessary throughput degradation for the

”victim” T-app, which does not face memory interconnect contention in its dat-

apath. This occurs because contention within the culprit T-app results in IIO

buffers filling up, which both the culprit and victim T-apps share. Since the IIO

lacks a mechanism to distinguish between the two T-apps, both applications

suffer from degraded throughput.

2.4 Summary

In this chapter we provide an overview of the end-to-end datapath for network

and storage applications, with an emphasis on potential bottlenecks within

the host network—e.g., the memory interconnect and the IOMMU caches. We

demonstrated that existing host infrastructure is ill-equipped to effectively han-

dle host congestion: we showcase significant throughput degradation and or-

ders of magnitude of tail latency inflation for networking and storage applica-

tions in the regime of host congestion.

40


CHAPTER 3

REARCHITECTING CONGESTION CONTROL

In this chapter, we discuss the assumptions in the design of existing congestion

control protocols that start the break under host congestion. We then present

design, implementation and evaluation of hostCC, a new congestion control

architecture to handle host congestion along with network fabric congestion.

3.1 Overview

Classical literature in datacenter congestion control takes a narrow view of

“end-to-end”, often interpreting the end as the point of presence of Ethernet

(the network interface card, or NIC). This view precludes the host network,

with the conventional wisdom in systems and networking communities being

that congestion happens primarily within the network fabric (that is, at network

switches). Several recent studies from large-scale production clusters [6, 86, 87]

(along with the results using open-sourced software and hardware infrastruc-

ture, shown in Chapter 2) demonstrate that the above conventional wisdom is

merely an appeal to tradition fallacy with the emergence of host congestion.

The regime of host congestion forces us to revisit the many fundamental

assumptions entrenched within decades of research and practice of congestion

control. For instance, classical congestion control literature assumes that packet

drops happen at the congestion point; in contrast, host congestion results in

queueing and drops away from the actual congestion point (since the host net-

work is lossless). Thus, we must rethink congestion signals to capture the pre-

cise time, location, and reason for host congestion. As another example, an

41


unspoken assumption in classical congestion control literature is that all com-

peting traffic adheres to the congestion control protocol; such is not the case in

the host congestion regime where traffic from “outside the network” (e.g., appli-

cations generating CPU-to-memory traffic) does not employ congestion control

mechanisms, is much closer to the congestion point, and can thus change dra-

matically at sub-RTT granularity. This has powerful implications in terms of

rethinking congestion response: existing congestion control protocols that op-

erate at RTT granularity may achieve performance far from optimal in the host

congestion regime. Thus, host congestion provides us an opportunity to revisit

intellectually intriguing, decades-old, fundamental questions related to conges-

tion control architecture and protocols.

We present hostCC, a congestion control architecture that takes the ultimate

end-to-end view: it handles both host congestion and network fabric congestion

by allocating both host and network resources among competing traffic. The

ethos of host and network resource allocation is the core that drives the three

key technical ideas embodied within hostCC. First, in addition to congestion

signals from within the network fabric, hostCC generates host-local congestion

signals at processor, memory, and peripheral interconnects at sub-microsecond

timescales. These host congestion signals enable hostCC to precisely capture

the time, location, and reason for host congestion. The second key technical

idea in hostCC is a sub-RTT granularity host-local congestion response: at both

the sender and the receiver, hostCC uses host-local congestion signals to allo-

cate host resources between network traffic and host-local traffic. At the sender,

hostCC uses host-local congestion response to ensure that network traffic is not

starved, even at sub-RTT granularity; at the receiver, hostCC uses host-local

congestion response to minimize queueing and packet drops at the host: it mod-

42


ulates host resources allocated to the network traffic at sub-RTT granularity to

ensure that NIC queues are drained at the same rate at which network traffic

arrives at the NIC. Finally, the third key technical idea in hostCC is to use both

host and network congestion signals to perform efficient network resource allo-

cation at RTT timescales.

hostCC admits efficient realization within existing host network stacks,

without any modifications in applications, host hardware, and/or network

hardware; moreover, hostCC can be integrated with existing congestion con-

trol protocols to efficiently handle both host and network fabric congestion.

To demonstrate this, we perform an end-to-end implementation of hostCC in

the Linux kernel using ∼800LOC, and evaluate it along with unmodified Linux

DCTCP. Our evaluation demonstrates that, in the presence of host conges-

tion, hostCC reduces queueing and packet drops at the host to a bare min-

imum, resulting in near-optimal network utilization and tail latency for net-

worked applications. The end-to-end implementation of hostCC, along with

all the documentation needed to reproduce our results, is available at https:

//github.com/Terabit-Ethernet/hostCC.

3.2 hostCC

Figure 3.1 illustrates the end-to-end hostCC architecture. In this section, we pro-

vide details on the three key technical ideas embodied within the architecture—

host congestion signals (§3.2.1), host-local congestion response (§3.2.2), and net-

work resource allocation (§3.2.3).

43

https://github.com/Terabit-Ethernet/hostCC
https://github.com/Terabit-Ethernet/hostCC


Classical Network FC

Host-local Congestion Response

NIC

PCIe

NIC

PCIe

Host 
Congestion 

Signal

Network 
Congestion 

Signal

Receiver Host

Classical Network CC

Input 
Sender-host congestion signal +

Receiver-host congestion signal +
Network congestion signal

Output
Network resource allocation

(via rate control/window mgmt)

Host-local Congestion Response

Input 
Host congestion signal

Output
Host-local response level

Sender Host
Host 

Congestion 
Signal

Output
Host-local response level

Input 
Host congestion signal

Input 
Host congestion signal + 

Network congestion signal

Output
Flow control + 

Echo congestion signals to sender

Figure 3.1: hostCC architecture overview. Discussion in §3.2.

3.2.1 Host congestion signals

hostCC, in addition to classical congestion signals from within the network fab-

ric, generates host-local congestion signals. More precisely, hostCC uses IIO

buffer occupancy as a congestion signal.

To understand why, recall the host datapath discussed in Chapter 2 (§2.1.1).

Let R be the rate at which NIC receives data, P be the maximum in-flight bytes

that PCIe can maintain (a fixed hardware-dependent constant that depends on

the maximum number of credits, and on TLP size), ℓp be the latency between

the NIC to the IIO (a fixed hardware-dependent constant), and ℓm be the la-

tency between the IIO and the memory controller. As discussed in Chapter 2,

ℓm depends upon multiple factors, including the memory controller write queue

size, load on the memory controller, whether DDIO is enabled (and whether an

eviction is triggered), and the speed-of-light-latency between the IIO and mem-

ory [59, 21, 80]. We will refer to ℓmin
m and ℓmax

m as the minimum and the maximum

44


value of ℓm.

Given the above, PCIe throughput is given by P/max{ℓp, ℓm}, that is, PCIe

utilization is dominated by the maximum latency among all links along the path

from the NIC to memory1. The IIO buffer occupancy is equal to R × ℓm, that is,

the maximum number of bytes that can be received by the IIO while it is waiting

for credits to be replenished (if it had credits, IIO would issue the requests).

In the regime of no host congestion, ℓm ≈ ℓmin
m ≪ ℓp by hardware design,

making ℓp the dominant factor in PCIe utilization. Thus, by hardware design,

IIO replenishes PCIe credits at a rate no lower than PCIe consumes credits, and

PCIe bandwidth utilization matches the rate at which NIC receives data; that

is, P/ℓp ≥ R. However, in the host congestion regime, the IIO buffer occupancy

can be anywhere between R × ℓmin
m and min{R × ℓmax

m , maximum number of PCIe

credits}, where the second expression in the upper bound is achieved when

IIO is unable to replenish PCIe credits due to large ℓm. Our measurements in

Figure 3.3 provide an empirical confirmation.

We are now ready to describe the benefits of using IIO occupancy as the host

congestion signal. First, IIO occupancy provides accurate information about

time, location, and reason for host congestion: IIO occupancy increases imme-

diately upon the memory controller becoming congested (accuracy in time and

location) and it increases only if memory controller is congested (accuracy in rea-

son)2. Second, IIO occupancy can be combined with another statistic—IIO in-

1Here is one way to visualize this intuitively: if ℓp → ∞ or ℓm → ∞, PCIe utilization tends
to 0; furthermore, if ℓm ≫ ℓp, PCIe utilization is bottlenecked by ℓm (host congestion) and if
ℓp ≫ ℓm, PCIe utilization is bottlenecked by ℓp. This follows almost immediately using the
analysis in [65].

2When IOMMU is enabled, IIO occupancy can also increase due to contention at IOMMU
caches. It is straightforward to distinguish IOMMU cache contention as a root cause by also
measuring the cache miss rates as shown in §2.2. In this chapter, we focus on IOMMU-disabled
scenario, and discuss handling host congestion due to IOMMU cache contention in Chapter 4.

45


sertion rate, defined as the rate at which PCIe inserts data into the IIO buffer—to

measure various other useful metrics; for instance, instantaneous PCIe through-

put (capturing the rate at which NIC buffers are drained) is equal to instanta-

neous IIO insertion rate times the cacheline size, host delay (ℓp+ ℓm) can be com-

puted using Little’s Law [89]), etc. Third, IIO occupancy and IIO insertion rates

can be measured using two registers typically available on commodity hard-

ware, allowing hostCC to work without any hardware modifications/support.

Finally, IIO measurements are done at the processor interconnect, outside the

NIC-to-memory datapath; thus, IIO occupancy measurements are not impacted

by host congestion. We provide more details in §3.3.1, including details on how

hostCC measures IIO occupancy and IIO insertion rates at sub-µs granularity

using existing hardware.

3.2.2 Host-local congestion response at sub-RTT granularity

A conceptual interpretation of classical congestion control protocols is that, to

handle congestion within the network, these protocols (along with network

switches) allocate network resources across entities competing at the congestion

point. The second key technical idea in hostCC architecture is motivated by this

conceptual view: to handle both host and network congestion, hostCC allocates

both host and network resources among entities competing at the congestion

point. To achieve this, hostCC introduces a host-local congestion response—at

both the sender and the receiver host—that uses host congestion signals dis-

cussed in the previous subsection to allocate host resources across network traf-

fic and host-local traffic.

46


Resource allocation depends on the underlying policy; hostCC architecture

does not dictate the precise resource allocation policy—just like different net-

work resource allocation mechanisms use different network allocation policies

(max-min fairness, weighted max-min fairness, prioritization, etc.), we envi-

sion hostCC to embody various host resource allocation policies and respective

implementation. For the following discussion, we assume that the policy pe-

riodically computes a target network bandwidth (BT ), and feeds it as input to

hostCC. In addition, the host-local congestion response takes as input IIO oc-

cupancy IS as the host congestion signal (using a threshold IT , where IS > IT

indicates host congestion) and PCIe bandwidth utilization BS (computed us-

ing IIO insertion rates, as discussed in the previous subsection). The conges-

tion response mechanism operates on a per-packet basis, and makes a decision

whether to increase or decrease the resource allocation to both network traffic

and host-local traffic.

Given the above, hostCC’s host-local congestion response mechanism is best

described in terms of four possible regimes of operation. We describe individual

regimes and corresponding host-local congestion response below.

1 [No host congestion, network traffic has met the target network band-

width]. In this regime, host is not congested (IS < IT ) and the network traffic

is using more resources than what is needed to meet the target bandwidth (that

is, BS > BT ). Thus, the host-local congestion response mechanism increases the

resources allocated to the host-local traffic. This is the right action to take since,

in absence of host congestion, more host resources can be allocated to either

network traffic or host-local traffic; moreover, since the network traffic has al-

ready met the target network bandwidth, we want to ensure that host-local traf-

47


fic is not backpressured unnecessarily. Thus, the host-local congestion response

mechanism increases resources allocated to the host-local traffic. It is possible

that host-local traffic does not need additional resources; hostCC handles this

case by relying on the AIMD-style mechanisms used in network congestion con-

trol protocols—since host is not congested, network traffic does not get marked

with congestion signals at the host allowing network traffic to increase its rate

and acquire unused host resources (if network fabric is not congested).

2 [Host congestion, network traffic has met the target network bandwidth].

In this regime, host is congested and the network traffic is using more resources

than what is needed to meet the target bandwidth. Thus, the right action in this

regime is to reduce resources allocated to the network traffic and to not reduce

resources allocated to the host-local traffic. To achieve this, hostCC again relies

on AIMD-style mechanisms used in network congestion control protocols: it

echoes the host congestion signal to the network congestion control protocol

resulting in reduction in network traffic rate.

3 [Host congestion, network traffic has not met the target network band-

width]. In this regime, host is congested but the network traffic is allocated

fewer resources than what is needed to meet the target bandwidth. Since there

is host congestion, we must reduce the allocated resources; since network traffic

has not met the target bandwidth, hostCC first reduces the resources allocated

to the host-local traffic (this happens at sub-RTT granularity). However, this is

not sufficient to avoid NIC buffer build up and packet drops. To see why, recall

that the PCIe bandwidth utilization BS and target network bandwidth BT may

be much lower than the rate R at which NIC is currently receiving traffic. By

reducing resources allocated to host-local traffic in order to accommodate BT

48


network traffic bandwidth, the host-local congestion response merely ensures

that NIC buffers build up at a rate no faster than R − BT . Without any explicit

congestion signal, the network traffic will have no reason to reduce R, resulting

in increasingly more queueing at the NIC and eventual packet drops. To avoid

this, the host-local congestion response also echoes the host congestion signal to

the network congestion control protocol resulting in reduction in network traf-

fic rate. We note that, if R < BT , the host-local congestion response may lead to

inefficient resource allocation due to reducing resources allocated to both net-

work and host-local traffic; nevertheless, hostCC takes the above conservation

decision temporarily to minimize NIC buffer buildup and packet drops.

4 [No host congestion, network traffic has not met the target network band-

width]. In this regime, host is not congested and the network traffic has fewer

resources than what is needed to meet the target bandwidth. Thus, the host-

local congestion response allocates more resources to network traffic; this allo-

cation is again implicit, in that, it again relies on the AIMD-style mechanisms—

since host is not congested, network traffic does not get marked with congestion

signals at the host allowing network traffic to increase its rate and acquire un-

used host resources (if network fabric is not congested). Increasing resources

allocated to the network traffic implicitly may take multiple RTTs, or may not

even be feasible (e.g., due to network congestion); nevertheless, the host-local

congestion response makes the conservation decision to not increase resources

allocated to the host-local traffic in this regime to avoid host congestion before

reaching the target network bandwidth.

The host-local congestion response in hostCC uses host congestion signals (that

are generated at sub-microsecond granularity) and is purely local to the host;

49


thus, it can be done at sub-RTT granularity. As a result, even if host-local traffic

changes at sub-RTT granularity, the host-local congestion response can ensure

high host resource utilization while maintaining target network bandwidth ac-

cording to any given policy.

3.2.3 Network resource allocation at RTT granularity

Consider the regime of host congestion (IS > IT ) and network traffic transmit-

ting at rate R > BT (which will lead to BS > BT ). In this scenario, the right

action is for the network traffic to reduce its rate. To achieve this, hostCC’s host-

local congestion response mechanism does not take any action; instead, hostCC

simply echos the host congestion signal back to the network congestion control

protocol (in addition to any network congestion signal). This has two benefits.

The first benefit is conceptual: it enables a clean separation of concerns, where

the host-local congestion response handles host congestion at sub-RTT granu-

larity, and network congestion control continues to handle network congestion

at RTT granularity as they do today. Second, hostCC can be integrated with any

network congestion control protocol: the only difference is that the protocol will

now use both host and network congestion signals for host resource allocation.

End-to-end hostCC behavior. We provide an intuitive description of hostCC’s

end-to-end behavior. Suppose the network traffic is operating at rate R > BT .

Suppose severe host congestion is introduced abruptly; then, as evaluated in

§2.2, BS will reduce to a small value below BT , IS will grow beyond IT , and the

host-local congestion response will kick in quickly to increase the host resources

allocated to network traffic to accommodate ∼BT bandwidth (potentially by re-

50


0.4 0.6 0.8 1.0 1.2
Measurement Latency (µs)

0.0

0.2

0.4

0.6

0.8

1.0

C
D

F
O

ve
rA

ll
M

ea
su

re
m

en
ts

No Host Congestion
With Host Congestion

(a) CDF for IS read latency

0.4 0.6 0.8 1.0 1.2
Measurement Latency (µs)

0.0

0.2

0.4

0.6

0.8

1.0

C
D

F
O

ve
rA

ll
M

ea
su

re
m

en
ts

No Host Congestion
With Host Congestion

(b) CDF for BS read latency

Figure 3.2: hostCC generates host congestion signals that are not on the NIC-
to-memory datapath; it can thus measure both IS (left) and BS (right) at sub-µs
timescales independent of host congestion.

ducing resources allocated to host-local traffic causing host congestion). How-

ever, for a few RTTs, the arrival rate of network traffic R at receiver NIC will still

be higher than BT , resulting in hostCC echoing host congestion signals back to

the sender. The sender will eventually reduce R until it converges to the target

network bandwidth BT .

3.3 hostCC Implementation

We now provide details on how we incorporate the three key technical

ideas from the previous section in an end-to-end hostCC implementation—

generating host congestion signals (§3.3.1), using host congestion signals for

host-local congestion response at sub-RTT granularity (§3.3.2) and using both

host and network congestion signals for network resource allocation at RTT

granularity (§3.3.3). We implement hostCC as a loadable Linux kernel module

using ∼800 LOC; hostCC works out-of-the-box with various existing conges-

tion control protocols, without requiring any modifications to applications, host

hardware, and/or network hardware.

51


3.3.1 Host congestion signals

hostCC collects host congestion signals at sub-microsecond granularity. To

maintain brevity, we describe an implementation atop Intel architectures (AMD

architecture is conceptually very similar).

Most hardware counters are exposed using model specific registers

(MSRs) [61]. The MSR for the IIO occupancy value at time t (denoted by ROCC(t))

maintains the cumulative value of the occupancy, incremented at IIO clock fre-

quency (denoted by FIIO); for example, FIIO = 500MHz for our servers. The

average IIO occupancy IS between any two time instants t1 and t2 is computed

using: IS = (ROCC(t2) − ROCC(t1))/((t2 − t1) ∗ FIIO). The time difference (t2 − t1) is

measured by the standard method of reading the TSC register, which provides

nanosecond level time accuracy. To minimize read latency, we used inline as-

sembler code to read the TSC register. The read latency is bottlenecked by the

read call to the IIO occupancy MSR register. On our servers, we measured that

each TSC read took < 2ns, and each MSR read call took <∼600ns. Thus, we are

able to collect host congestion signal (IIO occupancy IS ) at sub-µs timescales.

Similarly, to measure PCIe bandwidth utilization BS , we read another MSR

counter (denoted by RINS ) that stores cumulative IIO insertions. Thus, similar to

IIO occupancy, we compute the average rate of IIO insertions I between time in-

stants t1 and t2 as I = (RINS (t2)−RINS (t1))/(t2− t1). The PCIe bandwidth utilization

between t1 and t2 is thus I times the cacheline size. Both congestion signals—IS

and BS —require reading CPU registers, which does not overlap with the NIC-

to-memory datapath; thus, hostCC is able to measure these signals at a sub-µs

timescale, independent of host congestion (as shown in Figure 3.2).

We note a low-level detail. Similar to existing network congestion control

52


0 200 400 600 800 1000
0

20
40
60
80

100
120

PC
Ie

B
W

(G
bp

s)

0 200 400 600 800 1000
Time (µs)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(a) No Host Congestion

0 200 400 600 800 1000
0

20
40
60
80

100
120

PC
Ie

B
W

(G
bp

s)

0 200 400 600 800 1000
Time (µs)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(b) With Host Congestion
Figure 3.3: Variation of IIO occupancy IS and PCIe bandwidth utilization BS

with time for 1ms period for the default experimental setup without host con-
gestion (left) and with 3× host congestion (right). In absence of host congestion,
BS ≈ 103Gbps (line-rate bandwidth including PCIe overheads with 4K MTUs)
and IS≈65, which corresponds to hardware-specific IIO-DRAM bandwidth-
delay product (§3.2.1). During host congestion, IS increases to a maximum value
of ∼93 (shown in red), resulting in a reduction of BS , queueing and packet drops
at the NIC, and network CC reducing the rate.

protocols [77, 99], hostCC uses an exponentially weighted moving averaging

(EWMA) for its congestion signals, IS and BS , rather than their instantaneous

values. The EWMA weight used in hostCC has a standard tradeoff in terms of

aggressiveness in hostCC’s response to host congestion and delayed reaction—

using a large weight will quickly trigger host-local congestion response as well

as network congestion control response (the latter because congestion signals

will be generated more quickly, and echoing them to network congestion con-

trol protocol will trigger its response) which could lead to overreaction in pres-

ence of temporary burst of host congestion; a small weight, on the other hand,

will delay congestion response. hostCC uses a default weight value of 1/8 for

IS and 1/256 for BS (that is, last 8 IIO occupancy values and last 256 PCIe band-

width utilization values are dominant). An important note here is that we use

the same weight for IS for detecting host congestion and echoing congestion to

the network congestion control protocol; this works because network conges-

53


tion control protocols typically maintains EWMA of their parameters (e.g., α in

DCTCP) used to trigger network congestion response. We show in Figure 3.3

the IIO occupancy and PCIe write bandwidth with time for the baseline setup

in §2.2, with and without host congestion.

3.3.2 Host-local congestion response

The host-local congestion response in hostCC implementation takes as input

host congestion signals (IS and BS ) and corresponding thresholds (IT and BT )

and triggers the response discussed in §3.2.2. Any resource allocation mech-

anism must operate using the interface provided by the system that offers re-

sources; we describe our implementation using the Intel Memory Bandwidth

Allocation (MBA) tool. This interface uses a simple multi-level backpressure

mechanism to the host-local traffic (recall, host interconnect is lossless; thus,

backpressure is a natural mechanism to reduce CPU to memory traffic). Higher

levels mean more backpressure; that is, higher levels result in fewer resources

allocated to host-local traffic. Internally, MBA alters the rate at which any CPU

core can generate memory traffic by introducing additional latency to every

read/write request that observes an L2 cache miss on that core. Therefore, aver-

age traffic generated by a core to memory is inversely proportional to the intro-

duced additional latency: (LFB size × cacheline size)/per-access latency, where

LFB size is as discussed in §2.2. The current MBA interface allows 10 levels,

with higher levels introducing higher latency resulting in lower CPU to mem-

ory traffic [121]. AMD’s Memory Bandwidth QoS control tool uses a similar

interface [15].

54


0 1 2 3 4
Host-local Response Level

0
20
40
60
80

100

N
et

A
pp

-T
T

pu
t

(G
bp

s)

DDIO Off DDIO On

0 1 2 3 4
Host-local Response Level

0
50

100
150
200

M
A

pp
T

pu
t

(G
bp

s)

DDIO Off DDIO On

(a) Application throughput

0 1 2 3 4
Host-local Response Level

0.0

0.2

0.4

0.6

0.8

1.0

M
em

or
y

B
an

dw
id

th
U

til
iz

at
io

n

DDIO Off, NetApp-T
DDIO Off, MApp

DDIO On, NetApp-T
DDIO On, MApp

(b) Memory bandwidth utilization

Figure 3.4: Evaluating MBA efficacy: NetApp-T is able to acquire more host
resources with higher host-local response levels (that is, more backpressure on
MApp). As a result, NetApp-T gets higher throughput (left) and a larger fraction
of memory bandwidth (right). Discussion in §3.3.2 and §3.4.

The desired rate level is realized by performing a single MSR write to MBA-

specific registers. For each socket (NUMA node), MBA maintains 8 MSR reg-

isters, one for each “class-of-service” (COS), which can include any number of

CPU cores within a particular socket. The assignment of CPU cores to COS

can also be changed dynamically using an MSR write to another control regis-

ter. Therefore, a single MSR write can simultaneously alter the CPU to memory

traffic for any number of CPU cores within a socket. Our current implementa-

tion uses 5 local response levels ℓ = {0, 1, 2, 3, 4}, where each successive level ℓ

introduces increasingly larger latency—level 0 corresponds to no backpressure,

and level 4 corresponds to maximum backpressure. We separate out network

traffic cores from host-local traffic cores using different COS and introduce back-

pressure only to the latter.

Figure 3.4(a) shows network throughput when we hard code each individ-

ual host-local response level ℓ. We observe that, with each level ℓ, throughput

increases as expected: with more aggressive backpressure on host-local traffic,

network throughput increases from 43Gbps at level 0 to ∼100 Gbps at level 43.

3The host-local response level 3 corresponds to the maximum possible latency that can be in-

55


To understand Figure 3.4(b), that shows corresponding memory bandwidth uti-

lization, we must understand the difference between application-level through-

put and corresponding load on memory bandwidth. In particular, NetApp-T

and MApp use ∼2.1× and ∼1.33× memory bandwidth per unit of application-

level throughput (due to data copy and processor interconnect overheads, re-

spectively). When DDIO is enabled, the network application achieves higher

throughput at lower response levels (for eg., ∼100Gbps at level 3 instead of

4) because NetApp-T utilizes smaller amount of memory bandwidth per unit

throughput (since DDIO cache eviction rate is typically less than 100%). Con-

sequently, it requires smaller amount of backpressure to the MApp to achieve the

same network throughput. Our measurements also suggest that it takes ∼22µs

to perform a write to today’s MBA MSR registers due to MBA limitations; to

verify that this was not an hostCC implementation artifact, we performed MSR

writes using inline assembler core such that we only execute a single assembly-

level instruction for the write and still incurred 22µs latency (2× smaller than

our network RTT). We further discuss this limitation of MBA in Chapter 6.

3.3.3 Network resource allocation

hostCC can be integrated with many existing network congestion control pro-

tocols. We focus here on hostCC’s implementation with ECN-based protocols;

we discuss extensions to integrate hostCC with other protocols in §6.1. hostCC

troduced using the current MBA implementation to all MApp cores. However, this added latency
does not provide sufficient backpressure to MApp to allow the network application to reach line
rate throughput (as shown in Figure 3.4, NetApp-T only achieves ∼77Gbps throughput at level
3). In order to emulate an MBA response with larger added latency, we introduced a level 4 in
current hostCC implementation—when emulating level 4, we pause the execution of the MApp
process using the SIGSTOP signal, and when switching back from level 4 to lower levels, we
resume the MApp process using the SIGCONT signal.

56


requires no modification to existing network congestion control protocol imple-

mentations: hostCC simply generates ECN markings on the ACKs sent back to

the sender if IS > IT (if the packet was already marked by the switch, no modifi-

cations are made). The current hostCC implementation performs ECN marking

at the IP layer using 2 out of 6 DSCP bits (as in RFC 3168 [113]) by exploiting a

hook to the ip recv function provided by the widely used NetFilter kernel mod-

ule [131] in Linux. hostCC’s host-local congestion response does exactly what

today’s switches do—mark both these bits as 1 to indicate congestion—before

delivering the IP datagram to the transport layer. The transport layer then pro-

cesses the ECN marked datagram in exactly the same manner as it would for

any ECN marked packet at today’s switches.

3.4 hostCC Evaluation

We now evaluate hostCC performance. Our goals are three-fold:

• Understanding benefits of hostCC’s core ideas—host congestion signals,

host-local congestion response, and performing network resource allocation

using both host and network congestion signals—to application-level perfor-

mance;

• Deep dive into hostCC microscopic behavior (capturing host congestion sig-

nals, host-local congestion response, and network resource allocation) in the

host congestion regime;

• Develop lessons for the host congestion regime that would be useful to design

future host hardware and network stacks that can better enable the reaction

to host congestion.

57


0x 1x 2x 3x0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

0x 1x 2x 3x
Degree of Host Congestion

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)

DCTCP DCTCP+hostCC

(a) Network throughput and
packet drop rate

0x 1x 2x 3x
Degree of Host Congestion

0.0

0.2

0.4

0.6

0.8

1.0

M
em

or
y

B
an

dw
id

th
U

til
iz

at
io

n

DCTCP, NetApp-T
DCTCP, MApp

DCTCP+hostCC, NetApp-T
DCTCP+hostCC, MApp

(b) Memory bandwidth
utilization

Figure 3.5: (a) hostCC allows network traffic to achieve its target network band-
width of BT = 80Gbps, while simultaneously reducing packet drop rates by or-
ders of magnitude, even with a high degree of host congestion. (b) with hostCC,
MApps no longer acquire a large fraction of memory bandwidth, even with high
degree of host congestion.

Throughout this section, we use Linux DCTCP as our network congestion con-

trol protocol (network CC) since Linux DCTCP is a stable open-sourced im-

plementation that works with commodity hardware. We primarily focus on

the DDIO disabled case since results are easier to explain (as discussed ear-

lier, performance for DDIO enabled case depends on cache eviction policies);

hostCC evaluation for the DDIO enabled case is presented in §3.4.2. Unless

mentioned otherwise, we use IIO occupancy threshold IT = 70 and network

target bandwidth BT = 80Gbps for hostCC (we present sensitivity analysis of

hostCC performance with IT and BT in §3.4.3); for DCTCP, we use default pa-

rameters from [14]. We do not perform any experiment-specific parameter opti-

mization.

58


3.4.1 hostCC benefits

We first evaluate hostCC on the same setup as in §2.2—this setup does not

have any network congestion, allowing us to gain insights about hostCC per-

formance in the host congestion regime. We then extend the setup to the one

used in [6]; this setup includes network congestion and allows us to gain in-

sights on hostCC performance in the presence of network congestion (with and

without host congestion).

hostCC avoids throughput degradation for network traffic while simultane-

ously reducing packet drops at the host by orders of magnitude. Figure 3.5(a)

shows that, when the degree of host congestion is so low that NetApp-T can

reach its target network bandwidth without creating bottlenecks within the host

network (0× and 1× cases), hostCC has negligible impact on NetApp-T through-

put (which is bandwidth bottlenecked for the 0× case and CPU bottlenecked for

the 1× case). More interestingly, in the presence of host congestion, hostCC al-

lows NetApp-T to achieve throughput close to the desired target network band-

width (80Gbps in this experiment), even with a high degree of host congestion.

Essentially, using the host congestion signals, the sub-RTT host-local congestion

response promptly reduces the resources allocated to the MApp traffic whenever

the network throughput falls below BT in presence of host congestion (hostCC

steady-state behavior illustrated later in §3.4.4) Moreover, Figure 3.5(b) shows

that these benefits to NetApp-T do not come at the cost of starving MApp traffic:

hostCC’s host-local congestion response also increases resources allocated to

the MApp traffic whenever NetApp-T is able to sustain the target network band-

width. Figure 3.5(a) also shows that hostCC reduces packet drop rates to a bare

minimum, since the host-local congestion response and network CC (using host

59


1500B 4000B 9000B0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

1500B 4000B 9000B
MTU Size

10−5
10−4
10−3
10−2
10−1

100
101

Pa
ck

et
D

ro
p

R
at

e
(%

)

DCTCP DCTCP+hostCC

(a) Varying MTU size

4 8 160
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

4 8 16
Number of Active Flows

10−5
10−4
10−3
10−2
10−1

100
101

Pa
ck

et
D

ro
p

R
at

e
(%

)

DCTCP DCTCP+hostCC

(b) Varying number of flows

Figure 3.6: Even with high degree of host congestion (3× in this experiment),
hostCC consistently maintains its benefits across (a) MTU sizes and (b) number
of flows.

104
105
106

128B 512B 2KB 8KB 32KB
RPC Size

50
100
150
200
250
300

L
at

en
cy

(µ
s)

DCTCP
(No host congestion)

DCTCP
(With host congestion)

DCTCP+hostCC
(With host congestion)

Figure 3.7: Even with a high degree of host congestion (3× in this experiment),
hostCC incurs minimal latency inflation compared to no host congestion, sig-
nificantly improving tail latency across all RPC sizes.

congestion signals) work in tandem to keep the NIC buffer occupancy low for a

larger fraction of time.

We observe that, with a high degree of host congestion, the total memory

bandwidth utilization is slightly reduced. We believe this behavior is not due to

hostCC’s architecture but rather due to the coarse granularity of host resource

allocation using existing tools (Intel MBA, in this case). Due to such coarse gran-

ularity of the allocation, the MApp sometimes gets backpressured much more

than than it needs to, in order to accommodate additional NetApp-T traffic. Fig-

ure 3.4 shows an example of this behavior: when we switch from host-local re-

sponse level 3 to 4, NetApp-T gains 5.2GBps of memory bandwidth, while MApp

60


1x 1.5x 2x 2.5x0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

1x 1.5x 2x 2.5x
Increasing Degree of Incast

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)

DCTCP DCTCP+hostCC

(a) Network congestion

1x 1.5x 2x 2.5x0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

1x 1.5x 2x 2.5x
Increasing Degree of Incast

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)

DCTCP DCTCP+hostCC

(b) Host + Network congestion

Figure 3.8: (a) In the presence of network congestion and absence of host con-
gestion, hostCC achieves performance similar to network CC indicating mini-
mal overheads; (b) in the presence of both host and network congestion, hostCC
consistently provides benefits similar to the case with only host congestion.

loses 13.8GBps of memory bandwidth. We discuss potential avenues for future

hardware support for improved host-local congestion response in §6.1.

Figure 3.6 shows that hostCC consistently achieves benefits in terms of main-

taining target network bandwidth and reduced packet drop rates across all eval-

uated MTU sizes and number for flows.

hostCC observes minimal tail latency inflation for latency-sensitive traffic,

even with high degree of host congestion. Figure 3.7 shows the observed la-

tency under the same multi-tenant evaluation setup as in Figure 2.11 (where we

use all apps NetApp-T, NetApp-L and MApp together). hostCC observes mini-

mal latency inflation in the regime of host congestion due to two reasons: (1)

hostCC’s host-local congestion response ensures minimal queueing delay at the

host; and (2) hostCC significantly reduces packet drop rates, avoiding retrans-

mission and timeout delays. Results for small-sized RPCs provide evidence for

the first reason: recall, from §2.2, that P99 latency in this experiment is domi-

nated by NIC queueing delays; the figure shows that, despite the high degree

of host congestion, hostCC incurs a minuscule latency inflation of 13µs for 128B

61


RPCs (when compared to no host congestion scenario). All results provide evi-

dence for the second reason: we observe no timeouts even at P99.9 percentile.

hostCC maintains its benefits even in the presence of both host and network

congestion. Figure 3.8 evaluates hostCC performance in the presence of net-

work congestion, with and without host congestion. For this experiment, we

use an incast workload with two senders and a single receiver, directly con-

nected to a switch. We vary the degree of network congestion by varying the

degree of incast (the total number of active concurrent flows at the receiver)

from 4 to 10 (1× to 2.5× degree of incast). We observe that, in the absence of

host congestion, network CC without hostCC observes increased packet drop

rates with an increase in the degree of network congestion (as one would ex-

pect); since there is no host congestion, hostCC performance is near-identical to

the network CC performance indicating that hostCC has minimal overheads in

the absence of host congestion. In the presence of both host and network con-

gestion, however, network CC performance without hostCC suffers from high

packet drops rates and reduced throughput; in this scenario, hostCC provides

significant benefits using all three of its core ideas: it collects host congestion sig-

nals at sub-µs timescales, it is able to modulate host resources allocated to the

network traffic so as to maintain target network bandwidth, and incurs mini-

mal packet drops rates by ensuring that network CC converges to a rate that

matches available network and host resources. This experiment demonstrates

that hostCC interpolates well with network CC even in the presence of both

host and network congestion.

62


0x 1x 2x 3x0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

0x 1x 2x 3x
Degree of Host Congestion

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)

DCTCP DCTCP+hostCC

(a) Network throughput and
packet drop rate

0x 1x 2x 3x
Degree of Host Congestion

0.0

0.2

0.4

0.6

0.8

1.0

M
em

or
y

B
an

dw
id

th
U

til
iz

at
io

n

DCTCP, NetApp-T
DCTCP, MApp

DCTCP+hostCC, NetApp-T
DCTCP+hostCC, MApp

(b) Memory bandwidth
utilization

Figure 3.9: hostCC, with DDIO enabled, provides benefits similar to the DDIO
disabled case.

3.4.2 hostCC results with DDIO enabled

Figure 3.9 shows hostCC results using the same setup as in Figure 3.5, but with

DDIO enabled. We use IT = 50 here because the observed IIO occupancy value

when there is no host congestion is smaller when DDIO is enabled (∼45, com-

pared to ∼65 when disabled). This is due to smaller average IIO-to-memory

latency with DDIO enabled as discussed in §2.1.1. We observe similar trends as

in Figure 3.5—hostCC ensures that network traffic is able to achieve the target

network bandwidth, while reducing packet drop rates to a bare minimum. With

large degree of host congestion, the absolute packet drop rate with hostCC is

slightly higher than other evaluated cases (hostCC still helps reduce packet drop

rates by ∼37× compared to the case without hostCC); identifying the precise rea-

sons for this observation requires more visibility into DDIO-related hardware

operations like cache eviction policies (as also noted in [31, 46]).

We also observe that, with DDIO enabled, MApp is able to acquire larger frac-

tion of memory bandwidth (compared to DDIO disabled case in Figure 3.5) for

any given degree of host congestion. Figure 3.4 helps explain this behavior—

63


104
105
106

128B 512B 2KB 8KB 32KB
RPC Size

50
100
150
200
250
300

L
at

en
cy

(µ
s)

DCTCP
(No host congestion)

DCTCP
(With host congestion)

DCTCP+hostCC
(With host congestion)

Figure 3.10: hostCC, with DDIO enabled, provides benefits in terms of latency
improvements similar to the DDIO disabled case.

10 20 30 40 50 60 70 80 90 1000
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

10 20 30 40 50 60 70 80 90 100
Target Network Bandwidth (BT )

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)

(a) Network throughput and
packet drop rate

10 20 30 40 50 60 70 80 90 100
Target Network Bandwidth (BT )

0.0

0.2

0.4

0.6

0.8

1.0

M
em

or
y

B
an

dw
id

th
U

til
iz

at
io

n

NetApp-T MApp

(b) Memory banwidth
utilization

Figure 3.11: hostCC consistently maintains its benefits across varying network
target bandwidth BT . Discussion in §3.4.3.

when DDIO is enabled, NetApp-T is able to sustain higher average throughput

at a lower host-local response level; hence, MApp experiences smaller amount of

backpressure to achieve the target network bandwidth.

Figure 3.10 shows benefits of hostCC in terms of latency for NetApp-L using

the same setup as in Figure 2.11, but with DDIO enabled. We observe that la-

tency inflation without and with hostCC is identical to that of Figure 3.7; this is

because both DDIO enabled and disabled cases observe similar level of packet

drop rates for 3× degree of host congestion (as shown in Figure 2.4), leading to

similar tail latency inflation for NetApp-L (as discussed in §2.2).

64


3.4.3 hostCC sensitivity analysis

hostCC has only two parameters BT and IT . Figure 3.11 shows that hostCC con-

sistently achieves benefits in terms of maintaining target network bandwidth

and minimal packet drop rates for all values of BT (while only applying as much

backpressure on MApp as needed to maintain target network bandwidth). The

drop rates are particularly low for small values of BT ; this is because the arrival

rate of packets at the NIC is smaller than the PCIe bandwidth utilization (that

is, the rate at which packets are drained from the NIC buffer). To see this, recall

from Figure 2.4 that, even without hostCC, network traffic achieves ∼43Gbps

throughput at 3× degree of host congestion; thus, average PCIe utilization must

be at least 43Gbps. Here, hostCC maintains average network throughput less

than 40Gbps, resulting in NIC buffers rarely filling up and no packet drops. We

also observe low drop rates with large values of BT . As discussed in §3.2.2, NIC

buffer build up depends on R − BT , larger values of BT gives network traffic

more time to converge to the right rate; since hostCC ensures the average PCIe

bandwidth utilization remains close to BT , the time it takes for NIC buffer to fill

up reduces with increasing BT .

Figure 3.12 shows hostCC performance with varying IT values: increasing IT

leads to an increasingly delayed reaction to the onset of host congestion, leading

to larger packet drops and higher MApp throughput.

3.4.4 Deep dive into hostCC performance

We now provide more insights into hostCC’s performance.

65


70 75 80 85 900
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

70 75 80 85 90
Target IIO Threshold (IT )

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)

(a) Network throughput and
packet drop rate

70 75 80 85 90
Target IIO Threshold (IT )

0.0

0.2

0.4

0.6

0.8

1.0

M
em

or
y

B
an

dw
id

th
U

til
iz

at
io

n

NetApp-T MApp

(b) Memory bandwidth
utilization

Figure 3.12: (a) Increasing IT values lead to increasing drop rates due less ag-
gressive hostCC reaction to congestion. (b) MApp acquires larger memory band-
width with larger IT due to less aggressive backpressure.

Necessity of the three hostCC ideas. Figure 3.14 demonstrates that each of

the three key technical ideas in the hostCC architecture—generating host con-

gestion signals at sub-µs granularity, sub-RTT host-local congestion response,

and network resource allocation based on both host and network congestion

signals—contribute to hostCC’s performance.

In particular, without host-local congestion response, it is possible to mini-

mize packet drop rates but only at the cost of degraded throughput: network

traffic achieves merely ∼28Gbps of throughput. To explain the root cause for

this observation, we plot in Figure 3.14(b) measured IIO occupancy and PCIe

bandwidth utilization (including the PCIe-level overheads that turn out to be

∼5% with 4K MTU and hardware default TLP size) with time for a 1000µs hori-

zon for the case of 3×memory contention. We observe that IIO occupancy often

increases beyond IT = 70, indicating a possible onset of host congestion (when-

ever the network traffic increases using AIMD; network CC reacts to this by

reducing rate, thus bringing down the host congestion and drop rate, but also

suffering from low throughput. On the other hand, without performing net-

work resource allocation based on both host and network congestion signals, it

66


Echo congestion
signals only

Host-local
response only

Echo congestion signals
+ Host-local response

0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

Echo congestion
signals only

Host-local
response only

Echo congestion signals
+ Host-local response

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)
(a) Necessity of using both kinds of hostCC

responses in tandem

0 200 400 600 800 1000
0

20
40
60
80

100
120

PC
Ie

B
W

(G
bp

s)

0 200 400 600 800 1000
Time (µs)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(b) hostCC behavior
when it only echoes
congestion signals

0 200 400 600 800 1000
0

20
40
60
80

100
120

PC
Ie

B
W

(G
bp

s)

0 200 400 600 800 1000
Time (µs)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(c) hostCC behavior with
only host-local

congestion response

0 200 400 600 800 1000
0

20
40
60
80

100
120

PC
Ie

B
W

(G
bp

s)

0 200 400 600 800 1000
Time (µs)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(d) Default hostCC
behavior with both

responses in tandem

Figure 3.13: Each of the three key technical ideas in the hostCC architecture—
generating host congestion signals at sub-µs granularity, sub-RTT host-local
congestion response, and network resource allocation based on both host and
network congestion signals—contribute to hostCC’s performance. Discussion
in §3.4.4.

is possible to achieve high throughput but at the cost of large packet drop rates.

Figure 3.14(c) demonstrates the reason for this observation: IIO occupancy IS

frequently saturates to the maximum value of ∼93, indicating NIC buffer build-

up and subsequent packet drops at the host.

Figure 3.14(d) shows that, by carefully allocating both host resources (us-

ing host-local congestion response) and network resources (using both host and

network congestion signals), hostCC is able to simultaneously achieve high

67


Echo congestion
signals only

Host-local
response only

Echo congestion signals
+ Host-local response

0
20
40
60
80

100

N
et

w
or

k
T

hr
ou

gh
pu

t
(G

bp
s)

Echo congestion
signals only

Host-local
response only

Echo congestion signals
+ Host-local response

10−5
10−4
10−3
10−2
10−1

100

Pa
ck

et
D

ro
p

R
at

e
(%

)
(a) Necessity of using both kinds of hostCC

responses in tandem

0 200 400 600 800 1000
0

20
40
60
80

100
120

PC
Ie

B
W

(G
bp

s)

0 200 400 600 800 1000
Time (µs)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(b) hostCC behavior
when it only echoes
congestion signals

0 200 400 600 800 1000
0

20
40
60
80

100
120

PC
Ie

B
W

(G
bp

s)

0 200 400 600 800 1000
Time (µs)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(c) hostCC behavior with
only host-local

congestion response

0 200 400 600 800 1000
0

20
40
60
80

100
120

PC
Ie

B
W

(G
bp

s)

0 200 400 600 800 1000
Time (µs)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(d) Default hostCC
behavior with both

responses in tandem

Figure 3.14: Each of the three key technical ideas in the hostCC architecture—
generating host congestion signals at sub-µs granularity, sub-RTT host-local
congestion response, and network resource allocation based on both host and
network congestion signals—contribute to hostCC’s performance. Discussion
in §3.4.4.

throughput and low packet drops rates. Using the host-local congestion re-

sponse, hostCC is able to modulate host resources allocated to network traffic

in a manner that NIC queue buffer buildup can be avoided (as suggested by

smaller IIO occupancy immediately upon crossing the IT threshold) until net-

work traffic converges to the right throughput using both host and network

congestion signals.

Understanding example hostCC steady-state behavior. Figure 3.15 shows a

68


0 50 100 150 200 250
Time (us)

60
65
70
75
80
85
90
95

100

PC
Ie

B
an

dw
id

th
(G

bp
s)

(a) PCIe Write Bandwidth

0 50 100 150 200 250
Time (us)

0

1

2

3

4

R
es

po
ns

e
L

ev
el

(b) Host-local Response Level

0 50 100 150 200 250
Time (us)

50

60

70

80

90

100

II
O

O
cc

up
an

cy

(c) IIO Occupancy

Figure 3.15: In steady-state, hostCC keeps PCIe bandwidth close to the target
network bandwidth, while ensuring that IIO occupancy remains smaller than
the congestion threshold. Discussion in §3.4.4.

snapshot of hostCC’s behavior over a 250µs time horizon. Figure 3.15(a) shows

the measured PCIe bandwidth utilization with time for BT = 80Gbps (includ-

ing PCIe-level overheads, this amounts to 84Gbps, denoted by the green line).

We note from Figure 3.4 that PCIe bandwidth utilization lies between the host-

local response levels 3 and 4 (which provide ∼77Gbps and 100Gbps throughput,

respectively). Therefore, as expected, hostCC host-local congestion response os-

cillates between levels 3 and 4, ensuring that the measured PCIe bandwidth uti-

lization remains close to BT in Figure 3.15(a). The switches across levels happen

in accordance with the host-local congestion response logic in §3.2.2: hostCC

switches from level 3 to 4 when the IIO occupancy goes higher than IT (denoted

by red line in Figure 3.15(c)) and the PCIe bandwidth is still lower than BT ; and

switches back to level 3 when PCIe bandwidth has increased beyond BT and IIO

occupancy is again lower than IT .

69


3.5 Summary

hostCC is a congestion control architecture that handles host congestion, along

with network fabric congestion. hostCC achieves this using three key ideas—

generation of host congestion signals, a sub-RTT host-local congestion response

that uses host congestion signals to allocate host resources between network

and host-local traffic, and using host and network congestion signals to perform

network resource allocation at RTT granularity. We have realized hostCC within

the Linux network stack without any modifications in applications, host hard-

ware, and/or network hardware. hostCC evaluation on our testbed demon-

strates that it provides significant performance benefits under host congestion,

and interpolates well with existing congestion control protocols under network

fabric congestion.

70


CHAPTER 4

RETHINKING IO MEMORY PROTECTION

In this chapter we discuss the inefficiencies within the host network introduced

by enabling memory protection against untrusted NICs. We show that the exist-

ing memory protection mechanisms (falsely) assume that they need to give up

on their security guarantees in order to alleviate these inefficiencies and achieve

high application performance. We present design, implementation, and evalu-

ation of F&S, a simple modification to existing memory protection mechanisms

that enables strongest possible safety guarantees while near-completely elimi-

nating performance overheads due to memory protection.

4.1 Overview

Memory protection—ensuring that physical memory allocated to an applica-

tion cannot be accessed and/or modified by another application—is a classical

problem in operating systems and computer architecture. Memory manage-

ment units (MMUs), since their inception in the 1970s, have enabled applica-

tions running on processors to achieve memory protection. However, a buggy

or malicious network interface card (NIC) could still execute errant transfers

into the host memory. This led to the introduction of Input-Output Memory

Management Unit (IOMMU) in the mid-2000s. An IOMMU operates very sim-

ilar to MMU; we describe the IOMMU-enabled datapath from the NIC to the

application in Chapter 2, but briefly: with IOMMU enabled, the operating sys-

tem assigns NICs so-called IO virtual addresses (IOVAs), that are used by the

NIC to initiate direct memory access (DMA) to the host memory; for every DMA

71


initiated by the NIC, the IOMMU uses a host memory resident IO page table to

translate NIC-visible IOVA to host physical address that is used for the final

data transfer. These address translations are sped up using a special IOTLB

hardware (that caches frequently used IO page table entries). Upon an IOTLB

hit, the address translation incurs near-zero cost; however, upon an IOTLB miss,

an IO page table walk is performed to perform the address translation.

As already illustrated in Chapter 2, enabling memory protection against un-

trusted peripherals using IOMMU can have significant impact on application-

layer performance. Consider a server with 100Gbps NICs and 128Gbps PCIe. To

provide the strongest safety properties (referred to as strict mode in modern op-

erating systems), each IOVA must be unmapped, and the corresponding IOTLB

entry invalidated immediately after the DMA; thus, one must incur at least one

IOTLB miss every (huge)page worth of data transfer and perform 4 sequential

memory accesses per miss for IO page table access in the worst case. Even with

an optimistic estimate of 100ns memory access latency, this incurs ∼4× 100ns of

IOMMU overhead. Thus, with modern PCIe devices (that allow buffering up

to ∼100 cachelines at the root complex, the processor-side end of the PCIe de-

vice [105, 9]), enabling IOMMU in the strongest safety mode essentially pushes

PCIe to its limits—using Little’s Law [38], the maximum sustainable rate would

be 100 × 64bytes/400ns = 128Gbps. In practice, the situation is much worse:

memory access latencies are usually higher [35, 81, 36], PCIe has its own over-

heads [105], IOTLB can incur multiple misses during the huge(page) worth of

data transmission [17, 6], etc. Indeed, recent studies from large-scale production

clusters [6] have shown that inefficiencies within state-of-the-art memory pro-

tection mechanisms can result in significant throughput degradation, orders-of-

magnitude tail latency inflation, and violation of isolation guarantees.

72


Sadly, the above analysis is so fundamental that it has left the community

with a bleak outlook about memory protection—without rearchitecting mem-

ory management in operating systems and/or without significant improve-

ments in server hardware (e.g., larger buffers at root complex, lower memory

access latencies, etc.), it appears impossible to provide strong safety proper-

ties while maintaining high performance. As eloquently argued in [6], tech-

nology trends suggest that server hardware improvements along these direc-

tions are unlikely; thus, most recent work on high-performance memory pro-

tection targets weaker safety properties [108, 93, 45, 29, 23, 94], rearchitecting

memory management within the operating system [93, 94], and/or redesigning

IOMMU hardware [91, 79, 55, 17]. Unfortunately, such weaker safety proper-

ties and clean-slate modifications themselves introduce new security vulnera-

bilities [92, 93, 73, 13, 102]. This state-of-affair leaves organizations with two

equally expensive and painful options: (1) high-performance but unsafe sys-

tems; or, (2) safe systems but with suboptimal performance.

We explored the inefficiencies in modern memory protection mechanisms

in Chapter 2 (§2.2). While reproducing the overheads of memory protection

mechanisms were straightforward, our preliminary experiments during this ex-

ploration led to some baffling results: we found that, in many cases, the total

number of memory accesses for IO page table walks were less than 4× the num-

ber of IOTLB misses! Diving deeper into this inexplicable and seemingly impos-

sible phenomenon, we discovered an aspect of IOMMU hardware that has been

ignored in all previous studies: IO page table caches [60] (in the hindsight, this

should have been obvious, given that processor MMUs have had such caches for

a long time now [24]). Specifically, IOMMU has hardware caches that store most

frequently accessed IO page table entries from level1, level2, and level3 of the

73


IO page table (importantly, for reasons discussed in §2.1.3, level4 entries are not

cached in IO page table cache). Upon an IOTLB miss, IOMMU first checks the

cache for the corresponding IO page table level1 page—if a hit, the correspond-

ing level2 page can be identified without an IO page table walk (thus, avoiding

one memory access), and IOMMU now checks the cache for the correspond-

ing IO page table level2 page, and so on. Indeed, in the worst-case scenario,

IOMMU still needs to perform 4 memory accesses per IOTLB miss; however, in

the best-case scenario, the IO page table cache enables IOMMU to perform just

1 memory access per IOTLB miss (for the corresponding level4 entry)!

We present F&S, a simple modification to existing memory protection mech-

anisms that breaks the aforementioned impasse: it provides strongest safety

properties, and yet, near-completely eliminates their overheads. The key insight

in F&S design is that, while IOTLB misses are unavoidable with the strongest

safety properties, it is still possible to reduce the cost of address translation upon an

IOTLB miss using IO page table caches within the IOMMU hardware. Indeed,

we demonstrate that, by carefully allocating IOVAs (to maximize IO page table

cache hit rate) and by carefully managing IO page table cache invalidations (to

ensure safety), F&S is able to provide strongest safety properties with minimal

overheads of memory protection. For instance, across all evaluated workloads,

we find that F&S techniques result in 0 level1 misses, 0 level2 misses, and at

most 0.054 level3 misses per (huge)page worth of data; thus, F&S requires at

most 1+ 2× 0.054 = 1.108 memory accesses per (huge)page worth of data across

all evaluated workloads. Using the analysis earlier, this allows F&S to already

achieve a sustainable rate of 100×64bytes/1.108×100ns = 462Gbps throughput,

far higher than the current and the next generation PCIe. Finally, F&S provides

such powerful results without any modifications in host hardware and without

74


any modifications in operating systems memory management; it needs merely

∼300 lines of code changes within the IOMMU and network drivers (no changes

necessary in memory management).

4.2 F&S

In this section, we present the design of F&S, our Fast & Safe modification to

the existing memory protection mechanism within Linux. F&S rethinks the ap-

proach to perform IOMMU address translation while guaranteeing high appli-

cation performance, by focusing on reducing the cost of each IOTLB miss, rather

than (solely on) reducing the number of IOTLB misses. Our key insight is that

while we cannot reduce the IOTLB miss rate beyond a single miss per page (nec-

essary to enforce the strict IOMMU protection model [101, 17]), we can reduce

the cost of each miss by improving the IOMMU page table cache performance.

As demonstrated in §2.2, existing IOVA allocation mechanisms in Linux typ-

ically result in poor IOVA locality, and hence incurring a large number of IO

page table cache misses. This leads to a larger number of memory accesses for

performing address translation upon each IOTLB miss, resulting in a higher la-

tency cost per IOTLB miss and hence, degraded application throughput and

latencies. Even for few evaluated scenarios with improved IOVA locality, we

observe higher-than-expected latency costs for each IOTLB miss and degraded

application performance. This is due to IOVA invalidations: as discussed in

§2.2, invalidating an IOVA entry leads to deletion of corresponding entries not

only from the IOTLB but also from the IO page table (to safely handle potential

page table page reclamations, discussed in more details later in this section),

75


thus preventing the IO page table caches from re-using their entries for yet-to-

be-invalidated IOVAs. Moreover, in addition to a higher latency-cost for each

IOTLB miss, invalidations also lead to a higher CPU-cost per IOTLB miss. Upon

each invalidation, the CPU core waits until successful invalidation of IOTLB

and/or IO page table cache entries by the IOMMU hardware [108]. Therefore,

larger invalidation requests also result in a degradation of per-core application

throughput.

F&S resolves the aforementioned challenges using three key design ideas in

order to minimize the latency and CPU cost for each IOTLB miss.

1 Contiguous IOVA allocation for improved locality. Rather than perform-

ing allocations at the default per-page granularity, F&S allocates a single and

contiguous IOVA for all the pages within a descriptor (e.g., descriptors used

by recent Mellanox drivers consist of 64 pages per descriptor). We utilize

the context of IOVA allocation by networking applications in order to lever-

age locality benefits using contiguous address allocation. Unlike the context

of conventional Linux memory management (where the applications them-

selves dictate the access pattern for virtual addresses), the IOVA access pat-

tern for networking applications is completely predictable: IOVAs within

a single descriptor are accessed in a sequential manner. Thus contiguous

IOVA allocation by F&S ensures improved IOVA locality.

By changing the IOVA allocation granularity, F&S no longer relies on the

IOVA allocator’s per-core caches as the only means to achieve good lo-

cality. We already demonstrated in §2.2 that per-core caches typically fail

to provide good IOVA locality due to (i) multiple CPU cores contending

for globally-shared red-black tree when IOVA working set size exceeds the

76


per-core cache size and (ii) interference of IOVA allocation/deallocation for

ACK packets with payload packets for each CPU core (even when the IOVA

working set size is smaller than the per-core cache size).

2 Selectively invalidating IOTLBs to retain benefits of IOVA locality. Upon

each IOVA unmap operation, F&S only invalidates the corresponding

IOTLB entry while preserving the IOMMU L1/L2/L3 cache entries. This

simple change allows the IOMMU page table caches to take advantage of

IOVA locality by allowing them to re-use the cached entries after IOVA in-

validations.

To ensure safety, IO page table caches require invalidations if any corre-

sponding IO page table pages were reclaimed after unmap operation(s). To-

day, the Linux IOMMU driver takes a safer (but potentially suboptimal) ap-

proach of always invalidating the IO page table caches whenever an IOVA

is unmapped. However, our key insight is that L1/L2/L3 page reclamation

will never happen in the context of networking applications. Linux today

only reclaims a page table page if there is a single operation unmapping the

entire address range that it covers. To make this concrete, in order to reclaim

an L4 page (thus invalidating the L3 entry that points to it), the entire 2MB

range of IOVAs on that page must be unmapped in a single operation. To

invalidate an L2 entry and reclaim an L3 page, the entire 1GB range of IO-

VAs pointed to by that L2 entry must be unmapped in a single operation. In

the networking context, the maximum size of an unmap operation in F&S is

64 IOVAs: this could never cause a reclamantion of an IO page table page1.

Thus, an IOMMU page table cache entry will never become stale2.

1Page table page reclamation is likely to happen when the IOMMU is used for virtualization.
In this context, large parts of the address space are mapped and unmapped at once.

2If there is a context where the OS must change the IO page table structure (such as a driver

77


3 Batched invalidations to minimize CPU cost for IOTLB misses. The final

design idea of F&S is to leverage the contiguous IOVA allocation to invali-

date the entire IOVA range mapped to a descriptor with only a single inval-

idation request to the IOMMU. Such a batched invalidation amortizes the

CPU cost of a single invalidation request across the number of pages within

a single descriptor (e.g., 64 pages per descriptor for Mellanox NICs).

4.3 F&S Implementation

We implement F&S within the Linux kernel v6.0.3. We currently support recent

Mellanox NICs (CX-5 [97] onwards), which operate with 64-page descriptors

by default. F&S implementation uses ∼300 LOC, requires no hardware mod-

ifications, no interface changes to the IOVA allocator and the IOMMU driver,

and minimal changes to the network driver. We next discuss some of the most

interesting implementation details for each of the three key F&S design ideas.

Contiguous IOVA allocation. Figure 4.1a shows the default IOVA allocation

mechanism in Linux. In order to DMA map a descriptor, for each page, the OS

allocates a 4KB IOVA and then makes a call to the IO page table map function,

specifying the starting address of the IOVA, the starting address of the page,

and a 4KB length. In the same order that physical pages are accessed in the de-

scriptor, the OS maps each page to a contiguous page-sized chunk of the IOVA.

F&S, on the other hand, first allocates a contiguous descriptor-sized chunk of

the IOVA, as shown in Figure 4.1b. However, F&S does not change the mapping

that unmaps at >512 IOVA granularity), there exists a simple solution. At the time of the IOTLB
invalidation for that IOVA, check if any page table pages will be reclaimed, and if so, then
invalidate the page table caches. This simple additional conditional statement is unlikely to add
any significant overhead.

78


granularity—each physical page in the descriptor is still mapped to a contigu-

ous page-sized segment of the IOVA range. This idea of allocating a large IOVA

but mapping individual physical pages has two benefits: (1) it does not require

changes to the underlying hardware, and (2) it ensures good locality within a

descriptor: in the worst case, a single descriptor may experience two L3 page

table misses.3 This corresponds to .031 L3 misses per page-sized packet DMAed

in the worst case.

Selectively invalidating IOTLBs. As discussed in 2.1.3, IOMMU provides an

invalidation queue as the interface to the IOMMU driver to request IOVA in-

validations. The invalidation queue interface specifies an IOVA address, the

size of the IOVA, and importantly, an option to only invalidate the IOTLB entry

while preserving the IO page table cache entries. This option is not enabled in Linux

by default, as it could potentially lead to stale page table cache entries if the

page table structure changes (i.e., an L4 page table page is reclaimed, and so the

page’s corresponding L3 entry is updated). To preserve IOMMU caches, F&S

uses the existing invalidation queue interface and sets the specific bit to prevent

the invalidation of associated page table caches when flushing IOTLB entries.

Batched invalidations. F&S issues a single request to the invalidation queue

upon a DMA unmap call, specifying the contiguous descriptor-sized IOVA

chunk for invalidation. Note that, however, without the use of multi-page de-

scriptors, F&S cannot benefit from batching invalidations since each descriptor

must be invalidated immediately after DMA completes.

3One for the beginning of the IOVA range, mapped to the first page in the descriptor (I in
Figure 4.1b), and another one if the IOVA range spans multiple L4 pages; for example, if ‘I‘ is
the last mapping in an L4 page, then ‘I‘+4096 is on another page, which has a different L3 entry.

79


OS

IOVA
allocator

alloc (4KB)
4KB

Rx 
Descriptor

(P1,P2,...,Pn)

I Map()

length = 12 bits 
(4KB)

P1

IO Page Table

[ I    P]
4KB

(a) Linux IOVA allocation and IO page table mapping.

OS

IOVA
allocator

alloc (256KB)

256KB

I

Map()P1

I + 4096 I + 2* 4096

Map()P2
length = 12 bits 

(4KB)

length = 12 bits 
(4KB)

Rx 
Descriptor

(P1,P2,...,Pn)

Map()P3

length = 12 bits 
(4KB)

IO Page Table

[ I    P]4KB

[ I        ]P2
+ 4096

4KB

[          ]I        P3+ 2* 4096 4KB

(b) F&S IOVA allocation and IO page table mapping.

Figure 4.1: F&S achieves contiguous IOVA mappings without requiring changes
to IOMMU hardware, IOVA allocator, and IOMMU driver. (a) This process re-
peats for each page in the Rx descriptor, requiring calls to both the IOVA allo-
cator and IO page table map. (b) A single 256KB IOVA is allocated and then
mapped in page-size chunks to the physical pages in the Rx descriptor. F&S
requires only one call to the IOVA allocator and the same number of calls to the
IO page table map.

4.4 F&S Evaluation

In this section, we demonstrate that F&S near-completely eliminates memory

protection overheads demonstrated in §2.2. We use the same testbed setup as

described in §2.2. Our evaluation demonstrates that:

80


• With IOMMU enabled, F&S allows Linux to consistently match the appli-

cation throughput and latency performance as with the IOMMU disabled

scenario. F&S achieves its benefits even with the strictest memory protec-

tion model, while incurring minimal CPU overheads, across a wide vareity

of network and host parameters (such as number of flows, ring buffer sizes,

MTU sizes, core counts, and with DDIO enabled or disabled).

• F&S alleviates memory protection overheads by reducing the latency and

CPU cost of IOTLB misses. By reducing the memory protection overheads,

F&S also improves packet drop rates, thereby generating fewer ACKs, which

in turn also improves IOTLB hit rates and help in further reducing the over-

heads.

• In the presence of memory bandwidth contention, by integrating with

hostCC (discussed in Chapter 3), F&S improves application performance by

resolving both memory bandwidth and IOMMU contention.

4.4.1 F&S Benefits

F&S eliminates memory protection overheads by reducing the cost of each

IOTLB miss (while providing strictest safety properties). Figure 4.2 shows

the F&S results for increasing number of flows (same scenario as in Figure

2.8). F&S matches IOMMU-disabled performance by resolving the inflated

L1/L2/L3 cache miss rates, resulting in >10x reduction in L3 misses per de-

scriptor in all cases and bringing L1/L2 cache misses to 0. F&S also alleviates

the IOTLB contention, resulting in almost 2x fewer IOTLB misses in the 40 flow

case. Similar performance improvements are also observed across varied ring

81


5 10 20 40
Number of Active Flows

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

IOMMU Off
IOMMU On

IOMMU On + F&S

(a) Network throughput

5 10 20 40
Number of Flows

0

2

4

Pa
ck

et 
Dr

op
 

 R
ate

 (%
)

IOMMU Off
IOMMU On

IOMMU On + F&S

(b) Packet drop rate

5 10 20 40
Number of Active Flows

0x

1x

2x

3x

Re
du

cti
on

 in
 

 IO
TL

B 
M

iss
es

(c) Reduction in IOTLB misses

5 10 20 40
Number of Active Flows

0x
5x

10x
15x
20x
25x

Re
du

cti
on

 in
 

 IO
M

M
U 

Ca
ch

e M
iss

es L1/L2 L3 ACKs

0x

5x

Re
du

cti
on

 in
 

 N
um

be
r o

f A
CK

s

(d) Reduction in L1/L2/L3 misses

0 100 200 300 400 500
Sample Sequence of 

IOVA Accesses

0
2
4
6
8

10
12
14

Di
st

in
ct

 L
3s

 A
cc

es
se

d 
Be

fo
re

 R
ep

ea
tin

g

40 flows 5 flows

(e) IOVA access pattern

Figure 4.2: F&S eliminates memory protection overheads by reducing cost of
each IOTLB miss. (a) F&S enables IOMMU on scenario to match the perfor-
mance of IOMMU off, even as the number of flows increases. (b) By addressing
the IOMMU contention, F&S significantly lowers packet drop rates caused by
queueing at the NIC buffers. (c) F&S reduces the number of IOTLB misses per
descriptor up to 2x by minimizing packet drops, hence decreasing the number
of transmitted ACKs. (d) F&S eliminates L1/L2 misses and substantially re-
duces L3 misses by preserving IO page table caches during invalidation and
allocating contiguous IOVAs. (e) F&S enhances the locality and thus reduces
the number of L3 cache misses. See §4.4.1 for more discussion.

buffer size (Figure 4.3), MTU size and CPU core count (Figure 4.4), and DDIO

enabled (Figure 4.5) cases.

Understanding F&S benefits. Let us focus on understanding the performance

improvements with increasing number of flows (Figure 4.2), but similar insights

also apply to other cases (such as with varying ring buffer sizes, in Figure 4.3).

82


256 512 1024 2048
Ring Buffer Size

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

IOMMU Off
IOMMU On

IOMMU On + F&S

(a) Network throughput

256 512 1024 2048
Ring Buffer Size

0.0

0.5

1.0

Pa
ck

et 
Dr

op
 

 R
ate

 (%
)

IOMMU Off
IOMMU On

IOMMU On + F&S

(b) Packet drop rate

256 512 1024 2048
Ring Buffer Size

0x

1x

2x

3x

Re
du

cti
on

 in
 

 IO
TL

B 
M

iss
es

(c) Reduction in IOTLB misses

256 512 1024 2048
Ring Buffer Size

0x
5x

10x
15x
20x
25x

Re
du

cti
on

 in
 

 IO
M

M
U 

Ca
ch

e M
iss

es L1/L2 L3 ACKs

0x

5x

Re
du

cti
on

 in
 

 N
um

be
r o

f A
CK

s

(d) Reduction in L1/L2/L3 misses

0 100 200 300 400 500
Sample Sequence of 

IOVA Accesses

0
2
4
6
8

10
12
14

Di
st

in
ct

 L
3s

 A
cc

es
se

d 
Be

fo
re

 R
ep

ea
tin

g

2048rb 256rb

(e) IOVA access pattern

Figure 4.3: F&S maintains good locality, resulting in low L3 cache misses even
as the IO working set sizes increase. With IOMMU enabled and increasing ring
buffer sizes, we observe that F&S achieves several key performance metrics: (a)
it matches the performance of systems with IOMMU disabled; (b) experiences
no packet drops; (c) maintains a consistent number of IOTLB misses; (d) in-
curs low IOMMU cache misses; and (e) demonstrates good locality in L3 page
caches. See §4.4.1 for more discussion.

As discussed in §2.2, without F&S, the IOTLB misses increase with the number

of flows due to a higher drop rate and thus more ACKs. Figure 4.2 shows that by

reducing memory protection overheads, F&S improves the drop rate, leading

to a reduction in the number of ACKs. Consequently, this decrease in ACK

interference results in fewer IOTLB and IO page table cache misses.

To minimize page table cache misses, F&S preserves page table caches dur-

ing IOTLB invalidations, enabling the IOMMU to fully benefit from enhanced

83


5 6 7 8
Cores

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

IOMMU Off
IOMMU On

IOMMU On + F&S

(a) Varying number of cores.

1500B 4000B 9000B
MTU Size

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

IOMMU Off
IOMMU On

IOMMU On + F&S

(b) Varying MTU sizes

Figure 4.4: F&S enables IOMMU on to match the performance of IOMMU off
with varied number of cores and MTU sizes. See §4.4.1 for more discussion.

5 10 20 40
Flows

0
20
40
60
80

100

Th
rou

gh
pu

t (
Gb

ps
) IOMMU Off IOMMU On IOMMU On + F&S

Figure 4.5: F&S resolves IOMMU contention with DDIO on for varying number
of flows. Moreover, by improving CPU utilization, F&S is able to saturate the
link bandwidth. See §4.4.1 for more discussion.

locality, as illustrated in Figure 4.2(b). This approach improves the L3 cache

miss rate, and eliminates the L1/L2 miss rates. With preservation of page ta-

ble caches, even for very large working sets, L1 and L2 caches will observe no

misses; typically, the working set would need to be 8 million pages to see L2

misses and over 4 billion pages to see L1 misses. Today, even with configura-

tions such as a 9k MTU, a 4k ring buffer size, and 128 cores, we expect to see

∼2.25 million active pages for Rx datapath4.

By using contiguous IOVA allocations at a descriptor-sized granularity F&S

eliminates the poor locality resulting from successive IOVA allocator calls at

page-sized granularity. This approach allows F&S to maintain good locality

4Furthermore, at network speeds of 1 Tbps and assuming a round trip time of 100us—which
is 10x higher than recent predictions[7]—the Bandwidth-Delay Product (BDP) for all flows cor-
responds to only about 3k active pages. Therefore, even with scaling network speeds, we do
not expect to see L1/L2 cache misses with F&S.

84


 0

 0.1

 0.2

 0.3

 0.4

 0.5

 0.6

app
 
sys
call

 
dat
a co
py

tra
nsp
ort
/IP p
roc

net
dev
ice 
sub
sys
tem

skb
 mg
mt

me
mo
ry a
lloc
/de
allo
c

loc
k/u
nlo
ck

sch
edu
ling
iom
mu etc

.

C
PU

 U
til
(%
)

256 ring buffer sizes
2048 ring buffer sizes

(a) Hardware Prefetching Enabled (Default)

 0

 0.1

 0.2

 0.3

 0.4

 0.5

 0.6

app
 
sys
call

 
dat
a co
py

tra
nsp
ort
/IP p
roc

net
dev
ice 
sub
sys
tem

skb
 mg
mt

me
mo
ry a
lloc
/de
allo
c

loc
k/u
nlo
ck

sch
edu
ling
iom
mu etc

.

C
PU

 U
til
(%
)

256 ring buffer sizes
2048 ring buffer sizes

(b) Hardware Prefetching Disabled

Figure 4.6: The CPU breakdown results indicate that data copy overheads in-
crease with the number of ring buffer sizes due to ineffective hardware prefetch-
ing. Results shown for (a) hardware prefetching enabled, and (b) hardware
prefetching disabled respectively.

even as IO working set sizes increase, such as with larger ring buffer sizes. As

discussed peviously in §4.3, this ensures that every descriptor will result in al-

location of at most two unique L3 entries in the worst-case. For majority cases

however, all IOVAs within a descriptor will share the same L3 entry. This is ev-

ident from the Figure 4.2(e): the spikes in the figure occur at descriptor bound-

aries. Since IOVAs are allocated at descriptor granularity, there is no guarantee

the subsequent descriptor’s IOVAs will have the same L3 entry. However, the

spikes are small enough to likely not cause contention in the L3 cache.

85


104
105
106

IOMMU Off IOMMU On IOMMU On and F&S

128B 512B 2KB 8KB 32KB
RPC Size

50
100
150
200
250
300

La
ten

cy
 (

s)

Figure 4.7: F&S brings DCTCP latency closer to IOMMU off case. See §4.4.1 for
more discussion.

0x 1x 2x 3x
Degree of Memory Contention

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

Linux Linux + hostCC

(a) IOMMU off.

0x 1x 2x 3x
Degree of Memory Contention

0
20
40
60
80

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

Linux
Linux + hostCC

Linux + hostCC + F&S

(b) IOMMU on.

Figure 4.8: hostCC and F&S work together to resolve both memory bandwidth
and IOMMU contention. See §4.4.1 for more discussion.

A quick note about throughput degradation (with memory protection) in Fig-

ure 4.3. Despite F&S resulting in similar IOMMU cache performance for both

256 and 2048 ring buffers, we observe a decline in throughput with larger ring

buffer sizes due to an increased CPU cache miss rate (increased by 1.47x), in-

dependent of memory protection overheads. CPU profiling reveals that an in-

crease in cache miss rates leads to greater overhead from data copying (Fig-

ure 4.6), which degrades throughput as the CPU becomes the bottleneck. The

rise in cache miss rates is associated with a decrease in the efficiency of CPU

cache hardware prefetching for larger ring buffer sizes. Notably, disabling hard-

ware prefetching eliminates the CPU cache miss rate difference and narrows the

throughput gap between small and large ring buffer sizes, despite the observed

increase in cache miss rates and decrease in throughput for both scenarios. We

86


leave the investigation of the underlying reasons for large ring buffers leading

to inefficient hardware prefetching for future work.

F&S improves tail latency for latency-sensitive applications. Using the same

setup as Figure 2.9, F&S improves the tail latency of the L-appwhen IOMMU is

enabled and leads to minimal latency inflation compared to the scenarios where

the IOMMU is disabled. This improvement is primarily due to reduced queue-

ing delay at the NIC buffer, and reduced retransmission delay due to lower rate

of packet drops.

F&S performance in presence of memory contention. Figure 4.8 shows how

F&S and hostCC integrate with each other to resolve both IOMMU and mem-

ory bandwidth contention. We introduce memory bandwidth contention in an

identical manner as in §2.2, by increasing the number of M-app cores. Figure

4.8a shows that with memory contention only, hostCC is able to resolve mem-

ory bandwidth contention and reach the target application-level throughput of

80Gbps, independent of the degree of memory bandwidth contention. How-

ever, Figure 4.8b shows that, with IOMMU on, without F&S, hostCC fails to

achieve the target rate because of increased PCIe latency induced by address

translation overheads. By reducing these overheads, F&S with hostCC achieves

throughput close to the target.

4.4.2 Necessity of each design idea in F&S

Figure 4.9 shows the importance of each design idea in F&S. In particular,

we plot the throughput for 3 configurations varying ring buffer sizes: default

87


256 512 1024 2048
Ring Buffer Size

0

50

100

Ne
tw

ork
 T

hro
ug

hp
ut 

 (G
bp

s)

Default
Default + Preserve

F&S

256 512 1024 2048
Ring Buffer Size

0.0

0.5

1.0

1.5

2.0

M
iss

es 
pe

r P
ag

e

IOTLB
L1

L2
L3

Figure 4.9: Contribution of each of the F&S key design ideas. As ring buffer
size increases we observe: (a) the throughput drops for the case of “default
+ preserve” (b) as ring buffer size increases, despite preserving the L3 cache,
we still see misses due to the larger IO working set size. See §4.4.2 for more
discussion.

Linux, Linux but with IOMMU page table caches preserved, and Linux but the

IOMMU page tables are preserved and IOVAs allocated contiguously, namely

F&S. As illustrated in Figure 4.9 (a), preserving page table caches is sufficient

for good performance with small ring buffer sizes. However, as the ring buffer

sizes increase, leading to a larger IO working set size, performance degrades,

highlighting the necessity of allocating IOVAs contiguously.

Figure 4.9 (b) shows the cache performance for the case where the IOMMU

page table caches are preserved. For small ring buffers with good locality, pre-

serving the page table caches removes all L1/L2/L3 page table cache misses

originating from invalidations (as described in §4.4.1), leading to minimal trans-

lation overheads. However, as the locality gets worse, with increasing ring

buffer sizes, the number of L3 page table cache misses increases, leading to a

drop in throughput. As evident from Figure 4.3, upon improving the IOVA lo-

cality with F&S, the L3 page table cache misses reduce, removing the source of

address translation overhead.

88


4.5 Summary

In this chapter, we presented F&S, a mechanism to provide memory protection

against buggy and/or untrusted NICs while achieving near-optimal application

performance, under strictest possible memory protection guarantees. Classical

literature on reducing memory protection overheads have thus far focused pri-

marily on reducing the number of IOTLB misses. We however discuss, and

provide evidence that providing memory protection in modern Linux under

strict protection model necessitates incurring at least a single IOTLB miss (per

page-worth data). We then show that although it is hard to reduce the number

of IOTLB misses, we can reduce the cost of each miss by improving the IO page

table cache performance. F&S leverages this insight and reduces the latency and

CPU cost for IOTLB misses using its three key ideas: allocating IOVA contigu-

ously, selectively invalidating IOTLBs, and batching invalidations. Our evalu-

ations demonstrate that Linux achieves performance close to IOMMU-disabled

scenario using F&S, under the IOMMU strict protection model.

89


CHAPTER 5

REARCHITECTING STORAGE STACKS

In this chapter, we discuss the assumptions in the design of existing storage

stacks that break under host congestion. We then present the design, imple-

mentation, and evaluation of storageCC, a new architecture for storage stacks

that enable near-optimal application performance independent of the presence

and amount of host congestion within the storage datapath.

5.1 Overview

Computer systems, especially those that employ queues, have an ideal operat-

ing point in terms of parallelism—a “knee” point; operating the system below

the knee point results in sub-optimal throughput, and operating the system be-

yond the knee point results in inflated latency but with marginal or no gain in

throughput. Finding the right amount of parallelism for a system therefore, of-

ten requires intricate understanding of the underlying software and hardware

architecture, the workload under which the system is expected to operate, and

how various tunable parameters define systems performance.

Identifying the right parallelism is particularly challenging for storage ap-

plications running atop modern non-volatile memory express (NVMe) attached

solid state drives (SSDs). These devices can support low latency (tens of mi-

croseconds) and high throughput (tens of gigabytes per second); however, iden-

tifying the knee point turns out to be extremely challenging: it depends on

device characteristics (type of flash, number of channels, bus frequency, de-

90


vice controller processing capabilities, etc.), software characteristics (data lay-

out, driver software architecture, garbage collection processing, IO scheduler,

etc.), and workloads (read/write ratios, random/sequential request patterns,

write amplification, etc.). Unsurprisingly, there is a long and active body of re-

search in rethinking storage hardware [85, 133, 83, 136, 34, 37, 69, 56, 71]; inter-

faces [74, 75, 138, 27, 123]; and IO schedulers providing fairness (from classical

disk-based storage systems [53, 51, 52, 2, 3] to high-speed storage [40, 90]), QoS

guarantees [5, 54, 119], and isolation [57, 124, 76, 4]. The overarching goal of

most of these systems has been to enable storage applications to operate at or

close to the knee point. The challenge, however, is that all these works assume

that performance bottlenecks are either at the storage device or the CPU. The

above assumption are no longer true in the presence of host congestion, leading

to bottlenecks within the storage datapath that are not the CPUs or the storage

devices.

We present storageCC, a new storage architecture that enables storage stacks

to operate at the knee point in the presence of host congestion. The key insight in

storageCC is that the end-to-end storage datapath is a special low-latency high-

throughput “lossless” network that comprises of end-points (CPU and storage

devices), switches (root complex, memory controller, caching and home agents,

etc.), and network links (peripheral interconnect, memory bus, processor in-

terconnect). Building upon this conceptual insight, storageCC extends existing

storage stacks to incorporate a storage congestion control mechanism that enables

the storage stack to dynamically adapt the number of in-flight requests to the

underlying storage device for each individual application, based on available

capacity along the storage datapath. Similar to transport designs in computer

networks, storageCC ensures that the number of in-flight requests is the min-

91


imum of a flow control window size (that accounts for bottlenecks at the end-

points, and is decided using any of the techniques from prior work) and the

congestion control window size (that accounts for bottlenecks within the storage

datapath, and is decided using storageCC’s storage congestion control mecha-

nism). storageCC architecture thus enables applications to operate at the knee

point corresponding to bottlenecks anywhere within the end-to-end storage dat-

apath, even when the knee point shifts over time due to dynamically changing

host network bottlenecks.

The storage congestion control mechanism used in storageCC exploits the

many properties of the storage datapath. On the one hand, storageCC ex-

ploits the lossless nature of the storage datapath to design an extremely sim-

ple and CPU-efficient congestion control protocol: unlike computer networks,

storageCC does not need to deal with data drops, data corruption, data re-

transmission, hardware failures, or even out-of-order data delivery (e.g., due

to asynchronous IO interface). On the other hand, storageCC resolves unique

challenges in terms of identifying and detecting host network bottlenecks, and

converging to the right parallelism for storage datapath. For instance, for hop-

by-hop credit-based flow control mechanisms used by host hardware protocols

(§2.1.2), non-zero buffer occupancy is only a necessary but not a sufficient con-

dition for detecting host network bottlenecks; furthermore, the maximum avail-

able buffer sizes at hardware devices along the storage datapath are extremely

small. Put together, these two constraints leave very small room in buffer oc-

cupancy that corresponds to a host network contention scenario. storageCC re-

solves this challenge using a key insight derived from the unique characteristics

of the storage datapath: unlike computer networks, where congestion signals

(such as switch buffer occupancies) can be delayed by network-layer round trip

92


time (RTT) or even more, storageCC can access buffer occupancies within the

host network hardware with a delay that is a tiny fraction of end-to-end stor-

age datapath round trip time (by accessing hardware queue occupancy on an

on-demand basis). We will show that this enables a storage congestion control

mechanism that is able to quickly detect and react to host network bottlenecks.

We demonstrate that storageCC architecture, even with existing commod-

ity hardware and even with unmodified application interfaces, can be realized

with minimal overheads. To that end, we realize storageCC architecture within

the Linux kernel block layer and evaluate it across a wide variety of scenar-

ios involving memory interconnect bottlenecks. Our evaluation suggests that,

without any experiment-specific fine-tuning, storageCC enables storage appli-

cations to operate at or close to the experiment-specific knee point. These ben-

efits come at the cost of minor CPU overheads (less than 4.2% across all evalu-

ated scenarios) in case when bottlenecks are at CPUs/storage devices; however,

these overheads are purely due to limitations of existing hardware in terms of

providing efficient mechanisms to collect host hardware queue occupancies at

fine-grained timescales. We envision that, as the computer systems and com-

puter architecture communities build a deeper understanding of emerging host

network bottlenecks, future hardware architecture and operating systems in-

terfaces will evolve to enable even more efficient realization of the storageCC

architecture.

93


5.2 storageCC

This section introduces storageCC’s design. We begin with an overview of

storageCC (§5.2.1), followed by an explanation of how storageCC detects the

need to adjust parallelism in the presence of memory interconnect bottlenecks

(§5.2.2). Finally, we discuss how storageCC converges to the optimal level of

parallelism in such scenarios (§5.2.3).

5.2.1 storageCC Overview

storageCC extends the block layer in storage stacks to address emerging host

network bottlenecks within the storage datapath, in addition to traditional

CPU/SSD bottlenecks. At a high-level, it intercepts application IO requests,

delays the delivery of already generated requests to the underlying storage de-

vices, and additionally provides backpressure to the applications in a manner

that the in-flight requests never exceed the desired amount of parallelism. Im-

portantly, storageCC makes no changes to the application interface and ensures

compatibility for both synchronous and asynchronous IO. It enables storage

stacks to quickly detect and converge to the right amount of parallelism for

each application, depending upon the degree of contention within the storage

datapath. This capability allows applications to achieve high throughput, low

latencies, and alleviates the throughput starvation problem discussed in §2.3.

storageCC draws inspiration from congestion control (CC) mechanisms

(usually studied in networking literature [139, 88]) in order to detect and con-

verge to the desired parallelism. Conceptually, CC protocols are designed to

94


solve a similar problem: they maintain in-flight data to maximize the network

throughput while minimizing network latency. CC protocols achieve this ob-

jective by detecting in-network congestion using congestion signals (e.g., switch

buffer occupancies), and reacting to congestion via a congestion response (e.g.,

by maintaining a congestion window, and ensuring in-flight data never exceeds

this window).

Similar to computer networks, buffer occupancies within the storage

datapath—e.g., IIO buffer and memory controller read/write pending queues

(RPQs/WPQs)—act as congestion signals and help detect whether there is a

need to alter the amount of parallelism for a storage application (§5.2.2). Simi-

larly, ensuring that in-flight IO requests remain within a specified limit to avoid

performance degradation serves as a congestion response, enabling storage

stacks to converge to an ideal amount of parallelism (§5.2.3).

However, unlike computer networks, buffers within the storage datapath

are very small (few tens of cachelines as opposed to MBs), which presents a

unique challenge in achieving both high throughput and low latency for stor-

age applications. High throughput requires high utilization of IIO buffer, while

low latency requires prompt detection as the IIO buffer nears capacity to avoid

backpressure to the applications. storageCC addresses this challenge by lever-

aging its key insight derived from the unique properties of the storage datapath:

the queue build-up across the storage datapath buffers can be detected at much

finer-grained, sub-µs timescales, unlike computer networks (§5.2.2). Addition-

ally, storageCC leverages other advantageous properties of the storage datap-

ath, such as lossless data transfer guarantees, to adapt CC protocols for storage

stacks in a CPU-efficient manner (§5.2.3).

95


In addition to adjusting the parallelism of storage applications in the pres-

ence of memory interconnect bottlenecks, one other possibility is to throttle the

cores of M-apps using CPU schedulers or memory bandwidth allocation/QoS

tools [9, 15, 121]. storageCC is orthogonal to such a solution, in that, our goal is

to show that even in the worst case when we do not throttle the M-apps, we can

handle the memory interconnect bottlenecks within the storage datapath. Our

evaluation suggests that this really is the worst case scenario for us (§5.4). And

when used in conjunction with the solutions which can throttle M-apps, we will

still be able to converge to the right amount of parallelism.

5.2.2 Detecting host network bottlenecks

Given the conceptual similarity between the storage datapath and (lossless)

computer networks, storageCC leverages ideas from existing network CC

mechanisms, which employ buffer occupancy as congestion signals (like Ex-

plicit Congestion Notifications, or ECNs), to detect host network bottlenecks for

storage applications. In particular, storageCC generates ECNs based on buffer

occupancy at points of contention within the storage datapath (e.g., the IIO and

the memory controller). If buffer occupancy at these points is greater than a

fixed contention detection threshold (referred to as K), then an ECN is gener-

ated. Upon observing generated ECNs, storageCC senses the need to reduce the

parallelism in response to the contention within the datapath, and converges to

the right rate (using mechanism discussed in §5.2.3).

There is a well-known trade-off concerning K [14, 139, 88]. A value of K that

is too close to the buffer capacity would result in larger buffer occupancy due

to lesser available time to detect contention and converge to the right amount

96


of parallelism, increasing backpressure frequency and its associated problems

discussed in §2.3; on the other hand, a value of K too far from the buffer capacity

would result in unnecessary reduction in the parallelism, even in the absence of

contention, leading to underutilization of CPUs and/or SSDs.

Using ECNs in the context of storage datapath, therefore, presents a unique

challenge: the maximum available buffer sizes at intermediate hardware de-

vices within the storage datapath is very small—typically of the order of few

tens of cachelines (< 50 cachelines for memory controllers, and < 100 cachelines

for the IIO) (details in §5.3). Such small buffer sizes enforce a relatively large

marking threshold K: e.g., we observe on our testbed that the IIO occupancy is

close to 70% of the IIO buffer capacity when there is no congestion in the storage

datapath, and the SSDs are fully utilizing the PCIe bandwidth. This is because

host hardware protocols use a hop-by-hop credit-based flow control mechanism

(as discussed in §2.1); for such credit-based flow control mechanisms, non-zero

buffer occupancy is only a necessary but not a sufficient condition for conges-

tion. Specifically, due to credit-based flow control mechanism, the buffer oc-

cupancy is equal to the number of in-flight credits. Therefore, K must be set

to a value larger than at least 70% of maximum buffer capacity to ensure that

the buffer build-up is attributed to storage datapath contention. This leaves less

time for detecting and converging to the right parallelism level without encoun-

tering application backpressure.

storageCC leverages the unique advantages of the storage datapath, where

getting access to all buffer occupancies is much faster—storageCC can simply

query hardware counters for individual queues on an on-demand basis. For

instance, in §5.3, we show that existing commodity host hardware enables de-

97


tection of storage datapath congestion at a sub-µs scale (which is much smaller

than a typical SSD unloaded IO completion time of ∼100µs) with minimal CPU

overhead. Since most of the storage datapath latency is due to SSD access la-

tency, storageCC is able to generate ECNs within a tiny fraction of the end-to-

end storage datapath latency. Thus, storageCC is able to quickly detect and

converge to the right amount of parallelism, allowing it to use larger values of

K without unnecessarily delayed reactions. In addition, storageCC accesses oc-

cupancies for all hardware queues within the storage datapath locally. Thus,

ECNs can be generated entirely within the software, requiring no additional

hardware support.

We now detail the end-to-end mechanism to detect host network bottlenecks

in the storage datapath using storageCC. For each queue along the storage dat-

apath, storageCC specifies a queue-specific threshold that depends on its buffer

capacity and buffer occupancy under no memory interconnect contention. If the

buffer occupancy grows beyond this threshold, storageCC identifies that queue

to be observing contention. However, storageCC does not generate an ECN if

the request crosses only one queue observing contention: it generates an ECN if

and only if the following two conditions are satisfied: (1) one of the IIOs along

the storage datapath of the application observes contention; and (2) there is at

least one more queue upstream that observes contention. We describe the rea-

soning next. The first condition is for efficiency purposes—since the storage

datapath is lossless, buffer occupancy build-up at one of the queues will result

in backpressure all the way to the corresponding IIO; since storageCC has ex-

tremely fast congestion detection and reaction mechanism for detecting queue

build-up, we only declare storage datapath contention (and hence, the need to

shift parallelism) if the IIO for the application observes contention. The second

98


condition is more interesting. Since storageCC has local access to all hardware

queues within the storage datapath, the second condition enables storageCC

to identify the culprit applications, and help alleviate the throughput starva-

tion problem. The observation is that an application can be the source of con-

tention only if it encounter queue build-up at more than one point along the

path. As discussed earlier, due to hop-by-hop credit-based flow control mecha-

nism of the host network, build-up at the IIO buffer results from contention and

queue build-up at memory controller RPQs/WPQs. As contention arises within

the culprit datapath, these memory controller queues become exclusively filled

within the culprit application’s datapath. Pre-computing the storage datapath

for each application is also straightforward—determined by the application’s

target SSD location (PCIe slot on the host where the SSD is connected) and the

NUMA node where the application is running.

5.2.3 Converging to the right amount of parallelism

Upon detecting contention within the storage datapath, storageCC dynamically

converges to the right amount of parallelism for storage applications based on

the ECNs observed over time. To achieve this, storageCC adapts a classical con-

gestion control algorithm from datacenter networking—DCTCP1 [14]—for stor-

age stacks. Like DCTCP, storageCC uses an additive-increase multiplicative-

decrease mechanism (AIMD) to adapt the number of in-flight requests based

on amount of contention along the storage datapath, using an in-flight requests

window; this mechanism reduces the in-flight requests window upon detecting
1storageCC architecture is independent of the underlying CC protocol. We chose DCTCP

because it is open-sourced, supports ECN-based congestion signals, and has been widely de-
ployed in datacenter networks

99


contention using a multiplicative factor, while increasing in-flight requests win-

dow using an additive factor in absence of contention. At any given time, the

number of requests in-flight is limited to the minimum of the in-flight request

window and the IO depth set by the application (typically determined based on

benchmarking of CPU/SSD bottlenecks, as discussed in §2.3).

storageCC makes several observations in efficiently adapting DCTCP to stor-

age stacks. First, storageCC observes that the lossless nature of the underlying

storage datapath enables a simplified form of DCTCP protocol: storageCC does

not need to deal with data drops, data corruption, data retransmission, or even

hardware failures. Moreover, the inherent losslessness of the underlying net-

work and the precise semantics provided by the hardware DMA engine, in-

cluding the ordered delivery of data chunks within individual requests, obviate

the need for storageCC to handle out-of-order data delivery and/or acknowl-

edgement mechanisms2. Finally, since there are no packet drop and no time-

out events, the in-flight request window does not require to be reset to a mini-

mum value, eliminating the need for slow-start phase, and the need to maintain

the slow-start threshold. Overall, this makes the end-to-end protocol both ex-

tremely simple and CPU-efficient to implement.

The main algorithm in storageCC is to adapt to the right amount of par-

allelism, which is outlined in Algorithm 1. This algorithm operates on a per-

request basis, and takes three inputs: current time, whether the latest received

completion observed contention (i.e., an ECN was generated for the request),

and an estimate of the request completion time. The first parameter is straight-

forward; an ECN is generated for a request if the two conditions on queue oc-

2Note that individual request completions can still arrive out of order from the block layer, as
is already the case today with applications using asynchronous IO interface. These out-of-order
requests continue to be handled by the applications.

100


Algorithm 1 storageCC algorithm to adjust parallelism

Parameters:
K: contention detection threshold
gα: EWMA weight for α
gcomp time: EWMA weight for s comp time

Input:
cur time: current time
comp time: latest completion time sample
contended: whether contention detected during latest request completion?
(True/False)

Upon Receiving Request Completion
if (cur time - last update) > s comp time then

F ← cont req last comp/total req last comp
α← (1 − gα) × α + gα × F ▷ Update α
last update← cur time

cont req last comp← 0
total req last comp← 0

end if
total req last comp← total req last comp + 1
s comp time← (1 − gcomp time) × s comp time + gcomp time × comp time ▷ Update
smoothed completion time
if contended then

if cur time - last reduced > s comp time then
inf req wnd← inf req wnd × (1 − α/2) ▷MD
last reduced← cur time

end if
cont req last comp← cont req last comp + 1

else
inf req wnd← inf req wnd + (1/inf req wnd) ▷ AI

end if

cupancy discussed previously in §5.2.2 are met, whenever a request completion

arrives. The estimate for the request completion times is used to ensure that

number of requests in-flight are adjusted on a per round-trip-time basis within

the storage datapath; this ensures that the protocol gets enough time to stabilize

since, upon adjusting the number of in-flight requests, it can take up to a round-

trip-time worth of time for the effects to kick-in. Ensuring that we adjust the

number of in-flight requests on a per round-trip-time basis with an updated es-

101


timate of completion times is particularly crucial for storage applications due to

highly variable latencies of NVMe SSD devices (we observe latency variability

higher than even 50% in §5.4). storageCC computes an estimate of per-request

completion time by computing the time delta between the request completion

and its submission (both of which are readily available within the Linux kernel);

moreover, storageCC uses smoothed completion time which is an exponential

weighted moving average over the per-request completion time samples using

a parameter grtt.

Using the above three inputs, storageCC maintains the fraction of requests

F with observed contention, within the last completion time, and a weighted

moving average of F (using the variable α and a parameter gα). storageCC uses

α to compute the in-flight request window on a per-round-trip-time basis using

the same AIMD algorithm as in DCTCP: if the request corresponding to the

received completion is not contended, then storageCC increases the in-flight

request window using additive-increase such that the window increases by 1

every smoothed completion time; otherwise, storageCC reduces the in-flight

request window using multiplicative-decrease proportionally to the size of α.

AIMD is a popular mechanism ensures that the throughput of each application

thread sharing the same underlying bottleneck independently converges to its

max-min fair-share [14, 139]. We demonstrate that storageCC’s convergence to

fair-share for each application in §5.3.

102


Application

io_getevents()io_submit()

Filesystem

submit_bio()

IO Scheduler

blk_mq_submit_bio() bio_endio()
update_inf_req_wnd()

blk_mq_get_tag()
enforce_inf_req_wnd()

NVMe Driver

ctx

hctx

submission 
queue

completion 
queue

Block Layer

Figure 5.1: Datapath of an IO request within the Linux storage stack. Blue boxes
indicate points within the datapath where storageCC logic is implemented.
(§5.3)

5.3 storageCC Implementation

We have implemented storageCC within the block layer of Linux kernel 6.3. Our

implementation leverages as much of the existing storage stack infrastructure

as possible, making minimal modifications to the datapath. Our implementa-

tion currently uses ∼850 LOC. storageCC makes no interface changes for both

asynchronous (e.g., aio, io uring, etc) and synchronous (eg, read and write

syscalls) IO. In this section we first provide an overview of IO request datapath

within the kernel storage stack. We then outline and provide the rationale be-

hind the optimal modification points within Linux block layer for storageCC im-

plementation, followed by a discussion of some of most interesting storageCC

implementation details.

103


Life of an IO request in Linux storage stack. To maintain brevity, we focus

on the aio interface [25] for rest of this section. As shown in Figure 5.1, appli-

cations utilize the io submit() syscall to submit IO requests to the block layer

and use io getevents() to poll for the completion of these requests. Before sub-

mitting any request, an application must first establish an IO context (kioctx)

with io setup(). For each kioctx, the Virtual File System (VFS) creates a cor-

responding IO control block which in turn triggers the relevant VFS-specific

function to create a bio structure and initiate submit bio(), transitioning into

the block layer. Within the block layer, each bio is encapsulated into a request

and mapped to a hardware context (device-driver) queue. Without an IO sched-

uler (i.e., using none scheduler), the block layer attempts to directly queue the

request in the hardware queue before falling back to the software queue. Prior

to scheduling a request on the associated hardware queue, the block layer must

acquire a tag using blk mq get tag(), which uniquely identifies the request.

These tags are obtained from a bitmap reflecting the parallelism supported by

the hardware context queue (i.e., the hardware queue depth). If no tags are

available due to the hardware queue being full, the storage stack creates back-

pressure by making the request wait within the block layer. Once the device

processes the IO request and sends a completion response to the kernel. The

kernel uses the corresponding tag to identify and retrieve the relevant request,

release the tag, and invoke bio endio() to trigger the VFS-specific completion

handler. This handler enqueues the completion event for the relevant context

and sends a notification back to the application.

storageCC implementation within the block layer. storageCC’s architecture

incorporates two key hooks (shown in blue boxes in Figure 5.1): one in the IO re-

quest submission datapath to enforce the in-flight request window (inf req wnd

104


computed via Algorithm 1), and another in the request completion datapath to

update the in-flight request window. In the submission datapath, submit bio()

serves as the ideal hook, since it is the entry point for any IO request into the

block layer. We implement storageCC’s enforce inf wnd() logic within the

blk mq get tag() function (part of the submit bio() callstack) to leverage the

kernel storage stack’s built-in backpressure mechanism to throttle the applica-

tion’s in-flight requests (the application thread issuing the requests goes to sleep

for a fixed time interval, and retries getting a tag). In the completion datapath,

bio endio() serves as the ideal hook, acting as the exit point for an IO request

from the block layer to the application layer. Consequently, we implement stor-

ageCC’s update inf wnd() logic within this function.

storageCC’s enforce inf wnd implementation. storageCC manages an in-

flight request window (inf req wnd), and tracks the amount of in-flight requests

(infl req), on a per-application-thread basis. A request is considered in-flight

from the moment it successfully receives a tag, and until the storage stack re-

ceives a completion from the device driver. For each request submission, stor-

ageCC checks if the amount of in-flight requests is less than the in-flight request

window. If this condition is met, the request proceeds with unmodified stor-

age stack for remainder of the request submission datapath. Otherwise, the

request is temporarily queued in a pending req buffer; this is an optimiza-

tion that allows storageCC to more efficiently handle any transient contention

within the storage datapath. storageCC processes and dequeues requests from

the pending req buffer in a FIFO manner, when allowed by the in-flight request

window. During periods of persistent contention (when the buffer is full and no

more requests can be queued), storageCC leverages the built-in back pressure

mechanism of the block layer by denying the tag to the request and throttles

105


0 250 500
Time (µs)

0.0

0.2

0.4

0.6

0.8

1.0

No
rm

al
ize

d 
IIO

 O
cc

up
an

cy
(a) IIO Occupancy

(No Contention)

0 250 500
Time (µs)

0.0

0.2

0.4

0.6

0.8

1.0

No
rm

al
ize

d 
IIO

 O
cc

up
an

cy

(b) IIO Occupancy
(With Contention)

0 250 500
Time (µs)

0.0

0.2

0.4

0.6

0.8

1.0

No
rm

al
ize

d 
W

PQ
 O

cc
up

an
cy

(c) WPQ Occupancy
(No Contention)

0 250 500
Time (µs)

0.0

0.2

0.4

0.6

0.8

1.0

No
rm

al
ize

d 
W

PQ
 O

cc
up

an
cy

(d) WPQ Occupancy
(With Contention)

Figure 5.2: Effectively detecting storage datapath contention using storageCC:
when there is no contention, IIO occupancy is 70% of the maximum value, and
reaches close to maximum in the presence of memory bandwidth bottlenecks.
Memory controller queue occupancy (figure shows fraction of time WPQ occu-
pancy is full) also increases only when memory bandwidth is contented.

the application’s in-flight requests.

storageCC’s update inf req wnd implementation. storageCC updates the

inf req wnd upon each request completion using Algorithm 1. storageCC

incorporates a minor modification to struct bio by adding two additional

fields comp time and contended, to track completion times and whether con-

tention was detected, on a per-request basis. storageCC also creates a hash-

map to store per-application-thread variables and parameters: the key be-

ing the pid of the thread using storageCC, and the value being a struct

comprising the required per-application-thread state to execute Algorithm 1

(inf req wnd, alpha, cont req last comp, total req last comp, last update

and s comp time). Additionally, upon updating the inf req wnd, storageCC

evaluates whether a request from the pending req buffer is now eligible to sub-

mitted to block device (by receiving a tag). If yes, storageCC dequeues the re-

106


quest, which retries getting a tag using blk mq get tag(). This is done asyn-

chronously by constructing the worker thread to incur minimal overhead for

the completion datapath.

Detecting contention in storage datapath. storageCC marks the presence of

contention within the storage datapath by setting the contended field (used in

Algorithm 1). This field is set true if the two conditions discussed in §5.2.3

are met; evaluating these conditions requires measuring the occupancies of the

IIO buffer and the memory controller RPQ/WPQs, and checking whether mea-

sured values are larger than the fixed respective thresholds for each queue. stor-

ageCC uses similar methodology as [9] for measuring the IIO buffer occupancy:

it samples a hardware counter exposed using model specific registers (MSRs)

that maintains the cumulative value of IIO occupancy, incremented every IIO

clock frequency (500MHz for our Intel Cascadelake machines). Queue occu-

pancies for RPQ/WPQs is measured similar to IIO buffer, the only difference

being that the hardware counter exposing memory controller occupancies is lo-

cated in PCI config space [61]. 3 Similar to [9] we use a dedicated CPU core

polling these counters to measure them at sub-µs granularity; and also use an

exponentially weighted moving averaging (EWMA) for measured occupancies.

Figure 5.2 shows the measured queue occupancy values over 500µs horizon,

with and without the presence of storage datapath contention. We use the same

baseline setup as in §2.3. The figure shows that measured values are indeed ef-

fective in both detecting the presence of contention at both the IIO and memory

3To capture the time instants when the contention at WPQ leads to backpressure, we measure
the fraction of times WPQ is full (instead of simply measuring the WPQ occupancy) – this is
because of the well known difference between read and write datapath. When IIO performs
memory writes, it only observes latency inflation in writes when the WPQ is full. However,
when performing memory reads, IIO can experience latency inflation, leading to backpressure,
even when RPQ is high and not full.

107


controller queues.

Deciding contention detection thresholds. Unless mentioned otherwise, stor-

ageCC uses default contention threshold to be 91 cachelines for the IIO buffer.

Additionally, we use 0.7 for WPQ occupancy threshold.4 storageCC however

chooses a more aggressive threshold of 75 cachelines for storage applications

running on CPU cores in remote NUMA nodes, in a victim-culprit-like sce-

nario. The reasoning is as follows: when using remote NUMA setup, even

without memory bandwidth contention, the IIO occupancies are significantly

larger than the local NUMA case (as shown in Figure 5.3, this is due to higher

remote NUMA memory access latencies; we measured the latency to be 65%

higher, by using Little’s Law and observed average IIO occupancy and ap-

plication throughput.) Therefore, even with the same amount of application

throughput, remote NUMA applications can occupy larger space at the shared

IIO buffer. storageCC tries to alleviate such unfair IIO buffer allocation by us-

ing a smaller IIO threshold for remote NUMA applications, triggering a more

aggressive detection and response for contention.

5.4 storageCC Evaluation

This section presents the evaluation results for the Linux storage stack, with

and without storageCC. We use the same setup as in §2.3. The goals of our

evaluation are as follows:

• Understanding benefits of using storageCC in the presence of host conges-

4As discussed in §5.3, we measure fraction of time WPQ is full, therefore measured values
range between 0-1.

108


0 250 500
Time (µs)

0.0

0.2

0.4

0.6

0.8

1.0

No
rm

al
ize

d 
IIO

 O
cc

up
an

cy

(a) IIO Occupancy
(No Contention)

0 250 500
Time (µs)

0.0

0.2

0.4

0.6

0.8

1.0

No
rm

al
ize

d 
IIO

 O
cc

up
an

cy

(b) IIO Occupancy
(With Contention)

0 250 500
Time (µs)

0.0

0.2

0.4

0.6

0.8

1.0

No
rm

al
ize

d 
W

PQ
 O

cc
up

an
cy

(c) WPQ Occupancy
(No Contention)

0 250 500
Time (µs)

0.0

0.2

0.4

0.6

0.8

1.0

No
rm

al
ize

d 
W

PQ
 O

cc
up

an
cy

(d) WPQ Occupancy
(With Contention)

Figure 5.3: IIO buffer and memory controller queue occupancies for the remote
NUMA setup. (Left column) Even without memory bandwidth contention, the
IIO buffer is filled up. This is due to higher access latency from IIO to memory
in remote NUMA node. (Right column) IIO buffer and WPQ occupancies are as
expected under memory bandwidth contention.

tion due to memory bandwidth contention (§5.4.1)

• Demonstrating similar benefits even when using unmodified real-world

storage systems under host congestion (§5.4.2)

• Understanding storageCC overheads, especially for scenarios where there

isn’t any presence of host congestion (§5.4.3).

In this section, we focus mainly on DDIO disabled scenario for better explica-

bility of results; but storageCC also works with DDIO enabled scenarios; we

discuss them in §5.4.4.

109


0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Million IOPS)

102

103

104

105

L-
A 

  
Ta
il 
La
te
nc
y 
(u
s)

0x1x2x3x

7.34x Im rovementIn Latency

Without StorageCC With StorageCC

(a) Local NUMA setup

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Million IOPS)

102

103

104

105

L-
A 

  
Ta
il 
La
te
nc
y 
(u
s)

0x1x
2x3x

17.00x Im rovement
In Latency

Without StorageCC With StorageCC

(b) Remote NUMA setup

Figure 5.4: storageCC allows storage stack to converge to amount of parallelism
which is close to the knee point, independent of degree of contention and appli-
cation NUMA-affinity. storageCC provides close to optimal latencies for L-App
(upto ∼17× tail latency improvement, even assuming the default IO depth = 64 –
the minimum needed to sustain optimal throughput under no congestion) and
maximum achievable throughput for T-app. Discussion in §5.4.1.

5.4.1 storageCC benefits

[Figure 5.4] storageCC maintains near-optimal amount of parallelism for stor-

age applications independent of the degree of host congestion and NUMA-

affinity of the applications. Figure 5.4 shows the latency-load curve with and

without using storageCC using a similar setup as in §2.3 (Figures 2.11 and 2.12),

under varying degrees of storage datapath contention. We vary this degree by

proportionally changing the number of M-app cores in our setup. Starting with

the scenario where memory bandwidth is contended in the NUMA node local to

SSDs (Figure 5.4a) we observe that for any introduced degree of contention, stor-

ageCC converges to an amount of parallelism that is close to the the knee-point

on the latency-load curve. When T-app uses lower IO depth values (from 1 to 8),

storageCC achieves near-identical throughput and latency values compared to

the case without storageCC. This is expected because there is no memory band-

width contention for these IO depths (the offered load by T-app is smaller than

the available memory bandwidth); hence, storageCC never reduces the inflight

110


request window and keeps IO depth worth requests in-flight at all times.

However, when the T-App IO depth increases beyond a threshold which is

large enough to sustain the maximum achievable throughput—32 for 1×, and

16 for 2× and 3× degree of contention respectively—storageCC’s mechanisms

to detect contention and shift the amount of parallelism kicks-in. For example,

in a scenario with 2× contention and an IO depth of 64, which is the optimal par-

allelism for no contention, storageCC adjusts the operating point to near that of

an IO depth of 16—the ideal operating point for 2× contention scenario. This

adjustment occurs as storageCC detects contention and converges the in-flight

request window value close to 16. In the steady-state, storageCC exhibits AIMD

behavior, oscillating around this value (we later discuss steady-state behavior of

storageCC at a microscopic-level in Figure 5.11). storageCC converges close to

the ideal operating point even in the subsequent scenario, where the memory

bandwidth is contended in the remote NUMA node (Figure 5.4b); further, re-

gardless of the application’s IO depth setting, storageCC converges to an oper-

ating point close to the knee-point of each respective latency-load curve. These

observations showcase the efficacy and robustness of storageCC’s contention

detection and mitigation mechanisms. We discuss storageCC’s results under no

contention scenario later in §5.4.3, demonstrating storageCC introduces mini-

mal overheads.

[Figure 5.5] storageCC is more resilient to the throughput starvation prob-

lem. Using the same setup as in in Figure 2.11, we observe that storageCC

provides >55% gain in victim application throughput, while incurring a smaller

cost in terms of further degradation of culprit application throughput and lead-

ing to a gain of >22% in total application throughput (culprit + victim). The

111


Culprit Victim0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6

Th
ro
ug
hp
ut
 (i
n 
M
illi
on

 IO
PS

)

Necessary Unnecessary

No Contention
With Contention, Without StorageCC
With Contention, StorageCC

(a) 1x Congestion

Culprit Victim0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6

Th
ro
ug
hp
ut
 (i
n 
M
illi
on

 IO
PS

)

Necessary Unnecessary

No Contention
With Contention, Without StorageCC
With Contention, StorageCC

(b) 2x Congestion

Culprit Victim0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6

Th
ro
ug
hp
ut
 (i
n 
M
illi
on

 IO
PS

)

Necessary Unnecessary

No Contention
With Contention, Without StorageCC
With Contention, StorageCC

(c) 3x Congestion

1x 2x 3x
Degree of Storage Datapath Congestion

0

20

40

60

80

100

%
 G
ai
n

Gain in Total Throughput
Gain in Victim Throughput

(d) Net Throughput Gain (%)

Figure 5.5: storageCC helps alleviate the throughput starvation problem: im-
proving victim application throughput, while incurring smaller cost in culprit
application throughput, leading to ∼55% gains in victim throughput and ∼22%
in overall throughput. Discussion in §5.4.1.

gain in victim application throughput stems from enabling of a better isolation

across T-apps by storageCC: it detects that the contention is caused only in cul-

prit application datapath (since the WPQ increases only for the remote NUMA

memory controller), leading to reduction in in-flight request window just for

the culprit appplication. Reducing this in-flight window reduces the contention

of shared IIO buffer with the victim traffic, leading to improved victim apppli-

cation throughput.

5.4.2 RocksDB with storageCC

We now evaluate storageCC with RocksDB [43], a commonly used storage sys-

tem, as an L-app. In our setup, we mount local SSDs using the XFS file system

112


0.0 0.5 1.0 1.5 2.0 2.5
Throughput (in Million IOPS)

102

103

104

L-
A 

  
Ta
il 
La
te
nc
y 
(u
s)

0x

3x

6.40x Im rovement
In Latency

Without StorageCC With StorageCC

Figure 5.6: Evaluation using RocksDB: storageCC provides similar improve-
ment in tail latency as earlier results.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Normalized T-App Throughput

102

103

104

L-
Ap

p 
Ta

il 
La

te
nc

y 
(u

s)      <5% 
 Overhead

Without StorageCC With StorageCC

Figure 5.7: storageCC observe negligible overheads (<5%) without the presence
of any storage datapath congestion. Discussion in §5.4.3.

and configure RocksDB to utilize it as a backend with direct IO. We generate a

ReadRandom workload of 4KB request sizes, with an IO depth of 1 for the appli-

cation thread using the db bench benchmarking tool [44]. We run this bench-

mark for the same setup we used in §2.3. Figure 5.6 shows the latency-load

curve using storageCC with 3x degree of storage datapath contention. As ex-

pected, compared to the no contention scenario, the L-app experiences high tail

latency inflation (6.52×) under host congestion. We observe that storageCC ef-

fectively converges near the optimal operating point under contention, demon-

strating its effectiveness in handling host congestion for unmodified real-world

storage applications.

113


5.4.3 storageCC Overheads

We now discuss any implementation overheads of using storageCC when there

is no contention within the storage datapath. Figure 5.7 shows these overheads

for the worst-possible scenario: when all the CPU cores in the NUMA node lo-

cal to the SSDs (and therefore, also NUMA node local to the IIO) are running

the storage applications. As discussed in §5.3, in order to sample IIO occupancy

values, MSR reads must be performed to read the required hardware counter

values. However, those MSR counters keep track of statistics of the IIO in their

local NUMA node [61]. If all the cores in the IIO-local NUMA are running stor-

age applications, there is some additional CPU overhead while performing re-

quired MSR reads, which can lead to slight performance degradation. On our

testbed, we found <5% degradation, as shown in Figure 5.7.

5.4.4 storageCC Sensitivity Analysis

We now evaluate storageCC’s robustness under a wider set of workload and

host hardware settings.

[Figure 5.8] Varying IO sizes. We observe that storageCC is robust to changing

IO sizes and converges to the optimal level of parallelism in presence of memory

bandwidth contention.

[Figure 5.9] Varying L-app load. We observe that even for the cases L-app no

longer occupies a tiny fraction of SSD bandwidth, storageCC continues to pro-

vide similar benefits in performance. This is because storageCC contention de-

tection and parallelism adjustment mechanism do not make any assumptions

114


0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Throughput (in Milli n IOPS)

102

103

104

105

L-
Ap

p 
Ta
il 
La
te
nc
( 
(u
s)

0x3x

4.14x Impr vement
In Latenc(

With ut St rageCC With St rageCC

(a) IO size = 8KB

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Throughput (i  Millio  IOPS)

102

103

104

105

L-
Ap

p 
Ta
il 
La
te
 c
) 
(u
s)

0(3(

3.80( Improveme t
I  Late c)

Without StorageCC With StorageCC

(b) IO size = 16KB

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
Throughput (in Million IOPS)

102

103

104

105

L-
A 

  
Ta
il 
La
te
nc
y 
(u
s)

0x
3x

5.75x Im rovement
In Latency

Without StorageCC With StorageCC

(c) IO size = 32KB

0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225
Throughput (in Milli n IOPS)

102

103

104

105

L-
Ap

p 
Ta
il 
La
te
nc
( 
(u
s) 0x

3x

3.76x Impr vement
In Latenc(

With ut St rageCC With St rageCC

(d) IO size = 64KB

Figure 5.8: storageCC converges to right amount of parallelism independent of
IO size of the applications. Discussion in §5.4.4.

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Milli n IOPS)

102

103

104

105

L-
Ap

p 
Ta
il 
La
te
nc
( 
(u
s)

0x

3x

7.38x Impr vement
In Latenc(

With ut St rageCC With St rageCC

(a) L-app IO Depth = 2

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Million IOPS)

102

103

104

105

L-
A 

  
Ta
il 
La
te
nc
y 
(u
s)

0x

3x

8.03x Im rovement
In Latency

Without StorageCC With StorageCC

(b) L-app IO Depth = 4

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Milli n IOPS)

102

103

104

105

L-
Ap

p 
Ta
il 
La
te
nc
( 
(u
s)

0x

3x

8.26x Impr vement
In Latenc(

With ut St rageCC With St rageCC

(c) L-app IO Depth = 8

Figure 5.9: storageCC converges to the right amount of parallelism, even with
increasing load from competing L-apps. Discussion in §5.4.4.

115


0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Million IOPS)

102

103

104

105

L-
Ap

p 
Ta

il 
La

te
nc

y 
(u
 )

0x
3x

Without StorageCC With StorageCC

0.0 0.5 1.0 1.5 2.0 2.5 3.0
Throughput (in Million IOPS)

102

103

104

105

L-
Ap

p 
Ta

il 
La

te
nc

y 
(u
 )

Without StorageCC With StorageCC

Figure 5.10: DDIO enabled results: (a) scenario where DDIO can be beneficial,
storageCC provides even better tail latencies with close to maximum achievable
throughputs, (b) DDIO is not applicable for remote NUMA scenario, storageCC
provides near identical results as baseline results.

about the loads offered by individual kinds of applications.

[Figure 5.10] Enabling DDIO. We evaluate storageCC performance under

memory bandwidth contention when DDIO is enabled. Although sometimes

beneficial, DDIO is known to not fundamentally resolve the contention at the

memory interconnect [31, 6, 9]; for instance, we observe that efficacy of DDIO

to worsen with increasing IO sizes, or using remote NUMA nodes (since DDIO

is not applicable for that context [31]). storageCC is able to converge close to

points providing close-to-max achievable throughput and low latencis, leverag-

ing DDIO benefits when it is effective, and improving performance otherwise.

5.4.5 Understanding storageCC microscopic behavior

Figure 5.11 shows the steady-state microscopic behavior for storageCC for the

local NUMA setup with 2x degree of contention (we also observe similar trends

for other host congestion scenarios). Figure 5.11a shows storageCC’s in-flight

request window behaviour for a 2000 microsecond horizon for one of the T-Apps

– we verified that the behavior is near-identical for all T-app instances. The fig-

116


0 250 500 750 1000 1250 1500 1750 2000
Time (us)

0
2
4
6
8

10
12
14
16
18
20
22
24

in
f_

re
q_

wn
d

(a) storageCC inf req wnd

0 250 500 750 1000 1250 1500 1750 2000
Time (us)

0.0

0.1

0.2

0.3

0.4

0.5

al
ph

a

(b) storageCC α value

0 250 500 750 1000 1250 1500 1750 2000
Time (us)

60
65
70
75
80
85
90
95

100

IIO
 O

cc
up

an
cy

(c) IIO Occupancy

0 250 500 750 1000 1250 1500 1750 2000
Time (us)

60

80

100

120

140

Sm
oo

th
ed

 C
om

pl
et

io
n 

Ti
m

e
(u

s)

(d) Smoothed Completion Times

Figure 5.11: storageCC microscopic behavior

ure demonstrates the expected AIMD behavior in the steady-state in the pres-

ence of host congestion. The average inf req wnd value determines the aver-

age in-flight requests and hence the T-app throughput. Figure 5.11c shows the

IIO occupancy behavior for the same time horizon. We observe that storageCC

keeps the IIO occupancy close to the threshold (shown via red line), sustaining

high throughput. Finally, Figure 5.11d shows the smoothed completion time

behavior – we observe that completion time fluctuates much more than the de-

gree of contention (indicated by IIO occupancy value) – sometimes even varying

by more than 50%. The dynamic completion time estimation in storageCC how-

ever ensures apt response to contention in face of such variable storage datapath

completion times.

117


0 100 200 300 400 500
Time (s)

0

200

400

600

800

1000

1200

1400

1600

Ba
nd

wi
dt
h 
(M

iB
/s
)

Thread 1
Thread 2
Thread 3
Thread 4

Thread 5
Thread 6
Thread 7
Thread 8

Thread 9
Thread 10
Thread 11
Thread 12

Thread 13
Thread 14
Thread 15
Thread 16

Figure 5.12: Fair resource allocation by storageCC under host congestion and
varying number of application threads. Discussion in §5.4.6.

5.4.6 Fair resource allocation using storageCC

Figure 5.12 shows storageCC’s convergence to fair-share of throughput, under

varying number of T-app instances. We use a similar setup as the 3× memory

contention, and start with a single T-app thread on 4 CPU cores. We then in-

troduce additional threads at t = 100s and t = 200s respectively to increase the

total number of active T-app threads to be 8 and 16 respectively. We measured

throughput versus time for each thread, and observed storageCC converge to

the max-min fair-share as soon as the number of T-app instances were changed.

This is realized due to quick contention detection mechanism in storageCC, and

use of AIMD-based mechanism to adjust the parallelism in response to con-

tention.

5.5 Summary

storageCC is a new storage stack architecture to address the host network bot-

tlenecks within the storage datapath. While existing storage stacks handle CPU

and/or SSD bottlenecks, they lack the capability to detect and react to bottle-

118


necks within the storage datapath, resulting in significant tail latency inflation

and isolation violation. Leveraging insights from classical networking liter-

ature, storageCC quickly detects and efficiently reacts to these host network

bottlenecks. Our evaluation demonstrates that storageCC enables existing un-

modified storage applications to converge close to optimal operating points—

achieving low tail latencies and isolation guarantees—while incurring minimal

implementation overheads.

119


CHAPTER 6

RELATED WORK

This chapter provides a brief overview of the related work in the space of under-

standing and alleviating contention for host resources (§6.1), efficient memory

protection in network stacks (§6.2), and managing bottlenecks within the stor-

age datapath (§6.3).

6.1 Managing Host Resource Contention

Performance degradation due to contention for host resources. The prob-

lem of network applications experiencing underutilization due to lack of host

resources is not entirely new—decades ago we experienced the receiver live-

lock [100] problem, where the host spent all its time processing interrupts for

arriving packets, with no cycles left to deliver these packets to the application.

More recently, it has been shown [31] that network utilization can also be bottle-

necked by the CPU cycles spent on copying the delivered data from the kernel

to the application. The host congestion problem discussed in the dissertation

is caused by host network resources, like the memory bus, becoming the bot-

tleneck, and the network is underutilized even with CPU cycles to spare. Even

more crucially, the packet queueing due to these bottlenecks happens at differ-

ent places–at the kernel buffers inside the memory for CPU bottlenecks, but at

the NIC buffers (which can lead to packet drops) for host network bottlenecks.

Using emerging hardware technologies to alleviate host congestion. Two im-

portant emerging technologies are RDMA [139] and CXL [118]. While zero-

120


copy datapath enabled by RDMA does alleviate host congestion to some de-

gree (by reducing memory bandwidth utilization due to data copy overheads),

it does not resolve the host congestion problem [86]. Data DMAed from the

RDMA NICs to the application memory still requires transfer over the memory

interconnect, leading to host congestion and similar performance degradation

as shown in §2.2 [86]. The benefits of CXL to resolve host congestion are also

unclear. For instance, consider the two use cases of CXL. First, reducing PCIe

to IIO latency; our analysis in §3.2.1 suggests that reducing ℓp does not allevi-

ate memory interconnect congestion (ℓm is the core problem). Second, CXL may

enable memory expansion where CPUs can directly read from CXL-attached

memory; this requires massive changes in the host infrastructure, and whether

it will provide benefits to host congestion remains an interesting avenue of fu-

ture research.

6.2 Efficient IO Memory Protection

Improved performance using a “non-strict” security model. Some of the ex-

isting works, like DAMN [94], change the memory protection security model

to improve its performance. The key insight in DAMN is that IOVA invalida-

tions lead to both IOMMU and CPU overheads, so they introduce DMA buffers

that are permanently mapped in the IO page table, thus avoiding the need to

perform IOMMU invalidation. Packet data is copied to an intermediate buffer,

that the NIC cannot access, whenever it is read by the CPU (i.e. header data)

or otherwise to application buffers. The authors argue this is still secure, since

it avoids ”time of check to time of use” attacks where data is changed by the

121


NIC after it has been initially read by the CPU. While DAMN reduces over-

heads related to (un)map, IOVA allocation, and invalidations, it does not at-

tempt to reduce the cost of each IOTLB miss. As shown in our evaluation (and

in Figure 4.9), removing invalidation overhead is insufficient to achieve good

performance as poor locality still increases the address translation latency. In-

deed, in their evaluation the authors show throughput degrades due to IOTLB

miss overheads. Hence, DAMN is complementary to F&S, as it could be used

for buffer allocation and F&S for IOVA allocation and mapping. Several other

works have designed their solutions around deferring or avoiding invalidations

altogether [108, 93, 45], which improve performance by sacrificing on security

Improved performance using hugepages. [45] implements a hugepage allo-

cator for IOVA mappings that can achieve good IOTLB performance at high

throughput. However, their approach relies on keeping IOVAs permanently

mapped to pages in their page cache, not invalidating them after each DMA.

If they were to use strict mode, invalidations would lead to L1/L2/L3 misses,

degrading performance. Moreover, the authors note that hugepages are not a

panacea: faster link speeds will require larger hugepages to achieve optimal

performance. We note that F&S integrates well with hugepages and is a com-

plementary approach.

Improved performance using redesigned hardware. There have been many

other lines of work looking at designing improved hardware for address trans-

lation. [55] explores new address translation hardware for accelerators. [79] de-

signed hardware for address translation for a hypertenant environment. There

has also been work to rethink IOTLB hardware to reduce the number of page

table walks and remove invalidation overhead [17, 91]. Another line of re-

122


search [24, 48, 109] focuses on the design and performance of MMUs and TLBs.

In MMU address translation, applications use virtual addresses, and the OS

manages physical frames, making it challenging to ensure uniform performance

across various workloads. Conversely, in IOMMU address translation, the OS

allocates virtual addresses, offering a unique design opportunity to optimize

performance as demonstrated in F&S.

6.3 Mitigating Bottlenecks in Storage Datapath

Mitigating hardware bottlenecks in storage datapath. There is considerable re-

search aimed at improving SSD performance by minimizing garbage-collection

overheads [133, 83, 67, 137, 71], improving hardware interfaces [74, 75, 138,

27, 123], ensuring isolation [56], and redesigning the NVMe controller [136].

However, these solutions fall short when the bottleneck shifts from the SSD

to the host interconnect. While incorporating Quality of Service (QoS) mech-

anisms [49, 72, 78] could potentially address the HoL blocking problem un-

der host congestion, they require invasive modifications to the DMA engine

to prioritize L-app transfers at the PCIe transaction level. It is important to

note that these research efforts are orthogonal to storageCC’s focus on adjusting

application-level parallelism in response to host network bottlenecks, as lever-

aging more capable hardware can contribute to overall system performance im-

provements.

Mitigating software bottlenecks in storage datapath. Extensive research ef-

forts have been dedicated to designing storage stacks to alleviate CPU bottle-

necks and improve system performance. Over time, these stacks have evolved

123


to tackle various challenges, including ensuring fairness [2, 3, 90], prioritiza-

tion [4], QoS guarantees [5, 54, 119], and isolation [57, 124, 76]. However, as

highlighted in §2.3, they prove ineffective in mitigating issues arising in the host

congestion regime, where queuing happens beyond the Linux block layer and

deeper within the storage datapath. Additionally, some of these stacks intro-

duce significant CPU overheads (§2.3). Although userspace storage stacks [135]

offer lower CPU overheads, they still do not address the challenges encountered

in the host congestion regime.

124


CHAPTER 7

CONCLUSION AND OUTLOOK

This dissertation revisits the conventional wisdom in systems and networking

communities that congestion happens primarily within the network fabric (i.e.,

at network links and/or switches). We demonstrate recent technological trends,

such as rapid growth in network link and peripheral bandwidths combined

with relatively stagnant trends for other host resources (like memory bandwidth

and cache sizes per core), have led to the emergence of host congestion. We dis-

cuss the root cause of host congestion to be nanosecond-scale latency inflation

within the datapath between NIC-to-CPU/memory datapath. Such latency in-

flation results in underutilization of PCIe bandwidth, subsequently leading to

network bandwidth underutilization and queueing/drops at the NIC buffers.

We characterize impact of host congestion context of networking applica-

tions, and also demonstrate that the problem generalizes to context of storage

applications (Chapter 2). We show that existing network protocols and OS were

designed with several implicit assumptions that start to break under the regime

of host congestion. The dissertation revisits the achitecture of network proto-

cols like congestion control, and several components of the OS like IO memory

protection mechanisms and storage stacks, to handle the host congestion.

We presented hostCC (Chapter 3), a new congestion control architecture that

handles congestion within the host network, along with the network fabric.

The key idea in hostCC is a new “host-local” congestion response performed

at a sub-microsecond-scale. HostCC integrates the idea of host-local congestion

response with new congestion signals collected at sub-microsecond timescales

within the host hardware, allowing it to capture and react to congestion within

125


the host network at sub-microsecond timescales.

We next presented F&S (Chapter 4), which revisits the design of memory

protection mechanisms to minimize performance overheads while ensuring the

required safety guarantees. We show that, although reducing the number of

IOTLB misses under strictest possible memory protection guarantees is hard,

we can reduce the cost of each miss by improving the IO page table cache per-

formance. F&S leverages this insight and reduces the latency-cost and CPU-cost

for IOTLB misses through its key design ideas of contiguous IOVA allocation,

along with a selective and batched invalidation of IOTLBs.

We finally presented storageCC (Chapter 5), a new architecture for storage

stacks that enables them to operate at an ideal amount of parallelism in the pres-

ence of host congestion. StorageCC introduces the conceptual idea of conges-

tion control—which is typically used in the computer networking literature—in

storage stacks. By leveraging insights from classical networking literature and

exploiting specific properties of the host network, storageCC designs its conges-

tion control mechanism to quickly detect and efficiently react to host congestion,

dynamically adjusting in-flight IO requests to achieve near-optimal parallelism.

7.1 Future Directions

As established in this dissertation, effectively handling host congestion requires

exploring solutions across different layers of the infrastructure—-both within

the software and hardware. The mechanisms discussed in Chapters 3, 4, and 5

take the initial steps toward re-architecting the existing host infrastructure. We

next outline additional avenues for future research, aimed at rethinking the de-

126


sign of applications, operating systems, host hardware, and network fabric.

Rethinking design of applications. Existing host-local applications perform

multiple data copies for various kinds of operations [132, 33, 43, 134, 122] (like

security, compression, (de)serialization, etc.), which contributes to inflated host

network resources (e.g., memory bandwidth), resulting in higher inefficiencies.

Host congestion calls for exploring techniques that enable reducing data copies

at the application layer, with and without hardware support. Additionally, it

would be interesting to explore algorithmic solutions that enable trading off

computation for reduced memory traffic; the key insight here is that, when

memory bandwidth is bottlenecked, CPU utilization is reduced due to larger

stalls and that explicitly trading off computation for reduced memory traffic

could lead to improved solutions.

Rethinking design of operating systems. Exploration of different components

of the operating system, beyond the ones already discussed in this dissertation,

is needed to fundamentally tackle the impact of host congestion. For instance,

existing CPU schedulers do not consider memory bandwidth contention into ac-

count; it would be interesting to explore scheduling mechanisms that take into

account memory bandwidth contention. At the memory management level, an

interesting direction would be to rethink memory tiering solutions—existing

solutions try to place all frequently accessed pages in the host-local memory

(that is, the tier closer to the CPUs); however, when memory bandwidth is con-

tended, this is not necessarily the optimal solution. We could potentially mi-

grate pages across the memory tiers such as to strike a proper balance between

desired application performance and memory bandwidth contention. Similarly,

the existing network and storage stacks try placing physical pages (for DMAing

127


data to/from the peripherals) local to the NUMA node containing the peripher-

als. It would be interesting to rethink the design of these stacks to dynamically

migrate the physical pages across different NUMA nodes, depending upon the

host network contention within the end-to-end datapath.

Rethinking design of host hardware. This dissertation shows that at its root,

the inefficiencies within the host network arise because today’s host hardware

employs flow control mechanisms to handle data transfer between different

host-local devices. Networking literature suggests that flow control is better

suited for networks where resource contention happens at the end points of

the network (i.e., the host-local devices like CPU cores and peripheral devices).

Host network observes inefficiencies due to contention for resources within the

end points of the network (i.e., the interconnects within the host network).

Therefore, congestion control within the host network could help mitigate its

inefficiencies. Such a re-architecture would especially be needed in the con-

text of emerging hardware technologies like CXL—the amount of contention

for CXL memory channels is likely to become much higher than the host-local

memory (since processors from multiple hosts can access CXL memory chan-

nels). Implementing congestion control mechanisms within the hardware could

be beneficial to resolve such contention. As the host network inefficiencies get

more pronounced with current hardware trends, we might even need to gener-

alize such hardware-level congestion control mechanisms to transfers across the

entire host network—i.e., between peripheral devices, CPU cores and memory.

Rethinking design of (datacenter) network fabric. While this dissertation fo-

cusses primarily on ensuring the host network behaves close to an ideal manner,

providing performance guarantees similar to an ideal host network (e.g., near-

128


optimal latency and/or throughput) even for datacenter networks, could be

beneficial for alleviating many kinds of application inefficiencies. Existing data-

center networks are asynchronous—it is hard, or even impossible, to bound the

time taken by a message to be transmitted from the sender to the receiver host.

Such network asynchrony reduces the set of assumptions that system designers

can rely upon from the underlying network, thus introducing inefficiency and

complexity in end-host systems and applications. Datacenter networks offer-

ing stronger abstractions—one where the network delivers the message within

a bounded amount of time—can have powerful implications for future systems

and networking infrastructure. A recent study [7] takes the first step towards ex-

ploring such an abstraction, building upon programmable hardware and multi-

tier topologies prevalent in existing datacenter networks. It would be interest-

ing to build distributed applications, network protocols, and OS stacks that can

harness the benefits of such an abstraction.

129


BIBLIOGRAPHY

[1] axboe/fio: Flexible I/O Tester. https://github.com/axboe/fio.

[2] BFQ (budget fair queueing). https://docs.kernel.org/

blockbfq-iosched.html.

[3] CFQ (complete fairness queueing). https://www.kernel.org/doc/

Documentation/block/cfq-iosched.txt.

[4] Kyber multiqueue I/O scheduler. https://lwn.net/Articles/720071/.

[5] MQ-Deadline I/O scheduler. https://lwn.net/Articles/708465/.

[6] Saksham Agarwal, Rachit Agarwal, Behnam Montazeri, Masoud
Moshref, Khaled Elmeleegy, Luigi Rizzo, Marc Asher de Kruijf, Gautam
Kumar, Sylvia Ratnasamy, David Culler, et al. Understanding host inter-
connect congestion. In ACM HotNets, 2022.

[7] Saksham Agarwal, Qizhe Cai, Rachit Agarwal, David Shmoys, and Amin
Vahdat. Harmony: A congestion-free datacenter architecture. In USENIX
NSDI, 2024.

[8] Saksham Agarwal, Shreyas Kharbanda, and Rachit Agar-
wal. Storage stacks need congestion control. In Draft:
https://saksham.web.illinois.edu/assets/pdf/storagecc.pdf,
2024.

[9] Saksham Agarwal, Arvind Krishnamurthy, and Rachit Agarwal. Host
congestion control. In ACM SIGCOMM, 2023.

[10] Saksham Agarwal, Shijin Rajakrishnan, Akshay Narayan, Rachit Agar-
wal, David Shmoys, and Amin Vahdat. Sincronia: Near-optimal network
design for coflows. In ACM SIGCOMM, 2018.

[11] Amit Aggarwal, Stefan Savage, and Thomas Anderson. Understanding
the performance of tcp pacing. In IEEE INFOCOM, 2000.

[12] Jasmin Ajanovic. Pci express 3.0 accelerator features. Intel Corporation,
2008.

130

https://github.com/axboe/fio
https://docs.kernel.org/blockbfq-iosched.html
https://docs.kernel.org/blockbfq-iosched.html
https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt
https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt
https://lwn.net/Articles/720071/
https://lwn.net/Articles/708465/
https://saksham.web.illinois.edu/assets/pdf/storagecc.pdf


[13] Markuze Alex, Shay Vargaftik, Gil Kupfer, Boris Pismeny, Nadav Amit,
Adam Morrison, and Dan Tsafrir. Characterizing, exploiting, and detect-
ing dma code injection vulnerabilities in the presence of an iommu. In
ACM EUROSYS, 2021.

[14] Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Pad-
hye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Srid-
haran. Data center tcp (dctcp). In ACM SIGCOMM, 2010.

[15] AMD. Amd64 technology platform quality of service extensions. https:
//developer.amd.com/wp-content/resources/56375.pdf, 2020.

[16] AMD. Alveo U280 Data Center Accelerator Card. https://www.xilinx.
com/products/boards-and-kits/alveo/u280.html, 2024.

[17] Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. Iommu: Strategies
for mitigating the iotlb bottleneck. In ISCA. Springer, 2010.

[18] Robin Anil, Gokhan Capan, Isabel Drost-Fromm, Ted Dunning, Ellen
Friedman, Trevor Grant, Shannon Quinn, Paritosh Ranjan, Sebastian
Schelter, and Özgür Yılmazel. Apache mahout: Machine learning on dis-
tributed dataflow systems. Journal of Machine Learning Research, 21(127):1–
6, 2020.

[19] Apache Software Foundation. Apache cassandra. https://cassandra.
apache.org, 2024.

[20] Mohamed Arafa, Bahaa Fahim, Sailesh Kottapalli, Akhilesh Kumar, Lily P
Looi, Sreenivas Mandava, Andy Rudoff, Ian M Steiner, Bob Valentine,
Geetha Vedaraman, et al. Cascade lake: Next generation intel xeon scal-
able processor. IEEE MICRO, 2019.

[21] Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian,
Gabriel H Loh, and Onur Mutlu. Staged memory scheduling: Achiev-
ing high performance and scalability in heterogeneous systems. In ACM
SIGARCH Computer Architecture News, 2012.

[22] Muli Ben-Yehuda, Jon Mason, Jimi Xenidis, Orran Krieger, Leendert
Van Doorn, Jun Nakajima, Asit Mallick, and Elsie Wahlig. Utilizing iom-
mus for virtualization in linux and xen. In OLS’06: The 2006 Ottawa Linux
Symposium, pages 71–86. Citeseer, 2006.

131

https://developer.amd.com/wp-content/resources/56375.pdf
https://developer.amd.com/wp-content/resources/56375.pdf
https://www.xilinx.com/products/boards-and-kits/alveo/u280.html
https://www.xilinx.com/products/boards-and-kits/alveo/u280.html
https://cassandra.apache.org
https://cassandra.apache.org


[23] Muli Ben-Yehuda, Jimi Xenidis, Michal Ostrowski, Karl Rister, Alexis
Bruemmer, Leendert van Doorn, and Doorn Amd. The price of safety:
Evaluating iommu performance. Ottawa Linux Symposium (OLS), 01 2007.

[24] Abhishek Bhattacharjee. Large-reach memory management unit caches.
In IEEE/ACM MICRO, 2013.

[25] Suparna Bhattacharya, Steven Pratt, Badari Pulavarty, and Janet Morgan.
Asynchronous i/o support in linux 2.5. In Proceedings of the Linux Sympo-
sium, pages 371–386. Citeseer, 2003.

[26] Harshawardhan S Bhosale and Devendra P Gadekar. A review paper on
big data and hadoop. International Journal of Scientific and Research Publica-
tions, 4(10):1–7, 2014.

[27] Matias Bjørling, Abutalib Aghayev, Hans Holmberg, Aravind Ramesh,
Damien Le Moal, Gregory R Ganger, and George Amvrosiadis. Zns:
Avoiding the block interface tax for flash-based ssds. In USENIX ATC,
2021.

[28] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. Qnoc:
Qos architecture and design process for network on chip. Journal of systems
architecture, 50(2-3):105–128, 2004.

[29] Jeff Bonwick and Jonathan Adams. Magazines and vmem: Extending the
slab allocator to many CPUs and arbitrary resources. In USENIX ATC,
2001.

[30] Qizhe Cai, Mina Tahmasbi Arashloo, and Rachit Agarwal. dcpim: Near-
optimal proactive datacenter transport. In ACM SIGCOMM, 2022.

[31] Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang,
and Rachit Agarwal. Understanding host network stack overheads. In
ACM SIGCOMM, 2021.

[32] Qizhe Cai, Midhul Vuppalapati, Jaehyun Hwang, Christos Kozyrakis,
and Rachit Agarwal. Towards µ s tail latency and terabit ethernet: Dis-
aggregating the host network stack. In ACM SIGCOMM, 2022.

[33] Josiah Carlson. Redis in action. Simon and Schuster, 2013.

132


[34] Adrian M Caulfield, Laura M Grupp, and Steven Swanson. Gordon: us-
ing flash memory to build fast, power-efficient clusters for data-intensive
applications. ACM Sigplan Notices, 44(3):217–228, 2009.

[35] Kevin K Chang. Understanding and improving the latency of DRAM-based
memory systems. PhD thesis, Carnegie Mellon University, 2017.

[36] Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin
Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan,
and Onur Mutlu. Understanding latency variation in modern dram chips:
Experimental characterization, analysis, and optimization. SIGMETRICS
Perform. Eval. Rev., 44(1):323–336, jun 2016.

[37] Feng Chen, Rubao Lee, and Xiaodong Zhang. Essential roles of exploit-
ing internal parallelism of flash memory based solid state drives in high-
speed data processing. In IEEE HPCA, 2011.

[38] Dilip Chhajed and Timothy Lowe. Building Intuition: Insights From Basic
Operations Management Models and Principles, volume 115. 01 2008.

[39] Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny
Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation.
IEEE MICRO, 2021.

[40] John Colgrove, John D Davis, John Hayes, Ethan L Miller, Cary Sandvig,
Russell Sears, Ari Tamches, Neil Vachharajani, and Feng Wang. Purity:
Building fast, highly-available enterprise flash storage from commodity
components. In ACM SIGMOD, 2015.

[41] Giovanni De Micheli. Networks on chips. In Design, Automation, and
Test in Europe: The Most Influential Papers of 10 Years Date, pages 105–110.
Springer, 2008.

[42] Alin Deutsch, Yu Xu, Mingxi Wu, and Victor Lee. Tigergraph: A native
mpp graph database. arXiv preprint arXiv:1901.08248, 2019.

[43] Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. Rocksdb:
Evolution of development priorities in a key-value store serving large-
scale applications. ACM TOS, 2021.

[44] Facebook, Inc. Rocksdb benchmarking tools. https://github.com/

facebook/rocksdb/wiki/Benchmarking-tools.

133

https://github.com/facebook/rocksdb/wiki/Benchmarking-tools
https://github.com/facebook/rocksdb/wiki/Benchmarking-tools


[45] Alireza Farshin, Luigi Rizzo, Khaled Elmeleegy, and Dejan Kostić. Over-
coming the IOTLB wall for multi-100-gbps linux-based networking. PeerJ
Comput. Sci., 9:e1385, May 2023.

[46] Alireza Farshin, Amir Roozbeh, Gerald Q Maguire Jr, and Dejan Kostić.
Reexamining direct cache access to optimize i/o intensive applications for
multi-hundred-gigabit networks. In USENIX ATC, 2020.

[47] Mario Gerla and Leonard Kleinrock. Flow control: A comparative survey.
In IEEE Transactions on Communications, 1980.

[48] Krishnan Gosakan, Jaehyun Han, William Kuszmaul, Ibrahim N.
Mubarek, Nirjhar Mukherjee, Karthik Sriram, Guido Tagliavini, Evan
West, Michael A. Bender, Abhishek Bhattacharjee, Alex Conway, Mar-
tin Farach-Colton, Jayneel Gandhi, Rob Johnson, Sudarsun Kannan, and
Donald E. Porter. Mosaic pages: Big tlb reach with small pages. In ACM
ASPLOS, 2023.

[49] Donghyun Gouk, Miryeong Kwon, Jie Zhang, Sungjoon Koh, Wonil Choi,
Nam Sung Kim, Mahmut Kandemir, and Myoungsoo Jung. Amber: En-
abling precise full-system simulation with detailed modeling of all ssd
resources. In IEEE MICRO, 2018.

[50] Giulia Guidi, Marquita Ellis, Aydin Buluç, Katherine Yelick, and David
Culler. 10 years later: Cloud computing is closing the performance gap.
In ACM/SPEC ICPE, 2021.

[51] Ajay Gulati, Arif Merchant, Mustafa Uysal, Pradeep Padala, and Peter
Varman. Efficient and adaptive proportional share i/o scheduling. ACM
SIGMETRICS Performance Evaluation Review, 37(2):79–80, 2009.

[52] Ajay Gulati, Arif Merchant, Mustafa Uysal, Pradeep Padala, and Peter
Varman. Workload dependent io scheduling for fairness and efficiency in
shared storage systems. In IEEE HPCA, 2012.

[53] Ajay Gulati and Peter Varman. Lexicographic qos scheduling for parallel
i/o. In ACM SPAA, 2005.

[54] Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O
Suminto, Cesar A Stuardo, Andrew A Chien, and Haryadi S Gunawi. Mit-
tos: Supporting millisecond tail tolerance with fast rejecting slo-aware os
interface. In ACM SOSP, 2017.

134


[55] Yuchen Hao, Zhenman Fang, Glenn Reinman, and Jason Cong. Sup-
porting address translation for accelerator-centric architectures. In IEEE
HPCA, 2017.

[56] Jian Huang, Anirudh Badam, Laura Caulfield, Suman Nath, Sudipta Sen-
gupta, Bikash Sharma, and Moinuddin K Qureshi. Flashblox: Achieving
both performance isolation and uniform lifetime for virtualized ssds. In
USENIX FAST, 2017.

[57] Jaehyun Hwang, Midhul Vuppalapati, Simon Peter, and Rachit Agarwal.
Rearchitecting linux storage stack for µs latency and high throughput. In
USENIX OSDI, 2021.

[58] Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachidambaresan. Py-
torch. Programming with TensorFlow: Solution for Edge Computing Applica-
tions, pages 87–104, 2021.

[59] Intel. Intel® data direct i/o technology (intel® ddio): A primer.
https://www.intel.com/content/dam/www/public/us/en/documents/

technology-briefs/data-direct-i-o-technology-brief.pdf, 2012.

[60] Intel. Intel Virtualization Technology for Directed I/O. https:

//www.intel.com/content/www/us/en/content-details/774206/

intel-virtualization-technology-for-directed-i-o-architecture-specification.

html, 2023.

[61] Intel. Intel® 64 and ia-32 architectures software developer manuals.
https://www.intel.com/content/www/us/en/developer/articles/

technical/intel-sdm.html, 2023.

[62] Intel. Intel® memory latency checker. https://www.

intel.com/content/www/us/en/developer/articles/tool/

intelr-memory-latency-checker.html, 2023.

[63] iperf. iperf - the ultimate speed test tool for tcp, udp and sctp. https:
//iperf.fr/, 2023.

[64] Bruce Jacob, David Wang, and Spencer Ng. Memory Systems: Cache,
DRAM, Disk. 2010.

[65] Raj Jain. Congestion control and traffic management in atm networks:

135

https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf
https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf
https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html
https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html
https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html
https://iperf.fr/
https://iperf.fr/


Recent advances and a survey. In Computer Networks and ISDN Systems,
1996.

[66] JanusGraph Contributors. Janusgraph. https://janusgraph.org, 2017.

[67] Tianyang Jiang, Guangyan Zhang, Zican Huang, Xiaosong Ma, Junyu
Wei, Zhiyue Li, and Weimin Zheng. Fusionraid: Achieving consistent
low latency for commodity ssd arrays. In USENIX FAST, 2021.

[68] Rick Jones. Netperf benchmark. http://www.netperf.org/, 2012.

[69] Myoungsoo Jung, Wonil Choi, Miryeong Kwon, Shekhar Srikantaiah,
Joonhyuk Yoo, and Mahmut Taylan Kandemir. Design of a host inter-
face logic for gc-free ssds. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 39(8):1674–1687, 2019.

[70] Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter rpcs can
be general and fast. In USENIX NSDI, 2019.

[71] Jaeho Kim, Donghee Lee, and Sam H Noh. Towards slo complying ssds
through ops isolation. In USENIX FAST, volume 15, pages 183–189, 2015.

[72] Seonbong Kim and Joon-Sung Yang. Optimized i/o determinism for
emerging nvm-based nvme ssd in an enterprise system. In Proceedings
of the 55th Annual Design Automation Conference, pages 1–6, 2018.

[73] Taehun Kim, Hyeongjin Park, Seokmin Lee, Seunghee Shin, Junbeom
Hur, and Youngjoo Shin. Devious: Device-driven side-channel attacks
on the iommu. In IEEE SP, 2023.

[74] Taejin Kim, Duwon Hong, Sangwook Shane Hahn, Myoungjun Chun,
Sungjin Lee, Joo Young Hwang, Jongyoul Lee, and Jihong Kim. Fully au-
tomatic stream management for multi-streamed ssds using program con-
texts. In USENIX FAST, 2019.

[75] Youngjae Kim, Junghee Lee, Sarp Oral, David A Dillow, Feiyi Wang, and
Galen M Shipman. Coordinating garbage collectionfor arrays of solid-
state drives. IEEE Transactions on Computers, 63(4):888–901, 2012.

[76] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Reflex: Remote flash
= local flash. ACM SIGARCH Computer Architecture News, 45(1):345–359,
2017.

136

https://janusgraph.org


[77] Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan MG Wassel, Xian
Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher
Alfeld, Michael Ryan, et al. Swift: Delay is simple and effective for con-
gestion control in the datacenter. In ACM SIGCOMM, 2020.

[78] Jaewook Kwak, Sangjin Lee, Kibin Park, Jinwoo Jeong, and Yong Ho
Song. Cosmos+ openssd: Rapid prototype for flash storage systems. ACM
Transactions on Storage (TOS), 16(3):1–35, 2020.

[79] Alexey Lavrov and David Wentzlaff. Hypertrio: hyper-tenant translation
of i/o addresses. In ACM/IEEE ISCA, 2020.

[80] Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and
Yale N Patt. Dram-aware last-level cache writeback: Reducing write-
caused interference in memory systems, 2010.

[81] Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek
Seshadri, Kevin Chang, and Onur Mutlu. Adaptive-latency dram: Opti-
mizing dram timing for the common-case. In 2015 IEEE 21st International
Symposium on High Performance Computer Architecture (HPCA), pages 489–
501, 2015.

[82] Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jong-
moo Choi, and Onur Mutlu. Decoupled direct memory access: Isolating
cpu and io traffic by leveraging a dual-data-port dram. In IEEE PACT,
2015.

[83] Junghee Lee, Youngjae Kim, Galen M Shipman, Sarp Oral, Feiyi Wang,
and Jongman Kim. A semi-preemptive garbage collector for solid state
drives. In IEEE ISPASS, 2011.

[84] Sang-Won Lee, Bongki Moon, and Chanik Park. Advances in flash mem-
ory ssd technology for enterprise database applications. In ACM SIG-
MOD, 2009.

[85] Huaicheng Li, Martin L Putra, Ronald Shi, Xing Lin, Gregory R Ganger,
and Haryadi S Gunawi. loda: A host/device co-design for strong pre-
dictability contract on modern flash storage. In ACM SOSP, 2021.

[86] Qiang Li, Qiao Xiang, Derui Liu, Yuxin Wang, Haonan Qiu, Gexiao Tian,
Xiaoliang Wang, Lulu Chen, Ridi Wen, Jianbo Dong, et al. From rdma to
rdca: Toward high-speed last mile of data center networks using remote
direct cache access. https://arxiv.org/abs/2211.05975, 2022.

137

https://arxiv.org/abs/2211.05975


[87] Qiang Li, Qiao Xiang, Yuxin Wang, Haohao Song, Ridi Wen, Wenhui Yao,
Yuanyuan Dong, Shuqi Zhao, Shuo Huang, Zhaosheng Zhu, et al. More
than capacity: Performance-oriented evolution of pangu in alibaba. In
USENIX FAST, 2023.

[88] Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng,
Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Al-
izadeh, et al. Hpcc: High precision congestion control. In ACM SIG-
COMM. 2019.

[89] John DC Little and Stephen C Graves. Little’s law. Building intuition:
insights from basic operations management models and principles, pages 81–
100, 2008.

[90] Hui Lu, Brendan Saltaformaggio, Ramana Kompella, and Dongyan Xu.
vfair: latency-aware fair storage scheduling via per-io cost-based differ-
entiation. In ACM SoCC, 2015.

[91] Moshe Malka, Nadav Amit, Muli Ben-Yehuda, and Dan Tsafrir. riommu:
Efficient iommu for i/o devices that employ ring buffers. In ACM ASP-
LOS, 2015.

[92] Moshe Malka, Nadav Amit, and Dan Tsafrir. Efficient intra-operating sys-
tem protection against harmful dmas. In USENIX FAST, 2015.

[93] Alex Markuze, Adam Morrison, and Dan Tsafrir. True iommu protection
from dma attacks: When copy is faster than zero copy. In ACM ASPLOS,
2016.

[94] Alex Markuze, Igor Smolyar, Adam Morrison, and Dan Tsafrir. Damn:
Overhead-free iommu protection for networking. In ACM ASPLOS, 2018.

[95] Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean
Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C
Evans, Steve Gribble, et al. Snap: A microkernel approach to host net-
working. In ACM SOSP, 2019.

[96] Matthew Mathis, Jeffrey Semke, Jamshid Mahdavi, and Teunis Ott. The
macroscopic behavior of the tcp congestion avoidance algorithm. In ACM
SIGCOMM CCR, 1997.

138


[97] Mellanox Technologies. Mellanox CX5 Network Interface Card. Product
Information, 2024.

[98] Justin J Miller. Graph database applications and concepts with neo4j. In
Proceedings of the southern association for information systems conference, At-
lanta, GA, USA, volume 2324, pages 141–147, 2013.

[99] Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan
Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall,
and David Zats. Timely: Rtt-based congestion control for the datacenter.
In ACM SIGCOMM, 2015.

[100] Jeffrey C Mogul and Kadangode K Ramakrishnan. Eliminating receive
livelock in an interrupt-driven kernel. In ACM TOCS, 1997.

[101] Benoı̂t Morgan, Éric Alata, Vincent Nicomette, and Mohamed Kaâniche.
IOMMU protection against I/O attacks: a vulnerability and a proof of
concept. J. Braz. Comput. Soc., 24(1), December 2018.

[102] Benoı̂t Morgan, Éric Alata, Vincent Nicomette, and Mohamed Kaâniche.
Bypassing iommu protection against i/o attacks. In LADC, 2016.

[103] Robert Morris. Tcp behavior with many flows. In IEEE ICNP, 1997.

[104] Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich,
Sergio López-Buedo, and Andrew W Moore. Understanding pcie perfor-
mance for end host networking. In Proceedings of the 2018 Conference of the
ACM Special Interest Group on Data Communication, pages 327–341, 2018.

[105] Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich,
Sergio López-Buedo, and Andrew W. Moore. Understanding pcie perfor-
mance for end host networking. In ACM SIGCOMM, 2018.

[106] Bo Pang, Erik Nijkamp, and Ying Nian Wu. Deep learning with tensor-
flow: A review. Journal of Educational and Behavioral Statistics, 45(2):227–
248, 2020.

[107] Irma Esmer Papazian. New 3rd gen intel® xeon® scalable processor (co-
dename: Ice lake-sp). In Hot Chips Symposium, pages 1–22, 2020.

[108] Omer Peleg, Adam Morrison, Benjamin Serebrin, and Dan Tsafrir. Utiliz-
ing the iommu scalably. In USENIX ATC, 2015.

139


[109] Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek
Bhattacharjee. Colt: Coalesced large-reach tlbs. In IEEE/ACM MICRO,
2012.

[110] Solal Pirelli and George Candea. A simpler and faster NIC driver model
for network functions. In USENIX OSDI, 2020.

[111] Pure Storage. Linux recommended settings. https://support.

purestorage.com/Solutions/Linux/Linux_Reference/Linux_

Recommended_Settings.

[112] Mohammad Rajiullah, Per Hurtig, Anna Brunstrom, Andreas Petlund,
and Michael Welzl. An evaluation of tail loss recovery mechanisms for
tcp. In ACM SIGCOMM CCR, 2015.

[113] K Ramakrishnan, Sally Floyd, and D Black. Rfc3168: The addition of
explicit congestion notification (ecn) to ip. https://datatracker.ietf.
org/doc/html/rfc3168, 2001.

[114] Zebin Ren, Krijn Doekemeijer, Nick Tehrany, and Animesh Trivedi. Bfq,
multiqueue-deadline, or kyber? performance characterization of linux
storage schedulers in the nvme era. 2024.

[115] Scott Rixner, William J Dally, Ujval J Kapasi, Peter Mattson, and John D
Owens. Memory access scheduling. ACM SIGARCH Computer Architec-
ture News, 28(2):128–138, 2000.

[116] Benny Rubin, Saksham Agarwal, and Rachit Agarwal. Fast
and safe memory protection for networked systems. In Draft:
https://saksham.web.illinois.edu/assets/pdf/fns.pdf, 2024.

[117] Salman Salloum, Ruslan Dautov, Xiaojun Chen, Patrick Xiaogang Peng,
and Joshua Zhexue Huang. Big data analytics on apache spark. Interna-
tional Journal of Data Science and Analytics, 1:145–164, 2016.

[118] Debendra Das Sharma. Compute express link (cxl): Enabling heteroge-
neous data-centric computing with heterogeneous memory hierarchy. In
IEEE MICRO, 2022.

[119] David Shue and Michael J Freedman. From application requests to virtual
iops: Provisioned key-value storage with libra. In Proceedings of the Ninth
European Conference on Computer Systems, pages 1–14, 2014.

140

https://support.purestorage.com/Solutions/Linux/Linux_Reference/Linux_Recommended_Settings
https://support.purestorage.com/Solutions/Linux/Linux_Reference/Linux_Recommended_Settings
https://support.purestorage.com/Solutions/Linux/Linux_Reference/Linux_Recommended_Settings
https://datatracker.ietf.org/doc/html/rfc3168
https://datatracker.ietf.org/doc/html/rfc3168
https://saksham.web.illinois.edu/assets/pdf/fns.pdf


[120] Athinagoras Skiadopoulos, Zhiqiang Xie, Mark Zhao, Qizhe Cai, Sak-
sham Agarwal, Jacob Adelmann, David Ahern, Carlo Contavalli, Michael
Goldflam, Vitaly Mayatskikh, et al. High-throughput and flexible host
networking for accelerated computing. In USENIX OSDI, 2024.

[121] Parul Sohal, Michael Bechtel, Renato Mancuso, Heechul Yun, and Orran
Krieger. A closer look at intel resource director technology (rdt). In RTNS,
2022.

[122] Timothy Stamler, Deukyeon Hwang, Amanda Raybuck, Wei Zhang, and
Simon Peter. {zIO}: Accelerating {IO-Intensive} applications with trans-
parent {Zero-Copy}{IO}. In USENIX OSDI, 2022.

[123] Amy Tai, Igor Smolyar, Michael Wei, and Dan Tsafrir. Optimizing storage
performance with calibrated interrupts. ACM TOS, 2022.

[124] Eno Thereska, Hitesh Ballani, Greg O’Shea, Thomas Karagiannis, Antony
Rowstron, Tom Talpey, Richard Black, and Timothy Zhu. Ioflow: A
software-defined storage architecture. In ACM SOSP, 2013.

[125] Jordan Tigani and Siddartha Naidu. Google bigquery analytics. John Wiley
& Sons, 2014.

[126] Shih-Hao Tseng, Saksham Agarwal, Rachit Agarwal, Hitesh Ballani, and
Ao Tang. Codedbulk: Inter-datacenter bulk transfers using network cod-
ing. In USENIX NSDI, 2021.

[127] Mehul Nalin Vora. Hadoop-hbase for large-scale data. In Proceedings of
2011 International Conference on Computer Science and Network Technology,
volume 1, pages 601–605. IEEE, 2011.

[128] Midhul Vuppalapati, Saksham Agarwal, Henry N. Schuh, Baris Kasikci,
Arvind Krishnamurthy, and Rachit Agarwal. Understanding the host net-
work. In ACM SIGCOMM, 2024.

[129] Hao Wang, Chang-Jae Park, Gyung-su Byun, Jung Ho Ahn, and
Nam Sung Kim. Alloy: Parallel-serial memory channel architecture for
single-chip heterogeneous processor systems. In IEEE HPCA, 2015.

[130] WCCFtech. Intel emerald rapids: 5th gen xeon cpu family
specs leaked – up to 64 cores, ddr5-5600, 80 gen 5 lanes, 350w.

141


https://wccftech.com/intel-emerald-rapids-5th-gen-xeon-cpu-family-
specs-leaked-up-to-64-cores-ddr5-5600-80-gen-5-lanes-350w/.

[131] Harald Welte. The netfilter framework in linux 2.4. In Linux-Kongress,
2000.

[132] Chenggang Wu, Vikram Sreekanti, and Joseph M Hellerstein. Autoscaling
tiered cloud storage in anna. VLDB, 2019.

[133] Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swami-
nathan Sundararaman, Andrew A Chien, and Haryadi S Gunawi. Tiny-
tail flash: Near-perfect elimination of garbage collection tail latencies in
nand ssds. ACM TOS, 2017.

[134] Juncheng Yang, Yao Yue, and KV Rashmi. A large-scale analysis of hun-
dreds of in-memory key-value cache clusters at twitter. ACM TOS, 2021.

[135] Ziye Yang, James R. Harris, Benjamin Walker, Daniel Verkamp, Chang-
peng Liu, Cunyin Chang, Gang Cao, Jonathan Stern, Vishal Verma, and
Luse E. Paul. Spdk: A development kit to build high performance storage
applications. IEEE CloudCom, 2017.

[136] Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Changlim
Lee, Mohammad Alian, Myoungjun Chun, Mahmut Taylan Kandemir,
Nam Sung Kim, Jihong Kim, et al. Flashshare: Punching through server
storage stack from kernel to firmware for ultra-low latency ssds. In
USENIX OSDI, 2018.

[137] Xiangqun Zhang, Shuyi Pei, Jongmoo Choi, and Bryan S. Kim. Excessive
ssd-internal parallelism considered harmful. In ACM HotStorage, 2023.

[138] Yiying Zhang, Leo Prasath Arulraj, Andrea C Arpaci-Dusseau, and
Remzi H Arpaci-Dusseau. De-indirection for flash-based ssds with name-
less writes. In USENIX FAST, 2012.

[139] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina
Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mo-
hamad Haj Yahia, and Ming Zhang. Congestion control for large-scale
rdma deployments. ACM SIGCOMM, 2015.

142


	Biographical Sketch
	Dedication
	Acknowledgements
	Table of Contents
	List of Figures
	Introduction
	The Host Network
	Emergence of Host Congestion
	The Root Cause of Host Congestion
	Impact of Host Congestion
	Implications to Design of Network Protocols & OS
	Thesis and Key Contributions
	Acknowledgements

	Understanding the Impact of Host Congestion
	Host Network Datapath
	Network (NIC-to-CPU/memory) Datapath
	Storage (SSD-to-CPU/memory) Datapath
	Datapath with Memory Protection Enabled

	Host Congestion Impact on Networking Applications
	Impact of Memory Interconnect Contention
	Impact of IOMMU Contention

	Host Congestion Impact on Storage Applications
	Summary

	Rearchitecting Congestion Control
	Overview
	hostCC
	Host congestion signals
	Host-local congestion response at sub-RTT granularity
	Network resource allocation at RTT granularity

	hostCC Implementation
	Host congestion signals
	Host-local congestion response
	Network resource allocation

	hostCC Evaluation
	hostCC benefits
	hostCC results with DDIO enabled
	hostCC sensitivity analysis
	Deep dive into hostCC performance

	Summary

	Rethinking IO Memory Protection
	Overview
	F&S
	F&S Implementation
	F&S Evaluation
	F&S Benefits
	Necessity of each design idea in F&S

	Summary

	Rearchitecting Storage Stacks
	Overview
	storageCC
	storageCC Overview
	Detecting host network bottlenecks
	Converging to the right amount of parallelism

	storageCC Implementation
	storageCC Evaluation
	storageCC benefits
	RocksDB with storageCC
	storageCC Overheads
	storageCC Sensitivity Analysis
	Understanding storageCC microscopic behavior
	Fair resource allocation using storageCC

	Summary

	Related Work
	Managing Host Resource Contention
	Efficient IO Memory Protection
	Mitigating Bottlenecks in Storage Datapath

	Conclusion and Outlook
	Future Directions

	Bibliography