Reliable Communication for Datacenters
Datacenter platforms have dominated the systems landscape over the last decade, offering applications the promise of scalability, availability and responsiveness at very low costs. Delivering on this promise is a significant research challenge --- datacenters consist of thousands of inexpensive fault-prone components, running commodity operating systems and protocols ill-fitted for high-performance applications. Further, datacenter applications have unconventional scaling requirements and bursty workloads that frequently push systems into delays and down-time. This thesis seeks to provide systems with low-latency primitives for reliable communication that are fundamentally scalable and robust to faults and attacks. Our focus is on the design and implementation of two protocols: Maelstrom and Ricochet. Maelstrom is a transparent network appliance for reliable and rapid communication over high-speed optical networks between datacenters. Ricochet is a low-latency messaging layer for clustered applications running within datacenters. An important aspect of these two protocols is the use of proactive fault-handling techniques such as Forward Error Correction (FEC) and gossip to achieve low delays and stable performance. Reactive protocols do too much too late, imposing extra delays and overheads that often send systems into spirals of degrading performance. In contrast, proactive protocols recover from faults almost instantly and impose stable, predictable overheads that prevent transient overloads and failures from translating into application unavailability. %Maelstrom and Ricochet achieve these properties by using fast and simple XOR operations in novel ways, enabling datacenter applications to scale in new and vital dimensions. Both Maelstrom and Ricochet use fast and simple XOR operations in novel ways that allow datacenter applications to scale in new and vital dimensions. In particular, they create XORs at strategic points in the network (respectively, within an appliance and at multicast receivers) and from different data channels to obtain excellent recovery and latency properties. Together, these protocols enable the development of highly available applications that coordinate within and across datacenters while maintaining scalable and robust responsiveness.
AFOSR, AFRL, DARPA, NSF, Intel Corporation
networks; protocols; datacenter; reliability
dissertation or thesis