Building a Virtually Fault-Tolerant System
Bressoud, Thomas C.
All schemes for implementing fault-tolerance involve some form of replication. Replicas are assumed to fail independently, and each replica performs the same computation. Replicas may execute in parallel or, in the case of primary-backup protocols, in response to failures. Replication only works, however, if replicas are coordinated. Each replica must receive the same inputs in the same order, and each must be deterministic in its response to these inputs. The key engineering issue that the designer of a fault-tolerant computing system must address is deciding where in the system to implement replica coordination. Some of the alternatives include implementing replica coordination in the processor or network hardware, in the operating system, or in the applications software. A new solution is to implement replica coordination by augmenting the hypervisor of a virtual-machine manager and coordinating a primary virtual machine with its backup. This hypervisor-based fault-tolerance is transparent to the operating system and the applications programs executing above the hypervisor. In addition, this selection allows a single hypervisor design to be used for all processors in an architectural family. In this dissertation, we describe the protocols to implement hypervisor-based fault-tolerance. To assess the practicality of the approach, we constructed a prototype system for HP's PA-RISC architecture. The prototype hypervisor supports a single HP-UX virtual machine and implements the replica-coordination protocols. The prototype hypervisor has been instrumented to measure the overhead of all hypervisor-based activity. We have measured the performance of CPU-intensive workloads and disk I/O intensive workloads in this architecture and have built models allowing us to predict the performance for some alternative architectures.
computer science; technical report
Previously Published As