Techniques for Simplifying the Programming of Distributed Systems
It is difficult to design and verify distributed programs that execute correctly despite transient processor failures, or despite variable and unpredictable processor speeds and message transmission times. In this thesis, we describe a checkpointing/rollback mechanism that allows programmers to write distributed programs with the simplifying assumption that processors do not fail, and then run these programs correctly on systems with transient processor failures. We also describe a translation mechanism that can be used to write programs with the simplifying assumptions that processors execute in synchronized steps and messages take exactly one step to arrive, and then run these programs correctly on systems that violate these assumptions. Both mechanisms are transparent to the programmer, and they can be applied to solve a large class of problems.
computer science; technical report
Previously Published As