Achieving Critical Reliability With Unreliable Components andUnreliable Glue
Hayden, Mark; Birman, Kenneth P.
Even the most aggressive quality assurance procedures yield at best probabilistic confidence in the reliability of complex systems. Distributed systems, because of their large numbers of components, are enormously complex engineering artifacts, and hence may appear to be inherently unreliable -- despite the best efforts of researchers and developers. A cellular distributed systems architecture offers the hope of drastically improving the reliability of current technologies in settings where reliability is critical. The approach combines a stateful style of distributed computing within cells with a loosely coupled probabilistic inter-cell computing model based on a probabilistic broadcast primitive. We give an implementation of this primitive, called pbcast, and demonstrate how to use it to implement this methodology. Our approach is compatible with the use of popular distributed computing and reliability technologies, while offering considerable isolation against the spread of failures among cells.
computer science; technical report
Previously Published As