Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the individual components may be, the complexity of these systems results in a significant probability of failure during lengthy computations. In the case of distributed memory multiprocessors, fault tolerance techniques developed for distributed operating systems and applications can be applied also to parallel computations. In this paper we survey some of the principal paradigms for fault-tolerant distributed computing and discuss their relevance to parallel processing. One particular technique - passive replication - is explored in detail as it forms the basis for fault tolerance in the Paralex parallel programming environment. Keywords: Parallel processing, reliability, transactions, checkpointing, recovery, replication, reliable broadcast, causal ordering, Paralex.