Show simple item record

dc.contributor.authorBabaoglu, Ozalpen_US
dc.description.abstractThe scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the individual components may be, the complexity of these systems results in a significant probability of failure during lengthy computations. In the case of distributed memory multiprocessors, fault tolerance techniques developed for distributed operating systems and applications can be applied also to parallel computations. In this paper we survey some of the principal paradigms for fault-tolerant distributed computing and discuss their relevance to parallel processing. One particular technique - passive replication - is explored in detail as it forms the basis for fault tolerance in the Paralex parallel programming environment. Keywords: Parallel processing, reliability, transactions, checkpointing, recovery, replication, reliable broadcast, causal ordering, Paralex.en_US
dc.format.extent1915454 bytes
dc.format.extent383946 bytes
dc.publisherCornell Universityen_US
dc.subjectcomputer scienceen_US
dc.subjecttechnical reporten_US
dc.titleTools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programsen_US
dc.typetechnical reporten_US

Files in this item


This item appears in the following Collection(s)

Show simple item record