Cornell University
Library
Cornell UniversityLibrary

eCommons

Help
Log In(current)
  1. Home
  2. Cornell Computing and Information Science
  3. Computer Science
  4. Computer Science Technical Reports
  5. Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs

Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs

File(s)
91-1249.ps (374.95 KB)
91-1249.pdf (1.83 MB)
Permanent Link(s)
https://hdl.handle.net/1813/7089
Collections
Computer Science Technical Reports
Author
Babaoglu, Ozalp
Abstract

The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the individual components may be, the complexity of these systems results in a significant probability of failure during lengthy computations. In the case of distributed memory multiprocessors, fault tolerance techniques developed for distributed operating systems and applications can be applied also to parallel computations. In this paper we survey some of the principal paradigms for fault-tolerant distributed computing and discuss their relevance to parallel processing. One particular technique - passive replication - is explored in detail as it forms the basis for fault tolerance in the Paralex parallel programming environment. Keywords: Parallel processing, reliability, transactions, checkpointing, recovery, replication, reliable broadcast, causal ordering, Paralex.

Date Issued
1991-12
Publisher
Cornell University
Keywords
computer science
•
technical report
Previously Published as
http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR91-1249
Type
technical report

Site Statistics | Help

About eCommons | Policies | Terms of use | Contact Us

copyright © 2002-2026 Cornell University Library | Privacy | Web Accessibility Assistance