Determining the Last Process to Fail
A total failure occurs whenever all processes cooperatively executing a distributed task fail before the task's completion. A frequent prerequisite for recovery from a total failure is the identification of the last group (LAST) of processes concurrently failing. Herein, we derive necessary and sufficient conditions for computing LAST from the local failure data of recovered processes. These conditions are easily translated into decision procedures for LAST membership using either complete or incomplete failure data. The choice of failure data itself is dictated by two requirements: (1) it can be cheaply maintained, and (2) maximum fault-tolerance is afforded in the sense that the expected number of recoveries required for identifying LAST is minimized.
computer science; technical report
Previously Published As