Portable Checkpointing for Parallel Applications
Loading...
No Access Until
Permanent Link(s)
Collections
Other Titles
Authors
Abstract
High Performance Computing (HPC) systems represent the peak of modern computational capability. As
ever-increasing demands for computational power have fuelled the demand for ever-larger computing systems,
modern HPC systems have grown to incorporate hundreds, thousands or as many as 130,000 processors. At these
scales, the huge number of individual components in a single system makes the probability that a single
component will fail quite high, with today's large HPC systems featuring mean times between failures on the
order of hours or a few days. As many modern computational tasks require days or months to complete, fault
tolerance becomes critical to HPC system design.
The past three decades have seen significant amounts of research on parallel system fault tolerance. However,
as most of it has been either theoretical or has focused on low-level solutions that are embedded into a
particular operating system or type of hardware, this work has had little impact on real HPC systems. This
thesis attempts to address this lack of impact by describing a high-level approach for implementing
checkpoint/restart functionality that decouples the fault tolerance solution from the details of the
operating system, system libraries and the hardware and instead connects it to the APIs implemented by the
above components. The resulting solution enables applications that use these APIs to become
self-checkpointing and self-restarting regardless of the the software/hardware platform that may implement
the APIs.
The particular focus of this thesis is on the problem of checkpoint/restart of parallel applications. It
presents two theoretical checkpointing protocols, one for the message passing communication model and one for
the shared memory model. The former is the first protocol to be compatible with application-level
checkpointing of individual processes, while the latter is the first protocol that is compatible with
arbitrary shared memory models, APIs, implementations and consistency protocols. These checkpointing
protocols are used to implement checkpointing systems for applications that use the MPI and OpenMP parallel
APIs, respectively, and are first in providing checkpoint/restart to arbitrary implementations of these
popular APIs. Both checkpointing systems are extensively evaluated on multiple software/hardware platforms
and are shown to feature low overheads.
Journal / Series
Volume & Issue
Description
Sponsorship
Date Issued
2006-08-30T13:44:20Z
Publisher
Keywords
Computer Science; Fault Tolerance
Location
Effective Date
Expiration Date
Sector
Employer
Union
Union Local
NAICS
Number of Workers
Committee Chair
Committee Co-Chair
Committee Member
Degree Discipline
Degree Name
Degree Level
Related Version
Related DOI
Related To
Related Part
Based on Related Item
Has Other Format(s)
Part of Related Item
Related To
Related Publication(s)
Link(s) to Related Publication(s)
References
Link(s) to Reference(s)
Previously Published As
Government Document
ISBN
ISMN
ISSN
Other Identifiers
Rights
Rights URI
Types
dissertation or thesis