Portable Checkpointing for Parallel Applications
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ever-increasing demands for computational power have fuelled the demand for ever-larger computing systems, modern HPC systems have grown to incorporate hundreds, thousands or as many as 130,000 processors. At these scales, the huge number of individual components in a single system makes the probability that a single component will fail quite high, with today's large HPC systems featuring mean times between failures on the order of hours or a few days. As many modern computational tasks require days or months to complete, fault tolerance becomes critical to HPC system design. The past three decades have seen significant amounts of research on parallel system fault tolerance. However, as most of it has been either theoretical or has focused on low-level solutions that are embedded into a particular operating system or type of hardware, this work has had little impact on real HPC systems. This thesis attempts to address this lack of impact by describing a high-level approach for implementing checkpoint/restart functionality that decouples the fault tolerance solution from the details of the operating system, system libraries and the hardware and instead connects it to the APIs implemented by the above components. The resulting solution enables applications that use these APIs to become self-checkpointing and self-restarting regardless of the the software/hardware platform that may implement the APIs. The particular focus of this thesis is on the problem of checkpoint/restart of parallel applications. It presents two theoretical checkpointing protocols, one for the message passing communication model and one for the shared memory model. The former is the first protocol to be compatible with application-level checkpointing of individual processes, while the latter is the first protocol that is compatible with arbitrary shared memory models, APIs, implementations and consistency protocols. These checkpointing protocols are used to implement checkpointing systems for applications that use the MPI and OpenMP parallel APIs, respectively, and are first in providing checkpoint/restart to arbitrary implementations of these popular APIs. Both checkpointing systems are extensively evaluated on multiple software/hardware platforms and are shown to feature low overheads.
Computer Science; Fault Tolerance
dissertation or thesis