BAD-Check§: Bulk Asynchronous Distributed Checkpointing

John Bent, Brad Settlemyer, Haiyun Bao, Sorin Faibish, Jeremy Sauer, Jingwang Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the successful completion of each of the prior state calculations, checkpoint restart is the most widely-used technique to achieve fault tolerance. To write a consistent view of distributed state as a checkpoint, applications typically synchronize and pause while writing data to persistent media. In this paper we present a transactional protocol that enables asynchronous distributed creation of checkpoint data sets, and describe the conditions under which it is beneficial. With simulations, we demonstrate that scientific applications exhibiting computational variance without frequent synchronization can use our protocol to either reduce run time by up to 27% or reduce required storage system capability by up to 40%.

Original languageEnglish
Title of host publicationProceedings of PDSW 2015
Subtitle of host publication10th Parallel Data Storage Workshop - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherAssociation for Computing Machinery, Inc
Pages19-24
Number of pages6
ISBN (Electronic)9781450340083
DOIs
StatePublished - Nov 15 2015
Event10th Parallel Data Storage Workshop, PDSW 2015 - Austin, United States
Duration: Nov 16 2015 → …

Publication series

NameProceedings of PDSW 2015: 10th Parallel Data Storage Workshop - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference10th Parallel Data Storage Workshop, PDSW 2015
Country/TerritoryUnited States
CityAustin
Period11/16/15 → …

Fingerprint

Dive into the research topics of 'BAD-Check§: Bulk Asynchronous Distributed Checkpointing'. Together they form a unique fingerprint.

Cite this