Skip to main navigation Skip to search Skip to main content

BAD-Check§: Bulk Asynchronous Distributed Checkpointing

  • John Bent
  • , Brad Settlemyer
  • , Haiyun Bao
  • , Sorin Faibish
  • , Jeremy Sauer
  • , Jingwang Zhang
    • Los Alamos National Laboratory

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    1 Scopus citations

    Abstract

    Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the successful completion of each of the prior state calculations, checkpoint restart is the most widely-used technique to achieve fault tolerance. To write a consistent view of distributed state as a checkpoint, applications typically synchronize and pause while writing data to persistent media. In this paper we present a transactional protocol that enables asynchronous distributed creation of checkpoint data sets, and describe the conditions under which it is beneficial. With simulations, we demonstrate that scientific applications exhibiting computational variance without frequent synchronization can use our protocol to either reduce run time by up to 27% or reduce required storage system capability by up to 40%.

    Original languageEnglish
    Title of host publicationProceedings of PDSW 2015
    Subtitle of host publication10th Parallel Data Storage Workshop - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis
    PublisherAssociation for Computing Machinery
    Pages19-24
    Number of pages6
    ISBN (Electronic)9781450340083
    DOIs
    StatePublished - Nov 15 2015
    Event10th Parallel Data Storage Workshop, PDSW 2015 - Held as part of the 27th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 - Austin, United States
    Duration: Nov 16 2014Nov 20 2014

    Publication series

    NameProceedings of PDSW 2015: 10th Parallel Data Storage Workshop - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

    Conference

    Conference10th Parallel Data Storage Workshop, PDSW 2015 - Held as part of the 27th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015
    Country/TerritoryUnited States
    CityAustin
    Period11/16/1411/20/14

    Fingerprint

    Dive into the research topics of 'BAD-Check§: Bulk Asynchronous Distributed Checkpointing'. Together they form a unique fingerprint.

    Cite this