Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

TitleMulti-criteria Checkpointing Strategies: Response-Time versus Resource Utilization
Publication TypeConference Paper
Year of Publication2013
AuthorsBouteiller, A., F. Cappello, J. Dongarra, A. Guermouche, T. Herault, and Y. Robert
Conference NameEuro-Par 2013
Date Published2013-08
PublisherSpringer
Conference LocationAachen, Germany
Abstract

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that roll- back recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the sys- tem batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted com- putation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application comple- tion time is unchanged, while it delivers near-perfect platform efficiency.