Post-failure recovery of MPI communication capability: Design and rationale

TitlePost-failure recovery of MPI communication capability: Design and rationale
Publication TypeJournal Article
Year of Publication2013
AuthorsBland, W., A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra
JournalInternational Journal of High Performance Computing Applications
Volume27
Issue3
Pagination244 - 254
Date Published2013-01
ISSN1094-3420
Abstract

As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery.

URLhttp://hpc.sagepub.com/cgi/doi/10.1177/1094342013488238
DOI10.1177/1094342013488238
Short TitleInternational Journal of High Performance Computing Applications