Fine-grained Bit-Flip Protection for Relaxation Methods

TitleFine-grained Bit-Flip Protection for Relaxation Methods
Publication TypeJournal Article
Year of Publication2016
AuthorsAnzt, H., J. Dongarra, and E. S. Quintana-Orti
JournalJournal of Computational Science
Date Published2016-11
KeywordsBit flips, Fault tolerance, High Performance Computing, iterative solvers, Jacobi method, sparse linear systems
AbstractResilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.
DOI10.1016/j.jocs.2016.11.013
External Publication Flag: