Fault Tolerance Techniques for High-performance Computing

TitleFault Tolerance Techniques for High-performance Computing
Publication TypeTech Report
Year of Publication2015
AuthorsDongarra, J., T. Herault, and Y. Robert
Technical Report Series TitleUniversity of Tennessee Computer Science Technical Report (also LAWN 289)
NumberUT-EECS-15-734
Date Published2015-05
InstitutionUniversity of Tennessee
AbstractThis report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption).
URLhttp://www.netlib.org/lapack/lawnspdf/lawn289.pdf
Project Tags: