16. 05. 2014

A feasibility study of checkpoint/restart as a fault tolerance technique

Faisal Shahzad

Universität Erlangen-Nürnberg

The sustained performance growth in High Performance Computing (HPC) is due to high level of parallelism on the component level. Although the mean time to failure (MTTF) of individual components has increased over time, it does not match the increasing rate of component-level parallelism. This pattern results in the drop of the MTTF of the overall systems. Nowadays, the MTTF of large systems is in the range of days. It is predicted that on Exascale level, the MTTF will be in order of minutes and hours. This has raised significant concern in the HPC community.

Checkpoint/restart has classically been a popular technique for fault tolerance due to its ease of implementation. However, it often faces severe criticism for being inefficient due to its large overhead and is mostly not considered a feasible approach on the Exascale level. This talk presents an overview of various existing fault tolerance techniques that are mainly based on the checkpoint/restart approach. Within the scope of the ESSEX project, several ideas are investigated to reduce the checkpointing overhead to an acceptable level while keeping the implementation user friendly. These include asynchronous, node-level and neighbour-level checkpointing. Benchmarks are carried out using different algorithms to compare these optimization strategies with the naïve approaches. The results suggest that, if suitable optimization techniques are employed, checkpoint/restart can still be a feasible technique that incurs minimal overhead on the application at the Exascale level.

	28. Oct 2020
· Copyright © 2001-2022 Operating Systems Group, TU Dresden \| Impressum ·