|
16.
05.
2014
A feasibility study of checkpoint/restart as a fault tolerance technique
Faisal Shahzad
Universität Erlangen-Nürnberg
The sustained performance growth in High Performance Computing (HPC) is
due to high level of parallelism on the component level. Although the
mean time to failure (MTTF) of individual components has increased over
time, it does not match the increasing rate of component-level
parallelism. This pattern results in the drop of the MTTF of the overall
systems. Nowadays, the MTTF of large systems is in the range of days. It
is predicted that on Exascale level, the MTTF will be in order of minutes
and hours. This has raised significant concern in the HPC community.
Checkpoint/restart has classically been a popular technique for fault
tolerance due to its ease of implementation. However, it often faces
severe criticism for being inefficient due to its large overhead and is
mostly not considered a feasible approach on the Exascale level. This
talk presents an overview of various existing fault tolerance techniques
that are mainly based on the checkpoint/restart approach. Within the
scope of the ESSEX project, several ideas are investigated to reduce the
checkpointing overhead to an acceptable level while keeping the
implementation user friendly. These include asynchronous, node-level and
neighbour-level checkpointing. Benchmarks are carried out using different
algorithms to compare these optimization strategies with the naïve
approaches. The results suggest that, if suitable optimization techniques
are employed, checkpoint/restart can still be a feasible technique that
incurs minimal overhead on the application at the Exascale level.
|