09. 03. 2015

Performance Analysis of Distributing Erasure-Coded Checkpoing Data


Andreas Wiese

TU Dresden

Sondertermin: 11:00 Uhr, APB/3105

In the last decades, off-the-shelf computers like high-end server machines faced a vast increase in computing power while simultaneously facing massive drop in acquisition costs. For this reason, building High Performance Computers by simply connecting a huge number of single PC systems with a high-capacity interconnect is becoming more and more affordable. Caused by this development, distributed computing is becoming more and more important.

The downside of those systems, however, is that with an increasing number of nodes such a system consists of, the probability of single or even multiple nodes failing also increases massively.

Checkpointing is used to mitigate this problem by allowing to simply restart an application in case of failure. Distributing checkpointing data additionally increases availability with more severe failures.

Erasure coding checkpoint data could be a good approach to allow space and bandwidth efficient storing of checkpoint data while still maintaining or even increasing resistance against multiple node failures.

This work tries to find out whether distributing erasure-coded checkpoint data could be an applicable approach on real-life systems.
28. Oct 2020
· Copyright © 2001-2022 Operating Systems Group, TU Dresden | Impressum ·