System-Level vs. Application-Level Checkpointing
Автор: IEEEComputerSociety
Загружено: 2020-09-09
Просмотров: 334
Описание:
Fault tolerance is becoming increasingly important since the probability of permanent hardware failures increases with machine size. A typical resilience approach to fail/stop failures today is checkpointing, which can be performed on system- or application-level. Both levels come in many variants, but they fundamentally differ. On system-level, no code changes are required, full program states are saved, and after a failure the program must be restarted from the last checkpoint. In contrast, on application-level, only user-defined data are checkpointed, which requires some programming effort. Thereby, the running time overhead may be reduced significantly, and programs may continue execution after failures.
Typical representatives include DMTCP (Distributed MultiThreaded Checkpointing) for system-level, and FTGLB (Fault Tolerant Global Load Balancing) for application-level. DMTCP is a user-space library, which checkpoints parallel programs transparently and restarts them from a checkpoint. DMTCP supports many programming languages and HPC environments.
FTGLB bases on a distributed task-pool pattern, and writes uncoordinated in-memory checkpoints. Checkpoints only include task descriptors and interim results, and are written at regular time intervals and at certain events, e.g. work stealing.
In this work, we experimentally compare DMTCP and FTGLB with up to 320 processes. Moreover, we derive formulas for predicting running times, including failure handling. With these formulas, we compare DMTCP and FTGLB in failure-prone and larger settings. Overall, the results clearly show that the application-level optimizations of FTGLB are worthwhile since the running time overhead is significantly lower than that of DMTCP.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: