Home

Managing Checkpoints for Parallel Programs


Author(s) : Miron Livny Jim Pruyne, 
Publisher : N/A
Publication Date : 1996
ISSN : N/A
Abstract : Checkpointing is a valuable tool for any scheduling system to have. With the ability to checkpoint, schedulers are not locked into a single allocation of resources to jobs, but instead can stop running jobs, and re-allocate resources with out sacrificing any completed computations. Checkpointing techniques are not new, but they have not been widely available on parallel platforms. We have implemented CoCheck, a system for checkpointing message passing parallel programs. Parallel programs tend to be large in terms of their aggregate memory utilization, so the size of their checkpoint is also large. Because of this, checkpoints must be handled carefully to avoid overloading the system when checkpoints take place. Today's distributed file systems do not handle this situation well. We therefore propose the use of checkpoint servers which are specifically designed to move checkpoints from the checkpointing process, across the interconnection network, and on to stable storage. A scheduling system can utilize numerous checkpoint servers in any configuration in order to provide good checkpointing performance. 1,