Thursday, November 22, 2007
Checkpointing is a technique for inserting fault tolerance into computing systems. It basically consists on storing a snapshot of the current application state, and use it for restarting the execution in case of failure.
Checkpointing in distributed shared memory systems
A number of practical checkpointing packages have been developed for the Linux/UNIX family of operating systems. These checkpointing packages may be divided into two classes, those which operate in user space, examples of which include the checkpointing package used by Condor and the portable checkpointing library developed by The University of Tennessee. User space checkpointing pacakages are highly portable and can typically be compiled and run on any modern UNIX (e.g. Linux, FreeBSD, OpenBSD, Darwin etc). In contrast, kernel based checkpointing packages such as Chpox and Cryopid, and the checkpointing algorithms developed for the MOSIX cluster computing environment tend to be highly operating system dependent. Most kernel based checkpointing packages developed to date run under either the 2.4 or 2.6 subfamilies of the Linux kernel on i686 architectures.
Modern checkpointing packages such as Cryopid are capable of checkpointing a process pod, that is a parent process and all its associated children, and of dealing with file system abstractions such as sockets and pipes (FIFO's) in addition to regular files. In the case of Cryopid, there is also provision to roll all dynamic libraries, open files, sockets and FIFO's associated with the process into the checkpoint. This is very useful when the checkpointed process is to be restarted in a hetrogenous environment (e.g. the machine on which the checkpoint is restarted has libraries and file system which differ from the host on which the process was checkpointed).
Posted by iamyrfans at 10:13 AM