Next: 3 Components of a Up: Checkpoint and Migration of Previous: 1 Introduction

2 Checkpoint and Migration of a UNIX Process

One of the major components of Condor is its facility for transparently checkpointing and restarting a process, possibly on a different machine. By ``transparent'' we mean that the user code is not specially written to accommodate the checkpoint/restart or migration and generally has no knowledge that such an event has taken place. This mechanism is implemented entirely at user level, with absolutely no modifications to the UNIX kernel. While implementing checkpoint/restart and migration at user level is great for portability, it should be pointed out from the beginning that our method does have some limitations. These are discussed in detail is section 4.2.

Process checkpoint/restart can enable process migration. Checkpointing and restarting generally implies that the state of a process is saved in stable storage using the file system. Note however that in an environment where processes can access files from any of a group of machines, (possibly via AFS or NFS cross mounting of file systems), then by transporting a checkpoint file to another machine we accomplish process migration. In this article we use the term ``checkpoint'' almost synonymously with ``migrate''. Not all systems which implement process migration however also provide the safety of checkpointing, because they may not allow ``migrating'' a process into a file. Condor provides access to files across machines, even in environments where files are not generally available by NFS or AFS, with a mechanism called ``remote system calls''. These are discussed in section 4.1.

Our general method for checkpointing is to write all of the process's state information to a file (or socket) at checkpoint time, then use the information in that file (or read from a socket) to restore the process's state (as much as possible) at restart time. From the point of view of the operating system, the process is not restarted or migrated at all - we simply create a new process, and manipulate its state so as to emulate the state of the old process as accurately as possible. The checkpointing process is invoked by a signal, and at restart time, things are manipulated so that it appears to the user code that the process has just returned from that signal handler. The code contained in the signal handler, the code required to install the handler, and the code used to record information about the process's state as it evolves, are all contained in the Condor checkpointing library.

Next: 3 Components of a Up: Checkpoint and Migration of Previous: 1 Introduction

condor-admin@cs.wisc.edu