Berkeley Linux Checkpoint/Restart (BLCR) User's Guide

About Berkeley Linux Checkpoint/Restart

Checkpoint/Restart allows you to save a process to a file and later restart the process from that file. There are three main uses for this:
  1. Scheduling: Checkpointing a program allows a program to be safely stopped at any point in its execution, so that some other program can run in its place. The original program can then be run again later.

  2. Process Migration: if a compute node appears to be likely to crash, or there is some other reason for shutting it down (routine maintenance, hardware upgrade, etc.), checkpoint/restart allows any processes running on it to be moved to a different node (or saved until the original node is available again).

  3. Failure recovery: a long-running program can be checkpointed intermittently, so that if it crashes due to hardware, system software, or some other non-deterministic cause, it can be restarted from a point midway in its execution, rather than run again from the beginning.
Berkeley Linux Checkpoint/Restart (BLCR) provides checkpoint/restart on Linux systems. BLCR can be used either with a single process on a single computer, or on parallel jobs (such as MPI applications) which may be running across multiple machines on a cluster of Linux nodes.
Note: checkpointing parallel jobs requires a library which has integrated BLCR support. At present, the only MPI implementation which supports checkpoint/restart with BLCR is the LAM/MPI library.

Checkpoint/restarting within a BLCR-aware batch control system

One way to use BLCR is with a batch scheduler system (a.k.a. "job controller", "queue manager", etc.) that knows how to use the BLCR tools to checkpoint and restart the jobs under its control. You can simply tell such a system to "suspend" or "checkpoint" a job, and then later to "resume" or "restart" it.

Unfortunately BLCR has not yet been integrated with many batch systems. Currently the only system that supports BLCR with MPI jobs is the SciDAC Scalable Systems Software (SSS) Suite. If you are running on a system that uses the SSS Suite (this is the case with some versions of the OSCAR clustering toolkit), then refer to these instructions for using checkpoint/restart.

Support for serial jobs is available through SGE. See this report for more information.

The rest of this document assumes that your batch scheduler does not have built-in support for BLCR. In this case you will manually run the BLCR commands needed to checkpoint/restart your jobs.

Note: this does not mean that you cannot checkpoint/restart your applications if you use a batch system without built-in support for BLCR. It simply means that you have to do your checkpoints/restarts manually. To the batch system, a job that is checkpointed and terminated manually simply looks like a job that has "completed". A restart of an application looks like a "new" job.

Checkpointing Jobs with the BLCR command-line tools

Make sure BLCR is installed and loaded

This guide assumes that BLCR has already been successfully built, installed, and configured on your system (presumably by you or your system administrator). One easy way to test this is to use the 'lsmod' command to see if the BLCR kernel module is loaded on the node(s) that your program will run on:

    % /sbin/lsmod
    Module                  Size  Used by    Not tainted
    blcr                   47508   0 
    blcr_vmadump           24744   1 blcr
    blcr_imports            7808   2 blcr,blcr_vmadump
    iptable_filter          2412   0 (autoclean) (unused)
    ip_tables              15864   1 [iptable_filter]
If you don't see the three modules that begin woth 'blcr' in the output of 'lsmod', than BLCR is not yet available on your system. Consult the BLCR Administrators Guide for instructions on building and installing BLCR.

Make sure your environment is set up correctly

You must ensure that the BLCR commands, libraries and manual pages can be found in your shell.

Try running

    % cr_checkpoint --help
If 'cr_checkpoint' cannot be found, you need to modify your 'PATH' to include the directory where 'cr_checkpoint' lives. You will probably also want to modify your 'LD_LIBRARY_PATH' variable to contain the directory where 'libcr.so' lives, and add the BLCR man directory to your'MANPATH'.

Setting up your environment with 'modules'

If your system uses the Environment Modules system to manage software packages, you may be able to get all of your needed environment settings simply by entering something like

    % module add blcr
However, there is no requirement that 'blcr' is the name of the module you'll need--your administrator may have given it a different name ('checkpoint', etc.). Or s/he may have neglected to add BLCR to the set of packages managed by modules, in which case you'll need to use the 'manual' technique below.

Manually setting up your environment

To manually set up your environment for BLCR, the first thing you need to know is where it has been installed. By default, BLCR installs into the '/usr/local' directory tree, but your system administrator may have put it elsewhere by passing '--prefix=PREFIX' when BLCR was built (where PREFIX can be any arbitrary directory). See your system documents, or try commands such as 'locate cr_checkpoint' or 'find'.

Once you have determined where BLCR is installed, enter the following commands (depending on which type of shell you are using), replacing PREFIX with the value specified for the --prefix option used when configuring BLCR.

To configure a bourne-type shell (such as 'bash' or 'ksh'):

    $ PATH=$PATH:PREFIX/bin
    $ MANPATH=$MANPATH:PREFIX/man
    $ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:PREFIX/lib
    $ export PATH MANPATH LD_LIBRARY_PATH

To configure a csh-type shell (such as 'csh' or 'tcsh'):

    % setenv PATH ${PATH}:PREFIX/bin
    % setenv MANPATH ${MANPATH}:PREFIX/man
    % setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:PREFIX/lib

The above examples to set the PATH, MANPATH and LD_LIBRARY_PATH variables in your current session or window only. It is strongly recommended that you make these settings permanent, to make these settings affect future sessions or windows. To do this, you must add the example commands to your shell's start up files. For a single-user of BLCR, you should add the appropriate set of commands to the shell startup files in your home directory (.bashrc for bash, .profile for other bourne-type shells, or .cshrc for csh-type shells). For a system-wide installation, add the bourne shell commands to /etc/bashrc and /etc/profile and the csh commands to /etc/cshrc.

Checkpointing/restarting a single process application

Types of applications supported

Checkpoint/restart supports: However, certain applications are not supported:

Making an application checkpointable

To be checkpointed successfully with BLCR, an application must contain some library code that BLCR provides. There are several ways of ensuring this:
  1. Start your executable via the with the 'cr_run' command:
            % cr_run your_executable [arguments ]
    
    'cr_run' loads the BLCR library into your application at startup time. You do not need to modify an application to have it work with 'cr_run'.

  2. Link your application with BLCR's 'libcr'. For instance, you could make a simple 'hello world' C program checkpointable via
            % gcc -o hello hello.c -LPREFIX/lib -lcr
    
    where PREFIX is the root of your BLCR install. Your application will now look for the BLCR library whenever it starts up, but note that this does not mean it will automatically be found: you will need to set your 'LD_LIBRARY_PATH' environment variable to 'PREFIX/lib' if libcr is not installed into a standard system library directory.

  3. Link your application with a library which uses BLCR. For instance, if your MPI library has been made BLCR-aware, it will cause libcr to be loaded, and so simply linking with the MPI library is enough to make your application checkpointable.

  4. Force the 'libcr.so' dynamic library to do loaded at startup by adding it's full pathname to the LD_PRELOAD environment variable. In most cases, the pthread library will also be required. We do not recommend setting this in your environment in general--certain programs may interact poorly with the BLCR library logic. Instead, use a command like
            % env LD_PRELOAD=PREFIX/lib/libcr.so.0:libpthread.so.0 your_executable [arguments ]
    
    This is essentially how 'cr_run' works.

If you do not start your program with 'cr_run', it will simply die with an error if you try to checkpoint it. More specifically, it will receive a real-time signal (the exact one depends on your kernel and C library versions), which will cause your program to die by default, unless you handle the signal explicitly.

Checkpointing the process

To checkpoint a process, simply run
    % cr_checkpoint PID
where PID is the application's process ID.

By default, 'cr_checkpoint' saves a checkpoint, and then lets your application continue running. This is useful for backing up a process in case it fails later, for instance.

If you wish to stop the process after it has been checkpointed, pass the '--term' flag:

    % cr_checkpoint --term PID
This causes a SIGTERM signal to be received by the process at the end of the checkpoint. If you have a reason to send a different signal to your process at the end of the checkpoint, you can pass any arbitrary signal number instead via the '--signal' flag.

Files that contain checkpoints are called context files. By default, they are named 'context.PID', where PID is the process ID that was checkpointed, and are stored in the current working directory that 'cr_checkpoint' was run in. You may specify the name and location of the context file via the '-f' option.

There are a number of other options that 'cr_checkpoint' provides. See the man page (or 'cr_checkpoint --help') for details.

Restarting the process

To restart from a context file, certain conditions must be met: You may restart a program on a different machine than the one it was checkpointed on if all of these conditions are met (they often are on cluster systems, especially if you are using a shared network filesystem).

You can restart a process by using 'cr_restart' on its context file:

    % cr_restart context.15005
The original process will be restored, and resume running in the exact state it was in at checkpoint time. Note that this includes restoring its process ID, so you cannot restart a program unless the original copy of it has exited (otherwise 'cr_restart' will fail with a message that the PID is already in use).

You may restart a process from a particular context file as many times as you wish. The context file is not automatically removed at any point--delete it if/when it is no longer useful to you.

Checkpointing/restarting an MPI application

Currently there is only one MPI library that has been modified to work with BLCR: the LAM/MPI library. This means that if you wish to checkpoint/restart programs on a BLCR-enabled system, you must use LAM/MPI. Also, you must have configured LAM correctly to use BLCR with it (i.e. use the crtcp or gm RPI). You should also NOT configure LAM to debug mode, i.e. do not pass --with-debug to LAM's configure script. See the the LAM/MPI documentation for details.

To start a checkpointable LAM/MPI application, simply run it with the regular LAM 'mpirun' launcher:

    % mpirun C hello_mpi

Note: you may need to start up the LAM environment first by running 'lamboot' before starting your application.

To checkpoint the entire MPI application (across all nodes and processes), simply run

    % cr_checkpoint 12305
Where '12305' is the process ID of the 'mpirun' command. Do not pass the pid of your MPI executable: when 'mpirun' is checkpointed, it automatically takes care of transitively checkpointing all of the processes involved in the MPI job.

To restart your MPI job, simply run 'cr_restart' on the 'mpirun' process's context file:

    % cr_restart context.12305
All processes in the MPI job will be restarted as they were at checkpoint time.

Troubleshooting FAQ

My application dies with "Real-time signal 31" (or 32, etc.) when I try to checkpoint it

Your application has not loaded the required BLCR library it needs to be checkpointable, and so it dies when a checkpoint signal arrives (BLCR may use a different real-time signal than 31, depending on your kernel and/or C library).

See the section on Making an application checkpointable for the various ways to fix this.

I get the error: ioctl(/proc/checkpoint/ctrl, CR_OP_RSTRT_REQ): Device or resource busy

This is because a resource needed into order to restart the process is already in use. The most common problem is that another process already exists with the same pid (process ID)--the operating system will not allow you to create two programs with the same pid. Very frequently this is because a user is trying to 'restart' a process from a checkpoint, when the original process they took the checkpoint of is still running!

If you are unlucky enough that some other, unrelated process has grabbed the PID of your application, you must figure out some way to get rid of that process. If you own the process, you can of course simply kill it (or checkpoint it!). Otherwise, consider becoming root, or consulting your system administrator. BLCR will not kill another process for you (this 'feature' would raise certain security issues).

For more information

For more information on Checkpoint/Restart for Linux, visit the project home page: http://ftg.lbl.gov/checkpoint

For more information on LAM/MPI, see the LAM/MPI Documentation.