magnun opus
Date: Thu Jan 02 2003 - 12:38:16 PST

Here is the paper on our checkpoint/restart system that I wrote for my
class last semester.  It's mainly just a high-level overview of the
architecture, but there are a couple things worth noting:

1) I gave us an acroynm:  BLCR (Berkeley Lab Linux Checkpoint/Restart).
   If we hate it, we can change it.

2) I refer in the paper to 'signal-based' and 'thread-based' checkpoint
   'callbacks' rather than our current, utterly confusing nomenclature
   of 'synchronous and asynchronous handlers."  Calling a function that
   gets run in signal handler context 'synchronous' is just confusing,
   and calling user functions 'handlers' makes them sound too much like
   signal handlers.  I propose we change our docs and APIs, too.

3) I have some performance numbers, which are more interesting than
   expected (we suck badly at checkpointing jobs whose VM size is half
   or more of the physical memory size, and we'll need to fix it).

4) At least one of the ideas I discuss in the optimizations section is
   new (or at least I don't recall ever discussing it with anyone).

There are some typos and unfilled-in references in the paper, which I'm
going to fix at some point.  But I want to get this out to you all now,
before the real work of the new year diverts your attention.


Jason Duell             Future Technologies Group
<jcduell_at_lbl_dot_gov>       High Performance Computing Research Dept.
Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory