Re: SCORE's checkpoint/restart

From: Jason Duell (jcduell_at_lbl.gov)
Date: Thu May 02 2002 - 10:46:29 PDT


On Wed, May 01, 2002 at 03:10:40PM -0700, Eric Roman wrote:
> 
> Have any of you had a look at SCORE's checkpoint/restart?  (Take a look
> on http://www.pccluster.org
> 
> It looks like they have a checkpointable MPI.  I haven't looked too hard at
> this system in a while.  It seems to be pretty good.

It does seem pretty impressive, though also a bit eccentric.  They have
a network-independent API called "PM", on top of which they've layered
an MPICH-derived MPI  (among other things: they also have a distributed
shared memory system and a C++ template network application framework
layered on top of PM).  Any parallel app that is built over PM can be
checkpointed transparently, so they must have done the network
quiescence stuff already (you can't checkpoint arbitrary sockets/pipes).
They also have a version of the PBS batch system set up (on top of a
"user-level global operating system"!), and they have parallel gang
scheduling.

They also have pretty good documentation.

But:

Their gang scheduling seems to use regular process suspension, rather
than checkpointing (I'm not sure, but the docs say that a gang context
switch 'takes a few milliseconds').  

Their checkpointing scheme seems to be hard-wired into their batch run
system in an undesirable way that emphasizes only fault tolerance.  It
looks like you need to run your app with a "--checkpoint=interval" flag
to make it checkpointable (the docs are ambiguous as to whether a job
w/o that flag can later be checkpointed by other means), and in order to
restart your application if it crashes, you must NOT allow the 'front
end process' that represents your job on the front end server to
die--otherwise you lose your checkpoint forever!  Also, you can only go
back one checkpoint--after a new checkpoint is taken, you cannot roll
back to a previous one.  Your checkpointed jobs must restart on the same
nodes that they started on (they seem to be working on a migration
facility, but it's not there yet).  So the way they're set up
checkpointing seems to wind up protecting primarily against temporary
node failures: it's not useful for much else.  I can't imagine that
their checkpointing logic really requires all these restrictions--it
seems partly like they've overly integrated things in an attempt to make
them user-friendly, or something.

They currently can't handle checkpointing apps that use shared libs, and
they don't guarantee the same pid when you restart.

Their cross-transport network layer looks interesting, although it seems
to be of the 'ask for a new send/receive buffer before each
send/receive' flavor, which seems inefficient.  They also have a
DMA-like 'zero-copy' set of functions in PM, but for now you can't
checkpoint jobs that use it.


-- 
Jason Duell                               jcduell_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2354
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998