From: Jason Duell (jcduell_at_lbl.gov)
Date: Thu May 02 2002 - 10:46:29 PDT
On Wed, May 01, 2002 at 03:10:40PM -0700, Eric Roman wrote: > > Have any of you had a look at SCORE's checkpoint/restart? (Take a look > on http://www.pccluster.org > > It looks like they have a checkpointable MPI. I haven't looked too hard at > this system in a while. It seems to be pretty good. It does seem pretty impressive, though also a bit eccentric. They have a network-independent API called "PM", on top of which they've layered an MPICH-derived MPI (among other things: they also have a distributed shared memory system and a C++ template network application framework layered on top of PM). Any parallel app that is built over PM can be checkpointed transparently, so they must have done the network quiescence stuff already (you can't checkpoint arbitrary sockets/pipes). They also have a version of the PBS batch system set up (on top of a "user-level global operating system"!), and they have parallel gang scheduling. They also have pretty good documentation. But: Their gang scheduling seems to use regular process suspension, rather than checkpointing (I'm not sure, but the docs say that a gang context switch 'takes a few milliseconds'). Their checkpointing scheme seems to be hard-wired into their batch run system in an undesirable way that emphasizes only fault tolerance. It looks like you need to run your app with a "--checkpoint=interval" flag to make it checkpointable (the docs are ambiguous as to whether a job w/o that flag can later be checkpointed by other means), and in order to restart your application if it crashes, you must NOT allow the 'front end process' that represents your job on the front end server to die--otherwise you lose your checkpoint forever! Also, you can only go back one checkpoint--after a new checkpoint is taken, you cannot roll back to a previous one. Your checkpointed jobs must restart on the same nodes that they started on (they seem to be working on a migration facility, but it's not there yet). So the way they're set up checkpointing seems to wind up protecting primarily against temporary node failures: it's not useful for much else. I can't imagine that their checkpointing logic really requires all these restrictions--it seems partly like they've overly integrated things in an attempt to make them user-friendly, or something. They currently can't handle checkpointing apps that use shared libs, and they don't guarantee the same pid when you restart. Their cross-transport network layer looks interesting, although it seems to be of the 'ask for a new send/receive buffer before each send/receive' flavor, which seems inefficient. They also have a DMA-like 'zero-copy' set of functions in PM, but for now you can't checkpoint jobs that use it. -- Jason Duell jcduell_at_lbl_dot_gov NERSC Future Technologies Group Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998