Re: Document for Friday's mtg

From: Paul H. Hargrove (
Date: Mon Jun 03 2002 - 10:26:24 PDT


Jeff, I think you ARE missing something - sorry for confusing you.  When 
I refer to CHECKPOINT, CONTINUE and RESTART I am referring to blocks of 
code in the following handler template:

void handler(void* arg)
     int rc;

     /* do CHECKPOINT work here */

     rc = cr_checkpoint();
     if (CR_IS_FAILURE(rc)) {
         /* deal with FAILURE here (/
     } else if (CR_IS_RESTART(rc)) {
         /* do RESTART work here */
     } else {
         /* do CONTINUE work here */

The cr_checkpoint() call is a return-twice call in the spirit of fork() 
or setjmp().  The first (chronologically) return is just continuing 
after the checkpoint has been taken.  The second return is when 
restarting from a checkpoint.

As for the stdin/out/err question, I am referring to the fd passing you 
mention.  The setup (mpirun passes fd to local lamd) must be repeated at 
  restart time because we have new fds to deal with.


Jeff Squyres wrote:

> On Mon, 3 Jun 2002, Paul H. Hargrove wrote:
>>The main distinction between the CONTINUE and RESTART code for the
>>mpirun process has to do with file handles.  When we CONTINUE the mpirun
>>process is still connected to the local lamd by a unix domain socket and
>>that lamd has the proper stdin/out/err.  When we RESTART we must build a
>>new unix domain socket and must pass the stdin/our/err to the local
>>In the libmpi the situation is similar: all sockets in place (unless
>>using the shutdown trick) in the CONTINUE case - no sockets in place in
>>the RESTART case.
> (should we be using the checkpoint_at_lbl_dot_gov address for this thread?)
> Not sure what you mean here...  Two things:
> 1. What's the value of CONTINUE?
> 2. What do you mean by "the proper stdin/out/err"?
> Longer explanations:
> 1. The way I understand it, if you CONTINUE, you still get a bunch of
> image files as output, right?  Is the intent that these image files can be
> used later to restart the process?  e.g., for the scenario:
>   Time   Description
>   ------ --------------------------------------------------------------
>   T=0    mpirun C foo
>   ...
>   T=N    foo does a checkpoint/CONTINUE
>   T=N+1  foo continues as if nothing had happened
>   ...
>   T=M    foo aborts/dies ungracefully
>   ...
>   T=P    user manually re-starts foo with the image files from the
>          checkpoint/CONTINUE at T=N
>   ------ --------------------------------------------------------------
> Is that the intent?
> If so, then for both CONTINUE and RESTART are supposed to turn out image
> files that are suitable for re-starting the process, right?  If that's
> right, then I think that libmpi and mpi need to do exactly the same thing
> in CONTINUE and RESTART.  Particularly in terms of the MPI data
> connections (in the RPI), but also the connection to the lamd's unix
> socket -- they need to be flushed and closed before the checkpoint occurs
> and then re-opened after the checkpoint resumes (for both the CONTINUE and
> RESTART cases).
> If these connections are not flushed/closed, then the image files won't be
> able to be reliably used to restart the foo process.
> 2. What does lamd have to with stdout/err/in?  The local lamd's stdout/err
> will always be tied to where lamboot was run, and its stdin is closed.
> All remote lamd's stdout/err/in are all closed.
> Did you mean the stdout/err/in of the user application being tied to
> mpirun?  e.g., "mpirun C foo", how the stdout/err/in is tied to the
> originating mpirun?  If so, the input/output from foo is passed *through*
> the lamd, but in a very transparent way -- the lamd only handles the
> setup, and the rest is done transparently by the OS (using file descriptor
> passing from mpirun to the lamd).
> So I'm not quite clear on what you mean...
> -----
> One clarification from my previous mail: Brian informs me that I was
> incorrect -- nsend/nrecv do *not* invoke malloc/free anywhere in their
> call stacks.  So we should be ok there.
> {+} Jeff Squyres
> {+}
> {+}

Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998