From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Mon Jun 03 2002 - 10:26:24 PDT
NOW MOVING THE DISCUSSION TO THE LIST. Jeff, I think you ARE missing something - sorry for confusing you. When I refer to CHECKPOINT, CONTINUE and RESTART I am referring to blocks of code in the following handler template: void handler(void* arg) { int rc; /* do CHECKPOINT work here */ rc = cr_checkpoint(); if (CR_IS_FAILURE(rc)) { /* deal with FAILURE here (/ } else if (CR_IS_RESTART(rc)) { /* do RESTART work here */ } else { /* do CONTINUE work here */ } } The cr_checkpoint() call is a return-twice call in the spirit of fork() or setjmp(). The first (chronologically) return is just continuing after the checkpoint has been taken. The second return is when restarting from a checkpoint. As for the stdin/out/err question, I am referring to the fd passing you mention. The setup (mpirun passes fd to local lamd) must be repeated at restart time because we have new fds to deal with. -Paul Jeff Squyres wrote: > On Mon, 3 Jun 2002, Paul H. Hargrove wrote: > > >>The main distinction between the CONTINUE and RESTART code for the >>mpirun process has to do with file handles. When we CONTINUE the mpirun >>process is still connected to the local lamd by a unix domain socket and >>that lamd has the proper stdin/out/err. When we RESTART we must build a >>new unix domain socket and must pass the stdin/our/err to the local >>lamd. >> >>In the libmpi the situation is similar: all sockets in place (unless >>using the shutdown trick) in the CONTINUE case - no sockets in place in >>the RESTART case. >> > > (should we be using the checkpoint_at_lbl_dot_gov address for this thread?) > > Not sure what you mean here... Two things: > > 1. What's the value of CONTINUE? > 2. What do you mean by "the proper stdin/out/err"? > > Longer explanations: > > 1. The way I understand it, if you CONTINUE, you still get a bunch of > image files as output, right? Is the intent that these image files can be > used later to restart the process? e.g., for the scenario: > > Time Description > ------ -------------------------------------------------------------- > T=0 mpirun C foo > ... > T=N foo does a checkpoint/CONTINUE > T=N+1 foo continues as if nothing had happened > ... > T=M foo aborts/dies ungracefully > ... > T=P user manually re-starts foo with the image files from the > checkpoint/CONTINUE at T=N > ------ -------------------------------------------------------------- > > Is that the intent? > > If so, then for both CONTINUE and RESTART are supposed to turn out image > files that are suitable for re-starting the process, right? If that's > right, then I think that libmpi and mpi need to do exactly the same thing > in CONTINUE and RESTART. Particularly in terms of the MPI data > connections (in the RPI), but also the connection to the lamd's unix > socket -- they need to be flushed and closed before the checkpoint occurs > and then re-opened after the checkpoint resumes (for both the CONTINUE and > RESTART cases). > > If these connections are not flushed/closed, then the image files won't be > able to be reliably used to restart the foo process. > > 2. What does lamd have to with stdout/err/in? The local lamd's stdout/err > will always be tied to where lamboot was run, and its stdin is closed. > All remote lamd's stdout/err/in are all closed. > > Did you mean the stdout/err/in of the user application being tied to > mpirun? e.g., "mpirun C foo", how the stdout/err/in is tied to the > originating mpirun? If so, the input/output from foo is passed *through* > the lamd, but in a very transparent way -- the lamd only handles the > setup, and the rest is done transparently by the OS (using file descriptor > passing from mpirun to the lamd). > > So I'm not quite clear on what you mean... > > ----- > > One clarification from my previous mail: Brian informs me that I was > incorrect -- nsend/nrecv do *not* invoke malloc/free anywhere in their > call stacks. So we should be ok there. > > {+} Jeff Squyres > {+} [email protected] > {+} http://www.lam-mpi.org/ > > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov NERSC Future Technologies Group Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998