Summary of latest mpirun plan

From: Paul H. Hargrove (
Date: Tue Jul 23 2002 - 11:37:29 PDT

As promised, here is a write-up of what I proposed on the phone yesterday.


Goal #1:
At checkpoint time some entity associated with mpirun needs to
communicate with the lamd to propogate the checkpoint request to the
application processes.

1) mpirun is blocked on a read from the lamd
2) liblam is not reentrant.  Therefore mpirun cannot communicate with
the lamd from signal context.  Furthermore the lamd could not deal with
the apparent transition of mpirun out of the blocked state.
3) liblam is not sufficeint thread safe/aware to allow a checkpoint
handler thread to talk to the lamd.  (not sure about the exact nature of
the thread problem)
4) fork() from signal context will still leave us in signal context

Solution #1:
At initialization mpirun will register an async handler (one which is
run in a separate thread, outside of signal context).  This thread will
fork() a new process which can communicate freely with the lamd to
propogate the checkpoint request.  If necessary the handler thread and
its child process can communicate using fds created w/ pipe() or
socketpair().  The handler thread will waitpid() for its child to exit,
which indicates that the application processes have been checkpointed.
Then the handler thread will call cr_checkpoint(), which allows any
synchronous (signal context) handlers to run (see below).  There is no
work to be done at continue/restart.

Goal #2:
At restart time we want to spawn a new application schema which will
perform the cr_restart on each node.

1) mpirun is blocked in a read() from the lamd and it is very hard to
unwind its stack - this suggests that exec() is the easiest solution
2) we really want to keep the pid of the original mpirun
3) exec() from the async handler will not give us the proper pid and
will still leave the first mpirun instance is some goofy state.

Solution #2a:
At initialization mpirun will register a synchronous handler (one which 
runs in signal context).  At checkpoint time this handler will run in 
signal context, interrupting the blocked read().  There is no work to be 
done at checkpoint time.  Therefore, this handler can go directly in to 
a call to cr_checkpoint().  At continue/restart time execution will 
resume with a return from this cr_checkpoint() call.  In the case of a 
continue, there is probably no work to be done.  At restart time this 
handler just exec()s a new mpirun with the new application schema.

Solution #2b: (an afterthought)
The exec() solution requires that we be able to find the mpirun 
executable, or else use /proc/self/exe.  An alternative to the exec() is 
to do a sigsetjmp() early, perhaps even in main(), and use the 
corresponding siglongjmp() in place of the exec().  Otherwise things 
work just as in #2a.

Psuedocode for #2a:

   // It is safer to hold the schema in memory than create yet
   // another file on disk.  Especially since the LBNL code doesn't
   // deal w/ files for us yet.
   static struct foo *restart_schema;

   // Runs in a separate thread and works w/ the lamds to propogate the
   // checkpoint request to the application processes.
   // We also build the application schema needed at restart time because
   // we can do malloc() here, but not in the sync handler.
   void async_handler(void *my_arg) {
     pid = fork();
     if (pid < 0) {
       // fork() failed!!!
     } else if (!pid) {
       // This is the child.

     // Everything beyond this point is the parent

     restart_schema = construct_a_new_app_schema();

     rc = cr_checkpoint();
     if (rc < 0) {
       // cr_checkpoint() failed!!!
     } else if (!rc) {
       // CONTINUE: we don't need the app schema we built
     } else {
       // RESTART: DO NOTHING (sync handler does the work)

   void sync_handler(void *my_arg) {
     // CHECKPOINT: no work to do
     rc = cr_checkpoint();
     if (rc < 0) {
       // cr_checkpoint() failed!!!
     } else if (!rc) {
     } else {
       // RESTART:

       // Need to exec() ourself w/ new schema.
       // Not knowing how that is done I make a guess...

       // NOTE: mkstemp is not reentrant so we should do
       // the equivalent work ourselves.
       char schema_file = "/tmp/schema.XXXXXX";
       int fd = mkstemp(schema_file);

       write_schema_to_fd(app_schema, fd);

       // IMPORTANT: POSIX.1 only guarantees execle and execve
       // to be reentrant, not the other exec-family members.
       execle("/proc/self/exe", argv[0], "--app_schema", schema_file,
              NULL, environ);

   void mpirun_main() {
     id1 = cr_register_sync(&sync_handler, my_pointer_arg, 0);
     id2 = cr_register_async(&async_handler, my_pointer_arg, 0);

Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998