From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Tue Jul 23 2002 - 11:37:29 PDT
As promised, here is a write-up of what I proposed on the phone yesterday. -Paul Goal #1: At checkpoint time some entity associated with mpirun needs to communicate with the lamd to propogate the checkpoint request to the application processes. Constraints: 1) mpirun is blocked on a read from the lamd 2) liblam is not reentrant. Therefore mpirun cannot communicate with the lamd from signal context. Furthermore the lamd could not deal with the apparent transition of mpirun out of the blocked state. 3) liblam is not sufficeint thread safe/aware to allow a checkpoint handler thread to talk to the lamd. (not sure about the exact nature of the thread problem) 4) fork() from signal context will still leave us in signal context Solution #1: At initialization mpirun will register an async handler (one which is run in a separate thread, outside of signal context). This thread will fork() a new process which can communicate freely with the lamd to propogate the checkpoint request. If necessary the handler thread and its child process can communicate using fds created w/ pipe() or socketpair(). The handler thread will waitpid() for its child to exit, which indicates that the application processes have been checkpointed. Then the handler thread will call cr_checkpoint(), which allows any synchronous (signal context) handlers to run (see below). There is no work to be done at continue/restart. Goal #2: At restart time we want to spawn a new application schema which will perform the cr_restart on each node. Constraints: 1) mpirun is blocked in a read() from the lamd and it is very hard to unwind its stack - this suggests that exec() is the easiest solution 2) we really want to keep the pid of the original mpirun 3) exec() from the async handler will not give us the proper pid and will still leave the first mpirun instance is some goofy state. Solution #2a: At initialization mpirun will register a synchronous handler (one which runs in signal context). At checkpoint time this handler will run in signal context, interrupting the blocked read(). There is no work to be done at checkpoint time. Therefore, this handler can go directly in to a call to cr_checkpoint(). At continue/restart time execution will resume with a return from this cr_checkpoint() call. In the case of a continue, there is probably no work to be done. At restart time this handler just exec()s a new mpirun with the new application schema. Solution #2b: (an afterthought) The exec() solution requires that we be able to find the mpirun executable, or else use /proc/self/exe. An alternative to the exec() is to do a sigsetjmp() early, perhaps even in main(), and use the corresponding siglongjmp() in place of the exec(). Otherwise things work just as in #2a. Psuedocode for #2a: // It is safer to hold the schema in memory than create yet // another file on disk. Especially since the LBNL code doesn't // deal w/ files for us yet. static struct foo *restart_schema; // Runs in a separate thread and works w/ the lamds to propogate the // checkpoint request to the application processes. // We also build the application schema needed at restart time because // we can do malloc() here, but not in the sync handler. void async_handler(void *my_arg) { pid = fork(); if (pid < 0) { // fork() failed!!! abort(); } else if (!pid) { // This is the child. connect_to_lamd(); spawn_cr_save_on_nodes(); wait_for_cr_saves_to_complete(); exit(0); } // Everything beyond this point is the parent restart_schema = construct_a_new_app_schema(); waitpid(pid); rc = cr_checkpoint(); if (rc < 0) { // cr_checkpoint() failed!!! abort(); } else if (!rc) { // CONTINUE: we don't need the app schema we built free_app_schema(restart_schema); } else { // RESTART: DO NOTHING (sync handler does the work) } } void sync_handler(void *my_arg) { // CHECKPOINT: no work to do rc = cr_checkpoint(); if (rc < 0) { // cr_checkpoint() failed!!! abort(); } else if (!rc) { // CONTINUE: DO NOTHING } else { // RESTART: // Need to exec() ourself w/ new schema. // Not knowing how that is done I make a guess... // NOTE: mkstemp is not reentrant so we should do // the equivalent work ourselves. char schema_file = "/tmp/schema.XXXXXX"; int fd = mkstemp(schema_file); write_schema_to_fd(app_schema, fd); close_all_fds_except_0_1_2(); // IMPORTANT: POSIX.1 only guarantees execle and execve // to be reentrant, not the other exec-family members. execle("/proc/self/exe", argv[0], "--app_schema", schema_file, NULL, environ); } } void mpirun_main() { ... id1 = cr_register_sync(&sync_handler, my_pointer_arg, 0); id2 = cr_register_async(&async_handler, my_pointer_arg, 0); ... } -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov NERSC Future Technologies Group Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998