Re: Meeting Notes

From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Thu Mar 21 2002 - 11:35:19 PST


Brian W. Barrett wrote:

[snip]
> Outstanding questions:
> ----------------------
> 
> * Can we get enough communication in MPIRUN in the signal handler
>   context, or are we completely hosed?
> 
>   - what can we run in a signal handler context?
>   - If we can't, what is our next option?


If you must do things which are not legal in signal handler context, 
then we can offer an alternative: polling (ugh!) for checkpoint 
requests.  Looking at the header file I provided, see the comments about 
the CR_REG_NOASYNC flag and cr_progess().  Yes, this is already 
implemented as documented :-).


> * What is the interface for restarting an application


There are three options from C code:
+ exec*("context_file"), where exec* is any flavor of exec call.
+ cr_exec("context_file")
+ system("restart_utility context_file");

The exec*() option is the most appealing and will probably be the 
eventual preferred form. However it will be the last one implemented.

The cr_exec() will be the first implementation.  This will have the 
exec() semantics of replacing the running process, so one will usually 
fork() first.

The last option will exist to allow shell scripts to restart things 
before the exec() version works.  It will be a tiny wrapper around 
cr_exec().

NOTE THAT NONE OF THESE ARE IMPLEMENTED YET.

-Paul


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998