Re: Meeting Notes

From: Jason Duell (jcduell_at_lbl.gov)
Date: Thu Mar 21 2002 - 11:16:34 PST


On Thu, Mar 21, 2002 at 01:38:14PM -0500, Brian W. Barrett wrote:
>   The notification mechanism will work like a Unix signal (the first
>   version might use Unix signals, but this will change in future
>   versions).  The user will register a handler (function prototype
>   available in header file) that will be called whenever the process
>   is to be checkpointed.  Upon leaving the function, the process will
>   be checkpointed.  While in the handler, the function must conform to
>   all the requirements placed on Unix signal handlers.
> 
> * Can we get enough communication in MPIRUN in the signal handler
>   context, or are we completely hosed?
> 
>   - what can we run in a signal handler context?
>   - If we can't, what is our next option?
 
We're looking into ways to work around needing to run all your
checkpoint logic in signal context.  The basic idea is that we'll have
an option to have the CHKPT signal handler simply decrement an atomic
counter, and never call the checkpoint handlers directly.  Instead, some
sort of polling mechanism would be used in the regular, non-signal code
to detect if the signal had occurred (i.e. counter decremented to 0),
and if so, the handler would then be called (all in regular, not signal,
context).  We've also considered adding 'checkpoint critical section'
operations, so that you could guarantee that a checkpoint would not
occur during certain codepaths (the poll to see if it's time for a
checkpoint could be built into these operations, so that a checkpoint
would automatically happen right before/after a critical section if the
signal had arrived).

So you shouldn't worry about needing to operate in signal handler
context.


-- 
Jason Duell                               jcduell_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2354
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998