From: Jason Duell (jcduell_at_lbl.gov)
Date: Thu Mar 21 2002 - 11:16:34 PST
On Thu, Mar 21, 2002 at 01:38:14PM -0500, Brian W. Barrett wrote: > The notification mechanism will work like a Unix signal (the first > version might use Unix signals, but this will change in future > versions). The user will register a handler (function prototype > available in header file) that will be called whenever the process > is to be checkpointed. Upon leaving the function, the process will > be checkpointed. While in the handler, the function must conform to > all the requirements placed on Unix signal handlers. > > * Can we get enough communication in MPIRUN in the signal handler > context, or are we completely hosed? > > - what can we run in a signal handler context? > - If we can't, what is our next option? We're looking into ways to work around needing to run all your checkpoint logic in signal context. The basic idea is that we'll have an option to have the CHKPT signal handler simply decrement an atomic counter, and never call the checkpoint handlers directly. Instead, some sort of polling mechanism would be used in the regular, non-signal code to detect if the signal had occurred (i.e. counter decremented to 0), and if so, the handler would then be called (all in regular, not signal, context). We've also considered adding 'checkpoint critical section' operations, so that you could guarantee that a checkpoint would not occur during certain codepaths (the poll to see if it's time for a checkpoint could be built into these operations, so that a checkpoint would automatically happen right before/after a critical section if the signal had arrived). So you shouldn't worry about needing to operate in signal handler context. -- Jason Duell jcduell_at_lbl_dot_gov NERSC Future Technologies Group Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998