Blocked signals in BLCR callbacks

From: Jeff Squyres (jsquyres_at_open-mpi.org)
Date: Sat Jan 14 2006 - 14:25:18 PST

  • Next message: KERRY89: "Re[3]: Hei !.."
    Paul --
    
    Just an FYI -- we spent a little while debugging some behavior in LAM  
    that turned out to be correct BLCR behavior.  Thinking about it after  
    the fact, the behavior we saw totally makes sense, but we weren't  
    expecting it (hindsight is 20/20, right?), so I was wondering if you  
    might want to add this to documentation somewhere.  I poked around in  
    the docs and didn't see this mentioned anywhere, but then again, I  
    couldn't find any API-level documentation -- so it's quite possible  
    that I was looking in the wrong place.
    
    The behavior in question is that a process being checkpointed has  
    many (all?) of its signals blocked.  That is, when the BLCR- 
    registered callbacks are invoked, calling sigprocmask() shows that a  
    bunch of signals are blocked (e.g., SIGINT is an easy one to check  
    for).  In the tests that I did, this was true for both the signal and  
    thread callbacks, but this could have been a timing issue(i.e.,  
    they're really only blocked for the signal callback and it's a race  
    condition whether they're blocked for the thread callback).  More  
    specifically -- I did not try to figure out if it was only for the  
    signal callback or not.
    
    This behavior makes total sense -- you don't want any other signals  
    arriving while the signal callback is being invoked.
    
    However, in LAM's case, we don't return from the signal callback and  
    instead exec() a new copy of mpirun.  The signal blocking mask is   
    inherited by the new mpirun, and therefore makes it unresponsive to  
    Ctrl-C (for example).
    
    Once we figured this out, it was easy enough to unblock all the  
    relevant signals at the beginning of mpirun so that the newly-exec'ed  
    mpirun becomes responsive to Ctrl-C, etc.
    
    Granted, LAM's behavior is probably pretty un-typical (calling exec()  
    in the signal callback to launch a new process).  But I thought it  
    might be worthwhile to list in docs somewhere on the off-chance that  
    someone else runs into a similar issue someday.
    
    Thanks!
    
    --
    {+} Jeff Squyres
    {+} The Open MPI Project
    {+} http://www.open-mpi.org/
    

  • Next message: KERRY89: "Re[3]: Hei !.."