From: Jeff Squyres (jsquyres_at_open-mpi.org)
Date: Sat Jan 14 2006 - 14:25:18 PST
Paul -- Just an FYI -- we spent a little while debugging some behavior in LAM that turned out to be correct BLCR behavior. Thinking about it after the fact, the behavior we saw totally makes sense, but we weren't expecting it (hindsight is 20/20, right?), so I was wondering if you might want to add this to documentation somewhere. I poked around in the docs and didn't see this mentioned anywhere, but then again, I couldn't find any API-level documentation -- so it's quite possible that I was looking in the wrong place. The behavior in question is that a process being checkpointed has many (all?) of its signals blocked. That is, when the BLCR- registered callbacks are invoked, calling sigprocmask() shows that a bunch of signals are blocked (e.g., SIGINT is an easy one to check for). In the tests that I did, this was true for both the signal and thread callbacks, but this could have been a timing issue(i.e., they're really only blocked for the signal callback and it's a race condition whether they're blocked for the thread callback). More specifically -- I did not try to figure out if it was only for the signal callback or not. This behavior makes total sense -- you don't want any other signals arriving while the signal callback is being invoked. However, in LAM's case, we don't return from the signal callback and instead exec() a new copy of mpirun. The signal blocking mask is inherited by the new mpirun, and therefore makes it unresponsive to Ctrl-C (for example). Once we figured this out, it was easy enough to unblock all the relevant signals at the beginning of mpirun so that the newly-exec'ed mpirun becomes responsive to Ctrl-C, etc. Granted, LAM's behavior is probably pretty un-typical (calling exec() in the signal callback to launch a new process). But I thought it might be worthwhile to list in docs somewhere on the off-chance that someone else runs into a similar issue someday. Thanks! -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/