From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Sat Jun 29 2002 - 09:49:36 PDT
If I'm understanding you correctly, I think we need a little more than that, right? Take MPI, for example. We were planning to have a "handler" that runs in its own thread, independant of the user's thread(s). Is that the async or sync handler in your definition? (I'm not sure what the difference is between your async/sync handlers are) So it could/would be something like: Timestep User program MPI async handler ------- --------------------- ---------------------------------- 0 running doesn't exist yet 1 running, in MPI func just invoked 2 running, in MPI func blocks waiting for user MPI func to finish 3 running (out of MPI) locks MPI 4 calls MPI_Foo accounting/prep for checkpoint 5 blocks at MPI_Foo front accounting/prep for checkpoint 6 blocks at MPI_Foo finishes accounting/prep 7 blocks at MPI_Foo indicates "Ready for chkpt" 8 ---entire app is checkpointed--- ---time passes, and the entire app is restored--- N blocks at MPI_Foo re-setup after restore N+1 blocks at MPI_Foo unlocks MPI N+2 enters MPI_Foo finishes / returns ------- --------------------- ---------------------------------- That's what we're talking about, right? So the MPI handler has two distinct phases: shutdown and restore. Shutdown needs to be executed before the checkpoint, and restore needs to be executed after the checkpoint. Hence, the MPI handler must span the checkpoint, either by: - having the checkpoint occur in the middle of the handler function - having 2 handler routines (one for shutdown and one for restore) that libcr will invoke at the appropriate times - having 1 handler routine that can sense which phase it should run (perhaps by the args passed to it by libcr or something... a minor implementation issue), and libcr invokes it at the appropriate times These are all effectively equal (locking issues notwithstanding), and not really the point under discussion here... The point I'm trying to make is that the MPI handler needs to be run on both sides of the actual checkpoint. So my question is: why do your async handlers only get executed *before* the checkpoint? Don't all handlers need to be able to be executed on both sides? Or, I guess more generally, doesn't it seem safer/more flexible to allow all handlers to execute on both sides of the checkpoint? Hope that made sense... I will be on the Monday teleconf this week; sorry I wasn't there last week. On Fri, 28 Jun 2002, Eric Roman wrote: > What's the story for synchronous and asynchronous handlers at restart? > I keep forgetting the answer. > > So we want to have the async. handler running concurrently w/ the > application threads during checkpoint time. When the async handler > thread completes its work, it calls back in, then the synchronous > handlers callback, then the context is dumped. > > True? So it's > > 1: Async handler runs > 2: Async handler completes and calls back into kernel > 3: Application threads interrupted w/ signal > 4: Synchronous handlers execute > 5: Synchronous handlers complete checkpoint and call back into kernel > 6: Context for this thread is written > > Now on restart: > > 1: Context for this thread is read > 2: Synchronous handlers resume execution > 3: Synchronous handlers complete restart and call back into kernel > 4: All threads (app + async) allowed to continue execution > > I think what's described above is the correct thing to do. But, we > might allow 2, 3, and 4 to take place concurrently, or even in the > reverse order.