Re: Restarting asynchronous handlers

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Sat Jun 29 2002 - 09:49:36 PDT


If I'm understanding you correctly, I think we need a little more than
that, right?

Take MPI, for example.

We were planning to have a "handler" that runs in its own thread,
independant of the user's thread(s).  Is that the async or sync handler in
your definition?  (I'm not sure what the difference is between your
async/sync handlers are)

So it could/would be something like:

Timestep	User program		MPI async handler
-------		---------------------	----------------------------------
0		running			doesn't exist yet
1		running, in MPI func	just invoked
2		running, in MPI func	blocks waiting for user MPI func
					to finish
3		running (out of MPI)	locks MPI
4		calls MPI_Foo		accounting/prep for checkpoint
5		blocks at MPI_Foo front	accounting/prep for checkpoint
6		blocks at MPI_Foo	finishes accounting/prep
7		blocks at MPI_Foo	indicates "Ready for chkpt"
8		---entire app is checkpointed---
		---time passes, and the entire app is restored---
N		blocks at MPI_Foo	re-setup after restore
N+1		blocks at MPI_Foo	unlocks MPI
N+2		enters MPI_Foo		finishes / returns
-------		---------------------	----------------------------------

That's what we're talking about, right?

So the MPI handler has two distinct phases: shutdown and restore.
Shutdown needs to be executed before the checkpoint, and restore needs to
be executed after the checkpoint.

Hence, the MPI handler must span the checkpoint, either by:

- having the checkpoint occur in the middle of the handler function
- having 2 handler routines (one for shutdown and one for restore) that
  libcr will invoke at the appropriate times
- having 1 handler routine that can sense which phase it should run
  (perhaps by the args passed to it by libcr or something... a minor
  implementation issue), and libcr invokes it at the appropriate times

These are all effectively equal (locking issues notwithstanding), and not
really the point under discussion here...  The point I'm trying to make is
that the MPI handler needs to be run on both sides of the actual
checkpoint.

So my question is: why do your async handlers only get executed *before*
the checkpoint?  Don't all handlers need to be able to be executed on both
sides?  Or, I guess more generally, doesn't it seem safer/more flexible to
allow all handlers to execute on both sides of the checkpoint?

Hope that made sense...

I will be on the Monday teleconf this week; sorry I wasn't there last
week.


On Fri, 28 Jun 2002, Eric Roman wrote:

> What's the story for synchronous and asynchronous handlers at restart?
> I keep forgetting the answer.
>
> So we want to have the async. handler running concurrently w/ the
> application threads during checkpoint time.  When the async handler
> thread completes its work, it calls back in, then the synchronous
> handlers callback, then the context is dumped.
>
> True?  So it's
>
> 1: Async handler runs
> 2: Async handler completes and calls back into kernel
> 3: Application threads interrupted w/ signal
> 4: Synchronous handlers execute
> 5: Synchronous handlers complete checkpoint and call back into kernel
> 6: Context for this thread is written
>
> Now on restart:
>
> 1: Context for this thread is read
> 2: Synchronous handlers resume execution
> 3: Synchronous handlers complete restart and call back into kernel
> 4: All threads (app + async) allowed to continue execution
>
> I think what's described above is the correct thing to do.  But, we
> might allow 2, 3, and 4 to take place concurrently, or even in the
> reverse order.