Re: Question about Checkpoint on LamMPI

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Mon Jul 07 2003 - 15:30:18 PDT


On Mon, 7 Jul 2003, tingyu wrote:

>          There is a question making me little puzzle on ur problem:
> since the project supports MPI program checkpoint, how do u define the
> "parallel MPI program checkpoint" ?

It's an involuntary, coordinated checkpoint across all MPI processes that
were initially started via a single mpirun.  Checkpointing MPI-2 dynamic
processes is not yet supported.

>          As far as i understand, it seems in ur implementation, there is
> mechanism for cleaning up the messages transmitted in the network, then
> all of the processes invloved in this communication will be suspended,

Specifically, LAM/MPI's checkpoint support will coordinate between all of
its components to ensure that they are "ready for checkpoint".  For the
LAM components that currently support checkpointing, this means that the
network is drained (i.e., all outstanding "in flight" data is received).
But it is conceivable that there may someday be components where "ready
for checkpoint" does not necessarily mean that the network is drained
(e.g., shared memory).

> and later all of the processes (? not for sure) will be migrated to
> other node ans restarted. Is it correct?

Migration is not part of the checkpoint and restart process.
Specifically: the location of where the processes are restarted is
irrelevant (LAM will re-coordinate the locations, even if they have not
changed).

Additionally, once the processes have checkpointed, whether or not they
are every restarted is a human decision.  i.e., LAM has no control until
the process is actually restarted.

When/if the processes are restarted, all LAM components are coordinated to
"restart", which usually entails re-establishing network connections (but,
as per above, does not have to mean that).  The MPI processes can then
continue as if nothing had happened.

>          I checked the paper but still didn't get a comprehensive
>          image..

Which paper are you referring to, exactly?

-- 
{+} Jeff Squyres
{+} [email protected]
{+} http://www.lam-mpi.org/