From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Mon Jul 07 2003 - 15:30:18 PDT
On Mon, 7 Jul 2003, tingyu wrote: > There is a question making me little puzzle on ur problem: > since the project supports MPI program checkpoint, how do u define the > "parallel MPI program checkpoint" ? It's an involuntary, coordinated checkpoint across all MPI processes that were initially started via a single mpirun. Checkpointing MPI-2 dynamic processes is not yet supported. > As far as i understand, it seems in ur implementation, there is > mechanism for cleaning up the messages transmitted in the network, then > all of the processes invloved in this communication will be suspended, Specifically, LAM/MPI's checkpoint support will coordinate between all of its components to ensure that they are "ready for checkpoint". For the LAM components that currently support checkpointing, this means that the network is drained (i.e., all outstanding "in flight" data is received). But it is conceivable that there may someday be components where "ready for checkpoint" does not necessarily mean that the network is drained (e.g., shared memory). > and later all of the processes (? not for sure) will be migrated to > other node ans restarted. Is it correct? Migration is not part of the checkpoint and restart process. Specifically: the location of where the processes are restarted is irrelevant (LAM will re-coordinate the locations, even if they have not changed). Additionally, once the processes have checkpointed, whether or not they are every restarted is a human decision. i.e., LAM has no control until the process is actually restarted. When/if the processes are restarted, all LAM components are coordinated to "restart", which usually entails re-establishing network connections (but, as per above, does not have to mean that). The MPI processes can then continue as if nothing had happened. > I checked the paper but still didn't get a comprehensive > image.. Which paper are you referring to, exactly? -- {+} Jeff Squyres {+} [email protected] {+} http://www.lam-mpi.org/