[greg_at_bronevetsky_dot_com: Re: MPI support for BLCR]

Date view	Thread view	Subject view	Author view	Attachment view

jcduell_at_lbl_dot_gov
Date: Tue Feb 28 2006 - 17:45:26 PST

Next message: groups_at_1000islandtours_dot_com: "(no subject)"

Previous message: jcduell_at_lbl_dot_gov: "Re: MPI support for BLCR"

Greg,

I'm going to ponder this for a little while before answering.  I'm also
forwarding to our mailing list, so the other BLCR developers can think
it over, too.

I understand that your software layer intercepts all calls to MPI, and
then runs some arbitrary MPI layer underneath it.  Could you tell me
what happens to this underlying MPI layer when you checkpoint?  Do you
kill it off (MPI_Finalize) and then recreate it all at restart
(MPI_Init), transparently to the user?  If not, it's unclear to me how
you're checkpointing the application without preserving any of the MPI
libraries' "state" (could you be clear about what "state" you're talking
about--sockets?  List of hostname/ports where the jobs are running?  All
stack/heap state?).


-- 
Jason Duell             Future Technologies Group
<jcduell_at_lbl_dot_gov>       Computational Research Division
Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory


----- Forwarded message from Greg Bronevetsky <greg_at_bronevetsky_dot_com> -----

From: Greg Bronevetsky <greg_at_bronevetsky_dot_com>
Subject: Re: MPI support for BLCR
Date: Tue, 28 Feb 2006 20:21:15 -0500
To: JCDuell_at_lbl_dot_gov

What you're describing mostly makes sense but I still don't understand 
how LAM state was separated from application state. Does LAM not have 
any MPI state in the application's address space and instead keeps it in 
a separate process?

Our checkpoint coordination approach is more complex than LAM's because 
we don't require the network to be empty but intead keep track of 
outstanding MPI messages and record them as necessary in our checkpoint. 
(this was chosen because it is more scalable) Furthermore, we are not an 
MPI implementation but rather a layer that runs between the application 
and MPI, intercepting all MPI calls. As such, we can work with any 
implementation of MPI. This of course poses some problems since if the 
application is statically linked then the application, our layer and the 
MPI implementation are all parts of the same process image and will all 
get saved by a system like BLCR. This would be erroneous since there 
would be a lot of MPI state that would be invalid on restart. Instead we 
need a way to save just the application state and leave our layer and 
the MPI implementation alone so that we can take care of it ourselves.

We would be willing to modify aspects of our layer to make it more 
compatible with BLCR but we cannot modify the underlying MPI 
implementation since the whole point is for our system to work on any 
MPI implementation. Would this type of checkpointing be possible with BLCR?

-- 
                            Greg Bronevetsky

>The LAM team used our callback notifications to shut down all TCP (or
>other network) connections, so that when our checkpoint code ran, there
>was no network state that needed to be saved.  They also arrange to save
>the info they need to reconnect all the processes at startup.  Finally,
>they also arranged it so that using our checkpoint program on their
>'mpirun' (i.e the user's initial program to start the parallel MPI job)
>caused mpirun to arrange for all other processes in the MPI job to be
>checkpointed before mpirun itself returned from the callback and was
>checkpointed.  In sum, our code just 'sees' that a single 'mpirun'
>process is to be checkpointed.  Mpirun's callback contains all the logic
>that ensures each job in the parallel job is checkpointed before it
>itself is checkpointed.  Restart works the same way--mpirun's restart
>callback handles restarting the entire parallel job.
>
>Needless to say, this wasn't transparent to the MPI library--they did a
>lot of work to handle the parallel aspects.
>
>It sounds like your MPI library could be made to work with BLCR if you
>can write a callback that shuts down any TCP/IP connections (and does
>whatever other work you normally do for a checkpoint) right before
>checkpoint time, and then restores them at restart.  This is
>theoretically just a matter of writing two functions--a checkpoint-time
>callback, and a restart-time callback.  How easy that is depends on
>whether it's easy for you to close/reopen the network state.
>
>Does that make sense?
>
> 
>

----- End forwarded message -----

Next message: groups_at_1000islandtours_dot_com: "(no subject)"

Previous message: jcduell_at_lbl_dot_gov: "Re: MPI support for BLCR"

Date view	Thread view	Subject view	Author view	Attachment view