jcduell_at_lbl_dot_gov
Date: Tue Feb 28 2006 - 17:45:26 PST
Greg, I'm going to ponder this for a little while before answering. I'm also forwarding to our mailing list, so the other BLCR developers can think it over, too. I understand that your software layer intercepts all calls to MPI, and then runs some arbitrary MPI layer underneath it. Could you tell me what happens to this underlying MPI layer when you checkpoint? Do you kill it off (MPI_Finalize) and then recreate it all at restart (MPI_Init), transparently to the user? If not, it's unclear to me how you're checkpointing the application without preserving any of the MPI libraries' "state" (could you be clear about what "state" you're talking about--sockets? List of hostname/ports where the jobs are running? All stack/heap state?). -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory ----- Forwarded message from Greg Bronevetsky <greg_at_bronevetsky_dot_com> ----- From: Greg Bronevetsky <greg_at_bronevetsky_dot_com> Subject: Re: MPI support for BLCR Date: Tue, 28 Feb 2006 20:21:15 -0500 To: JCDuell_at_lbl_dot_gov What you're describing mostly makes sense but I still don't understand how LAM state was separated from application state. Does LAM not have any MPI state in the application's address space and instead keeps it in a separate process? Our checkpoint coordination approach is more complex than LAM's because we don't require the network to be empty but intead keep track of outstanding MPI messages and record them as necessary in our checkpoint. (this was chosen because it is more scalable) Furthermore, we are not an MPI implementation but rather a layer that runs between the application and MPI, intercepting all MPI calls. As such, we can work with any implementation of MPI. This of course poses some problems since if the application is statically linked then the application, our layer and the MPI implementation are all parts of the same process image and will all get saved by a system like BLCR. This would be erroneous since there would be a lot of MPI state that would be invalid on restart. Instead we need a way to save just the application state and leave our layer and the MPI implementation alone so that we can take care of it ourselves. We would be willing to modify aspects of our layer to make it more compatible with BLCR but we cannot modify the underlying MPI implementation since the whole point is for our system to work on any MPI implementation. Would this type of checkpointing be possible with BLCR? -- Greg Bronevetsky >The LAM team used our callback notifications to shut down all TCP (or >other network) connections, so that when our checkpoint code ran, there >was no network state that needed to be saved. They also arrange to save >the info they need to reconnect all the processes at startup. Finally, >they also arranged it so that using our checkpoint program on their >'mpirun' (i.e the user's initial program to start the parallel MPI job) >caused mpirun to arrange for all other processes in the MPI job to be >checkpointed before mpirun itself returned from the callback and was >checkpointed. In sum, our code just 'sees' that a single 'mpirun' >process is to be checkpointed. Mpirun's callback contains all the logic >that ensures each job in the parallel job is checkpointed before it >itself is checkpointed. Restart works the same way--mpirun's restart >callback handles restarting the entire parallel job. > >Needless to say, this wasn't transparent to the MPI library--they did a >lot of work to handle the parallel aspects. > >It sounds like your MPI library could be made to work with BLCR if you >can write a callback that shuts down any TCP/IP connections (and does >whatever other work you normally do for a checkpoint) right before >checkpoint time, and then restores them at restart. This is >theoretically just a matter of writing two functions--a checkpoint-time >callback, and a restart-time callback. How easy that is depends on >whether it's easy for you to close/reopen the network state. > >Does that make sense? > > > ----- End forwarded message -----