jcduell_at_lbl_dot_gov
Date: Tue Feb 28 2006 - 14:05:02 PST
On Tue, Feb 28, 2006 at 03:09:11PM -0500, Greg Bronevetsky wrote: > I am a grad student at Cornell, working on checkpointing of MPI > applications. Our checkpointer works with any implementation of MPI and > (in principle) with any single-process checkpointer. However, in > practice integration with single process checkpointers is made more > complex because by default such a checkpointer will save the state of > the entire process, including MPI state. This is generally incorrect as > MPI state contains hardware information that will not be valid on restart. > > I know that you've integrated BLCR with LAM, presumably in a way that > doesn't save LAM's state but instead lets LAM save its own state. How > did you do this? Was it via a special API (the callbacks referred to in > your FAQ) or did you use a more general technique? The LAM team used our callback notifications to shut down all TCP (or other network) connections, so that when our checkpoint code ran, there was no network state that needed to be saved. They also arrange to save the info they need to reconnect all the processes at startup. Finally, they also arranged it so that using our checkpoint program on their 'mpirun' (i.e the user's initial program to start the parallel MPI job) caused mpirun to arrange for all other processes in the MPI job to be checkpointed before mpirun itself returned from the callback and was checkpointed. In sum, our code just 'sees' that a single 'mpirun' process is to be checkpointed. Mpirun's callback contains all the logic that ensures each job in the parallel job is checkpointed before it itself is checkpointed. Restart works the same way--mpirun's restart callback handles restarting the entire parallel job. Needless to say, this wasn't transparent to the MPI library--they did a lot of work to handle the parallel aspects. It sounds like your MPI library could be made to work with BLCR if you can write a callback that shuts down any TCP/IP connections (and does whatever other work you normally do for a checkpoint) right before checkpoint time, and then restores them at restart. This is theoretically just a matter of writing two functions--a checkpoint-time callback, and a restart-time callback. How easy that is depends on whether it's easy for you to close/reopen the network state. Does that make sense? -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory