From: Greg Bronevetsky (greg_at_bronevetsky_dot_com)
Date: Tue Feb 28 2006 - 12:09:11 PST
I am a grad student at Cornell, working on checkpointing of MPI applications. Our checkpointer works with any implementation of MPI and (in principle) with any single-process checkpointer. However, in practice integration with single process checkpointers is made more complex because by default such a checkpointer will save the state of the entire process, including MPI state. This is generally incorrect as MPI state contains hardware information that will not be valid on restart. I know that you've integrated BLCR with LAM, presumably in a way that doesn't save LAM's state but instead lets LAM save its own state. How did you do this? Was it via a special API (the callbacks referred to in your FAQ) or did you use a more general technique? -- Greg Bronevetsky