From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Mar 12 2010 - 11:47:38 PST
fengguang, I am not an expert on the Open MPI parameters, but I believe that the following page should have the documentation you need: http://osl.iu.edu/research/ft/ompi-cr/api.php I think "--mca snapc_base_global_snapshot_dir /some/directory" passed to mpirun is what you want. If that is not correct, then you should probably ask on one of the Open MPI mailing lists. I am not aware of anything that monitor an mpi application and automatically restarts it from a checkpoint if it crashes. Again, asking on the Open MPI mailing lists may give a better answer. As for documentation on the functions in libcr - at the moment the only thing we have to offer is the comments in libcr.h and a few examples (the comments if libcr.h refer to some of the examples). -Paul fengguang tian wrote: > Hi,Paul > > I am using Open MPI now, and, yes, It works now, thank you. can i set > a directory > to store the checkpoint file(context.XXXXX), i saw these files are all > in the program directory by default. and also, how to restart the > checkpoint with the file context.XXXXX in the program automatically? > Is it possiable that when the a running process crashed, the program > restart automatically with the checkpoint file? > > BTW, is there any documents that introduce the usage of all these > functions in the BLCR library, I cannot find any documents talks about > that. > > Cheers! > fengguang > > On Thu, Mar 11, 2010 at 11:42 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov>> wrote: > > fengguang tian wrote: > > Hi > > my question is similar to this question: > http://www.nersc.gov/hypermail/checkpoint/0283.html > > what head file I should include in my c program. when I write > a program follow the > advice:http://www.nersc.gov/hypermail/checkpoint/0732.html > > it doesn't work. > > *I want to implement checkpoint into a MPI c++ program ,and > checkpoint the process periodically and automatically.* > > > If you want to write code like entry 0732 in the mail archive > you'll want to #include "libcr.h" and link with "-lcr". > > BLCR does not directly handle checkpointing of communications, > such as used in MPI. Instead, BLCR provides mechanisms for an MPI > implementation to participate in the checkpoint, in order to > capture the state of communications. Therefore, in order to use > BLCR with an MPI application, you will need to be using one of the > MPI implementations that have integrated with BLCR. Of the > commonly used MPI's both Open MPI and MVAPICH2 include BLCR > integration. You should consult the documentation for whichever > MPI you use to determine how to configure it for use with BLCR. > Then you will also find in the MPI implementation-specific > documentation some information on how the application can trigger > a checkpoint. > > -Paul > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory