From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Fri Mar 12 2010 - 11:55:22 PST
Hi, Paul Thank you very much. I will read the libcr.h to get the usage of all functions. Cheers, fengguang On Fri, Mar 12, 2010 at 2:47 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote: > fengguang, > > I am not an expert on the Open MPI parameters, but I believe that the > following page should have the documentation you need: > http://osl.iu.edu/research/ft/ompi-cr/api.php > I think "--mca snapc_base_global_snapshot_dir /some/directory" passed to > mpirun is what you want. If that is not correct, then you should probably > ask on one of the Open MPI mailing lists. > > I am not aware of anything that monitor an mpi application and > automatically restarts it from a checkpoint if it crashes. Again, asking on > the Open MPI mailing lists may give a better answer. > > As for documentation on the functions in libcr - at the moment the only > thing we have to offer is the comments in libcr.h and a few examples (the > comments if libcr.h refer to some of the examples). > > -Paul > > fengguang tian wrote: > >> Hi,Paul >> >> I am using Open MPI now, and, yes, It works now, thank you. can i set a >> directory >> to store the checkpoint file(context.XXXXX), i saw these files are all in >> the program directory by default. and also, how to restart the checkpoint >> with the file context.XXXXX in the program automatically? Is it possiable >> that when the a running process crashed, the program restart automatically >> with the checkpoint file? >> >> BTW, is there any documents that introduce the usage of all these >> functions in the BLCR library, I cannot find any documents talks about that. >> >> Cheers! >> fengguang >> >> On Thu, Mar 11, 2010 at 11:42 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov<mailto: >> PHHargrove_at_lbl_dot_gov>> wrote: >> >> fengguang tian wrote: >> >> Hi >> >> my question is similar to this question: >> http://www.nersc.gov/hypermail/checkpoint/0283.html >> >> what head file I should include in my c program. when I write >> a program follow the >> advice:http://www.nersc.gov/hypermail/checkpoint/0732.html >> >> it doesn't work. >> >> *I want to implement checkpoint into a MPI c++ program ,and >> checkpoint the process periodically and automatically.* >> >> >> If you want to write code like entry 0732 in the mail archive >> you'll want to #include "libcr.h" and link with "-lcr". >> >> BLCR does not directly handle checkpointing of communications, >> such as used in MPI. Instead, BLCR provides mechanisms for an MPI >> implementation to participate in the checkpoint, in order to >> capture the state of communications. Therefore, in order to use >> BLCR with an MPI application, you will need to be using one of the >> MPI implementations that have integrated with BLCR. Of the >> commonly used MPI's both Open MPI and MVAPICH2 include BLCR >> integration. You should consult the documentation for whichever >> MPI you use to determine how to configure it for use with BLCR. >> Then you will also find in the MPI implementation-specific >> documentation some information on how the application can trigger >> a checkpoint. >> >> -Paul >> >> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> <mailto:PHHargrove_at_lbl_dot_gov> >> >> Future Technologies Group Tel: +1-510-495-2352 >> HPC Research Department Fax: +1-510-486-6900 >> Lawrence Berkeley National Laboratory >> >> > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory >