Re: question about implement checkpoint into MPI program

From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Thu Mar 11 2010 - 21:12:44 PST

  • Next message: Alexandre Strube: "Re: question about implement checkpoint into MPI program"
    I am using Open MPI now, and, yes, It works now, thank you. can i set a
    to store the checkpoint file(context.XXXXX), i saw these files are all in
    the program directory by default. and also, how to restart the checkpoint
    with the file context.XXXXX in the program automatically? Is it possiable
    that when the a running process crashed, the program restart automatically
    with the checkpoint file?
    BTW, is there any documents that introduce the usage of all these functions
    in the BLCR library, I cannot find any documents talks about that.
    On Thu, Mar 11, 2010 at 11:42 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote:
    > fengguang tian wrote:
    >> Hi
    >> my question is similar to this question:
    >> what head file I should include in my c program. when I write a program
    >> follow the advice:
    >> it doesn't work.
    >> *I want to implement checkpoint into a MPI c++ program ,and checkpoint the
    >> process periodically and automatically.*
    > If you want to write code like entry 0732 in the mail archive you'll want
    > to #include "libcr.h" and link with "-lcr".
    > BLCR does not directly handle checkpointing of communications, such as used
    > in MPI.  Instead, BLCR provides mechanisms for an MPI implementation to
    > participate in the checkpoint, in order to capture the state of
    > communications.  Therefore, in order to use BLCR with an MPI application,
    > you will need to be using one of the MPI implementations that have
    > integrated with BLCR.  Of the commonly used MPI's both Open MPI and MVAPICH2
    > include BLCR integration.  You should consult the documentation for
    > whichever MPI you use to determine how to configure it for use with BLCR.
    >  Then you will also find in the MPI implementation-specific documentation
    > some information on how the application can trigger a checkpoint.
    > -Paul
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory

  • Next message: Alexandre Strube: "Re: question about implement checkpoint into MPI program"