Re: question about implement checkpoint into MPI program

From: Alexandre Strube (surak_at_surak.eti.br)
Date: Fri Mar 12 2010 - 01:56:32 PST

  • Next message: Paul H. Hargrove: "Re: question about implement checkpoint into MPI program"
    Hello Fenguuang,
    
    you must take a look of the RADIC's implementation over openMPI developed by
    Leonardo Fialho and other openMPI developers. It allows you to checkpoint
    mpi programs and to have transparent fault tolerance.
    
    On Fri, Mar 12, 2010 at 6:12 AM, fengguang tian <fernyabc_at_gmail_dot_com> wrote:
    
    > Hi,Paul
    >
    > I am using Open MPI now, and, yes, It works now, thank you. can i set a
    > directory
    > to store the checkpoint file(context.XXXXX), i saw these files are all in
    > the program directory by default. and also, how to restart the checkpoint
    > with the file context.XXXXX in the program automatically? Is it possiable
    > that when the a running process crashed, the program restart automatically
    > with the checkpoint file?
    >
    > BTW, is there any documents that introduce the usage of all these functions
    > in the BLCR library, I cannot find any documents talks about that.
    >
    > Cheers!
    > fengguang
    >
    >
    > On Thu, Mar 11, 2010 at 11:42 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote:
    >
    >> fengguang tian wrote:
    >>
    >>> Hi
    >>>
    >>> my question is similar to this question:
    >>> http://www.nersc.gov/hypermail/checkpoint/0283.html
    >>>
    >>> what head file I should include in my c program. when I write a program
    >>> follow the advice:http://www.nersc.gov/hypermail/checkpoint/0732.html
    >>>
    >>> it doesn't work.
    >>>
    >>> *I want to implement checkpoint into a MPI c++ program ,and checkpoint
    >>> the process periodically and automatically.*
    >>>
    >>
    >> If you want to write code like entry 0732 in the mail archive you'll want
    >> to #include "libcr.h" and link with "-lcr".
    >>
    >> BLCR does not directly handle checkpointing of communications, such as
    >> used in MPI.  Instead, BLCR provides mechanisms for an MPI implementation to
    >> participate in the checkpoint, in order to capture the state of
    >> communications.  Therefore, in order to use BLCR with an MPI application,
    >> you will need to be using one of the MPI implementations that have
    >> integrated with BLCR.  Of the commonly used MPI's both Open MPI and MVAPICH2
    >> include BLCR integration.  You should consult the documentation for
    >> whichever MPI you use to determine how to configure it for use with BLCR.
    >>  Then you will also find in the MPI implementation-specific documentation
    >> some information on how the application can trigger a checkpoint.
    >>
    >> -Paul
    >>
    >> --
    >> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >> Future Technologies Group                 Tel: +1-510-495-2352
    >> HPC Research Department                   Fax: +1-510-486-6900
    >> Lawrence Berkeley National Laboratory
    >>
    >
    >
    
    
    -- 
    []
    Alexandre Strube
    surak_at_ubuntu_dot_com
    

  • Next message: Paul H. Hargrove: "Re: question about implement checkpoint into MPI program"