Re: question about implement checkpoint into MPI program

From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Fri Mar 12 2010 - 11:55:22 PST

  • Next message: : "Re: Re: Question about Vmadump"
    Hi, Paul
    
    Thank you very much. I will read the libcr.h to get the usage of all
    functions.
    
    Cheers,
    fengguang
    
    On Fri, Mar 12, 2010 at 2:47 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote:
    
    > fengguang,
    >
    > I am not an expert on the Open MPI parameters, but I believe that the
    > following page should have the documentation you need:
    >       http://osl.iu.edu/research/ft/ompi-cr/api.php
    > I think "--mca snapc_base_global_snapshot_dir /some/directory" passed to
    > mpirun is what you want.  If that is not correct, then you should probably
    > ask on one of the Open MPI mailing lists.
    >
    > I am not aware of anything that monitor an mpi application and
    > automatically restarts it from a checkpoint if it crashes.  Again, asking on
    > the Open MPI mailing lists may give a better answer.
    >
    > As for documentation on the functions in libcr - at the moment the only
    > thing we have to offer is the comments in libcr.h and a few examples (the
    > comments if libcr.h refer to some of the examples).
    >
    > -Paul
    >
    > fengguang tian wrote:
    >
    >> Hi,Paul
    >>
    >> I am using Open MPI now, and, yes, It works now, thank you. can i set a
    >> directory
    >> to store the checkpoint file(context.XXXXX), i saw these files are all in
    >> the program directory by default. and also, how to restart the checkpoint
    >> with the file context.XXXXX in the program automatically? Is it possiable
    >> that when the a running process crashed, the program restart automatically
    >> with the checkpoint file?
    >>
    >> BTW, is there any documents that introduce the usage of all these
    >> functions in the BLCR library, I cannot find any documents talks about that.
    >>
    >> Cheers!
    >> fengguang
    >>
    >> On Thu, Mar 11, 2010 at 11:42 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov<mailto:
    >> PHHargrove_at_lbl_dot_gov>> wrote:
    >>
    >>    fengguang tian wrote:
    >>
    >>        Hi
    >>
    >>        my question is similar to this question:
    >>        http://www.nersc.gov/hypermail/checkpoint/0283.html
    >>
    >>        what head file I should include in my c program. when I write
    >>        a program follow the
    >>        advice:http://www.nersc.gov/hypermail/checkpoint/0732.html
    >>
    >>        it doesn't work.
    >>
    >>        *I want to implement checkpoint into a MPI c++ program ,and
    >>        checkpoint the process periodically and automatically.*
    >>
    >>
    >>    If you want to write code like entry 0732 in the mail archive
    >>    you'll want to #include "libcr.h" and link with "-lcr".
    >>
    >>    BLCR does not directly handle checkpointing of communications,
    >>    such as used in MPI.  Instead, BLCR provides mechanisms for an MPI
    >>    implementation to participate in the checkpoint, in order to
    >>    capture the state of communications.  Therefore, in order to use
    >>    BLCR with an MPI application, you will need to be using one of the
    >>    MPI implementations that have integrated with BLCR.  Of the
    >>    commonly used MPI's both Open MPI and MVAPICH2 include BLCR
    >>    integration.  You should consult the documentation for whichever
    >>    MPI you use to determine how to configure it for use with BLCR.
    >>     Then you will also find in the MPI implementation-specific
    >>    documentation some information on how the application can trigger
    >>    a checkpoint.
    >>
    >>    -Paul
    >>
    >>    --     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>    <mailto:PHHargrove_at_lbl_dot_gov>
    >>
    >>    Future Technologies Group                 Tel: +1-510-495-2352
    >>    HPC Research Department                   Fax: +1-510-486-6900
    >>    Lawrence Berkeley National Laboratory
    >>
    >>
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    >
    

  • Next message: : "Re: Re: Question about Vmadump"