Re: question about implement checkpoint into MPI program

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Mar 12 2010 - 11:47:38 PST

  • Next message: fengguang tian: "Re: question about implement checkpoint into MPI program"
    fengguang,
    
    I am not an expert on the Open MPI parameters, but I believe that the 
    following page should have the documentation you need:
            http://osl.iu.edu/research/ft/ompi-cr/api.php
    I think "--mca snapc_base_global_snapshot_dir /some/directory" passed to 
    mpirun is what you want.  If that is not correct, then you should 
    probably ask on one of the Open MPI mailing lists.
    
    I am not aware of anything that monitor an mpi application and 
    automatically restarts it from a checkpoint if it crashes.  Again, 
    asking on the Open MPI mailing lists may give a better answer.
    
    As for documentation on the functions in libcr - at the moment the only 
    thing we have to offer is the comments in libcr.h and a few examples 
    (the comments if libcr.h refer to some of the examples).
    
    -Paul
    
    fengguang tian wrote:
    > Hi,Paul
    >
    > I am using Open MPI now, and, yes, It works now, thank you. can i set 
    > a directory
    > to store the checkpoint file(context.XXXXX), i saw these files are all 
    > in the program directory by default. and also, how to restart the 
    > checkpoint with the file context.XXXXX in the program automatically? 
    > Is it possiable that when the a running process crashed, the program 
    > restart automatically with the checkpoint file?
    >
    > BTW, is there any documents that introduce the usage of all these 
    > functions in the BLCR library, I cannot find any documents talks about 
    > that.
    >
    > Cheers!
    > fengguang
    >
    > On Thu, Mar 11, 2010 at 11:42 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>> wrote:
    >
    >     fengguang tian wrote:
    >
    >         Hi
    >
    >         my question is similar to this question:
    >         http://www.nersc.gov/hypermail/checkpoint/0283.html
    >
    >         what head file I should include in my c program. when I write
    >         a program follow the
    >         advice:http://www.nersc.gov/hypermail/checkpoint/0732.html
    >
    >         it doesn't work.
    >
    >         *I want to implement checkpoint into a MPI c++ program ,and
    >         checkpoint the process periodically and automatically.*
    >
    >
    >     If you want to write code like entry 0732 in the mail archive
    >     you'll want to #include "libcr.h" and link with "-lcr".
    >
    >     BLCR does not directly handle checkpointing of communications,
    >     such as used in MPI.  Instead, BLCR provides mechanisms for an MPI
    >     implementation to participate in the checkpoint, in order to
    >     capture the state of communications.  Therefore, in order to use
    >     BLCR with an MPI application, you will need to be using one of the
    >     MPI implementations that have integrated with BLCR.  Of the
    >     commonly used MPI's both Open MPI and MVAPICH2 include BLCR
    >     integration.  You should consult the documentation for whichever
    >     MPI you use to determine how to configure it for use with BLCR.
    >      Then you will also find in the MPI implementation-specific
    >     documentation some information on how the application can trigger
    >     a checkpoint.
    >
    >     -Paul
    >
    >     -- 
    >     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >     <mailto:PHHargrove_at_lbl_dot_gov>
    >     Future Technologies Group                 Tel: +1-510-495-2352
    >     HPC Research Department                   Fax: +1-510-486-6900
    >     Lawrence Berkeley National Laboratory    
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: fengguang tian: "Re: question about implement checkpoint into MPI program"