Re: Please advise me about restarting with BLCR

From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Fri Oct 26 2007 - 00:33:34 PDT

  • Next message: Eric Roman: "Re: Please advise me about restarting with BLCR"
    Eric,
    
    Thank you for your reply and I'm sorry about my sketchy explanation.
    I have interest in fault tolerant MPI and its recovering method.
    Now I implemented my prototype with mpich, so, I can't use other MPI easily.
    
    >Try running that, and see if any of the tests pass.
    I tried "make check" then,
    -VMware-environment skipped 1 test. And other tests was passed.
    Unable to determine huge pagesize, if any (test skipped).
    SKIP: hugetlbfs.ct
    
    -Opteron-environment got following error,
    ./bug2003: line 14: 32105 Segmentation fault      (core dumped)
    ${cr_run} ${dir}/bug2003_aux
    ./bug2003: line 16: 32124 Segmentation fault      ${cr_restart} context.$pid
    FAIL: bug2003
    
    cr_poll_checkpoint: Input/output error
    /home/jitumoto/blcr-0.6.1/builddir/tests/.libs/lt-mmaps[1539]: file
    "../../tests/crut.c", line 628, in crut_main: Error during checkpoint.
     crut_checkpoint_status = 0, saved error = -5
    checkpoint/nonzeroexit (255) at ./mmaps.ct line 128.
    FAIL: mmaps.ct
    
    Unable to determine huge pagesize, if any (test skipped).
    SKIP: hugetlbfs.ct
    
    I'm in trouble because I can't specify whether the cause of this error
    is my prototype or BLCR. I want you to teach any trivial hint.
    
    I discribe my checkpointing method.
    ---------------------------------------------------------------------
    I use BLCR with only MPI application process (I don't use BLCR with
    any other deamon constructing mpich).
    The method is:
    Checkpoint
    1. MPI application process get checkpoint-starting-message from MPD (a
    deamon constructing mpich) .
    2. Application process drain its in-flight message, but it doesn't
    close its sockets.
    3. Application process checkpointing with BLCR by cr_request_file().
    
    Restart
    1. MPD restart checkpoints of application process with cr_restart.
    2. On restarting, BLCR use the following callback function with
    CR_SIGNAL_CONTEXT.
    int post_recover(void *arg) {
    	int rc;
    
    	rc = cr_checkpoint(0);
    	
    	if (rc < 0) {
    		exit(EXIT_FAILURE);
    	} else if (rc) {
    		//close and reconnect MPI application's sockets.
    		ck_ch_post_recover();
    	}
    	return 0;
    }
    Invalid argument is happened on rank 0. On the other rank, I can get
    correct checkpoint.
    
    Sincerely yours,
    Hideyuki Jitsumoto
    
    On 10/26/07, Eric Roman <ESRoman_at_berkeley_dot_edu> wrote:
    >
    > I'm not sure where the invalid argument happened.  There are a lot of
    > places where we can return EINVAL.
    >
    > Let's get the easy stuff out of the way first.  Did make check work on the
    > Opteron?  Try running that, and see if any of the tests pass.  If that's ok,
    > my guess is that it's the MPI issue.  I'm really surprised that you were
    > able to restart an MPICH code at all.
    >
    > We don't support BLCR with MPICH right now.  That really shouldn't work at all.
    > If you want to checkpoint an MPI job, you can use LAM MPI or a recent release
    > of MVAPICH (for Infiniband).  OpenMPI support is coming -- it's in their
    > subversion tree, but not yet in a released version.  OpenMPI checkpointing
    > will be released in a few weeks.
    >
    > For now, you'll need to build the LAM libraries with BLCR support, and
    > relink your application with those libraries.  There are instructions
    > for how to do this on the LAM MPI web page.  Once that's done, let me
    > know if you still see the error.  We should work fine on the Opteron
    > environment you're using.
    >
    > Eric
    >
    > On Thu, Oct 25, 2007 at 08:24:36PM +0900, Hideyuki Jitsumoto wrote:
    > > Dear BLCR-ML-Members,
    > >
    > > I trid to use BLCR for checkpointing mpich on 2 execution environments.
    > > I used completely same codes on MPI application, mpich, and BLCR.
    > > But on one environment, I got error message, "Restart failed: Invalid argument".
    > >
    > > Environment
    > > 1. VMware 6.1(Intel Core 2 Duo), Linux-2.6.8-2-686, gcc 3.3.5
    > > 2. AMD Opteron 242*2, Linux-2.6.12.2, gcc 3.3.5
    > >
    > > On Environment 1, I got correct restarting, but on Environment 2, I could't.
    > > So, I compared kernel log with CR_KTRACE_ALL.
    > > Then I noticed Environment2 has error on cr_rstrt_child.
    > >
    > > Please advise me about what's happened on BLCR , if you have an idea.
    > > Thank you.
    > >
    > > -the contents of /var/log/message
    > > Environment 1 had,
    > > ....
    > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files
    > > <cr_rstrt_req.c:1880>, pid 32752: : recovering fs_struct...
    > > Oct 22 12:14:10 Concertino1 kernel: cr_load_file_info
    > > <cr_rstrt_req.c:1339>, pid 32752: : entering
    > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files
    > > <cr_rstrt_req.c:1911>, pid 32752: :    fd=0 dnr=1
    > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_open_fifo
    > > <cr_pipes.c:488>, pid 32752: : entering
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > > <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cef61800.
    > > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    > > pid 32752: : pipe:[57509]:  Phase 1: Making new pipe.
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks
    > > <cr_rstrt_req.c:1819>, pid 32752: : entering
    > > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info
    > > <cr_rstrt_req.c:1339>, pid 32752: : entering
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files
    > > <cr_rstrt_req.c:1911>, pid 32752: :    fd=1 dnr=1
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > > <cr_pipes.c:488>, pid 32752: : entering
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > > <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cef616c0.
    > > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    > > pid 32752: : pipe:[57510]:  Phase 1: Making new pipe.
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks
    > > <cr_rstrt_req.c:1819>, pid 32752: : entering
    > > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info
    > > <cr_rstrt_req.c:1339>, pid 32752: : entering
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files
    > > <cr_rstrt_req.c:1911>, pid 32752: :    fd=2 dnr=1
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > > <cr_pipes.c:488>, pid 32752: : entering
    > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > > <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cf2c16c0.
    > > Oct 22 12:14:12 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    > > pid 32752: : pipe:[57511]:  Phase 1: Making new pipe.
    > > Oct 22 12:14:12 Concertino1 kernel: cr_restore_file_locks
    > > <cr_rstrt_req.c:1819>, pid 32752: : entering
    > > ....
    > >
    > > Environment 2 had,
    > > ....
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    > > <cr_rstrt_req.c:1880>, pid 18556: : recovering fs_struct...
    > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    > > <cr_rstrt_req.c:1339>, pid 18556: : entering
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    > > <cr_rstrt_req.c:1911>, pid 18556: :    fd=0 dnr=1
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>,
    > > pid 18556: : entering
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>,
    > > pid 18556: :    Open fifo: id == f75172c0.
    > > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid
    > > 18556: : pipe:[595796]:  Phase 1: Making new pipe.
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks
    > > <cr_rstrt_req.c:1819>, pid 18556: : entering
    > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    > > <cr_rstrt_req.c:1339>, pid 18556: : entering
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    > > <cr_rstrt_req.c:1911>, pid 18556: :    fd=1 dnr=1
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>,
    > > pid 18556: : entering
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>,
    > > pid 18556: :    Open fifo: id == f7517698.
    > > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid
    > > 18556: : pipe:[595797]:  Phase 1: Making new pipe.
    > > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks
    > > <cr_rstrt_req.c:1819>, pid 18556: : entering
    > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    > > <cr_rstrt_req.c:1339>, pid 18556: : entering
    > > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2424>,
    > > pid 18556: : 18556: closing request descriptor
    > > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2435>,
    > > pid 18556: : 18556: closing context file descriptor
    > > Oct 25 18:43:29 pad047 kernel: release_rstrt_req <cr_rstrt_req.c:94>,
    > > pid 18556: : ref count is approximately 2
    > > Oct 25 18:43:29 pad047 kernel: __cr_task_put <cr_task.c:114>, pid
    > > 18556: : Free cr_task_t ebf6f480
    > > ....
    > >
    > > --
    > > Sincerely Yours,
    > > Hideyuki Jitsumoto ([email protected])
    > > Tokyo Institute of Technology Grad. School of Info. and Eng.
    > > Dept. MCS (Matsuoka Lab.)
    >
    > --
    > Eric Roman                       Department of Physics
    > 510-642-7302                     UC Berkeley
    >
    
    
    -- 
    Sincerely Yours,
    Hideyuki Jitsumoto ([email protected])
    Tokyo Institute of Technology Grad. School of Info. and Eng.
    Dept. MCS (Matsuoka Lab.)
    

  • Next message: Eric Roman: "Re: Please advise me about restarting with BLCR"