From: Eric Roman (ESRoman_at_berkeley_dot_edu)
Date: Fri Oct 26 2007 - 16:22:05 PDT
On Fri, Oct 26, 2007 at 04:33:34PM +0900, Hideyuki Jitsumoto wrote: > Eric, > > Thank you for your reply and I'm sorry about my sketchy explanation. > I have interest in fault tolerant MPI and its recovering method. > Now I implemented my prototype with mpich, so, I can't use other MPI easily. Ok. If you're aware of the issues, then go for it! Are the MPICH guys aware of your work? I think they put a callback for checkpointing in the ADI for MPICH 2, but never implemented it. Also, I think the MVAPICH implementation for Infiniband is built around MPICH 2. You might find some more information in that direction. > >Try running that, and see if any of the tests pass. > I tried "make check" then, > -VMware-environment skipped 1 test. And other tests was passed. > Unable to determine huge pagesize, if any (test skipped). > SKIP: hugetlbfs.ct That's fine. Most systems don't have hugetlb turned on, so that test is usually skipped. > -Opteron-environment got following error, > ./bug2003: line 14: 32105 Segmentation fault (core dumped) > ${cr_run} ${dir}/bug2003_aux > ./bug2003: line 16: 32124 Segmentation fault ${cr_restart} context.$pid > FAIL: bug2003 That's actually kind of weird. This test checks for misbehaving signal handlers. I don't think this is related to your bug, but it might be. > cr_poll_checkpoint: Input/output error > /home/jitumoto/blcr-0.6.1/builddir/tests/.libs/lt-mmaps[1539]: file > "../../tests/crut.c", line 628, in crut_main: Error during checkpoint. > crut_checkpoint_status = 0, saved error = -5 > checkpoint/nonzeroexit (255) at ./mmaps.ct line 128. > FAIL: mmaps.ct This test makes sure that mapped files shared between processes are restored correctly. Can you tell me whether these fail repeatedly? Or do they succeed sometimes? > Unable to determine huge pagesize, if any (test skipped). > SKIP: hugetlbfs.ct > > I'm in trouble because I can't specify whether the cause of this error > is my prototype or BLCR. I want you to teach any trivial hint. Neither of those tests should fail at all. I'm not clear on what the issue is here. We've built against 2.6.12 for a while, and I've done a lot of (successful) testing with that kernel. Two obvious courses of action are to 1/ rebuild BLCR from one of our newer releases, and 2/ upgrade the kernel on the Opteron machine to something a little bit more recent. That's what I'd try first. If you still have problems, then we can take it from there. Make absolutely sure that the kernel configuration of the BLCR module is correct. The System.map's and the kernel sources should match. Then make insmod, and then make check a few times to make sure that everything is working. You should see everything PASS, and hugetlbfs will probably be skipped. Also, can you give me the kernel and distribution details? Is this a vanilla kernel that you've built yourself? Or did one of the distributions ship with this? > I discribe my checkpointing method. > --------------------------------------------------------------------- > I use BLCR with only MPI application process (I don't use BLCR with > any other deamon constructing mpich). > The method is: > Checkpoint > 1. MPI application process get checkpoint-starting-message from MPD (a > deamon constructing mpich) . > 2. Application process drain its in-flight message, but it doesn't > close its sockets. > 3. Application process checkpointing with BLCR by cr_request_file(). That's good. > > Restart > 1. MPD restart checkpoints of application process with cr_restart. > 2. On restarting, BLCR use the following callback function with > CR_SIGNAL_CONTEXT. > int post_recover(void *arg) { > int rc; > > rc = cr_checkpoint(0); > > if (rc < 0) { > exit(EXIT_FAILURE); > } else if (rc) { > //close and reconnect MPI application's sockets. > ck_ch_post_recover(); > } > return 0; > } > Invalid argument is happened on rank 0. On the other rank, I can get > correct checkpoint. Ok, well that tells us that there's something different about rank 0. We just don't know what. (Shared memory? Is that compiled in to your MPICH? Or are you TCP only? We don't support checkpoints with SYSV shared memory in use.) > Sincerely yours, > Hideyuki Jitsumoto > > On 10/26/07, Eric Roman <ESRoman_at_berkeley_dot_edu> wrote: > > > > I'm not sure where the invalid argument happened. There are a lot of > > places where we can return EINVAL. > > > > Let's get the easy stuff out of the way first. Did make check work on the > > Opteron? Try running that, and see if any of the tests pass. If that's ok, > > my guess is that it's the MPI issue. I'm really surprised that you were > > able to restart an MPICH code at all. > > > > We don't support BLCR with MPICH right now. That really shouldn't work at all. > > If you want to checkpoint an MPI job, you can use LAM MPI or a recent release > > of MVAPICH (for Infiniband). OpenMPI support is coming -- it's in their > > subversion tree, but not yet in a released version. OpenMPI checkpointing > > will be released in a few weeks. > > > > For now, you'll need to build the LAM libraries with BLCR support, and > > relink your application with those libraries. There are instructions > > for how to do this on the LAM MPI web page. Once that's done, let me > > know if you still see the error. We should work fine on the Opteron > > environment you're using. > > > > Eric > > > > On Thu, Oct 25, 2007 at 08:24:36PM +0900, Hideyuki Jitsumoto wrote: > > > Dear BLCR-ML-Members, > > > > > > I trid to use BLCR for checkpointing mpich on 2 execution environments. > > > I used completely same codes on MPI application, mpich, and BLCR. > > > But on one environment, I got error message, "Restart failed: Invalid argument". > > > > > > Environment > > > 1. VMware 6.1(Intel Core 2 Duo), Linux-2.6.8-2-686, gcc 3.3.5 > > > 2. AMD Opteron 242*2, Linux-2.6.12.2, gcc 3.3.5 > > > > > > On Environment 1, I got correct restarting, but on Environment 2, I could't. > > > So, I compared kernel log with CR_KTRACE_ALL. > > > Then I noticed Environment2 has error on cr_rstrt_child. > > > > > > Please advise me about what's happened on BLCR , if you have an idea. > > > Thank you. > > > > > > -the contents of /var/log/message > > > Environment 1 had, > > > .... > > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files > > > <cr_rstrt_req.c:1880>, pid 32752: : recovering fs_struct... > > > Oct 22 12:14:10 Concertino1 kernel: cr_load_file_info > > > <cr_rstrt_req.c:1339>, pid 32752: : entering > > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files > > > <cr_rstrt_req.c:1911>, pid 32752: : fd=0 dnr=1 > > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_open_fifo > > > <cr_pipes.c:488>, pid 32752: : entering > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cef61800. > > > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > > > pid 32752: : pipe:[57509]: Phase 1: Making new pipe. > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks > > > <cr_rstrt_req.c:1819>, pid 32752: : entering > > > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info > > > <cr_rstrt_req.c:1339>, pid 32752: : entering > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files > > > <cr_rstrt_req.c:1911>, pid 32752: : fd=1 dnr=1 > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > > <cr_pipes.c:488>, pid 32752: : entering > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cef616c0. > > > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > > > pid 32752: : pipe:[57510]: Phase 1: Making new pipe. > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks > > > <cr_rstrt_req.c:1819>, pid 32752: : entering > > > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info > > > <cr_rstrt_req.c:1339>, pid 32752: : entering > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files > > > <cr_rstrt_req.c:1911>, pid 32752: : fd=2 dnr=1 > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > > <cr_pipes.c:488>, pid 32752: : entering > > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cf2c16c0. > > > Oct 22 12:14:12 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > > > pid 32752: : pipe:[57511]: Phase 1: Making new pipe. > > > Oct 22 12:14:12 Concertino1 kernel: cr_restore_file_locks > > > <cr_rstrt_req.c:1819>, pid 32752: : entering > > > .... > > > > > > Environment 2 had, > > > .... > > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > > > <cr_rstrt_req.c:1880>, pid 18556: : recovering fs_struct... > > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > > > <cr_rstrt_req.c:1339>, pid 18556: : entering > > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > > > <cr_rstrt_req.c:1911>, pid 18556: : fd=0 dnr=1 > > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>, > > > pid 18556: : entering > > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>, > > > pid 18556: : Open fifo: id == f75172c0. > > > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid > > > 18556: : pipe:[595796]: Phase 1: Making new pipe. > > > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks > > > <cr_rstrt_req.c:1819>, pid 18556: : entering > > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > > > <cr_rstrt_req.c:1339>, pid 18556: : entering > > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > > > <cr_rstrt_req.c:1911>, pid 18556: : fd=1 dnr=1 > > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>, > > > pid 18556: : entering > > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>, > > > pid 18556: : Open fifo: id == f7517698. > > > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid > > > 18556: : pipe:[595797]: Phase 1: Making new pipe. > > > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks > > > <cr_rstrt_req.c:1819>, pid 18556: : entering > > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > > > <cr_rstrt_req.c:1339>, pid 18556: : entering > > > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2424>, > > > pid 18556: : 18556: closing request descriptor > > > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2435>, > > > pid 18556: : 18556: closing context file descriptor > > > Oct 25 18:43:29 pad047 kernel: release_rstrt_req <cr_rstrt_req.c:94>, > > > pid 18556: : ref count is approximately 2 > > > Oct 25 18:43:29 pad047 kernel: __cr_task_put <cr_task.c:114>, pid > > > 18556: : Free cr_task_t ebf6f480 > > > .... > > > > > > -- > > > Sincerely Yours, > > > Hideyuki Jitsumoto ([email protected]) > > > Tokyo Institute of Technology Grad. School of Info. and Eng. > > > Dept. MCS (Matsuoka Lab.) > > > > -- > > Eric Roman Department of Physics > > 510-642-7302 UC Berkeley > > > > > -- > Sincerely Yours, > Hideyuki Jitsumoto ([email protected]) > Tokyo Institute of Technology Grad. School of Info. and Eng. > Dept. MCS (Matsuoka Lab.) -- Eric Roman Department of Physics 510-642-7302 UC Berkeley