From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Fri Oct 26 2007 - 00:33:34 PDT
Eric, Thank you for your reply and I'm sorry about my sketchy explanation. I have interest in fault tolerant MPI and its recovering method. Now I implemented my prototype with mpich, so, I can't use other MPI easily. >Try running that, and see if any of the tests pass. I tried "make check" then, -VMware-environment skipped 1 test. And other tests was passed. Unable to determine huge pagesize, if any (test skipped). SKIP: hugetlbfs.ct -Opteron-environment got following error, ./bug2003: line 14: 32105 Segmentation fault (core dumped) ${cr_run} ${dir}/bug2003_aux ./bug2003: line 16: 32124 Segmentation fault ${cr_restart} context.$pid FAIL: bug2003 cr_poll_checkpoint: Input/output error /home/jitumoto/blcr-0.6.1/builddir/tests/.libs/lt-mmaps[1539]: file "../../tests/crut.c", line 628, in crut_main: Error during checkpoint. crut_checkpoint_status = 0, saved error = -5 checkpoint/nonzeroexit (255) at ./mmaps.ct line 128. FAIL: mmaps.ct Unable to determine huge pagesize, if any (test skipped). SKIP: hugetlbfs.ct I'm in trouble because I can't specify whether the cause of this error is my prototype or BLCR. I want you to teach any trivial hint. I discribe my checkpointing method. --------------------------------------------------------------------- I use BLCR with only MPI application process (I don't use BLCR with any other deamon constructing mpich). The method is: Checkpoint 1. MPI application process get checkpoint-starting-message from MPD (a deamon constructing mpich) . 2. Application process drain its in-flight message, but it doesn't close its sockets. 3. Application process checkpointing with BLCR by cr_request_file(). Restart 1. MPD restart checkpoints of application process with cr_restart. 2. On restarting, BLCR use the following callback function with CR_SIGNAL_CONTEXT. int post_recover(void *arg) { int rc; rc = cr_checkpoint(0); if (rc < 0) { exit(EXIT_FAILURE); } else if (rc) { //close and reconnect MPI application's sockets. ck_ch_post_recover(); } return 0; } Invalid argument is happened on rank 0. On the other rank, I can get correct checkpoint. Sincerely yours, Hideyuki Jitsumoto On 10/26/07, Eric Roman <ESRoman_at_berkeley_dot_edu> wrote: > > I'm not sure where the invalid argument happened. There are a lot of > places where we can return EINVAL. > > Let's get the easy stuff out of the way first. Did make check work on the > Opteron? Try running that, and see if any of the tests pass. If that's ok, > my guess is that it's the MPI issue. I'm really surprised that you were > able to restart an MPICH code at all. > > We don't support BLCR with MPICH right now. That really shouldn't work at all. > If you want to checkpoint an MPI job, you can use LAM MPI or a recent release > of MVAPICH (for Infiniband). OpenMPI support is coming -- it's in their > subversion tree, but not yet in a released version. OpenMPI checkpointing > will be released in a few weeks. > > For now, you'll need to build the LAM libraries with BLCR support, and > relink your application with those libraries. There are instructions > for how to do this on the LAM MPI web page. Once that's done, let me > know if you still see the error. We should work fine on the Opteron > environment you're using. > > Eric > > On Thu, Oct 25, 2007 at 08:24:36PM +0900, Hideyuki Jitsumoto wrote: > > Dear BLCR-ML-Members, > > > > I trid to use BLCR for checkpointing mpich on 2 execution environments. > > I used completely same codes on MPI application, mpich, and BLCR. > > But on one environment, I got error message, "Restart failed: Invalid argument". > > > > Environment > > 1. VMware 6.1(Intel Core 2 Duo), Linux-2.6.8-2-686, gcc 3.3.5 > > 2. AMD Opteron 242*2, Linux-2.6.12.2, gcc 3.3.5 > > > > On Environment 1, I got correct restarting, but on Environment 2, I could't. > > So, I compared kernel log with CR_KTRACE_ALL. > > Then I noticed Environment2 has error on cr_rstrt_child. > > > > Please advise me about what's happened on BLCR , if you have an idea. > > Thank you. > > > > -the contents of /var/log/message > > Environment 1 had, > > .... > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files > > <cr_rstrt_req.c:1880>, pid 32752: : recovering fs_struct... > > Oct 22 12:14:10 Concertino1 kernel: cr_load_file_info > > <cr_rstrt_req.c:1339>, pid 32752: : entering > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files > > <cr_rstrt_req.c:1911>, pid 32752: : fd=0 dnr=1 > > Oct 22 12:14:10 Concertino1 kernel: cr_restore_open_fifo > > <cr_pipes.c:488>, pid 32752: : entering > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cef61800. > > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > > pid 32752: : pipe:[57509]: Phase 1: Making new pipe. > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks > > <cr_rstrt_req.c:1819>, pid 32752: : entering > > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info > > <cr_rstrt_req.c:1339>, pid 32752: : entering > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files > > <cr_rstrt_req.c:1911>, pid 32752: : fd=1 dnr=1 > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > <cr_pipes.c:488>, pid 32752: : entering > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cef616c0. > > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > > pid 32752: : pipe:[57510]: Phase 1: Making new pipe. > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks > > <cr_rstrt_req.c:1819>, pid 32752: : entering > > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info > > <cr_rstrt_req.c:1339>, pid 32752: : entering > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files > > <cr_rstrt_req.c:1911>, pid 32752: : fd=2 dnr=1 > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > <cr_pipes.c:488>, pid 32752: : entering > > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cf2c16c0. > > Oct 22 12:14:12 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > > pid 32752: : pipe:[57511]: Phase 1: Making new pipe. > > Oct 22 12:14:12 Concertino1 kernel: cr_restore_file_locks > > <cr_rstrt_req.c:1819>, pid 32752: : entering > > .... > > > > Environment 2 had, > > .... > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > > <cr_rstrt_req.c:1880>, pid 18556: : recovering fs_struct... > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > > <cr_rstrt_req.c:1339>, pid 18556: : entering > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > > <cr_rstrt_req.c:1911>, pid 18556: : fd=0 dnr=1 > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>, > > pid 18556: : entering > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>, > > pid 18556: : Open fifo: id == f75172c0. > > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid > > 18556: : pipe:[595796]: Phase 1: Making new pipe. > > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks > > <cr_rstrt_req.c:1819>, pid 18556: : entering > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > > <cr_rstrt_req.c:1339>, pid 18556: : entering > > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > > <cr_rstrt_req.c:1911>, pid 18556: : fd=1 dnr=1 > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>, > > pid 18556: : entering > > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>, > > pid 18556: : Open fifo: id == f7517698. > > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid > > 18556: : pipe:[595797]: Phase 1: Making new pipe. > > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks > > <cr_rstrt_req.c:1819>, pid 18556: : entering > > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > > <cr_rstrt_req.c:1339>, pid 18556: : entering > > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2424>, > > pid 18556: : 18556: closing request descriptor > > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2435>, > > pid 18556: : 18556: closing context file descriptor > > Oct 25 18:43:29 pad047 kernel: release_rstrt_req <cr_rstrt_req.c:94>, > > pid 18556: : ref count is approximately 2 > > Oct 25 18:43:29 pad047 kernel: __cr_task_put <cr_task.c:114>, pid > > 18556: : Free cr_task_t ebf6f480 > > .... > > > > -- > > Sincerely Yours, > > Hideyuki Jitsumoto ([email protected]) > > Tokyo Institute of Technology Grad. School of Info. and Eng. > > Dept. MCS (Matsuoka Lab.) > > -- > Eric Roman Department of Physics > 510-642-7302 UC Berkeley > -- Sincerely Yours, Hideyuki Jitsumoto ([email protected]) Tokyo Institute of Technology Grad. School of Info. and Eng. Dept. MCS (Matsuoka Lab.)