From: Eric Roman (ESRoman_at_berkeley_dot_edu)
Date: Thu Oct 25 2007 - 10:47:12 PDT
I'm not sure where the invalid argument happened. There are a lot of places where we can return EINVAL. Let's get the easy stuff out of the way first. Did make check work on the Opteron? Try running that, and see if any of the tests pass. If that's ok, my guess is that it's the MPI issue. I'm really surprised that you were able to restart an MPICH code at all. We don't support BLCR with MPICH right now. That really shouldn't work at all. If you want to checkpoint an MPI job, you can use LAM MPI or a recent release of MVAPICH (for Infiniband). OpenMPI support is coming -- it's in their subversion tree, but not yet in a released version. OpenMPI checkpointing will be released in a few weeks. For now, you'll need to build the LAM libraries with BLCR support, and relink your application with those libraries. There are instructions for how to do this on the LAM MPI web page. Once that's done, let me know if you still see the error. We should work fine on the Opteron environment you're using. Eric On Thu, Oct 25, 2007 at 08:24:36PM +0900, Hideyuki Jitsumoto wrote: > Dear BLCR-ML-Members, > > I trid to use BLCR for checkpointing mpich on 2 execution environments. > I used completely same codes on MPI application, mpich, and BLCR. > But on one environment, I got error message, "Restart failed: Invalid argument". > > Environment > 1. VMware 6.1(Intel Core 2 Duo), Linux-2.6.8-2-686, gcc 3.3.5 > 2. AMD Opteron 242*2, Linux-2.6.12.2, gcc 3.3.5 > > On Environment 1, I got correct restarting, but on Environment 2, I could't. > So, I compared kernel log with CR_KTRACE_ALL. > Then I noticed Environment2 has error on cr_rstrt_child. > > Please advise me about what's happened on BLCR , if you have an idea. > Thank you. > > -the contents of /var/log/message > Environment 1 had, > .... > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files > <cr_rstrt_req.c:1880>, pid 32752: : recovering fs_struct... > Oct 22 12:14:10 Concertino1 kernel: cr_load_file_info > <cr_rstrt_req.c:1339>, pid 32752: : entering > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files > <cr_rstrt_req.c:1911>, pid 32752: : fd=0 dnr=1 > Oct 22 12:14:10 Concertino1 kernel: cr_restore_open_fifo > <cr_pipes.c:488>, pid 32752: : entering > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cef61800. > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > pid 32752: : pipe:[57509]: Phase 1: Making new pipe. > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks > <cr_rstrt_req.c:1819>, pid 32752: : entering > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info > <cr_rstrt_req.c:1339>, pid 32752: : entering > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files > <cr_rstrt_req.c:1911>, pid 32752: : fd=1 dnr=1 > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > <cr_pipes.c:488>, pid 32752: : entering > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cef616c0. > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > pid 32752: : pipe:[57510]: Phase 1: Making new pipe. > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks > <cr_rstrt_req.c:1819>, pid 32752: : entering > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info > <cr_rstrt_req.c:1339>, pid 32752: : entering > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files > <cr_rstrt_req.c:1911>, pid 32752: : fd=2 dnr=1 > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > <cr_pipes.c:488>, pid 32752: : entering > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo > <cr_pipes.c:498>, pid 32752: : Open fifo: id == cf2c16c0. > Oct 22 12:14:12 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, > pid 32752: : pipe:[57511]: Phase 1: Making new pipe. > Oct 22 12:14:12 Concertino1 kernel: cr_restore_file_locks > <cr_rstrt_req.c:1819>, pid 32752: : entering > .... > > Environment 2 had, > .... > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > <cr_rstrt_req.c:1880>, pid 18556: : recovering fs_struct... > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > <cr_rstrt_req.c:1339>, pid 18556: : entering > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > <cr_rstrt_req.c:1911>, pid 18556: : fd=0 dnr=1 > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>, > pid 18556: : entering > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>, > pid 18556: : Open fifo: id == f75172c0. > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid > 18556: : pipe:[595796]: Phase 1: Making new pipe. > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks > <cr_rstrt_req.c:1819>, pid 18556: : entering > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > <cr_rstrt_req.c:1339>, pid 18556: : entering > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files > <cr_rstrt_req.c:1911>, pid 18556: : fd=1 dnr=1 > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>, > pid 18556: : entering > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>, > pid 18556: : Open fifo: id == f7517698. > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid > 18556: : pipe:[595797]: Phase 1: Making new pipe. > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks > <cr_rstrt_req.c:1819>, pid 18556: : entering > Oct 25 18:43:29 pad047 kernel: cr_load_file_info > <cr_rstrt_req.c:1339>, pid 18556: : entering > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2424>, > pid 18556: : 18556: closing request descriptor > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2435>, > pid 18556: : 18556: closing context file descriptor > Oct 25 18:43:29 pad047 kernel: release_rstrt_req <cr_rstrt_req.c:94>, > pid 18556: : ref count is approximately 2 > Oct 25 18:43:29 pad047 kernel: __cr_task_put <cr_task.c:114>, pid > 18556: : Free cr_task_t ebf6f480 > .... > > -- > Sincerely Yours, > Hideyuki Jitsumoto ([email protected]) > Tokyo Institute of Technology Grad. School of Info. and Eng. > Dept. MCS (Matsuoka Lab.) -- Eric Roman Department of Physics 510-642-7302 UC Berkeley