From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Thu Oct 25 2007 - 04:24:36 PDT
Dear BLCR-ML-Members, I trid to use BLCR for checkpointing mpich on 2 execution environments. I used completely same codes on MPI application, mpich, and BLCR. But on one environment, I got error message, "Restart failed: Invalid argument". Environment 1. VMware 6.1(Intel Core 2 Duo), Linux-2.6.8-2-686, gcc 3.3.5 2. AMD Opteron 242*2, Linux-2.6.12.2, gcc 3.3.5 On Environment 1, I got correct restarting, but on Environment 2, I could't. So, I compared kernel log with CR_KTRACE_ALL. Then I noticed Environment2 has error on cr_rstrt_child. Please advise me about what's happened on BLCR , if you have an idea. Thank you. -the contents of /var/log/message Environment 1 had, .... Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files <cr_rstrt_req.c:1880>, pid 32752: : recovering fs_struct... Oct 22 12:14:10 Concertino1 kernel: cr_load_file_info <cr_rstrt_req.c:1339>, pid 32752: : entering Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files <cr_rstrt_req.c:1911>, pid 32752: : fd=0 dnr=1 Oct 22 12:14:10 Concertino1 kernel: cr_restore_open_fifo <cr_pipes.c:488>, pid 32752: : entering Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo <cr_pipes.c:498>, pid 32752: : Open fifo: id == cef61800. Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid 32752: : pipe:[57509]: Phase 1: Making new pipe. Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks <cr_rstrt_req.c:1819>, pid 32752: : entering Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info <cr_rstrt_req.c:1339>, pid 32752: : entering Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files <cr_rstrt_req.c:1911>, pid 32752: : fd=1 dnr=1 Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo <cr_pipes.c:488>, pid 32752: : entering Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo <cr_pipes.c:498>, pid 32752: : Open fifo: id == cef616c0. Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid 32752: : pipe:[57510]: Phase 1: Making new pipe. Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks <cr_rstrt_req.c:1819>, pid 32752: : entering Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info <cr_rstrt_req.c:1339>, pid 32752: : entering Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files <cr_rstrt_req.c:1911>, pid 32752: : fd=2 dnr=1 Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo <cr_pipes.c:488>, pid 32752: : entering Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo <cr_pipes.c:498>, pid 32752: : Open fifo: id == cf2c16c0. Oct 22 12:14:12 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid 32752: : pipe:[57511]: Phase 1: Making new pipe. Oct 22 12:14:12 Concertino1 kernel: cr_restore_file_locks <cr_rstrt_req.c:1819>, pid 32752: : entering .... Environment 2 had, .... Oct 25 18:43:29 pad047 kernel: cr_restore_all_files <cr_rstrt_req.c:1880>, pid 18556: : recovering fs_struct... Oct 25 18:43:29 pad047 kernel: cr_load_file_info <cr_rstrt_req.c:1339>, pid 18556: : entering Oct 25 18:43:29 pad047 kernel: cr_restore_all_files <cr_rstrt_req.c:1911>, pid 18556: : fd=0 dnr=1 Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>, pid 18556: : entering Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>, pid 18556: : Open fifo: id == f75172c0. Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid 18556: : pipe:[595796]: Phase 1: Making new pipe. Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks <cr_rstrt_req.c:1819>, pid 18556: : entering Oct 25 18:43:29 pad047 kernel: cr_load_file_info <cr_rstrt_req.c:1339>, pid 18556: : entering Oct 25 18:43:29 pad047 kernel: cr_restore_all_files <cr_rstrt_req.c:1911>, pid 18556: : fd=1 dnr=1 Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>, pid 18556: : entering Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>, pid 18556: : Open fifo: id == f7517698. Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid 18556: : pipe:[595797]: Phase 1: Making new pipe. Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks <cr_rstrt_req.c:1819>, pid 18556: : entering Oct 25 18:43:29 pad047 kernel: cr_load_file_info <cr_rstrt_req.c:1339>, pid 18556: : entering Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2424>, pid 18556: : 18556: closing request descriptor Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2435>, pid 18556: : 18556: closing context file descriptor Oct 25 18:43:29 pad047 kernel: release_rstrt_req <cr_rstrt_req.c:94>, pid 18556: : ref count is approximately 2 Oct 25 18:43:29 pad047 kernel: __cr_task_put <cr_task.c:114>, pid 18556: : Free cr_task_t ebf6f480 .... -- Sincerely Yours, Hideyuki Jitsumoto ([email protected]) Tokyo Institute of Technology Grad. School of Info. and Eng. Dept. MCS (Matsuoka Lab.)