Please advise me about restarting with BLCR

From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Thu Oct 25 2007 - 04:24:36 PDT

  • Next message: Eric Roman: "Re: Please advise me about restarting with BLCR"
    Dear BLCR-ML-Members,
    
    I trid to use BLCR for checkpointing mpich on 2 execution environments.
    I used completely same codes on MPI application, mpich, and BLCR.
    But on one environment, I got error message, "Restart failed: Invalid argument".
    
    Environment
    1. VMware 6.1(Intel Core 2 Duo), Linux-2.6.8-2-686, gcc 3.3.5
    2. AMD Opteron 242*2, Linux-2.6.12.2, gcc 3.3.5
    
    On Environment 1, I got correct restarting, but on Environment 2, I could't.
    So, I compared kernel log with CR_KTRACE_ALL.
    Then I noticed Environment2 has error on cr_rstrt_child.
    
    Please advise me about what's happened on BLCR , if you have an idea.
    Thank you.
    
    -the contents of /var/log/message
    Environment 1 had,
    ....
    Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files
    <cr_rstrt_req.c:1880>, pid 32752: : recovering fs_struct...
    Oct 22 12:14:10 Concertino1 kernel: cr_load_file_info
    <cr_rstrt_req.c:1339>, pid 32752: : entering
    Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files
    <cr_rstrt_req.c:1911>, pid 32752: :    fd=0 dnr=1
    Oct 22 12:14:10 Concertino1 kernel: cr_restore_open_fifo
    <cr_pipes.c:488>, pid 32752: : entering
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cef61800.
    Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    pid 32752: : pipe:[57509]:  Phase 1: Making new pipe.
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks
    <cr_rstrt_req.c:1819>, pid 32752: : entering
    Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info
    <cr_rstrt_req.c:1339>, pid 32752: : entering
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files
    <cr_rstrt_req.c:1911>, pid 32752: :    fd=1 dnr=1
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    <cr_pipes.c:488>, pid 32752: : entering
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cef616c0.
    Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    pid 32752: : pipe:[57510]:  Phase 1: Making new pipe.
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks
    <cr_rstrt_req.c:1819>, pid 32752: : entering
    Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info
    <cr_rstrt_req.c:1339>, pid 32752: : entering
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files
    <cr_rstrt_req.c:1911>, pid 32752: :    fd=2 dnr=1
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    <cr_pipes.c:488>, pid 32752: : entering
    Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cf2c16c0.
    Oct 22 12:14:12 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    pid 32752: : pipe:[57511]:  Phase 1: Making new pipe.
    Oct 22 12:14:12 Concertino1 kernel: cr_restore_file_locks
    <cr_rstrt_req.c:1819>, pid 32752: : entering
    ....
    
    Environment 2 had,
    ....
    Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    <cr_rstrt_req.c:1880>, pid 18556: : recovering fs_struct...
    Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    <cr_rstrt_req.c:1339>, pid 18556: : entering
    Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    <cr_rstrt_req.c:1911>, pid 18556: :    fd=0 dnr=1
    Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>,
    pid 18556: : entering
    Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>,
    pid 18556: :    Open fifo: id == f75172c0.
    Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid
    18556: : pipe:[595796]:  Phase 1: Making new pipe.
    Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks
    <cr_rstrt_req.c:1819>, pid 18556: : entering
    Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    <cr_rstrt_req.c:1339>, pid 18556: : entering
    Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    <cr_rstrt_req.c:1911>, pid 18556: :    fd=1 dnr=1
    Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>,
    pid 18556: : entering
    Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>,
    pid 18556: :    Open fifo: id == f7517698.
    Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid
    18556: : pipe:[595797]:  Phase 1: Making new pipe.
    Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks
    <cr_rstrt_req.c:1819>, pid 18556: : entering
    Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    <cr_rstrt_req.c:1339>, pid 18556: : entering
    Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2424>,
    pid 18556: : 18556: closing request descriptor
    Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2435>,
    pid 18556: : 18556: closing context file descriptor
    Oct 25 18:43:29 pad047 kernel: release_rstrt_req <cr_rstrt_req.c:94>,
    pid 18556: : ref count is approximately 2
    Oct 25 18:43:29 pad047 kernel: __cr_task_put <cr_task.c:114>, pid
    18556: : Free cr_task_t ebf6f480
    ....
    
    -- 
    Sincerely Yours,
    Hideyuki Jitsumoto (jitsumo0@is.titech.ac.jp)
    Tokyo Institute of Technology Grad. School of Info. and Eng.
    Dept. MCS (Matsuoka Lab.)
    

  • Next message: Eric Roman: "Re: Please advise me about restarting with BLCR"