Re: Problems with BLCR?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 26 2005 - 14:43:36 PDT

  • Next message: Pradeep Padala: "Re: Problems with BLCR?"
    Sorry to have replied before reading other replies which said the same 
    thing.
    
    I just reread the relavent parts of the BLCR sources and see just a few 
    places where EBUSY might be generated:
    
    + PID conflict.
    + Restore of a FIFO (aka named pipe) in which there is data buffered in 
    the pipe.  A "solution" here would be to delete and recreate the FIFO. 
    We need a better behavior in BLCR, but can't yet do anything more 
    intelligent.
    + Some "should never happen" file restore cases.
    
    In all three cases, there should be a warning/error message in the 
    system log file.  Please let me know what you find in /var/log/messages 
    (or equivalent).
    
    -Paul
    
    Paul H. Hargrove wrote:
    > Typically this is an indication that the original pids are (still) in 
    > use.  My guess is that the originaly mpi processes are still running.
    > 
    > -Paul
    > 
    > Jeff Squyres wrote:
    > 
    >> A user was having problems with LAM + BLCR, so I got a guest account 
    >> on his cluster and gave it a whirl.  With my own build of LAM/MPI, I'm 
    >> able to checkpoint just fine (i.e., I get N+1 checkpoint files).  But 
    >> when I try to restart, I get the following error:
    >>
    >> [jeff@linf1 ~]$ cr_restart context.4037
    >> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    >> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    >> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    >> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    >>
    >> What does this mean?
    >>
    >> I had checkpointed a simple "hello world" MPI application (4 MPI 
    >> processes) on a single node.
    >>
    >> The user has already been in contact with Paul -- from his initial 
    >> post on the LAM list 
    >> (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php):
    >>
    >> "P.S. I am using a patched version of blcr to make it work on FC4. The
    >> patch was given to me by Paul Hargrove."
    >>
    >> The specific version of BLCR in use is:
    >>
    >> [jeff@linf1 ~]$ cr_restart --version
    >> cr_restart version 0.4.pre1_snapshot_2005_06_27
    >>
    >> Sidenote: I notice that cr_checkpoint has a "--version" switch, but it 
    >> is not listed in "cr_checkpoint --help" (which was somewhat 
    >> confusing).  Ditto for cr_run.
    >>
    > 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Pradeep Padala: "Re: Problems with BLCR?"