Re: Problems with BLCR?

From: Pradeep Padala (ppadala_at_eecs_dot_umich_dot_edu)
Date: Tue Jul 26 2005 - 15:15:02 PDT

  • Next message: Paul H. Hargrove: "Re: Checkpointing"
    Hi Paul,
        Latest mail from Jeff mentioned this (He is busy with some 
    conference and may be late in responding)
    
    -------- Original Message --------
    Subject: Re: cr
    Date: Tue, 26 Jul 2005 06:43:22 -0600
    From: Jeff Squyres <[email protected]>
    To: Pradeep Padala <ppadala_at_eecs_dot_umich_dot_edu>
    
    Yes, without libaio parallel processes checkpointed / restarted just
    fine.
    
    I see the problem -- only libaio.so.1 exists (not libaio.so).  This is
    why the linker doesn't find it.  Did you remove an RPM yesterday or
    something?  IIRC, the libaio.so file is in the libaio-devel RPM...?
    ----------------------------------
         I fixed the aio rpm and I am waiting for him to re-test the mpi 
    programs. Is linking with aio a problem for blcr?
    
    -- 
    Pradeep Padala
    http://ppadala.blogspot.com
    
    Paul H. Hargrove wrote:
    > Sorry to have replied before reading other replies which said the same 
    > thing.
    > 
    > I just reread the relavent parts of the BLCR sources and see just a few 
    > places where EBUSY might be generated:
    > 
    > + PID conflict.
    > + Restore of a FIFO (aka named pipe) in which there is data buffered in 
    > the pipe.  A "solution" here would be to delete and recreate the FIFO. 
    > We need a better behavior in BLCR, but can't yet do anything more 
    > intelligent.
    > + Some "should never happen" file restore cases.
    > 
    > In all three cases, there should be a warning/error message in the 
    > system log file.  Please let me know what you find in /var/log/messages 
    > (or equivalent).
    > 
    > -Paul
    > 
    > Paul H. Hargrove wrote:
    > 
    >> Typically this is an indication that the original pids are (still) in 
    >> use.  My guess is that the originaly mpi processes are still running.
    >>
    >> -Paul
    >>
    >> Jeff Squyres wrote:
    >>
    >>> A user was having problems with LAM + BLCR, so I got a guest account 
    >>> on his cluster and gave it a whirl.  With my own build of LAM/MPI, 
    >>> I'm able to checkpoint just fine (i.e., I get N+1 checkpoint files).  
    >>> But when I try to restart, I get the following error:
    >>>
    >>> [jeff@linf1 ~]$ cr_restart context.4037
    >>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    >>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    >>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    >>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    >>>
    >>> What does this mean?
    >>>
    >>> I had checkpointed a simple "hello world" MPI application (4 MPI 
    >>> processes) on a single node.
    >>>
    >>> The user has already been in contact with Paul -- from his initial 
    >>> post on the LAM list 
    >>> (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php):
    >>>
    >>> "P.S. I am using a patched version of blcr to make it work on FC4. The
    >>> patch was given to me by Paul Hargrove."
    >>>
    >>> The specific version of BLCR in use is:
    >>>
    >>> [jeff@linf1 ~]$ cr_restart --version
    >>> cr_restart version 0.4.pre1_snapshot_2005_06_27
    >>>
    >>> Sidenote: I notice that cr_checkpoint has a "--version" switch, but 
    >>> it is not listed in "cr_checkpoint --help" (which was somewhat 
    >>> confusing).  Ditto for cr_run.
    

  • Next message: Paul H. Hargrove: "Re: Checkpointing"