Re: Problems with BLCR?

Date view	Thread view	Subject view	Author view	Attachment view

From: Pradeep Padala (ppadala_at_eecs_dot_umich_dot_edu)
Date: Tue Jul 26 2005 - 15:15:02 PDT

Next message: Paul H. Hargrove: "Re: Checkpointing"

Previous message: Paul H. Hargrove: "Re: Problems with BLCR?"
In reply to: Paul H. Hargrove: "Re: Problems with BLCR?"
Next in thread: Paul H. Hargrove: "Re: Problems with BLCR?"
Reply: Paul H. Hargrove: "Re: Problems with BLCR?"

Hi Paul,
    Latest mail from Jeff mentioned this (He is busy with some 
conference and may be late in responding)

-------- Original Message --------
Subject: Re: cr
Date: Tue, 26 Jul 2005 06:43:22 -0600
From: Jeff Squyres <[email protected]>
To: Pradeep Padala <ppadala_at_eecs_dot_umich_dot_edu>

Yes, without libaio parallel processes checkpointed / restarted just
fine.

I see the problem -- only libaio.so.1 exists (not libaio.so).  This is
why the linker doesn't find it.  Did you remove an RPM yesterday or
something?  IIRC, the libaio.so file is in the libaio-devel RPM...?
----------------------------------
     I fixed the aio rpm and I am waiting for him to re-test the mpi 
programs. Is linking with aio a problem for blcr?

-- 
Pradeep Padala
http://ppadala.blogspot.com

Paul H. Hargrove wrote:
> Sorry to have replied before reading other replies which said the same 
> thing.
> 
> I just reread the relavent parts of the BLCR sources and see just a few 
> places where EBUSY might be generated:
> 
> + PID conflict.
> + Restore of a FIFO (aka named pipe) in which there is data buffered in 
> the pipe.  A "solution" here would be to delete and recreate the FIFO. 
> We need a better behavior in BLCR, but can't yet do anything more 
> intelligent.
> + Some "should never happen" file restore cases.
> 
> In all three cases, there should be a warning/error message in the 
> system log file.  Please let me know what you find in /var/log/messages 
> (or equivalent).
> 
> -Paul
> 
> Paul H. Hargrove wrote:
> 
>> Typically this is an indication that the original pids are (still) in 
>> use.  My guess is that the originaly mpi processes are still running.
>>
>> -Paul
>>
>> Jeff Squyres wrote:
>>
>>> A user was having problems with LAM + BLCR, so I got a guest account 
>>> on his cluster and gave it a whirl.  With my own build of LAM/MPI, 
>>> I'm able to checkpoint just fine (i.e., I get N+1 checkpoint files).  
>>> But when I try to restart, I get the following error:
>>>
>>> [jeff@linf1 ~]$ cr_restart context.4037
>>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
>>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
>>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
>>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
>>>
>>> What does this mean?
>>>
>>> I had checkpointed a simple "hello world" MPI application (4 MPI 
>>> processes) on a single node.
>>>
>>> The user has already been in contact with Paul -- from his initial 
>>> post on the LAM list 
>>> (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php):
>>>
>>> "P.S. I am using a patched version of blcr to make it work on FC4. The
>>> patch was given to me by Paul Hargrove."
>>>
>>> The specific version of BLCR in use is:
>>>
>>> [jeff@linf1 ~]$ cr_restart --version
>>> cr_restart version 0.4.pre1_snapshot_2005_06_27
>>>
>>> Sidenote: I notice that cr_checkpoint has a "--version" switch, but 
>>> it is not listed in "cr_checkpoint --help" (which was somewhat 
>>> confusing).  Ditto for cr_run.

Next message: Paul H. Hargrove: "Re: Checkpointing"

Previous message: Paul H. Hargrove: "Re: Problems with BLCR?"
In reply to: Paul H. Hargrove: "Re: Problems with BLCR?"
Next in thread: Paul H. Hargrove: "Re: Problems with BLCR?"
Reply: Paul H. Hargrove: "Re: Problems with BLCR?"

Date view	Thread view	Subject view	Author view	Attachment view