From: Pradeep Padala (ppadala_at_eecs_dot_umich_dot_edu)
Date: Tue Jul 26 2005 - 15:15:02 PDT
Hi Paul, Latest mail from Jeff mentioned this (He is busy with some conference and may be late in responding) -------- Original Message -------- Subject: Re: cr Date: Tue, 26 Jul 2005 06:43:22 -0600 From: Jeff Squyres <[email protected]> To: Pradeep Padala <ppadala_at_eecs_dot_umich_dot_edu> Yes, without libaio parallel processes checkpointed / restarted just fine. I see the problem -- only libaio.so.1 exists (not libaio.so). This is why the linker doesn't find it. Did you remove an RPM yesterday or something? IIRC, the libaio.so file is in the libaio-devel RPM...? ---------------------------------- I fixed the aio rpm and I am waiting for him to re-test the mpi programs. Is linking with aio a problem for blcr? -- Pradeep Padala http://ppadala.blogspot.com Paul H. Hargrove wrote: > Sorry to have replied before reading other replies which said the same > thing. > > I just reread the relavent parts of the BLCR sources and see just a few > places where EBUSY might be generated: > > + PID conflict. > + Restore of a FIFO (aka named pipe) in which there is data buffered in > the pipe. A "solution" here would be to delete and recreate the FIFO. > We need a better behavior in BLCR, but can't yet do anything more > intelligent. > + Some "should never happen" file restore cases. > > In all three cases, there should be a warning/error message in the > system log file. Please let me know what you find in /var/log/messages > (or equivalent). > > -Paul > > Paul H. Hargrove wrote: > >> Typically this is an indication that the original pids are (still) in >> use. My guess is that the originaly mpi processes are still running. >> >> -Paul >> >> Jeff Squyres wrote: >> >>> A user was having problems with LAM + BLCR, so I got a guest account >>> on his cluster and gave it a whirl. With my own build of LAM/MPI, >>> I'm able to checkpoint just fine (i.e., I get N+1 checkpoint files). >>> But when I try to restart, I get the following error: >>> >>> [jeff@linf1 ~]$ cr_restart context.4037 >>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy >>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy >>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy >>> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy >>> >>> What does this mean? >>> >>> I had checkpointed a simple "hello world" MPI application (4 MPI >>> processes) on a single node. >>> >>> The user has already been in contact with Paul -- from his initial >>> post on the LAM list >>> (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php): >>> >>> "P.S. I am using a patched version of blcr to make it work on FC4. The >>> patch was given to me by Paul Hargrove." >>> >>> The specific version of BLCR in use is: >>> >>> [jeff@linf1 ~]$ cr_restart --version >>> cr_restart version 0.4.pre1_snapshot_2005_06_27 >>> >>> Sidenote: I notice that cr_checkpoint has a "--version" switch, but >>> it is not listed in "cr_checkpoint --help" (which was somewhat >>> confusing). Ditto for cr_run.