From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 26 2005 - 14:43:36 PDT
Sorry to have replied before reading other replies which said the same thing. I just reread the relavent parts of the BLCR sources and see just a few places where EBUSY might be generated: + PID conflict. + Restore of a FIFO (aka named pipe) in which there is data buffered in the pipe. A "solution" here would be to delete and recreate the FIFO. We need a better behavior in BLCR, but can't yet do anything more intelligent. + Some "should never happen" file restore cases. In all three cases, there should be a warning/error message in the system log file. Please let me know what you find in /var/log/messages (or equivalent). -Paul Paul H. Hargrove wrote: > Typically this is an indication that the original pids are (still) in > use. My guess is that the originaly mpi processes are still running. > > -Paul > > Jeff Squyres wrote: > >> A user was having problems with LAM + BLCR, so I got a guest account >> on his cluster and gave it a whirl. With my own build of LAM/MPI, I'm >> able to checkpoint just fine (i.e., I get N+1 checkpoint files). But >> when I try to restart, I get the following error: >> >> [jeff@linf1 ~]$ cr_restart context.4037 >> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy >> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy >> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy >> cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy >> >> What does this mean? >> >> I had checkpointed a simple "hello world" MPI application (4 MPI >> processes) on a single node. >> >> The user has already been in contact with Paul -- from his initial >> post on the LAM list >> (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php): >> >> "P.S. I am using a patched version of blcr to make it work on FC4. The >> patch was given to me by Paul Hargrove." >> >> The specific version of BLCR in use is: >> >> [jeff@linf1 ~]$ cr_restart --version >> cr_restart version 0.4.pre1_snapshot_2005_06_27 >> >> Sidenote: I notice that cr_checkpoint has a "--version" switch, but it >> is not listed in "cr_checkpoint --help" (which was somewhat >> confusing). Ditto for cr_run. >> > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900