From: Jeff Squyres (jsquyres_at_open-mpi.org)
Date: Mon Jul 25 2005 - 06:24:05 PDT
A user was having problems with LAM + BLCR, so I got a guest account on his cluster and gave it a whirl. With my own build of LAM/MPI, I'm able to checkpoint just fine (i.e., I get N+1 checkpoint files). But when I try to restart, I get the following error: [jeff@linf1 ~]$ cr_restart context.4037 cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy What does this mean? I had checkpointed a simple "hello world" MPI application (4 MPI processes) on a single node. The user has already been in contact with Paul -- from his initial post on the LAM list (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php): "P.S. I am using a patched version of blcr to make it work on FC4. The patch was given to me by Paul Hargrove." The specific version of BLCR in use is: [jeff@linf1 ~]$ cr_restart --version cr_restart version 0.4.pre1_snapshot_2005_06_27 Sidenote: I notice that cr_checkpoint has a "--version" switch, but it is not listed in "cr_checkpoint --help" (which was somewhat confusing). Ditto for cr_run. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/