From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 26 2005 - 10:49:05 PDT
Typically this is an indication that the original pids are (still) in use. My guess is that the originaly mpi processes are still running. -Paul Jeff Squyres wrote: > A user was having problems with LAM + BLCR, so I got a guest account on > his cluster and gave it a whirl. With my own build of LAM/MPI, I'm able > to checkpoint just fine (i.e., I get N+1 checkpoint files). But when I > try to restart, I get the following error: > > [jeff@linf1 ~]$ cr_restart context.4037 > cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy > cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy > cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy > cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy > > What does this mean? > > I had checkpointed a simple "hello world" MPI application (4 MPI > processes) on a single node. > > The user has already been in contact with Paul -- from his initial post > on the LAM list > (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php): > > "P.S. I am using a patched version of blcr to make it work on FC4. The > patch was given to me by Paul Hargrove." > > The specific version of BLCR in use is: > > [jeff@linf1 ~]$ cr_restart --version > cr_restart version 0.4.pre1_snapshot_2005_06_27 > > Sidenote: I notice that cr_checkpoint has a "--version" switch, but it > is not listed in "cr_checkpoint --help" (which was somewhat confusing). > Ditto for cr_run. > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900