From: Kevin (tz9_at_msstate.edu)
Date: Wed May 19 2004 - 08:07:21 PDT
Dear Sir, I used lam7.0.4 combined with blcr-0.2.0 to perform checkpoint mpi program. It works fine with single program and MPI program running on one node before.Today when I tried to checkpoint a MPI program (the "hello" program under example directory with LAM package)running on one node of our cluster, the MPI program could be checkpointed and context file is saved. But when I try to restart it, it returns "Error in exec" to the screen.I can't figure out where the problem is.Could you please give me some suggestion? Below are some information on my operation and configuration: [kevin@Sparrow-01-02 ~/src]mpirun C ./hello //it works fine and information displayed at console 1, [kevin@Sparrow-01-02 ~/src] getpid mpirun //I got the pid of mpirun with a script "getpid" from console 2, assumed it is 344 [kevin@Sparrow-01-02 ~/src]cr_checkpoint 344 //checkpoint the ./hello from console2, it works fine, the context.344 is saved to disk [kevin@Sparrow-01-02 ~/src]cr_restart context.344 Error in exec ---below are configurations---------------------------------- [kevin@Sparrow-01-02 ~/src]lamnodes n0 Sparrow-01-02.ERC.MsState.Edu:1:origin,this_node [kevin@Sparrow-01-02 ~/src]laminfo LAM/MPI: 7.0.4 Prefix: /home/kevin/LAM Architecture: i686-pc-linux-gnu Configured by: kevin Configured on: Mon May 3 15:45:08 CDT 2004 Configure host: Sparrow-01-01.ERC.MsState.Edu C bindings: yes C++ bindings: yes Fortran bindings: yes C profiling: yes C++ profiling: yes Fortran profiling: yes ROMIO support: yes IMPI support: no Debug support: no Purify clean: no SSI boot: globus (Module v0.5) SSI boot: rsh (Module v1.0) SSI coll: lam_basic (Module v7.0) SSI coll: smp (Module v1.0) SSI rpi: crtcp (Module v1.0.1) SSI rpi: lamd (Module v7.0) SSI rpi: sysv (Module v7.0) SSI rpi: tcp (Module v7.0) SSI rpi: usysv (Module v7.0) SSI cr: blcr (Module v1.0.1)