From: Kevin (tz9_at_msstate.edu)
Date: Fri May 21 2004 - 09:47:11 PDT
I tested the blcr with LAM further. Seems right now the problem is caused by the checkpoint file in which mpirun is saved. For example, if I use mpirun -np1 ./hello, assume the pid of mpirun is 344 then there are two context files created: context.344 in which mpirun process information is saved, and context.344-n0-345 in which single "hello" process information is saved. I can use cr_restart to restart a process with context.344-n0-345 partially successfully (in fact, the restarted process can't stopped automatically, it just get stoke after execution); but if using cr_restart context.344 then that's where "Error in exec" happened. Is it true that we can't restart a set of processes that belong to a MPI program at the same time? I guess file context.344 should get engough information to let a MPI program with multiple processes restart together, not just what I used, to restart the individual process one by one. ----- Original Message ----- From: "Kevin" <[email protected]> To: <eroman_at_lbl_dot_gov> Cc: <checkpoint_at_lbl_dot_gov> Sent: Thursday, May 20, 2004 10:12 AM Subject: Re: Error in exec > Eric, > > Thanks for your suggestion. I checked my PATH setting, it does include the path to mpirun which is in LAM/bin directory. If the problem is from crtcp, can we make some methods to solve it? > > Kevin > > > > ----- Original Message ----- > From: "Eric Roman" <ERoman_at_lbl_dot_gov> > To: "Kevin" <[email protected]> > Cc: <checkpoint_at_lbl_dot_gov> > Sent: Wednesday, May 19, 2004 11:48 AM > Subject: Re: Error in exec > > > > > > Kevin > > > > Best I can tell, this is an error coming from LAM. It looks like the "Error > > in exec" message is produced by crtcp when it fails to exec a new mpirun. > > > > Most likely reason for exec() to fail is that the executable wasn't found. > > I'd check the path that the MPI app is using. Make sure it includes mpirun. > > > > - E > > > > On Wed, May 19, 2004 at 10:07:21AM -0500, Kevin wrote: > > > Dear Sir, > > > > > > I used lam7.0.4 combined with blcr-0.2.0 to perform checkpoint mpi program. It works fine with single program and MPI program running on one node before.Today when I tried to checkpoint a MPI program (the "hello" program under example directory with LAM package)running on one node of our cluster, the MPI program could be checkpointed and context file is saved. But when I try to restart it, it returns "Error in exec" to the screen.I can't figure out where the problem is.Could you please give me some suggestion? > > > > > > Below are some information on my operation and configuration: > > > > > > [kevin@Sparrow-01-02 ~/src]mpirun C ./hello > > > //it works fine and information displayed at console 1, > > > > > > [kevin@Sparrow-01-02 ~/src] getpid mpirun > > > //I got the pid of mpirun with a script "getpid" from console 2, assumed it is 344 > > > > > > [kevin@Sparrow-01-02 ~/src]cr_checkpoint 344 > > > //checkpoint the ./hello from console2, it works fine, the context.344 is saved to disk > > > > > > [kevin@Sparrow-01-02 ~/src]cr_restart context.344 > > > Error in exec > > > > > > ---below are configurations---------------------------------- > > > [kevin@Sparrow-01-02 ~/src]lamnodes > > > n0 Sparrow-01-02.ERC.MsState.Edu:1:origin,this_node > > > > > > [kevin@Sparrow-01-02 ~/src]laminfo > > > LAM/MPI: 7.0.4 > > > Prefix: /home/kevin/LAM > > > Architecture: i686-pc-linux-gnu > > > Configured by: kevin > > > Configured on: Mon May 3 15:45:08 CDT 2004 > > > Configure host: Sparrow-01-01.ERC.MsState.Edu > > > C bindings: yes > > > C++ bindings: yes > > > Fortran bindings: yes > > > C profiling: yes > > > C++ profiling: yes > > > Fortran profiling: yes > > > ROMIO support: yes > > > IMPI support: no > > > Debug support: no > > > Purify clean: no > > > SSI boot: globus (Module v0.5) > > > SSI boot: rsh (Module v1.0) > > > SSI coll: lam_basic (Module v7.0) > > > SSI coll: smp (Module v1.0) > > > SSI rpi: crtcp (Module v1.0.1) > > > SSI rpi: lamd (Module v7.0) > > > SSI rpi: sysv (Module v7.0) > > > SSI rpi: tcp (Module v7.0) > > > SSI rpi: usysv (Module v7.0) > > > SSI cr: blcr (Module v1.0.1) > > > > > > > > > > > > > -- > > Eric Roman Computational Research Division > > 510-486-6420 Berkeley Lab > > >