Re: More testing result about "Error in exec". Re: Error in exec

jcduell_at_lbl_dot_gov
Date: Fri May 21 2004 - 10:33:03 PDT

  • Next message: Eric Roman: "Re: More testing result about "Error in exec". Re: Error in exec"
    Kevin:
    
    In the example below, you should be able to restart the whole MPI job
    just by running
    
            cr_restart context.344
    
    Just to be sure, could you tell me if
    
    1) All processes in the original (checkpointed) MPI job are gone by the
       time you try to restart.  In other words, you checkpointed via
    
            cr_checkpoint --term 344
    
       Or, if you didn't pass the '--term' flag, you killed the job manually
    
    2) When you tried
    
            cr_restart context.344
    
       There wasn't already a copy of 'hello' running from your having run
    
            cr_restart context.344-n0-345
    
    If that's all true, and you got an error from 'cr_restart context.344',
    could you cut and paste the error that displayed into an email and send
    it to us?
    
    Thanks,
    
    -- 
    Jason Duell             Future Technologies Group
    <jcduell_at_lbl_dot_gov>       Computational Research Division
    Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory
    
    
    On Fri, May 21, 2004 at 11:47:11AM -0500, Kevin wrote:
    > I tested the blcr with LAM further. Seems right now the problem is
    > caused by the checkpoint file in which mpirun is saved. For example,
    > if  I use 
    > 
    > mpirun -np1 ./hello,  assume the pid of mpirun is 344
    > 
    > then there are two context files created: context.344 in which mpirun
    > process information is saved, and context.344-n0-345 in which single
    > "hello" process information is saved. I can use cr_restart to restart
    > a process with context.344-n0-345 partially successfully (in fact, the
    > restarted process can't stopped automatically, it just get stoke after
    > execution); but if using 
    > cr_restart context.344
    > then that's where "Error in exec" happened.  Is it true that we can't
    > restart a set of processes that belong to a MPI program at the same
    > time? I guess file context.344 should get engough information to let a
    > MPI program with multiple processes restart together, not just what I
    > used, to restart the individual process one by one.
    > 
    > 
    > 
    > ----- Original Message ----- 
    > From: "Kevin" <[email protected]>
    > To: <eroman_at_lbl_dot_gov>
    > Cc: <checkpoint_at_lbl_dot_gov>
    > Sent: Thursday, May 20, 2004 10:12 AM
    > Subject: Re: Error in exec
    > 
    > 
    > > Eric,
    > > 
    > > Thanks for your suggestion. I checked my PATH setting, it does
    > > include the path to mpirun which is in LAM/bin directory. If the
    > > problem is from crtcp, can we make some methods to solve it? 
    > > 
    > > Kevin
    > > 
    > > 
    > >  
    > > ----- Original Message ----- 
    > > From: "Eric Roman" <ERoman_at_lbl_dot_gov>
    > > To: "Kevin" <[email protected]>
    > > Cc: <checkpoint_at_lbl_dot_gov>
    > > Sent: Wednesday, May 19, 2004 11:48 AM
    > > Subject: Re: Error in exec
    > > 
    > > 
    > > > 
    > > > Kevin
    > > > 
    > > > Best I can tell, this is an error coming from LAM.  It looks like
    > > > the "Error in exec" message is produced by crtcp when it fails to
    > > > exec a new mpirun.
    > > > 
    > > > Most likely reason for exec() to fail is that the executable
    > > > wasn't found.  I'd check the path that the MPI app is using.  Make
    > > > sure it includes mpirun.
    > > > 
    > > >  - E
    > > > 
    > > > On Wed, May 19, 2004 at 10:07:21AM -0500, Kevin wrote:
    > > > > Dear Sir, 
    > > > > 
    > > > > I used lam7.0.4 combined with blcr-0.2.0 to perform checkpoint
    > > > > mpi program. It works fine with single program and MPI program
    > > > > running on one node before.Today when I tried to checkpoint a
    > > > > MPI program (the "hello" program under example directory with
    > > > > LAM package)running on one node of our cluster, the MPI program
    > > > > could be checkpointed and context file is saved. But when I try
    > > > > to restart it, it returns "Error in exec" to the screen.I can't
    > > > > figure out where the problem is.Could you please give me some
    > > > > suggestion?
    > > > > 
    > > > > Below are some information on my operation and configuration:
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src]mpirun C ./hello 
    > > > > //it works fine and information displayed at console 1, 
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src] getpid mpirun 
    > > > > //I got the pid of mpirun with a script "getpid" from console 2, assumed it is 344
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src]cr_checkpoint 344
    > > > > //checkpoint the ./hello from console2, it works fine, the context.344 is saved to disk
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src]cr_restart context.344
    > > > > Error in exec
    > > > > 
    > > > > ---below are configurations----------------------------------
    > > > > [kevin@Sparrow-01-02 ~/src]lamnodes
    > > > > n0      Sparrow-01-02.ERC.MsState.Edu:1:origin,this_node
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src]laminfo
    > > > >            LAM/MPI: 7.0.4
    > > > >             Prefix: /home/kevin/LAM
    > > > >       Architecture: i686-pc-linux-gnu
    > > > >      Configured by: kevin
    > > > >      Configured on: Mon May  3 15:45:08 CDT 2004
    > > > >     Configure host: Sparrow-01-01.ERC.MsState.Edu
    > > > >         C bindings: yes
    > > > >       C++ bindings: yes
    > > > >   Fortran bindings: yes
    > > > >        C profiling: yes
    > > > >      C++ profiling: yes
    > > > >  Fortran profiling: yes
    > > > >      ROMIO support: yes
    > > > >       IMPI support: no
    > > > >      Debug support: no
    > > > >       Purify clean: no
    > > > >           SSI boot: globus (Module v0.5)
    > > > >           SSI boot: rsh (Module v1.0)
    > > > >           SSI coll: lam_basic (Module v7.0)
    > > > >           SSI coll: smp (Module v1.0)
    > > > >            SSI rpi: crtcp (Module v1.0.1)
    > > > >            SSI rpi: lamd (Module v7.0)
    > > > >            SSI rpi: sysv (Module v7.0)
    > > > >            SSI rpi: tcp (Module v7.0)
    > > > >            SSI rpi: usysv (Module v7.0)
    > > > >             SSI cr: blcr (Module v1.0.1)
    > > > > 
    

  • Next message: Eric Roman: "Re: More testing result about "Error in exec". Re: Error in exec"