Re: lam/mpi blcr problem

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 23 2005 - 06:27:56 PST

  • Next message: 任明明: "Re: lam/mpi blcr problem"
    I'm sorry -- I neglected to mention in my previous e-mail that we had 
    some problems with the logic for checkpoint/restart initialization in 
    LAM/MPI v7.1.1.  Can you try the soon-to-be-released 7.1.2 beta?
    
    	http://www.lam-mpi.org/beta/
    
    That should solve your problems.
    
    
    On Mar 23, 2005, at 9:27 AM, 浠绘槑鏄 wrote:
    
    >
    > thank you for your help!
    > I can use blcr to checkpoint the non-MPI program,such as the examples
    > included in the blcr software.And all the nodes are ok to checkpoint a
    > non-MPI program.
    > but when i use cr_checkpoint to checkpoint a MPI program, it doesn't 
    > generate
    > context file for each process, only generate a context file for mpirun 
    > command.
    >
    > all i do is the the following:
    >
    > In one window:
    > ****************************************************
    > [rmingming@node01 lam]$ mpicc cpi.c -o cpi
    > [rmingming@node01 lam]$ lamboot -v nodes
    >
    > LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
    >
    > n-1<8238> ssi:boot:base:linear: booting n0 (node01)
    > n-1<8238> ssi:boot:base:linear: booting n1 (node02)
    > n-1<8238> ssi:boot:base:linear: booting n2 (node03)
    > n-1<8238> ssi:boot:base:linear: booting n3 (node04)
    > n-1<8238> ssi:boot:base:linear: finished
    > [rmingming@node01 lam]$ mpirun C -ssi rpi crtcp -ssi cr blcr ./cpi
    > Process 0 on node01
    > Process 1 on node02
    > Process 3 on node04
    > Process 2 on node03
    > Enter the number of intervals: (0 quits) 0 (---during this i use 
    > cr_checkpoint)
    > [rmingming@node01 lam]$
    >
    > ******************************************************
    >
    > in another window:
    >
    > ******************************************************
    >
    > [rmingming@node01 lam]$ cr_checkpoint 8248
    > [rmingming@node01 lam]$ ls
    > context.8248  cpi  cpi.c  hello.c  nodes  ring
    > 拢篓i can't find the context files for each process, i also checked the 
    > home dir)
    > [rmingming@node01 lam]$ cr_restart context.8248
    > mpirun (rpwait): Bad file descriptor
    > [rmingming@node01 lam]$
    >
    > ******************************************************
    >
    > hope to receive from you all :)
    >
    > 脭脷脛煤碌脛脌麓脨脜脰脨脭酶戮颅脤谩碌陆:
    >> From: Jeff Squyres <jsquyres@lam-mpi.org>
    >> Reply-To:
    >> To: checkpoint_at_lbl_dot_gov
    >> Subject: Re: lam/mpi blcr problem
    >> Date:Tue, 22 Mar 2005 15:23:46 -0500
    >>
    >> On Mar 22, 2005, at 12:05 PM, Paul H. Hargrove wrote:
    >>
    >>> I am sorry to hear that you are having problems.  Lets see if we can
    >>> help.
    >>>
    >>> As far as I can tell your LAM configuration is OK, but I am cc:ing
    >>> this to one of the LAM developers who may be able to spot something I
    >>> could not.
    >>
    >> No need -- I'm actually on the checkpoint_at_lbl_dot_gov list.  :-)
    >>
    >>> Have you tried 'make check' in the blcr build directory or
    >>> checkpointing/restarting some of the non-mpi examples in blcr's
    >>> examples directory?  It would be good to know that the blcr build was
    >>> OK before bring LAM into the mix.
    >>>
    >>> When LAM ran the mpi application, was blcr installed (and the kernel
    >>> modules loaded) on all the compute nodes running the mpi job?
    >>
    >> Additionally, were you using the crtcp RPI?  I.e., what was the
    >> specific command that you used to mpirun your application?  And how 
    >> did
    >> you try to checkpoint it?
    >>
    >> -- 
    >> {+} Jeff Squyres
    >> {+} jsquyres@lam-mpi.org
    >> {+} http://www.lam-mpi.org/
    >>
    >>
    >
    >
    
    -- 
    {+} Jeff Squyres
    {+} jsquyres@lam-mpi.org
    {+} http://www.lam-mpi.org/
    

  • Next message: 任明明: "Re: lam/mpi blcr problem"