From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 23 2005 - 06:27:56 PST
I'm sorry -- I neglected to mention in my previous e-mail that we had some problems with the logic for checkpoint/restart initialization in LAM/MPI v7.1.1. Can you try the soon-to-be-released 7.1.2 beta? http://www.lam-mpi.org/beta/ That should solve your problems. On Mar 23, 2005, at 9:27 AM, 任明明 wrote: > > thank you for your help! > I can use blcr to checkpoint the non-MPI program,such as the examples > included in the blcr software.And all the nodes are ok to checkpoint a > non-MPI program. > but when i use cr_checkpoint to checkpoint a MPI program, it doesn't > generate > context file for each process, only generate a context file for mpirun > command. > > all i do is the the following: > > In one window: > **************************************************** > [rmingming@node01 lam]$ mpicc cpi.c -o cpi > [rmingming@node01 lam]$ lamboot -v nodes > > LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University > > n-1<8238> ssi:boot:base:linear: booting n0 (node01) > n-1<8238> ssi:boot:base:linear: booting n1 (node02) > n-1<8238> ssi:boot:base:linear: booting n2 (node03) > n-1<8238> ssi:boot:base:linear: booting n3 (node04) > n-1<8238> ssi:boot:base:linear: finished > [rmingming@node01 lam]$ mpirun C -ssi rpi crtcp -ssi cr blcr ./cpi > Process 0 on node01 > Process 1 on node02 > Process 3 on node04 > Process 2 on node03 > Enter the number of intervals: (0 quits) 0 (---during this i use > cr_checkpoint) > [rmingming@node01 lam]$ > > ****************************************************** > > in another window: > > ****************************************************** > > [rmingming@node01 lam]$ cr_checkpoint 8248 > [rmingming@node01 lam]$ ls > context.8248 cpi cpi.c hello.c nodes ring > £¨i can't find the context files for each process, i also checked the > home dir) > [rmingming@node01 lam]$ cr_restart context.8248 > mpirun (rpwait): Bad file descriptor > [rmingming@node01 lam]$ > > ****************************************************** > > hope to receive from you all :) > > ÔÚÄúµÄÀ´ÐÅÖÐÔø¾Ìáµ½: >> From: Jeff Squyres <[email protected]> >> Reply-To: >> To: checkpoint_at_lbl_dot_gov >> Subject: Re: lam/mpi blcr problem >> Date:Tue, 22 Mar 2005 15:23:46 -0500 >> >> On Mar 22, 2005, at 12:05 PM, Paul H. Hargrove wrote: >> >>> I am sorry to hear that you are having problems. Lets see if we can >>> help. >>> >>> As far as I can tell your LAM configuration is OK, but I am cc:ing >>> this to one of the LAM developers who may be able to spot something I >>> could not. >> >> No need -- I'm actually on the checkpoint_at_lbl_dot_gov list. :-) >> >>> Have you tried 'make check' in the blcr build directory or >>> checkpointing/restarting some of the non-mpi examples in blcr's >>> examples directory? It would be good to know that the blcr build was >>> OK before bring LAM into the mix. >>> >>> When LAM ran the mpi application, was blcr installed (and the kernel >>> modules loaded) on all the compute nodes running the mpi job? >> >> Additionally, were you using the crtcp RPI? I.e., what was the >> specific command that you used to mpirun your application? And how >> did >> you try to checkpoint it? >> >> -- >> {+} Jeff Squyres >> {+} [email protected] >> {+} http://www.lam-mpi.org/ >> >> > > -- {+} Jeff Squyres {+} [email protected] {+} http://www.lam-mpi.org/