From: 任明明 (0110018_at_mail.nankai.edu.cn)
Date: Wed Mar 23 2005 - 06:27:18 PST
thank you for your help! I can use blcr to checkpoint the non-MPI program,such as the examples included in the blcr software.And all the nodes are ok to checkpoint a non-MPI program. but when i use cr_checkpoint to checkpoint a MPI program, it doesn't generate context file for each process, only generate a context file for mpirun command. all i do is the the following: In one window: **************************************************** [rmingming@node01 lam]$ mpicc cpi.c -o cpi [rmingming@node01 lam]$ lamboot -v nodes LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University n-1<8238> ssi:boot:base:linear: booting n0 (node01) n-1<8238> ssi:boot:base:linear: booting n1 (node02) n-1<8238> ssi:boot:base:linear: booting n2 (node03) n-1<8238> ssi:boot:base:linear: booting n3 (node04) n-1<8238> ssi:boot:base:linear: finished [rmingming@node01 lam]$ mpirun C -ssi rpi crtcp -ssi cr blcr ./cpi Process 0 on node01 Process 1 on node02 Process 3 on node04 Process 2 on node03 Enter the number of intervals: (0 quits) 0 (---during this i use cr_checkpoint) [rmingming@node01 lam]$ ****************************************************** in another window: ****************************************************** [rmingming@node01 lam]$ cr_checkpoint 8248 [rmingming@node01 lam]$ ls context.8248 cpi cpi.c hello.c nodes ring (i can't find the context files for each process, i also checked the home dir) [rmingming@node01 lam]$ cr_restart context.8248 mpirun (rpwait): Bad file descriptor [rmingming@node01 lam]$ ****************************************************** hope to receive from you all :) 在您的来信中曾经提到: >From: Jeff Squyres <[email protected]> >Reply-To: >To: checkpoint_at_lbl_dot_gov >Subject: Re: lam/mpi blcr problem >Date:Tue, 22 Mar 2005 15:23:46 -0500 > >On Mar 22, 2005, at 12:05 PM, Paul H. Hargrove wrote: > > > I am sorry to hear that you are having problems. Lets see if we can > > help. > > > > As far as I can tell your LAM configuration is OK, but I am cc:ing > > this to one of the LAM developers who may be able to spot something I > > could not. > > No need -- I'm actually on the checkpoint_at_lbl_dot_gov list. :-) > > > Have you tried 'make check' in the blcr build directory or > > checkpointing/restarting some of the non-mpi examples in blcr's > > examples directory? It would be good to know that the blcr build was > > OK before bring LAM into the mix. > > > > When LAM ran the mpi application, was blcr installed (and the kernel > > modules loaded) on all the compute nodes running the mpi job? > > Additionally, were you using the crtcp RPI? I.e., what was the > specific command that you used to mpirun your application? And how did > you try to checkpoint it? > > -- > {+} Jeff Squyres > {+} [email protected] > {+} http://www.lam-mpi.org/ > >