From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Mon Jan 05 2004 - 08:37:07 PST
Jeff Squyres wrote: > > On Fri, 2 Jan 2004, tingyu wrote: > > > I just installed blcr with lam-mpi 7.1a1, and blcr works great with > > non-mpi programs. But when I tried to use it to checkpoint a MPI > > program. the cr_checkpoint command halt and I got error from where the > > checkpointed MPI program "Real-time signal 31". > > > > [snipped] > > > > Then I start it on one console with "mpirun -np 2 ./hello", and "su" > > with root on another console, use "cr_checkpoint pid-of-mpirun", then > > error happened. > > I'm not sure what "run-time signal 31" is, but I can tell you from the MPI > side that you should not need to su over to root to checkpoint the > parallel application. You should give "cr_checkpoint pid_of_mpirun" and > the whole process should be checkpointed. The run-time signal 31 message is an indication that the BLCR library did not register at startup time. The only likely reason for this would be that BLCR was not compiled in, as Jeff suggests. > Be sure that you compiled your LAM/MPI with support for BLCR and are using > the crtcp rpi module (or gm, at the CVS HEAD). The laminfo command can > tell you if you have supprot for blcr included -- running "laminfo" should > show a line similar to: > > SSI cr: blcr (Module v1.0.1) > > > Another question is, is it possible that we use cr_checkpoint to > > checkpoint some processes in a mpi program, not all the mpi program? > > Not at this time. If one tries to checkpoint a single process in an mpi application, you will see it wait for the other processes to participate in the checkpoint. The likely result will be that the application will get stuck when it next tries to use MPI and the chekpoint will never complete. > -- > {+} Jeff Squyres > {+} [email protected] > {+} http://www.lam-mpi.org/ -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998