Re: blcr error with "Real-time signal 31"

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Fri Jan 02 2004 - 11:13:51 PST


On Fri, 2 Jan 2004, tingyu wrote:

> I just installed blcr with lam-mpi 7.1a1, and blcr works great with
> non-mpi programs. But when I tried to use it to checkpoint a MPI
> program. the cr_checkpoint command halt and I got error from where the
> checkpointed MPI program "Real-time signal 31".
>
> [snipped]
>
> Then I start it on one console with "mpirun -np 2 ./hello", and "su"
> with root on another console, use "cr_checkpoint pid-of-mpirun", then
> error happened.

I'm not sure what "run-time signal 31" is, but I can tell you from the MPI
side that you should not need to su over to root to checkpoint the
parallel application.  You should give "cr_checkpoint pid_of_mpirun" and
the whole process should be checkpointed.

Be sure that you compiled your LAM/MPI with support for BLCR and are using
the crtcp rpi module (or gm, at the CVS HEAD).  The laminfo command can
tell you if you have supprot for blcr included -- running "laminfo" should
show a line similar to:

            SSI cr: blcr (Module v1.0.1)

> Another question is, is it possible that we use cr_checkpoint to
> checkpoint some processes in a mpi program, not all the mpi program?

Not at this time.

-- 
{+} Jeff Squyres
{+} jsquyres@lam-mpi.org
{+} http://www.lam-mpi.org/