Re: blcr error with "Real-time signal 31"

From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Mon Jan 05 2004 - 08:37:07 PST

  • Next message: Eric Roman: "Re: [checkpoint:BOUNCE:Unauthorized post from Ulisses <[email protected]> denied]"
    Jeff Squyres wrote:
    > 
    > On Fri, 2 Jan 2004, tingyu wrote:
    > 
    > > I just installed blcr with lam-mpi 7.1a1, and blcr works great with
    > > non-mpi programs. But when I tried to use it to checkpoint a MPI
    > > program. the cr_checkpoint command halt and I got error from where the
    > > checkpointed MPI program "Real-time signal 31".
    > >
    > > [snipped]
    > >
    > > Then I start it on one console with "mpirun -np 2 ./hello", and "su"
    > > with root on another console, use "cr_checkpoint pid-of-mpirun", then
    > > error happened.
    > 
    > I'm not sure what "run-time signal 31" is, but I can tell you from the MPI
    > side that you should not need to su over to root to checkpoint the
    > parallel application.  You should give "cr_checkpoint pid_of_mpirun" and
    > the whole process should be checkpointed.
    
    The run-time signal 31 message is an indication that the BLCR library
    did not register at startup time.  The only likely reason for this would
    be that BLCR was not compiled in, as Jeff suggests.
    
    > Be sure that you compiled your LAM/MPI with support for BLCR and are using
    > the crtcp rpi module (or gm, at the CVS HEAD).  The laminfo command can
    > tell you if you have supprot for blcr included -- running "laminfo" should
    > show a line similar to:
    > 
    >             SSI cr: blcr (Module v1.0.1)
    > 
    > > Another question is, is it possible that we use cr_checkpoint to
    > > checkpoint some processes in a mpi program, not all the mpi program?
    > 
    > Not at this time.
    
    If one tries to checkpoint a single process in an mpi application, you
    will see it wait for the other processes to participate in the
    checkpoint.  The likely result will be that the application will get
    stuck when it next tries to use MPI and the chekpoint will never
    complete.
    
    > --
    > {+} Jeff Squyres
    > {+} [email protected]
    > {+} http://www.lam-mpi.org/
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998
    

  • Next message: Eric Roman: "Re: [checkpoint:BOUNCE:Unauthorized post from Ulisses <[email protected]> denied]"