Re: checkpoints on alvarez

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Sat Mar 20 2004 - 06:15:32 PST

  • Next message: jcduell_at_lbl_dot_gov: "Re: 128way esp/pchksum checkpoint/restart on alvarez using BLCR/LAM"
    On Wed, 17 Mar 2004 jcduell_at_lbl_dot_gov wrote:
    
    > OK, I've sent an email to the LAM guys about the 'gm' rpi not
    > checkpointing successfully.
    
    To followup directly to the user -- the gm checkpoint/restart stuff that
    was demo'ed at SC will only be available in LAM/MPI 7.1.  It is not
    available in 7.0.x.  We expect to release 7.1 towards the end of the
    semester.
    
    > I'm not seeing any real problems checkpointing LAM on alvarez when the
    > 'crtcp' rpi is used.  I get a scary messsage from the application when
    > it gets terminated, but it restarts fine--I don't need to
    > lamhalt/lamboot before restarting.
    
    Correct.  This is also likely to be "just the way it is" for the time
    being; causing it to not print the scary message will likely take a lot of
    work on our part (i.e., how is LAM supposed to know when a SIGTERM is
    acceptable and when it is not?), and is not likely to be fixed in the near
    future.  Sorry!  :-(
    
    -- 
    {+} Jeff Squyres
    {+} jsquyres@lam-mpi.org
    {+} http://www.lam-mpi.org/
    

  • Next message: jcduell_at_lbl_dot_gov: "Re: 128way esp/pchksum checkpoint/restart on alvarez using BLCR/LAM"