checkpoints on alvarez

jcduell_at_lbl_dot_gov
Date: Wed Mar 17 2004 - 17:46:23 PST

  • Next message: jcduell_at_lbl_dot_gov: "DOE Operating Systems work"
    Tom:
    
    OK, I've sent an email to the LAM guys about the 'gm' rpi not
    checkpointing successfully.
    
    I'm not seeing any real problems checkpointing LAM on alvarez when the
    'crtcp' rpi is used.  I get a scary messsage from the application when
    it gets terminated, but it restarts fine--I don't need to
    lamhalt/lamboot before restarting.
    
    Also note point #3 below--you shouldn't use 'cr_checkpoint --kill':  use
    'cr_checkpoint --term' instead.
    
    Let me know if you have other problems.
    
    Cheers,
    
    -- 
    Jason Duell             Future Technologies Group
    <jcduell_at_lbl_dot_gov>       Computational Research Division
    Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory
    
    
    ------------------------------------------------------------------------
    I'm trying to help out a user here at the Lab who's trying out our
    stuff, and I'm seeing some errors.
    
    The system has both the 'crtcp' and 'gm' rpi's, which should both be
    checkpointable (right?).
    
    1) When I run as
    
            mpirun -ssi rpi crtcp N ./pingpong
    
       and cr_checkpoint the parent mpirun with 
    
            cr_checkpoint --term <pid>
    
       I get a valid checkpoint which restarts just fine, but at checkpoint
       time the terminal I ran mpirun on gets the following scary warning:
    
           -----------------------------------------------------------------------------
           One of the processes started by mpirun has exited with a nonzero
           exit code.  This typically indicates that the process finished in
           error.  If your process did not finish in error, be sure to
           include a "return 0" or "exit(0)" in your C code before exiting
           the application.
    
           PID 16986 failed on node n0 (172.17.1.85) due to signal 15.
           -----------------------------------------------------------------------------
    
       Any idea why this happens?
    
    2) When I use the gm ssi, I get the same scary message, but now I also
       can't restart the context file that's generated.  When I try
       cr_restart, I get the same error message that I get at checkpoint
       time:
    
            -----------------------------------------------------------------------------
            It seems that [at least] one of the processes that was started with
            mpirun did not invoke MPI_INIT before quitting (it is possible that
            more than one process did not invoke MPI_INIT -- mpirun was only
            notified of the first one, which was on node n0).
    
            mpirun can *only* be used with MPI programs (i.e., programs that
            invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
            to run non-MPI programs over the lambooted nodes.
    
       Am I supposed to be able to checkpoint with the gm ssi?
    
    I'm using LAM 7.0.4.  Here's the laminfo output:
               LAM/MPI: 7.0.4
                Prefix: /usr/local/lam-7.0.4/
          Architecture: i686-pc-linux-gnu
         Configured by: root
         Configured on: Wed Mar 10 16:26:59 PST 2004
        Configure host: alvin01
            C bindings: yes
          C++ bindings: yes
      Fortran bindings: yes
           C profiling: yes
         C++ profiling: yes
     Fortran profiling: yes
         ROMIO support: yes
          IMPI support: no
         Debug support: no
          Purify clean: no
              SSI boot: globus (Module v0.5)
              SSI boot: rsh (Module v1.0)
              SSI boot: tm (Module v1.0)
              SSI coll: lam_basic (Module v7.0)
              SSI coll: smp (Module v1.0)
               SSI rpi: crtcp (Module v1.0.1)
               SSI rpi: gm (Module v1.0.1)
               SSI rpi: lamd (Module v7.0)
               SSI rpi: sysv (Module v7.0)
               SSI rpi: tcp (Module v7.0)
               SSI rpi: usysv (Module v7.0)
                SSI cr: blcr (Module v1.0.1)
    
    3) It might be useful in your docs to note that calling
    
            cr_checkpoint --kill <mpirun_pid>
    
       is a bad idea, since the SIGKILL will wipe out the mpirun without it
       getting a chance to propagate the signal to the application
       processes.  The '--term' flag should be used instead.
    
    Thanks,
    
    Jason
    
    P.S.  How does one tell which rpi gets used by default, i.e. if no 
    '-ssi rpi XXX' option is passed?
    

  • Next message: jcduell_at_lbl_dot_gov: "DOE Operating Systems work"