checkpoints on alvarez

Date view	Thread view	Subject view	Author view	Attachment view

jcduell_at_lbl_dot_gov
Date: Wed Mar 17 2004 - 17:46:23 PST

Next message: jcduell_at_lbl_dot_gov: "DOE Operating Systems work"

Previous message: jcduell_at_lbl_dot_gov: "Re: blcr"
Next in thread: Jeff Squyres: "Re: checkpoints on alvarez"
Reply: Jeff Squyres: "Re: checkpoints on alvarez"

Tom:

OK, I've sent an email to the LAM guys about the 'gm' rpi not
checkpointing successfully.

I'm not seeing any real problems checkpointing LAM on alvarez when the
'crtcp' rpi is used.  I get a scary messsage from the application when
it gets terminated, but it restarts fine--I don't need to
lamhalt/lamboot before restarting.

Also note point #3 below--you shouldn't use 'cr_checkpoint --kill':  use
'cr_checkpoint --term' instead.

Let me know if you have other problems.

Cheers,

-- 
Jason Duell             Future Technologies Group
<jcduell_at_lbl_dot_gov>       Computational Research Division
Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory


------------------------------------------------------------------------
I'm trying to help out a user here at the Lab who's trying out our
stuff, and I'm seeing some errors.

The system has both the 'crtcp' and 'gm' rpi's, which should both be
checkpointable (right?).

1) When I run as

        mpirun -ssi rpi crtcp N ./pingpong

   and cr_checkpoint the parent mpirun with 

        cr_checkpoint --term <pid>

   I get a valid checkpoint which restarts just fine, but at checkpoint
   time the terminal I ran mpirun on gets the following scary warning:

       -----------------------------------------------------------------------------
       One of the processes started by mpirun has exited with a nonzero
       exit code.  This typically indicates that the process finished in
       error.  If your process did not finish in error, be sure to
       include a "return 0" or "exit(0)" in your C code before exiting
       the application.

       PID 16986 failed on node n0 (172.17.1.85) due to signal 15.
       -----------------------------------------------------------------------------

   Any idea why this happens?

2) When I use the gm ssi, I get the same scary message, but now I also
   can't restart the context file that's generated.  When I try
   cr_restart, I get the same error message that I get at checkpoint
   time:

        -----------------------------------------------------------------------------
        It seems that [at least] one of the processes that was started with
        mpirun did not invoke MPI_INIT before quitting (it is possible that
        more than one process did not invoke MPI_INIT -- mpirun was only
        notified of the first one, which was on node n0).

        mpirun can *only* be used with MPI programs (i.e., programs that
        invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
        to run non-MPI programs over the lambooted nodes.

   Am I supposed to be able to checkpoint with the gm ssi?

I'm using LAM 7.0.4.  Here's the laminfo output:
           LAM/MPI: 7.0.4
            Prefix: /usr/local/lam-7.0.4/
      Architecture: i686-pc-linux-gnu
     Configured by: root
     Configured on: Wed Mar 10 16:26:59 PST 2004
    Configure host: alvin01
        C bindings: yes
      C++ bindings: yes
  Fortran bindings: yes
       C profiling: yes
     C++ profiling: yes
 Fortran profiling: yes
     ROMIO support: yes
      IMPI support: no
     Debug support: no
      Purify clean: no
          SSI boot: globus (Module v0.5)
          SSI boot: rsh (Module v1.0)
          SSI boot: tm (Module v1.0)
          SSI coll: lam_basic (Module v7.0)
          SSI coll: smp (Module v1.0)
           SSI rpi: crtcp (Module v1.0.1)
           SSI rpi: gm (Module v1.0.1)
           SSI rpi: lamd (Module v7.0)
           SSI rpi: sysv (Module v7.0)
           SSI rpi: tcp (Module v7.0)
           SSI rpi: usysv (Module v7.0)
            SSI cr: blcr (Module v1.0.1)

3) It might be useful in your docs to note that calling

        cr_checkpoint --kill <mpirun_pid>

   is a bad idea, since the SIGKILL will wipe out the mpirun without it
   getting a chance to propagate the signal to the application
   processes.  The '--term' flag should be used instead.

Thanks,

Jason

P.S.  How does one tell which rpi gets used by default, i.e. if no 
'-ssi rpi XXX' option is passed?

Next message: jcduell_at_lbl_dot_gov: "DOE Operating Systems work"

Previous message: jcduell_at_lbl_dot_gov: "Re: blcr"
Next in thread: Jeff Squyres: "Re: checkpoints on alvarez"
Reply: Jeff Squyres: "Re: checkpoints on alvarez"

Date view	Thread view	Subject view	Author view	Attachment view