jcduell_at_lbl_dot_gov
Date: Wed Mar 17 2004 - 17:46:23 PST
Tom: OK, I've sent an email to the LAM guys about the 'gm' rpi not checkpointing successfully. I'm not seeing any real problems checkpointing LAM on alvarez when the 'crtcp' rpi is used. I get a scary messsage from the application when it gets terminated, but it restarts fine--I don't need to lamhalt/lamboot before restarting. Also note point #3 below--you shouldn't use 'cr_checkpoint --kill': use 'cr_checkpoint --term' instead. Let me know if you have other problems. Cheers, -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory ------------------------------------------------------------------------ I'm trying to help out a user here at the Lab who's trying out our stuff, and I'm seeing some errors. The system has both the 'crtcp' and 'gm' rpi's, which should both be checkpointable (right?). 1) When I run as mpirun -ssi rpi crtcp N ./pingpong and cr_checkpoint the parent mpirun with cr_checkpoint --term <pid> I get a valid checkpoint which restarts just fine, but at checkpoint time the terminal I ran mpirun on gets the following scary warning: ----------------------------------------------------------------------------- One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 16986 failed on node n0 (172.17.1.85) due to signal 15. ----------------------------------------------------------------------------- Any idea why this happens? 2) When I use the gm ssi, I get the same scary message, but now I also can't restart the context file that's generated. When I try cr_restart, I get the same error message that I get at checkpoint time: ----------------------------------------------------------------------------- It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. Am I supposed to be able to checkpoint with the gm ssi? I'm using LAM 7.0.4. Here's the laminfo output: LAM/MPI: 7.0.4 Prefix: /usr/local/lam-7.0.4/ Architecture: i686-pc-linux-gnu Configured by: root Configured on: Wed Mar 10 16:26:59 PST 2004 Configure host: alvin01 C bindings: yes C++ bindings: yes Fortran bindings: yes C profiling: yes C++ profiling: yes Fortran profiling: yes ROMIO support: yes IMPI support: no Debug support: no Purify clean: no SSI boot: globus (Module v0.5) SSI boot: rsh (Module v1.0) SSI boot: tm (Module v1.0) SSI coll: lam_basic (Module v7.0) SSI coll: smp (Module v1.0) SSI rpi: crtcp (Module v1.0.1) SSI rpi: gm (Module v1.0.1) SSI rpi: lamd (Module v7.0) SSI rpi: sysv (Module v7.0) SSI rpi: tcp (Module v7.0) SSI rpi: usysv (Module v7.0) SSI cr: blcr (Module v1.0.1) 3) It might be useful in your docs to note that calling cr_checkpoint --kill <mpirun_pid> is a bad idea, since the SIGKILL will wipe out the mpirun without it getting a chance to propagate the signal to the application processes. The '--term' flag should be used instead. Thanks, Jason P.S. How does one tell which rpi gets used by default, i.e. if no '-ssi rpi XXX' option is passed?