jcduell_at_lbl_dot_gov
Date: Mon Mar 22 2004 - 21:39:16 PST
On Mon, Mar 22, 2004 at 04:50:49PM -0800, Thomas Davis wrote: > > After figuring out the BLCR/LAM wanted to drop the context files into my > home directory (which didn't have sufficient space), and not the current > working directory.. it works. I'm glad you got everything to work. For future reference, you can tell LAM to put the checkpoints somewhere else either by setting the 'LAM_MPI_SSI_cr_base_dir' environment variable to the directory path, or passing '-ssi cr_base_dir /path' to mpirun. We'd be very glad to get any other feedback/suggestions you've got. Thanks for the interest. cheers, -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory > [tdavis@alvcn078 esp]$ mpirun -ssi rpi crtcp C ./pchksum -v -t 512 > Number of tasks: 128 > Initial digest > 68ffffffa5ffffffbdffffffa0ffffffb969ffffffad6877ffffffec16ffffffd22dffffff9f4815033bffffffbcffffff82 > ----------------------------------------------------------------------------- > One of the processes started by mpirun has exited with a nonzero exit > code. This typically indicates that the process finished in error. > If your process did not finish in error, be sure to include a "return > 0" or "exit(0)" in your C code before exiting the application. > > PID 31294 failed on node n0 (172.17.1.78) due to signal 15. > ----------------------------------------------------------------------------- > [tdavis@alvcn078 esp]$ mpirun -ssi rpi crtcp C ./pchksum -v -t 512 > Number of tasks: 128 > Initial digest > 68ffffffa5ffffffbdffffffa0ffffffb969ffffffad6877ffffffec16ffffffd22dffffff9f4815033bffffffbcffffff82 > ----------------------------------------------------------------------------- > One of the processes started by mpirun has exited with a nonzero exit > code. This typically indicates that the process finished in error. > If your process did not finish in error, be sure to include a "return > 0" or "exit(0)" in your C code before exiting the application. > > PID 31316 failed on node n0 (172.17.1.78) due to signal 15. > ----------------------------------------------------------------------------- > MPI_Wait: process in local group is dead (rank 111, MPI_COMM_WORLD) > Rank (111, MPI_COMM_WORLD): Call stack within LAM: > Rank (111, MPI_COMM_WORLD): - MPI_Wait() > Rank (111, MPI_COMM_WORLD): - main() > [tdavis@alvcn078 esp]$ /usr/local/bin/cr_restart ../tdavis/context.31315 > Forward merge complete > Elapsed time: 476268.98 msecs > Outbound data volume: 55.75 MB > Iteration count: 157 > Reverse merge complete > Elapsed time: 17461.67 msecs > Final digest > 68ffffffa5ffffffbdffffffa0ffffffb969ffffffad6877ffffffec16ffffffd22dffffff9f4815033bffffffbcffffff82 > Max stagger delta: 115015566 usecs > Total elapsed time: 570.78 secs > Status: OK > [tdavis@alvcn078 esp]$ > > --------------------------------- seperate terminal > ------------------------------------------------- > [tdavis@alvcn078 tdavis]$ ps aux | grep mpi > tdavis 31315 1.1 0.0 3640 692 ttyp0 S 16:33 0:00 mpirun -ssi > rpi c > tdavis 31318 0.0 0.0 3640 692 ttyp0 S 16:33 0:00 mpirun -ssi > rpi c > tdavis 31319 0.0 0.0 3640 692 ttyp0 S 16:33 0:00 mpirun -ssi > rpi c > tdavis 31326 0.0 0.0 1416 440 pts/0 S 16:33 0:00 grep mpi > [tdavis@alvcn078 tdavis]$ time cr_checkpoint --term 31315 > > real 0m16.987s > user 0m0.000s > sys 0m0.000s > [tdavis@alvcn078 tdavis]$