Re: 128way esp/pchksum checkpoint/restart on alvarez using BLCR/LAM

jcduell_at_lbl_dot_gov
Date: Mon Mar 22 2004 - 21:39:16 PST

  • Next message: jcduell_at_lbl_dot_gov: "Re: LBCR limitations"
    On Mon, Mar 22, 2004 at 04:50:49PM -0800, Thomas Davis wrote:
    > 
    > After figuring out the BLCR/LAM wanted to drop the context files into my 
    > home directory (which didn't have sufficient space), and not the current 
    > working directory..  it works.
    
    I'm glad you got everything to work.  For future reference, you can tell
    LAM to put the checkpoints somewhere else either by setting the 
    'LAM_MPI_SSI_cr_base_dir' environment variable to the directory path, or
    passing '-ssi cr_base_dir /path' to mpirun. 
    
    We'd be very glad to get any other feedback/suggestions you've got.
    Thanks for the interest.
    
    cheers,
     
    -- 
    Jason Duell             Future Technologies Group
    <jcduell_at_lbl_dot_gov>       Computational Research Division
    Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory
    
    
    > [tdavis@alvcn078 esp]$ mpirun -ssi rpi crtcp C ./pchksum -v -t 512
    > Number of tasks: 128
    > Initial digest
    > 68ffffffa5ffffffbdffffffa0ffffffb969ffffffad6877ffffffec16ffffffd22dffffff9f4815033bffffffbcffffff82
    > -----------------------------------------------------------------------------
    > One of the processes started by mpirun has exited with a nonzero exit
    > code.  This typically indicates that the process finished in error.
    > If your process did not finish in error, be sure to include a "return
    > 0" or "exit(0)" in your C code before exiting the application.
    > 
    > PID 31294 failed on node n0 (172.17.1.78) due to signal 15.
    > -----------------------------------------------------------------------------
    > [tdavis@alvcn078 esp]$ mpirun -ssi rpi crtcp C ./pchksum -v -t 512
    > Number of tasks: 128
    > Initial digest
    > 68ffffffa5ffffffbdffffffa0ffffffb969ffffffad6877ffffffec16ffffffd22dffffff9f4815033bffffffbcffffff82
    > -----------------------------------------------------------------------------
    > One of the processes started by mpirun has exited with a nonzero exit
    > code.  This typically indicates that the process finished in error.
    > If your process did not finish in error, be sure to include a "return
    > 0" or "exit(0)" in your C code before exiting the application.
    > 
    > PID 31316 failed on node n0 (172.17.1.78) due to signal 15.
    > -----------------------------------------------------------------------------
    > MPI_Wait: process in local group is dead (rank 111, MPI_COMM_WORLD)
    > Rank (111, MPI_COMM_WORLD): Call stack within LAM:
    > Rank (111, MPI_COMM_WORLD):  - MPI_Wait()
    > Rank (111, MPI_COMM_WORLD):  - main()
    > [tdavis@alvcn078 esp]$ /usr/local/bin/cr_restart ../tdavis/context.31315
    > Forward merge complete
    > Elapsed time: 476268.98 msecs
    > Outbound data volume: 55.75 MB
    > Iteration count: 157
    > Reverse merge complete
    > Elapsed time: 17461.67 msecs
    > Final digest
    > 68ffffffa5ffffffbdffffffa0ffffffb969ffffffad6877ffffffec16ffffffd22dffffff9f4815033bffffffbcffffff82
    > Max stagger delta: 115015566 usecs
    > Total elapsed time: 570.78 secs
    > Status: OK
    > [tdavis@alvcn078 esp]$
    > 
    > --------------------------------- seperate terminal 
    > -------------------------------------------------
    > [tdavis@alvcn078 tdavis]$ ps aux | grep mpi
    > tdavis   31315  1.1  0.0  3640  692 ttyp0    S    16:33   0:00 mpirun -ssi 
    > rpi c
    > tdavis   31318  0.0  0.0  3640  692 ttyp0    S    16:33   0:00 mpirun -ssi 
    > rpi c
    > tdavis   31319  0.0  0.0  3640  692 ttyp0    S    16:33   0:00 mpirun -ssi 
    > rpi c
    > tdavis   31326  0.0  0.0  1416  440 pts/0    S    16:33   0:00 grep mpi
    > [tdavis@alvcn078 tdavis]$ time cr_checkpoint --term 31315
    > 
    > real    0m16.987s
    > user    0m0.000s
    > sys     0m0.000s
    > [tdavis@alvcn078 tdavis]$ 
    

  • Next message: jcduell_at_lbl_dot_gov: "Re: LBCR limitations"