Re: checkpoint hangs when using in clusters

From: Josh Hursey (jjhursey_at_open-mpi.org)
Date: Mon Mar 29 2010 - 09:10:32 PDT

  • Next message: fengguang tian: "restart error in cluster"
    Just to followup here for those interested in this thread. This  
    discussion has moved to the Open MPI users list, since it is an Open  
    MPI issue not a BLCR specific issue.
    
    -- Josh
    
    On Mar 23, 2010, at 11:30 AM, fengguang tian wrote:
    
    > Hi
    >
    > I am using open-mpi and blcr in a cluster of 3 machines, and the  
    > checkpoint and restart work fine in single machine,but when doing  
    > checkpoint in
    > clusters environment, the ompi-checkpoint hangs
    >
    > for example
    > my clusters composed of 3 machines, and using NFS, has a shared  
    > directory. in master node,I run :mpirun -np 50 -am ft-enable-cr -- 
    > hostfile (hostfile) hello
    > , and the program run in the cluster,it works fine.but when I use  
    > ompi-checkpoint --term $(pidof mpirun) to checkpoint the program,  
    > the mpirun process is not
    > killed,it is still running, and although the ompi-checkpoint have  
    > created a checkpoint file, the mpirun process hangs here and are not  
    > terminated by the ompi-checkpoint.
    > when i check the process ,the mpirun is still there:
    > mpiu     31187  0.0  0.0  21636  4512 pts/3    S<s  10:45   0:00 -bash
    > mpiu     31688  0.0  0.0  65472  3888 pts/3    S<+  10:54   0:00  \_  
    > mpirun -np
    > mpiu     29635  0.0  0.0  21636  4504 pts/1    S<s  09:08   0:00 -bash
    > mpiu     32188  0.0  0.0  15168  1064 pts/1    R<+  11:18   0:00  \_  
    > ps auf
    >
    > and when I use ompi-restart to restart the program, it shows:
    > [nimbus:14545] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_
    > snapshot_14030.ckpt/0/opal_snapshot_29.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_29.ckpt) is invalid because  
    > either you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14609] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_34.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_34.ckpt) is invalid because  
    > either you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14685] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_39.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_39.ckpt) is invalid because  
    > either you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14737] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_44.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_44.ckpt) is invalid because  
    > either you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14798] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_49.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_49.ckpt) is invalid because  
    > either you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14317] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_4.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_4.ckpt) is invalid because either  
    > you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14331] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_9.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_9.ckpt) is invalid because either  
    > you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14381] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_14.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_14.ckpt) is invalid because  
    > either you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14408] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_19.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_19.ckpt) is invalid because  
    > either you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > [nimbus:14483] Error: Unable to access the path [/home/mpiu/ 
    > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_24.ckpt]!
    > --------------------------------------------------------------------------
    > Error: The filename (opal_snapshot_24.ckpt) is invalid because  
    > either you have not provided a filename
    >        or provided an invalid filename.
    >        Please see --help for usage.
    >
    > --------------------------------------------------------------------------
    > NO 26
    > Hello, world, I am 2 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 12 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 10 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 1 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 8 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 3 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 0 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 5 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 11 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 6 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 17 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 15 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 18 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 2 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 13 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 12 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 7 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 10 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 1 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 21 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 8 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 22 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 3 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 20 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 0 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 5 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 16 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 26 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 23 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 27 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 28 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 11 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 6 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 25 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 31 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 17 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 30 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 43 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 15 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 18 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 33 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 32 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 47 of 50 on nimbus
    >
    > NO 28
    > Hello, world, I am 2 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 36 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 35 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 13 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 40 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 38 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 37 of 50 on nimbus
    >
    > NO 28
    > Hello, world, I am 12 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 7 of 50 on nimbus
    >
    > NO 28
    > Hello, world, I am 10 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 48 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 41 of 50 on nimbus
    >
    > NO 28
    > Hello, world, I am 1 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 45 of 50 on nimbus
    >
    > NO 27
    > Hello, world, I am 21 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 42 of 50 on nimbus
    >
    > NO 26
    > Hello, world, I am 46 of 50 on nimbus
    >
    > [nimbus:14312] [[63351,0],0]-[[63351,1],46] mca_oob_tcp_msg_recv:  
    > readv failed: Connection reset by peer (104)
    > --------------------------------------------------------------------------
    > mpirun has exited due to process rank 4 with PID 14317 on
    > node nimbus exiting improperly. There are two reasons this could  
    > occur:
    >
    > 1. this process did not call "init" before exiting, but others in
    > the job did. This can cause a job to hang indefinitely while it waits
    > for all processes to call "init". By rule, if one process calls  
    > "init",
    > then ALL processes must call "init" prior to termination.
    >
    > 2. this process called "init", but exited without calling "finalize".
    > By rule, all processes that call "init" MUST call "finalize" prior to
    > exiting or it will be considered an "abnormal termination"
    >
    > This may have caused other processes in the application to be
    > terminated by signals sent by mpirun (as reported here).
    > --------------------------------------------------------------------------
    >
    > cheers
    > fengguang
    

  • Next message: fengguang tian: "restart error in cluster"