From: Josh Hursey (jjhursey_at_open-mpi.org)
Date: Mon Mar 29 2010 - 09:10:32 PDT
Just to followup here for those interested in this thread. This discussion has moved to the Open MPI users list, since it is an Open MPI issue not a BLCR specific issue. -- Josh On Mar 23, 2010, at 11:30 AM, fengguang tian wrote: > Hi > > I am using open-mpi and blcr in a cluster of 3 machines, and the > checkpoint and restart work fine in single machine,but when doing > checkpoint in > clusters environment, the ompi-checkpoint hangs > > for example > my clusters composed of 3 machines, and using NFS, has a shared > directory. in master node,I run :mpirun -np 50 -am ft-enable-cr -- > hostfile (hostfile) hello > , and the program run in the cluster,it works fine.but when I use > ompi-checkpoint --term $(pidof mpirun) to checkpoint the program, > the mpirun process is not > killed,it is still running, and although the ompi-checkpoint have > created a checkpoint file, the mpirun process hangs here and are not > terminated by the ompi-checkpoint. > when i check the process ,the mpirun is still there: > mpiu 31187 0.0 0.0 21636 4512 pts/3 S<s 10:45 0:00 -bash > mpiu 31688 0.0 0.0 65472 3888 pts/3 S<+ 10:54 0:00 \_ > mpirun -np > mpiu 29635 0.0 0.0 21636 4504 pts/1 S<s 09:08 0:00 -bash > mpiu 32188 0.0 0.0 15168 1064 pts/1 R<+ 11:18 0:00 \_ > ps auf > > and when I use ompi-restart to restart the program, it shows: > [nimbus:14545] Error: Unable to access the path [/home/mpiu/ > ompi_global_ > snapshot_14030.ckpt/0/opal_snapshot_29.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_29.ckpt) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14609] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_34.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_34.ckpt) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14685] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_39.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_39.ckpt) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14737] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_44.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_44.ckpt) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14798] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_49.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_49.ckpt) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14317] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_4.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_4.ckpt) is invalid because either > you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14331] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_9.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_9.ckpt) is invalid because either > you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14381] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_14.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_14.ckpt) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14408] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_19.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_19.ckpt) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > [nimbus:14483] Error: Unable to access the path [/home/mpiu/ > ompi_global_snapshot_14030.ckpt/0/opal_snapshot_24.ckpt]! > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_24.ckpt) is invalid because > either you have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > NO 26 > Hello, world, I am 2 of 50 on nimbus > > NO 26 > Hello, world, I am 12 of 50 on nimbus > > NO 26 > Hello, world, I am 10 of 50 on nimbus > > NO 26 > Hello, world, I am 1 of 50 on nimbus > > NO 26 > Hello, world, I am 8 of 50 on nimbus > > NO 26 > Hello, world, I am 3 of 50 on nimbus > > NO 26 > Hello, world, I am 0 of 50 on nimbus > > NO 26 > Hello, world, I am 5 of 50 on nimbus > > NO 26 > Hello, world, I am 11 of 50 on nimbus > > NO 26 > Hello, world, I am 6 of 50 on nimbus > > NO 26 > Hello, world, I am 17 of 50 on nimbus > > NO 26 > Hello, world, I am 15 of 50 on nimbus > > NO 26 > Hello, world, I am 18 of 50 on nimbus > > NO 27 > Hello, world, I am 2 of 50 on nimbus > > NO 26 > Hello, world, I am 13 of 50 on nimbus > > NO 27 > Hello, world, I am 12 of 50 on nimbus > > NO 26 > Hello, world, I am 7 of 50 on nimbus > > NO 27 > Hello, world, I am 10 of 50 on nimbus > > NO 27 > Hello, world, I am 1 of 50 on nimbus > > NO 26 > Hello, world, I am 21 of 50 on nimbus > > NO 27 > Hello, world, I am 8 of 50 on nimbus > > NO 26 > Hello, world, I am 22 of 50 on nimbus > > NO 27 > Hello, world, I am 3 of 50 on nimbus > > NO 26 > Hello, world, I am 20 of 50 on nimbus > > NO 27 > Hello, world, I am 0 of 50 on nimbus > > NO 27 > Hello, world, I am 5 of 50 on nimbus > > NO 26 > Hello, world, I am 16 of 50 on nimbus > > NO 26 > Hello, world, I am 26 of 50 on nimbus > > NO 26 > Hello, world, I am 23 of 50 on nimbus > > NO 26 > Hello, world, I am 27 of 50 on nimbus > > NO 26 > Hello, world, I am 28 of 50 on nimbus > > NO 27 > Hello, world, I am 11 of 50 on nimbus > > NO 27 > Hello, world, I am 6 of 50 on nimbus > > NO 26 > Hello, world, I am 25 of 50 on nimbus > > NO 26 > Hello, world, I am 31 of 50 on nimbus > > NO 27 > Hello, world, I am 17 of 50 on nimbus > > NO 26 > Hello, world, I am 30 of 50 on nimbus > > NO 26 > Hello, world, I am 43 of 50 on nimbus > > NO 27 > Hello, world, I am 15 of 50 on nimbus > > NO 27 > Hello, world, I am 18 of 50 on nimbus > > NO 26 > Hello, world, I am 33 of 50 on nimbus > > NO 26 > Hello, world, I am 32 of 50 on nimbus > > NO 26 > Hello, world, I am 47 of 50 on nimbus > > NO 28 > Hello, world, I am 2 of 50 on nimbus > > NO 26 > Hello, world, I am 36 of 50 on nimbus > > NO 26 > Hello, world, I am 35 of 50 on nimbus > > NO 27 > Hello, world, I am 13 of 50 on nimbus > > NO 26 > Hello, world, I am 40 of 50 on nimbus > > NO 26 > Hello, world, I am 38 of 50 on nimbus > > NO 26 > Hello, world, I am 37 of 50 on nimbus > > NO 28 > Hello, world, I am 12 of 50 on nimbus > > NO 27 > Hello, world, I am 7 of 50 on nimbus > > NO 28 > Hello, world, I am 10 of 50 on nimbus > > NO 26 > Hello, world, I am 48 of 50 on nimbus > > NO 26 > Hello, world, I am 41 of 50 on nimbus > > NO 28 > Hello, world, I am 1 of 50 on nimbus > > NO 26 > Hello, world, I am 45 of 50 on nimbus > > NO 27 > Hello, world, I am 21 of 50 on nimbus > > NO 26 > Hello, world, I am 42 of 50 on nimbus > > NO 26 > Hello, world, I am 46 of 50 on nimbus > > [nimbus:14312] [[63351,0],0]-[[63351,1],46] mca_oob_tcp_msg_recv: > readv failed: Connection reset by peer (104) > -------------------------------------------------------------------------- > mpirun has exited due to process rank 4 with PID 14317 on > node nimbus exiting improperly. There are two reasons this could > occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls > "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > > cheers > fengguang