checkpoint hangs when using in clusters

From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Tue Mar 23 2010 - 08:30:50 PDT

  • Next message: Neal Becker: "build on fedora 12"
    Hi
    
    I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint
    and restart work fine in single machine,but when doing checkpoint in
    clusters environment, the ompi-checkpoint hangs
    
    for example
    my clusters composed of 3 machines, and using NFS, has a shared directory.
    in master node,I run :mpirun -np 50 -am ft-enable-cr --hostfile (hostfile)
    hello
    , and the program run in the cluster,it works fine.but when I use
    ompi-checkpoint --term $(pidof mpirun) to checkpoint the program, the mpirun
    process is not
    killed,it is still running, and although the ompi-checkpoint have created a
    checkpoint file, the mpirun process hangs here and are not terminated by the
    ompi-checkpoint.
    when i check the process ,the mpirun is still there:
    mpiu     31187  0.0  0.0  21636  4512 pts/3    S<s  10:45   0:00 -bash
    *mpiu     31688  0.0  0.0  65472  3888 pts/3    S<+  10:54   0:00  \_ mpirun
    -np*
    mpiu     29635  0.0  0.0  21636  4504 pts/1    S<s  09:08   0:00 -bash
    mpiu     32188  0.0  0.0  15168  1064 pts/1    R<+  11:18   0:00  \_ ps auf
    
    and when I use ompi-restart to restart the program, it shows:
    [nimbus:14545] Error: Unable to access the path [/home/mpiu/ompi_global_
    snapshot_14030.ckpt/0/opal_snapshot_29.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_29.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14609] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_34.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_34.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14685] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_39.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_39.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14737] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_44.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_44.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14798] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_49.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_49.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14317] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_4.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_4.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14331] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_9.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_9.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14381] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_14.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_14.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14408] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_19.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_19.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    [nimbus:14483] Error: Unable to access the path
    [/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_24.ckpt]!
    --------------------------------------------------------------------------
    Error: The filename (opal_snapshot_24.ckpt) is invalid because either you
    have not provided a filename
           or provided an invalid filename.
           Please see --help for usage.
    
    --------------------------------------------------------------------------
    NO 26
    Hello, world, I am 2 of 50 on nimbus
    
    NO 26
    Hello, world, I am 12 of 50 on nimbus
    
    NO 26
    Hello, world, I am 10 of 50 on nimbus
    
    NO 26
    Hello, world, I am 1 of 50 on nimbus
    
    NO 26
    Hello, world, I am 8 of 50 on nimbus
    
    NO 26
    Hello, world, I am 3 of 50 on nimbus
    
    NO 26
    Hello, world, I am 0 of 50 on nimbus
    
    NO 26
    Hello, world, I am 5 of 50 on nimbus
    
    NO 26
    Hello, world, I am 11 of 50 on nimbus
    
    NO 26
    Hello, world, I am 6 of 50 on nimbus
    
    NO 26
    Hello, world, I am 17 of 50 on nimbus
    
    NO 26
    Hello, world, I am 15 of 50 on nimbus
    
    NO 26
    Hello, world, I am 18 of 50 on nimbus
    
    NO 27
    Hello, world, I am 2 of 50 on nimbus
    
    NO 26
    Hello, world, I am 13 of 50 on nimbus
    
    NO 27
    Hello, world, I am 12 of 50 on nimbus
    
    NO 26
    Hello, world, I am 7 of 50 on nimbus
    
    NO 27
    Hello, world, I am 10 of 50 on nimbus
    
    NO 27
    Hello, world, I am 1 of 50 on nimbus
    
    NO 26
    Hello, world, I am 21 of 50 on nimbus
    
    NO 27
    Hello, world, I am 8 of 50 on nimbus
    
    NO 26
    Hello, world, I am 22 of 50 on nimbus
    
    NO 27
    Hello, world, I am 3 of 50 on nimbus
    
    NO 26
    Hello, world, I am 20 of 50 on nimbus
    
    NO 27
    Hello, world, I am 0 of 50 on nimbus
    
    NO 27
    Hello, world, I am 5 of 50 on nimbus
    
    NO 26
    Hello, world, I am 16 of 50 on nimbus
    
    NO 26
    Hello, world, I am 26 of 50 on nimbus
    
    NO 26
    Hello, world, I am 23 of 50 on nimbus
    
    NO 26
    Hello, world, I am 27 of 50 on nimbus
    
    NO 26
    Hello, world, I am 28 of 50 on nimbus
    
    NO 27
    Hello, world, I am 11 of 50 on nimbus
    
    NO 27
    Hello, world, I am 6 of 50 on nimbus
    
    NO 26
    Hello, world, I am 25 of 50 on nimbus
    
    NO 26
    Hello, world, I am 31 of 50 on nimbus
    
    NO 27
    Hello, world, I am 17 of 50 on nimbus
    
    NO 26
    Hello, world, I am 30 of 50 on nimbus
    
    NO 26
    Hello, world, I am 43 of 50 on nimbus
    
    NO 27
    Hello, world, I am 15 of 50 on nimbus
    
    NO 27
    Hello, world, I am 18 of 50 on nimbus
    
    NO 26
    Hello, world, I am 33 of 50 on nimbus
    
    NO 26
    Hello, world, I am 32 of 50 on nimbus
    
    NO 26
    Hello, world, I am 47 of 50 on nimbus
    
    NO 28
    Hello, world, I am 2 of 50 on nimbus
    
    NO 26
    Hello, world, I am 36 of 50 on nimbus
    
    NO 26
    Hello, world, I am 35 of 50 on nimbus
    
    NO 27
    Hello, world, I am 13 of 50 on nimbus
    
    NO 26
    Hello, world, I am 40 of 50 on nimbus
    
    NO 26
    Hello, world, I am 38 of 50 on nimbus
    
    NO 26
    Hello, world, I am 37 of 50 on nimbus
    
    NO 28
    Hello, world, I am 12 of 50 on nimbus
    
    NO 27
    Hello, world, I am 7 of 50 on nimbus
    
    NO 28
    Hello, world, I am 10 of 50 on nimbus
    
    NO 26
    Hello, world, I am 48 of 50 on nimbus
    
    NO 26
    Hello, world, I am 41 of 50 on nimbus
    
    NO 28
    Hello, world, I am 1 of 50 on nimbus
    
    NO 26
    Hello, world, I am 45 of 50 on nimbus
    
    NO 27
    Hello, world, I am 21 of 50 on nimbus
    
    NO 26
    Hello, world, I am 42 of 50 on nimbus
    
    NO 26
    Hello, world, I am 46 of 50 on nimbus
    
    [nimbus:14312] [[63351,0],0]-[[63351,1],46] mca_oob_tcp_msg_recv: readv
    failed: Connection reset by peer (104)
    --------------------------------------------------------------------------
    mpirun has exited due to process rank 4 with PID 14317 on
    node nimbus exiting improperly. There are two reasons this could occur:
    
    1. this process did not call "init" before exiting, but others in
    the job did. This can cause a job to hang indefinitely while it waits
    for all processes to call "init". By rule, if one process calls "init",
    then ALL processes must call "init" prior to termination.
    
    2. this process called "init", but exited without calling "finalize".
    By rule, all processes that call "init" MUST call "finalize" prior to
    exiting or it will be considered an "abnormal termination"
    
    This may have caused other processes in the application to be
    terminated by signals sent by mpirun (as reported here).
    --------------------------------------------------------------------------
    
    cheers
    fengguang
    

  • Next message: Neal Becker: "build on fedora 12"