Re: OpenMPI + BLCR: Second time checkpoint hangs for MPI application.

From: Alan Woodland (awoodland_at_debian_dot_org)
Date: Thu Nov 26 2009 - 04:13:43 PST

  • Next message: rajasekaran.subramanian_at_wipro_dot_com: "RE: OpenMPI + BLCR: Second time checkpoint hangs for MPI application."
    2009/11/26  <rajasekaran.subramanian_at_wipro_dot_com>:
    > Hi,
    >
    > The checkpoint and restart of MPI application is working successfully.
    > However, the next ompi-checkpoint command hangs, for MPI application that is
    > started using ompi-restart. I have installed BLCR 0.8.2 and OpenMPI 1.3.3
    > version.
    >
    >
    >
    > The steps followed:
    >
    > 1.������ mpicc hello_c.c -o hello (Simple hellompi.c program attached)
    >
    > 2.������ mpirun -np 2 -am ft-enable-cr ./hello
    >
    > 3.������ ompi-checkpoint �term pid_of_mpirun
    >
    > 4.������ ompi-restart� -am ft-enable-cr ompi_global_snapshot_8767.ckpt
    > (checkpoint File created by above Step #3)
    >
    > 5.������ ompi-checkpoint pid_of_new_mpirun (This step hangs)
    >
    > a.������ The terminal where the process was restarted in above Step # 4,
    > throws the following error
    >
    > �mpirun noticed that process rank 1 with PID 9810 on node hpc02 exited on
    > signal 13 (Broken pipe).�
    >
    
    I thought for openmpi checkpointing to occur the process needed to hit
    an MPI call. Your while loop doesn't make any MPI calls, so MPI
    checkpoint is blocking until it hits one of these (which it never
    does). I don't quite understand why it seems to be failing at step 5/4
    and not step 3 though...
    
    There's a simple test case I wrote for the BLCR enabled OpenMPI builds
    in Debian attached to this bug report:
    http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=545919
    
    Alan
    

  • Next message: rajasekaran.subramanian_at_wipro_dot_com: "RE: OpenMPI + BLCR: Second time checkpoint hangs for MPI application."