RE: OpenMPI + BLCR: Second time checkpoint hangs for MPI application.

Date: Thu Nov 26 2009 - 05:03:45 PST

  • Next message: Paul H. Hargrove: "Re: query"
    Hi Alan,
    I tested the mpi_test_blcr.c, available in the link mentioned by you. 
    The same problem is seen, i.e the second time ompi-checkpoint hangs and the application fails with the same error as mentioned below.
    -----Original Message-----
    From: alan_dot_woodland_at_gmail_dot_com [mailto:alan_dot_woodland_at_gmail_dot_com] On Behalf Of Alan Woodland
    Sent: Thursday, November 26, 2009 5:44 PM
    To: Rajasekaran Subramanian (WT01 - ENERGY & UTILITIES)
    Cc: checkpoint_at_lbl_dot_gov; Vivek Wandile (WT01 - PES-HPC Practice); Ananda Babu Mudar (WT01 - ENERGY & UTILITIES); Balwant Singh (WT01 - ENERGY & UTILITIES)
    Subject: Re: OpenMPI + BLCR: Second time checkpoint hangs for MPI application.
    2009/11/26  <rajasekaran.subramanian_at_wipro_dot_com>:
    > Hi,
    > The checkpoint and restart of MPI application is working successfully.
    > However, the next ompi-checkpoint command hangs, for MPI application that is
    > started using ompi-restart. I have installed BLCR 0.8.2 and OpenMPI 1.3.3
    > version.
    > The steps followed:
    > 1.       mpicc hello_c.c -o hello (Simple hellompi.c program attached)
    > 2.       mpirun -np 2 -am ft-enable-cr ./hello
    > 3.       ompi-checkpoint -term pid_of_mpirun
    > 4.       ompi-restart  -am ft-enable-cr ompi_global_snapshot_8767.ckpt
    > (checkpoint File created by above Step #3)
    > 5.       ompi-checkpoint pid_of_new_mpirun (This step hangs)
    > a.       The terminal where the process was restarted in above Step # 4,
    > throws the following error
    > "mpirun noticed that process rank 1 with PID 9810 on node hpc02 exited on
    > signal 13 (Broken pipe)."
    I thought for openmpi checkpointing to occur the process needed to hit
    an MPI call. Your while loop doesn't make any MPI calls, so MPI
    checkpoint is blocking until it hits one of these (which it never
    does). I don't quite understand why it seems to be failing at step 5/4
    and not step 3 though...
    There's a simple test case I wrote for the BLCR enabled OpenMPI builds
    in Debian attached to this bug report:
    Please do not print this email unless it is absolutely necessary. 
    The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 
    WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 

  • Next message: Paul H. Hargrove: "Re: query"