RE: OpenMPI + BLCR: Second time checkpoint hangs for MPIapplication.

rajasekaran.subramanian_at_wipro_dot_com
Date: Thu Dec 03 2009 - 22:55:40 PST

  • Next message: Leonardo Fialho: "/proc/checkpoint/ctrl limit?"
    Thanks Eric.
    
    It works fine with OpenMPI 1.3.4 version.
    
    Thanks,
    Raj
    
    -----Original Message-----
    From: Eric Roman [mailto:ERoman_at_lbl_dot_gov] 
    Sent: Friday, December 04, 2009 5:58 AM
    To: Rajasekaran Subramanian (WT01 - ENERGY & UTILITIES)
    Cc: awoodland_at_debian_dot_org; checkpoint_at_lbl_dot_gov; Vivek Wandile (WT01 - PES-HPC Practice); Ananda Babu Mudar (WT01 - ENERGY & UTILITIES); Balwant Singh (WT01 - ENERGY & UTILITIES)
    Subject: Re: OpenMPI + BLCR: Second time checkpoint hangs for MPIapplication.
    
    
    Raj,
    
    I tried the hello_c.c program and didn't reproduce this problem.  I ran the
    same commands you did.  Checkpointing a restarted process worked reliably.
    I went through ompi-checkpoint, kill, ompi-restart several times.
    
    I ran this test on BLCR 0.8.2 and OpenMPI 1.3.4.  The OpenMPI was configured
    as follows:
    
    ../configure \
        --enable-ft-thread \
        --with-ft=cr \
        --enable-mpi-threads \
        --with-blcr=/usr/local/pkg/blcr-0.8.2 \
        --prefix=/home/eroman/pkg/openmpi-1.3.4-blcr
    
    Hope that helps.
    
    Eric
    
    On Thu, Nov 26, 2009 at 06:33:45PM +0530, rajasekaran.subramanian_at_wipro_dot_com wrote:
    > Hi Alan,
    > 
    > I tested the mpi_test_blcr.c, available in the link mentioned by you. 
    > 
    > The same problem is seen, i.e the second time ompi-checkpoint hangs and the application fails with the same error as mentioned below.
    > 
    > Thanks,
    > Raj
    > 
    > -----Original Message-----
    > From: alan_dot_woodland_at_gmail_dot_com [mailto:alan_dot_woodland_at_gmail_dot_com] On Behalf Of Alan Woodland
    > Sent: Thursday, November 26, 2009 5:44 PM
    > To: Rajasekaran Subramanian (WT01 - ENERGY & UTILITIES)
    > Cc: checkpoint_at_lbl_dot_gov; Vivek Wandile (WT01 - PES-HPC Practice); Ananda Babu Mudar (WT01 - ENERGY & UTILITIES); Balwant Singh (WT01 - ENERGY & UTILITIES)
    > Subject: Re: OpenMPI + BLCR: Second time checkpoint hangs for MPI application.
    > 
    > 2009/11/26  <rajasekaran.subramanian_at_wipro_dot_com>:
    > > Hi,
    > >
    > > The checkpoint and restart of MPI application is working successfully.
    > > However, the next ompi-checkpoint command hangs, for MPI application that is
    > > started using ompi-restart. I have installed BLCR 0.8.2 and OpenMPI 1.3.3
    > > version.
    > >
    > >
    > >
    > > The steps followed:
    > >
    > > 1.������ mpicc hello_c.c -o hello (Simple hellompi.c program attached)
    > >
    > > 2.������ mpirun -np 2 -am ft-enable-cr ./hello
    > >
    > > 3.������ ompi-checkpoint -term pid_of_mpirun
    > >
    > > 4.������ ompi-restart� -am ft-enable-cr ompi_global_snapshot_8767.ckpt
    > > (checkpoint File created by above Step #3)
    > >
    > > 5.������ ompi-checkpoint pid_of_new_mpirun (This step hangs)
    > >
    > > a.������ The terminal where the process was restarted in above Step # 4,
    > > throws the following error
    > >
    > > "mpirun noticed that process rank 1 with PID 9810 on node hpc02 exited on
    > > signal 13 (Broken pipe)."
    > >
    > 
    > I thought for openmpi checkpointing to occur the process needed to hit
    > an MPI call. Your while loop doesn't make any MPI calls, so MPI
    > checkpoint is blocking until it hits one of these (which it never
    > does). I don't quite understand why it seems to be failing at step 5/4
    > and not step 3 though...
    > 
    > There's a simple test case I wrote for the BLCR enabled OpenMPI builds
    > in Debian attached to this bug report:
    > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=545919
    > 
    > Alan
    > 
    > Please do not print this email unless it is absolutely necessary. 
    > 
    > The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 
    > 
    > WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 
    > 
    > www.wipro.com
    
    Please do not print this email unless it is absolutely necessary. 
    
    The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 
    
    WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 
    
    www.wipro.com
    

  • Next message: Leonardo Fialho: "/proc/checkpoint/ctrl limit?"