rajasekaran.subramanian_at_wipro_dot_com
Date: Thu Nov 26 2009 - 03:57:29 PST
Hi, The checkpoint and restart of MPI application is working successfully. However, the next ompi-checkpoint command hangs, for MPI application that is started using ompi-restart. I have installed BLCR 0.8.2 and OpenMPI 1.3.3 version. The steps followed: 1. mpicc hello_c.c -o hello (Simple hellompi.c program attached) 2. mpirun -np 2 -am ft-enable-cr ./hello 3. ompi-checkpoint -term pid_of_mpirun 4. ompi-restart -am ft-enable-cr ompi_global_snapshot_8767.ckpt (checkpoint File created by above Step #3) 5. ompi-checkpoint pid_of_new_mpirun (This step hangs) a. The terminal where the process was restarted in above Step # 4, throws the following error "mpirun noticed that process rank 1 with PID 9810 on node hpc02 exited on signal 13 (Broken pipe)." This same behavior is seen, even if the MPI application is executed across multiple nodes (Using hostfile in mpirun). Could you please let me know the reason for this failure. Thanks, Rajasekaran S, Technical Architect, High Performance Computing Group, Wipro Technologies. Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com