From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Mon Mar 22 2010 - 15:02:18 PDT
I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI program runs well on the clusters, but how to checkpoint the MPI program on this clusters? what I have done is that: I run the program using mpirun in the shared directory on the master node, and use ompi-checkpoint command in another terminal on master node, it will create an checkpoint file,but the MPI program are not terminated as what happened in single machine. Also, the ompi-restart also doesn't work on cluster. what should I do ? Cheers! fengguang