questions about checkpoint/restart on multiple clusters of MPI

From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Mon Mar 22 2010 - 15:02:18 PDT

  • Next message: Paul H. Hargrove: "Re: questions about checkpoint/restart on multiple clusters of MPI"
    I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI
    program runs well on the clusters,
    but how to checkpoint the MPI program on this clusters?
    
    what I have done is that: I run the program using mpirun in the shared
    directory on the master node, and use
    ompi-checkpoint command in another terminal on master node, it will create
    an checkpoint file,but the MPI program
    are not terminated as what happened in single machine. Also, the
    ompi-restart also doesn't work on cluster.
    
    what should I do ?
    
    Cheers!
    fengguang
    

  • Next message: Paul H. Hargrove: "Re: questions about checkpoint/restart on multiple clusters of MPI"