Re: restart program in cluster

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Oct 19 2007 - 08:33:15 PDT

  • Next message: Hideyuki Jitsumoto: "Please advise me about restarting with BLCR"
    Yuan Wan wrote:
    > Hi all,
    > I noticed the BLCR User Guid says:
    > "You may restart a program on a different machine than the one it was 
    > checkpointed on if all of these conditions are met (they often are on 
    > cluster systems, especially if you are using a shared network 
    > filesystem), and the kernels are the same."
    > I'm trying to implement such function on our Linux cluster:
    > - Node: IBM x3550 - 2 x Intel 5160 Xeon dual core
    > - O/S: Scientific Linux 4 (similar to RHEL4)
    > - File System: GPFS
    > - Compiler: GNU 3.4.5
    > - BLCR version: 0.5.0 and 0.6.1
    > I can restart checkpointed file on the same node but failed on another 
    > one. All work nodes using the same image and shared file system.
    > The error message is: "Segmentation fault"
    > Anyone knows why my restart fail and how to implement cross node 
    > restart on the cluster? Thanks
    > --Yuan
    > Yuan Wan
    The most likely problem is prelinking of shared libraries.
    Have a look at
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Hideyuki Jitsumoto: "Please advise me about restarting with BLCR"