restart program in cluster

From: Yuan Wan (
Date: Fri Oct 19 2007 - 02:40:17 PDT

  • Next message: Paul H. Hargrove: "Re: restart program in cluster"
    Hi all,
    I noticed the BLCR User Guid says:
    "You may restart a program on a different machine than the one it was 
    checkpointed on if all of these conditions are met (they often are on 
    cluster systems, especially if you are using a shared network 
    filesystem), and the kernels are the same."
    I'm trying to implement such function on our Linux cluster:
    - Node: IBM x3550 - 2 x Intel 5160 Xeon dual core
    - O/S: Scientific Linux 4 (similar to RHEL4)
    - File System: GPFS
    - Compiler: GNU 3.4.5
    - BLCR version: 0.5.0 and 0.6.1
    I can restart checkpointed file on the same node but failed on another 
    one. All work nodes using the same image and shared file system.
    The error message is: "Segmentation fault"
    Anyone knows why my restart fail and how to implement cross node restart 
    on the cluster? Thanks
    Yuan Wan
    Unix Section
    Information Services Infrastructure Division
    University of Edinburgh
    tel: 0131 650 4985
    email: [email protected]
    2032 Computing Services, JCMB
    The King's Buildings,
    Edinburgh, EH9 3JZ

  • Next message: Paul H. Hargrove: "Re: restart program in cluster"