restart program in cluster

From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Fri Oct 19 2007 - 02:40:17 PDT

  • Next message: Paul H. Hargrove: "Re: restart program in cluster"
    Hi all,
    
    I noticed the BLCR User Guid says:
    
    "You may restart a program on a different machine than the one it was 
    checkpointed on if all of these conditions are met (they often are on 
    cluster systems, especially if you are using a shared network 
    filesystem), and the kernels are the same."
    
    I'm trying to implement such function on our Linux cluster:
    
    - Node: IBM x3550 - 2 x Intel 5160 Xeon dual core
    - O/S: Scientific Linux 4 (similar to RHEL4)
    - File System: GPFS
    - Compiler: GNU 3.4.5
    - BLCR version: 0.5.0 and 0.6.1
    
    I can restart checkpointed file on the same node but failed on another 
    one. All work nodes using the same image and shared file system.
    
    The error message is: "Segmentation fault"
    
    Anyone knows why my restart fail and how to implement cross node restart 
    on the cluster? Thanks
    
    --Yuan
    
    
    Yuan Wan
    -- 
    Unix Section
    Information Services Infrastructure Division
    University of Edinburgh
    
    tel: 0131 650 4985
    email: ywan@ed.ac.uk
    
    2032 Computing Services, JCMB
    The King's Buildings,
    Edinburgh, EH9 3JZ
    

  • Next message: Paul H. Hargrove: "Re: restart program in cluster"