From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Fri Oct 19 2007 - 02:40:17 PDT
Hi all, I noticed the BLCR User Guid says: "You may restart a program on a different machine than the one it was checkpointed on if all of these conditions are met (they often are on cluster systems, especially if you are using a shared network filesystem), and the kernels are the same." I'm trying to implement such function on our Linux cluster: - Node: IBM x3550 - 2 x Intel 5160 Xeon dual core - O/S: Scientific Linux 4 (similar to RHEL4) - File System: GPFS - Compiler: GNU 3.4.5 - BLCR version: 0.5.0 and 0.6.1 I can restart checkpointed file on the same node but failed on another one. All work nodes using the same image and shared file system. The error message is: "Segmentation fault" Anyone knows why my restart fail and how to implement cross node restart on the cluster? Thanks --Yuan Yuan Wan -- Unix Section Information Services Infrastructure Division University of Edinburgh tel: 0131 650 4985 email: [email protected] 2032 Computing Services, JCMB The King's Buildings, Edinburgh, EH9 3JZ