From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Oct 19 2007 - 08:33:15 PDT
Yuan Wan wrote: > > Hi all, > > I noticed the BLCR User Guid says: > > "You may restart a program on a different machine than the one it was > checkpointed on if all of these conditions are met (they often are on > cluster systems, especially if you are using a shared network > filesystem), and the kernels are the same." > > I'm trying to implement such function on our Linux cluster: > > - Node: IBM x3550 - 2 x Intel 5160 Xeon dual core > - O/S: Scientific Linux 4 (similar to RHEL4) > - File System: GPFS > - Compiler: GNU 3.4.5 > - BLCR version: 0.5.0 and 0.6.1 > > I can restart checkpointed file on the same node but failed on another > one. All work nodes using the same image and shared file system. > > The error message is: "Segmentation fault" > > Anyone knows why my restart fail and how to implement cross node > restart on the cluster? Thanks > > --Yuan > > > Yuan Wan The most likely problem is prelinking of shared libraries. Have a look at http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900