restart program in cluster

Date view	Thread view	Subject view	Author view	Attachment view

From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Fri Oct 19 2007 - 02:40:17 PDT

Next message: Paul H. Hargrove: "Re: restart program in cluster"

Previous message: Mark Calleja: "Re: Another User Manual clarification please"
Next in thread: Paul H. Hargrove: "Re: restart program in cluster"
Reply: Paul H. Hargrove: "Re: restart program in cluster"

Hi all,

I noticed the BLCR User Guid says:

"You may restart a program on a different machine than the one it was 
checkpointed on if all of these conditions are met (they often are on 
cluster systems, especially if you are using a shared network 
filesystem), and the kernels are the same."

I'm trying to implement such function on our Linux cluster:

- Node: IBM x3550 - 2 x Intel 5160 Xeon dual core
- O/S: Scientific Linux 4 (similar to RHEL4)
- File System: GPFS
- Compiler: GNU 3.4.5
- BLCR version: 0.5.0 and 0.6.1

I can restart checkpointed file on the same node but failed on another 
one. All work nodes using the same image and shared file system.

The error message is: "Segmentation fault"

Anyone knows why my restart fail and how to implement cross node restart 
on the cluster? Thanks

--Yuan


Yuan Wan
-- 
Unix Section
Information Services Infrastructure Division
University of Edinburgh

tel: 0131 650 4985
email: [email protected]

2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ

Next message: Paul H. Hargrove: "Re: restart program in cluster"

Previous message: Mark Calleja: "Re: Another User Manual clarification please"
Next in thread: Paul H. Hargrove: "Re: restart program in cluster"
Reply: Paul H. Hargrove: "Re: restart program in cluster"

Date view	Thread view	Subject view	Author view	Attachment view