From: Lip Kian (lkng_at_eblackprint_dot_com)
Date: Mon Dec 20 2004 - 06:07:36 PST
Hi Paul, Thanks for the reply. Actually, the document on the integration of BLCR with Grid Engine has already been completed and available on the opensource Grid Engine website for several months (http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005. pdf) but I never expected people to actually bother with it until I've received email from a Grid Engine user to support the installation. Let me know your opinions on the document and I'll gladly modify and re-submit back to GridEngine. (I know BLCR limitation #4 is incorrect, but was told so by my colleague who did the initial tests on LAM-MPI.) > 2) You are correct that process group and/or session support is what you > will require for checkpointing of scripts. The small amount of funding > we have for this project makes it hard for me to set very accurate dates > for milestones. However, our current work is on a Linux 2.6 port of > BLCR and an Opteron port, to be followed by work on process groups and > sessions. So, I would expect work on process groups and sessions to > start in Feb or Mar and would hope to see it available for testing in > Jun or Jul. If you are interested in helping to test process group and > session support before a public release is available, please let us kown. I would like to see this feature implemented as this will enable a tighter integration of BLCR with Grid Engine and wouldn't mind helping to test but I'm not a great tester so some guidance would be needed. However, I could modify the integration scripts during this period. Regards, Lip Kian > -----Original Message----- > From: Paul H. Hargrove [mailto:PHHargrove_at_lbl_dot_gov] > Sent: Saturday, December 18, 2004 2:02 AM > To: Lip Kian > Cc: checkpoint_at_lbl_dot_gov > Subject: Re: checkpoint of scripts and error codes > > Lip Kian, > > I am pleased to hear that somebody is working to integrate BLR w/ Grid > Engine. We'd be interested in placing a link to your document on the > BLCR web pages when it is ready. > > As for your questions: > > 1) Nearly all of the exit values from cr_checkpoint are the errno value > from some failing library function or system call, plus a few that are > BLCR specifc failures. In /usr/include/asm/errno.h the value 100 > appears to be ENETDOWN, which I cannot account for as an exit code from > cr_checkpoint. If you need help determining the reason for a > checkpointing failure, please let us know. > > 2) You are correct that process group and/or session support is what you > will require for checkpointing of scripts. The small amount of funding > we have for this project makes it hard for me to set very accurate dates > for milestones. However, our current work is on a Linux 2.6 port of > BLCR and an Opteron port, to be followed by work on process groups and > sessions. So, I would expect work on process groups and sessions to > start in Feb or Mar and would hope to see it available for testing in > Jun or Jul. If you are interested in helping to test process group and > session support before a public release is available, please let us kown. > > -Paul > > Lip Kian wrote: > > >Hi! > > > >I've been experimenting with BLCR since version 0.2.1 and have written a > >document on integrating BLCR with N1 Grid Engine (formly Sun Grid Engine) > >which can be found at gridengine.sunsource.net. > > > >I have 2 questions regarding BLCR. > > > >1. Where can I find a listing of possible exit values and description for > >cr_restart? Specifically, what does an exit value of 100 mean? > > > >2. Seems that if my application is embedded within a script, > checkpointing > >the pid of the script will not checkpoint the embedded application. I > >believe this is due to checkpointing of progress group has not been > >implemented? Any timeline as to when will this be done? > > > >Thanks. > > > >Regards, > > > >Lip Kian > > > > > > > >