Re: checkpoint of scripts and error codes

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Dec 17 2004 - 10:01:50 PST

  • Next message: jcduell_at_lbl_dot_gov: "Re: blcr for mpi/myrinet jobs?"
    Lip Kian,
      I am pleased to hear that somebody is working to integrate BLR w/ Grid 
    Engine.  We'd be interested in placing a link to your document on the 
    BLCR web pages when it is ready.
      As for your questions:
    1)  Nearly all of the exit values from cr_checkpoint are the errno value 
    from some failing library function or system call, plus a few that are 
    BLCR specifc failures.  In /usr/include/asm/errno.h the value 100 
    appears to be ENETDOWN, which I cannot account for as an exit code from 
    cr_checkpoint.  If you need help determining the reason for a 
    checkpointing failure, please let us know.
    2) You are correct that process group and/or session support is what you 
    will require for checkpointing of scripts.  The small amount of funding 
    we have for this project makes it hard for me to set very accurate dates 
    for milestones.  However, our current work is on a Linux 2.6 port of 
    BLCR and an Opteron port, to be followed by work on process groups and 
    sessions.  So, I would expect work on process groups and sessions to 
    start in Feb or Mar and would hope to see it available for testing in 
    Jun or Jul.  If you are interested in helping to test process group and 
    session support before a public release is available, please let us kown.
    Lip Kian wrote:
    >I've been experimenting with BLCR since version 0.2.1 and have written a
    >document on integrating BLCR with N1 Grid Engine (formly Sun Grid Engine)
    >which can be found at
    >I have 2 questions regarding BLCR.
    >1. Where can I find a listing of possible exit values and description for
    >cr_restart? Specifically, what does an exit value of 100 mean?
    >2. Seems that if my application is embedded within a script, checkpointing
    >the pid of the script will not checkpoint the embedded application. I
    >believe this is due to checkpointing of progress group has not been
    >implemented? Any timeline as to when will this be done?
    >Lip Kian

  • Next message: jcduell_at_lbl_dot_gov: "Re: blcr for mpi/myrinet jobs?"