Re: Question on closing checkpoint file

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Jan 30 2008 - 11:30:58 PST

  • Next message: Neal Becker: "Re: Question on closing checkpoint file"
      I am not a python programmer, but I think I see the reasons for both
    your problems.
    1)  the "unknown error":  You are printing os.strerror(err) when you
    probably wanted os.strerror(errno).  I believe you will find that making
    that one change will print "Invalid Argument" (or similar) because you
    are getting the expected post-restart condition err=-1 and errno=EINVAL.
    The huge value you see now is -1 (err) when printed as an unsigned 64bit
    value (ulong).
    2)  the close(): blcr will restart with the checkpoint filedescriptor
    already closed.  An additional close() in your code will return a
    negative value and errno=EBAFD=9 (the C code in cr_request_file() calls
    close w/o checking the return value for this reason).  It appears that
    python is throwing an exception when that happens.  The solution is to
    either not close() on restart, or to close() w/o regard to restarting
    but to catch/ignore this exception.
    Neal Becker wrote:
    > In the ctypes wrapper I just sent, there are a couple of small issues.
    > Here is the loop I used (python version of cr_checkpoint.c)
    > with open ("checkpoint", 'w') as cp_file:
    >     cr_args.cr_fd = cp_file.fileno()
    >     err,cr_handle = request_checkpoint (cr_args, cr_handle)
    >     err = -1
    >     while (err < 0):
    >         err = libcr.cr_poll_checkpoint (byref(cr_handle), POINTER(timeval_t)
    > ())
    >         print "err:", os.strerror(err)
    >         if (err < 0):
    >             if (errno == EINVAL):
    >                 break                   # restarted
    >             elif (errno == EINTR):
    >                 continue
    >             else:
    >                 die ("cr_poll_checkpoint")
    >         elif (err == 0):
    >             die ("cr_poll_checkpoint returned unexpected 0")
    > On restart, I get:
    > cr_restart checkpoint 
    > err: Unknown error 18446744073709551615
    > ---------------------------------------------------------------------------
    > IOError                                   Traceback (most recent call last)
    > /home/nbecker/idma-cdma/test/<ipython console> in <module>()
    > /usr/tmp/ in <module>()
    > IOError: [Errno 9] Bad file descriptor
    > Looks like 2 errors:
    > 1) On restart, the "print err" statement is executed and seems to print a 
    > garbage value
    > 2) Looks like on restart closing the fd is a bad idea
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Neal Becker: "Re: Question on closing checkpoint file"