From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Jan 30 2008 - 11:30:58 PST
Neal, I am not a python programmer, but I think I see the reasons for both your problems. 1) the "unknown error": You are printing os.strerror(err) when you probably wanted os.strerror(errno). I believe you will find that making that one change will print "Invalid Argument" (or similar) because you are getting the expected post-restart condition err=-1 and errno=EINVAL. The huge value you see now is -1 (err) when printed as an unsigned 64bit value (ulong). 2) the close(): blcr will restart with the checkpoint filedescriptor already closed. An additional close() in your code will return a negative value and errno=EBAFD=9 (the C code in cr_request_file() calls close w/o checking the return value for this reason). It appears that python is throwing an exception when that happens. The solution is to either not close() on restart, or to close() w/o regard to restarting but to catch/ignore this exception. -Paul Neal Becker wrote: > In the ctypes wrapper I just sent, there are a couple of small issues. > Here is the loop I used (python version of cr_checkpoint.c) > > with open ("checkpoint", 'w') as cp_file: > cr_args.cr_fd = cp_file.fileno() > > err,cr_handle = request_checkpoint (cr_args, cr_handle) > > err = -1 > while (err < 0): > err = libcr.cr_poll_checkpoint (byref(cr_handle), POINTER(timeval_t) > ()) > print "err:", os.strerror(err) > if (err < 0): > if (errno == EINVAL): > break # restarted > elif (errno == EINTR): > continue > else: > die ("cr_poll_checkpoint") > elif (err == 0): > die ("cr_poll_checkpoint returned unexpected 0") > > On restart, I get: > cr_restart checkpoint > err: Unknown error 18446744073709551615 > --------------------------------------------------------------------------- > IOError Traceback (most recent call last) > > /home/nbecker/idma-cdma/test/<ipython console> in <module>() > > /usr/tmp/python-m9AicK.py in <module>() > > IOError: [Errno 9] Bad file descriptor > > Looks like 2 errors: > 1) On restart, the "print err" statement is executed and seems to print a > garbage value > 2) Looks like on restart closing the fd is a bad idea -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900