From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Nov 28 2007 - 11:34:44 PST
Dr M. Calleja wrote: > On Nov 27 2007, Paul H. Hargrove wrote: > >> Mark Calleja wrote: >>> Hi, >>> >>> I'm trying to get checkpointing using the BLCR kernel modules to >>> work with Condor (http://www.cs.wisc.edu/condor/), but I've run into >>> a hitch. I can get a sample, dynamically linked, x86_64 application >>> to run and checkpoint successfully using v0.6.1 of the BLCR modules >>> when run directly from the command line. However, when submitted to >>> the same machine using Condor via a Parrot shell >>> (http://www.cse.nd.edu/~ccl/software/parrot/), then although the job >>> starts running successfully with cr_run, attempts to checkpoint the >>> job with a separate process using cr_checkpoint fail with the error >>> message: >>> >>> "Requested kernel interface version is not supported" >>> >>> Is there any reason why this error should occur, especially when >>> command-line operation on the same box succeeds? BTW, Parrot is used >>> to provide a user-space file system which talks to a chirp server >>> (http://www.cse.nd.edu/~ccl/software/chirp/) in order to save the >>> checkpointed state off the execute host. The tests were carried out >>> on a Debian "etch" box, kernel 2.6.18-5-amd64, and the application >>> was built and linked with g++ v 4.1.2. >>> >>> Regards, >>> Mark >>> >> >> Mark, >> >> Sorry for the slow response. I am still catching up on e-mail from >> the U.S. holiday. >> >> The message you see indicates a version mismatch between the BLCR >> library and the BLCR kernel module(s). Since you can checkpoint at >> the command line but not with Parrot, I suspect that you may have 2 >> versions of BLCR installed and that the cr_run and/or cr_checkpoint >> command in the PATH differs between the two methods. To confirm >> this, you could try "cr_run --version" both at the command line and >> via Parrot. I suspect they will report different version numbers. >> If that is the case you will need to fix your PATH and/or remove the >> older of the two installations. If the same version is reported both >> times, let me know and we can try something else to isolate the cause >> of your problems. >> >> -Paul > > Hi Paul, > > The problem appears to be at the Parrot/BLCR interface, and unrelated > to Condor. Running "cr_run --version" from an ordinary shell and one > that's running under Parrot gives the same result: > > banani$ cr_run --version > /usr/local/bin/cr_run: version 0.6.1 > > This is not surprising since this is my desktop and has only one > version of BLCR installed, namely the one I installed. However, the > problem raises its head when I run cr_checkpoint from the Parrot > shell: it works just dandy from an ordinary shell but from within > Parrot I get: > > banani$ cr_checkpoint 30697 > Failed cr_init(): Requested kernel interface version is not supported > > The developer of Parrot (Doug Thain, at Notre Dame) is a very amenable > chap and I'm confident he'd be willing to help troubleshoot this. > > Thanks for your help and let me know if it would aid your debugging > process if I was to give you an account on my test machine. > > Regards, > Mark Mark, OK, I suspect that Parrot's I/O interception is the source of the problem. Here is some info to help you and Doug: The "Failed cr_init(): Requested kernel interface version is not supported" message comes from the cr_checkpoint executable. At startup cr_checkpoint calls cr_init() in libcr.so, which performs a version number exchange with the BLCR kernel module to verify that it supports the interface expected by the library. This is a version number that advances independent of BLCR release version, and functions much like the major/minor version numbering of shared libs. The interface between the BLCR shared lib and kernel modules is not through addition of a BLCR-specific system call, but rather through ioctl() on a pseudo file /proc/checkpoint/ctrl. The cr_init() function makes a call that does approximately the following: int cri_connect(void) { int fd = open("/proc/checkpoint/ctrl", O_WRONLY); if (fd < 0) { errno=ENOSYS; /* BLCR not present */ } else if (ioctl(fd, CR_OP_VERSION, (CR_MODULE_MAJOR << 16) | CR_MODULE_MINOR) < 0) { close(fd); fd = -1; } return fd; } The message you get indicates an errno value of CR_EVERSION. The only place that value is used is when the ioctl(fd, CR_OP_VERSION,...) call above finds a version mismatch. Looking at pfs_dispatch.c:decode_syscall() I see that when faced w/ an ioctl() call, Parrot is forced to guess if the third arg is a pointer or integer. If the value is addressable, it is treated as a pointer. I suspect that the value ((CR_MODULE_MAJOR << 16) | CR_MODULE_MINOR) is mistakenly treated as a pointer, and therefore replaced by the address of a proxy buffer in the actual ioctl() call made by Parrot. That would result in the CR_EVERSION that you observe. No clue on the BLCR end how we could deal with this, since we have no way to tell Parrot what is happening. Finally, I wanted to note that BLCR has not been tested against programs that are being ptrace()ed, and we have reason to think things might "get a little weird" at checkpoint and/or restart time. Since Parrot uses ptrace, you may run into this weirdness. If/when you do, please let us know and will help as much as we can. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900