Re: "Requested kernel interface version is not supported"

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Nov 28 2007 - 11:34:44 PST

  • Next message: Content-filter at lxmta2.gsi.de: "Considered UNSOLICITED BULK EMAIL, apparently from you"
    Dr M. Calleja wrote:
    > On Nov 27 2007, Paul H. Hargrove wrote:
    >
    >> Mark Calleja wrote:
    >>> Hi,
    >>>
    >>> I'm trying to get checkpointing using the BLCR kernel modules to 
    >>> work with Condor (http://www.cs.wisc.edu/condor/), but I've run into 
    >>> a hitch. I can get a sample, dynamically linked, x86_64 application 
    >>> to run and checkpoint successfully using v0.6.1 of the BLCR modules 
    >>> when run directly from the command line. However, when submitted to 
    >>> the same machine using Condor via a Parrot shell 
    >>> (http://www.cse.nd.edu/~ccl/software/parrot/), then although the job 
    >>> starts running successfully with cr_run, attempts to checkpoint the 
    >>> job with a separate process using cr_checkpoint fail with the error 
    >>> message:
    >>>
    >>> "Requested kernel interface version is not supported"
    >>>
    >>> Is there any reason why this error should occur, especially when 
    >>> command-line operation on the same box succeeds? BTW, Parrot is used 
    >>> to provide a user-space file system which talks to a chirp server 
    >>> (http://www.cse.nd.edu/~ccl/software/chirp/) in order to save the 
    >>> checkpointed state off the execute host. The tests were carried out 
    >>> on a Debian "etch" box, kernel  2.6.18-5-amd64, and the application 
    >>> was built and linked with g++ v 4.1.2.
    >>>
    >>> Regards,
    >>> Mark
    >>>
    >>
    >> Mark,
    >>
    >>  Sorry for the slow response.  I am still catching up on e-mail from 
    >> the U.S. holiday.
    >>
    >>  The message you see indicates a version mismatch between the BLCR 
    >> library and the BLCR kernel module(s).  Since you can checkpoint at 
    >> the command line but not with Parrot, I suspect that you may have 2 
    >> versions of BLCR installed and that the cr_run and/or cr_checkpoint 
    >> command in the PATH differs between the two methods.  To confirm 
    >> this, you could try "cr_run --version" both at the command line and 
    >> via Parrot.  I suspect they will report different version numbers.  
    >> If that is the case you will need to fix your PATH and/or remove the 
    >> older of the two installations.  If the same version is reported both 
    >> times, let me know and we can try something else to isolate the cause 
    >> of your problems.
    >>
    >> -Paul
    >
    > Hi Paul,
    >
    > The problem appears to be at the Parrot/BLCR interface, and unrelated 
    > to Condor. Running "cr_run --version" from an ordinary shell and one 
    > that's running under Parrot gives the same result:
    >
    > banani$  cr_run --version
    > /usr/local/bin/cr_run: version 0.6.1
    >
    > This is not surprising since this is my desktop and has only one 
    > version of BLCR installed, namely the one I installed. However, the 
    > problem raises its head when I run cr_checkpoint from the Parrot 
    > shell: it works just dandy from an ordinary shell but from within 
    > Parrot I get:
    >
    > banani$ cr_checkpoint 30697
    > Failed cr_init(): Requested kernel interface version is not supported
    >
    > The developer of Parrot (Doug Thain, at Notre Dame) is a very amenable 
    > chap and I'm confident he'd be willing to help troubleshoot this.
    >
    > Thanks for your help and let me know if it would aid your debugging 
    > process if I was to give you an account on my test machine.
    >
    > Regards,
    > Mark
    
    Mark,
      OK, I suspect that Parrot's I/O interception is the source of the 
    problem.  Here is some info to help you and Doug:
    
      The "Failed cr_init(): Requested kernel interface version is not 
    supported" message comes from the cr_checkpoint executable.  At startup 
    cr_checkpoint calls cr_init() in libcr.so, which performs a version 
    number exchange with the BLCR kernel module to verify that it supports 
    the interface expected by the library.  This is a version number that 
    advances independent of BLCR release version, and functions much like 
    the major/minor version numbering of shared libs.
      The interface between the BLCR shared lib and kernel modules is not 
    through addition of a BLCR-specific system call, but rather through 
    ioctl() on a pseudo file /proc/checkpoint/ctrl.  The cr_init() function 
    makes a call that does approximately the following:
    
    int cri_connect(void) {
       int fd = open("/proc/checkpoint/ctrl", O_WRONLY);
       if (fd < 0) {
           errno=ENOSYS; /* BLCR not present */
       } else if (ioctl(fd,  CR_OP_VERSION, (CR_MODULE_MAJOR << 16) | 
    CR_MODULE_MINOR) < 0) {
            close(fd);
            fd = -1;
       }
       return fd;
    }
    
    The message you get indicates an errno value of CR_EVERSION.  The only 
    place that value is used is when the ioctl(fd, CR_OP_VERSION,...) call 
    above finds a version mismatch.
    
    Looking at pfs_dispatch.c:decode_syscall() I see that when faced w/ an 
    ioctl() call, Parrot is forced to guess if the third arg is a pointer or 
    integer.  If the value is addressable, it is treated as a pointer.  I 
    suspect that the value ((CR_MODULE_MAJOR << 16) | CR_MODULE_MINOR) is 
    mistakenly treated as a pointer, and therefore replaced by the address 
    of a proxy buffer in the actual ioctl() call made by Parrot.  That would 
    result in the CR_EVERSION that you observe.  No clue on the BLCR end how 
    we could deal with this, since we have no way to tell Parrot what is 
    happening.
    
    Finally, I wanted to note that BLCR has not been tested against programs 
    that are being ptrace()ed, and we have reason to think things might "get 
    a little weird" at checkpoint and/or restart time.  Since Parrot uses 
    ptrace, you may run into this weirdness.  If/when you do, please let us 
    know and will help as much as we can.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Content-filter at lxmta2.gsi.de: "Considered UNSOLICITED BULK EMAIL, apparently from you"