Re: Another User Manual clarification please

From: Mark Calleja (
Date: Thu Oct 18 2007 - 11:26:19 PDT

  • Next message: Yuan Wan: "restart program in cluster"
    Hi again Paul,
    Paul H. Hargrove wrote:
    > Mark Calleja wrote:
    >> Hi,
    >> I read in the BLCR User manual that one of the necessary criteria is:
    >> "You may restart a program on a different machine than the one it was 
    >> checkpointed on if all of these conditions are met (they often are on 
    >> cluster systems, especially if you are using a shared network 
    >> filesystem), and the kernels are the same."
    >> To what levels do kernels have to be "the same"? So for instance, my 
    >> Debian etch box has a kernel of 2.6.18-5-686; can I restart a 
    >> checkpointed job on any 32 bit i686 2.6.18 kerneled machine, or does 
    >> it have to be *exactly* the same?
    >> Thanks for any help,
    >> Mark
    > Unfortunately there is no exact answer to this question.  We don't 
    > intentionally limit the kernel version at restart time based on any 
    > information contained in the context file(s) except to architecture.  
    > However, there are subtle differences between kernel versions that 
    > affect the contents of the context files.  So, a restart may fail in 
    > strange ways if the kernel was configured with a different feature set.
    > We currently do our own testing in homogeneous environments and thus 
    > have not done any significant amount of testing of the kind you are 
    > asking about.  However, I would say that among kernels with the same 
    > version number and architecture (2.6.18/i686 in your example) you have 
    > a high probability of success.
    OK, that's what I suspected. I shall test on our grid and see what 
    happens. I should then be able to come up with a set of classads to 
    steer Condor jobs to appropriate platforms.
    > More exact answers may be possible in a future release.  Specifically 
    > we could generate a "feature checksum" for a given kernel that could 
    > be queried and compared to determine compatibility.  However, no work 
    > has been done in this area.  If you are interested in contributing to 
    > BLCR, this is an area where your help would be appreciated.
    I have access to a range of Linux architectures (32 & 64 bit Intel and 
    AMD) and distros (Debian, Suse and SL) in our grid which I can run tests 
    on, so time allowing I'd be more than happy to run various tests.
    Dr Mark Calleja
    Cambridge eScience Centre, University of Cambridge
    Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
    Tel. (+44/0) 1223 765317, Fax (+44/0) 1223 765900

  • Next message: Yuan Wan: "restart program in cluster"