Re: Another User Manual clarification please

From: Mark Calleja (M.Calleja_at_damtp.cam.ac.uk)
Date: Thu Oct 18 2007 - 11:26:19 PDT

  • Next message: Yuan Wan: "restart program in cluster"
    Hi again Paul,
    
    Paul H. Hargrove wrote:
    > Mark Calleja wrote:
    >> Hi,
    >>
    >> I read in the BLCR User manual that one of the necessary criteria is:
    >>
    >> "You may restart a program on a different machine than the one it was 
    >> checkpointed on if all of these conditions are met (they often are on 
    >> cluster systems, especially if you are using a shared network 
    >> filesystem), and the kernels are the same."
    >>
    >> To what levels do kernels have to be "the same"? So for instance, my 
    >> Debian etch box has a kernel of 2.6.18-5-686; can I restart a 
    >> checkpointed job on any 32 bit i686 2.6.18 kerneled machine, or does 
    >> it have to be *exactly* the same?
    >>
    >> Thanks for any help,
    >> Mark
    >
    > Unfortunately there is no exact answer to this question.  We don't 
    > intentionally limit the kernel version at restart time based on any 
    > information contained in the context file(s) except to architecture.  
    > However, there are subtle differences between kernel versions that 
    > affect the contents of the context files.  So, a restart may fail in 
    > strange ways if the kernel was configured with a different feature set.
    >
    > We currently do our own testing in homogeneous environments and thus 
    > have not done any significant amount of testing of the kind you are 
    > asking about.  However, I would say that among kernels with the same 
    > version number and architecture (2.6.18/i686 in your example) you have 
    > a high probability of success.
    
    OK, that's what I suspected. I shall test on our grid and see what 
    happens. I should then be able to come up with a set of classads to 
    steer Condor jobs to appropriate platforms.
    
    >
    > More exact answers may be possible in a future release.  Specifically 
    > we could generate a "feature checksum" for a given kernel that could 
    > be queried and compared to determine compatibility.  However, no work 
    > has been done in this area.  If you are interested in contributing to 
    > BLCR, this is an area where your help would be appreciated.
    >
    
    I have access to a range of Linux architectures (32 & 64 bit Intel and 
    AMD) and distros (Debian, Suse and SL) in our grid which I can run tests 
    on, so time allowing I'd be more than happy to run various tests.
    
    Cheers,
    Mark
    
    -- 
    Dr Mark Calleja
    Cambridge eScience Centre, University of Cambridge
    Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
    Tel. (+44/0) 1223 765317, Fax (+44/0) 1223 765900
    http://www.escience.cam.ac.uk/~mcal00
    

  • Next message: Yuan Wan: "restart program in cluster"