Re: Another User Manual clarification please

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Oct 18 2007 - 10:10:37 PDT

  • Next message: Mark Calleja: "Re: Paths: relative or absolute?"
    Mark Calleja wrote:
    > Hi,
    >
    > I read in the BLCR User manual that one of the necessary criteria is:
    >
    > "You may restart a program on a different machine than the one it was 
    > checkpointed on if all of these conditions are met (they often are on 
    > cluster systems, especially if you are using a shared network 
    > filesystem), and the kernels are the same."
    >
    > To what levels do kernels have to be "the same"? So for instance, my 
    > Debian etch box has a kernel of 2.6.18-5-686; can I restart a 
    > checkpointed job on any 32 bit i686 2.6.18 kerneled machine, or does 
    > it have to be *exactly* the same?
    >
    > Thanks for any help,
    > Mark
    
    Unfortunately there is no exact answer to this question.  We don't 
    intentionally limit the kernel version at restart time based on any 
    information contained in the context file(s) except to architecture.  
    However, there are subtle differences between kernel versions that 
    affect the contents of the context files.  So, a restart may fail in 
    strange ways if the kernel was configured with a different feature set.
    
    We currently do our own testing in homogeneous environments and thus 
    have not done any significant amount of testing of the kind you are 
    asking about.  However, I would say that among kernels with the same 
    version number and architecture (2.6.18/i686 in your example) you have a 
    high probability of success.
    
    More exact answers may be possible in a future release.  Specifically we 
    could generate a "feature checksum" for a given kernel that could be 
    queried and compared to determine compatibility.  However, no work has 
    been done in this area.  If you are interested in contributing to BLCR, 
    this is an area where your help would be appreciated.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Mark Calleja: "Re: Paths: relative or absolute?"