From: Mark Calleja (M.Calleja_at_damtp.cam.ac.uk)
Date: Thu Oct 18 2007 - 11:26:19 PDT
Hi again Paul, Paul H. Hargrove wrote: > Mark Calleja wrote: >> Hi, >> >> I read in the BLCR User manual that one of the necessary criteria is: >> >> "You may restart a program on a different machine than the one it was >> checkpointed on if all of these conditions are met (they often are on >> cluster systems, especially if you are using a shared network >> filesystem), and the kernels are the same." >> >> To what levels do kernels have to be "the same"? So for instance, my >> Debian etch box has a kernel of 2.6.18-5-686; can I restart a >> checkpointed job on any 32 bit i686 2.6.18 kerneled machine, or does >> it have to be *exactly* the same? >> >> Thanks for any help, >> Mark > > Unfortunately there is no exact answer to this question. We don't > intentionally limit the kernel version at restart time based on any > information contained in the context file(s) except to architecture. > However, there are subtle differences between kernel versions that > affect the contents of the context files. So, a restart may fail in > strange ways if the kernel was configured with a different feature set. > > We currently do our own testing in homogeneous environments and thus > have not done any significant amount of testing of the kind you are > asking about. However, I would say that among kernels with the same > version number and architecture (2.6.18/i686 in your example) you have > a high probability of success. OK, that's what I suspected. I shall test on our grid and see what happens. I should then be able to come up with a set of classads to steer Condor jobs to appropriate platforms. > > More exact answers may be possible in a future release. Specifically > we could generate a "feature checksum" for a given kernel that could > be queried and compared to determine compatibility. However, no work > has been done in this area. If you are interested in contributing to > BLCR, this is an area where your help would be appreciated. > I have access to a range of Linux architectures (32 & 64 bit Intel and AMD) and distros (Debian, Suse and SL) in our grid which I can run tests on, so time allowing I'd be more than happy to run various tests. Cheers, Mark -- Dr Mark Calleja Cambridge eScience Centre, University of Cambridge Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA Tel. (+44/0) 1223 765317, Fax (+44/0) 1223 765900 http://www.escience.cam.ac.uk/~mcal00