From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Oct 18 2007 - 10:10:37 PDT
Mark Calleja wrote: > Hi, > > I read in the BLCR User manual that one of the necessary criteria is: > > "You may restart a program on a different machine than the one it was > checkpointed on if all of these conditions are met (they often are on > cluster systems, especially if you are using a shared network > filesystem), and the kernels are the same." > > To what levels do kernels have to be "the same"? So for instance, my > Debian etch box has a kernel of 2.6.18-5-686; can I restart a > checkpointed job on any 32 bit i686 2.6.18 kerneled machine, or does > it have to be *exactly* the same? > > Thanks for any help, > Mark Unfortunately there is no exact answer to this question. We don't intentionally limit the kernel version at restart time based on any information contained in the context file(s) except to architecture. However, there are subtle differences between kernel versions that affect the contents of the context files. So, a restart may fail in strange ways if the kernel was configured with a different feature set. We currently do our own testing in homogeneous environments and thus have not done any significant amount of testing of the kind you are asking about. However, I would say that among kernels with the same version number and architecture (2.6.18/i686 in your example) you have a high probability of success. More exact answers may be possible in a future release. Specifically we could generate a "feature checksum" for a given kernel that could be queried and compared to determine compatibility. However, no work has been done in this area. If you are interested in contributing to BLCR, this is an area where your help would be appreciated. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900