From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun Jan 24 2010 - 12:37:26 PST
Stijn, There is no specific guarantee about which kernel versions A and B one can safely migrate BLCR context files between. In an ideal world there would be, but we are not at a point in our development that we can devote effort to ensuring this. As Ladislav says, there has seldom been any problem with moving from 2.6.X.y to 2.6.X.z. However, the fact that kernel B panics when trying to restart is certainly a bug in BLCR, because even if the cross-kernel migration fails it should be with an error message about an invalid context file, not a kernel crash or hang. If you are ever able to obtain a log of the panic, we'd appreciate knowing more about this panic. Even you cannot obtain this log, you should file a bug report at https://upc-bugs.lbl.gov/bugzilla because it is much easier to track than email. Ladislav, I'd appreciate more info on your system hangs as well. If you can, as you suggest, try a few 2.6.27.x kernel versions to find the first problematic version, that would be very helpful to us. As I tell Stijin above, I'd appreciate having this info in our Bugzilla instead of by email to improve my chances of keeping track of the issue. -Paul Ladislav Subr wrote: > Hello, > > I have recent experience with BLCR 0.8.2 failing (the system hangs during > checkpoint) on vanilla 2.6.27.44, while I was successfully using 2.6.27.39 > for a couple of months. May it be due to the same (security?) patch that was > applied to the RedHat kernel as well? I can eventually try various 2.6.27.x > kernels if it helps to locate the problem. > > BTW, long time ago, on 2.4 kernels, it was quite safe to move jobs between > kernels of different version. On 2.6 my experience was (till yesterday) that > difference of the least significant version number is safe. > > L. > > >> hi all, >> >> (i'm not on the list so please put me in CC when replying.) >> >> we are using blcr 0.8.2 on sl5.4 x86_64 systems and we are seeing >> strange things with restarting checkpoints taken on kernel version A and >> then restarting it on kernel version B. >> B is supposed to be a 'security/bug fix only' update of A (from >> 2.6.18-164.6.1.el5 to 2.6.18-164.11.1.el5, but who knows what patches >> are in there ;) >> >> checkpoint/restart works fine on both versions (also the testsuite >> RUN_ME passes all tests), but when restarting a checkpointed job from A >> on B, the machine gives a kernel panic (and i can't find the complete >> panic message :( >> >> is there any guideline on the behaviour of restarting on different >> kernels (same BLCR version though, but only the blcr module rpm is >> upgraded, i assume that the other rpms are independent of the kernel). >> is this suposed to work at all times? >> >> many thanks, >> >> stijn >> > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory