Re: blcr and kernel updates

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun Jan 24 2010 - 12:37:26 PST

  • Next message: Paul H. Hargrove: "Re: /proc/checkpoint/ctrl limit?"
    Stijn,
    
      There is no specific guarantee about which kernel versions A and B one 
    can safely migrate BLCR context files between.  In an ideal world there 
    would be, but we are not at a point in our development that we can 
    devote effort to ensuring this.  As Ladislav says, there has seldom been 
    any problem with moving from 2.6.X.y to 2.6.X.z.
    
      However, the fact that kernel B panics when trying to restart is 
    certainly a bug in BLCR, because even if the cross-kernel migration 
    fails it should be with an error message about an invalid context file, 
    not a kernel crash or hang.  If you are ever able to obtain a log of the 
    panic, we'd appreciate knowing more about this panic.  Even you cannot 
    obtain this log, you should file a bug report at 
    https://upc-bugs.lbl.gov/bugzilla because it is much easier to track 
    than email.
    
    Ladislav,
    
      I'd appreciate more info on your system hangs as well.  If you can, as 
    you suggest, try a few 2.6.27.x kernel versions to find the first 
    problematic version, that would be very helpful to us.  As I tell Stijin 
    above, I'd appreciate having this info in our Bugzilla instead of by 
    email to improve my chances of keeping track of the issue.
    
    -Paul
    
    Ladislav Subr wrote:
    > Hello,
    >
    > I have recent experience with BLCR 0.8.2 failing (the system hangs during 
    > checkpoint) on vanilla 2.6.27.44, while I was successfully using 2.6.27.39 
    > for a couple of months. May it be due to the same (security?) patch that was 
    > applied to the RedHat kernel as well? I can eventually try various 2.6.27.x 
    > kernels if it helps to locate the problem.
    >
    > BTW, long time ago, on 2.4 kernels, it was quite safe to move jobs between 
    > kernels of different version. On 2.6 my experience was (till yesterday) that 
    > difference of the least significant version number is safe.
    >
    > 	L.
    >
    >   
    >> hi all,
    >>
    >> (i'm not on the list so please put me in CC when replying.)
    >>
    >> we are using blcr 0.8.2 on sl5.4 x86_64 systems and we are seeing
    >> strange things with restarting checkpoints taken on kernel version A and
    >> then restarting it on kernel version B.
    >> B is supposed to be a 'security/bug fix only' update of A (from
    >> 2.6.18-164.6.1.el5 to 2.6.18-164.11.1.el5, but who knows what patches
    >> are in there ;)
    >>
    >> checkpoint/restart works fine on both versions (also the testsuite
    >> RUN_ME passes all tests), but when restarting a checkpointed job from A
    >> on B, the machine gives a kernel panic (and i can't find the complete
    >> panic message  :(
    >>
    >> is there any guideline on the behaviour of restarting on different
    >> kernels (same BLCR version though, but only the blcr module rpm is
    >> upgraded, i assume that the other rpms are independent of the kernel).
    >> is this suposed to work at all times?
    >>
    >> many thanks,
    >>
    >> stijn
    >>     
    >
    >
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Paul H. Hargrove: "Re: /proc/checkpoint/ctrl limit?"