Re: Using blcr for debugging

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon May 26 2008 - 12:35:34 PDT

  • Next message: Parviz Fariborz: "Re: Using blcr for debugging"
    Parviz,
    
      BLCR is not able to save/restore the association between the debugger 
    and the executable, making what you are trying slightly difficult (but 
    hopefully not impossible).  For that reason, in the 0.7.0 release (due 
    out soon) the default behavior will be to refuse to checkpoint while a 
    debugger is attached (an additional option will need to be specified to 
    allow the checkpoint in such a case).  In neither the 0.6.x or 0.7.0 
    release will checkpointing gdb and the debugged process together (as 
    process group, process tree, etc) work.  If it did, your task would have 
    been much easier (just "cr_checkpoint <pid-of-gdb>").
    
      The Trace/BPT trap you see is the restarted executable executing a 
    breakpoint (bpt) trap instruction that the debugger inserted.  Since at 
    restart time no debugger is attached, the trap is a fatal error.  The 
    problem is that any breakpoint trap instruction written by the first gdb 
    is still present in the checkpointed process, having replaced 
    instuction(s) in the process.  When gdb wrote that instruction into 
    process memory, it would have saved the original instruction byte in its 
    own memory (to restore when executing past the breakpoint, or when 
    removing it).  However that information was lost when the first gdb 
    exited.  This doesn't appear to have a good solution other than deleting 
    all breakpoints before you take the checkpoint.  If you consult a gdb 
    expert (I am not one) you may be able to get gdb to print all the 
    breakpoint data in a form that can be fed back into the new gdb (or 
    perhaps you only have one at this stage).  So, I recommend the following 
    steps:
    1) Run under control on gdb until it stops at your "safe" breakpoint
    2) delete all breakpoints/watchpoints
    3) checkpoint the process (may require you to "c" in response to the 
    BLCR-generated signal)
    
    At restart time there is the question of attaching gdb "soon enough" to 
    regain control before the buggy code runs.  Since we had to remove all 
    the breakpoints, there seems to be nothing preventing the code from 
    executing normally, bugs and all.  If you are restarting from a point 
    early enough (say 1 minute or more) before your suspected bug then you 
    can probably just restart and then attach gdb "fast enough".  If you are 
    too slow it costs you little to try again.  However, it might not be 
    possible to do that in general.  To deal with that on can try passing 
    "--stop" to the cr_restart command, which will freeze the executable 
    (with a SIGSTOP) immediately on restart (before returning control to the 
    point where BLCR interrupted execution).  That should allow you to 
    attach a debugger, which then may need to send SIGCONT to the process to 
    resume execution.  However, I am not sure that gdb will correctly attach 
    to a STOPed process.  In my experiments there were some cases where "gdb 
    <exectuable> <pid>" appeared to hang when the process was STOPed in this 
    manner.  If so, try sending a SIGCONT from another window/terminal 
    ("kill -CONT <pid>"); hopefully that will resolve it, but it didn't 
    always do so for me.  I think this depends on the gdb and/or kernel 
    release.  In short, my recommendation if "attach gdb fast enough" isn't 
    possible is:
    1) Restart with the "--stop" command line option to freeze the process
    2) Attach gdb to the restarted-but-stopped process
    3) Send SIGCONT, either from gdb (if it attached OK) or from a command 
    line (if gdb looks "stuck").
    
    Hope this helps.  Let us know if the instructions above do or do not 
    work for you.  Perhaps you'd be interested in helping to write up a 
    "mini howto" based on your experiences?
    
    -Paul
    
    Parviz Fariborz wrote:
    >
    > Hi,
    >
    > I am trying to use blcr to shorten the debug time for a large 
    > executable. I have described the approach that I have taken and the 
    > issues that I ran into below. Perhaps someone in this mailing list has 
    > done the same and can give me some guidance.
    >
    > When debugging a long running executable in gdb (multiple hours), I 
    > want to use blcr to checkpoint the running executable at a breakpoint 
    > close to the problem area where I can safely assume things are in good 
    > state. In the next round of debugging, instead of running the 
    > executable in gdb, I want to re-start the checkpoint and attach the 
    > gdb to running process. This gets me to the point of interest a lot 
    > faster.
    >
    > My questions are : Is it possible to stop a running process in gdb at 
    > a breakpoint and create a checkpoint? I tried it and was able to 
    > create the checkpoint file, But the re-start always failed with the 
    > following message :
    >
    > .Trace/BPT trap
    >
    > Also, is there a better approach? If so, please describe it.
    >
    > Thanks in advance for your help
    >
    > -Parviz
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Parviz Fariborz: "Re: Using blcr for debugging"