Re: Using blcr for debugging

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon May 26 2008 - 12:35:34 PDT

Next message: Parviz Fariborz: "Re: Using blcr for debugging"

Previous message: Parviz Fariborz: "Using blcr for debugging"
In reply to: Parviz Fariborz: "Using blcr for debugging"
Next in thread: Parviz Fariborz: "Re: Using blcr for debugging"
Reply: Parviz Fariborz: "Re: Using blcr for debugging"

Parviz,

  BLCR is not able to save/restore the association between the debugger 
and the executable, making what you are trying slightly difficult (but 
hopefully not impossible).  For that reason, in the 0.7.0 release (due 
out soon) the default behavior will be to refuse to checkpoint while a 
debugger is attached (an additional option will need to be specified to 
allow the checkpoint in such a case).  In neither the 0.6.x or 0.7.0 
release will checkpointing gdb and the debugged process together (as 
process group, process tree, etc) work.  If it did, your task would have 
been much easier (just "cr_checkpoint <pid-of-gdb>").

  The Trace/BPT trap you see is the restarted executable executing a 
breakpoint (bpt) trap instruction that the debugger inserted.  Since at 
restart time no debugger is attached, the trap is a fatal error.  The 
problem is that any breakpoint trap instruction written by the first gdb 
is still present in the checkpointed process, having replaced 
instuction(s) in the process.  When gdb wrote that instruction into 
process memory, it would have saved the original instruction byte in its 
own memory (to restore when executing past the breakpoint, or when 
removing it).  However that information was lost when the first gdb 
exited.  This doesn't appear to have a good solution other than deleting 
all breakpoints before you take the checkpoint.  If you consult a gdb 
expert (I am not one) you may be able to get gdb to print all the 
breakpoint data in a form that can be fed back into the new gdb (or 
perhaps you only have one at this stage).  So, I recommend the following 
steps:
1) Run under control on gdb until it stops at your "safe" breakpoint
2) delete all breakpoints/watchpoints
3) checkpoint the process (may require you to "c" in response to the 
BLCR-generated signal)

At restart time there is the question of attaching gdb "soon enough" to 
regain control before the buggy code runs.  Since we had to remove all 
the breakpoints, there seems to be nothing preventing the code from 
executing normally, bugs and all.  If you are restarting from a point 
early enough (say 1 minute or more) before your suspected bug then you 
can probably just restart and then attach gdb "fast enough".  If you are 
too slow it costs you little to try again.  However, it might not be 
possible to do that in general.  To deal with that on can try passing 
"--stop" to the cr_restart command, which will freeze the executable 
(with a SIGSTOP) immediately on restart (before returning control to the 
point where BLCR interrupted execution).  That should allow you to 
attach a debugger, which then may need to send SIGCONT to the process to 
resume execution.  However, I am not sure that gdb will correctly attach 
to a STOPed process.  In my experiments there were some cases where "gdb 
<exectuable> <pid>" appeared to hang when the process was STOPed in this 
manner.  If so, try sending a SIGCONT from another window/terminal 
("kill -CONT <pid>"); hopefully that will resolve it, but it didn't 
always do so for me.  I think this depends on the gdb and/or kernel 
release.  In short, my recommendation if "attach gdb fast enough" isn't 
possible is:
1) Restart with the "--stop" command line option to freeze the process
2) Attach gdb to the restarted-but-stopped process
3) Send SIGCONT, either from gdb (if it attached OK) or from a command 
line (if gdb looks "stuck").

Hope this helps.  Let us know if the instructions above do or do not 
work for you.  Perhaps you'd be interested in helping to write up a 
"mini howto" based on your experiences?

-Paul

Parviz Fariborz wrote:
>
> Hi,
>
> I am trying to use blcr to shorten the debug time for a large 
> executable. I have described the approach that I have taken and the 
> issues that I ran into below. Perhaps someone in this mailing list has 
> done the same and can give me some guidance.
>
> When debugging a long running executable in gdb (multiple hours), I 
> want to use blcr to checkpoint the running executable at a breakpoint 
> close to the problem area where I can safely assume things are in good 
> state. In the next round of debugging, instead of running the 
> executable in gdb, I want to re-start the checkpoint and attach the 
> gdb to running process. This gets me to the point of interest a lot 
> faster.
>
> My questions are : Is it possible to stop a running process in gdb at 
> a breakpoint and create a checkpoint? I tried it and was able to 
> create the checkpoint file, But the re-start always failed with the 
> following message :
>
> .Trace/BPT trap
>
> Also, is there a better approach? If so, please describe it.
>
> Thanks in advance for your help
>
> -Parviz

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: Parviz Fariborz: "Re: Using blcr for debugging"

Previous message: Parviz Fariborz: "Using blcr for debugging"
In reply to: Parviz Fariborz: "Using blcr for debugging"
Next in thread: Parviz Fariborz: "Re: Using blcr for debugging"
Reply: Parviz Fariborz: "Re: Using blcr for debugging"

Date view	Thread view	Subject view	Author view	Attachment view