From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon May 26 2008 - 12:35:34 PDT
Parviz, BLCR is not able to save/restore the association between the debugger and the executable, making what you are trying slightly difficult (but hopefully not impossible). For that reason, in the 0.7.0 release (due out soon) the default behavior will be to refuse to checkpoint while a debugger is attached (an additional option will need to be specified to allow the checkpoint in such a case). In neither the 0.6.x or 0.7.0 release will checkpointing gdb and the debugged process together (as process group, process tree, etc) work. If it did, your task would have been much easier (just "cr_checkpoint <pid-of-gdb>"). The Trace/BPT trap you see is the restarted executable executing a breakpoint (bpt) trap instruction that the debugger inserted. Since at restart time no debugger is attached, the trap is a fatal error. The problem is that any breakpoint trap instruction written by the first gdb is still present in the checkpointed process, having replaced instuction(s) in the process. When gdb wrote that instruction into process memory, it would have saved the original instruction byte in its own memory (to restore when executing past the breakpoint, or when removing it). However that information was lost when the first gdb exited. This doesn't appear to have a good solution other than deleting all breakpoints before you take the checkpoint. If you consult a gdb expert (I am not one) you may be able to get gdb to print all the breakpoint data in a form that can be fed back into the new gdb (or perhaps you only have one at this stage). So, I recommend the following steps: 1) Run under control on gdb until it stops at your "safe" breakpoint 2) delete all breakpoints/watchpoints 3) checkpoint the process (may require you to "c" in response to the BLCR-generated signal) At restart time there is the question of attaching gdb "soon enough" to regain control before the buggy code runs. Since we had to remove all the breakpoints, there seems to be nothing preventing the code from executing normally, bugs and all. If you are restarting from a point early enough (say 1 minute or more) before your suspected bug then you can probably just restart and then attach gdb "fast enough". If you are too slow it costs you little to try again. However, it might not be possible to do that in general. To deal with that on can try passing "--stop" to the cr_restart command, which will freeze the executable (with a SIGSTOP) immediately on restart (before returning control to the point where BLCR interrupted execution). That should allow you to attach a debugger, which then may need to send SIGCONT to the process to resume execution. However, I am not sure that gdb will correctly attach to a STOPed process. In my experiments there were some cases where "gdb <exectuable> <pid>" appeared to hang when the process was STOPed in this manner. If so, try sending a SIGCONT from another window/terminal ("kill -CONT <pid>"); hopefully that will resolve it, but it didn't always do so for me. I think this depends on the gdb and/or kernel release. In short, my recommendation if "attach gdb fast enough" isn't possible is: 1) Restart with the "--stop" command line option to freeze the process 2) Attach gdb to the restarted-but-stopped process 3) Send SIGCONT, either from gdb (if it attached OK) or from a command line (if gdb looks "stuck"). Hope this helps. Let us know if the instructions above do or do not work for you. Perhaps you'd be interested in helping to write up a "mini howto" based on your experiences? -Paul Parviz Fariborz wrote: > > Hi, > > I am trying to use blcr to shorten the debug time for a large > executable. I have described the approach that I have taken and the > issues that I ran into below. Perhaps someone in this mailing list has > done the same and can give me some guidance. > > When debugging a long running executable in gdb (multiple hours), I > want to use blcr to checkpoint the running executable at a breakpoint > close to the problem area where I can safely assume things are in good > state. In the next round of debugging, instead of running the > executable in gdb, I want to re-start the checkpoint and attach the > gdb to running process. This gets me to the point of interest a lot > faster. > > My questions are : Is it possible to stop a running process in gdb at > a breakpoint and create a checkpoint? I tried it and was able to > create the checkpoint file, But the re-start always failed with the > following message : > > .Trace/BPT trap > > Also, is there a better approach? If so, please describe it. > > Thanks in advance for your help > > -Parviz -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900