From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Sep 02 2008 - 14:44:08 PDT
Adolfo, I don't know what the problem may be, but have some suggestions on how to work on tracking down the problem (in the order I would try them meyself): 1) Be certain that you have exactly one cr_checkpoint installed. If SGE's script is still running an old 0.5 install of BLCR, I can see where things would go wrong. Running "cr_checkpoint -V" from both the command line and in the SGE checkpoint script should both report 0.7.3. 2) Can you get a backtrace from the generated core file? The one-liner would be something like $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ My guess is that you'll get lots of "(no debugging symbols)" messages, but there might be enough info to get a rough idea where the code dump originates. Please send ALL of the gdb output. 3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint command line. That should produce some output from cr_checkpoint showing its progress at each step, assuming the stderr from the cr_checkpoint command is being collected somewhere you can see it. -Paul Adolfo J. Banchio wrote: > Hi, > > I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and > blcr 0.7.3 (from blcr 0.5) and now I have the following > problem. > > When I checkpoint running programs directly from the > command line it works fine. > But the same checkpoint command when it is given > by the SGE (batch queueing system) checkpointing > script ends up in a core dump file. > What I can see is that blcr started to create the > checkpoint file ( .context...) and it then writes > a core.PID file (I presume the PID there is the one > from the cr_checkpoint process). > > I can not figure out where the difference might > lie, since the script is run the same user I use > when it does work. > > Any help will be welcome. > > thanks in advance, > > adolfo > > > > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900