From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Tue Sep 02 2008 - 16:26:54 PDT
Paul, thanks for your prompt reply. Addressing your questions. 1) there is no possibility of having something old around, since there are new installations (the nodes are fully installed from scratch, and so was the frontend) 2) I hava backtraced the core file, but only with the idb (Intel gdb) since is the only I had installed at the moment. The output was -------------- begin ---------------------------- Intel(R) Debugger for applications running on Intel(R) 64, Version 10.1-35 , Build 20080310 ------------------ object file name: /usr/bin/cr_checkpoint core file name: core.27901 Reading symbols from /usr/bin/cr_checkpoint...(no debugging symbols found)...done. Core file produced from executable cr_checkpoint Initial part of arglist: /usr/bin/cr_checkpoint -f context_49.2 --kill 27170 Thread terminated at PC 0x00002aaaab31d055 by signal SIGABRT line: 1 Unable to parse input as legal command or C expression. ----------------- end ----------------------- And finally, I have added a -v flag to the script and I get the following outut: cr_async.c:198 thread_init: pthread_create() returned 12 targetfile='./.context_49.2.tmp', parent dir='.', rename=context_49.2 child killed by signal 6 (Aborted) I hope this helps you find a clue. best regards, adolfo On Tue, 2008-09-02 at 14:44 -0700, Paul H. Hargrove wrote: > Adolfo, > > I don't know what the problem may be, but have some suggestions on how > to work on tracking down the problem (in the order I would try them > meyself): > > 1) Be certain that you have exactly one cr_checkpoint installed. If > SGE's script is still running an old 0.5 install of BLCR, I can see > where things would go wrong. Running "cr_checkpoint -V" from both the > command line and in the SGE checkpoint script should both report 0.7.3. > > 2) Can you get a backtrace from the generated core file? The one-liner > would be something like > $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ > My guess is that you'll get lots of "(no debugging symbols)" messages, > but there might be enough info to get a rough idea where the code dump > originates. Please send ALL of the gdb output. > > 3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint > command line. That should produce some output from cr_checkpoint > showing its progress at each step, assuming the stderr from the > cr_checkpoint command is being collected somewhere you can see it. > > -Paul > > > Adolfo J. Banchio wrote: > > Hi, > > > > I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and > > blcr 0.7.3 (from blcr 0.5) and now I have the following > > problem. > > > > When I checkpoint running programs directly from the > > command line it works fine. > > But the same checkpoint command when it is given > > by the SGE (batch queueing system) checkpointing > > script ends up in a core dump file. > > What I can see is that blcr started to create the > > checkpoint file ( .context...) and it then writes > > a core.PID file (I presume the PID there is the one > > from the cr_checkpoint process). > > > > I can not figure out where the difference might > > lie, since the script is run the same user I use > > when it does work. > > > > Any help will be welcome. > > > > thanks in advance, > > > > adolfo > > > > > > > > > > > > > > -- Adolfo J. Banchio <banchio_at_famaf_dot_unc_dot_edu.ar>