From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Sep 02 2008 - 20:05:00 PDT
Adolfo, Thanks for the info. Based on the -v output it appears that pthread_create() is failing with error code 12, which on x86-64 is ENOMEM. I cannot guess why that would be the case only under SGE unless there is something in the resource limits that is preventing starting an additional thread (most likely failure to allocate the stack). However, I can't see exactly how that would happen. Is there any way you can see about increasing the resource limits in place when the checkpoint script is run? That is where I'd start looking, but I don't really know what I'd be looking for other than trying to increase various limits until the failure goes away. Sorry I can't suggest anything more concrete. -Paul Adolfo J. Banchio wrote: > Paul, > > thanks for your prompt reply. > > Addressing your questions. > > 1) there is no possibility of having something old around, since > there are new installations (the nodes are fully installed from > scratch, and so was the frontend) > > 2) I hava backtraced the core file, but only with the idb (Intel > gdb) since is the only I had installed at the moment. The output > was > > -------------- begin ---------------------------- > > Intel(R) Debugger for applications running on Intel(R) 64, Version > 10.1-35 , Build 20080310 > ------------------ > object file name: /usr/bin/cr_checkpoint > core file name: core.27901 > Reading symbols from /usr/bin/cr_checkpoint...(no debugging symbols > found)...done. > Core file produced from executable cr_checkpoint > Initial part of arglist: /usr/bin/cr_checkpoint -f context_49.2 --kill > 27170 > Thread terminated at PC 0x00002aaaab31d055 by signal SIGABRT > line: 1 Unable to parse input as legal command or C expression. > > ----------------- end ----------------------- > > And finally, I have added a -v flag to the script and I get > the following outut: > > > cr_async.c:198 thread_init: pthread_create() returned 12 > targetfile='./.context_49.2.tmp', parent dir='.', rename=context_49.2 > child killed by signal 6 (Aborted) > > > > I hope this helps you find a clue. > > best regards, > > adolfo > > > > > > > On Tue, 2008-09-02 at 14:44 -0700, Paul H. Hargrove wrote: > >> Adolfo, >> >> I don't know what the problem may be, but have some suggestions on how >> to work on tracking down the problem (in the order I would try them >> meyself): >> >> 1) Be certain that you have exactly one cr_checkpoint installed. If >> SGE's script is still running an old 0.5 install of BLCR, I can see >> where things would go wrong. Running "cr_checkpoint -V" from both the >> command line and in the SGE checkpoint script should both report 0.7.3. >> >> 2) Can you get a backtrace from the generated core file? The one-liner >> would be something like >> $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ >> My guess is that you'll get lots of "(no debugging symbols)" messages, >> but there might be enough info to get a rough idea where the code dump >> originates. Please send ALL of the gdb output. >> >> 3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint >> command line. That should produce some output from cr_checkpoint >> showing its progress at each step, assuming the stderr from the >> cr_checkpoint command is being collected somewhere you can see it. >> >> -Paul >> >> >> Adolfo J. Banchio wrote: >> >>> Hi, >>> >>> I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and >>> blcr 0.7.3 (from blcr 0.5) and now I have the following >>> problem. >>> >>> When I checkpoint running programs directly from the >>> command line it works fine. >>> But the same checkpoint command when it is given >>> by the SGE (batch queueing system) checkpointing >>> script ends up in a core dump file. >>> What I can see is that blcr started to create the >>> checkpoint file ( .context...) and it then writes >>> a core.PID file (I presume the PID there is the one >>> from the cr_checkpoint process). >>> >>> I can not figure out where the difference might >>> lie, since the script is run the same user I use >>> when it does work. >>> >>> Any help will be welcome. >>> >>> thanks in advance, >>> >>> adolfo >>> >>> >>> >>> >>> >>> >>> >> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900