From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Wed Sep 03 2008 - 05:05:42 PDT
Paul, thanks again for your help. It is actually the stack size. I printed from the script " ulimit -a" and I see a the stacksize of 2621440, compared to 10240 in the shell where it works. So it seems that within the SGE shell (wich is owned by the user) the stack is too big, and multiplied by the number of threads might reach some other limit (not the virtual memory limit, since I tried with unlimited). I have changed SGE scripts (submit, migrate and checkpoint) adding a line like " ulimit -s 10240 " before the cr_checkpoint and cr_restart commands, respectively, and now everything WORKS fine. Note that also the problem arises when restarting with cr_restart. I do not know if this stacksize will bring some other troubles later on. If you have any suggestions for the value please advise me. Thank you very much for your help. sincerely, adolfo On Tue, 2008-09-02 at 20:05 -0700, Paul H. Hargrove wrote: > Adolfo, > > Thanks for the info. Based on the -v output it appears that > pthread_create() is failing with error code 12, which on x86-64 is > ENOMEM. I cannot guess why that would be the case only under SGE unless > there is something in the resource limits that is preventing starting an > additional thread (most likely failure to allocate the stack). However, > I can't see exactly how that would happen. > Is there any way you can see about increasing the resource limits in > place when the checkpoint script is run? That is where I'd start > looking, but I don't really know what I'd be looking for other than > trying to increase various limits until the failure goes away. > Sorry I can't suggest anything more concrete. > > -Paul > > Adolfo J. Banchio wrote: > > Paul, > > > > thanks for your prompt reply. > > > > Addressing your questions. > > > > 1) there is no possibility of having something old around, since > > there are new installations (the nodes are fully installed from > > scratch, and so was the frontend) > > > > 2) I hava backtraced the core file, but only with the idb (Intel > > gdb) since is the only I had installed at the moment. The output > > was > > > > -------------- begin ---------------------------- > > > > Intel(R) Debugger for applications running on Intel(R) 64, Version > > 10.1-35 , Build 20080310 > > ------------------ > > object file name: /usr/bin/cr_checkpoint > > core file name: core.27901 > > Reading symbols from /usr/bin/cr_checkpoint...(no debugging symbols > > found)...done. > > Core file produced from executable cr_checkpoint > > Initial part of arglist: /usr/bin/cr_checkpoint -f context_49.2 --kill > > 27170 > > Thread terminated at PC 0x00002aaaab31d055 by signal SIGABRT > > line: 1 Unable to parse input as legal command or C expression. > > > > ----------------- end ----------------------- > > > > And finally, I have added a -v flag to the script and I get > > the following outut: > > > > > > cr_async.c:198 thread_init: pthread_create() returned 12 > > targetfile='./.context_49.2.tmp', parent dir='.', rename=context_49.2 > > child killed by signal 6 (Aborted) > > > > > > > > I hope this helps you find a clue. > > > > best regards, > > > > adolfo > > > > > > > > > > > > > > On Tue, 2008-09-02 at 14:44 -0700, Paul H. Hargrove wrote: > > > >> Adolfo, > >> > >> I don't know what the problem may be, but have some suggestions on how > >> to work on tracking down the problem (in the order I would try them > >> meyself): > >> > >> 1) Be certain that you have exactly one cr_checkpoint installed. If > >> SGE's script is still running an old 0.5 install of BLCR, I can see > >> where things would go wrong. Running "cr_checkpoint -V" from both the > >> command line and in the SGE checkpoint script should both report 0.7.3. > >> > >> 2) Can you get a backtrace from the generated core file? The one-liner > >> would be something like > >> $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ > >> My guess is that you'll get lots of "(no debugging symbols)" messages, > >> but there might be enough info to get a rough idea where the code dump > >> originates. Please send ALL of the gdb output. > >> > >> 3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint > >> command line. That should produce some output from cr_checkpoint > >> showing its progress at each step, assuming the stderr from the > >> cr_checkpoint command is being collected somewhere you can see it. > >> > >> -Paul > >> > >> > >> Adolfo J. Banchio wrote: > >> > >>> Hi, > >>> > >>> I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and > >>> blcr 0.7.3 (from blcr 0.5) and now I have the following > >>> problem. > >>> > >>> When I checkpoint running programs directly from the > >>> command line it works fine. > >>> But the same checkpoint command when it is given > >>> by the SGE (batch queueing system) checkpointing > >>> script ends up in a core dump file. > >>> What I can see is that blcr started to create the > >>> checkpoint file ( .context...) and it then writes > >>> a core.PID file (I presume the PID there is the one > >>> from the cr_checkpoint process). > >>> > >>> I can not figure out where the difference might > >>> lie, since the script is run the same user I use > >>> when it does work. > >>> > >>> Any help will be welcome. > >>> > >>> thanks in advance, > >>> > >>> adolfo > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >> > > -- Adolfo J. Banchio <banchio_at_famaf_dot_unc_dot_edu.ar>