From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Sep 03 2008 - 23:11:54 PDT
Adolfo, Glad things worked out. I will think about adding logic in BLCR to request thread stacks only as large as we need. However, I don't have any clue what size to pick, other than 10240 that you suggest has worked for you. -Paul Adolfo J. Banchio wrote: > Paul, > > thanks again for your help. It is actually the > stack size. I printed from the script " ulimit -a" and > I see a the stacksize of 2621440, compared to > 10240 in the shell where it works. > > So it seems that within the SGE shell (wich is owned by the > user) the stack is too big, and multiplied by the number of > threads might reach some other limit (not the virtual memory limit, > since I tried with unlimited). > > I have changed SGE scripts (submit, migrate and checkpoint) adding > a line like " ulimit -s 10240 " before the cr_checkpoint and > cr_restart commands, respectively, and now everything WORKS fine. > > Note that also the problem arises when restarting with > cr_restart. > > I do not know if this stacksize will bring some other troubles > later on. If you have any suggestions for the value please > advise me. > > Thank you very much for your help. > > sincerely, > > adolfo > > > > On Tue, 2008-09-02 at 20:05 -0700, Paul H. Hargrove wrote: > >> Adolfo, >> >> Thanks for the info. Based on the -v output it appears that >> pthread_create() is failing with error code 12, which on x86-64 is >> ENOMEM. I cannot guess why that would be the case only under SGE unless >> there is something in the resource limits that is preventing starting an >> additional thread (most likely failure to allocate the stack). However, >> I can't see exactly how that would happen. >> Is there any way you can see about increasing the resource limits in >> place when the checkpoint script is run? That is where I'd start >> looking, but I don't really know what I'd be looking for other than >> trying to increase various limits until the failure goes away. >> Sorry I can't suggest anything more concrete. >> >> -Paul >> >> Adolfo J. Banchio wrote: >> >>> Paul, >>> >>> thanks for your prompt reply. >>> >>> Addressing your questions. >>> >>> 1) there is no possibility of having something old around, since >>> there are new installations (the nodes are fully installed from >>> scratch, and so was the frontend) >>> >>> 2) I hava backtraced the core file, but only with the idb (Intel >>> gdb) since is the only I had installed at the moment. The output >>> was >>> >>> -------------- begin ---------------------------- >>> >>> Intel(R) Debugger for applications running on Intel(R) 64, Version >>> 10.1-35 , Build 20080310 >>> ------------------ >>> object file name: /usr/bin/cr_checkpoint >>> core file name: core.27901 >>> Reading symbols from /usr/bin/cr_checkpoint...(no debugging symbols >>> found)...done. >>> Core file produced from executable cr_checkpoint >>> Initial part of arglist: /usr/bin/cr_checkpoint -f context_49.2 --kill >>> 27170 >>> Thread terminated at PC 0x00002aaaab31d055 by signal SIGABRT >>> line: 1 Unable to parse input as legal command or C expression. >>> >>> ----------------- end ----------------------- >>> >>> And finally, I have added a -v flag to the script and I get >>> the following outut: >>> >>> >>> cr_async.c:198 thread_init: pthread_create() returned 12 >>> targetfile='./.context_49.2.tmp', parent dir='.', rename=context_49.2 >>> child killed by signal 6 (Aborted) >>> >>> >>> >>> I hope this helps you find a clue. >>> >>> best regards, >>> >>> adolfo >>> >>> >>> >>> >>> >>> >>> On Tue, 2008-09-02 at 14:44 -0700, Paul H. Hargrove wrote: >>> >>> >>>> Adolfo, >>>> >>>> I don't know what the problem may be, but have some suggestions on how >>>> to work on tracking down the problem (in the order I would try them >>>> meyself): >>>> >>>> 1) Be certain that you have exactly one cr_checkpoint installed. If >>>> SGE's script is still running an old 0.5 install of BLCR, I can see >>>> where things would go wrong. Running "cr_checkpoint -V" from both the >>>> command line and in the SGE checkpoint script should both report 0.7.3. >>>> >>>> 2) Can you get a backtrace from the generated core file? The one-liner >>>> would be something like >>>> $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ >>>> My guess is that you'll get lots of "(no debugging symbols)" messages, >>>> but there might be enough info to get a rough idea where the code dump >>>> originates. Please send ALL of the gdb output. >>>> >>>> 3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint >>>> command line. That should produce some output from cr_checkpoint >>>> showing its progress at each step, assuming the stderr from the >>>> cr_checkpoint command is being collected somewhere you can see it. >>>> >>>> -Paul >>>> >>>> >>>> Adolfo J. Banchio wrote: >>>> >>>> >>>>> Hi, >>>>> >>>>> I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and >>>>> blcr 0.7.3 (from blcr 0.5) and now I have the following >>>>> problem. >>>>> >>>>> When I checkpoint running programs directly from the >>>>> command line it works fine. >>>>> But the same checkpoint command when it is given >>>>> by the SGE (batch queueing system) checkpointing >>>>> script ends up in a core dump file. >>>>> What I can see is that blcr started to create the >>>>> checkpoint file ( .context...) and it then writes >>>>> a core.PID file (I presume the PID there is the one >>>>> from the cr_checkpoint process). >>>>> >>>>> I can not figure out where the difference might >>>>> lie, since the script is run the same user I use >>>>> when it does work. >>>>> >>>>> Any help will be welcome. >>>>> >>>>> thanks in advance, >>>>> >>>>> adolfo >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900