Re: blcr 0.7.3: core dump file

Date view	Thread view	Subject view	Author view	Attachment view
From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Sep 03 2008 - 23:11:54 PDT
Next message: Vincentius Robby: "Re: sparc implementation"
Previous message: Paul H. Hargrove: "Re: sparc implementation"
In reply to: Adolfo J. Banchio: "Re: blcr 0.7.3: core dump file"
Adolfo,

  Glad things worked out.

  I will think about adding logic in BLCR to request thread stacks only 
as large as we need.  However, I don't have any clue what size to pick, 
other than 10240 that you suggest has worked for you.

-Paul

Adolfo J. Banchio wrote:
> Paul,
>
> thanks again for your help. It is actually the
> stack size. I printed from the script " ulimit -a" and
> I see a the stacksize of 2621440, compared to
> 10240 in the shell where it works. 
>
> So it seems that within the SGE shell (wich is owned by the
> user) the stack is too big, and multiplied by the number of
> threads might reach some other limit (not the virtual memory limit, 
> since I tried with unlimited).
>
> I have changed SGE scripts (submit, migrate and checkpoint) adding
> a line like " ulimit -s 10240 " before the cr_checkpoint and
> cr_restart commands, respectively, and now everything WORKS fine.
>
> Note that also the problem arises when restarting with
> cr_restart.
>
> I do not know if this stacksize will bring some other troubles
> later on. If you have any suggestions for the value please
> advise me.
>
> Thank you very much for your help.
>
> sincerely,
>
> adolfo
>
>
>
> On Tue, 2008-09-02 at 20:05 -0700, Paul H. Hargrove wrote:
>   
>> Adolfo,
>>
>>   Thanks for the info.  Based on the -v output it appears that 
>> pthread_create() is failing with error code 12, which on x86-64 is 
>> ENOMEM.  I cannot guess why that would be the case only under SGE unless 
>> there is something in the resource limits that is preventing starting an 
>> additional thread (most likely failure to allocate the stack).  However, 
>> I can't see exactly how that would happen.
>>   Is there any way you can see about increasing the resource limits in 
>> place when the checkpoint script is run?  That is where I'd start 
>> looking, but I don't really know what I'd be looking for other than 
>> trying to increase various limits until the failure goes away.
>>   Sorry I can't suggest anything more concrete.
>>
>> -Paul
>>
>> Adolfo J. Banchio wrote:
>>     
>>> Paul,
>>>
>>> thanks for your prompt reply.
>>>
>>> Addressing your questions. 
>>>
>>> 1) there is no possibility of having something old around, since
>>> there are new installations (the nodes are fully installed from
>>> scratch, and so was the frontend)
>>>
>>> 2) I hava backtraced the core file, but only with the idb (Intel
>>> gdb) since is the only I had installed at the moment.  The output 
>>> was
>>>
>>> --------------  begin ----------------------------
>>>
>>> Intel(R) Debugger for applications running on Intel(R) 64, Version
>>> 10.1-35 , Build 20080310
>>> ------------------
>>> object file name: /usr/bin/cr_checkpoint
>>> core file name: core.27901
>>> Reading symbols from /usr/bin/cr_checkpoint...(no debugging symbols
>>> found)...done.
>>> Core file produced from executable cr_checkpoint
>>> Initial part of arglist: /usr/bin/cr_checkpoint -f context_49.2 --kill
>>> 27170
>>> Thread terminated at PC 0x00002aaaab31d055 by signal SIGABRT
>>> line: 1 Unable to parse input as legal command or C expression.
>>>
>>> ----------------- end  -----------------------
>>>
>>> And finally, I have added a -v flag to the script and I get
>>> the following outut:
>>>
>>>
>>> cr_async.c:198 thread_init: pthread_create() returned 12
>>> targetfile='./.context_49.2.tmp', parent dir='.', rename=context_49.2
>>> child killed by signal 6 (Aborted)
>>>
>>>
>>>
>>> I hope this helps you find a clue.
>>>
>>> best regards,
>>>
>>> adolfo
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 2008-09-02 at 14:44 -0700, Paul H. Hargrove wrote:
>>>   
>>>       
>>>> Adolfo,
>>>>
>>>> I don't know what the problem may be, but have some suggestions on how 
>>>> to work on tracking down the problem (in the order I would try them 
>>>> meyself):
>>>>
>>>> 1) Be certain that you have exactly one cr_checkpoint installed.  If 
>>>> SGE's script is still running an old 0.5 install of BLCR, I can see 
>>>> where things would go wrong.  Running "cr_checkpoint -V" from both the 
>>>> command line and in the SGE checkpoint script should both report 0.7.3.
>>>>
>>>> 2) Can you get a backtrace from the generated core file?  The one-liner 
>>>> would be something like
>>>>     $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ
>>>> My guess is that you'll get lots of "(no debugging symbols)" messages, 
>>>> but there might be enough info to get a rough idea where the code dump 
>>>> originates.  Please send ALL of the gdb output.
>>>>
>>>> 3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint 
>>>> command line.  That should produce some output from cr_checkpoint 
>>>> showing its progress at each step, assuming the stderr from the 
>>>> cr_checkpoint command is being collected somewhere you can see it.
>>>>
>>>> -Paul
>>>>
>>>>
>>>> Adolfo J. Banchio wrote:
>>>>     
>>>>         
>>>>> Hi,
>>>>>
>>>>> I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and
>>>>> blcr 0.7.3 (from blcr 0.5) and now I have the following
>>>>> problem.
>>>>>
>>>>> When I checkpoint running programs directly from the
>>>>> command line it works fine.
>>>>> But the same checkpoint command when it is given 
>>>>> by the SGE (batch queueing system) checkpointing
>>>>> script ends up in a core dump file.
>>>>> What I can see is that blcr started to create the
>>>>> checkpoint file ( .context...) and it then writes
>>>>> a core.PID file (I presume the PID there is the one
>>>>> from the cr_checkpoint process). 
>>>>>
>>>>> I can not figure out where the difference might 
>>>>> lie, since the script is run the same user I use 
>>>>> when it does work.
>>>>>
>>>>> Any help will be welcome.
>>>>>
>>>>> thanks in advance,
>>>>>
>>>>> adolfo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>     
>>>>         
>>     


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
Next message: Vincentius Robby: "Re: sparc implementation"
Previous message: Paul H. Hargrove: "Re: sparc implementation"
In reply to: Adolfo J. Banchio: "Re: blcr 0.7.3: core dump file"
Date view	Thread view	Subject view	Author view	Attachment view