Re: blcr 0.7.3: core dump file

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Sep 03 2008 - 23:11:54 PDT

  • Next message: Vincentius Robby: "Re: sparc implementation"
    Adolfo,
    
      Glad things worked out.
    
      I will think about adding logic in BLCR to request thread stacks only 
    as large as we need.  However, I don't have any clue what size to pick, 
    other than 10240 that you suggest has worked for you.
    
    -Paul
    
    Adolfo J. Banchio wrote:
    > Paul,
    >
    > thanks again for your help. It is actually the
    > stack size. I printed from the script " ulimit -a" and
    > I see a the stacksize of 2621440, compared to
    > 10240 in the shell where it works. 
    >
    > So it seems that within the SGE shell (wich is owned by the
    > user) the stack is too big, and multiplied by the number of
    > threads might reach some other limit (not the virtual memory limit, 
    > since I tried with unlimited).
    >
    > I have changed SGE scripts (submit, migrate and checkpoint) adding
    > a line like " ulimit -s 10240 " before the cr_checkpoint and
    > cr_restart commands, respectively, and now everything WORKS fine.
    >
    > Note that also the problem arises when restarting with
    > cr_restart.
    >
    > I do not know if this stacksize will bring some other troubles
    > later on. If you have any suggestions for the value please
    > advise me.
    >
    > Thank you very much for your help.
    >
    > sincerely,
    >
    > adolfo
    >
    >
    >
    > On Tue, 2008-09-02 at 20:05 -0700, Paul H. Hargrove wrote:
    >   
    >> Adolfo,
    >>
    >>   Thanks for the info.  Based on the -v output it appears that 
    >> pthread_create() is failing with error code 12, which on x86-64 is 
    >> ENOMEM.  I cannot guess why that would be the case only under SGE unless 
    >> there is something in the resource limits that is preventing starting an 
    >> additional thread (most likely failure to allocate the stack).  However, 
    >> I can't see exactly how that would happen.
    >>   Is there any way you can see about increasing the resource limits in 
    >> place when the checkpoint script is run?  That is where I'd start 
    >> looking, but I don't really know what I'd be looking for other than 
    >> trying to increase various limits until the failure goes away.
    >>   Sorry I can't suggest anything more concrete.
    >>
    >> -Paul
    >>
    >> Adolfo J. Banchio wrote:
    >>     
    >>> Paul,
    >>>
    >>> thanks for your prompt reply.
    >>>
    >>> Addressing your questions. 
    >>>
    >>> 1) there is no possibility of having something old around, since
    >>> there are new installations (the nodes are fully installed from
    >>> scratch, and so was the frontend)
    >>>
    >>> 2) I hava backtraced the core file, but only with the idb (Intel
    >>> gdb) since is the only I had installed at the moment.  The output 
    >>> was
    >>>
    >>> --------------  begin ----------------------------
    >>>
    >>> Intel(R) Debugger for applications running on Intel(R) 64, Version
    >>> 10.1-35 , Build 20080310
    >>> ------------------
    >>> object file name: /usr/bin/cr_checkpoint
    >>> core file name: core.27901
    >>> Reading symbols from /usr/bin/cr_checkpoint...(no debugging symbols
    >>> found)...done.
    >>> Core file produced from executable cr_checkpoint
    >>> Initial part of arglist: /usr/bin/cr_checkpoint -f context_49.2 --kill
    >>> 27170
    >>> Thread terminated at PC 0x00002aaaab31d055 by signal SIGABRT
    >>> line: 1 Unable to parse input as legal command or C expression.
    >>>
    >>> ----------------- end  -----------------------
    >>>
    >>> And finally, I have added a -v flag to the script and I get
    >>> the following outut:
    >>>
    >>>
    >>> cr_async.c:198 thread_init: pthread_create() returned 12
    >>> targetfile='./.context_49.2.tmp', parent dir='.', rename=context_49.2
    >>> child killed by signal 6 (Aborted)
    >>>
    >>>
    >>>
    >>> I hope this helps you find a clue.
    >>>
    >>> best regards,
    >>>
    >>> adolfo
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>> On Tue, 2008-09-02 at 14:44 -0700, Paul H. Hargrove wrote:
    >>>   
    >>>       
    >>>> Adolfo,
    >>>>
    >>>> I don't know what the problem may be, but have some suggestions on how 
    >>>> to work on tracking down the problem (in the order I would try them 
    >>>> meyself):
    >>>>
    >>>> 1) Be certain that you have exactly one cr_checkpoint installed.  If 
    >>>> SGE's script is still running an old 0.5 install of BLCR, I can see 
    >>>> where things would go wrong.  Running "cr_checkpoint -V" from both the 
    >>>> command line and in the SGE checkpoint script should both report 0.7.3.
    >>>>
    >>>> 2) Can you get a backtrace from the generated core file?  The one-liner 
    >>>> would be something like
    >>>>     $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ
    >>>> My guess is that you'll get lots of "(no debugging symbols)" messages, 
    >>>> but there might be enough info to get a rough idea where the code dump 
    >>>> originates.  Please send ALL of the gdb output.
    >>>>
    >>>> 3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint 
    >>>> command line.  That should produce some output from cr_checkpoint 
    >>>> showing its progress at each step, assuming the stderr from the 
    >>>> cr_checkpoint command is being collected somewhere you can see it.
    >>>>
    >>>> -Paul
    >>>>
    >>>>
    >>>> Adolfo J. Banchio wrote:
    >>>>     
    >>>>         
    >>>>> Hi,
    >>>>>
    >>>>> I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and
    >>>>> blcr 0.7.3 (from blcr 0.5) and now I have the following
    >>>>> problem.
    >>>>>
    >>>>> When I checkpoint running programs directly from the
    >>>>> command line it works fine.
    >>>>> But the same checkpoint command when it is given 
    >>>>> by the SGE (batch queueing system) checkpointing
    >>>>> script ends up in a core dump file.
    >>>>> What I can see is that blcr started to create the
    >>>>> checkpoint file ( .context...) and it then writes
    >>>>> a core.PID file (I presume the PID there is the one
    >>>>> from the cr_checkpoint process). 
    >>>>>
    >>>>> I can not figure out where the difference might 
    >>>>> lie, since the script is run the same user I use 
    >>>>> when it does work.
    >>>>>
    >>>>> Any help will be welcome.
    >>>>>
    >>>>> thanks in advance,
    >>>>>
    >>>>> adolfo
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>   
    >>>>>       
    >>>>>           
    >>>>     
    >>>>         
    >>     
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Vincentius Robby: "Re: sparc implementation"