Re: blcr 0.7.3: core dump file

From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Tue Sep 02 2008 - 16:26:54 PDT

  • Next message: Paul H. Hargrove: "Re: sparc implementation"
    Paul,
    
    thanks for your prompt reply.
    
    Addressing your questions. 
    
    1) there is no possibility of having something old around, since
    there are new installations (the nodes are fully installed from
    scratch, and so was the frontend)
    
    2) I hava backtraced the core file, but only with the idb (Intel
    gdb) since is the only I had installed at the moment.  The output 
    was
    
    --------------  begin ----------------------------
    
    Intel(R) Debugger for applications running on Intel(R) 64, Version
    10.1-35 , Build 20080310
    ------------------
    object file name: /usr/bin/cr_checkpoint
    core file name: core.27901
    Reading symbols from /usr/bin/cr_checkpoint...(no debugging symbols
    found)...done.
    Core file produced from executable cr_checkpoint
    Initial part of arglist: /usr/bin/cr_checkpoint -f context_49.2 --kill
    27170
    Thread terminated at PC 0x00002aaaab31d055 by signal SIGABRT
    line: 1 Unable to parse input as legal command or C expression.
    
    ----------------- end  -----------------------
    
    And finally, I have added a -v flag to the script and I get
    the following outut:
    
    
    cr_async.c:198 thread_init: pthread_create() returned 12
    targetfile='./.context_49.2.tmp', parent dir='.', rename=context_49.2
    child killed by signal 6 (Aborted)
    
    
    
    I hope this helps you find a clue.
    
    best regards,
    
    adolfo
    
    
    
    
    
    
    On Tue, 2008-09-02 at 14:44 -0700, Paul H. Hargrove wrote:
    > Adolfo,
    > 
    > I don't know what the problem may be, but have some suggestions on how 
    > to work on tracking down the problem (in the order I would try them 
    > meyself):
    > 
    > 1) Be certain that you have exactly one cr_checkpoint installed.  If 
    > SGE's script is still running an old 0.5 install of BLCR, I can see 
    > where things would go wrong.  Running "cr_checkpoint -V" from both the 
    > command line and in the SGE checkpoint script should both report 0.7.3.
    > 
    > 2) Can you get a backtrace from the generated core file?  The one-liner 
    > would be something like
    >     $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ
    > My guess is that you'll get lots of "(no debugging symbols)" messages, 
    > but there might be enough info to get a rough idea where the code dump 
    > originates.  Please send ALL of the gdb output.
    > 
    > 3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint 
    > command line.  That should produce some output from cr_checkpoint 
    > showing its progress at each step, assuming the stderr from the 
    > cr_checkpoint command is being collected somewhere you can see it.
    > 
    > -Paul
    > 
    > 
    > Adolfo J. Banchio wrote:
    > > Hi,
    > >
    > > I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and
    > > blcr 0.7.3 (from blcr 0.5) and now I have the following
    > > problem.
    > >
    > > When I checkpoint running programs directly from the
    > > command line it works fine.
    > > But the same checkpoint command when it is given 
    > > by the SGE (batch queueing system) checkpointing
    > > script ends up in a core dump file.
    > > What I can see is that blcr started to create the
    > > checkpoint file ( .context...) and it then writes
    > > a core.PID file (I presume the PID there is the one
    > > from the cr_checkpoint process). 
    > >
    > > I can not figure out where the difference might 
    > > lie, since the script is run the same user I use 
    > > when it does work.
    > >
    > > Any help will be welcome.
    > >
    > > thanks in advance,
    > >
    > > adolfo
    > >
    > >
    > >
    > >
    > >
    > >   
    > 
    > 
    -- 
    Adolfo J. Banchio <banchio_at_famaf_dot_unc_dot_edu.ar>
    

  • Next message: Paul H. Hargrove: "Re: sparc implementation"