Re: blcr 0.7.3: core dump file

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Sep 02 2008 - 14:44:08 PDT

  • Next message: Vincentius Robby: "Re: sparc implementation"
    Adolfo,
    
    I don't know what the problem may be, but have some suggestions on how 
    to work on tracking down the problem (in the order I would try them 
    meyself):
    
    1) Be certain that you have exactly one cr_checkpoint installed.  If 
    SGE's script is still running an old 0.5 install of BLCR, I can see 
    where things would go wrong.  Running "cr_checkpoint -V" from both the 
    command line and in the SGE checkpoint script should both report 0.7.3.
    
    2) Can you get a backtrace from the generated core file?  The one-liner 
    would be something like
        $ echo 'thread apply all backtrace' | gdb `which cr_checkpoint` core.XYZ
    My guess is that you'll get lots of "(no debugging symbols)" messages, 
    but there might be enough info to get a rough idea where the code dump 
    originates.  Please send ALL of the gdb output.
    
    3) Edit SGE's checkpointing script to add '-v' to the cr_checkpoint 
    command line.  That should produce some output from cr_checkpoint 
    showing its progress at each step, assuming the stderr from the 
    cr_checkpoint command is being collected somewhere you can see it.
    
    -Paul
    
    
    Adolfo J. Banchio wrote:
    > Hi,
    >
    > I have upgraded the cluster to Rocks 5.0 (Centos 5.0) and
    > blcr 0.7.3 (from blcr 0.5) and now I have the following
    > problem.
    >
    > When I checkpoint running programs directly from the
    > command line it works fine.
    > But the same checkpoint command when it is given 
    > by the SGE (batch queueing system) checkpointing
    > script ends up in a core dump file.
    > What I can see is that blcr started to create the
    > checkpoint file ( .context...) and it then writes
    > a core.PID file (I presume the PID there is the one
    > from the cr_checkpoint process). 
    >
    > I can not figure out where the difference might 
    > lie, since the script is run the same user I use 
    > when it does work.
    >
    > Any help will be welcome.
    >
    > thanks in advance,
    >
    > adolfo
    >
    >
    >
    >
    >
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Vincentius Robby: "Re: sparc implementation"