Re: program segfault after restart

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Feb 23 2009 - 12:09:46 PST

  • Next message: Paul H. Hargrove: "Re: using blcr on program with fork"
    When a user reports that they can restart on the machine where a 
    checkpoint was taken, but are unable to do so on another identical node 
    the problem has almost always been "prelinking", which is described in 
    the BLCR FAQ: .  
    I suggest you FIRST try disabling prelinking as described there.  Once 
    that is done you will need to take a new checkpoint (don't try 
    restarting one that may have included prelinked libs).  HOWEVER, I am 
    not certain that prelinking is your problem.
    Since your application has produced output after restart, your case 
    doesn't quite follow the pattern of the prelinking problem, in which the 
    application dies in the very first library call it makes.  It is 
    possible that in your case not all libraries are prelinked and that libc 
    is able to produce output while some other library is SEGFAULTing at 
    exit time.  However, I think that is a bit of a stretch.  So, I think we 
    need to consider other possibilities, too.
    You say "NPB", which is an acronym for NAS Parallel Benchmarks, but you 
    say "serial benchmarks" so I am assuming you are looking at the non-MPI 
    versions of these codes (otherwise I'd suggest looking at the MPI 
    implementation).  If you ARE using MPI, OpenMP or UPC versions, please 
    let me know and we can look there for possible problems.
    If disabling prelinking does not resolve the SEGFAULT then I don't have 
    a guess as to where the fault may be.  So, if prelinking is not the 
    problem you will need to get a core file from the faulting program and 
    use gdb (or other debugger of your choosing) to determine where the 
    fault is happening.  If you are not familiar with postmortem debugging 
    then you should be able to find many fine debugger tutorials online (the 
    subject is too large to cover by email).
    Of course, I suppose you have the option to just ignore the fault if it 
    doesn't prevent you from getting results from your indented 
    application(s).  However, if you are willing, I'd appreciate your help 
    in identifying the cause of your failure in case there is something that 
    BLCR should be doing differently.
    Hongjia Cao wrote:
    > Hash: SHA1
    > I encountered a problem about BLCR 0.8.0.
    > I run the NPB serial benchmarks on severl compute nodes of our cluster
    > and make checkpoints of them. The checkpoint process is OK and the
    > programs can be restarted from the context files from the same node
    > where it is checkpointed. But if I try to restart the program from
    > another node, which has the same architecture(x86_86), kernel(Linux
    > 2.6.28-8.1.8-el5), and executable(shared NFS directory), the program
    > will report a segmentation fault after running successfully to the end:
    > ...
    >  SP Benchmark Completed.
    >  Class           =                        B
    >  Size            =            102x 102x 102
    >  Iterations      =                      400
    >  Time in seconds =                   804.56
    >  Mop/s total     =                   441.25
    >  Operation type  =           floating point
    >  Verification    =               SUCCESSFUL
    >  Version         =                      3.3
    >  Compile date    =              19 Feb 2009
    >  Compile options:
    >     F77          = ifort
    >     FLINK        = $(F77)
    >     F_LIB        = (none)
    >     F_INC        = (none)
    >     FFLAGS       = -O
    >     FLINKFLAGS   = -O
    >     RAND         = (none)
    >  Please send all errors/feedbacks to:
    >  NPB Development Team
    >  npb_at_nas_dot_nasa_dot_gov
    > Segmentation fault
    > I wonder if anybody else has run into this problem before.
    > -----BEGIN PGP SIGNATURE-----
    > Version: GnuPG v1.4.6 (GNU/Linux)
    > Comment: Using GnuPG with Mozilla -
    > iD8DBQFJonNMVgdrmpB/quURAgxfAJ943N1rhRxRdx4idw2M/M7hrcDP1gCfZ4Jo
    > JleMdwgccjETsAY0+A79LMY=
    > =QbNq
    > -----END PGP SIGNATURE-----
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: Paul H. Hargrove: "Re: using blcr on program with fork"