Re: program segfault after restart

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Feb 23 2009 - 12:09:46 PST

Next message: Paul H. Hargrove: "Re: using blcr on program with fork"

Previous message: Paul H. Hargrove: "Re: What does this error message mean?"
Maybe in reply to: Hongjia Cao: "program segfault after restart"
Next in thread: Hongjia Cao: "Re: program segfault after restart"

When a user reports that they can restart on the machine where a 
checkpoint was taken, but are unable to do so on another identical node 
the problem has almost always been "prelinking", which is described in 
the BLCR FAQ: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .  
I suggest you FIRST try disabling prelinking as described there.  Once 
that is done you will need to take a new checkpoint (don't try 
restarting one that may have included prelinked libs).  HOWEVER, I am 
not certain that prelinking is your problem.

Since your application has produced output after restart, your case 
doesn't quite follow the pattern of the prelinking problem, in which the 
application dies in the very first library call it makes.  It is 
possible that in your case not all libraries are prelinked and that libc 
is able to produce output while some other library is SEGFAULTing at 
exit time.  However, I think that is a bit of a stretch.  So, I think we 
need to consider other possibilities, too.

You say "NPB", which is an acronym for NAS Parallel Benchmarks, but you 
say "serial benchmarks" so I am assuming you are looking at the non-MPI 
versions of these codes (otherwise I'd suggest looking at the MPI 
implementation).  If you ARE using MPI, OpenMP or UPC versions, please 
let me know and we can look there for possible problems.

If disabling prelinking does not resolve the SEGFAULT then I don't have 
a guess as to where the fault may be.  So, if prelinking is not the 
problem you will need to get a core file from the faulting program and 
use gdb (or other debugger of your choosing) to determine where the 
fault is happening.  If you are not familiar with postmortem debugging 
then you should be able to find many fine debugger tutorials online (the 
subject is too large to cover by email).

Of course, I suppose you have the option to just ignore the fault if it 
doesn't prevent you from getting results from your indented 
application(s).  However, if you are willing, I'd appreciate your help 
in identifying the cause of your failure in case there is something that 
BLCR should be doing differently.

-Paul

Hongjia Cao wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I encountered a problem about BLCR 0.8.0.
>
> I run the NPB serial benchmarks on severl compute nodes of our cluster
> and make checkpoints of them. The checkpoint process is OK and the
> programs can be restarted from the context files from the same node
> where it is checkpointed. But if I try to restart the program from
> another node, which has the same architecture(x86_86), kernel(Linux
> 2.6.28-8.1.8-el5), and executable(shared NFS directory), the program
> will report a segmentation fault after running successfully to the end:
>
> ...
>  SP Benchmark Completed.
>  Class           =                        B
>  Size            =            102x 102x 102
>  Iterations      =                      400
>  Time in seconds =                   804.56
>  Mop/s total     =                   441.25
>  Operation type  =           floating point
>  Verification    =               SUCCESSFUL
>  Version         =                      3.3
>  Compile date    =              19 Feb 2009
>
>  Compile options:
>     F77          = ifort
>     FLINK        = $(F77)
>     F_LIB        = (none)
>     F_INC        = (none)
>     FFLAGS       = -O
>     FLINKFLAGS   = -O
>     RAND         = (none)
>
>
>  Please send all errors/feedbacks to:
>
>  NPB Development Team
>  npb_at_nas_dot_nasa_dot_gov
>
>
> Segmentation fault
>
>
> I wonder if anybody else has run into this problem before.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFJonNMVgdrmpB/quURAgxfAJ943N1rhRxRdx4idw2M/M7hrcDP1gCfZ4Jo
> JleMdwgccjETsAY0+A79LMY=
> =QbNq
> -----END PGP SIGNATURE-----
>
>   

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory

Next message: Paul H. Hargrove: "Re: using blcr on program with fork"

Previous message: Paul H. Hargrove: "Re: What does this error message mean?"
Maybe in reply to: Hongjia Cao: "program segfault after restart"
Next in thread: Hongjia Cao: "Re: program segfault after restart"

Date view	Thread view	Subject view	Author view	Attachment view