From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Feb 23 2009 - 12:09:46 PST
When a user reports that they can restart on the machine where a checkpoint was taken, but are unable to do so on another identical node the problem has almost always been "prelinking", which is described in the BLCR FAQ: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink . I suggest you FIRST try disabling prelinking as described there. Once that is done you will need to take a new checkpoint (don't try restarting one that may have included prelinked libs). HOWEVER, I am not certain that prelinking is your problem. Since your application has produced output after restart, your case doesn't quite follow the pattern of the prelinking problem, in which the application dies in the very first library call it makes. It is possible that in your case not all libraries are prelinked and that libc is able to produce output while some other library is SEGFAULTing at exit time. However, I think that is a bit of a stretch. So, I think we need to consider other possibilities, too. You say "NPB", which is an acronym for NAS Parallel Benchmarks, but you say "serial benchmarks" so I am assuming you are looking at the non-MPI versions of these codes (otherwise I'd suggest looking at the MPI implementation). If you ARE using MPI, OpenMP or UPC versions, please let me know and we can look there for possible problems. If disabling prelinking does not resolve the SEGFAULT then I don't have a guess as to where the fault may be. So, if prelinking is not the problem you will need to get a core file from the faulting program and use gdb (or other debugger of your choosing) to determine where the fault is happening. If you are not familiar with postmortem debugging then you should be able to find many fine debugger tutorials online (the subject is too large to cover by email). Of course, I suppose you have the option to just ignore the fault if it doesn't prevent you from getting results from your indented application(s). However, if you are willing, I'd appreciate your help in identifying the cause of your failure in case there is something that BLCR should be doing differently. -Paul Hongjia Cao wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I encountered a problem about BLCR 0.8.0. > > I run the NPB serial benchmarks on severl compute nodes of our cluster > and make checkpoints of them. The checkpoint process is OK and the > programs can be restarted from the context files from the same node > where it is checkpointed. But if I try to restart the program from > another node, which has the same architecture(x86_86), kernel(Linux > 2.6.28-8.1.8-el5), and executable(shared NFS directory), the program > will report a segmentation fault after running successfully to the end: > > ... > SP Benchmark Completed. > Class = B > Size = 102x 102x 102 > Iterations = 400 > Time in seconds = 804.56 > Mop/s total = 441.25 > Operation type = floating point > Verification = SUCCESSFUL > Version = 3.3 > Compile date = 19 Feb 2009 > > Compile options: > F77 = ifort > FLINK = $(F77) > F_LIB = (none) > F_INC = (none) > FFLAGS = -O > FLINKFLAGS = -O > RAND = (none) > > > Please send all errors/feedbacks to: > > NPB Development Team > npb_at_nas_dot_nasa_dot_gov > > > Segmentation fault > > > I wonder if anybody else has run into this problem before. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFJonNMVgdrmpB/quURAgxfAJ943N1rhRxRdx4idw2M/M7hrcDP1gCfZ4Jo > JleMdwgccjETsAY0+A79LMY= > =QbNq > -----END PGP SIGNATURE----- > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory