From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jun 09 2008 - 00:49:33 PDT
Parviz, The problem you describe does not sound like any know bug or limitation in BLCR. It is likely that you have uncovered a new BLCR bug. The "Bad Address" (from the -14) is EFAULT, which suggests that some aspect of the restarted memory mapping is incorrect. If there is any failure to allocate/map memory, then the vmadump portion of BLCR should be detecting the failure prior to causing an EFAULT by accessing the memory. It would help if you could reconfigure and build BLCR with "--enable-debug" passed to BLCR's configure script to enable detailed tracing. If you load the modules by running "make insmod cr_ktrace_mask=0xffffffff" in you BLCR build directory, then the next time you try to restart from your 36G context file dmesg should provide some detail as to what was happening prior to the EFAULT. Sending us the last 100 lines or so from dmesg should probably be sufficient for us to narrow the possible causes, and perhaps suggest a solution. -Paul Parviz Fariborz wrote: > > Hi, > > I get the following error when I re-start a context file produced by > blcr-0.7.0 : > > => cr_restart context.21849 > Restart failed: Bad address > > The dmesg command produces the following error : > > blcr: Retry request on -CR_ENOSUPPORT > blcr: thaw_threads returned error, aborting. -14 > > Any idea what I may be doing wrong? Is this a bug? > > Several more pieces of info : > > I run blcr on a 64 bit machine running linux red-hat : > > =>uname -a > Linux ivel6 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 > x86_64 x86_64 GNU/Linux > > Th size of the context file that produces the error is very large, > around 36G. When a checkpoint the same executable with an smaller data > set, which produce a smaller context file (around 3G) re-start works > with no problem. > > Thanks in advance for your help. > -Parviz > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900