From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Wed Nov 07 2007 - 01:36:03 PST
Eric, Thank you so much for all your advice. I truly appreciate it. This mail has 3 contexts. 1. Usage of cr_request_file. Now, I use cr_request_file in signal-handler. I know that cr_request_file can't call from signal context because it is not reentrant. But my application use cr_request_file on only signal context, so cr_request_file is not re-enter. This usage is correct or not ? 2. Progress from previous e-mail. I was able to make correct environment for BLCR. (passed all BLCR's test suites) And I got correct restartable checkpoint sometimes, but it is very rare case. So, I thought this is "timing bug". Then, I compared coredumps made just before correct checkpointing with one made just before un-correct checkpointing. As a result, there was no difference in backtrace. And there was almost no difference in register (only esi register was different). A function interrupted by signal handler was write() called from printf. 3. reply for previous mail >And load the module with cr_ktrace_mask=0xffffffff. >Start looking at the error messages there. >If you're seeing a restart failure, we should be able to pin down exactly >what we're having trouble reinitializing at restart time. >(Include the checkpoint output, too.) I got following message from uncorrect checkpoint restarting: Nov 7 17:34:32 pad204 kernel: cr_load_file_info: Garbage in context file! (type=926232864) Nov 7 17:34:32 pad204 kernel: Error loading file_info. Nov 7 17:34:32 pad204 kernel: cr_rstrt_child [12432]: Unable to restore files! (err=-22) I don't understand why context file was broken... I change context file name every making. So maybe there is no overwrite-bug. I'll try to look for a cause of context broken. Context file uploaded on http://matsu-www.is.titech.ac.jp/~jitumoto/ckpt_backup.tar.gz -a.out_0_1 is restartable context file. -a.out_0_3 is not restartable one. Binaries uploaded on http://matsu-www.is.titech.ac.jp/~jitumoto/component.tar.gz -bin/a.out : checkpointed program -lib/** : component for fault tolerant sincerely, Hideyuki Jitsumoto On Nov 1, 2007 5:09 AM, Eric Roman <ESRoman_at_berkeley_dot_edu> wrote: > > > > Can you tell me whether these fail repeatedly? Or do they succeed sometimes? > > > > This test failed each time test suites were executing. > > Ok, that's good in a sense. It means that we're not dealing with > any sort of spurious race. It's probably something simple. > > > > Also, can you give me the kernel and distribution details? Is this a > > > vanilla kernel that you've built yourself? Or did one of the > > > distributions ship with this? > > > > Our linux kernel was made from vanilla kernel source. > > But, our cluster node administrator deleted this kernel source code. > > So, I use vanilla kernel source with config file that picked up from > > /boot for making BLCR modules. > > Possibly,in this step, we make un-coodination between kernel image > > now we use and kernel source code I prepared. > > Then I'll try to make and install new kernel image. > > > > I attached our kernel's config file just in case. > > It's important to have the System.map file that is built with the running > kernel. BLCR's configure step needs to look symbols up in that file. > > > >Ok, well that tells us that there's something different about rank 0. > > >We just don't know what. > > > > 1. There are errors in BLCR test suites, then , a cause of this bug is > > in BLCR... > > The test suites should be ok. We use those all over the place. There > are a few things that bother me here. > > First is that your error is only in rank 0, and the error message seems > to have nothing to do with anything in particular. > > Second is that the tests that fail are pretty much unrelated, and > almost certainly are not causing your failure. bug2003 tests for a very > obscure error in signal handler behavior, and the mmaps.ct test checks > for problems with mmap()/fork()'d files. I don't think that you're > doing either of these things. > > So my guess on this is that it's something in your environment. I'm not > sure what. C libraries mismatch, a slight kernel version mismatch, > symbol mismatch. I can't tell really. Nothing I've got really explains > those failures. > > > 2. Errors made in only rank0, then, a cause of this bug is in my > > mpich-modification... > > Does your rank 0 code work on the nodes where all of the BLCR test > suites pass? If you're getting a restart failure on those nodes, then > try building with all the kernel tracing and library tracing turned on. And > load the module with cr_ktrace_mask=0xffffffff. Start looking at the > error messages there. If you're seeing a restart failure, we should be > able to pin down exactly what we're having trouble reinitializing at > restart time. (Include the checkpoint output, too.) > > -- > > Eric Roman Department of Physics > 510-642-7302 UC Berkeley > -- Sincerely Yours, Hideyuki Jitsumoto ([email protected]) Tokyo Institute of Technology Grad. School of Info. and Eng. Dept. MCS (Matsuoka Lab.)