From: Eric Roman (ESRoman_at_berkeley_dot_edu)
Date: Wed Oct 31 2007 - 12:09:17 PST
> > Can you tell me whether these fail repeatedly? Or do they succeed sometimes? > > This test failed each time test suites were executing. Ok, that's good in a sense. It means that we're not dealing with any sort of spurious race. It's probably something simple. > > Also, can you give me the kernel and distribution details? Is this a > > vanilla kernel that you've built yourself? Or did one of the > > distributions ship with this? > > Our linux kernel was made from vanilla kernel source. > But, our cluster node administrator deleted this kernel source code. > So, I use vanilla kernel source with config file that picked up from > /boot for making BLCR modules. > Possibly,in this step, we make un-coodination between kernel image > now we use and kernel source code I prepared. > Then I'll try to make and install new kernel image. > > I attached our kernel's config file just in case. It's important to have the System.map file that is built with the running kernel. BLCR's configure step needs to look symbols up in that file. > >Ok, well that tells us that there's something different about rank 0. > >We just don't know what. > > 1. There are errors in BLCR test suites, then , a cause of this bug is > in BLCR... The test suites should be ok. We use those all over the place. There are a few things that bother me here. First is that your error is only in rank 0, and the error message seems to have nothing to do with anything in particular. Second is that the tests that fail are pretty much unrelated, and almost certainly are not causing your failure. bug2003 tests for a very obscure error in signal handler behavior, and the mmaps.ct test checks for problems with mmap()/fork()'d files. I don't think that you're doing either of these things. So my guess on this is that it's something in your environment. I'm not sure what. C libraries mismatch, a slight kernel version mismatch, symbol mismatch. I can't tell really. Nothing I've got really explains those failures. > 2. Errors made in only rank0, then, a cause of this bug is in my > mpich-modification... Does your rank 0 code work on the nodes where all of the BLCR test suites pass? If you're getting a restart failure on those nodes, then try building with all the kernel tracing and library tracing turned on. And load the module with cr_ktrace_mask=0xffffffff. Start looking at the error messages there. If you're seeing a restart failure, we should be able to pin down exactly what we're having trouble reinitializing at restart time. (Include the checkpoint output, too.) -- Eric Roman Department of Physics 510-642-7302 UC Berkeley