Re: Please advise me about restarting with BLCR

From: Eric Roman (ESRoman_at_berkeley_dot_edu)
Date: Wed Oct 31 2007 - 12:09:17 PST

    > > Can you tell me whether these fail repeatedly?  Or do they succeed sometimes?
    > This test failed each time test suites were executing.
    Ok, that's good in a sense.  It means that we're not dealing with
    any sort of spurious race.  It's probably something simple.
    > > Also, can you give me the kernel and distribution details?  Is this a
    > > vanilla kernel that you've built yourself?  Or did one of the
    > > distributions ship with this?
    > Our linux kernel was made from vanilla kernel source.
    > But, our cluster node administrator deleted this kernel source code.
    > So, I use vanilla kernel source with config file that picked up from
    > /boot for making BLCR modules.
    > Possibly,in this step,  we make un-coodination between kernel image
    > now we use and kernel source code I prepared.
    > Then I'll try to make and install new kernel image.
    > I attached our kernel's config file just in case.
    It's important to have the file that is built with the running
    kernel.  BLCR's configure step needs to look symbols up in that file.
    > >Ok, well that tells us that there's something different about rank 0.
    > >We just don't know what.
    > 1. There are errors in BLCR test suites, then , a cause of this bug is
    > in BLCR...
    The test suites should be ok.  We use those all over the place.  There
    are a few things that bother me here.
    First is that your error is only in rank 0, and the error message seems
    to have nothing to do with anything in particular.
    Second is that the tests that fail are pretty much unrelated, and
    almost certainly are not causing your failure.  bug2003 tests for a very
    obscure error in signal handler behavior, and the mmaps.ct test checks
    for problems with mmap()/fork()'d files.  I don't think that you're
    doing either of these things.
    So my guess on this is that it's something in your environment.  I'm not
    sure what.  C libraries mismatch, a slight kernel version mismatch,
    symbol mismatch.  I can't tell really.  Nothing I've got really explains
    those failures.
    > 2. Errors made in only rank0, then, a cause of this bug is in my
    > mpich-modification...
    Does your rank 0 code work on the nodes where all of the BLCR test
    suites pass?  If you're getting a restart failure on those nodes, then
    try building with all the kernel tracing and library tracing turned on.  And 
    load the module with cr_ktrace_mask=0xffffffff.  Start looking at the
    error messages there.  If you're seeing a restart failure, we should be
    able to pin down exactly what we're having trouble reinitializing at
    restart time.  (Include the checkpoint output, too.)
