Re: Please advise me about restarting with BLCR

From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Wed Nov 07 2007 - 01:36:03 PST

  • Next message: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"
    Eric,
    
    Thank you so much for all your advice. I truly appreciate it.
    
    This mail has 3 contexts.
    
    1. Usage of cr_request_file.
    Now, I use cr_request_file in signal-handler.
    I know that cr_request_file can't call from signal context because it
    is not reentrant.
    But my application use cr_request_file on only signal context, so
    cr_request_file is not re-enter.
    This usage is correct or not ?
    
    2. Progress from previous e-mail.
    I was able to make correct environment for BLCR. (passed all BLCR's test suites)
    And I got correct restartable checkpoint sometimes, but it is very rare case.
    So, I thought this is "timing bug". Then, I compared coredumps made
    just before correct checkpointing with one made just before un-correct
    checkpointing.
    As a result, there was no difference in backtrace. And there was
    almost no difference in register (only esi register was different). A
    function interrupted by signal handler was write() called from printf.
    
    3. reply for previous mail
    >And load the module with cr_ktrace_mask=0xffffffff.
    >Start looking at the error messages there.
    >If you're seeing a restart failure, we should be able to pin down exactly
    >what we're having trouble reinitializing at restart time.
    >(Include the checkpoint output, too.)
    
    I got following message from uncorrect checkpoint restarting:
    Nov  7 17:34:32 pad204 kernel: cr_load_file_info: Garbage in context
    file! (type=926232864)
    Nov  7 17:34:32 pad204 kernel: Error loading file_info.
    Nov  7 17:34:32 pad204 kernel: cr_rstrt_child [12432]:  Unable to
    restore files!  (err=-22)
    
    I don't understand why context file was broken...
    I change context file name every making. So maybe there is no overwrite-bug.
    I'll try to look for a cause of context broken.
    
    Context file uploaded on
    http://matsu-www.is.titech.ac.jp/~jitumoto/ckpt_backup.tar.gz
    -a.out_0_1 is restartable context file.
    -a.out_0_3 is not restartable one.
    
    Binaries uploaded on
    http://matsu-www.is.titech.ac.jp/~jitumoto/component.tar.gz
    -bin/a.out : checkpointed program
    -lib/** : component for fault tolerant
    
    sincerely,
    Hideyuki Jitsumoto
    
    
    On Nov 1, 2007 5:09 AM, Eric Roman <ESRoman_at_berkeley_dot_edu> wrote:
    >
    > > > Can you tell me whether these fail repeatedly?  Or do they succeed sometimes?
    > >
    > > This test failed each time test suites were executing.
    >
    > Ok, that's good in a sense.  It means that we're not dealing with
    > any sort of spurious race.  It's probably something simple.
    >
    > > > Also, can you give me the kernel and distribution details?  Is this a
    > > > vanilla kernel that you've built yourself?  Or did one of the
    > > > distributions ship with this?
    > >
    > > Our linux kernel was made from vanilla kernel source.
    > > But, our cluster node administrator deleted this kernel source code.
    > > So, I use vanilla kernel source with config file that picked up from
    > > /boot for making BLCR modules.
    > > Possibly,in this step,  we make un-coodination between kernel image
    > > now we use and kernel source code I prepared.
    > > Then I'll try to make and install new kernel image.
    > >
    > > I attached our kernel's config file just in case.
    >
    > It's important to have the System.map file that is built with the running
    > kernel.  BLCR's configure step needs to look symbols up in that file.
    >
    > > >Ok, well that tells us that there's something different about rank 0.
    > > >We just don't know what.
    > >
    > > 1. There are errors in BLCR test suites, then , a cause of this bug is
    > > in BLCR...
    >
    > The test suites should be ok.  We use those all over the place.  There
    > are a few things that bother me here.
    >
    > First is that your error is only in rank 0, and the error message seems
    > to have nothing to do with anything in particular.
    >
    > Second is that the tests that fail are pretty much unrelated, and
    > almost certainly are not causing your failure.  bug2003 tests for a very
    > obscure error in signal handler behavior, and the mmaps.ct test checks
    > for problems with mmap()/fork()'d files.  I don't think that you're
    > doing either of these things.
    >
    > So my guess on this is that it's something in your environment.  I'm not
    > sure what.  C libraries mismatch, a slight kernel version mismatch,
    > symbol mismatch.  I can't tell really.  Nothing I've got really explains
    > those failures.
    >
    > > 2. Errors made in only rank0, then, a cause of this bug is in my
    > > mpich-modification...
    >
    > Does your rank 0 code work on the nodes where all of the BLCR test
    > suites pass?  If you're getting a restart failure on those nodes, then
    > try building with all the kernel tracing and library tracing turned on.  And
    > load the module with cr_ktrace_mask=0xffffffff.  Start looking at the
    > error messages there.  If you're seeing a restart failure, we should be
    > able to pin down exactly what we're having trouble reinitializing at
    > restart time.  (Include the checkpoint output, too.)
    >
    > --
    >
    > Eric Roman                       Department of Physics
    > 510-642-7302                     UC Berkeley
    >
    
    
    
    -- 
    Sincerely Yours,
    Hideyuki Jitsumoto (jitsumo0@is.titech.ac.jp)
    Tokyo Institute of Technology Grad. School of Info. and Eng.
    Dept. MCS (Matsuoka Lab.)
    

  • Next message: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"