Re: Please advise me about restarting with BLCR

From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Wed Nov 07 2007 - 02:13:32 PST

  • Next message: Kristy Lyon: "think about"
    Eric,
    
    Maybe I got a phenomenon related with "Invalid Argument".
    I execute MPI Application that has only while(1){printf();}
    Then, I got "Invalid Argument".
    And, I tried only rank1 invoked printf(); .
    Next, I got "Invalid Argument" from only rank1.
    
    In MPICH P4MPD, MPI Application process's output was initialized as following,
    1. close fd=0, 1, 2
    2. make 3 connections between MPD.
    3. dup 3 connections descriptor to 0, 1, 2
    4. close 3 connections previous descriptor
    Then, when MPI Application invoke printf, a message is sent to MPD by
    connection that has descriptor 0, and MPD send the message to mpirun.
    At last mpirun print the message.
    
    On this method, Is there the factor that stdout message is mixed in
    checkpoint context file ?
    
    sincerely,
    Hideyuki Jitsumoto.
    
    On Nov 7, 2007 6:36 PM, Hideyuki Jitsumoto <[email protected]> wrote:
    > Eric,
    >
    > Thank you so much for all your advice. I truly appreciate it.
    >
    > This mail has 3 contexts.
    >
    > 1. Usage of cr_request_file.
    > Now, I use cr_request_file in signal-handler.
    > I know that cr_request_file can't call from signal context because it
    > is not reentrant.
    > But my application use cr_request_file on only signal context, so
    > cr_request_file is not re-enter.
    > This usage is correct or not ?
    >
    > 2. Progress from previous e-mail.
    > I was able to make correct environment for BLCR. (passed all BLCR's test suites)
    > And I got correct restartable checkpoint sometimes, but it is very rare case.
    > So, I thought this is "timing bug". Then, I compared coredumps made
    > just before correct checkpointing with one made just before un-correct
    > checkpointing.
    > As a result, there was no difference in backtrace. And there was
    > almost no difference in register (only esi register was different). A
    > function interrupted by signal handler was write() called from printf.
    >
    > 3. reply for previous mail
    > >And load the module with cr_ktrace_mask=0xffffffff.
    > >Start looking at the error messages there.
    > >If you're seeing a restart failure, we should be able to pin down exactly
    > >what we're having trouble reinitializing at restart time.
    > >(Include the checkpoint output, too.)
    >
    > I got following message from uncorrect checkpoint restarting:
    > Nov  7 17:34:32 pad204 kernel: cr_load_file_info: Garbage in context
    > file! (type=926232864)
    > Nov  7 17:34:32 pad204 kernel: Error loading file_info.
    > Nov  7 17:34:32 pad204 kernel: cr_rstrt_child [12432]:  Unable to
    > restore files!  (err=-22)
    >
    > I don't understand why context file was broken...
    > I change context file name every making. So maybe there is no overwrite-bug.
    > I'll try to look for a cause of context broken.
    >
    > Context file uploaded on
    > http://matsu-www.is.titech.ac.jp/~jitumoto/ckpt_backup.tar.gz
    > -a.out_0_1 is restartable context file.
    > -a.out_0_3 is not restartable one.
    >
    > Binaries uploaded on
    > http://matsu-www.is.titech.ac.jp/~jitumoto/component.tar.gz
    > -bin/a.out : checkpointed program
    > -lib/** : component for fault tolerant
    >
    > sincerely,
    > Hideyuki Jitsumoto
    >
    >
    >
    > On Nov 1, 2007 5:09 AM, Eric Roman <ESRoman_at_berkeley_dot_edu> wrote:
    > >
    > > > > Can you tell me whether these fail repeatedly?  Or do they succeed sometimes?
    > > >
    > > > This test failed each time test suites were executing.
    > >
    > > Ok, that's good in a sense.  It means that we're not dealing with
    > > any sort of spurious race.  It's probably something simple.
    > >
    > > > > Also, can you give me the kernel and distribution details?  Is this a
    > > > > vanilla kernel that you've built yourself?  Or did one of the
    > > > > distributions ship with this?
    > > >
    > > > Our linux kernel was made from vanilla kernel source.
    > > > But, our cluster node administrator deleted this kernel source code.
    > > > So, I use vanilla kernel source with config file that picked up from
    > > > /boot for making BLCR modules.
    > > > Possibly,in this step,  we make un-coodination between kernel image
    > > > now we use and kernel source code I prepared.
    > > > Then I'll try to make and install new kernel image.
    > > >
    > > > I attached our kernel's config file just in case.
    > >
    > > It's important to have the System.map file that is built with the running
    > > kernel.  BLCR's configure step needs to look symbols up in that file.
    > >
    > > > >Ok, well that tells us that there's something different about rank 0.
    > > > >We just don't know what.
    > > >
    > > > 1. There are errors in BLCR test suites, then , a cause of this bug is
    > > > in BLCR...
    > >
    > > The test suites should be ok.  We use those all over the place.  There
    > > are a few things that bother me here.
    > >
    > > First is that your error is only in rank 0, and the error message seems
    > > to have nothing to do with anything in particular.
    > >
    > > Second is that the tests that fail are pretty much unrelated, and
    > > almost certainly are not causing your failure.  bug2003 tests for a very
    > > obscure error in signal handler behavior, and the mmaps.ct test checks
    > > for problems with mmap()/fork()'d files.  I don't think that you're
    > > doing either of these things.
    > >
    > > So my guess on this is that it's something in your environment.  I'm not
    > > sure what.  C libraries mismatch, a slight kernel version mismatch,
    > > symbol mismatch.  I can't tell really.  Nothing I've got really explains
    > > those failures.
    > >
    > > > 2. Errors made in only rank0, then, a cause of this bug is in my
    > > > mpich-modification...
    > >
    > > Does your rank 0 code work on the nodes where all of the BLCR test
    > > suites pass?  If you're getting a restart failure on those nodes, then
    > > try building with all the kernel tracing and library tracing turned on.  And
    > > load the module with cr_ktrace_mask=0xffffffff.  Start looking at the
    > > error messages there.  If you're seeing a restart failure, we should be
    > > able to pin down exactly what we're having trouble reinitializing at
    > > restart time.  (Include the checkpoint output, too.)
    > >
    > > --
    > >
    > > Eric Roman                       Department of Physics
    > > 510-642-7302                     UC Berkeley
    > >
    >
    >
    >
    > --
    >
    > Sincerely Yours,
    > Hideyuki Jitsumoto ([email protected])
    > Tokyo Institute of Technology Grad. School of Info. and Eng.
    > Dept. MCS (Matsuoka Lab.)
    >
    
    
    
    -- 
    Sincerely Yours,
    Hideyuki Jitsumoto ([email protected])
    Tokyo Institute of Technology Grad. School of Info. and Eng.
    Dept. MCS (Matsuoka Lab.)
    

  • Next message: Kristy Lyon: "think about"