Re: Please advise me about restarting with BLCR

Date view	Thread view	Subject view	Author view	Attachment view

From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Wed Nov 07 2007 - 01:36:03 PST

Next message: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"

Previous message: Eric Roman: "Re: Please advise me about restarting with BLCR"
In reply to: Eric Roman: "Re: Please advise me about restarting with BLCR"
Next in thread: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"
Reply: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"

Eric,

Thank you so much for all your advice. I truly appreciate it.

This mail has 3 contexts.

1. Usage of cr_request_file.
Now, I use cr_request_file in signal-handler.
I know that cr_request_file can't call from signal context because it
is not reentrant.
But my application use cr_request_file on only signal context, so
cr_request_file is not re-enter.
This usage is correct or not ?

2. Progress from previous e-mail.
I was able to make correct environment for BLCR. (passed all BLCR's test suites)
And I got correct restartable checkpoint sometimes, but it is very rare case.
So, I thought this is "timing bug". Then, I compared coredumps made
just before correct checkpointing with one made just before un-correct
checkpointing.
As a result, there was no difference in backtrace. And there was
almost no difference in register (only esi register was different). A
function interrupted by signal handler was write() called from printf.

3. reply for previous mail
>And load the module with cr_ktrace_mask=0xffffffff.
>Start looking at the error messages there.
>If you're seeing a restart failure, we should be able to pin down exactly
>what we're having trouble reinitializing at restart time.
>(Include the checkpoint output, too.)

I got following message from uncorrect checkpoint restarting:
Nov  7 17:34:32 pad204 kernel: cr_load_file_info: Garbage in context
file! (type=926232864)
Nov  7 17:34:32 pad204 kernel: Error loading file_info.
Nov  7 17:34:32 pad204 kernel: cr_rstrt_child [12432]:  Unable to
restore files!  (err=-22)

I don't understand why context file was broken...
I change context file name every making. So maybe there is no overwrite-bug.
I'll try to look for a cause of context broken.

Context file uploaded on
http://matsu-www.is.titech.ac.jp/~jitumoto/ckpt_backup.tar.gz
-a.out_0_1 is restartable context file.
-a.out_0_3 is not restartable one.

Binaries uploaded on
http://matsu-www.is.titech.ac.jp/~jitumoto/component.tar.gz
-bin/a.out : checkpointed program
-lib/** : component for fault tolerant

sincerely,
Hideyuki Jitsumoto


On Nov 1, 2007 5:09 AM, Eric Roman <ESRoman_at_berkeley_dot_edu> wrote:
>
> > > Can you tell me whether these fail repeatedly?  Or do they succeed sometimes?
> >
> > This test failed each time test suites were executing.
>
> Ok, that's good in a sense.  It means that we're not dealing with
> any sort of spurious race.  It's probably something simple.
>
> > > Also, can you give me the kernel and distribution details?  Is this a
> > > vanilla kernel that you've built yourself?  Or did one of the
> > > distributions ship with this?
> >
> > Our linux kernel was made from vanilla kernel source.
> > But, our cluster node administrator deleted this kernel source code.
> > So, I use vanilla kernel source with config file that picked up from
> > /boot for making BLCR modules.
> > Possibly,in this step,  we make un-coodination between kernel image
> > now we use and kernel source code I prepared.
> > Then I'll try to make and install new kernel image.
> >
> > I attached our kernel's config file just in case.
>
> It's important to have the System.map file that is built with the running
> kernel.  BLCR's configure step needs to look symbols up in that file.
>
> > >Ok, well that tells us that there's something different about rank 0.
> > >We just don't know what.
> >
> > 1. There are errors in BLCR test suites, then , a cause of this bug is
> > in BLCR...
>
> The test suites should be ok.  We use those all over the place.  There
> are a few things that bother me here.
>
> First is that your error is only in rank 0, and the error message seems
> to have nothing to do with anything in particular.
>
> Second is that the tests that fail are pretty much unrelated, and
> almost certainly are not causing your failure.  bug2003 tests for a very
> obscure error in signal handler behavior, and the mmaps.ct test checks
> for problems with mmap()/fork()'d files.  I don't think that you're
> doing either of these things.
>
> So my guess on this is that it's something in your environment.  I'm not
> sure what.  C libraries mismatch, a slight kernel version mismatch,
> symbol mismatch.  I can't tell really.  Nothing I've got really explains
> those failures.
>
> > 2. Errors made in only rank0, then, a cause of this bug is in my
> > mpich-modification...
>
> Does your rank 0 code work on the nodes where all of the BLCR test
> suites pass?  If you're getting a restart failure on those nodes, then
> try building with all the kernel tracing and library tracing turned on.  And
> load the module with cr_ktrace_mask=0xffffffff.  Start looking at the
> error messages there.  If you're seeing a restart failure, we should be
> able to pin down exactly what we're having trouble reinitializing at
> restart time.  (Include the checkpoint output, too.)
>
> --
>
> Eric Roman                       Department of Physics
> 510-642-7302                     UC Berkeley
>



-- 
Sincerely Yours,
Hideyuki Jitsumoto ([email protected])
Tokyo Institute of Technology Grad. School of Info. and Eng.
Dept. MCS (Matsuoka Lab.)

Next message: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"

Previous message: Eric Roman: "Re: Please advise me about restarting with BLCR"
In reply to: Eric Roman: "Re: Please advise me about restarting with BLCR"
Next in thread: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"
Reply: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"

Date view	Thread view	Subject view	Author view	Attachment view