Re: Please advise me about restarting with BLCR

From: Hideyuki Jitsumoto (
Date: Mon Oct 29 2007 - 02:57:17 PST

    Sorry, I forgot to attach config file.
    Thank you,
    Hideyuki Jitsumoto
    On 10/29/07, Hideyuki Jitsumoto <> wrote:
    > Eric,
    > > Ok.  If you're aware of the issues, then go for it!  Are the MPICH guys
    > > aware of your work?  I think they put a callback for checkpointing in
    > > the ADI for MPICH 2, but never implemented it.
    > Last year, I went to MCS on ANL for 3 weeks, and they advised how to
    > modify MPICH2 for using FT technique. So, I think they may remember
    > me...
    > > > cr_poll_checkpoint: Input/output error
    > > > /home/jitumoto/blcr-0.6.1/builddir/tests/.libs/lt-mmaps[1539]: file
    > > > "../../tests/crut.c", line 628, in crut_main: Error during checkpoint.
    > > >  crut_checkpoint_status = 0, saved error = -5
    > > > checkpoint/nonzeroexit (255) at ./mmaps.ct line 128.
    > > > FAIL: mmaps.ct
    > > Can you tell me whether these fail repeatedly?  Or do they succeed sometimes?
    > This test failed each time test suites were executing.
    > > Two obvious courses of action are to 1/ rebuild BLCR from one of
    > > our newer releases, and 2/ upgrade the kernel on the Opteron machine
    > > to something a little bit more recent.
    > Already, I use BLCR 0.6.1. then, I'll try new kernel for tests.
    > > Also, can you give me the kernel and distribution details?  Is this a
    > > vanilla kernel that you've built yourself?  Or did one of the
    > > distributions ship with this?
    > Our linux kernel was made from vanilla kernel source.
    > But, our cluster node administrator deleted this kernel source code.
    > So, I use vanilla kernel source with config file that picked up from
    > /boot for making BLCR modules.
    > Possibly,in this step,  we make un-coodination between kernel image
    > now we use and kernel source code I prepared.
    > Then I'll try to make and install new kernel image.
    > I attached our kernel's config file just in case.
    > >Ok, well that tells us that there's something different about rank 0.
    > >We just don't know what.
    > I don't use shared memory. And I think there is no difference between
    > rank0 and others as cause of checkpointing error. (the difference are
    > "number of making sockets", "how message through the sockets" and so
    > on. there is no device only using one rank...)
    > Exactry my embarrassing is cause of this results.
    > 1. There are errors in BLCR test suites, then , a cause of this bug is
    > in BLCR...
    > 2. Errors made in only rank0, then, a cause of this bug is in my
    > mpich-modification...
    Sincerely Yours,
    Hideyuki Jitsumoto (
    Tokyo Institute of Technology Grad. School of Info. and Eng.
    Dept. MCS (Matsuoka Lab.)

