Re: Please advise me about restarting with BLCR

From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Mon Oct 29 2007 - 02:32:10 PST

  • Next message: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"
    Eric,
    
    > Ok.  If you're aware of the issues, then go for it!  Are the MPICH guys
    > aware of your work?  I think they put a callback for checkpointing in
    > the ADI for MPICH 2, but never implemented it.
    
    Last year, I went to MCS on ANL for 3 weeks, and they advised how to
    modify MPICH2 for using FT technique. So, I think they may remember
    me...
    
    > > cr_poll_checkpoint: Input/output error
    > > /home/jitumoto/blcr-0.6.1/builddir/tests/.libs/lt-mmaps[1539]: file
    > > "../../tests/crut.c", line 628, in crut_main: Error during checkpoint.
    > >  crut_checkpoint_status = 0, saved error = -5
    > > checkpoint/nonzeroexit (255) at ./mmaps.ct line 128.
    > > FAIL: mmaps.ct
    > Can you tell me whether these fail repeatedly?  Or do they succeed sometimes?
    
    This test failed each time test suites were executing.
    
    > Two obvious courses of action are to 1/ rebuild BLCR from one of
    > our newer releases, and 2/ upgrade the kernel on the Opteron machine
    > to something a little bit more recent.
    
    Already, I use BLCR 0.6.1. then, I'll try new kernel for tests.
    
    > Also, can you give me the kernel and distribution details?  Is this a
    > vanilla kernel that you've built yourself?  Or did one of the
    > distributions ship with this?
    
    Our linux kernel was made from vanilla kernel source.
    But, our cluster node administrator deleted this kernel source code.
    So, I use vanilla kernel source with config file that picked up from
    /boot for making BLCR modules.
    Possibly,in this step,  we make un-coodination between kernel image
    now we use and kernel source code I prepared.
    Then I'll try to make and install new kernel image.
    
    I attached our kernel's config file just in case.
    
    >Ok, well that tells us that there's something different about rank 0.
    >We just don't know what.
    
    I don't use shared memory. And I think there is no difference between
    rank0 and others as cause of checkpointing error. (the difference are
    "number of making sockets", "how message through the sockets" and so
    on. there is no device only using one rank...)
    Exactry my embarrassing is cause of this results.
    1. There are errors in BLCR test suites, then , a cause of this bug is
    in BLCR...
    2. Errors made in only rank0, then, a cause of this bug is in my
    mpich-modification...
    

  • Next message: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"