From: Hideyuki Jitsumoto (jitsumo0_at_is.titech.ac.jp)
Date: Mon Oct 29 2007 - 02:57:17 PST
Eric, Sorry, I forgot to attach config file. Thank you, Hideyuki Jitsumoto On 10/29/07, Hideyuki Jitsumoto <[email protected]> wrote: > Eric, > > > Ok. If you're aware of the issues, then go for it! Are the MPICH guys > > aware of your work? I think they put a callback for checkpointing in > > the ADI for MPICH 2, but never implemented it. > > Last year, I went to MCS on ANL for 3 weeks, and they advised how to > modify MPICH2 for using FT technique. So, I think they may remember > me... > > > > cr_poll_checkpoint: Input/output error > > > /home/jitumoto/blcr-0.6.1/builddir/tests/.libs/lt-mmaps[1539]: file > > > "../../tests/crut.c", line 628, in crut_main: Error during checkpoint. > > > crut_checkpoint_status = 0, saved error = -5 > > > checkpoint/nonzeroexit (255) at ./mmaps.ct line 128. > > > FAIL: mmaps.ct > > Can you tell me whether these fail repeatedly? Or do they succeed sometimes? > > This test failed each time test suites were executing. > > > Two obvious courses of action are to 1/ rebuild BLCR from one of > > our newer releases, and 2/ upgrade the kernel on the Opteron machine > > to something a little bit more recent. > > Already, I use BLCR 0.6.1. then, I'll try new kernel for tests. > > > Also, can you give me the kernel and distribution details? Is this a > > vanilla kernel that you've built yourself? Or did one of the > > distributions ship with this? > > Our linux kernel was made from vanilla kernel source. > But, our cluster node administrator deleted this kernel source code. > So, I use vanilla kernel source with config file that picked up from > /boot for making BLCR modules. > Possibly,in this step, we make un-coodination between kernel image > now we use and kernel source code I prepared. > Then I'll try to make and install new kernel image. > > I attached our kernel's config file just in case. > > >Ok, well that tells us that there's something different about rank 0. > >We just don't know what. > > I don't use shared memory. And I think there is no difference between > rank0 and others as cause of checkpointing error. (the difference are > "number of making sockets", "how message through the sockets" and so > on. there is no device only using one rank...) > Exactry my embarrassing is cause of this results. > 1. There are errors in BLCR test suites, then , a cause of this bug is > in BLCR... > 2. Errors made in only rank0, then, a cause of this bug is in my > mpich-modification... > -- Sincerely Yours, Hideyuki Jitsumoto ([email protected]) Tokyo Institute of Technology Grad. School of Info. and Eng. Dept. MCS (Matsuoka Lab.)