From: Karthik Gopalakrishnan (gopalakk_at_cse.ohio-state.edu)
Date: Thu Mar 26 2009 - 08:26:39 PDT
Hi Paul. We'll make the announcement on the MVAPICH users list. Thanks & Regards, Karthik On Wed, Mar 25, 2009 at 5:38 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote: > Alex, > Thanks for trying out the patches, and for the good news. > > Karthik, > I'd appreciate it if you'd make an announcement to the MVAPICH2 users list > that BLCR 0.8.0 has this known problem with MVAPICH2 and > checkpoint-after-restart. I am preparing to release 0.8.1 later today and > you should direct users to get that version if possible. > > -Paul > > Alex Ninaber wrote: >> >> Dear Paul, >> >> That fixed it, many thanks, >> >> >> Regards, >> >> Alex >> >> >> >> >> Paul H. Hargrove wrote: >>> >>> Alex, >>> >>> The behavior you describe sounds as if there is some thread or >>> sub-process in the restarted MPI job that is not completing its checkpoint. >>> There is nothing in BLCR that should prevent checkpointing after a restart, >>> and several of our test cases do exactly that (though not with MPI). >>> If your problem does originate from some bug in BLCR, I don't have >>> enough information yet to determine what might be the cause. However, a >>> similar behavior seen with the SLURM batch system has been identified and >>> will be fixed in the 0.8.1 release, expected later this month. You may try >>> applying the following two patches to BLCR 0.8.0 to get that fix: >>> http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=336 >>> http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=348 >>> If you have time to try that, please let me know if it resolves your >>> problem or not. >>> >>> It is also possible that the problem comes from the MVAPICH2 side of the >>> integration with BLCR. So, I'd suggest that you ask on the MVAPICH2 users >>> list to see if others have the same problem. If so, it would be valuable to >>> know which versions of BLCR display this problem. >>> >>> -Paul >>> >>> Alex Ninaber wrote: >>>> >>>> Dear blcr, >>>> >>>> Manually checkpointing and restarting our MPI application with mvapich2 >>>> appears to work fine once - however after starting it again with cr_restart, >>>> checkpointing it a second time fails: cr_checkpoint waits forever and never >>>> finishes. The application itself has the usual wait-time during the >>>> checkpoint (for writing out the files), and continues after that. What's >>>> left are the new checkpoint files and .context.* (of which the latter is too >>>> small, the checkpoint files appear to have the right size). The behavior is >>>> consistent, and also occurs when the automatic checkpoint in mvapich2 is >>>> used. >>>> >>>> I assume checkpointing an application that started from a restart should >>>> work, correct? >>>> >>>> blcr version 0.8.0 >>>> mvapich2 mvapich2-1.2p1 >>>> kernel 2.6.18-128.1.1.el5 >>>> export MV2_CKPT_INTERVAL=-1 >>>> export MV2_CKPT_FILE=/local/CPK >>>> export MV2_CKPT_MAX_SAVE_CKPTS=3 >>>> >>>> Any tips/hints/help would be appreciated, >>>> >>>> Regards, >>>> >>>> Alex >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >