From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 25 2009 - 14:38:00 PDT
Alex, Thanks for trying out the patches, and for the good news. Karthik, I'd appreciate it if you'd make an announcement to the MVAPICH2 users list that BLCR 0.8.0 has this known problem with MVAPICH2 and checkpoint-after-restart. I am preparing to release 0.8.1 later today and you should direct users to get that version if possible. -Paul Alex Ninaber wrote: > Dear Paul, > > That fixed it, many thanks, > > > Regards, > > Alex > > > > > Paul H. Hargrove wrote: >> Alex, >> >> The behavior you describe sounds as if there is some thread or >> sub-process in the restarted MPI job that is not completing its >> checkpoint. There is nothing in BLCR that should prevent >> checkpointing after a restart, and several of our test cases do >> exactly that (though not with MPI). >> If your problem does originate from some bug in BLCR, I don't have >> enough information yet to determine what might be the cause. >> However, a similar behavior seen with the SLURM batch system has been >> identified and will be fixed in the 0.8.1 release, expected later >> this month. You may try applying the following two patches to BLCR >> 0.8.0 to get that fix: >> http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=336 >> http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=348 >> If you have time to try that, please let me know if it resolves your >> problem or not. >> >> It is also possible that the problem comes from the MVAPICH2 side of >> the integration with BLCR. So, I'd suggest that you ask on the >> MVAPICH2 users list to see if others have the same problem. If so, >> it would be valuable to know which versions of BLCR display this >> problem. >> >> -Paul >> >> Alex Ninaber wrote: >>> >>> Dear blcr, >>> >>> Manually checkpointing and restarting our MPI application with >>> mvapich2 appears to work fine once - however after starting it again >>> with cr_restart, checkpointing it a second time fails: cr_checkpoint >>> waits forever and never finishes. The application itself has the >>> usual wait-time during the checkpoint (for writing out the files), >>> and continues after that. What's left are the new checkpoint files >>> and .context.* (of which the latter is too small, the checkpoint >>> files appear to have the right size). The behavior is consistent, >>> and also occurs when the automatic checkpoint in mvapich2 is used. >>> >>> I assume checkpointing an application that started from a restart >>> should work, correct? >>> >>> blcr version 0.8.0 >>> mvapich2 mvapich2-1.2p1 >>> kernel 2.6.18-128.1.1.el5 >>> export MV2_CKPT_INTERVAL=-1 >>> export MV2_CKPT_FILE=/local/CPK >>> export MV2_CKPT_MAX_SAVE_CKPTS=3 >>> >>> Any tips/hints/help would be appreciated, >>> >>> Regards, >>> >>> Alex >>> >>> >>> >>> >>> >> >> > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900