From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Mar 24 2009 - 10:26:13 PDT
Alex, The behavior you describe sounds as if there is some thread or sub-process in the restarted MPI job that is not completing its checkpoint. There is nothing in BLCR that should prevent checkpointing after a restart, and several of our test cases do exactly that (though not with MPI). If your problem does originate from some bug in BLCR, I don't have enough information yet to determine what might be the cause. However, a similar behavior seen with the SLURM batch system has been identified and will be fixed in the 0.8.1 release, expected later this month. You may try applying the following two patches to BLCR 0.8.0 to get that fix: http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=336 http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=348 If you have time to try that, please let me know if it resolves your problem or not. It is also possible that the problem comes from the MVAPICH2 side of the integration with BLCR. So, I'd suggest that you ask on the MVAPICH2 users list to see if others have the same problem. If so, it would be valuable to know which versions of BLCR display this problem. -Paul Alex Ninaber wrote: > > Dear blcr, > > Manually checkpointing and restarting our MPI application with > mvapich2 appears to work fine once - however after starting it again > with cr_restart, checkpointing it a second time fails: cr_checkpoint > waits forever and never finishes. The application itself has the usual > wait-time during the checkpoint (for writing out the files), and > continues after that. What's left are the new checkpoint files and > .context.* (of which the latter is too small, the checkpoint files > appear to have the right size). The behavior is consistent, and also > occurs when the automatic checkpoint in mvapich2 is used. > > I assume checkpointing an application that started from a restart > should work, correct? > > blcr version 0.8.0 > mvapich2 mvapich2-1.2p1 > kernel 2.6.18-128.1.1.el5 > export MV2_CKPT_INTERVAL=-1 > export MV2_CKPT_FILE=/local/CPK > export MV2_CKPT_MAX_SAVE_CKPTS=3 > > Any tips/hints/help would be appreciated, > > Regards, > > Alex > > > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory