From: Alex Ninaber (Alex.Ninaber_at_ClusterVision_dot_com)
Date: Wed Mar 25 2009 - 02:16:10 PDT
Dear Paul, That fixed it, many thanks, Regards, Alex Paul H. Hargrove wrote: > Alex, > > The behavior you describe sounds as if there is some thread or > sub-process in the restarted MPI job that is not completing its > checkpoint. There is nothing in BLCR that should prevent > checkpointing after a restart, and several of our test cases do > exactly that (though not with MPI). > If your problem does originate from some bug in BLCR, I don't have > enough information yet to determine what might be the cause. However, > a similar behavior seen with the SLURM batch system has been > identified and will be fixed in the 0.8.1 release, expected later this > month. You may try applying the following two patches to BLCR 0.8.0 > to get that fix: > http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=336 > http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=348 > If you have time to try that, please let me know if it resolves your > problem or not. > > It is also possible that the problem comes from the MVAPICH2 side of > the integration with BLCR. So, I'd suggest that you ask on the > MVAPICH2 users list to see if others have the same problem. If so, it > would be valuable to know which versions of BLCR display this problem. > > -Paul > > Alex Ninaber wrote: >> >> Dear blcr, >> >> Manually checkpointing and restarting our MPI application with >> mvapich2 appears to work fine once - however after starting it again >> with cr_restart, checkpointing it a second time fails: cr_checkpoint >> waits forever and never finishes. The application itself has the >> usual wait-time during the checkpoint (for writing out the files), >> and continues after that. What's left are the new checkpoint files >> and .context.* (of which the latter is too small, the checkpoint >> files appear to have the right size). The behavior is consistent, and >> also occurs when the automatic checkpoint in mvapich2 is used. >> >> I assume checkpointing an application that started from a restart >> should work, correct? >> >> blcr version 0.8.0 >> mvapich2 mvapich2-1.2p1 >> kernel 2.6.18-128.1.1.el5 >> export MV2_CKPT_INTERVAL=-1 >> export MV2_CKPT_FILE=/local/CPK >> export MV2_CKPT_MAX_SAVE_CKPTS=3 >> >> Any tips/hints/help would be appreciated, >> >> Regards, >> >> Alex >> >> >> >> >> > > -- ---------------------------------------------------------------- Dr Alex Ninaber ClusterVision Technical Manager tel NL: +31 20 407 7557 http://www.ClusterVision.com tel UK: +44 870 080 1980 email:Alex.Ninaber_at_ClusterVision_dot_com tel Mob: +31 61 650 4127 support: support_at_ClusterVision_dot_com skype: AlexNinaber