From: Alex Ninaber (Alex.Ninaber_at_ClusterVision_dot_com)
Date: Tue Mar 24 2009 - 10:02:18 PDT
Dear blcr, Manually checkpointing and restarting our MPI application with mvapich2 appears to work fine once - however after starting it again with cr_restart, checkpointing it a second time fails: cr_checkpoint waits forever and never finishes. The application itself has the usual wait-time during the checkpoint (for writing out the files), and continues after that. What's left are the new checkpoint files and .context.* (of which the latter is too small, the checkpoint files appear to have the right size). The behavior is consistent, and also occurs when the automatic checkpoint in mvapich2 is used. I assume checkpointing an application that started from a restart should work, correct? blcr version 0.8.0 mvapich2 mvapich2-1.2p1 kernel 2.6.18-128.1.1.el5 export MV2_CKPT_INTERVAL=-1 export MV2_CKPT_FILE=/local/CPK export MV2_CKPT_MAX_SAVE_CKPTS=3 Any tips/hints/help would be appreciated, Regards, Alex