Mvapich2 checkpointing after a cr_restart fails

From: Alex Ninaber (Alex.Ninaber_at_ClusterVision_dot_com)
Date: Tue Mar 24 2009 - 10:02:18 PDT

  • Next message: Paul H. Hargrove: "Re: Mvapich2 checkpointing after a cr_restart fails"
    Dear blcr,
    
    Manually checkpointing and restarting our MPI application with mvapich2 
    appears to work fine once - however after starting it again with 
    cr_restart, checkpointing it a second time fails: cr_checkpoint waits 
    forever and never finishes. The application itself has the usual 
    wait-time during the checkpoint (for writing out the files), and 
    continues after that. What's left are the new checkpoint files and 
    .context.* (of which the latter is too small, the checkpoint files 
    appear to have the right size). The behavior is consistent, and also 
    occurs when the automatic checkpoint in mvapich2 is used.
    
    I assume checkpointing an application that started from a restart should 
    work, correct?
    
    blcr version 0.8.0
    mvapich2 mvapich2-1.2p1
    kernel 2.6.18-128.1.1.el5
    export MV2_CKPT_INTERVAL=-1
    export MV2_CKPT_FILE=/local/CPK
    export MV2_CKPT_MAX_SAVE_CKPTS=3
    
    Any tips/hints/help would be appreciated,
    
    Regards,
    
    Alex
    

  • Next message: Paul H. Hargrove: "Re: Mvapich2 checkpointing after a cr_restart fails"