Re: Mvapich2 checkpointing after a cr_restart fails

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Mar 24 2009 - 10:26:13 PDT

  • Next message: Alex Ninaber: "Re: Mvapich2 checkpointing after a cr_restart fails"
    Alex,
    
      The behavior you describe sounds as if there is some thread or 
    sub-process in the restarted MPI job that is not completing its 
    checkpoint.  There is nothing in BLCR that should prevent checkpointing 
    after a restart, and several of our test cases do exactly that (though 
    not with MPI).
      If your problem does originate from some bug in BLCR, I don't have 
    enough information yet to determine what might be the cause.  However, a 
    similar behavior seen with the SLURM batch system has been identified 
    and will be fixed in the 0.8.1 release, expected later this month.  You 
    may try applying the following two patches to BLCR 0.8.0 to get that fix:
        http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=336
        http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=348
    If you have time to try that, please let me know if it resolves your 
    problem or not.
    
      It is also possible that the problem comes from the MVAPICH2 side of 
    the integration with BLCR.  So, I'd suggest that you ask on the MVAPICH2 
    users list to see if others have the same problem.  If so, it would be 
    valuable to know which versions of BLCR display this problem.
    
    -Paul
    
    Alex Ninaber wrote:
    >
    > Dear blcr,
    >
    > Manually checkpointing and restarting our MPI application with 
    > mvapich2 appears to work fine once - however after starting it again 
    > with cr_restart, checkpointing it a second time fails: cr_checkpoint 
    > waits forever and never finishes. The application itself has the usual 
    > wait-time during the checkpoint (for writing out the files), and 
    > continues after that. What's left are the new checkpoint files and 
    > .context.* (of which the latter is too small, the checkpoint files 
    > appear to have the right size). The behavior is consistent, and also 
    > occurs when the automatic checkpoint in mvapich2 is used.
    >
    > I assume checkpointing an application that started from a restart 
    > should work, correct?
    >
    > blcr version 0.8.0
    > mvapich2 mvapich2-1.2p1
    > kernel 2.6.18-128.1.1.el5
    > export MV2_CKPT_INTERVAL=-1
    > export MV2_CKPT_FILE=/local/CPK
    > export MV2_CKPT_MAX_SAVE_CKPTS=3
    >
    > Any tips/hints/help would be appreciated,
    >
    > Regards,
    >
    > Alex
    >
    >
    >
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Alex Ninaber: "Re: Mvapich2 checkpointing after a cr_restart fails"