Re: Mvapich2 checkpointing after a cr_restart fails

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 25 2009 - 14:38:00 PDT

  • Next message: Paul H. Hargrove: "Announcing the release of BLCR 0.8.1"
    Alex,
      Thanks for trying out the patches, and for the good news.
    
    Karthik,
      I'd appreciate it if you'd make an announcement to the MVAPICH2 users 
    list that BLCR 0.8.0 has this known problem with MVAPICH2 and 
    checkpoint-after-restart.  I am preparing to release 0.8.1 later today 
    and you should direct users to get that version if possible.
    
    -Paul
    
    Alex Ninaber wrote:
    > Dear Paul,
    >
    > That fixed it, many thanks,
    >
    >
    > Regards,
    >
    > Alex
    >
    >
    >
    >
    > Paul H. Hargrove wrote:
    >> Alex,
    >>
    >>  The behavior you describe sounds as if there is some thread or 
    >> sub-process in the restarted MPI job that is not completing its 
    >> checkpoint.  There is nothing in BLCR that should prevent 
    >> checkpointing after a restart, and several of our test cases do 
    >> exactly that (though not with MPI).
    >>  If your problem does originate from some bug in BLCR, I don't have 
    >> enough information yet to determine what might be the cause.  
    >> However, a similar behavior seen with the SLURM batch system has been 
    >> identified and will be fixed in the 0.8.1 release, expected later 
    >> this month.  You may try applying the following two patches to BLCR 
    >> 0.8.0 to get that fix:
    >>    http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=336
    >>    http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=348
    >> If you have time to try that, please let me know if it resolves your 
    >> problem or not.
    >>
    >>  It is also possible that the problem comes from the MVAPICH2 side of 
    >> the integration with BLCR.  So, I'd suggest that you ask on the 
    >> MVAPICH2 users list to see if others have the same problem.  If so, 
    >> it would be valuable to know which versions of BLCR display this 
    >> problem.
    >>
    >> -Paul
    >>
    >> Alex Ninaber wrote:
    >>>
    >>> Dear blcr,
    >>>
    >>> Manually checkpointing and restarting our MPI application with 
    >>> mvapich2 appears to work fine once - however after starting it again 
    >>> with cr_restart, checkpointing it a second time fails: cr_checkpoint 
    >>> waits forever and never finishes. The application itself has the 
    >>> usual wait-time during the checkpoint (for writing out the files), 
    >>> and continues after that. What's left are the new checkpoint files 
    >>> and .context.* (of which the latter is too small, the checkpoint 
    >>> files appear to have the right size). The behavior is consistent, 
    >>> and also occurs when the automatic checkpoint in mvapich2 is used.
    >>>
    >>> I assume checkpointing an application that started from a restart 
    >>> should work, correct?
    >>>
    >>> blcr version 0.8.0
    >>> mvapich2 mvapich2-1.2p1
    >>> kernel 2.6.18-128.1.1.el5
    >>> export MV2_CKPT_INTERVAL=-1
    >>> export MV2_CKPT_FILE=/local/CPK
    >>> export MV2_CKPT_MAX_SAVE_CKPTS=3
    >>>
    >>> Any tips/hints/help would be appreciated,
    >>>
    >>> Regards,
    >>>
    >>> Alex
    >>>
    >>>
    >>>
    >>>
    >>>
    >>
    >>
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Announcing the release of BLCR 0.8.1"