Re: Mvapich2 checkpointing after a cr_restart fails

From: Karthik Gopalakrishnan (gopalakk_at_cse.ohio-state.edu)
Date: Thu Mar 26 2009 - 08:26:39 PDT

  • Next message: Karthik Gopalakrishnan: "Installing BLCR on Lustre Kernel"
    Hi Paul.
    
    We'll make the announcement on the MVAPICH users list.
    
    Thanks & Regards,
    Karthik
    
    On Wed, Mar 25, 2009 at 5:38 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote:
    > Alex,
    >  Thanks for trying out the patches, and for the good news.
    >
    > Karthik,
    >  I'd appreciate it if you'd make an announcement to the MVAPICH2 users list
    > that BLCR 0.8.0 has this known problem with MVAPICH2 and
    > checkpoint-after-restart.  I am preparing to release 0.8.1 later today and
    > you should direct users to get that version if possible.
    >
    > -Paul
    >
    > Alex Ninaber wrote:
    >>
    >> Dear Paul,
    >>
    >> That fixed it, many thanks,
    >>
    >>
    >> Regards,
    >>
    >> Alex
    >>
    >>
    >>
    >>
    >> Paul H. Hargrove wrote:
    >>>
    >>> Alex,
    >>>
    >>>  The behavior you describe sounds as if there is some thread or
    >>> sub-process in the restarted MPI job that is not completing its checkpoint.
    >>>  There is nothing in BLCR that should prevent checkpointing after a restart,
    >>> and several of our test cases do exactly that (though not with MPI).
    >>>  If your problem does originate from some bug in BLCR, I don't have
    >>> enough information yet to determine what might be the cause.  However, a
    >>> similar behavior seen with the SLURM batch system has been identified and
    >>> will be fixed in the 0.8.1 release, expected later this month.  You may try
    >>> applying the following two patches to BLCR 0.8.0 to get that fix:
    >>>   http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=336
    >>>   http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=348
    >>> If you have time to try that, please let me know if it resolves your
    >>> problem or not.
    >>>
    >>>  It is also possible that the problem comes from the MVAPICH2 side of the
    >>> integration with BLCR.  So, I'd suggest that you ask on the MVAPICH2 users
    >>> list to see if others have the same problem.  If so, it would be valuable to
    >>> know which versions of BLCR display this problem.
    >>>
    >>> -Paul
    >>>
    >>> Alex Ninaber wrote:
    >>>>
    >>>> Dear blcr,
    >>>>
    >>>> Manually checkpointing and restarting our MPI application with mvapich2
    >>>> appears to work fine once - however after starting it again with cr_restart,
    >>>> checkpointing it a second time fails: cr_checkpoint waits forever and never
    >>>> finishes. The application itself has the usual wait-time during the
    >>>> checkpoint (for writing out the files), and continues after that. What's
    >>>> left are the new checkpoint files and .context.* (of which the latter is too
    >>>> small, the checkpoint files appear to have the right size). The behavior is
    >>>> consistent, and also occurs when the automatic checkpoint in mvapich2 is
    >>>> used.
    >>>>
    >>>> I assume checkpointing an application that started from a restart should
    >>>> work, correct?
    >>>>
    >>>> blcr version 0.8.0
    >>>> mvapich2 mvapich2-1.2p1
    >>>> kernel 2.6.18-128.1.1.el5
    >>>> export MV2_CKPT_INTERVAL=-1
    >>>> export MV2_CKPT_FILE=/local/CPK
    >>>> export MV2_CKPT_MAX_SAVE_CKPTS=3
    >>>>
    >>>> Any tips/hints/help would be appreciated,
    >>>>
    >>>> Regards,
    >>>>
    >>>> Alex
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>
    >>>
    >>
    >>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    

  • Next message: Karthik Gopalakrishnan: "Installing BLCR on Lustre Kernel"