Re: Mvapich2 checkpointing after a cr_restart fails

From: Alex Ninaber (Alex.Ninaber_at_ClusterVision_dot_com)
Date: Wed Mar 25 2009 - 02:16:10 PDT

  • Next message: Paul H. Hargrove: "Re: Mvapich2 checkpointing after a cr_restart fails"
    Dear Paul,
    
    That fixed it, many thanks,
    
    
    Regards,
    
    Alex
    
    
    
    
    Paul H. Hargrove wrote:
    > Alex,
    >
    >  The behavior you describe sounds as if there is some thread or 
    > sub-process in the restarted MPI job that is not completing its 
    > checkpoint.  There is nothing in BLCR that should prevent 
    > checkpointing after a restart, and several of our test cases do 
    > exactly that (though not with MPI).
    >  If your problem does originate from some bug in BLCR, I don't have 
    > enough information yet to determine what might be the cause.  However, 
    > a similar behavior seen with the SLURM batch system has been 
    > identified and will be fixed in the 0.8.1 release, expected later this 
    > month.  You may try applying the following two patches to BLCR 0.8.0 
    > to get that fix:
    >    http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=336
    >    http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=348
    > If you have time to try that, please let me know if it resolves your 
    > problem or not.
    >
    >  It is also possible that the problem comes from the MVAPICH2 side of 
    > the integration with BLCR.  So, I'd suggest that you ask on the 
    > MVAPICH2 users list to see if others have the same problem.  If so, it 
    > would be valuable to know which versions of BLCR display this problem.
    >
    > -Paul
    >
    > Alex Ninaber wrote:
    >>
    >> Dear blcr,
    >>
    >> Manually checkpointing and restarting our MPI application with 
    >> mvapich2 appears to work fine once - however after starting it again 
    >> with cr_restart, checkpointing it a second time fails: cr_checkpoint 
    >> waits forever and never finishes. The application itself has the 
    >> usual wait-time during the checkpoint (for writing out the files), 
    >> and continues after that. What's left are the new checkpoint files 
    >> and .context.* (of which the latter is too small, the checkpoint 
    >> files appear to have the right size). The behavior is consistent, and 
    >> also occurs when the automatic checkpoint in mvapich2 is used.
    >>
    >> I assume checkpointing an application that started from a restart 
    >> should work, correct?
    >>
    >> blcr version 0.8.0
    >> mvapich2 mvapich2-1.2p1
    >> kernel 2.6.18-128.1.1.el5
    >> export MV2_CKPT_INTERVAL=-1
    >> export MV2_CKPT_FILE=/local/CPK
    >> export MV2_CKPT_MAX_SAVE_CKPTS=3
    >>
    >> Any tips/hints/help would be appreciated,
    >>
    >> Regards,
    >>
    >> Alex
    >>
    >>
    >>
    >>
    >>
    >
    >
    
    
    -- 
    
    ----------------------------------------------------------------
    Dr Alex Ninaber				ClusterVision
    Technical Manager			tel NL: +31 20 407 7557
    http://www.ClusterVision.com		tel UK: +44 870 080 1980
    email:Alex.Ninaber_at_ClusterVision_dot_com 	tel Mob: +31 61 650 4127
    support: support_at_ClusterVision_dot_com	skype: AlexNinaber
    

  • Next message: Paul H. Hargrove: "Re: Mvapich2 checkpointing after a cr_restart fails"