Re: LAM: Checkpoint is correct, BUT cannot restart with LAM+BLCR

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Dec 03 2008 - 15:44:49 PST

  • Next message: Matthias Hovestadt: "OpenMPI and BLCR 0.8.0b2"
    Based on Jerry's logs, I realized that the execve() call in LAM's mpirun
    at restart time was probably interacting poorly with changes made
    beginning in BLCR 0.7.0.  I have been able to construct a compact test
    case that is similar enough to LAM's mpirun behavior to reproduce the
    symptom: a restarted mpirun-like process is unkillable and does not
    finish the execve() call.
    
    Testing shows that BLCR 0.6.0 works as expected, while 0.7.0 and 0.7.3
    both hang as described above.
    
    The good news is that the exec-from-callback behavior is very similar
    (from BLCR's point of view) to the SEGV-from-callback reported as bug
    2318 ( http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2318 ).  My
    testing shows that applying the "Proposed fix" attached to that bug
    report to 0.7.3 resolves the problem for my small test case.
    Additionally, since this patch is already part of the 0.8.0 betas, the
    problem Jerry reports is probably NOT present in 0.8.0 (my test case is
    fine with 0.8.0_b2).
    
    Jerry,
      I don't have a complete LAM/MPI build to test against.  So, I could
    really use your help to confirm that the same patch that fixes my small
    test case works for your fill mpirum+application.  If you could please:
    rebuild the BLCR kernel modules for 0.7.3 with the "proposed fix" for
    bug 2318 (available at
    http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=298 ), rmmod+insmod
    blcr.ko and retry your restart.  The patch does not change the context
    file format in any way, so it should be safe to restart from your
    existing checkpoint (assuming it was generated with BLCR 0.7.3).
    
    There is still a small BLCR "glitch" with the execve() call: the restart
    doesn't appear complete until the restarted mpirun exits where the 0.6.0
    behavior was to complete "immediately".  I have a plan to resolve this
    for 0.8.0.
    
    -Paul
    
    Jerry Mersel wrote:
    > Hi Paul:
    >
    >  I'm running on one machine that is running mpirun and the program hello.
    >
    >  I restart it with cr_restart and mpirun restarts but not the processes.
    >
    >  Thank you for your effort and patience.
    >
    >                        Regards,
    >                          Jerry
    >
    > P.S. See attachment for log
    >
    >
    >
    >
    >   
    >> Jerry,
    >>
    >>    Of the three BLCR kernel modules, only <filename> == blcr.ko needs the
    >> cr_ktrace_mask=0xffffffff argument.  That should be equivalent to the make
    >> command I suggested.
    >>
    >> -Paul
    >>
    >> Jerry Mersel wrote:
    >>     
    >>> Hi Paul:
    >>>
    >>>
    >>>  Would insmod <filename> cr_ktrace_mask=0xffffffff have the same effect?
    >>>
    >>>
    >>>
    >>>
    >>>       
    >>>> Jerry,
    >>>>  Please try loading the BLCR modules with "make insmod
    >>>> cr_ktrace_mask=0xffffffff" to enable the highest level of debugging
    >>>> output.  I suspect there will be additional output after the "parent
    >>>> linkage" message.
    >>>> -Paul
    >>>>
    >>>> Jerry Mersel wrote:
    >>>>         
    >>>>> Hi:
    >>>>>
    >>>>>    I also see the same errors as  zhangkan.
    >>>>>
    >>>>>    Also stopping on Parent linkage.
    >>>>>
    >>>>>    I just manage to start mpirun but not the children,
    >>>>>    and I need to reboot the machine to get rid of mpirun.
    >>>>>    I can't kill it. It goes into permanent sleep mode.
    >>>>>
    >>>>>
    >>>>>                             Regards,
    >>>>>                                Jerry
    >>>>>
    >>>>>
    >>>>>           
    >>>> --
    >>>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>> Future Technologies Group                 Tel: +1-510-495-2352
    >>>> HPC Research Department                   Fax: +1-510-486-6900
    >>>> Lawrence Berkeley National Laboratory
    >>>>
    >>>>
    >>>>         
    >>>       
    >> --
    >> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >> Future Technologies Group
    >> HPC Research Department                   Tel: +1-510-495-2352
    >> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>     
    > >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory
    

  • Next message: Matthias Hovestadt: "OpenMPI and BLCR 0.8.0b2"