Re: LAM: Checkpoint is correct, BUT cannot restart with LAM+BLCR

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Dec 04 2008 - 12:19:02 PST

  • Next message: Paul H. Hargrove: "BLCR 0.8.0 beta3 is now available"
    Jerry,
    
       Because of the nature of the problem (in the restart code in mpirun), 
    even a hello_world program should be sufficient to determine if my fix 
    is correct.
      I simply don't have the time/patience to configure, build and install 
    LAM just now.  So, I appreciate your help.
    
    -Paul
    
    Jerry Mersel wrote:
    > Hi Paul:
    >
    >   I don't mind checking my application with 0.8.0 and/or with
    >   the patch for 0.7.3 but I was just using a small test case
    >   as well where "Hello World 0 of 2" was printed out.
    >
    >   It wasn't a full blown application.
    >
    >   Do you still want me to try it with my test case?
    >   I did downgrade to version 0.6.4 and it did work
    >   with lam and gridengine, but it didn't work if the
    >   checkpointed files were anywhere but in the home directory.
    >
    >   Interesting.
    >
    >
    >                         Best regards,
    >                            Jerry
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >   
    >> Based on Jerry's logs, I realized that the execve() call in LAM's mpirun
    >> at restart time was probably interacting poorly with changes made
    >> beginning in BLCR 0.7.0.  I have been able to construct a compact test
    >> case that is similar enough to LAM's mpirun behavior to reproduce the
    >> symptom: a restarted mpirun-like process is unkillable and does not
    >> finish the execve() call.
    >>
    >> Testing shows that BLCR 0.6.0 works as expected, while 0.7.0 and 0.7.3
    >> both hang as described above.
    >>
    >> The good news is that the exec-from-callback behavior is very similar
    >> (from BLCR's point of view) to the SEGV-from-callback reported as bug
    >> 2318 ( http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2318 ).  My
    >> testing shows that applying the "Proposed fix" attached to that bug
    >> report to 0.7.3 resolves the problem for my small test case.
    >> Additionally, since this patch is already part of the 0.8.0 betas, the
    >> problem Jerry reports is probably NOT present in 0.8.0 (my test case is
    >> fine with 0.8.0_b2).
    >>
    >> Jerry,
    >>   I don't have a complete LAM/MPI build to test against.  So, I could
    >> really use your help to confirm that the same patch that fixes my small
    >> test case works for your fill mpirum+application.  If you could please:
    >> rebuild the BLCR kernel modules for 0.7.3 with the "proposed fix" for
    >> bug 2318 (available at
    >> http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=298 ), rmmod+insmod
    >> blcr.ko and retry your restart.  The patch does not change the context
    >> file format in any way, so it should be safe to restart from your
    >> existing checkpoint (assuming it was generated with BLCR 0.7.3).
    >>
    >> There is still a small BLCR "glitch" with the execve() call: the restart
    >> doesn't appear complete until the restarted mpirun exits where the 0.6.0
    >> behavior was to complete "immediately".  I have a plan to resolve this
    >> for 0.8.0.
    >>
    >> -Paul
    >>
    >> Jerry Mersel wrote:
    >>     
    >>> Hi Paul:
    >>>
    >>>  I'm running on one machine that is running mpirun and the program
    >>> hello.
    >>>
    >>>  I restart it with cr_restart and mpirun restarts but not the processes.
    >>>
    >>>  Thank you for your effort and patience.
    >>>
    >>>                        Regards,
    >>>                          Jerry
    >>>
    >>> P.S. See attachment for log
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>       
    >>>> Jerry,
    >>>>
    >>>>    Of the three BLCR kernel modules, only <filename> == blcr.ko needs
    >>>> the
    >>>> cr_ktrace_mask=0xffffffff argument.  That should be equivalent to the
    >>>> make
    >>>> command I suggested.
    >>>>
    >>>> -Paul
    >>>>
    >>>> Jerry Mersel wrote:
    >>>>
    >>>>         
    >>>>> Hi Paul:
    >>>>>
    >>>>>
    >>>>>  Would insmod <filename> cr_ktrace_mask=0xffffffff have the same
    >>>>> effect?
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>           
    >>>>>> Jerry,
    >>>>>>  Please try loading the BLCR modules with "make insmod
    >>>>>> cr_ktrace_mask=0xffffffff" to enable the highest level of debugging
    >>>>>> output.  I suspect there will be additional output after the "parent
    >>>>>> linkage" message.
    >>>>>> -Paul
    >>>>>>
    >>>>>> Jerry Mersel wrote:
    >>>>>>
    >>>>>>             
    >>>>>>> Hi:
    >>>>>>>
    >>>>>>>    I also see the same errors as  zhangkan.
    >>>>>>>
    >>>>>>>    Also stopping on Parent linkage.
    >>>>>>>
    >>>>>>>    I just manage to start mpirun but not the children,
    >>>>>>>    and I need to reboot the machine to get rid of mpirun.
    >>>>>>>    I can't kill it. It goes into permanent sleep mode.
    >>>>>>>
    >>>>>>>
    >>>>>>>                             Regards,
    >>>>>>>                                Jerry
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>>               
    >>>>>> --
    >>>>>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>>>> Future Technologies Group                 Tel: +1-510-495-2352
    >>>>>> HPC Research Department                   Fax: +1-510-486-6900
    >>>>>> Lawrence Berkeley National Laboratory
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>             
    >>>> --
    >>>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>> Future Technologies Group
    >>>> HPC Research Department                   Tel: +1-510-495-2352
    >>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>>>
    >>>>
    >>>>         
    >> --
    >> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >> Future Technologies Group                 Tel: +1-510-495-2352
    >> HPC Research Department                   Fax: +1-510-486-6900
    >> Lawrence Berkeley National Laboratory
    >>
    >>
    >>
    >>     
    >
    >
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "BLCR 0.8.0 beta3 is now available"