From: Jerry Mersel (jerry.mersel_at_weizmann.ac.il)
Date: Sat Dec 06 2008 - 23:04:41 PST
I'll do it today. Regards, Jerry > Jerry, > > Because of the nature of the problem (in the restart code in mpirun), > even a hello_world program should be sufficient to determine if my fix > is correct. > I simply don't have the time/patience to configure, build and install > LAM just now. So, I appreciate your help. > > -Paul > > Jerry Mersel wrote: >> Hi Paul: >> >> I don't mind checking my application with 0.8.0 and/or with >> the patch for 0.7.3 but I was just using a small test case >> as well where "Hello World 0 of 2" was printed out. >> >> It wasn't a full blown application. >> >> Do you still want me to try it with my test case? >> I did downgrade to version 0.6.4 and it did work >> with lam and gridengine, but it didn't work if the >> checkpointed files were anywhere but in the home directory. >> >> Interesting. >> >> >> Best regards, >> Jerry >> >> >> >> >> >> >> >> >> >> >> >> >>> Based on Jerry's logs, I realized that the execve() call in LAM's >>> mpirun >>> at restart time was probably interacting poorly with changes made >>> beginning in BLCR 0.7.0. I have been able to construct a compact test >>> case that is similar enough to LAM's mpirun behavior to reproduce the >>> symptom: a restarted mpirun-like process is unkillable and does not >>> finish the execve() call. >>> >>> Testing shows that BLCR 0.6.0 works as expected, while 0.7.0 and 0.7.3 >>> both hang as described above. >>> >>> The good news is that the exec-from-callback behavior is very similar >>> (from BLCR's point of view) to the SEGV-from-callback reported as bug >>> 2318 ( http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2318 ). My >>> testing shows that applying the "Proposed fix" attached to that bug >>> report to 0.7.3 resolves the problem for my small test case. >>> Additionally, since this patch is already part of the 0.8.0 betas, the >>> problem Jerry reports is probably NOT present in 0.8.0 (my test case is >>> fine with 0.8.0_b2). >>> >>> Jerry, >>> I don't have a complete LAM/MPI build to test against. So, I could >>> really use your help to confirm that the same patch that fixes my small >>> test case works for your fill mpirum+application. If you could please: >>> rebuild the BLCR kernel modules for 0.7.3 with the "proposed fix" for >>> bug 2318 (available at >>> http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=298 ), rmmod+insmod >>> blcr.ko and retry your restart. The patch does not change the context >>> file format in any way, so it should be safe to restart from your >>> existing checkpoint (assuming it was generated with BLCR 0.7.3). >>> >>> There is still a small BLCR "glitch" with the execve() call: the >>> restart >>> doesn't appear complete until the restarted mpirun exits where the >>> 0.6.0 >>> behavior was to complete "immediately". I have a plan to resolve this >>> for 0.8.0. >>> >>> -Paul >>> >>> Jerry Mersel wrote: >>> >>>> Hi Paul: >>>> >>>> I'm running on one machine that is running mpirun and the program >>>> hello. >>>> >>>> I restart it with cr_restart and mpirun restarts but not the >>>> processes. >>>> >>>> Thank you for your effort and patience. >>>> >>>> Regards, >>>> Jerry >>>> >>>> P.S. See attachment for log >>>> >>>> >>>> >>>> >>>> >>>> >>>>> Jerry, >>>>> >>>>> Of the three BLCR kernel modules, only <filename> == blcr.ko needs >>>>> the >>>>> cr_ktrace_mask=0xffffffff argument. That should be equivalent to the >>>>> make >>>>> command I suggested. >>>>> >>>>> -Paul >>>>> >>>>> Jerry Mersel wrote: >>>>> >>>>> >>>>>> Hi Paul: >>>>>> >>>>>> >>>>>> Would insmod <filename> cr_ktrace_mask=0xffffffff have the same >>>>>> effect? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Jerry, >>>>>>> Please try loading the BLCR modules with "make insmod >>>>>>> cr_ktrace_mask=0xffffffff" to enable the highest level of debugging >>>>>>> output. I suspect there will be additional output after the >>>>>>> "parent >>>>>>> linkage" message. >>>>>>> -Paul >>>>>>> >>>>>>> Jerry Mersel wrote: >>>>>>> >>>>>>> >>>>>>>> Hi: >>>>>>>> >>>>>>>> I also see the same errors as zhangkan. >>>>>>>> >>>>>>>> Also stopping on Parent linkage. >>>>>>>> >>>>>>>> I just manage to start mpirun but not the children, >>>>>>>> and I need to reboot the machine to get rid of mpirun. >>>>>>>> I can't kill it. It goes into permanent sleep mode. >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Jerry >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>>>>>> Future Technologies Group Tel: +1-510-495-2352 >>>>>>> HPC Research Department Fax: +1-510-486-6900 >>>>>>> Lawrence Berkeley National Laboratory >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> -- >>>>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>>>> Future Technologies Group >>>>> HPC Research Department Tel: +1-510-495-2352 >>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>> >>>>> >>>>> >>> -- >>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>> Future Technologies Group Tel: +1-510-495-2352 >>> HPC Research Department Fax: +1-510-486-6900 >>> Lawrence Berkeley National Laboratory >>> >>> >>> >>> >> >> >> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >