From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Dec 04 2008 - 12:19:02 PST
Jerry, Because of the nature of the problem (in the restart code in mpirun), even a hello_world program should be sufficient to determine if my fix is correct. I simply don't have the time/patience to configure, build and install LAM just now. So, I appreciate your help. -Paul Jerry Mersel wrote: > Hi Paul: > > I don't mind checking my application with 0.8.0 and/or with > the patch for 0.7.3 but I was just using a small test case > as well where "Hello World 0 of 2" was printed out. > > It wasn't a full blown application. > > Do you still want me to try it with my test case? > I did downgrade to version 0.6.4 and it did work > with lam and gridengine, but it didn't work if the > checkpointed files were anywhere but in the home directory. > > Interesting. > > > Best regards, > Jerry > > > > > > > > > > > > >> Based on Jerry's logs, I realized that the execve() call in LAM's mpirun >> at restart time was probably interacting poorly with changes made >> beginning in BLCR 0.7.0. I have been able to construct a compact test >> case that is similar enough to LAM's mpirun behavior to reproduce the >> symptom: a restarted mpirun-like process is unkillable and does not >> finish the execve() call. >> >> Testing shows that BLCR 0.6.0 works as expected, while 0.7.0 and 0.7.3 >> both hang as described above. >> >> The good news is that the exec-from-callback behavior is very similar >> (from BLCR's point of view) to the SEGV-from-callback reported as bug >> 2318 ( http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2318 ). My >> testing shows that applying the "Proposed fix" attached to that bug >> report to 0.7.3 resolves the problem for my small test case. >> Additionally, since this patch is already part of the 0.8.0 betas, the >> problem Jerry reports is probably NOT present in 0.8.0 (my test case is >> fine with 0.8.0_b2). >> >> Jerry, >> I don't have a complete LAM/MPI build to test against. So, I could >> really use your help to confirm that the same patch that fixes my small >> test case works for your fill mpirum+application. If you could please: >> rebuild the BLCR kernel modules for 0.7.3 with the "proposed fix" for >> bug 2318 (available at >> http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=298 ), rmmod+insmod >> blcr.ko and retry your restart. The patch does not change the context >> file format in any way, so it should be safe to restart from your >> existing checkpoint (assuming it was generated with BLCR 0.7.3). >> >> There is still a small BLCR "glitch" with the execve() call: the restart >> doesn't appear complete until the restarted mpirun exits where the 0.6.0 >> behavior was to complete "immediately". I have a plan to resolve this >> for 0.8.0. >> >> -Paul >> >> Jerry Mersel wrote: >> >>> Hi Paul: >>> >>> I'm running on one machine that is running mpirun and the program >>> hello. >>> >>> I restart it with cr_restart and mpirun restarts but not the processes. >>> >>> Thank you for your effort and patience. >>> >>> Regards, >>> Jerry >>> >>> P.S. See attachment for log >>> >>> >>> >>> >>> >>> >>>> Jerry, >>>> >>>> Of the three BLCR kernel modules, only <filename> == blcr.ko needs >>>> the >>>> cr_ktrace_mask=0xffffffff argument. That should be equivalent to the >>>> make >>>> command I suggested. >>>> >>>> -Paul >>>> >>>> Jerry Mersel wrote: >>>> >>>> >>>>> Hi Paul: >>>>> >>>>> >>>>> Would insmod <filename> cr_ktrace_mask=0xffffffff have the same >>>>> effect? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Jerry, >>>>>> Please try loading the BLCR modules with "make insmod >>>>>> cr_ktrace_mask=0xffffffff" to enable the highest level of debugging >>>>>> output. I suspect there will be additional output after the "parent >>>>>> linkage" message. >>>>>> -Paul >>>>>> >>>>>> Jerry Mersel wrote: >>>>>> >>>>>> >>>>>>> Hi: >>>>>>> >>>>>>> I also see the same errors as zhangkan. >>>>>>> >>>>>>> Also stopping on Parent linkage. >>>>>>> >>>>>>> I just manage to start mpirun but not the children, >>>>>>> and I need to reboot the machine to get rid of mpirun. >>>>>>> I can't kill it. It goes into permanent sleep mode. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Jerry >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>>>>> Future Technologies Group Tel: +1-510-495-2352 >>>>>> HPC Research Department Fax: +1-510-486-6900 >>>>>> Lawrence Berkeley National Laboratory >>>>>> >>>>>> >>>>>> >>>>>> >>>> -- >>>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>>> Future Technologies Group >>>> HPC Research Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>>> >>>> >> -- >> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> Future Technologies Group Tel: +1-510-495-2352 >> HPC Research Department Fax: +1-510-486-6900 >> Lawrence Berkeley National Laboratory >> >> >> >> > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900