From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Dec 03 2008 - 15:44:49 PST
Based on Jerry's logs, I realized that the execve() call in LAM's mpirun at restart time was probably interacting poorly with changes made beginning in BLCR 0.7.0. I have been able to construct a compact test case that is similar enough to LAM's mpirun behavior to reproduce the symptom: a restarted mpirun-like process is unkillable and does not finish the execve() call. Testing shows that BLCR 0.6.0 works as expected, while 0.7.0 and 0.7.3 both hang as described above. The good news is that the exec-from-callback behavior is very similar (from BLCR's point of view) to the SEGV-from-callback reported as bug 2318 ( http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2318 ). My testing shows that applying the "Proposed fix" attached to that bug report to 0.7.3 resolves the problem for my small test case. Additionally, since this patch is already part of the 0.8.0 betas, the problem Jerry reports is probably NOT present in 0.8.0 (my test case is fine with 0.8.0_b2). Jerry, I don't have a complete LAM/MPI build to test against. So, I could really use your help to confirm that the same patch that fixes my small test case works for your fill mpirum+application. If you could please: rebuild the BLCR kernel modules for 0.7.3 with the "proposed fix" for bug 2318 (available at http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=298 ), rmmod+insmod blcr.ko and retry your restart. The patch does not change the context file format in any way, so it should be safe to restart from your existing checkpoint (assuming it was generated with BLCR 0.7.3). There is still a small BLCR "glitch" with the execve() call: the restart doesn't appear complete until the restarted mpirun exits where the 0.6.0 behavior was to complete "immediately". I have a plan to resolve this for 0.8.0. -Paul Jerry Mersel wrote: > Hi Paul: > > I'm running on one machine that is running mpirun and the program hello. > > I restart it with cr_restart and mpirun restarts but not the processes. > > Thank you for your effort and patience. > > Regards, > Jerry > > P.S. See attachment for log > > > > > >> Jerry, >> >> Of the three BLCR kernel modules, only <filename> == blcr.ko needs the >> cr_ktrace_mask=0xffffffff argument. That should be equivalent to the make >> command I suggested. >> >> -Paul >> >> Jerry Mersel wrote: >> >>> Hi Paul: >>> >>> >>> Would insmod <filename> cr_ktrace_mask=0xffffffff have the same effect? >>> >>> >>> >>> >>> >>>> Jerry, >>>> Please try loading the BLCR modules with "make insmod >>>> cr_ktrace_mask=0xffffffff" to enable the highest level of debugging >>>> output. I suspect there will be additional output after the "parent >>>> linkage" message. >>>> -Paul >>>> >>>> Jerry Mersel wrote: >>>> >>>>> Hi: >>>>> >>>>> I also see the same errors as zhangkan. >>>>> >>>>> Also stopping on Parent linkage. >>>>> >>>>> I just manage to start mpirun but not the children, >>>>> and I need to reboot the machine to get rid of mpirun. >>>>> I can't kill it. It goes into permanent sleep mode. >>>>> >>>>> >>>>> Regards, >>>>> Jerry >>>>> >>>>> >>>>> >>>> -- >>>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>>> Future Technologies Group Tel: +1-510-495-2352 >>>> HPC Research Department Fax: +1-510-486-6900 >>>> Lawrence Berkeley National Laboratory >>>> >>>> >>>> >>> >> -- >> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> Future Technologies Group >> HPC Research Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory