From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Dec 19 2007 - 20:06:26 PST
I am able to pass all the test cases I have available to me now with the signals delivered as the are. However, I suspect that you have uncovered an actual problem in BLCR that was not evident previously because cr_restart was always waiting for completion of both the restart AND the execution of the app. In the 0.7.x series I *am* already planning on moving the signal to cr_rstrt_task_complete() as you suggest (see BLCR bug #2216). However, it is not safe to make that change in 0.6.x, because there is no infrastructure yet (see BLCR bug #2215) to deal with the possibility of a newly restarted task dying or exiting in a callback (which would result in unkillable processes stuck at the barriers that are needed to coordinate the signal delivery). However, if you either don't register callbacks or don't checkpoint multithreaded apps, then the change should be safe. If the modification you describe is working for you, I encourage you to continue using it. I expect that things will "just work" for you when 0.7.0 is released (expected Spring '08), but if not we can revisit your problems then. I am sorry that I can't fix this for you in 0.6.x. There is one small change (patch attached) that I *am* making to the final 0.6.2 release. Since it touches on some of the code you and I are discussing, I'd like to hear from you it applying the attached patch makes any difference for you (though I doubt that it does). -Paul Yuan Tang wrote: > I have to state that, I changed a little bit how BLCR works. I mean, I > want a background daemon process to monitor the state of frontground > child process. When the background daemon detect a SIGCHLD, it > automatically restart the child process without administrator's > interference . So, I modify the cr_restart utility to match my needs. > Actually, I read the source code of cr_restart, and change it into a > cr_restart() function call which could be invoked by the daemon to > restart the child process. I notice how the cr_restart utility restart > the child process. It forks child(ren) to contain the restarted tasks, > and then its own process call mimic_exit(child_status) to exit. I just > replaced the mimic_exit with a return. I believe this modification > works because if I am not checkpointing/restarting a SIGSTOPed > process, everything works fine as expected. > > BTW: In order to checkpoint/restart a SIGSTOPed process, I modify the > BLCR internals as I previously described to you. That is, move the > send_sig_info(SIGSTOP, NULL, task) stuff from > cr_dump_self()/cr_rstrt_child() to > cr_chkpt_complete()/cr_rstrt_complet(), respectively. Also, I noticed > that you also moved the send_sig_info() stuff for the checkpoint > portion, right? Would you try the restart portion, too? and see > whether my proposal works or not? > > Best wishes! > > Yuan > > ----- Original Message ---- > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> > To: Yuan Tang <supertangcc_at_yahoo_dot_com> > Cc: checkpoint_at_lbl_dot_gov > Sent: Wednesday, December 19, 2007 12:26:01 PM > Subject: Re: BLCR 0.6.2 beta1 now available > > Perhaps I am misunderstanding your problem, or you are misunderstanding > how BLCR works. I am not clear on which. > > The cr_restart utility restarts a process as its own child. It does not > return as soon as the process is restarted, but instead it waits until > the child terminates and then terminates itself with the same exit > code. Also, it is not the cr_restart utility you need to send SIGCONT > to, but to the child(ren) that it spawn and you need to wait until the > child is fully restored. Together those are why using the "--cont" > argument to cr_restart is my recommendation. > > The function cr_rstrt_child() in the kernel module is like an exec() in > that it *replaces* the calling thread with a thread of the restarted > process. So, there is no return from that back to the calling context - > when it does return it is to the context of the checkpoint handler in > the restarted process. > > If you could be clearer on which "wait_event_interruptible() function in > the loop" you mean, perhaps I can understand what you are trying to do. > I suspect to mean the loop in cr_rstrt_procs(). That loop should > terminate as soon as enough threads (the same number as in the > checkpointed process) have reached the wake_up() call in cr_rstrt_child(). > > From what I have read so far, I am worried that you are trying to re > implement the internals of the cr_restart utility. If that is the case, > I would strongly encourage you *not* to take that approach since those > internals change in each release. You should instead invoke the > cr_restart utility to do the "dirty work". In a future release (once > the interface stabilizes) we will add a cr_request_restart() entry point > to libcr. > > -Paul > > Yuan Tang wrote: > > Sure! I mean, the RESTART procedure never complete, it means it never > > returns. It blocks in the function cr_rstrt_child. My program is let a > > background daemon process to checkpoint a SIGSTOPed process, and later > > restart it. After restart it, the daemon process will surely send the > > SIGCONT to the restarted SIGSTOPed process to let it go. However, the > > restart procedure never complete. So, the background daemon will never > > have the chance to send the SIGCONT. If I send the SIGCONT from the > > console, it just interrupt the wait_event_interruptible() function in > > the loop. > > > > ----- Original Message ---- > > From: Paul H_dot_ Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > To: Yuan Tang <supertangcc_at_yahoo_dot_com <mailto:supertangcc_at_yahoo_dot_com>> > > Cc: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > > Sent: Wednesday, December 19, 2007 5:19:58 AM > > Subject: Re: BLCR 0.6.2 beta1 now available > > > > Yuan Tang wrote: > > > Hi Paul, > > > > > > Thank you for the work. I downloaded the beta version, installed it > > > and tested it. The SIGSTOPed process could pass the whole checkpoint > > > procedure now. Congratulation! However, when I tried restarting the > > > previously checkpointed SIGSTOPed process from its disk image, the > > > RESTART procedure never completed. It blocks in > > > cr_rstrt_req.c:cr_rstrt_child(). I guess, if you move the > > > send_sig_info(SIGSTOP, NULL, task) stuff to cr_rstrt_task_complete(), > > > the whole procedure will complete normally. Hope it helps. > > > > I believe the current behavior is correct (or at least is what I've > > intended). The process that was SIGSTOPed when the checkpoint was > > requested is again SIGSTOPed when restarted. To get it running again > > you should be able to either send it a SIGCONT (which is tricky because > > you might not know how soon to send it), or you can simply pass "--cont" > > to cr_restart to have it done automatically. > > > > If you find that adding "--cont" to the cr_restart arguments still > > doesn't allow the restart to complete, let us know and we'll see if we > > can figure out what is going on. > > > > -Paul > > > > > > > > Best wishes! > > > > > > Yuan Tang > > > > > > ----- Original Message ---- > > > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> <mailto:PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov>>> > > > To: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>> > > > Sent: Tuesday, December 18, 2007 4:23:30 AM > > > Subject: BLCR 0.6.2 beta1 now available > > > > > > The first beta of BLCR 0.6.2 is now available at > > > http://mantis.lbl.gov/blcr-dist/ > > > Both source tarball and SRPM are available. The filenames and MD5 > > > checksums are: > > > 93249f20abd4eeec7a07db2f2a6cd2b2 blcr-0.6.2_b1.tar.gz > > > e8ecba22c98de143ced20f83db76d8a1 blcr-0.6.2_b1-1.src.rpm > > > > > > This is a beta of a 0.6.2 patch release. The intent of 0.6.2 is > to fix > > > a small number of significant bugs found in 0.6.0 and 0.6.1 and to add > > > support for 2.6.23 kernels and some vendor-patched 2.6.22 kernels. A > > > NEWS entry summarizing these changes appears below. > > > > > > You are receiving this e-mail either because you are subscribed to the > > > checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>> > > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>> mailing list > > or because > > > you have reported one of the > > > bugs or previously unsupported kernel versions addressed by this > > > release. I apologize if you receive multiple copies. > > > > > > I would greatly appreciate any feedback (positive or negative) > > > indicating if this beta fixes any problems you have reported with BLCR > > > 0.6.0 and/or 0.6.1. Only after I have sufficient positive > feedback will > > > I make 0.6.2 available for download from the main BLCR web pages. > > > > > > -Paul > > > > > > > > > 0.6.2_b1 > > > -------- > > > December 17, 2007 > > > Bug-fix and expanded-support release. > > > - This release adds support for 2.6.23 kernels. > > > - This release adds support for SuSE's 2.6.22.x kernels. > > > - This release fixes a file descriptor leak that occurred on restart > > from > > > a checkpoint-of-self requested via cr_request_checkpoint(). > > > - This release fixes a deadlock (and unkillable process(es)) when a > > > multi-threaded process aborts (or omits itself from) a checkpoint > > > under certain conditions. > > > - This release fixes a restart-time failure when a checkpoint > includes a > > > pipe with one end outside the checkpoint scope, and data is buffered > > > in the pipe. > > > - This release fixes a bug with the cr_request{,_file}() calls in > which > > > a failed checkpoint would cause failure of the next one if it had the > > > same destination file name. > > > - This release fixes a race condition with the cr_enter_cs() and > > > checkpoints > > > in multi-threaded processes. > > > - This release fixes post-checkpoint signal delivery (--stop and > > friends) > > > to occur after the checkpoint is fully completed. See bug 2201 for > > > a full description of the problems addressed by these changes. > > > - This release documents (and fully implements) signal-delivery > options > > > to cr_restart (see bug 2200). > > > - Adds test cases for most of the bugs fixed in this release. > > > > > > > > > -- > > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>> > > > Future Technologies Group > > > HPC Research Department Tel: +1-510-495-2352 > > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > Never miss a thing. Make Yahoo your homepage. > > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> > > > > > > -- > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > Future Technologies Group > > HPC Research Department Tel: +1-510-495-2352 > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > ------------------------------------------------------------------------ > > Looking for last minute shopping deals? Find them fast with Yahoo! > > Search. > > > <http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > ------------------------------------------------------------------------ > Looking for last minute shopping deals? Find them fast with Yahoo! > Search. > <http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 --- cr_module/cr_rstrt_req.c 11 Dec 2007 20:17:11 -0000 1.223.8.4 +++ cr_module/cr_rstrt_req.c 19 Dec 2007 03:33:57 -0000 1.223.8.5 @@ -2514,9 +2514,14 @@ signal = SIGSTOP; } if (signal) { - if (!test_and_set_bit(0, &proc_req->done_sig)) { +#if HAVE_2_6_SIGNAL_STRUCT + if ((atomic_read(¤t->signal->count) == 1) || + !test_and_set_bit(0, &proc_req->done_sig)) { kill_proc(current->tgid, signal, 0); } +#else + send_sig_info(signal, NULL, current); +#endif cr_barrier_enter(&proc_req->postsig_barrier); } }