From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Dec 18 2007 - 20:26:01 PST
Perhaps I am misunderstanding your problem, or you are misunderstanding how BLCR works. I am not clear on which. The cr_restart utility restarts a process as its own child. It does not return as soon as the process is restarted, but instead it waits until the child terminates and then terminates itself with the same exit code. Also, it is not the cr_restart utility you need to send SIGCONT to, but to the child(ren) that it spawn and you need to wait until the child is fully restored. Together those are why using the "--cont" argument to cr_restart is my recommendation. The function cr_rstrt_child() in the kernel module is like an exec() in that it *replaces* the calling thread with a thread of the restarted process. So, there is no return from that back to the calling context - when it does return it is to the context of the checkpoint handler in the restarted process. If you could be clearer on which "wait_event_interruptible() function in the loop" you mean, perhaps I can understand what you are trying to do. I suspect to mean the loop in cr_rstrt_procs(). That loop should terminate as soon as enough threads (the same number as in the checkpointed process) have reached the wake_up() call in cr_rstrt_child(). From what I have read so far, I am worried that you are trying to re implement the internals of the cr_restart utility. If that is the case, I would strongly encourage you *not* to take that approach since those internals change in each release. You should instead invoke the cr_restart utility to do the "dirty work". In a future release (once the interface stabilizes) we will add a cr_request_restart() entry point to libcr. -Paul Yuan Tang wrote: > Sure! I mean, the RESTART procedure never complete, it means it never > returns. It blocks in the function cr_rstrt_child. My program is let a > background daemon process to checkpoint a SIGSTOPed process, and later > restart it. After restart it, the daemon process will surely send the > SIGCONT to the restarted SIGSTOPed process to let it go. However, the > restart procedure never complete. So, the background daemon will never > have the chance to send the SIGCONT. If I send the SIGCONT from the > console, it just interrupt the wait_event_interruptible() function in > the loop. > > ----- Original Message ---- > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> > To: Yuan Tang <supertangcc_at_yahoo_dot_com> > Cc: checkpoint_at_lbl_dot_gov > Sent: Wednesday, December 19, 2007 5:19:58 AM > Subject: Re: BLCR 0.6.2 beta1 now available > > Yuan Tang wrote: > > Hi Paul, > > > > Thank you for the work. I downloaded the beta version, installed it > > and tested it. The SIGSTOPed process could pass the whole checkpoint > > procedure now. Congratulation! However, when I tried restarting the > > previously checkpointed SIGSTOPed process from its disk image, the > > RESTART procedure never completed. It blocks in > > cr_rstrt_req.c:cr_rstrt_child(). I guess, if you move the > > send_sig_info(SIGSTOP, NULL, task) stuff to cr_rstrt_task_complete(), > > the whole procedure will complete normally. Hope it helps. > > I believe the current behavior is correct (or at least is what I've > intended). The process that was SIGSTOPed when the checkpoint was > requested is again SIGSTOPed when restarted. To get it running again > you should be able to either send it a SIGCONT (which is tricky because > you might not know how soon to send it), or you can simply pass "--cont" > to cr_restart to have it done automatically. > > If you find that adding "--cont" to the cr_restart arguments still > doesn't allow the restart to complete, let us know and we'll see if we > can figure out what is going on. > > -Paul > > > > > Best wishes! > > > > Yuan Tang > > > > ----- Original Message ---- > > From: Paul H_dot_ Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > To: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > > Sent: Tuesday, December 18, 2007 4:23:30 AM > > Subject: BLCR 0.6.2 beta1 now available > > > > The first beta of BLCR 0.6.2 is now available at > > http://mantis.lbl.gov/blcr-dist/ > > Both source tarball and SRPM are available. The filenames and MD5 > > checksums are: > > 93249f20abd4eeec7a07db2f2a6cd2b2 blcr-0.6.2_b1.tar.gz > > e8ecba22c98de143ced20f83db76d8a1 blcr-0.6.2_b1-1.src.rpm > > > > This is a beta of a 0.6.2 patch release. The intent of 0.6.2 is to fix > > a small number of significant bugs found in 0.6.0 and 0.6.1 and to add > > support for 2.6.23 kernels and some vendor-patched 2.6.22 kernels. A > > NEWS entry summarizing these changes appears below. > > > > You are receiving this e-mail either because you are subscribed to the > > checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>> mailing list > or because > > you have reported one of the > > bugs or previously unsupported kernel versions addressed by this > > release. I apologize if you receive multiple copies. > > > > I would greatly appreciate any feedback (positive or negative) > > indicating if this beta fixes any problems you have reported with BLCR > > 0.6.0 and/or 0.6.1. Only after I have sufficient positive feedback will > > I make 0.6.2 available for download from the main BLCR web pages. > > > > -Paul > > > > > > 0.6.2_b1 > > -------- > > December 17, 2007 > > Bug-fix and expanded-support release. > > - This release adds support for 2.6.23 kernels. > > - This release adds support for SuSE's 2.6.22.x kernels. > > - This release fixes a file descriptor leak that occurred on restart > from > > a checkpoint-of-self requested via cr_request_checkpoint(). > > - This release fixes a deadlock (and unkillable process(es)) when a > > multi-threaded process aborts (or omits itself from) a checkpoint > > under certain conditions. > > - This release fixes a restart-time failure when a checkpoint includes a > > pipe with one end outside the checkpoint scope, and data is buffered > > in the pipe. > > - This release fixes a bug with the cr_request{,_file}() calls in which > > a failed checkpoint would cause failure of the next one if it had the > > same destination file name. > > - This release fixes a race condition with the cr_enter_cs() and > > checkpoints > > in multi-threaded processes. > > - This release fixes post-checkpoint signal delivery (--stop and > friends) > > to occur after the checkpoint is fully completed. See bug 2201 for > > a full description of the problems addressed by these changes. > > - This release documents (and fully implements) signal-delivery options > > to cr_restart (see bug 2200). > > - Adds test cases for most of the bugs fixed in this release. > > > > > > -- > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > Future Technologies Group > > HPC Research Department Tel: +1-510-495-2352 > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > ------------------------------------------------------------------------ > > Never miss a thing. Make Yahoo your homepage. > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > ------------------------------------------------------------------------ > Looking for last minute shopping deals? Find them fast with Yahoo! > Search. > <http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900