From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Jan 29 2009 - 10:08:52 PST
Since I am not clear on *why* you are trying to spawn a new/additionl child process at restart time, I don't think I can point to an example in the BLCR tests. If you could explain a bit more about what you are trying to do I might be able to help more. -Paul Karthik Gopalakrishnan wrote: > Hi Paul. > > Thanks. That confirms what I suspected. Even a Ctrl+C does not work > after restart. And I think I understand what you are saying wrt not > calling the do_real_work() function from the CR Callback. I will > restructure my program to avoid that. Could you please point me to a > suitable example in BLCR's 'tests' directory. > > Thanks & Regards, > Karthik > > On Thu, Jan 29, 2009 at 3:34 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote: > >> I think the root of your problem is that BLCR invokes its callbacks with all >> signals blocked. This is preventing SIGCHLD from being delivered. You >> could unblock the signal yourself, but that is probably not the way to go >> (though I can't say for sure not seeing the full application). I think that >> perhaps you are not using the callback as we had intended (though I admit >> our documentation is a little "thin"). It was not our intention that the >> "normal" flow of your application would pickup in the callback, as your call >> to do_real_work() appears to. Instead it would be proper for the callback >> to raise some signal or otherwise "tell" the normal application flow (which >> is, I believe, currently just "while(1)") to do something. >> >> It is probably also worth noting that the child created by fork() inherits >> the signal mask of the parent, which in your case means the one spawned by >> the do_real_work() call in CR_Callback() is going to run with all signals >> blocked just as the callback does. >> >> Let us know if I have not been clear, or if you need more help. >> >> -Paul >> >> Karthik Gopalakrishnan wrote: >> >>> Hello. >>> >>> I apologize for the long mail in advance. :-) >>> >>> I have an application which roughly works as follows: >>> >>> main() >>> { >>> do_cr_initialization(); >>> do_real_work(); >>> } >>> >>> do_real_work() >>> { >>> register(SIGCHLD_Handler); >>> fork(); >>> if (child) { >>> do_stuff(); >>> exit(0); >>> } >>> while(1); >>> } >>> >>> SIGCHLD_Handler() >>> { >>> wait_for_child(); >>> exit(0); >>> } >>> >>> CR_Callback() >>> { >>> if (restarting) >>> do_real_work() >>> } >>> >>> do_stuff() is intelligent enough to continue from where it left off. >>> Now, under normal execution, after the do_stuff() completes & exit(0) >>> is called, SIGCHLD_Handler() is invoked which terminates the >>> application. However, when cr_restart is called after a checkpoint, >>> the application just "hangs" after do_stuff() completes the remaining >>> work & calls exit(0). SIGCHLD_Handler() is not invoked at restart at >>> all. The output of 'ps' shows the following: >>> >>> UID PID PPID C STIME TTY CMD >>> gopalakk 11886 12020 0 20:30 pts/0 a.out >>> gopalakk 12020 10333 0 20:30 pts/0 cr_restart context.11886 >>> gopalakk 12026 11886 0 20:30 pts/0 [a.out] <defunct> >>> >>> Can someone explain what's going on here. >>> >>> Thanks & Regards, >>> Karthik >>> >>> >> -- >> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> Future Technologies Group Tel: +1-510-495-2352 >> HPC Research Department Fax: +1-510-486-6900 >> Lawrence Berkeley National Laboratory >> >> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory