From: Karthik Gopalakrishnan (gopalakk_at_cse.ohio-state.edu)
Date: Thu Jan 29 2009 - 01:31:53 PST
Hi Paul. Thanks. That confirms what I suspected. Even a Ctrl+C does not work after restart. And I think I understand what you are saying wrt not calling the do_real_work() function from the CR Callback. I will restructure my program to avoid that. Could you please point me to a suitable example in BLCR's 'tests' directory. Thanks & Regards, Karthik On Thu, Jan 29, 2009 at 3:34 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote: > I think the root of your problem is that BLCR invokes its callbacks with all > signals blocked. This is preventing SIGCHLD from being delivered. You > could unblock the signal yourself, but that is probably not the way to go > (though I can't say for sure not seeing the full application). I think that > perhaps you are not using the callback as we had intended (though I admit > our documentation is a little "thin"). It was not our intention that the > "normal" flow of your application would pickup in the callback, as your call > to do_real_work() appears to. Instead it would be proper for the callback > to raise some signal or otherwise "tell" the normal application flow (which > is, I believe, currently just "while(1)") to do something. > > It is probably also worth noting that the child created by fork() inherits > the signal mask of the parent, which in your case means the one spawned by > the do_real_work() call in CR_Callback() is going to run with all signals > blocked just as the callback does. > > Let us know if I have not been clear, or if you need more help. > > -Paul > > Karthik Gopalakrishnan wrote: >> >> Hello. >> >> I apologize for the long mail in advance. :-) >> >> I have an application which roughly works as follows: >> >> main() >> { >> do_cr_initialization(); >> do_real_work(); >> } >> >> do_real_work() >> { >> register(SIGCHLD_Handler); >> fork(); >> if (child) { >> do_stuff(); >> exit(0); >> } >> while(1); >> } >> >> SIGCHLD_Handler() >> { >> wait_for_child(); >> exit(0); >> } >> >> CR_Callback() >> { >> if (restarting) >> do_real_work() >> } >> >> do_stuff() is intelligent enough to continue from where it left off. >> Now, under normal execution, after the do_stuff() completes & exit(0) >> is called, SIGCHLD_Handler() is invoked which terminates the >> application. However, when cr_restart is called after a checkpoint, >> the application just "hangs" after do_stuff() completes the remaining >> work & calls exit(0). SIGCHLD_Handler() is not invoked at restart at >> all. The output of 'ps' shows the following: >> >> UID PID PPID C STIME TTY CMD >> gopalakk 11886 12020 0 20:30 pts/0 a.out >> gopalakk 12020 10333 0 20:30 pts/0 cr_restart context.11886 >> gopalakk 12026 11886 0 20:30 pts/0 [a.out] <defunct> >> >> Can someone explain what's going on here. >> >> Thanks & Regards, >> Karthik >> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory >