From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Jan 29 2009 - 00:34:38 PST
I think the root of your problem is that BLCR invokes its callbacks with all signals blocked. This is preventing SIGCHLD from being delivered. You could unblock the signal yourself, but that is probably not the way to go (though I can't say for sure not seeing the full application). I think that perhaps you are not using the callback as we had intended (though I admit our documentation is a little "thin"). It was not our intention that the "normal" flow of your application would pickup in the callback, as your call to do_real_work() appears to. Instead it would be proper for the callback to raise some signal or otherwise "tell" the normal application flow (which is, I believe, currently just "while(1)") to do something. It is probably also worth noting that the child created by fork() inherits the signal mask of the parent, which in your case means the one spawned by the do_real_work() call in CR_Callback() is going to run with all signals blocked just as the callback does. Let us know if I have not been clear, or if you need more help. -Paul Karthik Gopalakrishnan wrote: > Hello. > > I apologize for the long mail in advance. :-) > > I have an application which roughly works as follows: > > main() > { > do_cr_initialization(); > do_real_work(); > } > > do_real_work() > { > register(SIGCHLD_Handler); > fork(); > if (child) { > do_stuff(); > exit(0); > } > while(1); > } > > SIGCHLD_Handler() > { > wait_for_child(); > exit(0); > } > > CR_Callback() > { > if (restarting) > do_real_work() > } > > do_stuff() is intelligent enough to continue from where it left off. > Now, under normal execution, after the do_stuff() completes & exit(0) > is called, SIGCHLD_Handler() is invoked which terminates the > application. However, when cr_restart is called after a checkpoint, > the application just "hangs" after do_stuff() completes the remaining > work & calls exit(0). SIGCHLD_Handler() is not invoked at restart at > all. The output of 'ps' shows the following: > > UID PID PPID C STIME TTY CMD > gopalakk 11886 12020 0 20:30 pts/0 a.out > gopalakk 12020 10333 0 20:30 pts/0 cr_restart context.11886 > gopalakk 12026 11886 0 20:30 pts/0 [a.out] <defunct> > > Can someone explain what's going on here. > > Thanks & Regards, > Karthik > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory