From: Karthik Gopalakrishnan (gopalakk_at_cse.ohio-state.edu)
Date: Thu Jan 29 2009 - 13:24:21 PST
Hi Paul. I just fixed the issue based on your feedback. I did not completely understand the restart code path which wrongly led me to believe that BLCR *only* restores execution from the CR Callback, which is why I tried to fork an additional child. I now understand BLCR's functionality a lot better and my program does the right thing. Thanks & Regards, Karthik On Thu, Jan 29, 2009 at 1:08 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote: > Since I am not clear on *why* you are trying to spawn a new/additionl child > process at restart time, I don't think I can point to an example in the BLCR > tests. > If you could explain a bit more about what you are trying to do I might be > able to help more. > > -Paul > > Karthik Gopalakrishnan wrote: >> >> Hi Paul. >> >> Thanks. That confirms what I suspected. Even a Ctrl+C does not work >> after restart. And I think I understand what you are saying wrt not >> calling the do_real_work() function from the CR Callback. I will >> restructure my program to avoid that. Could you please point me to a >> suitable example in BLCR's 'tests' directory. >> >> Thanks & Regards, >> Karthik >> >> On Thu, Jan 29, 2009 at 3:34 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> >> wrote: >> >>> >>> I think the root of your problem is that BLCR invokes its callbacks with >>> all >>> signals blocked. This is preventing SIGCHLD from being delivered. You >>> could unblock the signal yourself, but that is probably not the way to go >>> (though I can't say for sure not seeing the full application). I think >>> that >>> perhaps you are not using the callback as we had intended (though I admit >>> our documentation is a little "thin"). It was not our intention that the >>> "normal" flow of your application would pickup in the callback, as your >>> call >>> to do_real_work() appears to. Instead it would be proper for the >>> callback >>> to raise some signal or otherwise "tell" the normal application flow >>> (which >>> is, I believe, currently just "while(1)") to do something. >>> >>> It is probably also worth noting that the child created by fork() >>> inherits >>> the signal mask of the parent, which in your case means the one spawned >>> by >>> the do_real_work() call in CR_Callback() is going to run with all signals >>> blocked just as the callback does. >>> >>> Let us know if I have not been clear, or if you need more help. >>> >>> -Paul >>> >>> Karthik Gopalakrishnan wrote: >>> >>>> >>>> Hello. >>>> >>>> I apologize for the long mail in advance. :-) >>>> >>>> I have an application which roughly works as follows: >>>> >>>> main() >>>> { >>>> do_cr_initialization(); >>>> do_real_work(); >>>> } >>>> >>>> do_real_work() >>>> { >>>> register(SIGCHLD_Handler); >>>> fork(); >>>> if (child) { >>>> do_stuff(); >>>> exit(0); >>>> } >>>> while(1); >>>> } >>>> >>>> SIGCHLD_Handler() >>>> { >>>> wait_for_child(); >>>> exit(0); >>>> } >>>> >>>> CR_Callback() >>>> { >>>> if (restarting) >>>> do_real_work() >>>> } >>>> >>>> do_stuff() is intelligent enough to continue from where it left off. >>>> Now, under normal execution, after the do_stuff() completes & exit(0) >>>> is called, SIGCHLD_Handler() is invoked which terminates the >>>> application. However, when cr_restart is called after a checkpoint, >>>> the application just "hangs" after do_stuff() completes the remaining >>>> work & calls exit(0). SIGCHLD_Handler() is not invoked at restart at >>>> all. The output of 'ps' shows the following: >>>> >>>> UID PID PPID C STIME TTY CMD >>>> gopalakk 11886 12020 0 20:30 pts/0 a.out >>>> gopalakk 12020 10333 0 20:30 pts/0 cr_restart context.11886 >>>> gopalakk 12026 11886 0 20:30 pts/0 [a.out] <defunct> >>>> >>>> Can someone explain what's going on here. >>>> >>>> Thanks & Regards, >>>> Karthik >>>> >>>> >>> >>> -- >>> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >>> Future Technologies Group Tel: +1-510-495-2352 >>> HPC Research Department Fax: +1-510-486-6900 >>> Lawrence Berkeley National Laboratory >>> >>> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory > >