Re: Hang in cr_restart

From: Karthik Gopalakrishnan (gopalakk_at_cse.ohio-state.edu)
Date: Thu Jan 29 2009 - 13:24:21 PST

  • Next message: Neal Becker: "blcr-0.8.0 on 2.6.29?"
    Hi Paul.
    
    I just fixed the issue based on your feedback. I did not completely
    understand the restart code path which wrongly led me to believe that
    BLCR *only* restores execution from the CR Callback, which is why I
    tried to fork an additional child. I now understand BLCR's
    functionality a lot better and my program does the right thing.
    
    Thanks & Regards,
    Karthik
    
    On Thu, Jan 29, 2009 at 1:08 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote:
    > Since I am not clear on *why* you are trying to spawn a new/additionl child
    > process at restart time, I don't think I can point to an example in the BLCR
    > tests.
    > If you could explain a bit more about what you are trying to do I might be
    > able to help more.
    >
    > -Paul
    >
    > Karthik Gopalakrishnan wrote:
    >>
    >> Hi Paul.
    >>
    >> Thanks. That confirms what I suspected. Even a Ctrl+C does not work
    >> after restart. And I think I understand what you are saying wrt not
    >> calling the do_real_work() function from the CR Callback. I will
    >> restructure my program to avoid that. Could you please point me to a
    >> suitable example in BLCR's 'tests' directory.
    >>
    >> Thanks & Regards,
    >> Karthik
    >>
    >> On Thu, Jan 29, 2009 at 3:34 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    >> wrote:
    >>
    >>>
    >>> I think the root of your problem is that BLCR invokes its callbacks with
    >>> all
    >>> signals blocked.  This is preventing SIGCHLD from being delivered.  You
    >>> could unblock the signal yourself, but that is probably not the way to go
    >>> (though I can't say for sure not seeing the full application).  I think
    >>> that
    >>> perhaps you are not using the callback as we had intended (though I admit
    >>> our documentation is a little "thin").  It was not our intention that the
    >>> "normal" flow of your application would pickup in the callback, as your
    >>> call
    >>> to do_real_work() appears to.  Instead it would be proper for the
    >>> callback
    >>> to raise some signal or otherwise "tell" the normal application flow
    >>> (which
    >>> is, I believe, currently just "while(1)") to do something.
    >>>
    >>> It is probably also worth noting that the child created by fork()
    >>> inherits
    >>> the signal mask of the parent, which in your case means the one spawned
    >>> by
    >>> the do_real_work() call in CR_Callback() is going to run with all signals
    >>> blocked just as the callback does.
    >>>
    >>> Let us know if I have not been clear, or if you need more help.
    >>>
    >>> -Paul
    >>>
    >>> Karthik Gopalakrishnan wrote:
    >>>
    >>>>
    >>>> Hello.
    >>>>
    >>>> I apologize for the long mail in advance. :-)
    >>>>
    >>>> I have an application which roughly works as follows:
    >>>>
    >>>> main()
    >>>> {
    >>>>   do_cr_initialization();
    >>>>   do_real_work();
    >>>>  }
    >>>>
    >>>> do_real_work()
    >>>> {
    >>>>  register(SIGCHLD_Handler);
    >>>>  fork();
    >>>>   if (child) {
    >>>>       do_stuff();
    >>>>       exit(0);
    >>>>   }
    >>>>   while(1);
    >>>> }
    >>>>
    >>>> SIGCHLD_Handler()
    >>>> {
    >>>>   wait_for_child();
    >>>>   exit(0);
    >>>> }
    >>>>
    >>>> CR_Callback()
    >>>> {
    >>>>   if (restarting)
    >>>>       do_real_work()
    >>>> }
    >>>>
    >>>> do_stuff() is intelligent enough to continue from where it left off.
    >>>> Now, under normal execution, after the do_stuff() completes & exit(0)
    >>>> is called, SIGCHLD_Handler() is invoked which terminates the
    >>>> application. However, when cr_restart is called after a checkpoint,
    >>>> the application just "hangs" after do_stuff() completes the remaining
    >>>> work & calls exit(0). SIGCHLD_Handler() is not invoked at restart at
    >>>> all. The output of 'ps' shows the following:
    >>>>
    >>>> UID        PID  PPID  C STIME TTY      CMD
    >>>> gopalakk 11886 12020  0 20:30 pts/0    a.out
    >>>> gopalakk 12020 10333  0 20:30 pts/0    cr_restart context.11886
    >>>> gopalakk 12026 11886  0 20:30 pts/0    [a.out] <defunct>
    >>>>
    >>>> Can someone explain what's going on here.
    >>>>
    >>>> Thanks & Regards,
    >>>> Karthik
    >>>>
    >>>>
    >>>
    >>> --
    >>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>> Future Technologies Group                 Tel: +1-510-495-2352
    >>> HPC Research Department                   Fax: +1-510-486-6900
    >>> Lawrence Berkeley National Laboratory
    >>>
    >>>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    >
    >
    

  • Next message: Neal Becker: "blcr-0.8.0 on 2.6.29?"