Re: Hang in cr_restart

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Jan 29 2009 - 10:08:52 PST

  • Next message: Paul H. Hargrove: "Re: run blcr on simics virtutech"
    Since I am not clear on *why* you are trying to spawn a new/additionl 
    child process at restart time, I don't think I can point to an example 
    in the BLCR tests.
    If you could explain a bit more about what you are trying to do I might 
    be able to help more.
    
    -Paul
    
    Karthik Gopalakrishnan wrote:
    > Hi Paul.
    >
    > Thanks. That confirms what I suspected. Even a Ctrl+C does not work
    > after restart. And I think I understand what you are saying wrt not
    > calling the do_real_work() function from the CR Callback. I will
    > restructure my program to avoid that. Could you please point me to a
    > suitable example in BLCR's 'tests' directory.
    >
    > Thanks & Regards,
    > Karthik
    >
    > On Thu, Jan 29, 2009 at 3:34 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote:
    >   
    >> I think the root of your problem is that BLCR invokes its callbacks with all
    >> signals blocked.  This is preventing SIGCHLD from being delivered.  You
    >> could unblock the signal yourself, but that is probably not the way to go
    >> (though I can't say for sure not seeing the full application).  I think that
    >> perhaps you are not using the callback as we had intended (though I admit
    >> our documentation is a little "thin").  It was not our intention that the
    >> "normal" flow of your application would pickup in the callback, as your call
    >> to do_real_work() appears to.  Instead it would be proper for the callback
    >> to raise some signal or otherwise "tell" the normal application flow (which
    >> is, I believe, currently just "while(1)") to do something.
    >>
    >> It is probably also worth noting that the child created by fork() inherits
    >> the signal mask of the parent, which in your case means the one spawned by
    >> the do_real_work() call in CR_Callback() is going to run with all signals
    >> blocked just as the callback does.
    >>
    >> Let us know if I have not been clear, or if you need more help.
    >>
    >> -Paul
    >>
    >> Karthik Gopalakrishnan wrote:
    >>     
    >>> Hello.
    >>>
    >>> I apologize for the long mail in advance. :-)
    >>>
    >>> I have an application which roughly works as follows:
    >>>
    >>> main()
    >>> {
    >>>    do_cr_initialization();
    >>>    do_real_work();
    >>>  }
    >>>
    >>> do_real_work()
    >>> {
    >>>   register(SIGCHLD_Handler);
    >>>   fork();
    >>>    if (child) {
    >>>        do_stuff();
    >>>        exit(0);
    >>>    }
    >>>    while(1);
    >>> }
    >>>
    >>> SIGCHLD_Handler()
    >>> {
    >>>    wait_for_child();
    >>>    exit(0);
    >>> }
    >>>
    >>> CR_Callback()
    >>> {
    >>>    if (restarting)
    >>>        do_real_work()
    >>> }
    >>>
    >>> do_stuff() is intelligent enough to continue from where it left off.
    >>> Now, under normal execution, after the do_stuff() completes & exit(0)
    >>> is called, SIGCHLD_Handler() is invoked which terminates the
    >>> application. However, when cr_restart is called after a checkpoint,
    >>> the application just "hangs" after do_stuff() completes the remaining
    >>> work & calls exit(0). SIGCHLD_Handler() is not invoked at restart at
    >>> all. The output of 'ps' shows the following:
    >>>
    >>> UID        PID  PPID  C STIME TTY      CMD
    >>> gopalakk 11886 12020  0 20:30 pts/0    a.out
    >>> gopalakk 12020 10333  0 20:30 pts/0    cr_restart context.11886
    >>> gopalakk 12026 11886  0 20:30 pts/0    [a.out] <defunct>
    >>>
    >>> Can someone explain what's going on here.
    >>>
    >>> Thanks & Regards,
    >>> Karthik
    >>>
    >>>       
    >> --
    >> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >> Future Technologies Group                 Tel: +1-510-495-2352
    >> HPC Research Department                   Fax: +1-510-486-6900
    >> Lawrence Berkeley National Laboratory
    >>
    >>     
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Paul H. Hargrove: "Re: run blcr on simics virtutech"