Re: Hang in cr_restart

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Jan 29 2009 - 00:34:38 PST

  • Next message: Karthik Gopalakrishnan: "Re: Hang in cr_restart"
    I think the root of your problem is that BLCR invokes its callbacks with 
    all signals blocked.  This is preventing SIGCHLD from being delivered.  
    You could unblock the signal yourself, but that is probably not the way 
    to go (though I can't say for sure not seeing the full application).  I 
    think that perhaps you are not using the callback as we had intended 
    (though I admit our documentation is a little "thin").  It was not our 
    intention that the "normal" flow of your application would pickup in the 
    callback, as your call to do_real_work() appears to.  Instead it would 
    be proper for the callback to raise some signal or otherwise "tell" the 
    normal application flow (which is, I believe, currently just "while(1)") 
    to do something.
    It is probably also worth noting that the child created by fork() 
    inherits the signal mask of the parent, which in your case means the one 
    spawned by the do_real_work() call in CR_Callback() is going to run with 
    all signals blocked just as the callback does.
    Let us know if I have not been clear, or if you need more help.
    Karthik Gopalakrishnan wrote:
    > Hello.
    > I apologize for the long mail in advance. :-)
    > I have an application which roughly works as follows:
    > main()
    > {
    >     do_cr_initialization();
    >     do_real_work();
    >  }
    > do_real_work()
    > {
    >    register(SIGCHLD_Handler);
    >    fork();
    >     if (child) {
    >         do_stuff();
    >         exit(0);
    >     }
    >     while(1);
    > }
    > SIGCHLD_Handler()
    > {
    >     wait_for_child();
    >     exit(0);
    > }
    > CR_Callback()
    > {
    >     if (restarting)
    >         do_real_work()
    > }
    > do_stuff() is intelligent enough to continue from where it left off.
    > Now, under normal execution, after the do_stuff() completes & exit(0)
    > is called, SIGCHLD_Handler() is invoked which terminates the
    > application. However, when cr_restart is called after a checkpoint,
    > the application just "hangs" after do_stuff() completes the remaining
    > work & calls exit(0). SIGCHLD_Handler() is not invoked at restart at
    > all. The output of 'ps' shows the following:
    > UID        PID  PPID  C STIME TTY      CMD
    > gopalakk 11886 12020  0 20:30 pts/0    a.out
    > gopalakk 12020 10333  0 20:30 pts/0    cr_restart context.11886
    > gopalakk 12026 11886  0 20:30 pts/0    [a.out] <defunct>
    > Can someone explain what's going on here.
    > Thanks & Regards,
    > Karthik
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: Karthik Gopalakrishnan: "Re: Hang in cr_restart"