Re: Restart my program failed ?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jan 14 2008 - 21:05:08 PST

  • Next message: Locus Jackson: "Re: Restart my program failed ?"
    Locus Jackson wrote:
    > Hello,
    > I use blcr to checkpoint and restart my program(a single threaded 
    > application).
    > But when I want to restat my program,it always failed.
    > The general form of my program listed as follows:
    >
    >          void set_checkpoint()   //use this fucntion to set a 
    > checkpoint at any time and places
    >        {
    >          ........
    >        }
    >
    >        void call_restart(char* filename)  //use this function to 
    > restart my program in case it failed
    >        {
    >          ......
    >          system("cr_restart filename");
    >        }
    >
    >        int global_number=2;     
    >        int main()
    >        {
    >           ......
    >           statement1;
    >           set_checkpoint();
    >           statement2;
    >           ......
    >           while(global_number>0)   // I want to restart my program 2 times
    >           {
    >             global_number--;
    >             call_restart();
    >            }
    >            statement3;
    >            ......
    >          }
    >
    > when I execute this program ,it  restarts  far more than two 
    > times,until it told me " Restart failed: Device or resource busy".
    > In my  call_restart()  function  , I  fork  a  child  to  restart  my  
    > program(before it restart,its parent is exited,and the pgid of the 
    > child is also set to be child's pid ),but in restart,the child which 
    > is forked is always the son of the exited parent,the parent seems to 
    > be still alive,I do not know why?
    > Thank you for your help.
    >
    > Locus.  
    >         
    >
    >
    > ------------------------------------------------------------------------
    > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try 
    > it now. 
    > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
    Locus,
    
    Your first issue is "it  restarts  far more than two times".  That is 
    because the value of "global_number" has been restored to the value 2 
    when BLCR restarted to program.  You will need to use a different 
    mechanism to handle any value that is supposed to change across 
    checkpoints.  I suggest that you try registering a checkpoint callback.
    
    In main add these two lines:
      cr_client_id_t id = cr_init();
      cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT);
    
    and somewhere add the following function:
      static int my_callback(void* arg) {
        int rc = cr_checkpoint(CR_CHECKPOINT_READY);
        if (rc > 0) { /* Restarting */
          --global_number;
        }
        return 0;
      }
    
    As for eventually failing with "Device or resource busy", I imaging that 
    with the many restarts you may have eventually reused the original PID 
    for the cr_restart executable.  Perhaps that problem will go away when 
    you fix the multiple restarts problem.  The other possibility here is 
    that you are trying to restart the same process multiple times 
    *concurrently*, thus trying to use the original PID twice at the same time.
    
    Let me know if you need any more help.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Locus Jackson: "Re: Restart my program failed ?"