Re: Restart my program failed ?

From: Locus Jackson (locus_jackson_at_yahoo_dot_com)
Date: Tue Jan 15 2008 - 04:49:17 PST

  • Next message: Paul H. Hargrove: "Re: Restart my program failed ?"
    Hi,
    I am sorry that I still have some questions.
    In my function set_checkpoint(),I use cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and cr_poll_checkpoint() to set a checkpoint.
    In my function call_restart(),I use pipe(),fork(),and system() to restart my program from the checkpoint where I set before.
    You suggest registering a checkpoint callback,I may have some difficult to understand its mechanism though I have read libcr.h.
    1,cr_register_callback(cr_callback_t func,void* arg,int flags),I wonder when the callback func will be invoked?Will it be invoked after my function set_checkpoint() called?When will the callback func  be  invoked  generally? 
    2,In your reply,you wrote:
      static int my_callback(void* arg) {
        int rc = cr_checkpoint(CR_CHECKPOINT_READY);
        if (rc > 0) { /* Restarting */
          --global_number;
        }
        return 0;
      }
     I wonder does cr_checkpoint() set a checkpoint like my function set_checkpoint()?If the answer is no ,can I add call_restart() 
    in the condition if(rc>0) to explicitly restart my program for global_times? 
    3,If possible,would you please give me an example to explain your callback method,I want to restart my program for any given 
    times,but now,if I call call_restart(),the program will run forever,that is really  terrible.
    Thank you very much for your kind help.
    
    Regards,
    Locus.
    
    ----- Original Message ----
    From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    To: Locus Jackson <locus_jackson_at_yahoo_dot_com>
    Cc: checkpoint_at_lbl_dot_gov
    Sent: Tuesday, January 15, 2008 1:05:08 PM
    Subject: Re: Restart my program failed ?
    
    
    Locus Jackson wrote:
    > Hello,
    > I use blcr to checkpoint and restart my program(a single threaded 
    > application).
    > But when I want to restat my program,it always failed.
    > The general form of my program listed as follows:
    >
    >          void set_checkpoint()   //use this fucntion to set a 
    > checkpoint at any time and places
    >        {
    >          ........
    >        }
    >
    >        void call_restart(char* filename)  //use this function to 
    > restart my program in case it failed
    >        {
    >          ......
    >          system("cr_restart filename");
    >        }
    >
    >        int global_number=2;     
    >        int main()
    >        {
    >           ......
    >           statement1;
    >           set_checkpoint();
    >           statement2;
    >           ......
    >           while(global_number>0)   // I want to restart my program 2
     times
    >           {
    >             global_number--;
    >             call_restart();
    >            }
    >            statement3;
    >            ......
    >          }
    >
    > when I execute this program ,it  restarts  far more than two 
    > times,until it told me " Restart failed: Device or resource busy".
    > In my  call_restart()  function  , I  fork  a  child  to  restart  my
      
    > program(before it restart,its parent is exited,and the pgid of the 
    > child is also set to be child's pid ),but in restart,the child which 
    > is forked is always the son of the exited parent,the parent seems to 
    > be still alive,I do not know why?
    > Thank you for your help.
    >
    > Locus.  
    >         
    >
    >
    >
     ------------------------------------------------------------------------
    > Be a better friend, newshound, and know-it-all with Yahoo! Mobile.
     Try 
    > it now. 
    >
     <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
    Locus,
    
    Your first issue is "it  restarts  far more than two times".  That is 
    because the value of "global_number" has been restored to the value 2 
    when BLCR restarted to program.  You will need to use a different 
    mechanism to handle any value that is supposed to change across 
    checkpoints.  I suggest that you try registering a checkpoint callback.
    
    In main add these two lines:
      cr_client_id_t id = cr_init();
      cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT);
    
    and somewhere add the following function:
      static int my_callback(void* arg) {
        int rc = cr_checkpoint(CR_CHECKPOINT_READY);
        if (rc > 0) { /* Restarting */
          --global_number;
        }
        return 0;
      }
    
    As for eventually failing with "Device or resource busy", I imaging
     that 
    with the many restarts you may have eventually reused the original PID 
    for the cr_restart executable.  Perhaps that problem will go away when 
    you fix the multiple restarts problem.  The other possibility here is 
    that you are trying to restart the same process multiple times 
    *concurrently*, thus trying to use the original PID twice at the same
     time.
    
    Let me know if you need any more help.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    
    
    
    
    
    
    
    
          ____________________________________________________________________________________
    Never miss a thing.  Make Yahoo your home page. 
    http://www.yahoo.com/r/hs
    

  • Next message: Paul H. Hargrove: "Re: Restart my program failed ?"