Re: Restart my program failed ?

From: Locus Jackson (locus_jackson_at_yahoo_dot_com)
Date: Tue Jan 15 2008 - 18:41:11 PST

  • Next message: : "the details of setting a checkpoint"
    Hi,
    Once using a callback mechanism,my program will never stop.Maybe I still can not understand your meaning.
    My program form(I want to restart my program for global_number times) :
                    void set_checkpoint()//use this to set a checkpoint at any times and places
                   {
                     ......
                     cr_init();
                    cr_initialize_checkpoint_args_t(&cr_args);
                    cr_args.cr_fd=open(filename,......); //save checkpoint in filename
                    cr_register_callback(callback,......);//register a callback function
                    cr_request_checkpoint(&cr_args,......);//set a checkpoint
                    ......
                    cr_poll_checkpoint(.....);//wait for setting a checkpoint to be completed
                    ......
                   }
                   
                   static int callback(void* arg) { //your suggestion
                    int rc = cr_checkpoint(CR_CHECKPOINT_READY);
                    if (rc > 0) { 
                     --global_number;
                    }
                    return 0;
                   }
    
                   void call_restart(char* filename) //use this to restart explicitly my program 
                  {
                      ......
                      pipe();
                      fork();
                      .......  //parent is exited
                      system("cr_restart filename"); //restart program from the checkpoint set before
                     ......
                  }
                  
                  int global_number=2;//restart numbers
                  char filename[20]; //save checkpoint 
                  int main()
                  {
                   ......
                   set_checkpoint(); //place1
                   statement1;
                   ......
                   while(global_number)
                     call_restart(filename); //place2,restart from place1 to place2 for global_number times
                   ......
                   return 0; 
                  }
    I wonder whether my form is wrong or not, and I wonder will callback() function be called before cr_request_checkpoint(),I
    want to restart my program for global_number times?
    
    Thank you very much for your help.
    
    Regards
    Locus.
    
    
    
    ----- Original Message ----
    From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    To: Locus Jackson <locus_jackson_at_yahoo_dot_com>
    Cc: checkpoint <checkpoint_at_lbl_dot_gov>
    Sent: Wednesday, January 16, 2008 2:21:41 AM
    Subject: Re: Restart my program failed ?
    
    
    Locus,
    
      A BLCR callback is sort of like a signal handler that runs when the 
    checkpoint is being taken.  So, when cr_request_checkpoint() causes the
     
    checkpoint to be taken, BLCR will run the callback.  The callback runs 
    up to the cr_checkpoint() call before the checkpoint is taken.  This 
    would allow a process to save any sort of state that BLCR doesn't
     handle 
    (such as TCP sockets).  The call to cr_checkpoint() allows the 
    checkpoint to proceed (possibly invoking other callbacks if more than 
    one is registered).  The return value from cr_checkpoint() will be 0 
    when the process is just continuing normally after a checkpoint has
     been 
    taken, but will be >0 when resuming from a restart.  Any code running
     in 
    the callback after the cr_checkpoint() call can restore any state that 
    the callback saved.  In the example callback I showed, the value of 
    global_number will decrease by one when the process is restarted.
    
    -Paul
    
    
    Locus Jackson wrote:
    > Hi,
    > I am sorry that I still have some questions.
    > In my function set_checkpoint(),I use 
    > cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and
     
    > cr_poll_checkpoint() to set a checkpoint.
    > In my function call_restart(),I use pipe(),fork(),and system() to 
    > restart my program from the checkpoint where I set before.
    > You suggest registering a checkpoint callback,I may have some 
    > difficult to understand its mechanism though I have read libcr.h.
    > 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I 
    > wonder when the callback func will be invoked?Will it be invoked
     after 
    > my function set_checkpoint() called?When will the callback func  be  
    > invoked  generally?
    > 2,In your reply,you wrote:
    >   static int my_callback(void* arg) {
    >     int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    >     if (rc > 0) { /* Restarting */
    >       --global_number;
    >     }
    >     return 0;
    >   }
    >  I wonder does cr_checkpoint() set a checkpoint like my function
     set_checkpoint()?If the answer is no ,can I add call_restart() 
    > in the condition if(rc>0) to explicitly restart my program for
     global_times? 
    > 3,If possible,would you please give me an example to explain your
     callback method,I want to restart my program for any given 
    > times,but now,if I call call_restart(),the program will run
     forever,that is really  terrible.
    > Thank you very much for your kind help.
    >
    > Regards,
    > Locus.
    >   
    >
    > ----- Original Message ----
    > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    > To: Locus Jackson <locus_jackson_at_yahoo_dot_com>
    > Cc: checkpoint_at_lbl_dot_gov
    > Sent: Tuesday, January 15, 2008 1:05:08 PM
    > Subject: Re: Restart my program failed ?
    >
    > Locus Jackson wrote:
    > > Hello,
    > > I use blcr to checkpoint and restart my program(a single threaded
    > > application).
    > > But when I want to restat my program,it always failed.
    > > The general form of my program listed as follows:
    > >
    > >          void set_checkpoint()  //use this fucntion to set a
    > > checkpoint at any time and places
    > >        {
    > >          ........
    > >        }
    > >
    > >        void call_restart(char* filename)  //use this function to
    > > restart my program in case it failed
    > >        {
    > >          ......
    > >          system("cr_restart filename");
    > >        }
    > >
    > >        int global_number=2;   
    > >        int main()
    > >        {
    > >          ......
    > >          statement1;
    > >          set_checkpoint();
    > >          statement2;
    > >          ......
    > >          while(global_number>0)  // I want to restart my program 2
     times
    > >          {
    > >            global_number--;
    > >            call_restart();
    > >            }
    > >            statement3;
    > >            ......
    > >          }
    > >
    > > when I execute this program ,it  restarts  far more than two
    > > times,until it told me " Restart failed: Device or resource busy".
    > > In my  call_restart()  function  , I  fork  a  child  to  restart
      my  
    > > program(before it restart,its parent is exited,and the pgid of the
    > > child is also set to be child's pid ),but in restart,the child
     which
    > > is forked is always the son of the exited parent,the parent seems
     to
    > > be still alive,I do not know why?
    > > Thank you for your help.
    > >
    > > Locus. 
    > >       
    > >
    > >
    > >
     ------------------------------------------------------------------------
    > > Be a better friend, newshound, and know-it-all with Yahoo! Mobile.
     Try
    > > it now.
    > > 
    >
     <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
    > Locus,
    >
    > Your first issue is "it  restarts  far more than two times".  That is
    > because the value of "global_number" has been restored to the value 2
    > when BLCR restarted to program.  You will need to use a different
    > mechanism to handle any value that is supposed to change across
    > checkpoints.  I suggest that you try registering a checkpoint
     callback.
    >
    > In main add these two lines:
    >   cr_client_id_t id = cr_init();
    >   cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT);
    >
    > and somewhere add the following function:
    >   static int my_callback(void* arg) {
    >     int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    >     if (rc > 0) { /* Restarting */
    >       --global_number;
    >     }
    >     return 0;
    >   }
    >
    > As for eventually failing with "Device or resource busy", I imaging
     that
    > with the many restarts you may have eventually reused the original
     PID
    > for the cr_restart executable.  Perhaps that problem will go away
     when
    > you fix the multiple restarts problem.  The other possibility here is
    > that you are trying to restart the same process multiple times
    > *concurrently*, thus trying to use the original PID twice at the same
     
    > time.
    >
    > Let me know if you need any more help.
    >
    > -Paul
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > Future Technologies Group
    > HPC Research Department                  Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    >
    >
    >
    >
    >
     ------------------------------------------------------------------------
    > Never miss a thing. Make Yahoo your homepage. 
    > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    
    
    
    
    
    
    
    
          ____________________________________________________________________________________
    Be a better friend, newshound, and 
    know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
    

  • Next message: : "the details of setting a checkpoint"