Re: Restart my program failed ?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Jan 16 2008 - 11:32:32 PST

  • Next message: Paul H. Hargrove: "Re: the details of setting a checkpoint"
    I guess I answered too quickly last time, because what I proposed will 
    *not* result in restarting twice.  If you start with global_number=1, 
    then the callback will decrement it to zero when restarted, and you will 
    see your program restart exactly once.  However, if you start with 
    global_number=2, then each restart decrements it from 2 to 1, never 
    going lower.
    
    I don't have a good suggestion as to how to restart exactly twice from 
    *inside* your program.  BLCR is most often used with some outside 
    program controlling the restarts.
    
    -Paul
    
    Locus Jackson wrote:
    > Hi,
    > Once using a callback mechanism,my program will never stop.Maybe I 
    > still can not understand your meaning.
    > My program form(I want to restart my program for global_number times) :
    >                 void set_checkpoint()//use this to set a checkpoint at 
    > any times and places
    >                {
    >                  ......
    >                  cr_init();
    >                 cr_initialize_checkpoint_args_t(&cr_args);
    >                 cr_args.cr_fd=open(filename,......); //save checkpoint 
    > in filename
    >                 cr_register_callback(callback,......);//register a 
    > callback function
    >                 cr_request_checkpoint(&cr_args,......);//set a checkpoint
    >                 ......
    >                 cr_poll_checkpoint(.....);//wait for setting a 
    > checkpoint to be completed
    >                 ......
    >                }
    >               
    >                static int callback(void* arg) { //your suggestion
    >                 int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    >                 if (rc > 0) { 
    >                  --global_number;
    >                 }
    >                 return 0;
    >                }
    >
    >                void call_restart(char* filename) //use this to restart explicitly my program 
    >               {
    >                   ......
    >                   pipe();
    >                   fork();
    >                   .......  //parent is exited
    >                   system("cr_restart filename"); //restart program
    >  from the checkpoint set before
    >                  ......
    >               }
    >               
    >               int global_number=2;//restart numbers
    >               char filename[20]; //save checkpoint 
    >               int main()
    >               {
    >                ......
    >                set_checkpoint(); //place1
    >                statement1;
    >                ......
    >                while(global_number)
    >                  call_restart(filename); //place2,restart from place1 to place2 for global_number times
    >                ......
    >                return 0; 
    >               }
    > I wonder whether my form is wrong or not, and I wonder will callback() function be called before cr_request_checkpoint(),I
    > want to restart my program for global_number times?
    >
    > Thank you very much for your help.
    >
    > Regards
    > Locus.
    >
    >   
    >   
    >
    > ----- Original Message ----
    > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    > To: Locus Jackson <locus_jackson_at_yahoo_dot_com>
    > Cc: checkpoint <checkpoint_at_lbl_dot_gov>
    > Sent: Wednesday, January 16, 2008 2:21:41 AM
    > Subject: Re: Restart my program failed ?
    >
    > Locus,
    >
    >   A BLCR callback is sort of like a signal handler that runs when the
    > checkpoint is being taken.  So, when cr_request_checkpoint() causes the
    > checkpoint to be taken, BLCR will run the callback.  The callback runs
    > up to the cr_checkpoint() call before the checkpoint is taken.  This
    > would allow a process to save any sort of state that BLCR doesn't handle
    > (such as TCP sockets).  The call to cr_checkpoint() allows the
    > checkpoint to proceed (possibly invoking other callbacks if more than
    > one is registered).  The return value from cr_checkpoint() will be 0
    > when the process is just continuing normally after a checkpoint has been
    > taken, but will be >0 when resuming from a restart.  Any code running in
    > the callback after the cr_checkpoint() call can restore any state that
    > the callback saved.  In the example callback I showed, the value of
    > global_number will decrease by one when the process is restarted.
    >
    > -Paul
    >
    >
    > Locus Jackson wrote:
    > > Hi,
    > > I am sorry that I still have some questions.
    > > In my function set_checkpoint(),I use
    > > cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and
    > > cr_poll_checkpoint() to set a checkpoint.
    > > In my function call_restart(),I use pipe(),fork(),and system() to
    > > restart my program from the checkpoint where I set before.
    > > You suggest registering a checkpoint callback,I may have some
    > > difficult to understand its mechanism though I have read libcr.h.
    > > 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I
    > > wonder when the callback func will be invoked?Will it be invoked after
    > > my function set_checkpoint() called?When will the callback func  be 
    > > invoked  generally?
    > > 2,In your reply,you wrote:
    > >  static int my_callback(void* arg) {
    > >    int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    > >    if (rc > 0) { /* Restarting */
    > >      --global_number;
    > >    }
    > >    return 0;
    > >  }
    > >  I wonder does cr_checkpoint() set a checkpoint like my function 
    > set_checkpoint()?If the answer is no ,can I add call_restart()
    > > in the condition if(rc>0) to explicitly restart my program for 
    > global_times?
    > > 3,If possible,would you please give me an example to explain your 
    > callback method,I want to restart my program for any given
    > > times,but now,if I call call_restart(),the program will run 
    > forever,that is really  terrible.
    > > Thank you very much for your kind help.
    > >
    > > Regards,
    > > Locus.
    > > 
    > >
    > > ----- Original Message ----
    > > From: Paul H_dot_ Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com 
    > <mailto:locus_jackson_at_yahoo_dot_com>>
    > > Cc: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>
    > > Sent: Tuesday, January 15, 2008 1:05:08 PM
    > > Subject: Re: Restart my program failed ?
    > >
    > > Locus Jackson wrote:
    > > > Hello,
    > > > I use blcr to checkpoint and restart my program(a single threaded
    > > > application).
    > > > But when I want to restat my program,it always failed.
    > > > The general form of my program listed as follows:
    > > >
    > > >          void set_checkpoint()  //use this fucntion to set a
    > > > checkpoint at any time and places
    > > >        {
    > > >          ........
    > > >        }
    > > >
    > > >        void call_restart(char* filename)  //use this function to
    > > > restart my program in case it failed
    > > >        {
    > > >          ......
    > > >          system("cr_restart filename");
    > > >        }
    > > >
    > > >        int global_number=2; 
    > > >        int main()
    > > >        {
    > > >          ......
    > > >          statement1;
    > > >          set_checkpoint();
    > > >          statement2;
    > > >          ......
    > > >          while(global_number>0)  // I want to restart my program 2 
    > times
    > > >          {
    > > >            global_number--;
    > > >            call_restart();
    > > >            }
    > > >            statement3;
    > > >            ......
    > > >          }
    > > >
    > > > when I execute this program ,it  restarts  far more than two
    > > > times,until it told me " Restart failed: Device or resource busy".
    > > > In my  call_restart()  function  , I  fork  a  child  to  restart 
    >   my 
    > > > program(before it restart,its parent is exited,and the pgid of the
    > > > child is also set to be child's pid ),but in restart,the child which
    > > > is forked is always the son of the exited parent,the parent seems to
    > > > be still alive,I do not know why?
    > > > Thank you for your help.
    > > >
    > > > Locus.
    > > >     
    > > >
    > > >
    > > > 
    > ------------------------------------------------------------------------
    > > > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try
    > > > it now.
    > > >
    > > 
    > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
    > > Locus,
    > >
    > > Your first issue is "it  restarts  far more than two times".  That is
    > > because the value of "global_number" has been restored to the value 2
    > > when BLCR restarted to program.  You will need to use a different
    > > mechanism to handle any value that is supposed to change across
    > > checkpoints.  I suggest that you try registering a checkpoint callback.
    > >
    > > In main add these two lines:
    > >  cr_client_id_t id = cr_init();
    > >  cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT);
    > >
    > > and somewhere add the following function:
    > >  static int my_callback(void* arg) {
    > >    int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    > >    if (rc > 0) { /* Restarting */
    > >      --global_number;
    > >    }
    > >    return 0;
    > >  }
    > >
    > > As for eventually failing with "Device or resource busy", I imaging that
    > > with the many restarts you may have eventually reused the original PID
    > > for the cr_restart executable.  Perhaps that problem will go away when
    > > you fix the multiple restarts problem.  The other possibility here is
    > > that you are trying to restart the same process multiple times
    > > *concurrently*, thus trying to use the original PID twice at the same
    > > time.
    > >
    > > Let me know if you need any more help.
    > >
    > > -Paul
    > >
    > > --
    > > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > Future Technologies Group
    > > HPC Research Department                  Tel: +1-510-495-2352
    > > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    > >
    > >
    > >
    > >
    > > ------------------------------------------------------------------------
    > > Never miss a thing. Make Yahoo your homepage.
    > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
    >
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > Future Technologies Group
    > HPC Research Department                  Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    >
    >
    >
    >
    > ------------------------------------------------------------------------
    > Never miss a thing. Make Yahoo your homepage. 
    > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: the details of setting a checkpoint"