Re: Restart my program failed ?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Jan 18 2008 - 12:04:19 PST

  • Next message: Locus Jackson: "Re: Restart my program failed ?"
    Locus,
    
      BLCR callbacks run when the checkpoint is taken.  In your vase it runs 
    at some indeterminate spot between entering cr_request_checkpoint() and 
    leaving cr_poll_checkpoint().  The portions of the function before 
    "cr_checkpoint()" run before the actual checkpoint is taken, and the 
    parts after calling cr_checkpoint() run after the checkpoint is saved.  
    The return code from cr_checkpoint() is 0 when the checkpoint is taken.  
    However when restarting, the cr_checkpoint() call returns something 
    greater than zero (see "man setjmp" for a similar behavior in the POSIX 
    APIs).  That is why the callback I provided says "if (rc>0) 
    --global_number", so that (if global_number started at 1) the program 
    will see global_number=1 the first time in reaches the line you've 
    marked as "//B", but wen restarted will see zero (thus restarting 
    exactly once).
      One thing you should change is to call cr_register_callback() only 
    once (I suggest in main()), rather then each time you request a 
    checkpoint.  If you register it multiple times it will get called 
    multiple times and at restart you might get global_number<0 (which will 
    still cause your program to restart nearly forever).
    
    -Paul
    
    Locus Jackson wrote:
    > Hi,
    > Thank you for your suggestion.
    > I set global_number to 1 to have a try,but if global_number is equal 
    > to 1,it will not call function call_restart() any more.
    >                                  int global_number=1;
    >                                  int main()
    >                                  {
    >                                        ......
    >                                        set_checkpoint(); //A
    >                                        ......
    >                                        while(global_number)  //B
    >                                           call_resart(filename); //C
    >                                       ......
    >                                  }
    > at place B,the global_number is equal to 0,it will not call 
    > call_restart,thus I will not restart in my program for only once.
    > So is there any method that I can have a chance to call 
    > call_restart(),maybe  one time is also ok?
    > And,  I also  want to  know , in my function set_checkpoint(),will 
    > callback function callback() function be called before 
    > cr_request_checkpoint() or after it? Does a callback function 
    > automatically be invoked before calling a function to
    > set a checkpoint?
    > Thank you for your help.
    >
    > Regards
    > Locus.
    >
    > ----- Original Message ----
    > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    > To: Locus Jackson <locus_jackson_at_yahoo_dot_com>
    > Cc: checkpoint <checkpoint_at_lbl_dot_gov>
    > Sent: Thursday, January 17, 2008 3:32:32 AM
    > Subject: Re: Restart my program failed ?
    >
    > I guess I answered too quickly last time, because what I proposed will
    > *not* result in restarting twice.  If you start with global_number=1,
    > then the callback will decrement it to zero when restarted, and you will
    > see your program restart exactly once.  However, if you start with
    > global_number=2, then each restart decrements it from 2 to 1, never
    > going lower.
    >
    > I don't have a good suggestion as to how to restart exactly twice from
    > *inside* your program.  BLCR is most often used with some outside
    > program controlling the restarts.
    >
    > -Paul
    >
    > Locus Jackson wrote:
    > > Hi,
    > > Once using a callback mechanism,my program will never stop.Maybe I
    > > still can not understand your meaning.
    > > My program form(I want to restart my program for global_number times) :
    > >                void set_checkpoint()//use this to set a checkpoint at
    > > any times and places
    > >                {
    > >                  ......
    > >                  cr_init();
    > >                cr_initialize_checkpoint_args_t(&cr_args);
    > >                cr_args.cr_fd=open(filename,......); //save checkpoint
    > > in filename
    > >                cr_register_callback(callback,......);//register a
    > > callback function
    > >                cr_request_checkpoint(&cr_args,......);//set a checkpoint
    > >                ......
    > >                cr_poll_checkpoint(.....);//wait for setting a
    > > checkpoint to be completed
    > >                ......
    > >                }
    > >             
    > >                static int callback(void* arg) { //your suggestion
    > >                int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    > >                if (rc > 0) {
    > >                  --global_number;
    > >                }
    > >                return 0;
    > >                }
    > >
    > >                void call_restart(char* filename) //use this to 
    > restart explicitly my program
    > >              {
    > >                  ......
    > >                  pipe();
    > >                  fork();
    > >                  .......  //parent is exited
    > >                  system("cr_restart filename"); //restart program
    > >  from the checkpoint set before
    > >                  ......
    > >              }
    > >             
    > >              int global_number=2;//restart numbers
    > >              char filename[20]; //save checkpoint
    > >              int main()
    > >              {
    > >                ......
    > >                set_checkpoint(); //place1
    > >                statement1;
    > >                ......
    > >                while(global_number)
    > >                  call_restart(filename); //place2,restart from 
    > place1 to place2 for global_number times
    > >                ......
    > >                return 0;
    > >              }
    > > I wonder whether my form is wrong or not, and I wonder will 
    > callback() function be called before cr_request_checkpoint(),I
    > > want to restart my program for global_number times?
    > >
    > > Thank you very much for your help.
    > >
    > > Regards
    > > Locus.
    > >
    > > 
    > > 
    > >
    > > ----- Original Message ----
    > > From: Paul H_dot_ Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com 
    > <mailto:locus_jackson_at_yahoo_dot_com>>
    > > Cc: checkpoint <checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>
    > > Sent: Wednesday, January 16, 2008 2:21:41 AM
    > > Subject: Re: Restart my program failed ?
    > >
    > > Locus,
    > >
    > >  A BLCR callback is sort of like a signal handler that runs when the
    > > checkpoint is being taken.  So, when cr_request_checkpoint() causes the
    > > checkpoint to be taken, BLCR will run the callback.  The callback runs
    > > up to the cr_checkpoint() call before the checkpoint is taken.  This
    > > would allow a process to save any sort of state that BLCR doesn't handle
    > > (such as TCP sockets).  The call to cr_checkpoint() allows the
    > > checkpoint to proceed (possibly invoking other callbacks if more than
    > > one is registered).  The return value from cr_checkpoint() will be 0
    > > when the process is just continuing normally after a checkpoint has been
    > > taken, but will be >0 when resuming from a restart.  Any code running in
    > > the callback after the cr_checkpoint() call can restore any state that
    > > the callback saved.  In the example callback I showed, the value of
    > > global_number will decrease by one when the process is restarted.
    > >
    > > -Paul
    > >
    > >
    > > Locus Jackson wrote:
    > > > Hi,
    > > > I am sorry that I still have some questions.
    > > > In my function set_checkpoint(),I use
    > > > cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and
    > > > cr_poll_checkpoint() to set a checkpoint.
    > > > In my function call_restart(),I use pipe(),fork(),and system() to
    > > > restart my program from the checkpoint where I set before.
    > > > You suggest registering a checkpoint callback,I may have some
    > > > difficult to understand its mechanism though I have read libcr.h.
    > > > 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I
    > > > wonder when the callback func will be invoked?Will it be invoked after
    > > > my function set_checkpoint() called?When will the callback func  be
    > > > invoked  generally?
    > > > 2,In your reply,you wrote:
    > > >  static int my_callback(void* arg) {
    > > >    int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    > > >    if (rc > 0) { /* Restarting */
    > > >      --global_number;
    > > >    }
    > > >    return 0;
    > > >  }
    > > >  I wonder does cr_checkpoint() set a checkpoint like my function
    > > set_checkpoint()?If the answer is no ,can I add call_restart()
    > > > in the condition if(rc>0) to explicitly restart my program for
    > > global_times?
    > > > 3,If possible,would you please give me an example to explain your
    > > callback method,I want to restart my program for any given
    > > > times,but now,if I call call_restart(),the program will run
    > > forever,that is really  terrible.
    > > > Thank you very much for your kind help.
    > > >
    > > > Regards,
    > > > Locus.
    > > >
    > > >
    > > > ----- Original Message ----
    > > > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov> <mailto:PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>>>
    > > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com 
    > <mailto:locus_jackson_at_yahoo_dot_com>
    > > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com>>>
    > > > Cc: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> 
    > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>
    > > > Sent: Tuesday, January 15, 2008 1:05:08 PM
    > > > Subject: Re: Restart my program failed ?
    > > >
    > > > Locus Jackson wrote:
    > > > > Hello,
    > > > > I use blcr to checkpoint and restart my program(a single threaded
    > > > > application).
    > > > > But when I want to restat my program,it always failed.
    > > > > The general form of my program listed as follows:
    > > > >
    > > > >          void set_checkpoint()  //use this fucntion to set a
    > > > > checkpoint at any time and places
    > > > >        {
    > > > >          ........
    > > > >        }
    > > > >
    > > > >        void call_restart(char* filename)  //use this function to
    > > > > restart my program in case it failed
    > > > >        {
    > > > >          ......
    > > > >          system("cr_restart filename");
    > > > >        }
    > > > >
    > > > >        int global_number=2;
    > > > >        int main()
    > > > >        {
    > > > >          ......
    > > > >          statement1;
    > > > >          set_checkpoint();
    > > > >          statement2;
    > > > >          ......
    > > > >          while(global_number>0)  // I want to restart my program 2
    > > times
    > > > >          {
    > > > >            global_number--;
    > > > >            call_restart();
    > > > >            }
    > > > >            statement3;
    > > > >            ......
    > > > >          }
    > > > >
    > > > > when I execute this program ,it  restarts  far more than two
    > > > > times,until it told me " Restart failed: Device or resource busy".
    > > > > In my  call_restart()  function  , I  fork  a  child  to  restart
    > >  my
    > > > > program(before it restart,its parent is exited,and the pgid of the
    > > > > child is also set to be child's pid ),but in restart,the child which
    > > > > is forked is always the son of the exited parent,the parent seems to
    > > > > be still alive,I do not know why?
    > > > > Thank you for your help.
    > > > >
    > > > > Locus.
    > > > >   
    > > > >
    > > > >
    > > > >
    > > ------------------------------------------------------------------------
    > > > > Be a better friend, newshound, and know-it-all with Yahoo! 
    > Mobile. Try
    > > > > it now.
    > > > >
    > > >
    > > 
    > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
    > > > Locus,
    > > >
    > > > Your first issue is "it  restarts  far more than two times".  That is
    > > > because the value of "global_number" has been restored to the value 2
    > > > when BLCR restarted to program.  You will need to use a different
    > > > mechanism to handle any value that is supposed to change across
    > > > checkpoints.  I suggest that you try registering a checkpoint 
    > callback.
    > > >
    > > > In main add these two lines:
    > > >  cr_client_id_t id = cr_init();
    > > >  cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT);
    > > >
    > > > and somewhere add the following function:
    > > >  static int my_callback(void* arg) {
    > > >    int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    > > >    if (rc > 0) { /* Restarting */
    > > >      --global_number;
    > > >    }
    > > >    return 0;
    > > >  }
    > > >
    > > > As for eventually failing with "Device or resource busy", I 
    > imaging that
    > > > with the many restarts you may have eventually reused the original PID
    > > > for the cr_restart executable.  Perhaps that problem will go away when
    > > > you fix the multiple restarts problem.  The other possibility here is
    > > > that you are trying to restart the same process multiple times
    > > > *concurrently*, thus trying to use the original PID twice at the same
    > > > time.
    > > >
    > > > Let me know if you need any more help.
    > > >
    > > > -Paul
    > > >
    > > > --
    > > > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> 
    > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>>
    > > > Future Technologies Group
    > > > HPC Research Department                  Tel: +1-510-495-2352
    > > > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    > > >
    > > >
    > > >
    > > >
    > > > 
    > ------------------------------------------------------------------------
    > > > Never miss a thing. Make Yahoo your homepage.
    > > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
    > >
    > >
    > > --
    > > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > Future Technologies Group
    > > HPC Research Department                  Tel: +1-510-495-2352
    > > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    > >
    > >
    > >
    > >
    > > ------------------------------------------------------------------------
    > > Never miss a thing. Make Yahoo your homepage.
    > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
    >
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > Future Technologies Group
    > HPC Research Department                  Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    >
    >
    >
    >
    > ------------------------------------------------------------------------
    > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try 
    > it now. 
    > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Locus Jackson: "Re: Restart my program failed ?"