Re: Restart my program failed ?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jan 22 2008 - 12:09:22 PST

  • Next message: Paul H. Hargrove: "Announcing the release of BLCR 0.6.3"
    Locus,
    
      The sequence is a,b,c,e,d,f
      The first time the callback runs, the value of rc (return from 
    cr_checkpoint()) is 0 and therefore the code "global_number--" does not 
    run, leaving global_number at 1.  However, when you restart, execution 
    begins in the callback exactly as if returning from cr_checkpoint() with 
    rc > 0, like the POSIX/C99 function setjmp().  So global_number is 
    reduced to zero then.  In summary:
      a,b,c,e(rc==0),d,f(global_number==1),RESTART,e(rc>0),d,f(global_number==0)
    
    -Paul
    
    Locus Jackson wrote:
    > Paul,
    > Thank you for your reply.
    > I still have another question,in my program,set_checkpoint() (place 
    > A)will be called first,
    > thus cr_request_checkpoint,cr_checkpoint(),cr_poll_checkpoint() and 
    > mycallback()
    > are all called before reaching place B,so when reaching place 
    > B,global_number is equal to 0 ?
    > maybe my understanding is wrong.
    > So my attention is ,when the checkpoint is taken in my program?
    > To explain my meaning clearly,
    >  
    >                    set_checkpoint() {                 //a
    >                      cr_register_callback();      //b
    >                      cr_request_checkpoint();     //c
    >                      cr_poll_checkpoint();         //d
    >                    }
    >                   
    >                     mycallback() {}                     //e
    >
    >                    when(global_number)             //f
    >
    > the procedure of setting a checkpoint:
    >
    > a --> b --->d--->a checkpoint is set-->global_number is set to 0-->f
    >           | -- e--|
    >                  |
    >                  call mycallback() to set global_number to 0
    > In your reply,you mean the first reach at B,global_number is 1,I still 
    > can not understand.
    >
    > Thank you for your help.
    >
    > Regads
    > Locus
    >
    >
    >
    > ----- Original Message ----
    > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    > To: Locus Jackson <locus_jackson_at_yahoo_dot_com>
    > Cc: checkpoint <checkpoint_at_lbl_dot_gov>
    > Sent: Saturday, January 19, 2008 4:04:19 AM
    > Subject: Re: Restart my program failed ?
    >
    > Locus,
    >
    >   BLCR callbacks run when the checkpoint is taken.  In your vase it runs
    > at some indeterminate spot between entering cr_request_checkpoint() and
    > leaving cr_poll_checkpoint().  The portions of the function before
    > "cr_checkpoint()" run before the actual checkpoint is taken, and the
    > parts after calling cr_checkpoint() run after the checkpoint is saved.  
    > The return code from cr_checkpoint() is 0 when the checkpoint is taken.  
    > However when restarting, the cr_checkpoint() call returns something
    > greater than zero (see "man setjmp" for a similar behavior in the POSIX
    > APIs).  That is why the callback I provided says "if (rc>0)
    > --global_number", so that (if global_number started at 1) the program
    > will see global_number=1 the first time in reaches the line you've
    > marked as "//B", but wen restarted will see zero (thus restarting
    > exactly once).
    >   One thing you should change is to call cr_register_callback() only
    > once (I suggest in main()), rather then each time you request a
    > checkpoint.  If you register it multiple times it will get called
    > multiple times and at restart you might get global_number<0 (which will
    > still cause your program to restart nearly forever).
    >
    > -Paul
    >
    > Locus Jackson wrote:
    > > Hi,
    > > Thank you for your suggestion.
    > > I set global_number to 1 to have a try,but if global_number is equal
    > > to 1,it will not call function call_restart() any more.
    > >                                  int global_number=1;
    > >                                  int main()
    > >                                  {
    > >                                        ......
    > >                                        set_checkpoint(); //A
    > >                                        ......
    > >                                        while(global_number)  //B
    > >                                          call_resart(filename); //C
    > >                                      ......
    > >                                  }
    > > at place B,the global_number is equal to 0,it will not call
    > > call_restart,thus I will not restart in my program for only once.
    > > So is there any method that I can have a chance to call
    > > call_restart(),maybe  one time is also ok?
    > > And,  I also  want to  know , in my function set_checkpoint(),will
    > > callback function callback() function be called before
    > > cr_request_checkpoint() or after it? Does a callback function
    > > automatically be invoked before calling a function to
    > > set a checkpoint?
    > > Thank you for your help.
    > >
    > > Regards
    > > Locus.
    > >
    > > ----- Original Message ----
    > > From: Paul H_dot_ Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com 
    > <mailto:locus_jackson_at_yahoo_dot_com>>
    > > Cc: checkpoint <checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>
    > > Sent: Thursday, January 17, 2008 3:32:32 AM
    > > Subject: Re: Restart my program failed ?
    > >
    > > I guess I answered too quickly last time, because what I proposed will
    > > *not* result in restarting twice.  If you start with global_number=1,
    > > then the callback will decrement it to zero when restarted, and you will
    > > see your program restart exactly once.  However, if you start with
    > > global_number=2, then each restart decrements it from 2 to 1, never
    > > going lower.
    > >
    > > I don't have a good suggestion as to how to restart exactly twice from
    > > *inside* your program.  BLCR is most often used with some outside
    > > program controlling the restarts.
    > >
    > > -Paul
    > >
    > > Locus Jackson wrote:
    > > > Hi,
    > > > Once using a callback mechanism,my program will never stop.Maybe I
    > > > still can not understand your meaning.
    > > > My program form(I want to restart my program for global_number 
    > times) :
    > > >                void set_checkpoint()//use this to set a checkpoint at
    > > > any times and places
    > > >                {
    > > >                  ......
    > > >                  cr_init();
    > > >                cr_initialize_checkpoint_args_t(&cr_args);
    > > >                cr_args.cr_fd=open(filename,......); //save checkpoint
    > > > in filename
    > > >                cr_register_callback(callback,......);//register a
    > > > callback function
    > > >                cr_request_checkpoint(&cr_args,......);//set a 
    > checkpoint
    > > >                ......
    > > >                cr_poll_checkpoint(.....);//wait for setting a
    > > > checkpoint to be completed
    > > >                ......
    > > >                }
    > > >           
    > > >                static int callback(void* arg) { //your suggestion
    > > >                int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    > > >                if (rc > 0) {
    > > >                  --global_number;
    > > >                }
    > > >                return 0;
    > > >                }
    > > >
    > > >                void call_restart(char* filename) //use this to
    > > restart explicitly my program
    > > >              {
    > > >                  ......
    > > >                  pipe();
    > > >                  fork();
    > > >                  .......  //parent is exited
    > > >                  system("cr_restart filename"); //restart program
    > > >  from the checkpoint set before
    > > >                  ......
    > > >              }
    > > >           
    > > >              int global_number=2;//restart numbers
    > > >              char filename[20]; //save checkpoint
    > > >              int main()
    > > >              {
    > > >                ......
    > > >                set_checkpoint(); //place1
    > > >                statement1;
    > > >                ......
    > > >                while(global_number)
    > > >                  call_restart(filename); //place2,restart from
    > > place1 to place2 for global_number times
    > > >                ......
    > > >                return 0;
    > > >              }
    > > > I wonder whether my form is wrong or not, and I wonder will
    > > callback() function be called before cr_request_checkpoint(),I
    > > > want to restart my program for global_number times?
    > > >
    > > > Thank you very much for your help.
    > > >
    > > > Regards
    > > > Locus.
    > > >
    > > >
    > > >
    > > >
    > > > ----- Original Message ----
    > > > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov> <mailto:PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>>>
    > > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com 
    > <mailto:locus_jackson_at_yahoo_dot_com>
    > > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com>>>
    > > > Cc: checkpoint <checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> 
    > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>>
    > > > Sent: Wednesday, January 16, 2008 2:21:41 AM
    > > > Subject: Re: Restart my program failed ?
    > > >
    > > > Locus,
    > > >
    > > >  A BLCR callback is sort of like a signal handler that runs when the
    > > > checkpoint is being taken.  So, when cr_request_checkpoint() 
    > causes the
    > > > checkpoint to be taken, BLCR will run the callback.  The callback runs
    > > > up to the cr_checkpoint() call before the checkpoint is taken.   This
    > > > would allow a process to save any sort of state that BLCR doesn't 
    > handle
    > > > (such as TCP sockets).  The call to cr_checkpoint() allows the
    > > > checkpoint to proceed (possibly invoking other callbacks if more than
    > > > one is registered).  The return value from cr_checkpoint() will be 0
    > > > when the process is just continuing normally after a checkpoint 
    > has been
    > > > taken, but will be >0 when resuming from a restart.  Any code 
    > running in
    > > > the callback after the cr_checkpoint() call can restore any state that
    > > > the callback saved.  In the example callback I showed, the value of
    > > > global_number will decrease by one when the process is restarted.
    > > >
    > > > -Paul
    > > >
    > > >
    > > > Locus Jackson wrote:
    > > > > Hi,
    > > > > I am sorry that I still have some questions.
    > > > > In my function set_checkpoint(),I use
    > > > > 
    > cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and
    > > > > cr_poll_checkpoint() to set a checkpoint.
    > > > > In my function call_restart(),I use pipe(),fork(),and system() to
    > > > > restart my program from the checkpoint where I set before.
    > > > > You suggest registering a checkpoint callback,I may have some
    > > > > difficult to understand its mechanism though I have read libcr.h.
    > > > > 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I
    > > > > wonder when the callback func will be invoked?Will it be invoked 
    > after
    > > > > my function set_checkpoint() called?When will the callback func   be
    > > > > invoked  generally?
    > > > > 2,In your reply,you wrote:
    > > > >  static int my_callback(void* arg) {
    > > > >    int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    > > > >    if (rc > 0) { /* Restarting */
    > > > >      --global_number;
    > > > >    }
    > > > >    return 0;
    > > > >  }
    > > > >  I wonder does cr_checkpoint() set a checkpoint like my function
    > > > set_checkpoint()?If the answer is no ,can I add call_restart()
    > > > > in the condition if(rc>0) to explicitly restart my program for
    > > > global_times?
    > > > > 3,If possible,would you please give me an example to explain your
    > > > callback method,I want to restart my program for any given
    > > > > times,but now,if I call call_restart(),the program will run
    > > > forever,that is really  terrible.
    > > > > Thank you very much for your kind help.
    > > > >
    > > > > Regards,
    > > > > Locus.
    > > > >
    > > > >
    > > > > ----- Original Message ----
    > > > > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> 
    > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>>>
    > > > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com 
    > <mailto:locus_jackson_at_yahoo_dot_com>
    > > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com>>
    > > > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com> 
    > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com>>>>
    > > > > Cc: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> 
    > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>
    > > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> 
    > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>>
    > > > > Sent: Tuesday, January 15, 2008 1:05:08 PM
    > > > > Subject: Re: Restart my program failed ?
    > > > >
    > > > > Locus Jackson wrote:
    > > > > > Hello,
    > > > > > I use blcr to checkpoint and restart my program(a single threaded
    > > > > > application).
    > > > > > But when I want to restat my program,it always failed.
    > > > > > The general form of my program listed as follows:
    > > > > >
    > > > > >          void set_checkpoint()  //use this fucntion to set a
    > > > > > checkpoint at any time and places
    > > > > >        {
    > > > > >          ........
    > > > > >        }
    > > > > >
    > > > > >        void call_restart(char* filename)  //use this function to
    > > > > > restart my program in case it failed
    > > > > >        {
    > > > > >          ......
    > > > > >          system("cr_restart filename");
    > > > > >        }
    > > > > >
    > > > > >        int global_number=2;
    > > > > >        int main()
    > > > > >        {
    > > > > >          ......
    > > > > >          statement1;
    > > > > >          set_checkpoint();
    > > > > >          statement2;
    > > > > >          ......
    > > > > >          while(global_number>0)  // I want to restart my program 2
    > > > times
    > > > > >          {
    > > > > >            global_number--;
    > > > > >            call_restart();
    > > > > >            }
    > > > > >            statement3;
    > > > > >            ......
    > > > > >          }
    > > > > >
    > > > > > when I execute this program ,it  restarts  far more than two
    > > > > > times,until it told me " Restart failed: Device or resource busy".
    > > > > > In my  call_restart()  function  , I  fork  a  child  to   restart
    > > >  my
    > > > > > program(before it restart,its parent is exited,and the pgid of the
    > > > > > child is also set to be child's pid ),but in restart,the child 
    > which
    > > > > > is forked is always the son of the exited parent,the parent 
    > seems to
    > > > > > be still alive,I do not know why?
    > > > > > Thank you for your help.
    > > > > >
    > > > > > Locus.
    > > > > > 
    > > > > >
    > > > > >
    > > > > >
    > > > 
    > ------------------------------------------------------------------------
    > > > > > Be a better friend, newshound, and know-it-all with Yahoo!
    > > Mobile. Try
    > > > > > it now.
    > > > > >
    > > > >
    > > >
    > > 
    > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
    > > > > Locus,
    > > > >
    > > > > Your first issue is "it  restarts  far more than two times".   
    > That is
    > > > > because the value of "global_number" has been restored to the 
    > value 2
    > > > > when BLCR restarted to program.  You will need to use a different
    > > > > mechanism to handle any value that is supposed to change across
    > > > > checkpoints.  I suggest that you try registering a checkpoint
    > > callback.
    > > > >
    > > > > In main add these two lines:
    > > > >  cr_client_id_t id = cr_init();
    > > > >  cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT);
    > > > >
    > > > > and somewhere add the following function:
    > > > >  static int my_callback(void* arg) {
    > > > >    int rc = cr_checkpoint(CR_CHECKPOINT_READY);
    > > > >    if (rc > 0) { /* Restarting */
    > > > >      --global_number;
    > > > >    }
    > > > >    return 0;
    > > > >  }
    > > > >
    > > > > As for eventually failing with "Device or resource busy", I
    > > imaging that
    > > > > with the many restarts you may have eventually reused the 
    > original PID
    > > > > for the cr_restart executable.  Perhaps that problem will go 
    > away when
    > > > > you fix the multiple restarts problem.  The other possibility 
    > here is
    > > > > that you are trying to restart the same process multiple times
    > > > > *concurrently*, thus trying to use the original PID twice at the 
    > same
    > > > > time.
    > > > >
    > > > > Let me know if you need any more help.
    > > > >
    > > > > -Paul
    > > > >
    > > > > --
    > > > > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> 
    > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>>
    > > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> 
    > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> 
    > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>>>
    > > > > Future Technologies Group
    > > > > HPC Research Department                  Tel: +1-510-495-2352
    > > > > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    > > > >
    > > > >
    > > > >
    > > > >
    > > > >
    > > ------------------------------------------------------------------------
    > > > > Never miss a thing. Make Yahoo your homepage.
    > > > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
    > > >
    > > >
    > > > --
    > > > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> 
    > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>>
    > > > Future Technologies Group
    > > > HPC Research Department                  Tel: +1-510-495-2352
    > > > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    > > >
    > > >
    > > >
    > > >
    > > > 
    > ------------------------------------------------------------------------
    > > > Never miss a thing. Make Yahoo your homepage.
    > > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
    > >
    > >
    > > --
    > > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > Future Technologies Group
    > > HPC Research Department                  Tel: +1-510-495-2352
    > > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    > >
    > >
    > >
    > >
    > > ------------------------------------------------------------------------
    > > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try
    > > it now.
    > > 
    > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
    >
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > Future Technologies Group
    > HPC Research Department                  Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    >
    >
    >
    >
    > ------------------------------------------------------------------------
    > Never miss a thing. Make Yahoo your homepage. 
    > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Announcing the release of BLCR 0.6.3"