From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jan 22 2008 - 12:09:22 PST
Locus, The sequence is a,b,c,e,d,f The first time the callback runs, the value of rc (return from cr_checkpoint()) is 0 and therefore the code "global_number--" does not run, leaving global_number at 1. However, when you restart, execution begins in the callback exactly as if returning from cr_checkpoint() with rc > 0, like the POSIX/C99 function setjmp(). So global_number is reduced to zero then. In summary: a,b,c,e(rc==0),d,f(global_number==1),RESTART,e(rc>0),d,f(global_number==0) -Paul Locus Jackson wrote: > Paul, > Thank you for your reply. > I still have another question,in my program,set_checkpoint() (place > A)will be called first, > thus cr_request_checkpoint,cr_checkpoint(),cr_poll_checkpoint() and > mycallback() > are all called before reaching place B,so when reaching place > B,global_number is equal to 0 ? > maybe my understanding is wrong. > So my attention is ,when the checkpoint is taken in my program? > To explain my meaning clearly, > > set_checkpoint() { //a > cr_register_callback(); //b > cr_request_checkpoint(); //c > cr_poll_checkpoint(); //d > } > > mycallback() {} //e > > when(global_number) //f > > the procedure of setting a checkpoint: > > a --> b --->d--->a checkpoint is set-->global_number is set to 0-->f > | -- e--| > | > call mycallback() to set global_number to 0 > In your reply,you mean the first reach at B,global_number is 1,I still > can not understand. > > Thank you for your help. > > Regads > Locus > > > > ----- Original Message ---- > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> > To: Locus Jackson <locus_jackson_at_yahoo_dot_com> > Cc: checkpoint <checkpoint_at_lbl_dot_gov> > Sent: Saturday, January 19, 2008 4:04:19 AM > Subject: Re: Restart my program failed ? > > Locus, > > BLCR callbacks run when the checkpoint is taken. In your vase it runs > at some indeterminate spot between entering cr_request_checkpoint() and > leaving cr_poll_checkpoint(). The portions of the function before > "cr_checkpoint()" run before the actual checkpoint is taken, and the > parts after calling cr_checkpoint() run after the checkpoint is saved. > The return code from cr_checkpoint() is 0 when the checkpoint is taken. > However when restarting, the cr_checkpoint() call returns something > greater than zero (see "man setjmp" for a similar behavior in the POSIX > APIs). That is why the callback I provided says "if (rc>0) > --global_number", so that (if global_number started at 1) the program > will see global_number=1 the first time in reaches the line you've > marked as "//B", but wen restarted will see zero (thus restarting > exactly once). > One thing you should change is to call cr_register_callback() only > once (I suggest in main()), rather then each time you request a > checkpoint. If you register it multiple times it will get called > multiple times and at restart you might get global_number<0 (which will > still cause your program to restart nearly forever). > > -Paul > > Locus Jackson wrote: > > Hi, > > Thank you for your suggestion. > > I set global_number to 1 to have a try,but if global_number is equal > > to 1,it will not call function call_restart() any more. > > int global_number=1; > > int main() > > { > > ...... > > set_checkpoint(); //A > > ...... > > while(global_number) //B > > call_resart(filename); //C > > ...... > > } > > at place B,the global_number is equal to 0,it will not call > > call_restart,thus I will not restart in my program for only once. > > So is there any method that I can have a chance to call > > call_restart(),maybe one time is also ok? > > And, I also want to know , in my function set_checkpoint(),will > > callback function callback() function be called before > > cr_request_checkpoint() or after it? Does a callback function > > automatically be invoked before calling a function to > > set a checkpoint? > > Thank you for your help. > > > > Regards > > Locus. > > > > ----- Original Message ---- > > From: Paul H_dot_ Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com > <mailto:locus_jackson_at_yahoo_dot_com>> > > Cc: checkpoint <checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>> > > Sent: Thursday, January 17, 2008 3:32:32 AM > > Subject: Re: Restart my program failed ? > > > > I guess I answered too quickly last time, because what I proposed will > > *not* result in restarting twice. If you start with global_number=1, > > then the callback will decrement it to zero when restarted, and you will > > see your program restart exactly once. However, if you start with > > global_number=2, then each restart decrements it from 2 to 1, never > > going lower. > > > > I don't have a good suggestion as to how to restart exactly twice from > > *inside* your program. BLCR is most often used with some outside > > program controlling the restarts. > > > > -Paul > > > > Locus Jackson wrote: > > > Hi, > > > Once using a callback mechanism,my program will never stop.Maybe I > > > still can not understand your meaning. > > > My program form(I want to restart my program for global_number > times) : > > > void set_checkpoint()//use this to set a checkpoint at > > > any times and places > > > { > > > ...... > > > cr_init(); > > > cr_initialize_checkpoint_args_t(&cr_args); > > > cr_args.cr_fd=open(filename,......); //save checkpoint > > > in filename > > > cr_register_callback(callback,......);//register a > > > callback function > > > cr_request_checkpoint(&cr_args,......);//set a > checkpoint > > > ...... > > > cr_poll_checkpoint(.....);//wait for setting a > > > checkpoint to be completed > > > ...... > > > } > > > > > > static int callback(void* arg) { //your suggestion > > > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > > > if (rc > 0) { > > > --global_number; > > > } > > > return 0; > > > } > > > > > > void call_restart(char* filename) //use this to > > restart explicitly my program > > > { > > > ...... > > > pipe(); > > > fork(); > > > ....... //parent is exited > > > system("cr_restart filename"); //restart program > > > from the checkpoint set before > > > ...... > > > } > > > > > > int global_number=2;//restart numbers > > > char filename[20]; //save checkpoint > > > int main() > > > { > > > ...... > > > set_checkpoint(); //place1 > > > statement1; > > > ...... > > > while(global_number) > > > call_restart(filename); //place2,restart from > > place1 to place2 for global_number times > > > ...... > > > return 0; > > > } > > > I wonder whether my form is wrong or not, and I wonder will > > callback() function be called before cr_request_checkpoint(),I > > > want to restart my program for global_number times? > > > > > > Thank you very much for your help. > > > > > > Regards > > > Locus. > > > > > > > > > > > > > > > ----- Original Message ---- > > > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> <mailto:PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov>>> > > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com > <mailto:locus_jackson_at_yahoo_dot_com> > > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com>>> > > > Cc: checkpoint <checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>> > > > Sent: Wednesday, January 16, 2008 2:21:41 AM > > > Subject: Re: Restart my program failed ? > > > > > > Locus, > > > > > > A BLCR callback is sort of like a signal handler that runs when the > > > checkpoint is being taken. So, when cr_request_checkpoint() > causes the > > > checkpoint to be taken, BLCR will run the callback. The callback runs > > > up to the cr_checkpoint() call before the checkpoint is taken. This > > > would allow a process to save any sort of state that BLCR doesn't > handle > > > (such as TCP sockets). The call to cr_checkpoint() allows the > > > checkpoint to proceed (possibly invoking other callbacks if more than > > > one is registered). The return value from cr_checkpoint() will be 0 > > > when the process is just continuing normally after a checkpoint > has been > > > taken, but will be >0 when resuming from a restart. Any code > running in > > > the callback after the cr_checkpoint() call can restore any state that > > > the callback saved. In the example callback I showed, the value of > > > global_number will decrease by one when the process is restarted. > > > > > > -Paul > > > > > > > > > Locus Jackson wrote: > > > > Hi, > > > > I am sorry that I still have some questions. > > > > In my function set_checkpoint(),I use > > > > > cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and > > > > cr_poll_checkpoint() to set a checkpoint. > > > > In my function call_restart(),I use pipe(),fork(),and system() to > > > > restart my program from the checkpoint where I set before. > > > > You suggest registering a checkpoint callback,I may have some > > > > difficult to understand its mechanism though I have read libcr.h. > > > > 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I > > > > wonder when the callback func will be invoked?Will it be invoked > after > > > > my function set_checkpoint() called?When will the callback func be > > > > invoked generally? > > > > 2,In your reply,you wrote: > > > > static int my_callback(void* arg) { > > > > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > > > > if (rc > 0) { /* Restarting */ > > > > --global_number; > > > > } > > > > return 0; > > > > } > > > > I wonder does cr_checkpoint() set a checkpoint like my function > > > set_checkpoint()?If the answer is no ,can I add call_restart() > > > > in the condition if(rc>0) to explicitly restart my program for > > > global_times? > > > > 3,If possible,would you please give me an example to explain your > > > callback method,I want to restart my program for any given > > > > times,but now,if I call call_restart(),the program will run > > > forever,that is really terrible. > > > > Thank you very much for your kind help. > > > > > > > > Regards, > > > > Locus. > > > > > > > > > > > > ----- Original Message ---- > > > > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>>> > > > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com > <mailto:locus_jackson_at_yahoo_dot_com> > > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com>> > > > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com> > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com>>>> > > > > Cc: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>> > > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>>> > > > > Sent: Tuesday, January 15, 2008 1:05:08 PM > > > > Subject: Re: Restart my program failed ? > > > > > > > > Locus Jackson wrote: > > > > > Hello, > > > > > I use blcr to checkpoint and restart my program(a single threaded > > > > > application). > > > > > But when I want to restat my program,it always failed. > > > > > The general form of my program listed as follows: > > > > > > > > > > void set_checkpoint() //use this fucntion to set a > > > > > checkpoint at any time and places > > > > > { > > > > > ........ > > > > > } > > > > > > > > > > void call_restart(char* filename) //use this function to > > > > > restart my program in case it failed > > > > > { > > > > > ...... > > > > > system("cr_restart filename"); > > > > > } > > > > > > > > > > int global_number=2; > > > > > int main() > > > > > { > > > > > ...... > > > > > statement1; > > > > > set_checkpoint(); > > > > > statement2; > > > > > ...... > > > > > while(global_number>0) // I want to restart my program 2 > > > times > > > > > { > > > > > global_number--; > > > > > call_restart(); > > > > > } > > > > > statement3; > > > > > ...... > > > > > } > > > > > > > > > > when I execute this program ,it restarts far more than two > > > > > times,until it told me " Restart failed: Device or resource busy". > > > > > In my call_restart() function , I fork a child to restart > > > my > > > > > program(before it restart,its parent is exited,and the pgid of the > > > > > child is also set to be child's pid ),but in restart,the child > which > > > > > is forked is always the son of the exited parent,the parent > seems to > > > > > be still alive,I do not know why? > > > > > Thank you for your help. > > > > > > > > > > Locus. > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > Be a better friend, newshound, and know-it-all with Yahoo! > > Mobile. Try > > > > > it now. > > > > > > > > > > > > > > > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> > > > > Locus, > > > > > > > > Your first issue is "it restarts far more than two times". > That is > > > > because the value of "global_number" has been restored to the > value 2 > > > > when BLCR restarted to program. You will need to use a different > > > > mechanism to handle any value that is supposed to change across > > > > checkpoints. I suggest that you try registering a checkpoint > > callback. > > > > > > > > In main add these two lines: > > > > cr_client_id_t id = cr_init(); > > > > cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT); > > > > > > > > and somewhere add the following function: > > > > static int my_callback(void* arg) { > > > > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > > > > if (rc > 0) { /* Restarting */ > > > > --global_number; > > > > } > > > > return 0; > > > > } > > > > > > > > As for eventually failing with "Device or resource busy", I > > imaging that > > > > with the many restarts you may have eventually reused the > original PID > > > > for the cr_restart executable. Perhaps that problem will go > away when > > > > you fix the multiple restarts problem. The other possibility > here is > > > > that you are trying to restart the same process multiple times > > > > *concurrently*, thus trying to use the original PID twice at the > same > > > > time. > > > > > > > > Let me know if you need any more help. > > > > > > > > -Paul > > > > > > > > -- > > > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>> > > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>>> > > > > Future Technologies Group > > > > HPC Research Department Tel: +1-510-495-2352 > > > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > Never miss a thing. Make Yahoo your homepage. > > > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> > > > > > > > > > -- > > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>> > > > Future Technologies Group > > > HPC Research Department Tel: +1-510-495-2352 > > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > Never miss a thing. Make Yahoo your homepage. > > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> > > > > > > -- > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > Future Technologies Group > > HPC Research Department Tel: +1-510-495-2352 > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > ------------------------------------------------------------------------ > > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > > it now. > > > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > ------------------------------------------------------------------------ > Never miss a thing. Make Yahoo your homepage. > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900