From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Jan 18 2008 - 12:04:19 PST
Locus, BLCR callbacks run when the checkpoint is taken. In your vase it runs at some indeterminate spot between entering cr_request_checkpoint() and leaving cr_poll_checkpoint(). The portions of the function before "cr_checkpoint()" run before the actual checkpoint is taken, and the parts after calling cr_checkpoint() run after the checkpoint is saved. The return code from cr_checkpoint() is 0 when the checkpoint is taken. However when restarting, the cr_checkpoint() call returns something greater than zero (see "man setjmp" for a similar behavior in the POSIX APIs). That is why the callback I provided says "if (rc>0) --global_number", so that (if global_number started at 1) the program will see global_number=1 the first time in reaches the line you've marked as "//B", but wen restarted will see zero (thus restarting exactly once). One thing you should change is to call cr_register_callback() only once (I suggest in main()), rather then each time you request a checkpoint. If you register it multiple times it will get called multiple times and at restart you might get global_number<0 (which will still cause your program to restart nearly forever). -Paul Locus Jackson wrote: > Hi, > Thank you for your suggestion. > I set global_number to 1 to have a try,but if global_number is equal > to 1,it will not call function call_restart() any more. > int global_number=1; > int main() > { > ...... > set_checkpoint(); //A > ...... > while(global_number) //B > call_resart(filename); //C > ...... > } > at place B,the global_number is equal to 0,it will not call > call_restart,thus I will not restart in my program for only once. > So is there any method that I can have a chance to call > call_restart(),maybe one time is also ok? > And, I also want to know , in my function set_checkpoint(),will > callback function callback() function be called before > cr_request_checkpoint() or after it? Does a callback function > automatically be invoked before calling a function to > set a checkpoint? > Thank you for your help. > > Regards > Locus. > > ----- Original Message ---- > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> > To: Locus Jackson <locus_jackson_at_yahoo_dot_com> > Cc: checkpoint <checkpoint_at_lbl_dot_gov> > Sent: Thursday, January 17, 2008 3:32:32 AM > Subject: Re: Restart my program failed ? > > I guess I answered too quickly last time, because what I proposed will > *not* result in restarting twice. If you start with global_number=1, > then the callback will decrement it to zero when restarted, and you will > see your program restart exactly once. However, if you start with > global_number=2, then each restart decrements it from 2 to 1, never > going lower. > > I don't have a good suggestion as to how to restart exactly twice from > *inside* your program. BLCR is most often used with some outside > program controlling the restarts. > > -Paul > > Locus Jackson wrote: > > Hi, > > Once using a callback mechanism,my program will never stop.Maybe I > > still can not understand your meaning. > > My program form(I want to restart my program for global_number times) : > > void set_checkpoint()//use this to set a checkpoint at > > any times and places > > { > > ...... > > cr_init(); > > cr_initialize_checkpoint_args_t(&cr_args); > > cr_args.cr_fd=open(filename,......); //save checkpoint > > in filename > > cr_register_callback(callback,......);//register a > > callback function > > cr_request_checkpoint(&cr_args,......);//set a checkpoint > > ...... > > cr_poll_checkpoint(.....);//wait for setting a > > checkpoint to be completed > > ...... > > } > > > > static int callback(void* arg) { //your suggestion > > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > > if (rc > 0) { > > --global_number; > > } > > return 0; > > } > > > > void call_restart(char* filename) //use this to > restart explicitly my program > > { > > ...... > > pipe(); > > fork(); > > ....... //parent is exited > > system("cr_restart filename"); //restart program > > from the checkpoint set before > > ...... > > } > > > > int global_number=2;//restart numbers > > char filename[20]; //save checkpoint > > int main() > > { > > ...... > > set_checkpoint(); //place1 > > statement1; > > ...... > > while(global_number) > > call_restart(filename); //place2,restart from > place1 to place2 for global_number times > > ...... > > return 0; > > } > > I wonder whether my form is wrong or not, and I wonder will > callback() function be called before cr_request_checkpoint(),I > > want to restart my program for global_number times? > > > > Thank you very much for your help. > > > > Regards > > Locus. > > > > > > > > > > ----- Original Message ---- > > From: Paul H_dot_ Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com > <mailto:locus_jackson_at_yahoo_dot_com>> > > Cc: checkpoint <checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>> > > Sent: Wednesday, January 16, 2008 2:21:41 AM > > Subject: Re: Restart my program failed ? > > > > Locus, > > > > A BLCR callback is sort of like a signal handler that runs when the > > checkpoint is being taken. So, when cr_request_checkpoint() causes the > > checkpoint to be taken, BLCR will run the callback. The callback runs > > up to the cr_checkpoint() call before the checkpoint is taken. This > > would allow a process to save any sort of state that BLCR doesn't handle > > (such as TCP sockets). The call to cr_checkpoint() allows the > > checkpoint to proceed (possibly invoking other callbacks if more than > > one is registered). The return value from cr_checkpoint() will be 0 > > when the process is just continuing normally after a checkpoint has been > > taken, but will be >0 when resuming from a restart. Any code running in > > the callback after the cr_checkpoint() call can restore any state that > > the callback saved. In the example callback I showed, the value of > > global_number will decrease by one when the process is restarted. > > > > -Paul > > > > > > Locus Jackson wrote: > > > Hi, > > > I am sorry that I still have some questions. > > > In my function set_checkpoint(),I use > > > cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and > > > cr_poll_checkpoint() to set a checkpoint. > > > In my function call_restart(),I use pipe(),fork(),and system() to > > > restart my program from the checkpoint where I set before. > > > You suggest registering a checkpoint callback,I may have some > > > difficult to understand its mechanism though I have read libcr.h. > > > 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I > > > wonder when the callback func will be invoked?Will it be invoked after > > > my function set_checkpoint() called?When will the callback func be > > > invoked generally? > > > 2,In your reply,you wrote: > > > static int my_callback(void* arg) { > > > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > > > if (rc > 0) { /* Restarting */ > > > --global_number; > > > } > > > return 0; > > > } > > > I wonder does cr_checkpoint() set a checkpoint like my function > > set_checkpoint()?If the answer is no ,can I add call_restart() > > > in the condition if(rc>0) to explicitly restart my program for > > global_times? > > > 3,If possible,would you please give me an example to explain your > > callback method,I want to restart my program for any given > > > times,but now,if I call call_restart(),the program will run > > forever,that is really terrible. > > > Thank you very much for your kind help. > > > > > > Regards, > > > Locus. > > > > > > > > > ----- Original Message ---- > > > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> <mailto:PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov>>> > > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com > <mailto:locus_jackson_at_yahoo_dot_com> > > <mailto:locus_jackson_at_yahoo_dot_com <mailto:locus_jackson_at_yahoo_dot_com>>> > > > Cc: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>> > > > Sent: Tuesday, January 15, 2008 1:05:08 PM > > > Subject: Re: Restart my program failed ? > > > > > > Locus Jackson wrote: > > > > Hello, > > > > I use blcr to checkpoint and restart my program(a single threaded > > > > application). > > > > But when I want to restat my program,it always failed. > > > > The general form of my program listed as follows: > > > > > > > > void set_checkpoint() //use this fucntion to set a > > > > checkpoint at any time and places > > > > { > > > > ........ > > > > } > > > > > > > > void call_restart(char* filename) //use this function to > > > > restart my program in case it failed > > > > { > > > > ...... > > > > system("cr_restart filename"); > > > > } > > > > > > > > int global_number=2; > > > > int main() > > > > { > > > > ...... > > > > statement1; > > > > set_checkpoint(); > > > > statement2; > > > > ...... > > > > while(global_number>0) // I want to restart my program 2 > > times > > > > { > > > > global_number--; > > > > call_restart(); > > > > } > > > > statement3; > > > > ...... > > > > } > > > > > > > > when I execute this program ,it restarts far more than two > > > > times,until it told me " Restart failed: Device or resource busy". > > > > In my call_restart() function , I fork a child to restart > > my > > > > program(before it restart,its parent is exited,and the pgid of the > > > > child is also set to be child's pid ),but in restart,the child which > > > > is forked is always the son of the exited parent,the parent seems to > > > > be still alive,I do not know why? > > > > Thank you for your help. > > > > > > > > Locus. > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > Be a better friend, newshound, and know-it-all with Yahoo! > Mobile. Try > > > > it now. > > > > > > > > > > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> > > > Locus, > > > > > > Your first issue is "it restarts far more than two times". That is > > > because the value of "global_number" has been restored to the value 2 > > > when BLCR restarted to program. You will need to use a different > > > mechanism to handle any value that is supposed to change across > > > checkpoints. I suggest that you try registering a checkpoint > callback. > > > > > > In main add these two lines: > > > cr_client_id_t id = cr_init(); > > > cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT); > > > > > > and somewhere add the following function: > > > static int my_callback(void* arg) { > > > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > > > if (rc > 0) { /* Restarting */ > > > --global_number; > > > } > > > return 0; > > > } > > > > > > As for eventually failing with "Device or resource busy", I > imaging that > > > with the many restarts you may have eventually reused the original PID > > > for the cr_restart executable. Perhaps that problem will go away when > > > you fix the multiple restarts problem. The other possibility here is > > > that you are trying to restart the same process multiple times > > > *concurrently*, thus trying to use the original PID twice at the same > > > time. > > > > > > Let me know if you need any more help. > > > > > > -Paul > > > > > > -- > > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov> > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>> > > > Future Technologies Group > > > HPC Research Department Tel: +1-510-495-2352 > > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > Never miss a thing. Make Yahoo your homepage. > > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> > > > > > > -- > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > Future Technologies Group > > HPC Research Department Tel: +1-510-495-2352 > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > ------------------------------------------------------------------------ > > Never miss a thing. Make Yahoo your homepage. > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > ------------------------------------------------------------------------ > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > it now. > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900