From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Jan 16 2008 - 11:32:32 PST
I guess I answered too quickly last time, because what I proposed will *not* result in restarting twice. If you start with global_number=1, then the callback will decrement it to zero when restarted, and you will see your program restart exactly once. However, if you start with global_number=2, then each restart decrements it from 2 to 1, never going lower. I don't have a good suggestion as to how to restart exactly twice from *inside* your program. BLCR is most often used with some outside program controlling the restarts. -Paul Locus Jackson wrote: > Hi, > Once using a callback mechanism,my program will never stop.Maybe I > still can not understand your meaning. > My program form(I want to restart my program for global_number times) : > void set_checkpoint()//use this to set a checkpoint at > any times and places > { > ...... > cr_init(); > cr_initialize_checkpoint_args_t(&cr_args); > cr_args.cr_fd=open(filename,......); //save checkpoint > in filename > cr_register_callback(callback,......);//register a > callback function > cr_request_checkpoint(&cr_args,......);//set a checkpoint > ...... > cr_poll_checkpoint(.....);//wait for setting a > checkpoint to be completed > ...... > } > > static int callback(void* arg) { //your suggestion > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > if (rc > 0) { > --global_number; > } > return 0; > } > > void call_restart(char* filename) //use this to restart explicitly my program > { > ...... > pipe(); > fork(); > ....... //parent is exited > system("cr_restart filename"); //restart program > from the checkpoint set before > ...... > } > > int global_number=2;//restart numbers > char filename[20]; //save checkpoint > int main() > { > ...... > set_checkpoint(); //place1 > statement1; > ...... > while(global_number) > call_restart(filename); //place2,restart from place1 to place2 for global_number times > ...... > return 0; > } > I wonder whether my form is wrong or not, and I wonder will callback() function be called before cr_request_checkpoint(),I > want to restart my program for global_number times? > > Thank you very much for your help. > > Regards > Locus. > > > > > ----- Original Message ---- > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> > To: Locus Jackson <locus_jackson_at_yahoo_dot_com> > Cc: checkpoint <checkpoint_at_lbl_dot_gov> > Sent: Wednesday, January 16, 2008 2:21:41 AM > Subject: Re: Restart my program failed ? > > Locus, > > A BLCR callback is sort of like a signal handler that runs when the > checkpoint is being taken. So, when cr_request_checkpoint() causes the > checkpoint to be taken, BLCR will run the callback. The callback runs > up to the cr_checkpoint() call before the checkpoint is taken. This > would allow a process to save any sort of state that BLCR doesn't handle > (such as TCP sockets). The call to cr_checkpoint() allows the > checkpoint to proceed (possibly invoking other callbacks if more than > one is registered). The return value from cr_checkpoint() will be 0 > when the process is just continuing normally after a checkpoint has been > taken, but will be >0 when resuming from a restart. Any code running in > the callback after the cr_checkpoint() call can restore any state that > the callback saved. In the example callback I showed, the value of > global_number will decrease by one when the process is restarted. > > -Paul > > > Locus Jackson wrote: > > Hi, > > I am sorry that I still have some questions. > > In my function set_checkpoint(),I use > > cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and > > cr_poll_checkpoint() to set a checkpoint. > > In my function call_restart(),I use pipe(),fork(),and system() to > > restart my program from the checkpoint where I set before. > > You suggest registering a checkpoint callback,I may have some > > difficult to understand its mechanism though I have read libcr.h. > > 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I > > wonder when the callback func will be invoked?Will it be invoked after > > my function set_checkpoint() called?When will the callback func be > > invoked generally? > > 2,In your reply,you wrote: > > static int my_callback(void* arg) { > > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > > if (rc > 0) { /* Restarting */ > > --global_number; > > } > > return 0; > > } > > I wonder does cr_checkpoint() set a checkpoint like my function > set_checkpoint()?If the answer is no ,can I add call_restart() > > in the condition if(rc>0) to explicitly restart my program for > global_times? > > 3,If possible,would you please give me an example to explain your > callback method,I want to restart my program for any given > > times,but now,if I call call_restart(),the program will run > forever,that is really terrible. > > Thank you very much for your kind help. > > > > Regards, > > Locus. > > > > > > ----- Original Message ---- > > From: Paul H_dot_ Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > To: Locus Jackson <locus_jackson_at_yahoo_dot_com > <mailto:locus_jackson_at_yahoo_dot_com>> > > Cc: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> > > Sent: Tuesday, January 15, 2008 1:05:08 PM > > Subject: Re: Restart my program failed ? > > > > Locus Jackson wrote: > > > Hello, > > > I use blcr to checkpoint and restart my program(a single threaded > > > application). > > > But when I want to restat my program,it always failed. > > > The general form of my program listed as follows: > > > > > > void set_checkpoint() //use this fucntion to set a > > > checkpoint at any time and places > > > { > > > ........ > > > } > > > > > > void call_restart(char* filename) //use this function to > > > restart my program in case it failed > > > { > > > ...... > > > system("cr_restart filename"); > > > } > > > > > > int global_number=2; > > > int main() > > > { > > > ...... > > > statement1; > > > set_checkpoint(); > > > statement2; > > > ...... > > > while(global_number>0) // I want to restart my program 2 > times > > > { > > > global_number--; > > > call_restart(); > > > } > > > statement3; > > > ...... > > > } > > > > > > when I execute this program ,it restarts far more than two > > > times,until it told me " Restart failed: Device or resource busy". > > > In my call_restart() function , I fork a child to restart > my > > > program(before it restart,its parent is exited,and the pgid of the > > > child is also set to be child's pid ),but in restart,the child which > > > is forked is always the son of the exited parent,the parent seems to > > > be still alive,I do not know why? > > > Thank you for your help. > > > > > > Locus. > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > > > it now. > > > > > > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> > > Locus, > > > > Your first issue is "it restarts far more than two times". That is > > because the value of "global_number" has been restored to the value 2 > > when BLCR restarted to program. You will need to use a different > > mechanism to handle any value that is supposed to change across > > checkpoints. I suggest that you try registering a checkpoint callback. > > > > In main add these two lines: > > cr_client_id_t id = cr_init(); > > cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT); > > > > and somewhere add the following function: > > static int my_callback(void* arg) { > > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > > if (rc > 0) { /* Restarting */ > > --global_number; > > } > > return 0; > > } > > > > As for eventually failing with "Device or resource busy", I imaging that > > with the many restarts you may have eventually reused the original PID > > for the cr_restart executable. Perhaps that problem will go away when > > you fix the multiple restarts problem. The other possibility here is > > that you are trying to restart the same process multiple times > > *concurrently*, thus trying to use the original PID twice at the same > > time. > > > > Let me know if you need any more help. > > > > -Paul > > > > -- > > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> > > Future Technologies Group > > HPC Research Department Tel: +1-510-495-2352 > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > > > > > > ------------------------------------------------------------------------ > > Never miss a thing. Make Yahoo your homepage. > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > ------------------------------------------------------------------------ > Never miss a thing. Make Yahoo your homepage. > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900