From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jan 15 2008 - 10:21:41 PST
Locus, A BLCR callback is sort of like a signal handler that runs when the checkpoint is being taken. So, when cr_request_checkpoint() causes the checkpoint to be taken, BLCR will run the callback. The callback runs up to the cr_checkpoint() call before the checkpoint is taken. This would allow a process to save any sort of state that BLCR doesn't handle (such as TCP sockets). The call to cr_checkpoint() allows the checkpoint to proceed (possibly invoking other callbacks if more than one is registered). The return value from cr_checkpoint() will be 0 when the process is just continuing normally after a checkpoint has been taken, but will be >0 when resuming from a restart. Any code running in the callback after the cr_checkpoint() call can restore any state that the callback saved. In the example callback I showed, the value of global_number will decrease by one when the process is restarted. -Paul Locus Jackson wrote: > Hi, > I am sorry that I still have some questions. > In my function set_checkpoint(),I use > cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and > cr_poll_checkpoint() to set a checkpoint. > In my function call_restart(),I use pipe(),fork(),and system() to > restart my program from the checkpoint where I set before. > You suggest registering a checkpoint callback,I may have some > difficult to understand its mechanism though I have read libcr.h. > 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I > wonder when the callback func will be invoked?Will it be invoked after > my function set_checkpoint() called?When will the callback func be > invoked generally? > 2,In your reply,you wrote: > static int my_callback(void* arg) { > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > if (rc > 0) { /* Restarting */ > --global_number; > } > return 0; > } > I wonder does cr_checkpoint() set a checkpoint like my function set_checkpoint()?If the answer is no ,can I add call_restart() > in the condition if(rc>0) to explicitly restart my program for global_times? > 3,If possible,would you please give me an example to explain your callback method,I want to restart my program for any given > times,but now,if I call call_restart(),the program will run forever,that is really terrible. > Thank you very much for your kind help. > > Regards, > Locus. > > > ----- Original Message ---- > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> > To: Locus Jackson <locus_jackson_at_yahoo_dot_com> > Cc: checkpoint_at_lbl_dot_gov > Sent: Tuesday, January 15, 2008 1:05:08 PM > Subject: Re: Restart my program failed ? > > Locus Jackson wrote: > > Hello, > > I use blcr to checkpoint and restart my program(a single threaded > > application). > > But when I want to restat my program,it always failed. > > The general form of my program listed as follows: > > > > void set_checkpoint() //use this fucntion to set a > > checkpoint at any time and places > > { > > ........ > > } > > > > void call_restart(char* filename) //use this function to > > restart my program in case it failed > > { > > ...... > > system("cr_restart filename"); > > } > > > > int global_number=2; > > int main() > > { > > ...... > > statement1; > > set_checkpoint(); > > statement2; > > ...... > > while(global_number>0) // I want to restart my program 2 times > > { > > global_number--; > > call_restart(); > > } > > statement3; > > ...... > > } > > > > when I execute this program ,it restarts far more than two > > times,until it told me " Restart failed: Device or resource busy". > > In my call_restart() function , I fork a child to restart my > > program(before it restart,its parent is exited,and the pgid of the > > child is also set to be child's pid ),but in restart,the child which > > is forked is always the son of the exited parent,the parent seems to > > be still alive,I do not know why? > > Thank you for your help. > > > > Locus. > > > > > > > > ------------------------------------------------------------------------ > > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > > it now. > > > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> > Locus, > > Your first issue is "it restarts far more than two times". That is > because the value of "global_number" has been restored to the value 2 > when BLCR restarted to program. You will need to use a different > mechanism to handle any value that is supposed to change across > checkpoints. I suggest that you try registering a checkpoint callback. > > In main add these two lines: > cr_client_id_t id = cr_init(); > cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT); > > and somewhere add the following function: > static int my_callback(void* arg) { > int rc = cr_checkpoint(CR_CHECKPOINT_READY); > if (rc > 0) { /* Restarting */ > --global_number; > } > return 0; > } > > As for eventually failing with "Device or resource busy", I imaging that > with the many restarts you may have eventually reused the original PID > for the cr_restart executable. Perhaps that problem will go away when > you fix the multiple restarts problem. The other possibility here is > that you are trying to restart the same process multiple times > *concurrently*, thus trying to use the original PID twice at the same > time. > > Let me know if you need any more help. > > -Paul > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > ------------------------------------------------------------------------ > Never miss a thing. Make Yahoo your homepage. > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900