From: Locus Jackson (locus_jackson_at_yahoo_dot_com)
Date: Tue Jan 15 2008 - 04:49:17 PST
Hi, I am sorry that I still have some questions. In my function set_checkpoint(),I use cr_init(),cr_initialize_checkpoint_args_t,cr_request_checkpoint() and cr_poll_checkpoint() to set a checkpoint. In my function call_restart(),I use pipe(),fork(),and system() to restart my program from the checkpoint where I set before. You suggest registering a checkpoint callback,I may have some difficult to understand its mechanism though I have read libcr.h. 1,cr_register_callback(cr_callback_t func,void* arg,int flags),I wonder when the callback func will be invoked?Will it be invoked after my function set_checkpoint() called?When will the callback func be invoked generally? 2,In your reply,you wrote: static int my_callback(void* arg) { int rc = cr_checkpoint(CR_CHECKPOINT_READY); if (rc > 0) { /* Restarting */ --global_number; } return 0; } I wonder does cr_checkpoint() set a checkpoint like my function set_checkpoint()?If the answer is no ,can I add call_restart() in the condition if(rc>0) to explicitly restart my program for global_times? 3,If possible,would you please give me an example to explain your callback method,I want to restart my program for any given times,but now,if I call call_restart(),the program will run forever,that is really terrible. Thank you very much for your kind help. Regards, Locus. ----- Original Message ---- From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> To: Locus Jackson <locus_jackson_at_yahoo_dot_com> Cc: checkpoint_at_lbl_dot_gov Sent: Tuesday, January 15, 2008 1:05:08 PM Subject: Re: Restart my program failed ? Locus Jackson wrote: > Hello, > I use blcr to checkpoint and restart my program(a single threaded > application). > But when I want to restat my program,it always failed. > The general form of my program listed as follows: > > void set_checkpoint() //use this fucntion to set a > checkpoint at any time and places > { > ........ > } > > void call_restart(char* filename) //use this function to > restart my program in case it failed > { > ...... > system("cr_restart filename"); > } > > int global_number=2; > int main() > { > ...... > statement1; > set_checkpoint(); > statement2; > ...... > while(global_number>0) // I want to restart my program 2 times > { > global_number--; > call_restart(); > } > statement3; > ...... > } > > when I execute this program ,it restarts far more than two > times,until it told me " Restart failed: Device or resource busy". > In my call_restart() function , I fork a child to restart my > program(before it restart,its parent is exited,and the pgid of the > child is also set to be child's pid ),but in restart,the child which > is forked is always the son of the exited parent,the parent seems to > be still alive,I do not know why? > Thank you for your help. > > Locus. > > > > ------------------------------------------------------------------------ > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > it now. > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> Locus, Your first issue is "it restarts far more than two times". That is because the value of "global_number" has been restored to the value 2 when BLCR restarted to program. You will need to use a different mechanism to handle any value that is supposed to change across checkpoints. I suggest that you try registering a checkpoint callback. In main add these two lines: cr_client_id_t id = cr_init(); cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT); and somewhere add the following function: static int my_callback(void* arg) { int rc = cr_checkpoint(CR_CHECKPOINT_READY); if (rc > 0) { /* Restarting */ --global_number; } return 0; } As for eventually failing with "Device or resource busy", I imaging that with the many restarts you may have eventually reused the original PID for the cr_restart executable. Perhaps that problem will go away when you fix the multiple restarts problem. The other possibility here is that you are trying to restart the same process multiple times *concurrently*, thus trying to use the original PID twice at the same time. Let me know if you need any more help. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs