From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jan 14 2008 - 21:05:08 PST
Locus Jackson wrote: > Hello, > I use blcr to checkpoint and restart my program(a single threaded > application). > But when I want to restat my program,it always failed. > The general form of my program listed as follows: > > void set_checkpoint() //use this fucntion to set a > checkpoint at any time and places > { > ........ > } > > void call_restart(char* filename) //use this function to > restart my program in case it failed > { > ...... > system("cr_restart filename"); > } > > int global_number=2; > int main() > { > ...... > statement1; > set_checkpoint(); > statement2; > ...... > while(global_number>0) // I want to restart my program 2 times > { > global_number--; > call_restart(); > } > statement3; > ...... > } > > when I execute this program ,it restarts far more than two > times,until it told me " Restart failed: Device or resource busy". > In my call_restart() function , I fork a child to restart my > program(before it restart,its parent is exited,and the pgid of the > child is also set to be child's pid ),but in restart,the child which > is forked is always the son of the exited parent,the parent seems to > be still alive,I do not know why? > Thank you for your help. > > Locus. > > > > ------------------------------------------------------------------------ > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try > it now. > <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> Locus, Your first issue is "it restarts far more than two times". That is because the value of "global_number" has been restored to the value 2 when BLCR restarted to program. You will need to use a different mechanism to handle any value that is supposed to change across checkpoints. I suggest that you try registering a checkpoint callback. In main add these two lines: cr_client_id_t id = cr_init(); cr_register_callback(my_callback, NULL, CR_SIGNAL_CONTEXT); and somewhere add the following function: static int my_callback(void* arg) { int rc = cr_checkpoint(CR_CHECKPOINT_READY); if (rc > 0) { /* Restarting */ --global_number; } return 0; } As for eventually failing with "Device or resource busy", I imaging that with the many restarts you may have eventually reused the original PID for the cr_restart executable. Perhaps that problem will go away when you fix the multiple restarts problem. The other possibility here is that you are trying to restart the same process multiple times *concurrently*, thus trying to use the original PID twice at the same time. Let me know if you need any more help. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900