From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Dec 17 2007 - 11:03:37 PST
���� wrote: > Dear Sir , > > Thank you for your help . > I want to know can we explicitly checkpoint and restart at any time in > my program , to explain my meaning clearly , > I offer an example as follows : > > example( maybe wrong): > statements ; > statements; > cr_client_id_t cr; > . > . > . > cr_enter_cs(cr); //enter a critical section > cr_checkpoint(0); //place1 , I want to set the first checkpoint here > cr_leave_cs(cr); > statements; //place2 > . > . > . > cr_enter_cs(cr); > cr_checkpoint(0); //place3, the second checkpoint I want to set here > cr_leave_cs(cr); > statements; //place4 > . > . > . > Suppose this program is attacked (or other unexpected ) during its > execution , but I have set two checkpoints above , so > there is no need to restart my program at its beginning . If I can > know at place2(or place4) the program is attacked , so > I want to restart it at place1(or place3) , can we explicitly restart > my program at place1(or place3) ? If BLCR supports , > would you please offer me an example ? > > Thanks �� > > Daniel. Daniel, Thank you for restarting your question, and I apologizes for not responding more quickly. Rather than "cr_checkpoint()", you should be calling "cr_request()" to request that a checkpoint be taken of the calling process. However, the actual checkpoint is done asynchronously. The use of critical sections excludes checkpoints between the enter and leave (mainly for interaction with checkpoints requested external to the process). Since there is no call to check for completion of the asynchronous checkpoint started by "cr_reqeust()", a enter/leave pair is used as a way to ensure the checkpoint has completed before the next step (place2 or place4 in your example). Finally, I see you have not initialized cr_id. Here is a revised version of your example: cr_client_id_t cr; ... cr = cr_init(); ... cr_request(); //place1 , request the first checkpoint here cr_enter_cs(cr); //enter critical section, ensures checkpoint is done cr_leave_cs(cr); //leave statements; //place2 ... cr_request(); //place3 , request the first checkpoint here cr_enter_cs(cr); //enter critical section, ensures checkpoint is done cr_leave_cs(cr); //leave statements; //place4 ... This is a very basic example and there are a couple of other things one might want to do here: 1) Both checkpoints are going to write to the file "context.1234", where "1234" is replaced by the pid of the process. That means that the checkpoint at place3 destroys the one taken at place1. You could add code inside the enter/leave pair to rename the file, but more likely you'll want to replace "cr_request()" with "cr_request_fd()" or "cr_request_file()", which take a file descriptor or filename as arguments and request a checkpoint be written to that location. 2) The functions "cr_request()", "cr_request_fd()" and "cr_request_file()" don't have any error reporting mechanisms. If you wanted to check for errors (like out of disk space) then you can replace "cr_request()" with "cr_request_checkpoint()". This function takes an argument that specifies things like who to checkpoint (zero means the current process) and a file descriptor to checkpoint to. It also initializes a handle that one must poll or block on (instead of the enter/leave pair) to wait for completion. The poll will indicate if any errors happened. For a full example of "cr_request_checkpoint()" with error checking, have a look at util/cr_checkpoint/cr_checkpoint.c. Here is a shortened example with simplified error checking: cr_client_id_t cr; cr_checkpoint_args_t cr_args; cr_checkpoint_handle_t cr_handle; int err; ... cr = cr_init(); cr_initialize_checkpoint_args_t(&cr_args); // start with defaults cr_args.cr_scope = CR_SCOPE_PROC; // checkpoint a process cr_args.cr_target = 0; // process = self ... //place1: cr_args.cr_fd = open(some_filename, O_WRONLY|O_CREAT|O_LARGEFILE, 0400); if (cr_args.cr_fd < 0) { // HANDLE ERROR HERE } err = cr_request_checkpoint(&cr_args, &cr_handle); if (err < 0) { // HANDLE ERROR HERE } do { // Wait for checkpoint to complete err = cr_poll_checkpoint(&cr_handle, NULL); if (err < 0) { if (errno == EINVAL) { // expect this value when restarting -- not an error err = 0; } else if (errno == EINTR) { // poll was interrupted by a signal -- while loop retries } else { // HANDLE ERROR HERE } } } while (err < 0); close(cr_args.cr_fd); statements; //place2 ... //place3: SAME AS PLACE1 statements; //place4 ... I hope this helps. Let us know if you still have questions about how to use BLCR. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900