Re: Checkpoint program at any time ?

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Dec 17 2007 - 11:03:37 PST

Next message: Paul H. Hargrove: "Re: Checkpoint program at any time ?"

Previous message: ï¿½ï¿½ï¿½ï¿½: "Checkpoint program at any time ?"
In reply to: ï¿½ï¿½ï¿½ï¿½: "Checkpoint program at any time ?"
Next in thread: Paul H. Hargrove: "Re: Checkpoint program at any time ?"
Reply: Paul H. Hargrove: "Re: Checkpoint program at any time ?"

ï¿½ï¿½ï¿½ï¿½ wrote:
> Dear Sir ,
>
> Thank you for your help .
> I want to know can we explicitly checkpoint and restart at any time in
> my program , to explain my meaning clearly ,
> I offer an example as follows :
>
> example( maybe wrong):
> statements ;
> statements;
> cr_client_id_t cr;
> .
> .
> .
> cr_enter_cs(cr); //enter a critical section
> cr_checkpoint(0); //place1 , I want to set the first checkpoint here
> cr_leave_cs(cr);
> statements; //place2
> .
> .
> .
> cr_enter_cs(cr);
> cr_checkpoint(0); //place3, the second checkpoint I want to set here
> cr_leave_cs(cr);
> statements; //place4
> .
> .
> .
> Suppose this program is attacked (or other unexpected ) during its
> execution , but I have set two checkpoints above , so
> there is no need to restart my program at its beginning . If I can
> know at place2(or place4) the program is attacked , so
> I want to restart it at place1(or place3) , can we explicitly restart
> my program at place1(or place3) ? If BLCR supports ,
> would you please offer me an example ?
>
> Thanks ï¿½ï¿½
>
> Daniel.

Daniel,
Thank you for restarting your question, and I apologizes for not
responding more quickly. Rather than "cr_checkpoint()", you should be
calling "cr_request()" to request that a checkpoint be taken of the
calling process. However, the actual checkpoint is done asynchronously.
The use of critical sections excludes checkpoints between the enter and
leave (mainly for interaction with checkpoints requested external to the
process). Since there is no call to check for completion of the
asynchronous checkpoint started by "cr_reqeust()", a enter/leave pair is
used as a way to ensure the checkpoint has completed before the next
step (place2 or place4 in your example). Finally, I see you have not
initialized cr_id. Here is a revised version of your example:

cr_client_id_t cr;
...
cr = cr_init();
...
cr_request(); //place1 , request the first checkpoint here
cr_enter_cs(cr); //enter critical section, ensures checkpoint is done
cr_leave_cs(cr); //leave
statements; //place2
...
cr_request(); //place3 , request the first checkpoint here
cr_enter_cs(cr); //enter critical section, ensures checkpoint is done
cr_leave_cs(cr); //leave
statements; //place4
...

This is a very basic example and there are a couple of other things one
might want to do here:

1) Both checkpoints are going to write to the file "context.1234", where
"1234" is replaced by the pid of the process. That means that the
checkpoint at place3 destroys the one taken at place1. You could add
code inside the enter/leave pair to rename the file, but more likely
you'll want to replace "cr_request()" with "cr_request_fd()" or
"cr_request_file()", which take a file descriptor or filename as
arguments and request a checkpoint be written to that location.

2) The functions "cr_request()", "cr_request_fd()" and
"cr_request_file()" don't have any error reporting mechanisms. If you
wanted to check for errors (like out of disk space) then you can replace
"cr_request()" with "cr_request_checkpoint()". This function takes an
argument that specifies things like who to checkpoint (zero means the
current process) and a file descriptor to checkpoint to. It also
initializes a handle that one must poll or block on (instead of the
enter/leave pair) to wait for completion. The poll will indicate if any
errors happened. For a full example of "cr_request_checkpoint()" with
error checking, have a look at util/cr_checkpoint/cr_checkpoint.c. Here
is a shortened example with simplified error checking:

cr_client_id_t cr;
cr_checkpoint_args_t cr_args;
cr_checkpoint_handle_t cr_handle;
int err;
...
cr = cr_init();
cr_initialize_checkpoint_args_t(&cr_args); // start with defaults
cr_args.cr_scope = CR_SCOPE_PROC; // checkpoint a process
cr_args.cr_target = 0; // process = self
...
//place1:
cr_args.cr_fd = open(some_filename, O_WRONLY|O_CREAT|O_LARGEFILE, 0400);
if (cr_args.cr_fd < 0) { // HANDLE ERROR HERE }
err = cr_request_checkpoint(&cr_args, &cr_handle);
if (err < 0) { // HANDLE ERROR HERE }
do { // Wait for checkpoint to complete
err = cr_poll_checkpoint(&cr_handle, NULL);
if (err < 0) {
if (errno == EINVAL) {
// expect this value when restarting -- not an error
err = 0;
} else if (errno == EINTR) {
// poll was interrupted by a signal -- while loop retries
} else {
// HANDLE ERROR HERE
}
}
} while (err < 0);
close(cr_args.cr_fd);
statements; //place2
...
//place3: SAME AS PLACE1
statements; //place4
...

I hope this helps. Let us know if you still have questions about how to
use BLCR.

-Paul

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: Paul H. Hargrove: "Re: Checkpoint program at any time ?"

Previous message: ï¿½ï¿½ï¿½ï¿½: "Checkpoint program at any time ?"
In reply to: ï¿½ï¿½ï¿½ï¿½: "Checkpoint program at any time ?"
Next in thread: Paul H. Hargrove: "Re: Checkpoint program at any time ?"
Reply: Paul H. Hargrove: "Re: Checkpoint program at any time ?"

Date view	Thread view	Subject view	Author view	Attachment view