Re: Checkpoint program at any time ?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Dec 17 2007 - 11:03:37 PST

  • Next message: Paul H. Hargrove: "Re: Checkpoint program at any time ?"
    王磊 wrote:
    > Dear Sir ,
    >
    > Thank you for your help .
    > I want to know can we explicitly checkpoint and restart at any time in
    > my program , to explain my meaning clearly ,
    > I offer an example as follows :
    >
    > example( maybe wrong):
    > statements ;
    > statements;
    > cr_client_id_t cr;
    > .
    > .
    > .
    > cr_enter_cs(cr); //enter a critical section
    > cr_checkpoint(0); //place1 , I want to set the first checkpoint here
    > cr_leave_cs(cr);
    > statements; //place2
    > .
    > .
    > .
    > cr_enter_cs(cr);
    > cr_checkpoint(0); //place3, the second checkpoint I want to set here
    > cr_leave_cs(cr);
    > statements; //place4
    > .
    > .
    > .
    > Suppose this program is attacked (or other unexpected ) during its
    > execution , but I have set two checkpoints above , so
    > there is no need to restart my program at its beginning . If I can
    > know at place2(or place4) the program is attacked , so
    > I want to restart it at place1(or place3) , can we explicitly restart
    > my program at place1(or place3) ? If BLCR supports ,
    > would you please offer me an example ?
    >
    > Thanks !
    >
    > Daniel.
    
    Daniel,
    Thank you for restarting your question, and I apologizes for not
    responding more quickly. Rather than "cr_checkpoint()", you should be
    calling "cr_request()" to request that a checkpoint be taken of the
    calling process. However, the actual checkpoint is done asynchronously.
    The use of critical sections excludes checkpoints between the enter and
    leave (mainly for interaction with checkpoints requested external to the
    process). Since there is no call to check for completion of the
    asynchronous checkpoint started by "cr_reqeust()", a enter/leave pair is
    used as a way to ensure the checkpoint has completed before the next
    step (place2 or place4 in your example). Finally, I see you have not
    initialized cr_id. Here is a revised version of your example:
    
    cr_client_id_t cr;
    ...
    cr = cr_init();
    ...
    cr_request(); //place1 , request the first checkpoint here
    cr_enter_cs(cr); //enter critical section, ensures checkpoint is done
    cr_leave_cs(cr); //leave
    statements; //place2
    ...
    cr_request(); //place3 , request the first checkpoint here
    cr_enter_cs(cr); //enter critical section, ensures checkpoint is done
    cr_leave_cs(cr); //leave
    statements; //place4
    ...
    
    This is a very basic example and there are a couple of other things one
    might want to do here:
    
    1) Both checkpoints are going to write to the file "context.1234", where
    "1234" is replaced by the pid of the process. That means that the
    checkpoint at place3 destroys the one taken at place1. You could add
    code inside the enter/leave pair to rename the file, but more likely
    you'll want to replace "cr_request()" with "cr_request_fd()" or
    "cr_request_file()", which take a file descriptor or filename as
    arguments and request a checkpoint be written to that location.
    
    2) The functions "cr_request()", "cr_request_fd()" and
    "cr_request_file()" don't have any error reporting mechanisms. If you
    wanted to check for errors (like out of disk space) then you can replace
    "cr_request()" with "cr_request_checkpoint()". This function takes an
    argument that specifies things like who to checkpoint (zero means the
    current process) and a file descriptor to checkpoint to. It also
    initializes a handle that one must poll or block on (instead of the
    enter/leave pair) to wait for completion. The poll will indicate if any
    errors happened. For a full example of "cr_request_checkpoint()" with
    error checking, have a look at util/cr_checkpoint/cr_checkpoint.c. Here
    is a shortened example with simplified error checking:
    
    cr_client_id_t cr;
    cr_checkpoint_args_t cr_args;
    cr_checkpoint_handle_t cr_handle;
    int err;
    ...
    cr = cr_init();
    cr_initialize_checkpoint_args_t(&cr_args); // start with defaults
    cr_args.cr_scope = CR_SCOPE_PROC; // checkpoint a process
    cr_args.cr_target = 0; // process = self
    ...
    //place1:
    cr_args.cr_fd = open(some_filename, O_WRONLY|O_CREAT|O_LARGEFILE, 0400);
    if (cr_args.cr_fd < 0) { // HANDLE ERROR HERE }
    err = cr_request_checkpoint(&cr_args, &cr_handle);
    if (err < 0) { // HANDLE ERROR HERE }
    do { // Wait for checkpoint to complete
    err = cr_poll_checkpoint(&cr_handle, NULL);
    if (err < 0) {
    if (errno == EINVAL) {
    // expect this value when restarting -- not an error
    err = 0;
    } else if (errno == EINTR) {
    // poll was interrupted by a signal -- while loop retries
    } else {
    // HANDLE ERROR HERE
    }
    }
    } while (err < 0);
    close(cr_args.cr_fd);
    statements; //place2
    ...
    //place3: SAME AS PLACE1
    statements; //place4
    ...
    
    I hope this helps. Let us know if you still have questions about how to
    use BLCR.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: Checkpoint program at any time ?"