Re: BLCR checkpointfile corruption

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 28 2009 - 11:42:18 PDT

  • Next message: Paul H. Hargrove: "Re: copy-on-write"
    Be advised that checkpoint_at_lbl_dot_gov is a mailing list with both 
    subscription and archives open to the general public. If you wish to 
    maintain privacy regarding your development, please send to me directly 
    at PHHargrove_at_lbl_dot_gov in the future.
    I am going to assume you are using BLCR-0.8.x. If not, let me know, but 
    also please consider upgrading.
    Your problem does not sound like any specific known problem, but does 
    bring to mind a few potential related things.
    1) Files open in the process requesting a restart may "bleed through" to 
    the restarted process. Not sure if that is related, but it IS a 
    potentially non-obvious behavior.
    2) I suspect the I/O error is an indication of truncation. That makes me 
    think that a checkpoint has been taken that observed the length of the 
    open context file when it was still incomplete, and therefore the 
    restart could/would have truncated back to that length. However, that 
    doesn't seem to fit your description because the latest checkpoint would 
    certainly observe the older ones with their final complete length.
    3) A close() after the checkpoint should certainly not be corrupting the 
    checkpoint file. So, that is where I would suggest you begin looking. 
    The way in which we deal with a self-requested checkpoint looks roughly 
    cr_initialize_checkpoint_args_t(&args, ....);
    args.cr_fd = open(filename, ....);
    ... set rest of args structure ...
    cr_request_checkpoint(&args, &handle);
    do {rc = cr_poll_checkpoint(&handle, NULL);} while ((rc < 0) && (errno 
    == EINTR));
    if ((rc == CR_POLL_CHKPT_ERR_POST) && (errno == CR_ERESTARTED)) {
    /* restarting. fd should be closed already */
    } else if (rc < 0) {
    /* deal with errors */
    } else {
    If the checkpoint is not self-requested, but the context file is open in 
    one or more of the target processes, than I am not sure what will 
    happen. It is possible that there could be a BLCR bug here, but I can 
    know w/o more information.
    Hans Westgaard Ry wrote:
    > We are using blcr together with our mpi (Platform Mpi).
    > We allow the programs to do checkpoint and continue thus getting 
    > several versions of
    > the checkpointfiles for the same run.
    > My problem is that if I restart from the latest of these checkpoints 
    > all the previous checkpoint-files
    > are corrupted and will give Input/Output error is used for restarting.
    > Is this a known problem ?
    > I suspect it has to do with me not closing the checkpoint-file after 
    > returning from the checkpoint
    > but Iím not able to find a good way of doing that (looks like a close 
    > just after returning also corrupts the checkpoint file)
    > Regards
    > Hans Westgaard Ry
    > Senior Software Developer
    > Platform Computing
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: Paul H. Hargrove: "Re: copy-on-write"