Re: BLCR checkpointfile corruption

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 28 2009 - 11:42:18 PDT

Next message: Paul H. Hargrove: "Re: copy-on-write"

Previous message: Gang Chen: "copy-on-write"
In reply to: Hans Westgaard Ry: "BLCR checkpointfile corruption"

Hans,

Be advised that checkpoint_at_lbl_dot_gov is a mailing list with both 
subscription and archives open to the general public. If you wish to 
maintain privacy regarding your development, please send to me directly 
at PHHargrove_at_lbl_dot_gov in the future.

I am going to assume you are using BLCR-0.8.x. If not, let me know, but 
also please consider upgrading.

Your problem does not sound like any specific known problem, but does 
bring to mind a few potential related things.
1) Files open in the process requesting a restart may "bleed through" to 
the restarted process. Not sure if that is related, but it IS a 
potentially non-obvious behavior.
2) I suspect the I/O error is an indication of truncation. That makes me 
think that a checkpoint has been taken that observed the length of the 
open context file when it was still incomplete, and therefore the 
restart could/would have truncated back to that length. However, that 
doesn't seem to fit your description because the latest checkpoint would 
certainly observe the older ones with their final complete length.
3) A close() after the checkpoint should certainly not be corrupting the 
checkpoint file. So, that is where I would suggest you begin looking. 
The way in which we deal with a self-requested checkpoint looks roughly 
like:
cr_initialize_checkpoint_args_t(&args, ....);
args.cr_fd = open(filename, ....);
... set rest of args structure ...
cr_request_checkpoint(&args, &handle);
do {rc = cr_poll_checkpoint(&handle, NULL);} while ((rc < 0) && (errno 
== EINTR));
if ((rc == CR_POLL_CHKPT_ERR_POST) && (errno == CR_ERESTARTED)) {
/* restarting. fd should be closed already */
} else if (rc < 0) {
/* deal with errors */
} else {
close(args.cr_fd);
}
If the checkpoint is not self-requested, but the context file is open in 
one or more of the target processes, than I am not sure what will 
happen. It is possible that there could be a BLCR bug here, but I can 
know w/o more information.

-Paul

Hans Westgaard Ry wrote:
>
> We are using blcr together with our mpi (Platform Mpi).
>
> We allow the programs to do checkpoint and continue thus getting 
> several versions of
>
> the checkpointfiles for the same run.
>
> My problem is that if I restart from the latest of these checkpoints 
> all the previous checkpoint-files
>
> are corrupted and will give Input/Output error is used for restarting.
>
> Is this a known problem ?
>
> I suspect it has to do with me not closing the checkpoint-file after 
> returning from the checkpoint
>
> but I�m not able to find a good way of doing that (looks like a close 
> just after returning also corrupts the checkpoint file)
>
> Regards
>
> Hans Westgaard Ry
>
> Senior Software Developer
>
> Platform Computing
>

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory

Next message: Paul H. Hargrove: "Re: copy-on-write"

Previous message: Gang Chen: "copy-on-write"
In reply to: Hans Westgaard Ry: "BLCR checkpointfile corruption"

Date view	Thread view	Subject view	Author view	Attachment view