From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 28 2009 - 11:42:18 PDT
Hans, Be advised that checkpoint_at_lbl_dot_gov is a mailing list with both subscription and archives open to the general public. If you wish to maintain privacy regarding your development, please send to me directly at PHHargrove_at_lbl_dot_gov in the future. I am going to assume you are using BLCR-0.8.x. If not, let me know, but also please consider upgrading. Your problem does not sound like any specific known problem, but does bring to mind a few potential related things. 1) Files open in the process requesting a restart may "bleed through" to the restarted process. Not sure if that is related, but it IS a potentially non-obvious behavior. 2) I suspect the I/O error is an indication of truncation. That makes me think that a checkpoint has been taken that observed the length of the open context file when it was still incomplete, and therefore the restart could/would have truncated back to that length. However, that doesn't seem to fit your description because the latest checkpoint would certainly observe the older ones with their final complete length. 3) A close() after the checkpoint should certainly not be corrupting the checkpoint file. So, that is where I would suggest you begin looking. The way in which we deal with a self-requested checkpoint looks roughly like: cr_initialize_checkpoint_args_t(&args, ....); args.cr_fd = open(filename, ....); ... set rest of args structure ... cr_request_checkpoint(&args, &handle); do {rc = cr_poll_checkpoint(&handle, NULL);} while ((rc < 0) && (errno == EINTR)); if ((rc == CR_POLL_CHKPT_ERR_POST) && (errno == CR_ERESTARTED)) { /* restarting. fd should be closed already */ } else if (rc < 0) { /* deal with errors */ } else { close(args.cr_fd); } If the checkpoint is not self-requested, but the context file is open in one or more of the target processes, than I am not sure what will happen. It is possible that there could be a BLCR bug here, but I can know w/o more information. -Paul Hans Westgaard Ry wrote: > > We are using blcr together with our mpi (Platform Mpi). > > We allow the programs to do checkpoint and continue thus getting > several versions of > > the checkpointfiles for the same run. > > My problem is that if I restart from the latest of these checkpoints > all the previous checkpoint-files > > are corrupted and will give Input/Output error is used for restarting. > > Is this a known problem ? > > I suspect it has to do with me not closing the checkpoint-file after > returning from the checkpoint > > but I�m not able to find a good way of doing that (looks like a close > just after returning also corrupts the checkpoint file) > > Regards > > Hans Westgaard Ry > > Senior Software Developer > > Platform Computing > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory