From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Aug 01 2007 - 14:41:28 PDT
Neal Becker wrote: > ls -l context* > -rw------- 1 nbecker nbecker 42028508638 2007-07-30 09:43 > context.nbecker.26066 > -rw------- 1 nbecker nbecker 16192372703 2007-07-30 10:28 > context.nbecker.26066.new > > That's some file! > Neal, That certainly looks like something is wrong. If these file sizes (40GB and 16GB) are larger than your disk could possibly hold, then I strongly recommend running fsck to detect/repair any fs corruption. I am going to assume, however, that the files really *are* this big. I honestly have never seen anything like this from use of BLCR. However, I do have some thoughts: 1) The largest is 40GB which is not impossible, but probably indicates something fishy. My guess here is that we've run afoul of a new feature in BLCR. That is that BLCR now saves (in the context file) the data contained in any files that are open but deleted (unlinked) at the time of the checkpoint (doing so is needed to allow checkpoint/restart of shells and other interpreters when running "here documents"). The example code you sent earlier didn't include calls to close(newfd), but does (via use of rename()) unlink the old context file. So, it is possible that as you take multiple checkpoints you are including the past context files inside the new context file, leading to exponential grown in the file size! The first thing I'd try would be adding close(newfd) after the rename() call. 2) The cr_request_checkpoint()-based example you sent earlier was, I think, missing a call to cr_initialize_checkpoint_args_t(). Perhaps some uninitialized bit there is responsible? 3) BLCR uses just the vfs_write() kernel routine to write to the context file. So, there are no "holes". If running "du -b context*" doesn't return results matching those of ls, then there may be filesystem corruption. If so, you should fsck and retry your test(s). 4) Maybe add O_LARGEFILE in the flags when one open/creat a context file to pass to cr_request_checkpoint? If none of this helps, please create a bug report (http://mantis.lbl.gov/bugzilla) and if possible attach the code that created these files (note that you have to create the bug report first and then add attachments after). -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900