Re: Somthing horribly wrong

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Aug 01 2007 - 14:41:28 PDT

Next message: postmaster_at_bccenter_dot_org: "Undeliverable: RE: Weekly News"

Previous message: Paul H. Hargrove: "Re: API for checkpoint"
In reply to: Neal Becker: "Somthing horribly wrong"

Neal Becker wrote:
> ls -l context*
> -rw------- 1 nbecker nbecker 42028508638 2007-07-30 09:43 
> context.nbecker.26066
> -rw------- 1 nbecker nbecker 16192372703 2007-07-30 10:28 
> context.nbecker.26066.new
>
> That's some file!
>   

Neal,

  That certainly looks like something is wrong.  If these file sizes 
(40GB and 16GB) are larger than your disk could possibly hold, then I 
strongly recommend running fsck to detect/repair any fs corruption.  I 
am going to assume, however, that the files really *are* this big.

I honestly have never seen anything like this from use of BLCR.  
However, I do have some thoughts:

1) The largest is 40GB which is not impossible, but probably indicates 
something fishy.  My guess here is that we've run afoul of a new feature 
in BLCR.  That is that BLCR now saves (in the context file) the data 
contained in any files that are open but deleted (unlinked) at the time 
of the checkpoint (doing so is needed to allow checkpoint/restart of 
shells and other interpreters when running "here documents").  The 
example code you sent earlier didn't include calls to close(newfd), but 
does (via use of rename()) unlink the old context file.  So, it is 
possible that as you take multiple checkpoints you are including the 
past context files inside the new context file, leading to exponential 
grown in the file size!  The first thing I'd try would be adding 
close(newfd) after the rename() call.
2) The cr_request_checkpoint()-based example you sent earlier was, I 
think, missing a call to cr_initialize_checkpoint_args_t(). Perhaps some 
uninitialized bit there is responsible?
3) BLCR uses just the vfs_write() kernel routine to write to the context 
file.  So, there are no "holes".  If running "du -b context*" doesn't 
return results matching those of ls, then there may be filesystem 
corruption.  If so, you should fsck and retry your test(s).
4) Maybe add O_LARGEFILE in the flags when one open/creat a context file 
to pass to cr_request_checkpoint?

If none of this helps, please create a bug report 
(http://mantis.lbl.gov/bugzilla) and if possible attach the code that 
created these files (note that you have to create the bug report first 
and then add attachments after).

-Paul

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: postmaster_at_bccenter_dot_org: "Undeliverable: RE: Weekly News"

Previous message: Paul H. Hargrove: "Re: API for checkpoint"
In reply to: Neal Becker: "Somthing horribly wrong"

Date view	Thread view	Subject view	Author view	Attachment view