Re: Somthing horribly wrong

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Aug 01 2007 - 14:41:28 PDT

  • Next message: postmaster_at_bccenter_dot_org: "Undeliverable: RE: Weekly News"
    Neal Becker wrote:
    > ls -l context*
    > -rw------- 1 nbecker nbecker 42028508638 2007-07-30 09:43 
    > context.nbecker.26066
    > -rw------- 1 nbecker nbecker 16192372703 2007-07-30 10:28 
    > context.nbecker.26066.new
    >
    > That's some file!
    >   
    
    Neal,
    
      That certainly looks like something is wrong.  If these file sizes 
    (40GB and 16GB) are larger than your disk could possibly hold, then I 
    strongly recommend running fsck to detect/repair any fs corruption.  I 
    am going to assume, however, that the files really *are* this big.
    
    I honestly have never seen anything like this from use of BLCR.  
    However, I do have some thoughts:
    
    1) The largest is 40GB which is not impossible, but probably indicates 
    something fishy.  My guess here is that we've run afoul of a new feature 
    in BLCR.  That is that BLCR now saves (in the context file) the data 
    contained in any files that are open but deleted (unlinked) at the time 
    of the checkpoint (doing so is needed to allow checkpoint/restart of 
    shells and other interpreters when running "here documents").  The 
    example code you sent earlier didn't include calls to close(newfd), but 
    does (via use of rename()) unlink the old context file.  So, it is 
    possible that as you take multiple checkpoints you are including the 
    past context files inside the new context file, leading to exponential 
    grown in the file size!  The first thing I'd try would be adding 
    close(newfd) after the rename() call.
    2) The cr_request_checkpoint()-based example you sent earlier was, I 
    think, missing a call to cr_initialize_checkpoint_args_t(). Perhaps some 
    uninitialized bit there is responsible?
    3) BLCR uses just the vfs_write() kernel routine to write to the context 
    file.  So, there are no "holes".  If running "du -b context*" doesn't 
    return results matching those of ls, then there may be filesystem 
    corruption.  If so, you should fsck and retry your test(s).
    4) Maybe add O_LARGEFILE in the flags when one open/creat a context file 
    to pass to cr_request_checkpoint?
    
    
    If none of this helps, please create a bug report 
    (http://mantis.lbl.gov/bugzilla) and if possible attach the code that 
    created these files (note that you have to create the bug report first 
    and then add attachments after).
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: postmaster_at_bccenter_dot_org: "Undeliverable: RE: Weekly News"