Re: Open Files

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Sep 06 2005 - 11:22:25 PDT

  • Next message: Adolfo J. Banchio: "Checkpoint failed: support missing from application"
    To add a little to Jason's response.
    1) We don't do anything w/ file locks at the moment.
    2) We don't "recover" the file contents in general.  If the file were
    being written append only (either due to flags at open, or just by usage
    pattern), then when restarting we will truncate the file back to the
    length it had at the time the checkpoint was taken - effectively
    restoring the file to the same state it had previously.  However, if the
    program seeks between writes or if another program modifies the file,
    then we don't (yet) try to roll-back the writes that took place between
    the checkpoint and the restart.
    JCDuell_at_lbl_dot_gov wrote:
    >On Tue, Sep 06, 2005 at 01:35:16PM +0300, Emmanuel Grumbach wrote:
    >>I have read the pages on Checkpoint. It seems very interesting but there
    >>is an info I could not get. Does BLCR support open files ? In other words,
    >>if my application has opened a file for reading/writing (writing with lock
    >>seems more fun) and I checkpoint it, supposing the file still exists
    >>(logically (path) or on the same inodes), will BLCR be able to open it
    >>again ?
    >Yes, we handle the general case of an application with open files.  If
    >the file exists with the same *logical* pathname (the inode number does
    >not need to be the same), the file will be reopened, and seeked to the
    >same position as it was at checkpoint time.  This means that if you have
    >a global filesystem, you will be able to restart a program on a
    >different node in a cluster, so long as all the files the program needs
    >to restart (including shared libraries and the executable's program
    >text, etc.) are in the same logical place in the file system.
    >Note that we do not handle certain types of files (TCP or Unix domain
    >sockets, for instance).
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Adolfo J. Banchio: "Checkpoint failed: support missing from application"