Re: Announcing the release of BLCR 0.6.0

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Mar 10 2008 - 11:20:33 PST

  • Next message: Yuan Wan: "Re: "Permission denied" error"
    Ruini,
    
      For files that are mmap()ed and are still present in the filesystem
    (not unlinked), the mapping is saved "by reference" - BLCR will simply
    save the filename and the mmap() flags and re-mmap() the same file at
    restart time.  This is pretty much the same thing we do for open()
    files, and makes the same assumption that at restart time the file is
    still present (and unchanged since checkpoint).  It is certainly
    possible that this expectation will not be met, which could break some
    applications.  In the 0.7.0 release (expected Mar or Apr), there will be
    options at checkpoint time to cause all mmap()ed files to be saved "by
    value" (described below).
      For files that are unlinked (no longer in the filesystem), the save is
    "by value".  Since there is no filesystem object to name, the actual
    data in the mmap()ed file will be copied into the checkpoint context
    file.  At restart time a new file is created with the original data,
    mmap()ed, and then unlinked (to match the original circumstances).
      Prior to BLCR 0.6.x, the handling of unlinked mmap()ed files was such
    that each process that had a file mmap()ed would independently create
    its own replacement mapping, with the result that unlinked files
    mmap()ed with MAP_SHARED would no longer be shared after a restart (in
    addition to the space waste of having multiple copies of the same data
    in the context file, and in the multiple replacements at restart).
    Starting with BLCR 0.6.0, however there is coordination among the
    members of a multiple process checkpoint to ensure that exactly one copy
    of the data is saved, and that a single replacement file is created and
    mmap()ed by all of the original processes.
    
    Let us know if you would like any additional information.
    
    -Paul
    
    Ruini Xue wrote:
    > Hello,
    > 
    > Anyone can explain how BLCR handles mmap() files? Or any document to refer?
    > 
    > Best
    > 
    > Andrew
    > 
    > On Tue, Sep 11, 2007 at 6:16 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov
    > <mailto:PHHargrove_at_lbl_dot_gov>> wrote:
    > 
    >     After several weeks of betas, I've finally released BLCR 0.6.0.  It can
    >     be found at the BLCR Downloads page:
    >     http://ftg.lbl.gov/CheckpointRestart/CheckpointDownloads.shtml
    > 
    >     This version
    >      + adds support for checkpoint/restart of
    >        - memory shared via mmap(MAP_SHARED)
    >        - open unlinked files
    >        - pending signals
    >      + extends the range of supported kernels
    >      + greatly expands the test suite
    >      + fixes numerous bugs
    >      + New /experimental/ features include support for
    >        - PPC64 and ARM platforms
    >        - cross-compilation
    >     At the end of this message, I've included the full NEWS entry, relative
    >     to July's 0.5.6 release.
    > 
    >     Before reporting bugs, please read the (updated) FAQ to see if you have
    >     a known problem.
    > 
    >     Many thanks to the dedicated beta testers who identified many bugs I did
    >     not or could not reproduce on my own test platforms.  Their testing
    >     efforts have ensured a much more stable/usable 0.6.0 release than would
    >     otherwise have been possible.
    > 
    >     -Paul
    > 
    >     PS
    >     You are receiving this either because you are on the
    >     checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>
    >     list, or because you've recently sent email to the list (or me directly)
    >     asking about BLCR status.
    > 
    > 
    >     NEWS excerpts:
    > 
    >     0.6.0
    >     --------
    >     September 10, 2007
    >     Functionality and expanded-support release.
    >      - This release adds support for 2.6.22 kernels.
    >      - This release includes experimental support for PPC64 platforms
    >         + PPC64 supports both 32- and 64-bit applications.
    >         + No support for 32-bit kernels.
    >           Contact us if you would like to help w/ a PPC32 port.
    >         + No support for 2.4.x kernels
    >         + Tested with NPTL and kernels 2.6.12 (Gentoo) and 2.6.18 (FC6)
    >         + There are known problems with BLCR with LinuxThreads on PPC64
    >      - This release includes experimental support for ARM platforms.
    >         + Tested only for 2.6.12 and newer kernels
    >         + Thanks to Anton V. Uzunov <anton.uzunov_at_dsto_dot_defence_dot_gov.au
    >     <mailto:anton.uzunov_at_dsto_dot_defence_dot_gov.au>>
    >           of the Australian Government Department of Defence, Defence
    >           Science and Technology Organisation for contributing this port.
    >         + ARM-specific questions should be directed to
    >     [email protected] <mailto:blcr-arm_at_hpcrd_dot_lbl_dot_gov>
    >      - This release includes experimental support for cross-compilation.
    >         + See config/cross_helper.c for information on cross-compilation.
    >         + This has been tested only in the context of the ARM port
    >         + Thanks to Anton V. Uzonov for motivating and testing this work
    >      - This release includes a new API for issuing a checkpoint request.
    >         + Allows a program to request a checkpoint without the need to
    >     invoke
    >           system("cr_checkpoint ...").
    >         + See comments in include/libcr.h for information on the following:
    >           cr_initialize_checkpoint_args_t()
    >           cr_request_checkpoint()
    >           cr_poll_checkpoint()
    >      - This release adds a mechanism (CR_CHECKPOINT_OMIT) for processes to
    >        exclude themselves from a checkpoint (useful for batch-system
    >     helper or
    >        shepherd processes).
    >      - This release makes cr_checkpoint and cr_restart utilities
    >     checkpointable
    >      - This release adds full support for mmap()-based shared memory
    >         + Repairs the loss of sharing that existed in 0.5.x releases
    >         + Supports hugetlbfs
    >      - This release adds full support for save/restore of pending signals.
    >      - Default scope of cr_checkpoint is now --tree, rather than --pid.
    >      - Now checkpoint/restart unlinked open files.
    >      - Revised handling of certain file-descriptors at restart:
    >        + No longer override "normal" files with correspondingly-numbered fds
    >          from cr_restart as that consistently breaks shell "here documents".
    >        + Restore pipes endpoints that lie outside the checkpoint scope by
    >          attaching them to stdin or stdout of cr_restart, rather than to its
    >          correspondingly-numbered fds.
    >        + Opens of a process's controlling tty are attached to "/dev/tty" at
    >          restart, even if they were open by their "exact" name at checkpoint
    >          time (e.g.  "/dev/pts/0").
    >      - Experimental support for relocatable kernels on x86 and x86-64
    >      - Expanded test-suite
    >      - Option to install the testsuite (--enable-testsuite)
    >      - Support "install-strip", "install-exec" and "install-data" make
    >     targets
    >      - Tested against many scripting and programming language environments:
    >        + shells:  ash, bash, (t)csh, (pd)ksh and zsh
    >        + scripting-type languages: perl, python, tcl/expect, ruby and guile
    >        + java runtime environments: Sun, IBM and GNU
    >        + misc. language runtimes: php, rep, clisp, emacslisp, gst, ocaml
    >     and sml
    >        + Run "make bonus-tests" to run these tests on your own machine,
    >     but be
    >          warned that the tests themselves are fragile (contain races)
    >     and may
    >          experience random failures.  However, please do report any
    >     tests that
    >          fail consistently.
    >      - Many minor bug fixes and code cleanups
    > 
    >     July, 2007 - DEPRECATED support for LinuxThreads and for Linux 2.4.X
    >     kernels
    >      - Starting with the 0.6.0 release, new bug reports that one cannot
    >     reproduce
    >        under NPTL + Linux 2.6.x will receive little or none of our
    >     attention.
    >        However, we will try to distribute user-contributed fixes for
    >     such bugs.
    >        Note that the 0.6.0 release *does* pass the BLCR test-suite under
    >        LinuxThreads and/or 2.4.x kernels on the developers' x86 systems.
    >        However, we have seen test failures on PPC64 when running
    >     LinuxThreads
    >        with a 2.6.12 kernel (Gentoo distro).
    >      - Beginning with the next "full" release (0.7.0) we will begin to
    >     remove
    >        code in BLCR that exists only to support LinuxThreads and/or Linux
    >     2.4.x.
    >      - We have not yet decided the fate of support for those 2.4.x kernels
    >     which
    >        include Red Hat's backport of NPTL support (RHL9.0, RHEL, RHAS,
    >     etc.).
    >      - If anybody cares enough about 2.4.x and/or LinuxThreads to
    >     volunteer to
    >        take over testing and maintenance of BLCR on such platforms, let us
    >     know.
    > 
    > 
    >     --
    >     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >     <mailto:PHHargrove_at_lbl_dot_gov>
    >     Future Technologies Group
    >     HPC Research Department                   Tel: +1-510-495-2352
    >     Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    > 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Yuan Wan: "Re: "Permission denied" error"