From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Jan 15 2009 - 10:08:56 PST
I am pleased to announce the release of BLCR 0.8.0. The 0.8.0 release is now available from the BLCR Downloads page: http://ftg.lbl.gov/CheckpointRestart/CheckpointDownloads.shtml Relative to the 0.7.x series, this release includes some new features and some improvements in stability. This release also contains support for newer Linux kernels. A summary of the user-visible changes in BLCR, relative to 0.7.3, appears below in the form of an excerpt from the NEWS file. -Paul PS You are receiving this either because you are on the checkpoint_at_lbl_dot_gov list, because you've recently sent email to the list (or me directly) asking about BLCR status, or because our Bugzilla shows your interests in a bug fixed in this beta. NEWS: 0.8.0 ----------- January 12, 2009 Enhanced functionality and expanded-support release. - This release adds support for 2.6.26, .27 and .28 kernels. - In this release support for Xen is no longer considered experimental. However, there is still one known xen-specific bug (2457) in which the FPU state may become corrupted w/ paravirtualized kernels. - In this release the majority of checkpoint I/O is performed using O_DIRECT when available, significantly reducing the cost of checkpointing any process which uses a large fraction of the physical memory. - This release includes an unfinished port to SPARC64, contributed by Vincentius Robby <vincentius_at_umich_dot_edu> and Andrea Pellegrini <apellegr_at_umich_dot_edu>. Anyone willing/able to help complete this port should contact checkpoint_at_lbl_dot_gov. - As previously announced, this release removes support for 2.4.x kernels that contain backported NPTL support (e.g. RH9 and RHEL kernels). Support for all other 2.4.x kernels was removed in 0.7.0. - This release merges the blcr_vmadump kernel module into the blcr module. - This release adds preliminary support for the "Fault Tolerance Backplane" (FTB). See README.FTB for more information. - This release adds the following features to the cr_checkpoint utility: + --kmsg-{none,error,warning} options to control reporting of kernel-level errors and warnings messages when taking a checkpoint. - This release adds the following features to the cr_restart utility: + --kmsg-{none,error,warning} options to control reporting of kernel-level errors and warnings messages when restarting from a checkpoint. + --[no-]restore-{pid,pgid,sid} options to control restore of the process id, process group id, and session id. The default remains as in prior releases: restore only pid. - This release makes the following libcr API additions/changes: + The following functions were announced in May 2008 as scheduled for removal in 0.8.0. They have not been removed, but have been marked with gcc's "deprecated" attribute to produce a compiler warning if used. * cr_request() * cr_request_file() * cr_request_fd() + These functions have been added for controlling checkpoint requests: * cr_wait_checkpoint() * cr_reap_checkpoint() * cr_log_checkpoint() * cr_poll_checkpoint_msg() The wait and reap functions expose independently the two steps taken in the existing cr_poll_checkpoint() function. The log function collects kernel-level error or warning messages if called between wait and reap. The poll...msg() function is a convenience function, documented and implemented in terms of the wait, log and reap functions. The cr_poll_checkpoint() function will remain in libcr, but is now documented and implemented in terms of cr_poll_checkpoint_msg(). + A new CR_CHKPT_ASYNC_ERR flag to cr_request_checkpoint() defers the reporting of almost all errors in a call to cr_request_checkpoint() until the call to cr_reap_checkpoint() or cr_poll_checkpoint[_msg](). + The following functions have been added for making restart requests via library calls, rather than using the cr_restart utility. These are all marked "EXPERIMENTAL" as there might be significant changes to these calls in the future. * cr_initialize_restart_args_t() * cr_request_restart() * cr_wait_restart() * cr_reap_restart() * cr_log_restart() * cr_poll_restart_msg() * cr_poll_restart() + The struct members "old" and "new" in struct cr_rstrt_relocate_pair have been renamed to "oldpath" and "newpath". This change was required because "new" is a C++ reserved word. See the comments in include/libcr.h for API documentation. - This release makes the following additions/changes to the BLCR test suite: + Add tests of many of the features new to this release + Add new tests, or cases to existing tests, for reproducing several of the bugs fixed in this release. + Fix command lines used in several tests to function correctly when "POSIXLY_CORRECT" is set in the environment + Recode crut_wrapper and seq_wrapper in C, rather than perl, to allow running the full testsuite in environments without perl (such as embedded ARM platforms). - This release fixes the following user-visible bugs and "issues" + 2021 - Provide extended error reporting mechanism + 2056 - Eliminate perl wrappers + 2287/2437 - Xen segment selector problems + 2292 - --restore-ids does not work correctly for multithreaded processes. + 2317 - implement "async" request errors + 2318 - checkpoint hangs after SEGV + 2322/2446 - Failure when stack limit is too big + 2344 - bad cr_restart usage causes kernel oops + 2453 - loss of sigaltstack across restart + 2454 - Oops in FPU restore + Address bug 2448 - there may have been a race with cr_close_other() + Fix ENOMEM when checkpointing processes with no supplementary group IDs + i386 FPU restore code would fail to notice corrupt i387 state + Fix a bug in the ARM atomics + Fix several issues with restart of 64-bit processes with a 32-bit requester, as exposed by the addition of cr_request_restart() to libcr. -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory