From: Yuan Tang (supertangcc_at_yahoo_dot_com)
Date: Tue Dec 18 2007 - 05:19:10 PST
Hi Paul, Thank you for the work. I downloaded the beta version, installed it and tested it. The SIGSTOPed process could pass the whole checkpoint procedure now. Congratulation! However, when I tried restarting the previously checkpointed SIGSTOPed process from its disk image, the RESTART procedure never completed. It blocks in cr_rstrt_req.c:cr_rstrt_child(). I guess, if you move the send_sig_info(SIGSTOP, NULL, task) stuff to cr_rstrt_task_complete(), the whole procedure will complete normally. Hope it helps. Best wishes! Yuan Tang ----- Original Message ---- From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> To: checkpoint_at_lbl_dot_gov Sent: Tuesday, December 18, 2007 4:23:30 AM Subject: BLCR 0.6.2 beta1 now available The first beta of BLCR 0.6.2 is now available at http://mantis.lbl.gov/blcr-dist/ Both source tarball and SRPM are available. The filenames and MD5 checksums are: 93249f20abd4eeec7a07db2f2a6cd2b2 blcr-0.6.2_b1.tar.gz e8ecba22c98de143ced20f83db76d8a1 blcr-0.6.2_b1-1.src.rpm This is a beta of a 0.6.2 patch release. The intent of 0.6.2 is to fix a small number of significant bugs found in 0.6.0 and 0.6.1 and to add support for 2.6.23 kernels and some vendor-patched 2.6.22 kernels. A NEWS entry summarizing these changes appears below. You are receiving this e-mail either because you are subscribed to the checkpoint_at_lbl_dot_gov mailing list or because you have reported one of the bugs or previously unsupported kernel versions addressed by this release. I apologize if you receive multiple copies. I would greatly appreciate any feedback (positive or negative) indicating if this beta fixes any problems you have reported with BLCR 0.6.0 and/or 0.6.1. Only after I have sufficient positive feedback will I make 0.6.2 available for download from the main BLCR web pages. -Paul 0.6.2_b1 -------- December 17, 2007 Bug-fix and expanded-support release. - This release adds support for 2.6.23 kernels. - This release adds support for SuSE's 2.6.22.x kernels. - This release fixes a file descriptor leak that occurred on restart from a checkpoint-of-self requested via cr_request_checkpoint(). - This release fixes a deadlock (and unkillable process(es)) when a multi-threaded process aborts (or omits itself from) a checkpoint under certain conditions. - This release fixes a restart-time failure when a checkpoint includes a pipe with one end outside the checkpoint scope, and data is buffered in the pipe. - This release fixes a bug with the cr_request{,_file}() calls in which a failed checkpoint would cause failure of the next one if it had the same destination file name. - This release fixes a race condition with the cr_enter_cs() and checkpoints in multi-threaded processes. - This release fixes post-checkpoint signal delivery (--stop and friends) to occur after the checkpoint is fully completed. See bug 2201 for a full description of the problems addressed by these changes. - This release documents (and fully implements) signal-delivery options to cr_restart (see bug 2200). - Adds test cases for most of the bugs fixed in this release. -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ