Re: BLCR 0.6.2 beta1 now available

From: Yuan Tang (supertangcc_at_yahoo_dot_com)
Date: Tue Dec 18 2007 - 18:43:29 PST

  • Next message: Paul H. Hargrove: "Re: BLCR 0.6.2 beta1 now available"
    Sure! I mean, the RESTART procedure never complete, it means it never returns. It blocks in the function cr_rstrt_child. My program is let a background daemon process to checkpoint a SIGSTOPed process, and later restart it. After restart it, the daemon process will surely send the SIGCONT to the restarted SIGSTOPed process to let it go. However, the restart procedure never complete. So, the background daemon will never have the chance to send the SIGCONT. If I send the SIGCONT from the console, it just interrupt the wait_event_interruptible() function in the loop.
    ----- Original Message ----
    From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    To: Yuan Tang <supertangcc_at_yahoo_dot_com>
    Cc: checkpoint_at_lbl_dot_gov
    Sent: Wednesday, December 19, 2007 5:19:58 AM
    Subject: Re: BLCR 0.6.2 beta1 now available
    Yuan Tang wrote:
    > Hi Paul,
    > Thank you for the work. I downloaded the beta version, installed it 
    > and tested it. The SIGSTOPed process could pass the whole checkpoint 
    > procedure now. Congratulation! However, when I tried restarting the 
    > previously checkpointed SIGSTOPed process from its disk image, the 
    > RESTART procedure never completed. It blocks in 
    > cr_rstrt_req.c:cr_rstrt_child(). I guess, if you move the 
    > send_sig_info(SIGSTOP, NULL, task) stuff to cr_rstrt_task_complete(),
    > the whole procedure will complete normally. Hope it helps.
    I believe the current behavior is correct (or at least is what I've 
    intended).  The process that was SIGSTOPed when the checkpoint was 
    requested is again SIGSTOPed when restarted.  To get it running again 
    you should be able to either send it a SIGCONT (which is tricky because
    you might not know how soon to send it), or you can simply pass
    to cr_restart to have it done automatically.
    If you find that adding "--cont" to the cr_restart arguments still 
    doesn't allow the restart to complete, let us know and we'll see if we 
    can figure out what is going on.
    > Best wishes!
    > Yuan Tang
    > ----- Original Message ----
    > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    > To: checkpoint_at_lbl_dot_gov
    > Sent: Tuesday, December 18, 2007 4:23:30 AM
    > Subject: BLCR 0.6.2 beta1 now available
    > The first beta of BLCR 0.6.2 is now available at
    > Both source tarball and SRPM are available.  The filenames and MD5
    > checksums are:
    >   93249f20abd4eeec7a07db2f2a6cd2b2  blcr-0.6.2_b1.tar.gz
    >   e8ecba22c98de143ced20f83db76d8a1  blcr-0.6.2_b1-1.src.rpm
    > This is a beta of a 0.6.2 patch release.  The intent of 0.6.2 is to
    > a small number of significant bugs found in 0.6.0 and 0.6.1 and to
    > support for 2.6.23 kernels and some vendor-patched 2.6.22 kernels.  A
    > NEWS entry summarizing these changes appears below.
    > You are receiving this e-mail either because you are subscribed to
    > checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> mailing list or
    > you have reported one of the
    > bugs or previously unsupported kernel versions addressed by this
    > release.  I apologize if you receive multiple copies.
    > I would greatly appreciate any feedback (positive or negative)
    > indicating if this beta fixes any problems you have reported with
    > 0.6.0 and/or 0.6.1.  Only after I have sufficient positive feedback
    > I make 0.6.2 available for download from the main BLCR web pages.
    > -Paul
    > 0.6.2_b1
    > --------
    > December 17, 2007
    > Bug-fix and expanded-support release.
    > - This release adds support for 2.6.23 kernels.
    > - This release adds support for SuSE's 2.6.22.x kernels.
    > - This release fixes a file descriptor leak that occurred on restart
    >   a checkpoint-of-self requested via cr_request_checkpoint().
    > - This release fixes a deadlock (and unkillable process(es)) when a
    >   multi-threaded process aborts (or omits itself from) a checkpoint
    >   under certain conditions.
    > - This release fixes a restart-time failure when a checkpoint
     includes a
    >   pipe with one end outside the checkpoint scope, and data is
    >   in the pipe.
    > - This release fixes a bug with the cr_request{,_file}() calls in
    >   a failed checkpoint would cause failure of the next one if it had
    >   same destination file name.
    > - This release fixes a race condition with the cr_enter_cs() and
    > checkpoints
    >   in multi-threaded processes.
    > - This release fixes post-checkpoint signal delivery (--stop and
    >   to occur after the checkpoint is fully completed.  See bug 2201 for
    >   a full description of the problems addressed by these changes.
    > - This release documents (and fully implements) signal-delivery
    >   to cr_restart (see bug 2200).
    > - Adds test cases for most of the bugs fixed in this release.
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > Future Technologies Group
    > HPC Research Department                  Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    > Never miss a thing. Make Yahoo your homepage. 
    > <*> 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    Looking for last minute shopping deals?  
    Find them fast with Yahoo! Search.

  • Next message: Paul H. Hargrove: "Re: BLCR 0.6.2 beta1 now available"