Re: BLCR 0.6.2 beta1 now available

From: Yuan Tang (supertangcc_at_yahoo_dot_com)
Date: Wed Dec 19 2007 - 19:42:40 PST

  • Next message: Paul H. Hargrove: "Re: BLCR 0.6.2 beta1 now available"
    I have to state that, I changed a little bit how BLCR works. I mean, I want a background daemon process to monitor the state of frontground child process. When the background daemon detect a SIGCHLD, it automatically restart the child process without administrator's interference . So, I modify the cr_restart utility to match my needs. Actually, I read the source code of cr_restart, and change it into a cr_restart() function call which could be invoked by the daemon to restart the child process. I notice how the cr_restart utility restart the child process. It forks child(ren) to contain the restarted tasks, and then its own process call mimic_exit(child_status) to exit. I just replaced the mimic_exit with a return. I believe this modification works because if I am not checkpointing/restarting a SIGSTOPed process, everything works fine as expected. 
    
    BTW: In order to checkpoint/restart a SIGSTOPed process, I modify the BLCR internals as I previously described to you. That is, move the send_sig_info(SIGSTOP, NULL, task) stuff from cr_dump_self()/cr_rstrt_child() to cr_chkpt_complete()/cr_rstrt_complet(), respectively. Also, I noticed that you also moved the send_sig_info() stuff for the checkpoint portion, right? Would you try the restart portion, too? and see whether my proposal works or not?
    
    Best wishes!
    
    Yuan
    
    ----- Original Message ----
    From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    To: Yuan Tang <supertangcc_at_yahoo_dot_com>
    Cc: checkpoint_at_lbl_dot_gov
    Sent: Wednesday, December 19, 2007 12:26:01 PM
    Subject: Re: BLCR 0.6.2 beta1 now available
    
    
    Perhaps I am misunderstanding your problem, or you are misunderstanding
     
    how BLCR works.  I am not clear on which.
    
    The cr_restart utility restarts a process as its own child.  It does
     not 
    return as soon as the process is restarted, but instead it waits until 
    the child terminates and then terminates itself with the same exit 
    code.   Also, it is not the cr_restart utility you need to send SIGCONT
     
    to, but to the child(ren) that it spawn and you need to wait until the 
    child is fully restored.  Together those are why using the "--cont" 
    argument to cr_restart is my recommendation.
    
    The function cr_rstrt_child() in the kernel module is like an exec() in
     
    that it *replaces* the calling thread with a thread of the restarted 
    process.  So, there is no return from that back to the calling context
     - 
    when it does return it is to the context of the checkpoint handler in 
    the restarted process.
    
    If you could be clearer on which "wait_event_interruptible() function
     in 
    the loop" you mean, perhaps I can understand what you are trying to do.
      
    I suspect to mean the loop in cr_rstrt_procs().  That loop should 
    terminate as soon as enough threads (the same number as in the 
    checkpointed process) have reached the wake_up() call in
     cr_rstrt_child().
    
     From what I have read so far, I am worried that you are trying to re 
    implement the internals of the cr_restart utility.  If that is the
     case, 
    I would strongly encourage you *not* to take that approach since those 
    internals change in each release.  You should instead invoke the 
    cr_restart utility to do the "dirty work".  In a future release (once 
    the interface stabilizes) we will add a cr_request_restart() entry
     point 
    to libcr.
    
    -Paul
    
    Yuan Tang wrote:
    > Sure! I mean, the RESTART procedure never complete, it means it never
     
    > returns. It blocks in the function cr_rstrt_child. My program is let
     a 
    > background daemon process to checkpoint a SIGSTOPed process, and
     later 
    > restart it. After restart it, the daemon process will surely send the
     
    > SIGCONT to the restarted SIGSTOPed process to let it go. However, the
     
    > restart procedure never complete. So, the background daemon will
     never 
    > have the chance to send the SIGCONT. If I send the SIGCONT from the 
    > console, it just interrupt the wait_event_interruptible() function in
     
    > the loop.
    >
    > ----- Original Message ----
    > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    > To: Yuan Tang <supertangcc_at_yahoo_dot_com>
    > Cc: checkpoint_at_lbl_dot_gov
    > Sent: Wednesday, December 19, 2007 5:19:58 AM
    > Subject: Re: BLCR 0.6.2 beta1 now available
    >
    > Yuan Tang wrote:
    > > Hi Paul,
    > >
    > > Thank you for the work. I downloaded the beta version, installed it
    > > and tested it. The SIGSTOPed process could pass the whole
     checkpoint
    > > procedure now. Congratulation! However, when I tried restarting the
    > > previously checkpointed SIGSTOPed process from its disk image, the
    > > RESTART procedure never completed. It blocks in
    > > cr_rstrt_req.c:cr_rstrt_child(). I guess, if you move the
    > > send_sig_info(SIGSTOP, NULL, task) stuff to
     cr_rstrt_task_complete(),
    > > the whole procedure will complete normally. Hope it helps.
    >
    > I believe the current behavior is correct (or at least is what I've
    > intended).  The process that was SIGSTOPed when the checkpoint was
    > requested is again SIGSTOPed when restarted.  To get it running again
    > you should be able to either send it a SIGCONT (which is tricky
     because
    > you might not know how soon to send it), or you can simply pass
     "--cont"
    > to cr_restart to have it done automatically.
    >
    > If you find that adding "--cont" to the cr_restart arguments still
    > doesn't allow the restart to complete, let us know and we'll see if
     we
    > can figure out what is going on.
    >
    > -Paul
    >
    > >
    > > Best wishes!
    > >
    > > Yuan Tang
    > >
    > > ----- Original Message ----
    > > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov
     <mailto:PHHargrove_at_lbl_dot_gov>>
    > > To: checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>
    > > Sent: Tuesday, December 18, 2007 4:23:30 AM
    > > Subject: BLCR 0.6.2 beta1 now available
    > >
    > > The first beta of BLCR 0.6.2 is now available at
    > > http://mantis.lbl.gov/blcr-dist/
    > > Both source tarball and SRPM are available.  The filenames and MD5
    > > checksums are:
    > >  93249f20abd4eeec7a07db2f2a6cd2b2  blcr-0.6.2_b1.tar.gz
    > >  e8ecba22c98de143ced20f83db76d8a1  blcr-0.6.2_b1-1.src.rpm
    > >
    > > This is a beta of a 0.6.2 patch release.  The intent of 0.6.2 is to
     fix
    > > a small number of significant bugs found in 0.6.0 and 0.6.1 and to
     add
    > > support for 2.6.23 kernels and some vendor-patched 2.6.22 kernels.
      A
    > > NEWS entry summarizing these changes appears below.
    > >
    > > You are receiving this e-mail either because you are subscribed to
     the
    > > checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> 
    > <mailto:checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov>> mailing list 
    > or because
    > > you have reported one of the
    > > bugs or previously unsupported kernel versions addressed by this
    > > release.  I apologize if you receive multiple copies.
    > >
    > > I would greatly appreciate any feedback (positive or negative)
    > > indicating if this beta fixes any problems you have reported with
     BLCR
    > > 0.6.0 and/or 0.6.1.  Only after I have sufficient positive feedback
     will
    > > I make 0.6.2 available for download from the main BLCR web pages.
    > >
    > > -Paul
    > >
    > >
    > > 0.6.2_b1
    > > --------
    > > December 17, 2007
    > > Bug-fix and expanded-support release.
    > > - This release adds support for 2.6.23 kernels.
    > > - This release adds support for SuSE's 2.6.22.x kernels.
    > > - This release fixes a file descriptor leak that occurred on
     restart 
    > from
    > >  a checkpoint-of-self requested via cr_request_checkpoint().
    > > - This release fixes a deadlock (and unkillable process(es)) when a
    > >  multi-threaded process aborts (or omits itself from) a checkpoint
    > >  under certain conditions.
    > > - This release fixes a restart-time failure when a checkpoint
     includes a
    > >  pipe with one end outside the checkpoint scope, and data is
     buffered
    > >  in the pipe.
    > > - This release fixes a bug with the cr_request{,_file}() calls in
     which
    > >  a failed checkpoint would cause failure of the next one if it had
     the
    > >  same destination file name.
    > > - This release fixes a race condition with the cr_enter_cs() and
    > > checkpoints
    > >  in multi-threaded processes.
    > > - This release fixes post-checkpoint signal delivery (--stop and 
    > friends)
    > >  to occur after the checkpoint is fully completed.  See bug 2201
     for
    > >  a full description of the problems addressed by these changes.
    > > - This release documents (and fully implements) signal-delivery
     options
    > >  to cr_restart (see bug 2200).
    > > - Adds test cases for most of the bugs fixed in this release.
    > >
    > >
    > > --
    > > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > > <mailto:PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>>
    > > Future Technologies Group
    > > HPC Research Department                  Tel: +1-510-495-2352
    > > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    > >
    > >
    > >
    > >
    > >
     ------------------------------------------------------------------------
    > > Never miss a thing. Make Yahoo your homepage.
    > > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
    >
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > Future Technologies Group
    > HPC Research Department                  Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    >
    >
    >
    >
    >
     ------------------------------------------------------------------------
    > Looking for last minute shopping deals? Find them fast with Yahoo! 
    > Search. 
    >
     <http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping>
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    
    
    
    
    
    
    
    
          ____________________________________________________________________________________
    Looking for last minute shopping deals?  
    Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
    

  • Next message: Paul H. Hargrove: "Re: BLCR 0.6.2 beta1 now available"