Re: the details of setting a checkpoint

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Jan 16 2008 - 11:45:09 PST

  • Next message: Locus Jackson: "Re: Restart my program failed ?"
    王磊 wrote:
    > Dear sir,
    > In your website 
    >,I have 
    > read some papers,now I have a question.
    > In BLCR,the checkpoint is important,maybe in these papers are not 
    > explained clearly.
    > So if possible,I want to know the details about the checkpoint you 
    > have set,and what is  the criteria do you set to keep some
    > useful states or to remove?Does the checkpoint,like the Libckpt,has 
    > done some optimization,like memory exclusion,incremental
    > checkpoint,and so on.Even,can we get the source code about how do you 
    > set a checkpoint?
    > Thank you.
    > Sincerely,
    > Daniel
      The source code to perform the checkpoints is available in the 
    downloads section of the website.  I assume you (or somebody you work 
    with) has already downloaded it to compile and install on your system.  
    The files in the cr_module and vmadump4 directories do most of the work 
    to save and restore process state.
      You are correct that there are no papers on the details of how the 
    checkpoint is taken, mostly because there is very little of general 
    interest to be told.  In short what happens is that the kernel 
    interrupts all the target threads/processes with a signal that causes 
    them to run a blcr-provided signal handler.  That handler runs the 
    callbacks and then calls the blcr kernel module to take the checkpoint.  
    Once in the kernel to take the checkpoint, blcr writes the memory of the 
    processes and many of the kernel data structures that contain important 
    state (such as the registers, signal handlers, files table, etc).  At 
    restart time a process is reconstructed with the saved data.
      There is not criteria to keep or remove certain state.  There *are* 
    certain types of state that blcr doesn't handle (such as sockets and 
    SysV IPC), but there is no way to selectively control what is or is not 
    saved.  In the future we will implement memory exclusion and incremental 
    checkpoints, but those are not yet implemented.
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Locus Jackson: "Re: Restart my program failed ?"