From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Jan 16 2008 - 11:45:09 PST
王磊 wrote: > Dear sir, > In your website > http://ftg.lbl.gov/CheckpointRestart/CheckpointPapers.shtml,I have > read some papers,now I have a question. > In BLCR,the checkpoint is important,maybe in these papers are not > explained clearly. > So if possible,I want to know the details about the checkpoint you > have set,and what is the criteria do you set to keep some > useful states or to remove?Does the checkpoint,like the Libckpt,has > done some optimization,like memory exclusion,incremental > checkpoint,and so on.Even,can we get the source code about how do you > set a checkpoint? > Thank you. > Sincerely, > Daniel Daniel, The source code to perform the checkpoints is available in the downloads section of the website. I assume you (or somebody you work with) has already downloaded it to compile and install on your system. The files in the cr_module and vmadump4 directories do most of the work to save and restore process state. You are correct that there are no papers on the details of how the checkpoint is taken, mostly because there is very little of general interest to be told. In short what happens is that the kernel interrupts all the target threads/processes with a signal that causes them to run a blcr-provided signal handler. That handler runs the callbacks and then calls the blcr kernel module to take the checkpoint. Once in the kernel to take the checkpoint, blcr writes the memory of the processes and many of the kernel data structures that contain important state (such as the registers, signal handlers, files table, etc). At restart time a process is reconstructed with the saved data. There is not criteria to keep or remove certain state. There *are* certain types of state that blcr doesn't handle (such as sockets and SysV IPC), but there is no way to selectively control what is or is not saved. In the future we will implement memory exclusion and incremental checkpoints, but those are not yet implemented. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900