From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 25 2010 - 20:58:56 PDT
TK, It is not my intent to be rude or condescending but I don't have the time to describe everything that takes place in a checkpoint. The simple answer is that "the whole story" is in the source code - which you have available to examine. You have correctly determined that a checkpoint begins with an ioctl() that invokes cr_dump_self(), and you should be able to trace the rest using the source code. I have not memorized which functions call which others in what order, even though I wrote most of it. To give you the "whole story" I would have to take the time to read through the sources and trace the calls. Instead, I encourage you to read them. Doing so is likely to give you a deeper understanding than if I were to try to do it for you. If after that you have some specific questions about "how" or "why" things are done, I may be able to help. You may want to look at tools like "cflow" to build a call graph for you, though I cannot be certain they work well with Linux kernel code. I CAN summarize the distinction between the code in cr_module/ and vmadump4/, which appears to be a significant point of your question. The vmadump code is a heavily modified version of software from the BProc project that predates BLCD (and comes from a different organization). It was never able to deal with shared memory, files or multiple processes; nor does it have the callback mechanisms of BLCR. So the BLCR project began with the intent of keeping the changes made to files in vmadump to a minimum and building the other functionality (e.g. shared memory, files and multiple processes) separately. That is why you will find that vmadump handles "anonymous" pages and non-shared mappings, while the cr_save_mmaps code handles the shared mappings. I hope my answer helps you some, even if I can't provide the answer you may have been looking for. -Paul TK wrote: > Thanks. > But when a checkpoint request is issued with "cr_checkpoint" command, > a ioctl request is made to /proc/checkpoint/ctrl. I suppose it will > be the "CR_OP_HAND_CHKPT" request. Then "cr_dump_self" will be > called, and finally cr_save_mmaps_data will be called, and the memory > will be saved here. Am I correct? If so, when is the whole story of > checkpoint? When the "vmadump" module is used then ? > > Thank you very much. > > On 03/25/2010 07:20 PM, Paul H. Hargrove wrote: >> TK, >> >> I am sorry I didn't get the chance to answer this one when you asked >> me directly 2 days ago - I am up against some deadlines right now. >> >> To answer your question: >> In the function you ask about we are dealing only with memory regions >> created by mmap() of a file. Therefore all the "clean" pages already >> exist somewhere on disk in the file that has been mmap()ed. This >> includes the executable file and shared libraries that were mmap()ed >> in prior to the start of main(). As with open files, BLCR makes the >> (optimistic) assumption that the file will still exist, unmodified, >> at the time of the restart. However, one can ensure that even the >> "clean" pages will be stored with the checkpoint by passing --save-all. >> >> -Paul >> >> TK wrote: >>> Hi , all. I am trying to adding my own code into BLCR for some >>> experiments. >>> When I was reading the code of "cr_save_mmaps_data" function in >>> cr_module/cr_mmaps.c, I found the comment /* dump the dirty pages */ >>> . I am wondering you dump only the dirty pages only? It will not be >>> enough info for restart. Or the other pages are dumped else where? >>> If so, where is it? >>> Thank you. >> >> > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory