From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jan 22 2008 - 13:25:15 PST
Abhinav Jha wrote: > Sir, > > We want to give the user a flexibility of choosing different file > checkpointing methods, like : saving the name of the file, contents of the > file, etc. depending on the file size, file properties, etc. Thus, we > would like to add on to the file checkpointing facilities that you have > provided in the latest release of BLCR. We were thinking of making changes > to the VMADump module code in order to do this. Is that right ? If not, > kindly suggest which part of the code we need to make changes to. We feel > lost :( > > > I am sorry that you are feeling lost. The BLCR source code is not very easy to find one's way through initially, but it should get easier as you get more familiar with it. The VMADump code (originally from the BProc project) is used by BLCR to save/restore the memory, signal handlers and a few other architecture-independent elements of the process state, plus all the architecture-specific bits (the registers). While VMADump handles most mmap()ed files, it does not handle the files that are open(), and in the future BLCR will take over handling of many/most mmap()ed files from VMADump as support is added for inclusion of the executable and shared libs in the checkpoint context file. The code that handles the save of open files (only "by reference" currently) is in the function cr_save_open_file() in the file cr_module/cr_dump_self.c. The corresponding restore code is in cr_restore_open_file() in cr_module/cr_rstrt_req.c. It is hoped that in the future these two functions will move from their current locations into a single file with a name like cr_module/cr_files.c. Those functions, plus possibly their callers or callees, are where you are likely to make your modifications. When looking to save the entire contents of the file, look at how unlinked (deleted) files are currently handled in cr_{save,restore}_open_file(). There are two things to be aware of: 1) The function cr_copyfile() is used to insert the entire file into the checkpoint context file. The setting of the "unlinked" flag in the cr_file_info structure causes a corresponding cr_mkunlinked() call at restart time to reconstruct the file. While you may not wish to restore your files in an unlinked state, the logic you may require should be very similar to this case. 2) Note that cr_{save,restore)_open_file() both make use of calls to our object-map code (cr_insert_object() and cr_find_object()) to determine if the save/restore of the file should take place or not. This is to ensure that if multiple processes have the same file open, we will save its contents to the context file exactly once. At checkpoint time the use of the object map is trivial (inserting a map from the inode address to itself) and serves only as an indication that the specific file has been saved. At restart time, however, the mapping used is from the "file_id" to the "struct file" of the new instance. That ensures that while only one copy of the file is restored, all other references will reopen that same instance (which is done by cr_filp_reopen() in the case of unlinked files, since they don't have a path by which to open them normally). We look forward to seeing your work completed and, if possible, contributed for inclusion in BLCR where the entire community can benefit from your work. When ready to contribute patches to BLCR, we will need a Developer's Certificate of Origin, a.k.a a "Signed-off-by" line (see the "Sign Your Work" section of the file Documentation/SubmittingPatches in any recent Linux kernel sources). This allows us to satisfy our organization's requirement to redistribute your contribution to others. I am pleased to know that somebody is pursuing this work and am happy to provide what assistance I can by e-mail. Please don't hesitate to ask more questions using the checkpoint_at_lbl_dot_gov list address. You may also wish to subscribe to that list to ensure you see the answers and any follow-ups to them (just e-mail "subscribe checkpoint" in the message /body/ and without the quotes to majordomo_at_lbl_dot_gov <mailto:majordomo_at_lbl_dot_gov>). -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900