From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Sep 04 2007 - 15:14:57 PDT
Abhinav Jha wrote: > Dear Sir, > > Thank you for your kind reply. For the last few days, we have been going > through the BLCR code and are trying to figure out how a process is > checkpointed by BLCR. Is there a platform/forum where we could discuss > BLCR? We will try our best to club together our doubts in future so that > we don't cause you too much trouble ( we hope ). > The address you are sending to (checkpoint_at_lbl_dot_gov) is a mailing list including the BCLR developers and a few of the users. It is the best (and probably only) place to ask your questions. You can also find some explanation of how a process gets checkpointed in the following two papers (also indexed on our website): * Duell, J., Hargrove, P., and Roman., E. */The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart./* Berkeley Lab Technical Report (publication LBNL-54941) http://ftg.lbl.gov/CheckpointRestart/blcr.pdf * Paul H. Hargrove and Jason C. Duell */Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters/* In Proceedings of SciDAC 2006: June 2006. (publication LBNL-60520) http://ftg.lbl.gov/CheckpointRestart/LBNL-60520.pdf -Paul > Thanks once again, > > Abhinav Jha & Manish Kumar, > Indian Institute of Technology Guwahati > Guwahati -39, INDIA > http://www.iitg.ernet.in > > > > >> Abhinav Jha wrote: >> >>> Dear Sir, >>> >>> We're final year students from Indian Institute of Technology, Guwahati >>> ( >>> http://www.iitg.ernet.in ), working on our B.Tech. project, >>> "Implementation of checkpoint and restart mechanism on the linux kernel >>> 2.6". >>> >> Thank you for your interest in BLCR. You will find my answers to your >> questions below. >> >> >>> We wanted to make use of the already existing facilities of BLCR in this >>> regard. However, we're not aware of a few things: >>> >>> 1. Whether we can change your code without violating your copyright. >>> >> BLCR is distributed under 2 Open Source Software licenses, the GPL and >> LGPL. You should examine the license.txt files in each directory for >> information on which license applies to the files in that directory. >> >> The GPL allows you to modify the covered portions of BLCR provided that >> you distribute your modified version under the same GPL license. >> >> The LGPL allows slightly more freedom in how you may use the covered >> portions of BLCR. >> >> In either case, you should not have any problems if this is only for a >> class project. If you plan to distribute the resulting enhancements to >> the general public, you should expect to simply apply the same licenses >> to the modified versions. You don't need to obtain any permissions from >> us to do so. However, if you do develop enhancements of general >> interest, we should talk about incorporating your changes back into the >> base BLCR code. >> >> >>> 2. What is the feasibility of implementing socket checkpointing in BLCR. >>> >> Good question. We have not tried to pursue this task ourselves, and >> therefore have not tried hard to determine the exact level of >> difficulty. Assuming you are interested only in Unix-domain (aka >> AF_LOCAL) sockets, I imagine the problems are small since the buffered >> data is all local to one node. In the case of TCP, you can probably get >> away with preserving only the data that is buffered locally (both >> incoming and outgoing) and counting on retransmission to recover any >> data "on the wire" at checkpoint time. The difficulty, however, is >> likely to come from getting the TCP state engine back to the right >> state. For UDP, you can probably do the same as TCP. >> >> If you also want to attempt migration of TCP or UDP sockets, then you >> will need some way to "adjust" the peer as well. >> >> >>> 3. Can we do an implementation of file checkpointing, that is >>> independent >>> of the one you have planned ? >>> >> We have code in the soon-to-be-released 0.6.0 version of BLCR that takes >> care of checkointing of open-but-deleted files. That code can easily be >> leveraged to checkpoint all open files, whether or not they are deleted. >> The interesting part comes at restart time when you need to determine >> whether to use the checkpointed copy of a file or the copy that now >> exists on disk. Depending on how a given application uses files (and >> how users of the application expect to use the files after the >> application runs) there is no single correct policy. The implementation >> work to be done here is certainly simpler than socket checkpointing. >> >> >>> 4. What would be a good way to go about reading/modifying the code , >>> since >>> there is no manual avaiable ? >>> >> I am afraid we don't have a good answer for this one. We try to put >> comments in the kernel code that are sufficient for our own use when we >> look at code that another member of our group has written, or our code >> long after it was written. However, it will take a good bit of time to >> learn the code just by reading it. Alas, there is no documentation >> other than the code itself. >> >> >>> We'll be very grateful to hear from you. >>> >> Feel free to ask more questions if you need to. >> >> >> >>> Thank you, >>> >>> Abhinav Jha & Manish Kumar, >>> Indian Institute of Technology Guwahati >>> Guwahati -39, INDIA >>> http://www.iitg.ernet.in >>> >> -Paul >> >> -- >> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> Future Technologies Group >> HPC Research Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900