From: Josh Hursey (jjhursey_at_open-mpi.org)
Date: Tue Feb 27 2007 - 06:44:56 PST
On Feb 27, 2007, at 8:15 AM, Rajagopal Natarajan wrote: > Hi, > > I'm working on a 10 node P3 cluster, and use BLCR on it. I would > like to know if BLCR has any existing support for asynchronous > checkpointing. What do you mean by "asynchronous checkpointing"? BLCR supports command line tools cr_checkpoint and cr_restart, which will start a checkpoint inside an application that is properly liked with BLCR. The application does not have to add any code in order to be supported. So you could call that asynchronous checkpointing (and some do). If when you say "asynchronous checkpointing" you mean using an Uncoordinated Checkpoint/Restart Coordination Protocol this is a bit higher level than BLCR since it implicitly requires knowledge of a multi-process environment in which processes may or may not be located on the same machine. For this you need to look at building on top of the existing BLCR infrastructure in something like an MPI implementation as you note below. > > If the answer is yes, please point me to the appropriate docs. I'd start with the users guide: :) http://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html > > If the answer is no, I would like to implement asynchronous > checkpointing in LAM-MPI. LAM/MPI already incorporates an asynchronous checkpointing feature, meaning command line tools are exposed so you can checkpoint a MPI program with BLCR without modifying the MPI program. LAM/MPI uses a Coordinated Checkpoint/Restart Coordination Protocol, and supports checkpointing with TCP and GM (Myrinet). > Please tell me if i can make use of BLCR and modify the code to do > that, and how much of code might need to be modified. Would it be > feasible to implement it in 1-1.5 months, with two developers > working part time on it(Myself and my classmate, who both are > working on our bachelors thesis on checkpointing in LAM-MPI based > clusters and avoidance of rollback propagation. As we have other > course work, we might be able to devote upto 4-5 hrs on this project). If you intend to pursue an Uncoordinated Checkpoint/Restart Coordination Protocol in LAM/MPI it may take a few months or even a few years depending on quite a few factors. Most notably among those factors are familiarity with the LAM/MPI code base, the Uncoordinated C/R literature, and experience of the developers. You will need to become familiar with the LAM/MPI code base specifically how the current Coordinated C/R Coordination Protocol works. In addition the Uncoordinated C/R Coordination Protocols can become quite complex in their reconstruction of the multiprocess environment upon restart (especially with out using Message Logging techniques) this will add significantly to the time spent developing code. > > If the above project is not feasible in the specified time of 1-1.5 > months with 2 developers working on it, suggest us a something that > we can contribute to BLCR which would be related to avoidance > rollback propagation. Rollback propagation is a concept involving multiple processes using (mainly) Uncoordinated Checkpoint/Restart Protocols. Since BLCR is a single process checkpoint/restart service you are really looking to do something building upon it in a distributed process environment (like MPI provides for example). I may be wrong, but I think what you are looking for is an MPI implementation to experiment with. LAM/MPI is one option, and has some of the groundwork already laid out but certainly not all. BLCR's newest feature of being able to checkpoint/restart process groups within a single machine might be another, smaller area that you could look at. Meaning looking at how to checkpoint/restart a process group that communicates via shared memory using an Uncoordinated C/R Coordination Protocol or something like it. -- Josh > > Thanks. > > -- > N. Rajagopal, > Visit me at http://users.kaski-net.net/~raj/