Re: Asynchronous checkpointing support in BLCR

Date view	Thread view	Subject view	Author view	Attachment view

From: Josh Hursey (jjhursey_at_open-mpi.org)
Date: Tue Feb 27 2007 - 06:44:56 PST

Next message: Yiannis Georgiou: "blcr-0.5.0_b5 cr_run execution error"

Previous message: Rajagopal Natarajan: "Asynchronous checkpointing support in BLCR"
In reply to: Rajagopal Natarajan: "Asynchronous checkpointing support in BLCR"

On Feb 27, 2007, at 8:15 AM, Rajagopal Natarajan wrote:

> Hi,
>
> I'm working on a 10 node P3 cluster, and use BLCR on it. I would  
> like to know if BLCR has any existing support for asynchronous  
> checkpointing.

What do you mean by "asynchronous checkpointing"?

BLCR supports command line tools cr_checkpoint and cr_restart, which  
will start a checkpoint inside an application that is properly liked  
with BLCR. The application does not have to add any code in order to  
be supported. So you could call that asynchronous checkpointing (and  
some do).

If when you say "asynchronous checkpointing" you mean using an  
Uncoordinated Checkpoint/Restart Coordination Protocol this is a bit  
higher level than BLCR since it implicitly requires knowledge of a  
multi-process environment in which processes may or may not be  
located on the same machine. For this you need to look at building on  
top of the existing BLCR infrastructure in something like an MPI  
implementation as you note below.

>
> If the answer is yes, please point me to the appropriate docs.

I'd start with the users guide: :)
http://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html

>
> If the answer is no, I would like to implement asynchronous  
> checkpointing in LAM-MPI.

LAM/MPI already incorporates an asynchronous checkpointing feature,  
meaning command line tools are exposed so you can checkpoint a MPI  
program with BLCR without modifying the MPI program. LAM/MPI uses a  
Coordinated Checkpoint/Restart Coordination Protocol, and supports  
checkpointing with TCP and GM (Myrinet).

> Please tell me if i can make use of BLCR and modify the code to do  
> that, and how much of code might need to be modified. Would it be  
> feasible to implement it in 1-1.5 months, with two developers  
> working part time on it(Myself and my classmate, who both are  
> working on our bachelors thesis on checkpointing in LAM-MPI based  
> clusters and avoidance of rollback propagation. As we have other  
> course work, we might be able to devote upto 4-5 hrs on this project).

If you intend to pursue an Uncoordinated Checkpoint/Restart  
Coordination Protocol in LAM/MPI it may take a few months or even a  
few years depending on quite a few factors. Most notably among those  
factors are familiarity with the LAM/MPI code base, the Uncoordinated  
C/R literature, and experience of the developers. You will need to  
become familiar with the LAM/MPI code base specifically how the  
current Coordinated C/R Coordination Protocol works. In addition the  
Uncoordinated C/R Coordination Protocols can become quite complex in  
their reconstruction of the multiprocess environment upon restart  
(especially with out using Message Logging techniques) this will add  
significantly to the time spent developing code.

>
> If the above project is not feasible in the specified time of 1-1.5  
> months with 2 developers working on it, suggest us a something that  
> we can contribute to BLCR which would be related to avoidance  
> rollback propagation.

Rollback propagation is a concept involving multiple processes using  
(mainly) Uncoordinated Checkpoint/Restart Protocols. Since BLCR is a  
single process checkpoint/restart service you are really looking to  
do something building upon it in a distributed process environment  
(like MPI provides for example). I may be wrong, but I think what you  
are looking for is an MPI implementation to experiment with. LAM/MPI  
is one option, and has some of the groundwork already laid out but  
certainly not all.

BLCR's newest feature of being able to checkpoint/restart process  
groups within a single machine might be another, smaller area that  
you could look at. Meaning looking at how to checkpoint/restart a  
process group that communicates via shared memory using an  
Uncoordinated C/R Coordination Protocol or something like it.

-- Josh

>
> Thanks.
>
> -- 
> N. Rajagopal,
> Visit me at http://users.kaski-net.net/~raj/

Next message: Yiannis Georgiou: "blcr-0.5.0_b5 cr_run execution error"

Previous message: Rajagopal Natarajan: "Asynchronous checkpointing support in BLCR"
In reply to: Rajagopal Natarajan: "Asynchronous checkpointing support in BLCR"

Date view	Thread view	Subject view	Author view	Attachment view