Re: Current status of BLCR

jcduell_at_lbl.gov
Date: Mon May 12 2003 - 09:24:13 PDT


On Mon, May 05, 2003 at 11:15:53AM -0600, David Flynn wrote:
> My name is David Flynn.  I work for Linux NetworX. I've been
> following your work in checkpoint/restart and would very much like to
> see it ready for use in production environments.  This would add great
> value for the spectrum of HPC customers which we supply.  To that end
> I'd like to lend a hand.  I have access here to a wide range of clusters
> of various configurations, and close ties to vendors of commercial MPI
> codes.  Could you please advise me as to the current status of the
> project, availability of source code, and how I might be of service.

Good to hear from you, David.  We're interested in having our stuff get
into production, too.

The current status of the project is that we are about to do a first
beta release.  The major limitation of that release will be that open
files will not be restored across checkpoint/restart (except for
stdout/stderr/stdin).  We hope to tackle restoring filehandles this
summer.

On the MPI front, we have been working with the LAM MPI team
(http://www.lam-mpi.org/) to ensure that our stuff presents a generic
interface that any MPI library ought to be able to use.  The key idea is
that our checkpoint code can notify an MPI library when a checkpoint is
about to happen (and when a restart has occurred), so that the MPI
library can do whatever it needs to in order to transparently support
checkpointing.  Usually, this will mean that the MPI library drains all
the messages in the network at checkpoint time, and reestablishes
network connections at startup.

Our code is already working with LAM MPI, and we're able to checkpoint
some standard MPI programs (i.e., the NAS Parallel Benchmarks) with it.

I'm guessing you'll probably want to wait until we support open files
before shipping our code in a production environment.  But if you want
to help us move along, we could definitely use access to various cluster
environments.  Also, if you can get MPI vendors to start implementing
the hooks they need to work with our stuff (or at least get it on their
radar screen), that'd be great.

The homepage for the project is 

    http://www.nersc.gov/research/FTG/checkpoint/

and we have a couple of papers describing the interface at 

    http://www.nersc.gov/research/FTG/checkpoint/reports.html

The LAM team has written a paper on how they integrated their MPI
implementation with our interface--I don't see it on their site right
now, but I'll email them and ask them to put it up.

Feel free to contact me if you've got any more questions.

Cheers,

-- 
Jason Duell             Future Technologies Group,
<jcduell_at_lbl_dot_gov>       Computational Research Division
Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory