jcduell_at_lbl.gov
Date: Mon May 12 2003 - 09:24:13 PDT
On Mon, May 05, 2003 at 11:15:53AM -0600, David Flynn wrote: > My name is David Flynn. I work for Linux NetworX. I've been > following your work in checkpoint/restart and would very much like to > see it ready for use in production environments. This would add great > value for the spectrum of HPC customers which we supply. To that end > I'd like to lend a hand. I have access here to a wide range of clusters > of various configurations, and close ties to vendors of commercial MPI > codes. Could you please advise me as to the current status of the > project, availability of source code, and how I might be of service. Good to hear from you, David. We're interested in having our stuff get into production, too. The current status of the project is that we are about to do a first beta release. The major limitation of that release will be that open files will not be restored across checkpoint/restart (except for stdout/stderr/stdin). We hope to tackle restoring filehandles this summer. On the MPI front, we have been working with the LAM MPI team (http://www.lam-mpi.org/) to ensure that our stuff presents a generic interface that any MPI library ought to be able to use. The key idea is that our checkpoint code can notify an MPI library when a checkpoint is about to happen (and when a restart has occurred), so that the MPI library can do whatever it needs to in order to transparently support checkpointing. Usually, this will mean that the MPI library drains all the messages in the network at checkpoint time, and reestablishes network connections at startup. Our code is already working with LAM MPI, and we're able to checkpoint some standard MPI programs (i.e., the NAS Parallel Benchmarks) with it. I'm guessing you'll probably want to wait until we support open files before shipping our code in a production environment. But if you want to help us move along, we could definitely use access to various cluster environments. Also, if you can get MPI vendors to start implementing the hooks they need to work with our stuff (or at least get it on their radar screen), that'd be great. The homepage for the project is http://www.nersc.gov/research/FTG/checkpoint/ and we have a couple of papers describing the interface at http://www.nersc.gov/research/FTG/checkpoint/reports.html The LAM team has written a paper on how they integrated their MPI implementation with our interface--I don't see it on their site right now, but I'll email them and ask them to put it up. Feel free to contact me if you've got any more questions. Cheers, -- Jason Duell Future Technologies Group, <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory