jcduell_at_lbl_dot_gov
Date: Fri Dec 17 2004 - 13:03:35 PST
Darin: Our implementation requires that the MPI library handle shutting down and restoring network connections during checkpoint/restart. So your question really boils down to: is it likely that there will be a BLCR-enabled MPI library that runs over Myrinet? I haven't asked the LAM/MPI developers (http://www.lam-mpi.org/) whether they are planning to do this for Myrinet, and in what time frame if so, but they are the most likely candidates for the support you want--LAM already supports our stuff over TCP/IP, and LAM also works over Myrinet, so presumably they've got at least the design in place for a checkpointable Myrinet layer. The MPICH team has plans to become BLCR-enabled, but they're not even in the design phase yet. I'm curious about your statement that RMS/Quadrics can already checkpoint/restart. I wasn't aware of that, and I don't see such a feature listed on their website. -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory On Fri, Dec 17, 2004 at 03:43:48PM -0500, Darin wrote: > > Dear Jason, > > The work described on > > http://ftg.lbl.gov/twiki/bin/view/FTG/CheckpointRestart > > looks very interesting. I have followed this for several years, > but always go away frustrated because of the lack of a clear list > of features. After an hour of reading, I can't tell if this > software works (or will eventually work) on a cluster with > myrinet and mpich. > > We have a typical 50 processor cluster that we would like to > let week long jobs run on, but be able to swap these out and > let short jobs run, then resume the longer jobs. > > We are considering a Quadrics network because their RMS software > can do this. > > Can you clarify whether BLCR would work with Myrinet? This appears > to be a monumental undertaking, and would make our cluster > much more useable. > > > Thanks very much. > > > -- > Darin