Re: blcr for mpi/myrinet jobs?

Date: Fri Dec 17 2004 - 13:03:35 PST

  • Next message: Jeff Squyres: "Re: blcr for mpi/myrinet jobs?"
    Our implementation requires that the MPI library handle shutting down
    and restoring network connections during checkpoint/restart.  So your
    question really boils down to:  is it likely that there will be a
    BLCR-enabled MPI library that runs over Myrinet?  I haven't asked the
    LAM/MPI developers ( whether they are planning
    to do this for Myrinet, and in what time frame if so, but they are the
    most likely candidates for the support you want--LAM already supports
    our stuff over TCP/IP, and LAM also works over Myrinet, so presumably
    they've got at least the design in place for a checkpointable Myrinet
    The MPICH team has plans to become BLCR-enabled, but they're not even in
    the design phase yet.
    I'm curious about your statement that RMS/Quadrics can already
    checkpoint/restart.  I wasn't aware of that, and I don't see such a
    feature listed on their website.
    Jason Duell             Future Technologies Group
    <jcduell_at_lbl_dot_gov>       Computational Research Division
    Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory
    On Fri, Dec 17, 2004 at 03:43:48PM -0500, Darin wrote:
    > Dear Jason,
    > The work described on
    > looks very interesting.  I have followed this for several years,
    > but always go away frustrated because of the lack of a clear list
    > of features.  After an hour of reading, I can't tell if this
    > software works (or will eventually work) on a cluster with
    > myrinet and mpich.
    > We have a typical 50 processor cluster that we would like to
    > let week long jobs run on, but be able to swap these out and
    > let short jobs run, then resume the longer jobs.
    > We are considering a Quadrics network because their RMS software
    > can do this.
    > Can you clarify whether BLCR would work with Myrinet?  This appears
    > to be a monumental undertaking, and would make our cluster
    > much more useable.
    > Thanks very much.
    > -- 
    > Darin 

  • Next message: Jeff Squyres: "Re: blcr for mpi/myrinet jobs?"