Re: Sriram's first week

From: Paul H. Hargrove (
Date: Fri May 10 2002 - 17:21:59 PDT

I am hoping that the work this Summer will lead to a paper which covers
the implementation-agnostic aspects of checkpointing an MPI.  So, I
agree with Rusty that we need to "think from the beginning in a more
abstract way about checkpointing requirements."  LAM will be our
"reference implementation" in some sense.  It is possible that
particular design decisions made in LAM or MPICH could make it more or
less difficult to meet the abstract requirement in one implementation
than the other.  In the near term LAM provides us a platform to produce
something concrete.


Rusty Lusk wrote:
> I am of course interested in how this work can be made relevant to
> multiple MPI implementations.  This doesn't mean that it should not
> take advantage of features found only in LAM, but it should also
> focus on defining what it checkpointing needs from the MPI implementation in
> order to function well.  We would then be interested in adding such
> functionality to MPICH.  The point is not to have it then work on two
> implementations instead of one, but to think from the beginning in a
> more abstract way about checkpointing requirements.
> Rusty
> | So, I want to know where Sriram is with respect to LAM/MPI and
> | checkpoint/restart.  Is there specific work in LAM that Sriram is
> | already doing and should continue?  Should we dive right into discussing
> | how we expect to trigger LAM in the event of a checkpoint?  Is Sriram in
> | a possition to teach us Berkeley folks about how LAM applications
> | interact with the lamd and how the lamd's interact with eachother?
> |
> | A second issue is whether we should plan to have a conference call
> | sometime during this first week, or wait for the AG time the following
> | week?

Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998