From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Fri May 10 2002 - 17:21:59 PDT
I am hoping that the work this Summer will lead to a paper which covers the implementation-agnostic aspects of checkpointing an MPI. So, I agree with Rusty that we need to "think from the beginning in a more abstract way about checkpointing requirements." LAM will be our "reference implementation" in some sense. It is possible that particular design decisions made in LAM or MPICH could make it more or less difficult to meet the abstract requirement in one implementation than the other. In the near term LAM provides us a platform to produce something concrete. -Paul Rusty Lusk wrote: > > I am of course interested in how this work can be made relevant to > multiple MPI implementations. This doesn't mean that it should not > take advantage of features found only in LAM, but it should also > focus on defining what it checkpointing needs from the MPI implementation in > order to function well. We would then be interested in adding such > functionality to MPICH. The point is not to have it then work on two > implementations instead of one, but to think from the beginning in a > more abstract way about checkpointing requirements. > > Rusty > > | So, I want to know where Sriram is with respect to LAM/MPI and > | checkpoint/restart. Is there specific work in LAM that Sriram is > | already doing and should continue? Should we dive right into discussing > | how we expect to trigger LAM in the event of a checkpoint? Is Sriram in > | a possition to teach us Berkeley folks about how LAM applications > | interact with the lamd and how the lamd's interact with eachother? > | > | A second issue is whether we should plan to have a conference call > | sometime during this first week, or wait for the AG time the following > | week? -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov NERSC Future Technologies Group Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998