LBNL/IU AG conf reminder

From: Paul H. Hargrove (
Date: Mon Nov 25 2002 - 09:42:15 PST

This is a reminder about a telephone conference scheduled for 3-4pm PST,
Tuesday, November 26, 2002.

LBNL folks please note the change in local phone number.

To attend:
       Long Distance users call 1-877-252-5250,
       Local users call 510-486-5008,
       LBNL on-site users call x5008,
then press 1, enter 217373# and follow the instructions.

We'll begin with reports on SC2002.  We should then tackle planning for 
future directions on our collaboration.  We can start with the list that 
Jeff e-mailed out (attached for those not on the lam-cr list).


Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998

attached mail follows:

I think everyone was drooling over CR at SC.  This is great.

So Sriram -- we need to finish several things in LAM (not necessarily in
any particular order):

1. Do the fast stuff.  I think this is probably the first order of
business, and we talked about it a bit.  See where you can get with that.
It would be good to have a totally finished version of the TCP RPI.

2. Let's think about some features that we want to give to LAM users to
make this all work.  Some obvious ones that jump to mind (all of which are
subject to discussion are):

  - Hit ctrl-z in mpirun, and it checkpoints and kills.  "fg" (in the
    shell) would restore.
  - Abstract away cr_save and cr_restore into some kind of LAM commands
    (so that we can include other CR libraries and still have LAM users
    use a uniform interface, such as lam_save and lam_restore...?  They
    can just fork/exec the right underlying command, or use the
    appropriate CR library API call... you get the idea)
  - Perhaps a command line option to mpirun would auto-invoke a checkpoint
    if you hit ctrl-C (i.e., SIGINT) -- this might be helpful for batch
    systems, and slightly cleaner than ctrl-Z (i.e., the job is not still
    running when it dies)...?
  - A little better/more uniform control over where the checkpoint files
    go (command line interface, most likely)
  - Fix up the mpirun docs (man page and --help output) to describe all of
  - ...?

3. Clean up all the race conditions in mpirun.  Is it time to re-write
mpirun, perhaps in C++?

4. Fix the memory problems that Brian identified before SC but didn't have
time to chase down.

5. Think about abstracting away all the CR code into its own SSI.  This
would allow us to handle multiple different CR libraries/run-time systems.
Hence, the CRTCP RPI wouldn't call cr_checkpoint, it would probably call
lam_ssi_cr_checkpoint(), and you'd have a module for LBL's CR library that
would call cr_checkpoint() (and a module for Condor's checkpoint, and
...).  Let's talk about this when I come to Bloomies in 2 weeks.

6. Measure the performance of the TCP RPI vs. the CRTCP RPI.  There should
be really no noticable difference (the main difference in the main code
path is maintaining the bookmark counters, and that should be trivial),
but we need to be sure.

7. Start writing up a paper and/or your thesis.  A lot of work has been
done, and we should get it down in writing before it falls out of your
brain.  We need to get some publications about this.

{+} Jeff Squyres
lam-cr mailing list