From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Mon Nov 25 2002 - 09:42:15 PST
This is a reminder about a telephone conference scheduled for 3-4pm PST, Tuesday, November 26, 2002. LBNL folks please note the change in local phone number. To attend: Long Distance users call 1-877-252-5250, Local users call 510-486-5008, LBNL on-site users call x5008, then press 1, enter 217373# and follow the instructions. We'll begin with reports on SC2002. We should then tackle planning for future directions on our collaboration. We can start with the list that Jeff e-mailed out (attached for those not on the lam-cr list). -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov NERSC Future Technologies Group Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998
attached mail follows:
I think everyone was drooling over CR at SC. This is great. So Sriram -- we need to finish several things in LAM (not necessarily in any particular order): 1. Do the fast stuff. I think this is probably the first order of business, and we talked about it a bit. See where you can get with that. It would be good to have a totally finished version of the TCP RPI. 2. Let's think about some features that we want to give to LAM users to make this all work. Some obvious ones that jump to mind (all of which are subject to discussion are): - Hit ctrl-z in mpirun, and it checkpoints and kills. "fg" (in the shell) would restore. - Abstract away cr_save and cr_restore into some kind of LAM commands (so that we can include other CR libraries and still have LAM users use a uniform interface, such as lam_save and lam_restore...? They can just fork/exec the right underlying command, or use the appropriate CR library API call... you get the idea) - Perhaps a command line option to mpirun would auto-invoke a checkpoint if you hit ctrl-C (i.e., SIGINT) -- this might be helpful for batch systems, and slightly cleaner than ctrl-Z (i.e., the job is not still running when it dies)...? - A little better/more uniform control over where the checkpoint files go (command line interface, most likely) - Fix up the mpirun docs (man page and --help output) to describe all of this. - ...? 3. Clean up all the race conditions in mpirun. Is it time to re-write mpirun, perhaps in C++? 4. Fix the memory problems that Brian identified before SC but didn't have time to chase down. 5. Think about abstracting away all the CR code into its own SSI. This would allow us to handle multiple different CR libraries/run-time systems. Hence, the CRTCP RPI wouldn't call cr_checkpoint, it would probably call lam_ssi_cr_checkpoint(), and you'd have a module for LBL's CR library that would call cr_checkpoint() (and a module for Condor's checkpoint, and ...). Let's talk about this when I come to Bloomies in 2 weeks. 6. Measure the performance of the TCP RPI vs. the CRTCP RPI. There should be really no noticable difference (the main difference in the main code path is maintaining the bookmark counters, and that should be trivial), but we need to be sure. 7. Start writing up a paper and/or your thesis. A lot of work has been done, and we should get it down in writing before it falls out of your brain. We need to get some publications about this. -- {+} Jeff Squyres {+} [email protected] {+} http://www.lam-mpi.org/ _______________________________________________ lam-cr mailing list [email protected] http://www.lam-mpi.org/mailman/listinfo.cgi/lam-cr