From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Mon Jun 03 2002 - 20:42:59 PDT
Summary of AG conference on 03 JUN 2002 In attendance: Jason Duell LBNL Brent Gorda LBNL Paul Hargrove LBNL Eric Roman LBNL Andy Lumsdaine IU Sriram Sankaran IU, on loan to LBNL for Summer Jeff Squires (telephone) LAM/MPI One of the main goals of this conference was to make sure we could get the Access Grid working for us, plus a phone-in. It worked, after a flurry of e-mail between the operators to get a virtual venue selected. The main content of the call was devoted to takling about how we might let a checkpoint-aware library do its work outside of signal context. The current design admits multiple checkpoint-aware libraries by invoking the hendlers in an order opposite their order of registeration, like at_exit(). This means there are no issues of deadlock due to strange interactions between the libraries. However, there are a lot of problems related to non-reentrant code. All of libc is thread safe, but much is not reentrant. Among the things not reentrant is malloc(). Paul and Jason had a preliminary discussion before the call of how we might let a handler "acknowledge" a checkpoint request, but indicate that it would call back later to complete the actual checkpoint. This would allow the actual work to be done outside signal context. Additionally, the signal handler would return so the application would resume execution for a while. This is important, for instance, if the signal handler had run while the application was holding the malloc mutex. We are calling this mode of operation "asynchronous" - to mean that quiescing the network can proceed (and complete) independent of when the handler (in signal context) returns. On this call we discussed in broad strokes the idea of the asynchronous checkpoint and the fact that the major hurdle is avoiding deadlock in the case of two or more checkpoint-aware libraries. With proper documentation we think just a well defined set of rule and liberal use of "checkpoint critical sections" will PROBABLY get us clear of this. It was resolved to just pretend for the moment that we only need one checkpoint-aware library and postpone detailed work on the deadlock problem. Instead, Paul will try to get somthing done ASAP which lets Sriram develop the checkpoint handler for LAM/MPI outside of signal context. -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov NERSC Future Technologies Group Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998