More comments on paper

From: Paul H. Hargrove (
Date: Fri Feb 28 2003 - 12:58:17 PST

Sriram et al.,

Section 1
   The paragraph on process migration makes it sound as if the only 
motivation is "If the imminent failure of a node is know ahead of time". 
  While that _is_ one possible reason to perform process migration, it 
is not the best example, as it sounds to most people like an unlikely 
scenario.  Other uses include network load balancing on a non-uniform 
network topology, and scheduling flexibility.

Section 2
   I agree with Jason that "single-process checkpoint" may not be the 
best term to use.  We want to convey that the scope is local to a single 
node, not restrict ourselves too much.

Section 3
   One must also save signal handler registrations, the signal mask, and 
pending signals.
   I agree w/ Jason that "checkpoint the network or drain all the data" 
does not convey any clear distinction to me.

Section 4.1, 4.2, 4.3
   I agree w/ Jason's comments here.

Section 4.3.1
   I addition to Jasons comments, I'd note that you don't describe WHY 
both types of callback are needed.  The thread is needed so that it can 
block waiting for the application checkpoints w/o deadlocking, and so 
that bad things don't happen if the mpirun was communicationg w/ the 
lamd at the time the checkpoint request arrived.  The signal context 
handler is needed at restart time because the exec() from another thread 
would (under Linux) result in a changed PID and other related issues.

Section 4.3.2
   First two sentences of first paragraph could be rewritten to "Thread 
context callbacks allow the main application thread to continue running, 
and are necesary because most functions in the C library are not 
reentrant, even when thread safe.  Use of non reentrant functions from 
signal context can result in deadlock."
   Second paragraph might be written to make it clearer that the lock 
traffic required for C/R does not introduce any contention in the normal 
case, and also that the lock was _already_ in the code to support 

Section 4.3.3
   You claim that no special processing is required while quiescing, but 
don't you need to prevent long acks from being sent?
   Second paragraph might be better starting with "The procedure 
described above is sufficient if the application thread is not blocked 
on a read() at the time a checkpoint request arrives."

Section 5
   You say "Figure 5" in two places, one of which should be 3 and the 
other 4.
   Figure 3: I can't tell the lines apart
   Figure 4: Perhaps a better plot would be "Fraction of bandwidth lost 
by adding checkpoint/restart"
   When stating the 0.5% degredation, you should also indicate the 
standard deviation in your measurements.  When I tried to measure this I 
found that the was larger than the measured degradation, making 
its measurement inprecise.

Section 6
   Our current work does let us migrate an entire checkpointed job, but 
does not permit the migration of a subset of processes while the others 
remain "live".  You might wish to make the distinction.

Section 7
   The assertion that the degradation is "negligible" might be better 
supported with the numbers as well.

   Missing commas between authors' names in most entries.

Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998