From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Fri Feb 28 2003 - 12:58:17 PST
Sriram et al., Section 1 The paragraph on process migration makes it sound as if the only motivation is "If the imminent failure of a node is know ahead of time". While that _is_ one possible reason to perform process migration, it is not the best example, as it sounds to most people like an unlikely scenario. Other uses include network load balancing on a non-uniform network topology, and scheduling flexibility. Section 2 I agree with Jason that "single-process checkpoint" may not be the best term to use. We want to convey that the scope is local to a single node, not restrict ourselves too much. Section 3 One must also save signal handler registrations, the signal mask, and pending signals. I agree w/ Jason that "checkpoint the network or drain all the data" does not convey any clear distinction to me. Section 4.1, 4.2, 4.3 I agree w/ Jason's comments here. Section 4.3.1 I addition to Jasons comments, I'd note that you don't describe WHY both types of callback are needed. The thread is needed so that it can block waiting for the application checkpoints w/o deadlocking, and so that bad things don't happen if the mpirun was communicationg w/ the lamd at the time the checkpoint request arrived. The signal context handler is needed at restart time because the exec() from another thread would (under Linux) result in a changed PID and other related issues. Section 4.3.2 First two sentences of first paragraph could be rewritten to "Thread context callbacks allow the main application thread to continue running, and are necesary because most functions in the C library are not reentrant, even when thread safe. Use of non reentrant functions from signal context can result in deadlock." Second paragraph might be written to make it clearer that the lock traffic required for C/R does not introduce any contention in the normal case, and also that the lock was _already_ in the code to support MPI_THREAD_SERIALIZED even w/o C/R. Section 4.3.3 You claim that no special processing is required while quiescing, but don't you need to prevent long acks from being sent? Second paragraph might be better starting with "The procedure described above is sufficient if the application thread is not blocked on a read() at the time a checkpoint request arrives." Section 5 You say "Figure 5" in two places, one of which should be 3 and the other 4. Figure 3: I can't tell the lines apart Figure 4: Perhaps a better plot would be "Fraction of bandwidth lost by adding checkpoint/restart" When stating the 0.5% degredation, you should also indicate the standard deviation in your measurements. When I tried to measure this I found that the std.dev. was larger than the measured degradation, making its measurement inprecise. Section 6 Our current work does let us migrate an entire checkpointed job, but does not permit the migration of a subset of processes while the others remain "live". You might wish to make the distinction. Section 7 The assertion that the degradation is "negligible" might be better supported with the std.dev. numbers as well. Refs: Missing commas between authors' names in most entries. -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998