jcduell_at_lbl.gov
Date: Fri Feb 28 2003 - 11:52:52 PST
Sriram: I just read the paper, and overall it's looking good. Here are some comments: In the related work section, it was a bit unclear as to how CoCheck works--I assume there needs to be an instance of the special checkpoint process on each node (if not, how does the single process save the state of processes on other nodes?). In the issues section, you should qualify your remark that an MPI checkpoint/restart "must" be transparent to users--tack on a phrase like "to gain wide acceptance, ...". I'd title Section 4 "Implementation of Checkpoint/Restart for LAM/MPI", since you're not describing an implementation for any old MPI. Section 4.1. You should add a little bit of info about job startup under LAM, i.e., that the lamd's are spawned separately and exist beyond the lifespan of any particular MPI application (and are thus not logically part of the MPI application: this will help when you explain that they aren't part of the checkpoint). And it would be good to simply mention that one starts an MPI app on a single node with 'mpirun', and it causes the lamd's to spawn off the MPI apps and set up their connections, etc. It's also a little confusing in the initial text (though it becomes clearer through the diagrams and later discussion) when you mention that the LAM layer provides "a complete message passing service", then state in the next sentence that the MPI layer does communication, too. Putting in a parenthetical remark that the LAM msg service is for out-of-band communication, and/or is slower than the MPI-provided communication, would make this clearer. Section 4.2: As you know, we're planning to add to BLCR the ability to checkpoint entire process groups and sessions. So describing it as a 'the BLCR single-process checkpointer" muddies the waters. Why don't we fudge the issue by describing it as a systems that can "checkpoint applications on a single node", which is vague enough to be true both now and when we add support for sessions. Section 4.3. I'm not sure if saying that the three components involved are mpirun, "the MPI library", and the TCP RPI is the way to go. For one thing, I was under the assumption that the RPI is part of the MPI library. You might be better off organizing the section headings via the timeline of what goes on during a checkpoint (which is basically how the text flows, anyway). I.e., "user checkpoint mpirun", "lamd's propagate checkpoint to MPI apps", "callback threads quiesces application", etc. 4.3.1. I'd say "by invoking the BLCR utility cr_checkpoint with the process id of mpirun". This will make it clearer what cr_checkpoint is. When you say that restart results in the MPI application being restarted with the "same process topology", I assume you don't mean that they need to be restarted on the same nodes as before. Do they need to be restarted with the same distribution across nodes? I.e., if you checkpointed some jobs that were running on single-processor nodes, would you be able to restart them on SMPs and use all the processors? Assuming you currently do shared memory optimizations on SMP nodes now, would these work at restart, or would the MPI processes have been hardcoded at initialization to think that they'd need to use TCP to communicate with nodes that weren't on their node at startup? You don't mention anywhere in section 4.3 that the lamds don't themselves get checkpointed (although they participate in the checkpoint by propagating the cr_checkpoint requests, and probably some other things, too). You need to mention this. 4.3.3 You state earlier in the paper (in the Issues sections) that there are two strategies for handling the network in a checkpoint--either to "checkpoint the network" (what does that mean?), or drain all the data. You should make it clear right at the beginning of this section which approach LAM is using--right now you just say 'quiescing' the network, which doesn't make it clear. Did you mean to say "blocking" when you write that "pending data can be read from the sockets in a non-blocking fashion because it is known that there are still outstanding bytes to be received." You can *always* read safely (i.e. w/o deadlock) with a non-blocking call--I would think the guarantee that there are bits still coming would apply more to a blocking call. You mention that if a peer operation has not yet been posted, the callback thread needs to signal the main thread to knock it out of its blocking call. I assume that at restart the MPI call is automatically restarted without the user knowning about the interruption? Mention this. You never mention in section 4 (or anywhere else in the paper) what happens to the checkpoint files. Does LAM keep them in some well-known place? Or is it up to the user/batch system to manage them? Does the user need to do anything besides call cr_restart on the mpirun context file? Section 5. It would be good to state exactly what the causes of the checkpoint overheads are--is it just the byte counting? To address the issue of how well the system will scale to arbitrary numbers of nodes, you might want to lay out the Big O costs of all the stages of the checkpoint/restart. Right now I'm guessing you've got 1) O(log N) propagation of cr_checkpoints via lamds to all MPI processes. 2) Arbitrary (?) delay while you wait for any pending MPI calls to finish and relinquish the lock to the checkpoint callback 3) O(log N) for sending bookmarks between all the processes 4) Whoppingly huge but still O(1) time to write the checkpoint files, assuming they go to a local disk (or at least aren't all going to the same NFS filesystem). This time will of course increase linearly with the memory and file size of the MPI processes. 5) Shutdown cost? Do the lamds do anything? If you break down the costs, you can probably show that the file I/O time dominates, and you could project the O(log N) network costs out to show how many processes a LAM MPI job would need to have in order for the network time to dominate. Or maybe that's silly, since it might all depend on how much data the application has on the wire (or in "fast track" calls). But it would be nice to show that the parallel portion of CR is not actually the dominant cost, and that it should scale well. You should mention how many processes were used for the numbers in Figures 3 and 4. And do crunch the numbers for varying numbers of nodes, as you clearly plan to do. References: I've been lazy and haven't yet updated the BLCR's text: the authors should be Paul and Eric and myself, in alphabetical order. Oh, and you should use LaTeX instead of MS Turd. It should only take a couple of minutes to convert it... Cheers, -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> High Performance Computing Research Dept. Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory