jcduell_at_lbl_dot_gov
Date: Wed Mar 17 2004 - 15:03:29 PST
On Wed, Mar 17, 2004 at 04:14:09PM -0600, Pirabhu Raman wrote: > > Mr.Duell, > > I am Pirabhu, working with MPI Software Technology, Inc. and I am > interested in the checkpointing efforts being made at LBNL. > I happened to read the Design and Implementation document and > I have a few questions. Sure. > 1. You had mentioned many items in the design document as yet to be > completed (files, memory exclusion, concurrent checkpointing etc). > What are their status? We're planning to issue a new release very soon that will have support for simple restoring of file handles (i.e. reopen any regular file handles that were open at checkpoint, with an fseek to the same location they were in in the file). The other features (at least those we get to) will be done over the next year. > 2. Are there plans to support incremental checkpointing? No. We've generally found that most scientific apps change most of their memory in each time step, so the benefits of incremental checkpoints in our environment is small relative to the effort in implementing it. > 3. I went through the source files to find that some files being > covered by GPL and some others by LGPL (w/o exact mention to which). > Could you let me know if all the header, library files that are linked > by MPI libraries and user applications are LGPL'ed? If so, what are > those? All the code that your MPI library and user apps would need to link against in order for you to add checkpoint support is LGPL (so there's no problem if you're using a proprietary license: this is intentional on our part). The LAM MPI team at Indiana have already used our API to make their MPI library checkpointable. You may want to take a look at the paper they wrote on how they did it. It's in the publications section of our website: http://ftg.lbl.gov/twiki/bin/view/Whiteboard/CheckpointPapers > 4. Is there any reference document illustrating the API exported by > BLCR ie which would let me estimate the effort involved in adding > support for BLCR in our MPI implementation? The LAM paper gives the high-level overview of what your MPI library would need to do. The 'libcr.h' file in our source code is the current place to see the nitty-gritty docs on each function that we export. > 5. Are there any performance numbers comparing BLCR with other avaialble > checkpointing packages such as libckpt etc? Not to my knowledge--we haven't gathered them. Mostly the time is dominated by I/O, so a lot depends on your storage system. We should be comparable to any other non-incremental checkpointing system. -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory