Re: blcr

jcduell_at_lbl_dot_gov
Date: Wed Mar 17 2004 - 15:03:29 PST

  • Next message: jcduell_at_lbl_dot_gov: "checkpoints on alvarez"
    On Wed, Mar 17, 2004 at 04:14:09PM -0600, Pirabhu Raman wrote:
    > 
    > Mr.Duell,
    > 
    > I am Pirabhu, working with MPI Software Technology, Inc. and I am
    > interested in the checkpointing efforts being made at LBNL.
    > I happened to read the Design and Implementation document and
    > I have a few questions.
    
    Sure.
    
    > 1. You had mentioned many items in the design document as yet to be
    > completed (files, memory exclusion, concurrent checkpointing etc).
    > What are their status?
    
    We're planning to issue a new release very soon that will have support
    for simple restoring of file handles (i.e. reopen any regular file
    handles that were open at checkpoint, with an fseek to the same location
    they were in in the file).  The other features (at least those we get
    to) will be done over the next year.
    
    > 2. Are there plans to support incremental checkpointing?
    
    No.  We've generally found that most scientific apps change most of
    their memory in each time step, so the benefits of incremental
    checkpoints in our environment is small relative to the effort in
    implementing it.
     
    > 3. I went through the source files to find that some files being
    > covered by GPL and some others by LGPL (w/o exact mention to which).
    > Could you let me know if all the header, library files that are linked
    > by MPI libraries and user applications are LGPL'ed? If so, what are
    > those?
    
    All the code that your MPI library and user apps would need to link
    against in order for you to add checkpoint support is LGPL (so there's
    no problem if you're using a proprietary license:  this is intentional
    on our part).
    
    The LAM MPI team at Indiana have already used our API to make their MPI
    library checkpointable.  You may want to take a look at the paper they
    wrote on how they did it.  It's in the publications section of our
    website:
    
        http://ftg.lbl.gov/twiki/bin/view/Whiteboard/CheckpointPapers
    
    > 4. Is there any reference document illustrating the API exported by
    > BLCR ie which would let me estimate the effort involved in adding
    > support for BLCR in our MPI implementation?
    
    The LAM paper gives the high-level overview of what your MPI library
    would need to do.  The 'libcr.h' file in our source code is the current
    place to see the nitty-gritty docs on each function that we export.
    
    > 5. Are there any performance numbers comparing BLCR with other avaialble
    > checkpointing packages such as libckpt etc?
    
    Not to my knowledge--we haven't gathered them.  Mostly the time is
    dominated by I/O, so a lot depends on your storage system.  We should be
    comparable to any other non-incremental checkpointing system.
    
    -- 
    Jason Duell             Future Technologies Group
    <jcduell_at_lbl_dot_gov>       Computational Research Division
    Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory
    

  • Next message: jcduell_at_lbl_dot_gov: "checkpoints on alvarez"