Re: Simple API usage

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 22 2007 - 11:01:56 PST

  • Next message: Yuan Wan: "problem: checkpoint lam/mpi with BLCR"
    Neal Becker wrote:
    > I'm doing checkpointing periodically on a simple, single-threaded process, 
    > like this:
    >   if (rename (name.c_str(), oldname.c_str()) != 0 and errno != ENOENT)
    >     die ("rename failed");
    >   cr_request_file (name.c_str());
    >   cr_enter_cs(id);
    >   cr_leave_cs(id);
    >   if (remove (oldname.c_str()) != 0 and errno != ENOENT)
    >     die ("remove failed");
    > 
    > I guess this forces the process to wait until the checkpoint is complete.  I 
    > wonder if I can do something more efficient?  I'd rather avoid having to mess 
    > with callbacks, though.
    
    Neal,
    
      What exactly do you mean by "more efficient" here?  If you want to
    have a backup file, and you want to ensure it is kept until the new file
    is complete, I don't see what else you can do but wait for the
    completion.  The following is an alternative method:
    
       if (rename (name.c_str(), oldname.c_str()) != 0 and errno != ENOENT)
         die ("rename failed");
       { char buffer[1024];
         snprintf(buffer, sizeof(buffer), "cr_checkpoint -f %s -p %d",
                  name.c_str(), getpid());
       }
       if (remove (oldname.c_str()) != 0 and errno != ENOENT)
         die ("remove failed");
    
    
      If you are concerned by the use of "cr_enter_cs(id); cr_leave_cs(id);"
    as an idiom for "wait for the checkpoint to finish", then there is not
    much I can offer right now.  However, we are in the process of designing
    and implementing a new checkpoint-request API for inclusion in the 0.6.0
    release (summer '07?).  That API will have a more explicit "wait for the
    checkpoint" call, and that call will block in the kernel, which should
    be much "kinder" than the current enter/leave idiom which requires
    spinning on an atomic counter in user space.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Yuan Wan: "problem: checkpoint lam/mpi with BLCR"