Re: Simple API usage

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 22 2007 - 11:01:56 PST

Next message: Yuan Wan: "problem: checkpoint lam/mpi with BLCR"

Previous message: Neal Becker: "Simple API usage"
In reply to: Neal Becker: "Simple API usage"

Neal Becker wrote:
> I'm doing checkpointing periodically on a simple, single-threaded process, 
> like this:
>   if (rename (name.c_str(), oldname.c_str()) != 0 and errno != ENOENT)
>     die ("rename failed");
>   cr_request_file (name.c_str());
>   cr_enter_cs(id);
>   cr_leave_cs(id);
>   if (remove (oldname.c_str()) != 0 and errno != ENOENT)
>     die ("remove failed");
> 
> I guess this forces the process to wait until the checkpoint is complete.  I 
> wonder if I can do something more efficient?  I'd rather avoid having to mess 
> with callbacks, though.

Neal,

  What exactly do you mean by "more efficient" here?  If you want to
have a backup file, and you want to ensure it is kept until the new file
is complete, I don't see what else you can do but wait for the
completion.  The following is an alternative method:

   if (rename (name.c_str(), oldname.c_str()) != 0 and errno != ENOENT)
     die ("rename failed");
   { char buffer[1024];
     snprintf(buffer, sizeof(buffer), "cr_checkpoint -f %s -p %d",
              name.c_str(), getpid());
   }
   if (remove (oldname.c_str()) != 0 and errno != ENOENT)
     die ("remove failed");

  If you are concerned by the use of "cr_enter_cs(id); cr_leave_cs(id);"
as an idiom for "wait for the checkpoint to finish", then there is not
much I can offer right now.  However, we are in the process of designing
and implementing a new checkpoint-request API for inclusion in the 0.6.0
release (summer '07?).  That API will have a more explicit "wait for the
checkpoint" call, and that call will block in the kernel, which should
be much "kinder" than the current enter/leave idiom which requires
spinning on an atomic counter in user space.

-Paul

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: Yuan Wan: "problem: checkpoint lam/mpi with BLCR"

Previous message: Neal Becker: "Simple API usage"
In reply to: Neal Becker: "Simple API usage"

Date view	Thread view	Subject view	Author view	Attachment view