From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 22 2007 - 11:01:56 PST
Neal Becker wrote: > I'm doing checkpointing periodically on a simple, single-threaded process, > like this: > if (rename (name.c_str(), oldname.c_str()) != 0 and errno != ENOENT) > die ("rename failed"); > cr_request_file (name.c_str()); > cr_enter_cs(id); > cr_leave_cs(id); > if (remove (oldname.c_str()) != 0 and errno != ENOENT) > die ("remove failed"); > > I guess this forces the process to wait until the checkpoint is complete. I > wonder if I can do something more efficient? I'd rather avoid having to mess > with callbacks, though. Neal, What exactly do you mean by "more efficient" here? If you want to have a backup file, and you want to ensure it is kept until the new file is complete, I don't see what else you can do but wait for the completion. The following is an alternative method: if (rename (name.c_str(), oldname.c_str()) != 0 and errno != ENOENT) die ("rename failed"); { char buffer[1024]; snprintf(buffer, sizeof(buffer), "cr_checkpoint -f %s -p %d", name.c_str(), getpid()); } if (remove (oldname.c_str()) != 0 and errno != ENOENT) die ("remove failed"); If you are concerned by the use of "cr_enter_cs(id); cr_leave_cs(id);" as an idiom for "wait for the checkpoint to finish", then there is not much I can offer right now. However, we are in the process of designing and implementing a new checkpoint-request API for inclusion in the 0.6.0 release (summer '07?). That API will have a more explicit "wait for the checkpoint" call, and that call will block in the kernel, which should be much "kinder" than the current enter/leave idiom which requires spinning on an atomic counter in user space. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900