From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Aug 01 2007 - 14:09:53 PDT
Neal, Your code below is (as best as I can tell w/o running it) correct. However, the dependence on the specific return values -1 and -2 from cr_poll_checkpoint() is probably unsafe. I'll see about adding CR_* constants to replace the explicit -1 and -2 values in the next beta release. Once added, you'll find the descriptions in libcr.h and use example in cr_checkpoint.c. You are correct that the checkpoint is not necessarily flushed to disk when the poll call succeeds. You'll have to make your own fsync() call if you require that guarantee. Feel free to ask if you need any more clarifications. Below I've listed some of the advantages of this new API over the previous one. -Paul 1) This interface allows for a bounded wait (Despite the loop around the cr_poll_checkpoint() call, the common case is single-trip. This is because the 2nd arg to cr_poll_checkpoint() is a (struct timeval *) like the final argument to select(). Thus the NULL value here means to wait forever (or until interrupted by a signal). A non-NULL 2nd arg would allow you to perform a bounded wait.) 2) This interface allows for multi-process scopes (tree, pgrp, session). 3) This interface allows for checkpointing something other than oneself. 4) This interface allows multiple checkpoints (of distinct targets) to be in-flight simultaneously. 5) This interface returns error codes when something goes wrong. Neal Becker wrote: > Based on studying the code in cr_checkpoint.c, I have come up with the > following. Any comments appreciated. I'm guessing that the call to > cr_request_checkpoint, followed by the cr_poll_checkpoint, will efficiently > do the checkpoint and then wait for it to complete (but not necessarily be > flushed to the disk). > > static void doit () { > > int newfd = creat (newname.c_str(), 0600); > if (newfd < 0) > die ("creat failed"); > > cr_args.cr_fd = newfd; > cr_args.cr_scope = ... > > > cr_checkpoint_handle_t cr_handle; > > int err = cr_request_checkpoint (&cr_args, &cr_handle); > if (err < 0) > die ("cr_request_checkpoint failed"); > > do { > int err = cr_poll_checkpoint (&cr_handle, NULL); > if (err < 0) { > if (errno == EINVAL) { > return; // restarted > } > else if (errno == EINTR) { > ; > } > else { > perror ("cr_poll_checkpoint"); > break; > } > } > else if (err == 0) { > die ("cr_poll_checkpoint returned unexepected 0"); > } > } while (err < 0); > > if (err == -1) { > die (std::string ("cr_poll_checkpoint") + strerror (err)); > } > else if (err == -2) { > if (err == CR_ETEMPFAIL) { > die("Checkpoint cancelled by application: try again later\n"); > } else if (err == ESRCH) { > die("Checkpoint failed: no processes checkpointed\n"); > } else if (err == CR_EPERMFAIL) { > die("Checkpoint cancelled by application: unable to checkpoint\n"); > } else if (err == CR_ENOSUPPORT) { > die("Checkpoint failed: support missing from application\n"); > } else { > die(std::string ("ioctl") + strerror (err)); > } > } > else if (err < 0) { > die(std::string ("cr_poll_checkpoint") + strerror (err)); > } > > if (rename (newname.c_str(), name.c_str()) != 0) > die ("rename failed"); > } > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900