From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 30 2005 - 13:58:59 PST
Richard, The BLCR kernel code can only handle a single checkpoint outstanding per target process. Note that the request is somewhat asynchronous and the first cr_request_file() may return while the thread-context callback is still actually running, though the process will be stopped when the callback invokes cr_checkpoint(). The second call to cr_request_file() must therefore wait for the first checkpoint to actually complete before the second call can begin to take a checkpoint. At the same time it will also reap the completion code of the previous request (similar to waitpid() for processes). -Paul Richard Hu wrote: > Thank you for your response. > > What exactly do you mean by retire? I'm taking it to mean that > there's some kind of clean-up that needs to be done before restarting > from a checkpoint. Is that a correct interpretation? Also, I wanted > to add that this behavior doesn't seem to happen when there is quite a > bit of activity between the two checkpoints. For example, if the two > checkpoints in the sample program were separated by thousands of lines > of code doing complex calculations, I don't have this problem. I > assume this has to with BLCR being able to retire the first checkpoint > before starting the second, right? > > Thanks, > Richard > > At 03:58 PM 3/30/2005, you wrote: > >> Richard, >> >> I don't have a certain answer, but I can guess. I suspect that when >> you see the hang BLCR is trying to retire the first checkpoint before >> starting the second. When restarted from the 1st checkpoint the user >> space part of BLCR believes that there is a previous checkpoint to >> retire, but the kernel disagrees. >> I've entering a bug report at >> http://mantis.lbl.gov/bugzilla/show_bug.cgi?id=1037 >> >> -Paul >> >> Richard Hu wrote: >> >>> To Whom It May Concern: >>> >>> I appear to be having an issue with multiple checkpoints in BLCR and >>> I was wondering if you could perhaps shed some light on the >>> problem. I have attached a simple test program to demonstrate my >>> problem. >>> Essentially when I run the program, two checkpoints are generated >>> with some activity happening between the checkpoints. When I >>> restart from the second checkpoint (for_loop_1), everything works. >>> When I restart from the first checkpoint (for_loop_1), the program >>> hangs when it hits the spot in the program where it attempts to >>> create the second checkpoint. Do you know why this happens? Is >>> there a possible work-around? >>> >>> Thanks, >>> Richard Hu >>> rhu_at_opnet_dot_com >>> >>> ------------------------------------------------------------------------ >>> >>> >>> #include <stdio.h> >>> #include <stdlib.h> >>> #include "libcr.h" >>> #include <math.h> >>> #include <string.h> >>> >>> int callback(void *arg); >>> >>> int main (void) { >>> int counter; >>> char path[100] = "/usr/local/for_loop_"; >>> char num[20]; >>> >>> cr_init(); >>> cr_register_callback(callback, NULL, CR_THREAD_CONTEXT); >>> counter = 0; >>> >>> for (counter = 0; counter < 20; counter++) >>> printf("I am number %i\n", counter); >>> >>> cr_request_file ("/usr/local/for_loop_0"); >>> >>> for (counter = 40; counter < 60; counter++) >>> printf("I am number %i\n", counter); >>> >>> cr_request_file ("/usr/local/for_loop_1"); >>> >>> return 0; >>> } >>> >>> >>> int callback (void* arg) { >>> int rc; >>> >>> rc = cr_checkpoint(CR_CHECKPOINT_READY); >>> if (rc) { >>> printf("We have been restarted\n"); >>> } >>> else { >>> printf("Dump generated. We are continuing\n"); >>> } >>> return 0; >>> } >>> >> >