Re: multiple checkpoints

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 30 2005 - 13:58:59 PST

Next message: Kris Buggenhout: "chekpoint support for amd64?"

Previous message: Richard Hu: "Re: multiple checkpoints"
In reply to: Richard Hu: "Re: multiple checkpoints"

Richard,

The BLCR kernel code can only handle a single checkpoint outstanding per 
target process.  Note that the request is somewhat asynchronous and the 
first cr_request_file() may return while the thread-context callback is 
still actually running, though the process will be stopped when the 
callback invokes cr_checkpoint().  The second call to cr_request_file() 
must therefore wait for the first checkpoint to actually complete before 
the second call can begin to take a checkpoint.  At the same time it 
will also reap the completion code of the previous request (similar to 
waitpid() for processes).

-Paul

Richard Hu wrote:

> Thank you for your response.
>
> What exactly do you mean by retire?  I'm taking it to mean that 
> there's some kind of clean-up that needs to be done before restarting 
> from a checkpoint.  Is that a correct interpretation?  Also, I wanted 
> to add that this behavior doesn't seem to happen when there is quite a 
> bit of activity between the two checkpoints.  For example, if the two 
> checkpoints in the sample program were separated by thousands of lines 
> of code doing complex calculations, I don't have this problem.  I 
> assume this has to with BLCR being able to retire the first checkpoint 
> before starting the second, right?
>
> Thanks,
> Richard
>
> At 03:58 PM 3/30/2005, you wrote:
>
>> Richard,
>>
>>  I don't have a certain answer, but I can guess.  I suspect that when 
>> you see the hang BLCR is trying to retire the first checkpoint before 
>> starting the second.  When restarted from the 1st checkpoint the user 
>> space part of BLCR believes that there is a previous checkpoint to 
>> retire, but the kernel disagrees.
>>  I've entering a bug report at 
>> http://mantis.lbl.gov/bugzilla/show_bug.cgi?id=1037
>>
>> -Paul
>>
>> Richard Hu wrote:
>>
>>> To Whom It May Concern:
>>>
>>> I appear to be having an issue with multiple checkpoints in BLCR and 
>>> I was wondering if you could perhaps shed some light on the 
>>> problem.  I have attached a simple test program to demonstrate my 
>>> problem.
>>> Essentially when I run the program, two checkpoints are generated 
>>> with some activity happening between the checkpoints.  When I 
>>> restart from the second checkpoint (for_loop_1), everything works.  
>>> When I restart from the first checkpoint (for_loop_1), the program 
>>> hangs when it hits the spot in the program where it attempts to 
>>> create the second checkpoint.  Do you know why this happens?  Is 
>>> there a possible work-around?
>>>
>>> Thanks,
>>> Richard Hu
>>> rhu_at_opnet_dot_com
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include "libcr.h"
>>> #include <math.h>
>>> #include <string.h>
>>>
>>> int callback(void *arg);
>>>
>>> int main (void) {
>>>  int counter;
>>>  char path[100] = "/usr/local/for_loop_";
>>>  char num[20];
>>>
>>>  cr_init();
>>>  cr_register_callback(callback, NULL, CR_THREAD_CONTEXT);
>>>  counter = 0;
>>>
>>>  for (counter = 0; counter < 20; counter++)
>>>    printf("I am number %i\n", counter);
>>>
>>>  cr_request_file ("/usr/local/for_loop_0");
>>>
>>>  for (counter = 40; counter < 60; counter++)
>>>    printf("I am number %i\n", counter);
>>>
>>>  cr_request_file ("/usr/local/for_loop_1");
>>>
>>>  return 0;
>>> }
>>>
>>>
>>> int callback (void* arg) {
>>>  int rc;
>>>
>>>  rc = cr_checkpoint(CR_CHECKPOINT_READY);
>>>  if (rc) {
>>>    printf("We have been restarted\n");
>>>  }
>>>  else {
>>>    printf("Dump generated.  We are continuing\n");
>>>  }
>>>  return 0;
>>> }
>>>
>>
>

Next message: Kris Buggenhout: "chekpoint support for amd64?"

Previous message: Richard Hu: "Re: multiple checkpoints"
In reply to: Richard Hu: "Re: multiple checkpoints"

Date view	Thread view	Subject view	Author view	Attachment view