From: Richard Hu (rhu_at_opnet_dot_com)
Date: Wed Mar 30 2005 - 13:47:15 PST
Thank you for your response. What exactly do you mean by retire? I'm taking it to mean that there's some kind of clean-up that needs to be done before restarting from a checkpoint. Is that a correct interpretation? Also, I wanted to add that this behavior doesn't seem to happen when there is quite a bit of activity between the two checkpoints. For example, if the two checkpoints in the sample program were separated by thousands of lines of code doing complex calculations, I don't have this problem. I assume this has to with BLCR being able to retire the first checkpoint before starting the second, right? Thanks, Richard At 03:58 PM 3/30/2005, you wrote: >Richard, > > I don't have a certain answer, but I can guess. I suspect that when you > see the hang BLCR is trying to retire the first checkpoint before > starting the second. When restarted from the 1st checkpoint the user > space part of BLCR believes that there is a previous checkpoint to > retire, but the kernel disagrees. > I've entering a bug report at > http://mantis.lbl.gov/bugzilla/show_bug.cgi?id=1037 > >-Paul > >Richard Hu wrote: > >>To Whom It May Concern: >> >>I appear to be having an issue with multiple checkpoints in BLCR and I >>was wondering if you could perhaps shed some light on the problem. I >>have attached a simple test program to demonstrate my problem. >>Essentially when I run the program, two checkpoints are generated with >>some activity happening between the checkpoints. When I restart from the >>second checkpoint (for_loop_1), everything works. When I restart from >>the first checkpoint (for_loop_1), the program hangs when it hits the >>spot in the program where it attempts to create the second >>checkpoint. Do you know why this happens? Is there a possible work-around? >> >>Thanks, >>Richard Hu >>rhu_at_opnet_dot_com >> >>------------------------------------------------------------------------ >> >>#include <stdio.h> >>#include <stdlib.h> >>#include "libcr.h" >>#include <math.h> >>#include <string.h> >> >>int callback(void *arg); >> >>int main (void) { >> int counter; >> char path[100] = "/usr/local/for_loop_"; >> char num[20]; >> >> cr_init(); >> cr_register_callback(callback, NULL, CR_THREAD_CONTEXT); >> counter = 0; >> >> for (counter = 0; counter < 20; counter++) >> printf("I am number %i\n", counter); >> >> cr_request_file ("/usr/local/for_loop_0"); >> >> for (counter = 40; counter < 60; counter++) >> printf("I am number %i\n", counter); >> >> cr_request_file ("/usr/local/for_loop_1"); >> >> return 0; >>} >> >> >>int callback (void* arg) { >> int rc; >> >> rc = cr_checkpoint(CR_CHECKPOINT_READY); >> if (rc) { >> printf("We have been restarted\n"); >> } >> else { >> printf("Dump generated. We are continuing\n"); >> } >> return 0; >>} >> >