Re: multiple checkpoints

From: Richard Hu (rhu_at_opnet_dot_com)
Date: Wed Mar 30 2005 - 13:47:15 PST

  • Next message: Paul H. Hargrove: "Re: multiple checkpoints"
    Thank you for your response.
    
    What exactly do you mean by retire?  I'm taking it to mean that there's 
    some kind of clean-up that needs to be done before restarting from a 
    checkpoint.  Is that a correct interpretation?  Also, I wanted to add that 
    this behavior doesn't seem to happen when there is quite a bit of activity 
    between the two checkpoints.  For example, if the two checkpoints in the 
    sample program were separated by thousands of lines of code doing complex 
    calculations, I don't have this problem.  I assume this has to with BLCR 
    being able to retire the first checkpoint before starting the second, right?
    
    Thanks,
    Richard
    
    At 03:58 PM 3/30/2005, you wrote:
    >Richard,
    >
    >  I don't have a certain answer, but I can guess.  I suspect that when you 
    > see the hang BLCR is trying to retire the first checkpoint before 
    > starting the second.  When restarted from the 1st checkpoint the user 
    > space part of BLCR believes that there is a previous checkpoint to 
    > retire, but the kernel disagrees.
    >  I've entering a bug report at 
    > http://mantis.lbl.gov/bugzilla/show_bug.cgi?id=1037
    >
    >-Paul
    >
    >Richard Hu wrote:
    >
    >>To Whom It May Concern:
    >>
    >>I appear to be having an issue with multiple checkpoints in BLCR and I 
    >>was wondering if you could perhaps shed some light on the problem.  I 
    >>have attached a simple test program to demonstrate my problem.
    >>Essentially when I run the program, two checkpoints are generated with 
    >>some activity happening between the checkpoints.  When I restart from the 
    >>second checkpoint (for_loop_1), everything works.  When I restart from 
    >>the first checkpoint (for_loop_1), the program hangs when it hits the 
    >>spot in the program where it attempts to create the second 
    >>checkpoint.  Do you know why this happens?  Is there a possible work-around?
    >>
    >>Thanks,
    >>Richard Hu
    >>rhu_at_opnet_dot_com
    >>
    >>------------------------------------------------------------------------
    >>
    >>#include <stdio.h>
    >>#include <stdlib.h>
    >>#include "libcr.h"
    >>#include <math.h>
    >>#include <string.h>
    >>
    >>int callback(void *arg);
    >>
    >>int main (void) {
    >>  int counter;
    >>  char path[100] = "/usr/local/for_loop_";
    >>  char num[20];
    >>
    >>  cr_init();
    >>  cr_register_callback(callback, NULL, CR_THREAD_CONTEXT);
    >>  counter = 0;
    >>
    >>  for (counter = 0; counter < 20; counter++)
    >>    printf("I am number %i\n", counter);
    >>
    >>  cr_request_file ("/usr/local/for_loop_0");
    >>
    >>  for (counter = 40; counter < 60; counter++)
    >>    printf("I am number %i\n", counter);
    >>
    >>  cr_request_file ("/usr/local/for_loop_1");
    >>
    >>  return 0;
    >>}
    >>
    >>
    >>int callback (void* arg) {
    >>  int rc;
    >>
    >>  rc = cr_checkpoint(CR_CHECKPOINT_READY);
    >>  if (rc) {
    >>    printf("We have been restarted\n");
    >>  }
    >>  else {
    >>    printf("Dump generated.  We are continuing\n");
    >>  }
    >>  return 0;
    >>}
    >>
    >
    

  • Next message: Paul H. Hargrove: "Re: multiple checkpoints"