From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Oct 11 2005 - 10:28:12 PDT
Replies appear below. Christian Iwainsky wrote: > Hello, > I have a problem, with the blcr. > I have written a distributed program, which is sucessfully checkpointed. > But once I try to restart the second instance on one machine of the > program, the cr_restart function aborts with: > cri_syscall(CR_OP_RSTRT_REAP): Invalid argument > > in /var/log/messages: > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. > -22 > > What is the problem? (The Pid is free) The "invalid signature" means the contect file you are trying to restart from is either corrupted or possibly truncated. I suspect that you have not succesfully checkpointed, but that the checpoint operation has failed without letting you know. Is it possible that multiple processes might have been writing their checkpoints to the *same* file? That would certainly result in a corrupted file. > > I also experience an interesting behaviour: > I use the following code for the checkpoint-callback: > dsm_checkpoint_read is initialized to 0 > > /***********************************************************/ > int chkpt_callback(void * aptr){ > fprintf(stderr,"chkpt_callback\n"); > if (!dsm_checkpoint_ready){ > // the checkpoint thread function is asleap ... don't checkpoint yet > but awa > ken the checkpoint thread > dsm_checkpoint_sleep=0; > // postpone the checkpoint till jackal has a consistant state > fprintf(stderr,"Postponing checkpoint ..\n"); > //cr_checkpoint(CR_CHECKPOINT_READY); > cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE); > return 0; > } > fprintf(stderr,"checkpopint callback: taking checkpoint\n"); > int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY); > if (chkptResult>0){ > fprintf(stderr,"Restarting ...\n"); > dsm_checkpoint_wakeup=1; > } else if (chkptResult==0){ > fprintf(stderr,"checkpointing ........\n"); > }else { > fprintf(stderr,"Checkpoint Failure\n"); > cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE); > return -1; > } > return 0; > } > > one the callback postponed the checkpoint the program state is brought > to a checkpoint state, and then the cr_request_file is called to do > the real checkpoint. > The program crashes on the call to cr_request_file: > It is not clear to me from your desciption how cr_request_file might be crashing. I don't see anything wrong with your example except for your call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error (you should just return -1, rather than calling cr_checkpoint a 2nd time). However, since that code will only run if something is already "broken", I don't think it is the immediate cause of your problem. Is it possible for you to send a stack backtrace from a core file generated by this failure? I could then get a better idea of what is wrong inside cr_request_file. > > Regards, > Christian -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900