From: Christian Iwainsky (Christian.M.Iwainsky_at_informatik.stud.uni-erlangen.de)
Date: Tue Oct 18 2005 - 04:27:52 PDT
I still get the same error-message over an over gaian. What puzzels me is that the application instance crashing on recovery is the twin to anotherone which I can restart. Work-flow: Processes A and B use the same executable. Process A connects to process B Work is done in parallel. Process A or B decides to perform a checkpoint. Decision is passed to the other process. Network kommunication is stopped, but the sockets are not closed Checkpoint is taken to Files A.checkpoint by process A and to B.checkpoint by process B. Both processes A and B are killed! Process A does successfully restart, process B does not. This is always so, making no difference which process startes the decision to checkpoint. (Both processes use the same executable). I rechecked the kernel log, and discovered that before the list of Oct 18 13:14:15 faui21l kernel: vmadump: invalid signature Oct 18 13:14:15 faui21l kernel: thaw_threads returned error, aborting. -22 messages pop up the following message is given: Oct 18 13:14:15 faui21l kernel: vmadump: mmap failed: /var/run/nscd/dbxrfE9Q (deleted) Oct 18 13:14:15 faui21l kernel: thaw_threads returned error, aborting. -2 Another question: What is the file structure of a checkpoint file, so I can have a look at it, to check wether it is corrupt or not! Greetings Christian > Replies appear below. > > Christian Iwainsky wrote: > >> Hello, >> I have a problem, with the blcr. >> I have written a distributed program, which is sucessfully checkpointed. >> But once I try to restart the second instance on one machine of the >> program, the cr_restart function aborts with: >> cri_syscall(CR_OP_RSTRT_REAP): Invalid argument >> >> in /var/log/messages: >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >> -22 >> >> What is the problem? (The Pid is free) >> > > The "invalid signature" means the contect file you are trying to restart > from is either corrupted or possibly truncated. I suspect that you have > not succesfully checkpointed, but that the checpoint operation has > failed without letting you know. Is it possible that multiple processes > might have been writing their checkpoints to the *same* file? That > would certainly result in a corrupted file. > >> I also experience an interesting behaviour: >> I use the following code for the checkpoint-callback: >> dsm_checkpoint_read is initialized to 0 >> >> /***********************************************************/ >> int chkpt_callback(void * aptr){ >> fprintf(stderr,"chkpt_callback\n"); >> if (!dsm_checkpoint_ready){ >> // the checkpoint thread function is asleap ... don't checkpoint yet >> but awa >> ken the checkpoint thread >> dsm_checkpoint_sleep=0; >> // postpone the checkpoint till jackal has a consistant state >> fprintf(stderr,"Postponing checkpoint ..\n"); >> //cr_checkpoint(CR_CHECKPOINT_READY); >> cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE); >> return 0; >> } >> fprintf(stderr,"checkpopint callback: taking checkpoint\n"); >> int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY); >> if (chkptResult>0){ >> fprintf(stderr,"Restarting ...\n"); >> dsm_checkpoint_wakeup=1; >> } else if (chkptResult==0){ >> fprintf(stderr,"checkpointing ........\n"); >> }else { >> fprintf(stderr,"Checkpoint Failure\n"); >> cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE); >> return -1; >> } >> return 0; >> } >> >> one the callback postponed the checkpoint the program state is brought >> to a checkpoint state, and then the cr_request_file is called to do >> the real checkpoint. >> The program crashes on the call to cr_request_file: >> >> > > It is not clear to me from your desciption how cr_request_file might be > crashing. I don't see anything wrong with your example except for your > call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error > (you should just return -1, rather than calling cr_checkpoint a 2nd > time). However, since that code will only run if something is already > "broken", I don't think it is the immediate cause of your problem. > > Is it possible for you to send a stack backtrace from a core file > generated by this failure? I could then get a better idea of what is > wrong inside cr_request_file. > >> Regards, >> Christian >> > > -Paul >