Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument

From: Christian Iwainsky (Christian.M.Iwainsky_at_informatik.stud.uni-erlangen.de)
Date: Tue Oct 18 2005 - 04:27:52 PDT

  • Next message: Adolfo J. Banchio: "BLCR - changing code results ??"
    I still get the same error-message over an over gaian.
    What puzzels me is that the application instance crashing on recovery is 
    the twin to anotherone which I can restart.
    
    Work-flow:
    Processes A and B use the same executable.
    
    Process A connects to process B
    Work is done in parallel.
    Process A or B decides to perform a checkpoint.
    Decision is passed to the other process.
    Network kommunication is stopped, but the sockets are not closed
    Checkpoint is taken to Files A.checkpoint by process A and to 
    B.checkpoint by process B.
    
    Both processes A and B are killed!
    
    Process A does successfully restart, process B does not.
    This is always so, making no difference which process startes the 
    decision to checkpoint.
    (Both processes use the same executable).
    
    I rechecked the kernel log, and discovered that before the list of
    
    Oct 18 13:14:15 faui21l kernel: vmadump: invalid signature
    Oct 18 13:14:15 faui21l kernel: thaw_threads returned error, aborting. -22
    
    messages pop up the following message is given:
    
    Oct 18 13:14:15 faui21l kernel: vmadump: mmap failed: 
    /var/run/nscd/dbxrfE9Q (deleted)
    Oct 18 13:14:15 faui21l kernel: thaw_threads returned error, aborting. -2
    
    
    Another question: What is the file structure of a checkpoint file, so I 
    can have a look at it, to check wether it is corrupt or not!
    Greetings
     Christian
    > Replies appear below.
    >
    > Christian Iwainsky wrote:
    >   
    >> Hello,
    >> I have a problem, with the blcr.
    >> I have written a distributed program, which is sucessfully checkpointed.
    >> But once I try to restart the second instance on one machine of the
    >> program, the cr_restart function aborts with:
    >> cri_syscall(CR_OP_RSTRT_REAP): Invalid argument
    >>
    >> in /var/log/messages:
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >> -22
    >>
    >> What is the problem? (The Pid is free)
    >>     
    >
    > The "invalid signature" means the contect file you are trying to restart
    > from is either corrupted or possibly truncated.  I suspect that you have
    > not succesfully checkpointed, but that the checpoint operation has
    > failed without letting you know.  Is it possible that multiple processes
    > might have been writing their checkpoints to the *same* file?  That
    > would certainly result in a corrupted file.
    >   
    >> I also experience an interesting behaviour:
    >> I use the following code for the checkpoint-callback:
    >> dsm_checkpoint_read is initialized to 0
    >>
    >> /***********************************************************/
    >> int chkpt_callback(void * aptr){
    >> fprintf(stderr,"chkpt_callback\n");
    >> if (!dsm_checkpoint_ready){
    >>   // the checkpoint thread function is asleap ... don't checkpoint yet
    >> but awa
    >> ken the checkpoint thread
    >>   dsm_checkpoint_sleep=0;
    >>   // postpone the checkpoint till jackal has a consistant state
    >>   fprintf(stderr,"Postponing checkpoint ..\n");
    >>   //cr_checkpoint(CR_CHECKPOINT_READY);
    >>   cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE);
    >>   return 0;
    >> }
    >> fprintf(stderr,"checkpopint callback: taking checkpoint\n");
    >> int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY);
    >> if (chkptResult>0){
    >>   fprintf(stderr,"Restarting ...\n");
    >>   dsm_checkpoint_wakeup=1;
    >> } else if (chkptResult==0){
    >>   fprintf(stderr,"checkpointing ........\n");
    >> }else {
    >>   fprintf(stderr,"Checkpoint Failure\n");
    >>   cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE);
    >>   return -1;
    >> }
    >> return 0;
    >> }
    >>
    >> one the callback postponed the checkpoint the program state is brought
    >> to a checkpoint state, and then the cr_request_file is called to do
    >> the real checkpoint.
    >> The program crashes on the call to cr_request_file:
    >>
    >>     
    >
    > It is not clear to me from your desciption how cr_request_file might be
    > crashing.  I don't see anything wrong with your example except for your
    > call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error
    > (you should just return -1, rather than calling cr_checkpoint a 2nd
    > time).  However, since that code will only run if something is already
    > "broken", I don't think it is the immediate cause of your problem.
    >
    > Is it possible for you to send a stack backtrace from a core file
    > generated by this failure?  I could then get a better idea of what is
    > wrong inside cr_request_file.
    >   
    >> Regards,
    >> Christian
    >>     
    >
    > -Paul
    >   
    

  • Next message: Adolfo J. Banchio: "BLCR - changing code results ??"