Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument

Date view	Thread view	Subject view	Author view	Attachment view

From: Christian Iwainsky (Christian.M.Iwainsky_at_informatik.stud.uni-erlangen.de)
Date: Tue Oct 18 2005 - 04:27:52 PDT

Next message: Adolfo J. Banchio: "BLCR - changing code results ??"

Previous message: Paul H. Hargrove: "Re: BLCR 0.4.1 Beta5 now available"
In reply to: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
Next in thread: Christian Iwainsky: "Re [2]: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"

I still get the same error-message over an over gaian.
What puzzels me is that the application instance crashing on recovery is 
the twin to anotherone which I can restart.

Work-flow:
Processes A and B use the same executable.

Process A connects to process B
Work is done in parallel.
Process A or B decides to perform a checkpoint.
Decision is passed to the other process.
Network kommunication is stopped, but the sockets are not closed
Checkpoint is taken to Files A.checkpoint by process A and to 
B.checkpoint by process B.

Both processes A and B are killed!

Process A does successfully restart, process B does not.
This is always so, making no difference which process startes the 
decision to checkpoint.
(Both processes use the same executable).

I rechecked the kernel log, and discovered that before the list of

Oct 18 13:14:15 faui21l kernel: vmadump: invalid signature
Oct 18 13:14:15 faui21l kernel: thaw_threads returned error, aborting. -22

messages pop up the following message is given:

Oct 18 13:14:15 faui21l kernel: vmadump: mmap failed: 
/var/run/nscd/dbxrfE9Q (deleted)
Oct 18 13:14:15 faui21l kernel: thaw_threads returned error, aborting. -2


Another question: What is the file structure of a checkpoint file, so I 
can have a look at it, to check wether it is corrupt or not!
Greetings
 Christian
> Replies appear below.
>
> Christian Iwainsky wrote:
>   
>> Hello,
>> I have a problem, with the blcr.
>> I have written a distributed program, which is sucessfully checkpointed.
>> But once I try to restart the second instance on one machine of the
>> program, the cr_restart function aborts with:
>> cri_syscall(CR_OP_RSTRT_REAP): Invalid argument
>>
>> in /var/log/messages:
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>> Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>> Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>> -22
>>
>> What is the problem? (The Pid is free)
>>     
>
> The "invalid signature" means the contect file you are trying to restart
> from is either corrupted or possibly truncated.  I suspect that you have
> not succesfully checkpointed, but that the checpoint operation has
> failed without letting you know.  Is it possible that multiple processes
> might have been writing their checkpoints to the *same* file?  That
> would certainly result in a corrupted file.
>   
>> I also experience an interesting behaviour:
>> I use the following code for the checkpoint-callback:
>> dsm_checkpoint_read is initialized to 0
>>
>> /***********************************************************/
>> int chkpt_callback(void * aptr){
>> fprintf(stderr,"chkpt_callback\n");
>> if (!dsm_checkpoint_ready){
>>   // the checkpoint thread function is asleap ... don't checkpoint yet
>> but awa
>> ken the checkpoint thread
>>   dsm_checkpoint_sleep=0;
>>   // postpone the checkpoint till jackal has a consistant state
>>   fprintf(stderr,"Postponing checkpoint ..\n");
>>   //cr_checkpoint(CR_CHECKPOINT_READY);
>>   cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE);
>>   return 0;
>> }
>> fprintf(stderr,"checkpopint callback: taking checkpoint\n");
>> int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY);
>> if (chkptResult>0){
>>   fprintf(stderr,"Restarting ...\n");
>>   dsm_checkpoint_wakeup=1;
>> } else if (chkptResult==0){
>>   fprintf(stderr,"checkpointing ........\n");
>> }else {
>>   fprintf(stderr,"Checkpoint Failure\n");
>>   cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE);
>>   return -1;
>> }
>> return 0;
>> }
>>
>> one the callback postponed the checkpoint the program state is brought
>> to a checkpoint state, and then the cr_request_file is called to do
>> the real checkpoint.
>> The program crashes on the call to cr_request_file:
>>
>>     
>
> It is not clear to me from your desciption how cr_request_file might be
> crashing.  I don't see anything wrong with your example except for your
> call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error
> (you should just return -1, rather than calling cr_checkpoint a 2nd
> time).  However, since that code will only run if something is already
> "broken", I don't think it is the immediate cause of your problem.
>
> Is it possible for you to send a stack backtrace from a core file
> generated by this failure?  I could then get a better idea of what is
> wrong inside cr_request_file.
>   
>> Regards,
>> Christian
>>     
>
> -Paul
>

Next message: Adolfo J. Banchio: "BLCR - changing code results ??"

Previous message: Paul H. Hargrove: "Re: BLCR 0.4.1 Beta5 now available"
In reply to: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
Next in thread: Christian Iwainsky: "Re [2]: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"

Date view	Thread view	Subject view	Author view	Attachment view