Re: Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument

Date view	Thread view	Subject view	Author view	Attachment view

From: sichiwai (Christian.M.Iwainsky_at_informatik.stud.uni-erlangen.de)
Date: Thu Oct 13 2005 - 15:56:12 PDT

Next message: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"

Previous message: Neal Becker: "blcr python module"
In reply to: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
Next in thread: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
Reply: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"

Replies appear below.
> 
> Christian Iwainsky wrote:
> 
>>Hello,
>>I have a problem, with the blcr.
>>I have written a distributed program, which is sucessfully checkpointed.
>>But once I try to restart the second instance on one machine of the
>>program, the cr_restart function aborts with:
>>cri_syscall(CR_OP_RSTRT_REAP): Invalid argument
>>
>>in /var/log/messages:
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
>>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
>>-22
>>
>>What is the problem? (The Pid is free)
> 
> 
> The "invalid signature" means the contect file you are trying to restart
> from is either corrupted or possibly truncated.  I suspect that you have
> not succesfully checkpointed, but that the checpoint operation has
> failed without letting you know.  Is it possible that multiple processes
> might have been writing their checkpoints to the *same* file?  That
> would certainly result in a corrupted file.

i tryed it that each process wrote to a file-name on which i appended 
the pid to make it unique -> same result


>>I also experience an interesting behaviour:
>>I use the following code for the checkpoint-callback:
>>dsm_checkpoint_read is initialized to 0
>>
>>/***********************************************************/
>>int chkpt_callback(void * aptr){
>>fprintf(stderr,"chkpt_callback\n");
>>if (!dsm_checkpoint_ready){
>>  // the checkpoint thread function is asleap ... don't checkpoint yet
>>but awa
>>ken the checkpoint thread
>>  dsm_checkpoint_sleep=0;
>>  // postpone the checkpoint till jackal has a consistant state
>>  fprintf(stderr,"Postponing checkpoint ..\n");
>>  //cr_checkpoint(CR_CHECKPOINT_READY);
>>  cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE);
>>  return 0;
>>}
>>fprintf(stderr,"checkpopint callback: taking checkpoint\n");
>>int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY);
>>if (chkptResult>0){
>>  fprintf(stderr,"Restarting ...\n");
>>  dsm_checkpoint_wakeup=1;
>>} else if (chkptResult==0){
>>  fprintf(stderr,"checkpointing ........\n");
>>}else {
>>  fprintf(stderr,"Checkpoint Failure\n");
>>  cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE);
>>  return -1;
>>}
>>return 0;
>>}
>>
>>one the callback postponed the checkpoint the program state is brought
>>to a checkpoint state, and then the cr_request_file is called to do
>>the real checkpoint.
>>The program crashes on the call to cr_request_file:
>>
> 
> 
> It is not clear to me from your desciption how cr_request_file might be
> crashing.  I don't see anything wrong with your example except for your
> call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error
> (you should just return -1, rather than calling cr_checkpoint a 2nd
> time).  However, since that code will only run if something is already
> "broken", I don't think it is the immediate cause of your problem.
> 
> Is it possible for you to send a stack backtrace from a core file
> generated by this failure?  I could then get a better idea of what is
> wrong inside cr_request_file.
when i use the cr_checkpoint PID to initiate the checkpoint, the gdb 
shows the following realtime evtent:

Program received signal SIG64, Real-time event 64.
[Switching to Thread 1128709040 (LWP 20343)]
0x4007004b in __cri_ioctl (arg1=9, arg2=-1073438458, arg3=0x0, 
errno_p=0x4346bb84)
            at cr_syscall.c:123 123     cri_syscall3(int, __cri_ioctl, 
__NR_ioctl, int, int, void*)


then a consistant networking state is achived and
cr_request_file("checkpoint0_0.chkpt"); is executed:
backtrace form program abort follows:

cr_core.c:238 cri_request: CR_OP_CHKPT_REQ returned -1 w/ errno=16


Program received signal SIGABRT, Aborted.
(gdb) bt
#0  0xffffe410 in ?? ()
#1  0x4326a1c8 in ?? ()
#2  0x00000006 in ?? ()
#3  0x00004f76 in ?? ()
#4  0x400e62c1 in raise () from /lib/tls/libc.so.6
#5  0x400e7b75 in abort () from /lib/tls/libc.so.6
#6  0x4006f2c1 in cri_request (fd=<value optimized out>,
     filename=0x4326a390 "checkpoint0_0.chkpt") at cr_core.c:228
#7  0x4006f46f in cr_request_file (filename=0x4326a390 
"checkpoint0_0.chkpt")
     at cr_core.c:314
#8  0x088d5e50 in dsm_checkpoint_thread_function (params=0x0)
     at 
/home/i2cluster/studienarbeit/jackal/checkpointing_redevelop/manta/runtime/dsm/shm_dsm/checkpoint/dsm_checkpoint.c:193
#9  0x088f6cd9 in taco_pthread_boot (_thread=0x8e56e90) at taco.c:280
#10 0x40041aa7 in start_thread () from /lib/tls/libpthread.so.0
#11 0x40178c2e in clone () from /lib/tls/libc.so.6

In case that you need it, I can provied the binary itself.

I hope that you can shed some light into this matter.
Regards,
  Christian

Next message: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"

Previous message: Neal Becker: "blcr python module"
In reply to: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
Next in thread: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
Reply: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"

Date view	Thread view	Subject view	Author view	Attachment view