Re: Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument

From: sichiwai (Christian.M.Iwainsky_at_informatik.stud.uni-erlangen.de)
Date: Thu Oct 13 2005 - 15:56:12 PDT

  • Next message: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
    Replies appear below.
    > 
    > Christian Iwainsky wrote:
    > 
    >>Hello,
    >>I have a problem, with the blcr.
    >>I have written a distributed program, which is sucessfully checkpointed.
    >>But once I try to restart the second instance on one machine of the
    >>program, the cr_restart function aborts with:
    >>cri_syscall(CR_OP_RSTRT_REAP): Invalid argument
    >>
    >>in /var/log/messages:
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    >>-22
    >>
    >>What is the problem? (The Pid is free)
    > 
    > 
    > The "invalid signature" means the contect file you are trying to restart
    > from is either corrupted or possibly truncated.  I suspect that you have
    > not succesfully checkpointed, but that the checpoint operation has
    > failed without letting you know.  Is it possible that multiple processes
    > might have been writing their checkpoints to the *same* file?  That
    > would certainly result in a corrupted file.
    
    i tryed it that each process wrote to a file-name on which i appended 
    the pid to make it unique -> same result
    
    
    >>I also experience an interesting behaviour:
    >>I use the following code for the checkpoint-callback:
    >>dsm_checkpoint_read is initialized to 0
    >>
    >>/***********************************************************/
    >>int chkpt_callback(void * aptr){
    >>fprintf(stderr,"chkpt_callback\n");
    >>if (!dsm_checkpoint_ready){
    >>  // the checkpoint thread function is asleap ... don't checkpoint yet
    >>but awa
    >>ken the checkpoint thread
    >>  dsm_checkpoint_sleep=0;
    >>  // postpone the checkpoint till jackal has a consistant state
    >>  fprintf(stderr,"Postponing checkpoint ..\n");
    >>  //cr_checkpoint(CR_CHECKPOINT_READY);
    >>  cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE);
    >>  return 0;
    >>}
    >>fprintf(stderr,"checkpopint callback: taking checkpoint\n");
    >>int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY);
    >>if (chkptResult>0){
    >>  fprintf(stderr,"Restarting ...\n");
    >>  dsm_checkpoint_wakeup=1;
    >>} else if (chkptResult==0){
    >>  fprintf(stderr,"checkpointing ........\n");
    >>}else {
    >>  fprintf(stderr,"Checkpoint Failure\n");
    >>  cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE);
    >>  return -1;
    >>}
    >>return 0;
    >>}
    >>
    >>one the callback postponed the checkpoint the program state is brought
    >>to a checkpoint state, and then the cr_request_file is called to do
    >>the real checkpoint.
    >>The program crashes on the call to cr_request_file:
    >>
    > 
    > 
    > It is not clear to me from your desciption how cr_request_file might be
    > crashing.  I don't see anything wrong with your example except for your
    > call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error
    > (you should just return -1, rather than calling cr_checkpoint a 2nd
    > time).  However, since that code will only run if something is already
    > "broken", I don't think it is the immediate cause of your problem.
    > 
    > Is it possible for you to send a stack backtrace from a core file
    > generated by this failure?  I could then get a better idea of what is
    > wrong inside cr_request_file.
    when i use the cr_checkpoint PID to initiate the checkpoint, the gdb 
    shows the following realtime evtent:
    
    Program received signal SIG64, Real-time event 64.
    [Switching to Thread 1128709040 (LWP 20343)]
    0x4007004b in __cri_ioctl (arg1=9, arg2=-1073438458, arg3=0x0, 
    errno_p=0x4346bb84)
                at cr_syscall.c:123 123     cri_syscall3(int, __cri_ioctl, 
    __NR_ioctl, int, int, void*)
    
    
    then a consistant networking state is achived and
    cr_request_file("checkpoint0_0.chkpt"); is executed:
    backtrace form program abort follows:
    
    cr_core.c:238 cri_request: CR_OP_CHKPT_REQ returned -1 w/ errno=16
    
    
    Program received signal SIGABRT, Aborted.
    (gdb) bt
    #0  0xffffe410 in ?? ()
    #1  0x4326a1c8 in ?? ()
    #2  0x00000006 in ?? ()
    #3  0x00004f76 in ?? ()
    #4  0x400e62c1 in raise () from /lib/tls/libc.so.6
    #5  0x400e7b75 in abort () from /lib/tls/libc.so.6
    #6  0x4006f2c1 in cri_request (fd=<value optimized out>,
         filename=0x4326a390 "checkpoint0_0.chkpt") at cr_core.c:228
    #7  0x4006f46f in cr_request_file (filename=0x4326a390 
    "checkpoint0_0.chkpt")
         at cr_core.c:314
    #8  0x088d5e50 in dsm_checkpoint_thread_function (params=0x0)
         at 
    /home/i2cluster/studienarbeit/jackal/checkpointing_redevelop/manta/runtime/dsm/shm_dsm/checkpoint/dsm_checkpoint.c:193
    #9  0x088f6cd9 in taco_pthread_boot (_thread=0x8e56e90) at taco.c:280
    #10 0x40041aa7 in start_thread () from /lib/tls/libpthread.so.0
    #11 0x40178c2e in clone () from /lib/tls/libc.so.6
    
    In case that you need it, I can provied the binary itself.
    
    I hope that you can shed some light into this matter.
    Regards,
      Christian
    

  • Next message: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"