Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Oct 13 2005 - 17:22:32 PDT

  • Next message: Ladislav Subr: "Re: BLCR 0.4.1 Beta5 now available"
    Christian,
    
    I still have no clue on the first problem (the invalid signature), but
    might have some idea on the other issue.  See below...
    
    sichiwai wrote:
    > [snip]
    >>> I also experience an interesting behaviour:
    >>> I use the following code for the checkpoint-callback:
    >>> dsm_checkpoint_read is initialized to 0
    >>>
    >>> /***********************************************************/
    >>> int chkpt_callback(void * aptr){
    >>> fprintf(stderr,"chkpt_callback\n");
    >>> if (!dsm_checkpoint_ready){
    >>>  // the checkpoint thread function is asleap ... don't checkpoint yet
    >>> but awa
    >>> ken the checkpoint thread
    >>>  dsm_checkpoint_sleep=0;
    >>>  // postpone the checkpoint till jackal has a consistant state
    >>>  fprintf(stderr,"Postponing checkpoint ..\n");
    >>>  //cr_checkpoint(CR_CHECKPOINT_READY);
    >>>  cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE);
    >>>  return 0;
    >>> }
    >>> fprintf(stderr,"checkpopint callback: taking checkpoint\n");
    >>> int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY);
    >>> if (chkptResult>0){
    >>>  fprintf(stderr,"Restarting ...\n");
    >>>  dsm_checkpoint_wakeup=1;
    >>> } else if (chkptResult==0){
    >>>  fprintf(stderr,"checkpointing ........\n");
    >>> }else {
    >>>  fprintf(stderr,"Checkpoint Failure\n");
    >>>  cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE);
    >>>  return -1;
    >>> }
    >>> return 0;
    >>> }
    >>>
    >>> one the callback postponed the checkpoint the program state is brought
    >>> to a checkpoint state, and then the cr_request_file is called to do
    >>> the real checkpoint.
    >>> The program crashes on the call to cr_request_file:
    >>>
    >>
    >>
    >> It is not clear to me from your desciption how cr_request_file might be
    >> crashing.  I don't see anything wrong with your example except for your
    >> call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error
    >> (you should just return -1, rather than calling cr_checkpoint a 2nd
    >> time).  However, since that code will only run if something is already
    >> "broken", I don't think it is the immediate cause of your problem.
    >>
    >> Is it possible for you to send a stack backtrace from a core file
    >> generated by this failure?  I could then get a better idea of what is
    >> wrong inside cr_request_file.
    > when i use the cr_checkpoint PID to initiate the checkpoint, the gdb
    > shows the following realtime evtent:
    >
    > Program received signal SIG64, Real-time event 64.
    > [Switching to Thread 1128709040 (LWP 20343)]
    > 0x4007004b in __cri_ioctl (arg1=9, arg2=-1073438458, arg3=0x0,
    > errno_p=0x4346bb84)
    >            at cr_syscall.c:123 123     cri_syscall3(int, __cri_ioctl,
    > __NR_ioctl, int, int, void*)
    >
    >
    > then a consistant networking state is achived and
    > cr_request_file("checkpoint0_0.chkpt"); is executed:
    > backtrace form program abort follows:
    >
    > cr_core.c:238 cri_request: CR_OP_CHKPT_REQ returned -1 w/ errno=16
    The code implementing cr_request_file() has aborted with an assertion
    failure because the call to request the checkpoint returned with
    errno=16, which is EBUSY.  This means that the kernel module believes
    that their is already a checkpoint request outstanding for this process.
    
    I see two ways this might happen.  One is that another thread has
    allready requested a checkpoint of this process.  The other is that
    something went wrong in BLCR at the CR_CHECKPOINT_TEMP_FAILURE, leaving
    the kernel under the mistaken impression that the original checkpoint
    request is still pending.  I am going to look into this second possibility.
    
    > Program received signal SIGABRT, Aborted.
    > (gdb) bt
    > #0  0xffffe410 in ?? ()
    > #1  0x4326a1c8 in ?? ()
    > #2  0x00000006 in ?? ()
    > #3  0x00004f76 in ?? ()
    > #4  0x400e62c1 in raise () from /lib/tls/libc.so.6
    > #5  0x400e7b75 in abort () from /lib/tls/libc.so.6
    > #6  0x4006f2c1 in cri_request (fd=<value optimized out>,
    >     filename=0x4326a390 "checkpoint0_0.chkpt") at cr_core.c:228
    > #7  0x4006f46f in cr_request_file (filename=0x4326a390
    > "checkpoint0_0.chkpt")
    >     at cr_core.c:314
    > #8  0x088d5e50 in dsm_checkpoint_thread_function (params=0x0)
    >     at
    > /home/i2cluster/studienarbeit/jackal/checkpointing_redevelop/manta/runtime/dsm/shm_dsm/checkpoint/dsm_checkpoint.c:193
    >
    > #9  0x088f6cd9 in taco_pthread_boot (_thread=0x8e56e90) at taco.c:280
    > #10 0x40041aa7 in start_thread () from /lib/tls/libpthread.so.0
    > #11 0x40178c2e in clone () from /lib/tls/libc.so.6
    >
    > In case that you need it, I can provied the binary itself.
    >
    > I hope that you can shed some light into this matter.
    > Regards,
    >  Christian
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Ladislav Subr: "Re: BLCR 0.4.1 Beta5 now available"