From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Oct 13 2005 - 17:22:32 PDT
Christian, I still have no clue on the first problem (the invalid signature), but might have some idea on the other issue. See below... sichiwai wrote: > [snip] >>> I also experience an interesting behaviour: >>> I use the following code for the checkpoint-callback: >>> dsm_checkpoint_read is initialized to 0 >>> >>> /***********************************************************/ >>> int chkpt_callback(void * aptr){ >>> fprintf(stderr,"chkpt_callback\n"); >>> if (!dsm_checkpoint_ready){ >>> // the checkpoint thread function is asleap ... don't checkpoint yet >>> but awa >>> ken the checkpoint thread >>> dsm_checkpoint_sleep=0; >>> // postpone the checkpoint till jackal has a consistant state >>> fprintf(stderr,"Postponing checkpoint ..\n"); >>> //cr_checkpoint(CR_CHECKPOINT_READY); >>> cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE); >>> return 0; >>> } >>> fprintf(stderr,"checkpopint callback: taking checkpoint\n"); >>> int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY); >>> if (chkptResult>0){ >>> fprintf(stderr,"Restarting ...\n"); >>> dsm_checkpoint_wakeup=1; >>> } else if (chkptResult==0){ >>> fprintf(stderr,"checkpointing ........\n"); >>> }else { >>> fprintf(stderr,"Checkpoint Failure\n"); >>> cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE); >>> return -1; >>> } >>> return 0; >>> } >>> >>> one the callback postponed the checkpoint the program state is brought >>> to a checkpoint state, and then the cr_request_file is called to do >>> the real checkpoint. >>> The program crashes on the call to cr_request_file: >>> >> >> >> It is not clear to me from your desciption how cr_request_file might be >> crashing. I don't see anything wrong with your example except for your >> call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error >> (you should just return -1, rather than calling cr_checkpoint a 2nd >> time). However, since that code will only run if something is already >> "broken", I don't think it is the immediate cause of your problem. >> >> Is it possible for you to send a stack backtrace from a core file >> generated by this failure? I could then get a better idea of what is >> wrong inside cr_request_file. > when i use the cr_checkpoint PID to initiate the checkpoint, the gdb > shows the following realtime evtent: > > Program received signal SIG64, Real-time event 64. > [Switching to Thread 1128709040 (LWP 20343)] > 0x4007004b in __cri_ioctl (arg1=9, arg2=-1073438458, arg3=0x0, > errno_p=0x4346bb84) > at cr_syscall.c:123 123 cri_syscall3(int, __cri_ioctl, > __NR_ioctl, int, int, void*) > > > then a consistant networking state is achived and > cr_request_file("checkpoint0_0.chkpt"); is executed: > backtrace form program abort follows: > > cr_core.c:238 cri_request: CR_OP_CHKPT_REQ returned -1 w/ errno=16 The code implementing cr_request_file() has aborted with an assertion failure because the call to request the checkpoint returned with errno=16, which is EBUSY. This means that the kernel module believes that their is already a checkpoint request outstanding for this process. I see two ways this might happen. One is that another thread has allready requested a checkpoint of this process. The other is that something went wrong in BLCR at the CR_CHECKPOINT_TEMP_FAILURE, leaving the kernel under the mistaken impression that the original checkpoint request is still pending. I am going to look into this second possibility. > Program received signal SIGABRT, Aborted. > (gdb) bt > #0 0xffffe410 in ?? () > #1 0x4326a1c8 in ?? () > #2 0x00000006 in ?? () > #3 0x00004f76 in ?? () > #4 0x400e62c1 in raise () from /lib/tls/libc.so.6 > #5 0x400e7b75 in abort () from /lib/tls/libc.so.6 > #6 0x4006f2c1 in cri_request (fd=<value optimized out>, > filename=0x4326a390 "checkpoint0_0.chkpt") at cr_core.c:228 > #7 0x4006f46f in cr_request_file (filename=0x4326a390 > "checkpoint0_0.chkpt") > at cr_core.c:314 > #8 0x088d5e50 in dsm_checkpoint_thread_function (params=0x0) > at > /home/i2cluster/studienarbeit/jackal/checkpointing_redevelop/manta/runtime/dsm/shm_dsm/checkpoint/dsm_checkpoint.c:193 > > #9 0x088f6cd9 in taco_pthread_boot (_thread=0x8e56e90) at taco.c:280 > #10 0x40041aa7 in start_thread () from /lib/tls/libpthread.so.0 > #11 0x40178c2e in clone () from /lib/tls/libc.so.6 > > In case that you need it, I can provied the binary itself. > > I hope that you can shed some light into this matter. > Regards, > Christian -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900