From: sichiwai (Christian.M.Iwainsky_at_informatik.stud.uni-erlangen.de)
Date: Thu Oct 13 2005 - 15:56:12 PDT
Replies appear below. > > Christian Iwainsky wrote: > >>Hello, >>I have a problem, with the blcr. >>I have written a distributed program, which is sucessfully checkpointed. >>But once I try to restart the second instance on one machine of the >>program, the cr_restart function aborts with: >>cri_syscall(CR_OP_RSTRT_REAP): Invalid argument >> >>in /var/log/messages: >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >>Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature >>Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. >>-22 >> >>What is the problem? (The Pid is free) > > > The "invalid signature" means the contect file you are trying to restart > from is either corrupted or possibly truncated. I suspect that you have > not succesfully checkpointed, but that the checpoint operation has > failed without letting you know. Is it possible that multiple processes > might have been writing their checkpoints to the *same* file? That > would certainly result in a corrupted file. i tryed it that each process wrote to a file-name on which i appended the pid to make it unique -> same result >>I also experience an interesting behaviour: >>I use the following code for the checkpoint-callback: >>dsm_checkpoint_read is initialized to 0 >> >>/***********************************************************/ >>int chkpt_callback(void * aptr){ >>fprintf(stderr,"chkpt_callback\n"); >>if (!dsm_checkpoint_ready){ >> // the checkpoint thread function is asleap ... don't checkpoint yet >>but awa >>ken the checkpoint thread >> dsm_checkpoint_sleep=0; >> // postpone the checkpoint till jackal has a consistant state >> fprintf(stderr,"Postponing checkpoint ..\n"); >> //cr_checkpoint(CR_CHECKPOINT_READY); >> cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE); >> return 0; >>} >>fprintf(stderr,"checkpopint callback: taking checkpoint\n"); >>int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY); >>if (chkptResult>0){ >> fprintf(stderr,"Restarting ...\n"); >> dsm_checkpoint_wakeup=1; >>} else if (chkptResult==0){ >> fprintf(stderr,"checkpointing ........\n"); >>}else { >> fprintf(stderr,"Checkpoint Failure\n"); >> cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE); >> return -1; >>} >>return 0; >>} >> >>one the callback postponed the checkpoint the program state is brought >>to a checkpoint state, and then the cr_request_file is called to do >>the real checkpoint. >>The program crashes on the call to cr_request_file: >> > > > It is not clear to me from your desciption how cr_request_file might be > crashing. I don't see anything wrong with your example except for your > call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error > (you should just return -1, rather than calling cr_checkpoint a 2nd > time). However, since that code will only run if something is already > "broken", I don't think it is the immediate cause of your problem. > > Is it possible for you to send a stack backtrace from a core file > generated by this failure? I could then get a better idea of what is > wrong inside cr_request_file. when i use the cr_checkpoint PID to initiate the checkpoint, the gdb shows the following realtime evtent: Program received signal SIG64, Real-time event 64. [Switching to Thread 1128709040 (LWP 20343)] 0x4007004b in __cri_ioctl (arg1=9, arg2=-1073438458, arg3=0x0, errno_p=0x4346bb84) at cr_syscall.c:123 123 cri_syscall3(int, __cri_ioctl, __NR_ioctl, int, int, void*) then a consistant networking state is achived and cr_request_file("checkpoint0_0.chkpt"); is executed: backtrace form program abort follows: cr_core.c:238 cri_request: CR_OP_CHKPT_REQ returned -1 w/ errno=16 Program received signal SIGABRT, Aborted. (gdb) bt #0 0xffffe410 in ?? () #1 0x4326a1c8 in ?? () #2 0x00000006 in ?? () #3 0x00004f76 in ?? () #4 0x400e62c1 in raise () from /lib/tls/libc.so.6 #5 0x400e7b75 in abort () from /lib/tls/libc.so.6 #6 0x4006f2c1 in cri_request (fd=<value optimized out>, filename=0x4326a390 "checkpoint0_0.chkpt") at cr_core.c:228 #7 0x4006f46f in cr_request_file (filename=0x4326a390 "checkpoint0_0.chkpt") at cr_core.c:314 #8 0x088d5e50 in dsm_checkpoint_thread_function (params=0x0) at /home/i2cluster/studienarbeit/jackal/checkpointing_redevelop/manta/runtime/dsm/shm_dsm/checkpoint/dsm_checkpoint.c:193 #9 0x088f6cd9 in taco_pthread_boot (_thread=0x8e56e90) at taco.c:280 #10 0x40041aa7 in start_thread () from /lib/tls/libpthread.so.0 #11 0x40178c2e in clone () from /lib/tls/libc.so.6 In case that you need it, I can provied the binary itself. I hope that you can shed some light into this matter. Regards, Christian