From: Christian Iwainsky (sichiwai_at_informatik.stud.uni-erlangen.de)
Date: Tue Oct 11 2005 - 05:52:18 PDT
Hello, I have a problem, with the blcr. I have written a distributed program, which is sucessfully checkpointed. But once I try to restart the second instance on one machine of the program, the cr_restart function aborts with: cri_syscall(CR_OP_RSTRT_REAP): Invalid argument in /var/log/messages: Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting. -22 What is the problem? (The Pid is free) I also experience an interesting behaviour: I use the following code for the checkpoint-callback: dsm_checkpoint_read is initialized to 0 /***********************************************************/ int chkpt_callback(void * aptr){ fprintf(stderr,"chkpt_callback\n"); if (!dsm_checkpoint_ready){ // the checkpoint thread function is asleap ... don't checkpoint yet but awa ken the checkpoint thread dsm_checkpoint_sleep=0; // postpone the checkpoint till jackal has a consistant state fprintf(stderr,"Postponing checkpoint ..\n"); //cr_checkpoint(CR_CHECKPOINT_READY); cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE); return 0; } fprintf(stderr,"checkpopint callback: taking checkpoint\n"); int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY); if (chkptResult>0){ fprintf(stderr,"Restarting ...\n"); dsm_checkpoint_wakeup=1; } else if (chkptResult==0){ fprintf(stderr,"checkpointing ........\n"); }else { fprintf(stderr,"Checkpoint Failure\n"); cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE); return -1; } return 0; } one the callback postponed the checkpoint the program state is brought to a checkpoint state, and then the cr_request_file is called to do the real checkpoint. The program crashes on the call to cr_request_file: Regards, Christian