From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Nov 02 2005 - 10:56:46 PST
I think I may now see at least part of the problem, here: Nov 2 16:30:32 faui21l kernel: vmadump: mmap failed: /var/run/nscd/db5bHKnB (deleted) I've seen something like this before on an NFS filesytem (with Intel compilers). The thing to note here is that the text " (deleted)" is *NOT* part of the error message, but part of the saved filename "/var/run/nscd/db5bHKnB (deleted)". This tells me that for some reason the code that saves the filename thought the file to be deleted. Could you please check if the indicated file is actually present or not. I suspect that it *is* present and that it is being mistakenly marked as deleted. If the file does still exist, I can look into how to work around this problem. There is also an issue here of how BLCR is dealing with the situation. Rather than terminating the entire restore at the failed mmap(), it appears that the restore of the first thread terminated (the message ending with "aborting. -2") at that point and the restore of the 2nd through 5th threads was attempted from the wrong point in the context file (resulting in the 4 instances of "invalid signature"). -Paul Christian Iwainsky wrote: > Hello, I am still looking into that problem, with the cr_restart. It > still terminates with "cri_syscall(CR_OP_RSTRT_REAP): Invalid argument" > The two instances of the program go through the code code of > >code.txt<, one instance has rank =0 and the other one has rank = 1. > After the this code fragment the program is killed. > > Afterwards i tried to restart the instance 1 with the checkpoint > debug_liz_tcp_turn_done_0_1.chkpt, where I get the obove stated > message: Invalid argument. > > I appended the logfile from the kernel-messages log. > The line ____________________________ corresponds to the start of the > two instances of the program > The second line "here it happenes > _______________________________________" is the place where I restart > the checkpoint. > > Maybe you can tell me, what happens here. > > Greetings > Christian > > > >------------------------------------------------------------------------ > >char fileName[1024]; > > DEBUG_ENTER(); > int ret; > > dev_tcp_t *dev_tcp = LIZ_DEVICE_GET_PRIVATE(tcp, self, dev_tcp_t); > > assert(dev_tcp); > > /* > * create server socket > */ > assert(dev_tcp->port >= 0); > ret = create_server_socket(&dev_tcp->server_socket, dev_tcp->port); > assert(ret == LIZ_OK); > > sprintf(fileName,"debug_liz_tcp_server_socket_%i.chkpt",getpid()); > cr_request_file(fileName); > > /* > * accept connections from other hosts > */ > int port = dev_tcp->port; > > if (dev_tcp->port == 0) { > struct sockaddr_in sockaddr; > socklen_t len = sizeof(struct sockaddr_in); > ret = getsockname(dev_tcp->server_socket, (struct sockaddr *) &sockaddr, &len); > assert(ret == 0); > port = ntohs(sockaddr.sin_port); > } > > dev_tcp->port = port; > > FD_SET(dev_tcp->server_socket, &dev_tcp->fdset); > dev_tcp->maxfd = MAX(dev_tcp->maxfd, dev_tcp->server_socket + 1); > dev_tcp->no_fds++; > > /* > * set flags of server socket > */ > assert(dev_tcp->server_socket != -1); > int flags; > > flags = fcntl(dev_tcp->server_socket, F_GETFL, 0); > assert(flags != -1); > > // TODO: set correct socket options > flags &= ~O_NONBLOCK; > ret = fcntl(dev_tcp->server_socket, F_SETFL, flags); > assert(ret == 0); > > > liz_rank_t node; > liz_rank_t turn; > > sprintf(fileName,"debug_liz_tcp_preconnect_%i.chkpt",getpid()); > cr_request_file(fileName); > > for (turn = 0; turn < dev_tcp->no_connections; turn++) { > if (turn == dev_tcp->rank) { > /* > * accept connections from all other nodes > */ > for (node = turn + 1; node < dev_tcp->no_connections; node++) { > // *connection_t *c = &dev_tcp->connections[node]; > > // *if (!IS_CONNECTION_OPENED(c)) { > // *fprintf(stderr, "accepting connection from rank " FMT_RANK()"... ", node); >#ifdef DEBUG > fprintf(stderr, "accepting connection... "); >#endif > int socket = > accept(dev_tcp->server_socket, (struct sockaddr *) NULL, NULL); > > /* > * read the rank of the node that issued the request to connect > */ > liz_rank_t rank; > int ret = liz_read(socket, &rank, sizeof(liz_rank_t),&dev_tcp->cont); > > assert(ret == sizeof(liz_rank_t)); > assert(rank < dev_tcp->no_connections); > assert(rank >= 0); >#ifdef DEBUG > fprintf(stderr, "connected rank " FMT_INT()" (socket " FMT_INT()")\n", rank,socket); >#endif > connection_t *c = &dev_tcp->connections[rank]; > > if (socket >= 0) { > c->state = CONNECTION_OPENED; > c->socket = socket; >#ifdef DEBUG > c->rank = rank; >#endif > > FD_SET(c->socket, &dev_tcp->fdset); > dev_tcp->maxfd = MAX(dev_tcp->maxfd, c->socket + 1); > dev_tcp->no_fds++; > > ret = internal_setsockopt(c); > if (ret) > return ret; > } > // *} > } > } else { > /* > * connect to all other nodes if not already connected > */ > for (node = 0; node <= turn; node++) { > connection_t *c = &dev_tcp->connections[node]; > > if (dev_tcp->rank != node && !IS_CONNECTION_OPENED(c)) { >#ifdef DEBUG > fprintf(stderr, "opening connection to %s[" FMT_RANK()"]...", c->hostname, > node); >#endif > /* > * create a client socket connection to the remote host > */ > ret = create_client_socket(c, c->hostname, c->port); > assert(ret == LIZ_OK); > > /* > * now connect a local socket and the remote socket > */ > ret = internal_connect(c, node); > assert(ret == 0); > > /* > * send local rank to remote node > */ >#if DSM_CHECKPOINT > > int ret = liz_write(c->socket, &dev_tcp->rank, sizeof(liz_rank_t),0); >#else > > int ret = liz_write(c->socket, &dev_tcp->rank, sizeof(liz_rank_t)); >#endif > assert(ret == sizeof(liz_rank_t)); > c->state = CONNECTION_OPENED; > >#ifdef DEBUG > c->rank = node; >#endif > FD_SET(c->socket, &dev_tcp->fdset); > dev_tcp->maxfd = MAX(dev_tcp->maxfd, c->socket + 1); > dev_tcp->no_fds++; > } > } > } > /* do checkpoint here */ > sprintf(fileName,"debug_liz_tcp_turn_done_%i_%i.chkpt",turn,dev_tcp->rank); > cr_request_file(fileName); > } > > sprintf(fileName,"debug_liz_tcp_connections_up_%i.chkpt",getpid()); > cr_request_file(fileName); >#if CHECK_CONNECTIONS_EMPTY_AFTER_STARTUP > /* > * check the connections for debugging purposes > */ > check_connections(dev_tcp); >#endif > > /* > * start up receiver thread > */ >#ifdef TACO > dev_tcp->thread = taco_thread_create((TTaco_Func) self->run_as_thread,self, NULL, NULL, NULL, NULL, 0, 0); >#else > liz_thread_create(&dev_tcp->thread, NULL, self->run_as_thread, self); >#endif > fprintf(stderr,"START_TCP_MODULE done\n"); > DEBUG_LEAVE(); > return LIZ_OK; >} > >------------------------------------------------------------------------ > >Nov 2 16:28:53 faui21l sichiwai: __________________________________________________________________________________________-- >Nov 2 16:29:37 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning -22 >Nov 2 16:29:37 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:37 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7595: : Preparing to dump 5 threads >Nov 2 16:29:37 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7595: : Writing the fs struct... >Nov 2 16:29:37 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7595: : Writing the open file section... >Nov 2 16:29:37 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7595: : ...files_struct >Nov 2 16:29:37 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7595: : ...files >Nov 2 16:29:37 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning -11 >Nov 2 16:29:37 faui21l kernel: cr_chkpt_done <cr_chkpt_req.c:893>, pid 7593: : cr_chkpt_done returning 1 >Nov 2 16:29:37 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:37 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:37 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7596: : Preparing to dump 5 threads >Nov 2 16:29:38 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7597: : Writing the fs struct... >Nov 2 16:29:38 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7597: : Writing the open file section... >Nov 2 16:29:38 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7597: : ...files_struct >Nov 2 16:29:38 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7597: : ...files >Nov 2 16:29:38 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:38 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:38 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7593: : Preparing to dump 5 threads >Nov 2 16:29:38 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7597: : Writing the fs struct... >Nov 2 16:29:38 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7597: : Writing the open file section... >Nov 2 16:29:38 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7597: : ...files_struct >Nov 2 16:29:38 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7597: : ...files >Nov 2 16:29:38 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:38 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning -22 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7594: : Preparing to dump 5 threads >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7594: : Writing the fs struct... >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7594: : Writing the open file section... >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7594: : ...files_struct >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7594: : ...files >Nov 2 16:29:39 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7595: : Preparing to dump 5 threads >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7595: : Writing the fs struct... >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7595: : Writing the open file section... >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7595: : ...files_struct >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7595: : ...files >Nov 2 16:29:39 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7595: : Preparing to dump 5 threads >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7594: : Writing the fs struct... >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7594: : Writing the open file section... >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7594: : ...files_struct >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7594: : ...files >Nov 2 16:29:39 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7596: : Preparing to dump 5 threads >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7596: : Writing the fs struct... >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7596: : Writing the open file section... >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7596: : ...files_struct >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7596: : ...files >Nov 2 16:29:39 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7597: : Preparing to dump 5 threads >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7594: : Writing the fs struct... >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7594: : Writing the open file section... >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7594: : ...files_struct >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7594: : ...files >Nov 2 16:29:39 faui21l kernel: Skipping a socket. >Nov 2 16:29:39 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7596: : Preparing to dump 5 threads >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7596: : Writing the fs struct... >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7596: : Writing the open file section... >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7596: : ...files_struct >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7596: : ...files >Nov 2 16:29:39 faui21l kernel: Skipping a socket. >Nov 2 16:29:39 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning -22 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7602: : Preparing to dump 5 threads >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7601: : Writing the fs struct... >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7601: : Writing the open file section... >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7601: : ...files_struct >Nov 2 16:29:39 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7601: : ...files >Nov 2 16:29:39 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:39 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:39 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7601: : Preparing to dump 5 threads >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7601: : Writing the fs struct... >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7601: : Writing the open file section... >Nov 2 16:29:40 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7601: : ...files_struct >Nov 2 16:29:40 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7601: : ...files >Nov 2 16:29:40 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:40 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7600: : Preparing to dump 5 threads >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7604: : Writing the fs struct... >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7604: : Writing the open file section... >Nov 2 16:29:40 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7604: : ...files_struct >Nov 2 16:29:40 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7604: : ...files >Nov 2 16:29:40 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning -11 >Nov 2 16:29:40 faui21l kernel: cr_chkpt_done <cr_chkpt_req.c:893>, pid 7600: : cr_chkpt_done returning 1 >Nov 2 16:29:40 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:40 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7602: : Preparing to dump 5 threads >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7602: : Writing the fs struct... >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7602: : Writing the open file section... >Nov 2 16:29:40 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7602: : ...files_struct >Nov 2 16:29:40 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7602: : ...files >Nov 2 16:29:40 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning -11 >Nov 2 16:29:40 faui21l kernel: cr_chkpt_done <cr_chkpt_req.c:893>, pid 7600: : cr_chkpt_done returning 1 >Nov 2 16:29:40 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:40 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7604: : Preparing to dump 5 threads >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7601: : Writing the fs struct... >Nov 2 16:29:40 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7601: : Writing the open file section... >Nov 2 16:29:40 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7601: : ...files_struct >Nov 2 16:29:40 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7601: : ...files >Nov 2 16:29:40 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:40 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning -22 >Nov 2 16:29:41 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7601: : Preparing to dump 5 threads >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7603: : Writing the fs struct... >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7603: : Writing the open file section... >Nov 2 16:29:41 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7603: : ...files_struct >Nov 2 16:29:41 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7603: : ...files >Nov 2 16:29:41 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:41 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7603: : Preparing to dump 5 threads >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7600: : Writing the fs struct... >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7600: : Writing the open file section... >Nov 2 16:29:41 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7600: : ...files_struct >Nov 2 16:29:41 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7600: : ...files >Nov 2 16:29:41 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:41 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning -22 >Nov 2 16:29:41 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7604: : Preparing to dump 5 threads >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7604: : Writing the fs struct... >Nov 2 16:29:41 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7604: : Writing the open file section... >Nov 2 16:29:41 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7604: : ...files_struct >Nov 2 16:29:41 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7604: : ...files >Nov 2 16:29:41 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning -22 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7603: : Preparing to dump 5 threads >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7602: : Writing the fs struct... >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7602: : Writing the open file section... >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7602: : ...files_struct >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7602: : ...files >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning -22 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7602: : Preparing to dump 5 threads >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7594: : Preparing to dump 5 threads >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7602: : Writing the fs struct... >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7602: : Writing the open file section... >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7602: : ...files_struct >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7602: : ...files >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7595: : Writing the fs struct... >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7595: : Writing the open file section... >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7595: : ...files_struct >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7595: : ...files >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7602: : Preparing to dump 5 threads >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning -22 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7595: : Preparing to dump 5 threads >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7602: : Writing the fs struct... >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7602: : Writing the open file section... >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7602: : ...files_struct >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7602: : ...files >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7597: : Writing the fs struct... >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7597: : Writing the open file section... >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7597: : ...files_struct >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7597: : ...files >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning 0 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7593: : process 7593 checkpointing its own process 7593 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7600: : cr_chkpt_reap returning -22 >Nov 2 16:29:42 faui21l kernel: cr_chkpt_req <cr_chkpt_req.c:634>, pid 7600: : process 7600 checkpointing its own process 7600 >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7596: : Preparing to dump 5 threads >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1044>, pid 7602: : Preparing to dump 5 threads >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7594: : Writing the fs struct... >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7594: : Writing the open file section... >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7594: : ...files_struct >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7594: : ...files >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: cr_chkpt_reap <cr_chkpt_req.c:935>, pid 7593: : cr_chkpt_reap returning 0 >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1116>, pid 7601: : Writing the fs struct... >Nov 2 16:29:42 faui21l kernel: cr_do_vmadump <cr_dump_self.c:1125>, pid 7601: : Writing the open file section... >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:862>, pid 7601: : ...files_struct >Nov 2 16:29:42 faui21l kernel: cr_save_all_files <cr_dump_self.c:869>, pid 7601: : ...files >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:29:42 faui21l kernel: Skipping a socket. >Nov 2 16:30:28 faui21l sichiwai: here it happenes _______________________________________ >Nov 2 16:30:32 faui21l kernel: cr_rstrt_request_restart <cr_rstrt_req.c:678>, pid 7634: : cr_magic = 67 82, cr_version = 2, checkpoint_type = 1, num_threads = 5 >Nov 2 16:30:32 faui21l kernel: cr_reserve_ids <cr_rstrt_req.c:448>, pid 7634: : Now reserving required ids... >Nov 2 16:30:32 faui21l kernel: cr_rstrt_clones <cr_rstrt_req.c:3198>, pid 7635: : 7635: Have enough processes >Nov 2 16:30:32 faui21l kernel: cr_rstrt_child <cr_rstrt_req.c:3262>, pid 7635: : 7635: Restoring credentials >Nov 2 16:30:32 faui21l kernel: cr_rstrt_clones <cr_rstrt_req.c:3198>, pid 7636: : 7636: Have enough processes >Nov 2 16:30:32 faui21l kernel: cr_rstrt_clones <cr_rstrt_req.c:3198>, pid 7637: : 7637: Have enough processes >Nov 2 16:30:32 faui21l kernel: cr_rstrt_clones <cr_rstrt_req.c:3198>, pid 7638: : 7638: Have enough processes >Nov 2 16:30:32 faui21l kernel: cr_rstrt_clones <cr_rstrt_req.c:3198>, pid 7639: : 7639: Have enough processes >Nov 2 16:30:32 faui21l kernel: vmadump: mmap failed: /var/run/nscd/db5bHKnB (deleted) >Nov 2 16:30:32 faui21l kernel: thaw_threads returned error, aborting. -2 >Nov 2 16:30:32 faui21l kernel: vmadump: invalid signature >Nov 2 16:30:32 faui21l kernel: thaw_threads returned error, aborting. -22 >Nov 2 16:30:32 faui21l kernel: vmadump: invalid signature >Nov 2 16:30:32 faui21l kernel: thaw_threads returned error, aborting. -22 >Nov 2 16:30:32 faui21l kernel: vmadump: invalid signature >Nov 2 16:30:32 faui21l kernel: thaw_threads returned error, aborting. -22 >Nov 2 16:30:32 faui21l kernel: vmadump: invalid signature >Nov 2 16:30:32 faui21l kernel: thaw_threads returned error, aborting. -22 > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900