From: Matthias Hovestadt (matthias.hovestadt_at_tu-berlin.de)
Date: Mon Jan 05 2009 - 08:48:14 PST
Hi! I'm using BLCR for realizing fault tolerance in cluster systems, having the resource management system generating checkpoints from running jobs. In case of MPI parallel jobs, I'm using LAM+BLCR and OpenMPI+BLCR for realizing this fault tolerance. With LAM+BLCR I now have an issue in restarting LAM-MPI jobs. The checkpoint has been generated by the resource management system using the cr_checkpoint command: cr_checkpoint 5631 where 5631 is the PID of the mpirun process. This cr_checkpoint command succeeds with errorcode 0, having the following files generated in the home directory of the user: ----------------------------------------------------------- testuser@asok14-5:~$ ls -als context* total 3308 4 drwxr-xr-x 2 testuser testuser 4096 Jan 5 18:18 . 4 drwxr-xr-x 12 testuser testuser 4096 Jan 5 18:35 .. 444 -r-------- 1 testuser testuser 447767 Jan 5 18:15 context.5631 1232 -r-------- 1 testuser testuser 1255150 Jan 5 18:15 context.5631-n0-5632 1620 -r-------- 1 testuser testuser 1651993 Jan 5 18:15 context.5631-n1-17822 testuser@asok14-5:~$ ----------------------------------------------------------- If I now try to restart from this checkpoint (regardless whether the checkpoint command is issued manually by the user or automatically over the the resource management system) using the command cr_restart context.5631 the restart command fails, having these error messages in the syslog: Jan 5 18:18:35 asok14-5 kernel: [63441.370972] blcr: rstrt_watchdog: tgid/pid 5631/5631 exec()ed 'mpirun' during restart Jan 5 18:18:35 asok14-5 kernel: [63441.370978] blcr: rstrt_watchdog: 'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart Surprisingly, this issue only affects jobs that have been started by the resource management system. If I start the same job by hand (using the same commands on the same cluster nodes), I can checkpoint and restart without any problem. For getting some more information I started the cr_restart command using the strace tool. I then compared the output of a working restart (checkpoint from a manually started job) and the output of a failing restart (checkpoint from a job that has been started the the resource management system). Both strace outputs are identical until the "rt_sigaction" lines. Then the output differs. The working restart has the following output: ----------------------------------------------------------- . . . rt_sigaction(SIGRT_29, {0x402660, [], SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0 rt_sigaction(SIGRT_30, {0x402660, [], SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0 rt_sigaction(SIGRT_31, {0x402660, [], SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0 close(4) = 0 select(6, [5], NULL, NULL, NULL..........................................................) = 1 (in [5]) ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153 ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153 ioctl(5, 0xa122, 0xffffffffffffffff) = 15084 close(5) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 . . . ----------------------------------------------------------- In case of the failing restart: ----------------------------------------------------------- . . . rt_sigaction(SIGRT_29, {0x402660, [], SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0 rt_sigaction(SIGRT_30, {0x402660, [], SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0 rt_sigaction(SIGRT_31, {0x402660, [], SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0 close(4) = 0 select(6, [5], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted) --- SIGCHLD (Child exited) @ 0 (0) --- select(6, [5], NULL, NULL, NULL) = 1 (in [5]) ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149 ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149 ioctl(5, 0xa122, 0xffffffffffffffff) = 5631 close(5) = 0 write(2, "- ", 2- ) = 2 write(2, "rstrt_watchdog: tgid/pid 5631/56"..., 67rstrt_watchdog: tgid/pid 5631/5631 exec()ed 'mpirun' during restart) = 67 write(2, "\n", 1 ) = 1 write(2, "- ", 2- ) = 2 write(2, "rstrt_watchdog: \'mpirun\' (tgid/p"..., 79rstrt_watchdog: 'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart) = 79 write(2, "\n", 1 ) = 1 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 wait4(5631, [{WIFEXITED(s) && WEXITSTATUS(s) == 215}], __WCLONE|__WALL, NULL) = 5631 exit_group(215) = ? Process 7482 detached testuser@asok14-5:~$ ----------------------------------------------------------- Does anybody have an idea what might be the problem? Or is there any way of increasing the debug level, getting more verbose logfile output? Best, Matthias