From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jan 05 2009 - 12:02:01 PST
Matthias, Thanks for the report. I am afraid that I am not entirely sure what is the cause of your problem, but I do have a strong suspicion. Short version: The "error messages" you report are not errors but normal behavior. The restart of mpirun is successful, but it then exits with code 215. The 215 value probably indicates that the batch mpirun command ran with a job-specific LAM/MPI session directory which doesn't exist when you try to restart. Long version follows... First, I should point out that the messages about "exec()ed 'mpirun' during restart" and "exited with code 0" are not necessarily an error, because this is the normal behavior for LAM/MPI's mpirun when restarted. These are WARNINGS and end up in dmesg/syslog unconditionally, but normally cr_restart wouldn't print these messages at all unless you passed --kmsg-warning. Can you confirm that you passed this option to the failing restart? Second, the trace of the successful restart shows that the kernel generated 153 bytes of dmesg/syslog message as well. However, your strace output doesn't show them. I am guessing here that you did NOT pass --kmsg-warning to the successful cr_restart command. Can you confirm that? Also, could you please retry the successful cr_restart with --kmsg-warning and tell me if you then see similar "exec()ed" and "exited with code 0" messages? I suspect you will. Finally, looking at the strace output of the failing restart, it looks as if the mpirun has restarted (or it would not have run far enough to re-exec() itself), but has very quickly called exit(215). As it happens, 215 is LAM's ENOKERNEL. This is the exit code mpirun produces when it can't connect to the lamd. So, I suspect that the restart is failing simply because LAM's mpirun is unable to connect to the lamd as required to restart the application processes. Because you indicate that the failure is ONLY for a mpirun started by the resource management system (you didn't indicate which one), I am guessing that the problem is related to the following, quoted from the lamboot manpage: > ENVIRONMENT VARIABLES > LAM_MPI_SESSION_PREFIX > > LAM_MPI_SESSION_SUFFIX > It is possible to change the session directory > used by > LAM/MPI, normally of the form: > > <tmpdir>/lam-<username>@<hostname>[-<suffix>] > > <tmpdir> will be set to LAM_MPI_SESSION_PREFIX if set. > Otherwise, it > will fall back to the value of TMPDIR. If neither > of these > are set, the default is /tmp. > > <suffix> can be overridden by the LAM_MPI_SESSION_SUFFIX > environment > variable. If LAM_MPI_SESSION_SUFFIX is not set > and LAM is > running under a supported batch scheduling system, > <suffix> > will be a value unique to the currently running job. My guess is related to the last sentence above about a job-specific suffix value. My thought is that the mpirun under the resource management system ran with a non-default LAM/MPI session directory. If there is no corresponding lam universe when you restart, the "215" exit would occur. So, the solution to your problem might be as simple as: env LAM_MPI_SESSION_SUFFIX=<something> lamboot to recreate the correct session directory prior to your restart. Unfortunately, I don't know how to find the correct "<something>". If you know something about LAM's interaction with your specific resource manager, that might answer the question. If you cannot determine "<something>", then perhaps you should consider explicitly setting LAM_MPI_SESSION_SUFFIX to prevent the automatic generation of a job-specific suffix. However, that creates the danger of conflict between multiple concurrent batch jobs. I hope this helps, -Paul Matthias Hovestadt wrote: > Hi! > > I'm using BLCR for realizing fault tolerance in cluster > systems, having the resource management system generating > checkpoints from running jobs. In case of MPI parallel jobs, > I'm using LAM+BLCR and OpenMPI+BLCR for realizing this fault > tolerance. > > With LAM+BLCR I now have an issue in restarting LAM-MPI jobs. > The checkpoint has been generated by the resource management > system using the cr_checkpoint command: > > cr_checkpoint 5631 > > where 5631 is the PID of the mpirun process. This cr_checkpoint > command succeeds with errorcode 0, having the following files > generated in the home directory of the user: > > > ----------------------------------------------------------- > testuser@asok14-5:~$ ls -als context* > total 3308 > 4 drwxr-xr-x 2 testuser testuser 4096 Jan 5 18:18 . > 4 drwxr-xr-x 12 testuser testuser 4096 Jan 5 18:35 .. > 444 -r-------- 1 testuser testuser 447767 Jan 5 18:15 context.5631 > 1232 -r-------- 1 testuser testuser 1255150 Jan 5 18:15 > context.5631-n0-5632 > 1620 -r-------- 1 testuser testuser 1651993 Jan 5 18:15 > context.5631-n1-17822 > testuser@asok14-5:~$ > ----------------------------------------------------------- > > > If I now try to restart from this checkpoint (regardless whether > the checkpoint command is issued manually by the user or > automatically over the the resource management system) using the > command > > cr_restart context.5631 > > the restart command fails, having these error messages in the > syslog: > > Jan 5 18:18:35 asok14-5 kernel: [63441.370972] blcr: rstrt_watchdog: > tgid/pid 5631/5631 exec()ed 'mpirun' during restart > Jan 5 18:18:35 asok14-5 kernel: [63441.370978] blcr: rstrt_watchdog: > 'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart > > > Surprisingly, this issue only affects jobs that have been started by > the resource management system. If I start the same job by hand (using > the same commands on the same cluster nodes), I can checkpoint and > restart without any problem. > > For getting some more information I started the cr_restart command > using the strace tool. I then compared the output of a working > restart (checkpoint from a manually started job) and the output > of a failing restart (checkpoint from a job that has been started > the the resource management system). > > Both strace outputs are identical until the "rt_sigaction" lines. > Then the output differs. > > > The working restart has the following output: > > ----------------------------------------------------------- > . > . > . > rt_sigaction(SIGRT_29, {0x402660, [], > SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) > = 0 > rt_sigaction(SIGRT_30, {0x402660, [], > SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) > = 0 > rt_sigaction(SIGRT_31, {0x402660, [], > SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) > = 0 > close(4) = 0 > select(6, [5], NULL, NULL, > NULL..........................................................) > = 1 (in [5]) > ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153 > ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153 > ioctl(5, 0xa122, 0xffffffffffffffff) = 15084 > close(5) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > . > . > . > ----------------------------------------------------------- > > > In case of the failing restart: > > ----------------------------------------------------------- > . > . > . > rt_sigaction(SIGRT_29, {0x402660, [], > SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) > = 0 > rt_sigaction(SIGRT_30, {0x402660, [], > SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) > = 0 > rt_sigaction(SIGRT_31, {0x402660, [], > SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) > = 0 > close(4) = 0 > select(6, [5], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be > restarted) > --- SIGCHLD (Child exited) @ 0 (0) --- > select(6, [5], NULL, NULL, NULL) = 1 (in [5]) > ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149 > ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149 > ioctl(5, 0xa122, 0xffffffffffffffff) = 5631 > close(5) = 0 > write(2, "- ", 2- ) = 2 > write(2, "rstrt_watchdog: tgid/pid 5631/56"..., 67rstrt_watchdog: > tgid/pid 5631/5631 exec()ed 'mpirun' during restart) = 67 > write(2, "\n", 1 > ) = 1 > write(2, "- ", 2- ) = 2 > write(2, "rstrt_watchdog: \'mpirun\' (tgid/p"..., 79rstrt_watchdog: > 'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart) = 79 > write(2, "\n", 1 > ) = 1 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > wait4(5631, [{WIFEXITED(s) && WEXITSTATUS(s) == 215}], > __WCLONE|__WALL, NULL) = 5631 > exit_group(215) = ? > Process 7482 detached > testuser@asok14-5:~$ > ----------------------------------------------------------- > > > Does anybody have an idea what might be the problem? Or is there > any way of increasing the debug level, getting more verbose logfile > output? > > > Best, > Matthias -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory