Re: Restart issue with BLCR 0.8.0b5

Date view	Thread view	Subject view	Author view	Attachment view
From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jan 05 2009 - 12:02:01 PST
Next message: Paul H. Hargrove: "Re: Fwd: [Bug 19] Review request: blcr - Berkeley Lab Checkpoint/Restart for Linux"
Previous message: Matthias Hovestadt: "Restart issue with BLCR 0.8.0b5"
In reply to: Matthias Hovestadt: "Restart issue with BLCR 0.8.0b5"
Matthias,

Thanks for the report.  I am afraid that I am not entirely sure what is 
the cause of your problem, but I do have a strong suspicion.

Short version:

The "error messages" you report are not errors but normal behavior.
The restart of mpirun is successful, but it then exits with code 215.
The 215 value probably indicates that the batch mpirun command ran with 
a job-specific LAM/MPI session directory which doesn't exist when you 
try to restart.

Long version follows...

First, I should point out that the messages about "exec()ed 'mpirun' 
during restart"  and "exited with code 0" are not necessarily an error, 
because this is the normal behavior for LAM/MPI's mpirun when 
restarted.  These are WARNINGS and end up in dmesg/syslog 
unconditionally, but normally cr_restart wouldn't print these messages 
at all unless you passed --kmsg-warning.  Can you confirm that you 
passed this option to the failing restart?

Second, the trace of the successful restart shows that the kernel 
generated 153 bytes of dmesg/syslog message as well.  However, your 
strace output doesn't show them.  I am guessing here that you did NOT 
pass --kmsg-warning to the successful cr_restart command.  Can you 
confirm that?  Also, could you please retry the successful cr_restart 
with --kmsg-warning and tell me if you then see similar "exec()ed" and 
"exited with code 0" messages?  I suspect you will.

Finally, looking at the strace output of the failing restart, it looks 
as if the mpirun has restarted (or it would not have run far enough to 
re-exec() itself), but has very quickly called exit(215).  As it 
happens, 215 is LAM's ENOKERNEL.  This is the exit code mpirun produces 
when it can't connect to the lamd.

So, I suspect that the restart is failing simply because LAM's mpirun is 
unable to connect to the lamd as required to restart the application 
processes.  Because you indicate that the failure is ONLY for a mpirun 
started by the resource management system (you didn't indicate which 
one), I am guessing that the problem is related to the following, quoted 
from the lamboot manpage:

> ENVIRONMENT VARIABLES
>        LAM_MPI_SESSION_PREFIX
>
>        LAM_MPI_SESSION_SUFFIX
>                  It  is  possible  to  change  the  session  directory 
> used by
>                  LAM/MPI, normally of the form:
>
>        <tmpdir>/lam-<username>@<hostname>[-<suffix>]
>
>        <tmpdir>  will be set to LAM_MPI_SESSION_PREFIX if set.  
> Otherwise,  it
>                  will  fall  back to the value of TMPDIR.  If neither 
> of these
>                  are set, the default is /tmp.
>
>        <suffix>  can be overridden by the  LAM_MPI_SESSION_SUFFIX  
> environment
>                  variable.   If  LAM_MPI_SESSION_SUFFIX  is not set 
> and LAM is
>                  running under a supported batch scheduling  system,  
> <suffix>
>                  will be a value unique to the currently running job.

My guess is related to the last sentence above about a job-specific 
suffix value.  My thought is that the mpirun under the resource 
management system ran with a non-default LAM/MPI session directory.  If 
there is no corresponding lam universe when you restart, the "215" exit 
would occur.

So, the solution to your problem might be as simple as:

  env LAM_MPI_SESSION_SUFFIX=<something> lamboot

to recreate the correct session directory prior to your restart.  
Unfortunately, I don't know how to find the correct "<something>".  If 
you know something about LAM's interaction with your specific resource 
manager, that might answer the question.

If you cannot determine "<something>", then perhaps you should consider 
explicitly setting LAM_MPI_SESSION_SUFFIX to prevent the automatic 
generation of a job-specific suffix.  However, that creates the danger 
of conflict between multiple concurrent batch jobs.

I hope this helps,
-Paul

Matthias Hovestadt wrote:
> Hi!
>
> I'm using BLCR for realizing fault tolerance in cluster
> systems, having the resource management system generating
> checkpoints from running jobs. In case of MPI parallel jobs,
> I'm using LAM+BLCR and OpenMPI+BLCR for realizing this fault
> tolerance.
>
> With LAM+BLCR I now have an issue in restarting LAM-MPI jobs.
> The checkpoint has been generated by the resource management
> system using the cr_checkpoint command:
>
>   cr_checkpoint 5631
>
> where 5631 is the PID of the mpirun process. This cr_checkpoint
> command succeeds with errorcode 0, having the following files
> generated in the home directory of the user:
>
>
> -----------------------------------------------------------
> testuser@asok14-5:~$ ls -als context*
> total 3308
>    4 drwxr-xr-x  2 testuser testuser    4096 Jan  5 18:18 .
>    4 drwxr-xr-x 12 testuser testuser    4096 Jan  5 18:35 ..
>  444 -r--------  1 testuser testuser  447767 Jan  5 18:15 context.5631
> 1232 -r--------  1 testuser testuser 1255150 Jan  5 18:15 
> context.5631-n0-5632
> 1620 -r--------  1 testuser testuser 1651993 Jan  5 18:15 
> context.5631-n1-17822
> testuser@asok14-5:~$
> -----------------------------------------------------------
>
>
> If I now try to restart from this checkpoint (regardless whether
> the checkpoint command is issued manually by the user or
> automatically over the the resource management system) using the
> command
>
>   cr_restart context.5631
>
> the restart command fails, having these error messages in the
> syslog:
>
> Jan  5 18:18:35 asok14-5 kernel: [63441.370972] blcr: rstrt_watchdog: 
> tgid/pid 5631/5631 exec()ed 'mpirun' during restart
> Jan  5 18:18:35 asok14-5 kernel: [63441.370978] blcr: rstrt_watchdog: 
> 'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart
>
>
> Surprisingly, this issue only affects jobs that have been started by
> the resource management system. If I start the same job by hand (using
> the same commands on the same cluster nodes), I can checkpoint and
> restart without any problem.
>
> For getting some more information I started the cr_restart command
> using the strace tool. I then compared the output of a working
> restart (checkpoint from a manually started job) and the output
> of a failing restart (checkpoint from a job that has been started
> the the resource management system).
>
> Both strace outputs are identical until the "rt_sigaction" lines.
> Then the output differs.
>
>
> The working restart has the following output:
>
> -----------------------------------------------------------
> .
> .
> .
> rt_sigaction(SIGRT_29, {0x402660, [], 
> SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) 
> = 0
> rt_sigaction(SIGRT_30, {0x402660, [], 
> SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) 
> = 0
> rt_sigaction(SIGRT_31, {0x402660, [], 
> SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) 
> = 0
> close(4)                                = 0
> select(6, [5], NULL, NULL, 
> NULL..........................................................)        
> = 1 (in [5])
> ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153
> ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153
> ioctl(5, 0xa122, 0xffffffffffffffff)    = 15084
> close(5)                                = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> .
> .
> .
> -----------------------------------------------------------
>
>
> In case of the failing restart:
>
> -----------------------------------------------------------
> .
> .
> .
> rt_sigaction(SIGRT_29, {0x402660, [], 
> SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) 
> = 0
> rt_sigaction(SIGRT_30, {0x402660, [], 
> SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) 
> = 0
> rt_sigaction(SIGRT_31, {0x402660, [], 
> SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) 
> = 0
> close(4)                                = 0
> select(6, [5], NULL, NULL, NULL)        = ? ERESTARTNOHAND (To be 
> restarted)
> --- SIGCHLD (Child exited) @ 0 (0) ---
> select(6, [5], NULL, NULL, NULL)        = 1 (in [5])
> ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149
> ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149
> ioctl(5, 0xa122, 0xffffffffffffffff)    = 5631
> close(5)                                = 0
> write(2, "- ", 2- )                       = 2
> write(2, "rstrt_watchdog: tgid/pid 5631/56"..., 67rstrt_watchdog: 
> tgid/pid 5631/5631 exec()ed 'mpirun' during restart) = 67
> write(2, "\n", 1
> )                       = 1
> write(2, "- ", 2- )                       = 2
> write(2, "rstrt_watchdog: \'mpirun\' (tgid/p"..., 79rstrt_watchdog: 
> 'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart) = 79
> write(2, "\n", 1
> )                       = 1
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> wait4(5631, [{WIFEXITED(s) && WEXITSTATUS(s) == 215}], 
> __WCLONE|__WALL, NULL) = 5631
> exit_group(215)                         = ?
> Process 7482 detached
> testuser@asok14-5:~$
> -----------------------------------------------------------
>
>
> Does anybody have an idea what might be the problem? Or is there
> any way of increasing the debug level, getting more verbose logfile
> output?
>
>
> Best,
> Matthias


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory
Next message: Paul H. Hargrove: "Re: Fwd: [Bug 19] Review request: blcr - Berkeley Lab Checkpoint/Restart for Linux"
Previous message: Matthias Hovestadt: "Restart issue with BLCR 0.8.0b5"
In reply to: Matthias Hovestadt: "Restart issue with BLCR 0.8.0b5"
Date view	Thread view	Subject view	Author view	Attachment view