Restart issue with BLCR 0.8.0b5

Date view	Thread view	Subject view	Author view	Attachment view

From: Matthias Hovestadt (matthias.hovestadt_at_tu-berlin.de)
Date: Mon Jan 05 2009 - 08:48:14 PST

Next message: Paul H. Hargrove: "Re: Restart issue with BLCR 0.8.0b5"

Previous message: Neal Becker: "Fwd: [Bug 19] Review request: blcr - Berkeley Lab Checkpoint/Restart for Linux"
Next in thread: Paul H. Hargrove: "Re: Restart issue with BLCR 0.8.0b5"
Reply: Paul H. Hargrove: "Re: Restart issue with BLCR 0.8.0b5"

Hi!

I'm using BLCR for realizing fault tolerance in cluster
systems, having the resource management system generating
checkpoints from running jobs. In case of MPI parallel jobs,
I'm using LAM+BLCR and OpenMPI+BLCR for realizing this fault
tolerance.

With LAM+BLCR I now have an issue in restarting LAM-MPI jobs.
The checkpoint has been generated by the resource management
system using the cr_checkpoint command:

   cr_checkpoint 5631

where 5631 is the PID of the mpirun process. This cr_checkpoint
command succeeds with errorcode 0, having the following files
generated in the home directory of the user:


-----------------------------------------------------------
testuser@asok14-5:~$ ls -als context*
total 3308
    4 drwxr-xr-x  2 testuser testuser    4096 Jan  5 18:18 .
    4 drwxr-xr-x 12 testuser testuser    4096 Jan  5 18:35 ..
  444 -r--------  1 testuser testuser  447767 Jan  5 18:15 context.5631
1232 -r--------  1 testuser testuser 1255150 Jan  5 18:15 
context.5631-n0-5632
1620 -r--------  1 testuser testuser 1651993 Jan  5 18:15 
context.5631-n1-17822
testuser@asok14-5:~$
-----------------------------------------------------------


If I now try to restart from this checkpoint (regardless whether
the checkpoint command is issued manually by the user or
automatically over the the resource management system) using the
command

   cr_restart context.5631

the restart command fails, having these error messages in the
syslog:

Jan  5 18:18:35 asok14-5 kernel: [63441.370972] blcr: rstrt_watchdog: 
tgid/pid 5631/5631 exec()ed 'mpirun' during restart
Jan  5 18:18:35 asok14-5 kernel: [63441.370978] blcr: rstrt_watchdog: 
'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart


Surprisingly, this issue only affects jobs that have been started by
the resource management system. If I start the same job by hand (using
the same commands on the same cluster nodes), I can checkpoint and
restart without any problem.

For getting some more information I started the cr_restart command
using the strace tool. I then compared the output of a working
restart (checkpoint from a manually started job) and the output
of a failing restart (checkpoint from a job that has been started
the the resource management system).

Both strace outputs are identical until the "rt_sigaction" lines.
Then the output differs.


The working restart has the following output:

-----------------------------------------------------------
.
.
.
rt_sigaction(SIGRT_29, {0x402660, [], 
SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0
rt_sigaction(SIGRT_30, {0x402660, [], 
SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0
rt_sigaction(SIGRT_31, {0x402660, [], 
SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0
close(4)                                = 0
select(6, [5], NULL, NULL, 
NULL..........................................................)        = 
1 (in [5])
ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153
ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153
ioctl(5, 0xa122, 0xffffffffffffffff)    = 15084
close(5)                                = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
.
.
.
-----------------------------------------------------------


In case of the failing restart:

-----------------------------------------------------------
.
.
.
rt_sigaction(SIGRT_29, {0x402660, [], 
SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0
rt_sigaction(SIGRT_30, {0x402660, [], 
SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0
rt_sigaction(SIGRT_31, {0x402660, [], 
SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0
close(4)                                = 0
select(6, [5], NULL, NULL, NULL)        = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
select(6, [5], NULL, NULL, NULL)        = 1 (in [5])
ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149
ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149
ioctl(5, 0xa122, 0xffffffffffffffff)    = 5631
close(5)                                = 0
write(2, "- ", 2- )                       = 2
write(2, "rstrt_watchdog: tgid/pid 5631/56"..., 67rstrt_watchdog: 
tgid/pid 5631/5631 exec()ed 'mpirun' during restart) = 67
write(2, "\n", 1
)                       = 1
write(2, "- ", 2- )                       = 2
write(2, "rstrt_watchdog: \'mpirun\' (tgid/p"..., 79rstrt_watchdog: 
'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart) = 79
write(2, "\n", 1
)                       = 1
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
wait4(5631, [{WIFEXITED(s) && WEXITSTATUS(s) == 215}], __WCLONE|__WALL, 
NULL) = 5631
exit_group(215)                         = ?
Process 7482 detached
testuser@asok14-5:~$
-----------------------------------------------------------


Does anybody have an idea what might be the problem? Or is there
any way of increasing the debug level, getting more verbose logfile
output?


Best,
Matthias

Next message: Paul H. Hargrove: "Re: Restart issue with BLCR 0.8.0b5"

Previous message: Neal Becker: "Fwd: [Bug 19] Review request: blcr - Berkeley Lab Checkpoint/Restart for Linux"
Next in thread: Paul H. Hargrove: "Re: Restart issue with BLCR 0.8.0b5"
Reply: Paul H. Hargrove: "Re: Restart issue with BLCR 0.8.0b5"

Date view	Thread view	Subject view	Author view	Attachment view