Process deadlock on checkpoint after restart (BLCR-0.8.0)

Date view	Thread view	Subject view	Author view	Attachment view

From: Hongjia Cao (hjcao_at_nudt_dot_edu.cn)
Date: Thu Feb 26 2009 - 06:13:59 PST

Next message: Paul H. Hargrove: "Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)"

Previous message: Paul H. Hargrove: "Re: using blcr on program with fork"
Next in thread: Paul H. Hargrove: "Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)"
Maybe reply: Paul H. Hargrove: "Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)"

I run into this problem when testing the checkpoint/blcr plugin for
SLURM.  The problem arises if a program (srun_cr, a wrapper program of
the task launching utility of SLURM (srun)) performs time consuming
operations before calling cr_checkpoint() in the thread context callback
function. After restarting from checkpoint, it will cause deadlock
(process wait uninterruptible for the "req->preshared_barrier") when
being checkpointed again. I tried mpiexec_cr of MVAPICH2-1.2p1 and found
similar problem, only that the process will wait interruptible in
cr_freeze_threads() instead of the D state.

Suppose srun_cr is checkpointed and restarted, and "task->self_exec_id"
of cr_restart is 11. "task->self_exec_id" of the two threads of srun_cr
(the main thread and the thread context callback execution thread) will
be set to 12(11 + 1) in cr_restore_linkage(). But
"cr_task->self_exec_id" of them will be set to 11, since the "cr_task"
structures are allocated just after the threads are forked by
cr_restart. This is not a problem for the callback execution thread,
since do_trigger()  will synchronize "cr_task->self_exec_id" with
"task->self_exec_id" when triggering the PHASE1 threads. But, since it
take the callback function a long time to finish, the watchdog will
detect the inequality between the two fields of the main thread before
the PHASE1 thread triggers it. The main thread will be deleted from the
checkpoint request.

So I think "cr_task->self_exec_id" should be updated when changing
"task->self_exec_id" in cr_restore_linkage().


The attached files are the kernel trace logs(I added some tracing
statements to print the self_exec_id lines) and the process tree and the
threads of the processes.

text/x-log attachment: trace.log

text/x-log attachment: tree.log

text/x-log attachment: task.log

application/pgp-signature attachment: OpenPGP digital signature

Next message: Paul H. Hargrove: "Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)"

Previous message: Paul H. Hargrove: "Re: using blcr on program with fork"
Next in thread: Paul H. Hargrove: "Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)"
Maybe reply: Paul H. Hargrove: "Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)"

Date view	Thread view	Subject view	Author view	Attachment view