Process deadlock on checkpoint after restart (BLCR-0.8.0)

From: Hongjia Cao (
Date: Thu Feb 26 2009 - 06:13:59 PST

  • Next message: Paul H. Hargrove: "Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)"
    I run into this problem when testing the checkpoint/blcr plugin for
    SLURM.  The problem arises if a program (srun_cr, a wrapper program of
    the task launching utility of SLURM (srun)) performs time consuming
    operations before calling cr_checkpoint() in the thread context callback
    function. After restarting from checkpoint, it will cause deadlock
    (process wait uninterruptible for the "req->preshared_barrier") when
    being checkpointed again. I tried mpiexec_cr of MVAPICH2-1.2p1 and found
    similar problem, only that the process will wait interruptible in
    cr_freeze_threads() instead of the D state.
    Suppose srun_cr is checkpointed and restarted, and "task->self_exec_id"
    of cr_restart is 11. "task->self_exec_id" of the two threads of srun_cr
    (the main thread and the thread context callback execution thread) will
    be set to 12(11 + 1) in cr_restore_linkage(). But
    "cr_task->self_exec_id" of them will be set to 11, since the "cr_task"
    structures are allocated just after the threads are forked by
    cr_restart. This is not a problem for the callback execution thread,
    since do_trigger()  will synchronize "cr_task->self_exec_id" with
    "task->self_exec_id" when triggering the PHASE1 threads. But, since it
    take the callback function a long time to finish, the watchdog will
    detect the inequality between the two fields of the main thread before
    the PHASE1 thread triggers it. The main thread will be deleted from the
    checkpoint request.
    So I think "cr_task->self_exec_id" should be updated when changing
    "task->self_exec_id" in cr_restore_linkage().
    The attached files are the kernel trace logs(I added some tracing
    statements to print the self_exec_id lines) and the process tree and the
    threads of the processes.

  • Next message: Paul H. Hargrove: "Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)"