From: Hongjia Cao (hjcao_at_nudt_dot_edu.cn)
Date: Thu Feb 26 2009 - 06:13:59 PST
I run into this problem when testing the checkpoint/blcr plugin for SLURM. The problem arises if a program (srun_cr, a wrapper program of the task launching utility of SLURM (srun)) performs time consuming operations before calling cr_checkpoint() in the thread context callback function. After restarting from checkpoint, it will cause deadlock (process wait uninterruptible for the "req->preshared_barrier") when being checkpointed again. I tried mpiexec_cr of MVAPICH2-1.2p1 and found similar problem, only that the process will wait interruptible in cr_freeze_threads() instead of the D state. Suppose srun_cr is checkpointed and restarted, and "task->self_exec_id" of cr_restart is 11. "task->self_exec_id" of the two threads of srun_cr (the main thread and the thread context callback execution thread) will be set to 12(11 + 1) in cr_restore_linkage(). But "cr_task->self_exec_id" of them will be set to 11, since the "cr_task" structures are allocated just after the threads are forked by cr_restart. This is not a problem for the callback execution thread, since do_trigger() will synchronize "cr_task->self_exec_id" with "task->self_exec_id" when triggering the PHASE1 threads. But, since it take the callback function a long time to finish, the watchdog will detect the inequality between the two fields of the main thread before the PHASE1 thread triggers it. The main thread will be deleted from the checkpoint request. So I think "cr_task->self_exec_id" should be updated when changing "task->self_exec_id" in cr_restore_linkage(). The attached files are the kernel trace logs(I added some tracing statements to print the self_exec_id lines) and the process tree and the threads of the processes.