Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Feb 26 2009 - 13:36:46 PST

Next message: Andrea Autiero S143785: "Re: using blcr on program with fork"

Previous message: Hongjia Cao: "Process deadlock on checkpoint after restart (BLCR-0.8.0)"
Maybe in reply to: Hongjia Cao: "Process deadlock on checkpoint after restart (BLCR-0.8.0)"

Hongjia Cao,

Thanks for the bug report and I am glad to hear you are making some 
progress on SLURM+BLCR integration.

I think I understand the problem from your description, but I am not 
sure how soon I can work on a fix for BLCR.  So, could you please create 
a Bugzilla entry for this issue to ensure I don't lose track of it.  Our 
Bugzilla server is located at http://mantis.lbl.gov/bugzilla .  The 
initial bug report form does not support attaching files, but there will 
be an option to attach files after the bug report has been created.

I would suggest that for the time being, you may work around the problem 
by disabling the self_exec_id checking using the attached patch.  The 
only bad side-effect of this change will be that blcr will not be able 
to deal properly with the rare case in which a task exec()s during a 
checkpoint or restart.  Note that this will mean that one or more of 
BLCR's tests (run by "make check") will fail when they attempt to check 
these rare cases.

I do plan to eventually correct the self_exec_id checking code rather 
than disabling it.  I think your suggestion of updating it in 
cr_restore_linkage() is probably correct, but I don't have the time to 
test that right now.

-Paul


Hongjia Cao wrote:
> I run into this problem when testing the checkpoint/blcr plugin for
> SLURM.  The problem arises if a program (srun_cr, a wrapper program of
> the task launching utility of SLURM (srun)) performs time consuming
> operations before calling cr_checkpoint() in the thread context callback
> function. After restarting from checkpoint, it will cause deadlock
> (process wait uninterruptible for the "req->preshared_barrier") when
> being checkpointed again. I tried mpiexec_cr of MVAPICH2-1.2p1 and found
> similar problem, only that the process will wait interruptible in
> cr_freeze_threads() instead of the D state.
>
> Suppose srun_cr is checkpointed and restarted, and "task->self_exec_id"
> of cr_restart is 11. "task->self_exec_id" of the two threads of srun_cr
> (the main thread and the thread context callback execution thread) will
> be set to 12(11 + 1) in cr_restore_linkage(). But
> "cr_task->self_exec_id" of them will be set to 11, since the "cr_task"
> structures are allocated just after the threads are forked by
> cr_restart. This is not a problem for the callback execution thread,
> since do_trigger()  will synchronize "cr_task->self_exec_id" with
> "task->self_exec_id" when triggering the PHASE1 threads. But, since it
> take the callback function a long time to finish, the watchdog will
> detect the inequality between the two fields of the main thread before
> the PHASE1 thread triggers it. The main thread will be deleted from the
> checkpoint request.
>
> So I think "cr_task->self_exec_id" should be updated when changing
> "task->self_exec_id" in cr_restore_linkage().
>
>
> The attached files are the kernel trace logs(I added some tracing
> statements to print the self_exec_id lines) and the process tree and the
> threads of the processes.
>   


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory     


Index: cr_module/cr_rstrt_req.c
===================================================================
RCS file: /var/local/cvs/lbnl_cr/cr_module/cr_rstrt_req.c,v
retrieving revision 1.393
diff -u -r1.393 cr_rstrt_req.c
--- cr_module/cr_rstrt_req.c	10 Dec 2008 00:46:34 -0000	1.393
+++ cr_module/cr_rstrt_req.c	26 Feb 2009 21:35:03 -0000
@@ -2669,7 +2669,7 @@
 
     list_for_each_entry_safe(cr_task, next, &req->tasks, req_list) {
 	struct task_struct *task = cr_task->task;
-	if ((task->self_exec_id - cr_task->self_exec_id) > 1) {
+	if (0 && (task->self_exec_id - cr_task->self_exec_id) > 1) {
 		CR_WARN_PROC_REQ(cr_task->rstrt_proc_req,
 			"%s: tgid/pid %d/%d exec()ed '%s' during restart",
 			__FUNCTION__, task->tgid, task->pid, task->comm);
Index: cr_module/cr_chkpt_req.c
===================================================================
RCS file: /var/local/cvs/lbnl_cr/cr_module/cr_chkpt_req.c,v
retrieving revision 1.264
diff -u -r1.264 cr_chkpt_req.c
--- cr_module/cr_chkpt_req.c	5 Dec 2008 23:15:19 -0000	1.264
+++ cr_module/cr_chkpt_req.c	26 Feb 2009 21:35:03 -0000
@@ -173,7 +173,7 @@
 
 	list_for_each_entry_safe(cr_task, next, &req->tasks, req_list) {
 		struct task_struct *task = cr_task->task;
-		if (task->self_exec_id != cr_task->self_exec_id) {
+		if (0 && task->self_exec_id != cr_task->self_exec_id) {
 			CR_WARN_PROC_REQ(cr_task->rstrt_proc_req,
 				"%s: tgid/pid %d/%d exec()ed '%s' during checkpoint",
 				__FUNCTION__, task->tgid, task->pid, task->comm);

Next message: Andrea Autiero S143785: "Re: using blcr on program with fork"

Previous message: Hongjia Cao: "Process deadlock on checkpoint after restart (BLCR-0.8.0)"
Maybe in reply to: Hongjia Cao: "Process deadlock on checkpoint after restart (BLCR-0.8.0)"

Date view	Thread view	Subject view	Author view	Attachment view