From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Feb 26 2009 - 13:36:46 PST
Hongjia Cao, Thanks for the bug report and I am glad to hear you are making some progress on SLURM+BLCR integration. I think I understand the problem from your description, but I am not sure how soon I can work on a fix for BLCR. So, could you please create a Bugzilla entry for this issue to ensure I don't lose track of it. Our Bugzilla server is located at http://mantis.lbl.gov/bugzilla . The initial bug report form does not support attaching files, but there will be an option to attach files after the bug report has been created. I would suggest that for the time being, you may work around the problem by disabling the self_exec_id checking using the attached patch. The only bad side-effect of this change will be that blcr will not be able to deal properly with the rare case in which a task exec()s during a checkpoint or restart. Note that this will mean that one or more of BLCR's tests (run by "make check") will fail when they attempt to check these rare cases. I do plan to eventually correct the self_exec_id checking code rather than disabling it. I think your suggestion of updating it in cr_restore_linkage() is probably correct, but I don't have the time to test that right now. -Paul Hongjia Cao wrote: > I run into this problem when testing the checkpoint/blcr plugin for > SLURM. The problem arises if a program (srun_cr, a wrapper program of > the task launching utility of SLURM (srun)) performs time consuming > operations before calling cr_checkpoint() in the thread context callback > function. After restarting from checkpoint, it will cause deadlock > (process wait uninterruptible for the "req->preshared_barrier") when > being checkpointed again. I tried mpiexec_cr of MVAPICH2-1.2p1 and found > similar problem, only that the process will wait interruptible in > cr_freeze_threads() instead of the D state. > > Suppose srun_cr is checkpointed and restarted, and "task->self_exec_id" > of cr_restart is 11. "task->self_exec_id" of the two threads of srun_cr > (the main thread and the thread context callback execution thread) will > be set to 12(11 + 1) in cr_restore_linkage(). But > "cr_task->self_exec_id" of them will be set to 11, since the "cr_task" > structures are allocated just after the threads are forked by > cr_restart. This is not a problem for the callback execution thread, > since do_trigger() will synchronize "cr_task->self_exec_id" with > "task->self_exec_id" when triggering the PHASE1 threads. But, since it > take the callback function a long time to finish, the watchdog will > detect the inequality between the two fields of the main thread before > the PHASE1 thread triggers it. The main thread will be deleted from the > checkpoint request. > > So I think "cr_task->self_exec_id" should be updated when changing > "task->self_exec_id" in cr_restore_linkage(). > > > The attached files are the kernel trace logs(I added some tracing > statements to print the self_exec_id lines) and the process tree and the > threads of the processes. > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory Index: cr_module/cr_rstrt_req.c =================================================================== RCS file: /var/local/cvs/lbnl_cr/cr_module/cr_rstrt_req.c,v retrieving revision 1.393 diff -u -r1.393 cr_rstrt_req.c --- cr_module/cr_rstrt_req.c 10 Dec 2008 00:46:34 -0000 1.393 +++ cr_module/cr_rstrt_req.c 26 Feb 2009 21:35:03 -0000 @@ -2669,7 +2669,7 @@ list_for_each_entry_safe(cr_task, next, &req->tasks, req_list) { struct task_struct *task = cr_task->task; - if ((task->self_exec_id - cr_task->self_exec_id) > 1) { + if (0 && (task->self_exec_id - cr_task->self_exec_id) > 1) { CR_WARN_PROC_REQ(cr_task->rstrt_proc_req, "%s: tgid/pid %d/%d exec()ed '%s' during restart", __FUNCTION__, task->tgid, task->pid, task->comm); Index: cr_module/cr_chkpt_req.c =================================================================== RCS file: /var/local/cvs/lbnl_cr/cr_module/cr_chkpt_req.c,v retrieving revision 1.264 diff -u -r1.264 cr_chkpt_req.c --- cr_module/cr_chkpt_req.c 5 Dec 2008 23:15:19 -0000 1.264 +++ cr_module/cr_chkpt_req.c 26 Feb 2009 21:35:03 -0000 @@ -173,7 +173,7 @@ list_for_each_entry_safe(cr_task, next, &req->tasks, req_list) { struct task_struct *task = cr_task->task; - if (task->self_exec_id != cr_task->self_exec_id) { + if (0 && task->self_exec_id != cr_task->self_exec_id) { CR_WARN_PROC_REQ(cr_task->rstrt_proc_req, "%s: tgid/pid %d/%d exec()ed '%s' during checkpoint", __FUNCTION__, task->tgid, task->pid, task->comm);