Re: Process deadlock on checkpoint after restart (BLCR-0.8.0)

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Feb 26 2009 - 13:36:46 PST

  • Next message: Andrea Autiero S143785: "Re: using blcr on program with fork"
    Hongjia Cao,
    
    Thanks for the bug report and I am glad to hear you are making some 
    progress on SLURM+BLCR integration.
    
    I think I understand the problem from your description, but I am not 
    sure how soon I can work on a fix for BLCR.  So, could you please create 
    a Bugzilla entry for this issue to ensure I don't lose track of it.  Our 
    Bugzilla server is located at http://mantis.lbl.gov/bugzilla .  The 
    initial bug report form does not support attaching files, but there will 
    be an option to attach files after the bug report has been created.
    
    I would suggest that for the time being, you may work around the problem 
    by disabling the self_exec_id checking using the attached patch.  The 
    only bad side-effect of this change will be that blcr will not be able 
    to deal properly with the rare case in which a task exec()s during a 
    checkpoint or restart.  Note that this will mean that one or more of 
    BLCR's tests (run by "make check") will fail when they attempt to check 
    these rare cases.
    
    I do plan to eventually correct the self_exec_id checking code rather 
    than disabling it.  I think your suggestion of updating it in 
    cr_restore_linkage() is probably correct, but I don't have the time to 
    test that right now.
    
    -Paul
    
    
    Hongjia Cao wrote:
    > I run into this problem when testing the checkpoint/blcr plugin for
    > SLURM.  The problem arises if a program (srun_cr, a wrapper program of
    > the task launching utility of SLURM (srun)) performs time consuming
    > operations before calling cr_checkpoint() in the thread context callback
    > function. After restarting from checkpoint, it will cause deadlock
    > (process wait uninterruptible for the "req->preshared_barrier") when
    > being checkpointed again. I tried mpiexec_cr of MVAPICH2-1.2p1 and found
    > similar problem, only that the process will wait interruptible in
    > cr_freeze_threads() instead of the D state.
    >
    > Suppose srun_cr is checkpointed and restarted, and "task->self_exec_id"
    > of cr_restart is 11. "task->self_exec_id" of the two threads of srun_cr
    > (the main thread and the thread context callback execution thread) will
    > be set to 12(11 + 1) in cr_restore_linkage(). But
    > "cr_task->self_exec_id" of them will be set to 11, since the "cr_task"
    > structures are allocated just after the threads are forked by
    > cr_restart. This is not a problem for the callback execution thread,
    > since do_trigger()  will synchronize "cr_task->self_exec_id" with
    > "task->self_exec_id" when triggering the PHASE1 threads. But, since it
    > take the callback function a long time to finish, the watchdog will
    > detect the inequality between the two fields of the main thread before
    > the PHASE1 thread triggers it. The main thread will be deleted from the
    > checkpoint request.
    >
    > So I think "cr_task->self_exec_id" should be updated when changing
    > "task->self_exec_id" in cr_restore_linkage().
    >
    >
    > The attached files are the kernel trace logs(I added some tracing
    > statements to print the self_exec_id lines) and the process tree and the
    > threads of the processes.
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    
    
    Index: cr_module/cr_rstrt_req.c
    ===================================================================
    RCS file: /var/local/cvs/lbnl_cr/cr_module/cr_rstrt_req.c,v
    retrieving revision 1.393
    diff -u -r1.393 cr_rstrt_req.c
    --- cr_module/cr_rstrt_req.c	10 Dec 2008 00:46:34 -0000	1.393
    +++ cr_module/cr_rstrt_req.c	26 Feb 2009 21:35:03 -0000
    @@ -2669,7 +2669,7 @@
     
         list_for_each_entry_safe(cr_task, next, &req->tasks, req_list) {
     	struct task_struct *task = cr_task->task;
    -	if ((task->self_exec_id - cr_task->self_exec_id) > 1) {
    +	if (0 && (task->self_exec_id - cr_task->self_exec_id) > 1) {
     		CR_WARN_PROC_REQ(cr_task->rstrt_proc_req,
     			"%s: tgid/pid %d/%d exec()ed '%s' during restart",
     			__FUNCTION__, task->tgid, task->pid, task->comm);
    Index: cr_module/cr_chkpt_req.c
    ===================================================================
    RCS file: /var/local/cvs/lbnl_cr/cr_module/cr_chkpt_req.c,v
    retrieving revision 1.264
    diff -u -r1.264 cr_chkpt_req.c
    --- cr_module/cr_chkpt_req.c	5 Dec 2008 23:15:19 -0000	1.264
    +++ cr_module/cr_chkpt_req.c	26 Feb 2009 21:35:03 -0000
    @@ -173,7 +173,7 @@
     
     	list_for_each_entry_safe(cr_task, next, &req->tasks, req_list) {
     		struct task_struct *task = cr_task->task;
    -		if (task->self_exec_id != cr_task->self_exec_id) {
    +		if (0 && task->self_exec_id != cr_task->self_exec_id) {
     			CR_WARN_PROC_REQ(cr_task->rstrt_proc_req,
     				"%s: tgid/pid %d/%d exec()ed '%s' during checkpoint",
     				__FUNCTION__, task->tgid, task->pid, task->comm);
    

  • Next message: Andrea Autiero S143785: "Re: using blcr on program with fork"