Re: Question about "fd" token

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jun 15 2009 - 13:37:25 PDT

  • Next message: 李宏亮: "Re: Re: Question about "fd" token"
    In all cases, we trigger and checkpoint all threads, regardless of
    PHASE. They are all on the task list and will all get triggered (sent a
    signal). However, we let the PHASE1 threads (if any) run before the
    others are triggered.
    
    The PHASE1 threads are normally blocked in the kernel, waiting for a
    checkpoint request. As you noted, they are "triggered" first if they
    exist. All they do in their signal handler code is change the state
    (again, you already noted this). It is that change of state that causes
    them to leave the blocked state and begin running callbacks that have
    been registered with CR_THREAD_CONTEXT. After the checkpoint they resume
    blocking in the kernel, ready for the next checkpoint.
    
    If there are no PHASE1 threads, then the PHASE2 and NOPHASE threads are
    signaled instead of the PHASE1. However, if there are any PHASE1 threads
    in a given process, then BLCR waits until they have finished running
    their callbacks and reached do_checkpoint(); this is the purpose of the
    "phase_barrier". Only after that is cr_trigger_phase2() called to signal
    the remaining (PHASE2 and NOPHASE) threads in the process.
    
    Regarding the signal handler: there is one handler, cri_sig_handler(),
    because signal handler registration is per-process, not per-thread.
    However, that function calls others depending on the type of the thread:
    PHASE1:
    The (currently zero or one) thread that libcr creates to run
    thread-context callbacks
    Changes state to allow thread to wake and run callbacks registered as
    CR_THREAD_CONTEXT
    PHASE2:
    Any application-created thread that has called cr_init()
    Runs any callbacks registered as CR_SIGNAL_CONTEXT by the thread
    NOPHASE:
    This is any application thread that has NOT called cr_init() and
    therefore has no thread-specific cri_thread_info structure.
    Just calls do_checkpoint() without running any callbacks.
    
    
    You ask in your final question how the ones not triggered as PHASE1 are
    "waked up". I am not sure I understand the question, but I think you
    want to know how they are made to run the BLCR code. Right? If that is
    the question, then the answer is just that they are sent a signal which
    is handled in the normal Linux way. These threads are made to run the
    BLCR code just as any other signal handler would run. If you need to
    understand Linux's signal delivery code, then I am afraid that I am not
    qualified to describe that for you, but there are plenty of books and
    online resources about the Linus kernel design that should help with
    that. Let me know if I have missed the point of that last question.
    
    -Paul
    
    
    李宏亮 wrote:
    > Hello,Professor:
    >
    > Thank you very much for answering my questions with great patience.But
    > I have something more to ask.
    >
    > +"When a checkpoint is requested for a process,the BLCR kernel module
    > sends each thread in that process an unblockable signal"
    >
    > Yes, I see BLCR do this in the function"cr_trigger_phase1()" &
    > "cr_trigger_phase2()"
    >
    > when we execute "cr_trigger_phase1()":
    >
    > It's up to tasks in the target task list(proc_req->tasks).if there are
    > phase1 tasks(even only one phase task) in this task list, then only
    > these phase1 tasks are sent the signal "CR_SIGNUM". otherwise all the
    > tasks in this list are sent the signal(becauese all of them are either
    > "phase2" tasks or "no phase" tasks).
    >
    > when we execute "cr_trigger_phase2()":
    >
    > only "phase2" tasks and "no phase" tasks in the task list were sent
    > the signal.
    >
    > I know the phase1 task is spawned when we register thread callback:
    > cri_register_thread()->thread_init()->thread_main()->rc =
    > cri_syscall_token(token, CR_OP_HAND_PHASE1, token);
    >
    > After the "cri_register_thread()" finishes , we have created a
    > callback thread, this thread do the "CR_OP_HAND_PHASE1" syscall and
    > register a phase1 handler, then blocks in the kernel until a
    > checkpoint occurs.
    >
    > Here comes my first question: I guess:
    >
    > there are tasks in the target task list, may be phase1/phase/no phase
    > tasks ,we first find phase1 tasks, if any, ok , you are not planned to
    > be checkpointed, your work is to execute callback functions. so the
    > handler of phase1 tasks do nothing other than execute callback
    > functions. then phase2 and no phase tasks .these tasks are planned to
    > be checkpointed. so invoke cr_checkpoint() or do_checkpoint() separately.
    >
    > Am I right? If this is right, I want to know why the target task list
    > can contain the callback thread. in which scene?
    >
    > my second question: if I am wrong. what are the differences among the
    > no-phase ,phase1, phase2 task? their corresponding signal handler deal
    > with what?
    > I see phase1 handler simply changes the state of thread. while phase2
    > handler invoke cr_checkpoint() to execute callbacks array
    > first...uh.....I am confused...
    >
    > my last question is: To the callback threads which are added into the
    > target task list as phase1 task,I know how they are waked up after
    > blocked for checkpoint request.But I don't know ones not added into
    > that list , how are they waked up?
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > ===============================================
    > 快来和我一起享受TOM免费邮箱吧!看看除了1.5G,还有什么?
    > <http://bjcgi.163.net/cgi-bin/newreg.cgi?%0Arf=050602>
    > ===============================================
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: 李宏亮: "Re: Re: Question about "fd" token"