From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jun 15 2009 - 13:37:25 PDT
In all cases, we trigger and checkpoint all threads, regardless of PHASE. They are all on the task list and will all get triggered (sent a signal). However, we let the PHASE1 threads (if any) run before the others are triggered. The PHASE1 threads are normally blocked in the kernel, waiting for a checkpoint request. As you noted, they are "triggered" first if they exist. All they do in their signal handler code is change the state (again, you already noted this). It is that change of state that causes them to leave the blocked state and begin running callbacks that have been registered with CR_THREAD_CONTEXT. After the checkpoint they resume blocking in the kernel, ready for the next checkpoint. If there are no PHASE1 threads, then the PHASE2 and NOPHASE threads are signaled instead of the PHASE1. However, if there are any PHASE1 threads in a given process, then BLCR waits until they have finished running their callbacks and reached do_checkpoint(); this is the purpose of the "phase_barrier". Only after that is cr_trigger_phase2() called to signal the remaining (PHASE2 and NOPHASE) threads in the process. Regarding the signal handler: there is one handler, cri_sig_handler(), because signal handler registration is per-process, not per-thread. However, that function calls others depending on the type of the thread: PHASE1: The (currently zero or one) thread that libcr creates to run thread-context callbacks Changes state to allow thread to wake and run callbacks registered as CR_THREAD_CONTEXT PHASE2: Any application-created thread that has called cr_init() Runs any callbacks registered as CR_SIGNAL_CONTEXT by the thread NOPHASE: This is any application thread that has NOT called cr_init() and therefore has no thread-specific cri_thread_info structure. Just calls do_checkpoint() without running any callbacks. You ask in your final question how the ones not triggered as PHASE1 are "waked up". I am not sure I understand the question, but I think you want to know how they are made to run the BLCR code. Right? If that is the question, then the answer is just that they are sent a signal which is handled in the normal Linux way. These threads are made to run the BLCR code just as any other signal handler would run. If you need to understand Linux's signal delivery code, then I am afraid that I am not qualified to describe that for you, but there are plenty of books and online resources about the Linus kernel design that should help with that. Let me know if I have missed the point of that last question. -Paul ����� wrote: > Hello,Professor: > > Thank you very much for answering my questions with great patience.But > I have something more to ask. > > +"When a checkpoint is requested for a process,the BLCR kernel module > sends each thread in that process an unblockable signal" > > Yes, I see BLCR do this in the function"cr_trigger_phase1()" & > "cr_trigger_phase2()" > > when we execute "cr_trigger_phase1()": > > It's up to tasks in the target task list(proc_req->tasks).if there are > phase1 tasks(even only one phase task) in this task list, then only > these phase1 tasks are sent the signal "CR_SIGNUM". otherwise all the > tasks in this list are sent the signal(becauese all of them are either > "phase2" tasks or "no phase" tasks). > > when we execute "cr_trigger_phase2()": > > only "phase2" tasks and "no phase" tasks in the task list were sent > the signal. > > I know the phase1 task is spawned when we register thread callback: > cri_register_thread()->thread_init()->thread_main()->rc = > cri_syscall_token(token, CR_OP_HAND_PHASE1, token); > > After the "cri_register_thread()" finishes , we have created a > callback thread, this thread do the "CR_OP_HAND_PHASE1" syscall and > register a phase1 handler, then blocks in the kernel until a > checkpoint occurs. > > Here comes my first question: I guess: > > there are tasks in the target task list, may be phase1/phase/no phase > tasks ,we first find phase1 tasks, if any, ok , you are not planned to > be checkpointed, your work is to execute callback functions. so the > handler of phase1 tasks do nothing other than execute callback > functions. then phase2 and no phase tasks .these tasks are planned to > be checkpointed. so invoke cr_checkpoint() or do_checkpoint() separately. > > Am I right? If this is right, I want to know why the target task list > can contain the callback thread. in which scene? > > my second question: if I am wrong. what are the differences among the > no-phase ,phase1, phase2 task? their corresponding signal handler deal > with what? > I see phase1 handler simply changes the state of thread. while phase2 > handler invoke cr_checkpoint() to execute callbacks array > first...uh.....I am confused... > > my last question is: To the callback threads which are added into the > target task list as phase1 task,I know how they are waked up after > blocked for checkpoint request.But I don't know ones not added into > that list , how are they waked up? > > > > > > > > > > > > > > > =============================================== > ��������һ������TOM�������ɣ���������1.5G������ʲô�� > <http://bjcgi.163.net/cgi-bin/newreg.cgi?%0Arf=050602> > =============================================== > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory