From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jun 08 2009 - 12:18:54 PDT
Allow me answer your questions out of order, to provide the clearest explanation: + "how the other tasks in the req->task_list been dumped?" When a checkpoint is requested for a process, the BLCR kernel module sends each thread in that process an unblockable signal. The signal handler for that signal is in the libcr code that is either linked explicitly into the application, or is loaded via LD_PRELOAD. That signal handler ensures that every thread that has been included in the request will eventually call do_checkpoint. That call will come from cr_checkpoint() if the thread has interacted with libcr to cause a thread-specific info to be allocated. If no thread-specific info is allocated for a given thread, then the signal handler calls do_checkpoint() directly. + "What does these callbacks exactly used to do? To provide the user of blcr with interface to do something relative to the specified application program? Or just used to do the real checkpoint stuff?" As you noticed, before calling into the kernel via do_checkpoint() to perform the real work of saving the checkpoint, the code in cr_checkpoint() runs a stack of callbacks. These callbacks are, as you guessed, "to provide the user of blcr with interface to do something relative to the specified application program." The common motivating example use of a callback is for a distributed application to save information about the state of its communication (since BLCR does not save socket state, or any other network info). + Regarding my_cb() in cr_checkpoint.c: This callback in the cr_checkpoint utility program is *not* related to checkpointing of another process. Therefore, it may have confused or mislead you. This callback is used in the cr_checkpoint utility to make sure that if *cr_checkpoint* is checkpointed that the checkpoint it requested has either completed or not started yet. We do that by using a pthread mutex to be certain that checkpointing of cr_checkpoint and checkpointing the requested process(es) are mutually exclusive (if that is impossible because the cr_checkpoint has requested a checkpoint that includes itself, we OMIT it from the checkpoint to avoid deadlock). This is an example of "do something relative to the specified application program". In this case the application is the cr_checkpoint utility and the "do something" is making sure that we get "all or nothing" from the checkpoint request (because the request is not restartable across the checkpoint). -Paul ����� wrote: > Hello, Professor: > > I have read the paper"The Design and Implementation of Berkeley Lab��s > Linux > Checkpoint/Restart" for several times and intervals between these > times I was reading the source code.However, until now I still can not > understand the user library "callback" mechanism. > > What does these callbacks exactly used to do? To provide the user of > blcr with interface to do something relative to the specified > application program? Or just used to do the real checkpoint stuff? > > In checkpoint.c ,I noticed that before we issue the request to build > the task list that we want to checkpoint, the callback "my_cb" was > registered: > /* Register our callback */ > cb_id = cr_register_callback(&my_cb, NULL, CR_THREAD_CONTEXT); > > my_cb() invokes cr_checkpoint(): > > cr_checkpoint() will not invoke do_checkpoint()to do the real dump > work until all the callbacks in the callbacks array which is got from > the thread-specific info of current thread. > > so what does these callback used to do? > > By the way, the function cr_dump_self() seems to dump only the current > process.how the other tasks in the req->task_list been dumped? > > > > > > > =============================================== > ��������һ������TOM�������ɣ���������1.5G������ʲô�� > <http://bjcgi.163.net/cgi-bin/newreg.cgi?%0Arf=050602> > =============================================== > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory