From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 26 2005 - 15:41:10 PDT
sichiwai wrote: > Hello, > i have some questions regarding the checkpoint/restart project. > > According to FAQ (http://mantis.lbl.gov/blcr/doc/html/FAQ.html#faq7 and > http://mantis.lbl.gov/blcr/doc/html/FAQ.html#faq8) the BLCR can not > checkpoint TCP/IP and sockets. FAQ8 states that "You must arrange for > your program to release such resources before it is checkpointed (see > next FAQ)". What happens to those resources not checkpointed. Are they > simply ignored or do the cause problems? At one point in the past (when those FAQ entries where written), the behavior was to "cause problems" (fail at checkpoint time) if a non-restorable resource (such as a TCP socket) was detected. That is why the "you must arrange for your program to release such resources before it is checkpointed" bit was written. This is no longer the case... > What I especially need to know, what happens to bound sockets, since > they need some time to be gracefully released and reacquired, which is > certainly not desirable for a performance based application. We eventually recognized that failing on an unsupported resource was not the best policy. As you point out, there are non-trivial costs associated with releasing and reacquiring some resources. FAQ entry #7 will be updated at our next release to read as follows: > Are there limits to the types of programs can BLCR checkpoint? > > Yes. BLCR does not support checkpointing certain process resources. > Most notably, BLCR will not checkpoint and/or restore open TCP/IP or > Unix domain sockets, or SysV IPC objects (man 5 ipc). Such > resources are silently ignored at checkpoint time and are not > restored. Applications can arrange to save any necessary information > and reacquire such resources at restart time (see next FAQ) We expicitly ignore sockets currently, and the checkpoint of a process with sockets should checkpoint just fine. When the process is restarted there will be nothing at the corresponding file descriptor (as if it had been closed). This results in errno=EBADF for any calls involving the fd, and has the added bad behavior that a subsequent open() might accidentally reuse the socket fd causing terrible confusion. In the future we hope to attach the fd to a dummy that would return w/ something like errno=ECONNRESET rather than the current behavior. You explicitly mention "bound sockets". However, any file descriptor obtained from socket() or accept() will be ignored, regardless of calls to bind(). Thus UDP is just as fully ignored as TCP. > Also I'd like to know what happens to a pthread which has not called > cr_init(). I one of your examples (pthread_misc) the callback is > initialized for every thread. In the other pthread example, no > checkpoint function is called at all.. . What is now the correct way to > checkpoint a application with multiple threads? Short version: Every pthread is checkpointed regardless of whether it (or any other thread) has called cr_init(), but a thread that calls cr_*() functions must call cr_init() first. Long version: The only requirement to ensure a process can be checkpointed by BLCR is that the shared library initializer code is run. This happens automatically when an application is linked to libcr.so, or when run by the cr_run wrapper script (which uses an LD_PRELOAD to pull in the .so). In the "other pthread example" (pthread_counting), the app is linked to libcr.so and nothing else is needed. Every thread which makes calls to the library (things like registering a callback) must call cr_init() to initialize some per-thread state. This state is used (among other things) to ensure subsequent calls to the library are atomic with respect to checkpoints, but is not otherwise necessary for a thread to be checkpointable. This is what is happening in pthread_misc, which wants a callback run for each thread. -Paul > > Regards > Christian Iwainsky > Student -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900