Re: Checkpointing

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 26 2005 - 15:41:10 PDT

Next message: Paul H. Hargrove: "Re: Problems with BLCR?"

Previous message: Pradeep Padala: "Re: Problems with BLCR?"
In reply to: sichiwai: "Checkpointing"
Next in thread: Paul H. Hargrove: "Re: Checkpointing"

sichiwai wrote:
> Hello,
>  i have some questions regarding the checkpoint/restart project.
> 
> According to FAQ (http://mantis.lbl.gov/blcr/doc/html/FAQ.html#faq7 and 
> http://mantis.lbl.gov/blcr/doc/html/FAQ.html#faq8) the BLCR can not 
> checkpoint TCP/IP and sockets. FAQ8 states that "You must arrange for 
> your program to release such resources before it is checkpointed (see 
> next FAQ)". What happens to those resources not checkpointed. Are they 
> simply ignored or do the cause problems?

At one point in the past (when those FAQ entries where written), the 
behavior was to "cause problems" (fail at checkpoint time) if a 
non-restorable resource (such as a TCP socket) was detected.  That is 
why the "you must arrange for your program to release such resources 
before it is checkpointed" bit was written.  This is no longer the case...

> What I especially need to know, what happens to bound sockets, since 
> they need some time to be gracefully released and reacquired, which is 
> certainly not desirable for a performance based application.

We eventually recognized that failing on an unsupported resource was not 
the best policy.  As you point out, there are non-trivial costs 
associated with releasing and reacquiring some resources.  FAQ entry #7 
will be updated at our next release to read as follows:

> Are there limits to the types of programs can BLCR checkpoint?
> 
> Yes. BLCR does not support checkpointing certain process resources. 
> Most notably, BLCR will not checkpoint and/or restore open TCP/IP or 
> Unix domain sockets, or SysV IPC objects (man 5 ipc).   Such
> resources are silently ignored at checkpoint time and are not
> restored.  Applications can arrange to save any necessary information
> and reacquire such resources at restart time (see next FAQ)

We expicitly ignore sockets currently, and the checkpoint of a process 
with sockets should checkpoint just fine.  When the process is restarted 
there will be nothing at the corresponding file descriptor (as if it had 
been closed).  This results in errno=EBADF for any calls involving the 
fd, and has the added bad behavior that a subsequent open() might 
accidentally reuse the socket fd causing terrible confusion.  In the 
future we hope to attach the fd to a dummy that would return w/ 
something like errno=ECONNRESET rather than the current behavior.

You explicitly mention "bound sockets".  However, any file descriptor 
obtained from socket() or accept() will be ignored, regardless of calls 
to bind().  Thus UDP is just as fully ignored as TCP.

> Also I'd like to know what happens to a pthread which has not called 
> cr_init(). I one of your examples (pthread_misc) the callback is 
> initialized for every thread. In the other pthread  example, no 
> checkpoint function is called at all.. . What is now the correct way to 
> checkpoint a application with multiple threads?

Short version:

Every pthread is checkpointed regardless of whether it (or any other 
thread) has called cr_init(), but a thread that calls cr_*() functions 
must call cr_init() first.

Long version:

The only requirement to ensure a process can be checkpointed by BLCR is 
that the shared library initializer code is run.  This happens 
automatically when an application is linked to libcr.so, or when run by 
the cr_run wrapper script (which uses an LD_PRELOAD to pull in the .so).

In the "other pthread example" (pthread_counting), the app is linked to 
libcr.so and nothing else is needed.

Every thread which makes calls to the library (things like registering a 
callback) must call cr_init() to initialize some per-thread state.  This 
state is used (among other things) to ensure subsequent calls to the 
library are atomic with respect to checkpoints, but is not otherwise 
necessary for a thread to be checkpointable.  This is what is happening 
in pthread_misc, which wants a callback run for each thread.

-Paul

> 
> Regards
>  Christian Iwainsky
>  Student

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: Paul H. Hargrove: "Re: Problems with BLCR?"

Previous message: Pradeep Padala: "Re: Problems with BLCR?"
In reply to: sichiwai: "Checkpointing"
Next in thread: Paul H. Hargrove: "Re: Checkpointing"

Date view	Thread view	Subject view	Author view	Attachment view