Re: Checkpointing

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 26 2005 - 15:41:10 PDT

  • Next message: Paul H. Hargrove: "Re: Problems with BLCR?"
    sichiwai wrote:
    > Hello,
    >  i have some questions regarding the checkpoint/restart project.
    > According to FAQ ( and 
    > the BLCR can not 
    > checkpoint TCP/IP and sockets. FAQ8 states that "You must arrange for 
    > your program to release such resources before it is checkpointed (see 
    > next FAQ)". What happens to those resources not checkpointed. Are they 
    > simply ignored or do the cause problems?
    At one point in the past (when those FAQ entries where written), the 
    behavior was to "cause problems" (fail at checkpoint time) if a 
    non-restorable resource (such as a TCP socket) was detected.  That is 
    why the "you must arrange for your program to release such resources 
    before it is checkpointed" bit was written.  This is no longer the case...
    > What I especially need to know, what happens to bound sockets, since 
    > they need some time to be gracefully released and reacquired, which is 
    > certainly not desirable for a performance based application.
    We eventually recognized that failing on an unsupported resource was not 
    the best policy.  As you point out, there are non-trivial costs 
    associated with releasing and reacquiring some resources.  FAQ entry #7 
    will be updated at our next release to read as follows:
    > Are there limits to the types of programs can BLCR checkpoint?
    > Yes. BLCR does not support checkpointing certain process resources. 
    > Most notably, BLCR will not checkpoint and/or restore open TCP/IP or 
    > Unix domain sockets, or SysV IPC objects (man 5 ipc).   Such
    > resources are silently ignored at checkpoint time and are not
    > restored.  Applications can arrange to save any necessary information
    > and reacquire such resources at restart time (see next FAQ)
    We expicitly ignore sockets currently, and the checkpoint of a process 
    with sockets should checkpoint just fine.  When the process is restarted 
    there will be nothing at the corresponding file descriptor (as if it had 
    been closed).  This results in errno=EBADF for any calls involving the 
    fd, and has the added bad behavior that a subsequent open() might 
    accidentally reuse the socket fd causing terrible confusion.  In the 
    future we hope to attach the fd to a dummy that would return w/ 
    something like errno=ECONNRESET rather than the current behavior.
    You explicitly mention "bound sockets".  However, any file descriptor 
    obtained from socket() or accept() will be ignored, regardless of calls 
    to bind().  Thus UDP is just as fully ignored as TCP.
    > Also I'd like to know what happens to a pthread which has not called 
    > cr_init(). I one of your examples (pthread_misc) the callback is 
    > initialized for every thread. In the other pthread  example, no 
    > checkpoint function is called at all.. . What is now the correct way to 
    > checkpoint a application with multiple threads?
    Short version:
    Every pthread is checkpointed regardless of whether it (or any other 
    thread) has called cr_init(), but a thread that calls cr_*() functions 
    must call cr_init() first.
    Long version:
    The only requirement to ensure a process can be checkpointed by BLCR is 
    that the shared library initializer code is run.  This happens 
    automatically when an application is linked to, or when run by 
    the cr_run wrapper script (which uses an LD_PRELOAD to pull in the .so).
    In the "other pthread example" (pthread_counting), the app is linked to and nothing else is needed.
    Every thread which makes calls to the library (things like registering a 
    callback) must call cr_init() to initialize some per-thread state.  This 
    state is used (among other things) to ensure subsequent calls to the 
    library are atomic with respect to checkpoints, but is not otherwise 
    necessary for a thread to be checkpointable.  This is what is happening 
    in pthread_misc, which wants a callback run for each thread.
    > Regards
    >  Christian Iwainsky
    >  Student
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Paul H. Hargrove: "Re: Problems with BLCR?"