Re: BLCR: sleep process checkpointing problem

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 18 2010 - 17:26:06 PDT

  • Next message: Paul H. Hargrove: "Re: Question about a fault"
    Zsombor,
    
    I've experimented more to understand this sleep behavior and have, 
    again, found a way to reproduce the undesired behavior without involving 
    BLCR.  This suggests to me that there is nothing that BLCR could 
    realistically do differently to avoid the problem you are seeing.
    
    First, I observed from the strace output that something is registering 
    signal handlers for SIGRT_MIN and SIGRT_1.  Further investigation shows 
    that these are registered by something lib libpthread.so sometime before 
    main() executes.  The kill command doesn't recognize the SIGRT_* names 
    printed by strace, but if I signal /bin/sleep with "32" or "33" I can 
    reproduce the same sleep-restarts-from-30 behavior as seen with a 
    checkpoint:
    > $ (./sleeper.sh &); sleep 5; kill -32 `pgrep '\<sleep\>'`; sleep 31
    > Starting: 17:02:02
    > Ending: 17:02:37
    So that was 35 seconds due to restarting the 30-second nanosleep after 
    the signal is received at start+5.  Strace produces output similar to 
    the checkpoint case, confirming that nanosleep({30,0}, NULL) is called a 
    second time, rather than restart_syscall().  This means that if I send 
    signal 32 or 33 to sleep every 25 seconds, I can ensure it will never 
    complete - no checkpoints are required.
    
    So, my opinon (perhaps slightly biased as BLCR maintainer) is that there 
    exists a signal handling bug (or "misfeature" if you prefer) in the 
    implementation of /bin/sleep.  While one would rarely encounter it in 
    normal practice, the installation of a signal handler (via the library 
    LD_PRELOADed by cr_run) ensures this undesired behavior will occur every 
    time one checkpoints /bin/sleep,
    
    And here is a fun thing try: by LD_PRELOADing a tiny library that 
    registers a handler for SIGWINCH, I can prevent sleep from completing 
    simply by resizing my window at least once every 30 seconds:
    > $ cat wrap.c
    >     #include <signal.h>
    >     void dummy(int signo) { return; }
    >     static void __attribute__((constructor)) setup(void) { 
    > signal(SIGWINCH, &dummy); }
    > $ gcc -fPIC -c wrap.c && gcc -shared -o wrap.so.0 wrap.o
    > $ LD_PRELOAD=`pwd`/wrap.so.0 strace /bin/sleep 30
    > [...output removed...]
    > nanosleep({30, 0}, 0)                   = ? ERESTART_RESTARTBLOCK (To 
    > be restarted)
    > --- SIGWINCH (Window changed) @ 0 (0) ---
    > rt_sigreturn(0x1c)                      = -1 EINTR (Interrupted system 
    > call)
    > nanosleep({30, 0}, 0)                   = ? ERESTART_RESTARTBLOCK (To 
    > be restarted)
    > --- SIGWINCH (Window changed) @ 0 (0) ---
    > rt_sigreturn(0x1c)                      = -1 EINTR (Interrupted system 
    > call)
    > nanosleep({30, 0}, 0)                   = ? ERESTART_RESTARTBLOCK (To 
    > be restarted)
    > --- SIGWINCH (Window changed) @ 0 (0) ---
    > rt_sigreturn(0x1c)                      = -1 EINTR (Interrupted system 
    > call)
    > nanosleep({30, 0}, 0)                   = ? ERESTART_RESTARTBLOCK (To 
    > be restarted)
    > --- SIGWINCH (Window changed) @ 0 (0) ---
    > rt_sigreturn(0x1c)                      = -1 EINTR (Interrupted system 
    > call)
    > nanosleep({30, 0}, 0)                   = ? ERESTART_RESTARTBLOCK (To 
    > be restarted)
    > [etc...]
    
    Based on that experiment I believe I have convinced at least myself that 
    the ERESTART_RESTARTBLOCK/restart_syscall() behavior is not in effect 
    when the interrupting signal has a handler installed.  A reading of the 
    kernel code (signal.c for architecture of your choice) appears to 
    support that if one knows what to look for.
    
    -Paul
    
    
    Paul H. Hargrove wrote:
    > Zsombor,
    >
    > Thanks for the explanation.  I now understand what it is you are 
    > asking about, though I am not certain I have a satisfactory answer.  I 
    > am not aware of any specific reason why the sleep is resuming from the 
    > initial 30sec time.  However, I did try some experiments with strace 
    > of "sleep 30" (passing --ptraced-allow to cr_checkpoint and NOT 
    > including the strace in the checkpoint)
    >
    > Sending SIGTSTP, the relevant part of the strace output shows:
    >
    >> $ strace /bin/sleep 30
    >> [...]
    >> nanosleep({30, 0}, [*] 0)                   = ? ERESTART_RESTARTBLOCK 
    >> (To be restarted)
    >> --- SIGTSTP (Stopped) @ 0 (0) ---
    >> --- SIGTSTP (Stopped) @ 0 (0) ---
    >> restart_syscall(<... resuming interrupted call ...>) = 0
    >> close(1)                                = 0
    >> close(2)                                = 0
    >> exit_group(0)                           = ?
    >
    > Where the "[...]" is output I removed and "[*]" is my addition to mark 
    > the point where the output paused until the signal was sent.
    > It is interesting to note that I did NOT send the SIGCONT in this case 
    > - the sleep program apparently is able to keep running due to some 
    > multi-threaded tricks (note there are TWO "SIGTSTP" lines indicating 
    > the signal was delivered to 2 tasks).  Surprising to me is the fact 
    > that sleep is apparently able to run through receiving the unblockable 
    > SIGSTOP signal as well!
    >
    > Now, if I try to checkpoint instead of signal:
    >>
    >> $ strace cr_run /bin/sleep 30
    >> [...]
    >> nanosleep({30, 0}, [*] 0)                   = ? ERESTART_RESTARTBLOCK 
    >> (To be restarted)
    >> --- SIGRT_32 (Real-time signal 30) @ 0 (0) ---
    >> ioctl(3, 0x4008a101, 0x4000)            = 0
    >> rt_sigreturn(0x7fc9671a8530)            = -1 EINTR (Interrupted 
    >> system call)
    >> nanosleep({30, 0}, NULL)                = 0
    >> close(1)                                = 0
    >> close(2)                                = 0
    >> exit_group(0)                           = ?
    >
    > The ERESTART_RESTARTBLOCK/restart_syscall is some kernel "magic" to 
    > automatically/transparently restart certain interrupted system calls 
    > with DIFFERENT arguments than were passed originally.  In the case of 
    > nanosleep this is used to restart the call with the time adjusted to 
    > exclude the time already spent in the call.  It looks as if that is 
    > lost in the checkpoint case, and rather than the transparent restart 
    > of the nanosleep system call one sees a return of -1 with 
    > errno=EINTR.  The second argument to the nanosleep() is where the 
    > kernel would normally be able to write the remaining time when 
    > interrupted, but the caller has passed NULL which means it has no such 
    > information is available and apparently chooses restart the system 
    > call with the original arguments.
    >
    > I am not sure what the interaction is between the checkpoint and the 
    > system call restart logic, but my first and second guesses would be:
    > 1) Perhaps the fact that libcr installs a handler for our signal 
    > (SIGRT_32) tells the kernel not to transparently restart the call.
    > 2) Perhaps the existence of a handler is OK, but the presence of a 
    > system call (ioctl in this case) in the handler prevents the restart.
    >
    > I think the first guess is more likely, but hope to learn more by 
    > looking at the kernel code in more detail.
    > I either case, I suspect that one could implement a replacement for 
    > /bin/sleep that has the proper behavior in the presence of EINTR.
    > For what it is worth, sleep() in perl apparently does NOT restart on 
    > EINTR.  So, as an ugly work-around you might replace /bin/sleep with 
    > "perl -e sleep".
    >
    > It would be helpful to me if you could enter some of this information 
    > in our bug database (you could just paste this email into the bug 
    > report if you wish)
    >   https://upc-bugs.lbl.gov/bugzilla
    > That will make it less likely that I will loose track of this issue.
    >
    > -Paul
    >
    > Zsombor Nagy wrote:
    >> Hi Paul!
    >>
    >> I'm working with Iván and Gábor on this project, and I found this 
    >> sleep issue interesting, so I played a bit with it today. Here I'm 
    >> going to summarize what is our problem.
    >>
    >> Sorry for the long mail!
    >>
    >>
    >>
    >> I've made a sleeper script, which looks like this:
    >>
    >> $ cat sleeper.sh #!/bin/bash
    >> echo Starting: `date +%H:%M:%S`
    >> sleep 30
    >> echo Ending: `date +%H:%M:%S`
    >>
    >>
    >> Signalling
    >> ------------
    >>
    >> I'm executing it:
    >>
    >> $ ./sleeper.sh Starting: 13:49:13
    >> Ending: 13:49:43
    >>
    >> Then with cr_run:
    >>
    >> $ cr_run ./sleeper.sh Starting: 13:49:56
    >> Ending: 13:50:26
    >>
    >> No surprises: 30 seconds.
    >>
    >>
    >> Now I have an other script, called 'signaller':
    >>
    >> $ cat signaller.sh #!/bin/bash
    >> PID=`pgrep sleeper`
    >> echo Sending TSTP signal to PID $PID at `date +%H:%M:%S`
    >> kill -TSTP $PID
    >> echo Waiting for 5 sec.
    >> sleep 5
    >> echo Sending CONT signal to PID $PID at `date +%H:%M:%S`
    >> kill -CONT $PID
    >>
    >> This one sends a TSTP signal to our sleeper process, then waits for 5 
    >> seconds, then sends a continue signal.
    >>
    >> Now I start the sleeper script, and then after 5 seconds, I start the 
    >> signaller script in a second terminal window, but I will mix to 
    >> output here to make it easier to follow:
    >>
    >> - sleeper:
    >>
    >> Starting: 13:52:17
    >>
    >> - signaller:
    >>
    >> Sending TSTP signal to PID 6069 at 13:52:22
    >> Waiting for 5 sec.
    >> Sending CONT signal to PID 6069 at 13:52:27
    >>
    >> - back to sleeper:
    >>
    >> Ending: 13:52:47
    >>
    >>
    >> As you can see, the sleeper run for exactly 30 seconds no matter we 
    >> stopped it for 5 seconds.
    >> The reason for this is the behaviour of the sleep: it just checks the 
    >> time of the start, and if the current time minus the start time is 
    >> more then the number of seconds it needs to sleep, it is done. So it 
    >> does not matter that for 5 seconds it was not running, after the 
    >> current time reaches the start time + 30 seconds, it will quit - 
    >> unless it is stil stopped. So if I run the signaller just slightly 
    >> before the 30 seconds are up:
    >>
    >> - sleeper:
    >>
    >> Starting: 13:56:24
    >>
    >> - signaller:
    >>
    >> Sending TSTP signal to PID 7629 at 13:56:53
    >> Waiting for 5 sec.
    >> Sending CONT signal to PID 7629 at 13:56:58
    >>
    >> - back to sleeper:
    >>
    >> Ending: 13:56:58
    >>
    >>
    >> As you can see, the sleeper run for 34 seconds, because we sent the 
    >> CONT signal at 13:56:58, and then the sleeper process continued to 
    >> run, realized that the time is up, and quited immediately.
    >>
    >> So I think this behaviour is understandable and OK, there is no more 
    >> question related to this, this is just the way sleep works.
    >>
    >>
    >>
    >> Checkpointing
    >> ------------------
    >>
    >> I have a third script, called 'checkpointer':
    >>
    >> $ cat checkpointer.sh #!/bin/bash
    >> PID=`pgrep sleeper`
    >> echo Checkpointing PID $PID at `date +%H:%M:%S`
    >> cr_checkpoint $PID
    >>
    >> it just prints the time and then call cr_checkpoint.
    >>
    >>
    >> Let's run the sleeper, and then after a few seconds, run the 
    >> checkpointer:
    >>
    >> - sleeper:
    >>
    >> Starting: 14:03:23
    >>
    >> - checkpointer:
    >>
    >> Checkpointing PID 10237 at 14:03:27
    >>
    >> - back to sleeper:
    >>
    >> Ending: 14:03:57
    >>
    >> The sleeper process started at 23 and finished at 57 which is 34 
    >> seconds. But the interesting part is that we created the checkpoint 
    >> at 27, and the sleeper finished at 57, which is exactly 30 sec. It 
    >> looks like as the sleep would be restarted by the checkpointing.
    >>
    >> Another try:
    >>
    >> - sleeper:
    >>
    >> Starting: 14:05:48
    >>
    >> - checkpointer:
    >>
    >> Checkpointing PID 11118 at 14:06:04
    >>
    >> - back to sleeper:
    >>
    >> Ending: 14:06:34
    >>
    >>
    >>
    >> The sleeper started at 5:48, finished at 6:34, which is 48 seconds - 
    >> we checkpointed at 6:04, and the sleeper ended at 6:34, which is 30 
    >> seconds - again, looks like the checkpoint command restarts the sleep!
    >>
    >> That's the behaviour we try to understand, and hopefully avoid.
    >>
    >> The worst case scenario could be: having a periodical checkpointing 
    >> every 20 seconds would restart the 'sleep 30' every time, which would 
    >> mean it would never finish.
    >>
    >>
    >> I hope it's clearer now what we try to figure out.
    >>
    >> Looking forward to your answer!
    >>
    >> Thank you very much!
    >>
    >> Cheers,
    >> Zsombor Nagy
    >>
    >>
    >>
    >> On 2010.03.17., at 23:50, Paul H. Hargrove wrote:
    >>
    >>  
    >>> I am sorry, but I am not clear what it is you now need my help to 
    >>> explain.
    >>>
    >>> If have shown, and you have reproduced, a mechanism by which 
    >>> /usr/bin/sleep can be made to delay for an unexpectedly long period 
    >>> of time.  This is true regardless of whether BLCR is present or 
    >>> absent on the system.  If you are hoping that I know why this 
    >>> "extra"  sleep time occurs, then I am afraid I have no idea.
    >>>
    >>> If, however, you are asking about some behavior when using BLCR the 
    >>> "restart" a checkpointed /usr/bin/sleep process, then please re-ask 
    >>> the question clearly and I'll try to help.
    >>>
    >>> -Paul
    >>>
    >>> Ivan Marton wrote:
    >>>    
    >>>> Dear Paul!
    >>>>
    >>>> Sorry for the late answer! We have executed your suggested test and 
    >>>> experienced the same. After a deeper examination we have found that 
    >>>> this behavior can be very easily explained with the surprisingly 
    >>>> implementation of the sleep program.
    >>>>
    >>>> It simply stores the time when started and check very often 
    >>>> periodically whether the given period has passed or not. (Or at 
    >>>> least in that time when it's running.) When you have suspended it 
    >>>> this "counter" didn't stop and immediately returned when the 
    >>>> process was continued.
    >>>>
    >>>> The restart of this "counter" seems to be a completely different 
    >>>> problem. Could you help to locate the problem or explain if its a 
    >>>> feature?
    >>>>
    >>>> Thank you!
    >>>>
    >>>> Cheers,
    >>>> Ivan Marton
    >>>>
    >>>> On Mar 11, 2010, at 5:53 PM, Paul H. Hargrove wrote:
    >>>>
    >>>>      
    >>>>> Gábor,
    >>>>>
    >>>>> To help determine if this is a BLCR-specific problem, or a 
    >>>>> signal-handling issue in your /bin/sleep (or kernel), please try
    >>>>> $ kill -TSTP <PID> ; sleep 15; kill -CONT <PID>
    >>>>> instead of running
    >>>>> $ cr_checkpoint <PID>
    >>>>> and report the running time for the sleep command.
    >>>>> You could also repeat this TSTP/CONT experiment running /bin/sleep 
    >>>>> without cr_run.
    >>>>>
    >>>>> At least for me (CentOS 5.4 w/ 2.6.18-164.11.1.el5 kernel), I find 
    >>>>> that the TSTP/CONT experiment causes "time /bin/sleep" to report 
    >>>>> "extra" time without any involvement from BLCR.  Here is an 
    >>>>> example w/ "/bin/sleep 30" and a "sleep 35" between the two 
    >>>>> signals.  Strangely, I get 55s (which is not 30+35):
    >>>>>        
    >>>>>> $ time /bin/sleep 30 &
    >>>>>> [1] 3538
    >>>>>> $ ps aux|grep sleep
    >>>>>> phargrov  3539  0.0  0.1  58920   516 pts/0    S    13:07   0:00 
    >>>>>> /bin/sleep 30
    >>>>>> phargrov  3541  0.0  0.1  61180   736 pts/0    S+   13:07   0:00 
    >>>>>> grep sleep
    >>>>>> $ kill -TSTP 3539
    >>>>>> $ sleep 35; kill -CONT 3539
    >>>>>> $
    >>>>>> real    0m55.538s
    >>>>>> user    0m0.000s
    >>>>>> sys     0m0.002s
    >>>>>>           
    >>>>> -Paul
    >>>>>
    >>>>> Gábor Rőczei wrote:
    >>>>>        
    >>>>>> Dear BLCR developers,
    >>>>>>
    >>>>>> I am working for a company in Hungary, its name is NIIF 
    >>>>>> (http://www.niif.hu/en). Our one main area is the grid computing 
    >>>>>> and we have a country size grid infrastructure.  The PCs are 
    >>>>>> provided by participating Hungarian institutes, such as high 
    >>>>>> schools, universities, or public libraries.  Every contributor 
    >>>>>> uses their PCs for their own purposes during the official work 
    >>>>>> hours, such as educational or office-like purposes, and offers 
    >>>>>> the infrastructure for high-throughput computation whenever they 
    >>>>>> do not use them for any other purposes, i.e. during the nights 
    >>>>>> and the unoccupied week-ends. The combined use of "day-shift" 
    >>>>>> (i.e. individual mode) and "night-shift" (i.e. grid mode) enables 
    >>>>>> us to utilize CPU cycles (which would have been lost anyway) to 
    >>>>>> provide firm comutational infrastructure to the national research 
    >>>>>> community (more information about our grid: 
    >>>>>> http://www.clustergrid.hu/). The PCs are running Linux at "grid 
    >>>>>> mode" and they are using Windows at "daytime mode". When !
    > t!
    >>>>>>           
    >>  he PC switch from  Linux to Windows then the jobs are chechpointed 
    >> and when they change from Windows to Linux then the jobs will be 
    >> restarted. This is why we need checkpointing.
    >>  
    >>>>>> The current state we are using Condor and its checkpointing 
    >>>>>> library but there was some problems with it and we decided that 
    >>>>>> we will change them to Sun Grid Engine and BLCR soon. I read that 
    >>>>>> Sun Grid Engine can configured with BLCR:
    >>>>>>
    >>>>>> http://gridengine.sunsource.net/project/gridengine/howto/howto.html
    >>>>>>
    >>>>>> Section: Checkpointing under Linux with Berkeley Lab 
    >>>>>> Checkpoint/Restart
    >>>>>>
    >>>>>> http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf 
    >>>>>>
    >>>>>> We found a sleep problem. Here is the description:
    >>>>>>
    >>>>>> If I am not checkpointing  the sleep process:
    >>>>>>
    >>>>>> roczei@knowarc2:~$ time cr_run /bin/sleep 10
    >>>>>>
    >>>>>> real    0m10.126s
    >>>>>> user    0m0.004s
    >>>>>> sys    0m0.012s
    >>>>>>
    >>>>>> If I am  checkpointing the sleep process:
    >>>>>>
    >>>>>> roczei@knowarc2:~$ time cr_run /bin/sleep 10
    >>>>>>
    >>>>>> real    0m20.404s
    >>>>>> user    0m0.008s
    >>>>>> sys    0m0.008s
    >>>>>> roczei@knowarc2:~$
    >>>>>>
    >>>>>> Other terminal:
    >>>>>>
    >>>>>> roczei@knowarc2:~$ ps aux | grep sleep
    >>>>>> roczei   17113  2.6  0.3   3048   544 pts/0    S+   09:39   0:00 
    >>>>>> /bin/sleep 10
    >>>>>> roczei   17115  0.0  0.4   3120   724 pts/1    S+   09:39   0:00 
    >>>>>> grep sleep
    >>>>>> roczei@knowarc2:~$ cr_checkpoint 17113
    >>>>>>
    >>>>>> So if I send a checkpoint signal to 17113 then the sleep process 
    >>>>>> running will "restart". What do you think why happen this? This 
    >>>>>> error happen with Sun Grid Engine and without SGE too.
    >>>>>>
    >>>>>> Best regards,
    >>>>>>
    >>>>>>      Gabor Roczei
    >>>>>>
    >>>>>>
    >>>>>>           
    >>>>> -- 
    >>>>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>>> Future Technologies Group                 Tel: +1-510-495-2352
    >>>>> HPC Research Department                   Fax: +1-510-486-6900
    >>>>> Lawrence Berkeley National Laboratory
    >>>>>         
    >>> -- 
    >>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>> Future Technologies Group
    >>> HPC Research Department                   Tel: +1-510-495-2352
    >>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>>
    >>>     
    >>
    >>
    >>   
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Paul H. Hargrove: "Re: Question about a fault"