Re: BLCR: sleep process checkpointing problem

From: Ivan Marton (martoni_at_niif.hu)
Date: Wed Mar 17 2010 - 14:31:17 PDT

  • Next message: Paul H. Hargrove: "Re: BLCR: sleep process checkpointing problem"
    Dear Paul!
    
    Sorry for the late answer! We have executed your suggested test and  
    experienced the same. After a deeper examination we have found that  
    this behavior can be very easily explained with the surprisingly  
    implementation of the sleep program.
    
    It simply stores the time when started and check very often  
    periodically whether the given period has passed or not. (Or at least  
    in that time when it's running.) When you have suspended it this  
    "counter" didn't stop and immediately returned when the process was  
    continued.
    
    The restart of this "counter" seems to be a completely different  
    problem. Could you help to locate the problem or explain if its a  
    feature?
    
    Thank you!
    
    Cheers,
    Ivan Marton
    
    On Mar 11, 2010, at 5:53 PM, Paul H. Hargrove wrote:
    
    > Gábor,
    >
    > To help determine if this is a BLCR-specific problem, or a signal- 
    > handling issue in your /bin/sleep (or kernel), please try
    > $ kill -TSTP <PID> ; sleep 15; kill -CONT <PID>
    > instead of running
    > $ cr_checkpoint <PID>
    > and report the running time for the sleep command.
    > You could also repeat this TSTP/CONT experiment running /bin/sleep  
    > without cr_run.
    >
    > At least for me (CentOS 5.4 w/ 2.6.18-164.11.1.el5 kernel), I find  
    > that the TSTP/CONT experiment causes "time /bin/sleep" to report  
    > "extra" time without any involvement from BLCR.  Here is an example  
    > w/ "/bin/sleep 30" and a "sleep 35" between the two signals.   
    > Strangely, I get 55s (which is not 30+35):
    >> $ time /bin/sleep 30 &
    >> [1] 3538
    >> $ ps aux|grep sleep
    >> phargrov  3539  0.0  0.1  58920   516 pts/0    S    13:07   0:00 / 
    >> bin/sleep 30
    >> phargrov  3541  0.0  0.1  61180   736 pts/0    S+   13:07   0:00  
    >> grep sleep
    >> $ kill -TSTP 3539
    >> $ sleep 35; kill -CONT 3539
    >> $
    >> real    0m55.538s
    >> user    0m0.000s
    >> sys     0m0.002s
    >
    > -Paul
    >
    > Gábor Rőczei wrote:
    >> Dear BLCR developers,
    >>
    >> I am working for a company in Hungary, its name is NIIF (http://www.niif.hu/en 
    >> ). Our one main area is the grid computing and we have a country  
    >> size grid infrastructure.  The PCs are provided by participating  
    >> Hungarian institutes, such as high schools, universities, or public  
    >> libraries.  Every contributor uses their PCs for their own purposes  
    >> during the official work hours, such as educational or office-like  
    >> purposes, and offers the infrastructure for high-throughput  
    >> computation whenever they do not use them for any other purposes,  
    >> i.e. during the nights and the unoccupied week-ends. The combined  
    >> use of "day-shift" (i.e. individual mode) and "night-shift" (i.e.  
    >> grid mode) enables us to utilize CPU cycles (which would have been  
    >> lost anyway) to provide firm comutational infrastructure to the  
    >> national research community (more information about our grid: http://www.clustergrid.hu/) 
    >> . The PCs are running Linux at "grid mode" and they are using  
    >> Windows at "daytime mode". When the PC switch from  Linux to  
    >> Windows then the jobs are chechpointed and when they change from  
    >> Windows to Linux then the jobs will be restarted. This is why we  
    >> need checkpointing.
    >>
    >> The current state we are using Condor and its checkpointing library  
    >> but there was some problems with it and we decided that we will  
    >> change them to Sun Grid Engine and BLCR soon. I read that Sun Grid  
    >> Engine can configured with BLCR:
    >>
    >> http://gridengine.sunsource.net/project/gridengine/howto/howto.html
    >>
    >> Section: Checkpointing under Linux with Berkeley Lab Checkpoint/ 
    >> Restart
    >>
    >> http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf
    >>
    >> We found a sleep problem. Here is the description:
    >>
    >> If I am not checkpointing  the sleep process:
    >>
    >> roczei@knowarc2:~$ time cr_run /bin/sleep 10
    >>
    >> real    0m10.126s
    >> user    0m0.004s
    >> sys    0m0.012s
    >>
    >> If I am  checkpointing the sleep process:
    >>
    >> roczei@knowarc2:~$ time cr_run /bin/sleep 10
    >>
    >> real    0m20.404s
    >> user    0m0.008s
    >> sys    0m0.008s
    >> roczei@knowarc2:~$
    >>
    >> Other terminal:
    >>
    >> roczei@knowarc2:~$ ps aux | grep sleep
    >> roczei   17113  2.6  0.3   3048   544 pts/0    S+   09:39   0:00 / 
    >> bin/sleep 10
    >> roczei   17115  0.0  0.4   3120   724 pts/1    S+   09:39   0:00  
    >> grep sleep
    >> roczei@knowarc2:~$ cr_checkpoint 17113
    >>
    >> So if I send a checkpoint signal to 17113 then the sleep process  
    >> running will "restart". What do you think why happen this? This  
    >> error happen with Sun Grid Engine and without SGE too.
    >>
    >> Best regards,
    >>
    >>       Gabor Roczei
    >>
    >>
    >
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    
    
    


  • Next message: Paul H. Hargrove: "Re: BLCR: sleep process checkpointing problem"