Re: BLCR: sleep process checkpointing problem

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 11 2010 - 08:53:16 PST

  • Next message: fengguang tian: "question about implement checkpoint into MPI program"
    Gábor,
    
    To help determine if this is a BLCR-specific problem, or a 
    signal-handling issue in your /bin/sleep (or kernel), please try
      $ kill -TSTP <PID> ; sleep 15; kill -CONT <PID>
    instead of running
      $ cr_checkpoint <PID>
    and report the running time for the sleep command.
    You could also repeat this TSTP/CONT experiment running /bin/sleep 
    without cr_run.
    
    At least for me (CentOS 5.4 w/ 2.6.18-164.11.1.el5 kernel), I find that 
    the TSTP/CONT experiment causes "time /bin/sleep" to report "extra" time 
    without any involvement from BLCR.  Here is an example w/ "/bin/sleep 
    30" and a "sleep 35" between the two signals.  Strangely, I get 55s 
    (which is not 30+35):
     
    > $ time /bin/sleep 30 &
    > [1] 3538
    > $ ps aux|grep sleep
    > phargrov  3539  0.0  0.1  58920   516 pts/0    S    13:07   0:00 
    > /bin/sleep 30
    > phargrov  3541  0.0  0.1  61180   736 pts/0    S+   13:07   0:00 grep 
    > sleep
    > $ kill -TSTP 3539
    > $ sleep 35; kill -CONT 3539
    > $
    > real    0m55.538s
    > user    0m0.000s
    > sys     0m0.002s
    
    -Paul
    
    Gábor Rőczei wrote:
    > Dear BLCR developers,
    >
    > I am working for a company in Hungary, its name is NIIF 
    > (http://www.niif.hu/en). Our one main area is the grid computing and 
    > we have a country size grid infrastructure.  The PCs are provided by 
    > participating Hungarian institutes, such as high schools, 
    > universities, or public libraries.  Every contributor uses their PCs 
    > for their own purposes during the official work hours, such as 
    > educational or office-like purposes, and offers the infrastructure for 
    > high-throughput computation whenever they do not use them for any 
    > other purposes, i.e. during the nights and the unoccupied week-ends. 
    > The combined use of "day-shift" (i.e. individual mode) and 
    > "night-shift" (i.e. grid mode) enables us to utilize CPU cycles (which 
    > would have been lost anyway) to provide firm comutational 
    > infrastructure to the national research community (more information 
    > about our grid: http://www.clustergrid.hu/). The PCs are running Linux 
    > at "grid mode" and they are using Windows at "daytime mode". When the 
    > PC switch from  Linux to Windows then the jobs are chechpointed and 
    > when they change from Windows to Linux then the jobs will be 
    > restarted. This is why we need checkpointing.
    >
    > The current state we are using Condor and its checkpointing library 
    > but there was some problems with it and we decided that we will change 
    > them to Sun Grid Engine and BLCR soon. I read that Sun Grid Engine can 
    > configured with BLCR:
    >
    > http://gridengine.sunsource.net/project/gridengine/howto/howto.html
    >
    > Section: Checkpointing under Linux with Berkeley Lab Checkpoint/Restart
    >
    > http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf 
    >
    >
    > We found a sleep problem. Here is the description:
    >
    > If I am not checkpointing  the sleep process:
    >
    > roczei@knowarc2:~$ time cr_run /bin/sleep 10
    >
    > real    0m10.126s
    > user    0m0.004s
    > sys    0m0.012s
    >
    > If I am  checkpointing the sleep process:
    >
    > roczei@knowarc2:~$ time cr_run /bin/sleep 10
    >
    > real    0m20.404s
    > user    0m0.008s
    > sys    0m0.008s
    > roczei@knowarc2:~$
    >
    > Other terminal:
    >
    > roczei@knowarc2:~$ ps aux | grep sleep
    > roczei   17113  2.6  0.3   3048   544 pts/0    S+   09:39   0:00 
    > /bin/sleep 10
    > roczei   17115  0.0  0.4   3120   724 pts/1    S+   09:39   0:00 grep 
    > sleep
    > roczei@knowarc2:~$ cr_checkpoint 17113
    >
    > So if I send a checkpoint signal to 17113 then the sleep process 
    > running will "restart". What do you think why happen this? This error 
    > happen with Sun Grid Engine and without SGE too.
    >
    > Best regards,
    >
    >        Gabor Roczei
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: fengguang tian: "question about implement checkpoint into MPI program"