From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 11 2010 - 08:53:16 PST
Gábor, To help determine if this is a BLCR-specific problem, or a signal-handling issue in your /bin/sleep (or kernel), please try $ kill -TSTP <PID> ; sleep 15; kill -CONT <PID> instead of running $ cr_checkpoint <PID> and report the running time for the sleep command. You could also repeat this TSTP/CONT experiment running /bin/sleep without cr_run. At least for me (CentOS 5.4 w/ 2.6.18-164.11.1.el5 kernel), I find that the TSTP/CONT experiment causes "time /bin/sleep" to report "extra" time without any involvement from BLCR. Here is an example w/ "/bin/sleep 30" and a "sleep 35" between the two signals. Strangely, I get 55s (which is not 30+35): > $ time /bin/sleep 30 & > [1] 3538 > $ ps aux|grep sleep > phargrov 3539 0.0 0.1 58920 516 pts/0 S 13:07 0:00 > /bin/sleep 30 > phargrov 3541 0.0 0.1 61180 736 pts/0 S+ 13:07 0:00 grep > sleep > $ kill -TSTP 3539 > $ sleep 35; kill -CONT 3539 > $ > real 0m55.538s > user 0m0.000s > sys 0m0.002s -Paul Gábor Rőczei wrote: > Dear BLCR developers, > > I am working for a company in Hungary, its name is NIIF > (http://www.niif.hu/en). Our one main area is the grid computing and > we have a country size grid infrastructure. The PCs are provided by > participating Hungarian institutes, such as high schools, > universities, or public libraries. Every contributor uses their PCs > for their own purposes during the official work hours, such as > educational or office-like purposes, and offers the infrastructure for > high-throughput computation whenever they do not use them for any > other purposes, i.e. during the nights and the unoccupied week-ends. > The combined use of "day-shift" (i.e. individual mode) and > "night-shift" (i.e. grid mode) enables us to utilize CPU cycles (which > would have been lost anyway) to provide firm comutational > infrastructure to the national research community (more information > about our grid: http://www.clustergrid.hu/). The PCs are running Linux > at "grid mode" and they are using Windows at "daytime mode". When the > PC switch from Linux to Windows then the jobs are chechpointed and > when they change from Windows to Linux then the jobs will be > restarted. This is why we need checkpointing. > > The current state we are using Condor and its checkpointing library > but there was some problems with it and we decided that we will change > them to Sun Grid Engine and BLCR soon. I read that Sun Grid Engine can > configured with BLCR: > > http://gridengine.sunsource.net/project/gridengine/howto/howto.html > > Section: Checkpointing under Linux with Berkeley Lab Checkpoint/Restart > > http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf > > > We found a sleep problem. Here is the description: > > If I am not checkpointing the sleep process: > > roczei@knowarc2:~$ time cr_run /bin/sleep 10 > > real 0m10.126s > user 0m0.004s > sys 0m0.012s > > If I am checkpointing the sleep process: > > roczei@knowarc2:~$ time cr_run /bin/sleep 10 > > real 0m20.404s > user 0m0.008s > sys 0m0.008s > roczei@knowarc2:~$ > > Other terminal: > > roczei@knowarc2:~$ ps aux | grep sleep > roczei 17113 2.6 0.3 3048 544 pts/0 S+ 09:39 0:00 > /bin/sleep 10 > roczei 17115 0.0 0.4 3120 724 pts/1 S+ 09:39 0:00 grep > sleep > roczei@knowarc2:~$ cr_checkpoint 17113 > > So if I send a checkpoint signal to 17113 then the sleep process > running will "restart". What do you think why happen this? This error > happen with Sun Grid Engine and without SGE too. > > Best regards, > > Gabor Roczei > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory