From: Gábor Rőczei (roczei_at_niif.hu)
Date: Thu Mar 11 2010 - 00:58:16 PST
Dear BLCR developers, I am working for a company in Hungary, its name is NIIF (http://www.niif.hu/en ). Our one main area is the grid computing and we have a country size grid infrastructure. The PCs are provided by participating Hungarian institutes, such as high schools, universities, or public libraries. Every contributor uses their PCs for their own purposes during the official work hours, such as educational or office-like purposes, and offers the infrastructure for high-throughput computation whenever they do not use them for any other purposes, i.e. during the nights and the unoccupied week-ends. The combined use of "day-shift" (i.e. individual mode) and "night-shift" (i.e. grid mode) enables us to utilize CPU cycles (which would have been lost anyway) to provide firm comutational infrastructure to the national research community (more information about our grid: http://www.clustergrid.hu/). The PCs are running Linux at "grid mode" and they are using Windows at "daytime mode". When the PC switch from Linux to Windows then the jobs are chechpointed and when they change from Windows to Linux then the jobs will be restarted. This is why we need checkpointing. The current state we are using Condor and its checkpointing library but there was some problems with it and we decided that we will change them to Sun Grid Engine and BLCR soon. I read that Sun Grid Engine can configured with BLCR: http://gridengine.sunsource.net/project/gridengine/howto/howto.html Section: Checkpointing under Linux with Berkeley Lab Checkpoint/Restart http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf We found a sleep problem. Here is the description: If I am not checkpointing the sleep process: roczei@knowarc2:~$ time cr_run /bin/sleep 10 real 0m10.126s user 0m0.004s sys 0m0.012s If I am checkpointing the sleep process: roczei@knowarc2:~$ time cr_run /bin/sleep 10 real 0m20.404s user 0m0.008s sys 0m0.008s roczei@knowarc2:~$ Other terminal: roczei@knowarc2:~$ ps aux | grep sleep roczei 17113 2.6 0.3 3048 544 pts/0 S+ 09:39 0:00 /bin/ sleep 10 roczei 17115 0.0 0.4 3120 724 pts/1 S+ 09:39 0:00 grep sleep roczei@knowarc2:~$ cr_checkpoint 17113 So if I send a checkpoint signal to 17113 then the sleep process running will "restart". What do you think why happen this? This error happen with Sun Grid Engine and without SGE too. Best regards, Gabor Roczei