BLCR: sleep process checkpointing problem

From: Gábor Rőczei (roczei_at_niif.hu)
Date: Thu Mar 11 2010 - 00:58:16 PST

  • Next message: Paul H. Hargrove: "Re: BLCR: sleep process checkpointing problem"
    Dear BLCR developers,
    
    I am working for a company in Hungary, its name is NIIF (http://www.niif.hu/en 
    ). Our one main area is the grid computing and we have a country size  
    grid infrastructure.  The PCs are provided by participating Hungarian  
    institutes, such as high schools, universities, or public libraries.   
    Every contributor uses their PCs for their own purposes during the  
    official work hours, such as educational or office-like purposes, and  
    offers the infrastructure for high-throughput computation whenever  
    they do not use them for any other purposes, i.e. during the nights  
    and the unoccupied week-ends. The combined use of "day-shift" (i.e.  
    individual mode) and "night-shift" (i.e. grid mode) enables us to  
    utilize CPU cycles (which would have been lost anyway) to provide firm  
    comutational infrastructure to the national research community (more  
    information about our grid: http://www.clustergrid.hu/). The PCs are  
    running Linux at "grid mode" and they are using Windows at "daytime  
    mode". When the PC switch from  Linux to Windows then the jobs are  
    chechpointed and when they change from Windows to Linux then the jobs  
    will be restarted. This is why we need checkpointing.
    
    The current state we are using Condor and its checkpointing library  
    but there was some problems with it and we decided that we will change  
    them to Sun Grid Engine and BLCR soon. I read that Sun Grid Engine can  
    configured with BLCR:
    
    http://gridengine.sunsource.net/project/gridengine/howto/howto.html
    
    Section: Checkpointing under Linux with Berkeley Lab Checkpoint/Restart
    
    http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf
    
    We found a sleep problem. Here is the description:
    
    If I am not checkpointing  the sleep process:
    
    roczei@knowarc2:~$ time cr_run /bin/sleep 10
    
    real	0m10.126s
    user	0m0.004s
    sys	0m0.012s
    
    If I am  checkpointing the sleep process:
    
    roczei@knowarc2:~$ time cr_run /bin/sleep 10
    
    real	0m20.404s
    user	0m0.008s
    sys	0m0.008s
    roczei@knowarc2:~$
    
    Other terminal:
    
    roczei@knowarc2:~$ ps aux | grep sleep
    roczei   17113  2.6  0.3   3048   544 pts/0    S+   09:39   0:00 /bin/ 
    sleep 10
    roczei   17115  0.0  0.4   3120   724 pts/1    S+   09:39   0:00 grep  
    sleep
    roczei@knowarc2:~$ cr_checkpoint 17113
    
    So if I send a checkpoint signal to 17113 then the sleep process  
    running will "restart". What do you think why happen this? This error  
    happen with Sun Grid Engine and without SGE too.
    
    Best regards,
    
            Gabor Roczei
    
    
    
    


  • Next message: Paul H. Hargrove: "Re: BLCR: sleep process checkpointing problem"