Re: berkeley checkpointing

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 03 2007 - 10:54:50 PDT

  • Next message: Paul H. Hargrove: "Re: Restart failed: Resource temporarily unavailable"
    Jerry Mersel wrote:
    > Hi:
    >
    >   I have a few questions about Berkeley Checkpoint/Restart that wasn't
    >   clear to me in the documentation.
    >
    >
    >    Can a process be checkpointed and then restarted on a different node
    >     in a grid. Do the kernels have to be identical across all the
    > different  nodes?
    >
    >     I am considering using berkely checkpointing with Grid Engine 6.
    >
    >
    >                                      Thank you,
    >                                        Jerry
    >   
    Jerry,
    
      If the nodes are sufficiently identical then restarting on a different 
    node *is* possible.  The kernels *do* need to be identical, as do any 
    shared libraries used by the application(s) to be checkpointed.  See the 
    BLCR FAQ entry on prelinking 
    (http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink) for the most 
    common reason that moving to a different node might fail.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: Restart failed: Resource temporarily unavailable"