Re: berkeley checkpoint and matlab

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jul 09 2007 - 09:11:42 PDT

  • Next message: Paul H. Hargrove: "Re: bugs in blcr"
    Jerry,
    
      I am sorry things are not working for.  Thank you for your patience.  
    Unless my suggestions #4 or #5 turn up some additional clues, I no 
    longer have any ideas that might help.
      However, I did discover in our e-mail archive 
    (http://www.nersc.gov/hypermail/checkpoint/) a thread on Jan 3, 2006 in 
    which a user *did* have checkpoint/restart of Matlab working (and was, 
    in fact, able to migrate between nodes).
    
      In answer to you other question, NO one does not need to be root to 
    checkpoint ones own processes.
    
    -Paul
    
    Jerry Mersel wrote:
    > Hi:
    >
    >  Thanks for your advice(s) but it didn't seem to help.
    >
    >   I ran matlab like this:
    >
    >      cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab  
    > -nodisplay -nosplash -nojvm&
    >      then:
    >
    >      cr_checkpoint --tree --kill <process-id>
    >      cr_restart ./context.<process-id>
    >
    >   I still got resource not available.
    >
    >
    >                               Regards,
    >                                 Jerry
    >
    >
    > Paul H. Hargrove wrote:
    >
    >> Jerry,
    >>  I am sorry to hear it is not working as you had hoped.  Unfortunately
    >> there is little I can do to help debug the problem myself, since I have
    >> no machines for which I have both a Matlab license and root access to
    >> install BLCR.  However, I can make a few suggestions for things you
    >> might try:
    >>
    >> 1) Run cr_checkpoint with the --tree option to ensure all of the
    >> children of the main matlab process are checkpointed too.  By default
    >> cr_checkpoint saves only a single process, though it is likely that
    >> --tree will become the default in a future release.
    >> 2) Run matlab with the -nodisplay option.  The connection to the X
    >> server is one resource I can guarantee won't restore correctly.
    >> 3) Try the most recent BLCR snaphot available at
    >> http://mantis.lbl.gov/blc-dist/snapshots, which adds/improves support
    >> for various shared memory and unlinked-tempfile tricks that matlab might
    >> be using.
    >> 4) Check your syslog and/or dmesg output to see if there is some message
    >> from BLCR that may indicate what resource is unavailable.
    >> 5) Finally, if you configure BLCR with --enable-debug then it will
    >> generate additional debug messages from the kernel (to syslog and/or
    >> dmesg) that may indicate the origin of the "resource unavailable" if you
    >> didn't find anything there in #4.
    >>
    >> If #4 or #5 turn up any log messages that look related, please send them
    >> to me and I'll try to make sense of them for you.
    >>
    >> I am afraid, however, that matlab may have an open socket to a license
    >> server.  Since BLCR doesn't restore sockets, it is possible that this
    >> could be the problem and there would me no easy way to resolve it.
    >>
    >>
    >> -Paul
    >>
    >>
    >> Jerry Mersel wrote:
    >>  
    >>
    >>> Hi Paul:
    >>>
    >>>  I tried matlab with BLCR and I got the error resources not available
    >>> when I wnated to restart
    >>> matlab. I thought it had something to do with the PID's but I checked
    >>> and all the PIDs that matlab
    >>> used  were free. So  I'm not sure  what was ca_using the problem.
    >>>
    >>> Regards,
    >>>     Jerry_
    >>>
    >>> Paul H. Hargrove wrote:
    >>>
    >>>   
    >>>> Jerry Mersel wrote:
    >>>>
    >>>>     
    >>>>> Does berkeley checkpoint/restart work with matlab?
    >>>>>       
    >>>> Jerry,
    >>>>
    >>>> I am not aware of any reports (positive or negative) of BLCR used with
    >>>> Matlab, and am not in a position to make the tests myself.
    >>>> If you are able to try checkpoint/restart of Matlab, I'd appreciate
    >>>> hearing about your results, either success or failure.
    >>>>
    >>>> -Paul
    >>>>
    >>>>     
    >>
    >>
    >>  
    >>
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: bugs in blcr"