From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jul 09 2007 - 09:11:42 PDT
Jerry, I am sorry things are not working for. Thank you for your patience. Unless my suggestions #4 or #5 turn up some additional clues, I no longer have any ideas that might help. However, I did discover in our e-mail archive (http://www.nersc.gov/hypermail/checkpoint/) a thread on Jan 3, 2006 in which a user *did* have checkpoint/restart of Matlab working (and was, in fact, able to migrate between nodes). In answer to you other question, NO one does not need to be root to checkpoint ones own processes. -Paul Jerry Mersel wrote: > Hi: > > Thanks for your advice(s) but it didn't seem to help. > > I ran matlab like this: > > cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab > -nodisplay -nosplash -nojvm& > then: > > cr_checkpoint --tree --kill <process-id> > cr_restart ./context.<process-id> > > I still got resource not available. > > > Regards, > Jerry > > > Paul H. Hargrove wrote: > >> Jerry, >> I am sorry to hear it is not working as you had hoped. Unfortunately >> there is little I can do to help debug the problem myself, since I have >> no machines for which I have both a Matlab license and root access to >> install BLCR. However, I can make a few suggestions for things you >> might try: >> >> 1) Run cr_checkpoint with the --tree option to ensure all of the >> children of the main matlab process are checkpointed too. By default >> cr_checkpoint saves only a single process, though it is likely that >> --tree will become the default in a future release. >> 2) Run matlab with the -nodisplay option. The connection to the X >> server is one resource I can guarantee won't restore correctly. >> 3) Try the most recent BLCR snaphot available at >> http://mantis.lbl.gov/blc-dist/snapshots, which adds/improves support >> for various shared memory and unlinked-tempfile tricks that matlab might >> be using. >> 4) Check your syslog and/or dmesg output to see if there is some message >> from BLCR that may indicate what resource is unavailable. >> 5) Finally, if you configure BLCR with --enable-debug then it will >> generate additional debug messages from the kernel (to syslog and/or >> dmesg) that may indicate the origin of the "resource unavailable" if you >> didn't find anything there in #4. >> >> If #4 or #5 turn up any log messages that look related, please send them >> to me and I'll try to make sense of them for you. >> >> I am afraid, however, that matlab may have an open socket to a license >> server. Since BLCR doesn't restore sockets, it is possible that this >> could be the problem and there would me no easy way to resolve it. >> >> >> -Paul >> >> >> Jerry Mersel wrote: >> >> >>> Hi Paul: >>> >>> I tried matlab with BLCR and I got the error resources not available >>> when I wnated to restart >>> matlab. I thought it had something to do with the PID's but I checked >>> and all the PIDs that matlab >>> used were free. So I'm not sure what was ca_using the problem. >>> >>> Regards, >>> Jerry_ >>> >>> Paul H. Hargrove wrote: >>> >>> >>>> Jerry Mersel wrote: >>>> >>>> >>>>> Does berkeley checkpoint/restart work with matlab? >>>>> >>>> Jerry, >>>> >>>> I am not aware of any reports (positive or negative) of BLCR used with >>>> Matlab, and am not in a position to make the tests myself. >>>> If you are able to try checkpoint/restart of Matlab, I'd appreciate >>>> hearing about your results, either success or failure. >>>> >>>> -Paul >>>> >>>> >> >> >> >> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900