Re: berkeley checkpointing

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jan 14 2008 - 13:12:20 PST

  • Next message: Paul H. Hargrove: "Re: How to solve this problem?"
    Jerry,
    
      I am afraid I don't know how to help you any further.  The fact that 
    the relocation error doesn't prevent checkpoint/restart on the command 
    line suggests to me that it is not a real problem, just an annoyance.  
    The failure to restart w/ SGE, however, is a real problem that I don't 
    have any solution for.
      As I stated before, the log messages you sent show a failure to open 
    /dev/tty which indicates to me that the job had a controlling tty (CTTY) 
    at checkpoint time, but does not have one at restart time.  Not knowing 
    details of SGE, I don't know how that could be happening unless MATLAB 
    has gone to some special trouble to acquire a CTTY that was not opened 
    by SGE.  Since nohup didn't work, I don't have any further ideas on how 
    to avoid having a CTTY at checkpoint time, and I also have no 
    suggestions as to how to create/acquire one at restart time.
      If you have any ideas you would like my help to pursue, let me know 
    and I'll try to help.  However at this point I have nothing left to 
    suggest to you.
    
    -Paul
    
    Jerry Mersel wrote:
    > I checked and am getting the same relocation error from the command line,
    > but MATLAB and checkpointing and restarting are working.
    >
    > I don't get any relocation errors when using
    > cr_run perl </dev/null" or "cr_run cat </dev/null". (not from SGE or the
    > command line).
    >   
    [snip]
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: How to solve this problem?"