Re: berkeley checkpointing

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Jan 03 2008 - 11:10:21 PST

  • Next message: Paul H. Hargrove: "BLCR 0.6.2 beta2 now available"
      I am not sure what is happening here, but it looks like the non-tls 
    libc may be getting loaded along with the tls libpthread.  If you look 
    at the ldd output I requested, you will see /lib64/tls/ and 
    /lib64/tls/  In that case both are the matching tls 
    versions.  The relocation error message, however, says /lib64/ 
    and /lib64/tls/ showing the non-tls libc.  I am guessing 
    that this is the reason for the relocation error.  However, I don't know 
    why the mismatched libraries are getting loaded.
    I can think of two parties that might be causing the mismatch, but am 
    uncertain why either one would be a factor only when using SGE and not 
    from the command line.
    1) Perhaps blcr is somehow getting the non-tls libc linked.   To test 
    that, try "cr_run perl </dev/null" and "cr_run cat </dev/null".  I pick 
    those two programs because they should both be present on most systems 
    and one uses pthreads and the other does not.  If either of those two 
    commands yields the relocation error (running via SGE), then I can start 
    looking at how libcr gets linked for clues as to what is different on 
    your system from others.
    2) Perhaps Matlab is linked oddly.  If you could find the actual binary 
    run by matlab ("matlab" is usually a shell script) and run "ldd 
    full_path_to_MATLAB" that output might tell me something.  For me, 
    MATLAB is in the bin/glnx86/ directory under the Matlab release directory.
    I know I said I wanted to try one problem at a time, but I had a thought 
    on the second problem: the failure to reopen "/dev/tty" suggests that 
    the program had a controlling tty at checkpoint time but not at restart. 
      Perhaps you are checkpointing outside of SGE and restarting in SGE?  
    Running "nohup cr_run matlab ..." may remove the controlling tty 
    association (or not if Matlab explicitly opens /dev/tty).
    Jerry Mersel wrote:
    > Hi Paul:
    >  Both of those commands do create the same relocation error. (I ran 
    > one  without cr_run, correct)
    >  The results from SGE and without are the same.
    > The results:
    > => /lib64/tls/ (0x0000002a9566c000
    > => /lib64/tls/ (0x0000002a95782000)
    >        /lib64/ (0x0000002a95556000)
    >                              Thanks,
    >                                  Jerry
    > Paul H. Hargrove wrote:
    >> Jerry,
    >>  Let's try to deal with one problem at a time.  First I'd like to 
    >> address the "relocation error" and see if resolving it still leaves 
    >> the second error.
    >>  The purpose of cr_run is to set LD_PRELOAD just as you have done 
    >> manually.  If you could, please tell me if the following two commands 
    >> (executed via SGE) each produce the same relocation error:
    >> ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m
    >> env matlab -nojvm -nodisplay 
    >> -nosplash < $H/test.m
    >> If you could, also send the output of "env 
    >> ldd /bin/cat" executed both from the command line and via SGE.
    >> -Paul
    >> Jerry Mersel wrote:
    >>> I manage to checkpoint matlab processes  from the command line.
    >>> But when I want to use SGE I get the error:
    >>> /lib64/ relocation error: /lib64/tls/ 
    >>> symbol errno, version GLIBC_PRIVATE not defined in file 
    >>> with link time reference
    >>> Restart failed: No such device or address
    >>> The relocation error I get on the start using cr_run.
    >>> The Restart failed I get when trying to restart.
    >>> I start matlab thus:
    >>> ${BLCR_HOME}/bin/cr_run env 
    >>> matlab -nojvm -nodisplay -nosplash < $H/test.m
    >>> and try to restart thus:
    >>> ${BLCR_HOME}/bin/cr_restart $ckptfile
    >>> my log file says this:
    >>> Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    >>> Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    >>> Jan  2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0 
    >>> path='/dev/tty')
    >>> Jan  2 14:26:03 kam02 kernel: cr_restore_all_files [28703]:  Unable 
    >>> to restore fd 3 (type=6,err=-6)
    >>> Jan  2 14:26:03 kam02 kernel: cr_rstrt_child [28703]:  Unable to 
    >>> restore files!  (err=-6)
    >>> Perhaps something to do with the socket.
    >>> What do you think?
    >>>                                Regards,
    >>>                                   Jerry
    >>> P.S. I have prelinking turned off.
    >>> cat
    >>> Paul H. Hargrove wrote:
    >>>> Jerry Mersel wrote:
    >>>>> Hi:
    >>>>>  I am trying to migrate jobs on a grid after checkpointing.
    >>>>> Does the "prelinking" fix as mentioned in the faq must it be done
    >>>>> on the checkpointed node and the migrated to node?
    >>>>>                                     Regards,
    >>>>>                                        Jerry
    >>>> Yes, the prelinking of libraries should be disabled on both the 
    >>>> "checkpointed on" and "migrated to" nodes.
    >>>> I will clarify this in the next FAQ version.
    >>>> -Paul
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Paul H. Hargrove: "BLCR 0.6.2 beta2 now available"