Re: berkeley checkpointing

From: Jerry Mersel (jerry.mersel_at_weizmann.ac.il)
Date: Sat Jan 05 2008 - 23:43:56 PST

  • Next message: : "How to solve this problem?"
    I checked and am getting the same relocation error from the command line,
    but MATLAB and checkpointing and restarting are working.
    
    I don't get any relocation errors when using
    cr_run perl </dev/null" or "cr_run cat </dev/null". (not from SGE or the
    command line).
    
    
    ldd <full path>/MATLAB yields:
    
    
    -bash-3.00$ ldd ./bin/glnxa64/MATLAB
            libut.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libut.so
    (0x0000002a9566c000)
            libmx.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmx.so
    (0x0000002a9590a000)
            libmwservices.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwservices.so
    (0x0000002a95a56000)
            libmwjmi.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwjmi.so
    (0x0000002a95c52000)
            libmwbridge.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwbridge.so
    (0x0000002a95d8b000)
            libmwmcr.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmcr.so
    (0x0000002a95eb7000)
            libmwmvalue.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmvalue.so
    (0x0000002a95fd3000)
            libmwm_dispatcher.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_dispatcher.so
    (0x0000002a960e8000)        libmwm_interpreter.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_interpreter.so
    (0x0000002a9628d000)
            libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a96771000)
            libstdc++.so.5 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/../../sys/os/glnxa64/libstdc++.so.5
    (0x0000002a96887000)
            libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a96a64000)
            libgcc_s.so.1 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/../../sys/os/glnxa64/libgcc_s.so.1
    (0x0000002a96bea000)
            libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a96cf6000)
            librt.so.1 => /lib64/tls/librt.so.1 (0x0000002a96f2a000)
            libicudata.so.32 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libicudata.so.32
    (0x0000002a97044000)
            libicuuc.so.32 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libicuuc.so.32
    (0x0000002a97146000)
            libicui18n.so.32 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libicui18n.so.32
    (0x0000002a9733a000)
            libicuio.so.32 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libicuio.so.32
    (0x0000002a97555000)
            libMTwister.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libMTwister.so
    (0x0000002a97662000)
            libdl.so.2 => /lib64/libdl.so.2 (0x0000002a97765000)
            libz.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libz.so
    (0x0000002a97868000)
            libmwmpath.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmpath.so
    (0x0000002a97977000)
            libncurses.so.5 => /usr/lib64/libncurses.so.5 (0x0000002a97a90000)
            libmex.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmex.so
    (0x0000002a97bec000)
            libmwm_parser.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_parser.so
    (0x0000002a97cf9000)
            libmwudd.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwudd.so
    (0x0000002a9804e000)
            libmat.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmat.so
    (0x0000002a98238000)
            libmwmcos.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmcos.so
    (0x0000002a98341000)
            libmwgui.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwgui.so
    (0x0000002a9855d000)
            libmwhg.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwhg.so
    (0x0000002a98736000)
            libuij.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libuij.so
    (0x0000002a98a3d000)
            libmwudd_mi.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwudd_mi.so
    (0x0000002a98b6e000)
            libmwuinone.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwuinone.so
    (0x0000002a98ce5000)
            libmwm_ir.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_ir.so
    (0x0000002a98deb000)
            libmwuix.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwuix.so
    (0x0000002a98f91000)
            libmwdatasvcs.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwdatasvcs.so
    (0x0000002a99201000)
            libxerces-c.so.26 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libxerces-c.so.26
    (0x0000002a99323000)
            libmwmlib.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmlib.so
    (0x0000002a99799000)
            libmwm_pcodeio.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_pcodeio.so
    (0x0000002a99938000)
            libmwm_pcodegen.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_pcodegen.so
    (0x0000002a99a4b000)
            /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000)
            libmwir_xfmr.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwir_xfmr.so
    (0x0000002a99b63000)
            libmwhardcopy.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwhardcopy.so
    (0x0000002a99c6f000)
            libmwnumerics.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwnumerics.so
    (0x0000002a99d9e000)
            libXm.so.3 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/../../sys/os/glnxa64/libXm.so.3
    (0x0000002a9a078000)
            libXext.so.6 => /usr/X11R6/lib64/libXext.so.6 (0x0000002a9a427000)
            libXp.so.6 => /usr/X11R6/lib64/libXp.so.6 (0x0000002a9a538000)
            libXt.so.6 => /usr/X11R6/lib64/libXt.so.6 (0x0000002a9a642000)
            libX11.so.6 => /usr/X11R6/lib64/libX11.so.6 (0x0000002a9a7a4000)
            libmwlapack.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwlapack.so
    (0x0000002a9a99e000)
            libfftw3.so.3 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libfftw3.so.3
    (0x0000002a9aada000)
            libfftw3f.so.3 =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libfftw3f.so.3
    (0x0000002a9ad02000)
            libmwcolamd.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwcolamd.so
    (0x0000002a9af19000)
            libmwamd.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwamd.so
    (0x0000002a9b01e000)
            libmwumfpackv4.3.so =>
    /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwumfpackv4.3.so
    (0x0000002a9b125000)
            libSM.so.6 => /usr/X11R6/lib64/libSM.so.6 (0x0000002a9b2a3000)
            libICE.so.6 => /usr/X11R6/lib64/libICE.so.6 (0x0000002a9b3ad000)
    
    
    
    again, thanks for your help.
    
                              Regards,
                               Jerry
    
    
    
    
    > Jerry,
    >
    >   I am not sure what is happening here, but it looks like the non-tls
    > libc may be getting loaded along with the tls libpthread.  If you look
    > at the ldd output I requested, you will see /lib64/tls/libc.so.6 and
    > /lib64/tls/libpthread.so.0.  In that case both are the matching tls
    > versions.  The relocation error message, however, says /lib64/libc.so.6
    > and /lib64/tls/libpthread.so.0 showing the non-tls libc.  I am guessing
    > that this is the reason for the relocation error.  However, I don't know
    > why the mismatched libraries are getting loaded.
    >
    > I can think of two parties that might be causing the mismatch, but am
    > uncertain why either one would be a factor only when using SGE and not
    > from the command line.
    >
    > 1) Perhaps blcr is somehow getting the non-tls libc linked.   To test
    > that, try "cr_run perl </dev/null" and "cr_run cat </dev/null".  I pick
    > those two programs because they should both be present on most systems
    > and one uses pthreads and the other does not.  If either of those two
    > commands yields the relocation error (running via SGE), then I can start
    > looking at how libcr gets linked for clues as to what is different on
    > your system from others.
    >
    > 2) Perhaps Matlab is linked oddly.  If you could find the actual binary
    > run by matlab ("matlab" is usually a shell script) and run "ldd
    > full_path_to_MATLAB" that output might tell me something.  For me,
    > MATLAB is in the bin/glnx86/ directory under the Matlab release directory.
    >
    >
    > I know I said I wanted to try one problem at a time, but I had a thought
    > on the second problem: the failure to reopen "/dev/tty" suggests that
    > the program had a controlling tty at checkpoint time but not at restart.
    >   Perhaps you are checkpointing outside of SGE and restarting in SGE?
    > Running "nohup cr_run matlab ..." may remove the controlling tty
    > association (or not if Matlab explicitly opens /dev/tty).
    >
    > -Paul
    >
    > Jerry Mersel wrote:
    >> Hi Paul:
    >>
    >>  Both of those commands do create the same relocation error. (I ran
    >> one  without cr_run, correct)
    >>
    >>  The results from SGE and without are the same.
    >> The results:
    >>
    >> libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9566c000
    >>        libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95782000)
    >>        /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000)
    >>
    >>                              Thanks,
    >>                                  Jerry
    >>
    >>
    >>
    >> Paul H. Hargrove wrote:
    >>
    >>> Jerry,
    >>>
    >>>  Let's try to deal with one problem at a time.  First I'd like to
    >>> address the "relocation error" and see if resolving it still leaves
    >>> the second error.
    >>>  The purpose of cr_run is to set LD_PRELOAD just as you have done
    >>> manually.  If you could, please tell me if the following two commands
    >>> (executed via SGE) each produce the same relocation error:
    >>>
    >>> ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m
    >>> env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab -nojvm -nodisplay
    >>> -nosplash < $H/test.m
    >>>
    >>> If you could, also send the output of "env LD_PRELOAD=libpthread.so.0
    >>> ldd /bin/cat" executed both from the command line and via SGE.
    >>>
    >>> -Paul
    >>>
    >>> Jerry Mersel wrote:
    >>>
    >>>> I manage to checkpoint matlab processes  from the command line.
    >>>> But when I want to use SGE I get the error:
    >>>> /lib64/libc.so.6: relocation error: /lib64/tls/libpthread.so.0:
    >>>> symbol errno, version GLIBC_PRIVATE not defined in file libc.so.6
    >>>> with link time reference
    >>>> Restart failed: No such device or address
    >>>>
    >>>> The relocation error I get on the start using cr_run.
    >>>> The Restart failed I get when trying to restart.
    >>>>
    >>>> I start matlab thus:
    >>>> ${BLCR_HOME}/bin/cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0
    >>>> matlab -nojvm -nodisplay -nosplash < $H/test.m
    >>>>
    >>>> and try to restart thus:
    >>>> ${BLCR_HOME}/bin/cr_restart $ckptfile
    >>>>
    >>>> my log file says this:
    >>>> Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    >>>> Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    >>>> Jan  2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0
    >>>> path='/dev/tty')
    >>>> Jan  2 14:26:03 kam02 kernel: cr_restore_all_files [28703]:  Unable
    >>>> to restore fd 3 (type=6,err=-6)
    >>>> Jan  2 14:26:03 kam02 kernel: cr_rstrt_child [28703]:  Unable to
    >>>> restore files!  (err=-6)
    >>>>
    >>>> Perhaps something to do with the socket.
    >>>> What do you think?
    >>>>
    >>>>                                Regards,
    >>>>                                   Jerry
    >>>>
    >>>> P.S. I have prelinking turned off.
    >>>>
    >>>>
    >>>> cat
    >>>>
    >>>> Paul H. Hargrove wrote:
    >>>>
    >>>>> Jerry Mersel wrote:
    >>>>>
    >>>>>> Hi:
    >>>>>>
    >>>>>>  I am trying to migrate jobs on a grid after checkpointing.
    >>>>>> Does the "prelinking" fix as mentioned in the faq must it be done
    >>>>>> on the checkpointed node and the migrated to node?
    >>>>>>
    >>>>>>                                     Regards,
    >>>>>>                                        Jerry
    >>>>>
    >>>>>
    >>>>> Yes, the prelinking of libraries should be disabled on both the
    >>>>> "checkpointed on" and "migrated to" nodes.
    >>>>> I will clarify this in the next FAQ version.
    >>>>>
    >>>>> -Paul
    >>>>>
    >>>>
    >>>
    >>>
    >>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    >
    

  • Next message: : "How to solve this problem?"