Re: berkeley checkpointing

From: Jerry Mersel (jerry.mersel_at_weizmann.ac.il)
Date: Sat Jan 05 2008 - 22:40:03 PST

  • Next message: Jerry Mersel: "Re: berkeley checkpointing"
    Thank you Paul,
    
    nohup didn't help. I'll continue to check with your recommendations.
    
    > Jerry,
    >
    >   I am not sure what is happening here, but it looks like the non-tls
    > libc may be getting loaded along with the tls libpthread.  If you look
    > at the ldd output I requested, you will see /lib64/tls/libc.so.6 and
    > /lib64/tls/libpthread.so.0.  In that case both are the matching tls
    > versions.  The relocation error message, however, says /lib64/libc.so.6
    > and /lib64/tls/libpthread.so.0 showing the non-tls libc.  I am guessing
    > that this is the reason for the relocation error.  However, I don't know
    > why the mismatched libraries are getting loaded.
    >
    > I can think of two parties that might be causing the mismatch, but am
    > uncertain why either one would be a factor only when using SGE and not
    > from the command line.
    >
    > 1) Perhaps blcr is somehow getting the non-tls libc linked.   To test
    > that, try "cr_run perl </dev/null" and "cr_run cat </dev/null".  I pick
    > those two programs because they should both be present on most systems
    > and one uses pthreads and the other does not.  If either of those two
    > commands yields the relocation error (running via SGE), then I can start
    > looking at how libcr gets linked for clues as to what is different on
    > your system from others.
    >
    > 2) Perhaps Matlab is linked oddly.  If you could find the actual binary
    > run by matlab ("matlab" is usually a shell script) and run "ldd
    > full_path_to_MATLAB" that output might tell me something.  For me,
    > MATLAB is in the bin/glnx86/ directory under the Matlab release directory.
    >
    >
    > I know I said I wanted to try one problem at a time, but I had a thought
    > on the second problem: the failure to reopen "/dev/tty" suggests that
    > the program had a controlling tty at checkpoint time but not at restart.
    >   Perhaps you are checkpointing outside of SGE and restarting in SGE?
    > Running "nohup cr_run matlab ..." may remove the controlling tty
    > association (or not if Matlab explicitly opens /dev/tty).
    >
    > -Paul
    >
    > Jerry Mersel wrote:
    >> Hi Paul:
    >>
    >>  Both of those commands do create the same relocation error. (I ran
    >> one  without cr_run, correct)
    >>
    >>  The results from SGE and without are the same.
    >> The results:
    >>
    >> libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9566c000
    >>        libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95782000)
    >>        /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000)
    >>
    >>                              Thanks,
    >>                                  Jerry
    >>
    >>
    >>
    >> Paul H. Hargrove wrote:
    >>
    >>> Jerry,
    >>>
    >>>  Let's try to deal with one problem at a time.  First I'd like to
    >>> address the "relocation error" and see if resolving it still leaves
    >>> the second error.
    >>>  The purpose of cr_run is to set LD_PRELOAD just as you have done
    >>> manually.  If you could, please tell me if the following two commands
    >>> (executed via SGE) each produce the same relocation error:
    >>>
    >>> ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m
    >>> env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab -nojvm -nodisplay
    >>> -nosplash < $H/test.m
    >>>
    >>> If you could, also send the output of "env LD_PRELOAD=libpthread.so.0
    >>> ldd /bin/cat" executed both from the command line and via SGE.
    >>>
    >>> -Paul
    >>>
    >>> Jerry Mersel wrote:
    >>>
    >>>> I manage to checkpoint matlab processes  from the command line.
    >>>> But when I want to use SGE I get the error:
    >>>> /lib64/libc.so.6: relocation error: /lib64/tls/libpthread.so.0:
    >>>> symbol errno, version GLIBC_PRIVATE not defined in file libc.so.6
    >>>> with link time reference
    >>>> Restart failed: No such device or address
    >>>>
    >>>> The relocation error I get on the start using cr_run.
    >>>> The Restart failed I get when trying to restart.
    >>>>
    >>>> I start matlab thus:
    >>>> ${BLCR_HOME}/bin/cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0
    >>>> matlab -nojvm -nodisplay -nosplash < $H/test.m
    >>>>
    >>>> and try to restart thus:
    >>>> ${BLCR_HOME}/bin/cr_restart $ckptfile
    >>>>
    >>>> my log file says this:
    >>>> Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    >>>> Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    >>>> Jan  2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0
    >>>> path='/dev/tty')
    >>>> Jan  2 14:26:03 kam02 kernel: cr_restore_all_files [28703]:  Unable
    >>>> to restore fd 3 (type=6,err=-6)
    >>>> Jan  2 14:26:03 kam02 kernel: cr_rstrt_child [28703]:  Unable to
    >>>> restore files!  (err=-6)
    >>>>
    >>>> Perhaps something to do with the socket.
    >>>> What do you think?
    >>>>
    >>>>                                Regards,
    >>>>                                   Jerry
    >>>>
    >>>> P.S. I have prelinking turned off.
    >>>>
    >>>>
    >>>> cat
    >>>>
    >>>> Paul H. Hargrove wrote:
    >>>>
    >>>>> Jerry Mersel wrote:
    >>>>>
    >>>>>> Hi:
    >>>>>>
    >>>>>>  I am trying to migrate jobs on a grid after checkpointing.
    >>>>>> Does the "prelinking" fix as mentioned in the faq must it be done
    >>>>>> on the checkpointed node and the migrated to node?
    >>>>>>
    >>>>>>                                     Regards,
    >>>>>>                                        Jerry
    >>>>>
    >>>>>
    >>>>> Yes, the prelinking of libraries should be disabled on both the
    >>>>> "checkpointed on" and "migrated to" nodes.
    >>>>> I will clarify this in the next FAQ version.
    >>>>>
    >>>>> -Paul
    >>>>>
    >>>>
    >>>
    >>>
    >>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    >
    

  • Next message: Jerry Mersel: "Re: berkeley checkpointing"