**From:** Jerry Mersel (*jerry.mersel_at_weizmann.ac.il*)

**Date:** Sat Jan 05 2008 - 23:43:56 PST

**Previous message:**Jerry Mersel: "Re: berkeley checkpointing"**In reply to:**Paul H. Hargrove: "Re: berkeley checkpointing"**Next in thread:**Paul H. Hargrove: "Re: berkeley checkpointing"**Reply:**Paul H. Hargrove: "Re: berkeley checkpointing"

I checked and am getting the same relocation error from the command line, but MATLAB and checkpointing and restarting are working. I don't get any relocation errors when using cr_run perl </dev/null" or "cr_run cat </dev/null". (not from SGE or the command line). ldd <full path>/MATLAB yields: -bash-3.00$ ldd ./bin/glnxa64/MATLAB libut.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libut.so (0x0000002a9566c000) libmx.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmx.so (0x0000002a9590a000) libmwservices.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwservices.so (0x0000002a95a56000) libmwjmi.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwjmi.so (0x0000002a95c52000) libmwbridge.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwbridge.so (0x0000002a95d8b000) libmwmcr.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmcr.so (0x0000002a95eb7000) libmwmvalue.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmvalue.so (0x0000002a95fd3000) libmwm_dispatcher.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_dispatcher.so (0x0000002a960e8000) libmwm_interpreter.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_interpreter.so (0x0000002a9628d000) libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a96771000) libstdc++.so.5 => /storage/alonwork/installations/MATLAB/bin/glnxa64/../../sys/os/glnxa64/libstdc++.so.5 (0x0000002a96887000) libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a96a64000) libgcc_s.so.1 => /storage/alonwork/installations/MATLAB/bin/glnxa64/../../sys/os/glnxa64/libgcc_s.so.1 (0x0000002a96bea000) libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a96cf6000) librt.so.1 => /lib64/tls/librt.so.1 (0x0000002a96f2a000) libicudata.so.32 => /storage/alonwork/installations/MATLAB/bin/glnxa64/libicudata.so.32 (0x0000002a97044000) libicuuc.so.32 => /storage/alonwork/installations/MATLAB/bin/glnxa64/libicuuc.so.32 (0x0000002a97146000) libicui18n.so.32 => /storage/alonwork/installations/MATLAB/bin/glnxa64/libicui18n.so.32 (0x0000002a9733a000) libicuio.so.32 => /storage/alonwork/installations/MATLAB/bin/glnxa64/libicuio.so.32 (0x0000002a97555000) libMTwister.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libMTwister.so (0x0000002a97662000) libdl.so.2 => /lib64/libdl.so.2 (0x0000002a97765000) libz.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libz.so (0x0000002a97868000) libmwmpath.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmpath.so (0x0000002a97977000) libncurses.so.5 => /usr/lib64/libncurses.so.5 (0x0000002a97a90000) libmex.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmex.so (0x0000002a97bec000) libmwm_parser.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_parser.so (0x0000002a97cf9000) libmwudd.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwudd.so (0x0000002a9804e000) libmat.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmat.so (0x0000002a98238000) libmwmcos.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmcos.so (0x0000002a98341000) libmwgui.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwgui.so (0x0000002a9855d000) libmwhg.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwhg.so (0x0000002a98736000) libuij.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libuij.so (0x0000002a98a3d000) libmwudd_mi.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwudd_mi.so (0x0000002a98b6e000) libmwuinone.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwuinone.so (0x0000002a98ce5000) libmwm_ir.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_ir.so (0x0000002a98deb000) libmwuix.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwuix.so (0x0000002a98f91000) libmwdatasvcs.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwdatasvcs.so (0x0000002a99201000) libxerces-c.so.26 => /storage/alonwork/installations/MATLAB/bin/glnxa64/libxerces-c.so.26 (0x0000002a99323000) libmwmlib.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwmlib.so (0x0000002a99799000) libmwm_pcodeio.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_pcodeio.so (0x0000002a99938000) libmwm_pcodegen.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwm_pcodegen.so (0x0000002a99a4b000) /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000) libmwir_xfmr.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwir_xfmr.so (0x0000002a99b63000) libmwhardcopy.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwhardcopy.so (0x0000002a99c6f000) libmwnumerics.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwnumerics.so (0x0000002a99d9e000) libXm.so.3 => /storage/alonwork/installations/MATLAB/bin/glnxa64/../../sys/os/glnxa64/libXm.so.3 (0x0000002a9a078000) libXext.so.6 => /usr/X11R6/lib64/libXext.so.6 (0x0000002a9a427000) libXp.so.6 => /usr/X11R6/lib64/libXp.so.6 (0x0000002a9a538000) libXt.so.6 => /usr/X11R6/lib64/libXt.so.6 (0x0000002a9a642000) libX11.so.6 => /usr/X11R6/lib64/libX11.so.6 (0x0000002a9a7a4000) libmwlapack.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwlapack.so (0x0000002a9a99e000) libfftw3.so.3 => /storage/alonwork/installations/MATLAB/bin/glnxa64/libfftw3.so.3 (0x0000002a9aada000) libfftw3f.so.3 => /storage/alonwork/installations/MATLAB/bin/glnxa64/libfftw3f.so.3 (0x0000002a9ad02000) libmwcolamd.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwcolamd.so (0x0000002a9af19000) libmwamd.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwamd.so (0x0000002a9b01e000) libmwumfpackv4.3.so => /storage/alonwork/installations/MATLAB/bin/glnxa64/libmwumfpackv4.3.so (0x0000002a9b125000) libSM.so.6 => /usr/X11R6/lib64/libSM.so.6 (0x0000002a9b2a3000) libICE.so.6 => /usr/X11R6/lib64/libICE.so.6 (0x0000002a9b3ad000) again, thanks for your help. Regards, Jerry > Jerry, > > I am not sure what is happening here, but it looks like the non-tls > libc may be getting loaded along with the tls libpthread. If you look > at the ldd output I requested, you will see /lib64/tls/libc.so.6 and > /lib64/tls/libpthread.so.0. In that case both are the matching tls > versions. The relocation error message, however, says /lib64/libc.so.6 > and /lib64/tls/libpthread.so.0 showing the non-tls libc. I am guessing > that this is the reason for the relocation error. However, I don't know > why the mismatched libraries are getting loaded. > > I can think of two parties that might be causing the mismatch, but am > uncertain why either one would be a factor only when using SGE and not > from the command line. > > 1) Perhaps blcr is somehow getting the non-tls libc linked. To test > that, try "cr_run perl </dev/null" and "cr_run cat </dev/null". I pick > those two programs because they should both be present on most systems > and one uses pthreads and the other does not. If either of those two > commands yields the relocation error (running via SGE), then I can start > looking at how libcr gets linked for clues as to what is different on > your system from others. > > 2) Perhaps Matlab is linked oddly. If you could find the actual binary > run by matlab ("matlab" is usually a shell script) and run "ldd > full_path_to_MATLAB" that output might tell me something. For me, > MATLAB is in the bin/glnx86/ directory under the Matlab release directory. > > > I know I said I wanted to try one problem at a time, but I had a thought > on the second problem: the failure to reopen "/dev/tty" suggests that > the program had a controlling tty at checkpoint time but not at restart. > Perhaps you are checkpointing outside of SGE and restarting in SGE? > Running "nohup cr_run matlab ..." may remove the controlling tty > association (or not if Matlab explicitly opens /dev/tty). > > -Paul > > Jerry Mersel wrote: >> Hi Paul: >> >> Both of those commands do create the same relocation error. (I ran >> one without cr_run, correct) >> >> The results from SGE and without are the same. >> The results: >> >> libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9566c000 >> libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95782000) >> /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000) >> >> Thanks, >> Jerry >> >> >> >> Paul H. Hargrove wrote: >> >>> Jerry, >>> >>> Let's try to deal with one problem at a time. First I'd like to >>> address the "relocation error" and see if resolving it still leaves >>> the second error. >>> The purpose of cr_run is to set LD_PRELOAD just as you have done >>> manually. If you could, please tell me if the following two commands >>> (executed via SGE) each produce the same relocation error: >>> >>> ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m >>> env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab -nojvm -nodisplay >>> -nosplash < $H/test.m >>> >>> If you could, also send the output of "env LD_PRELOAD=libpthread.so.0 >>> ldd /bin/cat" executed both from the command line and via SGE. >>> >>> -Paul >>> >>> Jerry Mersel wrote: >>> >>>> I manage to checkpoint matlab processes from the command line. >>>> But when I want to use SGE I get the error: >>>> /lib64/libc.so.6: relocation error: /lib64/tls/libpthread.so.0: >>>> symbol errno, version GLIBC_PRIVATE not defined in file libc.so.6 >>>> with link time reference >>>> Restart failed: No such device or address >>>> >>>> The relocation error I get on the start using cr_run. >>>> The Restart failed I get when trying to restart. >>>> >>>> I start matlab thus: >>>> ${BLCR_HOME}/bin/cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0 >>>> matlab -nojvm -nodisplay -nosplash < $H/test.m >>>> >>>> and try to restart thus: >>>> ${BLCR_HOME}/bin/cr_restart $ckptfile >>>> >>>> my log file says this: >>>> Jan 2 14:24:36 kam02 kernel: Skipping a socket. >>>> Jan 2 14:24:36 kam02 kernel: Skipping a socket. >>>> Jan 2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0 >>>> path='/dev/tty') >>>> Jan 2 14:26:03 kam02 kernel: cr_restore_all_files [28703]: Unable >>>> to restore fd 3 (type=6,err=-6) >>>> Jan 2 14:26:03 kam02 kernel: cr_rstrt_child [28703]: Unable to >>>> restore files! (err=-6) >>>> >>>> Perhaps something to do with the socket. >>>> What do you think? >>>> >>>> Regards, >>>> Jerry >>>> >>>> P.S. I have prelinking turned off. >>>> >>>> >>>> cat >>>> >>>> Paul H. Hargrove wrote: >>>> >>>>> Jerry Mersel wrote: >>>>> >>>>>> Hi: >>>>>> >>>>>> I am trying to migrate jobs on a grid after checkpointing. >>>>>> Does the "prelinking" fix as mentioned in the faq must it be done >>>>>> on the checkpointed node and the migrated to node? >>>>>> >>>>>> Regards, >>>>>> Jerry >>>>> >>>>> >>>>> Yes, the prelinking of libraries should be disabled on both the >>>>> "checkpointed on" and "migrated to" nodes. >>>>> I will clarify this in the next FAQ version. >>>>> >>>>> -Paul >>>>> >>>> >>> >>> >> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > >

**Previous message:**Jerry Mersel: "Re: berkeley checkpointing"**In reply to:**Paul H. Hargrove: "Re: berkeley checkpointing"**Next in thread:**Paul H. Hargrove: "Re: berkeley checkpointing"**Reply:**Paul H. Hargrove: "Re: berkeley checkpointing"