From: Jerry Mersel (jerry.mersel_at_weizmann.ac.il)
Date: Sat Jan 05 2008 - 22:40:03 PST
Thank you Paul, nohup didn't help. I'll continue to check with your recommendations. > Jerry, > > I am not sure what is happening here, but it looks like the non-tls > libc may be getting loaded along with the tls libpthread. If you look > at the ldd output I requested, you will see /lib64/tls/libc.so.6 and > /lib64/tls/libpthread.so.0. In that case both are the matching tls > versions. The relocation error message, however, says /lib64/libc.so.6 > and /lib64/tls/libpthread.so.0 showing the non-tls libc. I am guessing > that this is the reason for the relocation error. However, I don't know > why the mismatched libraries are getting loaded. > > I can think of two parties that might be causing the mismatch, but am > uncertain why either one would be a factor only when using SGE and not > from the command line. > > 1) Perhaps blcr is somehow getting the non-tls libc linked. To test > that, try "cr_run perl </dev/null" and "cr_run cat </dev/null". I pick > those two programs because they should both be present on most systems > and one uses pthreads and the other does not. If either of those two > commands yields the relocation error (running via SGE), then I can start > looking at how libcr gets linked for clues as to what is different on > your system from others. > > 2) Perhaps Matlab is linked oddly. If you could find the actual binary > run by matlab ("matlab" is usually a shell script) and run "ldd > full_path_to_MATLAB" that output might tell me something. For me, > MATLAB is in the bin/glnx86/ directory under the Matlab release directory. > > > I know I said I wanted to try one problem at a time, but I had a thought > on the second problem: the failure to reopen "/dev/tty" suggests that > the program had a controlling tty at checkpoint time but not at restart. > Perhaps you are checkpointing outside of SGE and restarting in SGE? > Running "nohup cr_run matlab ..." may remove the controlling tty > association (or not if Matlab explicitly opens /dev/tty). > > -Paul > > Jerry Mersel wrote: >> Hi Paul: >> >> Both of those commands do create the same relocation error. (I ran >> one without cr_run, correct) >> >> The results from SGE and without are the same. >> The results: >> >> libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9566c000 >> libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95782000) >> /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000) >> >> Thanks, >> Jerry >> >> >> >> Paul H. Hargrove wrote: >> >>> Jerry, >>> >>> Let's try to deal with one problem at a time. First I'd like to >>> address the "relocation error" and see if resolving it still leaves >>> the second error. >>> The purpose of cr_run is to set LD_PRELOAD just as you have done >>> manually. If you could, please tell me if the following two commands >>> (executed via SGE) each produce the same relocation error: >>> >>> ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m >>> env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab -nojvm -nodisplay >>> -nosplash < $H/test.m >>> >>> If you could, also send the output of "env LD_PRELOAD=libpthread.so.0 >>> ldd /bin/cat" executed both from the command line and via SGE. >>> >>> -Paul >>> >>> Jerry Mersel wrote: >>> >>>> I manage to checkpoint matlab processes from the command line. >>>> But when I want to use SGE I get the error: >>>> /lib64/libc.so.6: relocation error: /lib64/tls/libpthread.so.0: >>>> symbol errno, version GLIBC_PRIVATE not defined in file libc.so.6 >>>> with link time reference >>>> Restart failed: No such device or address >>>> >>>> The relocation error I get on the start using cr_run. >>>> The Restart failed I get when trying to restart. >>>> >>>> I start matlab thus: >>>> ${BLCR_HOME}/bin/cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0 >>>> matlab -nojvm -nodisplay -nosplash < $H/test.m >>>> >>>> and try to restart thus: >>>> ${BLCR_HOME}/bin/cr_restart $ckptfile >>>> >>>> my log file says this: >>>> Jan 2 14:24:36 kam02 kernel: Skipping a socket. >>>> Jan 2 14:24:36 kam02 kernel: Skipping a socket. >>>> Jan 2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0 >>>> path='/dev/tty') >>>> Jan 2 14:26:03 kam02 kernel: cr_restore_all_files [28703]: Unable >>>> to restore fd 3 (type=6,err=-6) >>>> Jan 2 14:26:03 kam02 kernel: cr_rstrt_child [28703]: Unable to >>>> restore files! (err=-6) >>>> >>>> Perhaps something to do with the socket. >>>> What do you think? >>>> >>>> Regards, >>>> Jerry >>>> >>>> P.S. I have prelinking turned off. >>>> >>>> >>>> cat >>>> >>>> Paul H. Hargrove wrote: >>>> >>>>> Jerry Mersel wrote: >>>>> >>>>>> Hi: >>>>>> >>>>>> I am trying to migrate jobs on a grid after checkpointing. >>>>>> Does the "prelinking" fix as mentioned in the faq must it be done >>>>>> on the checkpointed node and the migrated to node? >>>>>> >>>>>> Regards, >>>>>> Jerry >>>>> >>>>> >>>>> Yes, the prelinking of libraries should be disabled on both the >>>>> "checkpointed on" and "migrated to" nodes. >>>>> I will clarify this in the next FAQ version. >>>>> >>>>> -Paul >>>>> >>>> >>> >>> >> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > >