From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Jan 03 2008 - 11:10:21 PST
Jerry, I am not sure what is happening here, but it looks like the non-tls libc may be getting loaded along with the tls libpthread. If you look at the ldd output I requested, you will see /lib64/tls/libc.so.6 and /lib64/tls/libpthread.so.0. In that case both are the matching tls versions. The relocation error message, however, says /lib64/libc.so.6 and /lib64/tls/libpthread.so.0 showing the non-tls libc. I am guessing that this is the reason for the relocation error. However, I don't know why the mismatched libraries are getting loaded. I can think of two parties that might be causing the mismatch, but am uncertain why either one would be a factor only when using SGE and not from the command line. 1) Perhaps blcr is somehow getting the non-tls libc linked. To test that, try "cr_run perl </dev/null" and "cr_run cat </dev/null". I pick those two programs because they should both be present on most systems and one uses pthreads and the other does not. If either of those two commands yields the relocation error (running via SGE), then I can start looking at how libcr gets linked for clues as to what is different on your system from others. 2) Perhaps Matlab is linked oddly. If you could find the actual binary run by matlab ("matlab" is usually a shell script) and run "ldd full_path_to_MATLAB" that output might tell me something. For me, MATLAB is in the bin/glnx86/ directory under the Matlab release directory. I know I said I wanted to try one problem at a time, but I had a thought on the second problem: the failure to reopen "/dev/tty" suggests that the program had a controlling tty at checkpoint time but not at restart. Perhaps you are checkpointing outside of SGE and restarting in SGE? Running "nohup cr_run matlab ..." may remove the controlling tty association (or not if Matlab explicitly opens /dev/tty). -Paul Jerry Mersel wrote: > Hi Paul: > > Both of those commands do create the same relocation error. (I ran > one without cr_run, correct) > > The results from SGE and without are the same. > The results: > > libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9566c000 > libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95782000) > /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000) > > Thanks, > Jerry > > > > Paul H. Hargrove wrote: > >> Jerry, >> >> Let's try to deal with one problem at a time. First I'd like to >> address the "relocation error" and see if resolving it still leaves >> the second error. >> The purpose of cr_run is to set LD_PRELOAD just as you have done >> manually. If you could, please tell me if the following two commands >> (executed via SGE) each produce the same relocation error: >> >> ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m >> env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab -nojvm -nodisplay >> -nosplash < $H/test.m >> >> If you could, also send the output of "env LD_PRELOAD=libpthread.so.0 >> ldd /bin/cat" executed both from the command line and via SGE. >> >> -Paul >> >> Jerry Mersel wrote: >> >>> I manage to checkpoint matlab processes from the command line. >>> But when I want to use SGE I get the error: >>> /lib64/libc.so.6: relocation error: /lib64/tls/libpthread.so.0: >>> symbol errno, version GLIBC_PRIVATE not defined in file libc.so.6 >>> with link time reference >>> Restart failed: No such device or address >>> >>> The relocation error I get on the start using cr_run. >>> The Restart failed I get when trying to restart. >>> >>> I start matlab thus: >>> ${BLCR_HOME}/bin/cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0 >>> matlab -nojvm -nodisplay -nosplash < $H/test.m >>> >>> and try to restart thus: >>> ${BLCR_HOME}/bin/cr_restart $ckptfile >>> >>> my log file says this: >>> Jan 2 14:24:36 kam02 kernel: Skipping a socket. >>> Jan 2 14:24:36 kam02 kernel: Skipping a socket. >>> Jan 2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0 >>> path='/dev/tty') >>> Jan 2 14:26:03 kam02 kernel: cr_restore_all_files [28703]: Unable >>> to restore fd 3 (type=6,err=-6) >>> Jan 2 14:26:03 kam02 kernel: cr_rstrt_child [28703]: Unable to >>> restore files! (err=-6) >>> >>> Perhaps something to do with the socket. >>> What do you think? >>> >>> Regards, >>> Jerry >>> >>> P.S. I have prelinking turned off. >>> >>> >>> cat >>> >>> Paul H. Hargrove wrote: >>> >>>> Jerry Mersel wrote: >>>> >>>>> Hi: >>>>> >>>>> I am trying to migrate jobs on a grid after checkpointing. >>>>> Does the "prelinking" fix as mentioned in the faq must it be done >>>>> on the checkpointed node and the migrated to node? >>>>> >>>>> Regards, >>>>> Jerry >>>> >>>> >>>> Yes, the prelinking of libraries should be disabled on both the >>>> "checkpointed on" and "migrated to" nodes. >>>> I will clarify this in the next FAQ version. >>>> >>>> -Paul >>>> >>> >> >> > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900