Re: berkeley checkpointing

From: Jerry Mersel (jerry.mersel_at_weizmann.ac.il)
Date: Wed Jan 02 2008 - 23:50:45 PST

  • Next message: Mary zhang: "Offshore Website design"
    Hi Paul:
    
      Both of those commands do create the same relocation error. (I ran 
    one  without cr_run, correct)
    
      The results from SGE and without are the same.
     The results:
    
    libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9566c000
            libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95782000)
            /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000)
    
                                  Thanks,
                                      Jerry
    
     
    
    Paul H. Hargrove wrote:
    
    > Jerry,
    >
    >  Let's try to deal with one problem at a time.  First I'd like to 
    > address the "relocation error" and see if resolving it still leaves 
    > the second error.
    >  The purpose of cr_run is to set LD_PRELOAD just as you have done 
    > manually.  If you could, please tell me if the following two commands 
    > (executed via SGE) each produce the same relocation error:
    >
    > ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m
    > env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab -nojvm -nodisplay 
    > -nosplash < $H/test.m
    >
    > If you could, also send the output of "env LD_PRELOAD=libpthread.so.0 
    > ldd /bin/cat" executed both from the command line and via SGE.
    >
    > -Paul
    >
    > Jerry Mersel wrote:
    >
    >> I manage to checkpoint matlab processes  from the command line.
    >> But when I want to use SGE I get the error:
    >> /lib64/libc.so.6: relocation error: /lib64/tls/libpthread.so.0: 
    >> symbol errno, version GLIBC_PRIVATE not defined in file libc.so.6 
    >> with link time reference
    >> Restart failed: No such device or address
    >>
    >> The relocation error I get on the start using cr_run.
    >> The Restart failed I get when trying to restart.
    >>
    >> I start matlab thus:
    >> ${BLCR_HOME}/bin/cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0 
    >> matlab -nojvm -nodisplay -nosplash < $H/test.m
    >>
    >> and try to restart thus:
    >> ${BLCR_HOME}/bin/cr_restart $ckptfile
    >>
    >> my log file says this:
    >> Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    >> Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    >> Jan  2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0 
    >> path='/dev/tty')
    >> Jan  2 14:26:03 kam02 kernel: cr_restore_all_files [28703]:  Unable 
    >> to restore fd 3 (type=6,err=-6)
    >> Jan  2 14:26:03 kam02 kernel: cr_rstrt_child [28703]:  Unable to 
    >> restore files!  (err=-6)
    >>
    >> Perhaps something to do with the socket.
    >> What do you think?
    >>
    >>                                Regards,
    >>                                   Jerry
    >>
    >> P.S. I have prelinking turned off.
    >>
    >>
    >> cat
    >>
    >> Paul H. Hargrove wrote:
    >>
    >>> Jerry Mersel wrote:
    >>>
    >>>> Hi:
    >>>>
    >>>>  I am trying to migrate jobs on a grid after checkpointing.
    >>>> Does the "prelinking" fix as mentioned in the faq must it be done
    >>>> on the checkpointed node and the migrated to node?
    >>>>
    >>>>                                     Regards,
    >>>>                                        Jerry
    >>>
    >>>
    >>> Yes, the prelinking of libraries should be disabled on both the 
    >>> "checkpointed on" and "migrated to" nodes.
    >>> I will clarify this in the next FAQ version.
    >>>
    >>> -Paul
    >>>
    >>
    >
    >
    

  • Next message: Mary zhang: "Offshore Website design"